LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v9 0/7] Enable cpuset controller in default hierarchy
@ 2018-05-29 13:41 Waiman Long
  2018-05-29 13:41 ` [PATCH v9 1/7] cpuset: " Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

v9:
 - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
   identify its purpose as the root of a new scheduling domain or
   partition.
 - Clarify in the document about the purpose of domain_root and
   load_balance. Using domain_root is th only way to create new
   partition.
 - Fix a lockdep warning in update_isolated_cpumask() function.
 - Add a new patch to eliminate call to generate_sched_domains() for
   v2 when a change in cpu list does not touch a domain_root.

v8:
 - Remove cpuset.cpus.isolated and add a new cpuset.sched.domain flag
   and rework the code accordingly.

v7:
 - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
 - Enforce that load_balancing can only be turned off on cpusets with
   CPUs from the isolated list.
 - Update sched domain generation to allow cpusets with CPUs only
   from the isolated CPU list to be in separate root domains.

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

v6 patch: https://lkml.org/lkml/2018/3/21/530
v7 patch: https://lkml.org/lkml/2018/4/19/448
v8 patch: https://lkml.org/lkml/2018/5/17/939

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus", "mems",
"sched.load_balance" and "sched.domain_root". The "cpus.effective" and
"mems.effective" will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.domain_root". It
is a boolean flag file that designates if a cgroup is the root of a new
scheduling domain or partition with its own set of unique list of CPUs
from scheduling perspective disjointed from other partitions. The root
cgroup is always a scheduling domain root. Multiple levels of scheduling
domains are supported with some limitations. So a container scheduling
domain root can behave like a real root.

When a scheduling domain root cgroup is removed, its list of exclusive
CPUs will be returned to the parent's cpus.effective automatically.

The "sched.load_balance" flag can only be changed in a scheduling
domain root with no child cpuset-enabled cgroups while the rests
inherit its value from their parents. This ensures that all cpusets
within the same partition will have the same load balancing state. The
"sched.load_balance" flag can no longer be used to create additional
partition as a side effect.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds a new "sched.domain_root" control file for setting up
multiple scheduling domains or partitions. A scheduling domain root
implies cpu_exclusive.

Patch 3 adds a "sched.load_balance" flag to turn off load balancing in
a scheduling domain or partition.

Patch 4 updates the scheduling domain genaration code to work with
the new scheduling domain feature.

Patch 5 exposes cpus.effective and mems.effective to the root cgroup as
enabling child scheduling domains will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 6 eliminates the need to rebuild sched domains for v2 if cpu list
changes occur to non-domain root cpusets only.

Patch 7 enables the printing the debug information about scheduling
domain generation.

Waiman Long (7):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add new v2 cpuset.sched.domain_root flag
  cpuset: Add cpuset.sched.load_balance flag to v2
  cpuset: Make generate_sched_domains() recognize isolated_cpus
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Don't rebuild sched domains if cpu changes in non-domain root
  cpuset: Allow reporting of sched domain generation info

 Documentation/cgroup-v2.txt | 144 +++++++++++++++-
 kernel/cgroup/cpuset.c      | 396 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 518 insertions(+), 22 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 1/7] cpuset: Enable cpuset controller in default hierarchy
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-29 13:41 ` [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
 kernel/cgroup/cpuset.c      | 48 ++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..cf7bac6 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
        5-3-2. Writeback
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the CPUs allowed to be used by tasks within this
+	cgroup.  The CPU numbers are comma-separated numbers or
+	ranges.  For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined CPUs that are actually allowed to be
+	used by tasks within the current cgroup.  If "cpuset.cpus"
+	is empty, it shows all the CPUs from the parent cgroup that
+	will be available to be used by this cgroup.  Otherwise, it is
+	a subset of "cpuset.cpus".  Its value will be affected by CPU
+	hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the memory nodes allowed to be used by tasks within
+	this cgroup.  The memory node numbers are comma-separated
+	numbers or ranges.  For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cpuset-enabled cgroups.
+
+	It lists the onlined memory nodes that are actually allowed to
+	be used by tasks within the current cgroup.  If "cpuset.mems"
+	is empty, it shows all the memory nodes from the parent cgroup
+	that will be available to be used by this cgroup.  Otherwise,
+	it is a subset of "cpuset.mems".  Its value will be affected
+	by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e..419b758 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
  2018-05-29 13:41 ` [PATCH v9 1/7] cpuset: " Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-30 14:18   ` Juri Lelli
  2018-05-31  9:49   ` Peter Zijlstra
  2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

A new cpuset.sched.domain_root boolean flag is added to cpuset
v2. This new flag, if set, indicates that the cgroup is the root of
a new scheduling domain or partition that includes itself and all its
descendants except those that are scheduling domain roots themselves
and their descendants.

With this new flag, one can directly create as many partitions as
necessary without ever using the v1 trick of turning off load balancing
in specific cpusets to create partitions as a side effect.

This new flag is owned by the parent and will cause the CPUs in the
cpuset to be removed from the effective CPUs of its parent.

This is implemented internally by adding a new isolated_cpus mask that
holds the CPUs belonging to child scheduling domain cpusets so that:

	isolated_cpus | effective_cpus = cpus_allowed
	isolated_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is a
scheduling domain root itself. The state of this flag cannot be changed
if the cpuset has children.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt |  28 +++++
 kernel/cgroup/cpuset.c      | 246 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 271 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cf7bac6..e7534c5 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,34 @@ Cpuset Interface Files
 	it is a subset of "cpuset.mems".  Its value will be affected
 	by memory nodes hotplug events.
 
+  cpuset.sched.domain_root
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It is a binary value flag that accepts
+	either "0" (off) or "1" (on).  This flag is set by the parent
+	and is not delegatable.
+
+	If set, it indicates that the current cgroup is the root of a
+	new scheduling domain or partition that comprises itself and
+	all its descendants except those that are scheduling domain
+	roots themselves and their descendants.  The root cgroup is
+	always a scheduling domain root.
+
+	There are constraints on where this flag can be set.  It can
+	only be set in a cgroup if all the following conditions are true.
+
+	1) The "cpuset.cpus" is not empty and the list of CPUs are
+	   exclusive, i.e. they are not shared by any of its siblings.
+	2) The parent cgroup is also a scheduling domain root.
+	3) There is no child cgroups with cpuset enabled.  This is
+	   for eliminating corner cases that have to be handled if such
+	   a condition is allowed.
+
+	Setting this flag will take the CPUs away from the effective
+	CPUs of the parent cgroup.  Once it is set, this flag cannot
+	be cleared if there are any child cgroups with cpuset enabled.
+	Further changes made to "cpuset.cpus" is allowed as long as
+	the first condition above is still true.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..405b072 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/* Isolated CPUs for scheduling domain children */
+	cpumask_var_t isolated_cpus;
+
 	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
@@ -134,6 +137,9 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* for isolated_cpus */
+	int isolation_count;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct *task)
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SCHED_DOMAIN_ROOT,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct cpuset *cs)
 	return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 }
 
+static inline int is_sched_domain_root(const struct cpuset *cs)
+{
+	return test_bit(CS_SCHED_DOMAIN_ROOT, &cs->flags);
+}
+
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
 	return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
@@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
-		  (1 << CS_MEM_EXCLUSIVE)),
+		  (1 << CS_MEM_EXCLUSIVE) | (1 << CS_SCHED_DOMAIN_ROOT)),
 };
 
 /**
@@ -902,7 +914,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		/*
+		 * If parent has isolated CPUs, include them in the list
+		 * of allowable CPUs.
+		 */
+		if (parent->isolation_count) {
+			cpumask_or(new_cpus, parent->effective_cpus,
+				   parent->isolated_cpus);
+			cpumask_and(new_cpus, new_cpus, cpu_online_mask);
+			cpumask_and(new_cpus, new_cpus, cp->cpus_allowed);
+		} else {
+			cpumask_and(new_cpus, cp->cpus_allowed,
+				    parent->effective_cpus);
+		}
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
@@ -948,6 +972,162 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 }
 
 /**
+ * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
+ * @cpuset:  The cpuset that requests CPU isolation
+ * @oldmask: The old isolated cpumask to be removed from the parent
+ * @newmask: The new isolated cpumask to be added to the parent
+ * Return: 0 if successful, an error code otherwise
+ *
+ * Changes to the isolated CPUs are not allowed if any of CPUs changing
+ * state are in any of the child cpusets of the parent except the requesting
+ * child.
+ *
+ * If the sched_domain_root flag changes, either the oldmask (0=>1) or the
+ * newmask (1=>0) will be NULL.
+ *
+ * Called with cpuset_mutex held.
+ */
+static int update_isolated_cpumask(struct cpuset *cpuset,
+	struct cpumask *oldmask, struct cpumask *newmask)
+{
+	int retval;
+	int adding, deleting;
+	cpumask_var_t addmask, delmask;
+	struct cpuset *parent = parent_cs(cpuset);
+	struct cpuset *sibling;
+	struct cgroup_subsys_state *pos_css;
+	int old_count = parent->isolation_count;
+	bool dying = cpuset->css.flags & CSS_DYING;
+
+	/*
+	 * The new cpumask, if present, mut not be empty and its parent
+	 * must be a scheduling domain root.
+	 */
+	if ((newmask && cpumask_empty(newmask)) ||
+	   !is_sched_domain_root(parent))
+		return -EINVAL;
+
+	/*
+	 * The oldmask, if present, must be a subset of parent's isolated
+	 * CPUs.
+	 */
+	if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
+		       !cpumask_subset(oldmask, parent->isolated_cpus))) {
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
+	/*
+	 * A sched_domain_root state change is not allowed if there are
+	 * online children and the cpuset is not dying.
+	 */
+	if (!dying && (!oldmask || !newmask) &&
+	    css_has_online_children(&cpuset->css))
+		return -EBUSY;
+
+	if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
+		free_cpumask_var(addmask);
+		return -ENOMEM;
+	}
+
+	if (!old_count) {
+		if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
+			retval = -ENOMEM;
+			goto out;
+		}
+		old_count = 1;
+	}
+
+	retval = -EBUSY;
+	adding = deleting = false;
+	if (newmask)
+		cpumask_copy(addmask, newmask);
+	if (oldmask)
+		deleting = cpumask_andnot(delmask, oldmask, addmask);
+	if (newmask)
+		adding = cpumask_andnot(addmask, newmask, delmask);
+
+	if (!adding && !deleting)
+		goto out_ok;
+
+	/*
+	 * The cpus to be added must be in the parent's effective_cpus mask
+	 * but not in the isolated_cpus mask.
+	 */
+	if (!cpumask_subset(addmask, parent->effective_cpus))
+		goto out;
+	if (parent->isolation_count &&
+	    cpumask_intersects(parent->isolated_cpus, addmask))
+		goto out;
+
+	/*
+	 * Check if any CPUs in addmask or delmask are in a sibling cpuset.
+	 * An empty sibling cpus_allowed means it is the same as parent's
+	 * effective_cpus. This checking is skipped if the cpuset is dying.
+	 */
+	if (dying)
+		goto updated_isolated_cpus;
+
+	rcu_read_lock();
+	cpuset_for_each_child(sibling, pos_css, parent) {
+		if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
+			continue;
+		if (cpumask_empty(sibling->cpus_allowed))
+			goto out_unlock;
+		if (adding &&
+		    cpumask_intersects(sibling->cpus_allowed, addmask))
+			goto out_unlock;
+		if (deleting &&
+		    cpumask_intersects(sibling->cpus_allowed, delmask))
+			goto out_unlock;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Change the isolated CPU list.
+	 * Newly added isolated CPUs will be removed from effective_cpus
+	 * and newly deleted ones will be added back if they are online.
+	 */
+updated_isolated_cpus:
+	spin_lock_irq(&callback_lock);
+	if (adding)
+		cpumask_or(parent->isolated_cpus,
+			   parent->isolated_cpus, addmask);
+
+	if (deleting)
+		cpumask_andnot(parent->isolated_cpus,
+			       parent->isolated_cpus, delmask);
+
+	/*
+	 * New effective_cpus = (cpus_allowed & ~isolated_cpus) &
+	 *			 cpu_online_mask
+	 */
+	cpumask_andnot(parent->effective_cpus, parent->cpus_allowed,
+		       parent->isolated_cpus);
+	cpumask_and(parent->effective_cpus, parent->effective_cpus,
+		    cpu_online_mask);
+
+	parent->isolation_count = cpumask_weight(parent->isolated_cpus);
+	spin_unlock_irq(&callback_lock);
+
+out_ok:
+	retval = 0;
+out:
+	free_cpumask_var(addmask);
+	free_cpumask_var(delmask);
+	if (old_count && !parent->isolation_count)
+		free_cpumask_var(parent->isolated_cpus);
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	goto out;
+}
+
+/**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
  * @trialcs: trial cpuset
@@ -988,6 +1168,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (retval < 0)
 		return retval;
 
+	if (is_sched_domain_root(cs)) {
+		retval = update_isolated_cpumask(cs, cs->cpus_allowed,
+						 trialcs->cpus_allowed);
+		if (retval < 0)
+			return retval;
+	}
+
 	spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	spin_unlock_irq(&callback_lock);
@@ -1316,6 +1503,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	struct cpuset *trialcs;
 	int balance_flag_changed;
 	int spread_flag_changed;
+	int domain_flag_changed;
 	int err;
 
 	trialcs = alloc_trial_cpuset(cs);
@@ -1327,6 +1515,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	/*
+	 *  Turning on sched.domain flag (default hierarchy only) implies
+	 *  an implicit cpu_exclusive. Turning off sched.domain will clear
+	 *  the cpu_exclusive flag.
+	 */
+	if (bit == CS_SCHED_DOMAIN_ROOT) {
+		if (turning_on)
+			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+		else
+			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+	}
+
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
@@ -1337,11 +1537,27 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	domain_flag_changed = (is_sched_domain_root(cs) !=
+			       is_sched_domain_root(trialcs));
+
+	if (domain_flag_changed) {
+		err = turning_on
+		    ? update_isolated_cpumask(cs, NULL, cs->cpus_allowed)
+		    : update_isolated_cpumask(cs, cs->cpus_allowed, NULL);
+		if (err < 0)
+			goto out;
+		/*
+		 * At this point, the state has been changed.
+		 * So we can't back out with error anymore.
+		 */
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
 	spin_unlock_irq(&callback_lock);
 
-	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
+	if (!cpumask_empty(trialcs->cpus_allowed) &&
+	   (balance_flag_changed || domain_flag_changed))
 		rebuild_sched_domains_locked();
 
 	if (spread_flag_changed)
@@ -1596,6 +1812,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_SCHED_DOMAIN_ROOT,
 	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
@@ -1629,6 +1846,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SCHED_LOAD_BALANCE:
 		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
 		break;
+	case FILE_SCHED_DOMAIN_ROOT:
+		retval = update_flag(CS_SCHED_DOMAIN_ROOT, cs, val);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
 		break;
@@ -1790,6 +2010,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 		return is_mem_hardwall(cs);
 	case FILE_SCHED_LOAD_BALANCE:
 		return is_sched_load_balance(cs);
+	case FILE_SCHED_DOMAIN_ROOT:
+		return is_sched_domain_root(cs);
 	case FILE_MEMORY_MIGRATE:
 		return is_memory_migrate(cs);
 	case FILE_MEMORY_PRESSURE_ENABLED:
@@ -1966,6 +2188,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.domain_root",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SCHED_DOMAIN_ROOT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2075,6 +2305,9 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
  * If the cpuset being removed has its flag 'sched_load_balance'
  * enabled, then simulate turning sched_load_balance off, which
  * will call rebuild_sched_domains_locked().
+ *
+ * If the cpuset has the 'sched_domain_root' flag enabled, simulate
+ * turning sched_domain_root off.
  */
 
 static void cpuset_css_offline(struct cgroup_subsys_state *css)
@@ -2083,6 +2316,13 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 
 	mutex_lock(&cpuset_mutex);
 
+	/*
+	 * Calling update_flag() may fail, so we have to call
+	 * update_isolated_cpumask directly to be sure.
+	 */
+	if (is_sched_domain_root(cs))
+		update_isolated_cpumask(cs, cs->cpus_allowed, NULL);
+
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
  2018-05-29 13:41 ` [PATCH v9 1/7] cpuset: " Waiman Long
  2018-05-29 13:41 ` [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-31 10:44   ` Peter Zijlstra
                     ` (2 more replies)
  2018-05-29 13:41 ` [PATCH v9 4/7] cpuset: Make generate_sched_domains() recognize isolated_cpus Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The sched.load_balance flag is needed to enable CPU isolation similar to
what can be done with the "isolcpus" kernel boot parameter. Its value
can only be changed in a scheduling domain with no child cpusets. On
a non-scheduling domain cpuset, the value of sched.load_balance is
inherited from its parent. This is to make sure that all the cpusets
within the same scheduling domain or partition has the same load
balancing state.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 26 +++++++++++++++++++++
 kernel/cgroup/cpuset.c      | 55 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index e7534c5..681a809 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1542,6 +1542,32 @@ Cpuset Interface Files
 	Further changes made to "cpuset.cpus" is allowed as long as
 	the first condition above is still true.
 
+	A parent scheduling domain root cgroup cannot distribute all
+	its CPUs to its child scheduling domain root cgroups unless
+	its load balancing flag is turned off.
+
+  cpuset.sched.load_balance
+	A read-write single value file which exists on non-root
+	cpuset-enabled cgroups.  It is a binary value flag that accepts
+	either "0" (off) or "1" (on).  This flag is set by the parent
+	and is not delegatable.  It is on by default in the root cgroup.
+
+	When it is on, tasks within this cpuset will be load-balanced
+	by the kernel scheduler.  Tasks will be moved from CPUs with
+	high load to other CPUs within the same cpuset with less load
+	periodically.
+
+	When it is off, there will be no load balancing among CPUs on
+	this cgroup.  Tasks will stay in the CPUs they are running on
+	and will not be moved to other CPUs.
+
+	The load balancing state of a cgroup can only be changed on a
+	scheduling domain root cgroup with no cpuset-enabled children.
+	All cgroups within a scheduling domain or partition must have
+	the same load balancing state.	As descendant cgroups of a
+	scheduling domain root are created, they inherit the same load
+	balancing state of their root.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 405b072..b94d4a0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -510,7 +510,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 
 	par = parent_cs(cur);
 
-	/* On legacy hiearchy, we must be a subset of our parent cpuset. */
+	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
 	ret = -EACCES;
 	if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
 		goto out;
@@ -1063,6 +1063,14 @@ static int update_isolated_cpumask(struct cpuset *cpuset,
 		goto out;
 
 	/*
+	 * A parent can't distribute all its CPUs to child scheduling
+	 * domain root cpusets unless load balancing is off.
+	 */
+	if (adding & !deleting && is_sched_load_balance(parent) &&
+	    cpumask_equal(addmask, parent->effective_cpus))
+		goto out;
+
+	/*
 	 * Check if any CPUs in addmask or delmask are in a sibling cpuset.
 	 * An empty sibling cpus_allowed means it is the same as parent's
 	 * effective_cpus. This checking is skipped if the cpuset is dying.
@@ -1540,6 +1548,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	domain_flag_changed = (is_sched_domain_root(cs) !=
 			       is_sched_domain_root(trialcs));
 
+	/*
+	 * On default hierachy, a load balance flag change is only allowed
+	 * in a scheduling domain root with no child cpuset as all the
+	 * cpusets within the same scheduling domain/partition must have the
+	 * same load balancing state.
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
+	   (!is_sched_domain_root(cs) || css_has_online_children(&cs->css))) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	if (domain_flag_changed) {
 		err = turning_on
 		    ? update_isolated_cpumask(cs, NULL, cs->cpus_allowed)
@@ -2196,6 +2216,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched.load_balance",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SCHED_LOAD_BALANCE,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2209,19 +2237,38 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct cpuset *cs;
+	struct cgroup_subsys_state *errptr = ERR_PTR(-ENOMEM);
 
 	if (!parent_css)
 		return &top_cpuset.css;
 
 	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
-		return ERR_PTR(-ENOMEM);
+		return errptr;
 	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
 		goto free_cs;
 	if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
 		goto free_cpus;
 
-	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	/*
+	 * On default hierarchy, inherit parent's CS_SCHED_LOAD_BALANCE flag.
+	 * Creating new cpuset is also not allowed if the effective_cpus of
+	 * its parent is empty.
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		struct cpuset *parent = css_cs(parent_css);
+
+		if (test_bit(CS_SCHED_LOAD_BALANCE, &parent->flags))
+			set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+
+		if (cpumask_empty(parent->effective_cpus)) {
+			errptr = ERR_PTR(-EINVAL);
+			goto free_cpus;
+		}
+	} else {
+		set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	}
+
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	cpumask_clear(cs->effective_cpus);
@@ -2235,7 +2282,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	free_cpumask_var(cs->cpus_allowed);
 free_cs:
 	kfree(cs);
-	return ERR_PTR(-ENOMEM);
+	return errptr;
 }
 
 static int cpuset_css_online(struct cgroup_subsys_state *css)
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 4/7] cpuset: Make generate_sched_domains() recognize isolated_cpus
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
                   ` (2 preceding siblings ...)
  2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-29 13:41 ` [PATCH v9 5/7] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b94d4a0..71cd920 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.isolation_count) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
-		/* skip @cp's subtree */
-		pos_css = css_rightmost_descendant(pos_css);
+		/* skip @cp's subtree if not a scheduling domain root */
+		if (!is_sched_domain_root(cp))
+			pos_css = css_rightmost_descendant(pos_css);
 	}
 	rcu_read_unlock();
 
@@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.isolation_count &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.isolation_count &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
@@ -2635,6 +2651,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If isolated_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2643,6 +2664,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+
+		if (top_cpuset.isolation_count)
+			cpumask_andnot(&new_cpus, &new_cpus,
+					top_cpuset.isolated_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 5/7] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
                   ` (3 preceding siblings ...)
  2018-05-29 13:41 ` [PATCH v9 4/7] cpuset: Make generate_sched_domains() recognize isolated_cpus Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-29 13:41 ` [PATCH v9 6/7] cpuset: Don't rebuild sched domains if cpu changes in non-domain root Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

Because of the fact that setting the "cpuset.sched.domain_root" in
a direct child of root can remove CPUs from the root's effective CPU
list, it makes sense to know what CPUs are left in the root cgroup for
scheduling purpose. So the "cpuset.cpus.effective" control file is now
exposed in the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 4 ++--
 kernel/cgroup/cpuset.c      | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 681a809..b97f211 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1474,7 +1474,7 @@ Cpuset Interface Files
 	and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined CPUs that are actually allowed to be
@@ -1504,7 +1504,7 @@ Cpuset Interface Files
 	and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-	A read-only multiple values file which exists on non-root
+	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
 
 	It lists the onlined memory nodes that are actually allowed to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 71cd920..f6ae483 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2214,14 +2214,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.name = "cpus.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_CPULIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
 		.name = "mems.effective",
 		.seq_show = cpuset_common_seq_show,
 		.private = FILE_EFFECTIVE_MEMLIST,
-		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
 	{
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 6/7] cpuset: Don't rebuild sched domains if cpu changes in non-domain root
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
                   ` (4 preceding siblings ...)
  2018-05-29 13:41 ` [PATCH v9 5/7] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-29 13:41 ` [PATCH v9 7/7] cpuset: Allow reporting of sched domain generation info Waiman Long
  2018-05-30 10:13 ` [PATCH v9 0/7] Enable cpuset controller in default hierarchy Juri Lelli
  7 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

With the cpuset v1, any changes to the list of allowable CPUs in a cpuset
may cause changes in the sched domain configuration depending on the
load balancing states and the cpu lists of its parent and its children.

With cpuset v2 (on default hierarchy), there are more restrictions
on how the load balancing state of a cpuset can change. As a result,
only changes made in a sched domain root will cause possible changes
to the corresponding sched domain. CPU changes to any of the non-domain
root cpusets will not cause changes in the sched domain configuration.
As a result, we don't need to call rebuild_sched_domains_locked()
for changes in a non-domain root cpuset saving precious cpu cycles.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6ae483..9513f90 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -971,11 +971,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 		update_tasks_cpumask(cp);
 
 		/*
-		 * If the effective cpumask of any non-empty cpuset is changed,
-		 * we need to rebuild sched domains.
+		 * On legacy hierarchy, if the effective cpumask of any non-
+		 * empty cpuset is changed, we need to rebuild sched domains.
+		 * On default hiearchy, the cpuset needs to be a sched
+		 * domain root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp))
+		    is_sched_load_balance(cp) &&
+		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
+		    is_sched_domain_root(cp)))
 			need_rebuild_sched_domains = true;
 
 		rcu_read_lock();
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v9 7/7] cpuset: Allow reporting of sched domain generation info
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
                   ` (5 preceding siblings ...)
  2018-05-29 13:41 ` [PATCH v9 6/7] cpuset: Don't rebuild sched domains if cpu changes in non-domain root Waiman Long
@ 2018-05-29 13:41 ` Waiman Long
  2018-05-30 10:13 ` [PATCH v9 0/7] Enable cpuset controller in default hierarchy Juri Lelli
  7 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli,
	Patrick Bellasi, Waiman Long

This patch enables us to report sched domain generation information.

If DYNAMIC_DEBUG is enabled, issuing the following command

  echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control

and setting loglevel to 8 will allow the kernel to show what scheduling
domain changes are being made.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9513f90..71fb2d0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	}
 	BUG_ON(nslot != ndoms);
 
+#ifdef CONFIG_DEBUG_KERNEL
+	for (i = 0; i < ndoms; i++)
+		pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
+			 cpumask_pr_args(doms[i]));
+#endif
+
 done:
 	kfree(csa);
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 0/7] Enable cpuset controller in default hierarchy
  2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
                   ` (6 preceding siblings ...)
  2018-05-29 13:41 ` [PATCH v9 7/7] cpuset: Allow reporting of sched domain generation info Waiman Long
@ 2018-05-30 10:13 ` Juri Lelli
  2018-05-30 12:56   ` Waiman Long
  7 siblings, 1 reply; 27+ messages in thread
From: Juri Lelli @ 2018-05-30 10:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

Hi,

On 29/05/18 09:41, Waiman Long wrote:
> v9:
>  - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
>    identify its purpose as the root of a new scheduling domain or
>    partition.
>  - Clarify in the document about the purpose of domain_root and
>    load_balance. Using domain_root is th only way to create new
>    partition.
>  - Fix a lockdep warning in update_isolated_cpumask() function.
>  - Add a new patch to eliminate call to generate_sched_domains() for
>    v2 when a change in cpu list does not touch a domain_root.

I was playing with this and ended up with the situation below:

 g1/cgroup.controllers:cpuset
 g1/cgroup.events:populated 0
 g1/cgroup.max.depth:max
 g1/cgroup.max.descendants:max
 g1/cgroup.stat:nr_descendants 1
 g1/cgroup.stat:nr_dying_descendants 0
 g1/cgroup.subtree_control:cpuset
 g1/cgroup.type:domain
 g1/cpuset.cpus:0-5                   <---
 g1/cpuset.cpus.effective:0-5
 g1/cpuset.mems.effective:0-1
 g1/cpuset.sched.domain_root:1        <---
 g1/cpuset.sched.load_balance:1
 g1/cpu.stat:usage_usec 0
 g1/cpu.stat:user_usec 0
 g1/cpu.stat:system_usec 0
 g1/g11/cgroup.events:populated 0
 g1/g11/cgroup.max.descendants:max
 g1/g11/cpu.stat:usage_usec 0
 g1/g11/cpu.stat:user_usec 0
 g1/g11/cpu.stat:system_usec 0
 g1/g11/cgroup.type:domain
 g1/g11/cgroup.stat:nr_descendants 0
 g1/g11/cgroup.stat:nr_dying_descendants 0
 g1/g11/cpuset.cpus.effective:0-5
 g1/g11/cgroup.controllers:cpuset
 g1/g11/cpuset.sched.load_balance:1
 g1/g11/cpuset.mems.effective:0-1
 g1/g11/cpuset.cpus:6-11              <---
 g1/g11/cgroup.max.depth:max
 g1/g11/cpuset.sched.domain_root:0

Should this be allowed? I was expecting subgroup g11 should be
restricted to a subset of g1's cpus.

Best,

- Juri

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 0/7] Enable cpuset controller in default hierarchy
  2018-05-30 10:13 ` [PATCH v9 0/7] Enable cpuset controller in default hierarchy Juri Lelli
@ 2018-05-30 12:56   ` Waiman Long
  2018-05-30 13:05     ` Juri Lelli
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2018-05-30 12:56 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

On 05/30/2018 06:13 AM, Juri Lelli wrote:
> Hi,
>
> On 29/05/18 09:41, Waiman Long wrote:
>> v9:
>>  - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
>>    identify its purpose as the root of a new scheduling domain or
>>    partition.
>>  - Clarify in the document about the purpose of domain_root and
>>    load_balance. Using domain_root is th only way to create new
>>    partition.
>>  - Fix a lockdep warning in update_isolated_cpumask() function.
>>  - Add a new patch to eliminate call to generate_sched_domains() for
>>    v2 when a change in cpu list does not touch a domain_root.
> I was playing with this and ended up with the situation below:
>
>  g1/cgroup.controllers:cpuset
>  g1/cgroup.events:populated 0
>  g1/cgroup.max.depth:max
>  g1/cgroup.max.descendants:max
>  g1/cgroup.stat:nr_descendants 1
>  g1/cgroup.stat:nr_dying_descendants 0
>  g1/cgroup.subtree_control:cpuset
>  g1/cgroup.type:domain
>  g1/cpuset.cpus:0-5                   <---
>  g1/cpuset.cpus.effective:0-5
>  g1/cpuset.mems.effective:0-1
>  g1/cpuset.sched.domain_root:1        <---
>  g1/cpuset.sched.load_balance:1
>  g1/cpu.stat:usage_usec 0
>  g1/cpu.stat:user_usec 0
>  g1/cpu.stat:system_usec 0
>  g1/g11/cgroup.events:populated 0
>  g1/g11/cgroup.max.descendants:max
>  g1/g11/cpu.stat:usage_usec 0
>  g1/g11/cpu.stat:user_usec 0
>  g1/g11/cpu.stat:system_usec 0
>  g1/g11/cgroup.type:domain
>  g1/g11/cgroup.stat:nr_descendants 0
>  g1/g11/cgroup.stat:nr_dying_descendants 0
>  g1/g11/cpuset.cpus.effective:0-5
>  g1/g11/cgroup.controllers:cpuset
>  g1/g11/cpuset.sched.load_balance:1
>  g1/g11/cpuset.mems.effective:0-1
>  g1/g11/cpuset.cpus:6-11              <---
>  g1/g11/cgroup.max.depth:max
>  g1/g11/cpuset.sched.domain_root:0
>
> Should this be allowed? I was expecting subgroup g11 should be
> restricted to a subset of g1's cpus.
>
> Best,
>
> - Juri

That shouldn't be allowed.The code is probably missing some checks that
should have been done. What was the sequence of commands leading to the
above configuration?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 0/7] Enable cpuset controller in default hierarchy
  2018-05-30 12:56   ` Waiman Long
@ 2018-05-30 13:05     ` Juri Lelli
  2018-05-30 13:47       ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Juri Lelli @ 2018-05-30 13:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

On 30/05/18 08:56, Waiman Long wrote:
> On 05/30/2018 06:13 AM, Juri Lelli wrote:
> > Hi,
> >
> > On 29/05/18 09:41, Waiman Long wrote:
> >> v9:
> >>  - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
> >>    identify its purpose as the root of a new scheduling domain or
> >>    partition.
> >>  - Clarify in the document about the purpose of domain_root and
> >>    load_balance. Using domain_root is th only way to create new
> >>    partition.
> >>  - Fix a lockdep warning in update_isolated_cpumask() function.
> >>  - Add a new patch to eliminate call to generate_sched_domains() for
> >>    v2 when a change in cpu list does not touch a domain_root.
> > I was playing with this and ended up with the situation below:
> >
> >  g1/cgroup.controllers:cpuset
> >  g1/cgroup.events:populated 0
> >  g1/cgroup.max.depth:max
> >  g1/cgroup.max.descendants:max
> >  g1/cgroup.stat:nr_descendants 1
> >  g1/cgroup.stat:nr_dying_descendants 0
> >  g1/cgroup.subtree_control:cpuset
> >  g1/cgroup.type:domain
> >  g1/cpuset.cpus:0-5                   <---
> >  g1/cpuset.cpus.effective:0-5
> >  g1/cpuset.mems.effective:0-1
> >  g1/cpuset.sched.domain_root:1        <---
> >  g1/cpuset.sched.load_balance:1
> >  g1/cpu.stat:usage_usec 0
> >  g1/cpu.stat:user_usec 0
> >  g1/cpu.stat:system_usec 0
> >  g1/g11/cgroup.events:populated 0
> >  g1/g11/cgroup.max.descendants:max
> >  g1/g11/cpu.stat:usage_usec 0
> >  g1/g11/cpu.stat:user_usec 0
> >  g1/g11/cpu.stat:system_usec 0
> >  g1/g11/cgroup.type:domain
> >  g1/g11/cgroup.stat:nr_descendants 0
> >  g1/g11/cgroup.stat:nr_dying_descendants 0
> >  g1/g11/cpuset.cpus.effective:0-5
> >  g1/g11/cgroup.controllers:cpuset
> >  g1/g11/cpuset.sched.load_balance:1
> >  g1/g11/cpuset.mems.effective:0-1
> >  g1/g11/cpuset.cpus:6-11              <---
> >  g1/g11/cgroup.max.depth:max
> >  g1/g11/cpuset.sched.domain_root:0
> >
> > Should this be allowed? I was expecting subgroup g11 should be
> > restricted to a subset of g1's cpus.
> >
> > Best,
> >
> > - Juri
> 
> That shouldn't be allowed.The code is probably missing some checks that
> should have been done. What was the sequence of commands leading to the
> above configuration?

This is a E5-2609 v3 (12 cores) Fedora Server box (with systemd, so
first command is needed to be able to use cpuset controller with v2,
IIUC):

# umount /sys/fs/cgroup/cpuset 
# cd /sys/fs/cgroup/unified/
# echo "+cpuset" >cgroup.subtree_control 
# mkdir g1
# echo 0-5 >g1/cpuset.cpus
# echo 6-11 >init.scope/cpuset.cpus
# echo 6-11 >machine.slice/cpuset.cpus
# echo 6-11 >system.slice/cpuset.cpus
# echo 6-11 >user.slice/cpuset.cpus
# echo 1 >g1/cpuset.sched.domain_root 
# mkdir g1/g11
# echo "+cpuset" > g1/cgroup.subtree_control 
# echo 6-11 >g1/g11/cpuset.cpus
# grep -R . g1/*

That should be it. Am I doing something wrong?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 0/7] Enable cpuset controller in default hierarchy
  2018-05-30 13:05     ` Juri Lelli
@ 2018-05-30 13:47       ` Waiman Long
  2018-05-30 13:52         ` Juri Lelli
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2018-05-30 13:47 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

On 05/30/2018 09:05 AM, Juri Lelli wrote:
> On 30/05/18 08:56, Waiman Long wrote:
>> On 05/30/2018 06:13 AM, Juri Lelli wrote:
>>> Hi,
>>>
>>> On 29/05/18 09:41, Waiman Long wrote:
>>>> v9:
>>>>  - Rename cpuset.sched.domain to cpuset.sched.domain_root to better
>>>>    identify its purpose as the root of a new scheduling domain or
>>>>    partition.
>>>>  - Clarify in the document about the purpose of domain_root and
>>>>    load_balance. Using domain_root is th only way to create new
>>>>    partition.
>>>>  - Fix a lockdep warning in update_isolated_cpumask() function.
>>>>  - Add a new patch to eliminate call to generate_sched_domains() for
>>>>    v2 when a change in cpu list does not touch a domain_root.
>>> I was playing with this and ended up with the situation below:
>>>
>>>  g1/cgroup.controllers:cpuset
>>>  g1/cgroup.events:populated 0
>>>  g1/cgroup.max.depth:max
>>>  g1/cgroup.max.descendants:max
>>>  g1/cgroup.stat:nr_descendants 1
>>>  g1/cgroup.stat:nr_dying_descendants 0
>>>  g1/cgroup.subtree_control:cpuset
>>>  g1/cgroup.type:domain
>>>  g1/cpuset.cpus:0-5                   <---
>>>  g1/cpuset.cpus.effective:0-5
>>>  g1/cpuset.mems.effective:0-1
>>>  g1/cpuset.sched.domain_root:1        <---
>>>  g1/cpuset.sched.load_balance:1
>>>  g1/cpu.stat:usage_usec 0
>>>  g1/cpu.stat:user_usec 0
>>>  g1/cpu.stat:system_usec 0
>>>  g1/g11/cgroup.events:populated 0
>>>  g1/g11/cgroup.max.descendants:max
>>>  g1/g11/cpu.stat:usage_usec 0
>>>  g1/g11/cpu.stat:user_usec 0
>>>  g1/g11/cpu.stat:system_usec 0
>>>  g1/g11/cgroup.type:domain
>>>  g1/g11/cgroup.stat:nr_descendants 0
>>>  g1/g11/cgroup.stat:nr_dying_descendants 0
>>>  g1/g11/cpuset.cpus.effective:0-5
>>>  g1/g11/cgroup.controllers:cpuset
>>>  g1/g11/cpuset.sched.load_balance:1
>>>  g1/g11/cpuset.mems.effective:0-1
>>>  g1/g11/cpuset.cpus:6-11              <---
>>>  g1/g11/cgroup.max.depth:max
>>>  g1/g11/cpuset.sched.domain_root:0
>>>
>>> Should this be allowed? I was expecting subgroup g11 should be
>>> restricted to a subset of g1's cpus.
>>>
>>> Best,
>>>
>>> - Juri
>> That shouldn't be allowed.The code is probably missing some checks that
>> should have been done. What was the sequence of commands leading to the
>> above configuration?
> This is a E5-2609 v3 (12 cores) Fedora Server box (with systemd, so
> first command is needed to be able to use cpuset controller with v2,
> IIUC):
>
> # umount /sys/fs/cgroup/cpuset 
> # cd /sys/fs/cgroup/unified/
> # echo "+cpuset" >cgroup.subtree_control 
> # mkdir g1
> # echo 0-5 >g1/cpuset.cpus
> # echo 6-11 >init.scope/cpuset.cpus
> # echo 6-11 >machine.slice/cpuset.cpus
> # echo 6-11 >system.slice/cpuset.cpus
> # echo 6-11 >user.slice/cpuset.cpus
> # echo 1 >g1/cpuset.sched.domain_root 
> # mkdir g1/g11
> # echo "+cpuset" > g1/cgroup.subtree_control 
> # echo 6-11 >g1/g11/cpuset.cpus
> # grep -R . g1/*
>
> That should be it. Am I doing something wrong?
>
> Thanks,
>
> - Juri


Yes, it is a bug in the existing code. I have sent out a patch to fix that.

-Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 0/7] Enable cpuset controller in default hierarchy
  2018-05-30 13:47       ` Waiman Long
@ 2018-05-30 13:52         ` Juri Lelli
  0 siblings, 0 replies; 27+ messages in thread
From: Juri Lelli @ 2018-05-30 13:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

On 30/05/18 09:47, Waiman Long wrote:

[...]

> Yes, it is a bug in the existing code. I have sent out a patch to fix that.

Cool, thanks. I'll have a look.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag
  2018-05-29 13:41 ` [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag Waiman Long
@ 2018-05-30 14:18   ` Juri Lelli
  2018-05-30 14:57     ` Waiman Long
  2018-05-31  9:49   ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Juri Lelli @ 2018-05-30 14:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

Hi,

On 29/05/18 09:41, Waiman Long wrote:

[...]

> +  cpuset.sched.domain_root
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> +	either "0" (off) or "1" (on).  This flag is set by the parent
> +	and is not delegatable.
> +
> +	If set, it indicates that the current cgroup is the root of a
> +	new scheduling domain or partition that comprises itself and
> +	all its descendants except those that are scheduling domain
> +	roots themselves and their descendants.  The root cgroup is
> +	always a scheduling domain root.
> +
> +	There are constraints on where this flag can be set.  It can
> +	only be set in a cgroup if all the following conditions are true.
> +
> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
> +	   exclusive, i.e. they are not shared by any of its siblings.
> +	2) The parent cgroup is also a scheduling domain root.
> +	3) There is no child cgroups with cpuset enabled.  This is
> +	   for eliminating corner cases that have to be handled if such
> +	   a condition is allowed.
> +
> +	Setting this flag will take the CPUs away from the effective
> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
> +	be cleared if there are any child cgroups with cpuset enabled.
> +	Further changes made to "cpuset.cpus" is allowed as long as
> +	the first condition above is still true.

IIUC, with the configuration below

 cpuset.cpus.effective:6-11
 cgroup.controllers:cpuset
 cpuset.mems.effective:0-1
 cgroup.subtree_control:cpuset
 g1/cpuset.cpus.effective:0-5
 g1/cgroup.controllers:cpuset
 g1/cpuset.sched.load_balance:1
 g1/cpuset.mems.effective:0-1
 g1/cpuset.cpus:0-5
 g1/cpuset.sched.domain_root:1
 user.slice/cpuset.cpus.effective:6-11
 user.slice/cgroup.controllers:cpuset
 user.slice/cpuset.sched.load_balance:1
 user.slice/cpuset.mems.effective:0-1
 user.slice/cpuset.cpus:6-11
 user.slice/cpuset.sched.domain_root:0
 init.scope/cpuset.cpus.effective:6-11
 init.scope/cgroup.controllers:cpuset
 init.scope/cpuset.sched.load_balance:1
 init.scope/cpuset.mems.effective:0-1
 init.scope/cpuset.cpus:6-11
 init.scope/cpuset.sched.domain_root:0
 system.slice/cpuset.cpus.effective:6-11
 system.slice/cgroup.controllers:cpuset
 system.slice/cpuset.sched.load_balance:1
 system.slice/cpuset.mems.effective:0-1
 system.slice/cpuset.cpus:6-11
 system.slice/cpuset.sched.domain_root:0
 machine.slice/cpuset.cpus.effective:6-11
 machine.slice/cgroup.controllers:cpuset
 machine.slice/cpuset.sched.load_balance:1
 machine.slice/cpuset.mems.effective:0-1
 machine.slice/cpuset.cpus:6-11
 machine.slice/cpuset.sched.domain_root:0

I should be able to

 # echo 0-4 >g1/cpuset.cpus

?

It doesn't let me.

I'm not sure we actually want to allow that, but that's what would I
expect as per your text above.

Thanks,

- Juri

BTW: thanks a lot for your prompt feedback and hope it's OK if I keep
playing and asking questions. :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag
  2018-05-30 14:18   ` Juri Lelli
@ 2018-05-30 14:57     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-30 14:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, kernel-team, pjt,
	luto, Mike Galbraith, torvalds, Roman Gushchin, Patrick Bellasi

On 05/30/2018 10:18 AM, Juri Lelli wrote:
> Hi,
>
> On 29/05/18 09:41, Waiman Long wrote:
>
> [...]
>
>> +  cpuset.sched.domain_root
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or "1" (on).  This flag is set by the parent
>> +	and is not delegatable.
>> +
>> +	If set, it indicates that the current cgroup is the root of a
>> +	new scheduling domain or partition that comprises itself and
>> +	all its descendants except those that are scheduling domain
>> +	roots themselves and their descendants.  The root cgroup is
>> +	always a scheduling domain root.
>> +
>> +	There are constraints on where this flag can be set.  It can
>> +	only be set in a cgroup if all the following conditions are true.
>> +
>> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> +	   exclusive, i.e. they are not shared by any of its siblings.
>> +	2) The parent cgroup is also a scheduling domain root.
>> +	3) There is no child cgroups with cpuset enabled.  This is
>> +	   for eliminating corner cases that have to be handled if such
>> +	   a condition is allowed.
>> +
>> +	Setting this flag will take the CPUs away from the effective
>> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
>> +	be cleared if there are any child cgroups with cpuset enabled.
>> +	Further changes made to "cpuset.cpus" is allowed as long as
>> +	the first condition above is still true.
> IIUC, with the configuration below
>
>  cpuset.cpus.effective:6-11
>  cgroup.controllers:cpuset
>  cpuset.mems.effective:0-1
>  cgroup.subtree_control:cpuset
>  g1/cpuset.cpus.effective:0-5
>  g1/cgroup.controllers:cpuset
>  g1/cpuset.sched.load_balance:1
>  g1/cpuset.mems.effective:0-1
>  g1/cpuset.cpus:0-5
>  g1/cpuset.sched.domain_root:1
>  user.slice/cpuset.cpus.effective:6-11
>  user.slice/cgroup.controllers:cpuset
>  user.slice/cpuset.sched.load_balance:1
>  user.slice/cpuset.mems.effective:0-1
>  user.slice/cpuset.cpus:6-11
>  user.slice/cpuset.sched.domain_root:0
>  init.scope/cpuset.cpus.effective:6-11
>  init.scope/cgroup.controllers:cpuset
>  init.scope/cpuset.sched.load_balance:1
>  init.scope/cpuset.mems.effective:0-1
>  init.scope/cpuset.cpus:6-11
>  init.scope/cpuset.sched.domain_root:0
>  system.slice/cpuset.cpus.effective:6-11
>  system.slice/cgroup.controllers:cpuset
>  system.slice/cpuset.sched.load_balance:1
>  system.slice/cpuset.mems.effective:0-1
>  system.slice/cpuset.cpus:6-11
>  system.slice/cpuset.sched.domain_root:0
>  machine.slice/cpuset.cpus.effective:6-11
>  machine.slice/cgroup.controllers:cpuset
>  machine.slice/cpuset.sched.load_balance:1
>  machine.slice/cpuset.mems.effective:0-1
>  machine.slice/cpuset.cpus:6-11
>  machine.slice/cpuset.sched.domain_root:0
>
> I should be able to
>
>  # echo 0-4 >g1/cpuset.cpus
>
> ?
>
> It doesn't let me.

It should allow that. I will fix this issue.

>
> I'm not sure we actually want to allow that, but that's what would I
> expect as per your text above.
>
> Thanks,
>
> - Juri
>
> BTW: thanks a lot for your prompt feedback and hope it's OK if I keep
> playing and asking questions. :)

Of course. I appreciate your help in looking for issue in the patch that
I might have overlooked.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag
  2018-05-29 13:41 ` [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag Waiman Long
  2018-05-30 14:18   ` Juri Lelli
@ 2018-05-31  9:49   ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31  9:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Tue, May 29, 2018 at 09:41:29AM -0400, Waiman Long wrote:
> +  cpuset.sched.domain_root
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> +	either "0" (off) or "1" (on).  This flag is set by the parent
> +	and is not delegatable.

What does "is not delegatable" mean?

I think you used to say "is owned by the parent", which is took to mean
file ownership is that of the parent directory (..) and not of the
current (,), which is slightly odd but works.

So if you chown a cgroup to a user, that user will not be able to change
the file of it's 'root' (will actually be the root in case of
container), but it _can_ change this file for any sub-cgroups it
creates, right?

So in that respect the feature is delegatable, a container can create
sub-partitions. It just cannot change it's 'root' partition, which is
consistent with a real root.

The only inconsistently left is then that the real root does not have
the file at all, vs a container root having it, but not accessible.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
@ 2018-05-31 10:44   ` Peter Zijlstra
  2018-05-31 10:54   ` Peter Zijlstra
  2018-05-31 12:26   ` Peter Zijlstra
  2 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31 10:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Tue, May 29, 2018 at 09:41:30AM -0400, Waiman Long wrote:
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index e7534c5..681a809 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1542,6 +1542,32 @@ Cpuset Interface Files
>  	Further changes made to "cpuset.cpus" is allowed as long as
>  	the first condition above is still true.
>  
> +	A parent scheduling domain root cgroup cannot distribute all
> +	its CPUs to its child scheduling domain root cgroups

This I think wants to be in the previous patch

>                                                            unless
> +	its load balancing flag is turned off.

And this is indeed for here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
  2018-05-31 10:44   ` Peter Zijlstra
@ 2018-05-31 10:54   ` Peter Zijlstra
  2018-05-31 13:36     ` Waiman Long
  2018-05-31 12:26   ` Peter Zijlstra
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31 10:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On Tue, May 29, 2018 at 09:41:30AM -0400, Waiman Long wrote:

> +  cpuset.sched.load_balance
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> +	either "0" (off) or "1" (on).  This flag is set by the parent
> +	and is not delegatable.  It is on by default in the root cgroup.
> +
> +	When it is on, tasks within this cpuset will be load-balanced
> +	by the kernel scheduler.  Tasks will be moved from CPUs with
> +	high load to other CPUs within the same cpuset with less load
> +	periodically.
> +
> +	When it is off, there will be no load balancing among CPUs on
> +	this cgroup.  Tasks will stay in the CPUs they are running on
> +	and will not be moved to other CPUs.

That is not entirely accurate I'm afraid (unless the patch makes it so,
I've yet to check). When you disable load-balancing on a cgroup you'll
get whatever balancing is left for the partition you happen to end up
in.

Take for instance workqueue thingies, they use kthread_bind_mask()
(IIRC) and thus end up with PF_NO_SETAFFINITY so cpusets (or any other
cgroups really) do not have effect on them (long standing complaint).

So take for instance the unbound numa enabled workqueue threads, those
will land in whatever partition and get balanced there.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
  2018-05-31 10:44   ` Peter Zijlstra
  2018-05-31 10:54   ` Peter Zijlstra
@ 2018-05-31 12:26   ` Peter Zijlstra
  2018-05-31 13:54     ` Waiman Long
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31 12:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On Tue, May 29, 2018 at 09:41:30AM -0400, Waiman Long wrote:
> The sched.load_balance flag is needed to enable CPU isolation similar to
> what can be done with the "isolcpus" kernel boot parameter. Its value
> can only be changed in a scheduling domain with no child cpusets. On
> a non-scheduling domain cpuset, the value of sched.load_balance is
> inherited from its parent. This is to make sure that all the cpusets
> within the same scheduling domain or partition has the same load
> balancing state.
> 
> This flag is set by the parent and is not delegatable.

> +  cpuset.sched.domain_root
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> +	either "0" (off) or "1" (on).  This flag is set by the parent
> +	and is not delegatable.
> +
> +	If set, it indicates that the current cgroup is the root of a
> +	new scheduling domain or partition that comprises itself and
> +	all its descendants except those that are scheduling domain
> +	roots themselves and their descendants.  The root cgroup is
> +	always a scheduling domain root.
> +
> +	There are constraints on where this flag can be set.  It can
> +	only be set in a cgroup if all the following conditions are true.
> +
> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
> +	   exclusive, i.e. they are not shared by any of its siblings.
> +	2) The parent cgroup is also a scheduling domain root.
> +	3) There is no child cgroups with cpuset enabled.  This is
> +	   for eliminating corner cases that have to be handled if such
> +	   a condition is allowed.
> +
> +	Setting this flag will take the CPUs away from the effective
> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
> +	be cleared if there are any child cgroups with cpuset enabled.
> +	Further changes made to "cpuset.cpus" is allowed as long as
> +	the first condition above is still true.
> +
> +	A parent scheduling domain root cgroup cannot distribute all
> +	its CPUs to its child scheduling domain root cgroups unless
> +	its load balancing flag is turned off.
> +
> +  cpuset.sched.load_balance
> +	A read-write single value file which exists on non-root
> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
> +	either "0" (off) or "1" (on).  This flag is set by the parent
> +	and is not delegatable.  It is on by default in the root cgroup.
> +
> +	When it is on, tasks within this cpuset will be load-balanced
> +	by the kernel scheduler.  Tasks will be moved from CPUs with
> +	high load to other CPUs within the same cpuset with less load
> +	periodically.
> +
> +	When it is off, there will be no load balancing among CPUs on
> +	this cgroup.  Tasks will stay in the CPUs they are running on
> +	and will not be moved to other CPUs.
> +
> +	The load balancing state of a cgroup can only be changed on a
> +	scheduling domain root cgroup with no cpuset-enabled children.
> +	All cgroups within a scheduling domain or partition must have
> +	the same load balancing state.	As descendant cgroups of a
> +	scheduling domain root are created, they inherit the same load
> +	balancing state of their root.

I still find all that a bit weird.

So load_balance=0 basically changes a partition into a
'fully-partitioned partition' with the seemingly random side-effect that
now sub-partitions are allowed to consume all CPUs.

The rationale, only given in the Changelog above, seems to be to allow
'easy' emulation of isolcpus.

I'm still not convinced this is a useful knob to have. You can do
fully-partitioned by simply creating a lot of 1 cpu parititions.

So this one knob does two separate things, both of which seem, to me,
redundant.

Can we please get better rationale for this?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 10:54   ` Peter Zijlstra
@ 2018-05-31 13:36     ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-05-31 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi

On 05/31/2018 06:54 AM, Peter Zijlstra wrote:
> On Tue, May 29, 2018 at 09:41:30AM -0400, Waiman Long wrote:
>
>> +  cpuset.sched.load_balance
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or "1" (on).  This flag is set by the parent
>> +	and is not delegatable.  It is on by default in the root cgroup.
>> +
>> +	When it is on, tasks within this cpuset will be load-balanced
>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
>> +	high load to other CPUs within the same cpuset with less load
>> +	periodically.
>> +
>> +	When it is off, there will be no load balancing among CPUs on
>> +	this cgroup.  Tasks will stay in the CPUs they are running on
>> +	and will not be moved to other CPUs.
> That is not entirely accurate I'm afraid (unless the patch makes it so,
> I've yet to check). When you disable load-balancing on a cgroup you'll
> get whatever balancing is left for the partition you happen to end up
> in.
>
> Take for instance workqueue thingies, they use kthread_bind_mask()
> (IIRC) and thus end up with PF_NO_SETAFFINITY so cpusets (or any other
> cgroups really) do not have effect on them (long standing complaint).
>
> So take for instance the unbound numa enabled workqueue threads, those
> will land in whatever partition and get balanced there.

Thanks for the clarification. The patch doesn't make any changes in the
scheduler. I was trying to say what the flag does. I will update the
documentation about this nuisance.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 12:26   ` Peter Zijlstra
@ 2018-05-31 13:54     ` Waiman Long
  2018-05-31 15:20       ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2018-05-31 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On 05/31/2018 08:26 AM, Peter Zijlstra wrote:
> On Tue, May 29, 2018 at 09:41:30AM -0400, Waiman Long wrote:
>> The sched.load_balance flag is needed to enable CPU isolation similar to
>> what can be done with the "isolcpus" kernel boot parameter. Its value
>> can only be changed in a scheduling domain with no child cpusets. On
>> a non-scheduling domain cpuset, the value of sched.load_balance is
>> inherited from its parent. This is to make sure that all the cpusets
>> within the same scheduling domain or partition has the same load
>> balancing state.
>>
>> This flag is set by the parent and is not delegatable.
>> +  cpuset.sched.domain_root
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or "1" (on).  This flag is set by the parent
>> +	and is not delegatable.
>> +
>> +	If set, it indicates that the current cgroup is the root of a
>> +	new scheduling domain or partition that comprises itself and
>> +	all its descendants except those that are scheduling domain
>> +	roots themselves and their descendants.  The root cgroup is
>> +	always a scheduling domain root.
>> +
>> +	There are constraints on where this flag can be set.  It can
>> +	only be set in a cgroup if all the following conditions are true.
>> +
>> +	1) The "cpuset.cpus" is not empty and the list of CPUs are
>> +	   exclusive, i.e. they are not shared by any of its siblings.
>> +	2) The parent cgroup is also a scheduling domain root.
>> +	3) There is no child cgroups with cpuset enabled.  This is
>> +	   for eliminating corner cases that have to be handled if such
>> +	   a condition is allowed.
>> +
>> +	Setting this flag will take the CPUs away from the effective
>> +	CPUs of the parent cgroup.  Once it is set, this flag cannot
>> +	be cleared if there are any child cgroups with cpuset enabled.
>> +	Further changes made to "cpuset.cpus" is allowed as long as
>> +	the first condition above is still true.
>> +
>> +	A parent scheduling domain root cgroup cannot distribute all
>> +	its CPUs to its child scheduling domain root cgroups unless
>> +	its load balancing flag is turned off.
>> +
>> +  cpuset.sched.load_balance
>> +	A read-write single value file which exists on non-root
>> +	cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +	either "0" (off) or "1" (on).  This flag is set by the parent
>> +	and is not delegatable.  It is on by default in the root cgroup.
>> +
>> +	When it is on, tasks within this cpuset will be load-balanced
>> +	by the kernel scheduler.  Tasks will be moved from CPUs with
>> +	high load to other CPUs within the same cpuset with less load
>> +	periodically.
>> +
>> +	When it is off, there will be no load balancing among CPUs on
>> +	this cgroup.  Tasks will stay in the CPUs they are running on
>> +	and will not be moved to other CPUs.
>> +
>> +	The load balancing state of a cgroup can only be changed on a
>> +	scheduling domain root cgroup with no cpuset-enabled children.
>> +	All cgroups within a scheduling domain or partition must have
>> +	the same load balancing state.	As descendant cgroups of a
>> +	scheduling domain root are created, they inherit the same load
>> +	balancing state of their root.
> I still find all that a bit weird.
>
> So load_balance=0 basically changes a partition into a
> 'fully-partitioned partition' with the seemingly random side-effect that
> now sub-partitions are allowed to consume all CPUs.

Are you suggesting that we should allow sub-partition to consume all the
CPUs no matter the load balance state? I can live with that if you think
it is more logical.

> The rationale, only given in the Changelog above, seems to be to allow
> 'easy' emulation of isolcpus.
>
> I'm still not convinced this is a useful knob to have. You can do
> fully-partitioned by simply creating a lot of 1 cpu parititions.

That is certainly true. However, I think there are some additional
overhead in the scheduler side in maintaining those 1-cpu partitions. Right?

> So this one knob does two separate things, both of which seem, to me,
> redundant.
>
> Can we please get better rationale for this?

I am fine getting rid of the load_balance flag if this is the consensus.
However, we do need to come up with a good migration story for those
users that need the isolcpus capability. I think Mike was the one asking
for supporting isolcpus. So Mike, what is your take on that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 13:54     ` Waiman Long
@ 2018-05-31 15:20       ` Peter Zijlstra
  2018-05-31 15:36         ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31 15:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On Thu, May 31, 2018 at 09:54:27AM -0400, Waiman Long wrote:
> On 05/31/2018 08:26 AM, Peter Zijlstra wrote:

> > I still find all that a bit weird.
> >
> > So load_balance=0 basically changes a partition into a
> > 'fully-partitioned partition' with the seemingly random side-effect that
> > now sub-partitions are allowed to consume all CPUs.
> 
> Are you suggesting that we should allow sub-partition to consume all the
> CPUs no matter the load balance state? I can live with that if you think
> it is more logical.

I'm on the fence myself; the only thing I'm fairly sure of is that tying
this particular behaviour to the load-balance knob seems off.

> > The rationale, only given in the Changelog above, seems to be to allow
> > 'easy' emulation of isolcpus.
> >
> > I'm still not convinced this is a useful knob to have. You can do
> > fully-partitioned by simply creating a lot of 1 cpu parititions.
> 
> That is certainly true. However, I think there are some additional
> overhead in the scheduler side in maintaining those 1-cpu partitions. Right?

cpuset-controller as such doesn't have much overhead scheduler wise,
cpu-controller OTOH does, and there depth is the predominant factor, so
many sibling groups should not matter there either.

> > So this one knob does two separate things, both of which seem, to me,
> > redundant.
> >
> > Can we please get better rationale for this?
> 
> I am fine getting rid of the load_balance flag if this is the consensus.
> However, we do need to come up with a good migration story for those
> users that need the isolcpus capability. I think Mike was the one asking
> for supporting isolcpus. So Mike, what is your take on that.

So I don't strictly mind having a knob that does the 'fully-partitioned
partition' thing -- however odd that sounds -- but I feel we should have
a solid use-case for it.

I also think we should not mix the 'consume all' thing with the
'fully-partitioned' thing, as they are otherwise unrelated.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 15:20       ` Peter Zijlstra
@ 2018-05-31 15:36         ` Waiman Long
  2018-05-31 16:08           ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2018-05-31 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On 05/31/2018 11:20 AM, Peter Zijlstra wrote:
> On Thu, May 31, 2018 at 09:54:27AM -0400, Waiman Long wrote:
>> On 05/31/2018 08:26 AM, Peter Zijlstra wrote:
>>> I still find all that a bit weird.
>>>
>>> So load_balance=0 basically changes a partition into a
>>> 'fully-partitioned partition' with the seemingly random side-effect that
>>> now sub-partitions are allowed to consume all CPUs.
>> Are you suggesting that we should allow sub-partition to consume all the
>> CPUs no matter the load balance state? I can live with that if you think
>> it is more logical.
> I'm on the fence myself; the only thing I'm fairly sure of is that tying
> this particular behaviour to the load-balance knob seems off.

The main reason for doing it this way is that I don't want to have
load-balanced partition with no cpu in it. How about we just don't allow
consume-all at all. Each partition must have at least 1 cpu.

>
>>> The rationale, only given in the Changelog above, seems to be to allow
>>> 'easy' emulation of isolcpus.
>>>
>>> I'm still not convinced this is a useful knob to have. You can do
>>> fully-partitioned by simply creating a lot of 1 cpu parititions.
>> That is certainly true. However, I think there are some additional
>> overhead in the scheduler side in maintaining those 1-cpu partitions. Right?
> cpuset-controller as such doesn't have much overhead scheduler wise,
> cpu-controller OTOH does, and there depth is the predominant factor, so
> many sibling groups should not matter there either.
>
>>> So this one knob does two separate things, both of which seem, to me,
>>> redundant.
>>>
>>> Can we please get better rationale for this?
>> I am fine getting rid of the load_balance flag if this is the consensus.
>> However, we do need to come up with a good migration story for those
>> users that need the isolcpus capability. I think Mike was the one asking
>> for supporting isolcpus. So Mike, what is your take on that.
> So I don't strictly mind having a knob that does the 'fully-partitioned
> partition' thing -- however odd that sounds -- but I feel we should have
> a solid use-case for it.
>
> I also think we should not mix the 'consume all' thing with the
> 'fully-partitioned' thing, as they are otherwise unrelated.

The "consume all" and "fully-partitioned" look the same to me. Are you
talking about allocating all the CPUs in a partition to sub-partitions
so that there is no CPU left in the parent partition?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 15:36         ` Waiman Long
@ 2018-05-31 16:08           ` Peter Zijlstra
  2018-05-31 16:42             ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2018-05-31 16:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On Thu, May 31, 2018 at 11:36:39AM -0400, Waiman Long wrote:
> > I'm on the fence myself; the only thing I'm fairly sure of is that tying
> > this particular behaviour to the load-balance knob seems off.
> 
> The main reason for doing it this way is that I don't want to have
> load-balanced partition with no cpu in it. How about we just don't allow
> consume-all at all. Each partition must have at least 1 cpu.

I suspect that might be sufficient. It certainly is for the use-cases
I'm aware of. You always want a system/control set which runs the
regular busy work of running a system.

Then you have one (or more) partitions to run your 'important' work.

> > I also think we should not mix the 'consume all' thing with the
> > 'fully-partitioned' thing, as they are otherwise unrelated.
>
> The "consume all" and "fully-partitioned" look the same to me. Are you
> talking about allocating all the CPUs in a partition to sub-partitions
> so that there is no CPU left in the parent partition?

Not sure what you're asking. "consume all" is allowing sub-partitions to
allocate all CPUs of the parent, such that there are none left.

"fully-partitioned" is N cpus but no load-balancing, also equivalent to
N 1 CPU parititions.

They are distinct things. Disabling load-balancing should not affect how
many CPUs can be allocated to sub-partitions, the moment you hit 1 CPU
the load balancing is effectively off already. Going down to 0 CPUs
isn't a problem for the load-balancer, it wasn't doing anything anyway.

So the question is if someone really needs the one partition without
balancing over N separate paritions.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 16:08           ` Peter Zijlstra
@ 2018-05-31 16:42             ` Waiman Long
  2018-06-20 14:46               ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Waiman Long @ 2018-05-31 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner

On 05/31/2018 12:08 PM, Peter Zijlstra wrote:
> On Thu, May 31, 2018 at 11:36:39AM -0400, Waiman Long wrote:
>>> I'm on the fence myself; the only thing I'm fairly sure of is that tying
>>> this particular behaviour to the load-balance knob seems off.
>> The main reason for doing it this way is that I don't want to have
>> load-balanced partition with no cpu in it. How about we just don't allow
>> consume-all at all. Each partition must have at least 1 cpu.
> I suspect that might be sufficient. It certainly is for the use-cases
> I'm aware of. You always want a system/control set which runs the
> regular busy work of running a system.
>
> Then you have one (or more) partitions to run your 'important' work.

Good. I will make the change in the next version.

>
>>> I also think we should not mix the 'consume all' thing with the
>>> 'fully-partitioned' thing, as they are otherwise unrelated.
>> The "consume all" and "fully-partitioned" look the same to me. Are you
>> talking about allocating all the CPUs in a partition to sub-partitions
>> so that there is no CPU left in the parent partition?
> Not sure what you're asking. "consume all" is allowing sub-partitions to
> allocate all CPUs of the parent, such that there are none left.
>
> "fully-partitioned" is N cpus but no load-balancing, also equivalent to
> N 1 CPU parititions.

Thanks for the clarification.

> They are distinct things. Disabling load-balancing should not affect how
> many CPUs can be allocated to sub-partitions, the moment you hit 1 CPU
> the load balancing is effectively off already. Going down to 0 CPUs
> isn't a problem for the load-balancer, it wasn't doing anything anyway.
>
> So the question is if someone really needs the one partition without
> balancing over N separate paritions.

Thinking about isolcpus emulation, I now realize that it is more than
just disabling load balancing. it also disables some kernel threads like
kworker from running so that an userspace application can monopolize as
much of a cpu as possible. Disabling kernel threads from running isn't
that hard if it is only done once at boot time. it is trickier if we
have to do it at run time.

Without good isolcpus emulation, disabling load balance kind of loses
its usefulness. So I am going to take out the load_balance flag for now
unless I hear objection otherwise.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-05-31 16:42             ` Waiman Long
@ 2018-06-20 14:46               ` Peter Zijlstra
  2018-06-21  7:40                 ` Waiman Long
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2018-06-20 14:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner, Frederic Weisbecker

On Thu, May 31, 2018 at 12:42:20PM -0400, Waiman Long wrote:
> Thinking about isolcpus emulation, I now realize that it is more than
> just disabling load balancing. it also disables some kernel threads like
> kworker from running so that an userspace application can monopolize as
> much of a cpu as possible. Disabling kernel threads from running isn't
> that hard if it is only done once at boot time. it is trickier if we
> have to do it at run time.

Don't think it is all that difficult, we just need a notifier for when
that housekeeping thing changes and ensure that everybody who uses it
re-evaluates crap.

> Without good isolcpus emulation, disabling load balance kind of loses
> its usefulness. So I am going to take out the load_balance flag for now
> unless I hear objection otherwise.

I'm not seeing the direct link between the load_balance flag and
isolcpus emulation in the proposed stuff.

We can tie the housekeeping mask to whatever CPUs remain in the root
cgroup, couple that to that notifier and it should all just work I
think.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2
  2018-06-20 14:46               ` Peter Zijlstra
@ 2018-06-21  7:40                 ` Waiman Long
  0 siblings, 0 replies; 27+ messages in thread
From: Waiman Long @ 2018-06-21  7:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, kernel-team, pjt, luto, Mike Galbraith,
	torvalds, Roman Gushchin, Juri Lelli, Patrick Bellasi,
	Thomas Gleixner, Frederic Weisbecker

On 06/20/2018 10:46 PM, Peter Zijlstra wrote:
> On Thu, May 31, 2018 at 12:42:20PM -0400, Waiman Long wrote:
>> Thinking about isolcpus emulation, I now realize that it is more than
>> just disabling load balancing. it also disables some kernel threads like
>> kworker from running so that an userspace application can monopolize as
>> much of a cpu as possible. Disabling kernel threads from running isn't
>> that hard if it is only done once at boot time. it is trickier if we
>> have to do it at run time.
> Don't think it is all that difficult, we just need a notifier for when
> that housekeeping thing changes and ensure that everybody who uses it
> re-evaluates crap.

Yes, it is certainly doable. I can work on that on my free time once the
first cpuset v2 patch is done. There is enough complexity in the current
patchset and I don't want to add stuff that is not a part of the core
cpuset functionality at this point. We can also add new feature in the
future, but once it is in, it is hard to change it.

>> Without good isolcpus emulation, disabling load balance kind of loses
>> its usefulness. So I am going to take out the load_balance flag for now
>> unless I hear objection otherwise.
> I'm not seeing the direct link between the load_balance flag and
> isolcpus emulation in the proposed stuff.
>
> We can tie the housekeeping mask to whatever CPUs remain in the root
> cgroup, couple that to that notifier and it should all just work I
> think.

The group of cpus in isolcpus are outside the reach of the scheduler and
so is equivalent to turning off load balancing in that sense. Of course,
there may also be other stuff that need to be considered in order to
have a proper emulation of isolcpus.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-06-21  7:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-29 13:41 [PATCH v9 0/7] Enable cpuset controller in default hierarchy Waiman Long
2018-05-29 13:41 ` [PATCH v9 1/7] cpuset: " Waiman Long
2018-05-29 13:41 ` [PATCH v9 2/7] cpuset: Add new v2 cpuset.sched.domain_root flag Waiman Long
2018-05-30 14:18   ` Juri Lelli
2018-05-30 14:57     ` Waiman Long
2018-05-31  9:49   ` Peter Zijlstra
2018-05-29 13:41 ` [PATCH v9 3/7] cpuset: Add cpuset.sched.load_balance flag to v2 Waiman Long
2018-05-31 10:44   ` Peter Zijlstra
2018-05-31 10:54   ` Peter Zijlstra
2018-05-31 13:36     ` Waiman Long
2018-05-31 12:26   ` Peter Zijlstra
2018-05-31 13:54     ` Waiman Long
2018-05-31 15:20       ` Peter Zijlstra
2018-05-31 15:36         ` Waiman Long
2018-05-31 16:08           ` Peter Zijlstra
2018-05-31 16:42             ` Waiman Long
2018-06-20 14:46               ` Peter Zijlstra
2018-06-21  7:40                 ` Waiman Long
2018-05-29 13:41 ` [PATCH v9 4/7] cpuset: Make generate_sched_domains() recognize isolated_cpus Waiman Long
2018-05-29 13:41 ` [PATCH v9 5/7] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root Waiman Long
2018-05-29 13:41 ` [PATCH v9 6/7] cpuset: Don't rebuild sched domains if cpu changes in non-domain root Waiman Long
2018-05-29 13:41 ` [PATCH v9 7/7] cpuset: Allow reporting of sched domain generation info Waiman Long
2018-05-30 10:13 ` [PATCH v9 0/7] Enable cpuset controller in default hierarchy Juri Lelli
2018-05-30 12:56   ` Waiman Long
2018-05-30 13:05     ` Juri Lelli
2018-05-30 13:47       ` Waiman Long
2018-05-30 13:52         ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).