LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [mm][PATCH 0/4] Memory cgroup hierarchy introduction
@ 2008-11-01 18:48 Balbir Singh
  2008-11-01 18:48 ` [mm] [PATCH 1/4] Memory cgroup hierarchy documentation Balbir Singh
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Balbir Singh @ 2008-11-01 18:48 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

This patch follows several iterations of the memory controller hierarchy
patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
at [2].

The current approach is based on [2] and has the following properties

1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
   A multi-tree hierarchy has been supported for a long time in filesystems.
   When the feature is turned on, we honor hierarchies such that the root
   accounts for resource usage of all children and limits can be set at
   any point in the hierarchy. Any memory cgroup is limited by limits
   along the hierarchy. The total usage of all children of a node cannot
   exceed the limit of the node.
2. The hierarchy feature is selectable and off by default
3. Hierarchies are expensive and the trade off is depth versus performance.
   Hierarchies can also be completely turned off.

The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
with SMP and swap turned on.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Series
------

memcg-hierarchy-documentation.patch
resource-counters-hierarchy-support.patch
memcg-hierarchical-reclaim.patch
memcg-add-hierarchy-selector.patch

Reviews? Comments?

References

1. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-06/msg05417.html
2. http://kerneltrap.org/mailarchive/linux-kernel/2008/4/19/1513644/thread

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm] [PATCH 1/4] Memory cgroup hierarchy documentation
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
@ 2008-11-01 18:48 ` Balbir Singh
  2008-11-04  6:25   ` Paul Menage
  2008-11-01 18:48 ` [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy Balbir Singh
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-01 18:48 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki



Documentation updates for hierarchy support

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memory.txt |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff -puN Documentation/controllers/memory.txt~memcg-hierarchy-documentation Documentation/controllers/memory.txt
--- linux-2.6.28-rc2/Documentation/controllers/memory.txt~memcg-hierarchy-documentation	2008-11-02 00:14:54.000000000 +0530
+++ linux-2.6.28-rc2-balbir/Documentation/controllers/memory.txt	2008-11-02 00:14:54.000000000 +0530
@@ -245,6 +245,40 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
+5. Hierarchy support
+
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy
+
+		root
+	     /  |   \
+           /	|    \
+	  a	b	c
+			| \
+			|  \
+			d   e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root).
+If one of the ancestors goes over its limit, the reclaim algorithm reclaims
+from the tasks in the ancestor and the children of the ancestor.
+
+5.1 Enabling hierarchical accounting and reclaim
+
+The memory controller by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.features file of the root cgroup
+
+# echo 1 > memory.features
+
+The feature can be disabled by
+
+# echo 0 > memory.features
+
+NOTE: Enabling/disabling will fail if the root cgroup already has other
+cgroups created below it.
+
 5. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
  2008-11-01 18:48 ` [mm] [PATCH 1/4] Memory cgroup hierarchy documentation Balbir Singh
@ 2008-11-01 18:48 ` Balbir Singh
  2008-11-02  5:42   ` KAMEZAWA Hiroyuki
  2008-11-01 18:48 ` [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim Balbir Singh
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-01 18:48 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


Add support for building hierarchies in resource counters. Cgroups allows us
to build a deep hierarchy, but we currently don't link the resource counters
belonging to the memory controller control groups, which are linked in
cgroup hiearchy. This patch provides the infrastructure for resource counters
that have the same hiearchy as their cgroup counter parts.

These set of patches are based on the resource counter hiearchy patches posted
by Pavel Emelianov.

NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are expensive,
so the user needs to be careful and aware of the trade-offs before creating
very deep ones.


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    8 ++++++--
 kernel/res_counter.c        |   42 ++++++++++++++++++++++++++++++++++--------
 mm/memcontrol.c             |    9 ++++++---
 3 files changed, 46 insertions(+), 13 deletions(-)

diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
--- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
+++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h	2008-11-02 00:14:58.000000000 +0530
@@ -43,6 +43,10 @@ struct res_counter {
 	 * the routines below consider this to be IRQ-safe
 	 */
 	spinlock_t lock;
+	/*
+	 * Parent counter, used for hierarchial resource accounting
+	 */
+	struct res_counter *parent;
 };
 
 /**
@@ -87,7 +91,7 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 /*
  * charge - try to consume more resource.
@@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
--- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
+++ linux-2.6.28-rc2-balbir/kernel/res_counter.c	2008-11-02 00:14:58.000000000 +0530
@@ -15,10 +15,11 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->parent = parent;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val)
+int res_counter_charge(struct res_counter *counter, unsigned long val,
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
+	struct res_counter *c, *u;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	ret = res_counter_charge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	*limit_fail_at = NULL;
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		ret = res_counter_charge_locked(c, val);
+		spin_unlock(&c->lock);
+		if (ret < 0) {
+			*limit_fail_at = c;
+			goto undo;
+		}
+	}
+	ret = 0;
+	goto done;
+undo:
+	for (u = counter; u != c; u = u->parent) {
+		spin_lock(&u->lock);
+		res_counter_uncharge_locked(u, val);
+		spin_unlock(&u->lock);
+	}
+done:
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
+	struct res_counter *c;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	res_counter_uncharge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		res_counter_uncharge_locked(c, val);
+		spin_unlock(&c->lock);
+	}
+	local_irq_restore(flags);
 }
 
 
diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:58.000000000 +0530
@@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
 	}
 
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
+		parent = NULL;
 	} else {
 		mem = mem_cgroup_alloc();
+		parent = mem_cgroup_from_cont(cont->parent);
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
 	}
 
-	res_counter_init(&mem->res);
+	res_counter_init(&mem->res, parent ? &parent->res : NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
  2008-11-01 18:48 ` [mm] [PATCH 1/4] Memory cgroup hierarchy documentation Balbir Singh
  2008-11-01 18:48 ` [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy Balbir Singh
@ 2008-11-01 18:48 ` Balbir Singh
  2008-11-02  5:37   ` KAMEZAWA Hiroyuki
  2008-11-01 18:49 ` [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector Balbir Singh
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-01 18:48 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


This patch introduces hierarchical reclaim. When an ancestor goes over its
limit, the charging routine points to the parent that is above its limit.
The reclaim process then starts from the last scanned child of the ancestor
and reclaims until the ancestor goes below its limit.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 129 insertions(+), 24 deletions(-)

diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-02 00:14:59.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:59.000000000 +0530
@@ -132,6 +132,11 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from.
+	 */
+	struct mem_cgroup *last_scanned_child;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -467,6 +472,125 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+static struct mem_cgroup *
+mem_cgroup_from_res_counter(struct res_counter *counter)
+{
+	return container_of(counter, struct mem_cgroup, res);
+}
+
+/*
+ * Dance down the hierarchy if needed to reclaim memory. We remember the
+ * last child we reclaimed from, so that we don't end up penalizing
+ * one child extensively based on its position in the children list
+ */
+static int
+mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	struct cgroup *cg, *cg_current, *cgroup;
+	struct mem_cgroup *mem_child;
+	int ret = 0;
+
+	if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+		return -ENOMEM;
+
+	/*
+	 * try_to_free_mem_cgroup_pages() might not give us a full
+	 * picture of reclaim. Some pages are reclaimed and might be
+	 * moved to swap cache or just unmapped from the cgroup.
+	 * Check the limit again to see if the reclaim reduced the
+	 * current usage of the cgroup before giving up
+	 */
+	if (res_counter_check_under_limit(&mem->res))
+		return 0;
+
+	/*
+	 * Scan all children under the mem_cgroup mem
+	 */
+	if (!mem->last_scanned_child)
+		cgroup = list_first_entry(&mem->css.cgroup->children,
+				struct cgroup, sibling);
+	else
+		cgroup = mem->last_scanned_child->css.cgroup;
+
+	cg_current = cgroup;
+
+	/*
+	 * We iterate twice, one of it is fundamental list issue, where
+	 * the elements are inserted using list_add and hence the list
+	 * behaves like a stack and list_for_entry_safe_from() stops
+	 * after seeing the first child. The two loops help us work
+	 * independently of the insertion and it helps us get a full pass at
+	 * scanning all list entries for reclaim
+	 */
+	list_for_each_entry_safe_from(cgroup, cg, &cg_current->parent->children,
+						 sibling) {
+		mem_child = mem_cgroup_from_cont(cgroup);
+
+		/*
+		 * Move beyond last scanned child
+		 */
+		if (mem_child == mem->last_scanned_child)
+			continue;
+
+		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
+		mem->last_scanned_child = mem_child;
+
+		if (res_counter_check_under_limit(&mem->res)) {
+			ret = 0;
+			goto done;
+		}
+	}
+
+	list_for_each_entry_safe(cgroup, cg, &cg_current->parent->children,
+						 sibling) {
+		mem_child = mem_cgroup_from_cont(cgroup);
+
+		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
+		mem->last_scanned_child = mem_child;
+
+		if (res_counter_check_under_limit(&mem->res)) {
+			ret = 0;
+			goto done;
+		}
+	}
+
+done:
+	return ret;
+}
+
+/*
+ * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
+ * from mem.
+ */
+static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	int ret = 0;
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	struct mem_cgroup *mem_over_limit;
+
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
+		if (!(gfp_mask & __GFP_WAIT))
+			goto out;
+
+		/*
+		 * Is one of our ancestors over their limit?
+		 */
+		if (fail_res)
+			mem_over_limit = mem_cgroup_from_res_counter(fail_res);
+		else
+			mem_over_limit = mem;
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask);
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			goto out;
+		}
+	}
+out:
+	return ret;
+}
 
 /**
  * mem_cgroup_try_charge - get charge of PAGE_SIZE.
@@ -484,8 +608,7 @@ int mem_cgroup_try_charge(struct mm_stru
 			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,29 +633,9 @@ int mem_cgroup_try_charge(struct mm_stru
 		css_get(&mem->css);
 	}
 
+	if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
+		goto nomem;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto nomem;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto nomem;
-		}
-	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1195,6 +1298,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
 
+	mem->last_scanned_child = NULL;
+
 	return &mem->css;
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
                   ` (2 preceding siblings ...)
  2008-11-01 18:48 ` [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim Balbir Singh
@ 2008-11-01 18:49 ` Balbir Singh
  2008-11-02  5:38   ` KAMEZAWA Hiroyuki
  2008-11-04  0:15 ` [mm][PATCH 0/4] Memory cgroup hierarchy introduction KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-01 18:49 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when there is just one cgroup
(the root cgroup).

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector	2008-11-02 00:15:00.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:15:00.000000000 +0530
@@ -40,6 +40,9 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+static unsigned long mem_cgroup_features;
+#define MEM_CGROUP_FEAT_HIERARCHY	0x1
+
 /*
  * Statistics for memory cgroup.
  */
@@ -1080,6 +1083,31 @@ out:
 	return ret;
 }
 
+static u64 mem_cgroup_feature_read(struct cgroup *cont, struct cftype *cft)
+{
+	return mem_cgroup_features;
+}
+
+static int mem_cgroup_feature_write(struct cgroup *cont, struct cftype *cft,
+			    		u64 val)
+{
+	int retval = 0;
+	struct cgroup *cgroup = init_mem_cgroup.css.cgroup;
+
+	if (val & MEM_CGROUP_FEAT_HIERARCHY) {
+		if (list_empty(&cgroup->children))
+			mem_cgroup_features |= MEM_CGROUP_FEAT_HIERARCHY;
+		else
+			retval = -EBUSY;
+	} else {
+		if (list_empty(&cgroup->children))
+			mem_cgroup_features &= ~MEM_CGROUP_FEAT_HIERARCHY;
+		else
+			retval = -EBUSY;
+	}
+	return retval;
+}
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
@@ -1214,6 +1242,11 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "features",
+		.write_u64 = mem_cgroup_feature_write,
+		.read_u64 = mem_cgroup_feature_read,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1292,7 +1325,10 @@ mem_cgroup_create(struct cgroup_subsys *
 			return ERR_PTR(-ENOMEM);
 	}
 
-	res_counter_init(&mem->res, parent ? &parent->res : NULL);
+	if ((mem_cgroup_features & MEM_CGROUP_FEAT_HIERARCHY) && parent)
+		res_counter_init(&mem->res, &parent->res);
+	else
+		res_counter_init(&mem->res, NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-01 18:48 ` [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim Balbir Singh
@ 2008-11-02  5:37   ` KAMEZAWA Hiroyuki
  2008-11-02  5:44     ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-02  5:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 00:18:49 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> This patch introduces hierarchical reclaim. When an ancestor goes over its
> limit, the charging routine points to the parent that is above its limit.
> The reclaim process then starts from the last scanned child of the ancestor
> and reclaims until the ancestor goes below its limit.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  mm/memcontrol.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 129 insertions(+), 24 deletions(-)
> 
> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-02 00:14:59.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:59.000000000 +0530
> @@ -132,6 +132,11 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +	/*
> +	 * While reclaiming in a hiearchy, we cache the last child we
> +	 * reclaimed from.
> +	 */
> +	struct mem_cgroup *last_scanned_child;
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> @@ -467,6 +472,125 @@ unsigned long mem_cgroup_isolate_pages(u
>  	return nr_taken;
>  }
>  
> +static struct mem_cgroup *
> +mem_cgroup_from_res_counter(struct res_counter *counter)
> +{
> +	return container_of(counter, struct mem_cgroup, res);
> +}
> +
> +/*
> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> + * last child we reclaimed from, so that we don't end up penalizing
> + * one child extensively based on its position in the children list
> + */
> +static int
> +mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
> +{
> +	struct cgroup *cg, *cg_current, *cgroup;
> +	struct mem_cgroup *mem_child;
> +	int ret = 0;
> +
> +	if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> +		return -ENOMEM;
> +
> +	/*
> +	 * try_to_free_mem_cgroup_pages() might not give us a full
> +	 * picture of reclaim. Some pages are reclaimed and might be
> +	 * moved to swap cache or just unmapped from the cgroup.
> +	 * Check the limit again to see if the reclaim reduced the
> +	 * current usage of the cgroup before giving up
> +	 */
> +	if (res_counter_check_under_limit(&mem->res))
> +		return 0;
> +
> +	/*
> +	 * Scan all children under the mem_cgroup mem
> +	 */
> +	if (!mem->last_scanned_child)
> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> +				struct cgroup, sibling);
> +	else
> +		cgroup = mem->last_scanned_child->css.cgroup;
> +
> +	cg_current = cgroup;
> +
> +	/*
> +	 * We iterate twice, one of it is fundamental list issue, where
> +	 * the elements are inserted using list_add and hence the list
> +	 * behaves like a stack and list_for_entry_safe_from() stops
> +	 * after seeing the first child. The two loops help us work
> +	 * independently of the insertion and it helps us get a full pass at
> +	 * scanning all list entries for reclaim
> +	 */
> +	list_for_each_entry_safe_from(cgroup, cg, &cg_current->parent->children,
> +						 sibling) {
> +		mem_child = mem_cgroup_from_cont(cgroup);
> +
> +		/*
> +		 * Move beyond last scanned child
> +		 */
> +		if (mem_child == mem->last_scanned_child)
> +			continue;
> +
> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
> +		mem->last_scanned_child = mem_child;
> +
> +		if (res_counter_check_under_limit(&mem->res)) {
> +			ret = 0;
> +			goto done;
> +		}
> +	}

Is this safe against cgroup create/remove ? cgroup_mutex is held ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-01 18:49 ` [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector Balbir Singh
@ 2008-11-02  5:38   ` KAMEZAWA Hiroyuki
  2008-11-02  6:03     ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-02  5:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 00:19:02 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> Don't enable multiple hierarchy support by default. This patch introduces
> a features element that can be set to enable the nested depth hierarchy
> feature. This feature can only be enabled when there is just one cgroup
> (the root cgroup).
> 
Why the flag is for the whole system ?
flag-per-subtree is of no use ?

Thanks,
-Kame

> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  mm/memcontrol.c |   38 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 37 insertions(+), 1 deletion(-)
> 
> diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector	2008-11-02 00:15:00.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:15:00.000000000 +0530
> @@ -40,6 +40,9 @@
>  struct cgroup_subsys mem_cgroup_subsys __read_mostly;
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
>  
> +static unsigned long mem_cgroup_features;
> +#define MEM_CGROUP_FEAT_HIERARCHY	0x1
> +
>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -1080,6 +1083,31 @@ out:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_feature_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_features;
> +}
> +
> +static int mem_cgroup_feature_write(struct cgroup *cont, struct cftype *cft,
> +			    		u64 val)
> +{
> +	int retval = 0;
> +	struct cgroup *cgroup = init_mem_cgroup.css.cgroup;
> +
> +	if (val & MEM_CGROUP_FEAT_HIERARCHY) {
> +		if (list_empty(&cgroup->children))
> +			mem_cgroup_features |= MEM_CGROUP_FEAT_HIERARCHY;
> +		else
> +			retval = -EBUSY;
> +	} else {
> +		if (list_empty(&cgroup->children))
> +			mem_cgroup_features &= ~MEM_CGROUP_FEAT_HIERARCHY;
> +		else
> +			retval = -EBUSY;
> +	}
> +	return retval;
> +}
> +
>  static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>  {
>  	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
> @@ -1214,6 +1242,11 @@ static struct cftype mem_cgroup_files[] 
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +	{
> +		.name = "features",
> +		.write_u64 = mem_cgroup_feature_write,
> +		.read_u64 = mem_cgroup_feature_read,
> +	},
>  };
>  
>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> @@ -1292,7 +1325,10 @@ mem_cgroup_create(struct cgroup_subsys *
>  			return ERR_PTR(-ENOMEM);
>  	}
>  
> -	res_counter_init(&mem->res, parent ? &parent->res : NULL);
> +	if ((mem_cgroup_features & MEM_CGROUP_FEAT_HIERARCHY) && parent)
> +		res_counter_init(&mem->res, &parent->res);
> +	else
> +		res_counter_init(&mem->res, NULL);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> _
> 
> -- 
> 	Balbir
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy
  2008-11-01 18:48 ` [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy Balbir Singh
@ 2008-11-02  5:42   ` KAMEZAWA Hiroyuki
  2008-11-02  5:49     ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-02  5:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 00:18:37 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> Add support for building hierarchies in resource counters. Cgroups allows us
> to build a deep hierarchy, but we currently don't link the resource counters
> belonging to the memory controller control groups, which are linked in
> cgroup hiearchy. This patch provides the infrastructure for resource counters
> that have the same hiearchy as their cgroup counter parts.
> 
> These set of patches are based on the resource counter hiearchy patches posted
> by Pavel Emelianov.
> 
> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
> the all the way up to the root. It is known that hiearchies are expensive,
> so the user needs to be careful and aware of the trade-offs before creating
> very deep ones.
> 
...isn't it better to add "root_lock" to res_counter rather than taking
all levels of lock one by one ?

 spin_lock(&res_counter->hierarchy_root->lock);
 do all charge/uncharge to hierarchy
 spin_unlock(&res_counter->hierarchy_root->lock);

Hmm ?

Thaks,
-Kame


> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    8 ++++++--
>  kernel/res_counter.c        |   42 ++++++++++++++++++++++++++++++++++--------
>  mm/memcontrol.c             |    9 ++++++---
>  3 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
> --- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h	2008-11-02 00:14:58.000000000 +0530
> @@ -43,6 +43,10 @@ struct res_counter {
>  	 * the routines below consider this to be IRQ-safe
>  	 */
>  	spinlock_t lock;
> +	/*
> +	 * Parent counter, used for hierarchial resource accounting
> +	 */
> +	struct res_counter *parent;
>  };
>  
>  /**
> @@ -87,7 +91,7 @@ enum {
>   * helpers for accounting
>   */
>  
> -void res_counter_init(struct res_counter *counter);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  
>  /*
>   * charge - try to consume more resource.
> @@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val);
> +		unsigned long val, struct res_counter **limit_fail_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
> --- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/kernel/res_counter.c	2008-11-02 00:14:58.000000000 +0530
> @@ -15,10 +15,11 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm.h>
>  
> -void res_counter_init(struct res_counter *counter)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  {
>  	spin_lock_init(&counter->lock);
>  	counter->limit = (unsigned long long)LLONG_MAX;
> +	counter->parent = parent;
>  }
>  
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> @@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
>  	return 0;
>  }
>  
> -int res_counter_charge(struct res_counter *counter, unsigned long val)
> +int res_counter_charge(struct res_counter *counter, unsigned long val,
> +			struct res_counter **limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
> +	struct res_counter *c, *u;
>  
> -	spin_lock_irqsave(&counter->lock, flags);
> -	ret = res_counter_charge_locked(counter, val);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +	*limit_fail_at = NULL;
> +	local_irq_save(flags);
> +	for (c = counter; c != NULL; c = c->parent) {
> +		spin_lock(&c->lock);
> +		ret = res_counter_charge_locked(c, val);
> +		spin_unlock(&c->lock);
> +		if (ret < 0) {
> +			*limit_fail_at = c;
> +			goto undo;
> +		}
> +	}
> +	ret = 0;
> +	goto done;
> +undo:
> +	for (u = counter; u != c; u = u->parent) {
> +		spin_lock(&u->lock);
> +		res_counter_uncharge_locked(u, val);
> +		spin_unlock(&u->lock);
> +	}
> +done:
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
> @@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct 
>  void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  {
>  	unsigned long flags;
> +	struct res_counter *c;
>  
> -	spin_lock_irqsave(&counter->lock, flags);
> -	res_counter_uncharge_locked(counter, val);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +	local_irq_save(flags);
> +	for (c = counter; c != NULL; c = c->parent) {
> +		spin_lock(&c->lock);
> +		res_counter_uncharge_locked(c, val);
> +		spin_unlock(&c->lock);
> +	}
> +	local_irq_restore(flags);
>  }
>  
>  
> diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support	2008-11-02 00:14:58.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:58.000000000 +0530
> @@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
>  {
>  	struct mem_cgroup *mem;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct res_counter *fail_res;
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the
> @@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
>  	}
>  
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> +	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> @@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
>  static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> -	struct mem_cgroup *mem;
> +	struct mem_cgroup *mem, *parent;
>  	int node;
>  
>  	if (unlikely((cont->parent) == NULL)) {
>  		mem = &init_mem_cgroup;
> +		parent = NULL;
>  	} else {
>  		mem = mem_cgroup_alloc();
> +		parent = mem_cgroup_from_cont(cont->parent);
>  		if (!mem)
>  			return ERR_PTR(-ENOMEM);
>  	}
>  
> -	res_counter_init(&mem->res);
> +	res_counter_init(&mem->res, parent ? &parent->res : NULL);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> _
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-02  5:37   ` KAMEZAWA Hiroyuki
@ 2008-11-02  5:44     ` Balbir Singh
  2008-11-04  2:17       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-02  5:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 00:18:49 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>> limit, the charging routine points to the parent that is above its limit.
>> The reclaim process then starts from the last scanned child of the ancestor
>> and reclaims until the ancestor goes below its limit.
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>> ---
>>
>>  mm/memcontrol.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 129 insertions(+), 24 deletions(-)
>>
>> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
>> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-02 00:14:59.000000000 +0530
>> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:59.000000000 +0530
>> @@ -132,6 +132,11 @@ struct mem_cgroup {
>>  	 * statistics.
>>  	 */
>>  	struct mem_cgroup_stat stat;
>> +	/*
>> +	 * While reclaiming in a hiearchy, we cache the last child we
>> +	 * reclaimed from.
>> +	 */
>> +	struct mem_cgroup *last_scanned_child;
>>  };
>>  static struct mem_cgroup init_mem_cgroup;
>>  
>> @@ -467,6 +472,125 @@ unsigned long mem_cgroup_isolate_pages(u
>>  	return nr_taken;
>>  }
>>  
>> +static struct mem_cgroup *
>> +mem_cgroup_from_res_counter(struct res_counter *counter)
>> +{
>> +	return container_of(counter, struct mem_cgroup, res);
>> +}
>> +
>> +/*
>> + * Dance down the hierarchy if needed to reclaim memory. We remember the
>> + * last child we reclaimed from, so that we don't end up penalizing
>> + * one child extensively based on its position in the children list
>> + */
>> +static int
>> +mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
>> +{
>> +	struct cgroup *cg, *cg_current, *cgroup;
>> +	struct mem_cgroup *mem_child;
>> +	int ret = 0;
>> +
>> +	if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * try_to_free_mem_cgroup_pages() might not give us a full
>> +	 * picture of reclaim. Some pages are reclaimed and might be
>> +	 * moved to swap cache or just unmapped from the cgroup.
>> +	 * Check the limit again to see if the reclaim reduced the
>> +	 * current usage of the cgroup before giving up
>> +	 */
>> +	if (res_counter_check_under_limit(&mem->res))
>> +		return 0;
>> +
>> +	/*
>> +	 * Scan all children under the mem_cgroup mem
>> +	 */
>> +	if (!mem->last_scanned_child)
>> +		cgroup = list_first_entry(&mem->css.cgroup->children,
>> +				struct cgroup, sibling);
>> +	else
>> +		cgroup = mem->last_scanned_child->css.cgroup;
>> +
>> +	cg_current = cgroup;
>> +
>> +	/*
>> +	 * We iterate twice, one of it is fundamental list issue, where
>> +	 * the elements are inserted using list_add and hence the list
>> +	 * behaves like a stack and list_for_entry_safe_from() stops
>> +	 * after seeing the first child. The two loops help us work
>> +	 * independently of the insertion and it helps us get a full pass at
>> +	 * scanning all list entries for reclaim
>> +	 */
>> +	list_for_each_entry_safe_from(cgroup, cg, &cg_current->parent->children,
>> +						 sibling) {
>> +		mem_child = mem_cgroup_from_cont(cgroup);
>> +
>> +		/*
>> +		 * Move beyond last scanned child
>> +		 */
>> +		if (mem_child == mem->last_scanned_child)
>> +			continue;
>> +
>> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
>> +		mem->last_scanned_child = mem_child;
>> +
>> +		if (res_counter_check_under_limit(&mem->res)) {
>> +			ret = 0;
>> +			goto done;
>> +		}
>> +	}
> 
> Is this safe against cgroup create/remove ? cgroup_mutex is held ?

Yes, I thought about it, but with the setup, each parent will be busy since they
have children and hence cannot be removed. The leaf child itself has tasks, so
it cannot be removed. IOW, it should be safe against removal.

For creation we might need to hold the mutex, I'll review that part of the code.

Thanks for the comments,

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy
  2008-11-02  5:42   ` KAMEZAWA Hiroyuki
@ 2008-11-02  5:49     ` Balbir Singh
  2008-11-02  5:56       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-02  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 00:18:37 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Add support for building hierarchies in resource counters. Cgroups allows us
>> to build a deep hierarchy, but we currently don't link the resource counters
>> belonging to the memory controller control groups, which are linked in
>> cgroup hiearchy. This patch provides the infrastructure for resource counters
>> that have the same hiearchy as their cgroup counter parts.
>>
>> These set of patches are based on the resource counter hiearchy patches posted
>> by Pavel Emelianov.
>>
>> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
>> the all the way up to the root. It is known that hiearchies are expensive,
>> so the user needs to be careful and aware of the trade-offs before creating
>> very deep ones.
>>
> ...isn't it better to add "root_lock" to res_counter rather than taking
> all levels of lock one by one ?
> 
>  spin_lock(&res_counter->hierarchy_root->lock);
>  do all charge/uncharge to hierarchy
>  spin_unlock(&res_counter->hierarchy_root->lock);
> 
> Hmm ?
> 

Good thought process, but that affects and adds code complexity for the case
when hierarchy is enabled/disabled. It is also inefficient, since all charges
will now contend on root lock, in the current process, it is step by step, the
contention only occurs on common parts of the hierarchy (root being the best case).

Thanks for the comments,

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy
  2008-11-02  5:49     ` Balbir Singh
@ 2008-11-02  5:56       ` KAMEZAWA Hiroyuki
  2008-11-02 11:46         ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-02  5:56 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 11:19:38 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Sun, 02 Nov 2008 00:18:37 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> Add support for building hierarchies in resource counters. Cgroups allows us
> >> to build a deep hierarchy, but we currently don't link the resource counters
> >> belonging to the memory controller control groups, which are linked in
> >> cgroup hiearchy. This patch provides the infrastructure for resource counters
> >> that have the same hiearchy as their cgroup counter parts.
> >>
> >> These set of patches are based on the resource counter hiearchy patches posted
> >> by Pavel Emelianov.
> >>
> >> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
> >> the all the way up to the root. It is known that hiearchies are expensive,
> >> so the user needs to be careful and aware of the trade-offs before creating
> >> very deep ones.
> >>
> > ...isn't it better to add "root_lock" to res_counter rather than taking
> > all levels of lock one by one ?
> > 
> >  spin_lock(&res_counter->hierarchy_root->lock);
> >  do all charge/uncharge to hierarchy
> >  spin_unlock(&res_counter->hierarchy_root->lock);
> > 
> > Hmm ?
> > 
> 
> Good thought process, but that affects and adds code complexity for the case
> when hierarchy is enabled/disabled. It is also inefficient, since all charges
> will now contend on root lock, in the current process, it is step by step, the
> contention only occurs on common parts of the hierarchy (root being the best case).
> 

Above code's contention level is not different from "only root no children" case.
Just inside-lock is heavier.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-02  5:38   ` KAMEZAWA Hiroyuki
@ 2008-11-02  6:03     ` Balbir Singh
  2008-11-02  6:24       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-02  6:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 00:19:02 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Don't enable multiple hierarchy support by default. This patch introduces
>> a features element that can be set to enable the nested depth hierarchy
>> feature. This feature can only be enabled when there is just one cgroup
>> (the root cgroup).
>>
> Why the flag is for the whole system ?
> flag-per-subtree is of no use ?

Flag per subtree might not be useful, since we charge all the way up to root,
which means that all subtrees have the root cgroup and hence the global flag.

Thanks for the comments,

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-02  6:03     ` Balbir Singh
@ 2008-11-02  6:24       ` KAMEZAWA Hiroyuki
  2008-11-02 15:52         ` Balbir Singh
  2008-11-06  6:56         ` Balbir Singh
  0 siblings, 2 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-02  6:24 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 11:33:51 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Sun, 02 Nov 2008 00:19:02 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> Don't enable multiple hierarchy support by default. This patch introduces
> >> a features element that can be set to enable the nested depth hierarchy
> >> feature. This feature can only be enabled when there is just one cgroup
> >> (the root cgroup).
> >>
> > Why the flag is for the whole system ?
> > flag-per-subtree is of no use ?
> 
> Flag per subtree might not be useful, since we charge all the way up to root,
Ah, what I said is "How about enabling/disabling hierarhcy support per subtree ?"
Sorry for bad text.

like this.
  - you can set hierarchy mode of a cgroup turned on/off when...
    * you don't have any tasks under it && it doesn't have any child cgroup.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy
  2008-11-02  5:56       ` KAMEZAWA Hiroyuki
@ 2008-11-02 11:46         ` Balbir Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Balbir Singh @ 2008-11-02 11:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 11:19:38 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Sun, 02 Nov 2008 00:18:37 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> Add support for building hierarchies in resource counters. Cgroups allows us
>>>> to build a deep hierarchy, but we currently don't link the resource counters
>>>> belonging to the memory controller control groups, which are linked in
>>>> cgroup hiearchy. This patch provides the infrastructure for resource counters
>>>> that have the same hiearchy as their cgroup counter parts.
>>>>
>>>> These set of patches are based on the resource counter hiearchy patches posted
>>>> by Pavel Emelianov.
>>>>
>>>> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
>>>> the all the way up to the root. It is known that hiearchies are expensive,
>>>> so the user needs to be careful and aware of the trade-offs before creating
>>>> very deep ones.
>>>>
>>> ...isn't it better to add "root_lock" to res_counter rather than taking
>>> all levels of lock one by one ?
>>>
>>>  spin_lock(&res_counter->hierarchy_root->lock);
>>>  do all charge/uncharge to hierarchy
>>>  spin_unlock(&res_counter->hierarchy_root->lock);
>>>
>>> Hmm ?
>>>
>> Good thought process, but that affects and adds code complexity for the case
>> when hierarchy is enabled/disabled. It is also inefficient, since all charges
>> will now contend on root lock, in the current process, it is step by step, the
>> contention only occurs on common parts of the hierarchy (root being the best case).
>>
> 
> Above code's contention level is not different from "only root no children" case.
> Just inside-lock is heavier.

Yes, correct! I think the approach in the patches is better.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-02  6:24       ` KAMEZAWA Hiroyuki
@ 2008-11-02 15:52         ` Balbir Singh
  2008-11-04  6:37           ` Paul Menage
  2008-11-06  6:56         ` Balbir Singh
  1 sibling, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-02 15:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 11:33:51 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Sun, 02 Nov 2008 00:19:02 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> Don't enable multiple hierarchy support by default. This patch introduces
>>>> a features element that can be set to enable the nested depth hierarchy
>>>> feature. This feature can only be enabled when there is just one cgroup
>>>> (the root cgroup).
>>>>
>>> Why the flag is for the whole system ?
>>> flag-per-subtree is of no use ?
>> Flag per subtree might not be useful, since we charge all the way up to root,
> Ah, what I said is "How about enabling/disabling hierarhcy support per subtree ?"
> Sorry for bad text.
> 
> like this.
>   - you can set hierarchy mode of a cgroup turned on/off when...
>     * you don't have any tasks under it && it doesn't have any child cgroup.

That should not be hard, but having it per-subtree sounds a little complex in
terms of exploiting from the end-user perspective and from symmetry perspective
(the CPU cgroup controller provides hierarchy control for the full hierarchy).

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm][PATCH 0/4] Memory cgroup hierarchy introduction
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
                   ` (3 preceding siblings ...)
  2008-11-01 18:49 ` [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector Balbir Singh
@ 2008-11-04  0:15 ` KAMEZAWA Hiroyuki
  2008-11-05 13:51   ` Balbir Singh
  2008-11-04  9:21 ` [patch 1/2] memcg: hierarchy, yet another one KAMEZAWA Hiroyuki
  2008-11-04  9:25 ` KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-04  0:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 00:18:12 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> This patch follows several iterations of the memory controller hierarchy
> patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
> at [2].
> 
> The current approach is based on [2] and has the following properties
> 
> 1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
>    A multi-tree hierarchy has been supported for a long time in filesystems.
>    When the feature is turned on, we honor hierarchies such that the root
>    accounts for resource usage of all children and limits can be set at
>    any point in the hierarchy. Any memory cgroup is limited by limits
>    along the hierarchy. The total usage of all children of a node cannot
>    exceed the limit of the node.
> 2. The hierarchy feature is selectable and off by default
> 3. Hierarchies are expensive and the trade off is depth versus performance.
>    Hierarchies can also be completely turned off.
> 
> The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
> with SMP and swap turned on.
> 

As first impression, I think hierarchical LRU management is not good...means
not fair from viewpoint of memory management.
I'd like to show some other possible implementation of
try_to_free_mem_cgroup_pages() if I can.

Anyway, I have to merge this with mem+swap controller. 

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-02  5:44     ` Balbir Singh
@ 2008-11-04  2:17       ` KAMEZAWA Hiroyuki
  2008-11-05 13:34         ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-04  2:17 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sun, 02 Nov 2008 11:14:48 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Sun, 02 Nov 2008 00:18:49 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> This patch introduces hierarchical reclaim. When an ancestor goes over its
> >> limit, the charging routine points to the parent that is above its limit.
> >> The reclaim process then starts from the last scanned child of the ancestor
> >> and reclaims until the ancestor goes below its limit.
> >>
> >> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> ---
> >>
> >>  mm/memcontrol.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++---------
> >>  1 file changed, 129 insertions(+), 24 deletions(-)
> >>
> >> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
> >> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-02 00:14:59.000000000 +0530
> >> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:59.000000000 +0530
> >> @@ -132,6 +132,11 @@ struct mem_cgroup {
> >>  	 * statistics.
> >>  	 */
> >>  	struct mem_cgroup_stat stat;
> >> +	/*
> >> +	 * While reclaiming in a hiearchy, we cache the last child we
> >> +	 * reclaimed from.
> >> +	 */
> >> +	struct mem_cgroup *last_scanned_child;
> >>  };
> >>  static struct mem_cgroup init_mem_cgroup;
> >>  
> >> @@ -467,6 +472,125 @@ unsigned long mem_cgroup_isolate_pages(u
> >>  	return nr_taken;
> >>  }
> >>  
> >> +static struct mem_cgroup *
> >> +mem_cgroup_from_res_counter(struct res_counter *counter)
> >> +{
> >> +	return container_of(counter, struct mem_cgroup, res);
> >> +}
> >> +
> >> +/*
> >> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> >> + * last child we reclaimed from, so that we don't end up penalizing
> >> + * one child extensively based on its position in the children list
> >> + */
> >> +static int
> >> +mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
> >> +{
> >> +	struct cgroup *cg, *cg_current, *cgroup;
> >> +	struct mem_cgroup *mem_child;
> >> +	int ret = 0;
> >> +
> >> +	if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> >> +		return -ENOMEM;
> >> +
> >> +	/*
> >> +	 * try_to_free_mem_cgroup_pages() might not give us a full
> >> +	 * picture of reclaim. Some pages are reclaimed and might be
> >> +	 * moved to swap cache or just unmapped from the cgroup.
> >> +	 * Check the limit again to see if the reclaim reduced the
> >> +	 * current usage of the cgroup before giving up
> >> +	 */
> >> +	if (res_counter_check_under_limit(&mem->res))
> >> +		return 0;
> >> +
> >> +	/*
> >> +	 * Scan all children under the mem_cgroup mem
> >> +	 */
> >> +	if (!mem->last_scanned_child)
> >> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> >> +				struct cgroup, sibling);
> >> +	else
> >> +		cgroup = mem->last_scanned_child->css.cgroup;
> >> +
> >> +	cg_current = cgroup;
> >> +
> >> +	/*
> >> +	 * We iterate twice, one of it is fundamental list issue, where
> >> +	 * the elements are inserted using list_add and hence the list
> >> +	 * behaves like a stack and list_for_entry_safe_from() stops
> >> +	 * after seeing the first child. The two loops help us work
> >> +	 * independently of the insertion and it helps us get a full pass at
> >> +	 * scanning all list entries for reclaim
> >> +	 */
> >> +	list_for_each_entry_safe_from(cgroup, cg, &cg_current->parent->children,
> >> +						 sibling) {
> >> +		mem_child = mem_cgroup_from_cont(cgroup);
> >> +
> >> +		/*
> >> +		 * Move beyond last scanned child
> >> +		 */
> >> +		if (mem_child == mem->last_scanned_child)
> >> +			continue;
> >> +
> >> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
> >> +		mem->last_scanned_child = mem_child;
> >> +
> >> +		if (res_counter_check_under_limit(&mem->res)) {
> >> +			ret = 0;
> >> +			goto done;
> >> +		}
> >> +	}
> > 
> > Is this safe against cgroup create/remove ? cgroup_mutex is held ?
> 
> Yes, I thought about it, but with the setup, each parent will be busy since they
> have children and hence cannot be removed. The leaf child itself has tasks, so
> it cannot be removed. IOW, it should be safe against removal.
> 
I'm sorry if I misunderstand something. could you explain folloing ?

In following tree,

    level-1
         -  level-2
                -  level-3
                       -  level-4
level-1's usage = level-1 + level-2 + level-3 + level-4
level-2's usage = level-2 + level-3 + level-4
level-3's usage = level-3 + level-4
level-4's usage = level-4

Assume that a task in level-2 hits its limit. It has to reclaim memory from
level-2 and level-3, level-4.

How can we guarantee level-4 has a task in this case ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 1/4] Memory cgroup hierarchy documentation
  2008-11-01 18:48 ` [mm] [PATCH 1/4] Memory cgroup hierarchy documentation Balbir Singh
@ 2008-11-04  6:25   ` Paul Menage
  2008-11-04  6:26     ` Paul Menage
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Menage @ 2008-11-04  6:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Andrew Morton,
	KAMEZAWA Hiroyuki

On Sat, Nov 1, 2008 at 10:48 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> +
> +The memory controller by default disables the hierarchy feature. Support
> +can be enabled by writing 1 to memory.features file of the root cgroup
> +
> +# echo 1 > memory.features
> +
> +The feature can be disabled by
> +
> +# echo 0 > memory.features

That's not a very intuitive interface. Why not memory.use_hierarchy?
(I presume that you're trying to cram multiple bits into a single
file, but that seems a bit ugly).

Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 1/4] Memory cgroup hierarchy documentation
  2008-11-04  6:25   ` Paul Menage
@ 2008-11-04  6:26     ` Paul Menage
  2008-11-05 13:55       ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Menage @ 2008-11-04  6:26 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Andrew Morton,
	KAMEZAWA Hiroyuki

On Mon, Nov 3, 2008 at 10:25 PM, Paul Menage <menage@google.com> wrote:
>
> That's not a very intuitive interface. Why not memory.use_hierarchy?

Or for consistency with the existing cpuset.memory_pressure_enabled,
just memory.hierarchy_enabled ?

Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-02 15:52         ` Balbir Singh
@ 2008-11-04  6:37           ` Paul Menage
  2008-11-06  7:00             ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: Paul Menage @ 2008-11-04  6:37 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	linux-kernel, Nick Piggin, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Andrew Morton

On Sun, Nov 2, 2008 at 7:52 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> That should not be hard, but having it per-subtree sounds a little complex in
> terms of exploiting from the end-user perspective and from symmetry perspective
> (the CPU cgroup controller provides hierarchy control for the full hierarchy).
>

The difference is that the CPU controller works in terms of shares,
whereas memory works in terms of absolute memory size. So it pretty
much has to limit the hierarchy to a single tree. Also, I didn't think
that you could modify the shares for the root cgroup - what would that
mean if so?

With this patch set as it is now, the root cgroup's lock becomes a
global memory allocation/deallocation lock, which seems a bit painful.
Having a bunch of top-level cgroups each with their own independent
memory limits, and allowing sub cgroups of them to be constrained by
the parent's memory limit, seems more useful than a single hierarchy
connected at the root.

In what realistic circumstances do you actually want to limit the root
cgroup's memory usage?

Paul

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 1/2] memcg: hierarchy, yet another one.
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
                   ` (4 preceding siblings ...)
  2008-11-04  0:15 ` [mm][PATCH 0/4] Memory cgroup hierarchy introduction KAMEZAWA Hiroyuki
@ 2008-11-04  9:21 ` KAMEZAWA Hiroyuki
  2008-11-04  9:25 ` KAMEZAWA Hiroyuki
  6 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-04  9:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

This patch is a toy for showing idea,
how do you think ?

[1/2] ... res counter part
[2/2] ... for memcg (mem+swap controller)

based on mem+swap controller + synchronized lru + some fixes(not posted)
I'm just wondering how to support hirerachy and not indend to improve this now.

[1/2] ... hierachical res_counter 
==
Hierarchy support for res_coutner.

This patch adds following interface to res_counter.
 - res_counter_init_hierarchy().
 - res_counter_charge_hierarchy().
 - res_counter_uncharge_hierarchy().
 - res_counter_check_under_limit_hierarchy().

By this, if res_counter is properly intialized, a charge to res_counter
will be charged up to the root of res_counter.

root res_counter has is res_counter->parent pointer to be NULL.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/res_counter.h |   13 ++++++++
 kernel/res_counter.c        |   65 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 1 deletion(-)

Index: mmotm-2.6.28-rc2+/include/linux/res_counter.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/res_counter.h
+++ mmotm-2.6.28-rc2+/include/linux/res_counter.h
@@ -39,6 +39,10 @@ struct res_counter {
 	 */
 	unsigned long long failcnt;
 	/*
+         * The parent sharing resource.
+         */
+	struct res_counter *parent;
+	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
 	 */
@@ -88,6 +92,8 @@ enum {
  */
 
 void res_counter_init(struct res_counter *counter);
+void res_counter_init_hierarchy(struct res_counter *counter,
+				struct res_counter *parent);
 
 /*
  * charge - try to consume more resource.
@@ -105,6 +111,8 @@ int __must_check res_counter_charge_lock
 int __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val);
 
+int __must_check res_counter_charge_hierarchy(struct res_counter *counter,
+		      unsigned long val, struct res_counter **fail);
 /*
  * uncharge - tell that some portion of the resource is released
  *
@@ -118,6 +126,9 @@ int __must_check res_counter_charge(stru
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
+void res_counter_uncharge_hierarchy(struct res_counter *counter,
+		unsigned long val);
+
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
 	if (cnt->usage < cnt->limit)
@@ -126,6 +137,8 @@ static inline bool res_counter_limit_che
 	return false;
 }
 
+bool res_counter_check_under_limit_hierarchy(struct res_counter *cnt,
+					     struct res_counter **fail);
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
Index: mmotm-2.6.28-rc2+/kernel/res_counter.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/kernel/res_counter.c
+++ mmotm-2.6.28-rc2+/kernel/res_counter.c
@@ -15,10 +15,17 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter)
+void res_counter_init_hierarchy(struct res_counter *counter,
+				struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->parent = parent;
+}
+
+void res_counter_init(struct res_counter *counter)
+{
+	res_counter_init_hierarchy(counter, NULL);
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -45,6 +52,7 @@ int res_counter_charge(struct res_counte
 	return ret;
 }
 
+
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (WARN_ON(counter->usage < val))
@@ -62,6 +70,61 @@ void res_counter_uncharge(struct res_cou
 	spin_unlock_irqrestore(&counter->lock, flags);
 }
 
+/**
+ * res_counter_charge_hierarchy - hierarchical resource charging.
+ * @counter: an res_counter to be charged.
+ * @val: value to be charged.
+ * @fail: a pointer to *res_counter for returning where we failed.
+ *
+ * charge "val" to res_counter and all ancestors of it. If fails, a pointer
+ * to res_counter which failed to be charged is returned to "fail".
+ */
+int res_counter_charge_hierarchy(struct res_counter *counter,
+				 unsigned long val,
+				 struct res_counter **fail)
+{
+	struct res_counter *tmp, *undo;
+	int ret = 0;
+
+	for (tmp = counter; tmp != NULL; tmp = tmp->parent) {
+		ret = res_counter_charge(tmp, val);
+		if (ret)
+			break;
+	}
+	if (!ret)
+		return ret;
+
+	*fail = tmp;
+	for (undo = tmp, tmp = counter; tmp != undo; tmp = tmp->parent)
+		res_counter_uncharge(tmp, val);
+
+	return ret;
+}
+
+
+void res_counter_uncharge_hierarchy(struct res_counter *counter,
+				    unsigned long  val)
+{
+	struct res_counter *tmp;
+
+	for (tmp = counter; tmp != NULL; tmp = tmp->parent)
+		res_counter_uncharge(tmp, val);
+}
+
+bool res_counter_check_under_limit_hierarchy(struct res_counter *counter,
+					     struct res_counter **fail)
+{
+	struct res_counter *tmp;
+	for (tmp = counter; tmp != NULL; tmp = tmp->parent)
+		if (!res_counter_check_under_limit(tmp))
+			break;
+
+	if (!tmp)
+		return true;
+	*fail = tmp;
+	return false;
+}
+
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 1/2] memcg: hierarchy, yet another one.
  2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
                   ` (5 preceding siblings ...)
  2008-11-04  9:21 ` [patch 1/2] memcg: hierarchy, yet another one KAMEZAWA Hiroyuki
@ 2008-11-04  9:25 ` KAMEZAWA Hiroyuki
  6 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-04  9:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton


Support for hierarchical memory/mem+swap limit management.
Major difference is "shared LRU". and hierarchy-per-subtree.

may cause panic or deadlock ;)

=
This patch adds hierarchy support for memcg's LRU management.

Working like following.

     /root
       |
       |-group_A  hierarchy = 1
       |   |- App_1
       |   |- App_2
       |
       |-group_B  hierarchy = 0.
           |- App_3
           |- App_4

 * group_A and App_1 and App_2 shares LRU and each one has its own limit.
   If group_A hits its limit, memory is reclaimed from group_A, App_1, App_2.
   If App_1 hits its limit, memory is reclaimed from App_1.

 * group_B and App_3 and App_4 has its own LRU and its own limit.
   if group_B hits its limit, memory is reclaimed from group_B itself.
   if App_3 hits its limit, memory is reclaimed from App_3.

 For sharing LRU, App_1 and App_2 use the LRU of group_A. They really shares
 LRU. group_B and App_3 and App_3 doesn't share anything and has its own.

 For reclaiming memory from share LRU, this works as following.
 Assume following hierachcal tree.

   group_W
     |- group_X
          |- group_Y1
               |- group_Z1
	       |- group_Z2
          |- group_Y2
               |- group_A

  All pages of W,X,Y1,Z1,Z2,Y2,A is on a LRU.

  When we hit limit in group Y1, tree is marked as following.

   group_W
     |- group_X
          |- group_Y1 (*)
               |- group_Z1 (*)
               |- group_Z2 (*)
          |- group_Y2
               |- group_A

  We reclaim memory from group Y1, Z1, Z2.

  When we hit limit in group X, tree is marked as following.

   group_W
     |- group_X (*)
          |- group_Y1 (*)
               |- group_Z1 (*)
               |- group_Z2 (*)
          |- group_Y2 (*)
               |- group_A (*)

  We reclaim memory from group Y1, Z1, Z2, Y2, A.

  Because all group under a hierarchy shares LRU, "which memory should be
  reclaimed ?" is not so far from global LRU's one.

  Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  265 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 221 insertions(+), 44 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -150,6 +150,12 @@ struct mem_cgroup {
 	 */
 	int	prev_priority;	/* for recording reclaim priority */
 	/*
+	 * For supporing multilevel memory reclaim.
+	 */
+	int		is_root; /* This cgroup is the master of LRU */
+	int		use_hierarchy; /* This cgroup is under some hierarchy */
+	atomic_t	in_reclaim; /* This cgroup is the target of reclaim */
+	/*
 	 * used for counting reference from swap_cgroup.
 	 */
 	int		obsolete;
@@ -158,6 +164,18 @@ struct mem_cgroup {
 
 static struct mem_cgroup init_mem_cgroup;
 
+
+static struct mem_cgroup *mem_cgroup_from_res(struct res_counter *res)
+{
+	return container_of(res, struct mem_cgroup, res);
+}
+
+static struct mem_cgroup *mem_cgroup_from_memsw(struct res_counter *res)
+{
+	return container_of(res, struct mem_cgroup, memsw);
+}
+
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -186,6 +204,8 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
+/* Special File Name */
+#define MEMCG_FILE_HIERARCHY	(0xab01)
 
 static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
 
@@ -452,6 +472,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (unlikely(!PageLRU(page)))
 			continue;
 
+		/* Is this target of this reclaim ? */
+		if (!atomic_read(&pc->mem_cgroup->in_reclaim))
+			continue;
 		scan++;
 		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
@@ -463,6 +486,56 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+static void inc_dec_reclaim_counter(struct cgroup *cg, bool set)
+{
+	struct cgroup *tmp;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cg);
+
+	if (set)
+		atomic_inc(&mem->in_reclaim);
+	else
+		atomic_dec(&mem->in_reclaim);
+
+	if (!list_empty(&cg->children)) {
+	    list_for_each_entry(tmp, &cg->children, sibling)
+		    inc_dec_reclaim_counter(tmp, set);
+	}
+}
+
+
+static void set_hierarchy_mask(struct mem_cgroup *mem,
+			      struct mem_cgroup *root, bool set)
+{
+	/*
+	 * we recalaim memory from all cgroups under "fail".
+	 * set target marker to them
+	 */
+	/* SHOULD BE FIXED: Need some magic to avoid this mutex.... */
+	cgroup_lock();
+	inc_dec_reclaim_counter(root->css.cgroup, set);
+	cgroup_unlock();
+}
+
+static int __try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+					  gfp_t gfp_mask, bool noswap,
+					  struct mem_cgroup *root)
+{
+	int ret;
+
+	atomic_inc(&mem->in_reclaim);
+	/* If we already get cgroup_lock(), root must be NULL */
+	if (mem->use_hierarchy && root)
+		set_hierarchy_mask(mem, root, true);
+
+	ret = try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
+
+	if (mem->use_hierarchy && root)
+		set_hierarchy_mask(mem, root, false);
+
+	atomic_dec(&mem->in_reclaim);
+
+	return ret;
+}
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -499,22 +572,31 @@ static int __mem_cgroup_try_charge(struc
 	while (1) {
 		int ret;
 		bool noswap = false;
+		struct res_counter *hit;
+		struct mem_cgroup *fail;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE);
+		ret = res_counter_charge_hierarchy(&mem->res, PAGE_SIZE, &hit);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
-			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+			ret = res_counter_charge_hierarchy(&mem->memsw,
+						PAGE_SIZE, &hit);
 			if (likely(!ret))
 				break;
+			fail = mem_cgroup_from_memsw(hit);
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge_hierarchy(&mem->res, PAGE_SIZE);
 			noswap = true;
-		}
+
+		} else
+			fail = mem_cgroup_from_res(hit);
+
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
+
+		if (__try_to_free_mem_cgroup_pages(mem, gfp_mask,
+						   noswap, fail))
 			continue;
 
 		/*
@@ -579,9 +661,9 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		res_counter_uncharge_hierarchy(&mem->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge_hierarchy(&mem->memsw, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
@@ -635,10 +717,10 @@ static int mem_cgroup_move_account(struc
 		goto out;
 
 	css_put(&from->css);
-	res_counter_uncharge(&from->res, PAGE_SIZE);
+	res_counter_uncharge_hierarchy(&from->res, PAGE_SIZE);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (do_swap_account)
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		res_counter_uncharge_hierarchy(&from->memsw, PAGE_SIZE);
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
 	css_get(&to->css);
@@ -687,15 +769,16 @@ static int mem_cgroup_move_parent(struct
 	/* drop extra refcnt by try_charge() (move_account increment one) */
 	css_put(&parent->css);
 	putback_lru_page(page);
+
 	if (!ret) {
 		put_page(page);
 		return 0;
 	}
 	/* uncharge if move fails */
 cancel:
-	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	res_counter_uncharge_hierarchy(&parent->res, PAGE_SIZE);
 	if (do_swap_account)
-		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+		res_counter_uncharge_hierarchy(&parent->memsw, PAGE_SIZE);
 	put_page(page);
 	return ret;
 }
@@ -857,7 +940,7 @@ int mem_cgroup_cache_charge_swapin(struc
 		 */
 		mem = swap_cgroup_record(ent, NULL);
 		if (mem) {
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			res_counter_uncharge_hierarchy(&mem->memsw, PAGE_SIZE);
 			mem_cgroup_forget_swapref(mem);
 		}
 	}
@@ -892,7 +975,8 @@ void mem_cgroup_commit_charge_swapin(str
 		memcg = swap_cgroup_record(ent, NULL);
 		if (memcg) {
 			/* If memcg is obsolete, memcg can be != ptr */
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			res_counter_uncharge_hierarchy(&memcg->memsw,
+							PAGE_SIZE);
 			mem_cgroup_forget_swapref(memcg);
 		}
 
@@ -906,8 +990,8 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	res_counter_uncharge_hierarchy(&mem->res, PAGE_SIZE);
+	res_counter_uncharge_hierarchy(&mem->memsw, PAGE_SIZE);
 	css_put(&mem->css);
 }
 
@@ -958,9 +1042,9 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge_hierarchy(&mem->res, PAGE_SIZE);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge_hierarchy(&mem->memsw, PAGE_SIZE);
 
 	mem_cgroup_charge_statistics(mem, pc, false);
 	ClearPageCgroupUsed(pc);
@@ -1025,7 +1109,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 
 	memcg = swap_cgroup_record(ent, NULL);
 	if (memcg) {
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge_hierarchy(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_forget_swapref(memcg);
 	}
 }
@@ -1119,9 +1203,10 @@ void mem_cgroup_end_migration(struct mem
  */
 int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *root;
 	int progress = 0;
 	int retry = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *hit;
 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
@@ -1136,10 +1221,13 @@ int mem_cgroup_shrink_usage(struct mm_st
 	}
 	css_get(&mem->css);
 	rcu_read_unlock();
-
+	root = mem;
 	do {
-		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
-		progress += res_counter_check_under_limit(&mem->res);
+		progress = __try_to_free_mem_cgroup_pages(mem,
+					gfp_mask, true, root);
+		progress += res_counter_check_under_limit_hierarchy(&mem->res,
+					&hit);
+		root = mem_cgroup_from_res(hit);
 	} while (!progress && --retry);
 
 	css_put(&mem->css);
@@ -1164,8 +1252,9 @@ int mem_cgroup_resize_limit(struct mem_c
 			ret = -EBUSY;
 			break;
 		}
-		progress = try_to_free_mem_cgroup_pages(memcg,
-				GFP_HIGHUSER_MOVABLE, false);
+		/* reclaim memory from all children. */
+		progress = __try_to_free_mem_cgroup_pages(memcg,
+					  GFP_HIGHUSER_MOVABLE, false, memcg);
 		if (!progress)
 			retry_count--;
 	}
@@ -1206,7 +1295,8 @@ int mem_cgroup_resize_memsw_limit(struct
 
 		if (!ret)
 			break;
-		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+		__try_to_free_mem_cgroup_pages(memcg,
+				       GFP_HIGHUSER_MOVABLE, true, memcg);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		if (curusage >= oldusage)
 			retry_count--;
@@ -1243,16 +1333,25 @@ static int mem_cgroup_force_empty_list(s
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
-		pc = list_entry(list->prev, struct page_cgroup, lru);
+		list_for_each_entry(pc, list, lru) {
+			/* check is not done under lock. fixed up later */
+			if (pc->mem_cgroup == mem)
+				break;
+		}
+		if (&pc->lru == list) { /* empty ? */
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			break;
+		}
 		if (busy == pc) {
-			list_move(&pc->lru, list);
-			busy = 0;
+			list_move_tail(&pc->lru, list);
+			busy = NULL;
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+
 		if (ret == -ENOMEM)
 			break;
 
@@ -1264,8 +1363,6 @@ static int mem_cgroup_force_empty_list(s
 			busy = NULL;
 	}
 
-	if (!ret && !list_empty(list))
-		return -EBUSY;
 	return ret;
 }
 
@@ -1287,7 +1384,8 @@ move_account:
 		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
-
+		if (signal_pending(current))
+			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		ret = 0;
@@ -1324,8 +1422,8 @@ try_to_free:
 	shrink = 1;
 	while (nr_retries && mem->res.usage > 0) {
 		int progress;
-		progress = try_to_free_mem_cgroup_pages(mem,
-						  GFP_HIGHUSER_MOVABLE, false);
+		progress = __try_to_free_mem_cgroup_pages(mem,
+					  GFP_HIGHUSER_MOVABLE, false, NULL);
 		if (!progress)
 			nr_retries--;
 
@@ -1346,6 +1444,10 @@ static u64 mem_cgroup_read(struct cgroup
 
 	type = MEMFILE_TYPE(cft->private);
 	name = MEMFILE_ATTR(cft->private);
+	/* something special ? */
+	if (name == MEMCG_FILE_HIERARCHY) {
+		return mem->use_hierarchy;
+	}
 	switch (type) {
 	case _MEM:
 		val = res_counter_read_u64(&mem->res, name);
@@ -1360,6 +1462,28 @@ static u64 mem_cgroup_read(struct cgroup
 	}
 	return val;
 }
+
+
+static int mem_cgroup_set_hierarchy(struct mem_cgroup *mem,
+                                     unsigned long long val)
+{
+	struct cgroup *cg = mem->css.cgroup;
+	int ret = 0;
+
+	cgroup_lock();
+	if (!list_empty(&cg->children)) {
+		ret =  -EBUSY;
+	} else {
+		if (val)
+			mem->use_hierarchy = 1;
+		else
+			mem->use_hierarchy = 0;
+	}
+	cgroup_unlock();
+
+	return ret;
+}
+
 /*
  * The user of this function is...
  * RES_LIMIT.
@@ -1385,6 +1509,12 @@ static int mem_cgroup_write(struct cgrou
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case MEMCG_FILE_HIERARCHY:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		ret = mem_cgroup_set_hierarchy(memcg, val);
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -1441,8 +1571,11 @@ static int mem_control_stat_show(struct 
 		val *= mem_cgroup_stat_desc[i].unit;
 		cb->fill(cb, mem_cgroup_stat_desc[i].msg, val);
 	}
-	/* showing # of active pages */
-	{
+	/*
+	 * showing # of active pages if it's root cgroup of hierarchy.
+	 * LRU is shared under a tree.
+	 */
+	if (mem_cont->is_root) {
 		unsigned long active_anon, inactive_anon;
 		unsigned long active_file, inactive_file;
 		unsigned long unevictable;
@@ -1499,6 +1632,12 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "hierarchy",
+		.private = MEMCG_FILE_HIERARCHY,
+		.write_string = mem_cgroup_write,
+		.read_u64  = mem_cgroup_read,
+	},
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 	{
 		.name = "memsw.usage_in_bytes",
@@ -1562,6 +1701,19 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
+static void mem_cgroup_share_node_info(struct mem_cgroup *mem,
+				       struct mem_cgroup *parent)
+{
+	struct mem_cgroup_per_node *pn;
+	int node;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		pn = parent->info.nodeinfo[node];
+		if (pn)
+			mem->info.nodeinfo[node] = pn;
+	}
+}
+
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
 	struct mem_cgroup *mem;
@@ -1599,9 +1751,10 @@ static void mem_cgroup_free(struct mem_c
 			return;
 	}
 
-
-	for_each_node_state(node, N_POSSIBLE)
-		free_mem_cgroup_per_zone_info(mem, node);
+	if (mem->is_root) {
+		for_each_node_state(node, N_POSSIBLE)
+			free_mem_cgroup_per_zone_info(mem, node);
+	}
 
 	if (sizeof(*mem) < PAGE_SIZE)
 		kfree(mem);
@@ -1636,24 +1789,48 @@ static void __init enable_swap_cgroup(vo
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
 		enable_swap_cgroup();
+		parent = NULL;
+		atomic_set(&mem->in_reclaim, 0);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		parent = mem_cgroup_from_cont(cont->parent);
 	}
 
-	res_counter_init(&mem->res);
-	res_counter_init(&mem->memsw);
+	/* If parent uses hierarchy, children do. */
+	if (parent)
+		mem->use_hierarchy = parent->use_hierarchy;
+	else /* default is no hierarchy */
+		mem->use_hierarchy = 0;
 
-	for_each_node_state(node, N_POSSIBLE)
-		if (alloc_mem_cgroup_per_zone_info(mem, node))
-			goto free_out;
+	if (!mem->use_hierarchy) {
+		mem->is_root = 1;
+		parent = NULL;
+		res_counter_init(&mem->res);
+		res_counter_init(&mem->memsw);
+	} else {
+		res_counter_init_hierarchy(&mem->res, &parent->res);
+		res_counter_init_hierarchy(&mem->memsw, &parent->memsw);
+	}
+	/*
+	 * If this memcg is hierarchy root, use its own LRU.
+	 * If not, share parent's(root's) one.
+	 */
+	if (mem->is_root) {
+		for_each_node_state(node, N_POSSIBLE)
+			if (alloc_mem_cgroup_per_zone_info(mem, node))
+				goto free_out;
+	} else {
+		for_each_node_state(node, N_POSSIBLE)
+			mem_cgroup_share_node_info(mem, parent);
+	}
 
 	return &mem->css;
 free_out:


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-04  2:17       ` KAMEZAWA Hiroyuki
@ 2008-11-05 13:34         ` Balbir Singh
  2008-11-05 16:20           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-05 13:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 11:14:48 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Sun, 02 Nov 2008 00:18:49 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>>>> limit, the charging routine points to the parent that is above its limit.
>>>> The reclaim process then starts from the last scanned child of the ancestor
>>>> and reclaims until the ancestor goes below its limit.
>>>>
>>>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>>  mm/memcontrol.c |  153 +++++++++++++++++++++++++++++++++++++++++++++++---------
>>>>  1 file changed, 129 insertions(+), 24 deletions(-)
>>>>
>>>> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
>>>> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-02 00:14:59.000000000 +0530
>>>> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-02 00:14:59.000000000 +0530
>>>> @@ -132,6 +132,11 @@ struct mem_cgroup {
>>>>  	 * statistics.
>>>>  	 */
>>>>  	struct mem_cgroup_stat stat;
>>>> +	/*
>>>> +	 * While reclaiming in a hiearchy, we cache the last child we
>>>> +	 * reclaimed from.
>>>> +	 */
>>>> +	struct mem_cgroup *last_scanned_child;
>>>>  };
>>>>  static struct mem_cgroup init_mem_cgroup;
>>>>  
>>>> @@ -467,6 +472,125 @@ unsigned long mem_cgroup_isolate_pages(u
>>>>  	return nr_taken;
>>>>  }
>>>>  
>>>> +static struct mem_cgroup *
>>>> +mem_cgroup_from_res_counter(struct res_counter *counter)
>>>> +{
>>>> +	return container_of(counter, struct mem_cgroup, res);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Dance down the hierarchy if needed to reclaim memory. We remember the
>>>> + * last child we reclaimed from, so that we don't end up penalizing
>>>> + * one child extensively based on its position in the children list
>>>> + */
>>>> +static int
>>>> +mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
>>>> +{
>>>> +	struct cgroup *cg, *cg_current, *cgroup;
>>>> +	struct mem_cgroup *mem_child;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
>>>> +		return -ENOMEM;
>>>> +
>>>> +	/*
>>>> +	 * try_to_free_mem_cgroup_pages() might not give us a full
>>>> +	 * picture of reclaim. Some pages are reclaimed and might be
>>>> +	 * moved to swap cache or just unmapped from the cgroup.
>>>> +	 * Check the limit again to see if the reclaim reduced the
>>>> +	 * current usage of the cgroup before giving up
>>>> +	 */
>>>> +	if (res_counter_check_under_limit(&mem->res))
>>>> +		return 0;
>>>> +
>>>> +	/*
>>>> +	 * Scan all children under the mem_cgroup mem
>>>> +	 */
>>>> +	if (!mem->last_scanned_child)
>>>> +		cgroup = list_first_entry(&mem->css.cgroup->children,
>>>> +				struct cgroup, sibling);
>>>> +	else
>>>> +		cgroup = mem->last_scanned_child->css.cgroup;
>>>> +
>>>> +	cg_current = cgroup;
>>>> +
>>>> +	/*
>>>> +	 * We iterate twice, one of it is fundamental list issue, where
>>>> +	 * the elements are inserted using list_add and hence the list
>>>> +	 * behaves like a stack and list_for_entry_safe_from() stops
>>>> +	 * after seeing the first child. The two loops help us work
>>>> +	 * independently of the insertion and it helps us get a full pass at
>>>> +	 * scanning all list entries for reclaim
>>>> +	 */
>>>> +	list_for_each_entry_safe_from(cgroup, cg, &cg_current->parent->children,
>>>> +						 sibling) {
>>>> +		mem_child = mem_cgroup_from_cont(cgroup);
>>>> +
>>>> +		/*
>>>> +		 * Move beyond last scanned child
>>>> +		 */
>>>> +		if (mem_child == mem->last_scanned_child)
>>>> +			continue;
>>>> +
>>>> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
>>>> +		mem->last_scanned_child = mem_child;
>>>> +
>>>> +		if (res_counter_check_under_limit(&mem->res)) {
>>>> +			ret = 0;
>>>> +			goto done;
>>>> +		}
>>>> +	}
>>> Is this safe against cgroup create/remove ? cgroup_mutex is held ?
>> Yes, I thought about it, but with the setup, each parent will be busy since they
>> have children and hence cannot be removed. The leaf child itself has tasks, so
>> it cannot be removed. IOW, it should be safe against removal.
>>
> I'm sorry if I misunderstand something. could you explain folloing ?
> 
> In following tree,
> 
>     level-1
>          -  level-2
>                 -  level-3
>                        -  level-4
> level-1's usage = level-1 + level-2 + level-3 + level-4
> level-2's usage = level-2 + level-3 + level-4
> level-3's usage = level-3 + level-4
> level-4's usage = level-4
> 
> Assume that a task in level-2 hits its limit. It has to reclaim memory from
> level-2 and level-3, level-4.
> 
> How can we guarantee level-4 has a task in this case ?

Good question. If there is no task, the LRU's will be empty and reclaim will
return. We could also add other checks if needed.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm][PATCH 0/4] Memory cgroup hierarchy introduction
  2008-11-04  0:15 ` [mm][PATCH 0/4] Memory cgroup hierarchy introduction KAMEZAWA Hiroyuki
@ 2008-11-05 13:51   ` Balbir Singh
  2008-11-05 16:33     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-05 13:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 00:18:12 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> This patch follows several iterations of the memory controller hierarchy
>> patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
>> at [2].
>>
>> The current approach is based on [2] and has the following properties
>>
>> 1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
>>    A multi-tree hierarchy has been supported for a long time in filesystems.
>>    When the feature is turned on, we honor hierarchies such that the root
>>    accounts for resource usage of all children and limits can be set at
>>    any point in the hierarchy. Any memory cgroup is limited by limits
>>    along the hierarchy. The total usage of all children of a node cannot
>>    exceed the limit of the node.
>> 2. The hierarchy feature is selectable and off by default
>> 3. Hierarchies are expensive and the trade off is depth versus performance.
>>    Hierarchies can also be completely turned off.
>>
>> The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
>> with SMP and swap turned on.
>>
> 
> As first impression, I think hierarchical LRU management is not good...means
> not fair from viewpoint of memory management.

Could you elaborate on this further? Is scanning of children during reclaim the
issue? Do you want weighted reclaim for each of the children?

> I'd like to show some other possible implementation of
> try_to_free_mem_cgroup_pages() if I can.
> 

Elaborate please!

> Anyway, I have to merge this with mem+swap controller. 

Cool! I'll send you an updated version.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 1/4] Memory cgroup hierarchy documentation
  2008-11-04  6:26     ` Paul Menage
@ 2008-11-05 13:55       ` Balbir Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Balbir Singh @ 2008-11-05 13:55 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, YAMAMOTO Takashi, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Andrew Morton,
	KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Mon, Nov 3, 2008 at 10:25 PM, Paul Menage <menage@google.com> wrote:
>> That's not a very intuitive interface. Why not memory.use_hierarchy?
> 
> Or for consistency with the existing cpuset.memory_pressure_enabled,
> just memory.hierarchy_enabled ?

Yes, I can change it to that.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-05 13:34         ` Balbir Singh
@ 2008-11-05 16:20           ` KAMEZAWA Hiroyuki
  2008-11-06 14:00             ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05 16:20 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Nick Piggin, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Andrew Morton

Balbir Singh said:
>>>>> +	list_for_each_entry_safe_from(cgroup, cg,
>>>>> &cg_current->parent->children,
>>>>> +						 sibling) {
>>>>> +		mem_child = mem_cgroup_from_cont(cgroup);
>>>>> +
>>>>> +		/*
>>>>> +		 * Move beyond last scanned child
>>>>> +		 */
>>>>> +		if (mem_child == mem->last_scanned_child)
>>>>> +			continue;
>>>>> +
>>>>> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
>>>>> +		mem->last_scanned_child = mem_child;
>>>>> +
>>>>> +		if (res_counter_check_under_limit(&mem->res)) {
>>>>> +			ret = 0;
>>>>> +			goto done;
>>>>> +		}
>>>>> +	}
>>>> Is this safe against cgroup create/remove ? cgroup_mutex is held ?
>>> Yes, I thought about it, but with the setup, each parent will be busy
>>> since they
>>> have children and hence cannot be removed. The leaf child itself has
>>> tasks, so
>>> it cannot be removed. IOW, it should be safe against removal.
>>>
>> I'm sorry if I misunderstand something. could you explain folloing ?
>>
>> In following tree,
>>
>>     level-1
>>          -  level-2
>>                 -  level-3
>>                        -  level-4
>> level-1's usage = level-1 + level-2 + level-3 + level-4
>> level-2's usage = level-2 + level-3 + level-4
>> level-3's usage = level-3 + level-4
>> level-4's usage = level-4
>>
>> Assume that a task in level-2 hits its limit. It has to reclaim memory
>> from
>> level-2 and level-3, level-4.
>>
>> How can we guarantee level-4 has a task in this case ?
>
> Good question. If there is no task, the LRU's will be empty and reclaim
> will
> return. We could also add other checks if needed.
>
If needed ?, yes, you need.
The problem is that you are walking a list in usual way without any lock
or guarantee that the list will never be modified.

My quick idea is following.
==
Before start reclaim.
 1. take lock_cgroup()
 2. scan the tree and create "private" list as snapshot of tree to be
    scanned.
 3. unlock_cgroup().
 4. start reclaim.

Adding refcnt to memcg to delay freeing memcg control area is necessary.
(mem+swap controller have function to do this and you may be able to
 reuse it.)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm][PATCH 0/4] Memory cgroup hierarchy introduction
  2008-11-05 13:51   ` Balbir Singh
@ 2008-11-05 16:33     ` KAMEZAWA Hiroyuki
  2008-11-05 17:52       ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05 16:33 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Nick Piggin, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Andrew Morton

Balbir Singh said:
> KAMEZAWA Hiroyuki wrote:
>> On Sun, 02 Nov 2008 00:18:12 +0530
>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> As first impression, I think hierarchical LRU management is not
>> good...means
>> not fair from viewpoint of memory management.
>
> Could you elaborate on this further? Is scanning of children during
> reclaim the
> issue? Do you want weighted reclaim for each of the children?
>
No. Consider follwing case
   /root/group_root/group_A
                   /group_B
                   /group_C

  sum of group A, B, C is limited by group_root's limit.

  Now,
        /group_root limit=1G, usage=990M
                    /group_A  usage=600M , no limit, no tasks for a while
                    /group_B  usage=10M  , no limit, no tasks
                    /group_C  usage=380M , no limit, 2 tasks

  A user run a new task in group_B.
  In your algorithm, group_A and B and C's memory are reclaimed
  to the same extent becasue there is no information to show
  "group A's memory are not accessed recently rather than B or C".

  This information is what we want for managing memory.

>> I'd like to show some other possible implementation of
>> try_to_free_mem_cgroup_pages() if I can.
>>
>
> Elaborate please!
>
ok. but, at least, please add
  - per-subtree hierarchy flag.
  - cgroup_lock to walk list of cgroups somewhere.

I already sent my version "shared LRU" just as a hint for you.
It is something extreme but contains something good, I think.

>> Anyway, I have to merge this with mem+swap controller.
>
> Cool! I'll send you an updated version.
>

Synchronized LRU patch may help you.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm][PATCH 0/4] Memory cgroup hierarchy introduction
  2008-11-05 16:33     ` KAMEZAWA Hiroyuki
@ 2008-11-05 17:52       ` Balbir Singh
  2008-11-06  0:22         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-05 17:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> Balbir Singh said:
>> KAMEZAWA Hiroyuki wrote:
>>> On Sun, 02 Nov 2008 00:18:12 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>> As first impression, I think hierarchical LRU management is not
>>> good...means
>>> not fair from viewpoint of memory management.
>> Could you elaborate on this further? Is scanning of children during
>> reclaim the
>> issue? Do you want weighted reclaim for each of the children?
>>
> No. Consider follwing case
>    /root/group_root/group_A
>                    /group_B
>                    /group_C
> 
>   sum of group A, B, C is limited by group_root's limit.
> 
>   Now,
>         /group_root limit=1G, usage=990M
>                     /group_A  usage=600M , no limit, no tasks for a while
>                     /group_B  usage=10M  , no limit, no tasks
>                     /group_C  usage=380M , no limit, 2 tasks
> 
>   A user run a new task in group_B.
>   In your algorithm, group_A and B and C's memory are reclaimed
>   to the same extent becasue there is no information to show
>   "group A's memory are not accessed recently rather than B or C".
> 
>   This information is what we want for managing memory.
> 

For that sort of implementation, we'll need a common LRU. I actually thought of
implementing it by sharing a common LRU, but then we would end up with just one
common LRU at the root :)

The reclaim algorithm is smart in that it knows what pages are commonly
accessed. group A will get reclaimed more since those pages are not actively
referenced. reclaim on group_C will be harder. Simple experiments seem to show that.


>>> I'd like to show some other possible implementation of
>>> try_to_free_mem_cgroup_pages() if I can.
>>>
>> Elaborate please!
>>
> ok. but, at least, please add
>   - per-subtree hierarchy flag.
>   - cgroup_lock to walk list of cgroups somewhere.
> 
> I already sent my version "shared LRU" just as a hint for you.
> It is something extreme but contains something good, I think.
> 
>>> Anyway, I have to merge this with mem+swap controller.
>> Cool! I'll send you an updated version.
>>
> 
> Synchronized LRU patch may help you.

Let me get a good working version against current -mm and then we'll integrate
our patches.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm][PATCH 0/4] Memory cgroup hierarchy introduction
  2008-11-05 17:52       ` Balbir Singh
@ 2008-11-06  0:22         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  0:22 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Wed, 05 Nov 2008 23:22:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> >   Now,
> >         /group_root limit=1G, usage=990M
> >                     /group_A  usage=600M , no limit, no tasks for a while
> >                     /group_B  usage=10M  , no limit, no tasks
> >                     /group_C  usage=380M , no limit, 2 tasks
> > 
> >   A user run a new task in group_B.
> >   In your algorithm, group_A and B and C's memory are reclaimed
> >   to the same extent becasue there is no information to show
> >   "group A's memory are not accessed recently rather than B or C".
> > 
> >   This information is what we want for managing memory.
> > 
> 
> For that sort of implementation, we'll need a common LRU. I actually thought of
> implementing it by sharing a common LRU, but then we would end up with just one
> common LRU at the root :)
> 
> The reclaim algorithm is smart in that it knows what pages are commonly
> accessed. group A will get reclaimed more since those pages are not actively
> referenced. reclaim on group_C will be harder.
Why ? I think isolate_lru_page() removes SWAP_CLUSTER_MAX from each group.

> Simple experiments seem to show that.
> 
please remember the problem as problem and put that in TODO or FIXME, at least.
or add explanation "why this works well" in logical text.
As Andrew Morton said, vaildation for LRU management tend to need long time.


> >>> I'd like to show some other possible implementation of
> >>> try_to_free_mem_cgroup_pages() if I can.
> >>>
> >> Elaborate please!
> >>
> > ok. but, at least, please add
> >   - per-subtree hierarchy flag.
> >   - cgroup_lock to walk list of cgroups somewhere.
> > 
> > I already sent my version "shared LRU" just as a hint for you.
> > It is something extreme but contains something good, I think.
> > 
> >>> Anyway, I have to merge this with mem+swap controller.
> >> Cool! I'll send you an updated version.
> >>
> > 
> > Synchronized LRU patch may help you.
> 
> Let me get a good working version against current -mm and then we'll integrate
> our patches.
> 
A patch set I posted yesterday doesn't work ?
If not, please wait until next mmotm comes out.

-Kame


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-02  6:24       ` KAMEZAWA Hiroyuki
  2008-11-02 15:52         ` Balbir Singh
@ 2008-11-06  6:56         ` Balbir Singh
  2008-11-06  7:30           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-06  6:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sun, 02 Nov 2008 11:33:51 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Sun, 02 Nov 2008 00:19:02 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> Don't enable multiple hierarchy support by default. This patch introduces
>>>> a features element that can be set to enable the nested depth hierarchy
>>>> feature. This feature can only be enabled when there is just one cgroup
>>>> (the root cgroup).
>>>>
>>> Why the flag is for the whole system ?
>>> flag-per-subtree is of no use ?
>> Flag per subtree might not be useful, since we charge all the way up to root,
> Ah, what I said is "How about enabling/disabling hierarhcy support per subtree ?"
> Sorry for bad text.
> 
> like this.
>   - you can set hierarchy mode of a cgroup turned on/off when...
>     * you don't have any tasks under it && it doesn't have any child cgroup.

I see.. the presence of tasks don't matter, since the root cgroup will always
have tasks. Presence of child groups does matter.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-04  6:37           ` Paul Menage
@ 2008-11-06  7:00             ` Balbir Singh
  2008-11-06  7:01               ` Balbir Singh
  0 siblings, 1 reply; 34+ messages in thread
From: Balbir Singh @ 2008-11-06  7:00 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	linux-kernel, Nick Piggin, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Andrew Morton, Dhaval Giani, Srivatsa Vaddagiri

Paul Menage wrote:
> On Sun, Nov 2, 2008 at 7:52 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> That should not be hard, but having it per-subtree sounds a little complex in
>> terms of exploiting from the end-user perspective and from symmetry perspective
>> (the CPU cgroup controller provides hierarchy control for the full hierarchy).
>>
> 
> The difference is that the CPU controller works in terms of shares,
> whereas memory works in terms of absolute memory size. So it pretty
> much has to limit the hierarchy to a single tree. Also, I didn't think
> that you could modify the shares for the root cgroup - what would that
> mean if so?
> 

The shares are proportional for the CPU controller. I am confused as to which
shares (CPU you are talking about?

> With this patch set as it is now, the root cgroup's lock becomes a
> global memory allocation/deallocation lock, which seems a bit painful.

Yes, true, but then that is the cost associated with using a hierarchy.

> Having a bunch of top-level cgroups each with their own independent
> memory limits, and allowing sub cgroups of them to be constrained by
> the parent's memory limit, seems more useful than a single hierarchy
> connected at the root.

That is certainly do-able, but can be confusing to users, given how other
controllers work. We can document that

> 
> In what realistic circumstances do you actually want to limit the root
> cgroup's memory usage?
> 

Good point, I would expect that people would mostly use the hierarchy with
soft-limits or shares. I am now beginning to like Kamezawa and your suggestion
of limiting usage to subtrees.


-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-06  7:00             ` Balbir Singh
@ 2008-11-06  7:01               ` Balbir Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Balbir Singh @ 2008-11-06  7:01 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm, YAMAMOTO Takashi, lizf,
	linux-kernel, Nick Piggin, David Rientjes, Pavel Emelianov,
	Dhaval Giani, Andrew Morton, Srivatsa Vaddagiri

Balbir Singh wrote:
> Paul Menage wrote:
>> On Sun, Nov 2, 2008 at 7:52 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>> That should not be hard, but having it per-subtree sounds a little complex in
>>> terms of exploiting from the end-user perspective and from symmetry perspective
>>> (the CPU cgroup controller provides hierarchy control for the full hierarchy).
>>>
>> The difference is that the CPU controller works in terms of shares,
>> whereas memory works in terms of absolute memory size. So it pretty
>> much has to limit the hierarchy to a single tree. Also, I didn't think
>> that you could modify the shares for the root cgroup - what would that
>> mean if so?
>>
> 
> The shares are proportional for the CPU controller. I am confused as to which
> shares (CPU you are talking about?
> 
>> With this patch set as it is now, the root cgroup's lock becomes a
>> global memory allocation/deallocation lock, which seems a bit painful.
> 
> Yes, true, but then that is the cost associated with using a hierarchy.
> 
>> Having a bunch of top-level cgroups each with their own independent
>> memory limits, and allowing sub cgroups of them to be constrained by
>> the parent's memory limit, seems more useful than a single hierarchy
>> connected at the root.
> 
> That is certainly do-able, but can be confusing to users, given how other
> controllers work. We can document that
> 
>> In what realistic circumstances do you actually want to limit the root
>> cgroup's memory usage?
>>
> 
> Good point, I would expect that people would mostly use the hierarchy with
> soft-limits or shares. I am now beginning to like Kamezawa and your suggestion
> of limiting usage to subtrees.

Oh! I am not sure if I mentioned, but you don't need to limit usage at the root.
Any parent along the hierarchy can be limited and it will act as if the entire
subtree is limited by that limit.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector
  2008-11-06  6:56         ` Balbir Singh
@ 2008-11-06  7:30           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 34+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  7:30 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Thu, 06 Nov 2008 12:26:29 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Sun, 02 Nov 2008 11:33:51 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> KAMEZAWA Hiroyuki wrote:
> >>> On Sun, 02 Nov 2008 00:19:02 +0530
> >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >>>
> >>>> Don't enable multiple hierarchy support by default. This patch introduces
> >>>> a features element that can be set to enable the nested depth hierarchy
> >>>> feature. This feature can only be enabled when there is just one cgroup
> >>>> (the root cgroup).
> >>>>
> >>> Why the flag is for the whole system ?
> >>> flag-per-subtree is of no use ?
> >> Flag per subtree might not be useful, since we charge all the way up to root,
> > Ah, what I said is "How about enabling/disabling hierarhcy support per subtree ?"
> > Sorry for bad text.
> > 
> > like this.
> >   - you can set hierarchy mode of a cgroup turned on/off when...
> >     * you don't have any tasks under it && it doesn't have any child cgroup.
> 
> I see.. the presence of tasks don't matter, since the root cgroup will always
> have tasks. Presence of child groups does matter.
> 
yes. you're right.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim
  2008-11-05 16:20           ` KAMEZAWA Hiroyuki
@ 2008-11-06 14:00             ` Balbir Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Balbir Singh @ 2008-11-06 14:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> Balbir Singh said:
>>>>>> +	list_for_each_entry_safe_from(cgroup, cg,
>>>>>> &cg_current->parent->children,
>>>>>> +						 sibling) {
>>>>>> +		mem_child = mem_cgroup_from_cont(cgroup);
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * Move beyond last scanned child
>>>>>> +		 */
>>>>>> +		if (mem_child == mem->last_scanned_child)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		ret = try_to_free_mem_cgroup_pages(mem_child, gfp_mask);
>>>>>> +		mem->last_scanned_child = mem_child;
>>>>>> +
>>>>>> +		if (res_counter_check_under_limit(&mem->res)) {
>>>>>> +			ret = 0;
>>>>>> +			goto done;
>>>>>> +		}
>>>>>> +	}
>>>>> Is this safe against cgroup create/remove ? cgroup_mutex is held ?
>>>> Yes, I thought about it, but with the setup, each parent will be busy
>>>> since they
>>>> have children and hence cannot be removed. The leaf child itself has
>>>> tasks, so
>>>> it cannot be removed. IOW, it should be safe against removal.
>>>>
>>> I'm sorry if I misunderstand something. could you explain folloing ?
>>>
>>> In following tree,
>>>
>>>     level-1
>>>          -  level-2
>>>                 -  level-3
>>>                        -  level-4
>>> level-1's usage = level-1 + level-2 + level-3 + level-4
>>> level-2's usage = level-2 + level-3 + level-4
>>> level-3's usage = level-3 + level-4
>>> level-4's usage = level-4
>>>
>>> Assume that a task in level-2 hits its limit. It has to reclaim memory
>>> from
>>> level-2 and level-3, level-4.
>>>
>>> How can we guarantee level-4 has a task in this case ?
>> Good question. If there is no task, the LRU's will be empty and reclaim
>> will
>> return. We could also add other checks if needed.
>>
> If needed ?, yes, you need.
> The problem is that you are walking a list in usual way without any lock
> or guarantee that the list will never be modified.
> 
> My quick idea is following.
> ==
> Before start reclaim.
>  1. take lock_cgroup()
>  2. scan the tree and create "private" list as snapshot of tree to be
>     scanned.

This might not be feasible, since we would need to recurse down tree structures.
I am wondering what is the best way to walk down a hierarchy, Paul any suggestions?

Here is what I have so far

1. take cgroup lock
2. list_for_each_safe.* to walk cgroup
3. Reclaim from local tasks
4. Reclaim from child cgroups (starting from last child we stopped at),
recursively, so that we can walk down the full hierarchy
5. unlock cgroup



-- 
	Balbir

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-11-06 14:00 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-01 18:48 [mm][PATCH 0/4] Memory cgroup hierarchy introduction Balbir Singh
2008-11-01 18:48 ` [mm] [PATCH 1/4] Memory cgroup hierarchy documentation Balbir Singh
2008-11-04  6:25   ` Paul Menage
2008-11-04  6:26     ` Paul Menage
2008-11-05 13:55       ` Balbir Singh
2008-11-01 18:48 ` [mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy Balbir Singh
2008-11-02  5:42   ` KAMEZAWA Hiroyuki
2008-11-02  5:49     ` Balbir Singh
2008-11-02  5:56       ` KAMEZAWA Hiroyuki
2008-11-02 11:46         ` Balbir Singh
2008-11-01 18:48 ` [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim Balbir Singh
2008-11-02  5:37   ` KAMEZAWA Hiroyuki
2008-11-02  5:44     ` Balbir Singh
2008-11-04  2:17       ` KAMEZAWA Hiroyuki
2008-11-05 13:34         ` Balbir Singh
2008-11-05 16:20           ` KAMEZAWA Hiroyuki
2008-11-06 14:00             ` Balbir Singh
2008-11-01 18:49 ` [mm] [PATCH 4/4] Memory cgroup hierarchy feature selector Balbir Singh
2008-11-02  5:38   ` KAMEZAWA Hiroyuki
2008-11-02  6:03     ` Balbir Singh
2008-11-02  6:24       ` KAMEZAWA Hiroyuki
2008-11-02 15:52         ` Balbir Singh
2008-11-04  6:37           ` Paul Menage
2008-11-06  7:00             ` Balbir Singh
2008-11-06  7:01               ` Balbir Singh
2008-11-06  6:56         ` Balbir Singh
2008-11-06  7:30           ` KAMEZAWA Hiroyuki
2008-11-04  0:15 ` [mm][PATCH 0/4] Memory cgroup hierarchy introduction KAMEZAWA Hiroyuki
2008-11-05 13:51   ` Balbir Singh
2008-11-05 16:33     ` KAMEZAWA Hiroyuki
2008-11-05 17:52       ` Balbir Singh
2008-11-06  0:22         ` KAMEZAWA Hiroyuki
2008-11-04  9:21 ` [patch 1/2] memcg: hierarchy, yet another one KAMEZAWA Hiroyuki
2008-11-04  9:25 ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).