LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3)
@ 2008-11-11 12:33 Balbir Singh
  2008-11-11 12:33 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v3) Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-11 12:33 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

This patch follows several iterations of the memory controller hierarchy
patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
at [2].

The current approach is based on [2] and has the following properties

1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
   A multi-tree hierarchy has been supported for a long time in filesystems.
   When the feature is turned on, we honor hierarchies such that the root
   accounts for resource usage of all children and limits can be set at
   any point in the hierarchy. Any memory cgroup is limited by limits
   along the hierarchy. The total usage of all children of a node cannot
   exceed the limit of the node.
2. The hierarchy feature is selectable and off by default
3. Hierarchies are expensive and the trade off is depth versus performance.
   Hierarchies can also be completely turned off.

The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
with SMP and swap turned on.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

v3..v2
======
1. Hierarchy selection logic, now allows use_hierarchy changes only if
   parent's use_hierarchy is set to 0 and there are no children
2. last_scanned_child is protected by cgroup_lock()
3. cgroup_lock() is released before lru_add_drain_all() in
   mem_cgroup_force_empty()

v2..v1
======
1. The hierarchy is now selectable per-subtree
2. The features file has been renamed to use_hierarchy
3. Reclaim now holds cgroup lock and the reclaim does recursive walk and reclaim

Acknowledgements
----------------

Thanks for the feedback from Li Zefan, Kamezawa Hiroyuki, Paul Menage and
others.

Series
------

memcg-hierarchy-documentation.patch
resource-counters-hierarchy-support.patch
memcg-hierarchical-reclaim.patch
memcg-add-hierarchy-selector.patch

Reviews? Comments?

References

1. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-06/msg05417.html
2. http://kerneltrap.org/mailarchive/linux-kernel/2008/4/19/1513644/thread

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v3)
  2008-11-11 12:33 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3) Balbir Singh
@ 2008-11-11 12:33 ` Balbir Singh
  2008-11-11 12:34 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v3) Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-11 12:33 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki



Documentation updates for hierarchy support

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memory.txt |   37 +++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff -puN Documentation/controllers/memory.txt~memcg-hierarchy-documentation Documentation/controllers/memory.txt
--- linux-2.6.28-rc2/Documentation/controllers/memory.txt~memcg-hierarchy-documentation	2008-11-11 17:51:54.000000000 +0530
+++ linux-2.6.28-rc2-balbir/Documentation/controllers/memory.txt	2008-11-11 17:51:54.000000000 +0530
@@ -245,6 +245,43 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
+5. Hierarchy support
+
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy
+
+		root
+	     /  |   \
+           /	|    \
+	  a	b	c
+			| \
+			|  \
+			d   e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
+that has memory.use_hierarchy enabled.  If one of the ancestors goes over its
+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
+children of the ancestor.
+
+5.1 Enabling hierarchical accounting and reclaim
+
+The memory controller by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
+
+# echo 1 > memory.use_hierarchy
+
+The feature can be disabled by
+
+# echo 0 > memory.use_hierarchy
+
+NOTE1: Enabling/disabling will fail if the cgroup already has other
+cgroups created below it.
+
+NOTE2: This feature can be enabled/disabled per subtree.
+
 5. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v3)
  2008-11-11 12:33 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3) Balbir Singh
  2008-11-11 12:33 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v3) Balbir Singh
@ 2008-11-11 12:34 ` Balbir Singh
  2008-11-11 12:34 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3) Balbir Singh
  2008-11-11 12:34 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3) Balbir Singh
  3 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-11 12:34 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


Add support for building hierarchies in resource counters. Cgroups allows us
to build a deep hierarchy, but we currently don't link the resource counters
belonging to the memory controller control groups, in the same fashion
as the corresponding cgroup entries in the cgroup hierarchy. This patch
provides the infrastructure for resource counters that have the same hiearchy
as their cgroup counter parts.

These set of patches are based on the resource counter hiearchy patches posted
by Pavel Emelianov.

NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are expensive,
so the user needs to be careful and aware of the trade-offs before creating
very deep ones.


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    8 ++++++--
 kernel/res_counter.c        |   42 ++++++++++++++++++++++++++++++++++--------
 mm/memcontrol.c             |    9 ++++++---
 3 files changed, 46 insertions(+), 13 deletions(-)

diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
--- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support	2008-11-11 17:51:55.000000000 +0530
+++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h	2008-11-11 17:51:55.000000000 +0530
@@ -43,6 +43,10 @@ struct res_counter {
 	 * the routines below consider this to be IRQ-safe
 	 */
 	spinlock_t lock;
+	/*
+	 * Parent counter, used for hierarchial resource accounting
+	 */
+	struct res_counter *parent;
 };
 
 /**
@@ -87,7 +91,7 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 /*
  * charge - try to consume more resource.
@@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
--- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support	2008-11-11 17:51:55.000000000 +0530
+++ linux-2.6.28-rc2-balbir/kernel/res_counter.c	2008-11-11 17:51:55.000000000 +0530
@@ -15,10 +15,11 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->parent = parent;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val)
+int res_counter_charge(struct res_counter *counter, unsigned long val,
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
+	struct res_counter *c, *u;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	ret = res_counter_charge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	*limit_fail_at = NULL;
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		ret = res_counter_charge_locked(c, val);
+		spin_unlock(&c->lock);
+		if (ret < 0) {
+			*limit_fail_at = c;
+			goto undo;
+		}
+	}
+	ret = 0;
+	goto done;
+undo:
+	for (u = counter; u != c; u = u->parent) {
+		spin_lock(&u->lock);
+		res_counter_uncharge_locked(u, val);
+		spin_unlock(&u->lock);
+	}
+done:
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
+	struct res_counter *c;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	res_counter_uncharge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		res_counter_uncharge_locked(c, val);
+		spin_unlock(&c->lock);
+	}
+	local_irq_restore(flags);
 }
 
 
diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support	2008-11-11 17:51:55.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-11 17:51:55.000000000 +0530
@@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
 	}
 
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
+		parent = NULL;
 	} else {
 		mem = mem_cgroup_alloc();
+		parent = mem_cgroup_from_cont(cont->parent);
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
 	}
 
-	res_counter_init(&mem->res);
+	res_counter_init(&mem->res, parent ? &parent->res : NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-11 12:33 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3) Balbir Singh
  2008-11-11 12:33 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v3) Balbir Singh
  2008-11-11 12:34 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v3) Balbir Singh
@ 2008-11-11 12:34 ` Balbir Singh
  2008-11-12  3:52   ` KAMEZAWA Hiroyuki
  2008-11-12  5:02   ` KAMEZAWA Hiroyuki
  2008-11-11 12:34 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3) Balbir Singh
  3 siblings, 2 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-11 12:34 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


This patch introduces hierarchical reclaim. When an ancestor goes over its
limit, the charging routine points to the parent that is above its limit.
The reclaim process then starts from the last scanned child of the ancestor
and reclaims until the ancestor goes below its limit.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |  179 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 152 insertions(+), 27 deletions(-)

diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-11 17:51:56.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-11 17:51:56.000000000 +0530
@@ -132,6 +132,11 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from. Protected by cgroup_lock()
+	 */
+	struct mem_cgroup *last_scanned_child;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -467,6 +472,126 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+static struct mem_cgroup *
+mem_cgroup_from_res_counter(struct res_counter *counter)
+{
+	return container_of(counter, struct mem_cgroup, res);
+}
+
+/*
+ * Dance down the hierarchy if needed to reclaim memory. We remember the
+ * last child we reclaimed from, so that we don't end up penalizing
+ * one child extensively based on its position in the children list.
+ *
+ * root_mem is the original ancestor that we've been reclaim from.
+ */
+static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
+						struct mem_cgroup *root_mem,
+						gfp_t gfp_mask)
+{
+	struct cgroup *cg_current, *cgroup;
+	struct mem_cgroup *mem_child;
+	int ret = 0;
+
+	/*
+	 * Reclaim unconditionally and don't check for return value.
+	 * We need to reclaim in the current group and down the tree.
+	 * One might think about checking for children before reclaiming,
+	 * but there might be left over accounting, even after children
+	 * have left.
+	 */
+	try_to_free_mem_cgroup_pages(mem, gfp_mask);
+
+	if (res_counter_check_under_limit(&root_mem->res))
+		return 0;
+
+	cgroup_lock();
+
+	if (list_empty(&mem->css.cgroup->children)) {
+		cgroup_unlock();
+		return 0;
+	}
+
+	/*
+	 * Scan all children under the mem_cgroup mem
+	 */
+	if (!mem->last_scanned_child)
+		cgroup = list_first_entry(&mem->css.cgroup->children,
+				struct cgroup, sibling);
+	else
+		cgroup = mem->last_scanned_child->css.cgroup;
+
+	cg_current = cgroup;
+
+	do {
+		struct list_head *next;
+
+		mem_child = mem_cgroup_from_cont(cgroup);
+		cgroup_unlock();
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
+							gfp_mask);
+		cgroup_lock();
+		mem->last_scanned_child = mem_child;
+		if (res_counter_check_under_limit(&root_mem->res)) {
+			ret = 0;
+			goto done;
+		}
+
+		/*
+		 * Since we gave up the lock, it is time to
+		 * start from last cgroup
+		 */
+		cgroup = mem->last_scanned_child->css.cgroup;
+		next = cgroup->sibling.next;
+
+		if (next == &cg_current->parent->children)
+			cgroup = list_first_entry(&mem->css.cgroup->children,
+							struct cgroup, sibling);
+		else
+			cgroup = container_of(next, struct cgroup, sibling);
+	} while (cgroup != cg_current);
+
+done:
+	cgroup_unlock();
+	return ret;
+}
+
+/*
+ * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
+ * from mem.
+ */
+static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	int ret = 0;
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	struct mem_cgroup *mem_over_limit;
+
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
+		if (!(gfp_mask & __GFP_WAIT))
+			goto out;
+
+		/*
+		 * Is one of our ancestors over their limit?
+		 */
+		if (fail_res)
+			mem_over_limit = mem_cgroup_from_res_counter(fail_res);
+		else
+			mem_over_limit = mem;
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
+							mem_over_limit,
+							gfp_mask);
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			goto out;
+		}
+	}
+out:
+	return ret;
+}
 
 /**
  * mem_cgroup_try_charge - get charge of PAGE_SIZE.
@@ -484,8 +609,7 @@ int mem_cgroup_try_charge(struct mm_stru
 			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,29 +634,9 @@ int mem_cgroup_try_charge(struct mm_stru
 		css_get(&mem->css);
 	}
 
+	if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
+		goto nomem;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto nomem;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto nomem;
-		}
-	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -945,7 +1049,7 @@ static void mem_cgroup_force_empty_list(
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool cgroup_locked)
 {
 	int ret = -EBUSY;
 	int node, zid;
@@ -959,8 +1063,20 @@ static int mem_cgroup_force_empty(struct
 	while (mem->res.usage > 0) {
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
+
+		/*
+		 * We need to give up the cgroup lock if it is held, since
+		 * it creates the potential for deadlock. cgroup_mutex should
+		 * be acquired after cpu_hotplug lock. In this path, we
+		 * acquire the cpu_hotplug lock after acquiring the cgroup_mutex
+		 * Giving it up should be OK
+		 */
+		if (cgroup_locked)
+			cgroup_unlock();
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		if (cgroup_locked)
+			cgroup_lock();
 		for_each_node_state(node, N_POSSIBLE)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
@@ -1025,7 +1141,7 @@ static int mem_cgroup_reset(struct cgrou
 
 static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
 {
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
+	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), false);
 }
 
 static const struct mem_cgroup_stat_desc {
@@ -1195,6 +1311,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
 
+	mem->last_scanned_child = NULL;
+
 	return &mem->css;
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
@@ -1208,7 +1326,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem_cgroup_force_empty(mem);
+	mem_cgroup_force_empty(mem, true);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
@@ -1217,6 +1335,13 @@ static void mem_cgroup_destroy(struct cg
 	int node;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	if (cont->parent) {
+		struct mem_cgroup *parent_mem =
+			mem_cgroup_from_cont(cont->parent);
+		if (parent_mem->last_scanned_child == mem)
+			parent_mem->last_scanned_child = NULL;
+	}
+
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3)
  2008-11-11 12:33 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3) Balbir Singh
                   ` (2 preceding siblings ...)
  2008-11-11 12:34 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3) Balbir Singh
@ 2008-11-11 12:34 ` Balbir Singh
  2008-11-13  1:28   ` Li Zefan
  2008-11-13  1:39   ` Li Zefan
  3 siblings, 2 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-11 12:34 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector	2008-11-11 17:51:57.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-11 17:51:57.000000000 +0530
@@ -137,6 +137,11 @@ struct mem_cgroup {
 	 * reclaimed from. Protected by cgroup_lock()
 	 */
 	struct mem_cgroup *last_scanned_child;
+	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	unsigned long use_hierarchy;
+
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -1093,6 +1098,42 @@ out:
 	return ret;
 }
 
+static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cont)->use_hierarchy;
+}
+
+static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
+					u64 val)
+{
+	int retval = 0;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct cgroup *parent = cont->parent;
+	struct mem_cgroup *parent_mem = NULL;
+
+	if (parent)
+		parent_mem = mem_cgroup_from_cont(parent);
+
+	/*
+	 * If parent's use_hiearchy is set, we can't make any modifications
+	 * in the child subtrees. If it is unset, then the change can
+	 * occur, provided the current cgroup has no children.
+	 *
+	 * For the root cgroup, parent_mem is NULL, we allow value to be
+	 * set if there are no children.
+	 */
+	if (!parent_mem || (!parent_mem->use_hierarchy &&
+				(val == 1 || val == 0))) {
+		if (list_empty(&cont->children))
+			mem->use_hierarchy = val;
+		else
+			retval = -EBUSY;
+	} else
+		retval = -EINVAL;
+
+	return retval;
+}
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
@@ -1227,6 +1268,11 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "use_hierarchy",
+		.write_u64 = mem_cgroup_hierarchy_write,
+		.read_u64 = mem_cgroup_hierarchy_read,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1303,9 +1349,13 @@ mem_cgroup_create(struct cgroup_subsys *
 		parent = mem_cgroup_from_cont(cont->parent);
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		mem->use_hierarchy = parent->use_hierarchy;
 	}
 
-	res_counter_init(&mem->res, parent ? &parent->res : NULL);
+	if (parent && parent->use_hierarchy)
+		res_counter_init(&mem->res, &parent->res);
+	else
+		res_counter_init(&mem->res, NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-11 12:34 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3) Balbir Singh
@ 2008-11-12  3:52   ` KAMEZAWA Hiroyuki
  2008-11-12  4:00     ` Balbir Singh
  2008-11-12  5:02   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  3:52 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Tue, 11 Nov 2008 18:04:17 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> +
> +		/*
> +		 * We need to give up the cgroup lock if it is held, since
> +		 * it creates the potential for deadlock. cgroup_mutex should
> +		 * be acquired after cpu_hotplug lock. In this path, we
> +		 * acquire the cpu_hotplug lock after acquiring the cgroup_mutex
> +		 * Giving it up should be OK
> +		 */
> +		if (cgroup_locked)
> +			cgroup_unlock();

nice catch. I'll post a fix to this as its own patch. 

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  3:52   ` KAMEZAWA Hiroyuki
@ 2008-11-12  4:00     ` Balbir Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-12  4:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Tue, 11 Nov 2008 18:04:17 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> +
>> +		/*
>> +		 * We need to give up the cgroup lock if it is held, since
>> +		 * it creates the potential for deadlock. cgroup_mutex should
>> +		 * be acquired after cpu_hotplug lock. In this path, we
>> +		 * acquire the cpu_hotplug lock after acquiring the cgroup_mutex
>> +		 * Giving it up should be OK
>> +		 */
>> +		if (cgroup_locked)
>> +			cgroup_unlock();
> 
> nice catch. I'll post a fix to this as its own patch. 

Sure, feel free to add my signed-off-by on it.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-11 12:34 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3) Balbir Singh
  2008-11-12  3:52   ` KAMEZAWA Hiroyuki
@ 2008-11-12  5:02   ` KAMEZAWA Hiroyuki
  2008-11-12  5:49     ` Balbir Singh
  1 sibling, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  5:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Tue, 11 Nov 2008 18:04:17 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> This patch introduces hierarchical reclaim. When an ancestor goes over its
> limit, the charging routine points to the parent that is above its limit.
> The reclaim process then starts from the last scanned child of the ancestor
> and reclaims until the ancestor goes below its limit.
> 

> +/*
> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> + * last child we reclaimed from, so that we don't end up penalizing
> + * one child extensively based on its position in the children list.
> + *
> + * root_mem is the original ancestor that we've been reclaim from.
> + */
> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
> +						struct mem_cgroup *root_mem,
> +						gfp_t gfp_mask)
> +{
> +	struct cgroup *cg_current, *cgroup;
> +	struct mem_cgroup *mem_child;
> +	int ret = 0;
> +
> +	/*
> +	 * Reclaim unconditionally and don't check for return value.
> +	 * We need to reclaim in the current group and down the tree.
> +	 * One might think about checking for children before reclaiming,
> +	 * but there might be left over accounting, even after children
> +	 * have left.
> +	 */
> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
> +
> +	if (res_counter_check_under_limit(&root_mem->res))
> +		return 0;
> +
> +	cgroup_lock();
> +
> +	if (list_empty(&mem->css.cgroup->children)) {
> +		cgroup_unlock();
> +		return 0;
> +	}
> +
> +	/*
> +	 * Scan all children under the mem_cgroup mem
> +	 */
> +	if (!mem->last_scanned_child)
> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> +				struct cgroup, sibling);
> +	else
> +		cgroup = mem->last_scanned_child->css.cgroup;
> +
> +	cg_current = cgroup;
> +
> +	do {
> +		struct list_head *next;
> +
> +		mem_child = mem_cgroup_from_cont(cgroup);
> +		cgroup_unlock();
> +
> +		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
> +							gfp_mask);
> +		cgroup_lock();
> +		mem->last_scanned_child = mem_child;
> +		if (res_counter_check_under_limit(&root_mem->res)) {
> +			ret = 0;
> +			goto done;
> +		}
> +
> +		/*
> +		 * Since we gave up the lock, it is time to
> +		 * start from last cgroup
> +		 */
> +		cgroup = mem->last_scanned_child->css.cgroup;
> +		next = cgroup->sibling.next;
> +
> +		if (next == &cg_current->parent->children)
> +			cgroup = list_first_entry(&mem->css.cgroup->children,
> +							struct cgroup, sibling);
> +		else
> +			cgroup = container_of(next, struct cgroup, sibling);
> +	} while (cgroup != cg_current);
> +
> +done:
> +	cgroup_unlock();
> +	return ret;
> +}

Hmm, does this function is necessary to be complex as this ?
I'm sorry I don't have enough time to review now. (chasing memory online/offline bug.)

But I can't convice this is a good way to reclaim in hierachical manner.

In following tree, Assume that processes hit limitation of Level_2.

   Level_1 (no limit)
	-> Level_2	(limit=1G)
		-> Level_3_A (usage=30M)
		-> Level_3_B (usage=100M)
			-> Level_4_A (usage=50M)
			-> Level_4_B (usage=400M)
			-> Level_4_C (usage=420M)

Even if we know Level_4_C incudes tons of Inactive file caches,
some amount of swap-out will occur until reachin Level_4_C.

Can't we do this hierarchical reclaim in another way ?
(start from Level_4_C because we know it has tons of inactive caches.)

This style of recursive call doesn't have chance to do kind of optimization.
Can we do this reclaim in more flat manner as loop like following
=
try:
  select the most inactive one
	-> try_to_fre_memory
		-> check limit
			-> go to try;
==

?

Thanks,
-Kame












^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  5:02   ` KAMEZAWA Hiroyuki
@ 2008-11-12  5:49     ` Balbir Singh
  2008-11-12  6:01       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2008-11-12  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Tue, 11 Nov 2008 18:04:17 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>> limit, the charging routine points to the parent that is above its limit.
>> The reclaim process then starts from the last scanned child of the ancestor
>> and reclaims until the ancestor goes below its limit.
>>
> 
>> +/*
>> + * Dance down the hierarchy if needed to reclaim memory. We remember the
>> + * last child we reclaimed from, so that we don't end up penalizing
>> + * one child extensively based on its position in the children list.
>> + *
>> + * root_mem is the original ancestor that we've been reclaim from.
>> + */
>> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
>> +						struct mem_cgroup *root_mem,
>> +						gfp_t gfp_mask)
>> +{
>> +	struct cgroup *cg_current, *cgroup;
>> +	struct mem_cgroup *mem_child;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * Reclaim unconditionally and don't check for return value.
>> +	 * We need to reclaim in the current group and down the tree.
>> +	 * One might think about checking for children before reclaiming,
>> +	 * but there might be left over accounting, even after children
>> +	 * have left.
>> +	 */
>> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
>> +
>> +	if (res_counter_check_under_limit(&root_mem->res))
>> +		return 0;
>> +
>> +	cgroup_lock();
>> +
>> +	if (list_empty(&mem->css.cgroup->children)) {
>> +		cgroup_unlock();
>> +		return 0;
>> +	}
>> +
>> +	/*
>> +	 * Scan all children under the mem_cgroup mem
>> +	 */
>> +	if (!mem->last_scanned_child)
>> +		cgroup = list_first_entry(&mem->css.cgroup->children,
>> +				struct cgroup, sibling);
>> +	else
>> +		cgroup = mem->last_scanned_child->css.cgroup;
>> +
>> +	cg_current = cgroup;
>> +
>> +	do {
>> +		struct list_head *next;
>> +
>> +		mem_child = mem_cgroup_from_cont(cgroup);
>> +		cgroup_unlock();
>> +
>> +		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
>> +							gfp_mask);
>> +		cgroup_lock();
>> +		mem->last_scanned_child = mem_child;
>> +		if (res_counter_check_under_limit(&root_mem->res)) {
>> +			ret = 0;
>> +			goto done;
>> +		}
>> +
>> +		/*
>> +		 * Since we gave up the lock, it is time to
>> +		 * start from last cgroup
>> +		 */
>> +		cgroup = mem->last_scanned_child->css.cgroup;
>> +		next = cgroup->sibling.next;
>> +
>> +		if (next == &cg_current->parent->children)
>> +			cgroup = list_first_entry(&mem->css.cgroup->children,
>> +							struct cgroup, sibling);
>> +		else
>> +			cgroup = container_of(next, struct cgroup, sibling);
>> +	} while (cgroup != cg_current);
>> +
>> +done:
>> +	cgroup_unlock();
>> +	return ret;
>> +}
> 
> Hmm, does this function is necessary to be complex as this ?
> I'm sorry I don't have enough time to review now. (chasing memory online/offline bug.)
> 
> But I can't convice this is a good way to reclaim in hierachical manner.
> 
> In following tree, Assume that processes hit limitation of Level_2.
> 
>    Level_1 (no limit)
> 	-> Level_2	(limit=1G)
> 		-> Level_3_A (usage=30M)
> 		-> Level_3_B (usage=100M)
> 			-> Level_4_A (usage=50M)
> 			-> Level_4_B (usage=400M)
> 			-> Level_4_C (usage=420M)
> 
> Even if we know Level_4_C incudes tons of Inactive file caches,
> some amount of swap-out will occur until reachin Level_4_C.
> 
> Can't we do this hierarchical reclaim in another way ?
> (start from Level_4_C because we know it has tons of inactive caches.)
> 
> This style of recursive call doesn't have chance to do kind of optimization.
> Can we do this reclaim in more flat manner as loop like following
> =
> try:
>   select the most inactive one
> 	-> try_to_fre_memory
> 		-> check limit
> 			-> go to try;
> ==
> 

I've been thinking along those lines as well and that will get more important as
we try to implement soft limits. However, for the current version I wanted
correctness. Fairness, I've seen is achieved, since groups with large number of
inactive pages, does get reclaimed from more than others (in my simple
experiments).

As far the pseudo code is concerned, select the most inactive one is an O(c)
operation, where c is the number of nodes under the subtree and is expensive.
The data structure and select algorithm get expensive. I am thinking about a
more suitable approach for implementation, but I want to focus on correctness as
the first step. Since the hierarchy is not enabled by default, I am not adding
any additional overhead, so I think that this approach is suitable.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  5:49     ` Balbir Singh
@ 2008-11-12  6:01       ` KAMEZAWA Hiroyuki
  2008-11-12  6:10         ` Balbir Singh
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  6:01 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Wed, 12 Nov 2008 11:19:37 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 11 Nov 2008 18:04:17 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> This patch introduces hierarchical reclaim. When an ancestor goes over its
> >> limit, the charging routine points to the parent that is above its limit.
> >> The reclaim process then starts from the last scanned child of the ancestor
> >> and reclaims until the ancestor goes below its limit.
> >>
> > 
> >> +/*
> >> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> >> + * last child we reclaimed from, so that we don't end up penalizing
> >> + * one child extensively based on its position in the children list.
> >> + *
> >> + * root_mem is the original ancestor that we've been reclaim from.
> >> + */
> >> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
> >> +						struct mem_cgroup *root_mem,
> >> +						gfp_t gfp_mask)
> >> +{
> >> +	struct cgroup *cg_current, *cgroup;
> >> +	struct mem_cgroup *mem_child;
> >> +	int ret = 0;
> >> +
> >> +	/*
> >> +	 * Reclaim unconditionally and don't check for return value.
> >> +	 * We need to reclaim in the current group and down the tree.
> >> +	 * One might think about checking for children before reclaiming,
> >> +	 * but there might be left over accounting, even after children
> >> +	 * have left.
> >> +	 */
> >> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
> >> +
> >> +	if (res_counter_check_under_limit(&root_mem->res))
> >> +		return 0;
> >> +
> >> +	cgroup_lock();
> >> +
> >> +	if (list_empty(&mem->css.cgroup->children)) {
> >> +		cgroup_unlock();
> >> +		return 0;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Scan all children under the mem_cgroup mem
> >> +	 */
> >> +	if (!mem->last_scanned_child)
> >> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> >> +				struct cgroup, sibling);
> >> +	else
> >> +		cgroup = mem->last_scanned_child->css.cgroup;
> >> +
> >> +	cg_current = cgroup;
> >> +
> >> +	do {
> >> +		struct list_head *next;
> >> +
> >> +		mem_child = mem_cgroup_from_cont(cgroup);
> >> +		cgroup_unlock();
> >> +
> >> +		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
> >> +							gfp_mask);
> >> +		cgroup_lock();
> >> +		mem->last_scanned_child = mem_child;
> >> +		if (res_counter_check_under_limit(&root_mem->res)) {
> >> +			ret = 0;
> >> +			goto done;
> >> +		}
> >> +
> >> +		/*
> >> +		 * Since we gave up the lock, it is time to
> >> +		 * start from last cgroup
> >> +		 */
> >> +		cgroup = mem->last_scanned_child->css.cgroup;
> >> +		next = cgroup->sibling.next;
> >> +
> >> +		if (next == &cg_current->parent->children)
> >> +			cgroup = list_first_entry(&mem->css.cgroup->children,
> >> +							struct cgroup, sibling);
> >> +		else
> >> +			cgroup = container_of(next, struct cgroup, sibling);
> >> +	} while (cgroup != cg_current);
> >> +
> >> +done:
> >> +	cgroup_unlock();
> >> +	return ret;
> >> +}
> > 
> > Hmm, does this function is necessary to be complex as this ?
> > I'm sorry I don't have enough time to review now. (chasing memory online/offline bug.)
> > 
> > But I can't convice this is a good way to reclaim in hierachical manner.
> > 
> > In following tree, Assume that processes hit limitation of Level_2.
> > 
> >    Level_1 (no limit)
> > 	-> Level_2	(limit=1G)
> > 		-> Level_3_A (usage=30M)
> > 		-> Level_3_B (usage=100M)
> > 			-> Level_4_A (usage=50M)
> > 			-> Level_4_B (usage=400M)
> > 			-> Level_4_C (usage=420M)
> > 
> > Even if we know Level_4_C incudes tons of Inactive file caches,
> > some amount of swap-out will occur until reachin Level_4_C.
> > 
> > Can't we do this hierarchical reclaim in another way ?
> > (start from Level_4_C because we know it has tons of inactive caches.)
> > 
> > This style of recursive call doesn't have chance to do kind of optimization.
> > Can we do this reclaim in more flat manner as loop like following
> > =
> > try:
> >   select the most inactive one
> > 	-> try_to_fre_memory
> > 		-> check limit
> > 			-> go to try;
> > ==
> > 
> 
> I've been thinking along those lines as well and that will get more important as
> we try to implement soft limits. However, for the current version I wanted
> correctness. Fairness, I've seen is achieved, since groups with large number of
> inactive pages, does get reclaimed from more than others (in my simple
> experiments).
> 
> As far the pseudo code is concerned, select the most inactive one is an O(c)
> operation, where c is the number of nodes under the subtree and is expensive.
> The data structure and select algorithm get expensive. I am thinking about a
> more suitable approach for implementation, but I want to focus on correctness as
> the first step. Since the hierarchy is not enabled by default, I am not adding
> any additional overhead, so I think that this approach is suitable.
> 
What I say here is not "implement fairness" but "please make this algorithm easy
to be updated." If you'll implement soft-limit, please design this code to be
easily reused. (Again, I don't say do it now but please make code simpler.)

Can you make this code iterative rather than recursive ?

I don't like this kind of recursive call with complexed lock/unlock.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  6:01       ` KAMEZAWA Hiroyuki
@ 2008-11-12  6:10         ` Balbir Singh
  2008-11-12  6:12           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2008-11-12  6:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 11:19:37 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 11 Nov 2008 18:04:17 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>>>> limit, the charging routine points to the parent that is above its limit.
>>>> The reclaim process then starts from the last scanned child of the ancestor
>>>> and reclaims until the ancestor goes below its limit.
>>>>
>>>> +/*
>>>> + * Dance down the hierarchy if needed to reclaim memory. We remember the
>>>> + * last child we reclaimed from, so that we don't end up penalizing
>>>> + * one child extensively based on its position in the children list.
>>>> + *
>>>> + * root_mem is the original ancestor that we've been reclaim from.
>>>> + */
>>>> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
>>>> +						struct mem_cgroup *root_mem,
>>>> +						gfp_t gfp_mask)
>>>> +{
>>>> +	struct cgroup *cg_current, *cgroup;
>>>> +	struct mem_cgroup *mem_child;
>>>> +	int ret = 0;
>>>> +
>>>> +	/*
>>>> +	 * Reclaim unconditionally and don't check for return value.
>>>> +	 * We need to reclaim in the current group and down the tree.
>>>> +	 * One might think about checking for children before reclaiming,
>>>> +	 * but there might be left over accounting, even after children
>>>> +	 * have left.
>>>> +	 */
>>>> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
>>>> +
>>>> +	if (res_counter_check_under_limit(&root_mem->res))
>>>> +		return 0;
>>>> +
>>>> +	cgroup_lock();
>>>> +
>>>> +	if (list_empty(&mem->css.cgroup->children)) {
>>>> +		cgroup_unlock();
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Scan all children under the mem_cgroup mem
>>>> +	 */
>>>> +	if (!mem->last_scanned_child)
>>>> +		cgroup = list_first_entry(&mem->css.cgroup->children,
>>>> +				struct cgroup, sibling);
>>>> +	else
>>>> +		cgroup = mem->last_scanned_child->css.cgroup;
>>>> +
>>>> +	cg_current = cgroup;
>>>> +
>>>> +	do {
>>>> +		struct list_head *next;
>>>> +
>>>> +		mem_child = mem_cgroup_from_cont(cgroup);
>>>> +		cgroup_unlock();
>>>> +
>>>> +		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
>>>> +							gfp_mask);
>>>> +		cgroup_lock();
>>>> +		mem->last_scanned_child = mem_child;
>>>> +		if (res_counter_check_under_limit(&root_mem->res)) {
>>>> +			ret = 0;
>>>> +			goto done;
>>>> +		}
>>>> +
>>>> +		/*
>>>> +		 * Since we gave up the lock, it is time to
>>>> +		 * start from last cgroup
>>>> +		 */
>>>> +		cgroup = mem->last_scanned_child->css.cgroup;
>>>> +		next = cgroup->sibling.next;
>>>> +
>>>> +		if (next == &cg_current->parent->children)
>>>> +			cgroup = list_first_entry(&mem->css.cgroup->children,
>>>> +							struct cgroup, sibling);
>>>> +		else
>>>> +			cgroup = container_of(next, struct cgroup, sibling);
>>>> +	} while (cgroup != cg_current);
>>>> +
>>>> +done:
>>>> +	cgroup_unlock();
>>>> +	return ret;
>>>> +}
>>> Hmm, does this function is necessary to be complex as this ?
>>> I'm sorry I don't have enough time to review now. (chasing memory online/offline bug.)
>>>
>>> But I can't convice this is a good way to reclaim in hierachical manner.
>>>
>>> In following tree, Assume that processes hit limitation of Level_2.
>>>
>>>    Level_1 (no limit)
>>> 	-> Level_2	(limit=1G)
>>> 		-> Level_3_A (usage=30M)
>>> 		-> Level_3_B (usage=100M)
>>> 			-> Level_4_A (usage=50M)
>>> 			-> Level_4_B (usage=400M)
>>> 			-> Level_4_C (usage=420M)
>>>
>>> Even if we know Level_4_C incudes tons of Inactive file caches,
>>> some amount of swap-out will occur until reachin Level_4_C.
>>>
>>> Can't we do this hierarchical reclaim in another way ?
>>> (start from Level_4_C because we know it has tons of inactive caches.)
>>>
>>> This style of recursive call doesn't have chance to do kind of optimization.
>>> Can we do this reclaim in more flat manner as loop like following
>>> =
>>> try:
>>>   select the most inactive one
>>> 	-> try_to_fre_memory
>>> 		-> check limit
>>> 			-> go to try;
>>> ==
>>>
>> I've been thinking along those lines as well and that will get more important as
>> we try to implement soft limits. However, for the current version I wanted
>> correctness. Fairness, I've seen is achieved, since groups with large number of
>> inactive pages, does get reclaimed from more than others (in my simple
>> experiments).
>>
>> As far the pseudo code is concerned, select the most inactive one is an O(c)
>> operation, where c is the number of nodes under the subtree and is expensive.
>> The data structure and select algorithm get expensive. I am thinking about a
>> more suitable approach for implementation, but I want to focus on correctness as
>> the first step. Since the hierarchy is not enabled by default, I am not adding
>> any additional overhead, so I think that this approach is suitable.
>>
> What I say here is not "implement fairness" but "please make this algorithm easy
> to be updated." If you'll implement soft-limit, please design this code to be
> easily reused. (Again, I don't say do it now but please make code simpler.)
> 

I think of it as easy to update - as in the modularity, you can plug out
hierarchical reclaim easily and implement your own hierarchical reclaim.

> Can you make this code iterative rather than recursive ?
> 
> I don't like this kind of recursive call with complexed lock/unlock.

I tried an iterative version, which ended up looking very ugly. I think the
recursive version is easier to understand. What we do is a DFS walk - pretty
standard algorithm.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  6:10         ` Balbir Singh
@ 2008-11-12  6:12           ` KAMEZAWA Hiroyuki
  2008-11-12  6:22             ` Balbir Singh
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  6:12 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Wed, 12 Nov 2008 11:40:13 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> I think of it as easy to update - as in the modularity, you can plug out
> hierarchical reclaim easily and implement your own hierarchical reclaim.
> 
When I do so, I'll rewrite all, again.

> > Can you make this code iterative rather than recursive ?
> > 
> > I don't like this kind of recursive call with complexed lock/unlock.
> 
> I tried an iterative version, which ended up looking very ugly. I think the
> recursive version is easier to understand. What we do is a DFS walk - pretty
> standard algorithm.
> 
But recursive one is not good for search-and-try algorithm.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  6:12           ` KAMEZAWA Hiroyuki
@ 2008-11-12  6:22             ` Balbir Singh
  2008-11-12  6:33               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2008-11-12  6:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 11:40:13 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> I think of it as easy to update - as in the modularity, you can plug out
>> hierarchical reclaim easily and implement your own hierarchical reclaim.
>>
> When I do so, I'll rewrite all, again.
> 

I don't intend to ask you to rewrite it, rewrite all, I meant you as in a
generic person. With hierarchy we will need weighted reclaim, which I'll add in
later.

>>> Can you make this code iterative rather than recursive ?
>>>
>>> I don't like this kind of recursive call with complexed lock/unlock.
>> I tried an iterative version, which ended up looking very ugly. I think the
>> recursive version is easier to understand. What we do is a DFS walk - pretty
>> standard algorithm.
>>
> But recursive one is not good for search-and-try algorithm.

OK, I'll post the iterative algorithm, but it is going to be dirty :)

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  6:22             ` Balbir Singh
@ 2008-11-12  6:33               ` KAMEZAWA Hiroyuki
  2008-11-12 11:21                 ` Balbir Singh
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  6:33 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Wed, 12 Nov 2008 11:52:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 12 Nov 2008 11:40:13 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> I think of it as easy to update - as in the modularity, you can plug out
> >> hierarchical reclaim easily and implement your own hierarchical reclaim.
> >>
> > When I do so, I'll rewrite all, again.
> > 
> 
> I don't intend to ask you to rewrite it, rewrite all, I meant you as in a
> generic person. With hierarchy we will need weighted reclaim, which I'll add in
> later.
> 
> >>> Can you make this code iterative rather than recursive ?
> >>>
> >>> I don't like this kind of recursive call with complexed lock/unlock.
> >> I tried an iterative version, which ended up looking very ugly. I think the
> >> recursive version is easier to understand. What we do is a DFS walk - pretty
> >> standard algorithm.
> >>
> > But recursive one is not good for search-and-try algorithm.
> 
> OK, I'll post the iterative algorithm, but it is going to be dirty :)
> 
Ah, thanks. I think maybe you're right that ittrative one is dirty.
I want to compare before going further. 
Thank you for your patience.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12  6:33               ` KAMEZAWA Hiroyuki
@ 2008-11-12 11:21                 ` Balbir Singh
  2008-11-13  4:18                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Balbir Singh @ 2008-11-12 11:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-11-12 15:33:14]:

> On Wed, 12 Nov 2008 11:52:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > KAMEZAWA Hiroyuki wrote:
> > > On Wed, 12 Nov 2008 11:40:13 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > >> I think of it as easy to update - as in the modularity, you can plug out
> > >> hierarchical reclaim easily and implement your own hierarchical reclaim.
> > >>
> > > When I do so, I'll rewrite all, again.
> > > 
> > 
> > I don't intend to ask you to rewrite it, rewrite all, I meant you as in a
> > generic person. With hierarchy we will need weighted reclaim, which I'll add in
> > later.
> > 
> > >>> Can you make this code iterative rather than recursive ?
> > >>>
> > >>> I don't like this kind of recursive call with complexed lock/unlock.
> > >> I tried an iterative version, which ended up looking very ugly. I think the
> > >> recursive version is easier to understand. What we do is a DFS walk - pretty
> > >> standard algorithm.
> > >>
> > > But recursive one is not good for search-and-try algorithm.
> > 
> > OK, I'll post the iterative algorithm, but it is going to be dirty :)
> > 
> Ah, thanks. I think maybe you're right that ittrative one is dirty.
> I want to compare before going further. 
> Thank you for your patience.

Here is the iterative version of this patch. I tested it in my
test environment. NOTE: The cgroup_locked check is still present, I'll
remove that shortly after your patch is accepted.

This patch introduces hierarchical reclaim. When an ancestor goes over its
limit, the charging routine points to the parent that is above its limit.
The reclaim process then starts from the last scanned child of the ancestor
and reclaims until the ancestor goes below its limit.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |  190 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 163 insertions(+), 27 deletions(-)

diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-11 17:51:56.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-12 16:49:42.000000000 +0530
@@ -132,6 +132,11 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from. Protected by cgroup_lock()
+	 */
+	struct mem_cgroup *last_scanned_child;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -467,6 +472,137 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+static struct mem_cgroup *
+mem_cgroup_from_res_counter(struct res_counter *counter)
+{
+	return container_of(counter, struct mem_cgroup, res);
+}
+
+/*
+ * Dance down the hierarchy if needed to reclaim memory. We remember the
+ * last child we reclaimed from, so that we don't end up penalizing
+ * one child extensively based on its position in the children list.
+ *
+ * root_mem is the original ancestor that we've been reclaim from.
+ */
+static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
+						struct mem_cgroup *root_mem,
+						gfp_t gfp_mask)
+{
+	struct cgroup *cgroup;
+	struct mem_cgroup *mem_child;
+	int ret = 0;
+
+	/*
+	 * Reclaim unconditionally and don't check for return value.
+	 * We need to reclaim in the current group and down the tree.
+	 * One might think about checking for children before reclaiming,
+	 * but there might be left over accounting, even after children
+	 * have left.
+	 */
+reclaim_down:
+	ret = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+
+	if (res_counter_check_under_limit(&root_mem->res))
+		goto loop_next;
+
+	cgroup_lock();
+
+	if (list_empty(&mem->css.cgroup->children)) {
+		cgroup_unlock();
+		goto loop_next;
+	}
+
+	/*
+	 * Scan all children under the mem_cgroup mem
+	 */
+	if (!mem->last_scanned_child)
+		cgroup = list_first_entry(&mem->css.cgroup->children,
+				struct cgroup, sibling);
+	else {
+		struct list_head *next;
+		cgroup = mem->last_scanned_child->css.cgroup;
+		next = cgroup->sibling.next;
+
+		if (next == &cgroup->parent->children)
+			cgroup = list_first_entry(&mem->css.cgroup->children,
+							struct cgroup, sibling);
+		else
+			cgroup = container_of(next, struct cgroup, sibling);
+	}
+
+reclaim_up:
+	do {
+		struct list_head *next;
+
+		mem_child = mem_cgroup_from_cont(cgroup);
+		cgroup_unlock();
+
+		mem = mem_child;
+		goto reclaim_down;
+
+loop_next:
+		cgroup_lock();
+		mem->last_scanned_child = mem_child;
+
+		/*
+		 * Since we gave up the lock, it is time to
+		 * start from last cgroup
+		 */
+		cgroup = mem->last_scanned_child->css.cgroup;
+		next = cgroup->sibling.next;
+
+		if (next == &cgroup->parent->children)
+			break;
+		else
+			cgroup = container_of(next, struct cgroup, sibling);
+	} while (cgroup != &cgroup->parent->children);
+
+	cgroup = cgroup->parent;
+	mem = mem_cgroup_from_cont(cgroup);
+	if (mem == root_mem)
+		goto done;
+	goto reclaim_up;
+done:
+	cgroup_unlock();
+	return ret;
+}
+
+/*
+ * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
+ * from mem.
+ */
+static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	int ret = 0;
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	struct mem_cgroup *mem_over_limit;
+
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
+		if (!(gfp_mask & __GFP_WAIT))
+			goto out;
+
+		/*
+		 * Is one of our ancestors over their limit?
+		 */
+		if (fail_res)
+			mem_over_limit = mem_cgroup_from_res_counter(fail_res);
+		else
+			mem_over_limit = mem;
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
+							mem_over_limit,
+							gfp_mask);
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			goto out;
+		}
+	}
+out:
+	return ret;
+}
 
 /**
  * mem_cgroup_try_charge - get charge of PAGE_SIZE.
@@ -484,8 +620,7 @@ int mem_cgroup_try_charge(struct mm_stru
 			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,29 +645,9 @@ int mem_cgroup_try_charge(struct mm_stru
 		css_get(&mem->css);
 	}
 
+	if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
+		goto nomem;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto nomem;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto nomem;
-		}
-	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -945,7 +1060,7 @@ static void mem_cgroup_force_empty_list(
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool cgroup_locked)
 {
 	int ret = -EBUSY;
 	int node, zid;
@@ -959,8 +1074,20 @@ static int mem_cgroup_force_empty(struct
 	while (mem->res.usage > 0) {
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
+
+		/*
+		 * We need to give up the cgroup lock if it is held, since
+		 * it creates the potential for deadlock. cgroup_mutex should
+		 * be acquired after cpu_hotplug lock. In this path, we
+		 * acquire the cpu_hotplug lock after acquiring the cgroup_mutex
+		 * Giving it up should be OK
+		 */
+		if (cgroup_locked)
+			cgroup_unlock();
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		if (cgroup_locked)
+			cgroup_lock();
 		for_each_node_state(node, N_POSSIBLE)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
@@ -1025,7 +1152,7 @@ static int mem_cgroup_reset(struct cgrou
 
 static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
 {
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
+	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), false);
 }
 
 static const struct mem_cgroup_stat_desc {
@@ -1195,6 +1322,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
 
+	mem->last_scanned_child = NULL;
+
 	return &mem->css;
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
@@ -1208,7 +1337,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem_cgroup_force_empty(mem);
+	mem_cgroup_force_empty(mem, true);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
@@ -1217,6 +1346,13 @@ static void mem_cgroup_destroy(struct cg
 	int node;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	if (cont->parent) {
+		struct mem_cgroup *parent_mem =
+			mem_cgroup_from_cont(cont->parent);
+		if (parent_mem->last_scanned_child == mem)
+			parent_mem->last_scanned_child = NULL;
+	}
+
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
_

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3)
  2008-11-11 12:34 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3) Balbir Singh
@ 2008-11-13  1:28   ` Li Zefan
  2008-11-13  1:34     ` Balbir Singh
  2008-11-13  1:39   ` Li Zefan
  1 sibling, 1 reply; 20+ messages in thread
From: Li Zefan @ 2008-11-13  1:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton, KAMEZAWA Hiroyuki

> +	/*
> +	 * If parent's use_hiearchy is set, we can't make any modifications
> +	 * in the child subtrees. If it is unset, then the change can
> +	 * occur, provided the current cgroup has no children.
> +	 *
> +	 * For the root cgroup, parent_mem is NULL, we allow value to be
> +	 * set if there are no children.
> +	 */
> +	if (!parent_mem || (!parent_mem->use_hierarchy &&
> +				(val == 1 || val == 0))) {
> +		if (list_empty(&cont->children))
> +			mem->use_hierarchy = val;
> +		else
> +			retval = -EBUSY;
> +	} else
> +		retval = -EINVAL;
> +
> +	return retval;
> +}

As I mentioned there is a race here. :(

echo 1 > /memcg/memory.use_hierarchy
 =>if (list_empty(&cont->children))
                                      mkdir /memcg/0
                                       => mem->use_hierarchy = 0
       mem->use_hierarchy = 1;

Now it ends up with parent's use_hierarchy is set but its child's
use_hierarchy is not set.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3)
  2008-11-13  1:28   ` Li Zefan
@ 2008-11-13  1:34     ` Balbir Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-13  1:34 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton, KAMEZAWA Hiroyuki

Li Zefan wrote:
>> +	/*
>> +	 * If parent's use_hiearchy is set, we can't make any modifications
>> +	 * in the child subtrees. If it is unset, then the change can
>> +	 * occur, provided the current cgroup has no children.
>> +	 *
>> +	 * For the root cgroup, parent_mem is NULL, we allow value to be
>> +	 * set if there are no children.
>> +	 */
>> +	if (!parent_mem || (!parent_mem->use_hierarchy &&
>> +				(val == 1 || val == 0))) {
>> +		if (list_empty(&cont->children))
>> +			mem->use_hierarchy = val;
>> +		else
>> +			retval = -EBUSY;
>> +	} else
>> +		retval = -EINVAL;
>> +
>> +	return retval;
>> +}
> 
> As I mentioned there is a race here. :(
> 
> echo 1 > /memcg/memory.use_hierarchy
>  =>if (list_empty(&cont->children))
>                                       mkdir /memcg/0
>                                        => mem->use_hierarchy = 0
>        mem->use_hierarchy = 1;
> 

Hi, Li,

I thought I had the cgroup_lock() around that check, but I seemed to have missed
it. I'll fix that in v4.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3)
  2008-11-11 12:34 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3) Balbir Singh
  2008-11-13  1:28   ` Li Zefan
@ 2008-11-13  1:39   ` Li Zefan
  1 sibling, 0 replies; 20+ messages in thread
From: Li Zefan @ 2008-11-13  1:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton, KAMEZAWA Hiroyuki

> @@ -137,6 +137,11 @@ struct mem_cgroup {
>  	 * reclaimed from. Protected by cgroup_lock()
>  	 */
>  	struct mem_cgroup *last_scanned_child;
> +	/*
> +	 * Should the accounting and control be hierarchical, per subtree?
> +	 */
> +	unsigned long use_hierarchy;
> +

A minor comment, 'unsigned int' is sufficient, then we save 4 bytes
per mem_cgroup on 64 bits machines.

>  };


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-12 11:21                 ` Balbir Singh
@ 2008-11-13  4:18                   ` KAMEZAWA Hiroyuki
  2008-11-13 13:33                     ` Balbir Singh
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-13  4:18 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Wed, 12 Nov 2008 16:51:41 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:


> Here is the iterative version of this patch. I tested it in my
> test environment. NOTE: The cgroup_locked check is still present, I'll
> remove that shortly after your patch is accepted.
> 
> This patch introduces hierarchical reclaim. When an ancestor goes over its
> limit, the charging routine points to the parent that is above its limit.
> The reclaim process then starts from the last scanned child of the ancestor
> and reclaims until the ancestor goes below its limit.
> 

complicated as you said but it seems it's from style.

I expected following kind of one.
==

struct mem_cgroup *memcg_select_next_token(struct mem_cgroup *itr,
					struct mem_cgroup *cur,
					struct mem_cgroup *root)
{
	struct cgroup *pos, *tmp, *parent, *rootpos;

	cgroup_lock();
	if (!itr || itr->obsolete)
		itr = cur;

	rootpos = root->css.cgroup;
	pos = itr->css.cgroup;
	parent = pos->parent;
	/* start from children */
	if (!list_empty(&pos->children)) {
		pos = list_entry(pos->children.next, struct cgroup, sibling);
		mem_cgroup_put(itr);
		itr = mem_cgroup_from_cont(pos);
		mem_cgroup_get(itr);
		goto unlock;
	}
next_parent:
	if (pos == rootpos) {
		/* I'm root and no available children */
		mem_cgroup_put(itr);
		itr = mem_cgroup_from_cont(pos);
		mem_cgroup_get(itr);
		goto unlock;
	}
	/* Do I have next siblings ? */
	if (pos->sibling.next != &parent->children) {
		pos = list_entry(pos->sibling.next, struct cgroup, sibling);
		mem_cgroup_put(itr);
		itr = mem_cgroup_from_cont(pos);
		mem_cgroup_get(itr);
		goto unlock;
	}
	/* Ok, go back to parent */
	pos = pos->parent;
	goto next_parent;
	
unlock:
	root->reclaim_token = token;
	cgroup_unlock();
	return itr;
}

struct mem_cgroup *memcg_select_start_token(struct mem_cgroup *cur,
					    struct mem_cgroup *root)
{
	struct mem_cgroup *token;

	if (cur == root)
		return cur;

	cgroup_lock();

	token = root->reclaim_token;
	if (token->obsolete) {
		mem_cgroup_put(token);  /* decrease refcnt */
		root->reclaim_token = cur;
		token = cur;
		mem_cgroup_get(cur);	/* increase refcnt */
		cgroup_unlock();
		return token;
	}
	cgroup_unlock();
	
	return memcg_select_next_token(token, cur, root);
}


int mem_cgroup_do_reclaim(struct mem_cgroup *mem,
			  struct mem_cgroup *root_mem,
			  gfp_t mask)
{
	struct cgroup *cgroup;
	struct mem_cgroup *tmp, *token, *start;
	/*
	 * We do memory reclaim under "root_mem".
	 * We have to be careful not to reclaim memory only from
	 * unlucky one. For avoiding that, we use "token".
	 */

	token = memcg_select_start_token(mem, root_mem);

	start = NULL;

	while (start != token) {
		if (!token->obsolete) {
			ret = try_to_free_mem_cgroup_pages(token,
						GFP_HIGHUSER_MOVABLE);
			if (!res_counter_check_under_limit(&root_mem->res))
				return 0;
			if (ret == 0)
				retry--;
			start = token;
			token = memcg_select_next_token(token, mem, root_mem);
		} else {
			/* This mem_cgroup is destroyed. */
			mem_cgroup_put(token);
			token = memcg_select_next_token(NULL, mem, root_mem);
		}
	}
	
}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3)
  2008-11-13  4:18                   ` KAMEZAWA Hiroyuki
@ 2008-11-13 13:33                     ` Balbir Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2008-11-13 13:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 16:51:41 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> 
>> Here is the iterative version of this patch. I tested it in my
>> test environment. NOTE: The cgroup_locked check is still present, I'll
>> remove that shortly after your patch is accepted.
>>
>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>> limit, the charging routine points to the parent that is above its limit.
>> The reclaim process then starts from the last scanned child of the ancestor
>> and reclaims until the ancestor goes below its limit.
>>
> 
> complicated as you said but it seems it's from style.
> 
> I expected following kind of one.

Thanks, it looks very similar to what I have, I like the split of the iterator,
start token and next token. I'll refactor the code based on your suggestion if
possible in the next version.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-11-13 13:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-11 12:33 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v3) Balbir Singh
2008-11-11 12:33 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v3) Balbir Singh
2008-11-11 12:34 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v3) Balbir Singh
2008-11-11 12:34 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v3) Balbir Singh
2008-11-12  3:52   ` KAMEZAWA Hiroyuki
2008-11-12  4:00     ` Balbir Singh
2008-11-12  5:02   ` KAMEZAWA Hiroyuki
2008-11-12  5:49     ` Balbir Singh
2008-11-12  6:01       ` KAMEZAWA Hiroyuki
2008-11-12  6:10         ` Balbir Singh
2008-11-12  6:12           ` KAMEZAWA Hiroyuki
2008-11-12  6:22             ` Balbir Singh
2008-11-12  6:33               ` KAMEZAWA Hiroyuki
2008-11-12 11:21                 ` Balbir Singh
2008-11-13  4:18                   ` KAMEZAWA Hiroyuki
2008-11-13 13:33                     ` Balbir Singh
2008-11-11 12:34 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v3) Balbir Singh
2008-11-13  1:28   ` Li Zefan
2008-11-13  1:34     ` Balbir Singh
2008-11-13  1:39   ` Li Zefan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).