LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCH 0/6] memcg updates (05/Nov)
@ 2008-11-05  8:16 KAMEZAWA Hiroyuki
  2008-11-05  8:18 ` [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir() KAMEZAWA Hiroyuki
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:16 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, balbir, menage, nishimura

Weekly (RFC) update for memcg.

This set includes

1. change force_empty to do move account rather than forget all
2. swap cache handling
3. mem+swap controller kconfig
4. swap_cgroup for rememver swap account information
5. mem+swap controller core
6. synchronize memcg's LRU and global LRU.

"1" is already sent, "6" is a newcomer.
I'd like to push out "2" or "2-5" in the next week (if no bugs.)

after 6, next candidates are
  - dirty_ratio handler
  - account move at task move.

Some more explanation about purpose of "6". (see details in patch itself)
Now, one of complicated logic in memcg is LRU handling. Because the place of
lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
lock as following even under zone->lru_lock.
==
  pc = lookup_page_cgroup(page);
  if (!trylock_page_cgroup(pc))
  	return -EBUSY;

   if (PageCgroupUsed(pc)) {
	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
	spin_lock_irqsave(&mz->lru_lock, flags);
	....some operation on LRU.
	spin_unlock_irqrestore(&mz->lru_lock, flags);
   }
   unlock_page_cgroup(pc);
==
Sigh..

After "6", page_cgroup's LRU management can be done independently to some extent.
== as
  (zone->lru_lock is held here)
  pc = lookup_page_cgroup(page);
  list operation on pc.
  (unlock zone->lru_lock)
==
Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
doing some racy operation.

	isolate_lru_page(page);
	pc = lookup_page_cgroup(page);
	do some jobs.
	putback_lru_page(page);

Maybe this will be a help to implement "account move at task move".

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir()
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
@ 2008-11-05  8:18 ` KAMEZAWA Hiroyuki
  2008-11-05  8:20 ` [RFC][PATCH 2/6] memcg: handle swap cache KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

Not different from a patch I sent.
==
This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
 - lru_lock of source/destination mem_cgroup is held.
 - lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typical code can be following.

(while page is not under lock_page())
	mem = pc->mem_cgroup;
	mz = page_cgroup_zoneinfo(pc)
	spin_lock_irqsave(&mz->lru_lock);
	if (pc->mem_cgroup == mem)
		...../* some list handling */
	spin_unlock_irqrestore(&mz->lru_lock);

Of course, better way is
	lock_page_cgroup(pc);
	....
	unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
  - force_empty (rmdir)
  - moving tasks between cgroup (for moving account information.)
  - hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
  - Fix "forget all" behavior of force_empty and avoid leak of accounting.
  - By "moving first, free if necessary", keep pages on memory as much as
    possible.

Adding a switch to change behavior of force_empty to
  - free first, move if necessary
  - free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.

This patch removes memory.force_empty file, a brutal debug-only interface.

Changelog: (v8) -> (v9)
  - fixed excessive use of __mem_cgroup_try_charge().
  - explained "page cache moves, too" in Documentation.

Changelog: (v6) -> (v8)
  - removed memory.force_empty file which was provided only for debug.

Changelog: (v5) -> (v6)
  - removed unnecessary check.
  - do all under lock_page_cgroup().
  - removed res_counter_charge() from move function itself.
    (and modifies try_charge() function.)
  - add argument to add_list() to specify to add page_cgroup head or tail.
  - merged with force_empty patch. (to answer who is user? question)

Changelog: (v4) -> (v5)
  - check for lock_page() is removed.
  - rewrote description.

Changelog: (v2) -> (v4)
  - added lock_page_cgroup().
  - moved out from new-force-empty patch.
  - added how-to-use text.
  - fixed race in __mem_cgroup_uncharge_common().

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   12 -
 mm/memcontrol.c                      |  277 ++++++++++++++++++++++++++---------
 2 files changed, 214 insertions(+), 75 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
 }
 
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
-				struct page_cgroup *pc)
+				struct page_cgroup *pc, bool hot)
 {
 	int lru = LRU_BASE;
 
@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	list_add(&pc->lru, &mz->lists[lru]);
+	if (hot)
+		list_add(&pc->lru, &mz->lists[lru]);
+	else
+		list_add_tail(&pc->lru, &mz->lists[lru]);
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Returns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
  */
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
 			continue;
 
 		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
+			if (oom)
+				mem_cgroup_out_of_memory(mem, gfp_mask);
 			goto nomem;
 		}
 	}
@@ -538,6 +533,25 @@ nomem:
 	return -ENOMEM;
 }
 
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge against memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Returns 0 if success. -ENOMEM at failure.
+ * This call can invoke OOM-Killer.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+			  gfp_t mask, struct mem_cgroup **memcg)
+{
+	return __mem_cgroup_try_charge(mm, mask, memcg, true);
+}
+
 /*
  * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
  * USED state. If already USED, uncharge and return.
@@ -571,11 +585,109 @@ static void __mem_cgroup_commit_charge(s
 	mz = page_cgroup_zoneinfo(pc);
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
+	__mem_cgroup_add_list(mz, pc, true);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 }
 
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc:	page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to:	mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = -EBUSY;
+
+	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(from == to);
+
+	nid = page_cgroup_nid(pc);
+	zid = page_cgroup_zid(pc);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+
+	if (!trylock_page_cgroup(pc))
+		return ret;
+
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	if (pc->mem_cgroup != from)
+		goto out;
+
+	if (spin_trylock(&to_mz->lru_lock)) {
+		__mem_cgroup_remove_list(from_mz, pc);
+		css_put(&from->css);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+		pc->mem_cgroup = to;
+		css_get(&to->css);
+		__mem_cgroup_add_list(to_mz, pc, false);
+		ret = 0;
+		spin_unlock(&to_mz->lru_lock);
+	}
+out:
+	unlock_page_cgroup(pc);
+	return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+				  struct mem_cgroup *child,
+				  gfp_t gfp_mask)
+{
+	struct cgroup *cg = child->css.cgroup;
+	struct cgroup *pcg = cg->parent;
+	struct mem_cgroup *parent;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+	int ret;
+
+	/* Is ROOT ? */
+	if (!pcg)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(pcg);
+
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	if (ret)
+		return ret;
+
+	mz = mem_cgroup_zoneinfo(child,
+			page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	ret = mem_cgroup_move_account(pc, child, parent);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+	/* drop extra refcnt */
+	css_put(&parent->css);
+	/* uncharge if move fails */
+	if (ret)
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+	return ret;
+}
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
 	if (ret)
 		return ret;
 
@@ -898,46 +1010,52 @@ int mem_cgroup_resize_limit(struct mem_c
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
-#define FORCE_UNCHARGE_BATCH	(128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
-	struct page_cgroup *pc;
-	struct page *page;
-	int count = FORCE_UNCHARGE_BATCH;
+	struct page_cgroup *pc, *busy;
 	unsigned long flags;
+	unsigned long loop;
 	struct list_head *list;
+	int ret = 0;
 
 	list = &mz->lists[lru];
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	while (!list_empty(list)) {
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		page = pc->page;
-		if (!PageCgroupUsed(pc))
+	loop = MEM_CGROUP_ZSTAT(mz, lru);
+	/* give some margin against EBUSY etc...*/
+	loop += 256;
+	busy = NULL;
+	while (loop--) {
+		ret = 0;
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (list_empty(list)) {
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
 			break;
-		get_page(page);
+		}
+		pc = list_entry(list->prev, struct page_cgroup, lru);
+		if (busy == pc) {
+			list_move(&pc->lru, list);
+			busy = 0;
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			continue;
+		}
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
-			put_page(page);
-			if (--count <= 0) {
-				count = FORCE_UNCHARGE_BATCH;
-				cond_resched();
-			}
-		} else {
-			spin_lock_irqsave(&mz->lru_lock, flags);
+
+		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+		if (ret == -ENOMEM)
 			break;
-		}
-		spin_lock_irqsave(&mz->lru_lock, flags);
+
+		if (ret == -EBUSY || ret == -EINVAL) {
+			/* found lock contention or "pc" is obsolete. */
+			busy = pc;
+			cond_resched();
+		} else
+			busy = NULL;
 	}
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	if (!ret && !list_empty(list))
+		return -EBUSY;
+	return ret;
 }
 
 /*
@@ -946,34 +1064,68 @@ static void mem_cgroup_force_empty_list(
  */
 static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
-	int ret = -EBUSY;
-	int node, zid;
+	int ret;
+	int node, zid, shrink;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
 	css_get(&mem->css);
-	/*
-	 * page reclaim code (kswapd etc..) will move pages between
-	 * active_list <-> inactive_list while we don't take a lock.
-	 * So, we have to do loop here until all lists are empty.
-	 */
+
+	shrink = 0;
+move_account:
 	while (mem->res.usage > 0) {
+		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
+
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
-		for_each_node_state(node, N_POSSIBLE)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		ret = 0;
+		for_each_node_state(node, N_POSSIBLE) {
+			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
 				enum lru_list l;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				for_each_lru(l)
-					mem_cgroup_force_empty_list(mem, mz, l);
+				for_each_lru(l) {
+					ret = mem_cgroup_force_empty_list(mem,
+								  mz, l);
+					if (ret)
+						break;
+				}
 			}
+			if (ret)
+				break;
+		}
+		/* it seems parent cgroup doesn't have enough mem */
+		if (ret == -ENOMEM)
+			goto try_to_free;
 		cond_resched();
 	}
 	ret = 0;
 out:
 	css_put(&mem->css);
 	return ret;
+
+try_to_free:
+	/* returns EBUSY if we come here twice. */
+	if (shrink)  {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* try to free all pages in this cgroup */
+	shrink = 1;
+	while (nr_retries && mem->res.usage > 0) {
+		int progress;
+		progress = try_to_free_mem_cgroup_pages(mem,
+						  GFP_HIGHUSER_MOVABLE);
+		if (!progress)
+			nr_retries--;
+
+	}
+	/* try move_account...there may be some *locked* pages. */
+	if (mem->res.usage)
+		goto move_account;
+	ret = 0;
+	goto out;
 }
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
@@ -1022,11 +1174,6 @@ static int mem_cgroup_reset(struct cgrou
 	return 0;
 }
 
-static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
-{
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
-}
-
 static const struct mem_cgroup_stat_desc {
 	const char *msg;
 	u64 unit;
@@ -1103,10 +1250,6 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 	{
-		.name = "force_empty",
-		.trigger = mem_force_empty_write,
-	},
-	{
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -207,12 +207,6 @@ exceeded.
 The memory.stat file gives accounting information. Now, the number of
 caches, RSS and Active pages/Inactive pages are shown.
 
-The memory.force_empty gives an interface to drop *all* charges by force.
-
-# echo 1 > memory.force_empty
-
-will drop all charges in cgroup. Currently, this is maintained for test.
-
 4. Testing
 
 Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
@@ -242,8 +236,10 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as much as possible and freed if parent
+is full. Both of RSS and CACHES are moved to parent.
+If both of them are busy, rmdir() returns -EBUSY.
 
 5. TODO
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 2/6] memcg: handle swap cache
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
  2008-11-05  8:18 ` [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir() KAMEZAWA Hiroyuki
@ 2008-11-05  8:20 ` KAMEZAWA Hiroyuki
  2008-11-07  8:53   ` Daisuke Nishimura
  2008-11-05  8:20 ` [RFC][PATCH 3/6] memcg : mem+swap controller kconfig KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

SwapCache support for memory resource controller (memcg)

Before mem+swap controller, memcg itself should handle SwapCache in proper way.

In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is a leak of account and should be handled.

SwapCache accounting is done as following.

  charge (anon)
	- charged when it's mapped.
	  (because of readahead, charge at add_to_swap_cache() is not sane)
  uncharge (anon)
	- uncharged when it's dropped from swapcache and fully unmapped.
	  means it's not uncharged at unmap.
	  Note: delete from swap cache at swap-in is done after rmap information
	        is established.
  charge (shmem)
	- charged at swap-in. this prevents charge at add_to_page_cache().

  uncharge (shmem)
	- uncharged when it's dropped from swapcache and not on shmem's
	  radix-tree.

  at migration, check against 'old page' is modified to handle shmem.

Comparing to the old version discussed (and caused troubles), we have
advantages of
  - PCG_USED bit.
  - simple migrating handling.

So, situation is much easier than several months ago, maybe.

Changelog (v1) -> (v2)
  - use lock_page() when we handle unlocked SwapCache.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    5 ++
 include/linux/swap.h                 |   16 ++++++++
 mm/memcontrol.c                      |   67 +++++++++++++++++++++++++++++++----
 mm/shmem.c                           |   18 ++++++++-
 mm/swap_state.c                      |    1 
 5 files changed, 99 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -21,6 +21,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include <linux/backing-dev.h>
@@ -140,6 +141,7 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
+	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
 	NR_CHARGE_TYPE,
 };
 
@@ -781,6 +783,33 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+#ifdef CONFIG_SWAP
+int mem_cgroup_cache_charge_swapin(struct page *page,
+			struct mm_struct *mm, gfp_t mask, bool locked)
+{
+	int ret = 0;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+	if (unlikely(!mm))
+		mm = &init_mm;
+	if (!locked)
+		lock_page(page);
+	/*
+	 * If not locked, the page can be dropped from SwapCache until
+	 * we reach here.
+	 */
+	if (PageSwapCache(page)) {
+		ret = mem_cgroup_charge_common(page, mm, mask,
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+	}
+	if (!locked)
+		unlock_page(page);
+
+	return ret;
+}
+#endif
+
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
 	struct page_cgroup *pc;
@@ -818,6 +847,9 @@ __mem_cgroup_uncharge_common(struct page
 	if (mem_cgroup_subsys.disabled)
 		return;
 
+	if (PageSwapCache(page))
+		return;
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -826,12 +858,26 @@ __mem_cgroup_uncharge_common(struct page
 		return;
 
 	lock_page_cgroup(pc);
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
-	     || !PageCgroupUsed(pc)) {
-		/* This happens at race in zap_pte_range() and do_swap_page()*/
-		unlock_page_cgroup(pc);
-		return;
+
+	if (!PageCgroupUsed(pc))
+		goto unlock_out;
+
+	switch(ctype) {
+	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+		if (page_mapped(page))
+			goto unlock_out;
+		break;
+	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
+		if (!PageAnon(page)) {	/* Shared memory */
+			if (page->mapping && !page_is_file_cache(page))
+				goto unlock_out;
+		} else if (page_mapped(page)) /* Anon */
+				goto unlock_out;
+		break;
+	default:
+		break;
 	}
+
 	ClearPageCgroupUsed(pc);
 	mem = pc->mem_cgroup;
 
@@ -845,6 +891,10 @@ __mem_cgroup_uncharge_common(struct page
 	css_put(&mem->css);
 
 	return;
+
+unlock_out:
+	unlock_page_cgroup(pc);
+	return;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -864,6 +914,11 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+}
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
@@ -921,7 +976,7 @@ void mem_cgroup_end_migration(struct mem
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
 
 	/* unused page is not on radix-tree now. */
-	if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+	if (unused)
 		__mem_cgroup_uncharge_common(unused, ctype);
 
 	pc = lookup_page_cgroup(target);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
+	mem_cgroup_uncharge_swapcache(page);
 }
 
 /**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+				struct mm_struct *mm, gfp_t mask, bool locked);
+extern void mem_cgroup_uncharge_swapcache(struct page *page);
+#else
+static inline
+int mem_cgroup_cache_charge_swapin(struct page *page,
+				struct mm_struct *mm, gfp_t mask, bool locked)
+{
+	return 0;
+}
+static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
Index: mmotm-2.6.28-rc2+/mm/shmem.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/shmem.c
+++ mmotm-2.6.28-rc2+/mm/shmem.c
@@ -920,8 +920,12 @@ found:
 	error = 1;
 	if (!inode)
 		goto out;
-	/* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
+	/*
+         * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
+         * charged back to the user(not to caller) when swap account is used.
+         */
+	error = mem_cgroup_cache_charge_swapin(page,
+			current->mm, GFP_HIGHUSER_MOVABLE, true);
 	if (error)
 		goto out;
 	error = radix_tree_preload(GFP_KERNEL);
@@ -1258,6 +1262,16 @@ repeat:
 				goto repeat;
 			}
 			wait_on_page_locked(swappage);
+			/*
+			 * We want to avoid charge at add_to_page_cache().
+			 * charge against this swap cache here.
+			 */
+			if (mem_cgroup_cache_charge_swapin(swappage,
+						current->mm, gfp, false)) {
+				page_cache_release(swappage);
+				error = -ENOMEM;
+				goto failed;
+			}
 			page_cache_release(swappage);
 			goto repeat;
 		}
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,6 +137,11 @@ behind this approach is that a cgroup th
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 
+Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
 2.4 Reclaim
 
 Each cgroup maintains a per cgroup LRU that consists of an active


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 3/6] memcg : mem+swap controller kconfig
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
  2008-11-05  8:18 ` [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir() KAMEZAWA Hiroyuki
  2008-11-05  8:20 ` [RFC][PATCH 2/6] memcg: handle swap cache KAMEZAWA Hiroyuki
@ 2008-11-05  8:20 ` KAMEZAWA Hiroyuki
  2008-11-06 11:07   ` Daisuke Nishimura
  2008-11-05  8:21 ` [RFC][PATCH 4/6] memcg : swap cgroup KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

Experimental.

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
  - CONFIG
  - boot option

Changelog: v1 -> v2
 - fixed typo.
 - make default value of "do_swap_account" to be 0 and turned on 1
   later if configured.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/kernel-parameters.txt |    3 +++
 include/linux/memcontrol.h          |    3 +++
 init/Kconfig                        |   17 +++++++++++++++++
 mm/memcontrol.c                     |   32 ++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

Index: mmotm-2.6.28-rc2+/init/Kconfig
===================================================================
--- mmotm-2.6.28-rc2+.orig/init/Kconfig
+++ mmotm-2.6.28-rc2+/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
 config MM_OWNER
 	bool
 
+config CGROUP_MEM_RES_CTLR_SWAP
+	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+	help
+	  Add swap management feature to memory resource controller. When you
+	  enable this, you can limit mem+swap usage per cgroup. In other words,
+	  when you disable this, memory resource controller has no cares to
+	  usage of swap...a process can exhaust all of the swap. This extension
+	  is useful when you want to avoid exhaustion swap but this itself
+	  adds more overheads and consumes memory for remembering information.
+	  Especially if you use 32bit system or small memory system, please
+	  be careful about enabling this. When memory resource controller
+	  is disabled by boot option, this will be automatically disabled and
+	  there will be no overhead from this. Even when you set this config=y,
+	  if boot option "noswapaccount" is set, swap will not be accounted.
+
+
 endmenu
 
 config SYSFS_DEPRECATED
Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -41,6 +41,15 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account		(0)
+#endif
+
+
 /*
  * Statistics for memory cgroup.
  */
@@ -1370,6 +1379,18 @@ static void mem_cgroup_free(struct mem_c
 }
 
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+	if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+		do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1378,6 +1399,7 @@ mem_cgroup_create(struct cgroup_subsys *
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
+		enable_swap_cgroup();
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
@@ -1461,3 +1483,13 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+	really_do_swap_account = 0;
+	return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
@@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters. 
 
 	nosoftlockup	[KNL] Disable the soft-lockup detector.
 
+	noswapaccount	[KNL] Disable accounting of swap in memory resource
+			controller. (See Documentation/controllers/memory.txt)
+
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
 	notsc		[BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 4/6] memcg : swap cgroup
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-11-05  8:20 ` [RFC][PATCH 3/6] memcg : mem+swap controller kconfig KAMEZAWA Hiroyuki
@ 2008-11-05  8:21 ` KAMEZAWA Hiroyuki
  2008-11-06 11:25   ` Daisuke Nishimura
  2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
  - swap_cgroup_swapon() .... called from swapon
  - swap_cgroup_swapoff() ... called at the end of swapoff.

  - swap_cgroup_record() .... record information of swap entry.
  - swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires
requires memory allocation at new records but swap-out is usually
called under memory shortage (or memcg hits limit.)
So, I used static allocation. (maybe dynamic allocation is not very
hard but it adds additional memory allocation in memory shortage path.)


Note1: In this, we use pointer to record information and this means
      8bytes per swap entry. I think we can reduce this when we
      create "id of cgroup" in the range of 0-65535 or 0-255.

Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   35 +++++++
 mm/page_cgroup.c            |  201 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c               |    8 +
 3 files changed, 244 insertions(+)

Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
@@ -8,6 +8,8 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -254,3 +256,202 @@ void __init pgdat_page_cgroup_init(struc
 }
 
 #endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+	spinlock_t lock;
+	struct page **map;
+	unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+	struct mem_cgroup	*val;
+};
+#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK	(SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+	struct page *page;
+	struct swap_cgroup_ctrl *ctrl;
+	unsigned long idx, max;
+
+	if (!do_swap_account)
+		return 0;
+	ctrl = &swap_cgroup_ctrl[type];
+
+	for (idx = 0; idx < ctrl->length; idx++) {
+		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+		if (!page)
+			goto not_enough_page;
+		ctrl->map[idx] = page;
+	}
+	return 0;
+not_enough_page:
+	max = idx;
+	for (idx = 0; idx < max; idx++)
+		__free_page(ctrl->map[idx]);
+
+	return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *old;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	old = sc->val;
+	sc->val = mem;
+	kunmap_atomic(mappage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	int type = swp_type(ent);
+	unsigned long flags;
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *ret;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	ret = sc->val;
+	kunmap_atomic(mapppage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	void *array;
+	unsigned long array_size;
+	unsigned long length;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return 0;
+
+	length = ((max_pages/SC_PER_PAGE) + 1);
+	array_size = length * sizeof(void *);
+
+	array = vmalloc(array_size);
+	if (!array)
+		goto nomem;
+
+	memset(array, 0, array_size);
+	ctrl = &swap_cgroup_ctrl[type];
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl->length = length;
+	ctrl->map = array;
+	if (swap_cgroup_prepare(type)) {
+		/* memory shortage */
+		ctrl->map = NULL;
+		ctrl->length = 0;
+		vfree(array);
+		mutex_unlock(&swap_cgroup_mutex);
+		goto nomem;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+
+	printk(KERN_INFO
+		"swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
+		array_size, length * PAGE_SIZE);
+	printk(KERN_INFO
+	"swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+	return 0;
+nomem:
+	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+	printk(KERN_INFO
+		"swap_cgroup can be disabled by noswapaccount boot option\n");
+	return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+	int i;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return;
+
+	mutex_lock(&swap_cgroup_mutex);
+	if (ctrl->map) {
+		ctrl = &swap_cgroup_ctrl[type];
+		for (i = 0; i < ctrl->length; i++) {
+			struct page *page = ctrl->map[i];
+			if (page)
+				__free_page(page);
+		}
+		vfree(ctrl->map);
+		ctrl->map = NULL;
+		ctrl->length = 0;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+	int i;
+	for (i = 0; i < MAX_SWAPFILES; i++)
+		spin_lock_init(&swap_cgroup_ctrl[i].lock);
+	return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+	return;
+}
+
+#endif
 #endif
Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
+#include <linux/page_cgroup.h>
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	/* Destroy swap account informatin */
+	swap_cgroup_swapoff(type);
+
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
 		nr_good_pages = swap_header->info.last_page -
 				swap_header->info.nr_badpages -
 				1 /* header page */;
+
+		if (!error)
+			error = swap_cgroup_swapon(type, maxpages);
+
 		if (error)
 			goto bad_swap;
 	}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-11-05  8:21 ` [RFC][PATCH 4/6] memcg : swap cgroup KAMEZAWA Hiroyuki
@ 2008-11-05  8:23 ` KAMEZAWA Hiroyuki
  2008-11-07  9:02   ` Daisuke Nishimura
                     ` (2 more replies)
  2008-11-05  8:24 ` [RFC][PATCH 6/6] memcg: synchronized LRU KAMEZAWA Hiroyuki
  2008-11-06  6:54 ` [RFC][PATCH 0/6] memcg updates (05/Nov) Balbir Singh
  6 siblings, 3 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

Mem+Swap controller core.

This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.

Mem+Swap controller works as following.
  - memory usage is limited by memory.limit_in_bytes.
  - memory + swap usage is limited by memory.memsw_limit_in_bytes.


This has following benefits.
  - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..
    
    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.
      
Maybe someone wonder "why not swap but mem+swap ?"

  - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.


Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
  map
    - charge  page and memsw.

  unmap
    - uncharge page/memsw if not SwapCache.

  swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

  swap-in (do_swap_page)
    - charged as page and memsw.
      record in swap_cgroup is cleared.
      memsw accounting is decremented.

  swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.


After this, usual memory resource controller handles SwapCache.
(It was lacked(ignored) feature in current memcg but must be handled.)

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead.  This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

Changelog: v1 -> v2
 - fixed typos
 - fixed migration of anon pages.
 - fixed uncharge to check USED bit always.
 - code for swapcache is moved to another patch.
 - added "noswap" argument to try_to_free_mem_cgroup_pages
 - fixed lock_page around mem_cgroup_charge_cache_swap()
 - fixed failcnt file.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/controllers/memory.txt |   29 ++
 include/linux/memcontrol.h           |   11 -
 include/linux/swap.h                 |   14 +
 mm/memcontrol.c                      |  340 +++++++++++++++++++++++++++++++----
 mm/memory.c                          |    3 
 mm/swap_state.c                      |    5 
 mm/swapfile.c                        |   11 -
 mm/vmscan.c                          |    6 
 8 files changed, 375 insertions(+), 44 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -132,6 +132,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter res;
 	/*
+	 * the counter to account for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -142,6 +146,12 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+
+	/*
+	 * used for counting reference from swap_cgroup.
+	 */
+	int		obsolete;
+	atomic_t	swapref;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -168,6 +178,16 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 	0, /* FORCE */
 };
 
+
+/* for encoding cft->private value on file */
+#define _MEM			(0)
+#define _MEMSWAP		(1)
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
 		css_get(&mem->css);
 	}
 
+	while (1) {
+		int ret;
+		bool noswap = false;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+		ret = res_counter_charge(&mem->res, PAGE_SIZE);
+		if (likely(!ret)) {
+			if (!do_swap_account)
+				break;
+			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+			if (likely(!ret))
+				break;
+			/* mem+swap counter fails */
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			noswap = true;
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
 			continue;
 
 		/*
@@ -528,8 +561,13 @@ static int __mem_cgroup_try_charge(struc
 		 * moved to swap cache or just unmapped from the cgroup.
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
+		 *
 		 */
-		if (res_counter_check_under_limit(&mem->res))
+		if (!do_swap_account &&
+			res_counter_check_under_limit(&mem->res))
+			continue;
+		if (do_swap_account &&
+			res_counter_check_under_limit(&mem->memsw))
 			continue;
 
 		if (!nr_retries--) {
@@ -583,6 +621,8 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
@@ -647,6 +687,8 @@ static int mem_cgroup_move_account(struc
 		__mem_cgroup_remove_list(from_mz, pc);
 		css_put(&from->css);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
 		pc->mem_cgroup = to;
 		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc, false);
@@ -693,8 +735,11 @@ static int mem_cgroup_move_parent(struct
 	/* drop extra refcnt */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	if (ret)
+	if (ret) {
 		res_counter_uncharge(&parent->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	}
 
 	return ret;
 }
@@ -792,7 +837,34 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+				 struct page *page,
+				 gfp_t mask, struct mem_cgroup **ptr)
+{
+	struct mem_cgroup *mem;
+	swp_entry_t     ent;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+
+	if (!do_swap_account)
+		goto charge_cur_mm;
+
+	ent.val = page_private(page);
+
+	mem = lookup_swap_cgroup(ent);
+	if (!mem || mem->obsolete)
+		goto charge_cur_mm;
+	*ptr = mem;
+	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
 #ifdef CONFIG_SWAP
+
 int mem_cgroup_cache_charge_swapin(struct page *page,
 			struct mm_struct *mm, gfp_t mask, bool locked)
 {
@@ -809,8 +881,28 @@ int mem_cgroup_cache_charge_swapin(struc
 	 * we reach here.
 	 */
 	if (PageSwapCache(page)) {
+		struct mem_cgroup *mem = NULL;
+		swp_entry_t ent;
+
+		ent.val = page_private(page);
+		if (do_swap_account) {
+			mem = lookup_swap_cgroup(ent);
+			if (mem && mem->obsolete)
+				mem = NULL;
+			if (mem)
+				mm = NULL;
+		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
-				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+
+		if (!ret && do_swap_account) {
+			/* avoid double counting */
+			mem = swap_cgroup_record(ent, NULL);
+			if (mem) {
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+				mem_cgroup_forget_swapref(mem);
+			}
+		}
 	}
 	if (!locked)
 		unlock_page(page);
@@ -829,6 +921,23 @@ void mem_cgroup_commit_charge_swapin(str
 		return;
 	pc = lookup_page_cgroup(page);
 	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+	/*
+	 * Now swap is on-memory. This means this page may be
+	 * counted both as mem and swap....double count.
+	 * Fix it by uncharging from memsw. This SwapCache is stable
+	 * because we're still under lock_page().
+	 */
+	if (do_swap_account) {
+		swp_entry_t ent = {.val = page_private(page)};
+		struct mem_cgroup *memcg;
+		memcg = swap_cgroup_record(ent, NULL);
+		if (memcg) {
+			/* If memcg is obsolete, memcg can be != ptr */
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			mem_cgroup_forget_swapref(memcg);
+		}
+
+	}
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	css_put(&mem->css);
 }
 
@@ -845,29 +955,31 @@ void mem_cgroup_cancel_charge_swapin(str
 /*
  * uncharge if !page_mapped(page)
  */
-static void
+static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
-		return;
+		return NULL;
 
 	if (PageSwapCache(page))
-		return;
+		return NULL;
 
 	/*
 	 * Check if our page_cgroup is valid
 	 */
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
-		return;
+		return NULL;
 
 	lock_page_cgroup(pc);
 
+	mem = pc->mem_cgroup;
+
 	if (!PageCgroupUsed(pc))
 		goto unlock_out;
 
@@ -887,8 +999,11 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
 	ClearPageCgroupUsed(pc);
-	mem = pc->mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
@@ -896,14 +1011,13 @@ __mem_cgroup_uncharge_common(struct page
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 
-	return;
+	return mem;
 
 unlock_out:
 	unlock_page_cgroup(pc);
-	return;
+	return NULL;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -923,10 +1037,42 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	struct mem_cgroup *memcg;
+
+	memcg = __mem_cgroup_uncharge_common(page,
+					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	/* record memcg information */
+	if (do_swap_account && memcg) {
+		swap_cgroup_record(ent, memcg);
+		atomic_inc(&memcg->swapref);
+	}
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+	struct mem_cgroup *memcg;
+
+	if (!do_swap_account)
+		return;
+
+	memcg = swap_cgroup_record(ent, NULL);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_forget_swapref(memcg);
+	}
 }
+#endif
 
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
@@ -1035,7 +1181,7 @@ int mem_cgroup_shrink_usage(struct mm_st
 	rcu_read_unlock();
 
 	do {
-		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
 		progress += res_counter_check_under_limit(&mem->res);
 	} while (!progress && --retry);
 
@@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
 			break;
 		}
 		progress = try_to_free_mem_cgroup_pages(memcg,
-				GFP_HIGHUSER_MOVABLE);
+				GFP_HIGHUSER_MOVABLE, false);
 		if (!progress)
 			retry_count--;
 	}
 	return ret;
 }
 
+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+				unsigned long long val)
+{
+	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+	unsigned long flags;
+	u64 memlimit, oldusage, curusage;
+	int ret;
+
+	if (!do_swap_account)
+		return -EINVAL;
+
+	while (retry_count) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		/*
+		 * Rather than hide all in some function, I do this in
+		 * open coded manner. You see what this really does.
+		 * We have to guarantee mem->res.limit < mem->memsw.limit.
+		 */
+		spin_lock_irqsave(&memcg->res.lock, flags);
+		memlimit = memcg->res.limit;
+		if (memlimit > val) {
+			spin_unlock_irqrestore(&memcg->res.lock, flags);
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_set_limit(&memcg->memsw, val);
+		oldusage = memcg->memsw.usage;
+		spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+		if (!ret)
+			break;
+		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		if (curusage >= oldusage)
+			retry_count--;
+	}
+	return ret;
+}
+
 
 /*
  * This routine traverse page_cgroup in given list and drop them all.
@@ -1180,7 +1368,7 @@ try_to_free:
 	while (nr_retries && mem->res.usage > 0) {
 		int progress;
 		progress = try_to_free_mem_cgroup_pages(mem,
-						  GFP_HIGHUSER_MOVABLE);
+						  GFP_HIGHUSER_MOVABLE, false);
 		if (!progress)
 			nr_retries--;
 
@@ -1194,8 +1382,25 @@ try_to_free:
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
-				    cft->private);
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	u64 val = 0;
+	int type, name;
+
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (type) {
+	case _MEM:
+		val = res_counter_read_u64(&mem->res, name);
+		break;
+	case _MEMSWAP:
+		if (do_swap_account)
+			val = res_counter_read_u64(&mem->memsw, name);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return val;
 }
 /*
  * The user of this function is...
@@ -1205,15 +1410,22 @@ static int mem_cgroup_write(struct cgrou
 			    const char *buffer)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	int type, name;
 	unsigned long long val;
 	int ret;
 
-	switch (cft->private) {
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (name) {
 	case RES_LIMIT:
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
-		if (!ret)
+		if (ret)
+			break;
+		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+		else
+			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
@@ -1225,14 +1437,23 @@ static int mem_cgroup_write(struct cgrou
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 {
 	struct mem_cgroup *mem;
+	int type, name;
 
 	mem = mem_cgroup_from_cont(cont);
-	switch (event) {
+	type = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+	switch (name) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_max(&mem->res);
+		else
+			res_counter_reset_max(&mem->memsw);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_failcnt(&mem->res);
+		else
+			res_counter_reset_failcnt(&mem->memsw);
 		break;
 	}
 	return 0;
@@ -1286,30 +1507,33 @@ static int mem_control_stat_show(struct 
 		cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
 
 	}
+	/* showing refs from disk-swap */
+	cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
+					* PAGE_SIZE);
 	return 0;
 }
 
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-		.private = RES_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "max_usage_in_bytes",
-		.private = RES_MAX_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "limit_in_bytes",
-		.private = RES_LIMIT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
 		.write_string = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "failcnt",
-		.private = RES_FAILCNT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
@@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+	{
+		.name = "memsw.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.failcnt",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1370,14 +1619,43 @@ static struct mem_cgroup *mem_cgroup_all
 	return mem;
 }
 
+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	if (do_swap_account) {
+		if (atomic_read(&mem->swapref) > 0)
+			return;
+	}
 	if (sizeof(*mem) < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);
 }
 
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem)
+{
+	if (!do_swap_account)
+		return;
+	if (atomic_dec_and_test(&mem->swapref)) {
+		if (!mem->obsolete)
+			return;
+		mem_cgroup_free(mem);
+	}
+}
+
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 static void __init enable_swap_cgroup(void)
@@ -1407,6 +1685,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	}
 
 	res_counter_init(&mem->res);
+	res_counter_init(&mem->memsw);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1425,6 +1704,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem);
 }
 
Index: mmotm-2.6.28-rc2+/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
+++ mmotm-2.6.28-rc2+/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
 	return NULL;
 }	
 
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
 {
+	unsigned long offset = swp_offset(ent);
 	int count = p->swap_map[offset];
 
 	if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
 				swap_list.next = p - swap_info;
 			nr_swap_pages++;
 			p->inuse_pages--;
+			mem_cgroup_uncharge_swap(ent);
 		}
 	}
 	return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, swp_offset(entry));
+		swap_entry_free(p, entry);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, swp_offset(entry)) == 1) {
+		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -536,7 +538,8 @@ static int unuse_pte(struct vm_area_stru
 	pte_t *pte;
 	int ret = 1;
 
-	if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+					GFP_HIGHUSER_MOVABLE, &ptr))
 		ret = -ENOMEM;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-rc2+/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
+++ mmotm-2.6.28-rc2+/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
 
@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	swp_entry_t ent = {.val = page_private(page)};
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
@@ -119,7 +122,7 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page);
+	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
Index: mmotm-2.6.28-rc2+/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
+++ mmotm-2.6.28-rc2+/include/linux/swap.h
@@ -213,7 +213,7 @@ static inline void lru_cache_add_active_
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-							gfp_t gfp_mask);
+						gfp_t gfp_mask, bool noswap);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
@@ -335,7 +335,7 @@ static inline void disable_swap_token(vo
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern int mem_cgroup_cache_charge_swapin(struct page *page,
 				struct mm_struct *mm, gfp_t mask, bool locked);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
 #else
 static inline
 int mem_cgroup_cache_charge_swapin(struct page *page,
@@ -343,7 +343,15 @@ int mem_cgroup_cache_charge_swapin(struc
 {
 	return 0;
 }
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 {
 }
 #endif
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t mask, struct mem_cgroup **ptr);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *ptr);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(stru
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charg
 }
 
 static inline int mem_cgroup_try_charge(struct mm_struct *mm,
-				gfp_t gfp_mask, struct mem_cgroup **ptr)
+			gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
 	return 0;
 }
Index: mmotm-2.6.28-rc2+/mm/memory.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memory.c
+++ mmotm-2.6.28-rc2+/mm/memory.c
@@ -2324,7 +2324,8 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
-	if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+	if (mem_cgroup_try_charge_swapin(mm, page,
+				GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
 		ret = VM_FAULT_OOM;
 		unlock_page(page);
 		goto out;
Index: mmotm-2.6.28-rc2+/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2+/mm/vmscan.c
@@ -1669,7 +1669,8 @@ unsigned long try_to_free_pages(struct z
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
-						gfp_t gfp_mask)
+						gfp_t gfp_mask,
+					   bool noswap)
 {
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -1682,6 +1683,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	};
 	struct zonelist *zonelist;
 
+	if (noswap)
+		sc.may_swap = 0;
+
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
@@ -137,12 +137,32 @@ behind this approach is that a cgroup th
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 
-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
 
 
-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim
 
 Each cgroup maintains a per cgroup LRU that consists of an active
 and inactive list. When a cgroup goes over its limit, we first try
@@ -246,6 +266,11 @@ Such charges are moved to its parent as 
 is full. Both of RSS and CACHES are moved to parent.
 If both of them are busy, rmdir() returns -EBUSY.
 
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is effectively discarded and a cgroup which uses swap
+(swapcache) will be charged as a new owner of it.
+
+
 5. TODO
 
 1. Add support for accounting huge pages (as a separate controller)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 6/6] memcg: synchronized LRU
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
@ 2008-11-05  8:24 ` KAMEZAWA Hiroyuki
  2008-11-06  6:54 ` [RFC][PATCH 0/6] memcg updates (05/Nov) Balbir Singh
  6 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-05  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

A big patch for chaning memcg's LRU semantics.

Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

  - LRU of page_cgroup is not synchronous with global LRU.

  - page and page_cgroup is one-to-one and statically allocated.

  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

  - SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

	pc = lookup_page_cgroup(page);
	lock_page_cgroup(pc); .....................(1)
	mz = page_cgroup_zoneinfo(pc);
	spin_lock(&mz->lru_lock);
	.....add to LRU
	spin_unlock(&mz->lru_lock);
	unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	mem_cgroup_add/remove/etc_lru() {
		pc = lookup_page_cgroup(page);
		mz = page_cgroup_zoneinfo(pc);
		if (PageCgroupUsed(pc)) {
			....add to LRU
		}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.

Pros.
  - easy for maintainance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Acitve/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

Changelog v0 -> v1
 - fixed statistics.

Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com>

 fs/splice.c                 |    1 
 include/linux/memcontrol.h  |   29 +++
 include/linux/mm_inline.h   |    3 
 include/linux/page_cgroup.h |   17 --
 mm/memcontrol.c             |  337 +++++++++++++++++++-------------------------
 mm/page_cgroup.c            |    1 
 mm/swap.c                   |    1 
 mm/vmscan.c                 |    9 -
 8 files changed, 186 insertions(+), 212 deletions(-)

Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2+/mm/memcontrol.c
@@ -35,6 +35,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include "internal.h"
 
 #include <asm/uaccess.h>
 
@@ -99,7 +100,6 @@ struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
-	spinlock_t		lru_lock;
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
 };
@@ -136,17 +136,19 @@ struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * statistics.
+	 */
+	struct mem_cgroup_stat stat;
+        /*
 	 * Per cgroup active and inactive list, similar to the
-	 * per zone LRU lists.
+ 	 * per zone LRU lists. This area is read_mostly.
 	 */
 	struct mem_cgroup_lru_info info;
-
-	int	prev_priority;	/* for recording reclaim priority */
 	/*
-	 * statistics.
+	 * Per cgroup active and inactive list, similar to the
+	 * per zone LRU lists.
 	 */
-	struct mem_cgroup_stat stat;
-
+	int	prev_priority;	/* for recording reclaim priority */
 	/*
 	 * used for counting reference from swap_cgroup.
 	 */
@@ -167,14 +169,12 @@ enum charge_type {
 /* only for here (for easy reading.) */
 #define PCGF_CACHE	(1UL << PCG_CACHE)
 #define PCGF_USED	(1UL << PCG_USED)
-#define PCGF_ACTIVE	(1UL << PCG_ACTIVE)
 #define PCGF_LOCK	(1UL << PCG_LOCK)
-#define PCGF_FILE	(1UL << PCG_FILE)
 static const unsigned long
 pcg_default_flags[NR_CHARGE_TYPE] = {
-	PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
-	PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
-	PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+	PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* File Cache */
+	PCGF_USED | PCGF_LOCK, /* Anon */
+	PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
 	0, /* FORCE */
 };
 
@@ -188,9 +188,6 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 
 static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
 
-/*
- * Always modified under lru lock. Then, not necessary to preempt_disable()
- */
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -198,10 +195,9 @@ static void mem_cgroup_charge_statistics
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
 	struct mem_cgroup_stat_cpu *cpustat;
+	int cpu = get_cpu();
 
-	VM_BUG_ON(!irqs_disabled());
-
-	cpustat = &stat->cpustat[smp_processor_id()];
+	cpustat = &stat->cpustat[cpu];
 	if (PageCgroupCache(pc))
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
@@ -213,6 +209,7 @@ static void mem_cgroup_charge_statistics
 	else
 		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+	put_cpu();
 }
 
 static struct mem_cgroup_per_zone *
@@ -267,80 +264,95 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
-			struct page_cgroup *pc)
-{
-	int lru = LRU_BASE;
+/*
+ * Following LRU functions are allowed to be used without PCG_LOCK.
+ * Operations are called by routine of global LRU independently from memcg.
+ * What we have to take care of here is validness of pc->mem_cgroup.
+ *
+ * Changes to pc->mem_cgroup happens when
+ * 1. charge
+ * 2. moving account
+ * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
+ * It is added to LRU before charge.
+ * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
+ * When moving account, the page is not on LRU. It's isolated.
+ */
 
-	if (PageCgroupUnevictable(pc))
-		lru = LRU_UNEVICTABLE;
-	else {
-		if (PageCgroupActive(pc))
-			lru += LRU_ACTIVE;
-		if (PageCgroupFile(pc))
-			lru += LRU_FILE;
-	}
+void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz;
 
+	if (mem_cgroup_subsys.disabled)
+		return;
+	pc = lookup_page_cgroup(page);
+	/* can happen while we handle swapcache. */
+	if (list_empty(&pc->lru))
+		return;
+	mz = page_cgroup_zoneinfo(pc);
+	mem = pc->mem_cgroup;
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
-	list_del(&pc->lru);
+	list_del_init(&pc->lru);
+	return;
 }
 
-static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
-				struct page_cgroup *pc, bool hot)
+void mem_cgroup_del_lru(struct page *page)
 {
-	int lru = LRU_BASE;
+	mem_cgroup_del_lru_list(page, page_lru(page));
+}
 
-	if (PageCgroupUnevictable(pc))
-		lru = LRU_UNEVICTABLE;
-	else {
-		if (PageCgroupActive(pc))
-			lru += LRU_ACTIVE;
-		if (PageCgroupFile(pc))
-			lru += LRU_FILE;
-	}
+void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
 
-	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	if (hot)
-		list_add(&pc->lru, &mz->lists[lru]);
-	else
-		list_add_tail(&pc->lru, &mz->lists[lru]);
+	if (mem_cgroup_subsys.disabled)
+		return;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+	pc = lookup_page_cgroup(page);
+	smp_rmb();
+	/* unused page is not rotated. */
+	if (!PageCgroupUsed(pc))
+		return;
+	mz = page_cgroup_zoneinfo(pc);
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
-static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
+void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = PageCgroupActive(pc);
-	int file      = PageCgroupFile(pc);
-	int unevictable = PageCgroupUnevictable(pc);
-	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
-				(LRU_FILE * !!file + !!active);
+	struct page_cgroup *pc;
+	struct mem_cgroup_per_zone *mz;
 
-	if (lru == from)
+	if (mem_cgroup_subsys.disabled)
+		return;
+	pc = lookup_page_cgroup(page);
+	/* barrier to sync with "charge" */
+	smp_rmb();
+	if (!PageCgroupUsed(pc))
 		return;
 
-	MEM_CGROUP_ZSTAT(mz, from) -= 1;
-	/*
-	 * However this is done under mz->lru_lock, another flags, which
-	 * are not related to LRU, will be modified from out-of-lock.
-	 * We have to use atomic set/clear flags.
-	 */
-	if (is_unevictable_lru(lru)) {
-		ClearPageCgroupActive(pc);
-		SetPageCgroupUnevictable(pc);
-	} else {
-		if (is_active_lru(lru))
-			SetPageCgroupActive(pc);
-		else
-			ClearPageCgroupActive(pc);
-		ClearPageCgroupUnevictable(pc);
-	}
-
+	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	list_move(&pc->lru, &mz->lists[lru]);
+	list_add(&pc->lru, &mz->lists[lru]);
+}
+/*
+ * To add swapcache into LRU. Be careful to all this function.
+ * zone->lru_lock shouldn't be held and irq must not be disabled.
+ */
+static void mem_cgroup_lru_fixup(struct page *page)
+{
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+void mem_cgroup_move_lists(struct page *page,
+			   enum lru_list from, enum lru_list to)
+{
+	if (mem_cgroup_subsys.disabled)
+		return;
+	mem_cgroup_del_lru_list(page, from);
+	mem_cgroup_add_lru_list(page, to);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -354,37 +366,6 @@ int task_in_mem_cgroup(struct task_struc
 }
 
 /*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	if (mem_cgroup_subsys.disabled)
-		return;
-
-	/*
-	 * We cannot lock_page_cgroup while holding zone's lru_lock,
-	 * because other holders of lock_page_cgroup can be interrupted
-	 * with an attempt to rotate_reclaimable_page.  But we cannot
-	 * safely get to page_cgroup without it, so just try_lock it:
-	 * mem_cgroup_isolate_pages allows for page left on wrong list.
-	 */
-	pc = lookup_page_cgroup(page);
-	if (!trylock_page_cgroup(pc))
-		return;
-	if (pc && PageCgroupUsed(pc)) {
-		mz = page_cgroup_zoneinfo(pc);
-		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, lru);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-	}
-	unlock_page_cgroup(pc);
-}
-
-/*
  * Calculate mapped_ratio under memory controller. This will be used in
  * vmscan.c for deteremining we have to reclaim mapped pages.
  */
@@ -463,40 +444,24 @@ unsigned long mem_cgroup_isolate_pages(u
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
 	src = &mz->lists[lru];
 
-	spin_lock(&mz->lru_lock);
 	scan = 0;
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+
+		page = pc->page;
 		if (unlikely(!PageCgroupUsed(pc)))
 			continue;
-		page = pc->page;
-
 		if (unlikely(!PageLRU(page)))
 			continue;
 
-		/*
-		 * TODO: play better with lumpy reclaim, grabbing anything.
-		 */
-		if (PageUnevictable(page) ||
-		    (PageActive(page) && !active) ||
-		    (!PageActive(page) && active)) {
-			__mem_cgroup_move_lists(pc, page_lru(page));
-			continue;
-		}
-
 		scan++;
-		list_move(&pc->lru, &pc_list);
-
 		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}
 	}
 
-	list_splice(&pc_list, src);
-	spin_unlock(&mz->lru_lock);
-
 	*scanned = scan;
 	return nr_taken;
 }
@@ -610,9 +575,6 @@ static void __mem_cgroup_commit_charge(s
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
@@ -627,17 +589,11 @@ static void __mem_cgroup_commit_charge(s
 		return;
 	}
 	pc->mem_cgroup = mem;
-	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
-	 */
+	smp_wmb();
 	pc->flags = pcg_default_flags[ctype];
 
-	mz = page_cgroup_zoneinfo(pc);
+	mem_cgroup_charge_statistics(mem, pc, true);
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc, true);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 }
 
@@ -648,8 +604,7 @@ static void __mem_cgroup_commit_charge(s
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
  * The caller must confirm following.
- * 1. disable irq.
- * 2. lru_lock of old mem_cgroup(@from) should be held.
+ * - page is not on LRU (isolate_page() is useful.)
  *
  * returns 0 at success,
  * returns -EBUSY when lock is busy or "pc" is unstable.
@@ -665,15 +620,14 @@ static int mem_cgroup_move_account(struc
 	int nid, zid;
 	int ret = -EBUSY;
 
-	VM_BUG_ON(!irqs_disabled());
 	VM_BUG_ON(from == to);
+	VM_BUG_ON(PageLRU(pc->page));
 
 	nid = page_cgroup_nid(pc);
 	zid = page_cgroup_zid(pc);
 	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
 	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
 
-
 	if (!trylock_page_cgroup(pc))
 		return ret;
 
@@ -683,18 +637,15 @@ static int mem_cgroup_move_account(struc
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	if (spin_trylock(&to_mz->lru_lock)) {
-		__mem_cgroup_remove_list(from_mz, pc);
-		css_put(&from->css);
-		res_counter_uncharge(&from->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&from->memsw, PAGE_SIZE);
-		pc->mem_cgroup = to;
-		css_get(&to->css);
-		__mem_cgroup_add_list(to_mz, pc, false);
-		ret = 0;
-		spin_unlock(&to_mz->lru_lock);
-	}
+	css_put(&from->css);
+	res_counter_uncharge(&from->res, PAGE_SIZE);
+	mem_cgroup_charge_statistics(from, pc, false);
+	if (do_swap_account)
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+	pc->mem_cgroup = to;
+	mem_cgroup_charge_statistics(to, pc, true);
+	css_get(&to->css);
+	ret = 0;
 out:
 	unlock_page_cgroup(pc);
 	return ret;
@@ -708,39 +659,47 @@ static int mem_cgroup_move_parent(struct
 				  struct mem_cgroup *child,
 				  gfp_t gfp_mask)
 {
+	struct page *page = pc->page;
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 	int ret;
 
 	/* Is ROOT ? */
 	if (!pcg)
 		return -EINVAL;
 
+
 	parent = mem_cgroup_from_cont(pcg);
 
+
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
 	if (ret)
 		return ret;
 
-	mz = mem_cgroup_zoneinfo(child,
-			page_cgroup_nid(pc), page_cgroup_zid(pc));
+	if (!get_page_unless_zero(page))
+		return -EBUSY;
+
+	ret = isolate_lru_page(page);
+
+	if (ret)
+		goto cancel;
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
 	ret = mem_cgroup_move_account(pc, child, parent);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	/* drop extra refcnt */
+	/* drop extra refcnt by try_charge() (move_account increment one) */
 	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (ret) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	putback_lru_page(page);
+	if (!ret) {
+		put_page(page);
+		return 0;
 	}
-
+	/* uncharge if move fails */
+cancel:
+	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	if (do_swap_account)
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	put_page(page);
 	return ret;
 }
 
@@ -906,6 +865,8 @@ int mem_cgroup_cache_charge_swapin(struc
 	}
 	if (!locked)
 		unlock_page(page);
+	/* add this page(page_cgroup) to the LRU we want. */
+	mem_cgroup_lru_fixup(page);
 
 	return ret;
 }
@@ -938,6 +899,8 @@ void mem_cgroup_commit_charge_swapin(str
 		}
 
 	}
+	/* add this page(page_cgroup) to the LRU we want. */
+	mem_cgroup_lru_fixup(page);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -961,7 +924,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return NULL;
@@ -1003,12 +965,10 @@ __mem_cgroup_uncharge_common(struct page
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 
+	mem_cgroup_charge_statistics(mem, pc, false);
 	ClearPageCgroupUsed(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 
 	css_put(&mem->css);
@@ -1257,21 +1217,22 @@ int mem_cgroup_resize_memsw_limit(struct
 	return ret;
 }
 
-
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
-			    struct mem_cgroup_per_zone *mz,
-			    enum lru_list lru)
+				int node, int zid, enum lru_list lru)
 {
+	struct zone *zone;
+	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc, *busy;
-	unsigned long flags;
-	unsigned long loop;
+	unsigned long flags, loop;
 	struct list_head *list;
 	int ret = 0;
 
+	zone = &NODE_DATA(node)->node_zones[zid];
+	mz = mem_cgroup_zoneinfo(mem, node, zid);
 	list = &mz->lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
@@ -1280,19 +1241,19 @@ static int mem_cgroup_force_empty_list(s
 	busy = NULL;
 	while (loop--) {
 		ret = 0;
-		spin_lock_irqsave(&mz->lru_lock, flags);
+		spin_lock_irqsave(&zone->lru_lock, flags);
 		if (list_empty(list)) {
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		if (busy == pc) {
 			list_move(&pc->lru, list);
 			busy = 0;
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
 		if (ret == -ENOMEM)
@@ -1305,6 +1266,7 @@ static int mem_cgroup_force_empty_list(s
 		} else
 			busy = NULL;
 	}
+
 	if (!ret && !list_empty(list))
 		return -EBUSY;
 	return ret;
@@ -1334,12 +1296,10 @@ move_account:
 		ret = 0;
 		for_each_node_state(node, N_POSSIBLE) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
-				struct mem_cgroup_per_zone *mz;
 				enum lru_list l;
-				mz = mem_cgroup_zoneinfo(mem, node, zid);
 				for_each_lru(l) {
 					ret = mem_cgroup_force_empty_list(mem,
-								  mz, l);
+							node, zid, l);
 					if (ret)
 						break;
 				}
@@ -1373,6 +1333,7 @@ try_to_free:
 			nr_retries--;
 
 	}
+	lru_add_drain();
 	/* try move_account...there may be some *locked* pages. */
 	if (mem->res.usage)
 		goto move_account;
@@ -1593,7 +1554,6 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		spin_lock_init(&mz->lru_lock);
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lists[l]);
 	}
@@ -1635,10 +1595,17 @@ static struct mem_cgroup *mem_cgroup_all
 
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	int node;
+
 	if (do_swap_account) {
 		if (atomic_read(&mem->swapref) > 0)
 			return;
 	}
+
+
+	for_each_node_state(node, N_POSSIBLE)
+		free_mem_cgroup_per_zone_info(mem, node);
+
 	if (sizeof(*mem) < PAGE_SIZE)
 		kfree(mem);
 	else
@@ -1711,12 +1678,6 @@ static void mem_cgroup_pre_destroy(struc
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	int node;
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-
-	for_each_node_state(node, N_POSSIBLE)
-		free_mem_cgroup_per_zone_info(mem, node);
-
 	mem_cgroup_free(mem_cgroup_from_cont(cont));
 }
 
Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
@@ -40,7 +40,12 @@ extern void mem_cgroup_cancel_charge_swa
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru(struct page *page);
+extern void mem_cgroup_move_lists(struct page *page,
+				  enum lru_list from, enum lru_list to);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
@@ -131,7 +136,27 @@ static inline int mem_cgroup_shrink_usag
 	return 0;
 }
 
-static inline void mem_cgroup_move_lists(struct page *page, bool active)
+static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+{
+}
+
+static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+{
+	return ;
+}
+
+static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+{
+	return ;
+}
+
+static inline void mem_cgroup_del_lru(struct page *page)
+{
+	return ;
+}
+
+static inline void
+mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
 {
 }
 
Index: mmotm-2.6.28-rc2+/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-rc2+/include/linux/mm_inline.h
@@ -28,6 +28,7 @@ add_page_to_lru_list(struct zone *zone, 
 {
 	list_add(&page->lru, &zone->lru[l].list);
 	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
 }
 
 static inline void
@@ -35,6 +36,7 @@ del_page_from_lru_list(struct zone *zone
 {
 	list_del(&page->lru);
 	__dec_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_del_lru_list(page, l);
 }
 
 static inline void
@@ -54,6 +56,7 @@ del_page_from_lru(struct zone *zone, str
 		l += page_is_file_cache(page);
 	}
 	__dec_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
@@ -17,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
 	pc->flags = 0;
 	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
+	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
Index: mmotm-2.6.28-rc2+/fs/splice.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/fs/splice.c
+++ mmotm-2.6.28-rc2+/fs/splice.c
@@ -21,6 +21,7 @@
 #include <linux/file.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
Index: mmotm-2.6.28-rc2+/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/vmscan.c
+++ mmotm-2.6.28-rc2+/mm/vmscan.c
@@ -516,7 +516,6 @@ redo:
 		lru = LRU_UNEVICTABLE;
 		add_page_to_unevictable_list(page);
 	}
-	mem_cgroup_move_lists(page, lru);
 
 	/*
 	 * page's status can change while we move it among lru. If an evictable
@@ -551,7 +550,6 @@ void putback_lru_page(struct page *page)
 
 	lru = !!TestClearPageActive(page) + page_is_file_cache(page);
 	lru_cache_add_lru(page, lru);
-	mem_cgroup_move_lists(page, lru);
 	put_page(page);
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
@@ -823,6 +821,7 @@ int __isolate_lru_page(struct page *page
 		return ret;
 
 	ret = -EBUSY;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
@@ -831,6 +830,7 @@ int __isolate_lru_page(struct page *page
 		 */
 		ClearPageLRU(page);
 		ret = 0;
+		mem_cgroup_del_lru(page);
 	}
 
 	return ret;
@@ -1144,7 +1144,6 @@ static unsigned long shrink_inactive_lis
 			SetPageLRU(page);
 			lru = page_lru(page);
 			add_page_to_lru_list(zone, page, lru);
-			mem_cgroup_move_lists(page, lru);
 			if (PageActive(page) && scan_global_lru(sc)) {
 				int file = !!page_is_file_cache(page);
 				zone->recent_rotated[file]++;
@@ -1277,7 +1276,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_move_lists(page, lru);
+		mem_cgroup_add_lru_list(page, lru);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
@@ -2436,6 +2435,7 @@ retry:
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
 		list_move(&page->lru, &zone->lru[l].list);
+		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -2444,6 +2444,7 @@ retry:
 		 */
 		SetPageUnevictable(page);
 		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
@@ -26,10 +26,6 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	/* flags for LRU placement */
-	PCG_ACTIVE, /* page is active in this cgroup */
-	PCG_FILE, /* page is file system backed */
-	PCG_UNEVICTABLE, /* page is unevictableable */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -50,19 +46,6 @@ TESTPCGFLAG(Cache, CACHE)
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
Index: mmotm-2.6.28-rc2+/mm/swap.c
===================================================================
--- mmotm-2.6.28-rc2+.orig/mm/swap.c
+++ mmotm-2.6.28-rc2+/mm/swap.c
@@ -168,7 +168,6 @@ void activate_page(struct page *page)
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page, lru);
 
 		zone->recent_rotated[!!file]++;
 		zone->recent_scanned[!!file]++;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] memcg updates (05/Nov)
  2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-11-05  8:24 ` [RFC][PATCH 6/6] memcg: synchronized LRU KAMEZAWA Hiroyuki
@ 2008-11-06  6:54 ` Balbir Singh
  2008-11-06  7:03   ` KAMEZAWA Hiroyuki
  2008-11-06 10:41   ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) KAMEZAWA Hiroyuki
  6 siblings, 2 replies; 28+ messages in thread
From: Balbir Singh @ 2008-11-06  6:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, menage, nishimura

KAMEZAWA Hiroyuki wrote:
> Weekly (RFC) update for memcg.
> 
> This set includes
> 
> 1. change force_empty to do move account rather than forget all

I would like this to be selectable, please. We don't want to break behaviour and
not everyone would like to pay the cost of movement.

> 2. swap cache handling
> 3. mem+swap controller kconfig
> 4. swap_cgroup for rememver swap account information
> 5. mem+swap controller core
> 6. synchronize memcg's LRU and global LRU.
> 
> "1" is already sent, "6" is a newcomer.
> I'd like to push out "2" or "2-5" in the next week (if no bugs.)
> 
> after 6, next candidates are
>   - dirty_ratio handler
>   - account move at task move.
> 
> Some more explanation about purpose of "6". (see details in patch itself)
> Now, one of complicated logic in memcg is LRU handling. Because the place of
> lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
> lock as following even under zone->lru_lock.
> ==
>   pc = lookup_page_cgroup(page);
>   if (!trylock_page_cgroup(pc))
>   	return -EBUSY;
> 
>    if (PageCgroupUsed(pc)) {
> 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> 	spin_lock_irqsave(&mz->lru_lock, flags);
> 	....some operation on LRU.
> 	spin_unlock_irqrestore(&mz->lru_lock, flags);
>    }
>    unlock_page_cgroup(pc);
> ==
> Sigh..
> 
> After "6", page_cgroup's LRU management can be done independently to some extent.
> == as
>   (zone->lru_lock is held here)
>   pc = lookup_page_cgroup(page);
>   list operation on pc.
>   (unlock zone->lru_lock)
> ==
> Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
> doing some racy operation.
> 
> 	isolate_lru_page(page);
> 	pc = lookup_page_cgroup(page);
> 	do some jobs.
> 	putback_lru_page(page);
> 
> Maybe this will be a help to implement "account move at task move".

Sounds promising!

-- 
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/6] memcg updates (05/Nov)
  2008-11-06  6:54 ` [RFC][PATCH 0/6] memcg updates (05/Nov) Balbir Singh
@ 2008-11-06  7:03   ` KAMEZAWA Hiroyuki
  2008-11-06 10:41   ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06  7:03 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, linux-kernel, menage, nishimura

On Thu, 06 Nov 2008 12:24:11 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Weekly (RFC) update for memcg.
> > 
> > This set includes
> > 
> > 1. change force_empty to do move account rather than forget all
> 
> I would like this to be selectable, please. We don't want to break behaviour and
> not everyone would like to pay the cost of movement.
> 
Just current behavior is broken ;)

Hmm. I have an option in my stack to do
 - call try_to_free_pages() only. or
 - call try_to_free_pages(). only when memory is locked, move to parent.

Ok ? *forget all* is no choice.

Thanks,
-Kame

> > 2. swap cache handling
> > 3. mem+swap controller kconfig
> > 4. swap_cgroup for rememver swap account information
> > 5. mem+swap controller core
> > 6. synchronize memcg's LRU and global LRU.
> > 
> > "1" is already sent, "6" is a newcomer.
> > I'd like to push out "2" or "2-5" in the next week (if no bugs.)
> > 
> > after 6, next candidates are
> >   - dirty_ratio handler
> >   - account move at task move.
> > 
> > Some more explanation about purpose of "6". (see details in patch itself)
> > Now, one of complicated logic in memcg is LRU handling. Because the place of
> > lru_head depends on page_cgroup->mem_cgroup pointer, we have to take
> > lock as following even under zone->lru_lock.
> > ==
> >   pc = lookup_page_cgroup(page);
> >   if (!trylock_page_cgroup(pc))
> >   	return -EBUSY;
> > 
> >    if (PageCgroupUsed(pc)) {
> > 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> > 	spin_lock_irqsave(&mz->lru_lock, flags);
> > 	....some operation on LRU.
> > 	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >    }
> >    unlock_page_cgroup(pc);
> > ==
> > Sigh..
> > 
> > After "6", page_cgroup's LRU management can be done independently to some extent.
> > == as
> >   (zone->lru_lock is held here)
> >   pc = lookup_page_cgroup(page);
> >   list operation on pc.
> >   (unlock zone->lru_lock)
> > ==
> > Maybe good for maintainance and as a bonus, we can make use of isolate_lru_page() when
> > doing some racy operation.
> > 
> > 	isolate_lru_page(page);
> > 	pc = lookup_page_cgroup(page);
> > 	do some jobs.
> > 	putback_lru_page(page);
> > 
> > Maybe this will be a help to implement "account move at task move".
> 
> Sounds promising!
> 
> -- 
> 	Balbir
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)
  2008-11-06  6:54 ` [RFC][PATCH 0/6] memcg updates (05/Nov) Balbir Singh
  2008-11-06  7:03   ` KAMEZAWA Hiroyuki
@ 2008-11-06 10:41   ` KAMEZAWA Hiroyuki
  2008-11-06 11:59     ` Hugh Dickins
  2008-11-06 13:46     ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) Balbir Singh
  1 sibling, 2 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06 10:41 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, linux-kernel, menage, nishimura

On Thu, 06 Nov 2008 12:24:11 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Weekly (RFC) update for memcg.
> > 
> > This set includes
> > 
> > 1. change force_empty to do move account rather than forget all
> 
> I would like this to be selectable, please. We don't want to break behaviour and
> not everyone would like to pay the cost of movement.

How about a patch like this ? I'd like to move this as [2/7], if possible.
It obviously needs painful rework. If I found it difficult, schedule this as [7/7].

BTW, cost of movement itself is not far from cost for force_empty.

If you can't find why "forget" is bad, please consider one more day.

==
This patch adds attribute to memory resource controller.

This memory.attribute file allows following to set/clear attribute.
  #echo attribute option > memory.attribute

This patch implements an attribute

 # on_rmdir [keep | drop] > memory.attribute.

 When on_rmdir=keep, memory remaining in memcg will be moved up to parent
 at rmdir. This is fast.
 When on_rmdir=drop, memory remaining in memcg will be freed.

 Characteristic of Keep.
  - fast.
  - Doesn't cause unnecessary freeing of memory(page cache).
    (IOW. page-cache for temporal files or some unnecessary pages will be kept.)
 Characteristic of Drop.
  - slow
  - No influence to its parent. all page caches will be dropped.
    (IOW. page-cache for libc or some important pages will be dropped.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   22 +++++
 mm/memcontrol.c                      |  134 ++++++++++++++++++++++++++++++++++-
 2 files changed, 154 insertions(+), 2 deletions(-)

Index: mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/mm/memcontrol.c
+++ mmotm-2.6.28-rc2-Oct30/mm/memcontrol.c
@@ -35,6 +35,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/ctype.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -146,6 +147,10 @@ struct mem_cgroup {
 	 */
 	int	prev_priority;	/* for recording reclaim priority */
 	/*
+	 * attribute
+	 */
+	char		drop_on_rmdir;
+	/*
 	 * used for counting reference from swap_cgroup.
 	 */
 	int		obsolete;
@@ -182,6 +187,22 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
+/*
+ *  attribute for memcg default value comes from its parent.
+ *  the root set all to false.
+ */
+enum {
+	MEMCG_ATTR_ON_RMDIR, /* drop_all if true, default is true. */
+	MEMCG_LAST_ATTR,
+};
+/* we may have to check status under racy situation. use global mutex. */
+DEFINE_MUTEX(memcg_attr_mutex);
+
+static char *memcg_attribute_names[MEMCG_LAST_ATTR] = {
+	"on_rmdir",
+};
+
+
 static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -1294,6 +1315,10 @@ static int mem_cgroup_force_empty(struct
 	css_get(&mem->css);
 
 	shrink = 0;
+	/* If this is true, free all orphan pages on LRU as much as possible */
+	if (mem->drop_on_rmdir)
+		goto try_to_free;
+
 move_account:
 	while (mem->res.usage > 0) {
 		ret = -EBUSY;
@@ -1311,6 +1336,9 @@ move_account:
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
 			goto try_to_free;
+		ret = -EINTR;
+		if (signal_pending(current))
+			goto out;
 		cond_resched();
 	}
 	ret = 0;
@@ -1332,6 +1360,10 @@ try_to_free:
 						  GFP_HIGHUSER_MOVABLE, false);
 		if (!progress)
 			nr_retries--;
+		ret = -EINTR;
+		if (signal_pending(current))
+			goto out;
+		cond_resched();
 
 	}
 	lru_add_drain();
@@ -1475,6 +1507,95 @@ static int mem_control_stat_show(struct 
 	return 0;
 }
 
+
+/*
+ * Assumes
+ * #echo attribute [option] > memory.feature
+ */
+static int
+parse_attr_option(char *buffer, char **attr, char **option, char **end)
+{
+	char *c = buffer;
+
+	*attr = NULL;
+	*option = NULL;
+	/* skip white space */
+	for (; *c && isspace(*c);c++);
+	/* found NULL ? */
+	if (!*c)
+		return -EINVAL;
+	*attr = c;
+	/* skip attribute */
+	for (; *c && !isspace(*c);c++);
+	/* skip space */
+	for (; *c && isspace(*c);c++);
+	/* pass pointer to option */
+	*option = c;
+	for (; *c; c++);
+	*end = c;
+	return 0;
+
+}
+
+static int mem_cgroup_write_attr(struct cgroup *cont,
+				     struct cftype *cft,
+				     const char *buffer)
+{
+	int i, len;
+	char *attr, *option, *end;
+	int ret = -EINVAL;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	mutex_lock(&memcg_attr_mutex);
+	/* parse attribute option */
+	ret = parse_attr_option((char*)buffer, &attr, &option, &end);
+	if (ret)
+		goto out;
+	for (i = 0; i < MEMCG_LAST_ATTR; i++) {
+
+		len = strlen(memcg_attribute_names[i]);
+		if ((option - attr) < len)
+			continue;
+		if (!strncmp(memcg_attribute_names[i], attr, len))
+			break;
+	}
+	ret = -EINVAL;
+	if (i == MEMCG_LAST_ATTR)
+		goto out;
+	switch(i) {
+	case MEMCG_ATTR_ON_RMDIR:
+		if ((end - option) < 4)
+			break;
+		ret = 0;
+		if (strncmp(option, "keep", 4) == 0)
+			mem->drop_on_rmdir = 0;
+		else if (strncmp(option, "drop", 4) == 0)
+			mem->drop_on_rmdir = 1;
+		else
+			ret = -EINVAL;
+		break;
+	}
+out:
+	mutex_unlock(&memcg_attr_mutex);
+	return ret;
+}
+
+static int mem_cgroup_read_attr(struct cgroup *cont, struct cftype *cft,
+				 struct seq_file *m)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	char *s;
+
+	s = memcg_attribute_names[MEMCG_ATTR_ON_RMDIR];
+
+	if (mem->drop_on_rmdir)
+		seq_printf(m, "%s drop\n",s);
+	else
+		seq_printf(m, "%s keep\n",s);
+
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1503,6 +1624,11 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "attribute",
+		.write_string = mem_cgroup_write_attr,
+		.read_seq_string = mem_cgroup_read_attr,
+	},
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 	{
 		.name = "memsw.usage_in_bytes",
@@ -1640,20 +1766,26 @@ static void __init enable_swap_cgroup(vo
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
 		enable_swap_cgroup();
+		parent = NULL;
+		mem->drop_on_rmdir = 1; /* default is drop */
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		parent = mem_cgroup_from_cont(cont->parent);
 	}
 
 	res_counter_init(&mem->res);
 	res_counter_init(&mem->memsw);
+	/* inherit */
+	if (parent)
+		mem->drop_on_rmdir = parent->drop_on_rmdir;
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
Index: mmotm-2.6.28-rc2-Oct30/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-rc2-Oct30.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-rc2-Oct30/Documentation/controllers/memory.txt
@@ -270,8 +270,28 @@ Charges recorded in swap information is 
 Recorded information is effectively discarded and a cgroup which uses swap
 (swapcache) will be charged as a new owner of it.
 
+5. Attributes
+memory.attribute file is provided to set per-memcg attribute.
+You can specify attribute by
+ #echo attribute option > memory.attribute
+
+5.1 on_rmdir
+set behavior of memcg at rmdir (destroy cgroup) default is "drop".
+
+  5.1.1 drop
+	#echo on_rmdir drop > memory.attribute
+	This is default. All pages on this memcg will be freed.
+	If pages are locked or too busy, they will be moved up to the parent.
+	Useful when you want to drop (large) page caches used in this memcg.
+
+  5.1.2 keep
+	#echo on_rmdir keep > memory.attribute
+	All pages on this memcg will be moved to parent.
+	Useful when you don't want to drop page caches used in this memcg.
+	You can keep page caches from some library or DB accessed by this
+	memcg on memory.
 
-5. TODO
+6. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 3/6] memcg : mem+swap controller kconfig
  2008-11-05  8:20 ` [RFC][PATCH 3/6] memcg : mem+swap controller kconfig KAMEZAWA Hiroyuki
@ 2008-11-06 11:07   ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-06 11:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

On Wed, 5 Nov 2008 17:20:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Experimental.
> 
> Config and control variable for mem+swap controller.
> 
> This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> (memory resource controller swap extension.)
> 
> For accounting swap, it's obvious that we have to use additional memory
> to remember "who uses swap". This adds more overhead.
> So, it's better to offer "choice" to users. This patch adds 2 choices.
> 
> This patch adds 2 parameters to enable swap extension or not.
>   - CONFIG
>   - boot option
> 
> Changelog: v1 -> v2
>  - fixed typo.
>  - make default value of "do_swap_account" to be 0 and turned on 1
>    later if configured.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
looks good to me.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

> 
>  Documentation/kernel-parameters.txt |    3 +++
>  include/linux/memcontrol.h          |    3 +++
>  init/Kconfig                        |   17 +++++++++++++++++
>  mm/memcontrol.c                     |   32 ++++++++++++++++++++++++++++++++
>  4 files changed, 55 insertions(+)
> 
> Index: mmotm-2.6.28-rc2+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/init/Kconfig
> +++ mmotm-2.6.28-rc2+/init/Kconfig
> @@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
>  config MM_OWNER
>  	bool
>  
> +config CGROUP_MEM_RES_CTLR_SWAP
> +	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> +	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> +	help
> +	  Add swap management feature to memory resource controller. When you
> +	  enable this, you can limit mem+swap usage per cgroup. In other words,
> +	  when you disable this, memory resource controller has no cares to
> +	  usage of swap...a process can exhaust all of the swap. This extension
> +	  is useful when you want to avoid exhaustion swap but this itself
> +	  adds more overheads and consumes memory for remembering information.
> +	  Especially if you use 32bit system or small memory system, please
> +	  be careful about enabling this. When memory resource controller
> +	  is disabled by boot option, this will be automatically disabled and
> +	  there will be no overhead from this. Even when you set this config=y,
> +	  if boot option "noswapaccount" is set, swap will not be accounted.
> +
> +
>  endmenu
>  
>  config SYSFS_DEPRECATED
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -41,6 +41,15 @@
>  struct cgroup_subsys mem_cgroup_subsys __read_mostly;
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
> +int do_swap_account __read_mostly;
> +static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> +#else
> +#define do_swap_account		(0)
> +#endif
> +
> +
>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -1370,6 +1379,18 @@ static void mem_cgroup_free(struct mem_c
>  }
>  
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +static void __init enable_swap_cgroup(void)
> +{
> +	if (!mem_cgroup_subsys.disabled && really_do_swap_account)
> +		do_swap_account = 1;
> +}
> +#else
> +static void __init enable_swap_cgroup(void)
> +{
> +}
> +#endif
> +
>  static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -1378,6 +1399,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  
>  	if (unlikely((cont->parent) == NULL)) {
>  		mem = &init_mem_cgroup;
> +		enable_swap_cgroup();
>  	} else {
>  		mem = mem_cgroup_alloc();
>  		if (!mem)
> @@ -1461,3 +1483,13 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  };
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static int __init disable_swap_account(char *s)
> +{
> +	really_do_swap_account = 0;
> +	return 1;
> +}
> +__setup("noswapaccount", disable_swap_account);
> +#endif
> Index: mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/kernel-parameters.txt
> +++ mmotm-2.6.28-rc2+/Documentation/kernel-parameters.txt
> @@ -1543,6 +1543,9 @@ and is between 256 and 4096 characters. 
>  
>  	nosoftlockup	[KNL] Disable the soft-lockup detector.
>  
> +	noswapaccount	[KNL] Disable accounting of swap in memory resource
> +			controller. (See Documentation/controllers/memory.txt)
> +
>  	nosync		[HW,M68K] Disables sync negotiation for all devices.
>  
>  	notsc		[BUGS=X86-32] Disable Time Stamp Counter
> Index: mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.28-rc2+/include/linux/memcontrol.h
> @@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
>  extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
>  					int priority, enum lru_list lru);
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern int do_swap_account;
> +#endif
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 4/6] memcg : swap cgroup
  2008-11-05  8:21 ` [RFC][PATCH 4/6] memcg : swap cgroup KAMEZAWA Hiroyuki
@ 2008-11-06 11:25   ` Daisuke Nishimura
  2008-11-06 12:44     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-06 11:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> For accounting swap, we need a record per swap entry, at least.
> 
> This patch adds following function.
>   - swap_cgroup_swapon() .... called from swapon
>   - swap_cgroup_swapoff() ... called at the end of swapoff.
> 
>   - swap_cgroup_record() .... record information of swap entry.
>   - swap_cgroup_lookup() .... lookup information of swap entry.
> 
> This patch just implements "how to record information". No actual
> method for limit the usage of swap. These routine uses flat table
> to record and lookup. "wise" lookup system like radix-tree requires
> requires memory allocation at new records but swap-out is usually
> called under memory shortage (or memcg hits limit.)
> So, I used static allocation. (maybe dynamic allocation is not very
> hard but it adds additional memory allocation in memory shortage path.)
> 
> 
> Note1: In this, we use pointer to record information and this means
>       8bytes per swap entry. I think we can reduce this when we
>       create "id of cgroup" in the range of 0-65535 or 0-255.
> 
> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  include/linux/page_cgroup.h |   35 +++++++
>  mm/page_cgroup.c            |  201 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c               |    8 +
>  3 files changed, 244 insertions(+)
> 
Is there any reason why they are defined not in memcontrol.[ch]
but in page_cgroup.[ch]?

> Index: mmotm-2.6.28-rc2+/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-rc2+/mm/page_cgroup.c
> @@ -8,6 +8,8 @@
>  #include <linux/memory.h>
>  #include <linux/vmalloc.h>
>  #include <linux/cgroup.h>
> +#include <linux/swapops.h>
> +#include <linux/highmem.h>
>  
>  static void __meminit
>  __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> @@ -254,3 +256,202 @@ void __init pgdat_page_cgroup_init(struc
>  }
>  
>  #endif
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +DEFINE_MUTEX(swap_cgroup_mutex);
> +struct swap_cgroup_ctrl {
> +	spinlock_t lock;
> +	struct page **map;
> +	unsigned long length;
> +};
> +
> +struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> +
> +/*
> + * This 8bytes seems big..maybe we can reduce this when we can use "id" for
> + * cgroup rather than pointer.
> + */
> +struct swap_cgroup {
> +	struct mem_cgroup	*val;
> +};
> +#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
> +#define SC_POS_MASK	(SC_PER_PAGE - 1)
> +
> +/*
> + * allocate buffer for swap_cgroup.
> + */
> +static int swap_cgroup_prepare(int type)
> +{
> +	struct page *page;
> +	struct swap_cgroup_ctrl *ctrl;
> +	unsigned long idx, max;
> +
> +	if (!do_swap_account)
> +		return 0;
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	for (idx = 0; idx < ctrl->length; idx++) {
> +		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> +		if (!page)
> +			goto not_enough_page;
> +		ctrl->map[idx] = page;
> +	}
> +	return 0;
> +not_enough_page:
> +	max = idx;
> +	for (idx = 0; idx < max; idx++)
> +		__free_page(ctrl->map[idx]);
> +
> +	return -ENOMEM;
> +}
> +
> +/**
> + * swap_cgroup_record - record mem_cgroup for this swp_entry.
> + * @ent: swap entry to be recorded into
> + * @mem: mem_cgroup to be recorded
> + *
> + * Returns old value at success, NULL at failure.
> + * (Of course, old value can be NULL.)
> + */
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	unsigned long flags;
> +	int type = swp_type(ent);
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *old;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	mappage = ctrl->map[idx];
> +	spin_lock_irqsave(&ctrl->lock, flags);
> +	sc = kmap_atomic(mappage, KM_USER0);
> +	sc += pos;
> +	old = sc->val;
> +	sc->val = mem;
> +	kunmap_atomic(mappage, KM_USER0);
> +	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	return old;
> +}
> +
> +/**
> + * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
> + * @ent: swap entry to be looked up.
> + *
> + * Returns pointer to mem_cgroup at success. NULL at failure.
> + */
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	int type = swp_type(ent);
> +	unsigned long flags;
> +	unsigned long offset = swp_offset(ent);
> +	unsigned long idx = offset / SC_PER_PAGE;
> +	unsigned long pos = offset & SC_POS_MASK;
> +	struct swap_cgroup_ctrl *ctrl;
> +	struct page *mappage;
> +	struct swap_cgroup *sc;
> +	struct mem_cgroup *ret;
> +
> +	if (!do_swap_account)
> +		return NULL;
> +
> +	ctrl = &swap_cgroup_ctrl[type];
> +
> +	mappage = ctrl->map[idx];
> +
> +	spin_lock_irqsave(&ctrl->lock, flags);
> +	sc = kmap_atomic(mappage, KM_USER0);
> +	sc += pos;
> +	ret = sc->val;
> +	kunmap_atomic(mapppage, KM_USER0);
> +	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	return ret;
> +}
> +
> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	void *array;
> +	unsigned long array_size;
> +	unsigned long length;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return 0;
> +
> +	length = ((max_pages/SC_PER_PAGE) + 1);
> +	array_size = length * sizeof(void *);
> +
> +	array = vmalloc(array_size);
> +	if (!array)
> +		goto nomem;
> +
> +	memset(array, 0, array_size);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl->length = length;
> +	ctrl->map = array;
> +	if (swap_cgroup_prepare(type)) {
> +		/* memory shortage */
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +		vfree(array);
> +		mutex_unlock(&swap_cgroup_mutex);
> +		goto nomem;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +
> +	printk(KERN_INFO
> +		"swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
> +		array_size, length * PAGE_SIZE);
> +	printk(KERN_INFO
> +	"swap_cgroup can be disabled by noswapaccount boot option.\n");
> +
> +	return 0;
> +nomem:
> +	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
> +	printk(KERN_INFO
> +		"swap_cgroup can be disabled by noswapaccount boot option\n");
> +	return -ENOMEM;
> +}
> +
> +void swap_cgroup_swapoff(int type)
> +{
> +	int i;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	mutex_lock(&swap_cgroup_mutex);
> +	if (ctrl->map) {
> +		ctrl = &swap_cgroup_ctrl[type];
This line should be before "if (ctrl->map)"(otherwise "ctrl" will be undefined!).


Thanks,
Daisuke Nishimura.

> +		for (i = 0; i < ctrl->length; i++) {
> +			struct page *page = ctrl->map[i];
> +			if (page)
> +				__free_page(page);
> +		}
> +		vfree(ctrl->map);
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +}
> +
> +static int __init swap_cgroup_init(void)
> +{
> +	int i;
> +	for (i = 0; i < MAX_SWAPFILES; i++)
> +		spin_lock_init(&swap_cgroup_ctrl[i].lock);
> +	return 0;
> +}
> +late_initcall(swap_cgroup_init);
> +#endif
> Index: mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.28-rc2+/include/linux/page_cgroup.h
> @@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
>  }
>  
>  #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +#include <linux/swap.h>
> +extern struct mem_cgroup *
> +swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
> +extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
> +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> +extern void swap_cgroup_swapoff(int type);
> +#else
> +#include <linux/swap.h>
> +
> +static inline
> +struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
> +{
> +	return NULL;
> +}
> +
> +static inline
> +struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
> +{
> +	return NULL;
> +}
> +
> +static inline int
> +swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	return 0;
> +}
> +
> +static inline void swap_cgroup_swapoff(int type)
> +{
> +	return;
> +}
> +
> +#endif
>  #endif
> Index: mmotm-2.6.28-rc2+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/swapfile.c
> +++ mmotm-2.6.28-rc2+/mm/swapfile.c
> @@ -32,6 +32,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
> +#include <linux/page_cgroup.h>
>  
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
>  	spin_unlock(&swap_lock);
>  	mutex_unlock(&swapon_mutex);
>  	vfree(swap_map);
> +	/* Destroy swap account informatin */
> +	swap_cgroup_swapoff(type);
> +
>  	inode = mapping->host;
>  	if (S_ISBLK(inode->i_mode)) {
>  		struct block_device *bdev = I_BDEV(inode);
> @@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
>  		nr_good_pages = swap_header->info.last_page -
>  				swap_header->info.nr_badpages -
>  				1 /* header page */;
> +
> +		if (!error)
> +			error = swap_cgroup_swapon(type, maxpages);
> +
>  		if (error)
>  			goto bad_swap;
>  	}
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)
  2008-11-06 10:41   ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) KAMEZAWA Hiroyuki
@ 2008-11-06 11:59     ` Hugh Dickins
  2008-11-06 12:47       ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir) KAMEZAWA Hiroyuki
  2008-11-06 13:46     ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) Balbir Singh
  1 sibling, 1 reply; 28+ messages in thread
From: Hugh Dickins @ 2008-11-06 11:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, linux-kernel, menage, nishimura

On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:
> On Thu, 06 Nov 2008 12:24:11 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > KAMEZAWA Hiroyuki wrote:
> > > 
> > > 1. change force_empty to do move account rather than forget all
> > 
> > I would like this to be selectable, please. We don't want to break behaviour and
> > not everyone would like to pay the cost of movement.
> 
> How about a patch like this ? I'd like to move this as [2/7], if possible.
> It obviously needs painful rework. If I found it difficult, schedule this as [7/7].
> 
> BTW, cost of movement itself is not far from cost for force_empty.
> 
> If you can't find why "forget" is bad, please consider one more day.

My recollection from a year ago is that force_empty totally violated
the rules established elsewhere, making a nonsense of it all: once a
force_empty had been done, you couldn't really be sure of anything
(perhaps it deserved a Taint flag).

Without studying your proposals at all, I do believe you've a good
chance of creating a sensible and consistent force_empty by moving
account, and abandoning the old "forget all" approach completely.

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 4/6] memcg : swap cgroup
  2008-11-06 11:25   ` Daisuke Nishimura
@ 2008-11-06 12:44     ` KAMEZAWA Hiroyuki
  2008-11-07  1:19       ` Daisuke Nishimura
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06 12:44 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, balbir, menage, nishimura

Daisuke Nishimura said:
> On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> Note1: In this, we use pointer to record information and this means
>>       8bytes per swap entry. I think we can reduce this when we
>>       create "id of cgroup" in the range of 0-65535 or 0-255.
>>
>> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for
>> x86-32.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>>  include/linux/page_cgroup.h |   35 +++++++
>>  mm/page_cgroup.c            |  201
>> ++++++++++++++++++++++++++++++++++++++++++++
>>  mm/swapfile.c               |    8 +
>>  3 files changed, 244 insertions(+)
>>
> Is there any reason why they are defined not in memcontrol.[ch]
> but in page_cgroup.[ch]?
>
no strong reason. just because this is not core logic for acccounting.
do you want to see this in memcontrol.c ?

>> +void swap_cgroup_swapoff(int type)
>> +{
>> +	int i;
>> +	struct swap_cgroup_ctrl *ctrl;
>> +
>> +	if (!do_swap_account)
>> +		return;
>> +
>> +	mutex_lock(&swap_cgroup_mutex);
>> +	if (ctrl->map) {
>> +		ctrl = &swap_cgroup_ctrl[type];
> This line should be before "if (ctrl->map)"(otherwise "ctrl" will be
> undefined!).
>
Oh....maybe refresh mis...brame me.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior  ofrmdir)
  2008-11-06 11:59     ` Hugh Dickins
@ 2008-11-06 12:47       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06 12:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, balbir, linux-mm, linux-kernel, menage, nishimura

Hugh Dickins said:
> On Thu, 6 Nov 2008, KAMEZAWA Hiroyuki wrote:
>> On Thu, 06 Nov 2008 12:24:11 +0530
>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> > KAMEZAWA Hiroyuki wrote:
>> > >
>> > > 1. change force_empty to do move account rather than forget all
>> >
>> > I would like this to be selectable, please. We don't want to break
>> behaviour and
>> > not everyone would like to pay the cost of movement.
>>
>> How about a patch like this ? I'd like to move this as [2/7], if
>> possible.
>> It obviously needs painful rework. If I found it difficult, schedule
>> this as [7/7].
>>
>> BTW, cost of movement itself is not far from cost for force_empty.
>>
>> If you can't find why "forget" is bad, please consider one more day.
>
> My recollection from a year ago is that force_empty totally violated
> the rules established elsewhere, making a nonsense of it all: once a
> force_empty had been done, you couldn't really be sure of anything
> (perhaps it deserved a Taint flag).
>
yes. that was terrible. (but necessary to do so because we accounted
pages not on LRU or some other reason.)

> Without studying your proposals at all, I do believe you've a good
> chance of creating a sensible and consistent force_empty by moving
> account, and abandoning the old "forget all" approach completely.
>

yes. I'm very encouraged. thanks!
After patch [1/6]
  -> move all at force empty
After this
  -> move all or free (not forget) all.

-Kame



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir)
  2008-11-06 10:41   ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) KAMEZAWA Hiroyuki
  2008-11-06 11:59     ` Hugh Dickins
@ 2008-11-06 13:46     ` Balbir Singh
  2008-11-06 14:30       ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir) KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 28+ messages in thread
From: Balbir Singh @ 2008-11-06 13:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, menage, nishimura

KAMEZAWA Hiroyuki wrote:
> On Thu, 06 Nov 2008 12:24:11 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> Weekly (RFC) update for memcg.
>>>
>>> This set includes
>>>
>>> 1. change force_empty to do move account rather than forget all
>> I would like this to be selectable, please. We don't want to break behaviour and
>> not everyone would like to pay the cost of movement.
> 
> How about a patch like this ? I'd like to move this as [2/7], if possible.
> It obviously needs painful rework. If I found it difficult, schedule this as [7/7].
> 
> BTW, cost of movement itself is not far from cost for force_empty.
> 
> If you can't find why "forget" is bad, please consider one more day.

The attributes seem quite reasonable, I've taken a quick look, not done a full
review or test.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior  ofrmdir)
  2008-11-06 13:46     ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) Balbir Singh
@ 2008-11-06 14:30       ` KAMEZAWA Hiroyuki
  2008-11-07  1:12         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-06 14:30 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, menage, nishimura

Balbir Singh said:
> KAMEZAWA Hiroyuki wrote:
>> BTW, cost of movement itself is not far from cost for force_empty.
>>
>> If you can't find why "forget" is bad, please consider one more day.
>
> The attributes seem quite reasonable, I've taken a quick look, not done a
> full
> review or test.
>
Thanks, I'll go ahead in this direction.

By the way, should we keep "one value per one file" for attributes ?
If so, I'll add a new file just for this.
Current my patch allows prural attributes set on a file.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 7/6] memcg: add atribute (for change bahavior  ofrmdir)
  2008-11-06 14:30       ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir) KAMEZAWA Hiroyuki
@ 2008-11-07  1:12         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-07  1:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, linux-kernel, menage, nishimura

On Thu, 6 Nov 2008 23:30:28 +0900 (JST)
"KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Balbir Singh said:
> > KAMEZAWA Hiroyuki wrote:
> >> BTW, cost of movement itself is not far from cost for force_empty.
> >>
> >> If you can't find why "forget" is bad, please consider one more day.
> >
> > The attributes seem quite reasonable, I've taken a quick look, not done a
> > full
> > review or test.
> >
> Thanks, I'll go ahead in this direction.
> 
It seems Andrew picked account_move patch into his queue. I'll post this as
add-on to mmotm, in the next week. patch-for-test can be posted in hours if
I work well...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 4/6] memcg : swap cgroup
  2008-11-06 12:44     ` KAMEZAWA Hiroyuki
@ 2008-11-07  1:19       ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-07  1:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

On Thu, 6 Nov 2008 21:44:21 +0900 (JST), "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Daisuke Nishimura said:
> > On Wed, 5 Nov 2008 17:21:41 +0900, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> Note1: In this, we use pointer to record information and this means
> >>       8bytes per swap entry. I think we can reduce this when we
> >>       create "id of cgroup" in the range of 0-65535 or 0-255.
> >>
> >> Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for
> >> x86-32.
> >>
> >> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>
> >>  include/linux/page_cgroup.h |   35 +++++++
> >>  mm/page_cgroup.c            |  201
> >> ++++++++++++++++++++++++++++++++++++++++++++
> >>  mm/swapfile.c               |    8 +
> >>  3 files changed, 244 insertions(+)
> >>
> > Is there any reason why they are defined not in memcontrol.[ch]
> > but in page_cgroup.[ch]?
> >
> no strong reason. just because this is not core logic for acccounting.
> do you want to see this in memcontrol.c ?
> 
I just felt strange just because they are not "page_cgroup".
I don't have any strong request to move them to memcontrol.c.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 2/6] memcg: handle swap cache
  2008-11-05  8:20 ` [RFC][PATCH 2/6] memcg: handle swap cache KAMEZAWA Hiroyuki
@ 2008-11-07  8:53   ` Daisuke Nishimura
  2008-11-07  9:13     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-07  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

On Wed, 5 Nov 2008 17:20:09 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> SwapCache support for memory resource controller (memcg)
> 
> Before mem+swap controller, memcg itself should handle SwapCache in proper way.
> 
> In current memcg, SwapCache is just leaked and the user can create tons of
> SwapCache. This is a leak of account and should be handled.
> 
> SwapCache accounting is done as following.
> 
>   charge (anon)
> 	- charged when it's mapped.
> 	  (because of readahead, charge at add_to_swap_cache() is not sane)
>   uncharge (anon)
> 	- uncharged when it's dropped from swapcache and fully unmapped.
> 	  means it's not uncharged at unmap.
> 	  Note: delete from swap cache at swap-in is done after rmap information
> 	        is established.
>   charge (shmem)
> 	- charged at swap-in. this prevents charge at add_to_page_cache().
> 
>   uncharge (shmem)
> 	- uncharged when it's dropped from swapcache and not on shmem's
> 	  radix-tree.
> 
>   at migration, check against 'old page' is modified to handle shmem.
> 
> Comparing to the old version discussed (and caused troubles), we have
> advantages of
>   - PCG_USED bit.
>   - simple migrating handling.
> 
> So, situation is much easier than several months ago, maybe.
> 
> Changelog (v1) -> (v2)
>   - use lock_page() when we handle unlocked SwapCache.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
I tested this version under swap in/out activity with page migration/rmdir,
and it worked w/o errors for more than 24 hours.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
	Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>


Thanks,
Daisuke Nishimura.

> ---
>  Documentation/controllers/memory.txt |    5 ++
>  include/linux/swap.h                 |   16 ++++++++
>  mm/memcontrol.c                      |   67 +++++++++++++++++++++++++++++++----
>  mm/shmem.c                           |   18 ++++++++-
>  mm/swap_state.c                      |    1 
>  5 files changed, 99 insertions(+), 8 deletions(-)
> 
> Index: mmotm-2.6.28-rc2+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-rc2+/mm/memcontrol.c
> @@ -21,6 +21,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> +#include <linux/pagemap.h>
>  #include <linux/smp.h>
>  #include <linux/page-flags.h>
>  #include <linux/backing-dev.h>
> @@ -140,6 +141,7 @@ enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
>  	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
>  	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
> +	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
>  	NR_CHARGE_TYPE,
>  };
>  
> @@ -781,6 +783,33 @@ int mem_cgroup_cache_charge(struct page 
>  				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
>  }
>  
> +#ifdef CONFIG_SWAP
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> +			struct mm_struct *mm, gfp_t mask, bool locked)
> +{
> +	int ret = 0;
> +
> +	if (mem_cgroup_subsys.disabled)
> +		return 0;
> +	if (unlikely(!mm))
> +		mm = &init_mm;
> +	if (!locked)
> +		lock_page(page);
> +	/*
> +	 * If not locked, the page can be dropped from SwapCache until
> +	 * we reach here.
> +	 */
> +	if (PageSwapCache(page)) {
> +		ret = mem_cgroup_charge_common(page, mm, mask,
> +				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> +	}
> +	if (!locked)
> +		unlock_page(page);
> +
> +	return ret;
> +}
> +#endif
> +
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
>  {
>  	struct page_cgroup *pc;
> @@ -818,6 +847,9 @@ __mem_cgroup_uncharge_common(struct page
>  	if (mem_cgroup_subsys.disabled)
>  		return;
>  
> +	if (PageSwapCache(page))
> +		return;
> +
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> @@ -826,12 +858,26 @@ __mem_cgroup_uncharge_common(struct page
>  		return;
>  
>  	lock_page_cgroup(pc);
> -	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
> -	     || !PageCgroupUsed(pc)) {
> -		/* This happens at race in zap_pte_range() and do_swap_page()*/
> -		unlock_page_cgroup(pc);
> -		return;
> +
> +	if (!PageCgroupUsed(pc))
> +		goto unlock_out;
> +
> +	switch(ctype) {
> +	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
> +		if (page_mapped(page))
> +			goto unlock_out;
> +		break;
> +	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
> +		if (!PageAnon(page)) {	/* Shared memory */
> +			if (page->mapping && !page_is_file_cache(page))
> +				goto unlock_out;
> +		} else if (page_mapped(page)) /* Anon */
> +				goto unlock_out;
> +		break;
> +	default:
> +		break;
>  	}
> +
>  	ClearPageCgroupUsed(pc);
>  	mem = pc->mem_cgroup;
>  
> @@ -845,6 +891,10 @@ __mem_cgroup_uncharge_common(struct page
>  	css_put(&mem->css);
>  
>  	return;
> +
> +unlock_out:
> +	unlock_page_cgroup(pc);
> +	return;
>  }
>  
>  void mem_cgroup_uncharge_page(struct page *page)
> @@ -864,6 +914,11 @@ void mem_cgroup_uncharge_cache_page(stru
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
>  
> +void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> +	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +}
> +
>  /*
>   * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
>   * page belongs to.
> @@ -921,7 +976,7 @@ void mem_cgroup_end_migration(struct mem
>  		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
>  
>  	/* unused page is not on radix-tree now. */
> -	if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
> +	if (unused)
>  		__mem_cgroup_uncharge_common(unused, ctype);
>  
>  	pc = lookup_page_cgroup(target);
> Index: mmotm-2.6.28-rc2+/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/swap_state.c
> +++ mmotm-2.6.28-rc2+/mm/swap_state.c
> @@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
> +	mem_cgroup_uncharge_swapcache(page);
>  }
>  
>  /**
> Index: mmotm-2.6.28-rc2+/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/include/linux/swap.h
> +++ mmotm-2.6.28-rc2+/include/linux/swap.h
> @@ -332,6 +332,22 @@ static inline void disable_swap_token(vo
>  	put_swap_token(swap_token_mm);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +extern int mem_cgroup_cache_charge_swapin(struct page *page,
> +				struct mm_struct *mm, gfp_t mask, bool locked);
> +extern void mem_cgroup_uncharge_swapcache(struct page *page);
> +#else
> +static inline
> +int mem_cgroup_cache_charge_swapin(struct page *page,
> +				struct mm_struct *mm, gfp_t mask, bool locked)
> +{
> +	return 0;
> +}
> +static inline void mem_cgroup_uncharge_swapcache(struct page *page)
> +{
> +}
> +#endif
> +
>  #else /* CONFIG_SWAP */
>  
>  #define total_swap_pages			0
> Index: mmotm-2.6.28-rc2+/mm/shmem.c
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/mm/shmem.c
> +++ mmotm-2.6.28-rc2+/mm/shmem.c
> @@ -920,8 +920,12 @@ found:
>  	error = 1;
>  	if (!inode)
>  		goto out;
> -	/* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
> +	/*
> +         * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
> +         * charged back to the user(not to caller) when swap account is used.
> +         */
> +	error = mem_cgroup_cache_charge_swapin(page,
> +			current->mm, GFP_HIGHUSER_MOVABLE, true);
>  	if (error)
>  		goto out;
>  	error = radix_tree_preload(GFP_KERNEL);
> @@ -1258,6 +1262,16 @@ repeat:
>  				goto repeat;
>  			}
>  			wait_on_page_locked(swappage);
> +			/*
> +			 * We want to avoid charge at add_to_page_cache().
> +			 * charge against this swap cache here.
> +			 */
> +			if (mem_cgroup_cache_charge_swapin(swappage,
> +						current->mm, gfp, false)) {
> +				page_cache_release(swappage);
> +				error = -ENOMEM;
> +				goto failed;
> +			}
>  			page_cache_release(swappage);
>  			goto repeat;
>  		}
> Index: mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> ===================================================================
> --- mmotm-2.6.28-rc2+.orig/Documentation/controllers/memory.txt
> +++ mmotm-2.6.28-rc2+/Documentation/controllers/memory.txt
> @@ -137,6 +137,11 @@ behind this approach is that a cgroup th
>  page will eventually get charged for it (once it is uncharged from
>  the cgroup that brought it in -- this will happen on memory pressure).
>  
> +Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
> +be backed into memory in force, charges for pages are accounted against the
> +caller of swapoff rather than the users of shmem.
> +
> +
>  2.4 Reclaim
>  
>  Each cgroup maintains a per cgroup LRU that consists of an active
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
@ 2008-11-07  9:02   ` Daisuke Nishimura
  2008-11-07  9:19     ` KAMEZAWA Hiroyuki
  2008-11-07 13:21   ` Daisuke Nishimura
  2008-11-10  4:30   ` Daisuke Nishimura
  2 siblings, 1 reply; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-07  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

On Wed, 5 Nov 2008 17:23:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Mem+Swap controller core.
> 
> This patch implements per cgroup limit for usage of memory+swap.
> However there are SwapCache, double counting of swap-cache and
> swap-entry is avoided.
> 
> Mem+Swap controller works as following.
>   - memory usage is limited by memory.limit_in_bytes.
>   - memory + swap usage is limited by memory.memsw_limit_in_bytes.
> 
> 
> This has following benefits.
>   - A user can limit total resource usage of mem+swap.
> 
>     Without this, because memory resource controller doesn't take care of
>     usage of swap, a process can exhaust all the swap (by memory leak.)
>     We can avoid this case.
> 
>     And Swap is shared resource but it cannot be reclaimed (goes back to memory)
>     until it's used. This characteristic can be trouble when the memory
>     is divided into some parts by cpuset or memcg.
>     Assume group A and group B.
>     After some application executes, the system can be..
>     
>     Group A -- very large free memory space but occupy 99% of swap.
>     Group B -- under memory shortage but cannot use swap...it's nearly full.
> 
>     Ability to set appropriate swap limit for each group is required.
>       
> Maybe someone wonder "why not swap but mem+swap ?"
> 
>   - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
>     to move account from memory to swap...there is no change in usage of
>     mem+swap.
> 
>     In other words, when we want to limit the usage of swap without affecting
>     global LRU, mem+swap limit is better than just limiting swap.
> 
> 
> Accounting target information is stored in swap_cgroup which is
> per swap entry record.
> 
> Charge is done as following.
>   map
>     - charge  page and memsw.
> 
>   unmap
>     - uncharge page/memsw if not SwapCache.
> 
>   swap-out (__delete_from_swap_cache)
>     - uncharge page
>     - record mem_cgroup information to swap_cgroup.
> 
>   swap-in (do_swap_page)
>     - charged as page and memsw.
>       record in swap_cgroup is cleared.
>       memsw accounting is decremented.
> 
>   swap-free (swap_free())
>     - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
> 
> 
> After this, usual memory resource controller handles SwapCache.
> (It was lacked(ignored) feature in current memcg but must be handled.)
> 
SwapCache has been handled in [2/6] already :)

(snip)
> @@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
>  		css_get(&mem->css);
>  	}
>  
> +	while (1) {
> +		int ret;
> +		bool noswap = false;
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE);
> +		if (likely(!ret)) {
> +			if (!do_swap_account)
> +				break;
> +			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> +			if (likely(!ret))
> +				break;
> +			/* mem+swap counter fails */
> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			noswap = true;
> +		}
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> +		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
>  			continue;
>  
>  		/*
I have two comment about try_charge.

1. It would be better if possible to avoid charging memsw at swapin (and uncharging
   it again at mem_cgroup_cache_charge_swapin/mem_cgroup_commit_charge_swapin).
   How about adding a new argument "charge_memsw" ? (it has many args already now...)
2. Should we use swap when exceeding mem.limit but mem.limit == memsw.limit ?

(snip)
>  void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> @@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
>  	if (!mem)
>  		return;
>  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>  	css_put(&mem->css);
>  }
>  
"if (do_swap_account)" is needed before uncharging memsw.

(snip)
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> -		.private = RES_USAGE,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "max_usage_in_bytes",
> -		.private = RES_MAX_USAGE,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
>  		.trigger = mem_cgroup_reset,
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "limit_in_bytes",
> -		.private = RES_LIMIT,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
>  		.write_string = mem_cgroup_write,
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "failcnt",
> -		.private = RES_FAILCNT,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
>  		.trigger = mem_cgroup_reset,
>  		.read_u64 = mem_cgroup_read,
>  	},
> @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[] 
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +	{
> +		.name = "memsw.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "memsw.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> +		.trigger = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "memsw.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> +		.write_string = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "memsw.failcnt",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> +		.trigger = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +#endif
>  };
>  
IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
and change mem_cgroup_populate() like:

static int mem_cgroup_populate(struct cgroup_subsys *ss,
				struct cgroup *cont)
{
	int ret;

	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
					ARRAY_SIZE(mem_cgroup_files));
	if (!ret && do_swap_account)
		ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
					ARRAY_SIZE(memsw_cgroup_files));

	return ret;
}

so that those files appear only when swap accounting is enabled.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 2/6] memcg: handle swap cache
  2008-11-07  8:53   ` Daisuke Nishimura
@ 2008-11-07  9:13     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-07  9:13 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, balbir, menage

On Fri, 7 Nov 2008 17:53:03 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 5 Nov 2008 17:20:09 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > SwapCache support for memory resource controller (memcg)
> > 
> > Before mem+swap controller, memcg itself should handle SwapCache in proper way.
> > 
> > In current memcg, SwapCache is just leaked and the user can create tons of
> > SwapCache. This is a leak of account and should be handled.
> > 
> > SwapCache accounting is done as following.
> > 
> >   charge (anon)
> > 	- charged when it's mapped.
> > 	  (because of readahead, charge at add_to_swap_cache() is not sane)
> >   uncharge (anon)
> > 	- uncharged when it's dropped from swapcache and fully unmapped.
> > 	  means it's not uncharged at unmap.
> > 	  Note: delete from swap cache at swap-in is done after rmap information
> > 	        is established.
> >   charge (shmem)
> > 	- charged at swap-in. this prevents charge at add_to_page_cache().
> > 
> >   uncharge (shmem)
> > 	- uncharged when it's dropped from swapcache and not on shmem's
> > 	  radix-tree.
> > 
> >   at migration, check against 'old page' is modified to handle shmem.
> > 
> > Comparing to the old version discussed (and caused troubles), we have
> > advantages of
> >   - PCG_USED bit.
> >   - simple migrating handling.
> > 
> > So, situation is much easier than several months ago, maybe.
> > 
> > Changelog (v1) -> (v2)
> >   - use lock_page() when we handle unlocked SwapCache.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> I tested this version under swap in/out activity with page migration/rmdir,
> and it worked w/o errors for more than 24 hours.
> 
> 	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 	Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
Thank you!

-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-07  9:02   ` Daisuke Nishimura
@ 2008-11-07  9:19     ` KAMEZAWA Hiroyuki
  2008-11-07 13:30       ` Daisuke Nishimura
  0 siblings, 1 reply; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-07  9:19 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, balbir, menage

On Fri, 7 Nov 2008 18:02:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 5 Nov 2008 17:23:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Mem+Swap controller core.
> > 
> > This patch implements per cgroup limit for usage of memory+swap.
> > However there are SwapCache, double counting of swap-cache and
> > swap-entry is avoided.
> > 
> > Mem+Swap controller works as following.
> >   - memory usage is limited by memory.limit_in_bytes.
> >   - memory + swap usage is limited by memory.memsw_limit_in_bytes.
> > 
> > 
> > This has following benefits.
> >   - A user can limit total resource usage of mem+swap.
> > 
> >     Without this, because memory resource controller doesn't take care of
> >     usage of swap, a process can exhaust all the swap (by memory leak.)
> >     We can avoid this case.
> > 
> >     And Swap is shared resource but it cannot be reclaimed (goes back to memory)
> >     until it's used. This characteristic can be trouble when the memory
> >     is divided into some parts by cpuset or memcg.
> >     Assume group A and group B.
> >     After some application executes, the system can be..
> >     
> >     Group A -- very large free memory space but occupy 99% of swap.
> >     Group B -- under memory shortage but cannot use swap...it's nearly full.
> > 
> >     Ability to set appropriate swap limit for each group is required.
> >       
> > Maybe someone wonder "why not swap but mem+swap ?"
> > 
> >   - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
> >     to move account from memory to swap...there is no change in usage of
> >     mem+swap.
> > 
> >     In other words, when we want to limit the usage of swap without affecting
> >     global LRU, mem+swap limit is better than just limiting swap.
> > 
> > 
> > Accounting target information is stored in swap_cgroup which is
> > per swap entry record.
> > 
> > Charge is done as following.
> >   map
> >     - charge  page and memsw.
> > 
> >   unmap
> >     - uncharge page/memsw if not SwapCache.
> > 
> >   swap-out (__delete_from_swap_cache)
> >     - uncharge page
> >     - record mem_cgroup information to swap_cgroup.
> > 
> >   swap-in (do_swap_page)
> >     - charged as page and memsw.
> >       record in swap_cgroup is cleared.
> >       memsw accounting is decremented.
> > 
> >   swap-free (swap_free())
> >     - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
> > 
> > 
> > After this, usual memory resource controller handles SwapCache.
> > (It was lacked(ignored) feature in current memcg but must be handled.)
> > 
> SwapCache has been handled in [2/6] already :)
> 
yes. I'll rewrite this.


> (snip)
> > @@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
> >  		css_get(&mem->css);
> >  	}
> >  
> > +	while (1) {
> > +		int ret;
> > +		bool noswap = false;
> >  
> > -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE);
> > +		if (likely(!ret)) {
> > +			if (!do_swap_account)
> > +				break;
> > +			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> > +			if (likely(!ret))
> > +				break;
> > +			/* mem+swap counter fails */
> > +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			noswap = true;
> > +		}
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> > +		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> >  			continue;
> >  
> >  		/*
> I have two comment about try_charge.
> 
> 1. It would be better if possible to avoid charging memsw at swapin (and uncharging
>    it again at mem_cgroup_cache_charge_swapin/mem_cgroup_commit_charge_swapin).
>    How about adding a new argument "charge_memsw" ? (it has many args already now...)

Hmm, maybe possible and good. I'll cosider this again.

> 2. Should we use swap when exceeding mem.limit but mem.limit == memsw.limit ?
> 
I'd like to put that special case into "TODO" list. Hmm...
maybe set noswap=true in that case is enough. but we have to be careful.


> (snip)
> >  void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> > @@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
> >  	if (!mem)
> >  		return;
> >  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >  	css_put(&mem->css);
> >  }
> >  
> "if (do_swap_account)" is needed before uncharging memsw.
> 
good catch !

> (snip)
> >  static struct cftype mem_cgroup_files[] = {
> >  	{
> >  		.name = "usage_in_bytes",
> > -		.private = RES_USAGE,
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> >  		.read_u64 = mem_cgroup_read,
> >  	},
> >  	{
> >  		.name = "max_usage_in_bytes",
> > -		.private = RES_MAX_USAGE,
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> >  		.trigger = mem_cgroup_reset,
> >  		.read_u64 = mem_cgroup_read,
> >  	},
> >  	{
> >  		.name = "limit_in_bytes",
> > -		.private = RES_LIMIT,
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> >  		.write_string = mem_cgroup_write,
> >  		.read_u64 = mem_cgroup_read,
> >  	},
> >  	{
> >  		.name = "failcnt",
> > -		.private = RES_FAILCNT,
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> >  		.trigger = mem_cgroup_reset,
> >  		.read_u64 = mem_cgroup_read,
> >  	},
> > @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[] 
> >  		.name = "stat",
> >  		.read_map = mem_control_stat_show,
> >  	},
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +	{
> > +		.name = "memsw.usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> > +		.read_u64 = mem_cgroup_read,
> > +	},
> > +	{
> > +		.name = "memsw.max_usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> > +		.trigger = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read,
> > +	},
> > +	{
> > +		.name = "memsw.limit_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> > +		.write_string = mem_cgroup_write,
> > +		.read_u64 = mem_cgroup_read,
> > +	},
> > +	{
> > +		.name = "memsw.failcnt",
> > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> > +		.trigger = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read,
> > +	},
> > +#endif
> >  };
> >  
> IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
> and change mem_cgroup_populate() like:
> 
> static int mem_cgroup_populate(struct cgroup_subsys *ss,
> 				struct cgroup *cont)
> {
> 	int ret;
> 
> 	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
> 					ARRAY_SIZE(mem_cgroup_files));
> 	if (!ret && do_swap_account)
> 		ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
> 					ARRAY_SIZE(memsw_cgroup_files));
> 
> 	return ret;
> }
> 
> so that those files appear only when swap accounting is enabled.
> 

Nice idea. I'll try that. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
  2008-11-07  9:02   ` Daisuke Nishimura
@ 2008-11-07 13:21   ` Daisuke Nishimura
  2008-11-10  4:30   ` Daisuke Nishimura
  2 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-07 13:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

> @@ -1286,30 +1507,33 @@ static int mem_control_stat_show(struct 
>  		cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
>  
>  	}
> +	/* showing refs from disk-swap */
> +	cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
> +					* PAGE_SIZE);
>  	return 0;
>  }
>  
"if (do_swap_account)" is needed here too.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-07  9:19     ` KAMEZAWA Hiroyuki
@ 2008-11-07 13:30       ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-07 13:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

> > >  static struct cftype mem_cgroup_files[] = {
> > >  	{
> > >  		.name = "usage_in_bytes",
> > > -		.private = RES_USAGE,
> > > +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> > >  		.read_u64 = mem_cgroup_read,
> > >  	},
> > >  	{
> > >  		.name = "max_usage_in_bytes",
> > > -		.private = RES_MAX_USAGE,
> > > +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> > >  		.trigger = mem_cgroup_reset,
> > >  		.read_u64 = mem_cgroup_read,
> > >  	},
> > >  	{
> > >  		.name = "limit_in_bytes",
> > > -		.private = RES_LIMIT,
> > > +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> > >  		.write_string = mem_cgroup_write,
> > >  		.read_u64 = mem_cgroup_read,
> > >  	},
> > >  	{
> > >  		.name = "failcnt",
> > > -		.private = RES_FAILCNT,
> > > +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> > >  		.trigger = mem_cgroup_reset,
> > >  		.read_u64 = mem_cgroup_read,
> > >  	},
> > > @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[] 
> > >  		.name = "stat",
> > >  		.read_map = mem_control_stat_show,
> > >  	},
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > +	{
> > > +		.name = "memsw.usage_in_bytes",
> > > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> > > +		.read_u64 = mem_cgroup_read,
> > > +	},
> > > +	{
> > > +		.name = "memsw.max_usage_in_bytes",
> > > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> > > +		.trigger = mem_cgroup_reset,
> > > +		.read_u64 = mem_cgroup_read,
> > > +	},
> > > +	{
> > > +		.name = "memsw.limit_in_bytes",
> > > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> > > +		.write_string = mem_cgroup_write,
> > > +		.read_u64 = mem_cgroup_read,
> > > +	},
> > > +	{
> > > +		.name = "memsw.failcnt",
> > > +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> > > +		.trigger = mem_cgroup_reset,
> > > +		.read_u64 = mem_cgroup_read,
> > > +	},
> > > +#endif
> > >  };
> > >  
> > IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
> > and change mem_cgroup_populate() like:
> > 
> > static int mem_cgroup_populate(struct cgroup_subsys *ss,
> > 				struct cgroup *cont)
> > {
> > 	int ret;
> > 
> > 	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
> > 					ARRAY_SIZE(mem_cgroup_files));
> > 	if (!ret && do_swap_account)
> > 		ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
> > 					ARRAY_SIZE(memsw_cgroup_files));
> > 
> > 	return ret;
> > }
> > 
> > so that those files appear only when swap accounting is enabled.
> > 
> 
> Nice idea. I'll try that. 
> 
I made a patch for this.

please merge this if it looks good to you.

I've confirmed that memsw.* files doesn't created with noswapaccount,
and this can be compiled with !CONFIG_CGROUP_MEM_RES_CTLR_SWAP.


Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27f1772..03dfc46 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1542,6 +1542,9 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+};
+
+static struct cftype swap_cgroup_files[] = {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 	{
 		.name = "memsw.usage_in_bytes",
@@ -1724,8 +1727,14 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	return cgroup_add_files(cont, ss, mem_cgroup_files,
+	int ret;
+	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
 					ARRAY_SIZE(mem_cgroup_files));
+	if (!ret && do_swap_account)
+		ret = cgroup_add_files(cont, ss, swap_cgroup_files,
+					ARRAY_SIZE(swap_cgroup_files));
+
+	return ret;
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
  2008-11-07  9:02   ` Daisuke Nishimura
  2008-11-07 13:21   ` Daisuke Nishimura
@ 2008-11-10  4:30   ` Daisuke Nishimura
  2008-11-10  7:03     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 28+ messages in thread
From: Daisuke Nishimura @ 2008-11-10  4:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, menage, nishimura

> @@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
>  			break;
>  		}
>  		progress = try_to_free_mem_cgroup_pages(memcg,
> -				GFP_HIGHUSER_MOVABLE);
> +				GFP_HIGHUSER_MOVABLE, false);
>  		if (!progress)
>  			retry_count--;
>  	}
>  	return ret;
>  }
>  
mem_cgroup_resize_limit() should verify that mem.limit <= memsw.limit
as mem_cgroup_resize_memsw_limit() does.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 5/6] memcg: mem+swap controller
  2008-11-10  4:30   ` Daisuke Nishimura
@ 2008-11-10  7:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-10  7:03 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, balbir, menage

On Mon, 10 Nov 2008 13:30:54 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -1062,13 +1208,55 @@ int mem_cgroup_resize_limit(struct mem_c
> >  			break;
> >  		}
> >  		progress = try_to_free_mem_cgroup_pages(memcg,
> > -				GFP_HIGHUSER_MOVABLE);
> > +				GFP_HIGHUSER_MOVABLE, false);
> >  		if (!progress)
> >  			retry_count--;
> >  	}
> >  	return ret;
> >  }
> >  
> mem_cgroup_resize_limit() should verify that mem.limit <= memsw.limit
> as mem_cgroup_resize_memsw_limit() does.
> 
> 
nice catch. will be fixed in the next version.

Thanks a lot!

-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 3/6] memcg: mem+swap controller kconfig
  2008-11-12  3:26 [RFC][PATCH 0/6] memcg updates (12/Nov/2008) KAMEZAWA Hiroyuki
@ 2008-11-12  3:28 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-12  3:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, balbir, nishimura, menage

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
  - CONFIG
  - boot option

Changelog: v2 -> v3
 - adjusted to avoid HUNK.

Changelog: v1 -> v2
 - fixed typo.
 - make default value of "do_swap_account" to be 0 and turned on 1
   later if configured.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/kernel-parameters.txt |    3 +++
 include/linux/memcontrol.h          |    3 +++
 init/Kconfig                        |   17 +++++++++++++++++
 mm/memcontrol.c                     |   32 ++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

Index: mmotm-2.6.28-Nov10/init/Kconfig
===================================================================
--- mmotm-2.6.28-Nov10.orig/init/Kconfig
+++ mmotm-2.6.28-Nov10/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
 config MM_OWNER
 	bool
 
+config CGROUP_MEM_RES_CTLR_SWAP
+	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+	help
+	  Add swap management feature to memory resource controller. When you
+	  enable this, you can limit mem+swap usage per cgroup. In other words,
+	  when you disable this, memory resource controller has no cares to
+	  usage of swap...a process can exhaust all of the swap. This extension
+	  is useful when you want to avoid exhaustion swap but this itself
+	  adds more overheads and consumes memory for remembering information.
+	  Especially if you use 32bit system or small memory system, please
+	  be careful about enabling this. When memory resource controller
+	  is disabled by boot option, this will be automatically disabled and
+	  there will be no overhead from this. Even when you set this config=y,
+	  if boot option "noswapaccount" is set, swap will not be accounted.
+
+
 endmenu
 
 config SYSFS_DEPRECATED
Index: mmotm-2.6.28-Nov10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov10/mm/memcontrol.c
@@ -42,6 +42,15 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account		(0)
+#endif
+
+
 /*
  * Statistics for memory cgroup.
  */
@@ -1510,6 +1519,18 @@ static void mem_cgroup_free(struct mem_c
 }
 
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+	if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+		do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1521,6 +1542,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		mem = &init_mem_cgroup;
 		/* do this in explicit way. */
 		mem->on_rmdir = 0;
+		enable_swap_cgroup();
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
@@ -1607,3 +1629,13 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+	really_do_swap_account = 0;
+	return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-Nov10/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-Nov10.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-Nov10/Documentation/kernel-parameters.txt
@@ -1554,6 +1554,9 @@ and is between 256 and 4096 characters. 
 
 	nosoftlockup	[KNL] Disable the soft-lockup detector.
 
+	noswapaccount	[KNL] Disable accounting of swap in memory resource
+			controller. (See Documentation/controllers/memory.txt)
+
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
 	notsc		[BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-Nov10/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov10.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov10/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-11-12  3:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-05  8:16 [RFC][PATCH 0/6] memcg updates (05/Nov) KAMEZAWA Hiroyuki
2008-11-05  8:18 ` [RFC][PATCH 1/6] memcg: move all accounts to parent at rmdir() KAMEZAWA Hiroyuki
2008-11-05  8:20 ` [RFC][PATCH 2/6] memcg: handle swap cache KAMEZAWA Hiroyuki
2008-11-07  8:53   ` Daisuke Nishimura
2008-11-07  9:13     ` KAMEZAWA Hiroyuki
2008-11-05  8:20 ` [RFC][PATCH 3/6] memcg : mem+swap controller kconfig KAMEZAWA Hiroyuki
2008-11-06 11:07   ` Daisuke Nishimura
2008-11-05  8:21 ` [RFC][PATCH 4/6] memcg : swap cgroup KAMEZAWA Hiroyuki
2008-11-06 11:25   ` Daisuke Nishimura
2008-11-06 12:44     ` KAMEZAWA Hiroyuki
2008-11-07  1:19       ` Daisuke Nishimura
2008-11-05  8:23 ` [RFC][PATCH 5/6] memcg: mem+swap controller KAMEZAWA Hiroyuki
2008-11-07  9:02   ` Daisuke Nishimura
2008-11-07  9:19     ` KAMEZAWA Hiroyuki
2008-11-07 13:30       ` Daisuke Nishimura
2008-11-07 13:21   ` Daisuke Nishimura
2008-11-10  4:30   ` Daisuke Nishimura
2008-11-10  7:03     ` KAMEZAWA Hiroyuki
2008-11-05  8:24 ` [RFC][PATCH 6/6] memcg: synchronized LRU KAMEZAWA Hiroyuki
2008-11-06  6:54 ` [RFC][PATCH 0/6] memcg updates (05/Nov) Balbir Singh
2008-11-06  7:03   ` KAMEZAWA Hiroyuki
2008-11-06 10:41   ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) KAMEZAWA Hiroyuki
2008-11-06 11:59     ` Hugh Dickins
2008-11-06 12:47       ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir) KAMEZAWA Hiroyuki
2008-11-06 13:46     ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior of rmdir) Balbir Singh
2008-11-06 14:30       ` [RFC][PATCH 7/6] memcg: add atribute (for change bahavior ofrmdir) KAMEZAWA Hiroyuki
2008-11-07  1:12         ` KAMEZAWA Hiroyuki
2008-11-12  3:26 [RFC][PATCH 0/6] memcg updates (12/Nov/2008) KAMEZAWA Hiroyuki
2008-11-12  3:28 ` [RFC][PATCH 3/6] memcg: mem+swap controller kconfig KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).