linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] memcg updates (14/Nov/2008)
@ 2008-11-14 10:12 KAMEZAWA Hiroyuki
  2008-11-14 10:14 ` [PATCH 1/9] memcg: memory hotpluf fix for notifier callback KAMEZAWA Hiroyuki
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:12 UTC (permalink / raw)
  To: linux-mm
  Cc: balbir, nishimura, pbadari, jblunck, taka, akpm, lizf, linux-kernel

Several patches are posted after last update (12/Nov),
it's better to catch all up as series.

All patchs are mm-of-the-moment snapshot 2008-11-13-17-22
  http://userweb.kernel.org/~akpm/mmotm/
(You may need to patch fs/dquota.c and fix kernel/auditsc.c CONFIG error)

New ones are 1,2,3 and 9. 

IMHO, patch 1-4 are ready to go. (but I want Ack from Balbir to 3/9)

Contents:

[1/9] .... fix memory online/offline with memcg.
  This patch is for "real" memory hotplug. So, people who can test this
  is limited, I think. I asked Badari to try this.
  This fix itself is logically correct I think, but there may be other bugs..

[2/9] .... reduce size of per-cpu allocation.
  This is from Jan Blunck <jblunck@suse.de> and I picked it up and rewrote.
  please test. This tries to reduce memory usage of mem_cgroup struct.

[3/9] .... add force_empty again with proper implementation.
  I removed "force_empty" by account_move patch in mmotm. But I asked not to
  do that brutal removal of interface. I'm sorry.
  This adds "force_empty", but implemntaion itself is much saner. After this,
  force_empty is no longer "debug only" interface.

[4/9] .... account swap-cache.
  Before accounting swap, we have to handle swap-cache.
  This patch have been test for a month and seems to works well. Still here
  and waiting for bug fixes moved into..

[5/9] .... mem+swap controller kconfig
  Kconfig changes and macro for mem+swap controller.

[6/9] .... swap cgroup.
  For accounting swap, we have to prepare a strage for remembering swap.

[7/9] .... mem+swap controller.
  mem+swap controller core logic. I and Nishimura have been testing this
  for a month. It's getting nicer.

[8/9] .... synchronized LRU patch
  remove mz->lru_lock and make use of zone->lru_lock. By this, we do not have to
  duplicate vmscan's global LRU behavior in memcg.
  I think I'm an only tester of this ;) but works well.

[9/9] .... mem_cgroup_disabled() patch
  Replacing if (mem_cgroup_subsys.disabled) to be if (mem_cgroup_disabled()).
  Takahashi (dm-ioband team) posted their bio-cgroup interface working with
  page_cgroup. This is cut out from his one.
  Takahashi, If you ack me, send me Signed-off-by or Acked-by. I'll queue this.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/9] memcg: memory hotpluf fix for notifier callback.
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
@ 2008-11-14 10:14 ` KAMEZAWA Hiroyuki
  2008-11-14 10:15 ` [PATCH 2/9] memcg : reduce size of mem_cgroup by using nr_cpu_ids KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

Fixes for memcg/memory hotplug.


While memory hotplug allocate/free memmap, page_cgroup doesn't free
page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
(Because freeing bootmem requires special care.)

Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
by memory hotplug, page_cgroup->page == page is no longer true.

But current MEM_ONLINE handler doesn't check it and update page_cgroup->page
if it's not necessary to allocate page_cgroup.
(This was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.)

And I noticed that MEM_ONLINE can be called against "part of section".
So, freeing page_cgroup at CANCEL_ONLINE will cause trouble.
(freeing used page_cgroup)
Don't rollback at CANCEL. 

One more, current memory hotplug notifier is stopped by slub
because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback
never be called. (low priority than slub now.)

I think this slub's behavior is not intentional(BUG). and fixes it.


Another way to be considered about page_cgroup allocation:
  - free page_cgroup at OFFLINE even if it's from bootmem
    and remove specieal handler. But it requires more changes.


Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/page_cgroup.c |   43 +++++++++++++++++++++++++++++--------------
 mm/slub.c        |    6 ++++--
 2 files changed, 33 insertions(+), 16 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov13/mm/page_cgroup.c
@@ -104,19 +104,29 @@ int __meminit init_section_page_cgroup(u
 	unsigned long table_size;
 	int nid, index;
 
-	if (section->page_cgroup)
-		return 0;
-
-	nid = page_to_nid(pfn_to_page(pfn));
-
-	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-	if (slab_is_available()) {
-		base = kmalloc_node(table_size, GFP_KERNEL, nid);
-		if (!base)
-			base = vmalloc_node(table_size, nid);
-	} else {
-		base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), table_size,
+	if (!section->page_cgroup) {
+		nid = page_to_nid(pfn_to_page(pfn));
+		table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+		if (slab_is_available()) {
+			base = kmalloc_node(table_size, GFP_KERNEL, nid);
+			if (!base)
+				base = vmalloc_node(table_size, nid);
+		} else {
+			base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+				table_size,
 				PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+		}
+	} else {
+		/*
+ 		 * We don't have to allocate page_cgroup again, but
+		 * address of memmap may be changed. So, we have to initialize
+		 * again.
+		 */
+		base = section->page_cgroup + pfn;
+		table_size = 0;
+		/* check address of memmap is changed or not. */
+		if (base->page == pfn_to_page(pfn))
+			return 0;
 	}
 
 	if (!base) {
@@ -204,18 +214,23 @@ static int page_cgroup_callback(struct n
 		ret = online_page_cgroup(mn->start_pfn,
 				   mn->nr_pages, mn->status_change_nid);
 		break;
-	case MEM_CANCEL_ONLINE:
 	case MEM_OFFLINE:
 		offline_page_cgroup(mn->start_pfn,
 				mn->nr_pages, mn->status_change_nid);
 		break;
+	case MEM_CANCEL_ONLINE:
 	case MEM_GOING_OFFLINE:
 		break;
 	case MEM_ONLINE:
 	case MEM_CANCEL_OFFLINE:
 		break;
 	}
-	ret = notifier_from_errno(ret);
+
+	if (ret)
+		ret = notifier_from_errno(ret);
+	else
+		ret = NOTIFY_OK;
+
 	return ret;
 }
 
Index: mmotm-2.6.28-Nov13/mm/slub.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/slub.c
+++ mmotm-2.6.28-Nov13/mm/slub.c
@@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
 	case MEM_CANCEL_OFFLINE:
 		break;
 	}
-
-	ret = notifier_from_errno(ret);
+	if (ret)
+		ret = notifier_from_errno(ret);
+	else
+		ret = NOTIFY_OK;
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] memcg : reduce size of mem_cgroup by using nr_cpu_ids.
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
  2008-11-14 10:14 ` [PATCH 1/9] memcg: memory hotpluf fix for notifier callback KAMEZAWA Hiroyuki
@ 2008-11-14 10:15 ` KAMEZAWA Hiroyuki
  2008-11-14 10:16 ` [PATCH 3/9] memcg: new force_empty to free pages under group KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

As  Jan Blunck <jblunck@suse.de> pointed out, allocating
per-cpu stat for memcg to the size of NR_CPUS is not good.

This patch changes mem_cgroup's cpustat allocation not based
on NR_CPUS but based on nr_cpu_ids.

Changelog:
 - fixed bugs in error path.

From: Jan Blunck <jblunck@suse.de>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   35 ++++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 17 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -60,7 +60,7 @@ struct mem_cgroup_stat_cpu {
 } ____cacheline_aligned_in_smp;
 
 struct mem_cgroup_stat {
-	struct mem_cgroup_stat_cpu cpustat[NR_CPUS];
+	struct mem_cgroup_stat_cpu cpustat[0];
 };
 
 /*
@@ -129,11 +129,10 @@ struct mem_cgroup {
 
 	int	prev_priority;	/* for recording reclaim priority */
 	/*
-	 * statistics.
+	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
-static struct mem_cgroup init_mem_cgroup;
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -1292,23 +1291,30 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
+static int mem_cgroup_size(void)
+{
+	int cpustat_size = nr_cpu_ids * sizeof(struct mem_cgroup_stat_cpu);
+	return sizeof(struct mem_cgroup) + cpustat_size;
+}
+
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
 	struct mem_cgroup *mem;
+	int size = mem_cgroup_size();
 
-	if (sizeof(*mem) < PAGE_SIZE)
-		mem = kmalloc(sizeof(*mem), GFP_KERNEL);
+	if (size < PAGE_SIZE)
+		mem = kmalloc(size, GFP_KERNEL);
 	else
-		mem = vmalloc(sizeof(*mem));
+		mem = vmalloc(size);
 
 	if (mem)
-		memset(mem, 0, sizeof(*mem));
+		memset(mem, 0, size);
 	return mem;
 }
 
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
-	if (sizeof(*mem) < PAGE_SIZE)
+	if (mem_cgroup_size() < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);
@@ -1321,13 +1327,9 @@ mem_cgroup_create(struct cgroup_subsys *
 	struct mem_cgroup *mem;
 	int node;
 
-	if (unlikely((cont->parent) == NULL)) {
-		mem = &init_mem_cgroup;
-	} else {
-		mem = mem_cgroup_alloc();
-		if (!mem)
-			return ERR_PTR(-ENOMEM);
-	}
+	mem = mem_cgroup_alloc();
+	if (!mem)
+		return ERR_PTR(-ENOMEM);
 
 	res_counter_init(&mem->res);
 
@@ -1339,8 +1341,7 @@ mem_cgroup_create(struct cgroup_subsys *
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
-	if (cont->parent != NULL)
-		mem_cgroup_free(mem);
+	mem_cgroup_free(mem);
 	return ERR_PTR(-ENOMEM);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/9] memcg: new force_empty to free pages under group
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
  2008-11-14 10:14 ` [PATCH 1/9] memcg: memory hotpluf fix for notifier callback KAMEZAWA Hiroyuki
  2008-11-14 10:15 ` [PATCH 2/9] memcg : reduce size of mem_cgroup by using nr_cpu_ids KAMEZAWA Hiroyuki
@ 2008-11-14 10:16 ` KAMEZAWA Hiroyuki
  2008-11-14 10:17 ` [PATCH 4/9] memcg: handle swap caches KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak
of memory usage and force_empty is removed.

This patch adds "force_empty" again, in reasonable manner.

memory.force_empty file works when

  #echo 0 (or some) > memory.force_empty
  and have following function.

  1. only works when there are no task in this cgroup.
  2. free all page under this cgroup as much as possible.
  3. page which cannot be freed will be moved up to parent.
  4. Then, memcg will be empty after above echo returns.

This is much better behavior than old "force_empty" which just forget
all accounts. This patch also check signal_pending() and above "echo"
can be stopped by "Ctrl-C".

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   27 +++++++++++++++++++++++----
 mm/memcontrol.c                      |   34 ++++++++++++++++++++++++++++++----
 2 files changed, 53 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -1061,7 +1061,7 @@ static int mem_cgroup_force_empty_list(s
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *mem)
+static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
 {
 	int ret;
 	int node, zid, shrink;
@@ -1070,12 +1070,17 @@ static int mem_cgroup_force_empty(struct
 	css_get(&mem->css);
 
 	shrink = 0;
+	/* should free all ? */
+	if (free_all)
+		goto try_to_free;
 move_account:
 	while (mem->res.usage > 0) {
 		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
-
+		ret = -EINTR;
+		if (signal_pending(current))
+			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		ret = 0;
@@ -1110,14 +1115,24 @@ try_to_free:
 		ret = -EBUSY;
 		goto out;
 	}
+	/* we call try-to-free pages for make this cgroup empty */
+	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
 	shrink = 1;
 	while (nr_retries && mem->res.usage > 0) {
 		int progress;
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			goto out;
+		}
 		progress = try_to_free_mem_cgroup_pages(mem,
 						  GFP_HIGHUSER_MOVABLE);
-		if (!progress)
+		if (!progress) {
 			nr_retries--;
+			/* maybe some writeback is necessary */
+			congestion_wait(WRITE, HZ/10);
+		}
 
 	}
 	/* try move_account...there may be some *locked* pages. */
@@ -1127,6 +1142,12 @@ try_to_free:
 	goto out;
 }
 
+int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
+{
+	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
+}
+
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
@@ -1224,6 +1245,7 @@ static int mem_control_stat_show(struct 
 	return 0;
 }
 
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1252,6 +1274,10 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "force_empty",
+		.trigger = mem_cgroup_force_empty_write,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1349,7 +1375,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem_cgroup_force_empty(mem);
+	mem_cgroup_force_empty(mem, false);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
Index: mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov13.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
@@ -237,11 +237,30 @@ reclaimed.
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
 tasks have migrated away from it.
-Such charges are moved to its parent as much as possible and freed if parent
-is full. Both of RSS and CACHES are moved to parent.
-If both of them are busy, rmdir() returns -EBUSY.
+Such charges are freed(at default) or moved to its parent. When moved,
+both of RSS and CACHES are moved to parent.
+If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
 
-5. TODO
+5. Misc. interfaces.
+
+5.1 force_empty
+  memory.force_empty interface is provided to make cgroup's memory usage empty.
+  You can use this interface only when the cgroup has no tasks.
+  When writing anything to this
+
+  # echo 0 > memory.force_empty
+
+  Almost all pages tracked by this memcg will be unmapped and freed. Some of
+  pages cannot be freed because it's locked or in-use. Such pages are moved
+  to parent and this cgroup will be empty. But this may return -EBUSY in
+  some too busy case.
+
+  Typical use case of this interface is that calling this before rmdir().
+  Because rmdir() moves all pages to parent, some out-of-use page caches can be
+  moved to the parent. If you want to avoid that, force_empty will be useful.
+
+
+6. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/9] memcg: handle swap caches
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-11-14 10:16 ` [PATCH 3/9] memcg: new force_empty to free pages under group KAMEZAWA Hiroyuki
@ 2008-11-14 10:17 ` KAMEZAWA Hiroyuki
  2008-11-14 10:18 ` [PATCH 5/9] memcg : mem+swap controller Kconfig KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

SwapCache support for memory resource controller (memcg)

Before mem+swap controller, memcg itself should handle SwapCache in proper way.
This is cut-out from it.

In current memcg, SwapCache is just leaked and the user can create tons of
SwapCache. This is a leak of account and should be handled.

SwapCache accounting is done as following.

  charge (anon)
	- charged when it's mapped.
	  (because of readahead, charge at add_to_swap_cache() is not sane)
  uncharge (anon)
	- uncharged when it's dropped from swapcache and fully unmapped.
	  means it's not uncharged at unmap.
	  Note: delete from swap cache at swap-in is done after rmap information
	        is established.
  charge (shmem)
	- charged at swap-in. this prevents charge at add_to_page_cache().

  uncharge (shmem)
	- uncharged when it's dropped from swapcache and not on shmem's
	  radix-tree.

  at migration, check against 'old page' is modified to handle shmem.

Comparing to the old version discussed (and caused troubles), we have
advantages of
  - PCG_USED bit.
  - simple migrating handling.

So, situation is much easier than several months ago, maybe.


Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    5 ++
 include/linux/swap.h                 |   16 ++++++++
 mm/memcontrol.c                      |   67 +++++++++++++++++++++++++++++++----
 mm/shmem.c                           |   18 ++++++++-
 mm/swap_state.c                      |    1 
 5 files changed, 99 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -21,6 +21,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include <linux/backing-dev.h>
@@ -139,6 +140,7 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
+	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
 	NR_CHARGE_TYPE,
 };
 
@@ -780,6 +782,33 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+#ifdef CONFIG_SWAP
+int mem_cgroup_cache_charge_swapin(struct page *page,
+			struct mm_struct *mm, gfp_t mask, bool locked)
+{
+	int ret = 0;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+	if (unlikely(!mm))
+		mm = &init_mm;
+	if (!locked)
+		lock_page(page);
+	/*
+	 * If not locked, the page can be dropped from SwapCache until
+	 * we reach here.
+	 */
+	if (PageSwapCache(page)) {
+		ret = mem_cgroup_charge_common(page, mm, mask,
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+	}
+	if (!locked)
+		unlock_page(page);
+
+	return ret;
+}
+#endif
+
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
 	struct page_cgroup *pc;
@@ -817,6 +846,9 @@ __mem_cgroup_uncharge_common(struct page
 	if (mem_cgroup_subsys.disabled)
 		return;
 
+	if (PageSwapCache(page))
+		return;
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -825,12 +857,26 @@ __mem_cgroup_uncharge_common(struct page
 		return;
 
 	lock_page_cgroup(pc);
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
-	     || !PageCgroupUsed(pc)) {
-		/* This happens at race in zap_pte_range() and do_swap_page()*/
-		unlock_page_cgroup(pc);
-		return;
+
+	if (!PageCgroupUsed(pc))
+		goto unlock_out;
+
+	switch (ctype) {
+	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+		if (page_mapped(page))
+			goto unlock_out;
+		break;
+	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
+		if (!PageAnon(page)) {	/* Shared memory */
+			if (page->mapping && !page_is_file_cache(page))
+				goto unlock_out;
+		} else if (page_mapped(page)) /* Anon */
+				goto unlock_out;
+		break;
+	default:
+		break;
 	}
+
 	ClearPageCgroupUsed(pc);
 	mem = pc->mem_cgroup;
 
@@ -844,6 +890,10 @@ __mem_cgroup_uncharge_common(struct page
 	css_put(&mem->css);
 
 	return;
+
+unlock_out:
+	unlock_page_cgroup(pc);
+	return;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -863,6 +913,11 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+}
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
@@ -920,7 +975,7 @@ void mem_cgroup_end_migration(struct mem
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
 
 	/* unused page is not on radix-tree now. */
-	if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+	if (unused)
 		__mem_cgroup_uncharge_common(unused, ctype);
 
 	pc = lookup_page_cgroup(target);
Index: mmotm-2.6.28-Nov13/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/swap_state.c
+++ mmotm-2.6.28-Nov13/mm/swap_state.c
@@ -119,6 +119,7 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
+	mem_cgroup_uncharge_swapcache(page);
 }
 
 /**
Index: mmotm-2.6.28-Nov13/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/swap.h
+++ mmotm-2.6.28-Nov13/include/linux/swap.h
@@ -335,6 +335,22 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+				struct mm_struct *mm, gfp_t mask, bool locked);
+extern void mem_cgroup_uncharge_swapcache(struct page *page);
+#else
+static inline
+int mem_cgroup_cache_charge_swapin(struct page *page,
+				struct mm_struct *mm, gfp_t mask, bool locked)
+{
+	return 0;
+}
+static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+{
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
Index: mmotm-2.6.28-Nov13/mm/shmem.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/shmem.c
+++ mmotm-2.6.28-Nov13/mm/shmem.c
@@ -920,8 +920,12 @@ found:
 	error = 1;
 	if (!inode)
 		goto out;
-	/* Charge page using GFP_HIGHUSER_MOVABLE while we can wait */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_HIGHUSER_MOVABLE);
+	/*
+	 * Charge page using GFP_HIGHUSER_MOVABLE while we can wait.
+	 * charged back to the user(not to caller) when swap account is used.
+	 */
+	error = mem_cgroup_cache_charge_swapin(page,
+			current->mm, GFP_HIGHUSER_MOVABLE, true);
 	if (error)
 		goto out;
 	error = radix_tree_preload(GFP_KERNEL);
@@ -1258,6 +1262,16 @@ repeat:
 				goto repeat;
 			}
 			wait_on_page_locked(swappage);
+			/*
+			 * We want to avoid charge at add_to_page_cache().
+			 * charge against this swap cache here.
+			 */
+			if (mem_cgroup_cache_charge_swapin(swappage,
+						current->mm, gfp, false)) {
+				page_cache_release(swappage);
+				error = -ENOMEM;
+				goto failed;
+			}
 			page_cache_release(swappage);
 			goto repeat;
 		}
Index: mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov13.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
@@ -137,6 +137,11 @@ behind this approach is that a cgroup th
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 
+Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+
 2.4 Reclaim
 
 Each cgroup maintains a per cgroup LRU that consists of an active

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/9] memcg : mem+swap controller Kconfig
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-11-14 10:17 ` [PATCH 4/9] memcg: handle swap caches KAMEZAWA Hiroyuki
@ 2008-11-14 10:18 ` KAMEZAWA Hiroyuki
  2008-11-14 10:18 ` [PATCH 6/9] memcg : swap cgroup for remembering usage KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

Experimental.

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extension or not.
  - CONFIG
  - boot option

Changelog: v2 -> v3
 - adjusted to avoid HUNK.

Changelog: v1 -> v2
 - fixed typo.
 - make default value of "do_swap_account" to be 0 and turned on 1
   later if configured.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/kernel-parameters.txt |    3 +++
 include/linux/memcontrol.h          |    3 +++
 init/Kconfig                        |   17 +++++++++++++++++
 mm/memcontrol.c                     |   32 ++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

Index: mmotm-2.6.28-Nov13/init/Kconfig
===================================================================
--- mmotm-2.6.28-Nov13.orig/init/Kconfig
+++ mmotm-2.6.28-Nov13/init/Kconfig
@@ -428,6 +428,23 @@ config CGROUP_MEM_RES_CTLR
 config MM_OWNER
 	bool
 
+config CGROUP_MEM_RES_CTLR_SWAP
+	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+	help
+	  Add swap management feature to memory resource controller. When you
+	  enable this, you can limit mem+swap usage per cgroup. In other words,
+	  when you disable this, memory resource controller has no cares to
+	  usage of swap...a process can exhaust all of the swap. This extension
+	  is useful when you want to avoid exhaustion swap but this itself
+	  adds more overheads and consumes memory for remembering information.
+	  Especially if you use 32bit system or small memory system, please
+	  be careful about enabling this. When memory resource controller
+	  is disabled by boot option, this will be automatically disabled and
+	  there will be no overhead from this. Even when you set this config=y,
+	  if boot option "noswapaccount" is set, swap will not be accounted.
+
+
 endmenu
 
 config SYSFS_DEPRECATED
Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -41,6 +41,15 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/* Turned on only when memory cgroup is enabled && really_do_swap_account = 0 */
+int do_swap_account __read_mostly;
+static int really_do_swap_account __initdata = 1; /* for remember boot option*/
+#else
+#define do_swap_account		(0)
+#endif
+
+
 /*
  * Statistics for memory cgroup.
  */
@@ -1402,6 +1411,18 @@ static void mem_cgroup_free(struct mem_c
 }
 
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static void __init enable_swap_cgroup(void)
+{
+	if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+		do_swap_account = 1;
+}
+#else
+static void __init enable_swap_cgroup(void)
+{
+}
+#endif
+
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -1417,6 +1438,9 @@ mem_cgroup_create(struct cgroup_subsys *
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
+	/* root ? */
+	if (cont->parent == NULL)
+		enable_swap_cgroup();
 
 	return &mem->css;
 free_out:
@@ -1488,3 +1512,13 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+	really_do_swap_account = 0;
+	return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.28-Nov13/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.28-Nov13.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.28-Nov13/Documentation/kernel-parameters.txt
@@ -1558,6 +1558,9 @@ and is between 256 and 4096 characters. 
 
 	nosoftlockup	[KNL] Disable the soft-lockup detector.
 
+	noswapaccount	[KNL] Disable accounting of swap in memory resource
+			controller. (See Documentation/controllers/memory.txt)
+
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
 	notsc		[BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.28-Nov13/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov13/include/linux/memcontrol.h
@@ -77,6 +77,9 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/9] memcg : swap cgroup for remembering usage
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-11-14 10:18 ` [PATCH 5/9] memcg : mem+swap controller Kconfig KAMEZAWA Hiroyuki
@ 2008-11-14 10:18 ` KAMEZAWA Hiroyuki
  2008-11-14 10:19 ` [PATCH 7/9] memcg : mem+swap controlelr core KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
  - swap_cgroup_swapon() .... called from swapon
  - swap_cgroup_swapoff() ... called at the end of swapoff.

  - swap_cgroup_record() .... record information of swap entry.
  - swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires
requires memory allocation at new records but swap-out is usually
called under memory shortage (or memcg hits limit.)
So, I used static allocation. (maybe dynamic allocation is not very
hard but it adds additional memory allocation in memory shortage path.)


Note1: In this, we use pointer to record information and this means
      8bytes per swap entry. I think we can reduce this when we
      create "id of cgroup" in the range of 0-65535 or 0-255.

Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.

Changelog: v2 -> v3
 - fixed typo

Changelog: v1 -> v2
 - fixed bug in swapoff.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   35 +++++++
 mm/page_cgroup.c            |  201 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c               |    8 +
 3 files changed, 244 insertions(+)

Index: mmotm-2.6.28-Nov13/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov13/mm/page_cgroup.c
@@ -8,6 +8,8 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -266,3 +268,202 @@ void __init pgdat_page_cgroup_init(struc
 }
 
 #endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+	spinlock_t lock;
+	struct page **map;
+	unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+	struct mem_cgroup	*val;
+};
+#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK	(SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+	struct page *page;
+	struct swap_cgroup_ctrl *ctrl;
+	unsigned long idx, max;
+
+	if (!do_swap_account)
+		return 0;
+	ctrl = &swap_cgroup_ctrl[type];
+
+	for (idx = 0; idx < ctrl->length; idx++) {
+		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+		if (!page)
+			goto not_enough_page;
+		ctrl->map[idx] = page;
+	}
+	return 0;
+not_enough_page:
+	max = idx;
+	for (idx = 0; idx < max; idx++)
+		__free_page(ctrl->map[idx]);
+
+	return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *old;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	old = sc->val;
+	sc->val = mem;
+	kunmap_atomic(mappage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	int type = swp_type(ent);
+	unsigned long flags;
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *ret;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	ret = sc->val;
+	kunmap_atomic(mappage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	void *array;
+	unsigned long array_size;
+	unsigned long length;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return 0;
+
+	length = ((max_pages/SC_PER_PAGE) + 1);
+	array_size = length * sizeof(void *);
+
+	array = vmalloc(array_size);
+	if (!array)
+		goto nomem;
+
+	memset(array, 0, array_size);
+	ctrl = &swap_cgroup_ctrl[type];
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl->length = length;
+	ctrl->map = array;
+	if (swap_cgroup_prepare(type)) {
+		/* memory shortage */
+		ctrl->map = NULL;
+		ctrl->length = 0;
+		vfree(array);
+		mutex_unlock(&swap_cgroup_mutex);
+		goto nomem;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+
+	printk(KERN_INFO
+		"swap_cgroup: uses %ld bytes vmalloc and %ld bytes buffres\n",
+		array_size, length * PAGE_SIZE);
+	printk(KERN_INFO
+	"swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+	return 0;
+nomem:
+	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+	printk(KERN_INFO
+		"swap_cgroup can be disabled by noswapaccount boot option\n");
+	return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+	int i;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return;
+
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl = &swap_cgroup_ctrl[type];
+	if (ctrl->map) {
+		for (i = 0; i < ctrl->length; i++) {
+			struct page *page = ctrl->map[i];
+			if (page)
+				__free_page(page);
+		}
+		vfree(ctrl->map);
+		ctrl->map = NULL;
+		ctrl->length = 0;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+	int i;
+	for (i = 0; i < MAX_SWAPFILES; i++)
+		spin_lock_init(&swap_cgroup_ctrl[i].lock);
+	return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.28-Nov13/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-Nov13/include/linux/page_cgroup.h
@@ -105,4 +105,39 @@ static inline void page_cgroup_init(void
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
+static inline
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	return NULL;
+}
+
+static inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+static inline void swap_cgroup_swapoff(int type)
+{
+	return;
+}
+
+#endif
 #endif
Index: mmotm-2.6.28-Nov13/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/swapfile.c
+++ mmotm-2.6.28-Nov13/mm/swapfile.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
+#include <linux/page_cgroup.h>
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	/* Destroy swap account informatin */
+	swap_cgroup_swapoff(type);
+
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
 		nr_good_pages = swap_header->info.last_page -
 				swap_header->info.nr_badpages -
 				1 /* header page */;
+
+		if (!error)
+			error = swap_cgroup_swapon(type, maxpages);
+
 		if (error)
 			goto bad_swap;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 7/9] memcg : mem+swap controlelr core
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-11-14 10:18 ` [PATCH 6/9] memcg : swap cgroup for remembering usage KAMEZAWA Hiroyuki
@ 2008-11-14 10:19 ` KAMEZAWA Hiroyuki
  2008-11-21  2:40   ` Li Zefan
  2008-11-14 10:20 ` [PATCH 8/9] memcg : synchronized LRU KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

Mem+Swap controller core.

This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.

Mem+Swap controller works as following.
  - memory usage is limited by memory.limit_in_bytes.
  - memory + swap usage is limited by memory.memsw_limit_in_bytes.


This has following benefits.
  - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..
    
    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.
      
Maybe someone wonder "why not swap but mem+swap ?"

  - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.


Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
  map
    - charge  page and memsw.

  unmap
    - uncharge page/memsw if not SwapCache.

  swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

  swap-in (do_swap_page)
    - charged as page and memsw.
      record in swap_cgroup is cleared.
      memsw accounting is decremented.

  swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.


There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead.  This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
 - maybe more optimization can be don in swap-in path. (but not very safe.)
   But we just do simple accounting at this stage.

Changelog: v2 -> v3
 - create memsw.* file only when do_swap_account==1.
 - fixed cancel_charge_swapin().
 - swap_on_disk in stat file is dropped.
 - swapref is renamed to be refcnt. (so, this can be used for general use.)
 - fixed resize_limit() to check that new limit don't exceed memsw.limit

Changelog: v1 -> v2
 - fixed typos
 - fixed migration of anon pages.
 - fixed uncharge to check USED bit always.
 - code for swapcache is moved to another patch.
 - added "noswap" argument to try_to_free_mem_cgroup_pages
 - fixed lock_page around mem_cgroup_charge_cache_swap()
 - fixed failcnt file.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/controllers/memory.txt |   29 ++
 include/linux/memcontrol.h           |   11 -
 include/linux/swap.h                 |   14 +
 mm/memcontrol.c                      |  370 +++++++++++++++++++++++++++++++----
 mm/memory.c                          |    3 
 mm/swap_state.c                      |    5 
 mm/swapfile.c                        |   11 -
 mm/vmscan.c                          |    6 
 8 files changed, 402 insertions(+), 47 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -132,12 +132,18 @@ struct mem_cgroup {
 	 */
 	struct res_counter res;
 	/*
+	 * the counter to account for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
 	struct mem_cgroup_lru_info info;
 
 	int	prev_priority;	/* for recording reclaim priority */
+	int		obsolete;
+	atomic_t	refcnt;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -167,6 +173,17 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 	0, /* FORCE */
 };
 
+
+/* for encoding cft->private value on file */
+#define _MEM			(0)
+#define _MEMSWAP		(1)
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+static void mem_cgroup_get(struct mem_cgroup *mem);
+static void mem_cgroup_put(struct mem_cgroup *mem);
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -485,7 +502,8 @@ unsigned long mem_cgroup_isolate_pages(u
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+			gfp_t gfp_mask, struct mem_cgroup **memcg,
+			bool oom)
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -513,12 +531,25 @@ static int __mem_cgroup_try_charge(struc
 		css_get(&mem->css);
 	}
 
+	while (1) {
+		int ret;
+		bool noswap = false;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+		ret = res_counter_charge(&mem->res, PAGE_SIZE);
+		if (likely(!ret)) {
+			if (!do_swap_account)
+				break;
+			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+			if (likely(!ret))
+				break;
+			/* mem+swap counter fails */
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			noswap = true;
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
 			continue;
 
 		/*
@@ -527,8 +558,13 @@ static int __mem_cgroup_try_charge(struc
 		 * moved to swap cache or just unmapped from the cgroup.
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
+		 *
 		 */
-		if (res_counter_check_under_limit(&mem->res))
+		if (!do_swap_account &&
+			res_counter_check_under_limit(&mem->res))
+			continue;
+		if (do_swap_account &&
+			res_counter_check_under_limit(&mem->memsw))
 			continue;
 
 		if (!nr_retries--) {
@@ -582,6 +618,8 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
@@ -646,6 +684,8 @@ static int mem_cgroup_move_account(struc
 		__mem_cgroup_remove_list(from_mz, pc);
 		css_put(&from->css);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
 		pc->mem_cgroup = to;
 		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc, false);
@@ -692,8 +732,11 @@ static int mem_cgroup_move_parent(struct
 	/* drop extra refcnt */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	if (ret)
+	if (ret) {
 		res_counter_uncharge(&parent->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	}
 
 	return ret;
 }
@@ -791,7 +834,34 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+				 struct page *page,
+				 gfp_t mask, struct mem_cgroup **ptr)
+{
+	struct mem_cgroup *mem;
+	swp_entry_t     ent;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+
+	if (!do_swap_account)
+		goto charge_cur_mm;
+
+	ent.val = page_private(page);
+
+	mem = lookup_swap_cgroup(ent);
+	if (!mem || mem->obsolete)
+		goto charge_cur_mm;
+	*ptr = mem;
+	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+}
+
 #ifdef CONFIG_SWAP
+
 int mem_cgroup_cache_charge_swapin(struct page *page,
 			struct mm_struct *mm, gfp_t mask, bool locked)
 {
@@ -808,8 +878,28 @@ int mem_cgroup_cache_charge_swapin(struc
 	 * we reach here.
 	 */
 	if (PageSwapCache(page)) {
+		struct mem_cgroup *mem = NULL;
+		swp_entry_t ent;
+
+		ent.val = page_private(page);
+		if (do_swap_account) {
+			mem = lookup_swap_cgroup(ent);
+			if (mem && mem->obsolete)
+				mem = NULL;
+			if (mem)
+				mm = NULL;
+		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
-				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+
+		if (!ret && do_swap_account) {
+			/* avoid double counting */
+			mem = swap_cgroup_record(ent, NULL);
+			if (mem) {
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+				mem_cgroup_put(mem);
+			}
+		}
 	}
 	if (!locked)
 		unlock_page(page);
@@ -828,6 +918,23 @@ void mem_cgroup_commit_charge_swapin(str
 		return;
 	pc = lookup_page_cgroup(page);
 	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+	/*
+	 * Now swap is on-memory. This means this page may be
+	 * counted both as mem and swap....double count.
+	 * Fix it by uncharging from memsw. This SwapCache is stable
+	 * because we're still under lock_page().
+	 */
+	if (do_swap_account) {
+		swp_entry_t ent = {.val = page_private(page)};
+		struct mem_cgroup *memcg;
+		memcg = swap_cgroup_record(ent, NULL);
+		if (memcg) {
+			/* If memcg is obsolete, memcg can be != ptr */
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			mem_cgroup_put(memcg);
+		}
+
+	}
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -837,6 +944,8 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	css_put(&mem->css);
 }
 
@@ -844,29 +953,31 @@ void mem_cgroup_cancel_charge_swapin(str
 /*
  * uncharge if !page_mapped(page)
  */
-static void
+static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
-		return;
+		return NULL;
 
 	if (PageSwapCache(page))
-		return;
+		return NULL;
 
 	/*
 	 * Check if our page_cgroup is valid
 	 */
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
-		return;
+		return NULL;
 
 	lock_page_cgroup(pc);
 
+	mem = pc->mem_cgroup;
+
 	if (!PageCgroupUsed(pc))
 		goto unlock_out;
 
@@ -886,8 +997,11 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+
 	ClearPageCgroupUsed(pc);
-	mem = pc->mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
@@ -895,14 +1009,13 @@ __mem_cgroup_uncharge_common(struct page
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 
-	return;
+	return mem;
 
 unlock_out:
 	unlock_page_cgroup(pc);
-	return;
+	return NULL;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -922,10 +1035,42 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
-void mem_cgroup_uncharge_swapcache(struct page *page)
+/*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	struct mem_cgroup *memcg;
+
+	memcg = __mem_cgroup_uncharge_common(page,
+					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	/* record memcg information */
+	if (do_swap_account && memcg) {
+		swap_cgroup_record(ent, memcg);
+		mem_cgroup_get(memcg);
+	}
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+	struct mem_cgroup *memcg;
+
+	if (!do_swap_account)
+		return;
+
+	memcg = swap_cgroup_record(ent, NULL);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_put(memcg);
+	}
 }
+#endif
 
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
@@ -1034,7 +1179,7 @@ int mem_cgroup_shrink_usage(struct mm_st
 	rcu_read_unlock();
 
 	do {
-		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
+		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask, true);
 		progress += res_counter_check_under_limit(&mem->res);
 	} while (!progress && --retry);
 
@@ -1051,6 +1196,11 @@ int mem_cgroup_resize_limit(struct mem_c
 	int progress;
 	int ret = 0;
 
+	if (do_swap_account) {
+		if (val > memcg->memsw.limit)
+			return -EINVAL;
+	}
+
 	while (res_counter_set_limit(&memcg->res, val)) {
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -1061,13 +1211,55 @@ int mem_cgroup_resize_limit(struct mem_c
 			break;
 		}
 		progress = try_to_free_mem_cgroup_pages(memcg,
-				GFP_HIGHUSER_MOVABLE);
+				GFP_HIGHUSER_MOVABLE, false);
 		if (!progress)
 			retry_count--;
 	}
 	return ret;
 }
 
+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+				unsigned long long val)
+{
+	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+	unsigned long flags;
+	u64 memlimit, oldusage, curusage;
+	int ret;
+
+	if (!do_swap_account)
+		return -EINVAL;
+
+	while (retry_count) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		/*
+		 * Rather than hide all in some function, I do this in
+		 * open coded manner. You see what this really does.
+		 * We have to guarantee mem->res.limit < mem->memsw.limit.
+		 */
+		spin_lock_irqsave(&memcg->res.lock, flags);
+		memlimit = memcg->res.limit;
+		if (memlimit > val) {
+			spin_unlock_irqrestore(&memcg->res.lock, flags);
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_set_limit(&memcg->memsw, val);
+		oldusage = memcg->memsw.usage;
+		spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+		if (!ret)
+			break;
+		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE, true);
+		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		if (curusage >= oldusage)
+			retry_count--;
+	}
+	return ret;
+}
+
 
 /*
  * This routine traverse page_cgroup in given list and drop them all.
@@ -1191,7 +1383,7 @@ try_to_free:
 			goto out;
 		}
 		progress = try_to_free_mem_cgroup_pages(mem,
-						  GFP_HIGHUSER_MOVABLE);
+						  GFP_HIGHUSER_MOVABLE, false);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -1214,8 +1406,25 @@ int mem_cgroup_force_empty_write(struct 
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
-				    cft->private);
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	u64 val = 0;
+	int type, name;
+
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (type) {
+	case _MEM:
+		val = res_counter_read_u64(&mem->res, name);
+		break;
+	case _MEMSWAP:
+		if (do_swap_account)
+			val = res_counter_read_u64(&mem->memsw, name);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return val;
 }
 /*
  * The user of this function is...
@@ -1225,15 +1434,22 @@ static int mem_cgroup_write(struct cgrou
 			    const char *buffer)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	int type, name;
 	unsigned long long val;
 	int ret;
 
-	switch (cft->private) {
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (name) {
 	case RES_LIMIT:
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
-		if (!ret)
+		if (ret)
+			break;
+		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+		else
+			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
@@ -1245,14 +1461,23 @@ static int mem_cgroup_write(struct cgrou
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 {
 	struct mem_cgroup *mem;
+	int type, name;
 
 	mem = mem_cgroup_from_cont(cont);
-	switch (event) {
+	type = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+	switch (name) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_max(&mem->res);
+		else
+			res_counter_reset_max(&mem->memsw);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_failcnt(&mem->res);
+		else
+			res_counter_reset_failcnt(&mem->memsw);
 		break;
 	}
 	return 0;
@@ -1313,24 +1538,24 @@ static int mem_control_stat_show(struct 
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-		.private = RES_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "max_usage_in_bytes",
-		.private = RES_MAX_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "limit_in_bytes",
-		.private = RES_LIMIT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
 		.write_string = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "failcnt",
-		.private = RES_FAILCNT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
@@ -1344,6 +1569,47 @@ static struct cftype mem_cgroup_files[] 
 	},
 };
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+static struct cftype memsw_cgroup_files[] = {
+	{
+		.name = "memsw.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.failcnt",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	if (!do_swap_account)
+		return 0;
+	return cgroup_add_files(cont, ss, memsw_cgroup_files,
+				ARRAY_SIZE(memsw_cgroup_files));
+};
+#else
+static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -1402,14 +1668,44 @@ static struct mem_cgroup *mem_cgroup_all
 	return mem;
 }
 
+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	if (atomic_read(&mem->refcnt) > 0)
+		return;
 	if (mem_cgroup_size() < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);
 }
 
+static void mem_cgroup_get(struct mem_cgroup *mem)
+{
+	atomic_inc(&mem->refcnt);
+}
+
+static void mem_cgroup_put(struct mem_cgroup *mem)
+{
+	if (atomic_dec_and_test(&mem->refcnt)) {
+		if (!mem->obsolete)
+			return;
+		mem_cgroup_free(mem);
+	}
+}
+
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 static void __init enable_swap_cgroup(void)
@@ -1434,6 +1730,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		return ERR_PTR(-ENOMEM);
 
 	res_counter_init(&mem->res);
+	res_counter_init(&mem->memsw);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1454,6 +1751,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem, false);
 }
 
@@ -1472,8 +1770,14 @@ static void mem_cgroup_destroy(struct cg
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	return cgroup_add_files(cont, ss, mem_cgroup_files,
-					ARRAY_SIZE(mem_cgroup_files));
+	int ret;
+
+	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
+				ARRAY_SIZE(mem_cgroup_files));
+
+	if (!ret)
+		ret = register_memsw_files(cont, ss);
+	return ret;
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
Index: mmotm-2.6.28-Nov13/mm/swapfile.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/swapfile.c
+++ mmotm-2.6.28-Nov13/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
 	return NULL;
 }	
 
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
 {
+	unsigned long offset = swp_offset(ent);
 	int count = p->swap_map[offset];
 
 	if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
 				swap_list.next = p - swap_info;
 			nr_swap_pages++;
 			p->inuse_pages--;
+			mem_cgroup_uncharge_swap(ent);
 		}
 	}
 	return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, swp_offset(entry));
+		swap_entry_free(p, entry);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, swp_offset(entry)) == 1) {
+		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -536,7 +538,8 @@ static int unuse_pte(struct vm_area_stru
 	pte_t *pte;
 	int ret = 1;
 
-	if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr))
+	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
+					GFP_HIGHUSER_MOVABLE, &ptr))
 		ret = -ENOMEM;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
Index: mmotm-2.6.28-Nov13/mm/swap_state.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/swap_state.c
+++ mmotm-2.6.28-Nov13/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
 
@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	swp_entry_t ent = {.val = page_private(page)};
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
@@ -119,7 +122,7 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
-	mem_cgroup_uncharge_swapcache(page);
+	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
Index: mmotm-2.6.28-Nov13/include/linux/swap.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/swap.h
+++ mmotm-2.6.28-Nov13/include/linux/swap.h
@@ -213,7 +213,7 @@ static inline void lru_cache_add_active_
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-							gfp_t gfp_mask);
+						gfp_t gfp_mask, bool noswap);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
@@ -338,7 +338,7 @@ static inline void disable_swap_token(vo
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern int mem_cgroup_cache_charge_swapin(struct page *page,
 				struct mm_struct *mm, gfp_t mask, bool locked);
-extern void mem_cgroup_uncharge_swapcache(struct page *page);
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
 #else
 static inline
 int mem_cgroup_cache_charge_swapin(struct page *page,
@@ -346,7 +346,15 @@ int mem_cgroup_cache_charge_swapin(struc
 {
 	return 0;
 }
-static inline void mem_cgroup_uncharge_swapcache(struct page *page)
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 {
 }
 #endif
Index: mmotm-2.6.28-Nov13/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov13/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t mask, struct mem_cgroup **ptr);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *ptr);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(stru
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charg
 }
 
 static inline int mem_cgroup_try_charge(struct mm_struct *mm,
-				gfp_t gfp_mask, struct mem_cgroup **ptr)
+			gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
 	return 0;
 }
Index: mmotm-2.6.28-Nov13/mm/memory.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memory.c
+++ mmotm-2.6.28-Nov13/mm/memory.c
@@ -2325,7 +2325,8 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
-	if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
+	if (mem_cgroup_try_charge_swapin(mm, page,
+				GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
 		ret = VM_FAULT_OOM;
 		unlock_page(page);
 		goto out;
Index: mmotm-2.6.28-Nov13/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/vmscan.c
+++ mmotm-2.6.28-Nov13/mm/vmscan.c
@@ -1718,7 +1718,8 @@ unsigned long try_to_free_pages(struct z
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
-						gfp_t gfp_mask)
+						gfp_t gfp_mask,
+					   bool noswap)
 {
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -1731,6 +1732,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	};
 	struct zonelist *zonelist;
 
+	if (noswap)
+		sc.may_swap = 0;
+
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
Index: mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.28-Nov13.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.28-Nov13/Documentation/controllers/memory.txt
@@ -137,12 +137,32 @@ behind this approach is that a cgroup th
 page will eventually get charged for it (once it is uncharged from
 the cgroup that brought it in -- this will happen on memory pressure).
 
-Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
 be backed into memory in force, charges for pages are accounted against the
 caller of swapoff rather than the users of shmem.
 
 
-2.4 Reclaim
+2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+usage of mem+swap is limited by memsw.limit_in_bytes.
+
+Note: why 'mem+swap' rather than swap.
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+mem+swap.
+
+In other words, when we want to limit the usage of swap without affecting
+global LRU, mem+swap limit is better than just limiting swap from OS point
+of view.
+
+2.5 Reclaim
 
 Each cgroup maintains a per cgroup LRU that consists of an active
 and inactive list. When a cgroup goes over its limit, we first try
@@ -246,6 +266,11 @@ Such charges are freed(at default) or mo
 both of RSS and CACHES are moved to parent.
 If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
 
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is discarded and a cgroup which uses swap (swapcache)
+will be charged as a new owner of it.
+
+
 5. Misc. interfaces.
 
 5.1 force_empty

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 8/9] memcg : synchronized LRU
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-11-14 10:19 ` [PATCH 7/9] memcg : mem+swap controlelr core KAMEZAWA Hiroyuki
@ 2008-11-14 10:20 ` KAMEZAWA Hiroyuki
  2008-11-14 10:21 ` [PATCH 9/9] memcg : add mem_cgroup_disabled() KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

A big patch for changing memcg's LRU semantics.

Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

  - LRU of page_cgroup is not synchronous with global LRU.

  - page and page_cgroup is one-to-one and statically allocated.

  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

  - SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

	pc = lookup_page_cgroup(page);
	lock_page_cgroup(pc); .....................(1)
	mz = page_cgroup_zoneinfo(pc);
	spin_lock(&mz->lru_lock);
	.....add to LRU
	spin_unlock(&mz->lru_lock);
	unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	mem_cgroup_add/remove/etc_lru() {
		pc = lookup_page_cgroup(page);
		mz = page_cgroup_zoneinfo(pc);
		if (PageCgroupUsed(pc)) {
			....add to LRU
		}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.

Pros.
  - easy for maintenance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

Changelog v0 -> v1
 - fixed statistics.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 fs/splice.c                 |    1 
 include/linux/memcontrol.h  |   29 +++
 include/linux/mm_inline.h   |    3 
 include/linux/page_cgroup.h |   17 --
 mm/memcontrol.c             |  323 +++++++++++++++++++-------------------------
 mm/page_cgroup.c            |    1 
 mm/swap.c                   |    1 
 mm/vmscan.c                 |    9 -
 8 files changed, 178 insertions(+), 206 deletions(-)

Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -35,6 +35,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include "internal.h"
 
 #include <asm/uaccess.h>
 
@@ -99,7 +100,6 @@ struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
-	spinlock_t		lru_lock;
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
 };
@@ -162,14 +162,12 @@ enum charge_type {
 /* only for here (for easy reading.) */
 #define PCGF_CACHE	(1UL << PCG_CACHE)
 #define PCGF_USED	(1UL << PCG_USED)
-#define PCGF_ACTIVE	(1UL << PCG_ACTIVE)
 #define PCGF_LOCK	(1UL << PCG_LOCK)
-#define PCGF_FILE	(1UL << PCG_FILE)
 static const unsigned long
 pcg_default_flags[NR_CHARGE_TYPE] = {
-	PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
-	PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
-	PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+	PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* File Cache */
+	PCGF_USED | PCGF_LOCK, /* Anon */
+	PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
 	0, /* FORCE */
 };
 
@@ -184,9 +182,6 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 
-/*
- * Always modified under lru lock. Then, not necessary to preempt_disable()
- */
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -194,10 +189,9 @@ static void mem_cgroup_charge_statistics
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
 	struct mem_cgroup_stat_cpu *cpustat;
+	int cpu = get_cpu();
 
-	VM_BUG_ON(!irqs_disabled());
-
-	cpustat = &stat->cpustat[smp_processor_id()];
+	cpustat = &stat->cpustat[cpu];
 	if (PageCgroupCache(pc))
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
@@ -209,6 +203,7 @@ static void mem_cgroup_charge_statistics
 	else
 		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+	put_cpu();
 }
 
 static struct mem_cgroup_per_zone *
@@ -263,80 +258,95 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
-			struct page_cgroup *pc)
-{
-	int lru = LRU_BASE;
+/*
+ * Following LRU functions are allowed to be used without PCG_LOCK.
+ * Operations are called by routine of global LRU independently from memcg.
+ * What we have to take care of here is validness of pc->mem_cgroup.
+ *
+ * Changes to pc->mem_cgroup happens when
+ * 1. charge
+ * 2. moving account
+ * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
+ * It is added to LRU before charge.
+ * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
+ * When moving account, the page is not on LRU. It's isolated.
+ */
 
-	if (PageCgroupUnevictable(pc))
-		lru = LRU_UNEVICTABLE;
-	else {
-		if (PageCgroupActive(pc))
-			lru += LRU_ACTIVE;
-		if (PageCgroupFile(pc))
-			lru += LRU_FILE;
-	}
+void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz;
 
+	if (mem_cgroup_subsys.disabled)
+		return;
+	pc = lookup_page_cgroup(page);
+	/* can happen while we handle swapcache. */
+	if (list_empty(&pc->lru))
+		return;
+	mz = page_cgroup_zoneinfo(pc);
+	mem = pc->mem_cgroup;
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
-	list_del(&pc->lru);
+	list_del_init(&pc->lru);
+	return;
 }
 
-static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
-				struct page_cgroup *pc, bool hot)
+void mem_cgroup_del_lru(struct page *page)
 {
-	int lru = LRU_BASE;
+	mem_cgroup_del_lru_list(page, page_lru(page));
+}
 
-	if (PageCgroupUnevictable(pc))
-		lru = LRU_UNEVICTABLE;
-	else {
-		if (PageCgroupActive(pc))
-			lru += LRU_ACTIVE;
-		if (PageCgroupFile(pc))
-			lru += LRU_FILE;
-	}
+void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
 
-	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	if (hot)
-		list_add(&pc->lru, &mz->lists[lru]);
-	else
-		list_add_tail(&pc->lru, &mz->lists[lru]);
+	if (mem_cgroup_subsys.disabled)
+		return;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+	pc = lookup_page_cgroup(page);
+	smp_rmb();
+	/* unused page is not rotated. */
+	if (!PageCgroupUsed(pc))
+		return;
+	mz = page_cgroup_zoneinfo(pc);
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
-static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
+void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = PageCgroupActive(pc);
-	int file      = PageCgroupFile(pc);
-	int unevictable = PageCgroupUnevictable(pc);
-	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
-				(LRU_FILE * !!file + !!active);
+	struct page_cgroup *pc;
+	struct mem_cgroup_per_zone *mz;
 
-	if (lru == from)
+	if (mem_cgroup_subsys.disabled)
+		return;
+	pc = lookup_page_cgroup(page);
+	/* barrier to sync with "charge" */
+	smp_rmb();
+	if (!PageCgroupUsed(pc))
 		return;
 
-	MEM_CGROUP_ZSTAT(mz, from) -= 1;
-	/*
-	 * However this is done under mz->lru_lock, another flags, which
-	 * are not related to LRU, will be modified from out-of-lock.
-	 * We have to use atomic set/clear flags.
-	 */
-	if (is_unevictable_lru(lru)) {
-		ClearPageCgroupActive(pc);
-		SetPageCgroupUnevictable(pc);
-	} else {
-		if (is_active_lru(lru))
-			SetPageCgroupActive(pc);
-		else
-			ClearPageCgroupActive(pc);
-		ClearPageCgroupUnevictable(pc);
-	}
-
+	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	list_move(&pc->lru, &mz->lists[lru]);
+	list_add(&pc->lru, &mz->lists[lru]);
+}
+/*
+ * To add swapcache into LRU. Be careful to all this function.
+ * zone->lru_lock shouldn't be held and irq must not be disabled.
+ */
+static void mem_cgroup_lru_fixup(struct page *page)
+{
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+void mem_cgroup_move_lists(struct page *page,
+			   enum lru_list from, enum lru_list to)
+{
+	if (mem_cgroup_subsys.disabled)
+		return;
+	mem_cgroup_del_lru_list(page, from);
+	mem_cgroup_add_lru_list(page, to);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -350,37 +360,6 @@ int task_in_mem_cgroup(struct task_struc
 }
 
 /*
- * This routine assumes that the appropriate zone's lru lock is already held
- */
-void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	if (mem_cgroup_subsys.disabled)
-		return;
-
-	/*
-	 * We cannot lock_page_cgroup while holding zone's lru_lock,
-	 * because other holders of lock_page_cgroup can be interrupted
-	 * with an attempt to rotate_reclaimable_page.  But we cannot
-	 * safely get to page_cgroup without it, so just try_lock it:
-	 * mem_cgroup_isolate_pages allows for page left on wrong list.
-	 */
-	pc = lookup_page_cgroup(page);
-	if (!trylock_page_cgroup(pc))
-		return;
-	if (pc && PageCgroupUsed(pc)) {
-		mz = page_cgroup_zoneinfo(pc);
-		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, lru);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-	}
-	unlock_page_cgroup(pc);
-}
-
-/*
  * Calculate mapped_ratio under memory controller. This will be used in
  * vmscan.c for deteremining we have to reclaim mapped pages.
  */
@@ -459,40 +438,24 @@ unsigned long mem_cgroup_isolate_pages(u
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
 	src = &mz->lists[lru];
 
-	spin_lock(&mz->lru_lock);
 	scan = 0;
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+
+		page = pc->page;
 		if (unlikely(!PageCgroupUsed(pc)))
 			continue;
-		page = pc->page;
-
 		if (unlikely(!PageLRU(page)))
 			continue;
 
-		/*
-		 * TODO: play better with lumpy reclaim, grabbing anything.
-		 */
-		if (PageUnevictable(page) ||
-		    (PageActive(page) && !active) ||
-		    (!PageActive(page) && active)) {
-			__mem_cgroup_move_lists(pc, page_lru(page));
-			continue;
-		}
-
 		scan++;
-		list_move(&pc->lru, &pc_list);
-
 		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}
 	}
 
-	list_splice(&pc_list, src);
-	spin_unlock(&mz->lru_lock);
-
 	*scanned = scan;
 	return nr_taken;
 }
@@ -607,9 +570,6 @@ static void __mem_cgroup_commit_charge(s
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
@@ -624,17 +584,11 @@ static void __mem_cgroup_commit_charge(s
 		return;
 	}
 	pc->mem_cgroup = mem;
-	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
-	 */
+	smp_wmb();
 	pc->flags = pcg_default_flags[ctype];
 
-	mz = page_cgroup_zoneinfo(pc);
+	mem_cgroup_charge_statistics(mem, pc, true);
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc, true);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 }
 
@@ -645,8 +599,7 @@ static void __mem_cgroup_commit_charge(s
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
  * The caller must confirm following.
- * 1. disable irq.
- * 2. lru_lock of old mem_cgroup(@from) should be held.
+ * - page is not on LRU (isolate_page() is useful.)
  *
  * returns 0 at success,
  * returns -EBUSY when lock is busy or "pc" is unstable.
@@ -662,15 +615,14 @@ static int mem_cgroup_move_account(struc
 	int nid, zid;
 	int ret = -EBUSY;
 
-	VM_BUG_ON(!irqs_disabled());
 	VM_BUG_ON(from == to);
+	VM_BUG_ON(PageLRU(pc->page));
 
 	nid = page_cgroup_nid(pc);
 	zid = page_cgroup_zid(pc);
 	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
 	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
 
-
 	if (!trylock_page_cgroup(pc))
 		return ret;
 
@@ -680,18 +632,15 @@ static int mem_cgroup_move_account(struc
 	if (pc->mem_cgroup != from)
 		goto out;
 
-	if (spin_trylock(&to_mz->lru_lock)) {
-		__mem_cgroup_remove_list(from_mz, pc);
-		css_put(&from->css);
-		res_counter_uncharge(&from->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&from->memsw, PAGE_SIZE);
-		pc->mem_cgroup = to;
-		css_get(&to->css);
-		__mem_cgroup_add_list(to_mz, pc, false);
-		ret = 0;
-		spin_unlock(&to_mz->lru_lock);
-	}
+	css_put(&from->css);
+	res_counter_uncharge(&from->res, PAGE_SIZE);
+	mem_cgroup_charge_statistics(from, pc, false);
+	if (do_swap_account)
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+	pc->mem_cgroup = to;
+	mem_cgroup_charge_statistics(to, pc, true);
+	css_get(&to->css);
+	ret = 0;
 out:
 	unlock_page_cgroup(pc);
 	return ret;
@@ -705,39 +654,47 @@ static int mem_cgroup_move_parent(struct
 				  struct mem_cgroup *child,
 				  gfp_t gfp_mask)
 {
+	struct page *page = pc->page;
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 	int ret;
 
 	/* Is ROOT ? */
 	if (!pcg)
 		return -EINVAL;
 
+
 	parent = mem_cgroup_from_cont(pcg);
 
+
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
 	if (ret)
 		return ret;
 
-	mz = mem_cgroup_zoneinfo(child,
-			page_cgroup_nid(pc), page_cgroup_zid(pc));
+	if (!get_page_unless_zero(page))
+		return -EBUSY;
+
+	ret = isolate_lru_page(page);
+
+	if (ret)
+		goto cancel;
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
 	ret = mem_cgroup_move_account(pc, child, parent);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	/* drop extra refcnt */
+	/* drop extra refcnt by try_charge() (move_account increment one) */
 	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (ret) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	putback_lru_page(page);
+	if (!ret) {
+		put_page(page);
+		return 0;
 	}
-
+	/* uncharge if move fails */
+cancel:
+	res_counter_uncharge(&parent->res, PAGE_SIZE);
+	if (do_swap_account)
+		res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	put_page(page);
 	return ret;
 }
 
@@ -903,6 +860,8 @@ int mem_cgroup_cache_charge_swapin(struc
 	}
 	if (!locked)
 		unlock_page(page);
+	/* add this page(page_cgroup) to the LRU we want. */
+	mem_cgroup_lru_fixup(page);
 
 	return ret;
 }
@@ -935,6 +894,8 @@ void mem_cgroup_commit_charge_swapin(str
 		}
 
 	}
+	/* add this page(page_cgroup) to the LRU we want. */
+	mem_cgroup_lru_fixup(page);
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -959,7 +920,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return NULL;
@@ -1001,12 +961,10 @@ __mem_cgroup_uncharge_common(struct page
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 
+	mem_cgroup_charge_statistics(mem, pc, false);
 	ClearPageCgroupUsed(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 
 	css_put(&mem->css);
@@ -1260,21 +1218,22 @@ int mem_cgroup_resize_memsw_limit(struct
 	return ret;
 }
 
-
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
 static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
-			    struct mem_cgroup_per_zone *mz,
-			    enum lru_list lru)
+				int node, int zid, enum lru_list lru)
 {
+	struct zone *zone;
+	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc, *busy;
-	unsigned long flags;
-	unsigned long loop;
+	unsigned long flags, loop;
 	struct list_head *list;
 	int ret = 0;
 
+	zone = &NODE_DATA(node)->node_zones[zid];
+	mz = mem_cgroup_zoneinfo(mem, node, zid);
 	list = &mz->lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
@@ -1283,19 +1242,19 @@ static int mem_cgroup_force_empty_list(s
 	busy = NULL;
 	while (loop--) {
 		ret = 0;
-		spin_lock_irqsave(&mz->lru_lock, flags);
+		spin_lock_irqsave(&zone->lru_lock, flags);
 		if (list_empty(list)) {
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		if (busy == pc) {
 			list_move(&pc->lru, list);
 			busy = 0;
-			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
 		if (ret == -ENOMEM)
@@ -1308,6 +1267,7 @@ static int mem_cgroup_force_empty_list(s
 		} else
 			busy = NULL;
 	}
+
 	if (!ret && !list_empty(list))
 		return -EBUSY;
 	return ret;
@@ -1342,12 +1302,10 @@ move_account:
 		ret = 0;
 		for_each_node_state(node, N_POSSIBLE) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
-				struct mem_cgroup_per_zone *mz;
 				enum lru_list l;
-				mz = mem_cgroup_zoneinfo(mem, node, zid);
 				for_each_lru(l) {
 					ret = mem_cgroup_force_empty_list(mem,
-								  mz, l);
+							node, zid, l);
 					if (ret)
 						break;
 				}
@@ -1391,6 +1349,7 @@ try_to_free:
 		}
 
 	}
+	lru_add_drain();
 	/* try move_account...there may be some *locked* pages. */
 	if (mem->res.usage)
 		goto move_account;
@@ -1635,7 +1594,6 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		spin_lock_init(&mz->lru_lock);
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lists[l]);
 	}
@@ -1684,8 +1642,15 @@ static struct mem_cgroup *mem_cgroup_all
 
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	int node;
+
 	if (atomic_read(&mem->refcnt) > 0)
 		return;
+
+
+	for_each_node_state(node, N_POSSIBLE)
+		free_mem_cgroup_per_zone_info(mem, node);
+
 	if (mem_cgroup_size() < PAGE_SIZE)
 		kfree(mem);
 	else
@@ -1758,12 +1723,6 @@ static void mem_cgroup_pre_destroy(struc
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	int node;
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-
-	for_each_node_state(node, N_POSSIBLE)
-		free_mem_cgroup_per_zone_info(mem, node);
-
 	mem_cgroup_free(mem_cgroup_from_cont(cont));
 }
 
Index: mmotm-2.6.28-Nov13/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov13/include/linux/memcontrol.h
@@ -40,7 +40,12 @@ extern void mem_cgroup_cancel_charge_swa
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_del_lru(struct page *page);
+extern void mem_cgroup_move_lists(struct page *page,
+				  enum lru_list from, enum lru_list to);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
@@ -131,7 +136,27 @@ static inline int mem_cgroup_shrink_usag
 	return 0;
 }
 
-static inline void mem_cgroup_move_lists(struct page *page, bool active)
+static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+{
+}
+
+static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+{
+	return ;
+}
+
+static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+{
+	return ;
+}
+
+static inline void mem_cgroup_del_lru(struct page *page)
+{
+	return ;
+}
+
+static inline void
+mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
 {
 }
 
Index: mmotm-2.6.28-Nov13/include/linux/mm_inline.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/mm_inline.h
+++ mmotm-2.6.28-Nov13/include/linux/mm_inline.h
@@ -28,6 +28,7 @@ add_page_to_lru_list(struct zone *zone, 
 {
 	list_add(&page->lru, &zone->lru[l].list);
 	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
 }
 
 static inline void
@@ -35,6 +36,7 @@ del_page_from_lru_list(struct zone *zone
 {
 	list_del(&page->lru);
 	__dec_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_del_lru_list(page, l);
 }
 
 static inline void
@@ -54,6 +56,7 @@ del_page_from_lru(struct zone *zone, str
 		l += page_is_file_cache(page);
 	}
 	__dec_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
Index: mmotm-2.6.28-Nov13/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov13/mm/page_cgroup.c
@@ -17,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
 	pc->flags = 0;
 	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
+	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
Index: mmotm-2.6.28-Nov13/fs/splice.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/fs/splice.c
+++ mmotm-2.6.28-Nov13/fs/splice.c
@@ -21,6 +21,7 @@
 #include <linux/file.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/swap.h>
 #include <linux/writeback.h>
Index: mmotm-2.6.28-Nov13/mm/vmscan.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/vmscan.c
+++ mmotm-2.6.28-Nov13/mm/vmscan.c
@@ -565,7 +565,6 @@ redo:
 		lru = LRU_UNEVICTABLE;
 		add_page_to_unevictable_list(page);
 	}
-	mem_cgroup_move_lists(page, lru);
 
 	/*
 	 * page's status can change while we move it among lru. If an evictable
@@ -600,7 +599,6 @@ void putback_lru_page(struct page *page)
 
 	lru = !!TestClearPageActive(page) + page_is_file_cache(page);
 	lru_cache_add_lru(page, lru);
-	mem_cgroup_move_lists(page, lru);
 	put_page(page);
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
@@ -872,6 +870,7 @@ int __isolate_lru_page(struct page *page
 		return ret;
 
 	ret = -EBUSY;
+
 	if (likely(get_page_unless_zero(page))) {
 		/*
 		 * Be careful not to clear PageLRU until after we're
@@ -880,6 +879,7 @@ int __isolate_lru_page(struct page *page
 		 */
 		ClearPageLRU(page);
 		ret = 0;
+		mem_cgroup_del_lru(page);
 	}
 
 	return ret;
@@ -1193,7 +1193,6 @@ static unsigned long shrink_inactive_lis
 			SetPageLRU(page);
 			lru = page_lru(page);
 			add_page_to_lru_list(zone, page, lru);
-			mem_cgroup_move_lists(page, lru);
 			if (PageActive(page) && scan_global_lru(sc)) {
 				int file = !!page_is_file_cache(page);
 				zone->recent_rotated[file]++;
@@ -1326,7 +1325,7 @@ static void shrink_active_list(unsigned 
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_move_lists(page, lru);
+		mem_cgroup_add_lru_list(page, lru);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
@@ -2486,6 +2485,7 @@ retry:
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
 		list_move(&page->lru, &zone->lru[l].list);
+		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -2494,6 +2494,7 @@ retry:
 		 */
 		SetPageUnevictable(page);
 		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
Index: mmotm-2.6.28-Nov13/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.28-Nov13/include/linux/page_cgroup.h
@@ -26,10 +26,6 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	/* flags for LRU placement */
-	PCG_ACTIVE, /* page is active in this cgroup */
-	PCG_FILE, /* page is file system backed */
-	PCG_UNEVICTABLE, /* page is unevictableable */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -50,19 +46,6 @@ TESTPCGFLAG(Cache, CACHE)
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
Index: mmotm-2.6.28-Nov13/mm/swap.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/swap.c
+++ mmotm-2.6.28-Nov13/mm/swap.c
@@ -168,7 +168,6 @@ void activate_page(struct page *page)
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
-		mem_cgroup_move_lists(page, lru);
 
 		zone->recent_rotated[!!file]++;
 		zone->recent_scanned[!!file]++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 9/9] memcg : add mem_cgroup_disabled()
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-11-14 10:20 ` [PATCH 8/9] memcg : synchronized LRU KAMEZAWA Hiroyuki
@ 2008-11-14 10:21 ` KAMEZAWA Hiroyuki
  2008-11-14 11:33 ` [PATCH 0/9] memcg updates (14/Nov/2008) Balbir Singh
  2008-11-15  3:00 ` KAMEZAWA Hiroyuki
  10 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-14 10:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, lizf,
	linux-kernel

We check mem_cgroup is disabled or not by checking mem_cgroup_subsys.disabled.
I think it has more references than expected, now.

replacing 
   if (mem_cgroup_subsys.disabled)
with
   if (mem_cgroup_disabled())

give us good look, I think.

From: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/memcontrol.h |   13 +++++++++++++
 mm/memcontrol.c            |   28 ++++++++++++++--------------
 mm/page_cgroup.c           |    4 ++--
 3 files changed, 29 insertions(+), 16 deletions(-)

Index: mmotm-2.6.28-Nov13/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov13.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov13/include/linux/memcontrol.h
@@ -87,6 +87,14 @@ extern long mem_cgroup_calc_reclaim(stru
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
+
+static inline bool mem_cgroup_disabled(void)
+{
+	if (mem_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -214,6 +222,11 @@ static inline long mem_cgroup_calc_recla
 {
 	return 0;
 }
+
+static inline bool mem_cgroup_disabled(void)
+{
+	return true;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.28-Nov13/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov13/mm/memcontrol.c
@@ -278,7 +278,7 @@ void mem_cgroup_del_lru_list(struct page
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
 	/* can happen while we handle swapcache. */
@@ -301,7 +301,7 @@ void mem_cgroup_rotate_lru_list(struct p
 	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_diabled())
 		return;
 
 	pc = lookup_page_cgroup(page);
@@ -318,7 +318,7 @@ void mem_cgroup_add_lru_list(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 	pc = lookup_page_cgroup(page);
 	/* barrier to sync with "charge" */
@@ -343,7 +343,7 @@ static void mem_cgroup_lru_fixup(struct 
 void mem_cgroup_move_lists(struct page *page,
 			   enum lru_list from, enum lru_list to)
 {
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 	mem_cgroup_del_lru_list(page, from);
 	mem_cgroup_add_lru_list(page, to);
@@ -730,7 +730,7 @@ static int mem_cgroup_charge_common(stru
 int mem_cgroup_newpage_charge(struct page *page,
 			      struct mm_struct *mm, gfp_t gfp_mask)
 {
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 	if (PageCompound(page))
 		return 0;
@@ -752,7 +752,7 @@ int mem_cgroup_newpage_charge(struct pag
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 	if (PageCompound(page))
 		return 0;
@@ -798,7 +798,7 @@ int mem_cgroup_try_charge_swapin(struct 
 	struct mem_cgroup *mem;
 	swp_entry_t     ent;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 
 	if (!do_swap_account)
@@ -824,7 +824,7 @@ int mem_cgroup_cache_charge_swapin(struc
 {
 	int ret = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -871,7 +871,7 @@ void mem_cgroup_commit_charge_swapin(str
 {
 	struct page_cgroup *pc;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 	if (!ptr)
 		return;
@@ -900,7 +900,7 @@ void mem_cgroup_commit_charge_swapin(str
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 {
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 	if (!mem)
 		return;
@@ -921,7 +921,7 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return NULL;
 
 	if (PageSwapCache(page))
@@ -1040,7 +1040,7 @@ int mem_cgroup_prepare_migration(struct 
 	struct mem_cgroup *mem = NULL;
 	int ret = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 
 	pc = lookup_page_cgroup(page);
@@ -1122,7 +1122,7 @@ int mem_cgroup_shrink_usage(struct mm_st
 	int progress = 0;
 	int retry = MEM_CGROUP_RECLAIM_RETRIES;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return 0;
 	if (!mm)
 		return 0;
@@ -1675,7 +1675,7 @@ static void mem_cgroup_put(struct mem_cg
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 static void __init enable_swap_cgroup(void)
 {
-	if (!mem_cgroup_subsys.disabled && really_do_swap_account)
+	if (!mem_cgroup_disabled() && really_do_swap_account)
 		do_swap_account = 1;
 }
 #else
Index: mmotm-2.6.28-Nov13/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov13.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov13/mm/page_cgroup.c
@@ -72,7 +72,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -244,7 +244,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (mem_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg updates (14/Nov/2008)
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2008-11-14 10:21 ` [PATCH 9/9] memcg : add mem_cgroup_disabled() KAMEZAWA Hiroyuki
@ 2008-11-14 11:33 ` Balbir Singh
  2008-11-15  3:00 ` KAMEZAWA Hiroyuki
  10 siblings, 0 replies; 21+ messages in thread
From: Balbir Singh @ 2008-11-14 11:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, nishimura, pbadari, jblunck, taka, akpm, lizf, linux-kernel

KAMEZAWA Hiroyuki wrote:
> Several patches are posted after last update (12/Nov),
> it's better to catch all up as series.
> 
> All patchs are mm-of-the-moment snapshot 2008-11-13-17-22
>   http://userweb.kernel.org/~akpm/mmotm/
> (You may need to patch fs/dquota.c and fix kernel/auditsc.c CONFIG error)
> 
> New ones are 1,2,3 and 9. 
> 
> IMHO, patch 1-4 are ready to go. (but I want Ack from Balbir to 3/9)

Hi, Kamezawa,

Sorry to keep you waiting, I've been spending time on memcg hierarchy patches
(testing, fixing, revisiting them). Hopefully, I'll find some time quickly.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg updates (14/Nov/2008)
  2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2008-11-14 11:33 ` [PATCH 0/9] memcg updates (14/Nov/2008) Balbir Singh
@ 2008-11-15  3:00 ` KAMEZAWA Hiroyuki
  2008-11-15  7:25   ` Balbir Singh
  10 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-15  3:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, nishimura, taka, akpm, lizf, linux-mm

On Fri, 14 Nov 2008 19:12:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Several patches are posted after last update (12/Nov),
> it's better to catch all up as series.
> 
> All patchs are mm-of-the-moment snapshot 2008-11-13-17-22
>   http://userweb.kernel.org/~akpm/mmotm/
> (You may need to patch fs/dquota.c and fix kernel/auditsc.c CONFIG error)
> 
> New ones are 1,2,3 and 9. 
> 
> IMHO, patch 1-4 are ready to go. (but I want Ack from Balbir to 3/9)
> 
Reduced CCs.

Hi folks, I noticed that all 9 pathces here are now in mmotm.
Thank you for all your patient help! and
please try "mm-of-the-moment snapshot 2008-11-14-17-14" 
Now, mem+swap controller is available there.

My concern is architecture other than x86-64. It seems I and Nishimura use
x86-64 in main test. So, test in other archtecuthre is very welcome.

I have no patches in my queue and wondering how to start
  - shrink usage
  - dirty ratio for memcg.
  - help Balbir's hierarchy.
works. (But I may have to clean up/optimize codes before going further.)

and Balbir, the world is changed after synchronized-LRU patch ([8/9]).
please see it. 

Thanks!
-Kame



> Contents:
> 
> [1/9] .... fix memory online/offline with memcg.
>   This patch is for "real" memory hotplug. So, people who can test this
>   is limited, I think. I asked Badari to try this.
>   This fix itself is logically correct I think, but there may be other bugs..
> 
> [2/9] .... reduce size of per-cpu allocation.
>   This is from Jan Blunck <jblunck@suse.de> and I picked it up and rewrote.
>   please test. This tries to reduce memory usage of mem_cgroup struct.
> 
> [3/9] .... add force_empty again with proper implementation.
>   I removed "force_empty" by account_move patch in mmotm. But I asked not to
>   do that brutal removal of interface. I'm sorry.
>   This adds "force_empty", but implemntaion itself is much saner. After this,
>   force_empty is no longer "debug only" interface.
> 
> [4/9] .... account swap-cache.
>   Before accounting swap, we have to handle swap-cache.
>   This patch have been test for a month and seems to works well. Still here
>   and waiting for bug fixes moved into..
> 
> [5/9] .... mem+swap controller kconfig
>   Kconfig changes and macro for mem+swap controller.
> 
> [6/9] .... swap cgroup.
>   For accounting swap, we have to prepare a strage for remembering swap.
> 
> [7/9] .... mem+swap controller.
>   mem+swap controller core logic. I and Nishimura have been testing this
>   for a month. It's getting nicer.
> 
> [8/9] .... synchronized LRU patch
>   remove mz->lru_lock and make use of zone->lru_lock. By this, we do not have to
>   duplicate vmscan's global LRU behavior in memcg.
>   I think I'm an only tester of this ;) but works well.
> 
> [9/9] .... mem_cgroup_disabled() patch
>   Replacing if (mem_cgroup_subsys.disabled) to be if (mem_cgroup_disabled()).
>   Takahashi (dm-ioband team) posted their bio-cgroup interface working with
>   page_cgroup. This is cut out from his one.
>   Takahashi, If you ack me, send me Signed-off-by or Acked-by. I'll queue this.
> 
> Thanks,
> -Kame
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg updates (14/Nov/2008)
  2008-11-15  3:00 ` KAMEZAWA Hiroyuki
@ 2008-11-15  7:25   ` Balbir Singh
  2008-11-15  9:16     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 21+ messages in thread
From: Balbir Singh @ 2008-11-15  7:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, taka, akpm, lizf, linux-mm

KAMEZAWA Hiroyuki wrote:
> On Fri, 14 Nov 2008 19:12:46 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> Several patches are posted after last update (12/Nov),
>> it's better to catch all up as series.
>>
>> All patchs are mm-of-the-moment snapshot 2008-11-13-17-22
>>   http://userweb.kernel.org/~akpm/mmotm/
>> (You may need to patch fs/dquota.c and fix kernel/auditsc.c CONFIG error)
>>
>> New ones are 1,2,3 and 9. 
>>
>> IMHO, patch 1-4 are ready to go. (but I want Ack from Balbir to 3/9)
>>
> Reduced CCs.
> 
> Hi folks, I noticed that all 9 pathces here are now in mmotm.
> Thank you for all your patient help! and
> please try "mm-of-the-moment snapshot 2008-11-14-17-14" 
> Now, mem+swap controller is available there.
> 
> My concern is architecture other than x86-64. It seems I and Nishimura use
> x86-64 in main test. So, test in other archtecuthre is very welcome.
> 
> I have no patches in my queue and wondering how to start
>   - shrink usage
>   - dirty ratio for memcg.
>   - help Balbir's hierarchy.
> works. (But I may have to clean up/optimize codes before going further.)
> 
> and Balbir, the world is changed after synchronized-LRU patch ([8/9]).
> please see it. 

Time to resynchronize the patches! I've taken a cursory look, not done a
detailed review of those patches. Help with hierarchy would be nice, I've got
most of the patches nailed down, except for resynchronization with mmotm.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg updates (14/Nov/2008)
  2008-11-15  7:25   ` Balbir Singh
@ 2008-11-15  9:16     ` KAMEZAWA Hiroyuki
  2008-11-15  9:19       ` Balbir Singh
  0 siblings, 1 reply; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-15  9:16 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, nishimura, taka, akpm, lizf, linux-mm

Balbir Singh said:
> KAMEZAWA Hiroyuki wrote:
> Time to resynchronize the patches! I've taken a cursory look, not done a
> detailed review of those patches. Help with hierarchy would be nice, I've
> got
> most of the patches nailed down, except for resynchronization with mmotm.
>
I have no other patches now and I'd like to use time for testing and
reviewing. So, it's nice time to resynchronize patches, yes.

Okay, let's start hierarchy support first. I'll stop "new feature" work
for a while.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg updates (14/Nov/2008)
  2008-11-15  9:16     ` KAMEZAWA Hiroyuki
@ 2008-11-15  9:19       ` Balbir Singh
  0 siblings, 0 replies; 21+ messages in thread
From: Balbir Singh @ 2008-11-15  9:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, taka, akpm, lizf, linux-mm

KAMEZAWA Hiroyuki wrote:
> Balbir Singh said:
>> KAMEZAWA Hiroyuki wrote:
>> Time to resynchronize the patches! I've taken a cursory look, not done a
>> detailed review of those patches. Help with hierarchy would be nice, I've
>> got
>> most of the patches nailed down, except for resynchronization with mmotm.
>>
> I have no other patches now and I'd like to use time for testing and
> reviewing. So, it's nice time to resynchronize patches, yes.
> 
> Okay, let's start hierarchy support first. I'll stop "new feature" work
> for a while.

OKAY, let me post v4 and then we'll synchronize the patchset. I would like to
review/test as well after hierarchy and then implement soft limits. Soft limits
will allow us to over commit mem cgroup, which is a very useful feature.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/9] memcg : mem+swap controlelr core
  2008-11-14 10:19 ` [PATCH 7/9] memcg : mem+swap controlelr core KAMEZAWA Hiroyuki
@ 2008-11-21  2:40   ` Li Zefan
  2008-11-21  2:44     ` KAMEZAWA Hiroyuki
  2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 21+ messages in thread
From: Li Zefan @ 2008-11-21  2:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, linux-kernel

> @@ -513,12 +531,25 @@ static int __mem_cgroup_try_charge(struc
>  		css_get(&mem->css);
>  	}
>  
> +	while (1) {

This loop will never break out if memory.limit_in_bytes is too low.

Actually, when I set the limit to 0 and moved a task into the cgroup and let
the task allocate a page, then the whole system froze, and I had to reset
my machine.

And small memory.limit will make the process stuck:
# mkdir /memcg/0
# echo 40K > /memcg/0/memory.limit_in_bytes
# echo $$ > tasks
# ls
(stuck)

(another console)
# echo 100K > /memcg/0/memory.limit_in_bytes
(then the above 'ls' can continue)

> +		int ret;
> +		bool noswap = false;
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE);
> +		if (likely(!ret)) {
> +			if (!do_swap_account)
> +				break;
> +			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> +			if (likely(!ret))
> +				break;
> +			/* mem+swap counter fails */
> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +			noswap = true;
> +		}
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> +		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
>  			continue;
>  
>  		/*
> @@ -527,8 +558,13 @@ static int __mem_cgroup_try_charge(struc
>  		 * moved to swap cache or just unmapped from the cgroup.
>  		 * Check the limit again to see if the reclaim reduced the
>  		 * current usage of the cgroup before giving up
> +		 *
>  		 */
> -		if (res_counter_check_under_limit(&mem->res))
> +		if (!do_swap_account &&
> +			res_counter_check_under_limit(&mem->res))
> +			continue;
> +		if (do_swap_account &&
> +			res_counter_check_under_limit(&mem->memsw))
>  			continue;
>  
>  		if (!nr_retries--) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/9] memcg : mem+swap controlelr core
  2008-11-21  2:40   ` Li Zefan
@ 2008-11-21  2:44     ` KAMEZAWA Hiroyuki
  2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-21  2:44 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm, linux-kernel

On Fri, 21 Nov 2008 10:40:07 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > @@ -513,12 +531,25 @@ static int __mem_cgroup_try_charge(struc
> >  		css_get(&mem->css);
> >  	}
> >  
> > +	while (1) {
> 
> This loop will never break out if memory.limit_in_bytes is too low.
> 
> Actually, when I set the limit to 0 and moved a task into the cgroup and let
> the task allocate a page, then the whole system froze, and I had to reset
> my machine.
> 
> And small memory.limit will make the process stuck:
> # mkdir /memcg/0
> # echo 40K > /memcg/0/memory.limit_in_bytes
> # echo $$ > tasks
> # ls
> (stuck)
> 
Ok, I'll check/post a patch today for both.
Thank you for reporting.

Maybe some clean up is needed around logic for "give up" or "continue" 

-Kame

> (another console)
> # echo 100K > /memcg/0/memory.limit_in_bytes
> (then the above 'ls' can continue)
> 
> > +		int ret;
> > +		bool noswap = false;
> >  
> > -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> > +		ret = res_counter_charge(&mem->res, PAGE_SIZE);
> > +		if (likely(!ret)) {
> > +			if (!do_swap_account)
> > +				break;
> > +			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> > +			if (likely(!ret))
> > +				break;
> > +			/* mem+swap counter fails */
> > +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +			noswap = true;
> > +		}
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> > +		if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> >  			continue;
> >  
> >  		/*
> > @@ -527,8 +558,13 @@ static int __mem_cgroup_try_charge(struc
> >  		 * moved to swap cache or just unmapped from the cgroup.
> >  		 * Check the limit again to see if the reclaim reduced the
> >  		 * current usage of the cgroup before giving up
> > +		 *
> >  		 */
> > -		if (res_counter_check_under_limit(&mem->res))
> > +		if (!do_swap_account &&
> > +			res_counter_check_under_limit(&mem->res))
> > +			continue;
> > +		if (do_swap_account &&
> > +			res_counter_check_under_limit(&mem->memsw))
> >  			continue;
> >  
> >  		if (!nr_retries--) {
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] memcg: fix oom handling
  2008-11-21  2:40   ` Li Zefan
  2008-11-21  2:44     ` KAMEZAWA Hiroyuki
@ 2008-11-21  9:58     ` KAMEZAWA Hiroyuki
  2008-11-21 10:01       ` [PATCH 1/2] memcg: avoid unnecessary system-wide-oom-killer KAMEZAWA Hiroyuki
                         ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-21  9:58 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm,
	linux-kernel, npiggin

Li Zefan reported

(a) This goes dead lock:
==
   #echo 0 >  (...)/01/memory.limit_in_bytes   #set memcg's limit to 0,
   #echo $$ > (...)/01/memory.tasks            #move task
   # do something...
==

(b) seems to be dead lock
==
   #echo 40k >  (...)/01/memory.limit_in_bytes   #set memcg's limit to 0,
   #echo $$ > (...)/01/memory.tasks            #move task
   # do something...
==


I think (a) is BUG. (b) is just slow down.
(you can see pgpgin/pgpgout count is increasing in (B).)

This patch set is for handling (a). Li-san, could you check ?
This works well in my environment.(means OOM-Killer is called in proper way.)

 [1/2].... current mmotm has pagefault_out_of_memory() but this doesn't consider
           memcg. When memcg hit limits in page_fault and panic_on_oom is set,
           the kernel panics.
           This tries to fix that.
           (See patches/mm-invoke-oom-killer-from-page-fault.patch)

 [2/2].... fixes wrong logic of check_under_limit.

Anyway, it seems hierarchy support is *not* enough in OOM handler.
Balbir, could you check it ? 
I think "a bad process in hierarchy rather than memcg" should be killed.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] memcg: avoid unnecessary system-wide-oom-killer
  2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
@ 2008-11-21 10:01       ` KAMEZAWA Hiroyuki
  2008-11-21 10:03       ` [PATCH 2/2] memcg: fix reclaim result checks KAMEZAWA Hiroyuki
  2008-11-22  2:16       ` [PATCH 0/2] memcg: fix oom handling Li Zefan
  2 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-21 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Li Zefan, linux-mm, balbir, nishimura, pbadari, jblunck, taka,
	akpm, linux-kernel, npiggin

Current mmtom has new oom function as pagefault_out_of_memory().
It's added for select bad process rathar than killing current.

When memcg hit limit and calls OOM at page_fault, this handler
called and system-wide-oom handling happens.
(means kernel panics if panic_on_oom is true....)

For avoiding overkill, check memcg's recent behavior before
starting system-wide-oom.

And this patch also fixes to guarantee "don't accnout against
process with TIF_MEMDIE". This is necessary for smooth OOM.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   33 +++++++++++++++++++++++++++++----
 mm/oom_kill.c              |    8 ++++++++
 3 files changed, 43 insertions(+), 4 deletions(-)

Index: mmotm-2.6.28-Nov20/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Nov20.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Nov20/include/linux/memcontrol.h
@@ -95,6 +95,8 @@ static inline bool mem_cgroup_disabled(v
 	return false;
 }
 
+extern bool mem_cgroup_oom_called(struct task_struct *task);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -227,6 +229,10 @@ static inline bool mem_cgroup_disabled(v
 {
 	return true;
 }
+static inline bool mem_cgroup_oom_called(struct task_struct *task);
+{
+	return false;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.28-Nov20/mm/oom_kill.c
===================================================================
--- mmotm-2.6.28-Nov20.orig/mm/oom_kill.c
+++ mmotm-2.6.28-Nov20/mm/oom_kill.c
@@ -560,6 +560,13 @@ void pagefault_out_of_memory(void)
 		/* Got some memory back in the last second. */
 		return;
 
+	/*
+	 * If this is from memcg, oom-killer is already invoked.
+	 * and not worth to go system-wide-oom.
+	 */
+	if (mem_cgroup_oom_called(current))
+		goto rest_and_return;
+
 	if (sysctl_panic_on_oom)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
@@ -571,6 +578,7 @@ void pagefault_out_of_memory(void)
 	 * Give "p" a good chance of killing itself before we
 	 * retry to allocate memory.
 	 */
+rest_and_return:
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
Index: mmotm-2.6.28-Nov20/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov20.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov20/mm/memcontrol.c
@@ -153,7 +153,7 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-
+	unsigned long	last_oom_jiffies;
 	int		obsolete;
 	atomic_t	refcnt;
 	/*
@@ -618,6 +618,22 @@ static int mem_cgroup_hierarchical_recla
 	return ret;
 }
 
+bool mem_cgroup_oom_called(struct task_struct *task)
+{
+	bool ret = false;
+	struct mem_cgroup *mem;
+	struct mm_struct *mm;
+
+	rcu_read_lock();
+	mm = task->mm;
+	if (!mm)
+		mm = &init_mm;
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+		ret = true;
+	rcu_read_unlock();
+	return ret;
+}
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -629,6 +645,13 @@ static int __mem_cgroup_try_charge(struc
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
+
+	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
+		/* Don't account this! */
+		*memcg = NULL;
+		return 0;
+	}
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -699,8 +722,10 @@ static int __mem_cgroup_try_charge(struc
 			continue;
 
 		if (!nr_retries--) {
-			if (oom)
+			if (oom) {
 				mem_cgroup_out_of_memory(mem, gfp_mask);
+				mem->last_oom_jiffies = jiffies;
+			}
 			goto nomem;
 		}
 	}
@@ -837,7 +862,7 @@ static int mem_cgroup_move_parent(struct
 
 
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
-	if (ret)
+	if (ret || !parent)
 		return ret;
 
 	if (!get_page_unless_zero(page))
@@ -888,7 +913,7 @@ static int mem_cgroup_charge_common(stru
 
 	mem = memcg;
 	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
-	if (ret)
+	if (ret || !mem)
 		return ret;
 
 	__mem_cgroup_commit_charge(mem, pc, ctype);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2] memcg: fix reclaim result checks.
  2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
  2008-11-21 10:01       ` [PATCH 1/2] memcg: avoid unnecessary system-wide-oom-killer KAMEZAWA Hiroyuki
@ 2008-11-21 10:03       ` KAMEZAWA Hiroyuki
  2008-11-22  2:16       ` [PATCH 0/2] memcg: fix oom handling Li Zefan
  2 siblings, 0 replies; 21+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-21 10:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Li Zefan, linux-mm, balbir, nishimura, pbadari, jblunck, taka,
	akpm, linux-kernel, npiggin

check_under_limit logic was wrong and this check should be against
mem_over_limit rather than mem.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-Nov20/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov20.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov20/mm/memcontrol.c
@@ -714,17 +714,17 @@ static int __mem_cgroup_try_charge(struc
 		 * current usage of the cgroup before giving up
 		 *
 		 */
-		if (!do_swap_account &&
-			res_counter_check_under_limit(&mem->res))
-			continue;
-		if (do_swap_account &&
-			res_counter_check_under_limit(&mem->memsw))
-			continue;
+		if (do_swap_account) {
+			if (res_counter_check_under_limit(&mem_over_limit->res) &&
+			    res_counter_check_under_limit(&mem_over_limit->memsw))
+				continue;
+		} else if (res_counter_check_under_limit(&mem_over_limit->res))
+				continue;
 
 		if (!nr_retries--) {
 			if (oom) {
-				mem_cgroup_out_of_memory(mem, gfp_mask);
-				mem->last_oom_jiffies = jiffies;
+				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+				mem_over_limit->last_oom_jiffies = jiffies;
 			}
 			goto nomem;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] memcg: fix oom handling
  2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
  2008-11-21 10:01       ` [PATCH 1/2] memcg: avoid unnecessary system-wide-oom-killer KAMEZAWA Hiroyuki
  2008-11-21 10:03       ` [PATCH 2/2] memcg: fix reclaim result checks KAMEZAWA Hiroyuki
@ 2008-11-22  2:16       ` Li Zefan
  2 siblings, 0 replies; 21+ messages in thread
From: Li Zefan @ 2008-11-22  2:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, pbadari, jblunck, taka, akpm,
	linux-kernel, npiggin

KAMEZAWA Hiroyuki wrote:
> Li Zefan reported
> 
> (a) This goes dead lock:
> ==
>    #echo 0 >  (...)/01/memory.limit_in_bytes   #set memcg's limit to 0,
>    #echo $$ > (...)/01/memory.tasks            #move task
>    # do something...
> ==
> 
> (b) seems to be dead lock
> ==
>    #echo 40k >  (...)/01/memory.limit_in_bytes   #set memcg's limit to 0,
>    #echo $$ > (...)/01/memory.tasks            #move task
>    # do something...
> ==
> 
> 
> I think (a) is BUG. (b) is just slow down.
> (you can see pgpgin/pgpgout count is increasing in (B).)
> 
> This patch set is for handling (a). Li-san, could you check ?

Yes, it works for me now. :)

> This works well in my environment.(means OOM-Killer is called in proper way.)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-11-22  2:16 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-14 10:12 [PATCH 0/9] memcg updates (14/Nov/2008) KAMEZAWA Hiroyuki
2008-11-14 10:14 ` [PATCH 1/9] memcg: memory hotpluf fix for notifier callback KAMEZAWA Hiroyuki
2008-11-14 10:15 ` [PATCH 2/9] memcg : reduce size of mem_cgroup by using nr_cpu_ids KAMEZAWA Hiroyuki
2008-11-14 10:16 ` [PATCH 3/9] memcg: new force_empty to free pages under group KAMEZAWA Hiroyuki
2008-11-14 10:17 ` [PATCH 4/9] memcg: handle swap caches KAMEZAWA Hiroyuki
2008-11-14 10:18 ` [PATCH 5/9] memcg : mem+swap controller Kconfig KAMEZAWA Hiroyuki
2008-11-14 10:18 ` [PATCH 6/9] memcg : swap cgroup for remembering usage KAMEZAWA Hiroyuki
2008-11-14 10:19 ` [PATCH 7/9] memcg : mem+swap controlelr core KAMEZAWA Hiroyuki
2008-11-21  2:40   ` Li Zefan
2008-11-21  2:44     ` KAMEZAWA Hiroyuki
2008-11-21  9:58     ` [PATCH 0/2] memcg: fix oom handling KAMEZAWA Hiroyuki
2008-11-21 10:01       ` [PATCH 1/2] memcg: avoid unnecessary system-wide-oom-killer KAMEZAWA Hiroyuki
2008-11-21 10:03       ` [PATCH 2/2] memcg: fix reclaim result checks KAMEZAWA Hiroyuki
2008-11-22  2:16       ` [PATCH 0/2] memcg: fix oom handling Li Zefan
2008-11-14 10:20 ` [PATCH 8/9] memcg : synchronized LRU KAMEZAWA Hiroyuki
2008-11-14 10:21 ` [PATCH 9/9] memcg : add mem_cgroup_disabled() KAMEZAWA Hiroyuki
2008-11-14 11:33 ` [PATCH 0/9] memcg updates (14/Nov/2008) Balbir Singh
2008-11-15  3:00 ` KAMEZAWA Hiroyuki
2008-11-15  7:25   ` Balbir Singh
2008-11-15  9:16     ` KAMEZAWA Hiroyuki
2008-11-15  9:19       ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox