[RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH 0/9]  remove page_cgroup pointer (with some enhancements)
@ 2008-09-11 11:08 KAMEZAWA Hiroyuki
  2008-09-11 11:11 ` [RFC] [PATCH 1/9] memcg:make root no limit KAMEZAWA Hiroyuki
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:08 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, xemul, hugh, linux-mm, linux-kernel, menage

Hi, Balbir.

I wrote remove-page-cgroup-pointer patch on top of my small patches.
This series includes enhancements patches for memory resource controller
on my queue. 

I think I can (or have to) do more twaeks but post this while it's hot.
Passed some tests.

remove-page-cgroup-pointer patch is [8/9] and [9/9].
How about this ?

Peformance comparison is below.
==
rc5-mm1
==
Execl Throughput                           3006.5 lps   (29.8 secs, 3 samples)
C Compiler Throughput                      1006.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4863.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                943.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               482.7 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         124804.9 lpm   (30.0 secs, 3 samples)

After this series
==
Execl Throughput                           3003.3 lps   (29.8 secs, 3 samples)
C Compiler Throughput                      1008.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4580.6 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                913.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               569.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         124918.7 lpm   (30.0 secs, 3 samples)

Hmm..no loss ? But maybe I should find what I can do to improve this.

Brief patch description is below.

1. patches/nolimit_root.patch
   This patch makes 'root' cgroup's limit to be fixed to unlimited.

2. patches/atomic_flags.patch
   This patch makes page_cgroup->flags to be unsigned long and add atomic ops.

3. patches/account_move.patch
   This patch implements move_account() function for recharging account 
   from a memory resource controller to another.

4. patches/new_force_empty.patch
   This patch makes force_empty() to use move_account() rather than just drop
   accounts. (As fist step, account is moved to 'root'. We can change this later.)

5. patches/make_mapping_null.patch
   Clean up. This guarantees page->mapping to be NULL before uncharge() against 
   page cache is called.

6. patches/stat.patch
   Optimize page_cgroup_change_statistics().

7. patches/charge-will-success.patch
   Add "likely" to charge function.

8. patches/page_cgroup.patch
   remove page_cgroup pointer from struct page and add lookup-system for
   page_cgroup from pfn,

9. patches/boost_page_cgroup_lookupg.patch
   user per-cpu cache for fast access to page_cgroup.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 1/9] memcg:make root no limit
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
@ 2008-09-11 11:11 ` KAMEZAWA Hiroyuki
  2008-09-11 11:13 ` [RFC] [PATCH 2/9] memcg: atomic page_cgroup flags KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

Make root cgroup of memory resource controller to have no limit.

By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.

For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    4 ++++
 mm/memcontrol.c                      |    7 +++++++
 2 files changed, 11 insertions(+)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
+#define is_root_cgroup(cgrp)	((cgrp) == &init_mem_cgroup)
+
+
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
  * lock.  We need to ensure that page->page_cgroup is at least two
@@ -937,6 +940,10 @@ static int mem_cgroup_write(struct cgrou
 
 	switch (cft->private) {
 	case RES_LIMIT:
+		if (is_root_cgroup(memcg)) {
+			ret = -EINVAL;
+			break;
+		}
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (!ret)
Index: mmtom-2.6.27-rc5+/Documentation/controllers/memory.txt
===================================================================
--- mmtom-2.6.27-rc5+.orig/Documentation/controllers/memory.txt
+++ mmtom-2.6.27-rc5+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
 a page from Page Cache is used to decrement the accounting counters of the
 cgroup.
 
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
 2.3 Shared Page Accounting
 
 Shared pages are accounted on the basis of the first touch approach. The
@@ -172,6 +175,7 @@ We can alter the memory limit:
 
 NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
 mega or gigabytes.
+Note: root cgroup is not able to be set limit.
 
 # cat /cgroups/0/memory.limit_in_bytes
 4194304

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 2/9]  memcg: atomic page_cgroup flags
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
  2008-09-11 11:11 ` [RFC] [PATCH 1/9] memcg:make root no limit KAMEZAWA Hiroyuki
@ 2008-09-11 11:13 ` KAMEZAWA Hiroyuki
  2008-09-11 11:14 ` [RFC] [PATCH 3/9] memcg: move_account between groups KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

This patch itself makes memcg slow but this patch's final purpose is 
to remove lock_page_cgroup() (in some situation) and allowing fast
access (and no-dead-lock access) to page_cgroup.

Changelog:  (v2) -> (v3)
 - renamed macros and flags to be longer name.
 - added comments.
 - added "default bit set" for File, Shmem, Anon.

Changelog:  (preview) -> (v1):
 - patch ordering is changed.
 - Added macro for defining functions for Test/Set/Clear bit.
 - made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  114 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 82 insertions(+), 32 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -161,12 +161,60 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	int flags;
+	unsigned long flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_CACHE, /* charged as cache */
+	/* flags for LRU placement */
+	PCG_ACTIVE, /* page is active in this cgroup */
+	PCG_FILE, /* page is file system backed */
+	PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int PageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ set_bit(PCG_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ clear_bit(PCG_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ __set_bit(PCG_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(PCG_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+#define PcgDefaultAnonFlag	((1 << PCG_ACTIVE))
+#define PcgDefaultFileFlag	((1 << PCG_CACHE) | (1 << PCG_FILE))
+#define PcgDefaultShmemFlag	((1 << PCG_CACHE) | (1 << PCG_ACTIVE))
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -187,14 +235,15 @@ enum charge_type {
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
-					bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
 
 	VM_BUG_ON(!irqs_disabled());
-	if (flags & PAGE_CGROUP_FLAG_CACHE)
+	if (PageCgroupCache(pc))
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -295,18 +344,18 @@ static void __mem_cgroup_remove_list(str
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PageCgroupUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PageCgroupActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PageCgroupFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
 }
 
@@ -315,27 +364,27 @@ static void __mem_cgroup_add_list(struct
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PageCgroupUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PageCgroupActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PageCgroupFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
-	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+	int active    = PageCgroupActive(pc);
+	int file      = PageCgroupFile(pc);
+	int unevictable = PageCgroupUnevictable(pc);
 	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
 				(LRU_FILE * !!file + !!active);
 
@@ -343,16 +392,20 @@ static void __mem_cgroup_move_lists(stru
 		return;
 
 	MEM_CGROUP_ZSTAT(mz, from) -= 1;
-
+	/*
+	 * However this is done under mz->lru_lock, another flags, which
+	 * are not related to LRU, will be modified from out-of-lock.
+	 * We have to use atomic set/clear flags.
+	 */
 	if (is_unevictable_lru(lru)) {
-		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+		ClearPageCgroupActive(pc);
+		SetPageCgroupUnevictable(pc);
 	} else {
 		if (is_active_lru(lru))
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			SetPageCgroupActive(pc);
 		else
-			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+			ClearPageCgroupActive(pc);
+		ClearPageCgroupUnevictable(pc);
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -585,18 +638,20 @@ static int mem_cgroup_charge_common(stru
 
 	pc->mem_cgroup = mem;
 	pc->page = page;
-	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
-	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
-		pc->flags = PAGE_CGROUP_FLAG_CACHE;
+
+	switch (ctype) {
+	case MEM_CGROUP_CHARGE_TYPE_CACHE:
 		if (page_is_file_cache(page))
-			pc->flags |= PAGE_CGROUP_FLAG_FILE;
+			pc->flags = PcgDefaultFileFlag;
 		else
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-	} else
-		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+			pc->flags = PcgDefaultShmemFlag;
+		break;
+	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+		pc->flags = PcgDefaultAnonFlag;
+		break;
+	default:
+		BUG();
+	}
 
 	lock_page_cgroup(page);
 	if (unlikely(page_get_page_cgroup(page))) {
@@ -704,8 +759,7 @@ __mem_cgroup_uncharge_common(struct page
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
-		|| page_mapped(page)))
+	    && ((PageCgroupCache(pc) || page_mapped(page))))
 		goto unlock;
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -755,7 +809,7 @@ int mem_cgroup_prepare_migration(struct 
 	if (pc) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+		if (PageCgroupCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
 	unlock_page_cgroup(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 3/9]  memcg: move_account between groups
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
  2008-09-11 11:11 ` [RFC] [PATCH 1/9] memcg:make root no limit KAMEZAWA Hiroyuki
  2008-09-11 11:13 ` [RFC] [PATCH 2/9] memcg: atomic page_cgroup flags KAMEZAWA Hiroyuki
@ 2008-09-11 11:14 ` KAMEZAWA Hiroyuki
  2008-09-12  4:36   ` KAMEZAWA Hiroyuki
  2008-09-11 11:16 ` [RFC] [PATCH 4/9] memcg: new force empty KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

This patch provides a function to move account information of a page between
mem_cgroups.

This moving of page_cgroup is done under
 - the page is locked.
 - lru_lock of source/destination mem_cgroup is held.

Then, a routine which touches pc->mem_cgroup without page_lock() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.

(while page is not under lock_page())
	mem = pc->mem_cgroup;
	mz = page_cgroup_zoneinfo(pc)
	spin_lock_irqsave(&mz->lru_lock);
	if (pc->mem_cgroup == mem)
		...../* some list handling */
	spin_unlock_irq(&mz->lru_lock);

If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't
have to worry about anything.

Changelong: (v2) -> (v3)
  - added lock_page_cgroup().
  - splitted out from new-force-empty patch.
  - added how-to-use text.
  - fixed race in __mem_cgroup_uncharge_common().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 3 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -428,6 +428,7 @@ int task_in_mem_cgroup(struct task_struc
 void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
@@ -446,9 +447,14 @@ void mem_cgroup_move_lists(struct page *
 
 	pc = page_get_page_cgroup(page);
 	if (pc) {
+		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, lru);
+		/*
+		 * check against the race with move_account.
+		 */
+		if (likely(mem == pc->mem_cgroup))
+			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
 	unlock_page_cgroup(page);
@@ -569,6 +575,67 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @page ... the target page of being moved.
+ * @pc   ... page_cgroup of the page.
+ * @from ... mem_cgroup which the page is moved from.
+ * @to   ... mem_cgroup which the page is moved to.
+ *
+ * The caller must confirm following.
+ * 1. lock the page by lock_page().
+ * 2. disable irq.
+ * 3. lru_lock of old mem_cgroup should be held.
+ * 4. pc is guaranteed to be valid and on mem_cgroup's LRU.
+ *
+ * Because we cannot call try_to_free_page() here, the caller must guarantee
+ * this moving of change never fails. Currently this is called only against
+ * root cgroup, which has no limitation of resource.
+ * Returns 0 at success, returns 1 at failure.
+ */
+int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = 1;
+
+	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!PageLocked(page));
+
+	nid = page_to_nid(page);
+	zid = page_zonenum(page);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+	if (res_counter_charge(&to->res, PAGE_SIZE)) {
+		/* Now, we assume no_limit...no failure here. */
+		return ret;
+	}
+	if (try_lock_page_cgroup(page))
+		return ret;
+
+	if (page_get_page_cgroup(page) != pc)
+		goto out;
+
+	if (spin_trylock(&to_mz->lru_lock)) {
+		__mem_cgroup_remove_list(from_mz, pc);
+		css_put(&from->css);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+		pc->mem_cgroup = to;
+		css_get(&to->css);
+		__mem_cgroup_add_list(to_mz, pc);
+		ret = 0;
+		spin_unlock(&to_mz->lru_lock);
+	} else {
+		res_counter_uncharge(&to->res, PAGE_SIZE);
+	}
+out:
+	unlock_page_cgroup(page);
+
+	return ret;
+}
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -761,16 +828,24 @@ __mem_cgroup_uncharge_common(struct page
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 	    && ((PageCgroupCache(pc) || page_mapped(page))))
 		goto unlock;
-
+retry:
+	mem = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
+	    unlikely(mem != pc->mem_cgroup)) {
+		/* MAPPED account can be done without lock_page().
+		   Check race with mem_cgroup_move_account() */
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+		goto retry;
+	}
 	__mem_cgroup_remove_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
 	page_assign_page_cgroup(page, NULL);
 	unlock_page_cgroup(page);
 
-	mem = pc->mem_cgroup;
+
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 4/9] memcg: new force empty
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-09-11 11:14 ` [RFC] [PATCH 3/9] memcg: move_account between groups KAMEZAWA Hiroyuki
@ 2008-09-11 11:16 ` KAMEZAWA Hiroyuki
  2008-09-11 11:17 ` [RFC] [PATCH 5/9] memcg: set mapping null before uncharge KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is never accounted at all and create an in-use page which
has no page_cgroup. (And we have to feat terrible race condition....)

This patch tries to move account to "root" cgroup at force_empty.
By this patch, force_empty doesn't leak an account but move account
to "root" cgroup. Maybe someone can think of other enhancements as moving
account to its parent. Someone will revisit this behavior later.

For now, just moves account to root cgroup.

Note: all lock other than old mem_cgroup's lru_lock
      in this path is try_lock().

Changelog (v2) -> (v3)
 - splitted out mem_cgroup_move_account().
 - replaced get_page() with get_page_unless_zero().
   (This is necessary for avoiding confliction with migration)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    7 ++--
 mm/memcontrol.c                      |   51 +++++++++++++++++++++--------------
 2 files changed, 35 insertions(+), 23 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
+#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
@@ -977,17 +978,14 @@ int mem_cgroup_resize_limit(struct mem_c
 
 
 /*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
+ * This routine traverse page_cgroup in given list and move them all.
  */
-#define FORCE_UNCHARGE_BATCH	(128)
 static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -997,23 +995,36 @@ static void mem_cgroup_force_empty_list(
 	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
-		get_page(page);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
+		/* For avoiding race with speculative page cache handling. */
+		if (!PageLRU(page) || !get_page_unless_zero(page)) {
+			list_move(&pc->lru, list);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			continue;
+		}
+		if (!trylock_page(page)) {
+			list_move(&pc->lru, list);
 			put_page(page);
-			if (--count <= 0) {
-				count = FORCE_UNCHARGE_BATCH;
-				cond_resched();
-			}
-		} else
-			cond_resched();
-		spin_lock_irqsave(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			continue;
+		}
+		if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
+			/* some confliction */
+			list_move(&pc->lru, list);
+			unlock_page(page);
+			put_page(page);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+		} else {
+			unlock_page(page);
+			put_page(page);
+		}
+		if (atomic_read(&mem->css.cgroup->count) > 0)
+			break;
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 }
Index: mmtom-2.6.27-rc5+/Documentation/controllers/memory.txt
===================================================================
--- mmtom-2.6.27-rc5+.orig/Documentation/controllers/memory.txt
+++ mmtom-2.6.27-rc5+/Documentation/controllers/memory.txt
@@ -207,7 +207,8 @@ The memory.force_empty gives an interfac
 
 # echo 1 > memory.force_empty
 
-will drop all charges in cgroup. Currently, this is maintained for test.
+will move all charges to root cgroup.
+(This policy may be modified in future.)
 
 4. Testing
 
@@ -238,8 +239,8 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it. Such charges are automatically moved to
+root cgroup at rmidr() if there are no tasks. (This policy may be changed.)
 
 5. TODO
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 5/9] memcg: set mapping null before uncharge
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-09-11 11:16 ` [RFC] [PATCH 4/9] memcg: new force empty KAMEZAWA Hiroyuki
@ 2008-09-11 11:17 ` KAMEZAWA Hiroyuki
  2008-09-11 11:18 ` [RFC] [PATCH 6/9] memcg: optimize stat KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/filemap.c    |    2 +-
 mm/memcontrol.c |    1 +
 mm/migrate.c    |   12 +++++++++---
 3 files changed, 11 insertions(+), 4 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/filemap.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/filemap.c
+++ mmtom-2.6.27-rc5+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
 {
 	struct address_space *mapping = page->mapping;
 
-	mem_cgroup_uncharge_cache_page(page);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
+	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -864,6 +864,7 @@ void mem_cgroup_uncharge_page(struct pag
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
+	VM_BUG_ON(page->mapping);
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
Index: mmtom-2.6.27-rc5+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc5+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!PageSwapCache(newpage))
-		mem_cgroup_uncharge_cache_page(page);
 
 	return 0;
 }
@@ -378,7 +376,15 @@ static void migrate_page_copy(struct pag
 #endif
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
-	page->mapping = NULL;
+	/* page->mapping contains a flag for PageAnon() */
+	if (PageAnon(page)) {
+		/* This page is uncharged at try_to_unmap(). */
+		page->mapping = NULL;
+	} else {
+		/* Obsolete file cache should be uncharged */
+		page->mapping = NULL;
+		mem_cgroup_uncharge_cache_page(page);
+	}
 
 	/*
 	 * If any waiters have accumulated on the new page then

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 6/9] memcg: optimize stat
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-09-11 11:17 ` [RFC] [PATCH 5/9] memcg: set mapping null before uncharge KAMEZAWA Hiroyuki
@ 2008-09-11 11:18 ` KAMEZAWA Hiroyuki
  2008-09-11 11:20 ` [RFC] [PATCH 7/9] memcg: charge likely success KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -67,11 +67,10 @@ struct mem_cgroup_stat {
 /*
  * For accounting under irq disable, no need for increment preempt count.
  */
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
 		enum mem_cgroup_stat_index idx, int val)
 {
-	int cpu = smp_processor_id();
-	stat->cpustat[cpu].count[idx] += val;
+	stat->count[idx] += val;
 }
 
 static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -242,18 +241,21 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(!irqs_disabled());
+
+	cpustat = &stat->cpustat[smp_processor_id()];
 	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
 
 	if (charge)
-		__mem_cgroup_stat_add_safe(stat,
+		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
 	else
-		__mem_cgroup_stat_add_safe(stat,
+		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 7/9] memcg: charge likely success
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-09-11 11:18 ` [RFC] [PATCH 6/9] memcg: optimize stat KAMEZAWA Hiroyuki
@ 2008-09-11 11:20 ` KAMEZAWA Hiroyuki
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

In fast path, res_counter_charge() will success.
This annotation 'unlikely' works very well to make footprint shorter.
(And you can see the benefit of this by some benchmarks.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -683,7 +683,7 @@ static int mem_cgroup_charge_common(stru
 		css_get(&memcg->css);
 	}
 
-	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-09-11 11:20 ` [RFC] [PATCH 7/9] memcg: charge likely success KAMEZAWA Hiroyuki
@ 2008-09-11 11:22 ` KAMEZAWA Hiroyuki
  2008-09-11 14:00   ` Nick Piggin
                     ` (3 more replies)
  2008-09-11 11:24 ` [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache KAMEZAWA Hiroyuki
  2008-09-12  9:35 ` [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
  9 siblings, 4 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

Remove page_cgroup pointer from struct page.

This patch removes page_cgroup pointer from struct page and make it be able
to get from pfn. Then, relationship of them is

Before this:
  pfn <-> struct page <-> struct page_cgroup.
After this:
  struct page <-> pfn -> struct page_cgroup -> struct page.

Benefit of this approach is we can remove 8(4) bytes from struct page.

Other changes are:
  - lock/unlock_page_cgroup() uses its own bit on struct page_cgroup.
  - all necessary page_cgroups are allocated at boot.

Characteristics:
  - page cgroup is allocated as some amount of chunk.
    This patch uses SECTION_SIZE as size of chunk if 64bit/SPARSEMEM is enabled.
    If not, appropriate default number is selected.
  - all page_cgroup struct is maintained by hash. 
    I think we have 2 ways to handle sparse index in general
    ...radix-tree and hash. This uses hash because radix-tree's layout is
    affected by memory map's layout.
  - page_cgroup.h/page_cgroup.c is added.

TODO:
  - memory hotplug support. (not difficult)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/mm_types.h    |    4 
 include/linux/page_cgroup.h |   87 ++++++++++++++++++
 mm/Makefile                 |    2 
 mm/memcontrol.c             |  207 +++++++++-----------------------------------
 mm/page_alloc.c             |    9 -
 mm/page_cgroup.c            |  178 +++++++++++++++++++++++++++++++++++++
 6 files changed, 314 insertions(+), 173 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmtom-2.6.27-rc5+/mm/page_cgroup.c
@@ -0,0 +1,178 @@
+#include <linux/mm.h>
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <linux/bootmem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/page_cgroup.h>
+#include <linux/hash.h>
+
+void lock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_LOCK, &pc->flags);
+}
+
+int trylock_page_cgroup(struct page_cgroup *pc)
+{
+	return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
+void unlock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_LOCK, &pc->flags);
+}
+
+
+struct pcg_hash_head {
+	spinlock_t		lock;
+	struct hlist_head	head;
+};
+
+static struct pcg_hash_head	*pcg_hashtable __read_mostly;
+
+struct pcg_hash {
+	struct hlist_node	node;
+	unsigned long		index;
+	struct page_cgroup	*map;
+};
+
+#if BITS_PER_LONG == 32 /* we use kmalloc() */
+#define ENTS_PER_CHUNK_SHIFT	(7)
+const bool chunk_vmalloc = false;
+#else /* we'll use vmalloc */
+#ifdef SECTION_SIZE_BITS
+#define ENTS_PER_CHUNK_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
+#else
+#define ENTS_PER_CHUNK_SHIFT	(14) /* covers 128MB on x86-64 */
+#endif
+const bool chunk_vmalloc = true;
+#endif
+
+#define ENTS_PER_CHUNK		(1 << (ENTS_PER_CHUNK_SHIFT))
+#define ENTS_PER_CHUNK_MASK	(ENTS_PER_CHUNK - 1)
+
+static int pcg_hashshift __read_mostly;
+static int pcg_hashmask  __read_mostly;
+
+#define PCG_HASHSHIFT		(pcg_hashshift)
+#define PCG_HASHMASK		(pcg_hashmask)
+#define PCG_HASHSIZE		(1 << pcg_hashshift)
+
+int pcg_hashfun(unsigned long index)
+{
+	return hash_long(index, pcg_hashshift);
+}
+
+struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
+{
+	unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+	struct pcg_hash *ent;
+	struct pcg_hash_head *head;
+	struct hlist_node *node;
+	struct page_cgroup *pc = NULL;
+	int hnum;
+
+	hnum = pcg_hashfun(index);
+	head = pcg_hashtable + hnum;
+	rcu_read_lock();
+	hlist_for_each_entry(ent, node, &head->head, node) {
+		if (ent->index == index) {
+			pc = ent->map + pfn;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return pc;
+}
+
+static void __meminit alloc_page_cgroup(int node, unsigned long index)
+{
+	struct pcg_hash *ent;
+	struct pcg_hash_head *head;
+	struct page_cgroup *pc;
+	unsigned long flags, base;
+	int hnum, i;
+	int mapsize = sizeof(struct page_cgroup) * ENTS_PER_CHUNK;
+
+	if (lookup_page_cgroup(index << ENTS_PER_CHUNK_SHIFT))
+		return;
+
+	if (!chunk_vmalloc) {
+		int ent_size = sizeof(*ent) + mapsize;
+		ent = kmalloc_node(ent_size, GFP_KERNEL, node);
+		pc = (void *)(ent + 1);
+	} else {
+		ent = kmalloc_node(sizeof(*ent), GFP_KERNEL, node);
+		pc =  vmalloc_node(mapsize, node);
+	}
+	ent->map = pc - (index << ENTS_PER_CHUNK_SHIFT);
+	ent->index = index;
+	INIT_HLIST_NODE(&ent->node);
+
+	for (base = index << ENTS_PER_CHUNK_SHIFT, i = 0;
+		i < ENTS_PER_CHUNK; i++) {
+		pc = ent->map + base + i;
+		pc->page = pfn_to_page(base + i);
+		pc->mem_cgroup = NULL;
+		pc->flags = 0;
+	}
+
+	hnum = pcg_hashfun(index);
+	head = &pcg_hashtable[hnum];
+	spin_lock_irqsave(&head->lock, flags);
+	hlist_add_head_rcu(&ent->node, &head->head);
+	spin_unlock_irqrestore(&head->lock, flags);
+	return;
+}
+
+
+/* Called From mem_cgroup's initilization */
+void __init page_cgroup_init(void)
+{
+	struct pcg_hash_head *head;
+	int node, i;
+	unsigned long start, pfn, end, index, offset;
+	long default_pcg_hash_size;
+
+	/* we don't need too large hash */
+	default_pcg_hash_size = (max_pfn/ENTS_PER_CHUNK);
+	default_pcg_hash_size *= 2;
+	/* if too big, use automatic calclation */
+	if (default_pcg_hash_size > 1024 * 1024)
+		default_pcg_hash_size = 0;
+
+	pcg_hashtable = alloc_large_system_hash("PageCgroup Hash",
+				sizeof(struct pcg_hash_head),
+				default_pcg_hash_size,
+				13,
+				0,
+				&pcg_hashshift,
+				&pcg_hashmask,
+				0);
+
+	for (i = 0; i < PCG_HASHSIZE; i++) {
+		head = &pcg_hashtable[i];
+		spin_lock_init(&head->lock);
+		INIT_HLIST_HEAD(&head->head);
+	}
+
+	for_each_node(node) {
+		start = NODE_DATA(node)->node_start_pfn;
+		end = start + NODE_DATA(node)->node_spanned_pages;
+		start >>= ENTS_PER_CHUNK_SHIFT;
+		end = (end  + ENTS_PER_CHUNK - 1) >> ENTS_PER_CHUNK_SHIFT;
+		for (index = start; index < end; index++) {
+			pfn = index << ENTS_PER_CHUNK_SHIFT;
+			/*
+			 * In usual, this loop breaks at offset=0.
+			 * Handle a case a hole in MAX_ORDER (ia64 only...)
+			 */
+			for (offset = 0; offset < ENTS_PER_CHUNK; offset++) {
+				if (pfn_valid(pfn + offset)) {
+					alloc_page_cgroup(node, index);
+					break;
+				}
+			}
+		}
+	}
+	return;
+}
Index: mmtom-2.6.27-rc5+/include/linux/mm_types.h
===================================================================
--- mmtom-2.6.27-rc5+.orig/include/linux/mm_types.h
+++ mmtom-2.6.27-rc5+/include/linux/mm_types.h
@@ -92,10 +92,6 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
-#endif
-
 #ifdef CONFIG_KMEMCHECK
 	void *shadow;
 #endif
Index: mmtom-2.6.27-rc5+/mm/Makefile
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/Makefile
+++ mmtom-2.6.27-rc5+/mm/Makefile
@@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
Index: mmtom-2.6.27-rc5+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmtom-2.6.27-rc5+/include/linux/page_cgroup.h
@@ -0,0 +1,87 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * Page Cgroup can be considered as an extended mem_map.
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ * All page cgroups are allocated at boot or memory hotplug event,
+ * then the page cgroup for pfn always exists.
+ */
+struct page_cgroup {
+	unsigned long flags;
+	struct mem_cgroup *mem_cgroup;
+	struct page *page;
+	struct list_head lru;		/* per cgroup LRU list */
+};
+
+void __init page_cgroup_init(void);
+struct page_cgroup *lookup_page_cgroup(unsigned long pfn);
+void lock_page_cgroup(struct page_cgroup *pc);
+int trylock_page_cgroup(struct page_cgroup *pc);
+void unlock_page_cgroup(struct page_cgroup *pc);
+
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_LOCK,  /* page cgroup is locked */
+	PCG_CACHE, /* charged as cache */
+	/* flags for LRU placement */
+	PCG_ACTIVE, /* page is active in this cgroup */
+	PCG_FILE, /* page is file system backed */
+	PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int PageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ set_bit(PCG_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ clear_bit(PCG_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ __set_bit(PCG_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(PCG_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+#define PcgDefaultAnonFlag	((1 << PCG_ACTIVE))
+#define PcgDefaultFileFlag	((1 << PCG_CACHE) | (1 << PCG_FILE))
+#define PcgDefaultShmemFlag	((1 << PCG_CACHE) | (1 << PCG_ACTIVE))
+
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+	return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+	return page_zonenum(pc->page);
+}
+#endif
Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -34,11 +34,11 @@
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
 /*
@@ -153,78 +153,6 @@ static struct mem_cgroup init_mem_cgroup
 #define PAGE_CGROUP_LOCK	0x0
 #endif
 
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	unsigned long flags;
-};
-
-enum {
-	/* flags for mem_cgroup */
-	PCG_CACHE, /* charged as cache */
-	/* flags for LRU placement */
-	PCG_ACTIVE, /* page is active in this cgroup */
-	PCG_FILE, /* page is file system backed */
-	PCG_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define __SETPCGFLAG(uname, lname)			\
-static inline void __SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ __set_bit(PCG_##lname, &pc->flags);  }
-
-#define __CLEARPCGFLAG(uname, lname)			\
-static inline void __ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ __clear_bit(PCG_##lname, &pc->flags);  }
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-__SETPCGFLAG(Cache, CACHE)
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-__SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-__SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-#define PcgDefaultAnonFlag	((1 << PCG_ACTIVE))
-#define PcgDefaultFileFlag	((1 << PCG_CACHE) | (1 << PCG_FILE))
-#define PcgDefaultShmemFlag	((1 << PCG_CACHE) | (1 << PCG_ACTIVE))
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -311,37 +239,6 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -445,11 +342,12 @@ void mem_cgroup_move_lists(struct page *
 	 * safely get to page_cgroup without it, so just try_lock it:
 	 * mem_cgroup_isolate_pages allows for page left on wrong list.
 	 */
-	if (!try_lock_page_cgroup(page))
+	pc = lookup_page_cgroup(page_to_pfn(page));
+
+	if (!trylock_page_cgroup(pc))
 		return;
 
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc->mem_cgroup) {
 		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
@@ -460,7 +358,7 @@ void mem_cgroup_move_lists(struct page *
 			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 }
 
 /*
@@ -615,10 +513,10 @@ int mem_cgroup_move_account(struct page 
 		/* Now, we assume no_limit...no failure here. */
 		return ret;
 	}
-	if (try_lock_page_cgroup(page))
+	if (trylock_page_cgroup(pc))
 		return ret;
 
-	if (page_get_page_cgroup(page) != pc)
+	if (!pc->mem_cgroup)
 		goto out;
 
 	if (spin_trylock(&to_mz->lru_lock)) {
@@ -634,7 +532,7 @@ int mem_cgroup_move_account(struct page 
 		res_counter_uncharge(&to->res, PAGE_SIZE);
 	}
 out:
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 
 	return ret;
 }
@@ -655,10 +553,6 @@ static int mem_cgroup_charge_common(stru
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
 
-	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
-	if (unlikely(pc == NULL))
-		goto err;
-
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -670,7 +564,6 @@ static int mem_cgroup_charge_common(stru
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 		if (unlikely(!mem)) {
 			rcu_read_unlock();
-			kmem_cache_free(page_cgroup_cache, pc);
 			return 0;
 		}
 		/*
@@ -705,9 +598,17 @@ static int mem_cgroup_charge_common(stru
 			goto out;
 		}
 	}
-
+	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	prefetch(mz);
+	pc = lookup_page_cgroup(page_to_pfn(page));
+	lock_page_cgroup(pc);
+	if (unlikely(pc->mem_cgroup)) {
+		unlock_page_cgroup(pc);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+		goto done;
+	}
 	pc->mem_cgroup = mem;
-	pc->page = page;
 
 	switch (ctype) {
 	case MEM_CGROUP_CHARGE_TYPE_CACHE:
@@ -723,28 +624,15 @@ static int mem_cgroup_charge_common(stru
 		BUG();
 	}
 
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
-	page_assign_page_cgroup(page, pc);
-
-	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 done:
 	return 0;
 out:
 	css_put(&mem->css);
-	kmem_cache_free(page_cgroup_cache, pc);
-err:
 	return -ENOMEM;
 }
 
@@ -786,15 +674,16 @@ int mem_cgroup_cache_charge(struct page 
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
 
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		if (pc) {
+
+		pc = lookup_page_cgroup(page_to_pfn(page));
+		lock_page_cgroup(pc);
+		if (pc->mem_cgroup) {
 			VM_BUG_ON(pc->page != page);
 			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
+			unlock_page_cgroup(pc);
 			return 0;
 		}
-		unlock_page_cgroup(page);
+		unlock_page_cgroup(pc);
 	}
 
 	if (unlikely(!mm))
@@ -814,26 +703,26 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
+	int nid, zid;
 
 	if (mem_cgroup_subsys.disabled)
 		return;
+	/* check the condition we can know from page */
+	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) && page_mapped(page))
+		return;
 
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
-
-	VM_BUG_ON(pc->page != page);
+	nid = page_to_nid(page);
+	zid = page_zonenum(page);
+	pc = lookup_page_cgroup(page_to_pfn(page));
 
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((PageCgroupCache(pc) || page_mapped(page))))
+	lock_page_cgroup(pc);
+	if (unlikely(!pc->mem_cgroup))
+		goto unlock;
+	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) && PageCgroupCache(pc))
 		goto unlock;
 retry:
 	mem = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(pc);
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED &&
 	    unlikely(mem != pc->mem_cgroup)) {
@@ -844,18 +733,15 @@ retry:
 	}
 	__mem_cgroup_remove_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
+	pc->mem_cgroup = NULL;
+	unlock_page_cgroup(pc);
 
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 
-	kmem_cache_free(page_cgroup_cache, pc);
 	return;
 unlock:
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -883,15 +769,16 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+
+	pc = lookup_page_cgroup(page_to_pfn(page));
+	lock_page_cgroup(pc);
+	if (pc->mem_cgroup) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 		if (PageCgroupCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
@@ -1272,8 +1159,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
+		page_cgroup_init();
 		mem = &init_mem_cgroup;
-		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
Index: mmtom-2.6.27-rc5+/mm/page_alloc.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/page_alloc.c
+++ mmtom-2.6.27-rc5+/mm/page_alloc.c
@@ -223,17 +223,12 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	void *pc = page_get_page_cgroup(page);
-
 	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
 		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
-	if (pc) {
-		printk(KERN_EMERG "cgroup:%p\n", pc);
-		page_reset_bad_cgroup(page);
-	}
+
 	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
 		KERN_EMERG "Backtrace:\n");
 	dump_stack();
@@ -472,7 +467,6 @@ static inline void free_pages_check(stru
 	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
 		bad_page(page);
@@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
 		bad_page(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
@ 2008-09-11 11:24 ` KAMEZAWA Hiroyuki
  2008-09-11 11:31   ` Nick Piggin
  2008-09-11 12:49   ` kamezawa.hiroyu
  2008-09-12  9:35 ` [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
  9 siblings, 2 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-11 11:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

Use per-cpu cache for fast access to page_cgroup.
This patch is for making fastpath faster.

Because page_cgroup is accessed when the page is allocated/freed,
we can assume several of continuous page_cgroup will be accessed soon.
(If not interleaved on NUMA...but in such case, alloc/free itself is slow.)

We cache some set of page_cgroup's base pointer on per-cpu area and
use it when we hit.

TODO:
 - memory/cpu hotplug support.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/page_cgroup.c |   47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/page_cgroup.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/page_cgroup.c
+++ mmtom-2.6.27-rc5+/mm/page_cgroup.c
@@ -57,14 +57,26 @@ static int pcg_hashmask  __read_mostly;
 #define PCG_HASHMASK		(pcg_hashmask)
 #define PCG_HASHSIZE		(1 << pcg_hashshift)
 
+#define PCG_CACHE_MAX_SLOT	(32)
+#define PCG_CACHE_MASK		(PCG_CACHE_MAX_SLOT - 1)
+struct percpu_page_cgroup_cache {
+	struct {
+		unsigned long	index;
+		struct page_cgroup *base;
+	} slots[PCG_CACHE_MAX_SLOT];
+};
+DEFINE_PER_CPU(struct percpu_page_cgroup_cache, pcg_cache);
+
 int pcg_hashfun(unsigned long index)
 {
 	return hash_long(index, pcg_hashshift);
 }
 
-struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
+noinline static struct page_cgroup *
+__lookup_page_cgroup(struct percpu_page_cgroup_cache *pcc,unsigned long pfn)
 {
 	unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+	int s = index & PCG_CACHE_MASK;
 	struct pcg_hash *ent;
 	struct pcg_hash_head *head;
 	struct hlist_node *node;
@@ -77,6 +89,8 @@ struct page_cgroup *lookup_page_cgroup(u
 	hlist_for_each_entry(ent, node, &head->head, node) {
 		if (ent->index == index) {
 			pc = ent->map + pfn;
+			pcc->slots[s].index = ent->index;
+			pcc->slots[s].base = ent->map;
 			break;
 		}
 	}
@@ -84,6 +98,22 @@ struct page_cgroup *lookup_page_cgroup(u
 	return pc;
 }
 
+struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
+{
+	unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+	int hnum = (pfn >> ENTS_PER_CHUNK_SHIFT) & PCG_CACHE_MASK;
+	struct percpu_page_cgroup_cache *pcc;
+	struct page_cgroup *ret;
+
+	pcc = &get_cpu_var(pcg_cache);
+	if (likely(pcc->slots[hnum].index == index))
+		ret = pcc->slots[hnum].base + pfn;
+	else
+		ret = __lookup_page_cgroup(pcc, pfn);
+	put_cpu_var(pcg_cache);
+	return ret;
+}
+
 static void __meminit alloc_page_cgroup(int node, unsigned long index)
 {
 	struct pcg_hash *ent;
@@ -124,12 +154,23 @@ static void __meminit alloc_page_cgroup(
 	return;
 }
 
+void clear_page_cgroup_cache_pcg(int cpu)
+{
+	struct percpu_page_cgroup_cache *pcc;
+	int i;
+
+	pcc = &per_cpu(pcg_cache, cpu);
+	for (i = 0; i <  PCG_CACHE_MAX_SLOT; i++) {
+		pcc->slots[i].index = -1;
+		pcc->slots[i].base = NULL;
+	}
+}
 
 /* Called From mem_cgroup's initilization */
 void __init page_cgroup_init(void)
 {
 	struct pcg_hash_head *head;
-	int node, i;
+	int node, cpu, i;
 	unsigned long start, pfn, end, index, offset;
 	long default_pcg_hash_size;
 
@@ -174,5 +215,7 @@ void __init page_cgroup_init(void)
 			}
 		}
 	}
+	for_each_possible_cpu(cpu)
+		clear_page_cgroup_cache_pcg(cpu);
 	return;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache
  2008-09-11 11:24 ` [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache KAMEZAWA Hiroyuki
@ 2008-09-11 11:31   ` Nick Piggin
  2008-09-11 12:49   ` kamezawa.hiroyu
  1 sibling, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2008-09-11 11:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

On Thursday 11 September 2008 21:24, KAMEZAWA Hiroyuki wrote:
> Use per-cpu cache for fast access to page_cgroup.
> This patch is for making fastpath faster.
>
> Because page_cgroup is accessed when the page is allocated/freed,
> we can assume several of continuous page_cgroup will be accessed soon.
> (If not interleaved on NUMA...but in such case, alloc/free itself is slow.)
>
> We cache some set of page_cgroup's base pointer on per-cpu area and
> use it when we hit.
>
> TODO:
>  - memory/cpu hotplug support.

How much does this help?


>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> ---
>  mm/page_cgroup.c |   47 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 45 insertions(+), 2 deletions(-)
>
> Index: mmtom-2.6.27-rc5+/mm/page_cgroup.c
> ===================================================================
> --- mmtom-2.6.27-rc5+.orig/mm/page_cgroup.c
> +++ mmtom-2.6.27-rc5+/mm/page_cgroup.c
> @@ -57,14 +57,26 @@ static int pcg_hashmask  __read_mostly;
>  #define PCG_HASHMASK		(pcg_hashmask)
>  #define PCG_HASHSIZE		(1 << pcg_hashshift)
>
> +#define PCG_CACHE_MAX_SLOT	(32)
> +#define PCG_CACHE_MASK		(PCG_CACHE_MAX_SLOT - 1)
> +struct percpu_page_cgroup_cache {
> +	struct {
> +		unsigned long	index;
> +		struct page_cgroup *base;
> +	} slots[PCG_CACHE_MAX_SLOT];
> +};
> +DEFINE_PER_CPU(struct percpu_page_cgroup_cache, pcg_cache);
> +
>  int pcg_hashfun(unsigned long index)
>  {
>  	return hash_long(index, pcg_hashshift);
>  }
>
> -struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
> +noinline static struct page_cgroup *
> +__lookup_page_cgroup(struct percpu_page_cgroup_cache *pcc,unsigned long
> pfn) {
>  	unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
> +	int s = index & PCG_CACHE_MASK;
>  	struct pcg_hash *ent;
>  	struct pcg_hash_head *head;
>  	struct hlist_node *node;
> @@ -77,6 +89,8 @@ struct page_cgroup *lookup_page_cgroup(u
>  	hlist_for_each_entry(ent, node, &head->head, node) {
>  		if (ent->index == index) {
>  			pc = ent->map + pfn;
> +			pcc->slots[s].index = ent->index;
> +			pcc->slots[s].base = ent->map;
>  			break;
>  		}
>  	}
> @@ -84,6 +98,22 @@ struct page_cgroup *lookup_page_cgroup(u
>  	return pc;
>  }
>
> +struct page_cgroup *lookup_page_cgroup(unsigned long pfn)
> +{
> +	unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
> +	int hnum = (pfn >> ENTS_PER_CHUNK_SHIFT) & PCG_CACHE_MASK;
> +	struct percpu_page_cgroup_cache *pcc;
> +	struct page_cgroup *ret;
> +
> +	pcc = &get_cpu_var(pcg_cache);
> +	if (likely(pcc->slots[hnum].index == index))
> +		ret = pcc->slots[hnum].base + pfn;
> +	else
> +		ret = __lookup_page_cgroup(pcc, pfn);
> +	put_cpu_var(pcg_cache);
> +	return ret;
> +}
> +
>  static void __meminit alloc_page_cgroup(int node, unsigned long index)
>  {
>  	struct pcg_hash *ent;
> @@ -124,12 +154,23 @@ static void __meminit alloc_page_cgroup(
>  	return;
>  }
>
> +void clear_page_cgroup_cache_pcg(int cpu)
> +{
> +	struct percpu_page_cgroup_cache *pcc;
> +	int i;
> +
> +	pcc = &per_cpu(pcg_cache, cpu);
> +	for (i = 0; i <  PCG_CACHE_MAX_SLOT; i++) {
> +		pcc->slots[i].index = -1;
> +		pcc->slots[i].base = NULL;
> +	}
> +}
>
>  /* Called From mem_cgroup's initilization */
>  void __init page_cgroup_init(void)
>  {
>  	struct pcg_hash_head *head;
> -	int node, i;
> +	int node, cpu, i;
>  	unsigned long start, pfn, end, index, offset;
>  	long default_pcg_hash_size;
>
> @@ -174,5 +215,7 @@ void __init page_cgroup_init(void)
>  			}
>  		}
>  	}
> +	for_each_possible_cpu(cpu)
> +		clear_page_cgroup_cache_pcg(cpu);
>  	return;
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache
  2008-09-11 11:24 ` [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache KAMEZAWA Hiroyuki
  2008-09-11 11:31   ` Nick Piggin
@ 2008-09-11 12:49   ` kamezawa.hiroyu
  1 sibling, 0 replies; 27+ messages in thread
From: kamezawa.hiroyu @ 2008-09-11 12:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, balbir, xemul, hugh, linux-mm, linux-kernel, menage

----- Original Message -----
>On Thursday 11 September 2008 21:24, KAMEZAWA Hiroyuki wrote:
>> Use per-cpu cache for fast access to page_cgroup.
>> This patch is for making fastpath faster.
>>
>> Because page_cgroup is accessed when the page is allocated/freed,
>> we can assume several of continuous page_cgroup will be accessed soon.
>> (If not interleaved on NUMA...but in such case, alloc/free itself is slow.)
>>
>> We cache some set of page_cgroup's base pointer on per-cpu area and
>> use it when we hit.
>>
>> TODO:
>>  - memory/cpu hotplug support.
>
>How much does this help?
>
1-2% in unixbench's test (in 0/9) on 2core/1socket x86-64/SMP host.
(cpu is not the newest one.)
This per-cpu covers 32 * 128MB=4GB of area.
Using 256 bytes(32 entry) is over-kill ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
@ 2008-09-11 14:00   ` Nick Piggin
  2008-09-11 14:38   ` kamezawa.hiroyu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2008-09-11 14:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

On Thursday 11 September 2008 21:22, KAMEZAWA Hiroyuki wrote:
> Remove page_cgroup pointer from struct page.
>
> This patch removes page_cgroup pointer from struct page and make it be able
> to get from pfn. Then, relationship of them is
>
> Before this:
>   pfn <-> struct page <-> struct page_cgroup.
> After this:
>   struct page <-> pfn -> struct page_cgroup -> struct page.

So...

pfn -> *hash* -> struct page_cgroup, right?

While I don't think there is anything wrong with the approach, I
don't understand exactly where you guys are hoping to end up with
this?

I thought everyone was happy with preallocated page_cgroups because
of their good performance and simplicity, but this seems to be
going the other way again.

I'd worry that the hash and lookaside buffers and everything makes
performance more fragile, adds code and data and icache to fastpaths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
  2008-09-11 14:00   ` Nick Piggin
@ 2008-09-11 14:38   ` kamezawa.hiroyu
  2008-09-11 15:01   ` kamezawa.hiroyu
  2008-09-12 16:12   ` Balbir Singh
  3 siblings, 0 replies; 27+ messages in thread
From: kamezawa.hiroyu @ 2008-09-11 14:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, balbir, xemul, hugh, linux-mm, linux-kernel, menage

>On Thursday 11 September 2008 21:22, KAMEZAWA Hiroyuki wrote:
>> Remove page_cgroup pointer from struct page.
>>
>> This patch removes page_cgroup pointer from struct page and make it be able
>> to get from pfn. Then, relationship of them is
>>
>> Before this:
>>   pfn <-> struct page <-> struct page_cgroup.
>> After this:
>>   struct page <-> pfn -> struct page_cgroup -> struct page.
>
>So...
>
>pfn -> *hash* -> struct page_cgroup, right?
>
right.

>While I don't think there is anything wrong with the approach, I
>don't understand exactly where you guys are hoping to end up with
>this?
>
No. but this is simple. I'd like to use linear mapping like
sparsemem-vmemmap and HUGTLB kernel pages at the end.
But it needs much more work and should be done in the future.
(And it seems to have to depend on SPARSEMEM.)
This is just a generic one.

>I thought everyone was happy with preallocated page_cgroups because
>of their good performance and simplicity, but this seems to be
>going the other way again.

I don't think so. Purpose of this is adding an interface to get 
page_cgroup from pfn.

>
>I'd worry that the hash and lookaside buffers and everything makes
>performance more fragile, adds code and data and icache to fastpaths.
>
I worry, too.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
  2008-09-11 14:00   ` Nick Piggin
  2008-09-11 14:38   ` kamezawa.hiroyu
@ 2008-09-11 15:01   ` kamezawa.hiroyu
  2008-09-12 16:12   ` Balbir Singh
  3 siblings, 0 replies; 27+ messages in thread
From: kamezawa.hiroyu @ 2008-09-11 15:01 UTC (permalink / raw)
  To: kamezawa.hiroyu
  Cc: Nick Piggin, balbir, xemul, hugh, linux-mm, linux-kernel, menage

>>On Thursday 11 September 2008 21:22, KAMEZAWA Hiroyuki wrote:
>>> Remove page_cgroup pointer from struct page.
>>>
>>> This patch removes page_cgroup pointer from struct page and make it be abl
e
>>> to get from pfn. Then, relationship of them is
>>>
>>> Before this:
>>>   pfn <-> struct page <-> struct page_cgroup.
>>> After this:
>>>   struct page <-> pfn -> struct page_cgroup -> struct page.
>>
>>So...
>>
>>pfn -> *hash* -> struct page_cgroup, right?
>>
>right.
>
>>While I don't think there is anything wrong with the approach, I
>>don't understand exactly where you guys are hoping to end up with
>>this?
>>
>No. but this is simple. I'd like to use linear mapping like
>sparsemem-vmemmap and HUGTLB kernel pages at the end.
>But it needs much more work and should be done in the future.
>(And it seems to have to depend on SPARSEMEM.)
>This is just a generic one.
>
This is my thinking, now.

Adding a patch for FLATMEM is very easy, maybe. (Just Do as Balbir's one.)

Adding a patch for DISCONTIGMEM is doubtful. I'm not sure how it's widely
used now.

When it comes to SPARSEMEM, our target is 64bit arch and we need 
SPARSEMEM_EXTREME...2 level table. Maybe not far different from current hash.

Our way should be linear mapping like SPARSEMEM_VMEMMAP. But it needs
arch dependent code and some difficulty to allocate virtual space,
which can be very Huge.
This should be updated one by one (for each arch.).

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 3/9]  memcg: move_account between groups
  2008-09-11 11:14 ` [RFC] [PATCH 3/9] memcg: move_account between groups KAMEZAWA Hiroyuki
@ 2008-09-12  4:36   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-12  4:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

On Thu, 11 Sep 2008 20:14:51 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> +	if (try_lock_page_cgroup(page))
> +		return ret;
Here is buggy and should be !try_lock_page_cgroup().
and uncharge should be called.

> +
> +	if (page_get_page_cgroup(page) != pc)
> +		goto out;
uncharge should be called.

Updated version will appear in the next week.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 0/9]  remove page_cgroup pointer (with some enhancements)
  2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2008-09-11 11:24 ` [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache KAMEZAWA Hiroyuki
@ 2008-09-12  9:35 ` KAMEZAWA Hiroyuki
  2008-09-12 10:18   ` KAMEZAWA Hiroyuki
  9 siblings, 1 reply; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-12  9:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

On Thu, 11 Sep 2008 20:08:55 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Peformance comparison is below.
> ==
> rc5-mm1
> ==
> Execl Throughput                           3006.5 lps   (29.8 secs, 3 samples)
> C Compiler Throughput                      1006.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (1 concurrent)               4863.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent)                943.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               482.7 lpm   (60.0 secs, 3 samples)
> Dc: sqrt(2) to 99 decimal places         124804.9 lpm   (30.0 secs, 3 samples)
> 
> After this series
> ==
> Execl Throughput                           3003.3 lps   (29.8 secs, 3 samples)
> C Compiler Throughput                      1008.0 lpm   (60.0 secs, 3 samples)
> Shell Scripts (1 concurrent)               4580.6 lpm   (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent)                913.3 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               569.0 lpm   (60.0 secs, 3 samples)
> Dc: sqrt(2) to 99 decimal places         124918.7 lpm   (30.0 secs, 3 samples)
> 
> Hmm..no loss ? But maybe I should find what I can do to improve this.
> 
This is the latest number.
 - added "Used" flag as Balbir's one.
 - rewrote and optimize uncharge() path.
 - move bit_spinlock() (lock_page_cgroup()) to header file as inilned function.

Execl Throughput                           3064.9 lps   (29.8 secs, 3 samples)
C Compiler Throughput                       998.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4717.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                928.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               474.3 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         127184.0 lpm   (30.0 secs, 3 samples)

Hmm..it seems something bad? in concurrent shell test.
(But this -mm's shell test is not trustable. 15% slowdown from rc4's.)

I tries to avoid mz->lru_lock (it was in my set), also. But I find I can't.
I postpone that. (maybe remove mz->lru_lock and depends on zone->lock is choice.
This make memcg's lru to be synchronized with global lru.)

Unfortunately, I'll be offline for 2 or 3 days. I'm sorry if I can't make
quick response.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 0/9]  remove page_cgroup pointer (with some enhancements)
  2008-09-12  9:35 ` [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
@ 2008-09-12 10:18   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-12 10:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage

On Fri, 12 Sep 2008 18:35:40 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Execl Throughput                           3064.9 lps   (29.8 secs, 3 samples)
> C Compiler Throughput                       998.0 lpm   (60.0 secs, 3 samples)
> Shell Scripts (1 concurrent)               4717.0 lpm   (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent)                928.3 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               474.3 lpm   (60.0 secs, 3 samples)
> Dc: sqrt(2) to 99 decimal places         127184.0 lpm   (30.0 secs, 3 samples)
> 
This number is because of BUG. follwoing is fixed one.

Execl Throughput                           3026.0 lps   (29.8 secs, 3 samples)
C Compiler Throughput                       971.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4573.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                913.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               465.7 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         125412.1 lpm   (30.0 secs, 3 samples)

not good yet ;(  Very sorry for noise.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-09-11 15:01   ` kamezawa.hiroyu
@ 2008-09-12 16:12   ` Balbir Singh
  2008-09-12 16:19     ` Dave Hansen
  2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
  3 siblings, 2 replies; 27+ messages in thread
From: Balbir Singh @ 2008-09-12 16:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: xemul, hugh, linux-mm, linux-kernel, menage, Dave Hansen

KAMEZAWA Hiroyuki wrote:
> Remove page_cgroup pointer from struct page.
> 
> This patch removes page_cgroup pointer from struct page and make it be able
> to get from pfn. Then, relationship of them is
> 
> Before this:
>   pfn <-> struct page <-> struct page_cgroup.
> After this:
>   struct page <-> pfn -> struct page_cgroup -> struct page.
> 
> Benefit of this approach is we can remove 8(4) bytes from struct page.
> 
> Other changes are:
>   - lock/unlock_page_cgroup() uses its own bit on struct page_cgroup.
>   - all necessary page_cgroups are allocated at boot.
> 
> Characteristics:
>   - page cgroup is allocated as some amount of chunk.
>     This patch uses SECTION_SIZE as size of chunk if 64bit/SPARSEMEM is enabled.
>     If not, appropriate default number is selected.
>   - all page_cgroup struct is maintained by hash. 
>     I think we have 2 ways to handle sparse index in general
>     ...radix-tree and hash. This uses hash because radix-tree's layout is
>     affected by memory map's layout.
>   - page_cgroup.h/page_cgroup.c is added.
> 
> TODO:
>   - memory hotplug support. (not difficult)

Kamezawa,

I feel we can try the following approaches

1. Try per-node per-zone radix tree with dynamic allocation
2. Try the approach you have
3. Integrate with sparsemem (last resort for performance), Dave Hansen suggested
adding a mem_section member and using that.

I am going to try #1 today and see what the performance looks like


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-12 16:12   ` Balbir Singh
@ 2008-09-12 16:19     ` Dave Hansen
  2008-09-12 16:23       ` Dave Hansen
  2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2008-09-12 16:19 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, xemul, hugh, linux-mm, linux-kernel, menage,
	Dave Hansen

On Fri, 2008-09-12 at 09:12 -0700, Balbir Singh wrote:
> 3. Integrate with sparsemem (last resort for performance), Dave Hansen suggested
> adding a mem_section member and using that.

I also suggested using the sparsemem *structure* without necessarily
using it for pfn_to_page() lookups.  That'll take some rework to
separate out SPARSEMEM_FOR_MEMMAP vs. CONFIG_SPARSE_STRUCTURE_FUN, but
it should be able to be prototyped pretty fast.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap
  2008-09-12 16:19     ` Dave Hansen
@ 2008-09-12 16:23       ` Dave Hansen
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2008-09-12 16:23 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, xemul, hugh, linux-mm, linux-kernel, menage,
	Dave Hansen

On Fri, 2008-09-12 at 09:19 -0700, Dave Hansen wrote:
> On Fri, 2008-09-12 at 09:12 -0700, Balbir Singh wrote:
> > 3. Integrate with sparsemem (last resort for performance), Dave Hansen suggested
> > adding a mem_section member and using that.
> 
> I also suggested using the sparsemem *structure* without necessarily
> using it for pfn_to_page() lookups.  That'll take some rework to
> separate out SPARSEMEM_FOR_MEMMAP vs. CONFIG_SPARSE_STRUCTURE_FUN, but
> it should be able to be prototyped pretty fast.

Heh, now that I think about it, you could also use vmemmap to do the
same thing.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap)
  2008-09-12 16:12   ` Balbir Singh
  2008-09-12 16:19     ` Dave Hansen
@ 2008-09-16 12:13     ` KAMEZAWA Hiroyuki
  2008-09-16 12:17       ` [RFC][PATCH 10/9] get/put page at charge/uncharge KAMEZAWA Hiroyuki
                         ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-16 12:13 UTC (permalink / raw)
  To: balbir
  Cc: xemul, hugh, linux-mm, linux-kernel, menage, Dave Hansen, nickpiggin

On Fri, 12 Sep 2008 09:12:48 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Kamezawa,
> 
> I feel we can try the following approaches
> 
> 1. Try per-node per-zone radix tree with dynamic allocation
> 2. Try the approach you have
> 3. Integrate with sparsemem (last resort for performance), Dave Hansen suggested
> adding a mem_section member and using that.
> 
> I am going to try #1 today and see what the performance looks like
> 

I'm now writing *lazy* lru handing via per-cpu struct like pagevec.
It seems works well (but not so fast as expected on 2cpu box....)
I need more tests but it's not so bad to share the logic at this stage.

I added 3 patches on to this set. (my old set need bug fix.)
==
[1] patches/page_count.patch    ....get_page()/put_page() via page_cgroup.
[2] patches/lazy_lru_free.patch ....free page_cgroup from LRU in lazy way.
[3] patches/lazy_lru_add.patch  ....add page_cgroup to LRU in lazy way.

3 patches will follow this mail.

Because of speculative radix-tree lookup, page_count patch seems a bit
difficult. 

Anyway, I'll make this patch readable and post again.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 10/9] get/put page at charge/uncharge
  2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
@ 2008-09-16 12:17       ` KAMEZAWA Hiroyuki
  2008-09-16 12:19       ` [RFC][PATCH 11/9] lazy lru free vector for memcg KAMEZAWA Hiroyuki
  2008-09-16 12:21       ` [RFC] [PATCH 12/9] lazy lru add vie per cpu " KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-16 12:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage, Dave Hansen,
	nickpiggin

While page_cgroup() has reference to the page, it doesn't increment
page->count.

Now, the page and its page_cgroup has one-to-one relationship and there is no
dynamic allocation. But the behavior of global LRU and memory resource
controller is not synchronized. What this means is
  - LRU handling cost is not high. (because there is no synchronization)
  - We have to be afraid of "reuse" of the page.

Synchronizing global LRU and memcg's LRU means to make the cost of LRU
handling twice. Instead of that, this patch add get_page()/put_page() to
charge/uncharge(). By this, at least, alloc/free/reuse of page and page_cgroup
is synchronized.
This makes memcg robust and helps optimization of memcg in future.

What this patch does is.
 - Ignore Compound pages.
 - get_page()/put_page() at charge/uncharge
 - handle special method of "freeze" page_count() in page migration and
   speculative page cache. To do this, callee of mem_cgroup_uncharge_cache_page()
   is moved. It's called after all speculative-page-cache ops are finished.
 - remove charge/uncharge() from insert_page()...for irregular handlers.
   which doesn't uses radix-tree.
 - move charge() in do_swap_page() under lock_page().
 
Needs careful and enough review.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 include/linux/memcontrol.h |   20 ++++++++++++++++++++
 mm/filemap.c               |    2 +-
 mm/memcontrol.c            |   24 ++++++++++++------------
 mm/memory.c                |   18 +++++-------------
 mm/migrate.c               |   10 +++++++++-
 mm/swapfile.c              |   18 ++++++++++++++++--
 mm/vmscan.c                |   42 ++++++++++++++++++++++++++++--------------
 7 files changed, 91 insertions(+), 43 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -332,7 +332,7 @@ void mem_cgroup_move_lists(struct page *
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
-	if (mem_cgroup_subsys.disabled)
+	if (!under_mem_cgroup(page))
 		return;
 
 	/*
@@ -555,6 +555,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
+	/* avoid case in boot sequence */
+	if (unlikely(PageReserved(page)))
+		return 0;
+
 	pc = lookup_page_cgroup(page_to_pfn(page));
 	/* can happen at boot */
 	if (unlikely(!pc))
@@ -630,6 +634,7 @@ static int mem_cgroup_charge_common(stru
 	default:
 		BUG();
 	}
+	get_page(pc->page);
 	unlock_page_cgroup(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -647,9 +652,7 @@ out:
 
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-	if (PageCompound(page))
+	if (!under_mem_cgroup(page))
 		return 0;
 	/*
 	 * If already mapped, we don't have to account.
@@ -669,9 +672,7 @@ int mem_cgroup_charge(struct page *page,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-	if (PageCompound(page))
+	if (!under_mem_cgroup(page))
 		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()
@@ -716,10 +717,8 @@ __mem_cgroup_uncharge_common(struct page
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
 
-	if (mem_cgroup_subsys.disabled)
+	if (!under_mem_cgroup(page))
 		return;
-	/* check the condition we can know from page */
-
 	pc = lookup_page_cgroup(pfn);
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
 		return;
@@ -735,6 +734,7 @@ __mem_cgroup_uncharge_common(struct page
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	put_page(pc->page);
 	pc->mem_cgroup = NULL;
 	css_put(&mem->css);
 	preempt_enable();
@@ -769,7 +769,7 @@ int mem_cgroup_prepare_migration(struct 
 	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
 	int ret = 0;
 
-	if (mem_cgroup_subsys.disabled)
+	if (!under_mem_cgroup(page))
 		return 0;
 
 
@@ -822,7 +822,7 @@ int mem_cgroup_shrink_usage(struct mm_st
 	int progress = 0;
 	int retry = MEM_CGROUP_RECLAIM_RETRIES;
 
-	if (mem_cgroup_subsys.disabled)
+	if (!under_mem_cgroup(NULL))
 		return 0;
 	if (!mm)
 		return 0;
Index: mmtom-2.6.27-rc5+/mm/swapfile.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/swapfile.c
+++ mmtom-2.6.27-rc5+/mm/swapfile.c
@@ -390,20 +390,34 @@ static int remove_exclusive_swap_page_co
 /*
  * Most of the time the page should have two references: one for the
  * process and one for the swap cache.
+ * If memory resource controller is used, the page has extra reference from it.
  */
 int remove_exclusive_swap_page(struct page *page)
 {
-	return remove_exclusive_swap_page_count(page, 2);
+	int count;
+	/* page is accounted only when it's mapped. (if swapcache) */
+	if (under_mem_cgroup(page) && page_mapped(page))
+		count = 3;
+	else
+		count = 2;
+	return remove_exclusive_swap_page_count(page, count);
 }
 
 /*
  * The pageout code holds an extra reference to the page.  That raises
  * the reference count to test for to 2 for a page that is only in the
  * swap cache plus 1 for each process that maps the page.
+ * If memory resource controller is used, the page has extra reference from it.
  */
 int remove_exclusive_swap_page_ref(struct page *page)
 {
-	return remove_exclusive_swap_page_count(page, 2 + page_mapcount(page));
+	int count;
+
+	count = page_mapcount(page);
+	/* page is accounted only when it's mapped. (if swapcache) */
+	if (under_mem_cgroup(page) && count)
+		count += 1;
+	return remove_exclusive_swap_page_count(page, 2 + count);
 }
 
 /*
Index: mmtom-2.6.27-rc5+/include/linux/memcontrol.h
===================================================================
--- mmtom-2.6.27-rc5+.orig/include/linux/memcontrol.h
+++ mmtom-2.6.27-rc5+/include/linux/memcontrol.h
@@ -20,6 +20,8 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+#include <linux/page-flags.h>
+#include <linux/cgroup.h>
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
@@ -72,6 +74,19 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+extern struct cgroup_subsys mem_cgroup_subsys;
+static inline int under_mem_cgroup(struct page *page)
+{
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+	if (!page)
+		return 1;
+	if (PageCompound(page))
+		return 0;
+	return 1;
+}
+
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline void page_reset_bad_cgroup(struct page *page)
 {
@@ -163,6 +178,11 @@ static inline long mem_cgroup_calc_recla
 {
 	return 0;
 }
+
+static inline int under_mem_cgroup(void)
+{
+	return 0;
+}
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: mmtom-2.6.27-rc5+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc5+/mm/migrate.c
@@ -283,8 +283,16 @@ static int migrate_page_move_mapping(str
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
+	/*
+	 * Here, the page is unmapped and memcg's refcnt should be 0
+	 * if it's anon or SwapCache.
+	 */
+	if (PageAnon(page))
+		expected_count = 2 + !!PagePrivate(page);
+	else
+		expected_count = 2 + !!PagePrivate(page)
+				   + under_mem_cgroup(page);
 
-	expected_count = 2 + !!PagePrivate(page);
 	if (page_count(page) != expected_count ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
Index: mmtom-2.6.27-rc5+/mm/vmscan.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/vmscan.c
+++ mmtom-2.6.27-rc5+/mm/vmscan.c
@@ -310,7 +310,10 @@ static inline int page_mapping_inuse(str
 
 static inline int is_page_cache_freeable(struct page *page)
 {
-	return page_count(page) - !!PagePrivate(page) == 2;
+	if (under_mem_cgroup(page))
+		return page_count(page) - !!PagePrivate(page) == 3;
+	else
+		return page_count(page) - !!PagePrivate(page) == 2;
 }
 
 static int may_write_to_queue(struct backing_dev_info *bdi)
@@ -453,6 +456,7 @@ static pageout_t pageout(struct page *pa
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page)
 {
+	int freeze_ref;
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
@@ -481,12 +485,19 @@ static int __remove_mapping(struct addre
 	 *
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under tree_lock, then this ordering is not required.
+ 	 *
+	 * If memory resource controller is enabled, it has extra ref.
 	 */
-	if (!page_freeze_refs(page, 2))
+	if (!PageSwapCache(page))
+		freeze_ref = 2 + under_mem_cgroup(page);
+	else
+		freeze_ref = 2;
+
+	if (!page_freeze_refs(page, freeze_ref))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 	if (unlikely(PageDirty(page))) {
-		page_unfreeze_refs(page, 2);
+		page_unfreeze_refs(page, freeze_ref);
 		goto cannot_free;
 	}
 
@@ -500,7 +511,7 @@ static int __remove_mapping(struct addre
 		spin_unlock_irq(&mapping->tree_lock);
 	}
 
-	return 1;
+	return freeze_ref - 1;
 
 cannot_free:
 	spin_unlock_irq(&mapping->tree_lock);
@@ -515,16 +526,19 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
-		/*
-		 * Unfreezing the refcount with 1 rather than 2 effectively
-		 * drops the pagecache ref for us without requiring another
-		 * atomic operation.
-		 */
-		page_unfreeze_refs(page, 1);
-		return 1;
-	}
-	return 0;
+	int ret;
+	ret = __remove_mapping(mapping, page);
+	if (!ret)
+		return 0;
+	/*
+	 * Unfreezing the refcount with 1 or 2 rather than 2 effectively
+	 * drops the pagecache ref for us without requiring another
+	 * atomic operation.
+	 */
+	page_unfreeze_refs(page, ret);
+	if (ret == 2)
+		mem_cgroup_uncharge_cache_page(page);
+	return 1;
 }
 
 /**
Index: mmtom-2.6.27-rc5+/mm/memory.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memory.c
+++ mmtom-2.6.27-rc5+/mm/memory.c
@@ -1319,22 +1319,17 @@ static int insert_page(struct vm_area_st
 			struct page *page, pgprot_t prot)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int retval;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int retval = -EINVAL;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
-	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1345,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2325,16 +2318,15 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
+	lock_page(page);
 
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
-		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		ret = VM_FAULT_OOM;
 		goto out;
 	}
 
 	mark_page_accessed(page);
-	lock_page(page);
-	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	/*
 	 * Back out if somebody else already faulted in this pte.
Index: mmtom-2.6.27-rc5+/mm/filemap.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/filemap.c
+++ mmtom-2.6.27-rc5+/mm/filemap.c
@@ -121,7 +121,6 @@ void __remove_from_page_cache(struct pag
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
-	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
@@ -144,6 +143,7 @@ void remove_from_page_cache(struct page 
 
 	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
+	mem_cgroup_uncharge_cache_page(page);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC][PATCH 11/9] lazy lru free vector for memcg
  2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
  2008-09-16 12:17       ` [RFC][PATCH 10/9] get/put page at charge/uncharge KAMEZAWA Hiroyuki
@ 2008-09-16 12:19       ` KAMEZAWA Hiroyuki
  2008-09-16 12:23         ` Pavel Emelyanov
  2008-09-16 13:02         ` kamezawa.hiroyu
  2008-09-16 12:21       ` [RFC] [PATCH 12/9] lazy lru add vie per cpu " KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-16 12:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage, Dave Hansen,
	nickpiggin

Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed ontto per-cpu vector and
removed from LRU. This is depends on increment-page-count-via-page-cgroup
patch. Because page_cgroup has refcnt to the page, we don't have to be
afraid that the page is reused while it's on vector.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  163 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 155 insertions(+), 8 deletions(-)

Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -35,6 +35,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 
 #include <asm/uaccess.h>
 
@@ -539,6 +540,120 @@ out:
 	return ret;
 }
 
+
+#define MEMCG_PCPVEC_SIZE	(8)
+struct memcg_percpu_vec {
+	int nr;
+	int limit;
+	struct mem_cgroup 	   *hot_memcg;
+	struct mem_cgroup_per_zone *hot_mz;
+	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *owner;
+	struct page_cgroup *pc;
+	struct page *freed[MEMCG_PCPVEC_SIZE];
+	int i, nr;
+
+	mz = mpv->hot_mz;
+	owner = mpv->hot_memcg;
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	nr = mpv->nr;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		VM_BUG_ON(PageCgroupUsed(pc));
+		__mem_cgroup_remove_list(mz, pc);
+		css_put(&owner->css);
+		freed[i] = pc->page;
+		pc->mem_cgroup = NULL;
+	}
+	mpv->nr = 0;
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	for (i = nr - 1; i >= 0; i--)
+		put_page(freed[i]);
+}
+
+static void
+release_page_cgroup(struct mem_cgroup_per_zone *mz,struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_free_vec);
+	if (mpv->hot_mz != mz) {
+		if (mpv->nr > 0)
+			__release_page_cgroup(mpv);
+		mpv->hot_mz = mz;
+		mpv->hot_memcg = pc->mem_cgroup;
+	}
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__release_page_cgroup(mpv);
+	put_cpu_var(memcg_free_vec);
+}
+
+static void page_cgroup_start_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void page_cgroup_stop_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = 0;
+}
+#endif
+
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void drain_page_cgroup_local(struct work_struct *work)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_free_vec);
+	__release_page_cgroup(mpv);
+	put_cpu_var(mpv);
+}
+
+static void drain_page_cgroup_cpu(int cpu)
+{
+	int local_cpu;
+	struct work_struct work;
+
+	local_cpu = get_cpu();
+	if (local_cpu == cpu) {
+		drain_page_cgroup_local(NULL);
+		put_cpu();
+		return;
+	}
+	put_cpu();
+
+	INIT_WORK(&work, drain_page_cgroup_local);
+	schedule_work_on(cpu, &work);
+	flush_work(&work);
+}
+
+static void drain_page_cgroup_all(void)
+{
+	mutex_lock(&memcg_force_drain_mutex);
+	schedule_on_each_cpu(drain_page_cgroup_local);
+	mutex_unlock(&memcg_force_drain_mutex);
+}
+
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -715,7 +830,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
 
 	if (!under_mem_cgroup(page))
 		return;
@@ -727,17 +841,12 @@ __mem_cgroup_uncharge_common(struct page
 	lock_page_cgroup(pc);
 	__ClearPageCgroupUsed(pc);
 	unlock_page_cgroup(pc);
+	preempt_enable();
 
 	mem = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(pc);
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	put_page(pc->page);
-	pc->mem_cgroup = NULL;
-	css_put(&mem->css);
-	preempt_enable();
+	release_page_cgroup(mz, pc);
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 
 	return;
@@ -938,6 +1047,7 @@ static int mem_cgroup_force_empty(struct
 	 * So, we have to do loop here until all lists are empty.
 	 */
 	while (mem->res.usage > 0) {
+		drain_page_cgroup_all();
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
 		for_each_node_state(node, N_POSSIBLE)
@@ -950,6 +1060,7 @@ static int mem_cgroup_force_empty(struct
 			}
 	}
 	ret = 0;
+	drain_page_cgroup_all();
 out:
 	css_put(&mem->css);
 	return ret;
@@ -1154,6 +1265,38 @@ static void mem_cgroup_free(struct mem_c
 		vfree(mem);
 }
 
+static void mem_cgroup_init_pcp(int cpu)
+{
+	page_cgroup_start_cache_cpu(cpu);
+}
+
+static int cpu_memcgroup_callback(struct notifier_block *nb,
+			unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	switch(action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		mem_cgroup_init_pcp(cpu);
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		page_cgroup_stop_cache_cpu(cpu);
+		drain_page_cgroup_cpu(cpu);
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata memcgroup_nb =
+{
+	.notifier_call = cpu_memcgroup_callback,
+};
 
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1164,6 +1307,10 @@ mem_cgroup_create(struct cgroup_subsys *
 	if (unlikely((cont->parent) == NULL)) {
 		page_cgroup_init();
 		mem = &init_mem_cgroup;
+		cpu_memcgroup_callback(&memcgroup_nb,
+					(unsigned long)CPU_UP_PREPARE,
+					(void *)(long)smp_processor_id());
+		register_hotcpu_notifier(&memcgroup_nb);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH 12/9] lazy lru add vie per cpu vector for memcg.
  2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
  2008-09-16 12:17       ` [RFC][PATCH 10/9] get/put page at charge/uncharge KAMEZAWA Hiroyuki
  2008-09-16 12:19       ` [RFC][PATCH 11/9] lazy lru free vector for memcg KAMEZAWA Hiroyuki
@ 2008-09-16 12:21       ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 27+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-16 12:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, xemul, hugh, linux-mm, linux-kernel, menage, Dave Hansen,
	nickpiggin

Delaying add_to_lru() and do it by batch.

For delaying, PCG_LRU flag is added. If PCG_LRU is set, page is on
LRU and unchage() have to call remove from lru. If not, the page is
not added to LRU.

For avoid race, all flags are modified under lock_page_cgroup().

Lazy-add logic reuses Lazy-free's one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/page_cgroup.h |    4 +
 mm/memcontrol.c             |   91 ++++++++++++++++++++++++++++++++++++++------
 2 files changed, 84 insertions(+), 11 deletions(-)

Index: mmtom-2.6.27-rc5+/include/linux/page_cgroup.h
===================================================================
--- mmtom-2.6.27-rc5+.orig/include/linux/page_cgroup.h
+++ mmtom-2.6.27-rc5+/include/linux/page_cgroup.h
@@ -23,6 +23,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_LRU, /* this is on LRU */
 	/* flags for LRU placement */
 	PCG_ACTIVE, /* page is active in this cgroup */
 	PCG_FILE, /* page is file system backed */
@@ -57,6 +58,9 @@ TESTPCGFLAG(Used, USED)
 __SETPCGFLAG(Used, USED)
 __CLEARPCGFLAG(Used, USED)
 
+TESTPCGFLAG(LRU, LRU)
+SETPCGFLAG(LRU, LRU)
+
 /* LRU management flags (from global-lru definition) */
 TESTPCGFLAG(File, FILE)
 SETPCGFLAG(File, FILE)
Index: mmtom-2.6.27-rc5+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc5+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc5+/mm/memcontrol.c
@@ -348,7 +348,7 @@ void mem_cgroup_move_lists(struct page *
 	if (!trylock_page_cgroup(pc))
 		return;
 
-	if (PageCgroupUsed(pc)) {
+	if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
 		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
@@ -508,6 +508,9 @@ int mem_cgroup_move_account(struct page 
 	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
 	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
 
+	if (!PageCgroupLRU(pc))
+		return ret;
+
 	if (res_counter_charge(&to->res, PAGE_SIZE)) {
 		/* Now, we assume no_limit...no failure here. */
 		return ret;
@@ -550,6 +553,7 @@ struct memcg_percpu_vec {
 	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
 };
 DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_add_vec);
 
 static void
 __release_page_cgroup(struct memcg_percpu_vec *mpv)
@@ -580,6 +584,40 @@ __release_page_cgroup(struct memcg_percp
 }
 
 static void
+__use_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *owner;
+	struct page_cgroup *pc;
+	struct page *freed[MEMCG_PCPVEC_SIZE];
+	int i, nr, freed_num;
+
+	mz = mpv->hot_mz;
+	owner = mpv->hot_memcg;
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	freed_num = 0;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		lock_page_cgroup(pc);
+		if (likely(PageCgroupUsed(pc))) {
+			__mem_cgroup_add_list(mz, pc);
+			SetPageCgroupLRU(pc);
+		} else {
+			css_put(&owner->css);
+			freed[freed_num++] = pc->page;
+			pc->mem_cgroup = NULL;
+		}
+		unlock_page_cgroup(pc);
+	}
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	while (freed_num--)
+		put_page(freed[freed_num]);
+}
+
+static void
 release_page_cgroup(struct mem_cgroup_per_zone *mz,struct page_cgroup *pc)
 {
 	struct memcg_percpu_vec *mpv;
@@ -597,11 +635,30 @@ release_page_cgroup(struct mem_cgroup_pe
 	put_cpu_var(memcg_free_vec);
 }
 
+static void
+use_page_cgroup(struct mem_cgroup_per_zone *mz, struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_add_vec);
+	if (mpv->hot_mz != mz) {
+		if (mpv->nr > 0)
+			__use_page_cgroup(mpv);
+		mpv->hot_mz = mz;
+		mpv->hot_memcg = pc->mem_cgroup;
+	}
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__use_page_cgroup(mpv);
+	put_cpu_var(memcg_add_vec);
+}
+
 static void page_cgroup_start_cache_cpu(int cpu)
 {
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = MEMCG_PCPVEC_SIZE;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -610,6 +667,8 @@ static void page_cgroup_stop_cache_cpu(i
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = 0;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = 0;
 }
 #endif
 
@@ -623,6 +682,9 @@ static DEFINE_MUTEX(memcg_force_drain_mu
 static void drain_page_cgroup_local(struct work_struct *work)
 {
 	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_add_vec);
+	__use_page_cgroup(mpv);
+	put_cpu_var(mpv);
 	mpv = &get_cpu_var(memcg_free_vec);
 	__release_page_cgroup(mpv);
 	put_cpu_var(mpv);
@@ -668,7 +730,6 @@ static int mem_cgroup_charge_common(stru
 	struct page_cgroup *pc;
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	/* avoid case in boot sequence */
 	if (unlikely(PageReserved(page)))
@@ -753,9 +814,7 @@ static int mem_cgroup_charge_common(stru
 	unlock_page_cgroup(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	use_page_cgroup(mz, pc);
 	preempt_enable();
 
 done:
@@ -830,23 +889,33 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long pfn = page_to_pfn(page);
+	int need_to_release;
 
 	if (!under_mem_cgroup(page))
 		return;
 	pc = lookup_page_cgroup(pfn);
-	if (unlikely(!pc || !PageCgroupUsed(pc)))
+	if (unlikely(!pc))
 		return;
-
 	preempt_disable();
+
 	lock_page_cgroup(pc);
+
+	if (unlikely(!PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		preempt_enable();
+		return;
+	}
+
+	need_to_release = PageCgroupLRU(pc);
+	mem = pc->mem_cgroup;
 	__ClearPageCgroupUsed(pc);
 	unlock_page_cgroup(pc);
 	preempt_enable();
 
-	mem = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(pc);
-
-	release_page_cgroup(mz, pc);
+	if (likely(need_to_release)) {
+		mz = page_cgroup_zoneinfo(pc);
+		release_page_cgroup(mz, pc);
+	}
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 
 	return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH 11/9] lazy lru free vector for memcg
  2008-09-16 12:19       ` [RFC][PATCH 11/9] lazy lru free vector for memcg KAMEZAWA Hiroyuki
@ 2008-09-16 12:23         ` Pavel Emelyanov
  2008-09-16 13:02         ` kamezawa.hiroyu
  1 sibling, 0 replies; 27+ messages in thread
From: Pavel Emelyanov @ 2008-09-16 12:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, hugh, linux-mm, linux-kernel, menage, Dave Hansen, nickpiggin

[snip]

> @@ -938,6 +1047,7 @@ static int mem_cgroup_force_empty(struct
>  	 * So, we have to do loop here until all lists are empty.
>  	 */
>  	while (mem->res.usage > 0) {
> +		drain_page_cgroup_all();

Shouldn't we wait here till the drain process completes?

>  		if (atomic_read(&mem->css.cgroup->count) > 0)
>  			goto out;
>  		for_each_node_state(node, N_POSSIBLE)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: [RFC][PATCH 11/9] lazy lru free vector for memcg
  2008-09-16 12:19       ` [RFC][PATCH 11/9] lazy lru free vector for memcg KAMEZAWA Hiroyuki
  2008-09-16 12:23         ` Pavel Emelyanov
@ 2008-09-16 13:02         ` kamezawa.hiroyu
  1 sibling, 0 replies; 27+ messages in thread
From: kamezawa.hiroyu @ 2008-09-16 13:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, balbir, hugh, linux-mm, linux-kernel, menage,
	Dave Hansen, nickpiggin

----- Original Message -----

>[snip]
>
>> @@ -938,6 +1047,7 @@ static int mem_cgroup_force_empty(struct
>>  	 * So, we have to do loop here until all lists are empty.
>>  	 */
>>  	while (mem->res.usage > 0) {
>> +		drain_page_cgroup_all();
>
>Shouldn't we wait here till the drain process completes?
>
I thought schedule_on_each_cpu() watis for completion of the work.
I'll check it, again.

Thank you for review.

Regards,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-09-16 13:02 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-11 11:08 [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
2008-09-11 11:11 ` [RFC] [PATCH 1/9] memcg:make root no limit KAMEZAWA Hiroyuki
2008-09-11 11:13 ` [RFC] [PATCH 2/9] memcg: atomic page_cgroup flags KAMEZAWA Hiroyuki
2008-09-11 11:14 ` [RFC] [PATCH 3/9] memcg: move_account between groups KAMEZAWA Hiroyuki
2008-09-12  4:36   ` KAMEZAWA Hiroyuki
2008-09-11 11:16 ` [RFC] [PATCH 4/9] memcg: new force empty KAMEZAWA Hiroyuki
2008-09-11 11:17 ` [RFC] [PATCH 5/9] memcg: set mapping null before uncharge KAMEZAWA Hiroyuki
2008-09-11 11:18 ` [RFC] [PATCH 6/9] memcg: optimize stat KAMEZAWA Hiroyuki
2008-09-11 11:20 ` [RFC] [PATCH 7/9] memcg: charge likely success KAMEZAWA Hiroyuki
2008-09-11 11:22 ` [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap KAMEZAWA Hiroyuki
2008-09-11 14:00   ` Nick Piggin
2008-09-11 14:38   ` kamezawa.hiroyu
2008-09-11 15:01   ` kamezawa.hiroyu
2008-09-12 16:12   ` Balbir Singh
2008-09-12 16:19     ` Dave Hansen
2008-09-12 16:23       ` Dave Hansen
2008-09-16 12:13     ` memcg: lazy_lru (was Re: [RFC] [PATCH 8/9] memcg: remove page_cgroup pointer from memmap) KAMEZAWA Hiroyuki
2008-09-16 12:17       ` [RFC][PATCH 10/9] get/put page at charge/uncharge KAMEZAWA Hiroyuki
2008-09-16 12:19       ` [RFC][PATCH 11/9] lazy lru free vector for memcg KAMEZAWA Hiroyuki
2008-09-16 12:23         ` Pavel Emelyanov
2008-09-16 13:02         ` kamezawa.hiroyu
2008-09-16 12:21       ` [RFC] [PATCH 12/9] lazy lru add vie per cpu " KAMEZAWA Hiroyuki
2008-09-11 11:24 ` [RFC] [PATCH 9/9] memcg: percpu page cgroup lookup cache KAMEZAWA Hiroyuki
2008-09-11 11:31   ` Nick Piggin
2008-09-11 12:49   ` kamezawa.hiroyu
2008-09-12  9:35 ` [RFC] [PATCH 0/9] remove page_cgroup pointer (with some enhancements) KAMEZAWA Hiroyuki
2008-09-12 10:18   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox