[RFC][PATCH 0/14] Mem+Swap Controller v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/14]  Mem+Swap Controller v2
@ 2008-08-22 11:27 KAMEZAWA Hiroyuki
  2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
                   ` (15 more replies)
  0 siblings, 16 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:27 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura

Hi, I totally rewrote the series. maybe easier to be reviewed.

This patch series provides memory resource controller enhancement(candidates)
Codes are totally rewritten from "preview".
Based on rc3 + a bit old mmtom (may not have conflicts with the latest...)
(I'm not so in hurry now, please see when you have time.)

Contents are following. I'll push them gradually when it seems O.K.

 - New force_empty implementation.
   - rather than drop all accounting, move all accounting to "root".
     This behavior can be changed later (based on some policy.)
     It may take some amount of time about "good" policy, I think start from
     "move charge to the root" is good. I want to hear opinions about this
     interface's behavior.

 - Lockless page_cgroup
   - Removes lock_page_cgroup() and makes access to page->page_cgroup safe
     under some situation. This will makes memcg's lock semantics to
     be better.
     
 - Exporting page_cgroup.
   - Because of Lockess page_cgroup, we can easily access page_cgroup from
     outside of memcontrol.c. There are some people who ask me to allow
     them to access page_cgroup.
 
 - Mem+Swap controller.
   - This controller is implemented as an extention of memory controller.
     If Mem+Swap controller is enabled, you can set 2 limits ans see    
     2 counters.

     memory.limit_in_bytes .... limit of amounts of pages.
     memory.memsw_limit_in_bytes .... limit of amounts of the sum of pages 
                                      and swap_on_disks.
     memory.usage_in_bytes .... current usage of on-memory pages.
     memory.memory.swap_in_bytes .... current usage of swaps which is not 
                                      on_memory.

Any feedback, comments are welcome.

This set passed some fundamental tests on small box and works good.
but I have not done long-run test on big box. So, you may see panic
of race conditions....

TODO:
  - Update Documentation more.
  - Long-run test.
  - Update force_empty's policy.

Major Changes from v1.
  - force_empty is updated.
  - small bug fix on Lockless page_cgroup.
  - Mem+Swap controller is added. (Implementation detail is quite different
    from preview version. But no change in algorithm.)

Patch series:
  I'd like to push patch 1...9 in early than 10..14
  Comments about the direction of patch 1,2,11,13 is helpful.

1. unlimted_root_cgroup.patch 
            .... makes root cgroup's limitation to unlimited.
2. new_force_empty.patch
            .... rewrite force_empty to move the resource rather
3. atomic_flags.patch
            .... makes page_cgroup->flags modification to atomic_ops.
4. lazy-lru-freeing.patch
            .... makes freeing of page_cgroup to be delayed.
5. rcu-free.patch
            ....freeing of page_cgroup by RCU.
6. lockess.patch
            ....remove lock_page_cgroup()
7. prefetch.patch
            .... just adds prefetch().
8. make_mapping_null.patch
            .... guarantee page->mapping to be NULL before uncharge
                 file cache (and check it by BUG_ON)
9. split-page-cgroup.patch
            .... add page_cgroup.h file.
10. mem_counter.patch
            .... replace res_coutner with mem_counter, newly added.
                 (reason of this patch will be shown in patch[11])
11. memcgrp_id.patch
            .... give each mem_cgroup its own ID.
12. swap_cgroup_config.patch
            .... Add Kconfig and some macro for Mem+Swap Controller.
13. swap_counter.patch
            .... modifies mem_counter to handle swaps.
14. swap_account.patch
            .... account swap.

Thanks,
-Kame



















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 1/14] memcg: unlimted root cgroup
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
@ 2008-08-22 11:30 ` KAMEZAWA Hiroyuki
  2008-08-22 22:51   ` Balbir Singh
  2008-08-23  0:38   ` kamezawa.hiroyu
  2008-08-22 11:31 ` [RFC][PATCH 2/14] memcg: rewrite force_empty KAMEZAWA Hiroyuki
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Make root cgroup of memory resource controller to have no limit.

By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.

For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    4 ++++
 mm/memcontrol.c                      |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -133,6 +133,10 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/*
+	 * special flags.
+	 */
+	int	no_limit;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -920,6 +924,10 @@ static int mem_cgroup_write(struct cgrou
 
 	switch (cft->private) {
 	case RES_LIMIT:
+		if (memcg->no_limit == 1) {
+			ret = -EINVAL;
+			break;
+		}
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (!ret)
@@ -1119,6 +1127,10 @@ mem_cgroup_create(struct cgroup_subsys *
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
 
+	/* Default cgroup have no limit */
+	if (cont->parent == NULL)
+		mem->no_limit = 1;
+
 	return &mem->css;
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
Index: mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
===================================================================
--- mmtom-2.6.27-rc3+.orig/Documentation/controllers/memory.txt
+++ mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
 a page from Page Cache is used to decrement the accounting counters of the
 cgroup.
 
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
 2.3 Shared Page Accounting
 
 Shared pages are accounted on the basis of the first touch approach. The
@@ -172,6 +175,7 @@ We can alter the memory limit:
 
 NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
 mega or gigabytes.
+Note: root cgroup is not able to be set limit.
 
 # cat /cgroups/0/memory.limit_in_bytes
 4194304

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 2/14] memcg: rewrite force_empty
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
  2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
@ 2008-08-22 11:31 ` KAMEZAWA Hiroyuki
  2008-08-25  3:21   ` KAMEZAWA Hiroyuki
  2008-08-29 11:45   ` Daisuke Nishimura
  2008-08-22 11:32 ` [RFC][PATCH 3/14] memcg: atomic_flags KAMEZAWA Hiroyuki
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is not accounted at all and create an in-use page which
has no page_cgroup.

This patch tries to move account to "root" cgroup. By this patch, force_empty
doesn't leak an account but move account to "root" cgroup. Maybe someone can
think of other enhancements as

 1. move account to its parent.
 2. move account to default-trash-can-cgroup somewhere.
 3. move account to a cgroup specified by an admin.

I think a routine this patch adds is an enough generic and can be the base
patch for supporting above behavior (if someone wants.). But, for now, just
moves account to root group.

While moving mem_cgroup, lock_page(page) is held. This helps us for avoiding
race condition with accessing page_cgroup->mem_cgroup.
While under lock_page(), page_cgroup->mem_cgroup points to right cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |    7 +-
 mm/memcontrol.c                      |   85 ++++++++++++++++++++++++++---------
 2 files changed, 69 insertions(+), 23 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -368,6 +368,7 @@ int task_in_mem_cgroup(struct task_struc
 void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
@@ -386,9 +387,14 @@ void mem_cgroup_move_lists(struct page *
 
 	pc = page_get_page_cgroup(page);
 	if (pc) {
+		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, lru);
+		/*
+		 * check against the race with force_empty.
+		 */
+		if (likely(mem == pc->mem_cgroup))
+			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
 	unlock_page_cgroup(page);
@@ -830,19 +836,52 @@ int mem_cgroup_resize_limit(struct mem_c
 	return ret;
 }
 
+int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = 1;
+
+	VM_BUG_ON(to->no_limit == 0);
+	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!PageLocked(page));
+
+	nid = page_to_nid(page);
+	zid = page_zonenum(page);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+	if (res_counter_charge(&to->res, PAGE_SIZE)) {
+		/* Now, we assume no_limit...no failure here. */
+		return ret;
+	}
+
+	if (spin_trylock(&to_mz->lru_lock)) {
+		__mem_cgroup_remove_list(from_mz, pc);
+		css_put(&from->css);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+		pc->mem_cgroup = to;
+		css_get(&to->css);
+		__mem_cgroup_add_list(to_mz, pc);
+		ret = 0;
+		spin_unlock(&to_mz->lru_lock);
+	} else {
+		res_counter_uncharge(&to->res, PAGE_SIZE);
+	}
+
+	return ret;
+}
 
 /*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
+ * This routine moves all account to root cgroup.
  */
-#define FORCE_UNCHARGE_BATCH	(128)
 static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct page *page;
-	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
 
@@ -853,22 +892,28 @@ static void mem_cgroup_force_empty_list(
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
 		get_page(page);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
+		if (!trylock_page(page)) {
+			list_move(&pc->lru, list);
+			put_page(page):
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			continue;
+		}
+		if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
+			/* some confliction */
+			list_move(&pc->lru, list);
+			unlock_page(page);
 			put_page(page);
-			if (--count <= 0) {
-				count = FORCE_UNCHARGE_BATCH;
-				cond_resched();
-			}
-		} else
-			cond_resched();
-		spin_lock_irqsave(&mz->lru_lock, flags);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+		} else {
+			unlock_page(page);
+			put_page(page);
+		}
+		if (atomic_read(&mem->css.cgroup->count) > 0)
+			break;
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 }
Index: mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
===================================================================
--- mmtom-2.6.27-rc3+.orig/Documentation/controllers/memory.txt
+++ mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
@@ -207,7 +207,8 @@ The memory.force_empty gives an interfac
 
 # echo 1 > memory.force_empty
 
-will drop all charges in cgroup. Currently, this is maintained for test.
+will drop all charges in cgroup and move to default cgroup.
+Currently, this is maintained for test.
 
 4. Testing
 
@@ -238,8 +239,8 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it. Such charges are automatically moved to
+root cgroup at rmidr() if there are no tasks.
 
 5. TODO
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
  2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
  2008-08-22 11:31 ` [RFC][PATCH 2/14] memcg: rewrite force_empty KAMEZAWA Hiroyuki
@ 2008-08-22 11:32 ` KAMEZAWA Hiroyuki
  2008-08-26  4:55   ` Balbir Singh
  2008-08-26  8:46   ` kamezawa.hiroyu
  2008-08-22 11:33 ` [RFC][PATCH 4/14] delay page_cgroup freeing KAMEZAWA Hiroyuki
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

This patch itself makes memcg slow but this patch's final purpose is 
to remove lock_page_cgroup() and allowing fast access to page_cgroup.

Before trying to modify memory resource controller, this atomic operation
on flags is necessary.
Changelog (v1) -> (v2)
 - no changes
Changelog  (preview) -> (v1):
 - patch ordering is changed.
 - Added macro for defining functions for Test/Set/Clear bit.
 - made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 77 insertions(+), 31 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -163,12 +163,57 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	int flags;
+	unsigned long flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+enum {
+	/* flags for mem_cgroup */
+	Pcg_CACHE, /* charged as cache */
+	/* flags for LRU placement */
+	Pcg_ACTIVE, /* page is active in this cgroup */
+	Pcg_FILE, /* page is file system backed */
+	Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int Pcg##uname(struct page_cgroup *pc)	\
+	{ return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+	{ set_bit(Pcg_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPcg##uname(struct page_cgroup *pc)	\
+	{ clear_bit(Pcg_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+	{ __set_bit(Pcg_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(Pcg_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -189,14 +234,15 @@ enum charge_type {
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
-					bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
 
 	VM_BUG_ON(!irqs_disabled());
-	if (flags & PAGE_CGROUP_FLAG_CACHE)
+	if (PcgCache(pc))
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -289,18 +335,18 @@ static void __mem_cgroup_remove_list(str
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PcgUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PcgActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PcgFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
 }
 
@@ -309,27 +355,27 @@ static void __mem_cgroup_add_list(struct
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PcgUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PcgActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PcgFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
-	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+	int active    = PcgActive(pc);
+	int file      = PcgFile(pc);
+	int unevictable = PcgUnevictable(pc);
 	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
 				(LRU_FILE * !!file + !!active);
 
@@ -339,14 +385,14 @@ static void __mem_cgroup_move_lists(stru
 	MEM_CGROUP_ZSTAT(mz, from) -= 1;
 
 	if (is_unevictable_lru(lru)) {
-		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+		ClearPcgActive(pc);
+		SetPcgUnevictable(pc);
 	} else {
 		if (is_active_lru(lru))
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			SetPcgActive(pc);
 		else
-			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+			ClearPcgActive(pc);
+		ClearPcgUnevictable(pc);
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -580,18 +626,19 @@ static int mem_cgroup_charge_common(stru
 
 	pc->mem_cgroup = mem;
 	pc->page = page;
+	pc->flags = 0;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
-		pc->flags = PAGE_CGROUP_FLAG_CACHE;
+		__SetPcgCache(pc);
 		if (page_is_file_cache(page))
-			pc->flags |= PAGE_CGROUP_FLAG_FILE;
+			__SetPcgFile(pc);
 		else
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			__SetPcgActive(pc);
 	} else
-		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+		__SetPcgActive(pc);
 
 	lock_page_cgroup(page);
 	if (unlikely(page_get_page_cgroup(page))) {
@@ -699,8 +746,7 @@ __mem_cgroup_uncharge_common(struct page
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
-		|| page_mapped(page)))
+	    && ((PcgCache(pc) || page_mapped(page))))
 		goto unlock;
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -750,7 +796,7 @@ int mem_cgroup_prepare_migration(struct 
 	if (pc) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+		if (PcgCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
 	unlock_page_cgroup(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-08-22 11:32 ` [RFC][PATCH 3/14] memcg: atomic_flags KAMEZAWA Hiroyuki
@ 2008-08-22 11:33 ` KAMEZAWA Hiroyuki
  2008-08-26 11:46   ` Balbir Singh
  2008-08-22 11:34 ` [RFC][PATCH 5/14] memcg: free page_cgroup by RCU KAMEZAWA Hiroyuki
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Freeing page_cgroup at mem_cgroup_uncharge() in lazy way.

In mem_cgroup_uncharge_common(), we don't free page_cgroup
and just link it to per-cpu free queue.
And remove it later by checking threshold.

This patch is a base patch for freeing page_cgroup by RCU patch.
This patch depends on make-page_cgroup_flag-atomic patch.

Changelog: (v1) -> (v2)
  - fixed mem_cgroup_move_list()'s checking of PcgObsolete()
  - fixed force_empty.
Changelog: (preview) -> (v1)
  - Clean up.
  - renamed functions

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 110 insertions(+), 12 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -164,11 +164,13 @@ struct page_cgroup {
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
 	unsigned long flags;
+	struct page_cgroup *next;
 };
 
 enum {
 	/* flags for mem_cgroup */
 	Pcg_CACHE, /* charged as cache */
+	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
 	/* flags for LRU placement */
 	Pcg_ACTIVE, /* page is active in this cgroup */
 	Pcg_FILE, /* page is file system backed */
@@ -199,6 +201,10 @@ static inline void __ClearPcg##uname(str
 TESTPCGFLAG(Cache, CACHE)
 __SETPCGFLAG(Cache, CACHE)
 
+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
 /* LRU management flags (from global-lru definition) */
 TESTPCGFLAG(File, FILE)
 SETPCGFLAG(File, FILE)
@@ -225,6 +231,18 @@ static enum zone_type page_cgroup_zid(st
 	return page_zonenum(pc->page);
 }
 
+/*
+ * per-cpu slot for freeing page_cgroup in lazy manner.
+ * All page_cgroup linked to this list is OBSOLETE.
+ */
+struct mem_cgroup_sink_list {
+	int count;
+	struct page_cgroup *next;
+};
+DEFINE_PER_CPU(struct mem_cgroup_sink_list, memcg_sink_list);
+#define MEMCG_LRU_THRESH	(16)
+
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -440,7 +458,7 @@ void mem_cgroup_move_lists(struct page *
 		/*
 		 * check against the race with force_empty.
 		 */
-		if (likely(mem == pc->mem_cgroup))
+		if (!PcgObsolete(pc) && likely(mem == pc->mem_cgroup))
 			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
@@ -531,6 +549,10 @@ unsigned long mem_cgroup_isolate_pages(u
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+
+		if (PcgObsolete(pc))
+			continue;
+
 		page = pc->page;
 
 		if (unlikely(!PageLRU(page)))
@@ -563,6 +585,81 @@ unsigned long mem_cgroup_isolate_pages(u
 }
 
 /*
+ * Free obsolete page_cgroups which is linked to per-cpu drop list.
+ */
+
+static void __free_obsolete_page_cgroup(void)
+{
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc, *next;
+	struct mem_cgroup_per_zone *mz, *page_mz;
+	struct mem_cgroup_sink_list *mcsl;
+	unsigned long flags;
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	next = mcsl->next;
+	mcsl->next = NULL;
+	mcsl->count = 0;
+	put_cpu_var(memcg_sink_list);
+
+	mz = NULL;
+
+	local_irq_save(flags);
+	while (next) {
+		pc = next;
+		VM_BUG_ON(!PcgObsolete(pc));
+		next = pc->next;
+		prefetch(next);
+		page_mz = page_cgroup_zoneinfo(pc);
+		memcg = pc->mem_cgroup;
+		if (page_mz != mz) {
+			if (mz)
+				spin_unlock(&mz->lru_lock);
+			mz = page_mz;
+			spin_lock(&mz->lru_lock);
+		}
+		__mem_cgroup_remove_list(mz, pc);
+		css_put(&memcg->css);
+		kmem_cache_free(page_cgroup_cache, pc);
+	}
+	if (mz)
+		spin_unlock(&mz->lru_lock);
+	local_irq_restore(flags);
+}
+
+static void free_obsolete_page_cgroup(struct page_cgroup *pc)
+{
+	int count;
+	struct mem_cgroup_sink_list *mcsl;
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	pc->next = mcsl->next;
+	mcsl->next = pc;
+	count = ++mcsl->count;
+	put_cpu_var(memcg_sink_list);
+	if (count >= MEMCG_LRU_THRESH)
+		__free_obsolete_page_cgroup();
+}
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void mem_cgroup_local_force_drain(struct work_struct *work)
+{
+	__free_obsolete_page_cgroup();
+}
+
+static void mem_cgroup_all_force_drain(void)
+{
+	mutex_lock(&memcg_force_drain_mutex);
+	schedule_on_each_cpu(mem_cgroup_local_force_drain);
+	mutex_unlock(&memcg_force_drain_mutex);
+}
+
+/*
  * Charge the memory controller for page usage.
  * Return
  * 0 if the charge was successful
@@ -627,6 +724,7 @@ static int mem_cgroup_charge_common(stru
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = 0;
+	pc->next = NULL;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
@@ -729,8 +827,6 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return;
@@ -748,20 +844,14 @@ __mem_cgroup_uncharge_common(struct page
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 	    && ((PcgCache(pc) || page_mapped(page))))
 		goto unlock;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
+	mem = pc->mem_cgroup;
+	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
 	unlock_page_cgroup(page);
 
-	mem = pc->mem_cgroup;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
+	free_obsolete_page_cgroup(pc);
 
-	kmem_cache_free(page_cgroup_cache, pc);
 	return;
 unlock:
 	unlock_page_cgroup(page);
@@ -937,6 +1027,14 @@ static void mem_cgroup_force_empty_list(
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
+		if (PcgObsolete(pc)) {
+			list_move(&pc->lru, list);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			mem_cgroup_all_force_drain();
+			yield();
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			continue;
+		}
 		page = pc->page;
 		if (!get_page_unless_zero(page)) {
 			list_move(&pc->lru, list);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 5/14]  memcg: free page_cgroup by RCU
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-08-22 11:33 ` [RFC][PATCH 4/14] delay page_cgroup freeing KAMEZAWA Hiroyuki
@ 2008-08-22 11:34 ` KAMEZAWA Hiroyuki
  2008-08-28 10:06   ` Balbir Singh
  2008-08-22 11:35 ` [RFC][PATCH 6/14] memcg: lockless page cgroup KAMEZAWA Hiroyuki
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Freeing page_cgroup by RCU.

This makes access to page->page_cgroup as RCU-safe.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -588,19 +588,23 @@ unsigned long mem_cgroup_isolate_pages(u
  * Free obsolete page_cgroups which is linked to per-cpu drop list.
  */
 
-static void __free_obsolete_page_cgroup(void)
+struct page_cgroup_rcu_work {
+	struct rcu_head head;
+	struct page_cgroup *list;
+};
+
+static void __free_obsolete_page_cgroup_cb(struct rcu_head *head)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc, *next;
 	struct mem_cgroup_per_zone *mz, *page_mz;
-	struct mem_cgroup_sink_list *mcsl;
+	struct page_cgroup_rcu_work *work;
 	unsigned long flags;
 
-	mcsl = &get_cpu_var(memcg_sink_list);
-	next = mcsl->next;
-	mcsl->next = NULL;
-	mcsl->count = 0;
-	put_cpu_var(memcg_sink_list);
+
+	work = container_of(head, struct page_cgroup_rcu_work, head);
+	next = work->list;
+	kfree(work);
 
 	mz = NULL;
 
@@ -627,6 +631,26 @@ static void __free_obsolete_page_cgroup(
 	local_irq_restore(flags);
 }
 
+static int __free_obsolete_page_cgroup(void)
+{
+	struct page_cgroup_rcu_work *work;
+	struct mem_cgroup_sink_list *mcsl;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+	INIT_RCU_HEAD(&work->head);
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	work->list = mcsl->next;
+	mcsl->next = NULL;
+	mcsl->count = 0;
+	put_cpu_var(memcg_sink_list);
+
+	call_rcu(&work->head, __free_obsolete_page_cgroup_cb);
+	return 0;
+}
+
 static void free_obsolete_page_cgroup(struct page_cgroup *pc)
 {
 	int count;
@@ -649,13 +673,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
 
 static void mem_cgroup_local_force_drain(struct work_struct *work)
 {
-	__free_obsolete_page_cgroup();
+	int ret;
+	do {
+		ret = __free_obsolete_page_cgroup();
+	} while (ret);
 }
 
 static void mem_cgroup_all_force_drain(void)
 {
 	mutex_lock(&memcg_force_drain_mutex);
 	schedule_on_each_cpu(mem_cgroup_local_force_drain);
+	synchronize_rcu();
 	mutex_unlock(&memcg_force_drain_mutex);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-08-22 11:34 ` [RFC][PATCH 5/14] memcg: free page_cgroup by RCU KAMEZAWA Hiroyuki
@ 2008-08-22 11:35 ` KAMEZAWA Hiroyuki
  2008-09-09  5:40   ` Daisuke Nishimura
  2008-08-22 11:36 ` [RFC][PATCH 7/14] memcg: add prefetch to spinlock KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch removes lock_page_cgroup(). Now, page_cgroup is guarded by RCU.

To remove lock_page_cgroup(), we have to confirm there is no race.

Anon pages:
* pages are chareged/uncharged only when first-mapped/last-unmapped.
  page_mapcount() handles that.
   (And... pte_lock() is always held in any racy case.)

Swap pages:
  There will be race because charge is done before lock_page().
  This patch moves mem_cgroup_charge() under lock_page().

File pages: (not Shmem)
* pages are charged/uncharged only when it's added/removed to radix-tree.
  In this case, PageLock() is always held.

Install Page:
  Is it worth to charge this special map page ? which is (maybe) not on LRU.
  I think no.
  I removed charge/uncharge from install_page().

Page Migration:
  We precharge it and map it back under lock_page(). This should be treated
  as special case.

freeing page_cgroup is done under RCU.

After this patch, page_cgroup can be accesced via struct page->page_cgroup
under following conditions.

1. The page is file cache and on radix-tree.
   (means lock_page() or mapping->tree_lock is held.)
2. The page is anounymous page and mapped.
   (means pte_lock is held.)
3. under RCU and the page_cgroup is not Obsolete.

Typical style of "3" is following.
**
	rcu_read_lock();
	pc = page_get_page_cgroup(page);
	if (pc && !PcgObsolete(pc)) {
		......
	}
	rcu_read_unlock();
**

This is now under test. Don't apply if you're not brave.

Changelog: (v1) -> (v2)
 - Added Documentation.

Changelog: (preview) -> (v1)
 - Added comments.
 - Fixed page migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 Documentation/controllers/memory.txt |   16 ++++
 include/linux/mm_types.h             |    2 
 mm/memcontrol.c                      |  125 +++++++++++++----------------------
 mm/memory.c                          |   16 +---
 4 files changed, 70 insertions(+), 89 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -142,20 +142,6 @@ struct mem_cgroup {
 static struct mem_cgroup init_mem_cgroup;
 
 /*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
  * A page_cgroup page is associated with every page descriptor. The
  * page_cgroup helps us identify information about the cgroup
  */
@@ -317,35 +303,14 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
 {
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+	rcu_assign_pointer(page->page_cgroup, pc);
 }
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+	return rcu_dereference(page->page_cgroup);
 }
 
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
@@ -440,29 +405,22 @@ void mem_cgroup_move_lists(struct page *
 	if (mem_cgroup_subsys.disabled)
 		return;
 
-	/*
-	 * We cannot lock_page_cgroup while holding zone's lru_lock,
-	 * because other holders of lock_page_cgroup can be interrupted
-	 * with an attempt to rotate_reclaimable_page.  But we cannot
-	 * safely get to page_cgroup without it, so just try_lock it:
-	 * mem_cgroup_isolate_pages allows for page left on wrong list.
-	 */
-	if (!try_lock_page_cgroup(page))
-		return;
-
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
 	if (pc) {
 		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		/*
-		 * check against the race with force_empty.
+		 * check against the race with force_empty. pc->mem_cgroup is
+		 * if pc is valid because the page is under page_lock. move
+		 * function will not change pc->mem_cgroup.
 		 */
-		if (!PcgObsolete(pc) && likely(mem == pc->mem_cgroup))
+		if (!PcgObsolete(pc))
 			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
-	unlock_page_cgroup(page);
+	rcu_read_unlock();
 }
 
 /*
@@ -766,14 +724,9 @@ static int mem_cgroup_charge_common(stru
 	} else
 		__SetPcgActive(pc);
 
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
+	/* Double counting race condition ? */
+	VM_BUG_ON(page_get_page_cgroup(page));
+
 	page_assign_page_cgroup(page, pc);
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -781,8 +734,6 @@ static int mem_cgroup_charge_common(stru
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	unlock_page_cgroup(page);
-done:
 	return 0;
 out:
 	css_put(&mem->css);
@@ -807,6 +758,28 @@ int mem_cgroup_charge(struct page *page,
 		return 0;
 	if (unlikely(!mm))
 		mm = &init_mm;
+	/*
+	 * Check for pre-charged case of an anonymous page.
+	 * i.e. page migraion.
+	 *
+	 * Under page migration, the new page (target of migration) is charged
+	 * befere being mapped. And page->mapping points to anon_vma.
+	 * Check it here wheter we've already charged this or not.
+	 *
+	 * But, in this case, we don't charge against a page which is newly
+	 * allocated. It should be locked for avoiding race.
+	 */
+	if (PageAnon(page)) {
+		struct page_cgroup *pc;
+		VM_BUG_ON(!PageLocked(page));
+		rcu_read_lock();
+		pc = page_get_page_cgroup(page);
+		if (pc && !PcgObsolete(pc)) {
+			rcu_read_unlock();
+			return 0;
+		}
+		rcu_read_unlock();
+	}
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
@@ -824,20 +797,21 @@ int mem_cgroup_cache_charge(struct page 
 	 *
 	 * For GFP_NOWAIT case, the page may be pre-charged before calling
 	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
-	 * charge twice. (It works but has to pay a bit larger cost.)
+	 * charge twice.
+	 *
+	 * Note: page migration doesn't call add_to_page_cache(). We can ignore
+	 * the case.
 	 */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
-
-		lock_page_cgroup(page);
+		rcu_read_lock();
 		pc = page_get_page_cgroup(page);
-		if (pc) {
+		if (pc && !PcgObsolete(pc)) {
 			VM_BUG_ON(pc->page != page);
 			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
 			return 0;
 		}
-		unlock_page_cgroup(page);
+		rcu_read_unlock();
 	}
 
 	if (unlikely(!mm))
@@ -862,27 +836,26 @@ __mem_cgroup_uncharge_common(struct page
 	/*
 	 * Check if our page_cgroup is valid
 	 */
-	lock_page_cgroup(page);
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
+	if (unlikely(!pc) || PcgObsolete(pc))
+		goto out;
 
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 	    && ((PcgCache(pc) || page_mapped(page))))
-		goto unlock;
+		goto out;
 	mem = pc->mem_cgroup;
 	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
 
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	free_obsolete_page_cgroup(pc);
 
+out:
+	rcu_read_unlock();
 	return;
-unlock:
-	unlock_page_cgroup(page);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -909,15 +882,15 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
-	lock_page_cgroup(page);
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc && !PcgObsolete(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 		if (PcgCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
-	unlock_page_cgroup(page);
+	rcu_read_unlock();
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
Index: mmtom-2.6.27-rc3+/mm/memory.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memory.c
+++ mmtom-2.6.27-rc3+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
 	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2325,16 +2319,16 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
+	lock_page(page);
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
-		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		ret = VM_FAULT_OOM;
+		unlock_page(page);
 		goto out;
 	}
 
 	mark_page_accessed(page);
-	lock_page(page);
-	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	/*
 	 * Back out if somebody else already faulted in this pte.
Index: mmtom-2.6.27-rc3+/include/linux/mm_types.h
===================================================================
--- mmtom-2.6.27-rc3+.orig/include/linux/mm_types.h
+++ mmtom-2.6.27-rc3+/include/linux/mm_types.h
@@ -93,7 +93,7 @@ struct page {
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
+	struct page_cgroup *page_cgroup;
 #endif
 
 #ifdef CONFIG_KMEMCHECK
Index: mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
===================================================================
--- mmtom-2.6.27-rc3+.orig/Documentation/controllers/memory.txt
+++ mmtom-2.6.27-rc3+/Documentation/controllers/memory.txt
@@ -151,7 +151,21 @@ The memory controller uses the following
 
 1. zone->lru_lock is used for selecting pages to be isolated
 2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
-3. lock_page_cgroup() is used to protect page->page_cgroup
+
+Access to page_cgroup via struct page->page_cgroup is safe while
+
+1. The page is file cache and on radix-tree.
+   (means mapping->tree_lock or lock_page should be held.)
+2. The page is anonymous page and it's guaranteed to be mapped.
+   (means pte_lock should be held.)
+3. under rcu_read_lock() and !PcgObsolete(pc)
+
+In any case, the user should use page_get_page_cgroup().
+Accessing member of page_cgroup->flags is not dangerous.
+Accessing member of page_cgroup->mem_cgroup, page_cgroup->lru is a
+little more dangerous. You should avoid it from outside of mm/memcontrol.c
+
+
 
 3. User Interface
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 7/14] memcg: add prefetch to spinlock
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-08-22 11:35 ` [RFC][PATCH 6/14] memcg: lockless page cgroup KAMEZAWA Hiroyuki
@ 2008-08-22 11:36 ` KAMEZAWA Hiroyuki
  2008-08-28 11:00   ` Balbir Singh
  2008-08-22 11:37 ` [RFC][PATCH 8/14] memcg: make mapping null before uncharge KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Address of "mz" can be calculated in easy way.
prefetch it (we do spin_lock.)


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -707,6 +707,8 @@ static int mem_cgroup_charge_common(stru
 		}
 	}
 
+	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	prefetchw(mz);
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = 0;
@@ -729,7 +731,6 @@ static int mem_cgroup_charge_common(stru
 
 	page_assign_page_cgroup(page, pc);
 
-	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 8/14] memcg: make mapping null before uncharge
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-08-22 11:36 ` [RFC][PATCH 7/14] memcg: add prefetch to spinlock KAMEZAWA Hiroyuki
@ 2008-08-22 11:37 ` KAMEZAWA Hiroyuki
  2008-08-22 11:38 ` [RFC][PATCH 9/14] memcg: add page_cgroup.h file KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();

Changelog (v1) -> (v2)
 - fixed page migration

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/filemap.c    |    2 +-
 mm/memcontrol.c |    1 +
 mm/migrate.c    |   12 +++++++++---
 3 files changed, 11 insertions(+), 4 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/filemap.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/filemap.c
+++ mmtom-2.6.27-rc3+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
 {
 	struct address_space *mapping = page->mapping;
 
-	mem_cgroup_uncharge_cache_page(page);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
+	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -867,6 +867,7 @@ void mem_cgroup_uncharge_page(struct pag
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
+	VM_BUG_ON(page->mapping);
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
Index: mmtom-2.6.27-rc3+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc3+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!PageSwapCache(newpage))
-		mem_cgroup_uncharge_cache_page(page);
 
 	return 0;
 }
@@ -378,7 +376,15 @@ static void migrate_page_copy(struct pag
 #endif
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
-	page->mapping = NULL;
+	/* page->mapping contains a flag for PageAnon() */
+	if (PageAnon(page)) {
+		/* This page is uncharged at try_to_unmap(). */
+		page->mapping = NULL;
+	} else {
+		/* Obsolete file cache should be uncharged */
+		page->mapping = NULL;
+		mem_cgroup_uncharge_cache_page(page);
+	}
 
 	/*
 	 * If any waiters have accumulated on the new page then

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 9/14] memcg: add page_cgroup.h file
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-08-22 11:37 ` [RFC][PATCH 8/14] memcg: make mapping null before uncharge KAMEZAWA Hiroyuki
@ 2008-08-22 11:38 ` KAMEZAWA Hiroyuki
  2008-08-22 11:39 ` [RFC][PATCH 10/14] memcg: replace res_counter KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Experimental

page_cgroup is a struct for accounting each page under memory resource
controller. Currently, it's only used under memcontrol.h but there 
is possible user of this struct (now).
(*) Because page_cgroup is an extended/on-demand mem_map by nature,
    there are people who want to use this for recording information.

If no users, this patch is not necessary.

Changelog (v1) -> (v2)
  - modified "how to use" comments.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/page_cgroup.h |  110 ++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c             |   81 --------------------------------
 2 files changed, 111 insertions(+), 80 deletions(-)

Index: mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
@@ -0,0 +1,110 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup.
+ *
+ * This is pointed from struct page by page->page_cgroup pointer.
+ * This pointer is safe under RCU. If a page_cgroup is marked as
+ * Obsolete, don't access it.
+ *
+ * You can access to page_cgroup in safe way under...
+ *
+ * 1. the page is file cache and on radix-tree.
+ *    (means you should hold lock_page() or mapping->tree_lock)
+ * 2. the page is anonymous and mapped.
+ *    (means you should hold pte_lock)
+ * 3. under RCU.
+ *
+ * Typical way to access page_cgroup under RCU is following.
+ *
+ * rcu_read_lock();
+ * pc = page_get_page_cgroup(page);
+ * if (pc && !PcgObsolete(pc)) {
+ *         ......
+ * }
+ * rcu_read_unlock();
+ *
+ * But access to the member of page_cgroup should be restricted.
+ * The member lru, mem_cgroup, next is dangerous.
+ */
+struct page_cgroup {
+	struct list_head lru;		/* per zone/memcg LRU list */
+	struct page *page;		/* the page this accounts for */
+	struct mem_cgroup *mem_cgroup;  /* belongs to this mem_cgroup */
+	unsigned long flags;
+	struct page_cgroup *next;
+};
+
+enum {
+	/* flags for mem_cgroup */
+	Pcg_CACHE, /* charged as cache */
+	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
+	/* flags for LRU placement */
+	Pcg_ACTIVE, /* page is active in this cgroup */
+	Pcg_FILE, /* page is file system backed */
+	Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int Pcg##uname(struct page_cgroup *pc)	\
+	{ return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+	{ set_bit(Pcg_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPcg##uname(struct page_cgroup *pc)	\
+	{ clear_bit(Pcg_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+	{ __set_bit(Pcg_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(Pcg_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+
+static int page_cgroup_nid(struct page_cgroup *pc)
+{
+	return page_to_nid(pc->page);
+}
+
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+	return page_zonenum(pc->page);
+}
+
+struct page_cgroup *page_get_page_cgroup(struct page *page)
+{
+	return rcu_dereference(page->page_cgroup);
+}
+
+
+#endif
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -34,6 +34,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/pagemap.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -141,81 +142,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	unsigned long flags;
-	struct page_cgroup *next;
-};
-
-enum {
-	/* flags for mem_cgroup */
-	Pcg_CACHE, /* charged as cache */
-	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
-	/* flags for LRU placement */
-	Pcg_ACTIVE, /* page is active in this cgroup */
-	Pcg_FILE, /* page is file system backed */
-	Pcg_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname)			\
-static inline int Pcg##uname(struct page_cgroup *pc)	\
-	{ return test_bit(Pcg_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPcg##uname(struct page_cgroup *pc)\
-	{ set_bit(Pcg_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPcg##uname(struct page_cgroup *pc)	\
-	{ clear_bit(Pcg_##lname, &pc->flags);  }
-
-#define __SETPCGFLAG(uname, lname)			\
-static inline void __SetPcg##uname(struct page_cgroup *pc)\
-	{ __set_bit(Pcg_##lname, &pc->flags);  }
-
-#define __CLEARPCGFLAG(uname, lname)			\
-static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
-	{ __clear_bit(Pcg_##lname, &pc->flags);  }
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-__SETPCGFLAG(Cache, CACHE)
-
-/* No "Clear" routine for OBSOLETE flag */
-TESTPCGFLAG(Obsolete, OBSOLETE);
-SETPCGFLAG(Obsolete, OBSOLETE);
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-__SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-__SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
 
 /*
  * per-cpu slot for freeing page_cgroup in lazy manner.
@@ -308,11 +234,6 @@ static void page_assign_page_cgroup(stru
 	rcu_assign_pointer(page->page_cgroup, pc);
 }
 
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return rcu_dereference(page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 10/14] memcg: replace res_counter
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2008-08-22 11:38 ` [RFC][PATCH 9/14] memcg: add page_cgroup.h file KAMEZAWA Hiroyuki
@ 2008-08-22 11:39 ` KAMEZAWA Hiroyuki
  2008-08-27  0:44   ` Daisuke Nishimura
  2008-08-22 11:40 ` [RFC][PATCH 11/14] memcg: mem_cgroup private ID KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

For mem+swap controller, we'll use special counter which has 2 values and
2 limit. Before doing that, replace current res_counter with new mem_counter.

This patch doen't have much meaning other than for clean up before mem+swap
controller. New mem_counter's counter is "unsigned long" and account resource by
# of pages. (I think "unsigned long" is safe under 32bit machines when we count
resource by # of pages rather than bytes.) No changes in user interface.
User interface is in "bytes".

Using "unsigned long long", we have to be nervous to read to temporal value
without lock.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  177 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 151 insertions(+), 26 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -17,10 +17,9 @@
  * GNU General Public License for more details.
  */
 
-#include <linux/res_counter.h>
+#include <linux/mm.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
-#include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include <linux/backing-dev.h>
@@ -118,12 +117,21 @@ struct mem_cgroup_lru_info {
  * no reclaim occurs from a cgroup at it's low water mark, this is
  * a feature that will be implemented much later in the future.
  */
+struct mem_counter {
+	unsigned long	pages;
+	unsigned long	pages_limit;
+	unsigned long	max_pages;
+	unsigned long	failcnt;
+	spinlock_t	lock;
+};
+
+
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
 	/*
 	 * the counter to account for memory usage
 	 */
-	struct res_counter res;
+	struct mem_counter res;
 	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
@@ -161,6 +169,14 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 };
 
+/* Private File ID for memory resource controller's interface */
+enum {
+	MEMCG_FILE_PAGE_LIMIT,
+	MEMCG_FILE_PAGE_USAGE,
+	MEMCG_FILE_PAGE_MAX_USAGE,
+	MEMCG_FILE_FAILCNT,
+};
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -234,6 +250,81 @@ static void page_assign_page_cgroup(stru
 	rcu_assign_pointer(page->page_cgroup, pc);
 }
 
+/*
+ * counter for memory resource accounting.
+ */
+static void mem_counter_init(struct mem_cgroup *mem)
+{
+	memset(&mem->res, 0, sizeof(mem->res));
+	mem->res.pages_limit = ~0UL;
+	spin_lock_init(&mem->res.lock);
+}
+
+static int mem_counter_charge(struct mem_cgroup *mem, long num)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&mem->res.lock, flags);
+	if (mem->res.pages + num > mem->res.pages_limit)
+		goto busy_out;
+	mem->res.pages += num;
+	if (mem->res.pages > mem->res.max_pages)
+		mem->res.max_pages = mem->res.pages;
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+	return 0;
+busy_out:
+	mem->res.failcnt++;
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+	return -EBUSY;
+}
+
+static void mem_counter_uncharge_page(struct mem_cgroup *mem, long num)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&mem->res.lock, flags);
+	mem->res.pages -= num;
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+}
+
+static int mem_counter_set_pages_limit(struct mem_cgroup *mem,
+					unsigned long num)
+{
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	spin_lock_irqsave(&mem->res.lock, flags);
+	if (mem->res.pages < num) {
+		mem->res.pages_limit = num;
+		ret = 0;
+	}
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+	return ret;
+}
+
+static int mem_counter_check_under_pages_limit(struct mem_cgroup *mem)
+{
+	if (mem->res.pages < mem->res.pages_limit)
+		return 1;
+	return 0;
+}
+
+static void mem_counter_reset(struct mem_cgroup *mem, int member)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&mem->res.lock, flags);
+	switch (member) {
+	case MEMCG_FILE_PAGE_MAX_USAGE:
+		mem->res.max_pages = 0;
+		break;
+	case MEMCG_FILE_FAILCNT:
+		mem->res.failcnt = 0;
+		break;
+	}
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+}
+
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -356,7 +447,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	 * usage is recorded in bytes. But, here, we assume the number of
 	 * physical pages can be represented by "long" on any arch.
 	 */
-	total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L;
+	total = (long) (mem->res.pages >> PAGE_SHIFT) + 1L;
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
@@ -605,7 +696,7 @@ static int mem_cgroup_charge_common(stru
 		css_get(&memcg->css);
 	}
 
-	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+	while (mem_counter_charge(mem, 1)) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
 
@@ -619,7 +710,7 @@ static int mem_cgroup_charge_common(stru
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
 		 */
-		if (res_counter_check_under_limit(&mem->res))
+		if (mem_counter_check_under_pages_limit(mem))
 			continue;
 
 		if (!nr_retries--) {
@@ -772,7 +863,7 @@ __mem_cgroup_uncharge_common(struct page
 	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
 
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	mem_counter_uncharge_page(mem, 1);
 	free_obsolete_page_cgroup(pc);
 
 out:
@@ -880,8 +971,12 @@ int mem_cgroup_resize_limit(struct mem_c
 	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
 	int progress;
 	int ret = 0;
+	unsigned long new_lim = (unsigned long)(val >> PAGE_SHIFT);
 
-	while (res_counter_set_limit(&memcg->res, val)) {
+	if (val & PAGE_SIZE)
+		new_lim += 1;
+
+	while (mem_counter_set_pages_limit(memcg, new_lim)) {
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -913,7 +1008,7 @@ int mem_cgroup_move_account(struct page 
 	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
 	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
 
-	if (res_counter_charge(&to->res, PAGE_SIZE)) {
+	if (mem_counter_charge(to, 1)) {
 		/* Now, we assume no_limit...no failure here. */
 		return ret;
 	}
@@ -921,14 +1016,14 @@ int mem_cgroup_move_account(struct page 
 	if (spin_trylock(&to_mz->lru_lock)) {
 		__mem_cgroup_remove_list(from_mz, pc);
 		css_put(&from->css);
-		res_counter_uncharge(&from->res, PAGE_SIZE);
+		mem_counter_uncharge_page(from, 1);
 		pc->mem_cgroup = to;
 		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc);
 		ret = 0;
 		spin_unlock(&to_mz->lru_lock);
 	} else {
-		res_counter_uncharge(&to->res, PAGE_SIZE);
+		mem_counter_uncharge_page(to, 1);
 	}
 
 	return ret;
@@ -1008,7 +1103,7 @@ static int mem_cgroup_force_empty(struct
 	 * active_list <-> inactive_list while we don't take a lock.
 	 * So, we have to do loop here until all lists are empty.
 	 */
-	while (mem->res.usage > 0) {
+	while (mem->res.pages > 0) {
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
 		for_each_node_state(node, N_POSSIBLE)
@@ -1028,13 +1123,43 @@ out:
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
-				    cft->private);
+	unsigned long long ret;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	switch (cft->private) {
+	case MEMCG_FILE_PAGE_LIMIT:
+		ret = (unsigned long long)mem->res.pages_limit << PAGE_SHIFT;
+		break;
+	case MEMCG_FILE_PAGE_USAGE:
+		ret = (unsigned long long)mem->res.pages << PAGE_SHIFT;
+		break;
+	case MEMCG_FILE_PAGE_MAX_USAGE:
+		ret = (unsigned long long)mem->res.max_pages << PAGE_SHIFT;
+		break;
+	case MEMCG_FILE_FAILCNT:
+		ret = (unsigned long long)mem->res.failcnt;
+		break;
+	default:
+		BUG();
+	}
+	return ret;
 }
 /*
  * The user of this function is...
  * RES_LIMIT.
  */
+static int call_memparse(const char *buf, unsigned long long *val)
+{
+	char *end;
+
+	*val = memparse((char *)buf, &end);
+	if (*end != '\0')
+		return -EINVAL;
+	*val = PAGE_ALIGN(*val);
+	return 0;
+}
+
+
 static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			    const char *buffer)
 {
@@ -1043,13 +1168,13 @@ static int mem_cgroup_write(struct cgrou
 	int ret;
 
 	switch (cft->private) {
-	case RES_LIMIT:
+	case MEMCG_FILE_PAGE_LIMIT:
 		if (memcg->no_limit == 1) {
 			ret = -EINVAL;
 			break;
 		}
 		/* This function does all necessary parse...reuse it */
-		ret = res_counter_memparse_write_strategy(buffer, &val);
+		ret = call_memparse(buffer, &val);
 		if (!ret)
 			ret = mem_cgroup_resize_limit(memcg, val);
 		break;
@@ -1066,12 +1191,12 @@ static int mem_cgroup_reset(struct cgrou
 
 	mem = mem_cgroup_from_cont(cont);
 	switch (event) {
-	case RES_MAX_USAGE:
-		res_counter_reset_max(&mem->res);
-		break;
-	case RES_FAILCNT:
-		res_counter_reset_failcnt(&mem->res);
+	case MEMCG_FILE_PAGE_MAX_USAGE:
+	case MEMCG_FILE_FAILCNT:
+		mem_counter_reset(mem, event);
 		break;
+	default:
+		BUG();
 	}
 	return 0;
 }
@@ -1135,24 +1260,24 @@ static int mem_control_stat_show(struct 
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-		.private = RES_USAGE,
+		.private = MEMCG_FILE_PAGE_USAGE,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "max_usage_in_bytes",
-		.private = RES_MAX_USAGE,
+		.private = MEMCG_FILE_PAGE_MAX_USAGE,
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "limit_in_bytes",
-		.private = RES_LIMIT,
+		.private = MEMCG_FILE_PAGE_LIMIT,
 		.write_string = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "failcnt",
-		.private = RES_FAILCNT,
+		.private = MEMCG_FILE_FAILCNT,
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
@@ -1241,7 +1366,7 @@ mem_cgroup_create(struct cgroup_subsys *
 			return ERR_PTR(-ENOMEM);
 	}
 
-	res_counter_init(&mem->res);
+	mem_counter_init(mem);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 11/14]  memcg: mem_cgroup private ID
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2008-08-22 11:39 ` [RFC][PATCH 10/14] memcg: replace res_counter KAMEZAWA Hiroyuki
@ 2008-08-22 11:40 ` KAMEZAWA Hiroyuki
  2008-08-22 11:41 ` [RFC][PATCH 12/14] memcg: mem+swap controller Kconfig KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch adds a private ID to each memory resource controller.
This is for mem+swap controller.

When we record memcgrp information per each swap entry, rememvering pointer
can consume 8(4) bytes per entry. This is large.

This patch limits the number of memory resource controller to 32768 and
give ID to each controller. (1 bit will be used for flag..)
This can help to save space in future.

ID "0" is used for indicating "invalid" or "not used" ID.
ID "1" is used for root.

(*) 32768 is too small ?

Changelog: 
  - new patch in v2.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   80 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -40,6 +40,7 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
+#define NR_MEMCGRP_ID			(32767)
 
 /*
  * Statistics for memory cgroup.
@@ -144,6 +145,10 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat stat;
 	/*
+	 * private ID
+	 */
+	unsigned short	memcgrp_id;
+	/*
 	 * special flags.
 	 */
 	int	no_limit;
@@ -324,6 +329,69 @@ static void mem_counter_reset(struct mem
 	spin_unlock_irqrestore(&mem->res.lock, flags);
 }
 
+/*
+ * private ID management for memcg.
+ * set/clear bitmap is called by create/destroy and done under cgroup_mutex.
+ */
+static unsigned long *memcgrp_id_bitmap;
+static struct mem_cgroup **memcgrp_array;
+int nr_memcgrp;
+
+static int memcgrp_id_init(void)
+{
+	void *addr;
+	unsigned long bitmap_size = NR_MEMCGRP_ID/8;
+	unsigned long array_size = NR_MEMCGRP_ID * sizeof(void*);
+
+	addr = kmalloc(bitmap_size, GFP_KERNEL | __GFP_ZERO);
+	if (!addr)
+		return -ENOMEM;
+	memcgrp_array = vmalloc(array_size);
+	if (!memcgrp_array) {
+		kfree(memcgrp_array);
+		return -ENOMEM;
+	}
+	memcgrp_id_bitmap = addr;
+	/* 0 for "invalid id" */
+	set_bit(0, memcgrp_id_bitmap);
+	set_bit(1, memcgrp_id_bitmap);
+	memcgrp_array[0] = NULL;
+	memcgrp_array[1] = &init_mem_cgroup;
+	init_mem_cgroup.memcgrp_id = 1;
+	nr_memcgrp = 1;
+	return 0;
+}
+
+static unsigned int get_new_memcgrp_id(struct mem_cgroup *mem)
+{
+	int id;
+	id = find_first_zero_bit(memcgrp_id_bitmap, NR_MEMCGRP_ID);
+
+	if (id == NR_MEMCGRP_ID - 1)
+		return -ENOSPC;
+	set_bit(id, memcgrp_id_bitmap);
+	memcgrp_array[id] = mem;
+	mem->memcgrp_id = id;
+
+	return 0;
+}
+
+static void free_memcgrp_id(struct mem_cgroup *mem)
+{
+	memcgrp_array[mem->memcgrp_id] = NULL;
+	clear_bit(mem->memcgrp_id , memcgrp_id_bitmap);
+}
+
+/*
+ * please access this while you can convice memcgroup exist.
+ */
+
+static struct mem_cgroup *mem_cgroup_id_lookup(unsigned short id)
+{
+	return memcgrp_array[id];
+}
+
+
 
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
@@ -1358,12 +1426,19 @@ mem_cgroup_create(struct cgroup_subsys *
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
+		if (memcgrp_id_init())
+			return ERR_PTR(-ENOMEM);
 		mem = &init_mem_cgroup;
 		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+
+		if (get_new_memcgrp_id(mem)) {
+			kfree(mem);
+			return ERR_PTR(-ENOSPC);
+		}
 	}
 
 	mem_counter_init(mem);
@@ -1380,8 +1455,10 @@ mem_cgroup_create(struct cgroup_subsys *
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
-	if (cont->parent != NULL)
+	if (cont->parent != NULL) {
+		free_memcgrp_id(mem);
 		mem_cgroup_free(mem);
+	}
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -1398,6 +1475,7 @@ static void mem_cgroup_destroy(struct cg
 	int node;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	free_memcgrp_id(mem);
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 12/14] memcg: mem+swap controller Kconfig
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (10 preceding siblings ...)
  2008-08-22 11:40 ` [RFC][PATCH 11/14] memcg: mem_cgroup private ID KAMEZAWA Hiroyuki
@ 2008-08-22 11:41 ` KAMEZAWA Hiroyuki
  2008-08-22 11:41 ` [RFC][PATCH 13/14] memcg: mem+swap counter KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Add config for mem+swap controller and defines a helper macro

For stacking several readable size of patches, this marks config
as Broken....later patch will remove this word.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 init/Kconfig    |   10 ++++++++++
 mm/memcontrol.c |    7 +++++++
 2 files changed, 17 insertions(+)

Index: mmtom-2.6.27-rc3+/init/Kconfig
===================================================================
--- mmtom-2.6.27-rc3+.orig/init/Kconfig
+++ mmtom-2.6.27-rc3+/init/Kconfig
@@ -415,6 +415,16 @@ config CGROUP_MEM_RES_CTLR
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.
 
+config CGROUP_MEM_RES_CTLR_SWAP
+	bool "Memory Resource Controller Swap Extension (Broken)"
+	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+	help
+	 Add swap management feature to memory resource controller. By this,
+	 you can control swap consumption per cgroup by limiting the total
+	 amount of memory+swap. Because this records additional informaton
+	 at swap-out, this consumes extra memory. If you use 32bit system or
+	 small memory system, please be careful to enable this.
+
 config CGROUP_MEMRLIMIT_CTLR
 	bool "Memory resource limit controls for cgroups"
 	depends on CGROUPS && RESOURCE_COUNTERS && MMU
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -42,6 +42,13 @@ static struct kmem_cache *page_cgroup_ca
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 #define NR_MEMCGRP_ID			(32767)
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#define do_swap_account	(1)
+#else
+#define do_swap_account	(0)
+#endif
+
+
 /*
  * Statistics for memory cgroup.
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 13/14] memcg: mem+swap counter
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (11 preceding siblings ...)
  2008-08-22 11:41 ` [RFC][PATCH 12/14] memcg: mem+swap controller Kconfig KAMEZAWA Hiroyuki
@ 2008-08-22 11:41 ` KAMEZAWA Hiroyuki
  2008-08-28  8:51   ` Daisuke Nishimura
  2008-08-22 11:44 ` [RFC][PATCH 14/14]memcg: mem+swap accounting KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Add counter for swap accounting to memory resource controller.

This adds 1 counter and 1 limit.
res.swaps and res.memsw_limit. res.swaps is a counter for # of swap usage.
(Later, you'll see res.swaps shows the number of swap _on_ disk)

these counter works as

  res.pages + res.swaps < res.memsw_limit.

This means the sum of on_memory_resource and on_swap_resource is limited.
So, a swap is accounted when an anonymous page is charged. By this, the
user can avoid unexpected massive use of swap and kswapd, the global LRU,
is not affected by swap resouce control feature when he try add_to_swap.
...swap is considered to be already accounted as page.

For avoiding too much #ifdefs, this patch uses "do_swap_account" macro.
If config=n, the compiler does good job and ignore some pieces of codes.

This patch doesn't includes swap_accounting infrastructure..then, 
CONFIG_CGROUP_MEM_RES_CTLR_SWAP is still broken.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  121 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 112 insertions(+), 9 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -129,6 +129,8 @@ struct mem_counter {
 	unsigned long	pages;
 	unsigned long	pages_limit;
 	unsigned long	max_pages;
+	unsigned long	swaps;
+	unsigned long	memsw_limit;
 	unsigned long	failcnt;
 	spinlock_t	lock;
 };
@@ -178,6 +180,7 @@ DEFINE_PER_CPU(struct mem_cgroup_sink_li
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
+	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 };
 
@@ -186,6 +189,8 @@ enum {
 	MEMCG_FILE_PAGE_LIMIT,
 	MEMCG_FILE_PAGE_USAGE,
 	MEMCG_FILE_PAGE_MAX_USAGE,
+	MEMCG_FILE_SWAP_USAGE,
+	MEMCG_FILE_MEMSW_LIMIT,
 	MEMCG_FILE_FAILCNT,
 };
 
@@ -269,6 +274,7 @@ static void mem_counter_init(struct mem_
 {
 	memset(&mem->res, 0, sizeof(mem->res));
 	mem->res.pages_limit = ~0UL;
+	mem->res.memsw_limit = ~0UL;
 	spin_lock_init(&mem->res.lock);
 }
 
@@ -279,6 +285,10 @@ static int mem_counter_charge(struct mem
 	spin_lock_irqsave(&mem->res.lock, flags);
 	if (mem->res.pages + num > mem->res.pages_limit)
 		goto busy_out;
+	if (do_swap_account &&
+	    (mem->res.pages + mem->res.swaps > mem->res.memsw_limit))
+		goto busy_out;
+
 	mem->res.pages += num;
 	if (mem->res.pages > mem->res.max_pages)
 		mem->res.max_pages = mem->res.pages;
@@ -298,6 +308,27 @@ static void mem_counter_uncharge_page(st
 	spin_unlock_irqrestore(&mem->res.lock, flags);
 }
 
+static void mem_counter_recharge_swap(struct mem_cgroup *mem)
+{
+        unsigned long flags;
+	if (do_swap_account) {
+        	spin_lock_irqsave(&mem->res.lock, flags);
+		mem->res.pages -= 1;
+        	mem->res.swaps += 1;
+        	spin_unlock_irqrestore(&mem->res.lock, flags);
+	}
+}
+
+static void mem_counter_uncharge_swap(struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	if (do_swap_account) {
+		spin_lock_irqsave(&mem->res.lock, flags);
+		mem->res.swaps -= 1;
+		spin_unlock_irqrestore(&mem->res.lock, flags);
+	}
+}
+
 static int mem_counter_set_pages_limit(struct mem_cgroup *mem,
 					unsigned long num)
 {
@@ -305,7 +336,9 @@ static int mem_counter_set_pages_limit(s
 	int ret = -EBUSY;
 
 	spin_lock_irqsave(&mem->res.lock, flags);
-	if (mem->res.pages < num) {
+	if (mem->res.memsw_limit < num) {
+		ret = -EINVAL;
+	} else if (mem->res.pages < num) {
 		mem->res.pages_limit = num;
 		ret = 0;
 	}
@@ -313,6 +346,23 @@ static int mem_counter_set_pages_limit(s
 	return ret;
 }
 
+static int
+mem_counter_set_memsw_limit(struct mem_cgroup *mem, unsigned long num)
+{
+	unsigned long flags;
+	int ret = -EBUSY;
+
+	spin_lock_irqsave(&mem->res.lock, flags);
+	if (mem->res.pages_limit > num) {
+		ret = -EINVAL;
+	} else if (mem->res.swaps + mem->res.pages < num) {
+		mem->res.memsw_limit = num;
+		ret = 0;
+	}
+	spin_unlock_irqrestore(&mem->res.lock, flags);
+	return ret;
+}
+
 static int mem_counter_check_under_pages_limit(struct mem_cgroup *mem)
 {
 	if (mem->res.pages < mem->res.pages_limit)
@@ -320,6 +370,15 @@ static int mem_counter_check_under_pages
 	return 0;
 }
 
+static int mem_counter_check_under_memsw_limit(struct mem_cgroup *mem)
+{
+	if (!do_swap_account)
+		return 1;
+	if (mem->res.pages + mem->res.swaps < mem->res.memsw_limit)
+		return 1;
+	return 0;
+}
+
 static void mem_counter_reset(struct mem_cgroup *mem, int member)
 {
 	unsigned long flags;
@@ -772,20 +831,28 @@ static int mem_cgroup_charge_common(stru
 	}
 
 	while (mem_counter_charge(mem, 1)) {
+		int progress;
 		if (!(gfp_mask & __GFP_WAIT))
 			goto out;
 
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
+		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
 
 		/*
+		 * When we hit memsw limit, return value of "progress"
+		 * has no meaning. (some pages may just be changed to swap)
+		 */
+		if (mem_counter_check_under_memsw_limit(mem) && progress)
+			continue;
+		/*
 		 * try_to_free_mem_cgroup_pages() might not give us a full
 		 * picture of reclaim. Some pages are reclaimed and might be
 		 * moved to swap cache or just unmapped from the cgroup.
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
 		 */
-		if (mem_counter_check_under_pages_limit(mem))
+
+		if (!do_swap_account
+		   && mem_counter_check_under_pages_limit(mem))
 			continue;
 
 		if (!nr_retries--) {
@@ -938,7 +1005,10 @@ __mem_cgroup_uncharge_common(struct page
 	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
 
-	mem_counter_uncharge_page(mem, 1);
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_counter_recharge_swap(mem);
+	else
+		mem_counter_uncharge_page(mem, 1);
 	free_obsolete_page_cgroup(pc);
 
 out:
@@ -1040,7 +1110,9 @@ int mem_cgroup_shrink_usage(struct mm_st
 	return 0;
 }
 
-int mem_cgroup_resize_limit(struct mem_cgroup *memcg, unsigned long long val)
+int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
+			    unsigned long long val,
+			    bool memswap)
 {
 
 	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
@@ -1051,7 +1123,14 @@ int mem_cgroup_resize_limit(struct mem_c
 	if (val & PAGE_SIZE)
 		new_lim += 1;
 
-	while (mem_counter_set_pages_limit(memcg, new_lim)) {
+	do {
+		if (memswap)
+			ret = mem_counter_set_memsw_limit(memcg, new_lim);
+		else
+			ret = mem_counter_set_pages_limit(memcg, new_lim);
+
+		if (!ret || ret == -EINVAL)
+			break;
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			break;
@@ -1063,7 +1142,8 @@ int mem_cgroup_resize_limit(struct mem_c
 		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
 		if (!progress)
 			retry_count--;
-	}
+	} while (1);
+
 	return ret;
 }
 
@@ -1214,6 +1294,12 @@ static u64 mem_cgroup_read(struct cgroup
 	case MEMCG_FILE_FAILCNT:
 		ret = (unsigned long long)mem->res.failcnt;
 		break;
+	case MEMCG_FILE_SWAP_USAGE:
+		ret = (unsigned long long)mem->res.swaps << PAGE_SHIFT;
+		break;
+	case MEMCG_FILE_MEMSW_LIMIT:
+		ret = (unsigned long long)mem->res.memsw_limit << PAGE_SHIFT;
+		break;
 	default:
 		BUG();
 	}
@@ -1240,9 +1326,13 @@ static int mem_cgroup_write(struct cgrou
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 	unsigned long long val;
+	bool memswap = false;
 	int ret;
 
 	switch (cft->private) {
+	case MEMCG_FILE_MEMSW_LIMIT:
+		memswap = true;
+		/* Fall through */
 	case MEMCG_FILE_PAGE_LIMIT:
 		if (memcg->no_limit == 1) {
 			ret = -EINVAL;
@@ -1251,7 +1341,7 @@ static int mem_cgroup_write(struct cgrou
 		/* This function does all necessary parse...reuse it */
 		ret = call_memparse(buffer, &val);
 		if (!ret)
-			ret = mem_cgroup_resize_limit(memcg, val);
+			ret = mem_cgroup_resize_limit(memcg, val, memswap);
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
@@ -1364,6 +1454,19 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+	{
+		.name = "swap_in_bytes",
+		.private = MEMCG_FILE_SWAP_USAGE,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memswap_limit_in_bytes",
+		.private = MEMCG_FILE_MEMSW_LIMIT,
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	}
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (12 preceding siblings ...)
  2008-08-22 11:41 ` [RFC][PATCH 13/14] memcg: mem+swap counter KAMEZAWA Hiroyuki
@ 2008-08-22 11:44 ` KAMEZAWA Hiroyuki
  2008-09-01  7:15   ` Daisuke Nishimura
  2008-08-22 13:20 ` [RFC][PATCH 0/14] Mem+Swap Controller v2 Balbir Singh
  2008-08-22 15:34 ` kamezawa.hiroyu
  15 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22 11:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

Add Swap accounting feature to memory resource controller.

Accounting is done in following logic.

Swap-out:
  - When add_to_swap_cache() is called, swp_entry is marked as to be under
    page->page_cgroup->mem_cgroup.
  - When swap-cache is uncharged (fully unmapped), we don't uncharge it.
  - When swap-cache is deleted, we uncharge it from memory and charge it to
    swaps. This ops is done only when swap cache is already charged.
           res.pages -=1, res.swaps +=1.

Swap-in:
  - When add_to_swapcache() is called, we do nothing.
  - When swap is mapped, we charge to memory and uncharge from swap
	   res.pages +=1, res.swaps -=1.

SwapCache-Deleting:
  - If the page doesn't have page_cgroup, nothing to do.
  - If the page is still charged as swap, just uncharge memory.
    (This can happen under shmem/tmpfs.)
  - If the page is not charged as swap, res.pages -= 1, res.swaps +=1.

Swap-Freeing:
  - if swap entry is charged, res.swaps -= 1.

Almost all operations are done against SwapCache, which is Locked.

This patch uses an array to remember the owner of swp_entry. Considering x86-32,we should avoid to use NORMAL memory and vmalloc() area too much. This patch
uses HIGHMEM to record information under kmap_atomic(KM_USER0). And information
is recored in 2 bytes per 1 swap page.
(memory controller's id is defined as smaller than unsigned short)

Changelog: (preview) -> (v2)
 - removed radix-tree. just use array.
 - removed linked-list.
 - use memcgroup_id rather than pointer.
 - added force_empty (temporal) support.
   This should be reworked in future. (But for now, this works well for us.)
 
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/swap.h |   38 +++++
 init/Kconfig         |    2 
 mm/memcontrol.c      |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/migrate.c         |    7 
 mm/swap_state.c      |    7 
 mm/swapfile.c        |   14 +
 6 files changed, 422 insertions(+), 10 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -34,6 +34,10 @@
 #include <linux/mm_inline.h>
 #include <linux/pagemap.h>
 #include <linux/page_cgroup.h>
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#endif
 
 #include <asm/uaccess.h>
 
@@ -43,9 +47,28 @@ static struct kmem_cache *page_cgroup_ca
 #define NR_MEMCGRP_ID			(32767)
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
 #define do_swap_account	(1)
+
+static void
+swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page);
+
+static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page);
+static void swap_cgroup_clean_account(struct mem_cgroup *mem);
 #else
 #define do_swap_account	(0)
+
+static void
+swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
+{
+}
+static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
+{
+	return NULL;
+}
+static void swap_cgroup_clean_account(struct mem_cgroup *mem)
+{
+}
 #endif
 
 
@@ -889,6 +912,9 @@ static int mem_cgroup_charge_common(stru
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
+	/* We did swap-in, uncharge swap. */
+	if (do_swap_account && PageSwapCache(page))
+		swap_cgroup_delete_account(mem, page);
 	return 0;
 out:
 	css_put(&mem->css);
@@ -899,6 +925,8 @@ err:
 
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
+	struct mem_cgroup *memcg = NULL;
+
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
@@ -935,13 +963,19 @@ int mem_cgroup_charge(struct page *page,
 		}
 		rcu_read_unlock();
 	}
+	/* Swap-in ? */
+	if (do_swap_account && PageSwapCache(page))
+		memcg = lookup_mem_cgroup_from_swap(page);
+
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, memcg);
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
+	struct mem_cgroup *memcg = NULL;
+
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
@@ -971,9 +1005,11 @@ int mem_cgroup_cache_charge(struct page 
 
 	if (unlikely(!mm))
 		mm = &init_mm;
+	if (do_swap_account && PageSwapCache(page))
+		memcg = lookup_mem_cgroup_from_swap(page);
 
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+				MEM_CGROUP_CHARGE_TYPE_CACHE, memcg);
 }
 
 /*
@@ -998,9 +1034,11 @@ __mem_cgroup_uncharge_common(struct page
 
 	VM_BUG_ON(pc->page != page);
 
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((PcgCache(pc) || page_mapped(page))))
-		goto out;
+	if ((ctype != MEM_CGROUP_CHARGE_TYPE_FORCE))
+		if (PageSwapCache(page) || page_mapped(page) ||
+		    (page->mapping && !PageAnon(page)))
+			goto out;
+
 	mem = pc->mem_cgroup;
 	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
@@ -1577,6 +1615,8 @@ static void mem_cgroup_pre_destroy(struc
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 	mem_cgroup_force_empty(mem);
+	if (do_swap_account)
+		swap_cgroup_clean_account(mem);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
@@ -1635,3 +1675,317 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+/*
+ * swap accounting infrastructure.
+ */
+DEFINE_MUTEX(swap_cgroup_mutex);
+spinlock_t swap_cgroup_lock[MAX_SWAPFILES];
+struct page **swap_cgroup_map[MAX_SWAPFILES];
+unsigned long swap_cgroup_pages[MAX_SWAPFILES];
+
+
+/* This definition is based onf NR_MEM_CGROUP==32768 */
+struct swap_cgroup {
+	unsigned short memcgrp_id:15;
+	unsigned short count:1;
+};
+#define ENTS_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
+
+/*
+ * Called from get_swap_ent().
+ */
+int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask)
+{
+	struct page *page;
+	unsigned long array_index = swp_offset(ent) / ENTS_PER_PAGE;
+	int type = swp_type(ent);
+	unsigned long flags;
+
+	if (swap_cgroup_map[type][array_index])
+		return 0;
+	page = alloc_page(mask | __GFP_HIGHMEM | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	spin_lock_irqsave(&swap_cgroup_lock[type], flags);
+	if (swap_cgroup_map[type][array_index] == NULL) {
+		swap_cgroup_map[type][array_index] = page;
+		page = NULL;
+	}
+	spin_unlock_irqrestore(&swap_cgroup_lock[type], flags);
+
+	if (page)
+		__free_page(page);
+	return 0;
+}
+
+/**
+ * swap_cgroup_record_info
+ * @page ..... a page which is in some mem_cgroup.
+ * @entry .... swp_entry of the page. (or old swp_entry of the page)
+ * @delete ... if 0 add entry, if 1 remove entry.
+ *
+ * At set new value:
+ * This is called from add_to_swap_cache() after added to swapper_space.
+ * Then...this is called under page_lock() and this page is on radix-tree
+ * We're safe to access page->page_cgroup->mem_cgroup.
+ * This function never fails. (may leak information...but it's not Oops.)
+ *
+ * At delettion:
+ * Returns count is set or not.
+ */
+int swap_cgroup_record_info(struct page *page, swp_entry_t entry, bool del)
+{
+	unsigned long flags;
+	int type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	unsigned long array_index = offset/ENTS_PER_PAGE;
+	unsigned long index = offset & (ENTS_PER_PAGE - 1);
+	struct page *mappage;
+	struct swap_cgroup *map;
+	struct page_cgroup *pc = NULL;
+	int ret = 0;
+
+	if (!del) {
+		/*
+		 * At swap-in, the page is added to swap cache before tied to
+		 * mem_cgroup. This page will be finally charged at page fault.
+		 * Ignore this at this point.
+		 */
+		pc = page_get_page_cgroup(page);
+		if (!pc)
+			return ret;
+	}
+	if (!swap_cgroup_map[type])
+		return ret;
+	mappage = swap_cgroup_map[type][array_index];
+	if (!mappage)
+		return ret;
+
+	local_irq_save(flags);
+	map = kmap_atomic(mappage, KM_USER0);
+	if (!del) {
+		map[index].memcgrp_id = pc->mem_cgroup->memcgrp_id;
+		map[index].count = 0;
+	} else {
+		if (map[index].count) {
+			ret = map[index].memcgrp_id;
+			map[index].count = 0;
+		}
+		map[index].memcgrp_id = 0;
+	}
+	kunmap_atomic(mappage, KM_USER0);
+	local_irq_restore(flags);
+	return ret;
+}
+
+/*
+ * returns mem_cgroup pointer when swp_entry is assgiend to.
+ */
+static struct mem_cgroup *swap_cgroup_lookup(swp_entry_t entry)
+{
+	unsigned long flags;
+	int type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	unsigned long array_index = offset/ENTS_PER_PAGE;
+	unsigned long index = offset & (ENTS_PER_PAGE - 1);
+	struct page *mappage;
+	struct swap_cgroup *map;
+	unsigned short id;
+
+	if (!swap_cgroup_map[type])
+		return NULL;
+	mappage = swap_cgroup_map[type][array_index];
+	if (!mappage)
+		return NULL;
+
+	local_irq_save(flags);
+	map = kmap_atomic(mappage, KM_USER0);
+	id = map[index].memcgrp_id;
+	kunmap_atomic(mappage, KM_USER0);
+	local_irq_restore(flags);
+	return mem_cgroup_id_lookup(id);
+}
+
+static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
+{
+	swp_entry_t entry = { .val = page_private(page) };
+	return swap_cgroup_lookup(entry);
+}
+
+/*
+ * set/clear accounting information of swap_cgroup.
+ *
+ * Called when set/clear accounting information.
+ * returns 1 at success.
+ */
+static int swap_cgroup_account(struct mem_cgroup *memcg,
+			       swp_entry_t entry, bool set)
+{
+	unsigned long flags;
+	int type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	unsigned long array_index = offset/ENTS_PER_PAGE;
+	unsigned long index = offset & (ENTS_PER_PAGE - 1);
+	struct page *mappage;
+	struct swap_cgroup *map;
+	int ret = 0;
+
+	if (!swap_cgroup_map[type])
+		return ret;
+	mappage = swap_cgroup_map[type][array_index];
+	if (!mappage)
+		return ret;
+
+
+	local_irq_save(flags);
+	map = kmap_atomic(mappage, KM_USER0);
+	if (map[index].memcgrp_id == memcg->memcgrp_id) {
+		if (set && map[index].count == 0) {
+			map[index].count = 1;
+			ret = 1;
+		} else if (!set && map[index].count == 1) {
+			map[index].count = 0;
+			ret = 1;
+		}
+	}
+	kunmap_atomic(mappage, KM_USER0);
+	local_irq_restore(flags);
+	return ret;
+}
+
+void swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
+{
+	swp_entry_t val = { .val = page_private(page) };
+	if (swap_cgroup_account(mem, val, false))
+		mem_counter_uncharge_swap(mem);
+}
+
+/*
+ * Called from delete_from_swap_cache() then, page is Locked! and
+ * swp_entry is still in use.
+ */
+void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+
+	pc = page_get_page_cgroup(page);
+	/* swap-in but not mapped. */
+	if (!pc)
+		return;
+
+	if (swap_cgroup_account(pc->mem_cgroup, entry, true))
+		__mem_cgroup_uncharge_common(page,
+				MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	else if (page->mapping && !PageAnon(page))
+		__mem_cgroup_uncharge_common(page,
+				MEM_CGROUP_CHARGE_TYPE_CACHE);
+	else
+		__mem_cgroup_uncharge_common(page,
+				MEM_CGROUP_CHARGE_TYPE_MAPPED);
+	return;
+}
+
+void swap_cgroup_delete_swap(swp_entry_t entry)
+{
+	int ret;
+	struct mem_cgroup *mem;
+
+	ret = swap_cgroup_record_info(NULL, entry, true);
+	if (ret) {
+		mem = mem_cgroup_id_lookup(ret);
+		if (mem)
+			mem_counter_uncharge_swap(mem);
+	}
+}
+
+
+/*
+ * Forget all accounts under swap_cgroup of memcg.
+ * Called from destroying context.
+ */
+static void swap_cgroup_clean_account(struct mem_cgroup *memcg)
+{
+	int type;
+	unsigned long array_index, flags;
+	int index;
+	struct page *page;
+	struct swap_cgroup *map;
+
+	if (!memcg->res.swaps)
+		return;
+	mutex_lock(&swap_cgroup_mutex);
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		if (swap_cgroup_pages[type] == 0)
+			continue;
+		for (array_index = 0;
+		     array_index < swap_cgroup_pages[type];
+		     array_index++) {
+			page = swap_cgroup_map[type][array_index];
+			if (!page)
+				continue;
+			local_irq_save(flags);
+			map = kmap_atomic(page, KM_USER0);
+			for (index = 0; index < ENTS_PER_PAGE; index++) {
+				if (map[index].memcgrp_id
+				    == memcg->memcgrp_id) {
+					map[index].memcgrp_id = 0;
+					map[index].count = 0;
+				}
+			}
+			kunmap_atomic(page, KM_USER0);
+			local_irq_restore(flags);
+		}
+		mutex_unlock(&swap_cgroup_mutex);
+		yield();
+		mutex_lock(&swap_cgroup_mutex);
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+}
+
+/*
+ * called from swapon().
+ */
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	void *array;
+	int array_size;
+
+	VM_BUG_ON(swap_cgroup_map[type]);
+
+	array_size = ((max_pages/ENTS_PER_PAGE) + 1) * sizeof(void *);
+
+	array = vmalloc(array_size);
+	if (!array) {
+		printk("swap %d will not be accounted\n", type);
+		return -ENOMEM;
+	}
+	memset(array, 0, array_size);
+	mutex_lock(&swap_cgroup_mutex);
+	swap_cgroup_pages[type] = (max_pages/ENTS_PER_PAGE + 1);
+	swap_cgroup_map[type] = array;
+	mutex_unlock(&swap_cgroup_mutex);
+	spin_lock_init(&swap_cgroup_lock[type]);
+	return 0;
+}
+
+/*
+ * called from swapoff().
+ */
+void swap_cgroup_swapoff(int type)
+{
+	int i;
+	for (i = 0; i < swap_cgroup_pages[type]; i++) {
+		struct page *page = swap_cgroup_map[type][i];
+		if (page)
+			__free_page(page);
+	}
+	mutex_lock(&swap_cgroup_mutex);
+	vfree(swap_cgroup_map[type]);
+	swap_cgroup_map[type] = NULL;
+	mutex_unlock(&swap_cgroup_mutex);
+	swap_cgroup_pages[type] = 0;
+}
+
+#endif
Index: mmtom-2.6.27-rc3+/include/linux/swap.h
===================================================================
--- mmtom-2.6.27-rc3+.orig/include/linux/swap.h
+++ mmtom-2.6.27-rc3+/include/linux/swap.h
@@ -335,6 +335,44 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+extern void swap_cgroup_delete_swap(swp_entry_t entry);
+extern int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask);
+extern int swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del);
+extern void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry);
+
+#else
+static inline int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+static inline void swap_cgroup_swapoff(int type)
+{
+	return;
+}
+static inline void swap_cgroup_delete_swap(swp_entry_t entry)
+{
+	return;
+}
+static inline int swap_cgroup_prapare(swp_entry_t ent, gfp_t mask)
+{
+	return 0;
+}
+static inline int
+ swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del)
+{
+	return 0;
+}
+static inline
+void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
+{
+	return;
+}
+#endif
+
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
Index: mmtom-2.6.27-rc3+/mm/swapfile.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/swapfile.c
+++ mmtom-2.6.27-rc3+/mm/swapfile.c
@@ -270,8 +270,9 @@ out:
 	return NULL;
 }	
 
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 {
+	unsigned long offset = swp_offset(entry);
 	int count = p->swap_map[offset];
 
 	if (count < SWAP_MAP_MAX) {
@@ -286,6 +287,7 @@ static int swap_entry_free(struct swap_i
 				swap_list.next = p - swap_info;
 			nr_swap_pages++;
 			p->inuse_pages--;
+			swap_cgroup_delete_swap(entry);
 		}
 	}
 	return count;
@@ -301,7 +303,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, swp_offset(entry));
+		swap_entry_free(p, entry);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -420,7 +422,7 @@ void free_swap_and_cache(swp_entry_t ent
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, swp_offset(entry)) == 1) {
+		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
@@ -1343,6 +1345,7 @@ asmlinkage long sys_swapoff(const char _
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	swap_cgroup_swapoff(type);
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1672,11 @@ asmlinkage long sys_swapon(const char __
 				1 /* header page */;
 		if (error)
 			goto bad_swap;
+
+		if (swap_cgroup_swapon(type, maxpages)) {
+			printk("We don't enable swap accounting because of"
+				"memory shortage\n");
+		}
 	}
 
 	if (nr_good_pages) {
Index: mmtom-2.6.27-rc3+/mm/swap_state.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/swap_state.c
+++ mmtom-2.6.27-rc3+/mm/swap_state.c
@@ -76,6 +76,9 @@ int add_to_swap_cache(struct page *page,
 	BUG_ON(PageSwapCache(page));
 	BUG_ON(PagePrivate(page));
 	BUG_ON(!PageSwapBacked(page));
+	error = swap_cgroup_prepare(entry, gfp_mask);
+	if (error)
+		return error;
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		page_cache_get(page);
@@ -89,6 +92,7 @@ int add_to_swap_cache(struct page *page,
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 			INC_CACHE_INFO(add_total);
+			swap_cgroup_record_info(page, entry, false);
 		}
 		spin_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
@@ -108,6 +112,8 @@ int add_to_swap_cache(struct page *page,
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	swp_entry_t entry = { .val = page_private(page) };
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
@@ -117,6 +123,7 @@ void __delete_from_swap_cache(struct pag
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	total_swapcache_pages--;
+	swap_cgroup_delete_swapcache(page, entry);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
Index: mmtom-2.6.27-rc3+/init/Kconfig
===================================================================
--- mmtom-2.6.27-rc3+.orig/init/Kconfig
+++ mmtom-2.6.27-rc3+/init/Kconfig
@@ -416,7 +416,7 @@ config CGROUP_MEM_RES_CTLR
 	  could in turn add some fork/exit overhead.
 
 config CGROUP_MEM_RES_CTLR_SWAP
-	bool "Memory Resource Controller Swap Extension (Broken)"
+	bool "Memory Resource Controller Swap Extension (EXPERIMENTAL)"
 	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
 	help
 	 Add swap management feature to memory resource controller. By this,
Index: mmtom-2.6.27-rc3+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc3+/mm/migrate.c
@@ -339,6 +339,8 @@ static int migrate_page_move_mapping(str
  */
 static void migrate_page_copy(struct page *newpage, struct page *page)
 {
+	int was_swapcache = 0;
+
 	copy_highpage(newpage, page);
 
 	if (PageError(page))
@@ -372,14 +374,17 @@ static void migrate_page_copy(struct pag
 	mlock_migrate_page(newpage, page);
 
 #ifdef CONFIG_SWAP
+	was_swapcache = PageSwapCache(page);
 	ClearPageSwapCache(page);
 #endif
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	/* page->mapping contains a flag for PageAnon() */
 	if (PageAnon(page)) {
-		/* This page is uncharged at try_to_unmap(). */
+		/* This page is uncharged at try_to_unmap() if not SwapCache. */
 		page->mapping = NULL;
+		if (was_swapcache)
+			mem_cgroup_uncharge_page(page);
 	} else {
 		/* Obsolete file cache should be uncharged */
 		page->mapping = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 0/14]  Mem+Swap Controller v2
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (13 preceding siblings ...)
  2008-08-22 11:44 ` [RFC][PATCH 14/14]memcg: mem+swap accounting KAMEZAWA Hiroyuki
@ 2008-08-22 13:20 ` Balbir Singh
  2008-08-22 15:34 ` kamezawa.hiroyu
  15 siblings, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-08-22 13:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> Hi, I totally rewrote the series. maybe easier to be reviewed.
> 
> This patch series provides memory resource controller enhancement(candidates)
> Codes are totally rewritten from "preview".
> Based on rc3 + a bit old mmtom (may not have conflicts with the latest...)
> (I'm not so in hurry now, please see when you have time.)
> 
> Contents are following. I'll push them gradually when it seems O.K.
> 
>  - New force_empty implementation.
>    - rather than drop all accounting, move all accounting to "root".
>      This behavior can be changed later (based on some policy.)
>      It may take some amount of time about "good" policy, I think start from
>      "move charge to the root" is good. I want to hear opinions about this
>      interface's behavior.
> 
>  - Lockless page_cgroup
>    - Removes lock_page_cgroup() and makes access to page->page_cgroup safe
>      under some situation. This will makes memcg's lock semantics to
>      be better.
>      
>  - Exporting page_cgroup.
>    - Because of Lockess page_cgroup, we can easily access page_cgroup from
>      outside of memcontrol.c. There are some people who ask me to allow
>      them to access page_cgroup.
> 
>  - Mem+Swap controller.
>    - This controller is implemented as an extention of memory controller.
>      If Mem+Swap controller is enabled, you can set 2 limits ans see    
>      2 counters.
> 
>      memory.limit_in_bytes .... limit of amounts of pages.
>      memory.memsw_limit_in_bytes .... limit of amounts of the sum of pages 
>                                       and swap_on_disks.
>      memory.usage_in_bytes .... current usage of on-memory pages.
>      memory.memory.swap_in_bytes .... current usage of swaps which is not 
>                                       on_memory.
> 
> Any feedback, comments are welcome.
> 
> This set passed some fundamental tests on small box and works good.
> but I have not done long-run test on big box. So, you may see panic
> of race conditions....
> 
> TODO:
>   - Update Documentation more.
>   - Long-run test.
>   - Update force_empty's policy.
> 
> Major Changes from v1.
>   - force_empty is updated.
>   - small bug fix on Lockless page_cgroup.
>   - Mem+Swap controller is added. (Implementation detail is quite different
>     from preview version. But no change in algorithm.)
> 
> Patch series:
>   I'd like to push patch 1...9 in early than 10..14
>   Comments about the direction of patch 1,2,11,13 is helpful.
> 
> 1. unlimted_root_cgroup.patch 
>             .... makes root cgroup's limitation to unlimited.
> 2. new_force_empty.patch
>             .... rewrite force_empty to move the resource rather
> 3. atomic_flags.patch
>             .... makes page_cgroup->flags modification to atomic_ops.
> 4. lazy-lru-freeing.patch
>             .... makes freeing of page_cgroup to be delayed.
> 5. rcu-free.patch
>             ....freeing of page_cgroup by RCU.
> 6. lockess.patch
>             ....remove lock_page_cgroup()
> 7. prefetch.patch
>             .... just adds prefetch().
> 8. make_mapping_null.patch
>             .... guarantee page->mapping to be NULL before uncharge
>                  file cache (and check it by BUG_ON)
> 9. split-page-cgroup.patch
>             .... add page_cgroup.h file.
> 10. mem_counter.patch
>             .... replace res_coutner with mem_counter, newly added.
>                  (reason of this patch will be shown in patch[11])
> 11. memcgrp_id.patch
>             .... give each mem_cgroup its own ID.

To be honest, I think something like this needs to happen at the cgroups level, no?

> 12. swap_cgroup_config.patch
>             .... Add Kconfig and some macro for Mem+Swap Controller.
> 13. swap_counter.patch
>             .... modifies mem_counter to handle swaps.
> 14. swap_account.patch
>             .... account swap.

This is too fast for me to review. I'll review this series anyway. I was also
hoping to getting down to user space notifications for OOM and pagevec series.
Let me see if I can do the latter quickly.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [RFC][PATCH 0/14]  Mem+Swap Controller v2
  2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
                   ` (14 preceding siblings ...)
  2008-08-22 13:20 ` [RFC][PATCH 0/14] Mem+Swap Controller v2 Balbir Singh
@ 2008-08-22 15:34 ` kamezawa.hiroyu
  15 siblings, 0 replies; 61+ messages in thread
From: kamezawa.hiroyu @ 2008-08-22 15:34 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura

>> 11. memcgrp_id.patch
>>             .... give each mem_cgroup its own ID.
>
>To be honest, I think something like this needs to happen at the cgroups leve
l, no?
>
I think ..yes..maybe. In my patch, this ID is defines as [0-32677] and 0 for
invalid and 1 for root group. So, this is not designed for IDs for the whole
system wide cgroup. Kicking this out from memcontrol.c and move to some
kernel/xxx.c as "cgroup hierarchy ID support" may be an idea.
I'd like to wait for Paul and to hear his opinion.
Anyway, I like this idea (assign short ID to cgrp) for saving space from 8byte
s(pointer) to 2bytes(ID) to record cgroup's account information
in array.


>> 12. swap_cgroup_config.patch
>>             .... Add Kconfig and some macro for Mem+Swap Controller.
>> 13. swap_counter.patch
>>             .... modifies mem_counter to handle swaps.
>> 14. swap_account.patch
>>             .... account swap.
>
>This is too fast for me to review. I'll review this series anyway.
Thanks,

> I was also hoping to getting down to user space notifications for OOM
I want to see this :)

> and pagevec series.
lockless patch incudes operation like pagevec in lazy-lru-free and rcu
patch. It does batched lru operation at uncharge().
(I couldn't find a way to implement bached lru insertion at charge()
 without race condition.)

Thanks,
-Kame

>Let me see if I can do the latter quickly.
>

>-- 
>	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 1/14] memcg: unlimted root cgroup
  2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
@ 2008-08-22 22:51   ` Balbir Singh
  2008-08-23  0:38   ` kamezawa.hiroyu
  1 sibling, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-08-22 22:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> Make root cgroup of memory resource controller to have no limit.
> 
> By this, users cannot set limit to root group. This is for making root cgroup
> as a kind of trash-can.
> 
> For accounting pages which has no owner, which are created by force_empty,
> we need some cgroup with no_limit. A patch for rewriting force_empty will
> will follow this one.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  Documentation/controllers/memory.txt |    4 ++++
>  mm/memcontrol.c                      |   12 ++++++++++++
>  2 files changed, 16 insertions(+)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -133,6 +133,10 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +	/*
> +	 * special flags.
> +	 */
> +	int	no_limit;

Is this a generic implementation to support no limits? If not, why not store the
root memory controller pointer and see if someone is trying to set a limit on that?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [RFC][PATCH 1/14] memcg: unlimted root cgroup
  2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
  2008-08-22 22:51   ` Balbir Singh
@ 2008-08-23  0:38   ` kamezawa.hiroyu
  2008-08-25  3:19     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 61+ messages in thread
From: kamezawa.hiroyu @ 2008-08-23  0:38 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura

----- Original Message -----
>KAMEZAWA Hiroyuki wrote:
>> Make root cgroup of memory resource controller to have no limit.
>> 
>> By this, users cannot set limit to root group. This is for making root cgro
up
>> as a kind of trash-can.
>> 
>> For accounting pages which has no owner, which are created by force_empty,
>> we need some cgroup with no_limit. A patch for rewriting force_empty will
>> will follow this one.
>> 
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> 
>> ---
>>  Documentation/controllers/memory.txt |    4 ++++
>>  mm/memcontrol.c                      |   12 ++++++++++++
>>  2 files changed, 16 insertions(+)
>> 
>> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
>> ===================================================================
>> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
>> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
>> @@ -133,6 +133,10 @@ struct mem_cgroup {
>>  	 * statistics.
>>  	 */
>>  	struct mem_cgroup_stat stat;
>> +	/*
>> +	 * special flags.
>> +	 */
>> +	int	no_limit;
>
>Is this a generic implementation to support no limits? If not, why not store 
the
>root memory controller pointer and see if someone is trying to set a limit on
 that?
>
Just because I designed this for supporting trash-box and changed my mind..
Sorry. If pointer comparison is better, I'll do that.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 1/14] memcg: unlimted root cgroup
  2008-08-23  0:38   ` kamezawa.hiroyu
@ 2008-08-25  3:19     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-25  3:19 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, linux-mm, nishimura

On Sat, 23 Aug 2008 09:38:10 +0900 (JST)
kamezawa.hiroyu@jp.fujitsu.com wrote:
> >Is this a generic implementation to support no limits? If not, why not store 
> the
> >root memory controller pointer and see if someone is trying to set a limit on
>  that?
> >
> Just because I designed this for supporting trash-box and changed my mind..
> Sorry. If pointer comparison is better, I'll do that.
> 
I decieded to use follwoing macro instead of memcg->no_limit.

#define is_root_cgroup(cgrp)	((cgrp) == &init_mem_cgroup)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 2/14] memcg: rewrite force_empty
  2008-08-22 11:31 ` [RFC][PATCH 2/14] memcg: rewrite force_empty KAMEZAWA Hiroyuki
@ 2008-08-25  3:21   ` KAMEZAWA Hiroyuki
  2008-08-29 11:45   ` Daisuke Nishimura
  1 sibling, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-25  3:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

On Fri, 22 Aug 2008 20:31:14 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Current force_empty of memory resource controller just removes page_cgroup.
> This maans the page is not accounted at all and create an in-use page which
> has no page_cgroup.
> 
> This patch tries to move account to "root" cgroup. By this patch, force_empty
> doesn't leak an account but move account to "root" cgroup. Maybe someone can
> think of other enhancements as
> 
>  1. move account to its parent.
>  2. move account to default-trash-can-cgroup somewhere.
>  3. move account to a cgroup specified by an admin.
> 
> I think a routine this patch adds is an enough generic and can be the base
> patch for supporting above behavior (if someone wants.). But, for now, just
> moves account to root group.
> 
> While moving mem_cgroup, lock_page(page) is held. This helps us for avoiding
> race condition with accessing page_cgroup->mem_cgroup.
> While under lock_page(), page_cgroup->mem_cgroup points to right cgroup.
> 

I decided to divide this patch into 2 pieces.

1. mem_cgroup_move_account() patch
2. rewrite force_empty to use mem_cgroup_move_account() patch.

(1) will add more generic helps for mem_cgroup in future.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-22 11:32 ` [RFC][PATCH 3/14] memcg: atomic_flags KAMEZAWA Hiroyuki
@ 2008-08-26  4:55   ` Balbir Singh
  2008-08-26 23:50     ` KAMEZAWA Hiroyuki
  2008-08-26  8:46   ` kamezawa.hiroyu
  1 sibling, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-26  4:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> This patch makes page_cgroup->flags to be atomic_ops and define
> functions (and macros) to access it.
> 
> This patch itself makes memcg slow but this patch's final purpose is 
> to remove lock_page_cgroup() and allowing fast access to page_cgroup.
> 

That is a cause of worry, do the patches that follow help performance? How do we
benefit from faster access to page_cgroup() if the memcg controller becomes slower?

> Before trying to modify memory resource controller, this atomic operation
> on flags is necessary.
> Changelog (v1) -> (v2)
>  - no changes
> Changelog  (preview) -> (v1):
>  - patch ordering is changed.
>  - Added macro for defining functions for Test/Set/Clear bit.
>  - made the names of flags shorter.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 77 insertions(+), 31 deletions(-)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -163,12 +163,57 @@ struct page_cgroup {
>  	struct list_head lru;		/* per cgroup LRU list */
>  	struct page *page;
>  	struct mem_cgroup *mem_cgroup;
> -	int flags;
> +	unsigned long flags;
>  };
> -#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
> -#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
> -#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
> -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
> +
> +enum {
> +	/* flags for mem_cgroup */
> +	Pcg_CACHE, /* charged as cache */

Why Pcg_CACHE and not PCG_CACHE or PAGE_CGROUP_CACHE? I think the latter is more
readable, no?

> +	/* flags for LRU placement */
> +	Pcg_ACTIVE, /* page is active in this cgroup */
> +	Pcg_FILE, /* page is file system backed */
> +	Pcg_UNEVICTABLE, /* page is unevictableable */
> +};
> +
> +#define TESTPCGFLAG(uname, lname)			\
                      ^^ uname and lname?
How about TEST_PAGE_CGROUP_FLAG(func, bit)

> +static inline int Pcg##uname(struct page_cgroup *pc)	\
> +	{ return test_bit(Pcg_##lname, &pc->flags); }
> +

I would prefer PageCgroup##func

> +#define SETPCGFLAG(uname, lname)			\
> +static inline void SetPcg##uname(struct page_cgroup *pc)\
> +	{ set_bit(Pcg_##lname, &pc->flags);  }
> +
> +#define CLEARPCGFLAG(uname, lname)			\
> +static inline void ClearPcg##uname(struct page_cgroup *pc)	\
> +	{ clear_bit(Pcg_##lname, &pc->flags);  }
> +
> +#define __SETPCGFLAG(uname, lname)			\
> +static inline void __SetPcg##uname(struct page_cgroup *pc)\
> +	{ __set_bit(Pcg_##lname, &pc->flags);  }
> +

OK, so we have the non-atomic verion as well

> +#define __CLEARPCGFLAG(uname, lname)			\
> +static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
> +	{ __clear_bit(Pcg_##lname, &pc->flags);  }
> +
> +/* Cache flag is set only once (at allocation) */
> +TESTPCGFLAG(Cache, CACHE)
> +__SETPCGFLAG(Cache, CACHE)
> +
> +/* LRU management flags (from global-lru definition) */
> +TESTPCGFLAG(File, FILE)
> +SETPCGFLAG(File, FILE)
> +__SETPCGFLAG(File, FILE)
> +CLEARPCGFLAG(File, FILE)
> +
> +TESTPCGFLAG(Active, ACTIVE)
> +SETPCGFLAG(Active, ACTIVE)
> +__SETPCGFLAG(Active, ACTIVE)
> +CLEARPCGFLAG(Active, ACTIVE)
> +
> +TESTPCGFLAG(Unevictable, UNEVICTABLE)
> +SETPCGFLAG(Unevictable, UNEVICTABLE)
> +CLEARPCGFLAG(Unevictable, UNEVICTABLE)
> +
> 
>  static int page_cgroup_nid(struct page_cgroup *pc)
>  {
> @@ -189,14 +234,15 @@ enum charge_type {
>  /*
>   * Always modified under lru lock. Then, not necessary to preempt_disable()
>   */
> -static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
> -					bool charge)
> +static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> +					 struct page_cgroup *pc,
> +					 bool charge)
>  {
>  	int val = (charge)? 1 : -1;
>  	struct mem_cgroup_stat *stat = &mem->stat;
> 
>  	VM_BUG_ON(!irqs_disabled());
> -	if (flags & PAGE_CGROUP_FLAG_CACHE)
> +	if (PcgCache(pc))

Shouldn't we use __PcgCache(), see my comments below

>  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
>  	else
>  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
> @@ -289,18 +335,18 @@ static void __mem_cgroup_remove_list(str
>  {
>  	int lru = LRU_BASE;
> 
> -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> +	if (PcgUnevictable(pc))

Since we call this under a lock, can't we use __PcgUnevictable(pc)? If not, what
are we buying by doing atomic operations under a lock?

>  		lru = LRU_UNEVICTABLE;
>  	else {
> -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> +		if (PcgActive(pc))

Ditto

>  			lru += LRU_ACTIVE;
> -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> +		if (PcgFile(pc))

Ditto

>  			lru += LRU_FILE;
>  	}
> 
>  	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> 
> -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
> +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
>  	list_del(&pc->lru);
>  }
> 
> @@ -309,27 +355,27 @@ static void __mem_cgroup_add_list(struct
>  {
>  	int lru = LRU_BASE;
> 
> -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> +	if (PcgUnevictable(pc))

Ditto

>  		lru = LRU_UNEVICTABLE;
>  	else {
> -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> +		if (PcgActive(pc))
>  			lru += LRU_ACTIVE;
> -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> +		if (PcgFile(pc))

Ditto

>  			lru += LRU_FILE;
>  	}
> 
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
>  	list_add(&pc->lru, &mz->lists[lru]);
> 
> -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
> +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
>  }
> 
>  static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
>  {
>  	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> -	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
> -	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
> -	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
> +	int active    = PcgActive(pc);
> +	int file      = PcgFile(pc);
> +	int unevictable = PcgUnevictable(pc);
>  	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
>  				(LRU_FILE * !!file + !!active);
> 
> @@ -339,14 +385,14 @@ static void __mem_cgroup_move_lists(stru
>  	MEM_CGROUP_ZSTAT(mz, from) -= 1;
> 
>  	if (is_unevictable_lru(lru)) {
> -		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> -		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
> +		ClearPcgActive(pc);
> +		SetPcgUnevictable(pc);
>  	} else {
>  		if (is_active_lru(lru))
> -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> +			SetPcgActive(pc);
>  		else
> -			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> -		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
> +			ClearPcgActive(pc);
> +		ClearPcgUnevictable(pc);

Again shouldn't we be using the __ variants?

>  	}
> 
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> @@ -580,18 +626,19 @@ static int mem_cgroup_charge_common(stru
> 
>  	pc->mem_cgroup = mem;
>  	pc->page = page;
> +	pc->flags = 0;
>  	/*
>  	 * If a page is accounted as a page cache, insert to inactive list.
>  	 * If anon, insert to active list.
>  	 */
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> -		pc->flags = PAGE_CGROUP_FLAG_CACHE;
> +		__SetPcgCache(pc);
>  		if (page_is_file_cache(page))
> -			pc->flags |= PAGE_CGROUP_FLAG_FILE;
> +			__SetPcgFile(pc);
>  		else
> -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> +			__SetPcgActive(pc);
>  	} else
> -		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> +		__SetPcgActive(pc);
> 
>  	lock_page_cgroup(page);
>  	if (unlikely(page_get_page_cgroup(page))) {
> @@ -699,8 +746,7 @@ __mem_cgroup_uncharge_common(struct page
>  	VM_BUG_ON(pc->page != page);
> 
>  	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> -	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
> -		|| page_mapped(page)))
> +	    && ((PcgCache(pc) || page_mapped(page))))
>  		goto unlock;
> 
>  	mz = page_cgroup_zoneinfo(pc);
> @@ -750,7 +796,7 @@ int mem_cgroup_prepare_migration(struct 
>  	if (pc) {
>  		mem = pc->mem_cgroup;
>  		css_get(&mem->css);
> -		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> +		if (PcgCache(pc))
>  			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
>  	}
>  	unlock_page_cgroup(page);

Seems reasonable, my worry is the performance degradation that you've mentioned.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-22 11:32 ` [RFC][PATCH 3/14] memcg: atomic_flags KAMEZAWA Hiroyuki
  2008-08-26  4:55   ` Balbir Singh
@ 2008-08-26  8:46   ` kamezawa.hiroyu
  2008-08-26  8:49     ` Balbir Singh
  1 sibling, 1 reply; 61+ messages in thread
From: kamezawa.hiroyu @ 2008-08-26  8:46 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura

----- Original Message -----
>KAMEZAWA Hiroyuki wrote:
>> This patch makes page_cgroup->flags to be atomic_ops and define
>> functions (and macros) to access it.
>> 
>> This patch itself makes memcg slow but this patch's final purpose is 
>> to remove lock_page_cgroup() and allowing fast access to page_cgroup.
>> 
>
>That is a cause of worry, do the patches that follow help performance?
By applying patchs for this and RCU and removing lock_page_cgroup(), I saw sma
ll performance benefit.

> How do we
>benefit from faster access to page_cgroup() if the memcg controller becomes s
lower?
>
No slow-down on my box but. But the cpu which I'm testing on is a bit old.
I'd like to try newer CPU.
As you know, I don't like slow-down very much ;)

Thanks,
-Kame

>> Before trying to modify memory resource controller, this atomic operation
>> on flags is necessary.
>> Changelog (v1) -> (v2)
>>  - no changes
>> Changelog  (preview) -> (v1):
>>  - patch ordering is changed.
>>  - Added macro for defining functions for Test/Set/Clear bit.
>>  - made the names of flags shorter.
>> 
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> 
>> ---
>>  mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++------------
-----
>>  1 file changed, 77 insertions(+), 31 deletions(-)
>> 
>> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
>> ===================================================================
>> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
>> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
>> @@ -163,12 +163,57 @@ struct page_cgroup {
>>  	struct list_head lru;		/* per cgroup LRU list */
>>  	struct page *page;
>>  	struct mem_cgroup *mem_cgroup;
>> -	int flags;
>> +	unsigned long flags;
>>  };
>> -#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
>> -#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup 
*/
>> -#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
>> -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
>> +
>> +enum {
>> +	/* flags for mem_cgroup */
>> +	Pcg_CACHE, /* charged as cache */
>
>Why Pcg_CACHE and not PCG_CACHE or PAGE_CGROUP_CACHE? I think the latter is m
ore
>readable, no?
>
>> +	/* flags for LRU placement */
>> +	Pcg_ACTIVE, /* page is active in this cgroup */
>> +	Pcg_FILE, /* page is file system backed */
>> +	Pcg_UNEVICTABLE, /* page is unevictableable */
>> +};
>> +
>> +#define TESTPCGFLAG(uname, lname)			\
>                      ^^ uname and lname?
>How about TEST_PAGE_CGROUP_FLAG(func, bit)
>
>> +static inline int Pcg##uname(struct page_cgroup *pc)	\
>> +	{ return test_bit(Pcg_##lname, &pc->flags); }
>> +
>
>I would prefer PageCgroup##func
>
>> +#define SETPCGFLAG(uname, lname)			\
>> +static inline void SetPcg##uname(struct page_cgroup *pc)\
>> +	{ set_bit(Pcg_##lname, &pc->flags);  }
>> +
>> +#define CLEARPCGFLAG(uname, lname)			\
>> +static inline void ClearPcg##uname(struct page_cgroup *pc)	\
>> +	{ clear_bit(Pcg_##lname, &pc->flags);  }
>> +
>> +#define __SETPCGFLAG(uname, lname)			\
>> +static inline void __SetPcg##uname(struct page_cgroup *pc)\
>> +	{ __set_bit(Pcg_##lname, &pc->flags);  }
>> +
>
>OK, so we have the non-atomic verion as well
>
>> +#define __CLEARPCGFLAG(uname, lname)			\
>> +static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
>> +	{ __clear_bit(Pcg_##lname, &pc->flags);  }
>> +
>> +/* Cache flag is set only once (at allocation) */
>> +TESTPCGFLAG(Cache, CACHE)
>> +__SETPCGFLAG(Cache, CACHE)
>> +
>> +/* LRU management flags (from global-lru definition) */
>> +TESTPCGFLAG(File, FILE)
>> +SETPCGFLAG(File, FILE)
>> +__SETPCGFLAG(File, FILE)
>> +CLEARPCGFLAG(File, FILE)
>> +
>> +TESTPCGFLAG(Active, ACTIVE)
>> +SETPCGFLAG(Active, ACTIVE)
>> +__SETPCGFLAG(Active, ACTIVE)
>> +CLEARPCGFLAG(Active, ACTIVE)
>> +
>> +TESTPCGFLAG(Unevictable, UNEVICTABLE)
>> +SETPCGFLAG(Unevictable, UNEVICTABLE)
>> +CLEARPCGFLAG(Unevictable, UNEVICTABLE)
>> +
>> 
>>  static int page_cgroup_nid(struct page_cgroup *pc)
>>  {
>> @@ -189,14 +234,15 @@ enum charge_type {
>>  /*
>>   * Always modified under lru lock. Then, not necessary to preempt_disable(
)
>>   */
>> -static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags
,
>> -					bool charge)
>> +static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>> +					 struct page_cgroup *pc,
>> +					 bool charge)
>>  {
>>  	int val = (charge)? 1 : -1;
>>  	struct mem_cgroup_stat *stat = &mem->stat;
>> 
>>  	VM_BUG_ON(!irqs_disabled());
>> -	if (flags & PAGE_CGROUP_FLAG_CACHE)
>> +	if (PcgCache(pc))
>
>Shouldn't we use __PcgCache(), see my comments below
>
>>  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
>>  	else
>>  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
>> @@ -289,18 +335,18 @@ static void __mem_cgroup_remove_list(str
>>  {
>>  	int lru = LRU_BASE;
>> 
>> -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
>> +	if (PcgUnevictable(pc))
>
>Since we call this under a lock, can't we use __PcgUnevictable(pc)? If not, w
hat
>are we buying by doing atomic operations under a lock?
>
>>  		lru = LRU_UNEVICTABLE;
>>  	else {
>> -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
>> +		if (PcgActive(pc))
>
>Ditto
>
>>  			lru += LRU_ACTIVE;
>> -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
>> +		if (PcgFile(pc))
>
>Ditto
>
>>  			lru += LRU_FILE;
>>  	}
>> 
>>  	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> 
>> -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
>> +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
>>  	list_del(&pc->lru);
>>  }
>> 
>> @@ -309,27 +355,27 @@ static void __mem_cgroup_add_list(struct
>>  {
>>  	int lru = LRU_BASE;
>> 
>> -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
>> +	if (PcgUnevictable(pc))
>
>Ditto
>
>>  		lru = LRU_UNEVICTABLE;
>>  	else {
>> -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
>> +		if (PcgActive(pc))
>>  			lru += LRU_ACTIVE;
>> -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
>> +		if (PcgFile(pc))
>
>Ditto
>
>>  			lru += LRU_FILE;
>>  	}
>> 
>>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
>>  	list_add(&pc->lru, &mz->lists[lru]);
>> 
>> -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
>> +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
>>  }
>> 
>>  static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list 
lru)
>>  {
>>  	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
>> -	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
>> -	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
>> -	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
>> +	int active    = PcgActive(pc);
>> +	int file      = PcgFile(pc);
>> +	int unevictable = PcgUnevictable(pc);
>>  	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
>>  				(LRU_FILE * !!file + !!active);
>> 
>> @@ -339,14 +385,14 @@ static void __mem_cgroup_move_lists(stru
>>  	MEM_CGROUP_ZSTAT(mz, from) -= 1;
>> 
>>  	if (is_unevictable_lru(lru)) {
>> -		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
>> -		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
>> +		ClearPcgActive(pc);
>> +		SetPcgUnevictable(pc);
>>  	} else {
>>  		if (is_active_lru(lru))
>> -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
>> +			SetPcgActive(pc);
>>  		else
>> -			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
>> -		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
>> +			ClearPcgActive(pc);
>> +		ClearPcgUnevictable(pc);
>
>Again shouldn't we be using the __ variants?
>
>>  	}
>> 
>>  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
>> @@ -580,18 +626,19 @@ static int mem_cgroup_charge_common(stru
>> 
>>  	pc->mem_cgroup = mem;
>>  	pc->page = page;
>> +	pc->flags = 0;
>>  	/*
>>  	 * If a page is accounted as a page cache, insert to inactive list.
>>  	 * If anon, insert to active list.
>>  	 */
>>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
>> -		pc->flags = PAGE_CGROUP_FLAG_CACHE;
>> +		__SetPcgCache(pc);
>>  		if (page_is_file_cache(page))
>> -			pc->flags |= PAGE_CGROUP_FLAG_FILE;
>> +			__SetPcgFile(pc);
>>  		else
>> -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
>> +			__SetPcgActive(pc);
>>  	} else
>> -		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
>> +		__SetPcgActive(pc);
>> 
>>  	lock_page_cgroup(page);
>>  	if (unlikely(page_get_page_cgroup(page))) {
>> @@ -699,8 +746,7 @@ __mem_cgroup_uncharge_common(struct page
>>  	VM_BUG_ON(pc->page != page);
>> 
>>  	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>> -	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
>> -		|| page_mapped(page)))
>> +	    && ((PcgCache(pc) || page_mapped(page))))
>>  		goto unlock;
>> 
>>  	mz = page_cgroup_zoneinfo(pc);
>> @@ -750,7 +796,7 @@ int mem_cgroup_prepare_migration(struct 
>>  	if (pc) {
>>  		mem = pc->mem_cgroup;
>>  		css_get(&mem->css);
>> -		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
>> +		if (PcgCache(pc))
>>  			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
>>  	}
>>  	unlock_page_cgroup(page);
>
>Seems reasonable, my worry is the performance degradation that you've mention
ed.
>
>-- 
>	Balbir
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-26  8:46   ` kamezawa.hiroyu
@ 2008-08-26  8:49     ` Balbir Singh
  2008-08-26 23:41       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-26  8:49 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, nishimura

kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
>> KAMEZAWA Hiroyuki wrote:
>>> This patch makes page_cgroup->flags to be atomic_ops and define
>>> functions (and macros) to access it.
>>>
>>> This patch itself makes memcg slow but this patch's final purpose is 
>>> to remove lock_page_cgroup() and allowing fast access to page_cgroup.
>>>
>> That is a cause of worry, do the patches that follow help performance?
> By applying patchs for this and RCU and removing lock_page_cgroup(), I saw sma
> ll performance benefit.
> 
>> How do we
>> benefit from faster access to page_cgroup() if the memcg controller becomes s
> lower?
> No slow-down on my box but. But the cpu which I'm testing on is a bit old.
> I'd like to try newer CPU.
> As you know, I don't like slow-down very much ;)

I see, yes, I do know that you like to make things faster. BTW, you did not
comment on my comments below about the naming convention and using the __ variants
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-22 11:33 ` [RFC][PATCH 4/14] delay page_cgroup freeing KAMEZAWA Hiroyuki
@ 2008-08-26 11:46   ` Balbir Singh
  2008-08-26 23:55     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-26 11:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> Freeing page_cgroup at mem_cgroup_uncharge() in lazy way.
> 
> In mem_cgroup_uncharge_common(), we don't free page_cgroup
> and just link it to per-cpu free queue.
> And remove it later by checking threshold.
> 
> This patch is a base patch for freeing page_cgroup by RCU patch.
> This patch depends on make-page_cgroup_flag-atomic patch.
> 
> Changelog: (v1) -> (v2)
>   - fixed mem_cgroup_move_list()'s checking of PcgObsolete()
>   - fixed force_empty.
> Changelog: (preview) -> (v1)
>   - Clean up.
>   - renamed functions
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/memcontrol.c |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 110 insertions(+), 12 deletions(-)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -164,11 +164,13 @@ struct page_cgroup {
>  	struct page *page;
>  	struct mem_cgroup *mem_cgroup;
>  	unsigned long flags;
> +	struct page_cgroup *next;
>  };
> 
>  enum {
>  	/* flags for mem_cgroup */
>  	Pcg_CACHE, /* charged as cache */
> +	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
>  	/* flags for LRU placement */
>  	Pcg_ACTIVE, /* page is active in this cgroup */
>  	Pcg_FILE, /* page is file system backed */
> @@ -199,6 +201,10 @@ static inline void __ClearPcg##uname(str
>  TESTPCGFLAG(Cache, CACHE)
>  __SETPCGFLAG(Cache, CACHE)
> 
> +/* No "Clear" routine for OBSOLETE flag */
> +TESTPCGFLAG(Obsolete, OBSOLETE);
> +SETPCGFLAG(Obsolete, OBSOLETE);
> +
>  /* LRU management flags (from global-lru definition) */
>  TESTPCGFLAG(File, FILE)
>  SETPCGFLAG(File, FILE)
> @@ -225,6 +231,18 @@ static enum zone_type page_cgroup_zid(st
>  	return page_zonenum(pc->page);
>  }
> 
> +/*
> + * per-cpu slot for freeing page_cgroup in lazy manner.
> + * All page_cgroup linked to this list is OBSOLETE.
> + */
> +struct mem_cgroup_sink_list {
> +	int count;
> +	struct page_cgroup *next;
> +};

Can't we reuse the lru field in page_cgroup to build a list? Do we need them on
the memory controller LRU if they are obsolete? I want to do something similar
for both additions and deletions - reuse pagevec style, basically. I am OK,
having a list as well, in that case we can just reuse the LRU pointer.

> +DEFINE_PER_CPU(struct mem_cgroup_sink_list, memcg_sink_list);
> +#define MEMCG_LRU_THRESH	(16)
> +
> +
>  enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> @@ -440,7 +458,7 @@ void mem_cgroup_move_lists(struct page *
>  		/*
>  		 * check against the race with force_empty.
>  		 */
> -		if (likely(mem == pc->mem_cgroup))
> +		if (!PcgObsolete(pc) && likely(mem == pc->mem_cgroup))
>  			__mem_cgroup_move_lists(pc, lru);
>  		spin_unlock_irqrestore(&mz->lru_lock, flags);
>  	}
> @@ -531,6 +549,10 @@ unsigned long mem_cgroup_isolate_pages(u
>  	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
>  		if (scan >= nr_to_scan)
>  			break;
> +
> +		if (PcgObsolete(pc))
> +			continue;
> +
>  		page = pc->page;
> 
>  		if (unlikely(!PageLRU(page)))
> @@ -563,6 +585,81 @@ unsigned long mem_cgroup_isolate_pages(u
>  }
> 
>  /*
> + * Free obsolete page_cgroups which is linked to per-cpu drop list.
> + */
> +
> +static void __free_obsolete_page_cgroup(void)
> +{
> +	struct mem_cgroup *memcg;
> +	struct page_cgroup *pc, *next;
> +	struct mem_cgroup_per_zone *mz, *page_mz;
> +	struct mem_cgroup_sink_list *mcsl;
> +	unsigned long flags;
> +
> +	mcsl = &get_cpu_var(memcg_sink_list);
> +	next = mcsl->next;
> +	mcsl->next = NULL;
> +	mcsl->count = 0;
> +	put_cpu_var(memcg_sink_list);
> +
> +	mz = NULL;
> +
> +	local_irq_save(flags);
> +	while (next) {
> +		pc = next;
> +		VM_BUG_ON(!PcgObsolete(pc));
> +		next = pc->next;
> +		prefetch(next);
> +		page_mz = page_cgroup_zoneinfo(pc);
> +		memcg = pc->mem_cgroup;
> +		if (page_mz != mz) {
> +			if (mz)
> +				spin_unlock(&mz->lru_lock);
> +			mz = page_mz;
> +			spin_lock(&mz->lru_lock);
> +		}
> +		__mem_cgroup_remove_list(mz, pc);
> +		css_put(&memcg->css);
> +		kmem_cache_free(page_cgroup_cache, pc);
> +	}
> +	if (mz)
> +		spin_unlock(&mz->lru_lock);
> +	local_irq_restore(flags);
> +}
> +
> +static void free_obsolete_page_cgroup(struct page_cgroup *pc)
> +{
> +	int count;
> +	struct mem_cgroup_sink_list *mcsl;
> +
> +	mcsl = &get_cpu_var(memcg_sink_list);
> +	pc->next = mcsl->next;
> +	mcsl->next = pc;
> +	count = ++mcsl->count;
> +	put_cpu_var(memcg_sink_list);
> +	if (count >= MEMCG_LRU_THRESH)
> +		__free_obsolete_page_cgroup();
> +}
> +
> +/*
> + * Used when freeing memory resource controller to remove all
> + * page_cgroup (in obsolete list).
> + */
> +static DEFINE_MUTEX(memcg_force_drain_mutex);
> +
> +static void mem_cgroup_local_force_drain(struct work_struct *work)
> +{
> +	__free_obsolete_page_cgroup();
> +}
> +
> +static void mem_cgroup_all_force_drain(void)
> +{
> +	mutex_lock(&memcg_force_drain_mutex);
> +	schedule_on_each_cpu(mem_cgroup_local_force_drain);
> +	mutex_unlock(&memcg_force_drain_mutex);
> +}
> +
> +/*
>   * Charge the memory controller for page usage.
>   * Return
>   * 0 if the charge was successful
> @@ -627,6 +724,7 @@ static int mem_cgroup_charge_common(stru
>  	pc->mem_cgroup = mem;
>  	pc->page = page;
>  	pc->flags = 0;
> +	pc->next = NULL;
>  	/*
>  	 * If a page is accounted as a page cache, insert to inactive list.
>  	 * If anon, insert to active list.
> @@ -729,8 +827,6 @@ __mem_cgroup_uncharge_common(struct page
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem;
> -	struct mem_cgroup_per_zone *mz;
> -	unsigned long flags;
> 
>  	if (mem_cgroup_subsys.disabled)
>  		return;
> @@ -748,20 +844,14 @@ __mem_cgroup_uncharge_common(struct page
>  	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>  	    && ((PcgCache(pc) || page_mapped(page))))
>  		goto unlock;
> -
> -	mz = page_cgroup_zoneinfo(pc);
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_remove_list(mz, pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> +	mem = pc->mem_cgroup;
> +	SetPcgObsolete(pc);
>  	page_assign_page_cgroup(page, NULL);
>  	unlock_page_cgroup(page);
> 
> -	mem = pc->mem_cgroup;
>  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> -	css_put(&mem->css);
> +	free_obsolete_page_cgroup(pc);
> 
> -	kmem_cache_free(page_cgroup_cache, pc);
>  	return;
>  unlock:
>  	unlock_page_cgroup(page);
> @@ -937,6 +1027,14 @@ static void mem_cgroup_force_empty_list(
>  	spin_lock_irqsave(&mz->lru_lock, flags);
>  	while (!list_empty(list)) {
>  		pc = list_entry(list->prev, struct page_cgroup, lru);
> +		if (PcgObsolete(pc)) {
> +			list_move(&pc->lru, list);
> +			spin_unlock_irqrestore(&mz->lru_lock, flags);
> +			mem_cgroup_all_force_drain();
> +			yield();
> +			spin_lock_irqsave(&mz->lru_lock, flags);
> +			continue;
> +		}
>  		page = pc->page;
>  		if (!get_page_unless_zero(page)) {
>  			list_move(&pc->lru, list);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-26  8:49     ` Balbir Singh
@ 2008-08-26 23:41       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-26 23:41 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Tue, 26 Aug 2008 14:19:54 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> kamezawa.hiroyu@jp.fujitsu.com wrote:
> > ----- Original Message -----
> >> KAMEZAWA Hiroyuki wrote:
> >>> This patch makes page_cgroup->flags to be atomic_ops and define
> >>> functions (and macros) to access it.
> >>>
> >>> This patch itself makes memcg slow but this patch's final purpose is 
> >>> to remove lock_page_cgroup() and allowing fast access to page_cgroup.
> >>>
> >> That is a cause of worry, do the patches that follow help performance?
> > By applying patchs for this and RCU and removing lock_page_cgroup(), I saw sma
> > ll performance benefit.
> > 
> >> How do we
> >> benefit from faster access to page_cgroup() if the memcg controller becomes s
> > lower?
> > No slow-down on my box but. But the cpu which I'm testing on is a bit old.
> > I'd like to try newer CPU.
> > As you know, I don't like slow-down very much ;)
> 
> I see, yes, I do know that you like to make things faster. BTW, you did not
> comment on my comments below about the naming convention and using the __ variants

Sorry I missed it. will write reply.

Thanks,
-Kame

> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-26  4:55   ` Balbir Singh
@ 2008-08-26 23:50     ` KAMEZAWA Hiroyuki
  2008-08-27  1:58       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-26 23:50 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Tue, 26 Aug 2008 10:25:55 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > Before trying to modify memory resource controller, this atomic operation
> > on flags is necessary.
> > Changelog (v1) -> (v2)
> >  - no changes
> > Changelog  (preview) -> (v1):
> >  - patch ordering is changed.
> >  - Added macro for defining functions for Test/Set/Clear bit.
> >  - made the names of flags shorter.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > ---
> >  mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 77 insertions(+), 31 deletions(-)
> > 
> > Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> > +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> > @@ -163,12 +163,57 @@ struct page_cgroup {
> >  	struct list_head lru;		/* per cgroup LRU list */
> >  	struct page *page;
> >  	struct mem_cgroup *mem_cgroup;
> > -	int flags;
> > +	unsigned long flags;
> >  };
> > -#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
> > -#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
> > -#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
> > -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
> > +
> > +enum {
> > +	/* flags for mem_cgroup */
> > +	Pcg_CACHE, /* charged as cache */
> 
> Why Pcg_CACHE and not PCG_CACHE or PAGE_CGROUP_CACHE? I think the latter is more
> readable, no?
> 
Hmm, ok.


> > +	/* flags for LRU placement */
> > +	Pcg_ACTIVE, /* page is active in this cgroup */
> > +	Pcg_FILE, /* page is file system backed */
> > +	Pcg_UNEVICTABLE, /* page is unevictableable */
> > +};
> > +
> > +#define TESTPCGFLAG(uname, lname)			\
>                       ^^ uname and lname?
> How about TEST_PAGE_CGROUP_FLAG(func, bit)
> 
This style is from PageXXX macros and I like shorter name.


> > +static inline int Pcg##uname(struct page_cgroup *pc)	\
> > +	{ return test_bit(Pcg_##lname, &pc->flags); }
> > +
> 
> I would prefer PageCgroup##func
> 
Hmm..ok. I'll rewrite and see 80 columns problem.


> > +#define SETPCGFLAG(uname, lname)			\
> > +static inline void SetPcg##uname(struct page_cgroup *pc)\
> > +	{ set_bit(Pcg_##lname, &pc->flags);  }
> > +
> > +#define CLEARPCGFLAG(uname, lname)			\
> > +static inline void ClearPcg##uname(struct page_cgroup *pc)	\
> > +	{ clear_bit(Pcg_##lname, &pc->flags);  }
> > +
> > +#define __SETPCGFLAG(uname, lname)			\
> > +static inline void __SetPcg##uname(struct page_cgroup *pc)\
> > +	{ __set_bit(Pcg_##lname, &pc->flags);  }
> > +
> 
> OK, so we have the non-atomic verion as well
> 
> > +#define __CLEARPCGFLAG(uname, lname)			\
> > +static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
> > +	{ __clear_bit(Pcg_##lname, &pc->flags);  }
> > +
> > +/* Cache flag is set only once (at allocation) */
> > +TESTPCGFLAG(Cache, CACHE)
> > +__SETPCGFLAG(Cache, CACHE)
> > +
> > +/* LRU management flags (from global-lru definition) */
> > +TESTPCGFLAG(File, FILE)
> > +SETPCGFLAG(File, FILE)
> > +__SETPCGFLAG(File, FILE)
> > +CLEARPCGFLAG(File, FILE)
> > +
> > +TESTPCGFLAG(Active, ACTIVE)
> > +SETPCGFLAG(Active, ACTIVE)
> > +__SETPCGFLAG(Active, ACTIVE)
> > +CLEARPCGFLAG(Active, ACTIVE)
> > +
> > +TESTPCGFLAG(Unevictable, UNEVICTABLE)
> > +SETPCGFLAG(Unevictable, UNEVICTABLE)
> > +CLEARPCGFLAG(Unevictable, UNEVICTABLE)
> > +
> > 
> >  static int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> > @@ -189,14 +234,15 @@ enum charge_type {
> >  /*
> >   * Always modified under lru lock. Then, not necessary to preempt_disable()
> >   */
> > -static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
> > -					bool charge)
> > +static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > +					 struct page_cgroup *pc,
> > +					 bool charge)
> >  {
> >  	int val = (charge)? 1 : -1;
> >  	struct mem_cgroup_stat *stat = &mem->stat;
> > 
> >  	VM_BUG_ON(!irqs_disabled());
> > -	if (flags & PAGE_CGROUP_FLAG_CACHE)
> > +	if (PcgCache(pc))
> 
> Shouldn't we use __PcgCache(), see my comments below
> 
> >  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
> >  	else
> >  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
> > @@ -289,18 +335,18 @@ static void __mem_cgroup_remove_list(str
> >  {
> >  	int lru = LRU_BASE;
> > 
> > -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> > +	if (PcgUnevictable(pc))
> 
> Since we call this under a lock, can't we use __PcgUnevictable(pc)? If not, what
> are we buying by doing atomic operations under a lock?
> 
> >  		lru = LRU_UNEVICTABLE;
> >  	else {
> > -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> > +		if (PcgActive(pc))
> 
> Ditto
> 
> >  			lru += LRU_ACTIVE;
> > -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> > +		if (PcgFile(pc))
> 
> Ditto
> 
> >  			lru += LRU_FILE;
> >  	}
> > 
> >  	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> > 
> > -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
> > +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
> >  	list_del(&pc->lru);
> >  }
> > 
> > @@ -309,27 +355,27 @@ static void __mem_cgroup_add_list(struct
> >  {
> >  	int lru = LRU_BASE;
> > 
> > -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> > +	if (PcgUnevictable(pc))
> 
> Ditto
> 
> >  		lru = LRU_UNEVICTABLE;
> >  	else {
> > -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> > +		if (PcgActive(pc))
> >  			lru += LRU_ACTIVE;
> > -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> > +		if (PcgFile(pc))
> 
> Ditto
> 
> >  			lru += LRU_FILE;
> >  	}
> > 
> >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> >  	list_add(&pc->lru, &mz->lists[lru]);
> > 
> > -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
> > +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
> >  }
> > 
> >  static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
> >  {
> >  	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> > -	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
> > -	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
> > -	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
> > +	int active    = PcgActive(pc);
> > +	int file      = PcgFile(pc);
> > +	int unevictable = PcgUnevictable(pc);
> >  	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
> >  				(LRU_FILE * !!file + !!active);
> > 
> > @@ -339,14 +385,14 @@ static void __mem_cgroup_move_lists(stru
> >  	MEM_CGROUP_ZSTAT(mz, from) -= 1;
> > 
> >  	if (is_unevictable_lru(lru)) {
> > -		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> > -		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
> > +		ClearPcgActive(pc);
> > +		SetPcgUnevictable(pc);
> >  	} else {
> >  		if (is_active_lru(lru))
> > -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> > +			SetPcgActive(pc);
> >  		else
> > -			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> > -		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
> > +			ClearPcgActive(pc);
> > +		ClearPcgUnevictable(pc);
> 
> Again shouldn't we be using the __ variants?
> 

For testing, __ variants are ok, I think.
For setting/clearing, because we'll have PcgObsolete() flag or
some more flags later, atomic version is better.
(For example, PcgDirty() is in my plan.)

I'll check all again.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-26 11:46   ` Balbir Singh
@ 2008-08-26 23:55     ` KAMEZAWA Hiroyuki
  2008-08-27  1:17       ` Balbir Singh
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-26 23:55 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Tue, 26 Aug 2008 17:16:20 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > +/*
> > + * per-cpu slot for freeing page_cgroup in lazy manner.
> > + * All page_cgroup linked to this list is OBSOLETE.
> > + */
> > +struct mem_cgroup_sink_list {
> > +	int count;
> > +	struct page_cgroup *next;
> > +};
> 
> Can't we reuse the lru field in page_cgroup to build a list? Do we need them on
> the memory controller LRU if they are obsolete? I want to do something similar
> for both additions and deletions - reuse pagevec style, basically. I am OK,
> having a list as well, in that case we can just reuse the LRU pointer.
> 
reusing page_cgroup->lru is not a choice because this patch is for avoid
locking on mz->lru_lock (and kfree).
But using vector can be a choice. I'll try in the next version.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 10/14] memcg: replace res_counter
  2008-08-22 11:39 ` [RFC][PATCH 10/14] memcg: replace res_counter KAMEZAWA Hiroyuki
@ 2008-08-27  0:44   ` Daisuke Nishimura
  2008-08-27  1:26     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-08-27  0:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

Hi.

> @@ -356,7 +447,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
>  	 * usage is recorded in bytes. But, here, we assume the number of
>  	 * physical pages can be represented by "long" on any arch.
>  	 */
> -	total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L;
> +	total = (long) (mem->res.pages >> PAGE_SHIFT) + 1L;
I don't think this shift is needed.

>  	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
>  	return (int)((rss * 100L) / total);
>  }


> @@ -880,8 +971,12 @@ int mem_cgroup_resize_limit(struct mem_c
>  	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
>  	int progress;
>  	int ret = 0;
> +	unsigned long new_lim = (unsigned long)(val >> PAGE_SHIFT);
>  
> -	while (res_counter_set_limit(&memcg->res, val)) {
> +	if (val & PAGE_SIZE)
> +		new_lim += 1;
> +
I'm sorry I can't understand here.

Should it be "val & (PAGE_SIZE-1)"?

> +	while (mem_counter_set_pages_limit(memcg, new_lim)) {
>  		if (signal_pending(current)) {
>  			ret = -EINTR;
>  			break;


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-26 23:55     ` KAMEZAWA Hiroyuki
@ 2008-08-27  1:17       ` Balbir Singh
  2008-08-27  1:39         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-27  1:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> On Tue, 26 Aug 2008 17:16:20 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>>> +/*
>>> + * per-cpu slot for freeing page_cgroup in lazy manner.
>>> + * All page_cgroup linked to this list is OBSOLETE.
>>> + */
>>> +struct mem_cgroup_sink_list {
>>> +	int count;
>>> +	struct page_cgroup *next;
>>> +};
>> Can't we reuse the lru field in page_cgroup to build a list? Do we need them on
>> the memory controller LRU if they are obsolete? I want to do something similar
>> for both additions and deletions - reuse pagevec style, basically. I am OK,
>> having a list as well, in that case we can just reuse the LRU pointer.
>>
> reusing page_cgroup->lru is not a choice because this patch is for avoid
> locking on mz->lru_lock (and kfree).
> But using vector can be a choice. I'll try in the next version.

Kame,

Do we need to use the lru_lock? If we do an atomic check on PcgObsolete(), can't
we use another lock for obsolete pages and still use the lru list head?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 10/14] memcg: replace res_counter
  2008-08-27  0:44   ` Daisuke Nishimura
@ 2008-08-27  1:26     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-27  1:26 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Wed, 27 Aug 2008 09:44:26 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> > @@ -356,7 +447,7 @@ int mem_cgroup_calc_mapped_ratio(struct 
> >  	 * usage is recorded in bytes. But, here, we assume the number of
> >  	 * physical pages can be represented by "long" on any arch.
> >  	 */
> > -	total = (long) (mem->res.usage >> PAGE_SHIFT) + 1L;
> > +	total = (long) (mem->res.pages >> PAGE_SHIFT) + 1L;
> I don't think this shift is needed.
> 
Oh, yes! thanks.

> >  	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
> >  	return (int)((rss * 100L) / total);
> >  }
> 
> 
> > @@ -880,8 +971,12 @@ int mem_cgroup_resize_limit(struct mem_c
> >  	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
> >  	int progress;
> >  	int ret = 0;
> > +	unsigned long new_lim = (unsigned long)(val >> PAGE_SHIFT);
> >  
> > -	while (res_counter_set_limit(&memcg->res, val)) {
> > +	if (val & PAGE_SIZE)
> > +		new_lim += 1;
> > +
> I'm sorry I can't understand here.
> 
> Should it be "val & (PAGE_SIZE-1)"?
> 
yes...will fix.

Thanks,
-Kame


> > +	while (mem_counter_set_pages_limit(memcg, new_lim)) {
> >  		if (signal_pending(current)) {
> >  			ret = -EINTR;
> >  			break;
> 
> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-27  1:17       ` Balbir Singh
@ 2008-08-27  1:39         ` KAMEZAWA Hiroyuki
  2008-08-27  2:25           ` Balbir Singh
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-27  1:39 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Wed, 27 Aug 2008 06:47:59 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 26 Aug 2008 17:16:20 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >>> +/*
> >>> + * per-cpu slot for freeing page_cgroup in lazy manner.
> >>> + * All page_cgroup linked to this list is OBSOLETE.
> >>> + */
> >>> +struct mem_cgroup_sink_list {
> >>> +	int count;
> >>> +	struct page_cgroup *next;
> >>> +};
> >> Can't we reuse the lru field in page_cgroup to build a list? Do we need them on
> >> the memory controller LRU if they are obsolete? I want to do something similar
> >> for both additions and deletions - reuse pagevec style, basically. I am OK,
> >> having a list as well, in that case we can just reuse the LRU pointer.
> >>
> > reusing page_cgroup->lru is not a choice because this patch is for avoid
> > locking on mz->lru_lock (and kfree).
> > But using vector can be a choice. I'll try in the next version.
> 
> Kame,
> 
> Do we need to use the lru_lock? If we do an atomic check on PcgObsolete(), can't
> we use another lock for obsolete pages and still use the lru list head?

To reuse that, we'll have to modify lru.prev or lru.next pointer. 

And there will be race with 
 - move_list,
 - isolate_pages,
 - (new) force_empty

move_list and (new)force_empty modifies lru.prev/lru.next.
So, I think it's dangerous at this stage. (We can revist this when it's
necessary (if vector seems bad.)
Anyway, I think I'll be able to remove page_cgroup->next pointer I added.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 3/14]  memcg: atomic_flags
  2008-08-26 23:50     ` KAMEZAWA Hiroyuki
@ 2008-08-27  1:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-27  1:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, nishimura

On Wed, 27 Aug 2008 08:50:35 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 26 Aug 2008 10:25:55 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > Before trying to modify memory resource controller, this atomic operation
> > > on flags is necessary.
> > > Changelog (v1) -> (v2)
> > >  - no changes
> > > Changelog  (preview) -> (v1):
> > >  - patch ordering is changed.
> > >  - Added macro for defining functions for Test/Set/Clear bit.
> > >  - made the names of flags shorter.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > ---
> > >  mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++-----------------
> > >  1 file changed, 77 insertions(+), 31 deletions(-)
> > > 
> > > Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> > > ===================================================================
> > > --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> > > +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> > > @@ -163,12 +163,57 @@ struct page_cgroup {
> > >  	struct list_head lru;		/* per cgroup LRU list */
> > >  	struct page *page;
> > >  	struct mem_cgroup *mem_cgroup;
> > > -	int flags;
> > > +	unsigned long flags;
> > >  };
> > > -#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
> > > -#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
> > > -#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
> > > -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
> > > +
> > > +enum {
> > > +	/* flags for mem_cgroup */
> > > +	Pcg_CACHE, /* charged as cache */
> > 
> > Why Pcg_CACHE and not PCG_CACHE or PAGE_CGROUP_CACHE? I think the latter is more
> > readable, no?
> > 
> Hmm, ok.
> 
> 
> > > +	/* flags for LRU placement */
> > > +	Pcg_ACTIVE, /* page is active in this cgroup */
> > > +	Pcg_FILE, /* page is file system backed */
> > > +	Pcg_UNEVICTABLE, /* page is unevictableable */
> > > +};
> > > +
> > > +#define TESTPCGFLAG(uname, lname)			\
> >                       ^^ uname and lname?
> > How about TEST_PAGE_CGROUP_FLAG(func, bit)
> > 
> This style is from PageXXX macros and I like shorter name.
> 
> 
> > > +static inline int Pcg##uname(struct page_cgroup *pc)	\
> > > +	{ return test_bit(Pcg_##lname, &pc->flags); }
> > > +
> > 
> > I would prefer PageCgroup##func
> > 
> Hmm..ok. I'll rewrite and see 80 columns problem.
> 
> 
> > > +#define SETPCGFLAG(uname, lname)			\
> > > +static inline void SetPcg##uname(struct page_cgroup *pc)\
> > > +	{ set_bit(Pcg_##lname, &pc->flags);  }
> > > +
> > > +#define CLEARPCGFLAG(uname, lname)			\
> > > +static inline void ClearPcg##uname(struct page_cgroup *pc)	\
> > > +	{ clear_bit(Pcg_##lname, &pc->flags);  }
> > > +
> > > +#define __SETPCGFLAG(uname, lname)			\
> > > +static inline void __SetPcg##uname(struct page_cgroup *pc)\
> > > +	{ __set_bit(Pcg_##lname, &pc->flags);  }
> > > +
> > 
> > OK, so we have the non-atomic verion as well
> > 
> > > +#define __CLEARPCGFLAG(uname, lname)			\
> > > +static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
> > > +	{ __clear_bit(Pcg_##lname, &pc->flags);  }
> > > +
> > > +/* Cache flag is set only once (at allocation) */
> > > +TESTPCGFLAG(Cache, CACHE)
> > > +__SETPCGFLAG(Cache, CACHE)
> > > +
> > > +/* LRU management flags (from global-lru definition) */
> > > +TESTPCGFLAG(File, FILE)
> > > +SETPCGFLAG(File, FILE)
> > > +__SETPCGFLAG(File, FILE)
> > > +CLEARPCGFLAG(File, FILE)
> > > +
> > > +TESTPCGFLAG(Active, ACTIVE)
> > > +SETPCGFLAG(Active, ACTIVE)
> > > +__SETPCGFLAG(Active, ACTIVE)
> > > +CLEARPCGFLAG(Active, ACTIVE)
> > > +
> > > +TESTPCGFLAG(Unevictable, UNEVICTABLE)
> > > +SETPCGFLAG(Unevictable, UNEVICTABLE)
> > > +CLEARPCGFLAG(Unevictable, UNEVICTABLE)
> > > +
> > > 
> > >  static int page_cgroup_nid(struct page_cgroup *pc)
> > >  {
> > > @@ -189,14 +234,15 @@ enum charge_type {
> > >  /*
> > >   * Always modified under lru lock. Then, not necessary to preempt_disable()
> > >   */
> > > -static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
> > > -					bool charge)
> > > +static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> > > +					 struct page_cgroup *pc,
> > > +					 bool charge)
> > >  {
> > >  	int val = (charge)? 1 : -1;
> > >  	struct mem_cgroup_stat *stat = &mem->stat;
> > > 
> > >  	VM_BUG_ON(!irqs_disabled());
> > > -	if (flags & PAGE_CGROUP_FLAG_CACHE)
> > > +	if (PcgCache(pc))
> > 
> > Shouldn't we use __PcgCache(), see my comments below
> > 
> > >  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
> > >  	else
> > >  		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
> > > @@ -289,18 +335,18 @@ static void __mem_cgroup_remove_list(str
> > >  {
> > >  	int lru = LRU_BASE;
> > > 
> > > -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> > > +	if (PcgUnevictable(pc))
> > 
> > Since we call this under a lock, can't we use __PcgUnevictable(pc)? If not, what
> > are we buying by doing atomic operations under a lock?
> > 
> > >  		lru = LRU_UNEVICTABLE;
> > >  	else {
> > > -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> > > +		if (PcgActive(pc))
> > 
> > Ditto
> > 
> > >  			lru += LRU_ACTIVE;
> > > -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> > > +		if (PcgFile(pc))
> > 
> > Ditto
> > 
> > >  			lru += LRU_FILE;
> > >  	}
> > > 
> > >  	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> > > 
> > > -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
> > > +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
> > >  	list_del(&pc->lru);
> > >  }
> > > 
> > > @@ -309,27 +355,27 @@ static void __mem_cgroup_add_list(struct
> > >  {
> > >  	int lru = LRU_BASE;
> > > 
> > > -	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
> > > +	if (PcgUnevictable(pc))
> > 
> > Ditto
> > 
> > >  		lru = LRU_UNEVICTABLE;
> > >  	else {
> > > -		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
> > > +		if (PcgActive(pc))
> > >  			lru += LRU_ACTIVE;
> > > -		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
> > > +		if (PcgFile(pc))
> > 
> > Ditto
> > 
> > >  			lru += LRU_FILE;
> > >  	}
> > > 
> > >  	MEM_CGROUP_ZSTAT(mz, lru) += 1;
> > >  	list_add(&pc->lru, &mz->lists[lru]);
> > > 
> > > -	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
> > > +	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
> > >  }
> > > 
> > >  static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
> > >  {
> > >  	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
> > > -	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
> > > -	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
> > > -	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
> > > +	int active    = PcgActive(pc);
> > > +	int file      = PcgFile(pc);
> > > +	int unevictable = PcgUnevictable(pc);
> > >  	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
> > >  				(LRU_FILE * !!file + !!active);
> > > 
> > > @@ -339,14 +385,14 @@ static void __mem_cgroup_move_lists(stru
> > >  	MEM_CGROUP_ZSTAT(mz, from) -= 1;
> > > 
> > >  	if (is_unevictable_lru(lru)) {
> > > -		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> > > -		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
> > > +		ClearPcgActive(pc);
> > > +		SetPcgUnevictable(pc);
> > >  	} else {
> > >  		if (is_active_lru(lru))
> > > -			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> > > +			SetPcgActive(pc);
> > >  		else
> > > -			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
> > > -		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
> > > +			ClearPcgActive(pc);
> > > +		ClearPcgUnevictable(pc);
> > 
> > Again shouldn't we be using the __ variants?
> > 
> 
> For testing, __ variants are ok, I think.
Sorry for confusion, __xxx for test is meaningless ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-27  1:39         ` KAMEZAWA Hiroyuki
@ 2008-08-27  2:25           ` Balbir Singh
  2008-08-27  2:46             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-27  2:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> On Wed, 27 Aug 2008 06:47:59 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 26 Aug 2008 17:16:20 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>>> +/*
>>>>> + * per-cpu slot for freeing page_cgroup in lazy manner.
>>>>> + * All page_cgroup linked to this list is OBSOLETE.
>>>>> + */
>>>>> +struct mem_cgroup_sink_list {
>>>>> +	int count;
>>>>> +	struct page_cgroup *next;
>>>>> +};
>>>> Can't we reuse the lru field in page_cgroup to build a list? Do we need them on
>>>> the memory controller LRU if they are obsolete? I want to do something similar
>>>> for both additions and deletions - reuse pagevec style, basically. I am OK,
>>>> having a list as well, in that case we can just reuse the LRU pointer.
>>>>
>>> reusing page_cgroup->lru is not a choice because this patch is for avoid
>>> locking on mz->lru_lock (and kfree).
>>> But using vector can be a choice. I'll try in the next version.
>> Kame,
>>
>> Do we need to use the lru_lock? If we do an atomic check on PcgObsolete(), can't
>> we use another lock for obsolete pages and still use the lru list head?
> 
> To reuse that, we'll have to modify lru.prev or lru.next pointer. 
> 
> And there will be race with 
>  - move_list,
>  - isolate_pages,
>  - (new) force_empty
> 

I was suggesting that we could mark the page as obsolete and then move it on to
another queue, if the page_cgroup was marked as obsolete.

> move_list and (new)force_empty modifies lru.prev/lru.next.
> So, I think it's dangerous at this stage. (We can revist this when it's
> necessary (if vector seems bad.)

OK, lets try with the vector and see how that turns out.

> Anyway, I think I'll be able to remove page_cgroup->next pointer I added.
> 

Yes, with the vector you should not need it.

> Thanks,
> -Kame
> 


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 4/14]  delay page_cgroup freeing
  2008-08-27  2:25           ` Balbir Singh
@ 2008-08-27  2:46             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-27  2:46 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Wed, 27 Aug 2008 07:55:49 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > And there will be race with 
> >  - move_list,
> >  - isolate_pages,
> >  - (new) force_empty
> > 
> 
> I was suggesting that we could mark the page as obsolete and then move it on to
> another queue, if the page_cgroup was marked as obsolete.
> 
I can't understand this point. What is another queue ? Can it be a help
for avoiding lru_lock ?

following is my current version.
==
+/*
+ * per-cpu slot for freeing page_cgroup in lazy manner.
+ * All page_cgroup linked to this vec is OBSOLETE.
+ * This vector size is determined to be within 128 bytes on 64bit archs.
+ */
+#define MEMCG_LRU_THRESH       (15)
+struct mem_cgroup_sink_vec {
+       unsigned long nr;
+       struct page_cgroup *vec[MEMCG_LRU_THRESH];
+};
+DEFINE_PER_CPU(struct mem_cgroup_sink_vec, memcg_sink_vec);
+

record Obsolete page_cgroup in vec and free them in batched manner.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 13/14] memcg: mem+swap counter
  2008-08-22 11:41 ` [RFC][PATCH 13/14] memcg: mem+swap counter KAMEZAWA Hiroyuki
@ 2008-08-28  8:51   ` Daisuke Nishimura
  2008-08-28  9:32     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-08-28  8:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

> @@ -279,6 +285,10 @@ static int mem_counter_charge(struct mem
>  	spin_lock_irqsave(&mem->res.lock, flags);
>  	if (mem->res.pages + num > mem->res.pages_limit)
>  		goto busy_out;
> +	if (do_swap_account &&
> +	    (mem->res.pages + mem->res.swaps > mem->res.memsw_limit))
                                           ^^^
You need "+ num" here.

> +		goto busy_out;
> +
>  	mem->res.pages += num;
>  	if (mem->res.pages > mem->res.max_pages)
>  		mem->res.max_pages = mem->res.pages;


> @@ -772,20 +831,28 @@ static int mem_cgroup_charge_common(stru
>  	}
>  
>  	while (mem_counter_charge(mem, 1)) {
> +		int progress;
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto out;
>  
> -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> -			continue;
> +		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
>  
>  		/*
> +		 * When we hit memsw limit, return value of "progress"
> +		 * has no meaning. (some pages may just be changed to swap)
> +		 */
> +		if (mem_counter_check_under_memsw_limit(mem) && progress)
> +			continue;
> +		/*
>  		 * try_to_free_mem_cgroup_pages() might not give us a full
>  		 * picture of reclaim. Some pages are reclaimed and might be
>  		 * moved to swap cache or just unmapped from the cgroup.
>  		 * Check the limit again to see if the reclaim reduced the
>  		 * current usage of the cgroup before giving up
>  		 */
> -		if (mem_counter_check_under_pages_limit(mem))
> +
> +		if (!do_swap_account
> +		   && mem_counter_check_under_pages_limit(mem))
>  			continue;
>  
>  		if (!nr_retries--) {
IMHO, try_to_free_mem_cgroup_pages() should use swap only when
!mem_counter_check_under_pages_limit(). Otherwise, it would
try to swapout some pages in vain.

How about adding a "may_swap" flag to args of tyr_to_free_mem_cgroup_pages(),
and pass the arg to sc->may_swap?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 13/14] memcg: mem+swap counter
  2008-08-28  8:51   ` Daisuke Nishimura
@ 2008-08-28  9:32     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-28  9:32 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Thu, 28 Aug 2008 17:51:51 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -279,6 +285,10 @@ static int mem_counter_charge(struct mem
> >  	spin_lock_irqsave(&mem->res.lock, flags);
> >  	if (mem->res.pages + num > mem->res.pages_limit)
> >  		goto busy_out;
> > +	if (do_swap_account &&
> > +	    (mem->res.pages + mem->res.swaps > mem->res.memsw_limit))
>                                            ^^^
> You need "+ num" here.
> 
Oh, yes.

> > +		goto busy_out;
> > +
> >  	mem->res.pages += num;
> >  	if (mem->res.pages > mem->res.max_pages)
> >  		mem->res.max_pages = mem->res.pages;
> 
> 
> > @@ -772,20 +831,28 @@ static int mem_cgroup_charge_common(stru
> >  	}
> >  
> >  	while (mem_counter_charge(mem, 1)) {
> > +		int progress;
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto out;
> >  
> > -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> > -			continue;
> > +		progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
> >  
> >  		/*
> > +		 * When we hit memsw limit, return value of "progress"
> > +		 * has no meaning. (some pages may just be changed to swap)
> > +		 */
> > +		if (mem_counter_check_under_memsw_limit(mem) && progress)
> > +			continue;
> > +		/*
> >  		 * try_to_free_mem_cgroup_pages() might not give us a full
> >  		 * picture of reclaim. Some pages are reclaimed and might be
> >  		 * moved to swap cache or just unmapped from the cgroup.
> >  		 * Check the limit again to see if the reclaim reduced the
> >  		 * current usage of the cgroup before giving up
> >  		 */
> > -		if (mem_counter_check_under_pages_limit(mem))
> > +
> > +		if (!do_swap_account
> > +		   && mem_counter_check_under_pages_limit(mem))
> >  			continue;
> >  
> >  		if (!nr_retries--) {
> IMHO, try_to_free_mem_cgroup_pages() should use swap only when
> !mem_counter_check_under_pages_limit(). Otherwise, it would
> try to swapout some pages in vain.
> 
> How about adding a "may_swap" flag to args of tyr_to_free_mem_cgroup_pages(),
> and pass the arg to sc->may_swap?
> 
> 
make sense. I'll try that. thanks.

Note: maybe new version cannot be shown in this week ;) 

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 5/14]  memcg: free page_cgroup by RCU
  2008-08-22 11:34 ` [RFC][PATCH 5/14] memcg: free page_cgroup by RCU KAMEZAWA Hiroyuki
@ 2008-08-28 10:06   ` Balbir Singh
  2008-08-28 10:44     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Balbir Singh @ 2008-08-28 10:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

> Freeing page_cgroup by RCU.
> 
> This makes access to page->page_cgroup as RCU-safe.
> 

In addition to freeing page_cgroup via RCU, we'll also need to use
rcu_assign_pointer() and rcu_dereference() to make the access RCU safe.

Oh! I just see that the next set of patches do the correct thing, could you
please write the change log correctly indicate that this patch release
page->page_cgroup via RCU.

> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   44 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 8 deletions(-)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -588,19 +588,23 @@ unsigned long mem_cgroup_isolate_pages(u
>   * Free obsolete page_cgroups which is linked to per-cpu drop list.
>   */
> 
> -static void __free_obsolete_page_cgroup(void)
> +struct page_cgroup_rcu_work {
> +	struct rcu_head head;
> +	struct page_cgroup *list;
> +};
> +
> +static void __free_obsolete_page_cgroup_cb(struct rcu_head *head)
>  {
>  	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc, *next;
>  	struct mem_cgroup_per_zone *mz, *page_mz;
> -	struct mem_cgroup_sink_list *mcsl;
> +	struct page_cgroup_rcu_work *work;
>  	unsigned long flags;
> 
> -	mcsl = &get_cpu_var(memcg_sink_list);
> -	next = mcsl->next;
> -	mcsl->next = NULL;
> -	mcsl->count = 0;
> -	put_cpu_var(memcg_sink_list);
> +
> +	work = container_of(head, struct page_cgroup_rcu_work, head);
> +	next = work->list;

What do we do with next here? I must be missing it, but where is the page_cgroup
released?

> +	kfree(work);
> 
>  	mz = NULL;
> 
> @@ -627,6 +631,26 @@ static void __free_obsolete_page_cgroup(
>  	local_irq_restore(flags);
>  }
> 
> +static int __free_obsolete_page_cgroup(void)
> +{
> +	struct page_cgroup_rcu_work *work;
> +	struct mem_cgroup_sink_list *mcsl;
> +
> +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> +	if (!work)
> +		return -ENOMEM;
> +	INIT_RCU_HEAD(&work->head);
> +
> +	mcsl = &get_cpu_var(memcg_sink_list);
> +	work->list = mcsl->next;
> +	mcsl->next = NULL;
> +	mcsl->count = 0;
> +	put_cpu_var(memcg_sink_list);
> +
> +	call_rcu(&work->head, __free_obsolete_page_cgroup_cb);
> +	return 0;
> +}
> +

I don't like this approach, seems complex, you allocate more memory in
GFP_ATOMIC context, so that free can be called from RCU context.

>  static void free_obsolete_page_cgroup(struct page_cgroup *pc)
>  {
>  	int count;
> @@ -649,13 +673,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
> 
>  static void mem_cgroup_local_force_drain(struct work_struct *work)
>  {
> -	__free_obsolete_page_cgroup();
> +	int ret;
> +	do {
> +		ret = __free_obsolete_page_cgroup();

We keep repeating till we get 0?

> +	} while (ret);
>  }
> 
>  static void mem_cgroup_all_force_drain(void)
>  {
>  	mutex_lock(&memcg_force_drain_mutex);
>  	schedule_on_each_cpu(mem_cgroup_local_force_drain);
> +	synchronize_rcu();
>  	mutex_unlock(&memcg_force_drain_mutex);
>  }
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 5/14]  memcg: free page_cgroup by RCU
  2008-08-28 10:06   ` Balbir Singh
@ 2008-08-28 10:44     ` KAMEZAWA Hiroyuki
  2008-09-01  6:51       ` YAMAMOTO Takashi
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-28 10:44 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, nishimura

On Thu, 28 Aug 2008 15:36:58 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 	KAMEZAWA Hiroyuki wrote:
> > Freeing page_cgroup by RCU.
> > 
> > This makes access to page->page_cgroup as RCU-safe.
> > 
> 
> In addition to freeing page_cgroup via RCU, we'll also need to use
> rcu_assign_pointer() and rcu_dereference() to make the access RCU safe.
> 
Yes. but it is not necessary until lockess patches. it's guarded by
lock/unlock_page_cgroup(). So, I didin't mention it here.
But ok, I'll write.

> Oh! I just see that the next set of patches do the correct thing, could you
> please write the change log correctly indicate that this patch release
> page->page_cgroup via RCU.
> 
Sorry, changelog will be fixed.


> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   44 ++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 36 insertions(+), 8 deletions(-)
> > 
> > Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> > +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> > @@ -588,19 +588,23 @@ unsigned long mem_cgroup_isolate_pages(u
> >   * Free obsolete page_cgroups which is linked to per-cpu drop list.
> >   */
> > 
> > -static void __free_obsolete_page_cgroup(void)
> > +struct page_cgroup_rcu_work {
> > +	struct rcu_head head;
> > +	struct page_cgroup *list;
> > +};
> > +
> > +static void __free_obsolete_page_cgroup_cb(struct rcu_head *head)
> >  {
> >  	struct mem_cgroup *memcg;
> >  	struct page_cgroup *pc, *next;
> >  	struct mem_cgroup_per_zone *mz, *page_mz;
> > -	struct mem_cgroup_sink_list *mcsl;
> > +	struct page_cgroup_rcu_work *work;
> >  	unsigned long flags;
> > 
> > -	mcsl = &get_cpu_var(memcg_sink_list);
> > -	next = mcsl->next;
> > -	mcsl->next = NULL;
> > -	mcsl->count = 0;
> > -	put_cpu_var(memcg_sink_list);
> > +
> > +	work = container_of(head, struct page_cgroup_rcu_work, head);
> > +	next = work->list;
> 
> What do we do with next here? I must be missing it, but where is the page_cgroup
> released?
> 
Maybe chasing diff is a bit complicated here. After this line, linked list of
page_cgroup from 'next' will be freed one by one.

(But this behavior may be changed in the next version.)

> > +	kfree(work);
> > 
> >  	mz = NULL;
> > 
> > @@ -627,6 +631,26 @@ static void __free_obsolete_page_cgroup(
> >  	local_irq_restore(flags);
> >  }
> > 
> > +static int __free_obsolete_page_cgroup(void)
> > +{
> > +	struct page_cgroup_rcu_work *work;
> > +	struct mem_cgroup_sink_list *mcsl;
> > +
> > +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> > +	if (!work)
> > +		return -ENOMEM;
> > +	INIT_RCU_HEAD(&work->head);
> > +
> > +	mcsl = &get_cpu_var(memcg_sink_list);
> > +	work->list = mcsl->next;
> > +	mcsl->next = NULL;
> > +	mcsl->count = 0;
> > +	put_cpu_var(memcg_sink_list);
> > +
> > +	call_rcu(&work->head, __free_obsolete_page_cgroup_cb);
> > +	return 0;
> > +}
> > +
> 
> I don't like this approach, seems complex, you allocate more memory in
> GFP_ATOMIC context, so that free can be called from RCU context.
> 
We should assume kmalloc() can return NULL.
But I understand you don't like this. allow me consider more.


> >  static void free_obsolete_page_cgroup(struct page_cgroup *pc)
> >  {
> >  	int count;
> > @@ -649,13 +673,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
> > 
> >  static void mem_cgroup_local_force_drain(struct work_struct *work)
> >  {
> > -	__free_obsolete_page_cgroup();
> > +	int ret;
> > +	do {
> > +		ret = __free_obsolete_page_cgroup();
> 
> We keep repeating till we get 0?
> 
yes. this returns 0 or -ENOMEM. 


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 7/14] memcg: add prefetch to spinlock
  2008-08-22 11:36 ` [RFC][PATCH 7/14] memcg: add prefetch to spinlock KAMEZAWA Hiroyuki
@ 2008-08-28 11:00   ` Balbir Singh
  0 siblings, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-08-28 11:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

KAMEZAWA Hiroyuki wrote:
> Address of "mz" can be calculated in easy way.
> prefetch it (we do spin_lock.)
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -707,6 +707,8 @@ static int mem_cgroup_charge_common(stru
>  		}
>  	}
> 
> +	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
> +	prefetchw(mz);

Nice optimization!

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 2/14] memcg: rewrite force_empty
  2008-08-22 11:31 ` [RFC][PATCH 2/14] memcg: rewrite force_empty KAMEZAWA Hiroyuki
  2008-08-25  3:21   ` KAMEZAWA Hiroyuki
@ 2008-08-29 11:45   ` Daisuke Nishimura
  2008-08-30  7:30     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-08-29 11:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

>  /*
> - * This routine traverse page_cgroup in given list and drop them all.
> - * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> + * This routine moves all account to root cgroup.
>   */
> -#define FORCE_UNCHARGE_BATCH	(128)
>  static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
>  			    struct mem_cgroup_per_zone *mz,
>  			    enum lru_list lru)
>  {
>  	struct page_cgroup *pc;
>  	struct page *page;
> -	int count = FORCE_UNCHARGE_BATCH;
>  	unsigned long flags;
>  	struct list_head *list;
>  
> @@ -853,22 +892,28 @@ static void mem_cgroup_force_empty_list(
>  		pc = list_entry(list->prev, struct page_cgroup, lru);
>  		page = pc->page;
>  		get_page(page);
> -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> -		/*
> -		 * Check if this page is on LRU. !LRU page can be found
> -		 * if it's under page migration.
> -		 */
> -		if (PageLRU(page)) {
> -			__mem_cgroup_uncharge_common(page,
> -					MEM_CGROUP_CHARGE_TYPE_FORCE);
> +		if (!trylock_page(page)) {
> +			list_move(&pc->lru, list);
> +			put_page(page):
                                     ^^^
                                    s/:/;

Just to make sure :)


Thanks,
Daisuke Nishimura.


> +			spin_unlock_irqrestore(&mz->lru_lock, flags);
> +			yield();
> +			spin_lock_irqsave(&mz->lru_lock, flags);
> +			continue;
> +		}
> +		if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
> +			/* some confliction */
> +			list_move(&pc->lru, list);
> +			unlock_page(page);
>  			put_page(page);
> -			if (--count <= 0) {
> -				count = FORCE_UNCHARGE_BATCH;
> -				cond_resched();
> -			}
> -		} else
> -			cond_resched();
> -		spin_lock_irqsave(&mz->lru_lock, flags);
> +			spin_unlock_irqrestore(&mz->lru_lock, flags);
> +			yield();
> +			spin_lock_irqsave(&mz->lru_lock, flags);
> +		} else {
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +		if (atomic_read(&mem->css.cgroup->count) > 0)
> +			break;
>  	}
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 2/14] memcg: rewrite force_empty
  2008-08-29 11:45   ` Daisuke Nishimura
@ 2008-08-30  7:30     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-30  7:30 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Fri, 29 Aug 2008 20:45:49 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> >  /*
> > - * This routine traverse page_cgroup in given list and drop them all.
> > - * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> > + * This routine moves all account to root cgroup.
> >   */
> > -#define FORCE_UNCHARGE_BATCH	(128)
> >  static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
> >  			    struct mem_cgroup_per_zone *mz,
> >  			    enum lru_list lru)
> >  {
> >  	struct page_cgroup *pc;
> >  	struct page *page;
> > -	int count = FORCE_UNCHARGE_BATCH;
> >  	unsigned long flags;
> >  	struct list_head *list;
> >  
> > @@ -853,22 +892,28 @@ static void mem_cgroup_force_empty_list(
> >  		pc = list_entry(list->prev, struct page_cgroup, lru);
> >  		page = pc->page;
> >  		get_page(page);
> > -		spin_unlock_irqrestore(&mz->lru_lock, flags);
> > -		/*
> > -		 * Check if this page is on LRU. !LRU page can be found
> > -		 * if it's under page migration.
> > -		 */
> > -		if (PageLRU(page)) {
> > -			__mem_cgroup_uncharge_common(page,
> > -					MEM_CGROUP_CHARGE_TYPE_FORCE);
> > +		if (!trylock_page(page)) {
> > +			list_move(&pc->lru, list);
> > +			put_page(page):
>                                      ^^^
>                                     s/:/;
> 
> Just to make sure :)
> 
Oh, thanks.

-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> 
> > +			spin_unlock_irqrestore(&mz->lru_lock, flags);
> > +			yield();
> > +			spin_lock_irqsave(&mz->lru_lock, flags);
> > +			continue;
> > +		}
> > +		if (mem_cgroup_move_account(page, pc, mem, &init_mem_cgroup)) {
> > +			/* some confliction */
> > +			list_move(&pc->lru, list);
> > +			unlock_page(page);
> >  			put_page(page);
> > -			if (--count <= 0) {
> > -				count = FORCE_UNCHARGE_BATCH;
> > -				cond_resched();
> > -			}
> > -		} else
> > -			cond_resched();
> > -		spin_lock_irqsave(&mz->lru_lock, flags);
> > +			spin_unlock_irqrestore(&mz->lru_lock, flags);
> > +			yield();
> > +			spin_lock_irqsave(&mz->lru_lock, flags);
> > +		} else {
> > +			unlock_page(page);
> > +			put_page(page);
> > +		}
> > +		if (atomic_read(&mem->css.cgroup->count) > 0)
> > +			break;
> >  	}
> >  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 5/14]  memcg: free page_cgroup by RCU
  2008-08-28 10:44     ` KAMEZAWA Hiroyuki
@ 2008-09-01  6:51       ` YAMAMOTO Takashi
  2008-09-01  7:01         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: YAMAMOTO Takashi @ 2008-09-01  6:51 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: balbir, linux-mm, nishimura

hi,

> > > @@ -649,13 +673,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
> > > 
> > >  static void mem_cgroup_local_force_drain(struct work_struct *work)
> > >  {
> > > -	__free_obsolete_page_cgroup();
> > > +	int ret;
> > > +	do {
> > > +		ret = __free_obsolete_page_cgroup();
> > 
> > We keep repeating till we get 0?
> > 
> yes. this returns 0 or -ENOMEM. 

it's problematic to keep busy-looping on ENOMEM, esp. for GFP_ATOMIC.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 5/14]  memcg: free page_cgroup by RCU
  2008-09-01  6:51       ` YAMAMOTO Takashi
@ 2008-09-01  7:01         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-01  7:01 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: balbir, linux-mm, nishimura

On Mon,  1 Sep 2008 15:51:44 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

> hi,
> 
> > > > @@ -649,13 +673,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
> > > > 
> > > >  static void mem_cgroup_local_force_drain(struct work_struct *work)
> > > >  {
> > > > -	__free_obsolete_page_cgroup();
> > > > +	int ret;
> > > > +	do {
> > > > +		ret = __free_obsolete_page_cgroup();
> > > 
> > > We keep repeating till we get 0?
> > > 
> > yes. this returns 0 or -ENOMEM. 
> 
> it's problematic to keep busy-looping on ENOMEM, esp. for GFP_ATOMIC.
> 
Ah thank you. I remove this routine in the next version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-08-22 11:44 ` [RFC][PATCH 14/14]memcg: mem+swap accounting KAMEZAWA Hiroyuki
@ 2008-09-01  7:15   ` Daisuke Nishimura
  2008-09-01  7:58     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-01  7:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

Hi, Kamezawa-san.

I'm testing these patches on mmotm-2008-08-29-01-08
(with some trivial fixes I've reported and some debug codes),
but swap_in_bytes sometimes becomes very huge(it seems that
over uncharge is happening..) and I can see OOM
if I've set memswap_limit.

I'm digging this now, but have you also ever seen it?


Thanks,
Daisuke Nishimura.

On Fri, 22 Aug 2008 20:44:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Add Swap accounting feature to memory resource controller.
> 
> Accounting is done in following logic.
> 
> Swap-out:
>   - When add_to_swap_cache() is called, swp_entry is marked as to be under
>     page->page_cgroup->mem_cgroup.
>   - When swap-cache is uncharged (fully unmapped), we don't uncharge it.
>   - When swap-cache is deleted, we uncharge it from memory and charge it to
>     swaps. This ops is done only when swap cache is already charged.
>            res.pages -=1, res.swaps +=1.
> 
> Swap-in:
>   - When add_to_swapcache() is called, we do nothing.
>   - When swap is mapped, we charge to memory and uncharge from swap
> 	   res.pages +=1, res.swaps -=1.
> 
> SwapCache-Deleting:
>   - If the page doesn't have page_cgroup, nothing to do.
>   - If the page is still charged as swap, just uncharge memory.
>     (This can happen under shmem/tmpfs.)
>   - If the page is not charged as swap, res.pages -= 1, res.swaps +=1.
> 
> Swap-Freeing:
>   - if swap entry is charged, res.swaps -= 1.
> 
> Almost all operations are done against SwapCache, which is Locked.
> 
> This patch uses an array to remember the owner of swp_entry. Considering x86-32,we should avoid to use NORMAL memory and vmalloc() area too much. This patch
> uses HIGHMEM to record information under kmap_atomic(KM_USER0). And information
> is recored in 2 bytes per 1 swap page.
> (memory controller's id is defined as smaller than unsigned short)
> 
> Changelog: (preview) -> (v2)
>  - removed radix-tree. just use array.
>  - removed linked-list.
>  - use memcgroup_id rather than pointer.
>  - added force_empty (temporal) support.
>    This should be reworked in future. (But for now, this works well for us.)
>  
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  include/linux/swap.h |   38 +++++
>  init/Kconfig         |    2 
>  mm/memcontrol.c      |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/migrate.c         |    7 
>  mm/swap_state.c      |    7 
>  mm/swapfile.c        |   14 +
>  6 files changed, 422 insertions(+), 10 deletions(-)
> 
> Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> @@ -34,6 +34,10 @@
>  #include <linux/mm_inline.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_cgroup.h>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +#endif
>  
>  #include <asm/uaccess.h>
>  
> @@ -43,9 +47,28 @@ static struct kmem_cache *page_cgroup_ca
>  #define NR_MEMCGRP_ID			(32767)
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
>  #define do_swap_account	(1)
> +
> +static void
> +swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page);
> +
> +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page);
> +static void swap_cgroup_clean_account(struct mem_cgroup *mem);
>  #else
>  #define do_swap_account	(0)
> +
> +static void
> +swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
> +{
> +}
> +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
> +{
> +	return NULL;
> +}
> +static void swap_cgroup_clean_account(struct mem_cgroup *mem)
> +{
> +}
>  #endif
>  
>  
> @@ -889,6 +912,9 @@ static int mem_cgroup_charge_common(stru
>  	__mem_cgroup_add_list(mz, pc);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  
> +	/* We did swap-in, uncharge swap. */
> +	if (do_swap_account && PageSwapCache(page))
> +		swap_cgroup_delete_account(mem, page);
>  	return 0;
>  out:
>  	css_put(&mem->css);
> @@ -899,6 +925,8 @@ err:
>  
>  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  {
> +	struct mem_cgroup *memcg = NULL;
> +
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
>  
> @@ -935,13 +963,19 @@ int mem_cgroup_charge(struct page *page,
>  		}
>  		rcu_read_unlock();
>  	}
> +	/* Swap-in ? */
> +	if (do_swap_account && PageSwapCache(page))
> +		memcg = lookup_mem_cgroup_from_swap(page);
> +
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED, memcg);
>  }
>  
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> +	struct mem_cgroup *memcg = NULL;
> +
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
>  
> @@ -971,9 +1005,11 @@ int mem_cgroup_cache_charge(struct page 
>  
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> +	if (do_swap_account && PageSwapCache(page))
> +		memcg = lookup_mem_cgroup_from_swap(page);
>  
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> -				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> +				MEM_CGROUP_CHARGE_TYPE_CACHE, memcg);
>  }
>  
>  /*
> @@ -998,9 +1034,11 @@ __mem_cgroup_uncharge_common(struct page
>  
>  	VM_BUG_ON(pc->page != page);
>  
> -	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> -	    && ((PcgCache(pc) || page_mapped(page))))
> -		goto out;
> +	if ((ctype != MEM_CGROUP_CHARGE_TYPE_FORCE))
> +		if (PageSwapCache(page) || page_mapped(page) ||
> +		    (page->mapping && !PageAnon(page)))
> +			goto out;
> +
>  	mem = pc->mem_cgroup;
>  	SetPcgObsolete(pc);
>  	page_assign_page_cgroup(page, NULL);
> @@ -1577,6 +1615,8 @@ static void mem_cgroup_pre_destroy(struc
>  {
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>  	mem_cgroup_force_empty(mem);
> +	if (do_swap_account)
> +		swap_cgroup_clean_account(mem);
>  }
>  
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> @@ -1635,3 +1675,317 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  };
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +/*
> + * swap accounting infrastructure.
> + */
> +DEFINE_MUTEX(swap_cgroup_mutex);
> +spinlock_t swap_cgroup_lock[MAX_SWAPFILES];
> +struct page **swap_cgroup_map[MAX_SWAPFILES];
> +unsigned long swap_cgroup_pages[MAX_SWAPFILES];
> +
> +
> +/* This definition is based onf NR_MEM_CGROUP==32768 */
> +struct swap_cgroup {
> +	unsigned short memcgrp_id:15;
> +	unsigned short count:1;
> +};
> +#define ENTS_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
> +
> +/*
> + * Called from get_swap_ent().
> + */
> +int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask)
> +{
> +	struct page *page;
> +	unsigned long array_index = swp_offset(ent) / ENTS_PER_PAGE;
> +	int type = swp_type(ent);
> +	unsigned long flags;
> +
> +	if (swap_cgroup_map[type][array_index])
> +		return 0;
> +	page = alloc_page(mask | __GFP_HIGHMEM | __GFP_ZERO);
> +	if (!page)
> +		return -ENOMEM;
> +	spin_lock_irqsave(&swap_cgroup_lock[type], flags);
> +	if (swap_cgroup_map[type][array_index] == NULL) {
> +		swap_cgroup_map[type][array_index] = page;
> +		page = NULL;
> +	}
> +	spin_unlock_irqrestore(&swap_cgroup_lock[type], flags);
> +
> +	if (page)
> +		__free_page(page);
> +	return 0;
> +}
> +
> +/**
> + * swap_cgroup_record_info
> + * @page ..... a page which is in some mem_cgroup.
> + * @entry .... swp_entry of the page. (or old swp_entry of the page)
> + * @delete ... if 0 add entry, if 1 remove entry.
> + *
> + * At set new value:
> + * This is called from add_to_swap_cache() after added to swapper_space.
> + * Then...this is called under page_lock() and this page is on radix-tree
> + * We're safe to access page->page_cgroup->mem_cgroup.
> + * This function never fails. (may leak information...but it's not Oops.)
> + *
> + * At delettion:
> + * Returns count is set or not.
> + */
> +int swap_cgroup_record_info(struct page *page, swp_entry_t entry, bool del)
> +{
> +	unsigned long flags;
> +	int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	unsigned long array_index = offset/ENTS_PER_PAGE;
> +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> +	struct page *mappage;
> +	struct swap_cgroup *map;
> +	struct page_cgroup *pc = NULL;
> +	int ret = 0;
> +
> +	if (!del) {
> +		/*
> +		 * At swap-in, the page is added to swap cache before tied to
> +		 * mem_cgroup. This page will be finally charged at page fault.
> +		 * Ignore this at this point.
> +		 */
> +		pc = page_get_page_cgroup(page);
> +		if (!pc)
> +			return ret;
> +	}
> +	if (!swap_cgroup_map[type])
> +		return ret;
> +	mappage = swap_cgroup_map[type][array_index];
> +	if (!mappage)
> +		return ret;
> +
> +	local_irq_save(flags);
> +	map = kmap_atomic(mappage, KM_USER0);
> +	if (!del) {
> +		map[index].memcgrp_id = pc->mem_cgroup->memcgrp_id;
> +		map[index].count = 0;
> +	} else {
> +		if (map[index].count) {
> +			ret = map[index].memcgrp_id;
> +			map[index].count = 0;
> +		}
> +		map[index].memcgrp_id = 0;
> +	}
> +	kunmap_atomic(mappage, KM_USER0);
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +/*
> + * returns mem_cgroup pointer when swp_entry is assgiend to.
> + */
> +static struct mem_cgroup *swap_cgroup_lookup(swp_entry_t entry)
> +{
> +	unsigned long flags;
> +	int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	unsigned long array_index = offset/ENTS_PER_PAGE;
> +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> +	struct page *mappage;
> +	struct swap_cgroup *map;
> +	unsigned short id;
> +
> +	if (!swap_cgroup_map[type])
> +		return NULL;
> +	mappage = swap_cgroup_map[type][array_index];
> +	if (!mappage)
> +		return NULL;
> +
> +	local_irq_save(flags);
> +	map = kmap_atomic(mappage, KM_USER0);
> +	id = map[index].memcgrp_id;
> +	kunmap_atomic(mappage, KM_USER0);
> +	local_irq_restore(flags);
> +	return mem_cgroup_id_lookup(id);
> +}
> +
> +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
> +{
> +	swp_entry_t entry = { .val = page_private(page) };
> +	return swap_cgroup_lookup(entry);
> +}
> +
> +/*
> + * set/clear accounting information of swap_cgroup.
> + *
> + * Called when set/clear accounting information.
> + * returns 1 at success.
> + */
> +static int swap_cgroup_account(struct mem_cgroup *memcg,
> +			       swp_entry_t entry, bool set)
> +{
> +	unsigned long flags;
> +	int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	unsigned long array_index = offset/ENTS_PER_PAGE;
> +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> +	struct page *mappage;
> +	struct swap_cgroup *map;
> +	int ret = 0;
> +
> +	if (!swap_cgroup_map[type])
> +		return ret;
> +	mappage = swap_cgroup_map[type][array_index];
> +	if (!mappage)
> +		return ret;
> +
> +
> +	local_irq_save(flags);
> +	map = kmap_atomic(mappage, KM_USER0);
> +	if (map[index].memcgrp_id == memcg->memcgrp_id) {
> +		if (set && map[index].count == 0) {
> +			map[index].count = 1;
> +			ret = 1;
> +		} else if (!set && map[index].count == 1) {
> +			map[index].count = 0;
> +			ret = 1;
> +		}
> +	}
> +	kunmap_atomic(mappage, KM_USER0);
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +void swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
> +{
> +	swp_entry_t val = { .val = page_private(page) };
> +	if (swap_cgroup_account(mem, val, false))
> +		mem_counter_uncharge_swap(mem);
> +}
> +
> +/*
> + * Called from delete_from_swap_cache() then, page is Locked! and
> + * swp_entry is still in use.
> + */
> +void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
> +{
> +	struct page_cgroup *pc;
> +
> +	pc = page_get_page_cgroup(page);
> +	/* swap-in but not mapped. */
> +	if (!pc)
> +		return;
> +
> +	if (swap_cgroup_account(pc->mem_cgroup, entry, true))
> +		__mem_cgroup_uncharge_common(page,
> +				MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +	else if (page->mapping && !PageAnon(page))
> +		__mem_cgroup_uncharge_common(page,
> +				MEM_CGROUP_CHARGE_TYPE_CACHE);
> +	else
> +		__mem_cgroup_uncharge_common(page,
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +	return;
> +}
> +
> +void swap_cgroup_delete_swap(swp_entry_t entry)
> +{
> +	int ret;
> +	struct mem_cgroup *mem;
> +
> +	ret = swap_cgroup_record_info(NULL, entry, true);
> +	if (ret) {
> +		mem = mem_cgroup_id_lookup(ret);
> +		if (mem)
> +			mem_counter_uncharge_swap(mem);
> +	}
> +}
> +
> +
> +/*
> + * Forget all accounts under swap_cgroup of memcg.
> + * Called from destroying context.
> + */
> +static void swap_cgroup_clean_account(struct mem_cgroup *memcg)
> +{
> +	int type;
> +	unsigned long array_index, flags;
> +	int index;
> +	struct page *page;
> +	struct swap_cgroup *map;
> +
> +	if (!memcg->res.swaps)
> +		return;
> +	mutex_lock(&swap_cgroup_mutex);
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		if (swap_cgroup_pages[type] == 0)
> +			continue;
> +		for (array_index = 0;
> +		     array_index < swap_cgroup_pages[type];
> +		     array_index++) {
> +			page = swap_cgroup_map[type][array_index];
> +			if (!page)
> +				continue;
> +			local_irq_save(flags);
> +			map = kmap_atomic(page, KM_USER0);
> +			for (index = 0; index < ENTS_PER_PAGE; index++) {
> +				if (map[index].memcgrp_id
> +				    == memcg->memcgrp_id) {
> +					map[index].memcgrp_id = 0;
> +					map[index].count = 0;
> +				}
> +			}
> +			kunmap_atomic(page, KM_USER0);
> +			local_irq_restore(flags);
> +		}
> +		mutex_unlock(&swap_cgroup_mutex);
> +		yield();
> +		mutex_lock(&swap_cgroup_mutex);
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +}
> +
> +/*
> + * called from swapon().
> + */
> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	void *array;
> +	int array_size;
> +
> +	VM_BUG_ON(swap_cgroup_map[type]);
> +
> +	array_size = ((max_pages/ENTS_PER_PAGE) + 1) * sizeof(void *);
> +
> +	array = vmalloc(array_size);
> +	if (!array) {
> +		printk("swap %d will not be accounted\n", type);
> +		return -ENOMEM;
> +	}
> +	memset(array, 0, array_size);
> +	mutex_lock(&swap_cgroup_mutex);
> +	swap_cgroup_pages[type] = (max_pages/ENTS_PER_PAGE + 1);
> +	swap_cgroup_map[type] = array;
> +	mutex_unlock(&swap_cgroup_mutex);
> +	spin_lock_init(&swap_cgroup_lock[type]);
> +	return 0;
> +}
> +
> +/*
> + * called from swapoff().
> + */
> +void swap_cgroup_swapoff(int type)
> +{
> +	int i;
> +	for (i = 0; i < swap_cgroup_pages[type]; i++) {
> +		struct page *page = swap_cgroup_map[type][i];
> +		if (page)
> +			__free_page(page);
> +	}
> +	mutex_lock(&swap_cgroup_mutex);
> +	vfree(swap_cgroup_map[type]);
> +	swap_cgroup_map[type] = NULL;
> +	mutex_unlock(&swap_cgroup_mutex);
> +	swap_cgroup_pages[type] = 0;
> +}
> +
> +#endif
> Index: mmtom-2.6.27-rc3+/include/linux/swap.h
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/include/linux/swap.h
> +++ mmtom-2.6.27-rc3+/include/linux/swap.h
> @@ -335,6 +335,44 @@ static inline void disable_swap_token(vo
>  	put_swap_token(swap_token_mm);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> +extern void swap_cgroup_swapoff(int type);
> +extern void swap_cgroup_delete_swap(swp_entry_t entry);
> +extern int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask);
> +extern int swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del);
> +extern void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry);
> +
> +#else
> +static inline int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	return 0;
> +}
> +static inline void swap_cgroup_swapoff(int type)
> +{
> +	return;
> +}
> +static inline void swap_cgroup_delete_swap(swp_entry_t entry)
> +{
> +	return;
> +}
> +static inline int swap_cgroup_prapare(swp_entry_t ent, gfp_t mask)
> +{
> +	return 0;
> +}
> +static inline int
> + swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del)
> +{
> +	return 0;
> +}
> +static inline
> +void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
> +{
> +	return;
> +}
> +#endif
> +
> +
>  #else /* CONFIG_SWAP */
>  
>  #define total_swap_pages			0
> Index: mmtom-2.6.27-rc3+/mm/swapfile.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/swapfile.c
> +++ mmtom-2.6.27-rc3+/mm/swapfile.c
> @@ -270,8 +270,9 @@ out:
>  	return NULL;
>  }	
>  
> -static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
> +static int swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>  {
> +	unsigned long offset = swp_offset(entry);
>  	int count = p->swap_map[offset];
>  
>  	if (count < SWAP_MAP_MAX) {
> @@ -286,6 +287,7 @@ static int swap_entry_free(struct swap_i
>  				swap_list.next = p - swap_info;
>  			nr_swap_pages++;
>  			p->inuse_pages--;
> +			swap_cgroup_delete_swap(entry);
>  		}
>  	}
>  	return count;
> @@ -301,7 +303,7 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> -		swap_entry_free(p, swp_offset(entry));
> +		swap_entry_free(p, entry);
>  		spin_unlock(&swap_lock);
>  	}
>  }
> @@ -420,7 +422,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> -		if (swap_entry_free(p, swp_offset(entry)) == 1) {
> +		if (swap_entry_free(p, entry) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && !trylock_page(page)) {
>  				page_cache_release(page);
> @@ -1343,6 +1345,7 @@ asmlinkage long sys_swapoff(const char _
>  	spin_unlock(&swap_lock);
>  	mutex_unlock(&swapon_mutex);
>  	vfree(swap_map);
> +	swap_cgroup_swapoff(type);
>  	inode = mapping->host;
>  	if (S_ISBLK(inode->i_mode)) {
>  		struct block_device *bdev = I_BDEV(inode);
> @@ -1669,6 +1672,11 @@ asmlinkage long sys_swapon(const char __
>  				1 /* header page */;
>  		if (error)
>  			goto bad_swap;
> +
> +		if (swap_cgroup_swapon(type, maxpages)) {
> +			printk("We don't enable swap accounting because of"
> +				"memory shortage\n");
> +		}
>  	}
>  
>  	if (nr_good_pages) {
> Index: mmtom-2.6.27-rc3+/mm/swap_state.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/swap_state.c
> +++ mmtom-2.6.27-rc3+/mm/swap_state.c
> @@ -76,6 +76,9 @@ int add_to_swap_cache(struct page *page,
>  	BUG_ON(PageSwapCache(page));
>  	BUG_ON(PagePrivate(page));
>  	BUG_ON(!PageSwapBacked(page));
> +	error = swap_cgroup_prepare(entry, gfp_mask);
> +	if (error)
> +		return error;
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
>  		page_cache_get(page);
> @@ -89,6 +92,7 @@ int add_to_swap_cache(struct page *page,
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  			INC_CACHE_INFO(add_total);
> +			swap_cgroup_record_info(page, entry, false);
>  		}
>  		spin_unlock_irq(&swapper_space.tree_lock);
>  		radix_tree_preload_end();
> @@ -108,6 +112,8 @@ int add_to_swap_cache(struct page *page,
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> +	swp_entry_t entry = { .val = page_private(page) };
> +
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(!PageSwapCache(page));
>  	BUG_ON(PageWriteback(page));
> @@ -117,6 +123,7 @@ void __delete_from_swap_cache(struct pag
>  	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
>  	total_swapcache_pages--;
> +	swap_cgroup_delete_swapcache(page, entry);
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
>  }
> Index: mmtom-2.6.27-rc3+/init/Kconfig
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/init/Kconfig
> +++ mmtom-2.6.27-rc3+/init/Kconfig
> @@ -416,7 +416,7 @@ config CGROUP_MEM_RES_CTLR
>  	  could in turn add some fork/exit overhead.
>  
>  config CGROUP_MEM_RES_CTLR_SWAP
> -	bool "Memory Resource Controller Swap Extension (Broken)"
> +	bool "Memory Resource Controller Swap Extension (EXPERIMENTAL)"
>  	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
>  	help
>  	 Add swap management feature to memory resource controller. By this,
> Index: mmtom-2.6.27-rc3+/mm/migrate.c
> ===================================================================
> --- mmtom-2.6.27-rc3+.orig/mm/migrate.c
> +++ mmtom-2.6.27-rc3+/mm/migrate.c
> @@ -339,6 +339,8 @@ static int migrate_page_move_mapping(str
>   */
>  static void migrate_page_copy(struct page *newpage, struct page *page)
>  {
> +	int was_swapcache = 0;
> +
>  	copy_highpage(newpage, page);
>  
>  	if (PageError(page))
> @@ -372,14 +374,17 @@ static void migrate_page_copy(struct pag
>  	mlock_migrate_page(newpage, page);
>  
>  #ifdef CONFIG_SWAP
> +	was_swapcache = PageSwapCache(page);
>  	ClearPageSwapCache(page);
>  #endif
>  	ClearPagePrivate(page);
>  	set_page_private(page, 0);
>  	/* page->mapping contains a flag for PageAnon() */
>  	if (PageAnon(page)) {
> -		/* This page is uncharged at try_to_unmap(). */
> +		/* This page is uncharged at try_to_unmap() if not SwapCache. */
>  		page->mapping = NULL;
> +		if (was_swapcache)
> +			mem_cgroup_uncharge_page(page);
>  	} else {
>  		/* Obsolete file cache should be uncharged */
>  		page->mapping = NULL;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  7:15   ` Daisuke Nishimura
@ 2008-09-01  7:58     ` KAMEZAWA Hiroyuki
  2008-09-01  8:53       ` Daisuke Nishimura
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-01  7:58 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Mon, 1 Sep 2008 16:15:01 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi, Kamezawa-san.
> 
> I'm testing these patches on mmotm-2008-08-29-01-08
> (with some trivial fixes I've reported and some debug codes),
> but swap_in_bytes sometimes becomes very huge(it seems that
> over uncharge is happening..) and I can see OOM
> if I've set memswap_limit.
> 
> I'm digging this now, but have you also ever seen it?
> 
I didn't see that. But, as you say, maybe over-uncharge. Hmm..
What kind of test ? Just use swap ? and did you use shmem or tmpfs ?

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> On Fri, 22 Aug 2008 20:44:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Add Swap accounting feature to memory resource controller.
> > 
> > Accounting is done in following logic.
> > 
> > Swap-out:
> >   - When add_to_swap_cache() is called, swp_entry is marked as to be under
> >     page->page_cgroup->mem_cgroup.
> >   - When swap-cache is uncharged (fully unmapped), we don't uncharge it.
> >   - When swap-cache is deleted, we uncharge it from memory and charge it to
> >     swaps. This ops is done only when swap cache is already charged.
> >            res.pages -=1, res.swaps +=1.
> > 
> > Swap-in:
> >   - When add_to_swapcache() is called, we do nothing.
> >   - When swap is mapped, we charge to memory and uncharge from swap
> > 	   res.pages +=1, res.swaps -=1.
> > 
> > SwapCache-Deleting:
> >   - If the page doesn't have page_cgroup, nothing to do.
> >   - If the page is still charged as swap, just uncharge memory.
> >     (This can happen under shmem/tmpfs.)
> >   - If the page is not charged as swap, res.pages -= 1, res.swaps +=1.
> > 
> > Swap-Freeing:
> >   - if swap entry is charged, res.swaps -= 1.
> > 
> > Almost all operations are done against SwapCache, which is Locked.
> > 
> > This patch uses an array to remember the owner of swp_entry. Considering x86-32,we should avoid to use NORMAL memory and vmalloc() area too much. This patch
> > uses HIGHMEM to record information under kmap_atomic(KM_USER0). And information
> > is recored in 2 bytes per 1 swap page.
> > (memory controller's id is defined as smaller than unsigned short)
> > 
> > Changelog: (preview) -> (v2)
> >  - removed radix-tree. just use array.
> >  - removed linked-list.
> >  - use memcgroup_id rather than pointer.
> >  - added force_empty (temporal) support.
> >    This should be reworked in future. (But for now, this works well for us.)
> >  
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > ---
> >  include/linux/swap.h |   38 +++++
> >  init/Kconfig         |    2 
> >  mm/memcontrol.c      |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  mm/migrate.c         |    7 
> >  mm/swap_state.c      |    7 
> >  mm/swapfile.c        |   14 +
> >  6 files changed, 422 insertions(+), 10 deletions(-)
> > 
> > Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
> > +++ mmtom-2.6.27-rc3+/mm/memcontrol.c
> > @@ -34,6 +34,10 @@
> >  #include <linux/mm_inline.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_cgroup.h>
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +#include <linux/swap.h>
> > +#include <linux/swapops.h>
> > +#endif
> >  
> >  #include <asm/uaccess.h>
> >  
> > @@ -43,9 +47,28 @@ static struct kmem_cache *page_cgroup_ca
> >  #define NR_MEMCGRP_ID			(32767)
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +
> >  #define do_swap_account	(1)
> > +
> > +static void
> > +swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page);
> > +
> > +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page);
> > +static void swap_cgroup_clean_account(struct mem_cgroup *mem);
> >  #else
> >  #define do_swap_account	(0)
> > +
> > +static void
> > +swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
> > +{
> > +}
> > +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
> > +{
> > +	return NULL;
> > +}
> > +static void swap_cgroup_clean_account(struct mem_cgroup *mem)
> > +{
> > +}
> >  #endif
> >  
> >  
> > @@ -889,6 +912,9 @@ static int mem_cgroup_charge_common(stru
> >  	__mem_cgroup_add_list(mz, pc);
> >  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >  
> > +	/* We did swap-in, uncharge swap. */
> > +	if (do_swap_account && PageSwapCache(page))
> > +		swap_cgroup_delete_account(mem, page);
> >  	return 0;
> >  out:
> >  	css_put(&mem->css);
> > @@ -899,6 +925,8 @@ err:
> >  
> >  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >  {
> > +	struct mem_cgroup *memcg = NULL;
> > +
> >  	if (mem_cgroup_subsys.disabled)
> >  		return 0;
> >  
> > @@ -935,13 +963,19 @@ int mem_cgroup_charge(struct page *page,
> >  		}
> >  		rcu_read_unlock();
> >  	}
> > +	/* Swap-in ? */
> > +	if (do_swap_account && PageSwapCache(page))
> > +		memcg = lookup_mem_cgroup_from_swap(page);
> > +
> >  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> > -				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> > +				MEM_CGROUP_CHARGE_TYPE_MAPPED, memcg);
> >  }
> >  
> >  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >  				gfp_t gfp_mask)
> >  {
> > +	struct mem_cgroup *memcg = NULL;
> > +
> >  	if (mem_cgroup_subsys.disabled)
> >  		return 0;
> >  
> > @@ -971,9 +1005,11 @@ int mem_cgroup_cache_charge(struct page 
> >  
> >  	if (unlikely(!mm))
> >  		mm = &init_mm;
> > +	if (do_swap_account && PageSwapCache(page))
> > +		memcg = lookup_mem_cgroup_from_swap(page);
> >  
> >  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> > -				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> > +				MEM_CGROUP_CHARGE_TYPE_CACHE, memcg);
> >  }
> >  
> >  /*
> > @@ -998,9 +1034,11 @@ __mem_cgroup_uncharge_common(struct page
> >  
> >  	VM_BUG_ON(pc->page != page);
> >  
> > -	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > -	    && ((PcgCache(pc) || page_mapped(page))))
> > -		goto out;
> > +	if ((ctype != MEM_CGROUP_CHARGE_TYPE_FORCE))
> > +		if (PageSwapCache(page) || page_mapped(page) ||
> > +		    (page->mapping && !PageAnon(page)))
> > +			goto out;
> > +
> >  	mem = pc->mem_cgroup;
> >  	SetPcgObsolete(pc);
> >  	page_assign_page_cgroup(page, NULL);
> > @@ -1577,6 +1615,8 @@ static void mem_cgroup_pre_destroy(struc
> >  {
> >  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> >  	mem_cgroup_force_empty(mem);
> > +	if (do_swap_account)
> > +		swap_cgroup_clean_account(mem);
> >  }
> >  
> >  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
> > @@ -1635,3 +1675,317 @@ struct cgroup_subsys mem_cgroup_subsys =
> >  	.attach = mem_cgroup_move_task,
> >  	.early_init = 0,
> >  };
> > +
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +/*
> > + * swap accounting infrastructure.
> > + */
> > +DEFINE_MUTEX(swap_cgroup_mutex);
> > +spinlock_t swap_cgroup_lock[MAX_SWAPFILES];
> > +struct page **swap_cgroup_map[MAX_SWAPFILES];
> > +unsigned long swap_cgroup_pages[MAX_SWAPFILES];
> > +
> > +
> > +/* This definition is based onf NR_MEM_CGROUP==32768 */
> > +struct swap_cgroup {
> > +	unsigned short memcgrp_id:15;
> > +	unsigned short count:1;
> > +};
> > +#define ENTS_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
> > +
> > +/*
> > + * Called from get_swap_ent().
> > + */
> > +int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask)
> > +{
> > +	struct page *page;
> > +	unsigned long array_index = swp_offset(ent) / ENTS_PER_PAGE;
> > +	int type = swp_type(ent);
> > +	unsigned long flags;
> > +
> > +	if (swap_cgroup_map[type][array_index])
> > +		return 0;
> > +	page = alloc_page(mask | __GFP_HIGHMEM | __GFP_ZERO);
> > +	if (!page)
> > +		return -ENOMEM;
> > +	spin_lock_irqsave(&swap_cgroup_lock[type], flags);
> > +	if (swap_cgroup_map[type][array_index] == NULL) {
> > +		swap_cgroup_map[type][array_index] = page;
> > +		page = NULL;
> > +	}
> > +	spin_unlock_irqrestore(&swap_cgroup_lock[type], flags);
> > +
> > +	if (page)
> > +		__free_page(page);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * swap_cgroup_record_info
> > + * @page ..... a page which is in some mem_cgroup.
> > + * @entry .... swp_entry of the page. (or old swp_entry of the page)
> > + * @delete ... if 0 add entry, if 1 remove entry.
> > + *
> > + * At set new value:
> > + * This is called from add_to_swap_cache() after added to swapper_space.
> > + * Then...this is called under page_lock() and this page is on radix-tree
> > + * We're safe to access page->page_cgroup->mem_cgroup.
> > + * This function never fails. (may leak information...but it's not Oops.)
> > + *
> > + * At delettion:
> > + * Returns count is set or not.
> > + */
> > +int swap_cgroup_record_info(struct page *page, swp_entry_t entry, bool del)
> > +{
> > +	unsigned long flags;
> > +	int type = swp_type(entry);
> > +	unsigned long offset = swp_offset(entry);
> > +	unsigned long array_index = offset/ENTS_PER_PAGE;
> > +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> > +	struct page *mappage;
> > +	struct swap_cgroup *map;
> > +	struct page_cgroup *pc = NULL;
> > +	int ret = 0;
> > +
> > +	if (!del) {
> > +		/*
> > +		 * At swap-in, the page is added to swap cache before tied to
> > +		 * mem_cgroup. This page will be finally charged at page fault.
> > +		 * Ignore this at this point.
> > +		 */
> > +		pc = page_get_page_cgroup(page);
> > +		if (!pc)
> > +			return ret;
> > +	}
> > +	if (!swap_cgroup_map[type])
> > +		return ret;
> > +	mappage = swap_cgroup_map[type][array_index];
> > +	if (!mappage)
> > +		return ret;
> > +
> > +	local_irq_save(flags);
> > +	map = kmap_atomic(mappage, KM_USER0);
> > +	if (!del) {
> > +		map[index].memcgrp_id = pc->mem_cgroup->memcgrp_id;
> > +		map[index].count = 0;
> > +	} else {
> > +		if (map[index].count) {
> > +			ret = map[index].memcgrp_id;
> > +			map[index].count = 0;
> > +		}
> > +		map[index].memcgrp_id = 0;
> > +	}
> > +	kunmap_atomic(mappage, KM_USER0);
> > +	local_irq_restore(flags);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * returns mem_cgroup pointer when swp_entry is assgiend to.
> > + */
> > +static struct mem_cgroup *swap_cgroup_lookup(swp_entry_t entry)
> > +{
> > +	unsigned long flags;
> > +	int type = swp_type(entry);
> > +	unsigned long offset = swp_offset(entry);
> > +	unsigned long array_index = offset/ENTS_PER_PAGE;
> > +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> > +	struct page *mappage;
> > +	struct swap_cgroup *map;
> > +	unsigned short id;
> > +
> > +	if (!swap_cgroup_map[type])
> > +		return NULL;
> > +	mappage = swap_cgroup_map[type][array_index];
> > +	if (!mappage)
> > +		return NULL;
> > +
> > +	local_irq_save(flags);
> > +	map = kmap_atomic(mappage, KM_USER0);
> > +	id = map[index].memcgrp_id;
> > +	kunmap_atomic(mappage, KM_USER0);
> > +	local_irq_restore(flags);
> > +	return mem_cgroup_id_lookup(id);
> > +}
> > +
> > +static struct mem_cgroup *lookup_mem_cgroup_from_swap(struct page *page)
> > +{
> > +	swp_entry_t entry = { .val = page_private(page) };
> > +	return swap_cgroup_lookup(entry);
> > +}
> > +
> > +/*
> > + * set/clear accounting information of swap_cgroup.
> > + *
> > + * Called when set/clear accounting information.
> > + * returns 1 at success.
> > + */
> > +static int swap_cgroup_account(struct mem_cgroup *memcg,
> > +			       swp_entry_t entry, bool set)
> > +{
> > +	unsigned long flags;
> > +	int type = swp_type(entry);
> > +	unsigned long offset = swp_offset(entry);
> > +	unsigned long array_index = offset/ENTS_PER_PAGE;
> > +	unsigned long index = offset & (ENTS_PER_PAGE - 1);
> > +	struct page *mappage;
> > +	struct swap_cgroup *map;
> > +	int ret = 0;
> > +
> > +	if (!swap_cgroup_map[type])
> > +		return ret;
> > +	mappage = swap_cgroup_map[type][array_index];
> > +	if (!mappage)
> > +		return ret;
> > +
> > +
> > +	local_irq_save(flags);
> > +	map = kmap_atomic(mappage, KM_USER0);
> > +	if (map[index].memcgrp_id == memcg->memcgrp_id) {
> > +		if (set && map[index].count == 0) {
> > +			map[index].count = 1;
> > +			ret = 1;
> > +		} else if (!set && map[index].count == 1) {
> > +			map[index].count = 0;
> > +			ret = 1;
> > +		}
> > +	}
> > +	kunmap_atomic(mappage, KM_USER0);
> > +	local_irq_restore(flags);
> > +	return ret;
> > +}
> > +
> > +void swap_cgroup_delete_account(struct mem_cgroup *mem, struct page *page)
> > +{
> > +	swp_entry_t val = { .val = page_private(page) };
> > +	if (swap_cgroup_account(mem, val, false))
> > +		mem_counter_uncharge_swap(mem);
> > +}
> > +
> > +/*
> > + * Called from delete_from_swap_cache() then, page is Locked! and
> > + * swp_entry is still in use.
> > + */
> > +void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
> > +{
> > +	struct page_cgroup *pc;
> > +
> > +	pc = page_get_page_cgroup(page);
> > +	/* swap-in but not mapped. */
> > +	if (!pc)
> > +		return;
> > +
> > +	if (swap_cgroup_account(pc->mem_cgroup, entry, true))
> > +		__mem_cgroup_uncharge_common(page,
> > +				MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> > +	else if (page->mapping && !PageAnon(page))
> > +		__mem_cgroup_uncharge_common(page,
> > +				MEM_CGROUP_CHARGE_TYPE_CACHE);
> > +	else
> > +		__mem_cgroup_uncharge_common(page,
> > +				MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > +	return;
> > +}
> > +
> > +void swap_cgroup_delete_swap(swp_entry_t entry)
> > +{
> > +	int ret;
> > +	struct mem_cgroup *mem;
> > +
> > +	ret = swap_cgroup_record_info(NULL, entry, true);
> > +	if (ret) {
> > +		mem = mem_cgroup_id_lookup(ret);
> > +		if (mem)
> > +			mem_counter_uncharge_swap(mem);
> > +	}
> > +}
> > +
> > +
> > +/*
> > + * Forget all accounts under swap_cgroup of memcg.
> > + * Called from destroying context.
> > + */
> > +static void swap_cgroup_clean_account(struct mem_cgroup *memcg)
> > +{
> > +	int type;
> > +	unsigned long array_index, flags;
> > +	int index;
> > +	struct page *page;
> > +	struct swap_cgroup *map;
> > +
> > +	if (!memcg->res.swaps)
> > +		return;
> > +	mutex_lock(&swap_cgroup_mutex);
> > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > +		if (swap_cgroup_pages[type] == 0)
> > +			continue;
> > +		for (array_index = 0;
> > +		     array_index < swap_cgroup_pages[type];
> > +		     array_index++) {
> > +			page = swap_cgroup_map[type][array_index];
> > +			if (!page)
> > +				continue;
> > +			local_irq_save(flags);
> > +			map = kmap_atomic(page, KM_USER0);
> > +			for (index = 0; index < ENTS_PER_PAGE; index++) {
> > +				if (map[index].memcgrp_id
> > +				    == memcg->memcgrp_id) {
> > +					map[index].memcgrp_id = 0;
> > +					map[index].count = 0;
> > +				}
> > +			}
> > +			kunmap_atomic(page, KM_USER0);
> > +			local_irq_restore(flags);
> > +		}
> > +		mutex_unlock(&swap_cgroup_mutex);
> > +		yield();
> > +		mutex_lock(&swap_cgroup_mutex);
> > +	}
> > +	mutex_unlock(&swap_cgroup_mutex);
> > +}
> > +
> > +/*
> > + * called from swapon().
> > + */
> > +int swap_cgroup_swapon(int type, unsigned long max_pages)
> > +{
> > +	void *array;
> > +	int array_size;
> > +
> > +	VM_BUG_ON(swap_cgroup_map[type]);
> > +
> > +	array_size = ((max_pages/ENTS_PER_PAGE) + 1) * sizeof(void *);
> > +
> > +	array = vmalloc(array_size);
> > +	if (!array) {
> > +		printk("swap %d will not be accounted\n", type);
> > +		return -ENOMEM;
> > +	}
> > +	memset(array, 0, array_size);
> > +	mutex_lock(&swap_cgroup_mutex);
> > +	swap_cgroup_pages[type] = (max_pages/ENTS_PER_PAGE + 1);
> > +	swap_cgroup_map[type] = array;
> > +	mutex_unlock(&swap_cgroup_mutex);
> > +	spin_lock_init(&swap_cgroup_lock[type]);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * called from swapoff().
> > + */
> > +void swap_cgroup_swapoff(int type)
> > +{
> > +	int i;
> > +	for (i = 0; i < swap_cgroup_pages[type]; i++) {
> > +		struct page *page = swap_cgroup_map[type][i];
> > +		if (page)
> > +			__free_page(page);
> > +	}
> > +	mutex_lock(&swap_cgroup_mutex);
> > +	vfree(swap_cgroup_map[type]);
> > +	swap_cgroup_map[type] = NULL;
> > +	mutex_unlock(&swap_cgroup_mutex);
> > +	swap_cgroup_pages[type] = 0;
> > +}
> > +
> > +#endif
> > Index: mmtom-2.6.27-rc3+/include/linux/swap.h
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/include/linux/swap.h
> > +++ mmtom-2.6.27-rc3+/include/linux/swap.h
> > @@ -335,6 +335,44 @@ static inline void disable_swap_token(vo
> >  	put_swap_token(swap_token_mm);
> >  }
> >  
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> > +extern void swap_cgroup_swapoff(int type);
> > +extern void swap_cgroup_delete_swap(swp_entry_t entry);
> > +extern int swap_cgroup_prepare(swp_entry_t ent, gfp_t mask);
> > +extern int swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del);
> > +extern void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry);
> > +
> > +#else
> > +static inline int swap_cgroup_swapon(int type, unsigned long max_pages)
> > +{
> > +	return 0;
> > +}
> > +static inline void swap_cgroup_swapoff(int type)
> > +{
> > +	return;
> > +}
> > +static inline void swap_cgroup_delete_swap(swp_entry_t entry)
> > +{
> > +	return;
> > +}
> > +static inline int swap_cgroup_prapare(swp_entry_t ent, gfp_t mask)
> > +{
> > +	return 0;
> > +}
> > +static inline int
> > + swap_cgroup_record_info(struct page *, swp_entry_t ent, bool del)
> > +{
> > +	return 0;
> > +}
> > +static inline
> > +void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t entry)
> > +{
> > +	return;
> > +}
> > +#endif
> > +
> > +
> >  #else /* CONFIG_SWAP */
> >  
> >  #define total_swap_pages			0
> > Index: mmtom-2.6.27-rc3+/mm/swapfile.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/swapfile.c
> > +++ mmtom-2.6.27-rc3+/mm/swapfile.c
> > @@ -270,8 +270,9 @@ out:
> >  	return NULL;
> >  }	
> >  
> > -static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
> > +static int swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >  {
> > +	unsigned long offset = swp_offset(entry);
> >  	int count = p->swap_map[offset];
> >  
> >  	if (count < SWAP_MAP_MAX) {
> > @@ -286,6 +287,7 @@ static int swap_entry_free(struct swap_i
> >  				swap_list.next = p - swap_info;
> >  			nr_swap_pages++;
> >  			p->inuse_pages--;
> > +			swap_cgroup_delete_swap(entry);
> >  		}
> >  	}
> >  	return count;
> > @@ -301,7 +303,7 @@ void swap_free(swp_entry_t entry)
> >  
> >  	p = swap_info_get(entry);
> >  	if (p) {
> > -		swap_entry_free(p, swp_offset(entry));
> > +		swap_entry_free(p, entry);
> >  		spin_unlock(&swap_lock);
> >  	}
> >  }
> > @@ -420,7 +422,7 @@ void free_swap_and_cache(swp_entry_t ent
> >  
> >  	p = swap_info_get(entry);
> >  	if (p) {
> > -		if (swap_entry_free(p, swp_offset(entry)) == 1) {
> > +		if (swap_entry_free(p, entry) == 1) {
> >  			page = find_get_page(&swapper_space, entry.val);
> >  			if (page && !trylock_page(page)) {
> >  				page_cache_release(page);
> > @@ -1343,6 +1345,7 @@ asmlinkage long sys_swapoff(const char _
> >  	spin_unlock(&swap_lock);
> >  	mutex_unlock(&swapon_mutex);
> >  	vfree(swap_map);
> > +	swap_cgroup_swapoff(type);
> >  	inode = mapping->host;
> >  	if (S_ISBLK(inode->i_mode)) {
> >  		struct block_device *bdev = I_BDEV(inode);
> > @@ -1669,6 +1672,11 @@ asmlinkage long sys_swapon(const char __
> >  				1 /* header page */;
> >  		if (error)
> >  			goto bad_swap;
> > +
> > +		if (swap_cgroup_swapon(type, maxpages)) {
> > +			printk("We don't enable swap accounting because of"
> > +				"memory shortage\n");
> > +		}
> >  	}
> >  
> >  	if (nr_good_pages) {
> > Index: mmtom-2.6.27-rc3+/mm/swap_state.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/swap_state.c
> > +++ mmtom-2.6.27-rc3+/mm/swap_state.c
> > @@ -76,6 +76,9 @@ int add_to_swap_cache(struct page *page,
> >  	BUG_ON(PageSwapCache(page));
> >  	BUG_ON(PagePrivate(page));
> >  	BUG_ON(!PageSwapBacked(page));
> > +	error = swap_cgroup_prepare(entry, gfp_mask);
> > +	if (error)
> > +		return error;
> >  	error = radix_tree_preload(gfp_mask);
> >  	if (!error) {
> >  		page_cache_get(page);
> > @@ -89,6 +92,7 @@ int add_to_swap_cache(struct page *page,
> >  			total_swapcache_pages++;
> >  			__inc_zone_page_state(page, NR_FILE_PAGES);
> >  			INC_CACHE_INFO(add_total);
> > +			swap_cgroup_record_info(page, entry, false);
> >  		}
> >  		spin_unlock_irq(&swapper_space.tree_lock);
> >  		radix_tree_preload_end();
> > @@ -108,6 +112,8 @@ int add_to_swap_cache(struct page *page,
> >   */
> >  void __delete_from_swap_cache(struct page *page)
> >  {
> > +	swp_entry_t entry = { .val = page_private(page) };
> > +
> >  	BUG_ON(!PageLocked(page));
> >  	BUG_ON(!PageSwapCache(page));
> >  	BUG_ON(PageWriteback(page));
> > @@ -117,6 +123,7 @@ void __delete_from_swap_cache(struct pag
> >  	set_page_private(page, 0);
> >  	ClearPageSwapCache(page);
> >  	total_swapcache_pages--;
> > +	swap_cgroup_delete_swapcache(page, entry);
> >  	__dec_zone_page_state(page, NR_FILE_PAGES);
> >  	INC_CACHE_INFO(del_total);
> >  }
> > Index: mmtom-2.6.27-rc3+/init/Kconfig
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/init/Kconfig
> > +++ mmtom-2.6.27-rc3+/init/Kconfig
> > @@ -416,7 +416,7 @@ config CGROUP_MEM_RES_CTLR
> >  	  could in turn add some fork/exit overhead.
> >  
> >  config CGROUP_MEM_RES_CTLR_SWAP
> > -	bool "Memory Resource Controller Swap Extension (Broken)"
> > +	bool "Memory Resource Controller Swap Extension (EXPERIMENTAL)"
> >  	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> >  	help
> >  	 Add swap management feature to memory resource controller. By this,
> > Index: mmtom-2.6.27-rc3+/mm/migrate.c
> > ===================================================================
> > --- mmtom-2.6.27-rc3+.orig/mm/migrate.c
> > +++ mmtom-2.6.27-rc3+/mm/migrate.c
> > @@ -339,6 +339,8 @@ static int migrate_page_move_mapping(str
> >   */
> >  static void migrate_page_copy(struct page *newpage, struct page *page)
> >  {
> > +	int was_swapcache = 0;
> > +
> >  	copy_highpage(newpage, page);
> >  
> >  	if (PageError(page))
> > @@ -372,14 +374,17 @@ static void migrate_page_copy(struct pag
> >  	mlock_migrate_page(newpage, page);
> >  
> >  #ifdef CONFIG_SWAP
> > +	was_swapcache = PageSwapCache(page);
> >  	ClearPageSwapCache(page);
> >  #endif
> >  	ClearPagePrivate(page);
> >  	set_page_private(page, 0);
> >  	/* page->mapping contains a flag for PageAnon() */
> >  	if (PageAnon(page)) {
> > -		/* This page is uncharged at try_to_unmap(). */
> > +		/* This page is uncharged at try_to_unmap() if not SwapCache. */
> >  		page->mapping = NULL;
> > +		if (was_swapcache)
> > +			mem_cgroup_uncharge_page(page);
> >  	} else {
> >  		/* Obsolete file cache should be uncharged */
> >  		page->mapping = NULL;
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  7:58     ` KAMEZAWA Hiroyuki
@ 2008-09-01  8:53       ` Daisuke Nishimura
  2008-09-01  9:53         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-01  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 1 Sep 2008 16:15:01 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi, Kamezawa-san.
> > 
> > I'm testing these patches on mmotm-2008-08-29-01-08
> > (with some trivial fixes I've reported and some debug codes),
This problem happens on the kernel without debug codes I added.

> > but swap_in_bytes sometimes becomes very huge(it seems that
> > over uncharge is happening..) and I can see OOM
> > if I've set memswap_limit.
> > 
> > I'm digging this now, but have you also ever seen it?
> > 
> I didn't see that.
I see, thanks.

> But, as you say, maybe over-uncharge. Hmm..
> What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> 
I don't do anything special, and this can happen without shmem/tmpfs
(can happen with shmem/tmpfs, too).

For example:

- make swap out/in activity for a while(I used page01 of ltp).
- stop the test.

[root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
4096

- swapoff

[root@localhost ~]# swapoff -a
[root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
18446744073709395968


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  8:53       ` Daisuke Nishimura
@ 2008-09-01  9:53         ` KAMEZAWA Hiroyuki
  2008-09-01 10:21           ` Daisuke Nishimura
                             ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-01  9:53 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Mon, 1 Sep 2008 17:53:02 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 1 Sep 2008 16:15:01 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > Hi, Kamezawa-san.
> > > 
> > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > (with some trivial fixes I've reported and some debug codes),
> This problem happens on the kernel without debug codes I added.
> 
> > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > over uncharge is happening..) and I can see OOM
> > > if I've set memswap_limit.
> > > 
> > > I'm digging this now, but have you also ever seen it?
> > > 
> > I didn't see that.
> I see, thanks.
> 
> > But, as you say, maybe over-uncharge. Hmm..
> > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > 
> I don't do anything special, and this can happen without shmem/tmpfs
> (can happen with shmem/tmpfs, too).
> 
> For example:
> 
> - make swap out/in activity for a while(I used page01 of ltp).
> - stop the test.
> 
> [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> 4096
> 
> - swapoff
> 
> [root@localhost ~]# swapoff -a
> [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> 18446744073709395968
> 
> 
Hmm ? can happen without swapoff ?
It seems "accounted" flag is on by mistake.


Maybe I missed some...but thank you. I'll try.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  9:53         ` KAMEZAWA Hiroyuki
@ 2008-09-01 10:21           ` Daisuke Nishimura
  2008-09-02  2:21           ` Daisuke Nishimura
  2008-09-02 11:09           ` Daisuke Nishimura
  2 siblings, 0 replies; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-01 10:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Mon, 1 Sep 2008 18:53:47 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 1 Sep 2008 17:53:02 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 1 Sep 2008 16:15:01 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > Hi, Kamezawa-san.
> > > > 
> > > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > > (with some trivial fixes I've reported and some debug codes),
> > This problem happens on the kernel without debug codes I added.
> > 
> > > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > > over uncharge is happening..) and I can see OOM
> > > > if I've set memswap_limit.
> > > > 
> > > > I'm digging this now, but have you also ever seen it?
> > > > 
> > > I didn't see that.
> > I see, thanks.
> > 
> > > But, as you say, maybe over-uncharge. Hmm..
> > > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > > 
> > I don't do anything special, and this can happen without shmem/tmpfs
> > (can happen with shmem/tmpfs, too).
> > 
> > For example:
> > 
> > - make swap out/in activity for a while(I used page01 of ltp).
> > - stop the test.
> > 
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 4096
> > 
> > - swapoff
> > 
> > [root@localhost ~]# swapoff -a
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 18446744073709395968
> > 
> > 
> Hmm ? can happen without swapoff ?
> It seems "accounted" flag is on by mistake.
> 
Yes.

I used the example above just to show over-uncharging is happening.

Actually, I've not yet seen OOM when running only page01,
but I saw OOM when I run page01 and shmem_test_02 at the same time,

Below is the log showing usage periodically when I got OOM.

----- 2008年 9月 1日 月曜日 17:38:00 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
20480
### /cgroup/memory/01/memory.swap_in_bytes ###
0
----- 2008年 9月 1日 月曜日 17:38:01 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
20480
### /cgroup/memory/01/memory.swap_in_bytes ###
0
----- 2008年 9月 1日 月曜日 17:38:03 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
20480
### /cgroup/memory/01/memory.swap_in_bytes ###
0
----- 2008年 9月 1日 月曜日 17:38:04 JST -----    <- start test
### /cgroup/memory/01/memory.usage_in_bytes ###
9269248
### /cgroup/memory/01/memory.swap_in_bytes ###
0
----- 2008年 9月 1日 月曜日 17:38:06 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33546240
### /cgroup/memory/01/memory.swap_in_bytes ###
921600
----- 2008年 9月 1日 月曜日 17:38:08 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33464320
### /cgroup/memory/01/memory.swap_in_bytes ###
11104256
----- 2008年 9月 1日 月曜日 17:38:09 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33087488
### /cgroup/memory/01/memory.swap_in_bytes ###
9048064
----- 2008年 9月 1日 月曜日 17:38:11 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33304576
### /cgroup/memory/01/memory.swap_in_bytes ###
5992448
----- 2008年 9月 1日 月曜日 17:38:12 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33517568
### /cgroup/memory/01/memory.swap_in_bytes ###
3706880
----- 2008年 9月 1日 月曜日 17:38:14 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33427456
### /cgroup/memory/01/memory.swap_in_bytes ###
1368064
----- 2008年 9月 1日 月曜日 17:38:16 JST -----    <- over-uncharge
### /cgroup/memory/01/memory.usage_in_bytes ###
33312768
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073696366592
----- 2008年 9月 1日 月曜日 17:38:17 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33464320
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073694580736
----- 2008年 9月 1日 月曜日 17:38:19 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33542144
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073696337920
----- 2008年 9月 1日 月曜日 17:38:21 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
33480704
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073696149504
----- 2008年 9月 1日 月曜日 17:38:22 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
30011392
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073684623360
----- 2008年 9月 1日 月曜日 17:38:26 JST -----    <- got OOM
### /cgroup/memory/01/memory.usage_in_bytes ###
0
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073675382784
----- 2008年 9月 1日 月曜日 17:38:28 JST -----
### /cgroup/memory/01/memory.usage_in_bytes ###
0
### /cgroup/memory/01/memory.swap_in_bytes ###
18446744073675382784


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  9:53         ` KAMEZAWA Hiroyuki
  2008-09-01 10:21           ` Daisuke Nishimura
@ 2008-09-02  2:21           ` Daisuke Nishimura
  2008-09-02 11:09           ` Daisuke Nishimura
  2 siblings, 0 replies; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-02  2:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Mon, 1 Sep 2008 18:53:47 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 1 Sep 2008 17:53:02 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 1 Sep 2008 16:15:01 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > Hi, Kamezawa-san.
> > > > 
> > > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > > (with some trivial fixes I've reported and some debug codes),
> > This problem happens on the kernel without debug codes I added.
> > 
> > > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > > over uncharge is happening..) and I can see OOM
> > > > if I've set memswap_limit.
> > > > 
> > > > I'm digging this now, but have you also ever seen it?
> > > > 
> > > I didn't see that.
> > I see, thanks.
> > 
> > > But, as you say, maybe over-uncharge. Hmm..
> > > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > > 
> > I don't do anything special, and this can happen without shmem/tmpfs
> > (can happen with shmem/tmpfs, too).
> > 
> > For example:
> > 
> > - make swap out/in activity for a while(I used page01 of ltp).
> > - stop the test.
> > 
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 4096
> > 
> > - swapoff
> > 
> > [root@localhost ~]# swapoff -a
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 18446744073709395968
> > 
> > 
> Hmm ? can happen without swapoff ?
> It seems "accounted" flag is on by mistake.
> 
> 
Just FYI, this orver-uncharge at swapoff seems to happen only when
there remains a process(bash) in the cgroup.

I'll dig more.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-01  9:53         ` KAMEZAWA Hiroyuki
  2008-09-01 10:21           ` Daisuke Nishimura
  2008-09-02  2:21           ` Daisuke Nishimura
@ 2008-09-02 11:09           ` Daisuke Nishimura
  2008-09-02 11:40             ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-02 11:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Mon, 1 Sep 2008 18:53:47 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 1 Sep 2008 17:53:02 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 1 Sep 2008 16:15:01 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > Hi, Kamezawa-san.
> > > > 
> > > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > > (with some trivial fixes I've reported and some debug codes),
> > This problem happens on the kernel without debug codes I added.
> > 
> > > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > > over uncharge is happening..) and I can see OOM
> > > > if I've set memswap_limit.
> > > > 
> > > > I'm digging this now, but have you also ever seen it?
> > > > 
> > > I didn't see that.
> > I see, thanks.
> > 
> > > But, as you say, maybe over-uncharge. Hmm..
> > > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > > 
> > I don't do anything special, and this can happen without shmem/tmpfs
> > (can happen with shmem/tmpfs, too).
> > 
> > For example:
> > 
> > - make swap out/in activity for a while(I used page01 of ltp).
> > - stop the test.
> > 
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 4096
> > 
> > - swapoff
> > 
> > [root@localhost ~]# swapoff -a
> > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > 18446744073709395968
> > 
> > 
> Hmm ? can happen without swapoff ?
> It seems "accounted" flag is on by mistake.
> 
I found the cause of this problem.

If __mem_cgroup_uncharge_common() in __swap_cgroup_delete_swapcache() fails,
res.swaps would not be incremented while the swap_cgroup.count remains on.
This causes over-uncharging of res.swaps.

This patch fixes this problem(it works well so far).


Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 59ad6d8..ab62a95 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1015,14 +1015,15 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *m
 /*
  * uncharge if !page_mapped(page)
  */
-static void
+static int
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
        struct page_cgroup *pc;
        struct mem_cgroup *mem;
+       int ret = 0;

        if (mem_cgroup_subsys.disabled)
-               return;
+               return ret;

        /*
         * Check if our page_cgroup is valid
@@ -1039,6 +1040,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type cty
                    (page->mapping && !PageAnon(page)))
                        goto out;

+       ret = 1;
        mem = pc->mem_cgroup;
        SetPcgObsolete(pc);
        page_assign_page_cgroup(page, NULL);
@@ -1051,7 +1053,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type cty

 out:
        rcu_read_unlock();
-       return;
+       return ret;
 }

 void mem_cgroup_uncharge_page(struct page *page)
@@ -1869,9 +1871,10 @@ void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t en
        if (!pc)
                return;

-       if (swap_cgroup_account(pc->mem_cgroup, entry, true))
-               __mem_cgroup_uncharge_common(page,
-                               MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+       if (swap_cgroup_account(pc->mem_cgroup, entry, true)
+               && !__mem_cgroup_uncharge_common(page,
+                               MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+               WARN_ON(!swap_cgroup_account(pc->mem_cgroup, entry, false));
        else if (page->mapping && !PageAnon(page))
                __mem_cgroup_uncharge_common(page,
                                MEM_CGROUP_CHARGE_TYPE_CACHE);
@@ -1889,8 +1892,8 @@ void swap_cgroup_delete_swap(swp_entry_t entry)
        ret = swap_cgroup_record_info(NULL, entry, true);
        if (ret) {
                mem = mem_cgroup_id_lookup(ret);
-               if (mem)
-                       mem_counter_uncharge_swap(mem);
+               VM_BUG_ON(!mem);
+               mem_counter_uncharge_swap(mem);
        }
 }



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-02 11:09           ` Daisuke Nishimura
@ 2008-09-02 11:40             ` KAMEZAWA Hiroyuki
  2008-09-03  6:23               ` Daisuke Nishimura
  0 siblings, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-02 11:40 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Tue, 2 Sep 2008 20:09:05 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 1 Sep 2008 18:53:47 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 1 Sep 2008 17:53:02 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Mon, 1 Sep 2008 16:15:01 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > 
> > > > > Hi, Kamezawa-san.
> > > > > 
> > > > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > > > (with some trivial fixes I've reported and some debug codes),
> > > This problem happens on the kernel without debug codes I added.
> > > 
> > > > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > > > over uncharge is happening..) and I can see OOM
> > > > > if I've set memswap_limit.
> > > > > 
> > > > > I'm digging this now, but have you also ever seen it?
> > > > > 
> > > > I didn't see that.
> > > I see, thanks.
> > > 
> > > > But, as you say, maybe over-uncharge. Hmm..
> > > > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > > > 
> > > I don't do anything special, and this can happen without shmem/tmpfs
> > > (can happen with shmem/tmpfs, too).
> > > 
> > > For example:
> > > 
> > > - make swap out/in activity for a while(I used page01 of ltp).
> > > - stop the test.
> > > 
> > > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > > 4096
> > > 
> > > - swapoff
> > > 
> > > [root@localhost ~]# swapoff -a
> > > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > > 18446744073709395968
> > > 
> > > 
> > Hmm ? can happen without swapoff ?
> > It seems "accounted" flag is on by mistake.
> > 
> I found the cause of this problem.
> 
> If __mem_cgroup_uncharge_common() in __swap_cgroup_delete_swapcache() fails,
> res.swaps would not be incremented while the swap_cgroup.count remains on.
> This causes over-uncharging of res.swaps.
> 
> This patch fixes this problem(it works well so far).
> 
Oh, thanks. But it seems unexpected situation...hmm
I think I misunderstand some calling sequence...

maybe like this.
swap_cgroup_set_account()
   -> mem_cgroup_uncharge()
        -> the page is mapped ..no uncharge here.
              -> then, res.page, res.swaps is not changed.
                 -> we should mark swap account as false.

Anyway, thank you! I'll consider this again.

Thanks,
-Kame

> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 59ad6d8..ab62a95 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1015,14 +1015,15 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *m
>  /*
>   * uncharge if !page_mapped(page)
>   */
> -static void
> +static int
>  __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  {
>         struct page_cgroup *pc;
>         struct mem_cgroup *mem;
> +       int ret = 0;
> 
>         if (mem_cgroup_subsys.disabled)
> -               return;
> +               return ret;
> 
>         /*
>          * Check if our page_cgroup is valid
> @@ -1039,6 +1040,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type cty
>                     (page->mapping && !PageAnon(page)))
>                         goto out;
> 
> +       ret = 1;
>         mem = pc->mem_cgroup;
>         SetPcgObsolete(pc);
>         page_assign_page_cgroup(page, NULL);
> @@ -1051,7 +1053,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type cty
> 
>  out:
>         rcu_read_unlock();
> -       return;
> +       return ret;
>  }
> 
>  void mem_cgroup_uncharge_page(struct page *page)
> @@ -1869,9 +1871,10 @@ void swap_cgroup_delete_swapcache(struct page *page, swp_entry_t en
>         if (!pc)
>                 return;
> 
> -       if (swap_cgroup_account(pc->mem_cgroup, entry, true))
> -               __mem_cgroup_uncharge_common(page,
> -                               MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +       if (swap_cgroup_account(pc->mem_cgroup, entry, true)
> +               && !__mem_cgroup_uncharge_common(page,
> +                               MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> +               WARN_ON(!swap_cgroup_account(pc->mem_cgroup, entry, false));
>         else if (page->mapping && !PageAnon(page))
>                 __mem_cgroup_uncharge_common(page,
>                                 MEM_CGROUP_CHARGE_TYPE_CACHE);
> @@ -1889,8 +1892,8 @@ void swap_cgroup_delete_swap(swp_entry_t entry)
>         ret = swap_cgroup_record_info(NULL, entry, true);
>         if (ret) {
>                 mem = mem_cgroup_id_lookup(ret);
> -               if (mem)
> -                       mem_counter_uncharge_swap(mem);
> +               VM_BUG_ON(!mem);
> +               mem_counter_uncharge_swap(mem);
>         }
>  }
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-02 11:40             ` KAMEZAWA Hiroyuki
@ 2008-09-03  6:23               ` Daisuke Nishimura
  2008-09-03  7:05                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-03  6:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Tue, 2 Sep 2008 20:40:53 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 2 Sep 2008 20:09:05 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 1 Sep 2008 18:53:47 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 1 Sep 2008 17:53:02 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > 
> > > > On Mon, 1 Sep 2008 16:58:27 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > On Mon, 1 Sep 2008 16:15:01 +0900
> > > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > > > 
> > > > > > Hi, Kamezawa-san.
> > > > > > 
> > > > > > I'm testing these patches on mmotm-2008-08-29-01-08
> > > > > > (with some trivial fixes I've reported and some debug codes),
> > > > This problem happens on the kernel without debug codes I added.
> > > > 
> > > > > > but swap_in_bytes sometimes becomes very huge(it seems that
> > > > > > over uncharge is happening..) and I can see OOM
> > > > > > if I've set memswap_limit.
> > > > > > 
> > > > > > I'm digging this now, but have you also ever seen it?
> > > > > > 
> > > > > I didn't see that.
> > > > I see, thanks.
> > > > 
> > > > > But, as you say, maybe over-uncharge. Hmm..
> > > > > What kind of test ? Just use swap ? and did you use shmem or tmpfs ?
> > > > > 
> > > > I don't do anything special, and this can happen without shmem/tmpfs
> > > > (can happen with shmem/tmpfs, too).
> > > > 
> > > > For example:
> > > > 
> > > > - make swap out/in activity for a while(I used page01 of ltp).
> > > > - stop the test.
> > > > 
> > > > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > > > 4096
> > > > 
> > > > - swapoff
> > > > 
> > > > [root@localhost ~]# swapoff -a
> > > > [root@localhost ~]# cat /cgroup/memory/01/memory.swap_in_bytes
> > > > 18446744073709395968
> > > > 
> > > > 
> > > Hmm ? can happen without swapoff ?
> > > It seems "accounted" flag is on by mistake.
> > > 
> > I found the cause of this problem.
> > 
> > If __mem_cgroup_uncharge_common() in __swap_cgroup_delete_swapcache() fails,
> > res.swaps would not be incremented while the swap_cgroup.count remains on.
> > This causes over-uncharging of res.swaps.
> > 
> > This patch fixes this problem(it works well so far).
> > 
> Oh, thanks. But it seems unexpected situation...hmm
> I think I misunderstand some calling sequence...
> 
> maybe like this.
> swap_cgroup_set_account()
>    -> mem_cgroup_uncharge()
>         -> the page is mapped ..no uncharge here.
>               -> then, res.page, res.swaps is not changed.
>                  -> we should mark swap account as false.
> 
> Anyway, thank you! I'll consider this again.
> 
I add some debug code to show why the uncharge fails there.

It showed that the cause of uncharge failure at swapoff was that
it tried to free a mapped page, as you said.

I think this can happen in the sequence below:

  try_to_unuse()
    unuse_mm()
      ...
      unuse_pte()
        mem_cgroup_charge()
          swap_cgroup_delete_account()
            - swap_cgroup->count is turned off.
        page_add_anon_rmap()
          - map page.
    ...
    delete_from_swap_cache()
      swap_cgroup_delete_swapcache()
        - turns swap_cgroup->count on again.
        - tries to uncharge a mapped page.

And I think deleting a swapcache of a mapped page can also happen
in other sequences(e.g. do_swap_page()->remove_exclusive_swap_cache()).

OTOH, as for shmem/tmpfs, swap_cgroup_delete_swapcache() tries to
uncharge a page which is on radix tree(that's why I saw over-uncharging),
because shmem_getpage() calls add_to_page_cache() before
delete_from_swap_cache().


So, I think current implementation should be changed anyway.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 14/14]memcg: mem+swap accounting
  2008-09-03  6:23               ` Daisuke Nishimura
@ 2008-09-03  7:05                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-03  7:05 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Wed, 3 Sep 2008 15:23:14 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 2 Sep 2008 20:40:53 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > maybe like this.
> > swap_cgroup_set_account()
> >    -> mem_cgroup_uncharge()
> >         -> the page is mapped ..no uncharge here.
> >               -> then, res.page, res.swaps is not changed.
> >                  -> we should mark swap account as false.
> > 
> > Anyway, thank you! I'll consider this again.
> > 
> I add some debug code to show why the uncharge fails there.
> 
> It showed that the cause of uncharge failure at swapoff was that
> it tried to free a mapped page, as you said.
> 
> I think this can happen in the sequence below:
> 
>   try_to_unuse()
>     unuse_mm()
>       ...
>       unuse_pte()
>         mem_cgroup_charge()
>           swap_cgroup_delete_account()
>             - swap_cgroup->count is turned off.
>         page_add_anon_rmap()
>           - map page.
>     ...
>     delete_from_swap_cache()
>       swap_cgroup_delete_swapcache()
>         - turns swap_cgroup->count on again.
>         - tries to uncharge a mapped page.
> 
> And I think deleting a swapcache of a mapped page can also happen
> in other sequences(e.g. do_swap_page()->remove_exclusive_swap_cache()).
> 
> OTOH, as for shmem/tmpfs, swap_cgroup_delete_swapcache() tries to
> uncharge a page which is on radix tree(that's why I saw over-uncharging),
> because shmem_getpage() calls add_to_page_cache() before
> delete_from_swap_cache().
> 
> 
> So, I think current implementation should be changed anyway.
> 
Thank you for your investigation. I'll refine logic.

I think I have to tune lockless** series first. It's almost done.
mem+swap patch is also maintained but my usual 8cpu host is
still under maintaince. (and the newest mmtom doesn't work on 2cpu my machine.)
So, please be patient for a while if no updates from me.

Of course, if you have your own, please post. I think kmap_atomic()
logic in my patch is also benefial to your original version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-08-22 11:35 ` [RFC][PATCH 6/14] memcg: lockless page cgroup KAMEZAWA Hiroyuki
@ 2008-09-09  5:40   ` Daisuke Nishimura
  2008-09-09  7:56     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-09  5:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

On Fri, 22 Aug 2008 20:35:51 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This patch removes lock_page_cgroup(). Now, page_cgroup is guarded by RCU.
> 
> To remove lock_page_cgroup(), we have to confirm there is no race.
> 
> Anon pages:
> * pages are chareged/uncharged only when first-mapped/last-unmapped.
>   page_mapcount() handles that.
>    (And... pte_lock() is always held in any racy case.)
> 
> Swap pages:
>   There will be race because charge is done before lock_page().
>   This patch moves mem_cgroup_charge() under lock_page().
> 
> File pages: (not Shmem)
> * pages are charged/uncharged only when it's added/removed to radix-tree.
>   In this case, PageLock() is always held.
> 
> Install Page:
>   Is it worth to charge this special map page ? which is (maybe) not on LRU.
>   I think no.
>   I removed charge/uncharge from install_page().
> 
> Page Migration:
>   We precharge it and map it back under lock_page(). This should be treated
>   as special case.
> 
> freeing page_cgroup is done under RCU.
> 
> After this patch, page_cgroup can be accesced via struct page->page_cgroup
> under following conditions.
> 
> 1. The page is file cache and on radix-tree.
>    (means lock_page() or mapping->tree_lock is held.)
> 2. The page is anounymous page and mapped.
>    (means pte_lock is held.)
> 3. under RCU and the page_cgroup is not Obsolete.
> 
> Typical style of "3" is following.
> **
> 	rcu_read_lock();
> 	pc = page_get_page_cgroup(page);
> 	if (pc && !PcgObsolete(pc)) {
> 		......
> 	}
> 	rcu_read_unlock();
> **
> 
> This is now under test. Don't apply if you're not brave.
> 
> Changelog: (v1) -> (v2)
>  - Added Documentation.
> 
> Changelog: (preview) -> (v1)
>  - Added comments.
>  - Fixed page migration.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 

(snip)

>  /*
> @@ -766,14 +724,9 @@ static int mem_cgroup_charge_common(stru
>  	} else
>  		__SetPcgActive(pc);
>  
> -	lock_page_cgroup(page);
> -	if (unlikely(page_get_page_cgroup(page))) {
> -		unlock_page_cgroup(page);
> -		res_counter_uncharge(&mem->res, PAGE_SIZE);
> -		css_put(&mem->css);
> -		kmem_cache_free(page_cgroup_cache, pc);
> -		goto done;
> -	}
> +	/* Double counting race condition ? */
> +	VM_BUG_ON(page_get_page_cgroup(page));
> +
>  	page_assign_page_cgroup(page, pc);
>  
>  	mz = page_cgroup_zoneinfo(pc);

I got this VM_BUG_ON at swapoff.

Trying to shmem_unuse_inode a page which has been moved
to swapcache by shmem_writepage causes this BUG, because
the page has not been uncharged(with all the patches applied).

I made a patch which changes shmem_unuse_inode to charge with
GFP_NOWAIT first and shrink usage on failure, as shmem_getpage does.

But I don't stick to my patch if you handle this case :)


Thanks,
Daisuke Nishimura.

====
Change shmem_unuse_inode to charge with GFP_NOWAIT first and
shrink usage on failure, as shmem_getpage does.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

---
diff --git a/mm/shmem.c b/mm/shmem.c
index 72b5f03..d37cd51 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -922,15 +922,10 @@ found:
 	error = 1;
 	if (!inode)
 		goto out;
-	/* Precharge page using GFP_KERNEL while we can wait */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
-	if (error)
-		goto out;
+retry:
 	error = radix_tree_preload(GFP_KERNEL);
-	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
+	if (error)
 		goto out;
-	}
 	error = 1;
 
 	spin_lock(&info->lock);
@@ -938,9 +933,17 @@ found:
 	if (ptr && ptr->val == entry.val) {
 		error = add_to_page_cache_locked(page, inode->i_mapping,
 						idx, GFP_NOWAIT);
-		/* does mem_cgroup_uncharge_cache_page on error */
-	} else	/* we must compensate for our precharge above */
-		mem_cgroup_uncharge_cache_page(page);
+		if (error == -ENOMEM) {
+			if (ptr)
+				shmem_swp_unmap(ptr);
+			spin_unlock(&info->lock);
+			radix_tree_preload_end();
+			error = mem_cgroup_shrink_usage(current->mm, GFP_KERNEL);
+			if (error)
+				goto out;
+			goto retry;
+		}
+	}
 
 	if (error == -EEXIST) {
 		struct page *filepage = find_get_page(inode->i_mapping, idx);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09  5:40   ` Daisuke Nishimura
@ 2008-09-09  7:56     ` KAMEZAWA Hiroyuki
  2008-09-09  8:11       ` Daisuke Nishimura
  2008-09-09 14:04       ` Balbir Singh
  0 siblings, 2 replies; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-09  7:56 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Tue, 9 Sep 2008 14:40:07 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +	/* Double counting race condition ? */
> > +	VM_BUG_ON(page_get_page_cgroup(page));
> > +
> >  	page_assign_page_cgroup(page, pc);
> >  
> >  	mz = page_cgroup_zoneinfo(pc);
> 
> I got this VM_BUG_ON at swapoff.
> 
> Trying to shmem_unuse_inode a page which has been moved
> to swapcache by shmem_writepage causes this BUG, because
> the page has not been uncharged(with all the patches applied).
> 
> I made a patch which changes shmem_unuse_inode to charge with
> GFP_NOWAIT first and shrink usage on failure, as shmem_getpage does.
> 
> But I don't stick to my patch if you handle this case :)
> 
Thank you for testing and sorry for no progress in these days.

I'm sorry to say that I'll have to postpone this to remove
page->page_cgroup pointer. I need some more performance-improvement
effort to remove page->page_cgroup pointer without significant overhead.

So please be patient for a while.


Sorry,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 
> ====
> Change shmem_unuse_inode to charge with GFP_NOWAIT first and
> shrink usage on failure, as shmem_getpage does.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> ---
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 72b5f03..d37cd51 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -922,15 +922,10 @@ found:
>  	error = 1;
>  	if (!inode)
>  		goto out;
> -	/* Precharge page using GFP_KERNEL while we can wait */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> -	if (error)
> -		goto out;
> +retry:
>  	error = radix_tree_preload(GFP_KERNEL);
> -	if (error) {
> -		mem_cgroup_uncharge_cache_page(page);
> +	if (error)
>  		goto out;
> -	}
>  	error = 1;
>  
>  	spin_lock(&info->lock);
> @@ -938,9 +933,17 @@ found:
>  	if (ptr && ptr->val == entry.val) {
>  		error = add_to_page_cache_locked(page, inode->i_mapping,
>  						idx, GFP_NOWAIT);
> -		/* does mem_cgroup_uncharge_cache_page on error */
> -	} else	/* we must compensate for our precharge above */
> -		mem_cgroup_uncharge_cache_page(page);
> +		if (error == -ENOMEM) {
> +			if (ptr)
> +				shmem_swp_unmap(ptr);
> +			spin_unlock(&info->lock);
> +			radix_tree_preload_end();
> +			error = mem_cgroup_shrink_usage(current->mm, GFP_KERNEL);
> +			if (error)
> +				goto out;
> +			goto retry;
> +		}
> +	}
>  
>  	if (error == -EEXIST) {
>  		struct page *filepage = find_get_page(inode->i_mapping, idx);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09  7:56     ` KAMEZAWA Hiroyuki
@ 2008-09-09  8:11       ` Daisuke Nishimura
  2008-09-09 11:11         ` KAMEZAWA Hiroyuki
  2008-09-09 14:24         ` Balbir Singh
  2008-09-09 14:04       ` Balbir Singh
  1 sibling, 2 replies; 61+ messages in thread
From: Daisuke Nishimura @ 2008-09-09  8:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Tue, 9 Sep 2008 16:56:08 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 9 Sep 2008 14:40:07 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > +	/* Double counting race condition ? */
> > > +	VM_BUG_ON(page_get_page_cgroup(page));
> > > +
> > >  	page_assign_page_cgroup(page, pc);
> > >  
> > >  	mz = page_cgroup_zoneinfo(pc);
> > 
> > I got this VM_BUG_ON at swapoff.
> > 
> > Trying to shmem_unuse_inode a page which has been moved
> > to swapcache by shmem_writepage causes this BUG, because
> > the page has not been uncharged(with all the patches applied).
> > 
> > I made a patch which changes shmem_unuse_inode to charge with
> > GFP_NOWAIT first and shrink usage on failure, as shmem_getpage does.
> > 
> > But I don't stick to my patch if you handle this case :)
> > 
> Thank you for testing and sorry for no progress in these days.
> 
> I'm sorry to say that I'll have to postpone this to remove
> page->page_cgroup pointer. I need some more performance-improvement
> effort to remove page->page_cgroup pointer without significant overhead.
> 
No problem. I know about that :)

And, I've started reviewing the radix tree patch and trying to test it.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09  8:11       ` Daisuke Nishimura
@ 2008-09-09 11:11         ` KAMEZAWA Hiroyuki
  2008-09-09 11:48           ` Balbir Singh
  2008-09-09 14:24         ` Balbir Singh
  1 sibling, 1 reply; 61+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-09-09 11:11 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Tue, 9 Sep 2008 17:11:54 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > I'm sorry to say that I'll have to postpone this to remove
> > page->page_cgroup pointer. I need some more performance-improvement
> > effort to remove page->page_cgroup pointer without significant overhead.
> > 
> No problem. I know about that :)
> 
This is the latest result of lockless series. (on rc5-mmtom)
(Don't trust shell script result...it seems too slow.)

==on 2cpu/1socket x86-64 host==
rc5-mm1
==
Execl Throughput                           3006.5 lps   (29.8 secs, 3 samples)
C Compiler Throughput                      1006.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4863.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                943.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               482.7 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         124804.9 lpm   (30.0 secs, 3 samples)

lockless
==
Execl Throughput                           3035.5 lps   (29.6 secs, 3 samples)
C Compiler Throughput                      1010.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4881.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                947.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               485.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         125437.9 lpm   (30.0 secs, 3 samples)
==

I'll try to build "remove-page-cgroup-pointer" patch on this
and see what happens tomorrow. (And I think my 8cpu box will come back..

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09 11:11         ` KAMEZAWA Hiroyuki
@ 2008-09-09 11:48           ` Balbir Singh
  0 siblings, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-09-09 11:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm

KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Sep 2008 17:11:54 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>>> I'm sorry to say that I'll have to postpone this to remove
>>> page->page_cgroup pointer. I need some more performance-improvement
>>> effort to remove page->page_cgroup pointer without significant overhead.
>>>
>> No problem. I know about that :)
>>
> This is the latest result of lockless series. (on rc5-mmtom)
> (Don't trust shell script result...it seems too slow.)
> 
> ==on 2cpu/1socket x86-64 host==
> rc5-mm1
> ==
> Execl Throughput                           3006.5 lps   (29.8 secs, 3 samples)
> C Compiler Throughput                      1006.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (1 concurrent)               4863.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent)                943.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               482.7 lpm   (60.0 secs, 3 samples)
> Dc: sqrt(2) to 99 decimal places         124804.9 lpm   (30.0 secs, 3 samples)
> 
> lockless
> ==
> Execl Throughput                           3035.5 lps   (29.6 secs, 3 samples)
> C Compiler Throughput                      1010.3 lpm   (60.0 secs, 3 samples)
> Shell Scripts (1 concurrent)               4881.0 lpm   (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent)                947.7 lpm   (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent)               485.0 lpm   (60.0 secs, 3 samples)
> Dc: sqrt(2) to 99 decimal places         125437.9 lpm   (30.0 secs, 3 samples)
> ==
> 
> I'll try to build "remove-page-cgroup-pointer" patch on this
> and see what happens tomorrow. (And I think my 8cpu box will come back..

Looks good so far. Thanks for all the testing!

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09  7:56     ` KAMEZAWA Hiroyuki
  2008-09-09  8:11       ` Daisuke Nishimura
@ 2008-09-09 14:04       ` Balbir Singh
  1 sibling, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-09-09 14:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm

KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Sep 2008 14:40:07 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
>>> +	/* Double counting race condition ? */
>>> +	VM_BUG_ON(page_get_page_cgroup(page));
>>> +
>>>  	page_assign_page_cgroup(page, pc);
>>>  
>>>  	mz = page_cgroup_zoneinfo(pc);
>> I got this VM_BUG_ON at swapoff.
>>
>> Trying to shmem_unuse_inode a page which has been moved
>> to swapcache by shmem_writepage causes this BUG, because
>> the page has not been uncharged(with all the patches applied).
>>
>> I made a patch which changes shmem_unuse_inode to charge with
>> GFP_NOWAIT first and shrink usage on failure, as shmem_getpage does.
>>
>> But I don't stick to my patch if you handle this case :)
>>
> Thank you for testing and sorry for no progress in these days.
> 
> I'm sorry to say that I'll have to postpone this to remove
> page->page_cgroup pointer. I need some more performance-improvement
> effort to remove page->page_cgroup pointer without significant overhead.
> 

I don't think this should take long to do. It's really easy to do (I've tried
two approaches and it look me a day to get them working). I am trying some other
approach based on early_init and alloc_bootmem*.

> So please be patient for a while.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH 6/14]  memcg: lockless page cgroup
  2008-09-09  8:11       ` Daisuke Nishimura
  2008-09-09 11:11         ` KAMEZAWA Hiroyuki
@ 2008-09-09 14:24         ` Balbir Singh
  1 sibling, 0 replies; 61+ messages in thread
From: Balbir Singh @ 2008-09-09 14:24 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: KAMEZAWA Hiroyuki, linux-mm

Daisuke Nishimura wrote:
> On Tue, 9 Sep 2008 16:56:08 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Tue, 9 Sep 2008 14:40:07 +0900
>> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>>
>>>> +	/* Double counting race condition ? */
>>>> +	VM_BUG_ON(page_get_page_cgroup(page));
>>>> +
>>>>  	page_assign_page_cgroup(page, pc);
>>>>  
>>>>  	mz = page_cgroup_zoneinfo(pc);
>>> I got this VM_BUG_ON at swapoff.
>>>
>>> Trying to shmem_unuse_inode a page which has been moved
>>> to swapcache by shmem_writepage causes this BUG, because
>>> the page has not been uncharged(with all the patches applied).
>>>
>>> I made a patch which changes shmem_unuse_inode to charge with
>>> GFP_NOWAIT first and shrink usage on failure, as shmem_getpage does.
>>>
>>> But I don't stick to my patch if you handle this case :)
>>>
>> Thank you for testing and sorry for no progress in these days.
>>
>> I'm sorry to say that I'll have to postpone this to remove
>> page->page_cgroup pointer. I need some more performance-improvement
>> effort to remove page->page_cgroup pointer without significant overhead.
>>
> No problem. I know about that :)
> 
> And, I've started reviewing the radix tree patch and trying to test it.
> 

Thanks, Daisuke!

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2008-09-09 14:25 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-22 11:27 [RFC][PATCH 0/14] Mem+Swap Controller v2 KAMEZAWA Hiroyuki
2008-08-22 11:30 ` [RFC][PATCH 1/14] memcg: unlimted root cgroup KAMEZAWA Hiroyuki
2008-08-22 22:51   ` Balbir Singh
2008-08-23  0:38   ` kamezawa.hiroyu
2008-08-25  3:19     ` KAMEZAWA Hiroyuki
2008-08-22 11:31 ` [RFC][PATCH 2/14] memcg: rewrite force_empty KAMEZAWA Hiroyuki
2008-08-25  3:21   ` KAMEZAWA Hiroyuki
2008-08-29 11:45   ` Daisuke Nishimura
2008-08-30  7:30     ` KAMEZAWA Hiroyuki
2008-08-22 11:32 ` [RFC][PATCH 3/14] memcg: atomic_flags KAMEZAWA Hiroyuki
2008-08-26  4:55   ` Balbir Singh
2008-08-26 23:50     ` KAMEZAWA Hiroyuki
2008-08-27  1:58       ` KAMEZAWA Hiroyuki
2008-08-26  8:46   ` kamezawa.hiroyu
2008-08-26  8:49     ` Balbir Singh
2008-08-26 23:41       ` KAMEZAWA Hiroyuki
2008-08-22 11:33 ` [RFC][PATCH 4/14] delay page_cgroup freeing KAMEZAWA Hiroyuki
2008-08-26 11:46   ` Balbir Singh
2008-08-26 23:55     ` KAMEZAWA Hiroyuki
2008-08-27  1:17       ` Balbir Singh
2008-08-27  1:39         ` KAMEZAWA Hiroyuki
2008-08-27  2:25           ` Balbir Singh
2008-08-27  2:46             ` KAMEZAWA Hiroyuki
2008-08-22 11:34 ` [RFC][PATCH 5/14] memcg: free page_cgroup by RCU KAMEZAWA Hiroyuki
2008-08-28 10:06   ` Balbir Singh
2008-08-28 10:44     ` KAMEZAWA Hiroyuki
2008-09-01  6:51       ` YAMAMOTO Takashi
2008-09-01  7:01         ` KAMEZAWA Hiroyuki
2008-08-22 11:35 ` [RFC][PATCH 6/14] memcg: lockless page cgroup KAMEZAWA Hiroyuki
2008-09-09  5:40   ` Daisuke Nishimura
2008-09-09  7:56     ` KAMEZAWA Hiroyuki
2008-09-09  8:11       ` Daisuke Nishimura
2008-09-09 11:11         ` KAMEZAWA Hiroyuki
2008-09-09 11:48           ` Balbir Singh
2008-09-09 14:24         ` Balbir Singh
2008-09-09 14:04       ` Balbir Singh
2008-08-22 11:36 ` [RFC][PATCH 7/14] memcg: add prefetch to spinlock KAMEZAWA Hiroyuki
2008-08-28 11:00   ` Balbir Singh
2008-08-22 11:37 ` [RFC][PATCH 8/14] memcg: make mapping null before uncharge KAMEZAWA Hiroyuki
2008-08-22 11:38 ` [RFC][PATCH 9/14] memcg: add page_cgroup.h file KAMEZAWA Hiroyuki
2008-08-22 11:39 ` [RFC][PATCH 10/14] memcg: replace res_counter KAMEZAWA Hiroyuki
2008-08-27  0:44   ` Daisuke Nishimura
2008-08-27  1:26     ` KAMEZAWA Hiroyuki
2008-08-22 11:40 ` [RFC][PATCH 11/14] memcg: mem_cgroup private ID KAMEZAWA Hiroyuki
2008-08-22 11:41 ` [RFC][PATCH 12/14] memcg: mem+swap controller Kconfig KAMEZAWA Hiroyuki
2008-08-22 11:41 ` [RFC][PATCH 13/14] memcg: mem+swap counter KAMEZAWA Hiroyuki
2008-08-28  8:51   ` Daisuke Nishimura
2008-08-28  9:32     ` KAMEZAWA Hiroyuki
2008-08-22 11:44 ` [RFC][PATCH 14/14]memcg: mem+swap accounting KAMEZAWA Hiroyuki
2008-09-01  7:15   ` Daisuke Nishimura
2008-09-01  7:58     ` KAMEZAWA Hiroyuki
2008-09-01  8:53       ` Daisuke Nishimura
2008-09-01  9:53         ` KAMEZAWA Hiroyuki
2008-09-01 10:21           ` Daisuke Nishimura
2008-09-02  2:21           ` Daisuke Nishimura
2008-09-02 11:09           ` Daisuke Nishimura
2008-09-02 11:40             ` KAMEZAWA Hiroyuki
2008-09-03  6:23               ` Daisuke Nishimura
2008-09-03  7:05                 ` KAMEZAWA Hiroyuki
2008-08-22 13:20 ` [RFC][PATCH 0/14] Mem+Swap Controller v2 Balbir Singh
2008-08-22 15:34 ` kamezawa.hiroyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox