[RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller
@ 2008-10-23  8:58 KAMEZAWA Hiroyuki
  2008-10-23  8:59 ` [RFC][PATCH 1/11] memcg: fix kconfig menu comment KAMEZAWA Hiroyuki
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  8:58 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, xemul, menage

Just for internal review. Now, it's under merge window ;)

This is mem cgroup next update (in my queue.)
I'm now testing this set and would like to post next week, one by one.
(Anyway, I'll wait until mmotm/mainline seems to be setteled.)

These are against "The mm-of-the-moment snapshot 2008-10-22-17-18"

Includes 1 clean up and 4 major changes.

  a. menuconfig cleanup
     Now, "General Setup" in menuconfig is getting longer day by day...
     add cgroup submenu for good look.

  b. try/commit/cancel protocol.
     make mem_cgroup interface to be more specific and add new interface to
     try/commit/cancel.
     Because we allocates all page_cgroup at boot, we can do better handling
     of charge/uncharge calls.

  c. change force_empty's behavior from forgetting all to move to parent.
     Now, force_empty does "forget all". This is not good. 
     Change this behavior to
        - move account to the parent.
        - if the parent hits limit, free pages.
     and this remove memory.force_empty interface....a debug only brutal file.
     (This file can be a hole....)
  d. lazy lru handling.
     do add/remove to memcg's LRU in lazy way as pagevec does.

  e. Mem+Swap controller.
     account swap and limit by mem+swap. this feature is implemented as a
     extension to memcg. (mem_counter is removed.)

In my view,
   a. is ok. (patch 1,2)
   b,c,d have been tested for 2-3 weeks unchaged.. (patch 3-7)
   e. is very new and will be in my queue for more weeks. (patch8-11)

Patches.
 [1/11] fix menu's comment about page_cgroup overhead.
 [2/11] make cgroup's manuconfig as sub menu
 [3/11] introduce charge/commit/cancel
 [4/11] clean up page migration (again!)
 [5/11] fix force_empty to move account to parent
 [6/11] lazy memcg lru removal
 [7/11] lazy memcg lru add
 [8/11] make shmem's accounting clealer before mem+swap controller
 [9/11] mem+swap controller kconfig.
 [10/11] swap_cgroup for recording swap information
 [11/11] mem+swap controller core

Thank you for all your patient helps.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 1/11] memcg: fix kconfig menu comment
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
@ 2008-10-23  8:59 ` KAMEZAWA Hiroyuki
  2008-10-24  4:24   ` Randy Dunlap
  2008-10-23  9:00 ` [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  8:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Fixes menu help text for memcg-allocate-page-cgroup-at-boot.patch.


Signed-off-by: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 init/Kconfig |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: mmotm-2.6.27+/init/Kconfig
===================================================================
--- mmotm-2.6.27+.orig/init/Kconfig
+++ mmotm-2.6.27+/init/Kconfig
@@ -401,16 +401,20 @@ config CGROUP_MEM_RES_CTLR
 	depends on CGROUPS && RESOURCE_COUNTERS
 	select MM_OWNER
 	help
-	  Provides a memory resource controller that manages both page cache and
-	  RSS memory.
+	  Provides a memory resource controller that manages both anonymous
+	  memory and page cache. (See Documentation/controllers/memory.txt)
 
 	  Note that setting this option increases fixed memory overhead
-	  associated with each page of memory in the system by 4/8 bytes
-	  and also increases cache misses because struct page on many 64bit
-	  systems will not fit into a single cache line anymore.
+	  associated with each page of memory in the system. By this,
+	  20(40)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory
+	  usage tracking struct at boot. Total amount of this is printed out
+	  at boot.
 
 	  Only enable when you're ok with these trade offs and really
-	  sure you need the memory resource controller.
+	  sure you need the memory resource controller. Even when you enable
+	  this, you can set "cgroup_disable=memory" at your boot option to
+	  disable memoyr resource controller and you can avoid overheads.
+	  (and lose benefits of memory resource contoller)
 
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
  2008-10-23  8:59 ` [RFC][PATCH 1/11] memcg: fix kconfig menu comment KAMEZAWA Hiroyuki
@ 2008-10-23  9:00 ` KAMEZAWA Hiroyuki
  2008-10-23 21:20   ` Paul Menage
  2008-10-23  9:02 ` [RFC][PATCH 3/11] memcg: charge commit cancel protocol KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Making CGROUP related configs to be submenu.

This patch will making CGROUP related confings to be submenu and
makeing 1st level configs of "General Setup" shorter.

 including following additional changes 
  - add help comment about CGROUPS and GROUP_SCHED.
  - moved MM_OWNER config to the bottom.
    (for good indent in menuconfig)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 init/Kconfig |  117 ++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 61 insertions(+), 56 deletions(-)

Index: mmotm-2.6.27+/init/Kconfig
===================================================================
--- mmotm-2.6.27+.orig/init/Kconfig
+++ mmotm-2.6.27+/init/Kconfig
@@ -271,59 +271,6 @@ config LOG_BUF_SHIFT
 		     13 =>  8 KB
 		     12 =>  4 KB
 
-config CGROUPS
-	bool "Control Group support"
-	help
-	  This option will let you use process cgroup subsystems
-	  such as Cpusets
-
-	  Say N if unsure.
-
-config CGROUP_DEBUG
-	bool "Example debug cgroup subsystem"
-	depends on CGROUPS
-	default n
-	help
-	  This option enables a simple cgroup subsystem that
-	  exports useful debugging information about the cgroups
-	  framework
-
-	  Say N if unsure
-
-config CGROUP_NS
-        bool "Namespace cgroup subsystem"
-        depends on CGROUPS
-        help
-          Provides a simple namespace cgroup subsystem to
-          provide hierarchical naming of sets of namespaces,
-          for instance virtual servers and checkpoint/restart
-          jobs.
-
-config CGROUP_FREEZER
-        bool "control group freezer subsystem"
-        depends on CGROUPS
-        help
-          Provides a way to freeze and unfreeze all tasks in a
-	  cgroup.
-
-config CGROUP_DEVICE
-	bool "Device controller for cgroups"
-	depends on CGROUPS && EXPERIMENTAL
-	help
-	  Provides a cgroup implementing whitelists for devices which
-	  a process in the cgroup can mknod or open.
-
-config CPUSETS
-	bool "Cpuset support"
-	depends on SMP && CGROUPS
-	help
-	  This option will let you create and manage CPUSETs which
-	  allow dynamically partitioning a system into sets of CPUs and
-	  Memory Nodes and assigning tasks to run only within those sets.
-	  This is primarily useful on large SMP or NUMA systems.
-
-	  Say N if unsure.
-
 #
 # Architectures with an unreliable sched_clock() should select this:
 #
@@ -337,6 +284,8 @@ config GROUP_SCHED
 	help
 	  This feature lets CPU scheduler recognize task groups and control CPU
 	  bandwidth allocation to such task groups.
+	  For allowing to make a group from arbitrary set of processes, use
+	  CONFIG_CGROUPS. (See Control Group support.)
 
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
@@ -379,6 +328,60 @@ config CGROUP_SCHED
 
 endchoice
 
+menu "Control Group supprt"
+config CGROUPS
+	bool "Control Group support"
+	help
+	  This option will let you use process cgroup subsystems
+	  such as Cpusets
+
+	  Say N if unsure.
+
+config CGROUP_DEBUG
+	bool "Example debug cgroup subsystem"
+	depends on CGROUPS
+	default n
+	help
+	  This option enables a simple cgroup subsystem that
+	  exports useful debugging information about the cgroups
+	  framework
+
+	  Say N if unsure
+
+config CGROUP_NS
+        bool "Namespace cgroup subsystem"
+        depends on CGROUPS
+        help
+          Provides a simple namespace cgroup subsystem to
+          provide hierarchical naming of sets of namespaces,
+          for instance virtual servers and checkpoint/restart
+          jobs.
+
+config CGROUP_FREEZER
+        bool "control group freezer subsystem"
+        depends on CGROUPS
+        help
+          Provides a way to freeze and unfreeze all tasks in a
+	  cgroup.
+
+config CGROUP_DEVICE
+	bool "Device controller for cgroups"
+	depends on CGROUPS && EXPERIMENTAL
+	help
+	  Provides a cgroup implementing whitelists for devices which
+	  a process in the cgroup can mknod or open.
+
+config CPUSETS
+	bool "Cpuset support"
+	depends on SMP && CGROUPS
+	help
+	  This option will let you create and manage CPUSETs which
+	  allow dynamically partitioning a system into sets of CPUs and
+	  Memory Nodes and assigning tasks to run only within those sets.
+	  This is primarily useful on large SMP or NUMA systems.
+
+	  Say N if unsure.
+
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
 	depends on CGROUPS
@@ -393,9 +396,6 @@ config RESOURCE_COUNTERS
           infrastructure that works with cgroups
 	depends on CGROUPS
 
-config MM_OWNER
-	bool
-
 config CGROUP_MEM_RES_CTLR
 	bool "Memory Resource Controller for Control Groups"
 	depends on CGROUPS && RESOURCE_COUNTERS
@@ -419,6 +419,11 @@ config CGROUP_MEM_RES_CTLR
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.
 
+config MM_OWNER
+	bool
+
+endmenu
+
 config SYSFS_DEPRECATED
 	bool
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 3/11] memcg: charge commit cancel protocol
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
  2008-10-23  8:59 ` [RFC][PATCH 1/11] memcg: fix kconfig menu comment KAMEZAWA Hiroyuki
  2008-10-23  9:00 ` [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu KAMEZAWA Hiroyuki
@ 2008-10-23  9:02 ` KAMEZAWA Hiroyuki
  2008-10-23  9:03 ` [RFC][PATCH 4/11] memcg: better page migration handling KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.

      CPUA 			CPUB
       mapcount == 1.
   (1) charge if mapcount==0     zap_pte_range()
                                (2) mapcount 1 => 0.
			        (3) uncharge(). (success)
   (4) set page'rmap()
       mapcoint 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
  - precharge
   account to res_counter by PAGE_SIZE and try to free pages if necessary.
  - commit	
   register page_cgroup and add to LRU if necessary.
  - cancel
   uncharge PAGE_SIZE because of do_swap_page failure.


     CPUA              
  (1) charge (always)
  (2) set page's rmap (mapcount > 0)
  (3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does precharge -> commit at a time.

And this patch also adds following function to clarify all charges.

  - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
	called against newly allocated anon pages.

  - mem_cgroup_charge_migrate_fixup()
        called only from remove_migration_ptes().
	we'll have to rewrite this later.(this patch just keeps old behavior)
	This function will be removed by additional patch to make migration
	clearer.

Good for clarify "what we does"

Then, we have 4 following charge points.
  - newpage
  - swapin
  - add-to-cache.
  - migration.

Changelog v7 -> v8
 - handles that try_charge() sets NULL to *memcg.

Changelog v5 -> v7:
 - added newpage_charge() and migrate_fixup().
 - renamed  functions for swap-in from "swap" to "swapin"
 - add more precise description.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |   35 +++++++++-
 mm/memcontrol.c            |  151 +++++++++++++++++++++++++++++++++++----------
 mm/memory.c                |   12 ++-
 mm/migrate.c               |    2 
 mm/swapfile.c              |    6 +
 5 files changed, 165 insertions(+), 41 deletions(-)

Index: mmotm-2.6.27+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27+/include/linux/memcontrol.h
@@ -27,8 +27,17 @@ struct mm_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
-extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+extern int mem_cgroup_charge_migrate_fixup(struct page *page,
+				struct mm_struct *mm, gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_try_charge(struct mm_struct *mm,
+		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swapin(struct page *page,
+					struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
+
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -71,7 +80,9 @@ extern long mem_cgroup_calc_reclaim(stru
 
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline int mem_cgroup_charge(struct page *page,
+struct mem_cgroup;
+
+static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return 0;
@@ -83,6 +94,26 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
+static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static int mem_cgroup_try_charge(struct mm_struct *mm,
+				gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static void mem_cgroup_commit_charge_swapin(struct page *page,
+					  struct mem_cgroup *ptr)
+{
+}
+static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
+
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
  */
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype,
-				struct mem_cgroup *memcg)
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
-	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	pc = lookup_page_cgroup(page);
-	/* can happen at boot */
-	if (unlikely(!pc))
-		return 0;
-	prefetchw(pc);
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-
-	if (likely(!memcg)) {
+	if (likely(!*memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 		if (unlikely(!mem)) {
@@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
 		 * For every charge from the cgroup, increment reference count
 		 */
 		css_get(&mem->css);
+		*memcg = mem;
 		rcu_read_unlock();
 	} else {
-		mem = memcg;
-		css_get(&memcg->css);
+		mem = *memcg;
+		css_get(&mem->css);
 	}
 
+
 	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
 		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
+			goto nomem;
 
 		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
 			continue;
@@ -531,18 +529,37 @@ static int mem_cgroup_charge_common(stru
 
 		if (!nr_retries--) {
 			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
+			goto nomem;
 		}
 	}
+	return 0;
+nomem:
+	css_put(&mem->css);
+	return -ENOMEM;
+}
 
+/*
+ * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
+ * USED state. If already USED, uncharge and return.
+ */
+
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+				     struct page_cgroup *pc,
+				     enum charge_type ctype)
+{
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+
+	/* try_charge() can return NULL to *memcg, taking care of it. */
+	if (!mem)
+		return;
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
-
-		goto done;
+		return;
 	}
 	pc->mem_cgroup = mem;
 	/*
@@ -557,15 +574,39 @@ static int mem_cgroup_charge_common(stru
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
+}
 
-done:
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask, enum charge_type ctype,
+				struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *mem;
+	struct page_cgroup *pc;
+	int ret;
+
+	pc = lookup_page_cgroup(page);
+	/* can happen at boot */
+	if (unlikely(!pc))
+		return 0;
+	prefetchw(pc);
+
+	mem = memcg;
+	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+	if (ret)
+		return ret;
+
+	__mem_cgroup_commit_charge(mem, pc, ctype);
 	return 0;
-out:
-	css_put(&mem->css);
-	return -ENOMEM;
 }
 
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_newpage_charge(struct page *page,
+			      struct mm_struct *mm, gfp_t gfp_mask)
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
@@ -586,6 +627,34 @@ int mem_cgroup_charge(struct page *page,
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
+/*
+ * same as mem_cgroup_newpage_charge(), now.
+ * But what we assume is different from newpage, and this is special case.
+ * treat this in special function. easy for maintainance.
+ */
+
+int mem_cgroup_charge_migrate_fixup(struct page *page,
+				struct mm_struct *mm, gfp_t gfp_mask)
+{
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+
+	if (PageCompound(page))
+		return 0;
+
+	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+		return 0;
+
+	if (unlikely(!mm))
+		mm = &init_mm;
+
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+
+
+
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -628,6 +697,30 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+
+void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+	if (!ptr)
+		return;
+	pc = lookup_page_cgroup(page);
+	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
+{
+	if (mem_cgroup_subsys.disabled)
+		return;
+	if (!mem)
+		return;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	css_put(&mem->css);
+}
+
+
 /*
  * uncharge if !page_mapped(page)
  */
Index: mmotm-2.6.27+/mm/memory.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memory.c
+++ mmotm-2.6.27+/mm/memory.c
@@ -1889,7 +1889,7 @@ gotten:
 	cow_user_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
 	/*
@@ -2285,6 +2285,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page;
 	swp_entry_t entry;
 	pte_t pte;
+	struct mem_cgroup *ptr = NULL;
 	int ret = 0;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
@@ -2323,7 +2324,7 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
-	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
 		ret = VM_FAULT_OOM;
 		unlock_page(page);
 		goto out;
@@ -2353,6 +2354,7 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2373,7 +2375,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge_swapin(ptr);
 	pte_unmap_unlock(page_table, ptl);
 	unlock_page(page);
 	page_cache_release(page);
@@ -2403,7 +2405,7 @@ static int do_anonymous_page(struct mm_s
 		goto oom;
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge(page, mm, GFP_KERNEL))
+	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2496,7 +2498,7 @@ static int __do_fault(struct mm_struct *
 				ret = VM_FAULT_OOM;
 				goto out;
 			}
-			if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+			if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
 				ret = VM_FAULT_OOM;
 				page_cache_release(page);
 				goto out;
Index: mmotm-2.6.27+/mm/migrate.c
===================================================================
--- mmotm-2.6.27+.orig/mm/migrate.c
+++ mmotm-2.6.27+/mm/migrate.c
@@ -133,7 +133,7 @@ static void remove_migration_pte(struct 
 	 * be reliable, and this charge can actually fail: oh well, we don't
 	 * make the situation any worse by proceeding as if it had succeeded.
 	 */
-	mem_cgroup_charge(new, mm, GFP_ATOMIC);
+	mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
 
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
Index: mmotm-2.6.27+/mm/swapfile.c
===================================================================
--- mmotm-2.6.27+.orig/mm/swapfile.c
+++ mmotm-2.6.27+/mm/swapfile.c
@@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type, 
 static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, swp_entry_t entry, struct page *page)
 {
+	struct mem_cgroup *ptr = NULL;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret = 1;
 
-	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
 		ret = -ENOMEM;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
 		if (ret > 0)
-			mem_cgroup_uncharge_page(page);
+			mem_cgroup_cancel_charge_swapin(ptr);
 		ret = 0;
 		goto out;
 	}
@@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, addr);
+	mem_cgroup_commit_charge_swapin(page, ptr);
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 4/11] memcg: better page migration handling
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-10-23  9:02 ` [RFC][PATCH 3/11] memcg: charge commit cancel protocol KAMEZAWA Hiroyuki
@ 2008-10-23  9:03 ` KAMEZAWA Hiroyuki
  2008-10-23  9:05 ` [RFC][PATCH 5/11] memcg: account move and change force_empty KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldapge to newpage)

 before
  - "newpage" is charged before migration.
 at success.
  - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
 at failure
  - "newpage" is uncharged.
  - "oldpage" is charged if necessary (*1)

But (*1) is not trustable....because of GFP_ATOMIC.

This patch tries to change behavior as following by charge/commit/cancel ops.

 before
  - charge PAGE_SIZE (no target page)
 success
  - commit charge against "newpage".
 failure
  - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
  - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.

Changelog: v7 -> v8
 - fixed memcg==NULL case in migration handling.
  
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |   19 ++-----
 mm/memcontrol.c            |  107 ++++++++++++++++++++++-----------------------
 mm/migrate.c               |   42 +++++------------
 3 files changed, 74 insertions(+), 94 deletions(-)

Index: mmotm-2.6.27+/mm/migrate.c
===================================================================
--- mmotm-2.6.27+.orig/mm/migrate.c
+++ mmotm-2.6.27+/mm/migrate.c
@@ -121,20 +121,6 @@ static void remove_migration_pte(struct 
 	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
 		goto out;
 
-	/*
-	 * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
-	 * Failure is not an option here: we're now expected to remove every
-	 * migration pte, and will cause crashes otherwise.  Normally this
-	 * is not an issue: mem_cgroup_prepare_migration bumped up the old
-	 * page_cgroup count for safety, that's now attached to the new page,
-	 * so this charge should just be another incrementation of the count,
-	 * to keep in balance with rmap.c's mem_cgroup_uncharging.  But if
-	 * there's been a force_empty, those reference counts may no longer
-	 * be reliable, and this charge can actually fail: oh well, we don't
-	 * make the situation any worse by proceeding as if it had succeeded.
-	 */
-	mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
-
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
@@ -382,9 +368,6 @@ static void migrate_page_copy(struct pag
 	anon = PageAnon(page);
 	page->mapping = NULL;
 
-	if (!anon) /* This page was removed from radix-tree. */
-		mem_cgroup_uncharge_cache_page(page);
-
 	/*
 	 * If any waiters have accumulated on the new page then
 	 * wake them up.
@@ -621,6 +604,7 @@ static int unmap_and_move(new_page_t get
 	struct page *newpage = get_new_page(page, private, &result);
 	int rcu_locked = 0;
 	int charge = 0;
+	struct mem_cgroup *mem;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -630,24 +614,26 @@ static int unmap_and_move(new_page_t get
 		goto move_newpage;
 	}
 
-	charge = mem_cgroup_prepare_migration(page, newpage);
-	if (charge == -ENOMEM) {
-		rc = -ENOMEM;
-		goto move_newpage;
-	}
 	/* prepare cgroup just returns 0 or -ENOMEM */
-	BUG_ON(charge);
-
 	rc = -EAGAIN;
+
 	if (!trylock_page(page)) {
 		if (!force)
 			goto move_newpage;
 		lock_page(page);
 	}
 
+	/* precharge against new page */
+	charge = mem_cgroup_prepare_migration(page, &mem);
+	if (charge == -ENOMEM) {
+		rc = -ENOMEM;
+		goto unlock;
+	}
+	BUG_ON(charge);
+
 	if (PageWriteback(page)) {
 		if (!force)
-			goto unlock;
+			goto uncharge;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -700,7 +686,9 @@ static int unmap_and_move(new_page_t get
 rcu_unlock:
 	if (rcu_locked)
 		rcu_read_unlock();
-
+uncharge:
+	if (!charge)
+		mem_cgroup_end_migration(mem, page, newpage);
 unlock:
 	unlock_page(page);
 
@@ -716,8 +704,6 @@ unlock:
 	}
 
 move_newpage:
-	if (!charge)
-		mem_cgroup_end_migration(newpage);
 
 	/*
 	 * Move the new page to the LRU. If migration was not successful
Index: mmotm-2.6.27+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27+/include/linux/memcontrol.h
@@ -29,8 +29,6 @@ struct mm_struct;
 
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
-extern int mem_cgroup_charge_migrate_fixup(struct page *page,
-				struct mm_struct *mm, gfp_t gfp_mask);
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
@@ -60,8 +58,9 @@ extern struct mem_cgroup *mem_cgroup_fro
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
 extern int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
-extern void mem_cgroup_end_migration(struct page *page);
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
+extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
+	struct page *oldpage, struct page *newpage);
 
 /*
  * For memory reclaim.
@@ -94,12 +93,6 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
-static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static int mem_cgroup_try_charge(struct mm_struct *mm,
 				gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
@@ -143,12 +136,14 @@ static inline int task_in_mem_cgroup(str
 }
 
 static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_end_migration(struct page *page)
+static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
+					struct page *oldpage,
+					struct page *newpage)
 {
 }
 
Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -627,34 +627,6 @@ int mem_cgroup_newpage_charge(struct pag
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
-/*
- * same as mem_cgroup_newpage_charge(), now.
- * But what we assume is different from newpage, and this is special case.
- * treat this in special function. easy for maintainance.
- */
-
-int mem_cgroup_charge_migrate_fixup(struct page *page,
-				struct mm_struct *mm, gfp_t gfp_mask)
-{
-	if (mem_cgroup_subsys.disabled)
-		return 0;
-
-	if (PageCompound(page))
-		return 0;
-
-	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
-		return 0;
-
-	if (unlikely(!mm))
-		mm = &init_mm;
-
-	return mem_cgroup_charge_common(page, mm, gfp_mask,
-				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
-}
-
-
-
-
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -697,7 +669,6 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
-
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
 	struct page_cgroup *pc;
@@ -782,13 +753,13 @@ void mem_cgroup_uncharge_cache_page(stru
 }
 
 /*
- * Before starting migration, account against new page.
+ * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
+ * page belongs to.
  */
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
-	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
 	int ret = 0;
 
 	if (mem_cgroup_subsys.disabled)
@@ -799,43 +770,70 @@ int mem_cgroup_prepare_migration(struct 
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (PageCgroupCache(pc)) {
-			if (page_is_file_cache(page))
-				ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
-			else
-				ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-		}
 	}
 	unlock_page_cgroup(pc);
+
 	if (mem) {
-		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
-			ctype, mem);
+		ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
 		css_put(&mem->css);
 	}
+	*ptr = mem;
 	return ret;
 }
 
 /* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
+void mem_cgroup_end_migration(struct mem_cgroup *mem,
+		struct page *oldpage, struct page *newpage)
 {
+	struct page *target, *unused;
+	struct page_cgroup *pc;
+	enum charge_type ctype;
+
+	if (!mem)
+		return;
+
+	/* at migration success, oldpage->mapping is NULL. */
+	if (oldpage->mapping) {
+		target = oldpage;
+		unused = NULL;
+	} else {
+		target = newpage;
+		unused = oldpage;
+	}
+
+	if (PageAnon(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+	else if (page_is_file_cache(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+	else
+		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+
+	/* unused page is not on radix-tree now. */
+	if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)
+		__mem_cgroup_uncharge_common(unused, ctype);
+
+	pc = lookup_page_cgroup(target);
 	/*
-	 * At success, page->mapping is not NULL.
-	 * special rollback care is necessary when
-	 * 1. at migration failure. (newpage->mapping is cleared in this case)
-	 * 2. the newpage was moved but not remapped again because the task
-	 *    exits and the newpage is obsolete. In this case, the new page
-	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
-	 *    always for avoiding mess. The  page_cgroup will be removed if
-	 *    unnecessary. File cache pages is still on radix-tree. Don't
-	 *    care it.
+	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
+	 * So, double-counting is effectively avoided.
 	 */
-	if (!newpage->mapping)
-		__mem_cgroup_uncharge_common(newpage,
-				MEM_CGROUP_CHARGE_TYPE_FORCE);
-	else if (PageAnon(newpage))
-		mem_cgroup_uncharge_page(newpage);
+	__mem_cgroup_commit_charge(mem, pc, ctype);
+
+	/*
+	 * Both of oldpage and newpage are still under lock_page().
+	 * Then, we don't have to care about race in radix-tree.
+	 * But we have to be careful that this page is unmapped or not.
+	 *
+	 * There is a case for !page_mapped(). At the start of
+	 * migration, oldpage was mapped. But now, it's zapped.
+	 * But we know *target* page is not freed/reused under us.
+	 * mem_cgroup_uncharge_page() does all necessary checks.
+	 */
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+		mem_cgroup_uncharge_page(target);
 }
 
+
 /*
  * A call to try to shrink memory usage under specified resource controller.
  * This is typically used for page reclaiming for shmem for reducing side

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 5/11] memcg: account move and change force_empty
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-10-23  9:03 ` [RFC][PATCH 4/11] memcg: better page migration handling KAMEZAWA Hiroyuki
@ 2008-10-23  9:05 ` KAMEZAWA Hiroyuki
  2008-10-24  4:28   ` Randy Dunlap
  2008-10-23  9:06 ` [RFC][PATCH 6/11] memcg: lary LRU removal KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
 - lru_lock of source/destination mem_cgroup is held.
 - lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.

(while page is not under lock_page())
	mem = pc->mem_cgroup;
	mz = page_cgroup_zoneinfo(pc)
	spin_lock_irqsave(&mz->lru_lock);
	if (pc->mem_cgroup == mem)
		...../* some list handling */
	spin_unlock_irq(&mz->lru_lock);

Of course, better way is
	lock_page_cgroup(pc);
	....
	unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
  - force_empty (rmdir)
  - moving tasks between cgroup (for moving account information.)
  - hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
  - Fix "forget all" behavior of force_empty and avoid leak of accounting.
  - By "moving first, free if necessary", keep pages on memory as much as
    possible.

Adding a switch to change behavior of force_empty to
  - free first, move if necessary
  - free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.

This patch removes memory.force_empty file, a brutal debug-only interface.

Changelog: (v6) -> (v8)
  - removed memory.force_empty file which was provided only for debug.

Changelog: (v5) -> (v6)
  - removed unnecessary check.
  - do all under lock_page_cgroup().
  - removed res_counter_charge() from move function itself.
    (and modifies try_charge() function.)
  - add argument to add_list() to specify to add page_cgroup head or tail.
  - merged with force_empty patch. (to answer who is user? question)

Changelog: (v4) -> (v5)
  - check for lock_page() is removed.
  - rewrote description.

Changelog: (v2) -> (v4)
  - added lock_page_cgroup().
  - splitted out from new-force-empty patch.
  - added how-to-use text.
  - fixed race in __mem_cgroup_uncharge_common().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   12 -
 mm/memcontrol.c                      |  277 ++++++++++++++++++++++++++---------
 2 files changed, 214 insertions(+), 75 deletions(-)

Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
 }
 
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
-				struct page_cgroup *pc)
+				struct page_cgroup *pc, bool hot)
 {
 	int lru = LRU_BASE;
 
@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	list_add(&pc->lru, &mz->lists[lru]);
+	if (hot)
+		list_add(&pc->lru, &mz->lists[lru]);
+	else
+		list_add_tail(&pc->lru, &mz->lists[lru]);
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Retruns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
  */
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
 			continue;
 
 		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
+			if (oom)
+				mem_cgroup_out_of_memory(mem, gfp_mask);
 			goto nomem;
 		}
 	}
@@ -538,6 +533,25 @@ nomem:
 	return -ENOMEM;
 }
 
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
+ * This call can invoce OOM-Killer.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+			  gfp_t mask, struct mem_cgroup **memcg)
+{
+	return __mem_cgroup_try_charge(mm, mask, memcg, true);
+}
+
 /*
  * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
  * USED state. If already USED, uncharge and return.
@@ -571,11 +585,109 @@ static void __mem_cgroup_commit_charge(s
 	mz = page_cgroup_zoneinfo(pc);
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
+	__mem_cgroup_add_list(mz, pc, true);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 }
 
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc:	page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to:	mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = -EBUSY;
+
+	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(from == to);
+
+	nid = page_cgroup_nid(pc);
+	zid = page_cgroup_zid(pc);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+
+	if (!trylock_page_cgroup(pc))
+		return ret;
+
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	if (pc->mem_cgroup != from)
+		goto out;
+
+	if (spin_trylock(&to_mz->lru_lock)) {
+		__mem_cgroup_remove_list(from_mz, pc);
+		css_put(&from->css);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+		pc->mem_cgroup = to;
+		css_get(&to->css);
+		__mem_cgroup_add_list(to_mz, pc, false);
+		ret = 0;
+		spin_unlock(&to_mz->lru_lock);
+	}
+out:
+	unlock_page_cgroup(pc);
+	return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+				  struct mem_cgroup *child,
+				  gfp_t gfp_mask)
+{
+	struct cgroup *cg = child->css.cgroup;
+	struct cgroup *pcg = cg->parent;
+	struct mem_cgroup *parent;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+	int ret;
+
+	/* Is ROOT ? */
+	if (!pcg)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(pcg);
+
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	if (ret)
+		return ret;
+
+	mz = mem_cgroup_zoneinfo(child,
+			page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	ret = mem_cgroup_move_account(pc, child, parent);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+	/* drop extra refcnt */
+	css_put(&parent->css);
+	/* uncharge if move fails */
+	if (ret)
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+	return ret;
+}
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -597,7 +709,7 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
 	if (ret)
 		return ret;
 
@@ -898,46 +1010,52 @@ int mem_cgroup_resize_limit(struct mem_c
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
-#define FORCE_UNCHARGE_BATCH	(128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
-	struct page_cgroup *pc;
-	struct page *page;
-	int count = FORCE_UNCHARGE_BATCH;
+	struct page_cgroup *pc, *busy;
 	unsigned long flags;
+	unsigned long loop;
 	struct list_head *list;
+	int ret = 0;
 
 	list = &mz->lists[lru];
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	while (!list_empty(list)) {
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		page = pc->page;
-		if (!PageCgroupUsed(pc))
+	loop = MEM_CGROUP_ZSTAT(mz, lru);
+	/* give some margin against EBUSY etc...*/
+	loop += 256;
+	busy = NULL;
+	while (loop--) {
+		ret = 0;
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (list_empty(list)) {
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
 			break;
-		get_page(page);
+		}
+		pc = list_entry(list->prev, struct page_cgroup, lru);
+		if (busy == pc) {
+			list_move(&pc->lru, list);
+			busy = 0;
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			continue;
+		}
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
-			put_page(page);
-			if (--count <= 0) {
-				count = FORCE_UNCHARGE_BATCH;
-				cond_resched();
-			}
-		} else {
-			spin_lock_irqsave(&mz->lru_lock, flags);
+
+		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+		if (ret == -ENOMEM)
 			break;
-		}
-		spin_lock_irqsave(&mz->lru_lock, flags);
+
+		if (ret == -EBUSY || ret == -EINVAL) {
+			/* found lock contention or "pc" is obsolete. */
+			busy = pc;
+			cond_resched();
+		} else
+			busy = NULL;
 	}
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	if (!ret && !list_empty(list))
+		return -EBUSY;
+	return ret;
 }
 
 /*
@@ -946,34 +1064,68 @@ static void mem_cgroup_force_empty_list(
  */
 static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
-	int ret = -EBUSY;
-	int node, zid;
+	int ret;
+	int node, zid, shrink;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
 	css_get(&mem->css);
-	/*
-	 * page reclaim code (kswapd etc..) will move pages between
-	 * active_list <-> inactive_list while we don't take a lock.
-	 * So, we have to do loop here until all lists are empty.
-	 */
+
+	shrink = 0;
+move_account:
 	while (mem->res.usage > 0) {
+		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
+
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
-		for_each_node_state(node, N_POSSIBLE)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		ret = 0;
+		for_each_node_state(node, N_POSSIBLE) {
+			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
 				enum lru_list l;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				for_each_lru(l)
-					mem_cgroup_force_empty_list(mem, mz, l);
+				for_each_lru(l) {
+					ret = mem_cgroup_force_empty_list(mem,
+								  mz, l);
+					if (ret)
+						break;
+				}
 			}
+			if (ret)
+				break;
+		}
+		/* it seems parent cgroup doesn't have enough mem */
+		if (ret == -ENOMEM)
+			goto try_to_free;
 		cond_resched();
 	}
 	ret = 0;
 out:
 	css_put(&mem->css);
 	return ret;
+
+try_to_free:
+	/* returns EBUSY if we come here twice. */
+	if (shrink)  {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* try to free all pages in this cgroup */
+	shrink = 1;
+	while (nr_retries && mem->res.usage > 0) {
+		int progress;
+		progress = try_to_free_mem_cgroup_pages(mem,
+						  GFP_HIGHUSER_MOVABLE);
+		if (!progress)
+			nr_retries--;
+
+	}
+	/* try move_account...there may be some *locked* pages. */
+	if (mem->res.usage)
+		goto move_account;
+	ret = 0;
+	goto out;
 }
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
@@ -1022,11 +1174,6 @@ static int mem_cgroup_reset(struct cgrou
 	return 0;
 }
 
-static int mem_force_empty_write(struct cgroup *cont, unsigned int event)
-{
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont));
-}
-
 static const struct mem_cgroup_stat_desc {
 	const char *msg;
 	u64 unit;
@@ -1103,10 +1250,6 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 	{
-		.name = "force_empty",
-		.trigger = mem_force_empty_write,
-	},
-	{
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
Index: mmotm-2.6.27+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27+/Documentation/controllers/memory.txt
@@ -207,12 +207,6 @@ exceeded.
 The memory.stat file gives accounting information. Now, the number of
 caches, RSS and Active pages/Inactive pages are shown.
 
-The memory.force_empty gives an interface to drop *all* charges by force.
-
-# echo 1 > memory.force_empty
-
-will drop all charges in cgroup. Currently, this is maintained for test.
-
 4. Testing
 
 Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
@@ -242,8 +236,10 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as much as possible and freed if parent
+is full.
+If both of them are busy, rmdir() returns -EBUSY.
 
 5. TODO
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 6/11] memcg: lary LRU removal
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-10-23  9:05 ` [RFC][PATCH 5/11] memcg: account move and change force_empty KAMEZAWA Hiroyuki
@ 2008-10-23  9:06 ` KAMEZAWA Hiroyuki
  2008-10-23  9:08 ` [RFC][PATCH 7/11] memcg: lazy lru add KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.

After this, a pc, which is PageCgroupLRU(pc)==true, is on LRU.
This LRU bit is guarded by lru_lock().

 PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is used and on LRU.
 This check makes sense only when both 2 locks, lock_page_cgroup()/lru_lock(),
 are aquired.

 PageCgroupUsed(pc) && !PageCgroupLRU(pc) means "pc" is used but not on LRU.
 !PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is unused but still on
 LRU. lru walk routine should avoid touching this.

Changelog (v6) -> (v7)
 - added check for race to check pc->mem_cgroup without lock.

Changelog (v5) -> (v6)
 - Fixing race and added PCG_LRU bit

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |    5 +
 mm/memcontrol.c             |  210 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 199 insertions(+), 16 deletions(-)

Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -34,6 +34,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 
 #include <asm/uaccess.h>
 
@@ -344,7 +345,7 @@ void mem_cgroup_move_lists(struct page *
 	pc = lookup_page_cgroup(page);
 	if (!trylock_page_cgroup(pc))
 		return;
-	if (pc && PageCgroupUsed(pc)) {
+	if (pc && PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_move_lists(pc, lru);
@@ -470,6 +471,129 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+
+#define MEMCG_PCPVEC_SIZE	(14)	/* size of pagevec */
+struct memcg_percpu_vec {
+	int nr;
+	int limit;
+	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	struct mem_cgroup *tmp;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		tmp = pc->mem_cgroup;
+		mz = mem_cgroup_zoneinfo(tmp,
+				page_cgroup_nid(pc), page_cgroup_zid(pc));
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		/*
+		 * this "pc" may be charge()->uncharge() while we are waiting
+		 * for this. But charge() path check LRU bit and remove this
+		 * from LRU if necessary. So, tmp == pc->mem_cgroup can be
+		 * considered to be always true...but logically, we should check
+		 * it.
+		 */
+		if (!PageCgroupUsed(pc)
+		    && PageCgroupLRU(pc)
+		    && tmp == pc->mem_cgroup) {
+			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
+			css_put(&pc->mem_cgroup->css);
+		}
+	}
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_free_vec);
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__release_page_cgroup(mpv);
+	put_cpu_var(memcg_free_vec);
+}
+
+static void page_cgroup_start_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void page_cgroup_stop_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = 0;
+}
+#endif
+
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void drain_page_cgroup_local(struct work_struct *work)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_free_vec);
+	__release_page_cgroup(mpv);
+	put_cpu_var(mpv);
+}
+
+static void drain_page_cgroup_cpu(int cpu)
+{
+	int local_cpu;
+	struct work_struct work;
+
+	local_cpu = get_cpu();
+	if (local_cpu == cpu) {
+		drain_page_cgroup_local(NULL);
+		put_cpu();
+		return;
+	}
+	put_cpu();
+
+	INIT_WORK(&work, drain_page_cgroup_local);
+	schedule_work_on(cpu, &work);
+	flush_work(&work);
+}
+
+static void drain_page_cgroup_all(void)
+{
+	mutex_lock(&memcg_force_drain_mutex);
+	schedule_on_each_cpu(drain_page_cgroup_local);
+	mutex_unlock(&memcg_force_drain_mutex);
+}
+
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -569,25 +693,46 @@ static void __mem_cgroup_commit_charge(s
 		return;
 
 	lock_page_cgroup(pc);
+	/*
+	 * USED bit is set after pc->mem_cgroup has valid value.
+	 */
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
+	/*
+	 * This page_cgroup is not used but may be on LRU.
+	 */
+	if (unlikely(PageCgroupLRU(pc))) {
+		/*
+		 * pc->mem_cgroup has old information. force_empty() guarantee
+		 * that we never see stale mem_cgroup here.
+		 */
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (PageCgroupLRU(pc)) {
+			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
+			css_put(&pc->mem_cgroup->css);
+		}
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+	}
+	/* Here, PCG_LRU bit is cleared */
 	pc->mem_cgroup = mem;
 	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
+	 * below pcg_default_flags includes PCG_LOCK bit.
 	 */
 	pc->flags = pcg_default_flags[ctype];
+	unlock_page_cgroup(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc, true);
+	SetPageCgroupLRU(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	unlock_page_cgroup(pc);
 }
 
 /**
@@ -626,7 +771,7 @@ static int mem_cgroup_move_account(struc
 	if (!trylock_page_cgroup(pc))
 		return ret;
 
-	if (!PageCgroupUsed(pc))
+	if (!PageCgroupUsed(pc) || !PageCgroupLRU(pc))
 		goto out;
 
 	if (pc->mem_cgroup != from)
@@ -812,8 +957,6 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return;
@@ -834,16 +977,13 @@ __mem_cgroup_uncharge_common(struct page
 	}
 	ClearPageCgroupUsed(pc);
 	mem = pc->mem_cgroup;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	unlock_page_cgroup(pc);
-
+	/*
+	 * We must uncharge here because "reuse" can occur just after we
+	 * unlock this.
+	 */
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
-
+	unlock_page_cgroup(pc);
+	release_page_cgroup(pc);
 	return;
 }
 
@@ -1079,6 +1219,7 @@ move_account:
 
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		drain_page_cgroup_all();
 		ret = 0;
 		for_each_node_state(node, N_POSSIBLE) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -1102,6 +1243,7 @@ move_account:
 	}
 	ret = 0;
 out:
+	drain_page_cgroup_all();
 	css_put(&mem->css);
 	return ret;
 
@@ -1314,6 +1456,38 @@ static void mem_cgroup_free(struct mem_c
 		vfree(mem);
 }
 
+static void mem_cgroup_init_pcp(int cpu)
+{
+	page_cgroup_start_cache_cpu(cpu);
+}
+
+static int cpu_memcgroup_callback(struct notifier_block *nb,
+			unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		mem_cgroup_init_pcp(cpu);
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		page_cgroup_stop_cache_cpu(cpu);
+		drain_page_cgroup_cpu(cpu);
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata memcgroup_nb =
+{
+	.notifier_call = cpu_memcgroup_callback,
+};
 
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1323,6 +1497,10 @@ mem_cgroup_create(struct cgroup_subsys *
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
+		cpu_memcgroup_callback(&memcgroup_nb,
+					(unsigned long)CPU_UP_PREPARE,
+					(void *)(long)smp_processor_id());
+		register_hotcpu_notifier(&memcgroup_nb);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
Index: mmotm-2.6.27+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27+/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_LRU, /* on LRU */
 	/* flags for LRU placement */
 	PCG_ACTIVE, /* page is active in this cgroup */
 	PCG_FILE, /* page is file system backed */
@@ -50,6 +51,10 @@ TESTPCGFLAG(Cache, CACHE)
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+SETPCGFLAG(LRU, LRU)
+TESTPCGFLAG(LRU, LRU)
+CLEARPCGFLAG(LRU, LRU)
+
 /* LRU management flags (from global-lru definition) */
 TESTPCGFLAG(File, FILE)
 SETPCGFLAG(File, FILE)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 7/11] memcg: lazy lru add
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-10-23  9:06 ` [RFC][PATCH 6/11] memcg: lary LRU removal KAMEZAWA Hiroyuki
@ 2008-10-23  9:08 ` KAMEZAWA Hiroyuki
  2008-10-23  9:10 ` [RFC][PATCH 8/11] memcg: shmem account helper KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.

Because __set_page_cgroup_lru() itself doesn't take lock_page_cgroup(),
we need a sanity check inside lru_lock().

And this delaying make css_put()/get() complicated. To make it clear,
 * css_get() is called from mem_cgroup_add_list().
 * css_put() is called from mem_cgroup_remove_list().
 * css_get()->css_put() is called while try_charge()->commit/cancel sequence.


Changelog: v6 -> v7
 - No changes.

Changelog: v5 -> v6.
 - css_get()/put comes back again...it's called via add_list(), remove_list().
 - patch for PCG_LRU bit part is moved to release_page_cgroup_lru() patch.
 - Avoid TestSet and just use lock_page_cgroup() etc.
 - fixed race condition we saw in v5. (smp_wmb() and USED bit magic help us)

Changelog: v3 -> v5.
 - removed css_get/put per page_cgroup struct.
   Now, *new* force_empty checks there is page_cgroup on the memcg.
   We don't need to be afraid of leak.

Changelog: v2 -> v3
 - added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
 - renamed function name from use_page_cgroup to set_page_cgroup_lru().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 73 insertions(+), 11 deletions(-)

Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -255,6 +255,7 @@ static void __mem_cgroup_remove_list(str
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
+	css_put(&pc->mem_cgroup->css);
 }
 
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
@@ -278,6 +279,7 @@ static void __mem_cgroup_add_list(struct
 		list_add_tail(&pc->lru, &mz->lists[lru]);
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+	css_get(&pc->mem_cgroup->css);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
@@ -479,6 +481,7 @@ struct memcg_percpu_vec {
 	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
 };
 static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_add_vec);
 
 static void
 __release_page_cgroup(struct memcg_percpu_vec *mpv)
@@ -516,7 +519,6 @@ __release_page_cgroup(struct memcg_percp
 		    && tmp == pc->mem_cgroup) {
 			ClearPageCgroupLRU(pc);
 			__mem_cgroup_remove_list(mz, pc);
-			css_put(&pc->mem_cgroup->css);
 		}
 	}
 	if (prev_mz)
@@ -526,10 +528,53 @@ __release_page_cgroup(struct memcg_percp
 }
 
 static void
+__set_page_cgroup_lru(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		mem = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(pc);
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
+			/*
+			 * while we wait for lru_lock, uncharge()->charge()
+			 * can occur. check here pc->mem_cgroup is what
+			 * we expected or yet.
+			 */
+			smp_rmb();
+			if (likely(mem == pc->mem_cgroup)) {
+				SetPageCgroupLRU(pc);
+				__mem_cgroup_add_list(mz, pc, true);
+			}
+		}
+	}
+
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
 release_page_cgroup(struct page_cgroup *pc)
 {
 	struct memcg_percpu_vec *mpv;
-
 	mpv = &get_cpu_var(memcg_free_vec);
 	mpv->vec[mpv->nr++] = pc;
 	if (mpv->nr >= mpv->limit)
@@ -537,11 +582,25 @@ release_page_cgroup(struct page_cgroup *
 	put_cpu_var(memcg_free_vec);
 }
 
+static void
+set_page_cgroup_lru(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_add_vec);
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__set_page_cgroup_lru(mpv);
+	put_cpu_var(memcg_add_vec);
+}
+
 static void page_cgroup_start_cache_cpu(int cpu)
 {
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = MEMCG_PCPVEC_SIZE;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -550,6 +609,8 @@ static void page_cgroup_stop_cache_cpu(i
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = 0;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = 0;
 }
 #endif
 
@@ -563,6 +624,9 @@ static DEFINE_MUTEX(memcg_force_drain_mu
 static void drain_page_cgroup_local(struct work_struct *work)
 {
 	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_add_vec);
+	__set_page_cgroup_lru(mpv);
+	put_cpu_var(mpv);
 	mpv = &get_cpu_var(memcg_free_vec);
 	__release_page_cgroup(mpv);
 	put_cpu_var(mpv);
@@ -715,24 +779,24 @@ static void __mem_cgroup_commit_charge(s
 		if (PageCgroupLRU(pc)) {
 			ClearPageCgroupLRU(pc);
 			__mem_cgroup_remove_list(mz, pc);
-			css_put(&pc->mem_cgroup->css);
 		}
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
 	/* Here, PCG_LRU bit is cleared */
 	pc->mem_cgroup = mem;
 	/*
+	 * We have to set pc->mem_cgroup before set USED bit for avoiding
+	 * race with (delayed) __set_page_cgroup_lru() in other cpu.
+	 */
+	smp_wmb();
+	/*
 	 * below pcg_default_flags includes PCG_LOCK bit.
 	 */
 	pc->flags = pcg_default_flags[ctype];
 	unlock_page_cgroup(pc);
 
-	mz = page_cgroup_zoneinfo(pc);
-
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc, true);
-	SetPageCgroupLRU(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	set_page_cgroup_lru(pc);
+	css_put(&mem->css);
 }
 
 /**
@@ -779,10 +843,8 @@ static int mem_cgroup_move_account(struc
 
 	if (spin_trylock(&to_mz->lru_lock)) {
 		__mem_cgroup_remove_list(from_mz, pc);
-		css_put(&from->css);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
 		pc->mem_cgroup = to;
-		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc, false);
 		ret = 0;
 		spin_unlock(&to_mz->lru_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 8/11] memcg: shmem account helper
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-10-23  9:08 ` [RFC][PATCH 7/11] memcg: lazy lru add KAMEZAWA Hiroyuki
@ 2008-10-23  9:10 ` KAMEZAWA Hiroyuki
  2008-10-23  9:12 ` [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

In mem+swap controller, we also have to catch shmem's swap-in.

This patch adds hook to shmem's swap-in path.

And as a good effect, a charge done under spinlock(info->lock)
is moved out to outside of lock.
(do that under spinlock is bug...)

And this also fixes gfp mask of shmem's charge. Now, we don't
have to allocate page_cgroup dynamically, GFP_KERNEL is not suitable.

mem_cgroup_charge_cache_swapin() itself will be modified by following patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   12 ++++++++++++
 mm/shmem.c                 |   17 ++++++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)

Index: mmotm-2.6.27+/mm/shmem.c
===================================================================
--- mmotm-2.6.27+.orig/mm/shmem.c
+++ mmotm-2.6.27+/mm/shmem.c
@@ -920,8 +920,9 @@ found:
 	error = 1;
 	if (!inode)
 		goto out;
-	/* Precharge page using GFP_KERNEL while we can wait */
-	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
+	/* Precharge page using GFP_HIGHUSER_PAGECACHE while we can wait */
+	error = mem_cgroup_cache_charge_swapin(page, current->mm,
+			GFP_HIGHUSER_PAGECACHE);
 	if (error)
 		goto out;
 	error = radix_tree_preload(GFP_KERNEL);
@@ -1259,6 +1260,16 @@ repeat:
 			}
 			wait_on_page_locked(swappage);
 			page_cache_release(swappage);
+			/*
+			 * We want to charge agaisnt this page not-under
+			 * info->lock. do precharge here.
+			 */
+			if (mem_cgroup_cache_charge_swapin(swappage,
+					current->mm, gfp)) {
+				error = -ENOMEM;
+				goto failed;
+			}
+
 			goto repeat;
 		}
 
@@ -1371,7 +1382,7 @@ repeat:
 
 			/* Precharge page while we can wait, compensate after */
 			error = mem_cgroup_cache_charge(filepage, current->mm,
-							gfp & ~__GFP_HIGHMEM);
+							gfp);
 			if (error) {
 				page_cache_release(filepage);
 				shmem_unacct_blocks(info->flags, 1);
Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -988,6 +988,18 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+int mem_cgroup_cache_charge_swapin(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask)
+{
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+
+}
+
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
 	struct page_cgroup *pc;
Index: mmotm-2.6.27+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27+/include/linux/memcontrol.h
@@ -38,6 +38,9 @@ extern void mem_cgroup_cancel_charge_swa
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
+extern int mem_cgroup_cache_charge_swapin(struct page *page,
+			struct mm_struct *mm, gfp_t gfp_mask);
+
 extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-10-23  9:10 ` [RFC][PATCH 8/11] memcg: shmem account helper KAMEZAWA Hiroyuki
@ 2008-10-23  9:12 ` KAMEZAWA Hiroyuki
  2008-10-24  4:32   ` Randy Dunlap
  2008-10-27  6:39   ` Daisuke Nishimura
  2008-10-23  9:13 ` [RFC][PATCH 10/11] memcg: swap cgroup KAMEZAWA Hiroyuki
  2008-10-23  9:16 ` [RFC][PATCH 11/11] memcg: mem+swap controler core KAMEZAWA Hiroyuki
  10 siblings, 2 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Config and control variable for mem+swap controller.

This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
(memory resource controller swap extension.)

For accounting swap, it's obvious that we have to use additional memory
to remember "who uses swap". This adds more overhead.
So, it's better to offer "choice" to users. This patch adds 2 choices.

This patch adds 2 parameters to enable swap extenstion or not.
  - CONFIG
  - boot option

This version uses policy of "default is enable if configured."
please tell me you dislike this. See patches following this in detail...

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 Documentation/kernel-parameters.txt |    3 +++
 include/linux/memcontrol.h          |    3 +++
 init/Kconfig                        |   16 ++++++++++++++++
 mm/memcontrol.c                     |   17 +++++++++++++++++
 4 files changed, 39 insertions(+)

Index: mmotm-2.6.27+/init/Kconfig
===================================================================
--- mmotm-2.6.27+.orig/init/Kconfig
+++ mmotm-2.6.27+/init/Kconfig
@@ -613,6 +613,22 @@ config KALLSYMS_EXTRA_PASS
 	   reported.  KALLSYMS_EXTRA_PASS is only a temporary workaround while
 	   you wait for kallsyms to be fixed.
 
+config CGROUP_MEM_RES_CTLR_SWAP
+	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
+	help
+	  Add swap management feature to memory resource controller. When you
+	  enable this, you can limit mem+swap usage per cgroup. In other words,
+	  when you disable this, memory resource controller have no cares to
+	  usage of swap...a process can exhaust the all swap. This extension
+	  is useful when you want to avoid exhausion of swap but this itself
+	  adds more overheads and consumes memory for remembering information.
+	  Especially if you use 32bit system or small memory system,
+	  please be careful to enable this. When memory resource controller
+	  is disabled by boot option, this will be automatiaclly disabled and
+	  there will be no overhead from this. Even when you set this config=y,
+	  if boot option "noswapaccount" is set, swap will not be accounted.
+
 
 config HOTPLUG
 	bool "Support for hot-pluggable devices" if EMBEDDED
Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -41,6 +41,13 @@
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+int do_swap_account __read_mostly = 1;
+#else
+#define do_swap_account		(0)
+#endif
+
+
 /*
  * Statistics for memory cgroup.
  */
@@ -1658,3 +1665,13 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+static int __init disable_swap_account(char *s)
+{
+	do_swap_account = 0;
+	return 1;
+}
+__setup("noswapaccount", disable_swap_account);
+#endif
Index: mmotm-2.6.27+/Documentation/kernel-parameters.txt
===================================================================
--- mmotm-2.6.27+.orig/Documentation/kernel-parameters.txt
+++ mmotm-2.6.27+/Documentation/kernel-parameters.txt
@@ -1540,6 +1540,9 @@ and is between 256 and 4096 characters. 
 
 	nosoftlockup	[KNL] Disable the soft-lockup detector.
 
+	noswapaccount	[KNL] Disable accounting of swap in memory resource
+			controller. (See Documentation/controllers/memory.txt)
+
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
 	notsc		[BUGS=X86-32] Disable Time Stamp Counter
Index: mmotm-2.6.27+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27+/include/linux/memcontrol.h
@@ -80,6 +80,9 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern int do_swap_account;
+#endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 10/11] memcg: swap cgroup
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2008-10-23  9:12 ` [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig KAMEZAWA Hiroyuki
@ 2008-10-23  9:13 ` KAMEZAWA Hiroyuki
  2008-10-27  7:02   ` Daisuke Nishimura
  2008-10-23  9:16 ` [RFC][PATCH 11/11] memcg: mem+swap controler core KAMEZAWA Hiroyuki
  10 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
  - swap_cgroup_swapon() .... called from swapon
  - swap_cgroup_swapoff() ... called at the end of swapoff.

  - swap_cgroup_record() .... record information of swap entry.
  - swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual
method for limit the usage of swap. These routine uses flat table
to record and lookup. "wise" lookup system like radix-tree requires memory
allocation at new records but swap-out is ususally called under memory
shortage (or memcg hits limit.) So, I used static allocation. 

Note1: In this, we use pointer to record information and this means
      8bytes per swap entry. I think we can reduce this when we
      create "id of cgroup" in the range of 0-65535 or 0-255.

Note2: array of swap_cgroup is allocated from HIGHMEM. maybe good for x86-32.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   34 +++++++
 mm/page_cgroup.c            |  199 ++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c               |    8 +
 3 files changed, 241 insertions(+)

Index: mmotm-2.6.27+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.27+.orig/mm/page_cgroup.c
+++ mmotm-2.6.27+/mm/page_cgroup.c
@@ -9,6 +9,8 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
+#include <linux/swapops.h>
+#include <linux/highmem.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -255,3 +257,200 @@ void __init pgdat_page_cgroup_init(struc
 }
 
 #endif
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+
+DEFINE_MUTEX(swap_cgroup_mutex);
+struct swap_cgroup_ctrl {
+	spinlock_t lock;
+	struct page **map;
+	unsigned long length;
+};
+
+struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
+
+/*
+ * This 8bytes seems big..maybe we can reduce this when we can use "id" for
+ * cgroup rather than pointer.
+ */
+struct swap_cgroup {
+	struct mem_cgroup	*val;
+};
+#define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
+#define SC_POS_MASK	(SC_PER_PAGE - 1)
+
+/*
+ * allocate buffer for swap_cgroup.
+ */
+static int swap_cgroup_prepare(int type)
+{
+	struct page *page;
+	struct swap_cgroup_ctrl *ctrl;
+	unsigned long idx, max;
+
+	if (!do_swap_account)
+		return 0;
+	ctrl = &swap_cgroup_ctrl[type];
+
+	for (idx = 0; idx < ctrl->length; idx++) {
+		page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+		if (!page)
+			goto not_enough_page;
+		ctrl->map[idx] = page;
+	}
+	return 0;
+not_enough_page:
+	max = idx;
+	for (idx = 0; idx < max; idx++)
+		__free_page(ctrl->map[idx]);
+
+	return -ENOMEM;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for this swp_entry.
+ * @ent: swap entry to be recorded into
+ * @mem: mem_cgroup to be recorded
+ *
+ * Returns old value at success, NULL at failure.
+ * (Of course, old value can be NULL.)
+ */
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	unsigned long flags;
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *old;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	old = sc->val;
+	sc->val = mem;
+	kunmap_atomic(mappage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return old;
+}
+
+/**
+ * lookup_swap_cgroup - lookup mem_cgroup tied to swap entry
+ * @ent: swap entry to be looked up.
+ *
+ * Returns pointer to mem_cgroup at success. NULL at failure.
+ */
+struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent)
+{
+	int type = swp_type(ent);
+	unsigned long flags;
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+	struct mem_cgroup *ret;
+
+	if (!do_swap_account)
+		return NULL;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	sc = kmap_atomic(mappage, KM_USER0);
+	sc += pos;
+	ret = sc->val;
+	kunmap_atomic(mapppage, KM_USER0);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	void *array;
+	unsigned long array_size;
+	unsigned long length;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return 0;
+
+	length = ((max_pages/SC_PER_PAGE) + 1);
+	array_size = length * sizeof(void *);
+
+	array = vmalloc(array_size);
+	if (!array)
+		goto nomem;
+
+	memset(array, 0, array_size);
+	ctrl = &swap_cgroup_ctrl[type];
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl->length = length;
+	ctrl->map = array;
+	if (swap_cgroup_prepare(type)) {
+		/* memory shortage */
+		ctrl->map = NULL;
+		ctrl->length = 0;
+		vfree(array);
+		mutex_unlock(&swap_cgroup_mutex);
+		goto nomem;
+	}
+	mutex_unlock(&swap_cgroup_mutex);
+
+	printk(KERN_INFO
+		"swap_cgroup: uses %ldbytes vmalloc and %ld bytes buffres\n",
+		array_size, length * PAGE_SIZE);
+	printk(KERN_INFO
+	"swap_cgroup can be disabled by noswapaccount boot option.\n");
+
+	return 0;
+nomem:
+	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
+	printk(KERN_INFO
+		"swap_cgroup can be disabled by noswapaccount boot option\n");
+	return -ENOMEM;
+}
+
+void swap_cgroup_swapoff(int type)
+{
+	int i;
+	struct swap_cgroup_ctrl *ctrl;
+
+	if (!do_swap_account)
+		return;
+
+	mutex_lock(&swap_cgroup_mutex);
+	ctrl = &swap_cgroup_ctrl[type];
+	for (i = 0; i < ctrl->length; i++) {
+		struct page *page = ctrl->map[i];
+		if (page)
+			__free_page(page);
+	}
+	vfree(ctrl->map);
+	ctrl->map = NULL;
+	ctrl->length = 0;
+	mutex_unlock(&swap_cgroup_mutex);
+}
+
+static int __init swap_cgroup_init(void)
+{
+	int i;
+	for (i = 0; i < MAX_SWAPFILES; i++)
+		spin_lock_init(&swap_cgroup_ctrl[i].lock);
+	return 0;
+}
+late_initcall(swap_cgroup_init);
+#endif
Index: mmotm-2.6.27+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27+/include/linux/page_cgroup.h
@@ -110,4 +110,38 @@ static inline void page_cgroup_init(void
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#include <linux/swap.h>
+extern struct mem_cgroup *
+swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem);
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+extern int swap_cgroup_swapon(int type, unsigned long max_pages);
+extern void swap_cgroup_swapoff(int type);
+#else
+#include <linux/swap.h>
+
+static inline
+struct mem_cgroup *swap_cgroup_record(swp_entry_t ent, struct mem_cgroup *mem)
+{
+	return NULL;
+}
+
+extern struct mem_cgroup *lookup_swap_cgroup(swp_entry_t ent);
+{
+	return NULL;
+}
+
+extern inline int
+swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+extern void swap_cgroup_swapoff(int type)
+{
+	return;
+}
+
+#endif
 #endif
Index: mmotm-2.6.27+/mm/swapfile.c
===================================================================
--- mmotm-2.6.27+.orig/mm/swapfile.c
+++ mmotm-2.6.27+/mm/swapfile.c
@@ -32,6 +32,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
+#include <linux/page_cgroup.h>
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -1345,6 +1346,9 @@ asmlinkage long sys_swapoff(const char _
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+	/* Destroy swap acccount informatin */
+	swap_cgroup_swapoff(type);
+
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
 		struct block_device *bdev = I_BDEV(inode);
@@ -1669,6 +1673,10 @@ asmlinkage long sys_swapon(const char __
 		nr_good_pages = swap_header->info.last_page -
 				swap_header->info.nr_badpages -
 				1 /* header page */;
+
+		if (!error)
+			error = swap_cgroup_swapon(type, maxpages);
+
 		if (error)
 			goto bad_swap;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC][PATCH 11/11] memcg: mem+swap controler core
  2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2008-10-23  9:13 ` [RFC][PATCH 10/11] memcg: swap cgroup KAMEZAWA Hiroyuki
@ 2008-10-23  9:16 ` KAMEZAWA Hiroyuki
  2008-10-27 11:37   ` Daisuke Nishimura
  10 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-23  9:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

Mem+Swap controller core.

This patch implements per cgroup limit for usage of memory+swap.
However there are SwapCache, double counting of swap-cache and
swap-entry is avoided.

Mem+Swap controller works as following.
  - memory usage is limited by memory.limit_in_bytes.
  - memory + swap usage is limited by memory.memsw_limit_in_bytes.


This has following benefits.
  - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..
    
    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.
    This cannot be recovered in general.
    Ability to set appropriate swap limit for each group is required.
      
Maybe someone wonder "why not swap but mem+swap ?"

  - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.


Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
  map
    - charge  page and memsw.

  unmap
    - uncharge page/memsw if not SwapCache.

  swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

  swap-in (do_swap_page)
    - charged as page and memsw.
      record in swap_cgroup is cleared.
      memsw accounting is decremented.

  swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.


After this, usual memory resource controller handles SwapCache.
(It was lacked(ignored) feature in current memcg but must be
 handled.)

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead.  This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


 include/linux/memcontrol.h |    3 
 include/linux/swap.h       |   17 ++
 mm/memcontrol.c            |  356 +++++++++++++++++++++++++++++++++++++++++----
 mm/swap_state.c            |    4 
 mm/swapfile.c              |    8 -
 5 files changed, 359 insertions(+), 29 deletions(-)

Index: mmotm-2.6.27+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27+.orig/mm/memcontrol.c
+++ mmotm-2.6.27+/mm/memcontrol.c
@@ -130,6 +130,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter res;
 	/*
+	 * the coutner to accoaunt for mem+swap usage.
+	 */
+	struct res_counter memsw;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -140,6 +144,12 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+
+	/*
+	 * used for counting reference from swap_cgroup.
+	 */
+	int		obsolete;
+	atomic_t	swapref;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -148,6 +158,7 @@ enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
 	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
+	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* used by force_empty */
 	NR_CHARGE_TYPE,
 };
 
@@ -165,6 +176,16 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
 	0, /* FORCE */
 };
 
+
+/* for encoding cft->private value on file */
+#define _MEM			(0)
+#define _MEMSWAP		(1)
+#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
+#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
+
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
@@ -698,8 +719,19 @@ static int __mem_cgroup_try_charge(struc
 		css_get(&mem->css);
 	}
 
+	while (1) {
+		int ret;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+		ret = res_counter_charge(&mem->res, PAGE_SIZE);
+		if (likely(!ret)) {
+			if (!do_swap_account)
+				break;
+			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+			if (likely(!ret))
+				break;
+			/* mem+swap counter fails */
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -712,8 +744,13 @@ static int __mem_cgroup_try_charge(struc
 		 * moved to swap cache or just unmapped from the cgroup.
 		 * Check the limit again to see if the reclaim reduced the
 		 * current usage of the cgroup before giving up
+		 *
 		 */
-		if (res_counter_check_under_limit(&mem->res))
+		if (!do_swap_account &&
+			res_counter_check_under_limit(&mem->res))
+			continue;
+		if (do_swap_account &&
+			res_counter_check_under_limit(&mem->memsw))
 			continue;
 
 		if (!nr_retries--) {
@@ -770,6 +807,8 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
@@ -851,6 +890,8 @@ static int mem_cgroup_move_account(struc
 	if (spin_trylock(&to_mz->lru_lock)) {
 		__mem_cgroup_remove_list(from_mz, pc);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
 		pc->mem_cgroup = to;
 		__mem_cgroup_add_list(to_mz, pc, false);
 		ret = 0;
@@ -896,8 +937,11 @@ static int mem_cgroup_move_parent(struct
 	/* drop extra refcnt */
 	css_put(&parent->css);
 	/* uncharge if move fails */
-	if (ret)
+	if (ret) {
 		res_counter_uncharge(&parent->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	}
 
 	return ret;
 }
@@ -972,7 +1016,6 @@ int mem_cgroup_cache_charge(struct page 
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
 
-
 		pc = lookup_page_cgroup(page);
 		if (!pc)
 			return 0;
@@ -998,15 +1041,74 @@ int mem_cgroup_cache_charge(struct page 
 int mem_cgroup_cache_charge_swapin(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
+	struct mem_cgroup *mem;
+	swp_entry_t ent;
+	int ret;
+
 	if (mem_cgroup_subsys.disabled)
 		return 0;
-	if (unlikely(!mm))
-		mm = &init_mm;
-	return mem_cgroup_charge_common(page, mm, gfp_mask,
+
+	ent.val = page_private(page);
+	if (do_swap_account) {
+		mem = lookup_swap_cgroup(ent);
+		if (!mem || mem->obsolete)
+			goto charge_cur_mm;
+		ret = mem_cgroup_charge_common(page, NULL, gfp_mask,
+				       MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+	} else {
+charge_cur_mm:
+		if (unlikely(!mm))
+			mm = &init_mm;
+		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
+	}
+	if (do_swap_account && !ret) {
+		/*
+		 * At this point, we successfully charge both for mem and swap.
+		 * fix this double counting, here.
+		 */
+		mem = swap_cgroup_record(ent, NULL);
+		if (mem) {
+			/* If memcg is obsolete, memcg can be != ptr */
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+			mem_cgroup_forget_swapref(mem);
+		}
+	}
+	return ret;
+}
+
+/*
+ * look into swap_cgroup and charge against mem_cgroup which does swapout
+ * if we can. If not, charge against current mm.
+ */
+
+int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t mask, struct mem_cgroup **ptr)
+{
+	struct mem_cgroup *mem;
+	swp_entry_t	ent;
+
+	if (mem_cgroup_subsys.disabled)
+		return 0;
 
+	if (!do_swap_account)
+		goto charge_cur_mm;
+
+	ent.val = page_private(page);
+
+	mem = lookup_swap_cgroup(ent);
+	if (!mem || mem->obsolete)
+		goto charge_cur_mm;
+	*ptr = mem;
+	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+charge_cur_mm:
+	if (unlikely(!mm))
+		mm = &init_mm;
+	return __mem_cgroup_try_charge(mm, mask, ptr, true);
 }
 
+
+
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
 {
 	struct page_cgroup *pc;
@@ -1017,6 +1119,23 @@ void mem_cgroup_commit_charge_swapin(str
 		return;
 	pc = lookup_page_cgroup(page);
 	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+	/*
+	 * Now swap is on-memory. This means this page may be
+	 * counted both as mem and swap....double count.
+	 * Fix it by uncharging from memsw. This SwapCache is stable
+	 * because we're still under lock_page().
+	 */
+	if (do_swap_account) {
+		swp_entry_t ent = {.val = page_private(page)};
+		struct mem_cgroup *memcg;
+		memcg = swap_cgroup_record(ent, NULL);
+		if (memcg) {
+			/* If memcg is obsolete, memcg can be != ptr */
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+			mem_cgroup_forget_swapref(memcg);
+		}
+
+	}
 }
 
 void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
@@ -1026,35 +1145,50 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	css_put(&mem->css);
 }
 
-
 /*
  * uncharge if !page_mapped(page)
+ * returns memcg at success.
  */
-static void
+static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
 
 	if (mem_cgroup_subsys.disabled)
-		return;
+		return NULL;
 
+	if (PageSwapCache(page))
+		return NULL;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
-		return;
+		return NULL;
 
 	lock_page_cgroup(pc);
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
+		if (PageAnon(page)) {
+			if (page_mapped(page)) {
+				unlock_page_cgroup(pc);
+				return NULL;
+			}
+		} else if (page->mapping && !page_is_file_cache(page)) {
+			/* This is on radix-tree. */
+			unlock_page_cgroup(pc);
+			return NULL;
+		}
+	}
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
 	     || !PageCgroupUsed(pc)) {
 		/* This happens at race in zap_pte_range() and do_swap_page()*/
 		unlock_page_cgroup(pc);
-		return;
+		return NULL;
 	}
 	ClearPageCgroupUsed(pc);
 	mem = pc->mem_cgroup;
@@ -1063,9 +1197,11 @@ __mem_cgroup_uncharge_common(struct page
 	 * unlock this.
 	 */
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (do_swap_account && ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	unlock_page_cgroup(pc);
 	release_page_cgroup(pc);
-	return;
+	return mem;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -1086,6 +1222,41 @@ void mem_cgroup_uncharge_cache_page(stru
 }
 
 /*
+ * called from __delete_from_swap_cache() and drop "page" account.
+ * memcg information is recorded to swap_cgroup of "ent"
+ */
+void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = __mem_cgroup_uncharge_common(page,
+					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	/* record memcg information */
+	if (do_swap_account && memcg) {
+		swap_cgroup_record(ent, memcg);
+		atomic_inc(&memcg->swapref);
+	}
+}
+
+/*
+ * called from swap_entry_free(). remove record in swap_cgroup and
+ * uncharge "memsw" account.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+	struct mem_cgroup *memcg;
+
+	if (!do_swap_account)
+		return;
+
+	memcg = swap_cgroup_record(ent, NULL);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_forget_swapref(memcg);
+	}
+}
+
+/*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
@@ -1219,13 +1390,56 @@ int mem_cgroup_resize_limit(struct mem_c
 			ret = -EBUSY;
 			break;
 		}
-		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+		progress = try_to_free_mem_cgroup_pages(memcg,
+				GFP_KERNEL);
 		if (!progress)
 			retry_count--;
 	}
 	return ret;
 }
 
+int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
+				unsigned long long val)
+{
+	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
+	unsigned long flags;
+	u64 memlimit, oldusage, curusage;
+	int ret;
+
+	if (!do_swap_account)
+		return -EINVAL;
+
+	while (retry_count) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		/*
+		 * Rather than hide all in some function, I do this in
+		 * open coded mannaer. You see what this really does.
+		 * We have to guarantee mem->res.limit < mem->memsw.limit.
+		 */
+		spin_lock_irqsave(&memcg->res.lock, flags);
+		memlimit = memcg->res.limit;
+		if (memlimit > val) {
+			spin_unlock_irqrestore(&memcg->res.lock, flags);
+			ret = -EINVAL;
+			break;
+		}
+		ret = res_counter_set_limit(&memcg->memsw, val);
+		oldusage = memcg->memsw.usage;
+		spin_unlock_irqrestore(&memcg->res.lock, flags);
+
+		if (!ret)
+			break;
+		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE);
+		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+		if (curusage >= oldusage)
+			retry_count--;
+	}
+	return ret;
+}
+
 
 /*
  * This routine traverse page_cgroup in given list and drop them all.
@@ -1353,8 +1567,25 @@ try_to_free:
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
-				    cft->private);
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	u64 val = 0;
+	int type, name;
+
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (type) {
+	case _MEM:
+		val = res_counter_read_u64(&mem->res, name);
+		break;
+	case _MEMSWAP:
+		if (do_swap_account)
+			val = res_counter_read_u64(&mem->memsw, name);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return val;
 }
 /*
  * The user of this function is...
@@ -1364,15 +1595,22 @@ static int mem_cgroup_write(struct cgrou
 			    const char *buffer)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	int type, name;
 	unsigned long long val;
 	int ret;
 
-	switch (cft->private) {
+	type = MEMFILE_TYPE(cft->private);
+	name = MEMFILE_ATTR(cft->private);
+	switch (name) {
 	case RES_LIMIT:
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
-		if (!ret)
+		if (ret)
+			break;
+		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+		else
+			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
@@ -1384,14 +1622,23 @@ static int mem_cgroup_write(struct cgrou
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 {
 	struct mem_cgroup *mem;
+	int type, name;
 
 	mem = mem_cgroup_from_cont(cont);
-	switch (event) {
+	type = MEMFILE_TYPE(event);
+	name = MEMFILE_ATTR(event);
+	switch (name) {
 	case RES_MAX_USAGE:
-		res_counter_reset_max(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_max(&mem->res);
+		else
+			res_counter_reset_max(&mem->memsw);
 		break;
 	case RES_FAILCNT:
-		res_counter_reset_failcnt(&mem->res);
+		if (type == _MEM)
+			res_counter_reset_failcnt(&mem->res);
+		else
+			res_counter_reset_failcnt(&mem->memsw);
 		break;
 	}
 	return 0;
@@ -1445,30 +1692,33 @@ static int mem_control_stat_show(struct 
 		cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
 
 	}
+	/* showing refs from disk-swap */
+	cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
+					* PAGE_SIZE);
 	return 0;
 }
 
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
-		.private = RES_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "max_usage_in_bytes",
-		.private = RES_MAX_USAGE,
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "limit_in_bytes",
-		.private = RES_LIMIT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
 		.write_string = mem_cgroup_write,
 		.read_u64 = mem_cgroup_read,
 	},
 	{
 		.name = "failcnt",
-		.private = RES_FAILCNT,
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read,
 	},
@@ -1476,6 +1726,31 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+	{
+		.name = "memsw.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "memsw.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "failcnt",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+		.trigger = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read,
+	},
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1529,14 +1804,43 @@ static struct mem_cgroup *mem_cgroup_all
 	return mem;
 }
 
+/*
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
+ * entry which points to this memcg will be ignore at swapin.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
+ */
+
 static void mem_cgroup_free(struct mem_cgroup *mem)
 {
+	if (do_swap_account) {
+		if (atomic_read(&mem->swapref) > 0)
+			return;
+	}
 	if (sizeof(*mem) < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);
 }
 
+static void mem_cgroup_forget_swapref(struct mem_cgroup *mem)
+{
+	if (!do_swap_account)
+		return;
+	if (atomic_dec_and_test(&mem->swapref)) {
+		if (!mem->obsolete)
+			return;
+		mem_cgroup_free(mem);
+	}
+}
+
 static void mem_cgroup_init_pcp(int cpu)
 {
 	page_cgroup_start_cache_cpu(cpu);
@@ -1589,6 +1893,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	}
 
 	res_counter_init(&mem->res);
+	res_counter_init(&mem->memsw);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1607,6 +1912,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem);
 }
 
Index: mmotm-2.6.27+/mm/swapfile.c
===================================================================
--- mmotm-2.6.27+.orig/mm/swapfile.c
+++ mmotm-2.6.27+/mm/swapfile.c
@@ -271,8 +271,9 @@ out:
 	return NULL;
 }	
 
-static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
+static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
 {
+	unsigned long offset = swp_offset(ent);
 	int count = p->swap_map[offset];
 
 	if (count < SWAP_MAP_MAX) {
@@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
 				swap_list.next = p - swap_info;
 			nr_swap_pages++;
 			p->inuse_pages--;
+			mem_cgroup_uncharge_swap(ent);
 		}
 	}
 	return count;
@@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, swp_offset(entry));
+		swap_entry_free(p, entry);
 		spin_unlock(&swap_lock);
 	}
 }
@@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, swp_offset(entry)) == 1) {
+		if (swap_entry_free(p, entry) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
 			if (page && !trylock_page(page)) {
 				page_cache_release(page);
Index: mmotm-2.6.27+/mm/swap_state.c
===================================================================
--- mmotm-2.6.27+.orig/mm/swap_state.c
+++ mmotm-2.6.27+/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/pgtable.h>
 
@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	swp_entry_t ent = {.val = page_private(page)};
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
@@ -119,6 +122,7 @@ void __delete_from_swap_cache(struct pag
 	total_swapcache_pages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
+	mem_cgroup_uncharge_swapcache(page, ent);
 }
 
 /**
Index: mmotm-2.6.27+/include/linux/swap.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/swap.h
+++ mmotm-2.6.27+/include/linux/swap.h
@@ -332,6 +332,23 @@ static inline void disable_swap_token(vo
 	put_swap_token(swap_token_mm);
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/* This function requires swp_entry_t definition. see memcontrol.c */
+extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+#else
+static inline void
+mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
+{
+}
+#endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+#else
+static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+{
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages			0
Index: mmotm-2.6.27+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27+/include/linux/memcontrol.h
@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
+		struct page *page, gfp_t mask, struct mem_cgroup **ptr);
 extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *ptr);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
@@ -83,7 +85,6 @@ extern long mem_cgroup_calc_reclaim(stru
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu
  2008-10-23  9:00 ` [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu KAMEZAWA Hiroyuki
@ 2008-10-23 21:20   ` Paul Menage
  2008-10-24  1:16     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menage @ 2008-10-23 21:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul

On Thu, Oct 23, 2008 at 2:00 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> @@ -337,6 +284,8 @@ config GROUP_SCHED
>        help
>          This feature lets CPU scheduler recognize task groups and control CPU
>          bandwidth allocation to such task groups.
> +         For allowing to make a group from arbitrary set of processes, use
> +         CONFIG_CGROUPS. (See Control Group support.)

Please can we make this:

In order to create a scheduler group from an arbitrary set of
processes, use CONFIG_CGROUPS (See Control Group support).

>
> +         This option will let you use process cgroup subsystems
> +         such as Cpusets

This option adds support for grouping sets of processes together, for
use with process control subsystems such as Cpusets, CFS, memory
controls or device isolation.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu
  2008-10-23 21:20   ` Paul Menage
@ 2008-10-24  1:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-24  1:16 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, balbir, nishimura, xemul

On Thu, 23 Oct 2008 14:20:05 -0700
"Paul Menage" <menage@google.com> wrote:

> On Thu, Oct 23, 2008 at 2:00 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > @@ -337,6 +284,8 @@ config GROUP_SCHED
> >        help
> >          This feature lets CPU scheduler recognize task groups and control CPU
> >          bandwidth allocation to such task groups.
> > +         For allowing to make a group from arbitrary set of processes, use
> > +         CONFIG_CGROUPS. (See Control Group support.)
> 
> Please can we make this:
> 
> In order to create a scheduler group from an arbitrary set of
> processes, use CONFIG_CGROUPS (See Control Group support).
> 
> >
> > +         This option will let you use process cgroup subsystems
> > +         such as Cpusets
> 
> This option adds support for grouping sets of processes together, for
> use with process control subsystems such as Cpusets, CFS, memory
> controls or device isolation.
> 

O.K. thank you for help.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 1/11] memcg: fix kconfig menu comment
  2008-10-23  8:59 ` [RFC][PATCH 1/11] memcg: fix kconfig menu comment KAMEZAWA Hiroyuki
@ 2008-10-24  4:24   ` Randy Dunlap
  2008-10-24  4:28     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Randy Dunlap @ 2008-10-24  4:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 17:59:46 +0900 KAMEZAWA Hiroyuki wrote:

> Fixes menu help text for memcg-allocate-page-cgroup-at-boot.patch.
> 
> 
> Signed-off-by: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  init/Kconfig |   16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> Index: mmotm-2.6.27+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.27+.orig/init/Kconfig
> +++ mmotm-2.6.27+/init/Kconfig
> @@ -401,16 +401,20 @@ config CGROUP_MEM_RES_CTLR
>  	depends on CGROUPS && RESOURCE_COUNTERS
>  	select MM_OWNER
>  	help
> -	  Provides a memory resource controller that manages both page cache and
> -	  RSS memory.
> +	  Provides a memory resource controller that manages both anonymous
> +	  memory and page cache. (See Documentation/controllers/memory.txt)
>  
>  	  Note that setting this option increases fixed memory overhead
> -	  associated with each page of memory in the system by 4/8 bytes
> -	  and also increases cache misses because struct page on many 64bit
> -	  systems will not fit into a single cache line anymore.
> +	  associated with each page of memory in the system. By this,
> +	  20(40)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory
> +	  usage tracking struct at boot. Total amount of this is printed out
> +	  at boot.
>  
>  	  Only enable when you're ok with these trade offs and really
> -	  sure you need the memory resource controller.
> +	  sure you need the memory resource controller. Even when you enable
> +	  this, you can set "cgroup_disable=memory" at your boot option to
> +	  disable memoyr resource controller and you can avoid overheads.

	          memory

> +	  (and lose benefits of memory resource contoller)
>  
>  	  This config option also selects MM_OWNER config option, which
>  	  could in turn add some fork/exit overhead.


---
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 1/11] memcg: fix kconfig menu comment
  2008-10-24  4:24   ` Randy Dunlap
@ 2008-10-24  4:28     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-24  4:28 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 21:24:13 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> On Thu, 23 Oct 2008 17:59:46 +0900 KAMEZAWA Hiroyuki wrote:
> 
> > Fixes menu help text for memcg-allocate-page-cgroup-at-boot.patch.
> > 
> > 
> > Signed-off-by: KAMEZAWA hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> >  init/Kconfig |   16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> > 
> > Index: mmotm-2.6.27+/init/Kconfig
> > ===================================================================
> > --- mmotm-2.6.27+.orig/init/Kconfig
> > +++ mmotm-2.6.27+/init/Kconfig
> > @@ -401,16 +401,20 @@ config CGROUP_MEM_RES_CTLR
> >  	depends on CGROUPS && RESOURCE_COUNTERS
> >  	select MM_OWNER
> >  	help
> > -	  Provides a memory resource controller that manages both page cache and
> > -	  RSS memory.
> > +	  Provides a memory resource controller that manages both anonymous
> > +	  memory and page cache. (See Documentation/controllers/memory.txt)
> >  
> >  	  Note that setting this option increases fixed memory overhead
> > -	  associated with each page of memory in the system by 4/8 bytes
> > -	  and also increases cache misses because struct page on many 64bit
> > -	  systems will not fit into a single cache line anymore.
> > +	  associated with each page of memory in the system. By this,
> > +	  20(40)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory
> > +	  usage tracking struct at boot. Total amount of this is printed out
> > +	  at boot.
> >  
> >  	  Only enable when you're ok with these trade offs and really
> > -	  sure you need the memory resource controller.
> > +	  sure you need the memory resource controller. Even when you enable
> > +	  this, you can set "cgroup_disable=memory" at your boot option to
> > +	  disable memoyr resource controller and you can avoid overheads.
> 
> 	          memory
> 
Oh, I though I fixed this but not...

Thank you for review! 

Regards,
-Kame


> > +	  (and lose benefits of memory resource contoller)
> >  
> >  	  This config option also selects MM_OWNER config option, which
> >  	  could in turn add some fork/exit overhead.
> 
> 
> ---
> ~Randy
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 5/11] memcg: account move and change force_empty
  2008-10-23  9:05 ` [RFC][PATCH 5/11] memcg: account move and change force_empty KAMEZAWA Hiroyuki
@ 2008-10-24  4:28   ` Randy Dunlap
  2008-10-24  4:37     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Randy Dunlap @ 2008-10-24  4:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 18:05:38 +0900 KAMEZAWA Hiroyuki wrote:

>  Documentation/controllers/memory.txt |   12 -
>  mm/memcontrol.c                      |  277 ++++++++++++++++++++++++++---------
>  2 files changed, 214 insertions(+), 75 deletions(-)
> 
> Index: mmotm-2.6.27+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27+/mm/memcontrol.c
> @@ -538,6 +533,25 @@ nomem:
>  	return -ENOMEM;
>  }
>  
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Retruns 0 if success. -ENOMEM at failure.

      Returns

> + * This call can invoce OOM-Killer.

                    invoke

> + */

---
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-23  9:12 ` [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig KAMEZAWA Hiroyuki
@ 2008-10-24  4:32   ` Randy Dunlap
  2008-10-24  4:37     ` KAMEZAWA Hiroyuki
  2008-10-27  6:39   ` Daisuke Nishimura
  1 sibling, 1 reply; 30+ messages in thread
From: Randy Dunlap @ 2008-10-24  4:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 18:12:20 +0900 KAMEZAWA Hiroyuki wrote:

>  Documentation/kernel-parameters.txt |    3 +++
>  include/linux/memcontrol.h          |    3 +++
>  init/Kconfig                        |   16 ++++++++++++++++
>  mm/memcontrol.c                     |   17 +++++++++++++++++
>  4 files changed, 39 insertions(+)
> 
> Index: mmotm-2.6.27+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.27+.orig/init/Kconfig
> +++ mmotm-2.6.27+/init/Kconfig
> @@ -613,6 +613,22 @@ config KALLSYMS_EXTRA_PASS
>  	   reported.  KALLSYMS_EXTRA_PASS is only a temporary workaround while
>  	   you wait for kallsyms to be fixed.
>  
> +config CGROUP_MEM_RES_CTLR_SWAP
> +	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> +	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> +	help
> +	  Add swap management feature to memory resource controller. When you
> +	  enable this, you can limit mem+swap usage per cgroup. In other words,
> +	  when you disable this, memory resource controller have no cares to

	  probably:                                         has

> +	  usage of swap...a process can exhaust the all swap. This extension

	                                        all of the swap.

> +	  is useful when you want to avoid exhausion of swap but this itself

	                                   exhaustion

> +	  adds more overheads and consumes memory for remembering information.
> +	  Especially if you use 32bit system or small memory system,
> +	  please be careful to enable this. When memory resource controller

	  probably:         about enabling this.

> +	  is disabled by boot option, this will be automatiaclly disabled and
> +	  there will be no overhead from this. Even when you set this config=y,
> +	  if boot option "noswapaccount" is set, swap will not be accounted.
> +
>  
>  config HOTPLUG
>  	bool "Support for hot-pluggable devices" if EMBEDDED


---
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 5/11] memcg: account move and change force_empty
  2008-10-24  4:28   ` Randy Dunlap
@ 2008-10-24  4:37     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-24  4:37 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 21:28:37 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> On Thu, 23 Oct 2008 18:05:38 +0900 KAMEZAWA Hiroyuki wrote:
> 
> >  Documentation/controllers/memory.txt |   12 -
> >  mm/memcontrol.c                      |  277 ++++++++++++++++++++++++++---------
> >  2 files changed, 214 insertions(+), 75 deletions(-)
> > 
> > Index: mmotm-2.6.27+/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.27+.orig/mm/memcontrol.c
> > +++ mmotm-2.6.27+/mm/memcontrol.c
> > @@ -538,6 +533,25 @@ nomem:
> >  	return -ENOMEM;
> >  }
> >  
> > +/**
> > + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> > + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> > + * @gfp_mask: gfp_mask for reclaim.
> > + * @memcg: a pointer to memory cgroup which is charged against.
> > + *
> > + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> > + * memory cgroup from @mm is got and stored in *memcg.
> > + *
> > + * Retruns 0 if success. -ENOMEM at failure.
> 
>       Returns
> 
> > + * This call can invoce OOM-Killer.
> 
>                     invoke
> 

Thanks, will fix. (and use aspell before next post..)

Regards,
-kame

> > + */
> 
> ---
> ~Randy
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-24  4:32   ` Randy Dunlap
@ 2008-10-24  4:37     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-24  4:37 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, balbir, nishimura, xemul, menage

On Thu, 23 Oct 2008 21:32:28 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> On Thu, 23 Oct 2008 18:12:20 +0900 KAMEZAWA Hiroyuki wrote:
> 
> >  Documentation/kernel-parameters.txt |    3 +++
> >  include/linux/memcontrol.h          |    3 +++
> >  init/Kconfig                        |   16 ++++++++++++++++
> >  mm/memcontrol.c                     |   17 +++++++++++++++++
> >  4 files changed, 39 insertions(+)
> > 
> > Index: mmotm-2.6.27+/init/Kconfig
> > ===================================================================
> > --- mmotm-2.6.27+.orig/init/Kconfig
> > +++ mmotm-2.6.27+/init/Kconfig
> > @@ -613,6 +613,22 @@ config KALLSYMS_EXTRA_PASS
> >  	   reported.  KALLSYMS_EXTRA_PASS is only a temporary workaround while
> >  	   you wait for kallsyms to be fixed.
> >  
> > +config CGROUP_MEM_RES_CTLR_SWAP
> > +	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> > +	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> > +	help
> > +	  Add swap management feature to memory resource controller. When you
> > +	  enable this, you can limit mem+swap usage per cgroup. In other words,
> > +	  when you disable this, memory resource controller have no cares to
> 
> 	  probably:                                         has
> 
> > +	  usage of swap...a process can exhaust the all swap. This extension
> 
> 	                                        all of the swap.
> 
> > +	  is useful when you want to avoid exhausion of swap but this itself
> 
> 	                                   exhaustion
> 
> > +	  adds more overheads and consumes memory for remembering information.
> > +	  Especially if you use 32bit system or small memory system,
> > +	  please be careful to enable this. When memory resource controller
> 
> 	  probably:         about enabling this.
> 
Thank you for review. will be fixed.

-Kame


> > +	  is disabled by boot option, this will be automatiaclly disabled and
> > +	  there will be no overhead from this. Even when you set this config=y,
> > +	  if boot option "noswapaccount" is set, swap will not be accounted.
> > +
> >  
> >  config HOTPLUG
> >  	bool "Support for hot-pluggable devices" if EMBEDDED
> 
> 
> ---
> ~Randy
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-23  9:12 ` [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig KAMEZAWA Hiroyuki
  2008-10-24  4:32   ` Randy Dunlap
@ 2008-10-27  6:39   ` Daisuke Nishimura
  2008-10-27  7:17     ` Li Zefan
  2008-10-28  0:08     ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 30+ messages in thread
From: Daisuke Nishimura @ 2008-10-27  6:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, xemul, menage

On Thu, 23 Oct 2008 18:12:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Config and control variable for mem+swap controller.
> 
> This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> (memory resource controller swap extension.)
> 
> For accounting swap, it's obvious that we have to use additional memory
> to remember "who uses swap". This adds more overhead.
> So, it's better to offer "choice" to users. This patch adds 2 choices.
> 
> This patch adds 2 parameters to enable swap extenstion or not.
>   - CONFIG
>   - boot option
> 
> This version uses policy of "default is enable if configured."
> please tell me you dislike this. See patches following this in detail...
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  Documentation/kernel-parameters.txt |    3 +++
>  include/linux/memcontrol.h          |    3 +++
>  init/Kconfig                        |   16 ++++++++++++++++
>  mm/memcontrol.c                     |   17 +++++++++++++++++
>  4 files changed, 39 insertions(+)
> 
> Index: mmotm-2.6.27+/init/Kconfig
> ===================================================================
> --- mmotm-2.6.27+.orig/init/Kconfig
> +++ mmotm-2.6.27+/init/Kconfig
> @@ -613,6 +613,22 @@ config KALLSYMS_EXTRA_PASS
>  	   reported.  KALLSYMS_EXTRA_PASS is only a temporary workaround while
>  	   you wait for kallsyms to be fixed.
>  
> +config CGROUP_MEM_RES_CTLR_SWAP
> +	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> +	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> +	help
> +	  Add swap management feature to memory resource controller. When you
> +	  enable this, you can limit mem+swap usage per cgroup. In other words,
> +	  when you disable this, memory resource controller have no cares to
> +	  usage of swap...a process can exhaust the all swap. This extension
> +	  is useful when you want to avoid exhausion of swap but this itself
> +	  adds more overheads and consumes memory for remembering information.
> +	  Especially if you use 32bit system or small memory system,
> +	  please be careful to enable this. When memory resource controller
> +	  is disabled by boot option, this will be automatiaclly disabled and
> +	  there will be no overhead from this. Even when you set this config=y,
> +	  if boot option "noswapaccount" is set, swap will not be accounted.
> +
>  
hmm... "cgroup_disable=memory" doesn't fully disable this feature.

Even if specifying "cgroup_disable=memory", memory for table of swap_cgroup
is allocated at swapon because "do_swap_account" is not turned off.

I think it can be turned off adding some codes at mem_cgroup_create() like:

===
@@ -1881,6 +1881,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
        int node;

        if (unlikely((cont->parent) == NULL)) {
+               if (mem_cgroup_subsys.disabled)
+                       do_swap_account = 0;
                mem = &init_mem_cgroup;
                cpu_memcgroup_callback(&memcgroup_nb,
                                        (unsigned long)CPU_UP_PREPARE,
===

BTW, is there any reason to call cgroup_init_subsys() even when the subsys
is disabled by boot option?


Thanks,
Daisuke Nishimura.

>  config HOTPLUG
>  	bool "Support for hot-pluggable devices" if EMBEDDED
> Index: mmotm-2.6.27+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27+/mm/memcontrol.c
> @@ -41,6 +41,13 @@
>  struct cgroup_subsys mem_cgroup_subsys __read_mostly;
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +int do_swap_account __read_mostly = 1;
> +#else
> +#define do_swap_account		(0)
> +#endif
> +
> +
>  /*
>   * Statistics for memory cgroup.
>   */
> @@ -1658,3 +1665,13 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  };
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +
> +static int __init disable_swap_account(char *s)
> +{
> +	do_swap_account = 0;
> +	return 1;
> +}
> +__setup("noswapaccount", disable_swap_account);
> +#endif
> Index: mmotm-2.6.27+/Documentation/kernel-parameters.txt
> ===================================================================
> --- mmotm-2.6.27+.orig/Documentation/kernel-parameters.txt
> +++ mmotm-2.6.27+/Documentation/kernel-parameters.txt
> @@ -1540,6 +1540,9 @@ and is between 256 and 4096 characters. 
>  
>  	nosoftlockup	[KNL] Disable the soft-lockup detector.
>  
> +	noswapaccount	[KNL] Disable accounting of swap in memory resource
> +			controller. (See Documentation/controllers/memory.txt)
> +
>  	nosync		[HW,M68K] Disables sync negotiation for all devices.
>  
>  	notsc		[BUGS=X86-32] Disable Time Stamp Counter
> Index: mmotm-2.6.27+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.27+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.27+/include/linux/memcontrol.h
> @@ -80,6 +80,9 @@ extern void mem_cgroup_record_reclaim_pr
>  extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
>  					int priority, enum lru_list lru);
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern int do_swap_account;
> +#endif
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 10/11] memcg: swap cgroup
  2008-10-23  9:13 ` [RFC][PATCH 10/11] memcg: swap cgroup KAMEZAWA Hiroyuki
@ 2008-10-27  7:02   ` Daisuke Nishimura
  2008-10-28  0:09     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Daisuke Nishimura @ 2008-10-27  7:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, xemul, menage

> +int swap_cgroup_swapon(int type, unsigned long max_pages)
> +{
> +	void *array;
> +	unsigned long array_size;
> +	unsigned long length;
> +	struct swap_cgroup_ctrl *ctrl;
> +
> +	if (!do_swap_account)
> +		return 0;
> +
> +	length = ((max_pages/SC_PER_PAGE) + 1);
> +	array_size = length * sizeof(void *);
> +
> +	array = vmalloc(array_size);
> +	if (!array)
> +		goto nomem;
> +
> +	memset(array, 0, array_size);
> +	ctrl = &swap_cgroup_ctrl[type];
> +	mutex_lock(&swap_cgroup_mutex);
> +	ctrl->length = length;
> +	ctrl->map = array;
> +	if (swap_cgroup_prepare(type)) {
> +		/* memory shortage */
> +		ctrl->map = NULL;
> +		ctrl->length = 0;
> +		vfree(array);
> +		mutex_unlock(&swap_cgroup_mutex);
> +		goto nomem;
> +	}
> +	mutex_unlock(&swap_cgroup_mutex);
> +
> +	printk(KERN_INFO
> +		"swap_cgroup: uses %ldbytes vmalloc and %ld bytes buffres\n",
just a minor nitpick, s/ldbytes/ld bytes.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-27  6:39   ` Daisuke Nishimura
@ 2008-10-27  7:17     ` Li Zefan
  2008-10-27  7:24       ` Daisuke Nishimura
  2008-10-28  0:08     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 30+ messages in thread
From: Li Zefan @ 2008-10-27  7:17 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, xemul, menage

> BTW, is there any reason to call cgroup_init_subsys() even when the subsys
> is disabled by boot option?
> 

Yes, because cgroup_init_subsys() is called by cgroup_init() and cgroup_init_early().
When cgroup_init_earsy() gets called, the boot param hasn't been parsed, so we don't
know which subsystems are disabled at that time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-27  7:17     ` Li Zefan
@ 2008-10-27  7:24       ` Daisuke Nishimura
  0 siblings, 0 replies; 30+ messages in thread
From: Daisuke Nishimura @ 2008-10-27  7:24 UTC (permalink / raw)
  To: Li Zefan; +Cc: nishimura, KAMEZAWA Hiroyuki, linux-mm, balbir, xemul, menage

On Mon, 27 Oct 2008 15:17:07 +0800, Li Zefan <lizf@cn.fujitsu.com> wrote:
> > BTW, is there any reason to call cgroup_init_subsys() even when the subsys
> > is disabled by boot option?
> > 
> 
> Yes, because cgroup_init_subsys() is called by cgroup_init() and cgroup_init_early().
> When cgroup_init_earsy() gets called, the boot param hasn't been parsed, so we don't
> know which subsystems are disabled at that time.
> 
Oh, I see.

Thank you for your explanation.


Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 11/11] memcg: mem+swap controler core
  2008-10-23  9:16 ` [RFC][PATCH 11/11] memcg: mem+swap controler core KAMEZAWA Hiroyuki
@ 2008-10-27 11:37   ` Daisuke Nishimura
  2008-10-28  0:16     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Daisuke Nishimura @ 2008-10-27 11:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, xemul, menage, nishimura

On Thu, 23 Oct 2008 18:16:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Mem+Swap controller core.
> 
> This patch implements per cgroup limit for usage of memory+swap.
> However there are SwapCache, double counting of swap-cache and
> swap-entry is avoided.
> 
> Mem+Swap controller works as following.
>   - memory usage is limited by memory.limit_in_bytes.
>   - memory + swap usage is limited by memory.memsw_limit_in_bytes.
> 
> 
> This has following benefits.
>   - A user can limit total resource usage of mem+swap.
> 
>     Without this, because memory resource controller doesn't take care of
>     usage of swap, a process can exhaust all the swap (by memory leak.)
>     We can avoid this case.
> 
>     And Swap is shared resource but it cannot be reclaimed (goes back to memory)
>     until it's used. This characteristic can be trouble when the memory
>     is divided into some parts by cpuset or memcg.
>     Assume group A and group B.
>     After some application executes, the system can be..
>     
>     Group A -- very large free memory space but occupy 99% of swap.
>     Group B -- under memory shortage but cannot use swap...it's nearly full.
>     This cannot be recovered in general.
>     Ability to set appropriate swap limit for each group is required.
>       
> Maybe someone wonder "why not swap but mem+swap ?"
> 
>   - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
>     to move account from memory to swap...there is no change in usage of
>     mem+swap.
> 
>     In other words, when we want to limit the usage of swap without affecting
>     global LRU, mem+swap limit is better than just limiting swap.
> 
> 
> Accounting target information is stored in swap_cgroup which is
> per swap entry record.
> 
> Charge is done as following.
>   map
>     - charge  page and memsw.
> 
>   unmap
>     - uncharge page/memsw if not SwapCache.
> 
>   swap-out (__delete_from_swap_cache)
>     - uncharge page
>     - record mem_cgroup information to swap_cgroup.
> 
>   swap-in (do_swap_page)
>     - charged as page and memsw.
>       record in swap_cgroup is cleared.
>       memsw accounting is decremented.
> 
>   swap-free (swap_free())
>     - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
> 
> 
> After this, usual memory resource controller handles SwapCache.
> (It was lacked(ignored) feature in current memcg but must be
>  handled.)
> 
> There are people work under never-swap environments and consider swap as
> something bad. For such people, this mem+swap controller extension is just an
> overhead.  This overhead is avoided by config or boot option.
> (see Kconfig. detail is not in this patch.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> 
>  include/linux/memcontrol.h |    3 
>  include/linux/swap.h       |   17 ++
>  mm/memcontrol.c            |  356 +++++++++++++++++++++++++++++++++++++++++----
>  mm/swap_state.c            |    4 
>  mm/swapfile.c              |    8 -
>  5 files changed, 359 insertions(+), 29 deletions(-)
> 
> Index: mmotm-2.6.27+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27+/mm/memcontrol.c
> @@ -130,6 +130,10 @@ struct mem_cgroup {
>  	 */
>  	struct res_counter res;
>  	/*
> +	 * the coutner to accoaunt for mem+swap usage.
> +	 */
> +	struct res_counter memsw;
> +	/*
>  	 * Per cgroup active and inactive list, similar to the
>  	 * per zone LRU lists.
>  	 */
> @@ -140,6 +144,12 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +
> +	/*
> +	 * used for counting reference from swap_cgroup.
> +	 */
> +	int		obsolete;
> +	atomic_t	swapref;
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> @@ -148,6 +158,7 @@ enum charge_type {
>  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
>  	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
>  	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
> +	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* used by force_empty */
comment should be modified :)

>  	NR_CHARGE_TYPE,
>  };
>  
> @@ -165,6 +176,16 @@ pcg_default_flags[NR_CHARGE_TYPE] = {
>  	0, /* FORCE */
>  };
>  
> +
> +/* for encoding cft->private value on file */
> +#define _MEM			(0)
> +#define _MEMSWAP		(1)
> +#define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
> +#define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
> +#define MEMFILE_ATTR(val)	((val) & 0xffff)
> +
> +static void mem_cgroup_forget_swapref(struct mem_cgroup *mem);
> +
>  /*
>   * Always modified under lru lock. Then, not necessary to preempt_disable()
>   */
> @@ -698,8 +719,19 @@ static int __mem_cgroup_try_charge(struc
>  		css_get(&mem->css);
>  	}
>  
> +	while (1) {
> +		int ret;
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> +		ret = res_counter_charge(&mem->res, PAGE_SIZE);
> +		if (likely(!ret)) {
> +			if (!do_swap_account)
> +				break;
> +			ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> +			if (likely(!ret))
> +				break;
> +			/* mem+swap counter fails */
> +			res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		}
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> @@ -712,8 +744,13 @@ static int __mem_cgroup_try_charge(struc
>  		 * moved to swap cache or just unmapped from the cgroup.
>  		 * Check the limit again to see if the reclaim reduced the
>  		 * current usage of the cgroup before giving up
> +		 *
>  		 */
> -		if (res_counter_check_under_limit(&mem->res))
> +		if (!do_swap_account &&
> +			res_counter_check_under_limit(&mem->res))
> +			continue;
> +		if (do_swap_account &&
> +			res_counter_check_under_limit(&mem->memsw))
>  			continue;
>  
>  		if (!nr_retries--) {
> @@ -770,6 +807,8 @@ static void __mem_cgroup_commit_charge(s
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
>  		res_counter_uncharge(&mem->res, PAGE_SIZE);
> +		if (do_swap_account)
> +			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>  		css_put(&mem->css);
>  		return;
>  	}
> @@ -851,6 +890,8 @@ static int mem_cgroup_move_account(struc
>  	if (spin_trylock(&to_mz->lru_lock)) {
>  		__mem_cgroup_remove_list(from_mz, pc);
>  		res_counter_uncharge(&from->res, PAGE_SIZE);
> +		if (do_swap_account)
> +			res_counter_uncharge(&from->memsw, PAGE_SIZE);
>  		pc->mem_cgroup = to;
>  		__mem_cgroup_add_list(to_mz, pc, false);
>  		ret = 0;
> @@ -896,8 +937,11 @@ static int mem_cgroup_move_parent(struct
>  	/* drop extra refcnt */
>  	css_put(&parent->css);
>  	/* uncharge if move fails */
> -	if (ret)
> +	if (ret) {
>  		res_counter_uncharge(&parent->res, PAGE_SIZE);
> +		if (do_swap_account)
> +			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> +	}
>  
>  	return ret;
>  }
> @@ -972,7 +1016,6 @@ int mem_cgroup_cache_charge(struct page 
>  	if (!(gfp_mask & __GFP_WAIT)) {
>  		struct page_cgroup *pc;
>  
> -
>  		pc = lookup_page_cgroup(page);
>  		if (!pc)
>  			return 0;
> @@ -998,15 +1041,74 @@ int mem_cgroup_cache_charge(struct page 
>  int mem_cgroup_cache_charge_swapin(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> +	struct mem_cgroup *mem;
> +	swp_entry_t ent;
> +	int ret;
> +
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
> -	if (unlikely(!mm))
> -		mm = &init_mm;
> -	return mem_cgroup_charge_common(page, mm, gfp_mask,
> +
> +	ent.val = page_private(page);
> +	if (do_swap_account) {
> +		mem = lookup_swap_cgroup(ent);
> +		if (!mem || mem->obsolete)
> +			goto charge_cur_mm;
> +		ret = mem_cgroup_charge_common(page, NULL, gfp_mask,
> +				       MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
> +	} else {
> +charge_cur_mm:
> +		if (unlikely(!mm))
> +			mm = &init_mm;
> +		ret = mem_cgroup_charge_common(page, mm, gfp_mask,
>  				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
> +	}
> +	if (do_swap_account && !ret) {
> +		/*
> +		 * At this point, we successfully charge both for mem and swap.
> +		 * fix this double counting, here.
> +		 */
> +		mem = swap_cgroup_record(ent, NULL);
> +		if (mem) {
> +			/* If memcg is obsolete, memcg can be != ptr */
> +			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +			mem_cgroup_forget_swapref(mem);
> +		}
> +	}
> +	return ret;
> +}
> +
> +/*
> + * look into swap_cgroup and charge against mem_cgroup which does swapout
> + * if we can. If not, charge against current mm.
> + */
> +
> +int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> +		struct page *page, gfp_t mask, struct mem_cgroup **ptr)
> +{
> +	struct mem_cgroup *mem;
> +	swp_entry_t	ent;
> +
> +	if (mem_cgroup_subsys.disabled)
> +		return 0;
>  
> +	if (!do_swap_account)
> +		goto charge_cur_mm;
> +
> +	ent.val = page_private(page);
> +
> +	mem = lookup_swap_cgroup(ent);
> +	if (!mem || mem->obsolete)
> +		goto charge_cur_mm;
> +	*ptr = mem;
> +	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +charge_cur_mm:
> +	if (unlikely(!mm))
> +		mm = &init_mm;
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true);
>  }
>  
hmm... this function is not called from any functions.
Should do_swap_page()->mem_cgroup_try_charge() and unuse_pte()->mem_cgroup_try_charge()
are changed to mem_cgroup_try_charge_swapin()?

> +
> +
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
>  {
>  	struct page_cgroup *pc;
> @@ -1017,6 +1119,23 @@ void mem_cgroup_commit_charge_swapin(str
>  		return;
>  	pc = lookup_page_cgroup(page);
>  	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +	/*
> +	 * Now swap is on-memory. This means this page may be
> +	 * counted both as mem and swap....double count.
> +	 * Fix it by uncharging from memsw. This SwapCache is stable
> +	 * because we're still under lock_page().
> +	 */
> +	if (do_swap_account) {
> +		swp_entry_t ent = {.val = page_private(page)};
> +		struct mem_cgroup *memcg;
> +		memcg = swap_cgroup_record(ent, NULL);
> +		if (memcg) {
> +			/* If memcg is obsolete, memcg can be != ptr */
> +			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +			mem_cgroup_forget_swapref(memcg);
> +		}
> +
> +	}
>  }
>  
>  void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> @@ -1026,35 +1145,50 @@ void mem_cgroup_cancel_charge_swapin(str
>  	if (!mem)
>  		return;
>  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>  	css_put(&mem->css);
>  }
>  
> -
>  /*
>   * uncharge if !page_mapped(page)
> + * returns memcg at success.
>   */
> -static void
> +static struct mem_cgroup *
>  __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem;
>  
>  	if (mem_cgroup_subsys.disabled)
> -		return;
> +		return NULL;
>  
> +	if (PageSwapCache(page))
> +		return NULL;
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
>  	pc = lookup_page_cgroup(page);
>  	if (unlikely(!pc || !PageCgroupUsed(pc)))
> -		return;
> +		return NULL;
>  
>  	lock_page_cgroup(pc);
> +	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> +		if (PageAnon(page)) {
> +			if (page_mapped(page)) {
> +				unlock_page_cgroup(pc);
> +				return NULL;
> +			}
> +		} else if (page->mapping && !page_is_file_cache(page)) {
> +			/* This is on radix-tree. */
> +			unlock_page_cgroup(pc);
> +			return NULL;
> +		}
> +	}
>  	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
>  	     || !PageCgroupUsed(pc)) {
Isn't check for PCG_USED needed when MEM_CGROUP_CHARGE_TYPE_SWAPOUT?

>  		/* This happens at race in zap_pte_range() and do_swap_page()*/
>  		unlock_page_cgroup(pc);
> -		return;
> +		return NULL;
>  	}
>  	ClearPageCgroupUsed(pc);
>  	mem = pc->mem_cgroup;
> @@ -1063,9 +1197,11 @@ __mem_cgroup_uncharge_common(struct page
>  	 * unlock this.
>  	 */
>  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	if (do_swap_account && ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>  	unlock_page_cgroup(pc);
>  	release_page_cgroup(pc);
> -	return;
> +	return mem;
>  }
>  
Now, anon pages are not uncharge if PageSwapCache,
I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" at
mem_cgroup_end_migration() should be removed. Otherwise oldpage
is not uncharged if it is on swapcache, isn't it?


Thanks,
Daisuke Nishimura.

>  void mem_cgroup_uncharge_page(struct page *page)
> @@ -1086,6 +1222,41 @@ void mem_cgroup_uncharge_cache_page(stru
>  }
>  
>  /*
> + * called from __delete_from_swap_cache() and drop "page" account.
> + * memcg information is recorded to swap_cgroup of "ent"
> + */
> +void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = __mem_cgroup_uncharge_common(page,
> +					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +	/* record memcg information */
> +	if (do_swap_account && memcg) {
> +		swap_cgroup_record(ent, memcg);
> +		atomic_inc(&memcg->swapref);
> +	}
> +}
> +
> +/*
> + * called from swap_entry_free(). remove record in swap_cgroup and
> + * uncharge "memsw" account.
> + */
> +void mem_cgroup_uncharge_swap(swp_entry_t ent)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (!do_swap_account)
> +		return;
> +
> +	memcg = swap_cgroup_record(ent, NULL);
> +	if (memcg) {
> +		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> +		mem_cgroup_forget_swapref(memcg);
> +	}
> +}
> +
> +/*
>   * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
>   * page belongs to.
>   */
> @@ -1219,13 +1390,56 @@ int mem_cgroup_resize_limit(struct mem_c
>  			ret = -EBUSY;
>  			break;
>  		}
> -		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
> +		progress = try_to_free_mem_cgroup_pages(memcg,
> +				GFP_KERNEL);
>  		if (!progress)
>  			retry_count--;
>  	}
>  	return ret;
>  }
>  
> +int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> +				unsigned long long val)
> +{
> +	int retry_count = MEM_CGROUP_RECLAIM_RETRIES;
> +	unsigned long flags;
> +	u64 memlimit, oldusage, curusage;
> +	int ret;
> +
> +	if (!do_swap_account)
> +		return -EINVAL;
> +
> +	while (retry_count) {
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +		/*
> +		 * Rather than hide all in some function, I do this in
> +		 * open coded mannaer. You see what this really does.
> +		 * We have to guarantee mem->res.limit < mem->memsw.limit.
> +		 */
> +		spin_lock_irqsave(&memcg->res.lock, flags);
> +		memlimit = memcg->res.limit;
> +		if (memlimit > val) {
> +			spin_unlock_irqrestore(&memcg->res.lock, flags);
> +			ret = -EINVAL;
> +			break;
> +		}
> +		ret = res_counter_set_limit(&memcg->memsw, val);
> +		oldusage = memcg->memsw.usage;
> +		spin_unlock_irqrestore(&memcg->res.lock, flags);
> +
> +		if (!ret)
> +			break;
> +		try_to_free_mem_cgroup_pages(memcg, GFP_HIGHUSER_MOVABLE);
> +		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +		if (curusage >= oldusage)
> +			retry_count--;
> +	}
> +	return ret;
> +}
> +
>  
>  /*
>   * This routine traverse page_cgroup in given list and drop them all.
> @@ -1353,8 +1567,25 @@ try_to_free:
>  
>  static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>  {
> -	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
> -				    cft->private);
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	u64 val = 0;
> +	int type, name;
> +
> +	type = MEMFILE_TYPE(cft->private);
> +	name = MEMFILE_ATTR(cft->private);
> +	switch (type) {
> +	case _MEM:
> +		val = res_counter_read_u64(&mem->res, name);
> +		break;
> +	case _MEMSWAP:
> +		if (do_swap_account)
> +			val = res_counter_read_u64(&mem->memsw, name);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return val;
>  }
>  /*
>   * The user of this function is...
> @@ -1364,15 +1595,22 @@ static int mem_cgroup_write(struct cgrou
>  			    const char *buffer)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +	int type, name;
>  	unsigned long long val;
>  	int ret;
>  
> -	switch (cft->private) {
> +	type = MEMFILE_TYPE(cft->private);
> +	name = MEMFILE_ATTR(cft->private);
> +	switch (name) {
>  	case RES_LIMIT:
>  		/* This function does all necessary parse...reuse it */
>  		ret = res_counter_memparse_write_strategy(buffer, &val);
> -		if (!ret)
> +		if (ret)
> +			break;
> +		if (type == _MEM)
>  			ret = mem_cgroup_resize_limit(memcg, val);
> +		else
> +			ret = mem_cgroup_resize_memsw_limit(memcg, val);
>  		break;
>  	default:
>  		ret = -EINVAL; /* should be BUG() ? */
> @@ -1384,14 +1622,23 @@ static int mem_cgroup_write(struct cgrou
>  static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>  {
>  	struct mem_cgroup *mem;
> +	int type, name;
>  
>  	mem = mem_cgroup_from_cont(cont);
> -	switch (event) {
> +	type = MEMFILE_TYPE(event);
> +	name = MEMFILE_ATTR(event);
> +	switch (name) {
>  	case RES_MAX_USAGE:
> -		res_counter_reset_max(&mem->res);
> +		if (type == _MEM)
> +			res_counter_reset_max(&mem->res);
> +		else
> +			res_counter_reset_max(&mem->memsw);
>  		break;
>  	case RES_FAILCNT:
> -		res_counter_reset_failcnt(&mem->res);
> +		if (type == _MEM)
> +			res_counter_reset_failcnt(&mem->res);
> +		else
> +			res_counter_reset_failcnt(&mem->memsw);
>  		break;
>  	}
>  	return 0;
> @@ -1445,30 +1692,33 @@ static int mem_control_stat_show(struct 
>  		cb->fill(cb, "unevictable", unevictable * PAGE_SIZE);
>  
>  	}
> +	/* showing refs from disk-swap */
> +	cb->fill(cb, "swap_on_disk", atomic_read(&mem_cont->swapref)
> +					* PAGE_SIZE);
>  	return 0;
>  }
>  
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> -		.private = RES_USAGE,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "max_usage_in_bytes",
> -		.private = RES_MAX_USAGE,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
>  		.trigger = mem_cgroup_reset,
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "limit_in_bytes",
> -		.private = RES_LIMIT,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
>  		.write_string = mem_cgroup_write,
>  		.read_u64 = mem_cgroup_read,
>  	},
>  	{
>  		.name = "failcnt",
> -		.private = RES_FAILCNT,
> +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
>  		.trigger = mem_cgroup_reset,
>  		.read_u64 = mem_cgroup_read,
>  	},
> @@ -1476,6 +1726,31 @@ static struct cftype mem_cgroup_files[] 
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +	{
> +		.name = "memsw.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "memsw.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> +		.trigger = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "memsw.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> +		.write_string = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "failcnt",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> +		.trigger = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +#endif
>  };
>  
>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> @@ -1529,14 +1804,43 @@ static struct mem_cgroup *mem_cgroup_all
>  	return mem;
>  }
>  
> +/*
> + * At destroying mem_cgroup, references from swap_cgroup can remain.
> + * (scanning all at force_empty is too costly...)
> + *
> + * Instead of clearing all references at force_empty, we remember
> + * the number of reference from swap_cgroup and free mem_cgroup when
> + * it goes down to 0.
> + *
> + * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
> + * entry which points to this memcg will be ignore at swapin.
> + *
> + * Removal of cgroup itself succeeds regardless of refs from swap.
> + */
> +
>  static void mem_cgroup_free(struct mem_cgroup *mem)
>  {
> +	if (do_swap_account) {
> +		if (atomic_read(&mem->swapref) > 0)
> +			return;
> +	}
>  	if (sizeof(*mem) < PAGE_SIZE)
>  		kfree(mem);
>  	else
>  		vfree(mem);
>  }
>  
> +static void mem_cgroup_forget_swapref(struct mem_cgroup *mem)
> +{
> +	if (!do_swap_account)
> +		return;
> +	if (atomic_dec_and_test(&mem->swapref)) {
> +		if (!mem->obsolete)
> +			return;
> +		mem_cgroup_free(mem);
> +	}
> +}
> +
>  static void mem_cgroup_init_pcp(int cpu)
>  {
>  	page_cgroup_start_cache_cpu(cpu);
> @@ -1589,6 +1893,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	}
>  
>  	res_counter_init(&mem->res);
> +	res_counter_init(&mem->memsw);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> @@ -1607,6 +1912,7 @@ static void mem_cgroup_pre_destroy(struc
>  					struct cgroup *cont)
>  {
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	mem->obsolete = 1;
>  	mem_cgroup_force_empty(mem);
>  }
>  
> Index: mmotm-2.6.27+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.27+.orig/mm/swapfile.c
> +++ mmotm-2.6.27+/mm/swapfile.c
> @@ -271,8 +271,9 @@ out:
>  	return NULL;
>  }	
>  
> -static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
> +static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
>  {
> +	unsigned long offset = swp_offset(ent);
>  	int count = p->swap_map[offset];
>  
>  	if (count < SWAP_MAP_MAX) {
> @@ -287,6 +288,7 @@ static int swap_entry_free(struct swap_i
>  				swap_list.next = p - swap_info;
>  			nr_swap_pages++;
>  			p->inuse_pages--;
> +			mem_cgroup_uncharge_swap(ent);
>  		}
>  	}
>  	return count;
> @@ -302,7 +304,7 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> -		swap_entry_free(p, swp_offset(entry));
> +		swap_entry_free(p, entry);
>  		spin_unlock(&swap_lock);
>  	}
>  }
> @@ -421,7 +423,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> -		if (swap_entry_free(p, swp_offset(entry)) == 1) {
> +		if (swap_entry_free(p, entry) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && !trylock_page(page)) {
>  				page_cache_release(page);
> Index: mmotm-2.6.27+/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.27+.orig/mm/swap_state.c
> +++ mmotm-2.6.27+/mm/swap_state.c
> @@ -17,6 +17,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/pagevec.h>
>  #include <linux/migrate.h>
> +#include <linux/page_cgroup.h>
>  
>  #include <asm/pgtable.h>
>  
> @@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page,
>   */
>  void __delete_from_swap_cache(struct page *page)
>  {
> +	swp_entry_t ent = {.val = page_private(page)};
> +
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(!PageSwapCache(page));
>  	BUG_ON(PageWriteback(page));
> @@ -119,6 +122,7 @@ void __delete_from_swap_cache(struct pag
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
> +	mem_cgroup_uncharge_swapcache(page, ent);
>  }
>  
>  /**
> Index: mmotm-2.6.27+/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.27+.orig/include/linux/swap.h
> +++ mmotm-2.6.27+/include/linux/swap.h
> @@ -332,6 +332,23 @@ static inline void disable_swap_token(vo
>  	put_swap_token(swap_token_mm);
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/* This function requires swp_entry_t definition. see memcontrol.c */
> +extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +#else
> +static inline void
> +mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> +{
> +}
> +#endif
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +#else
> +static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> +{
> +}
> +#endif
> +
>  #else /* CONFIG_SWAP */
>  
>  #define total_swap_pages			0
> Index: mmotm-2.6.27+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.27+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.27+/include/linux/memcontrol.h
> @@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(str
>  /* for swap handling */
>  extern int mem_cgroup_try_charge(struct mm_struct *mm,
>  		gfp_t gfp_mask, struct mem_cgroup **ptr);
> +extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> +		struct page *page, gfp_t mask, struct mem_cgroup **ptr);
>  extern void mem_cgroup_commit_charge_swapin(struct page *page,
>  					struct mem_cgroup *ptr);
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
> @@ -83,7 +85,6 @@ extern long mem_cgroup_calc_reclaim(stru
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  extern int do_swap_account;
>  #endif
> -
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig
  2008-10-27  6:39   ` Daisuke Nishimura
  2008-10-27  7:17     ` Li Zefan
@ 2008-10-28  0:08     ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-28  0:08 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, xemul, menage

On Mon, 27 Oct 2008 15:39:11 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 23 Oct 2008 18:12:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Config and control variable for mem+swap controller.
> > 
> > This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > (memory resource controller swap extension.)
> > 
> > For accounting swap, it's obvious that we have to use additional memory
> > to remember "who uses swap". This adds more overhead.
> > So, it's better to offer "choice" to users. This patch adds 2 choices.
> > 
> > This patch adds 2 parameters to enable swap extenstion or not.
> >   - CONFIG
> >   - boot option
> > 
> > This version uses policy of "default is enable if configured."
> > please tell me you dislike this. See patches following this in detail...
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > 
> >  Documentation/kernel-parameters.txt |    3 +++
> >  include/linux/memcontrol.h          |    3 +++
> >  init/Kconfig                        |   16 ++++++++++++++++
> >  mm/memcontrol.c                     |   17 +++++++++++++++++
> >  4 files changed, 39 insertions(+)
> > 
> > Index: mmotm-2.6.27+/init/Kconfig
> > ===================================================================
> > --- mmotm-2.6.27+.orig/init/Kconfig
> > +++ mmotm-2.6.27+/init/Kconfig
> > @@ -613,6 +613,22 @@ config KALLSYMS_EXTRA_PASS
> >  	   reported.  KALLSYMS_EXTRA_PASS is only a temporary workaround while
> >  	   you wait for kallsyms to be fixed.
> >  
> > +config CGROUP_MEM_RES_CTLR_SWAP
> > +	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> > +	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> > +	help
> > +	  Add swap management feature to memory resource controller. When you
> > +	  enable this, you can limit mem+swap usage per cgroup. In other words,
> > +	  when you disable this, memory resource controller have no cares to
> > +	  usage of swap...a process can exhaust the all swap. This extension
> > +	  is useful when you want to avoid exhausion of swap but this itself
> > +	  adds more overheads and consumes memory for remembering information.
> > +	  Especially if you use 32bit system or small memory system,
> > +	  please be careful to enable this. When memory resource controller
> > +	  is disabled by boot option, this will be automatiaclly disabled and
> > +	  there will be no overhead from this. Even when you set this config=y,
> > +	  if boot option "noswapaccount" is set, swap will not be accounted.
> > +
> >  
> hmm... "cgroup_disable=memory" doesn't fully disable this feature.
> 
> Even if specifying "cgroup_disable=memory", memory for table of swap_cgroup
> is allocated at swapon because "do_swap_account" is not turned off.
> 


AH, ok. I see. will be fixed.

Thanks,
-Kame

> I think it can be turned off adding some codes at mem_cgroup_create() like:
> 
> ===
> @@ -1881,6 +1881,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>         int node;
> 
>         if (unlikely((cont->parent) == NULL)) {
> +               if (mem_cgroup_subsys.disabled)
> +                       do_swap_account = 0;
>                 mem = &init_mem_cgroup;
>                 cpu_memcgroup_callback(&memcgroup_nb,
>                                         (unsigned long)CPU_UP_PREPARE,
> ===
> 
> BTW, is there any reason to call cgroup_init_subsys() even when the subsys
> is disabled by boot option?
> 
> 
> Thanks,
> Daisuke Nishimura.
> 
> >  config HOTPLUG
> >  	bool "Support for hot-pluggable devices" if EMBEDDED
> > Index: mmotm-2.6.27+/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.27+.orig/mm/memcontrol.c
> > +++ mmotm-2.6.27+/mm/memcontrol.c
> > @@ -41,6 +41,13 @@
> >  struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> >  #define MEM_CGROUP_RECLAIM_RETRIES	5
> >  
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +int do_swap_account __read_mostly = 1;
> > +#else
> > +#define do_swap_account		(0)
> > +#endif
> > +
> > +
> >  /*
> >   * Statistics for memory cgroup.
> >   */
> > @@ -1658,3 +1665,13 @@ struct cgroup_subsys mem_cgroup_subsys =
> >  	.attach = mem_cgroup_move_task,
> >  	.early_init = 0,
> >  };
> > +
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +
> > +static int __init disable_swap_account(char *s)
> > +{
> > +	do_swap_account = 0;
> > +	return 1;
> > +}
> > +__setup("noswapaccount", disable_swap_account);
> > +#endif
> > Index: mmotm-2.6.27+/Documentation/kernel-parameters.txt
> > ===================================================================
> > --- mmotm-2.6.27+.orig/Documentation/kernel-parameters.txt
> > +++ mmotm-2.6.27+/Documentation/kernel-parameters.txt
> > @@ -1540,6 +1540,9 @@ and is between 256 and 4096 characters. 
> >  
> >  	nosoftlockup	[KNL] Disable the soft-lockup detector.
> >  
> > +	noswapaccount	[KNL] Disable accounting of swap in memory resource
> > +			controller. (See Documentation/controllers/memory.txt)
> > +
> >  	nosync		[HW,M68K] Disables sync negotiation for all devices.
> >  
> >  	notsc		[BUGS=X86-32] Disable Time Stamp Counter
> > Index: mmotm-2.6.27+/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-2.6.27+.orig/include/linux/memcontrol.h
> > +++ mmotm-2.6.27+/include/linux/memcontrol.h
> > @@ -80,6 +80,9 @@ extern void mem_cgroup_record_reclaim_pr
> >  extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> >  					int priority, enum lru_list lru);
> >  
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +extern int do_swap_account;
> > +#endif
> >  
> >  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >  struct mem_cgroup;
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 10/11] memcg: swap cgroup
  2008-10-27  7:02   ` Daisuke Nishimura
@ 2008-10-28  0:09     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-28  0:09 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, xemul, menage

On Mon, 27 Oct 2008 16:02:16 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +	memset(array, 0, array_size);
> > +	ctrl = &swap_cgroup_ctrl[type];
> > +	mutex_lock(&swap_cgroup_mutex);
> > +	ctrl->length = length;
> > +	ctrl->map = array;
> > +	if (swap_cgroup_prepare(type)) {
> > +		/* memory shortage */
> > +		ctrl->map = NULL;
> > +		ctrl->length = 0;
> > +		vfree(array);
> > +		mutex_unlock(&swap_cgroup_mutex);
> > +		goto nomem;
> > +	}
> > +	mutex_unlock(&swap_cgroup_mutex);
> > +
> > +	printk(KERN_INFO
> > +		"swap_cgroup: uses %ldbytes vmalloc and %ld bytes buffres\n",
> just a minor nitpick, s/ldbytes/ld bytes.
> 
yes. thank you for review.

-Kame

> 
> Thanks,
> Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 11/11] memcg: mem+swap controler core
  2008-10-27 11:37   ` Daisuke Nishimura
@ 2008-10-28  0:16     ` KAMEZAWA Hiroyuki
  2008-10-28  2:06       ` Daisuke Nishimura
  0 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-28  0:16 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, xemul, menage

On Mon, 27 Oct 2008 20:37:51 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 23 Oct 2008 18:16:11 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >  static struct mem_cgroup init_mem_cgroup;
> >  
> > @@ -148,6 +158,7 @@ enum charge_type {
> >  	MEM_CGROUP_CHARGE_TYPE_MAPPED,
> >  	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
> >  	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
> > +	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* used by force_empty */
> comment should be modified :)
> 
sure.

> >  	NR_CHARGE_TYPE,
> >  };
<snip>
> > +int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> > +		struct page *page, gfp_t mask, struct mem_cgroup **ptr)
> > +{
> > +	struct mem_cgroup *mem;
> > +	swp_entry_t	ent;
> > +
> > +	if (mem_cgroup_subsys.disabled)
> > +		return 0;
> >  
> > +	if (!do_swap_account)
> > +		goto charge_cur_mm;
> > +
> > +	ent.val = page_private(page);
> > +
> > +	mem = lookup_swap_cgroup(ent);
> > +	if (!mem || mem->obsolete)
> > +		goto charge_cur_mm;
> > +	*ptr = mem;
> > +	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
> > +charge_cur_mm:
> > +	if (unlikely(!mm))
> > +		mm = &init_mm;
> > +	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> >  }
> >  
> hmm... this function is not called from any functions.
> Should do_swap_page()->mem_cgroup_try_charge() and unuse_pte()->mem_cgroup_try_charge()
> are changed to mem_cgroup_try_charge_swapin()?
> 
yes. Hmm...patch order is confusing ? I'll look into again.



> >  	lock_page_cgroup(pc);
> > +	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> > +		if (PageAnon(page)) {
> > +			if (page_mapped(page)) {
> > +				unlock_page_cgroup(pc);
> > +				return NULL;
> > +			}
> > +		} else if (page->mapping && !page_is_file_cache(page)) {
> > +			/* This is on radix-tree. */
> > +			unlock_page_cgroup(pc);
> > +			return NULL;
> > +		}
> > +	}
> >  	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
> >  	     || !PageCgroupUsed(pc)) {
> Isn't check for PCG_USED needed when MEM_CGROUP_CHARGE_TYPE_SWAPOUT?
> 
Ah, seems problematic. thanks.

> >  		/* This happens at race in zap_pte_range() and do_swap_page()*/
> >  		unlock_page_cgroup(pc);
> > -		return;
> > +		return NULL;
> >  	}
> >  	ClearPageCgroupUsed(pc);
> >  	mem = pc->mem_cgroup;
> > @@ -1063,9 +1197,11 @@ __mem_cgroup_uncharge_common(struct page
> >  	 * unlock this.
> >  	 */
> >  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	if (do_swap_account && ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >  	unlock_page_cgroup(pc);
> >  	release_page_cgroup(pc);
> > -	return;
> > +	return mem;
> >  }
> >  
> Now, anon pages are not uncharge if PageSwapCache,
> I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" at
> mem_cgroup_end_migration() should be removed. Otherwise oldpage
> is not uncharged if it is on swapcache, isn't it?
> 
oldpage's swapcache bit is dropped at that stage.
I'll add comment.

Thank you for review.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 11/11] memcg: mem+swap controler core
  2008-10-28  0:16     ` KAMEZAWA Hiroyuki
@ 2008-10-28  2:06       ` Daisuke Nishimura
  2008-10-28  2:30         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Daisuke Nishimura @ 2008-10-28  2:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, xemul, menage

> > >  		/* This happens at race in zap_pte_range() and do_swap_page()*/
> > >  		unlock_page_cgroup(pc);
> > > -		return;
> > > +		return NULL;
> > >  	}
> > >  	ClearPageCgroupUsed(pc);
> > >  	mem = pc->mem_cgroup;
> > > @@ -1063,9 +1197,11 @@ __mem_cgroup_uncharge_common(struct page
> > >  	 * unlock this.
> > >  	 */
> > >  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > > +	if (do_swap_account && ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > >  	unlock_page_cgroup(pc);
> > >  	release_page_cgroup(pc);
> > > -	return;
> > > +	return mem;
> > >  }
> > >  
> > Now, anon pages are not uncharge if PageSwapCache,
> > I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" at
> > mem_cgroup_end_migration() should be removed. Otherwise oldpage
> > is not uncharged if it is on swapcache, isn't it?
> > 
> oldpage's swapcache bit is dropped at that stage.
> I'll add comment.
> 
I'm sorry if I misunderstand something.

Oldpage(anon on swapcache) isn't uncharged via try_to_unmap(),
and its ctype is MEM_CGROUP_CHARGE_TYPE_MAPPED so
__mem_cgroup_uncharge_common() is not called at mem_cgroup_end_migration().

I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" in
mem_cgroup_end_migration() is not needed. PCG_USED flag prevents double uncharging.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 11/11] memcg: mem+swap controler core
  2008-10-28  2:06       ` Daisuke Nishimura
@ 2008-10-28  2:30         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-28  2:30 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, xemul, menage

On Tue, 28 Oct 2008 11:06:05 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > > >  		/* This happens at race in zap_pte_range() and do_swap_page()*/
> > > >  		unlock_page_cgroup(pc);
> > > > -		return;
> > > > +		return NULL;
> > > >  	}
> > > >  	ClearPageCgroupUsed(pc);
> > > >  	mem = pc->mem_cgroup;
> > > > @@ -1063,9 +1197,11 @@ __mem_cgroup_uncharge_common(struct page
> > > >  	 * unlock this.
> > > >  	 */
> > > >  	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > > > +	if (do_swap_account && ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > > > +		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > > >  	unlock_page_cgroup(pc);
> > > >  	release_page_cgroup(pc);
> > > > -	return;
> > > > +	return mem;
> > > >  }
> > > >  
> > > Now, anon pages are not uncharge if PageSwapCache,
> > > I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" at
> > > mem_cgroup_end_migration() should be removed. Otherwise oldpage
> > > is not uncharged if it is on swapcache, isn't it?
> > > 
> > oldpage's swapcache bit is dropped at that stage.
> > I'll add comment.
> > 
> I'm sorry if I misunderstand something.
> 
> Oldpage(anon on swapcache) isn't uncharged via try_to_unmap(),
yes.
> and its ctype is MEM_CGROUP_CHARGE_TYPE_MAPPED so
yes.
> __mem_cgroup_uncharge_common() is not called at mem_cgroup_end_migration().
> 
Ah, I see.

> I think "if (unused && ctype != MEM_CGROUP_CHARGE_TYPE_MAPPED)" in
> mem_cgroup_end_migration() is not needed. PCG_USED flag prevents double uncharging.
> 
will fix.

Thanks,
-Kame

> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-10-28  2:31 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-23  8:58 [RFC][PATCH 0/11] memcg updates / clean up, lazy lru ,mem+swap controller KAMEZAWA Hiroyuki
2008-10-23  8:59 ` [RFC][PATCH 1/11] memcg: fix kconfig menu comment KAMEZAWA Hiroyuki
2008-10-24  4:24   ` Randy Dunlap
2008-10-24  4:28     ` KAMEZAWA Hiroyuki
2008-10-23  9:00 ` [RFC][PATCH 2/11] cgroup: make cgroup kconfig as submenu KAMEZAWA Hiroyuki
2008-10-23 21:20   ` Paul Menage
2008-10-24  1:16     ` KAMEZAWA Hiroyuki
2008-10-23  9:02 ` [RFC][PATCH 3/11] memcg: charge commit cancel protocol KAMEZAWA Hiroyuki
2008-10-23  9:03 ` [RFC][PATCH 4/11] memcg: better page migration handling KAMEZAWA Hiroyuki
2008-10-23  9:05 ` [RFC][PATCH 5/11] memcg: account move and change force_empty KAMEZAWA Hiroyuki
2008-10-24  4:28   ` Randy Dunlap
2008-10-24  4:37     ` KAMEZAWA Hiroyuki
2008-10-23  9:06 ` [RFC][PATCH 6/11] memcg: lary LRU removal KAMEZAWA Hiroyuki
2008-10-23  9:08 ` [RFC][PATCH 7/11] memcg: lazy lru add KAMEZAWA Hiroyuki
2008-10-23  9:10 ` [RFC][PATCH 8/11] memcg: shmem account helper KAMEZAWA Hiroyuki
2008-10-23  9:12 ` [RFC][PATCH 9/11] memcg : mem+swap controlelr kconfig KAMEZAWA Hiroyuki
2008-10-24  4:32   ` Randy Dunlap
2008-10-24  4:37     ` KAMEZAWA Hiroyuki
2008-10-27  6:39   ` Daisuke Nishimura
2008-10-27  7:17     ` Li Zefan
2008-10-27  7:24       ` Daisuke Nishimura
2008-10-28  0:08     ` KAMEZAWA Hiroyuki
2008-10-23  9:13 ` [RFC][PATCH 10/11] memcg: swap cgroup KAMEZAWA Hiroyuki
2008-10-27  7:02   ` Daisuke Nishimura
2008-10-28  0:09     ` KAMEZAWA Hiroyuki
2008-10-23  9:16 ` [RFC][PATCH 11/11] memcg: mem+swap controler core KAMEZAWA Hiroyuki
2008-10-27 11:37   ` Daisuke Nishimura
2008-10-28  0:16     ` KAMEZAWA Hiroyuki
2008-10-28  2:06       ` Daisuke Nishimura
2008-10-28  2:30         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox