[PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4
@ 2010-08-05  9:44 KAMEZAWA Hiroyuki
  2010-08-05  9:57 ` [PATCH 1/4 -mm][memcg] quick ID lookup in memcg KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05  9:44 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, vgoyal, m-ikeda, gthelen, akpm, linux-kernel

This is v4.

Major changes from v3 is 
 - dropped spinlock per page_cgroup (then, 4byte free space.)
 - added more comments
 - clean up.


This set has 2 purposes.
 1. re-desgin struct page_cgroup and makes room for blocckio-cgroup ID.
 2. implement quick updating method for memcg's file stat.

 1. check influence of Mel's new writeback method.
    I think we'll see OOM easier. IIUC, memory cgroup needs a thread like kswapd
    to do background writeback or low-high watermark.
    (By this, we can control priority of background writeout thread priority
     by CFS. This is very good.)

 2. implementing dirty_ratio.
    Now, Greg Thelen is working on. One of biggest problems of previous trial was
    update cost of status. I think this patch set can reduce it.

About reducing size of struct page_cgroup:
We have several choices. So, plz wait.
I don't think packing memcgid, blockio-id to pc->flags is a good choice.
(I'm afraid of races and new limitations added by that.)

One idea:  free spaces in pc->flags can be used...but it's better to store some
           very stable/staic value.
	   One idea is store pfn or section-ID or node-id there. Then, we can remove
	   pc->page pointer. Considering page_cgroup's usage, page->page_cgroup is
	   a usual operation but page_cgroup->page is not. (only used at memory recalim)
	   Then, we can implement a function page_cgroup_to_page() and remove
	   page_cgroup->page pointer. For this, free space in pc->flags can be used.

After MM-Summit, I have another event until Aug16, and will be busy for a while.
(And cannot read e-mail box on my office) If you want to contact me, please
 e-mail kahi at mte biglobe ne jp.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4 -mm][memcg] quick ID lookup in memcg
  2010-08-05  9:44 [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4 KAMEZAWA Hiroyuki
@ 2010-08-05  9:57 ` KAMEZAWA Hiroyuki
  2010-08-06  4:12   ` Greg Thelen
  2010-08-05  9:57 ` [PATCH 2/4 -mm][memcg] use id in page cgroup KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05  9:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, gthelen, akpm,
	linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, memory cgroup has an ID per cgroup and make use of it at
 - hierarchy walk,
 - swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100804
 - fixed description in init/Kconfig

Changelog: 20100730
 - fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 init/Kconfig    |   10 ++++++++++
 mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 44 insertions(+), 14 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -292,6 +292,30 @@ static bool move_file(void)
 					&mc.to->move_charge_at_immigrate);
 }
 
+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+	/*
+	 * This array is set to NULL when mem_cgroup is freed.
+	 * IOW, there are no more references && rcu_synchronized().
+	 * This lookup-caching is safe.
+	 */
+	if (unlikely(!mem_cgroups[id])) {
+		struct cgroup_subsys_state *css;
+
+		rcu_read_lock();
+		css = css_lookup(&mem_cgroup_subsys, id);
+		rcu_read_unlock();
+		if (!css)
+			return NULL;
+		mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
+	}
+	return mem_cgroups[id];
+}
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
@@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
  * it's concern. (dropping refcnt from swap can be called against removed
  * memcg.)
  */
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
 
-	/* ID 0 is unused ID */
-	if (!id)
-		return NULL;
-	css = css_lookup(&mem_cgroup_subsys, id);
-	if (!css)
-		return NULL;
-	return container_of(css, struct mem_cgroup, css);
-}
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
@@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 		ent.val = page_private(page);
 		id = lookup_swap_cgroup(ent);
 		rcu_read_lock();
-		mem = mem_cgroup_lookup(id);
+		mem = id_to_memcg(id);
 		if (mem && !css_tryget(&mem->css))
 			mem = NULL;
 		rcu_read_unlock();
@@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
 
 		id = swap_cgroup_record(ent, 0);
 		rcu_read_lock();
-		memcg = mem_cgroup_lookup(id);
+		memcg = id_to_memcg(id);
 		if (memcg) {
 			/*
 			 * This recorded memcg can be obsolete one. So, avoid
@@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 
 	id = swap_cgroup_record(ent, 0);
 	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
+	memcg = id_to_memcg(id);
 	if (memcg) {
 		/*
 		 * We uncharge this because swap is freed.
@@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
 	struct mem_cgroup *mem;
 	int size = sizeof(struct mem_cgroup);
 
+	if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
+		return NULL;
+
 	/* Can be very big if MAX_NUMNODES is very big */
 	if (size < PAGE_SIZE)
 		mem = kmalloc(size, GFP_KERNEL);
@@ -4025,7 +4041,10 @@ static void __mem_cgroup_free(struct mem
 	int node;
 
 	mem_cgroup_remove_from_trees(mem);
+	/* No more lookup against this ID */
+	mem_cgroups[css_id(&mem->css)] = NULL;
 	free_css_id(&mem_cgroup_subsys, &mem->css);
+	atomic_dec(&mem_cgroup_num);
 
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
@@ -4162,6 +4181,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
+	atomic_inc(&mem_cgroup_num);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
Index: mmotm-0727/init/Kconfig
===================================================================
--- mmotm-0727.orig/init/Kconfig
+++ mmotm-0727/init/Kconfig
@@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config MEM_CGROUP_MAX_GROUPS
+	int "Maximum number of memory cgroups on a system"
+	range 1 65535
+	default 8192 if 64BIT
+	default 2048 if 32BIT
+	help
+	  Memory cgroup has limitation of the number of groups created.
+	  Please select your favorite value. The more you allow, the more
+	  memory(a pointer per group) will be consumed.
+
 menuconfig CGROUP_SCHED
 	bool "Group CPU scheduler"
 	depends on EXPERIMENTAL && CGROUPS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/4 -mm][memcg] use id in page cgroup
  2010-08-05  9:44 [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4 KAMEZAWA Hiroyuki
  2010-08-05  9:57 ` [PATCH 1/4 -mm][memcg] quick ID lookup in memcg KAMEZAWA Hiroyuki
@ 2010-08-05  9:57 ` KAMEZAWA Hiroyuki
  2010-08-05 10:01 ` [PATCH 3/4 -mm][memcg] scalable file status update logic without race KAMEZAWA Hiroyuki
  2010-08-05 10:03 ` [PATCH 4/4 -mm][memcg] generic file-stat accounting interface for memcg KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05  9:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, gthelen, akpm,
	linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100804
 - added comments to page_cgroup.h
Changelog: 20100730
 - fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |    6 +++++-
 mm/memcontrol.c             |   32 +++++++++++++++++++-------------
 mm/page_cgroup.c            |    2 +-
 3 files changed, 25 insertions(+), 15 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -9,10 +9,14 @@
  * page_cgroup helps us identify information about the cgroup
  * All page cgroups are allocated at boot or memory hotplug event,
  * then the page cgroup for pfn always exists.
+ *
+ * TODO: It seems ID for cgroup can be packed into "flags". But there will
+ * be race between assigning ID <-> set/clear flags. Please be careful.
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
+	unsigned short mem_cgroup;	/* ID of assigned memory cgroup */
+	unsigned short blk_cgroup;	/* Not Used..but will be. */
 	struct page *page;
 	struct list_head lru;		/* per cgroup LRU list */
 };
Index: mmotm-0727/mm/page_cgroup.c
===================================================================
--- mmotm-0727.orig/mm/page_cgroup.c
+++ mmotm-0727/mm/page_cgroup.c
@@ -15,7 +15,7 @@ static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
+	pc->mem_cgroup = 0;
 	pc->page = pfn_to_page(pfn);
 	INIT_LIST_HEAD(&pc->lru);
 }
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -379,7 +379,7 @@ struct cgroup_subsys_state *mem_cgroup_c
 static struct mem_cgroup_per_zone *
 page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
-	struct mem_cgroup *mem = pc->mem_cgroup;
+	struct mem_cgroup *mem = id_to_memcg(pc->mem_cgroup);
 	int nid = page_cgroup_nid(pc);
 	int zid = page_cgroup_zid(pc);
 
@@ -721,6 +721,11 @@ static inline bool mem_cgroup_is_root(st
 	return (mem == root_mem_cgroup);
 }
 
+static inline bool mem_cgroup_is_rootid(unsigned short id)
+{
+	return (id == 1);
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -753,7 +758,7 @@ void mem_cgroup_del_lru_list(struct page
 	 */
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
-	if (mem_cgroup_is_root(pc->mem_cgroup))
+	if (mem_cgroup_is_rootid(pc->mem_cgroup))
 		return;
 	VM_BUG_ON(list_empty(&pc->lru));
 	list_del_init(&pc->lru);
@@ -780,7 +785,7 @@ void mem_cgroup_rotate_lru_list(struct p
 	 */
 	smp_rmb();
 	/* unused or root page is not rotated. */
-	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+	if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
 		return;
 	mz = page_cgroup_zoneinfo(pc);
 	list_move(&pc->lru, &mz->lists[lru]);
@@ -806,7 +811,7 @@ void mem_cgroup_add_lru_list(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	SetPageCgroupAcctLRU(pc);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
+	if (mem_cgroup_is_rootid(pc->mem_cgroup))
 		return;
 	list_add(&pc->lru, &mz->lists[lru]);
 }
@@ -1474,7 +1479,7 @@ void mem_cgroup_update_file_mapped(struc
 		return;
 
 	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
+	mem = id_to_memcg(pc->mem_cgroup);
 	if (!mem || !PageCgroupUsed(pc))
 		goto done;
 
@@ -1862,7 +1867,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
-		mem = pc->mem_cgroup;
+		mem = id_to_memcg(pc->mem_cgroup);
 		if (mem && !css_tryget(&mem->css))
 			mem = NULL;
 	} else if (PageSwapCache(page)) {
@@ -1898,7 +1903,7 @@ static void __mem_cgroup_commit_charge(s
 		return;
 	}
 
-	pc->mem_cgroup = mem;
+	pc->mem_cgroup = css_id(&mem->css);
 	/*
 	 * We access a page_cgroup asynchronously without lock_page_cgroup().
 	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -1956,7 +1961,7 @@ static void __mem_cgroup_move_account(st
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
-	VM_BUG_ON(pc->mem_cgroup != from);
+	VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);
 
 	if (PageCgroupFileMapped(pc)) {
 		/* Update mapped_file data for mem_cgroup */
@@ -1971,7 +1976,7 @@ static void __mem_cgroup_move_account(st
 		mem_cgroup_cancel_charge(from);
 
 	/* caller should have done css_get */
-	pc->mem_cgroup = to;
+	pc->mem_cgroup = css_id(&to->css);
 	mem_cgroup_charge_statistics(to, pc, true);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
@@ -1991,7 +1996,7 @@ static int mem_cgroup_move_account(struc
 {
 	int ret = -EINVAL;
 	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+	if (PageCgroupUsed(pc) && id_to_memcg(pc->mem_cgroup) == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
@@ -2330,7 +2335,7 @@ __mem_cgroup_uncharge_common(struct page
 
 	lock_page_cgroup(pc);
 
-	mem = pc->mem_cgroup;
+	mem = id_to_memcg(pc->mem_cgroup);
 
 	if (!PageCgroupUsed(pc))
 		goto unlock_out;
@@ -2575,7 +2580,7 @@ int mem_cgroup_prepare_migration(struct 
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
-		mem = pc->mem_cgroup;
+		mem = id_to_memcg(pc->mem_cgroup);
 		css_get(&mem->css);
 		/*
 		 * At migrating an anonymous page, its mapcount goes down
@@ -4398,7 +4403,8 @@ static int is_target_pte_for_mc(struct v
 		 * mem_cgroup_move_account() checks the pc is valid or not under
 		 * the lock.
 		 */
-		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+		if (PageCgroupUsed(pc) &&
+			id_to_memcg(pc->mem_cgroup) == mc.from) {
 			ret = MC_TARGET_PAGE;
 			if (target)
 				target->page = page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/4 -mm][memcg] scalable file status update logic without race
  2010-08-05  9:44 [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4 KAMEZAWA Hiroyuki
  2010-08-05  9:57 ` [PATCH 1/4 -mm][memcg] quick ID lookup in memcg KAMEZAWA Hiroyuki
  2010-08-05  9:57 ` [PATCH 2/4 -mm][memcg] use id in page cgroup KAMEZAWA Hiroyuki
@ 2010-08-05 10:01 ` KAMEZAWA Hiroyuki
  2010-08-05 10:03 ` [PATCH 4/4 -mm][memcg] generic file-stat accounting interface for memcg KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, gthelen, akpm,
	linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.


Changelog: 20100804
 - added a comment for possible optimization hint. (for_each_possible...)
   (but this is not needed to be quick..now)
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 67 insertions(+), 12 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1074,7 +1075,50 @@ static unsigned int get_swappiness(struc
 	return swappiness;
 }
 
-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+	int cpu;
+	/* for fast checking in mem_cgroup_update_file_stat() etc..*/
+	spin_lock(&mc.lock);
+	/* TODO: Can we optimize this by for_each_online_cpu() ? */
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+	spin_unlock(&mc.lock);
+
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+	int cpu;
+
+	if (!mem)
+		return;
+	/* for fast checking in mem_cgroup_update_file_stat() etc..*/
+	spin_lock(&mc.lock);
+	for_each_possible_cpu(cpu)
+		per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+	spin_unlock(&mc.lock);
+}
+
+/*
+ * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
+ *                          used for avoiding race.
+ * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
+ *			    under hierarchy of them. used for waiting at
+ *			    memory pressure.
+ * Result of is_moved can be trusted until the end of rcu_read_unlock().
+ * The caller must do
+ *	rcu_read_lock();
+ *	result = mem_cgroup_is_moved();
+ *	.....make use of result here....
+ *	rcu_read_unlock();
+ */
+static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+	return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}
 
 static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 {
@@ -1473,29 +1517,36 @@ void mem_cgroup_update_file_mapped(struc
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	bool need_lock = false;
 
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
-
-	lock_page_cgroup(pc);
+	rcu_read_lock();
 	mem = id_to_memcg(pc->mem_cgroup);
-	if (!mem || !PageCgroupUsed(pc))
+	if (likely(mem)) {
+		if (mem_cgroup_is_moved(mem)) {
+			/* need to serialize with move_account */
+			lock_page_cgroup(pc);
+			need_lock = true;
+			mem = id_to_memcg(pc->mem_cgroup);
+			if (unlikely(!mem))
+				goto done;
+		}
+	}
+	if (unlikely(!PageCgroupUsed(pc)))
 		goto done;
-
-	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
 	if (val > 0) {
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
 	} else {
-		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		ClearPageCgroupFileMapped(pc);
 	}
-
 done:
-	unlock_page_cgroup(pc);
+	if (need_lock)
+		unlock_page_cgroup(pc);
+	rcu_read_unlock();
 }
 
 /*
@@ -3034,6 +3085,7 @@ move_account:
 		lru_add_drain_all();
 		drain_all_stock_sync();
 		ret = 0;
+		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				enum lru_list l;
@@ -3047,6 +3099,7 @@ move_account:
 			if (ret)
 				break;
 		}
+		mem_cgroup_end_move(mem);
 		memcg_oom_recover(mem);
 		/* it seems parent cgroup doesn't have enough mem */
 		if (ret == -ENOMEM)
@@ -4513,6 +4566,7 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	spin_unlock(&mc.lock);
+	mem_cgroup_end_move(from);
 	memcg_oom_recover(from);
 	memcg_oom_recover(to);
 	wake_up_all(&mc.waitq);
@@ -4543,6 +4597,7 @@ static int mem_cgroup_can_attach(struct 
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
 			VM_BUG_ON(mc.moving_task);
+			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/4 -mm][memcg] generic file-stat accounting interface for memcg
  2010-08-05  9:44 [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4 KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2010-08-05 10:01 ` [PATCH 3/4 -mm][memcg] scalable file status update logic without race KAMEZAWA Hiroyuki
@ 2010-08-05 10:03 ` KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-05 10:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, gthelen, akpm,
	linux-kernel

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Changelog:
 - resrructured for clean up.
 - added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |   24 ++++++++++++++++++++++
 include/linux/page_cgroup.h |   19 ++++++++++++------
 mm/memcontrol.c             |   46 ++++++++++++++++++--------------------------
 3 files changed, 56 insertions(+), 33 deletions(-)

Index: mmotm-0727/include/linux/memcontrol.h
===================================================================
--- mmotm-0727.orig/include/linux/memcontrol.h
+++ mmotm-0727/include/linux/memcontrol.h
@@ -25,6 +25,30 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
+	/* When you add new member for file-stat, please update page_cgroup.h */
+	MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_END,
+	MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx)	((idx) - MEM_CGROUP_FSTAT_BASE)
+#define NR_FILE_FLAGS_MEMCG ((MEM_CGROUP_FSTAT_END - MEM_CGROUP_FSTAT_BASE))
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -74,24 +74,6 @@ static int really_do_swap_account __init
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
-	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
 
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
@@ -1513,7 +1495,8 @@ bool mem_cgroup_handle_oom(struct mem_cg
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+static void
+mem_cgroup_update_file_stat(struct page *page, unsigned int idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
@@ -1537,11 +1520,11 @@ void mem_cgroup_update_file_mapped(struc
 	if (unlikely(!PageCgroupUsed(pc)))
 		goto done;
 	if (val > 0) {
-		this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		SetPageCgroupFileMapped(pc);
+		this_cpu_inc(mem->stat->count[idx]);
+		set_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
 	} else {
-		this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		ClearPageCgroupFileMapped(pc);
+		this_cpu_dec(mem->stat->count[idx]);
+		clear_bit(fflag_idx(MEMCG_FSTAT_IDX(idx)), &pc->flags);
 	}
 done:
 	if (need_lock)
@@ -1549,6 +1532,12 @@ done:
 	rcu_read_unlock();
 }
 
+void mem_cgroup_update_file_mapped(struct page *page, int val)
+{
+	return mem_cgroup_update_file_stat(page,
+		MEM_CGROUP_FSTAT_FILE_MAPPED, val);
+}
+
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
@@ -2008,17 +1997,20 @@ static void __mem_cgroup_commit_charge(s
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
+	int i;
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(id_to_memcg(pc->mem_cgroup) != from);
 
-	if (PageCgroupFileMapped(pc)) {
+	for (i = MEM_CGROUP_FSTAT_BASE; i < MEM_CGROUP_FSTAT_END; ++i) {
+		if (!test_bit(fflag_idx(MEMCG_FSTAT_IDX(i)), &pc->flags))
+			continue;
 		/* Update mapped_file data for mem_cgroup */
 		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		__this_cpu_dec(from->stat->count[i]);
+		__this_cpu_inc(to->stat->count[i]);
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, pc, false);
@@ -3448,7 +3440,7 @@ static int mem_cgroup_get_local_stat(str
 	s->stat[MCS_CACHE] += val * PAGE_SIZE;
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
 	s->stat[MCS_RSS] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_FSTAT_FILE_MAPPED);
 	s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
 	s->stat[MCS_PGPGIN] += val;
Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -3,6 +3,7 @@
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 #include <linux/bit_spinlock.h>
+#include <linux/memcontrol.h> /* for flags */
 /*
  * Page Cgroup can be considered as an extended mem_map.
  * A page_cgroup page is associated with every page descriptor. The
@@ -43,10 +44,21 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
-	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	PCG_MIGRATION, /* under page migration */
+	PCG_FILE_FLAGS_MEMCG, /* see memcontrol.h */
+	PCG_FILE_FLAGS_MEMCG_END
+		= PCG_FILE_FLAGS_MEMCG + NR_FILE_FLAGS_MEMCG - 1,
 };
 
+/*
+ * file-stat flags are defined regarding to memcg's stat information.
+ * Here, just defines a macro for indexing
+ */
+static inline int fflag_idx (int idx) {
+	VM_BUG_ON((idx)>=NR_FILE_FLAGS_MEMCG);
+	return ((idx) + PCG_FILE_FLAGS_MEMCG);
+}
+
 #define TESTPCGFLAG(uname, lname)			\
 static inline int PageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_bit(PCG_##lname, &pc->flags); }
@@ -79,11 +91,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
-
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg
  2010-08-06  4:12   ` Greg Thelen
@ 2010-08-06  4:10     ` KAMEZAWA Hiroyuki
  2010-08-06  4:37       ` Greg Thelen
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-06  4:10 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, akpm, linux-kernel

On Thu, 05 Aug 2010 21:12:50 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Now, memory cgroup has an ID per cgroup and make use of it at
> >  - hierarchy walk,
> >  - swap recording.
> >
> > This patch is for making more use of it. The final purpose is
> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
> >
> > This patch caches a pointer of memcg in an array. By this, we
> > don't have to call css_lookup() which requires radix-hash walk.
> > This saves some amount of memory footprint at lookup memcg via id.
> >
> > Changelog: 20100804
> >  - fixed description in init/Kconfig
> >
> > Changelog: 20100730
> >  - fixed rcu_read_unlock() placement.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  init/Kconfig    |   10 ++++++++++
> >  mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
> >  2 files changed, 44 insertions(+), 14 deletions(-)
> >
> > Index: mmotm-0727/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0727.orig/mm/memcontrol.c
> > +++ mmotm-0727/mm/memcontrol.c
> > @@ -292,6 +292,30 @@ static bool move_file(void)
> >  					&mc.to->move_charge_at_immigrate);
> >  }
> >  
> > +/* 0 is unused */
> > +static atomic_t mem_cgroup_num;
> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> > +
> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
> > +{
> > +	/*
> > +	 * This array is set to NULL when mem_cgroup is freed.
> > +	 * IOW, there are no more references && rcu_synchronized().
> > +	 * This lookup-caching is safe.
> > +	 */
> > +	if (unlikely(!mem_cgroups[id])) {
> > +		struct cgroup_subsys_state *css;
> > +
> > +		rcu_read_lock();
> > +		css = css_lookup(&mem_cgroup_subsys, id);
> > +		rcu_read_unlock();
> > +		if (!css)
> > +			return NULL;
> > +		mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> > +	}
> > +	return mem_cgroups[id];
> > +}
> 
> I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
> cause an illegal array index.  I see that
> mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
> mem_cgroup_num ensures that there are never more than
> CONFIG_MEM_CGROUP_MAX_GROUPS memcg active.  But do we have guarantee
> that the that all of the css_id of each active memcg are less than
> NR_MEMCG_GROUPS?
> 
Yes. kernel/cgroup.c's ID assign routine use the smallest number, always.



> >  /*
> >   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> >   * limit reclaim to prevent infinite loops, if they ever occur.
> > @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
> >   * it's concern. (dropping refcnt from swap can be called against removed
> >   * memcg.)
> >   */
> > -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> > -{
> > -	struct cgroup_subsys_state *css;
> >  
> > -	/* ID 0 is unused ID */
> > -	if (!id)
> > -		return NULL;
> > -	css = css_lookup(&mem_cgroup_subsys, id);
> > -	if (!css)
> > -		return NULL;
> > -	return container_of(css, struct mem_cgroup, css);
> > -}
> >  
> >  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> >  {
> > @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> >  		ent.val = page_private(page);
> >  		id = lookup_swap_cgroup(ent);
> >  		rcu_read_lock();
> > -		mem = mem_cgroup_lookup(id);
> > +		mem = id_to_memcg(id);
> >  		if (mem && !css_tryget(&mem->css))
> >  			mem = NULL;
> >  		rcu_read_unlock();
> > @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >  
> >  		id = swap_cgroup_record(ent, 0);
> >  		rcu_read_lock();
> > -		memcg = mem_cgroup_lookup(id);
> > +		memcg = id_to_memcg(id);
> >  		if (memcg) {
> >  			/*
> >  			 * This recorded memcg can be obsolete one. So, avoid
> > @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
> >  
> >  	id = swap_cgroup_record(ent, 0);
> >  	rcu_read_lock();
> > -	memcg = mem_cgroup_lookup(id);
> > +	memcg = id_to_memcg(id);
> >  	if (memcg) {
> >  		/*
> >  		 * We uncharge this because swap is freed.
> > @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
> >  	struct mem_cgroup *mem;
> >  	int size = sizeof(struct mem_cgroup);
> >  
> > +	if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
> > +		return NULL;
> > +
> 
> I think that multiple tasks to be simultaneously running
> mem_cgroup_create().  Therefore more than NR_MEMCG_GROUPS memcg may be
> created.
> 

No. cgroup_mutex() is held.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg
  2010-08-05  9:57 ` [PATCH 1/4 -mm][memcg] quick ID lookup in memcg KAMEZAWA Hiroyuki
@ 2010-08-06  4:12   ` Greg Thelen
  2010-08-06  4:10     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2010-08-06  4:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, akpm, linux-kernel

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, memory cgroup has an ID per cgroup and make use of it at
>  - hierarchy walk,
>  - swap recording.
>
> This patch is for making more use of it. The final purpose is
> to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>
> This patch caches a pointer of memcg in an array. By this, we
> don't have to call css_lookup() which requires radix-hash walk.
> This saves some amount of memory footprint at lookup memcg via id.
>
> Changelog: 20100804
>  - fixed description in init/Kconfig
>
> Changelog: 20100730
>  - fixed rcu_read_unlock() placement.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  init/Kconfig    |   10 ++++++++++
>  mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 44 insertions(+), 14 deletions(-)
>
> Index: mmotm-0727/mm/memcontrol.c
> ===================================================================
> --- mmotm-0727.orig/mm/memcontrol.c
> +++ mmotm-0727/mm/memcontrol.c
> @@ -292,6 +292,30 @@ static bool move_file(void)
>  					&mc.to->move_charge_at_immigrate);
>  }
>  
> +/* 0 is unused */
> +static atomic_t mem_cgroup_num;
> +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
> +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
> +
> +static struct mem_cgroup *id_to_memcg(unsigned short id)
> +{
> +	/*
> +	 * This array is set to NULL when mem_cgroup is freed.
> +	 * IOW, there are no more references && rcu_synchronized().
> +	 * This lookup-caching is safe.
> +	 */
> +	if (unlikely(!mem_cgroups[id])) {
> +		struct cgroup_subsys_state *css;
> +
> +		rcu_read_lock();
> +		css = css_lookup(&mem_cgroup_subsys, id);
> +		rcu_read_unlock();
> +		if (!css)
> +			return NULL;
> +		mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
> +	}
> +	return mem_cgroups[id];
> +}

I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
cause an illegal array index.  I see that
mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
mem_cgroup_num ensures that there are never more than
CONFIG_MEM_CGROUP_MAX_GROUPS memcg active.  But do we have guarantee
that the that all of the css_id of each active memcg are less than
NR_MEMCG_GROUPS?

>  /*
>   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
>   * limit reclaim to prevent infinite loops, if they ever occur.
> @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
>   * it's concern. (dropping refcnt from swap can be called against removed
>   * memcg.)
>   */
> -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> -{
> -	struct cgroup_subsys_state *css;
>  
> -	/* ID 0 is unused ID */
> -	if (!id)
> -		return NULL;
> -	css = css_lookup(&mem_cgroup_subsys, id);
> -	if (!css)
> -		return NULL;
> -	return container_of(css, struct mem_cgroup, css);
> -}
>  
>  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  {
> @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>  		ent.val = page_private(page);
>  		id = lookup_swap_cgroup(ent);
>  		rcu_read_lock();
> -		mem = mem_cgroup_lookup(id);
> +		mem = id_to_memcg(id);
>  		if (mem && !css_tryget(&mem->css))
>  			mem = NULL;
>  		rcu_read_unlock();
> @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  
>  		id = swap_cgroup_record(ent, 0);
>  		rcu_read_lock();
> -		memcg = mem_cgroup_lookup(id);
> +		memcg = id_to_memcg(id);
>  		if (memcg) {
>  			/*
>  			 * This recorded memcg can be obsolete one. So, avoid
> @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>  
>  	id = swap_cgroup_record(ent, 0);
>  	rcu_read_lock();
> -	memcg = mem_cgroup_lookup(id);
> +	memcg = id_to_memcg(id);
>  	if (memcg) {
>  		/*
>  		 * We uncharge this because swap is freed.
> @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
>  	struct mem_cgroup *mem;
>  	int size = sizeof(struct mem_cgroup);
>  
> +	if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
> +		return NULL;
> +

I think that multiple tasks to be simultaneously running
mem_cgroup_create().  Therefore more than NR_MEMCG_GROUPS memcg may be
created.

>  	/* Can be very big if MAX_NUMNODES is very big */
>  	if (size < PAGE_SIZE)
>  		mem = kmalloc(size, GFP_KERNEL);
> @@ -4025,7 +4041,10 @@ static void __mem_cgroup_free(struct mem
>  	int node;
>  
>  	mem_cgroup_remove_from_trees(mem);
> +	/* No more lookup against this ID */
> +	mem_cgroups[css_id(&mem->css)] = NULL;
>  	free_css_id(&mem_cgroup_subsys, &mem->css);
> +	atomic_dec(&mem_cgroup_num);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		free_mem_cgroup_per_zone_info(mem, node);
> @@ -4162,6 +4181,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> +	atomic_inc(&mem_cgroup_num);
>  	return &mem->css;
>  free_out:
>  	__mem_cgroup_free(mem);
> Index: mmotm-0727/init/Kconfig
> ===================================================================
> --- mmotm-0727.orig/init/Kconfig
> +++ mmotm-0727/init/Kconfig
> @@ -594,6 +594,16 @@ config CGROUP_MEM_RES_CTLR_SWAP
>  	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
>  	  size is 4096bytes, 512k per 1Gbytes of swap.
>  
> +config MEM_CGROUP_MAX_GROUPS
> +	int "Maximum number of memory cgroups on a system"
> +	range 1 65535
> +	default 8192 if 64BIT
> +	default 2048 if 32BIT
> +	help
> +	  Memory cgroup has limitation of the number of groups created.
> +	  Please select your favorite value. The more you allow, the more
> +	  memory(a pointer per group) will be consumed.
> +
>  menuconfig CGROUP_SCHED
>  	bool "Group CPU scheduler"
>  	depends on EXPERIMENTAL && CGROUPS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/4 -mm][memcg] quick ID lookup in memcg
  2010-08-06  4:10     ` KAMEZAWA Hiroyuki
@ 2010-08-06  4:37       ` Greg Thelen
  0 siblings, 0 replies; 8+ messages in thread
From: Greg Thelen @ 2010-08-06  4:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, balbir, nishimura, vgoyal, m-ikeda, akpm, linux-kernel

On Thu, Aug 5, 2010 at 9:10 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 05 Aug 2010 21:12:50 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>>
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > Now, memory cgroup has an ID per cgroup and make use of it at
>> >  - hierarchy walk,
>> >  - swap recording.
>> >
>> > This patch is for making more use of it. The final purpose is
>> > to replace page_cgroup->mem_cgroup's pointer to an unsigned short.
>> >
>> > This patch caches a pointer of memcg in an array. By this, we
>> > don't have to call css_lookup() which requires radix-hash walk.
>> > This saves some amount of memory footprint at lookup memcg via id.
>> >
>> > Changelog: 20100804
>> >  - fixed description in init/Kconfig
>> >
>> > Changelog: 20100730
>> >  - fixed rcu_read_unlock() placement.
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > ---
>> >  init/Kconfig    |   10 ++++++++++
>> >  mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
>> >  2 files changed, 44 insertions(+), 14 deletions(-)
>> >
>> > Index: mmotm-0727/mm/memcontrol.c
>> > ===================================================================
>> > --- mmotm-0727.orig/mm/memcontrol.c
>> > +++ mmotm-0727/mm/memcontrol.c
>> > @@ -292,6 +292,30 @@ static bool move_file(void)
>> >                                     &mc.to->move_charge_at_immigrate);
>> >  }
>> >
>> > +/* 0 is unused */
>> > +static atomic_t mem_cgroup_num;
>> > +#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
>> > +static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
>> > +
>> > +static struct mem_cgroup *id_to_memcg(unsigned short id)
>> > +{
>> > +   /*
>> > +    * This array is set to NULL when mem_cgroup is freed.
>> > +    * IOW, there are no more references && rcu_synchronized().
>> > +    * This lookup-caching is safe.
>> > +    */
>> > +   if (unlikely(!mem_cgroups[id])) {
>> > +           struct cgroup_subsys_state *css;
>> > +
>> > +           rcu_read_lock();
>> > +           css = css_lookup(&mem_cgroup_subsys, id);
>> > +           rcu_read_unlock();
>> > +           if (!css)
>> > +                   return NULL;
>> > +           mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
>> > +   }
>> > +   return mem_cgroups[id];
>> > +}
>>
>> I am worried that id may be larger than CONFIG_MEM_CGROUP_MAX_GROUPS and
>> cause an illegal array index.  I see that
>> mem_cgroup_uncharge_swapcache() uses css_id() to compute 'id'.
>> mem_cgroup_num ensures that there are never more than
>> CONFIG_MEM_CGROUP_MAX_GROUPS memcg active.  But do we have guarantee
>> that the that all of the css_id of each active memcg are less than
>> NR_MEMCG_GROUPS?
>>
> Yes. kernel/cgroup.c's ID assign routine use the smallest number, always.
>
>
>
>> >  /*
>> >   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
>> >   * limit reclaim to prevent infinite loops, if they ever occur.
>> > @@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
>> >   * it's concern. (dropping refcnt from swap can be called against removed
>> >   * memcg.)
>> >   */
>> > -static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>> > -{
>> > -   struct cgroup_subsys_state *css;
>> >
>> > -   /* ID 0 is unused ID */
>> > -   if (!id)
>> > -           return NULL;
>> > -   css = css_lookup(&mem_cgroup_subsys, id);
>> > -   if (!css)
>> > -           return NULL;
>> > -   return container_of(css, struct mem_cgroup, css);
>> > -}
>> >
>> >  struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>> >  {
>> > @@ -1856,7 +1869,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>> >             ent.val = page_private(page);
>> >             id = lookup_swap_cgroup(ent);
>> >             rcu_read_lock();
>> > -           mem = mem_cgroup_lookup(id);
>> > +           mem = id_to_memcg(id);
>> >             if (mem && !css_tryget(&mem->css))
>> >                     mem = NULL;
>> >             rcu_read_unlock();
>> > @@ -2208,7 +2221,7 @@ __mem_cgroup_commit_charge_swapin(struct
>> >
>> >             id = swap_cgroup_record(ent, 0);
>> >             rcu_read_lock();
>> > -           memcg = mem_cgroup_lookup(id);
>> > +           memcg = id_to_memcg(id);
>> >             if (memcg) {
>> >                     /*
>> >                      * This recorded memcg can be obsolete one. So, avoid
>> > @@ -2472,7 +2485,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>> >
>> >     id = swap_cgroup_record(ent, 0);
>> >     rcu_read_lock();
>> > -   memcg = mem_cgroup_lookup(id);
>> > +   memcg = id_to_memcg(id);
>> >     if (memcg) {
>> >             /*
>> >              * We uncharge this because swap is freed.
>> > @@ -3988,6 +4001,9 @@ static struct mem_cgroup *mem_cgroup_all
>> >     struct mem_cgroup *mem;
>> >     int size = sizeof(struct mem_cgroup);
>> >
>> > +   if (atomic_read(&mem_cgroup_num) == NR_MEMCG_GROUPS)
>> > +           return NULL;
>> > +
>>
>> I think that multiple tasks to be simultaneously running
>> mem_cgroup_create().  Therefore more than NR_MEMCG_GROUPS memcg may be
>> created.
>>
>
> No. cgroup_mutex() is held.
>
> Thanks,
> -Kame
>
>

I see that now.  Thank you clarification.  I am doing some testing on
the patches now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-06  4:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-05  9:44 [PATCH 0/5 -mm][memcg] towards I/O aware memory cgroup v4 KAMEZAWA Hiroyuki
2010-08-05  9:57 ` [PATCH 1/4 -mm][memcg] quick ID lookup in memcg KAMEZAWA Hiroyuki
2010-08-06  4:12   ` Greg Thelen
2010-08-06  4:10     ` KAMEZAWA Hiroyuki
2010-08-06  4:37       ` Greg Thelen
2010-08-05  9:57 ` [PATCH 2/4 -mm][memcg] use id in page cgroup KAMEZAWA Hiroyuki
2010-08-05 10:01 ` [PATCH 3/4 -mm][memcg] scalable file status update logic without race KAMEZAWA Hiroyuki
2010-08-05 10:03 ` [PATCH 4/4 -mm][memcg] generic file-stat accounting interface for memcg KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox