[PATCH] memcg updates (2008/12/16)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] memcg updates (2008/12/16)
@ 2008-12-16  9:09 KAMEZAWA Hiroyuki
  2008-12-16  9:10 ` [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling KAMEZAWA Hiroyuki
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:09 UTC (permalink / raw)
  To: linux-mm; +Cc: menage, balbir, nishimura, lizf


This is just for dumping my queue and sharing what is planned.

Including Paul Menage's "RFC" patches (because my patches uses them.)

Some are new and others are not. All are against mmotm-Dec-15.

[1/9] Bug fix for mem_cgroup_create() error path and simplify refcnt handling.
      (maybe it's better to divide this into bugfix and clean up.)
[2/9] Paul Menage's cgroup_hierarchy_mutex()
[3/9] Paul Menage's hierarchy_mutex_in_memcg
[4/9] Paul Menage's css_tryget()
[5/9] Add css_is_removed()
[6/9] Use css_tryget() in memcg to remove memcg->obsolete
[7/9] Add css_id support.
      A new implementation of cgroup ID I posted. IDs are per-CSS.
[8/9] Recalim memory in hierarchy withoug mutex of cgroups.
      Recalim memory in round-robin under hierarchy by CSS ID.
[9/9] Fix OOM killer bug under hierarchy.
      Current memcg' oom-killer is broken under hierarchy. This is a fix.

I wonder I have to keep 6-9 in my queue until the next year.
(2-5 aren't under my control.)

Piled up patches from memcg onto mmotm seems like a card tower ;)
So, I don't want to be aggressive.
If anyone interested in, fix 9/9 without 2-8.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
@ 2008-12-16  9:10 ` KAMEZAWA Hiroyuki
  2008-12-17  2:26   ` Daisuke Nishimura
  2008-12-16  9:12 ` [PATCH 2/9] cgroup hierarchy mutex KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 1. Fix double-free BUG in error route of mem_cgroup_create().
    mem_cgroup_free() itself frees per-zone-info.
 2. Making refcnt of memcg simple.
    Add 1 refcnt at creation and call free when refcnt goes down to 0.

Singed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.28-Dec15/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec15/mm/memcontrol.c
@@ -2087,14 +2087,10 @@ static struct mem_cgroup *mem_cgroup_all
  * Removal of cgroup itself succeeds regardless of refs from swap.
  */
 
-static void mem_cgroup_free(struct mem_cgroup *mem)
+static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
-	if (atomic_read(&mem->refcnt) > 0)
-		return;
-
-
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
@@ -2111,11 +2107,8 @@ static void mem_cgroup_get(struct mem_cg
 
 static void mem_cgroup_put(struct mem_cgroup *mem)
 {
-	if (atomic_dec_and_test(&mem->refcnt)) {
-		if (!mem->obsolete)
-			return;
-		mem_cgroup_free(mem);
-	}
+	if (atomic_dec_and_test(&mem->refcnt))
+		__mem_cgroup_free(mem);
 }
 
 
@@ -2165,12 +2158,10 @@ mem_cgroup_create(struct cgroup_subsys *
 
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
-
+	atomic_set(&mem->refcnt, 1);
 	return &mem->css;
 free_out:
-	for_each_node_state(node, N_POSSIBLE)
-		free_mem_cgroup_per_zone_info(mem, node);
-	mem_cgroup_free(mem);
+	__mem_cgroup_free(mem);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -2185,7 +2176,7 @@ static void mem_cgroup_pre_destroy(struc
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	mem_cgroup_free(mem_cgroup_from_cont(cont));
+	mem_cgroup_put(mem_cgroup_from_cont(cont));
 }
 
 static int mem_cgroup_populate(struct cgroup_subsys *ss,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/9] cgroup hierarchy mutex
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
  2008-12-16  9:10 ` [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling KAMEZAWA Hiroyuki
@ 2008-12-16  9:12 ` KAMEZAWA Hiroyuki
  2008-12-16  9:13 ` [PATCH 3/9] use hierarchy mutex in memcg KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf

This was RFC from Paul Menage. Including here just for base of my series.
==
From:	menage@google.com

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
  bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---
Index: mmotm-2.6.28-Dec12/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec12.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec12/include/linux/cgroup.h
@@ -337,8 +337,15 @@ struct cgroup_subsys {
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
-	struct cgroupfs_root *root;
+	/*
+	 * Protects sibling/children links of cgroups in this
+	 * hierarchy, plus protects which hierarchy (or none) the
+	 * subsystem is a part of (i.e. root/sibling)
+	 */
+	struct mutex hierarchy_mutex;
 
+	/* Protected by this->hierarchy_mutex and cgroup_lock() */
+	struct cgroupfs_root *root;
 	struct list_head sibling;
 };
 
Index: mmotm-2.6.28-Dec12/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec12/kernel/cgroup.c
@@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
 			BUG_ON(cgrp->subsys[i]);
 			BUG_ON(!dummytop->subsys[i]);
 			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
+			mutex_lock(&ss->hierarchy_mutex);
 			cgrp->subsys[i] = dummytop->subsys[i];
 			cgrp->subsys[i]->cgroup = cgrp;
 			list_move(&ss->sibling, &root->subsys_list);
 			ss->root = root;
 			if (ss->bind)
 				ss->bind(ss, cgrp);
-
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & removed_bits) {
 			/* We're removing this subsystem */
 			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
 			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
+			mutex_lock(&ss->hierarchy_mutex);
 			if (ss->bind)
 				ss->bind(ss, dummytop);
 			dummytop->subsys[i]->cgroup = dummytop;
 			cgrp->subsys[i] = NULL;
 			subsys[i]->root = &rootnode;
 			list_move(&ss->sibling, &rootnode.subsys_list);
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & final_bits) {
 			/* Subsystem state should already exist */
 			BUG_ON(!cgrp->subsys[i]);
@@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
 	cgrp->subsys[ss->subsys_id] = css;
 }
 
+static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
+{
+	/* We need to take each hierarchy_mutex in a consistent order */
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_lock_nested(&ss->hierarchy_mutex, i);
+	}
+}
+
+static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
+{
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_unlock(&ss->hierarchy_mutex);
+	}
+}
+
 /*
  * cgroup_create - create a cgroup
  * @parent: cgroup that will be parent of the new cgroup
@@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup 
 		init_cgroup_css(css, ss, cgrp);
 	}
 
+	cgroup_lock_hierarchy(root);
 	list_add(&cgrp->sibling, &cgrp->parent->children);
+	cgroup_unlock_hierarchy(root);
 	root->number_of_cgroups++;
 
 	err = cgroup_create_dir(cgrp, dentry, mode);
@@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
 	spin_unlock(&release_list_lock);
-	/* delete my sibling from parent->children */
+
+	cgroup_lock_hierarchy(cgrp->root);
+	/* delete this cgroup from parent->children */
 	list_del(&cgrp->sibling);
+	cgroup_unlock_hierarchy(cgrp->root);
+
 	spin_lock(&cgrp->dentry->d_lock);
 	d = dget(cgrp->dentry);
 	spin_unlock(&d->d_lock);
@@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st
 	 * need to invoke fork callbacks here. */
 	BUG_ON(!list_empty(&init_task.tasks));
 
+	mutex_init(&ss->hierarchy_mutex);
 	ss->active = 1;
 }
 
Index: mmotm-2.6.28-Dec12/Documentation/cgroups/cgroups.txt
===================================================================
--- mmotm-2.6.28-Dec12.orig/Documentation/cgroups/cgroups.txt
+++ mmotm-2.6.28-Dec12/Documentation/cgroups/cgroups.txt
@@ -528,7 +528,7 @@ example in cpusets, no task may attach b
 up.
 
 void bind(struct cgroup_subsys *ss, struct cgroup *root)
-(cgroup_mutex held by caller)
+(cgroup_mutex and ss->hierarchy_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
 and root cgroup. Currently this will only involve movement between

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/9] use hierarchy mutex in memcg
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
  2008-12-16  9:10 ` [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling KAMEZAWA Hiroyuki
  2008-12-16  9:12 ` [PATCH 2/9] cgroup hierarchy mutex KAMEZAWA Hiroyuki
@ 2008-12-16  9:13 ` KAMEZAWA Hiroyuki
  2008-12-16  9:14 ` [PATCH 4/9] cgroup add css_tryget() KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf

This was RFC from Paul Menage, just for base of my series.
==
From:	menage@google.com

This patch updates the memory controller to use its hierarchy_mutex
rather than calling cgroup_lock() to protected against
cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---
Index: mmotm-2.6.28-Dec12/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec12/mm/memcontrol.c
@@ -154,7 +154,7 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by cgroup_lock()
+	 * reclaimed from. Protected by hierarchy_mutex
 	 */
 	struct mem_cgroup *last_scanned_child;
 	/*
@@ -554,7 +554,7 @@ unsigned long mem_cgroup_isolate_pages(u
 
 /*
  * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
+ * called with hierarchy_mutex held
  */
 static struct mem_cgroup *
 mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
@@ -623,7 +623,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 	/*
 	 * Scan all children under the mem_cgroup mem
 	 */
-	cgroup_lock();
+	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 	if (list_empty(&root_mem->css.cgroup->children)) {
 		ret = root_mem;
 		goto done;
@@ -644,7 +644,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 
 done:
 	root_mem->last_scanned_child = ret;
-	cgroup_unlock();
+	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	return ret;
 }
 
@@ -708,18 +708,16 @@ static int mem_cgroup_hierarchical_recla
 	while (next_mem != root_mem) {
 		if (next_mem->obsolete) {
 			mem_cgroup_put(next_mem);
-			cgroup_lock();
 			next_mem = mem_cgroup_get_first_node(root_mem);
-			cgroup_unlock();
 			continue;
 		}
 		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
 						   get_swappiness(next_mem));
 		if (mem_cgroup_check_under_limit(root_mem))
 			return 0;
-		cgroup_lock();
+		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		cgroup_unlock();
+		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	}
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/9] cgroup add css_tryget()
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-12-16  9:13 ` [PATCH 3/9] use hierarchy mutex in memcg KAMEZAWA Hiroyuki
@ 2008-12-16  9:14 ` KAMEZAWA Hiroyuki
  2008-12-16  9:15 ` [PATCH 5/9] Add css_is_remvoed KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


This was RFC from Paul Menage, including here just for base of my series.
==
From: menage@google.com

This patch adds css_tryget(), that obtains a counted reference on a
CSS.  It is used in situations where the caller has a "weak" reference
to the CSS, i.e. one that does not protect the cgroup from removal via
a reference count, but would instead be cleaned up by a destroy()
callback.

css_tryget() will return true on success, or false if the cgroup is
being removed.

This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
but with the difference that in the event of css_tryget() racing with
a cgroup_rmdir(), css_tryget() will only return false if the cgroup
really does get removed.

This implementation is done by biasing css->refcnt, so that a refcnt
of 1 means "releasable" and 0 means "released or releasing". In the
event of a race, css_tryget() distinguishes between "released" and
"releasing" by checking for the CSS_REMOVED flag in css->flags.

Signed-off-by: Paul Menage <menage@google.com>

---
Index: mmotm-2.6.28-Dec12/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec12.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec12/include/linux/cgroup.h
@@ -52,9 +52,9 @@ struct cgroup_subsys_state {
 	 * hierarchy structure */
 	struct cgroup *cgroup;
 
-	/* State maintained by the cgroup system to allow
-	 * subsystems to be "busy". Should be accessed via css_get()
-	 * and css_put() */
+	/* State maintained by the cgroup system to allow subsystems
+	 * to be "busy". Should be accessed via css_get(),
+	 * css_tryget() and and css_put(). */
 
 	atomic_t refcnt;
 
@@ -64,11 +64,14 @@ struct cgroup_subsys_state {
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
+	CSS_REMOVED, /* This CSS is dead */
 };
 
 /*
- * Call css_get() to hold a reference on the cgroup;
- *
+ * Call css_get() to hold a reference on the css; it can be used
+ * for a reference obtained via:
+ * - an existing ref-counted reference to the css
+ * - task->cgroups for a locked task
  */
 
 static inline void css_get(struct cgroup_subsys_state *css)
@@ -77,9 +80,27 @@ static inline void css_get(struct cgroup
 	if (!test_bit(CSS_ROOT, &css->flags))
 		atomic_inc(&css->refcnt);
 }
+
+/*
+ * Call css_tryget() to take a reference on a css if your existing
+ * (known-valid) reference isn't already ref-counted. Returns false if
+ * the css has been destroyed.
+ */
+
+static inline bool css_tryget(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return true;
+	while (!atomic_inc_not_zero(&css->refcnt)) {
+		if (test_bit(CSS_REMOVED, &css->flags))
+			return false;
+	}
+	return true;
+}
+
 /*
  * css_put() should be called to release a reference taken by
- * css_get()
+ * css_get() or css_tryget()
  */
 
 extern void __css_put(struct cgroup_subsys_state *css);
Index: mmotm-2.6.28-Dec12/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec12.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec12/kernel/cgroup.c
@@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
 			       struct cgroup *cgrp)
 {
 	css->cgroup = cgrp;
-	atomic_set(&css->refcnt, 0);
+	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
@@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
 {
 	/* Check the reference count on each subsystem. Since we
 	 * already established that there are no tasks in the
-	 * cgroup, if the css refcount is also 0, then there should
+	 * cgroup, if the css refcount is also 1, then there should
 	 * be no outstanding references, so the subsystem is safe to
 	 * destroy. We scan across all subsystems rather than using
 	 * the per-hierarchy linked list of mounted subsystems since
@@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg
 		 * matter, since it can only happen if the cgroup
 		 * has been deleted and hence no longer needs the
 		 * release agent to be called anyway. */
-		if (css && atomic_read(&css->refcnt))
+		if (css && (atomic_read(&css->refcnt) > 1))
 			return 1;
 	}
 	return 0;
 }
 
+/*
+ * Atomically mark all of the cgroup's CSS objects as
+ * CSS_REMOVED. Return true on success, or false if the cgroup has
+ * busy subsystems. Call with cgroup_mutex held
+ */
+
+static int cgroup_clear_css_refs(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	unsigned long flags;
+	bool failed = false;
+	local_irq_save(flags);
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		int refcnt;
+		do {
+			/* We can only remove a CSS with a refcnt==1 */
+			refcnt = atomic_read(&css->refcnt);
+			if (refcnt > 1) {
+				failed = true;
+				goto done;
+			}
+			BUG_ON(!refcnt);
+			/*
+			 * Drop the refcnt to 0 while we check other
+			 * subsystems. This will cause any racing
+			 * css_tryget() to spin until we set the
+			 * CSS_REMOVED bits or abort
+			 */
+		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
+	}
+ done:
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		if (failed) {
+			/* Restore old refcnt if necessary */
+			if (!atomic_read(&css->refcnt))
+				atomic_set(&css->refcnt, 1);
+		} else {
+			/* Commit the fact that the CSS is removed */
+			set_bit(CSS_REMOVED, &css->flags);
+		}
+	}
+	local_irq_restore(flags);
+	return !failed;
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un
 
 	if (atomic_read(&cgrp->count)
 	    || !list_empty(&cgrp->children)
-	    || cgroup_has_css_refs(cgrp)) {
+	    || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat
 {
 	struct cgroup *cgrp = css->cgroup;
 	rcu_read_lock();
-	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
+	if ((atomic_dec_return(&css->refcnt) == 1) &&
+	    notify_on_release(cgrp)) {
 		set_bit(CGRP_RELEASABLE, &cgrp->flags);
 		check_for_release(cgrp);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/9] Add css_is_remvoed
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-12-16  9:14 ` [PATCH 4/9] cgroup add css_tryget() KAMEZAWA Hiroyuki
@ 2008-12-16  9:15 ` KAMEZAWA Hiroyuki
  2008-12-16  9:17 ` [PATCH 6/9] memcg: use css_tryget() KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf

I hear Paul Menage will add similar call to his set.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Adding a function for checking css is removed or not.
Maybe this patch will be unnecessary.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.28-Dec12/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec12.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec12/include/linux/cgroup.h
@@ -110,6 +110,11 @@ static inline void css_put(struct cgroup
 		__css_put(css);
 }
 
+static inline bool css_is_removed(struct cgroup_subsys_state *css)
+{
+	return test_bit(CSS_REMOVED, &css->flags);
+}
+
 /* bits in struct cgroup flags field */
 enum {
 	/* Control Group is dead */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 6/9] memcg: use css_tryget()
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-12-16  9:15 ` [PATCH 5/9] Add css_is_remvoed KAMEZAWA Hiroyuki
@ 2008-12-16  9:17 ` KAMEZAWA Hiroyuki
  2008-12-18  1:50   ` Daisuke Nishimura
  2008-12-16  9:19 ` [PATCH 7/9] cgroup: Support CSS ID KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Use css_tryget() in memcg.

css_tryget() newly is added and we can know css is alive or not and
get refcnt of css in very safe way.
("alive" here means "rmdir/destroy" is not called.)

This patch replaces css_get() to css_tryget(), where I cannot explain
why css_get() is safe. And removes memcg->obsolete flag.

Changelog (v0) -> (v1):
  - fixed css_ref leak bug at swap-in.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.28-Dec15/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec15/mm/memcontrol.c
@@ -162,7 +162,6 @@ struct mem_cgroup {
 	 */
 	bool use_hierarchy;
 	unsigned long	last_oom_jiffies;
-	int		obsolete;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -283,6 +282,31 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
+static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem = NULL;
+	/*
+	 * Because we have no locks, mm->owner's may be being moved to other
+	 * cgroup. We use css_tryget() here even if this looks
+	 * pessimistic (rather than adding locks here).
+	 */
+	rcu_read_lock();
+	do {
+		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem))
+			break;
+	} while (!css_tryget(&mem->css));
+	rcu_read_unlock();
+	return mem;
+}
+
+static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return true;
+	return css_is_removed(&mem->css);
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -617,8 +641,9 @@ mem_cgroup_get_first_node(struct mem_cgr
 {
 	struct cgroup *cgroup;
 	struct mem_cgroup *ret;
-	bool obsolete = (root_mem->last_scanned_child &&
-				root_mem->last_scanned_child->obsolete);
+	bool obsolete;
+
+	obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
 
 	/*
 	 * Scan all children under the mem_cgroup mem
@@ -706,7 +731,7 @@ static int mem_cgroup_hierarchical_recla
 	next_mem = mem_cgroup_get_first_node(root_mem);
 
 	while (next_mem != root_mem) {
-		if (next_mem->obsolete) {
+		if (mem_cgroup_is_obsolete(next_mem)) {
 			mem_cgroup_put(next_mem);
 			next_mem = mem_cgroup_get_first_node(root_mem);
 			continue;
@@ -762,23 +787,17 @@ static int __mem_cgroup_try_charge(struc
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-	if (likely(!*memcg)) {
-		rcu_read_lock();
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!mem)) {
-			rcu_read_unlock();
-			return 0;
-		}
-		/*
-		 * For every charge from the cgroup, increment reference count
-		 */
-		css_get(&mem->css);
+	mem = *memcg;
+	if (likely(!mem)) {
+		mem = try_get_mem_cgroup_from_mm(mm);
 		*memcg = mem;
-		rcu_read_unlock();
 	} else {
-		mem = *memcg;
 		css_get(&mem->css);
 	}
+	if (unlikely(!mem))
+		return 0;
+
+	VM_BUG_ON(mem_cgroup_is_obsolete(mem));
 
 	while (1) {
 		int ret;
@@ -1065,12 +1084,19 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+/*
+ * While swap-in, try_charge -> commit or cancel, the page is locked.
+ * And when try_charge() successfully returns, one refcnt to memcg without
+ * struct page_cgroup is aquired. This refcnt will be cumsumed by
+ * "commit()" or removed by "cancel()"
+ */
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 				 struct page *page,
 				 gfp_t mask, struct mem_cgroup **ptr)
 {
 	struct mem_cgroup *mem;
 	swp_entry_t     ent;
+	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -1089,10 +1115,15 @@ int mem_cgroup_try_charge_swapin(struct 
 	ent.val = page_private(page);
 
 	mem = lookup_swap_cgroup(ent);
-	if (!mem || mem->obsolete)
+	if (!mem)
+		goto charge_cur_mm;
+	if (!css_tryget(&mem->css))
 		goto charge_cur_mm;
 	*ptr = mem;
-	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	/* drop extra refcnt from tryget */
+	css_put(&mem->css);
+	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -1123,13 +1154,18 @@ int mem_cgroup_cache_charge_swapin(struc
 		ent.val = page_private(page);
 		if (do_swap_account) {
 			mem = lookup_swap_cgroup(ent);
-			if (mem && mem->obsolete)
-				mem = NULL;
-			if (mem)
-				mm = NULL;
+			if (mem) {
+				if (css_tryget(&mem->css))
+					mm = NULL; /* charge to recorded */
+				else
+					mem = NULL; /* charge to current */
+			}
 		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+		/* drop extra refcnt from tryget */
+		if (mem)
+			css_put(&mem->css);
 
 		if (!ret && do_swap_account) {
 			/* avoid double counting */
@@ -1171,7 +1207,6 @@ void mem_cgroup_commit_charge_swapin(str
 		struct mem_cgroup *memcg;
 		memcg = swap_cgroup_record(ent, NULL);
 		if (memcg) {
-			/* If memcg is obsolete, memcg can be != ptr */
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_put(memcg);
 		}
@@ -1414,14 +1449,9 @@ int mem_cgroup_shrink_usage(struct mm_st
 	if (!mm)
 		return 0;
 
-	rcu_read_lock();
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (unlikely(!mem)) {
-		rcu_read_unlock();
+	mem = try_get_mem_cgroup_from_mm(mm);
+	if (unlikely(!mem))
 		return 0;
-	}
-	css_get(&mem->css);
-	rcu_read_unlock();
 
 	do {
 		progress = mem_cgroup_hierarchical_reclaim(mem, gfp_mask, true);
@@ -2079,9 +2109,6 @@ static struct mem_cgroup *mem_cgroup_all
  * the number of reference from swap_cgroup and free mem_cgroup when
  * it goes down to 0.
  *
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
- *
  * Removal of cgroup itself succeeds regardless of refs from swap.
  */
 
@@ -2167,7 +2194,6 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem, false);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 7/9] cgroup: Support CSS ID
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-12-16  9:17 ` [PATCH 6/9] memcg: use css_tryget() KAMEZAWA Hiroyuki
@ 2008-12-16  9:19 ` KAMEZAWA Hiroyuki
  2008-12-16 10:24   ` Paul Menage
  2008-12-16  9:21 ` [PATCH 8/9] memcg: hierarchical memory reclaim by round-robin KAMEZAWA Hiroyuki
  2008-12-16  9:22 ` [PATCH 9/9] memcg : fix OOM killer under hierarchy KAMEZAWA Hiroyuki
  8 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Patch for Per-CSS ID and private hierarchy code.

This patch tries to assign a ID to each css. Attach unique ID to each
css and provides following functions.

 - css_lookup(subsys, id)
   returns struct cgroup of id.
 - css_get_next(subsys, id, rootid, depth, foundid)
   returns the next cgroup under "root" by scanning bitmap (not by tree-walk)

When cgrou_subsys->use_id is set, id field and bitmap for css is maintained.
kernel/cgroup.c just parepare
	- css_id of root css for subsys
	- alloc/free id functions.
So, each subsys should allocate ID in attach() callback if necessary.

There is several reasons to develop this.
	- Saving space .... For example, memcg's swap_cgroup is array of
	  pointers to cgroup. But it is not necessary to be very fast.
	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
	  reduce much amount of memory usage.

	- Scanning without lock.
	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
	  css under root can be written without locks.
	  ex)
	  do {
		rcu_read_lock();
		next = cgroup_get_next(subsys, id, root, &found);
		/* check sanity of next here */
		css_tryget();
		rcu_read_unlock();
		id = found + 1
	 } while(...)

Characteristics: 
	- Each css has unique ID under subsys.
	- Lifetime of ID is controlled by subsys.
	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
	- Allowed ID is 1-65535, ID 0 is UNUSED ID.

Design Choices:
	- scan-by-ID v.s. scan-by-tree-walk.
	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
	  by following kind of routine.
	  scan -> rest a while(release a lock) -> conitunue from interrupted
	  memcg's hierarchical reclaim does this.

	- When subsys->use_id is set, # of css in the system is limited to
	  65534. 

Changelog: (v5) -> (v6)
 	- max depth is removed.
	- changed arguments to "scan" from ID to pointer of root css.
Changelog: (v4) -> (v5)
	- Totally re-designed as per-css ID.
Changelog:(v3) -> (v4)
	- updated comments.
	- renamed hierarchy_code[] to stack[]
	- merged prepare_id routines.

Changelog (v2) -> (v3)
	- removed cgroup_id_getref().
	- added cgroup_id_tryget().

Changelog (v1) -> (v2):
	- Design change: show only ID(integer) to outside of cgroup.c
	- moved cgroup ID definition from include/ to kernel/cgroup.c
	- struct cgroup_id is freed by RCU.
	- changed interface from pointer to "int"
	- kill_sb() is handled. 
	- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   49 +++++++++
 include/linux/idr.h    |    1 
 kernel/cgroup.c        |  251 ++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/idr.c              |   46 ++++++++
 4 files changed, 345 insertions(+), 2 deletions(-)

Index: mmotm-2.6.28-Dec15/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec15.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec15/include/linux/cgroup.h
@@ -15,13 +15,14 @@
 #include <linux/cgroupstats.h>
 #include <linux/prio_heap.h>
 #include <linux/rwsem.h>
-
+#include <linux/idr.h>
 #ifdef CONFIG_CGROUPS
 
 struct cgroupfs_root;
 struct cgroup_subsys;
 struct inode;
 struct cgroup;
+struct css_id;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -59,6 +60,8 @@ struct cgroup_subsys_state {
 	atomic_t refcnt;
 
 	unsigned long flags;
+	/* ID for this css, if possible */
+	struct css_id *id;
 };
 
 /* bits in struct cgroup_subsys_state flags field */
@@ -360,6 +363,11 @@ struct cgroup_subsys {
 	int active;
 	int disabled;
 	int early_init;
+	/*
+	 * set 1 if subsys uses ID. ID is not available before cgroup_init()
+	 * (not available in early_init time.
+	 */
+	int use_id;
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
@@ -373,6 +381,9 @@ struct cgroup_subsys {
 	/* Protected by this->hierarchy_mutex and cgroup_lock() */
 	struct cgroupfs_root *root;
 	struct list_head sibling;
+	/* used when use_id == 1 */
+	struct idr idr;
+	spinlock_t id_lock;
 };
 
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
@@ -426,6 +437,42 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
+/*
+ * CSS ID is a ID for all css struct under subsys. Only works when
+ * cgroup_subsys->use_id != 0. It can be used for look up and scanning
+ * Cgroup ID is assined at cgroup allocation (create) and removed
+ * when refcnt to ID goes down to 0. Refcnt is inremented when subsys want to
+ * avoid reuse of ID for persistent objects. In usual, refcnt to ID will be 0
+ * when cgroup is removed.
+ * This look-up and scan function should be called under rcu_read_lock().
+ * cgroup_lock() is not necessary.
+ *
+ * Note: At using ID, max depth of the hierarchy is determined by
+ * cgroup_subsys->max_id_depth.
+ */
+
+/* called at create() */
+int css_id_alloc(struct cgroup_subsys *ss, struct cgroup_subsys_state *parent,
+		 struct cgroup_subsys_state *css);
+/* called at destroy(), or somewhere we can free ID */
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
+
+/* Find a cgroup which the "ID" is attached. */
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
+/*
+ * Get next cgroup under tree. Returning a cgroup which has equal or greater
+ * ID than "id" in argument.
+ */
+struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
+		struct cgroup_subsys_state *root, int *foundid);
+
+/* get id and depth of css */
+unsigned short css_id(struct cgroup_subsys_state *css);
+unsigned short css_depth(struct cgroup_subsys_state *css);
+/* returns non-zero if root is ancestor of cg */
+int css_is_ancestor(struct cgroup_subsys_state *cg,
+		struct cgroup_subsys_state *root);
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Dec15/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec15/kernel/cgroup.c
@@ -46,7 +46,7 @@
 #include <linux/cgroupstats.h>
 #include <linux/hash.h>
 #include <linux/namei.h>
-
+#include <linux/idr.h>
 #include <asm/atomic.h>
 
 static DEFINE_MUTEX(cgroup_mutex);
@@ -185,6 +185,8 @@ struct cg_cgroup_link {
 static struct css_set init_css_set;
 static struct cg_cgroup_link init_css_set_link;
 
+static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
+
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
  * due to cgroup_iter_start() */
@@ -2684,6 +2686,8 @@ int __init cgroup_init(void)
 		struct cgroup_subsys *ss = subsys[i];
 		if (!ss->early_init)
 			cgroup_init_subsys(ss);
+		if (ss->use_id)
+			cgroup_subsys_init_idr(ss);
 	}
 
 	/* Add init_css_set to the hash table */
@@ -3215,3 +3219,248 @@ static int __init cgroup_disable(char *s
 	return 1;
 }
 __setup("cgroup_disable=", cgroup_disable);
+
+/*
+ * CSS ID
+ */
+struct css_id {
+	/*
+	 * The cgroup to whiech this ID points. If cgroup is removed,
+	 * this will point to NULL.
+	 */
+	struct cgroup_subsys_state *css;
+	/*
+	 * ID of this css.
+	 */
+	unsigned short  id;
+	/*
+	 * Depth in hierarchy which this ID belongs to.
+	 */
+	unsigned short depth;
+	/*
+	 * ID is freed by RCU. (and lookup routine is RCU safe.)
+	 */
+	struct rcu_head rcu_head;
+	/*
+	 * Hierarchy of CSS ID belongs to.
+	 */
+	unsigned short  stack[0]; /* Length of this field is defined by depth */
+};
+
+/*
+ * To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+unsigned short css_id(struct cgroup_subsys_state *css)
+{
+	struct css_id *cssid = rcu_dereference(css->id);
+
+	if (cssid)
+		return cssid->id;
+	return 0;
+}
+
+unsigned short css_depth(struct cgroup_subsys_state *css)
+{
+	struct css_id *cssid = rcu_dereference(css->id);
+
+	if (cssid)
+		return cssid->depth;
+	return 0;
+}
+
+int css_is_ancestor(struct cgroup_subsys_state *css,
+		    struct cgroup_subsys_state *root)
+{
+	struct css_id *id = css->id;
+	struct css_id *ans = root->id;
+
+	if (!id || !ans || (id->depth < ans->depth))
+		return 0;
+	return id->stack[ans->depth] == ans->id;
+}
+
+static void __free_css_id_cb(struct rcu_head *head)
+{
+	struct css_id *id;
+
+	id = container_of(head, struct css_id, rcu_head);
+	kfree(id);
+}
+
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
+{
+	struct css_id *id = css->id;
+
+	BUG_ON(!ss->use_id);
+
+	rcu_assign_pointer(id->css, NULL);
+	rcu_assign_pointer(css->id, NULL);
+	spin_lock(&ss->id_lock);
+	idr_remove(&ss->idr, id->id);
+	spin_unlock(&ss->id_lock);
+	call_rcu(&id->rcu_head, __free_css_id_cb);
+}
+
+
+static int __get_and_prepare_newid(struct cgroup_subsys *ss,
+				int depth, struct css_id **ret)
+{
+	struct css_id *newid;
+	int myid, error, size;
+
+	BUG_ON(!ss->use_id);
+
+	size = sizeof(struct css_id) + sizeof(unsigned short) * (depth + 1);
+	newid = kzalloc(size, GFP_KERNEL);
+	if (!newid)
+		return -ENOMEM;
+	/* get id */
+	if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
+		error = -ENOMEM;
+		goto err_out;
+	}
+	spin_lock(&ss->id_lock);
+	/* Don't use 0 */
+	error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+	spin_unlock(&ss->id_lock);
+
+	/* Returns error when there are no free spaces for new ID.*/
+	if (error) {
+		error = -ENOSPC;
+		goto err_out;
+	}
+
+	newid->id = myid;
+	newid->depth = depth;
+	*ret = newid;
+	return 0;
+err_out:
+	kfree(newid);
+	return error;
+
+}
+
+
+static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
+{
+	struct css_id *newid;
+	struct cgroup_subsys_state *rootcss;
+	int err = -ENOMEM;
+
+	spin_lock_init(&ss->id_lock);
+	idr_init(&ss->idr);
+
+	rootcss = init_css_set.subsys[ss->subsys_id];
+	err = __get_and_prepare_newid(ss, 0, &newid);
+	if (err)
+		return err;
+
+	newid->stack[0] = newid->id;
+	newid->css = rootcss;
+	rootcss->id = newid;
+	return 0;
+}
+
+int css_id_alloc(struct cgroup_subsys *ss,
+		    struct cgroup_subsys_state *parent,
+		    struct cgroup_subsys_state *css)
+{
+	int i, depth = 0;
+	struct css_id *cssid, *parent_id = NULL;
+	int error;
+
+	if (parent) {
+		parent_id = parent->id;
+		depth = parent_id->depth + 1;
+	}
+
+	error = __get_and_prepare_newid(ss, depth, &cssid);
+	if (error)
+		return error;
+
+	for (i = 0; i < depth; i++)
+		cssid->stack[i] = parent_id->stack[i];
+	cssid->stack[depth] = cssid->id;
+
+	rcu_assign_pointer(cssid->css, css);
+	rcu_assign_pointer(css->id, cssid);
+
+	return 0;
+}
+
+/**
+ * css_lookup - lookup css by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to css if there is valid css with id, NULL if not.
+ * Should be called under rcu_read_lock()
+ */
+
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
+{
+	struct cgroup_subsys_state *css = NULL;
+	struct css_id *cssid = NULL;
+
+	BUG_ON(!ss->use_id);
+	rcu_read_lock();
+	cssid = idr_find(&ss->idr, id);
+
+	if (unlikely(!cssid))
+		goto out;
+
+	css = rcu_dereference(cssid->css);
+out:
+	rcu_read_unlock();
+	return css;
+}
+
+/**
+ * css_get_next - lookup next cgroup under specified hierarchy.
+ * @ss: pointer to subsystem
+ * @id: current position of iteration.
+ * @root: pointer to css. search tree under this.
+ * @foundid: position of found object.
+ *
+ * Search next css under the specified hierarchy of rootid. Calling under
+ * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
+ */
+struct cgroup_subsys_state *
+css_get_next(struct cgroup_subsys *ss, int id,
+	     struct cgroup_subsys_state *root, int *foundid)
+{
+	struct cgroup_subsys_state *ret = NULL;
+	struct css_id *tmp;
+	int tmpid;
+	int rootid = css_id(root);
+	int depth = css_depth(root);
+
+	if (!rootid)
+		return NULL;
+
+	BUG_ON(!ss->use_id);
+	rcu_read_lock();
+	tmpid = id;
+	while (1) {
+		/* scan next entry from bitmap(tree) */
+		spin_lock(&ss->id_lock);
+		tmp = idr_get_next(&ss->idr, &tmpid);
+		spin_unlock(&ss->id_lock);
+
+		if (!tmp) {
+			ret = NULL;
+			break;
+		}
+		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
+			ret = rcu_dereference(tmp->css);
+			/* Sanity check and check hierarchy */
+			if (ret && !css_is_removed(ret))
+				break;
+		}
+		tmpid = tmpid + 1;
+	}
+
+	rcu_read_unlock();
+	*foundid = tmpid;
+	return ret;
+}
+
Index: mmotm-2.6.28-Dec15/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Dec15.orig/include/linux/idr.h
+++ mmotm-2.6.28-Dec15/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
 int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
 int idr_for_each(struct idr *idp,
 		 int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Dec15/lib/idr.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/lib/idr.c
+++ mmotm-2.6.28-Dec15/lib/idr.c
@@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id:  pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+	struct idr_layer *p, *pa[MAX_LEVEL];
+	struct idr_layer **paa = &pa[0];
+	int id = *nextidp;
+	int n, max;
+
+	/* find first ent */
+	n = idp->layers * IDR_BITS;
+	max = 1 << n;
+	p = rcu_dereference(idp->top);
+	if (!p)
+		return NULL;
+
+	while (id < max) {
+		while (n > 0 && p) {
+			n -= IDR_BITS;
+			*paa++ = p;
+			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+		}
+
+		if (p) {
+			*nextidp = id;
+			return p;
+		}
+
+		id += 1 << n;
+		while (n < fls(id)) {
+			n += IDR_BITS;
+			p = *--paa;
+		}
+	}
+	return NULL;
+}
+
+
+
+/**
  * idr_replace - replace pointer for given id
  * @idp: idr handle
  * @ptr: pointer you want associated with the id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 8/9] memcg: hierarchical memory reclaim by round-robin.
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2008-12-16  9:19 ` [PATCH 7/9] cgroup: Support CSS ID KAMEZAWA Hiroyuki
@ 2008-12-16  9:21 ` KAMEZAWA Hiroyuki
  2008-12-16  9:22 ` [PATCH 9/9] memcg : fix OOM killer under hierarchy KAMEZAWA Hiroyuki
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Use css ID in memcg.

Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

	Assume folloing tree.

	group_A (ID=3)
		/01 (ID=4)
		   /0A (ID=7)
		/02 (ID=10)
	group_B (ID=5)
	and task in group_A/01/0A hits limit at group_A.

	reclaim will be done in following order (round-robin).
	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
	-> group_A -> .....

	Round robin by ID. The last visited cgroup is recorded and restart
	from it when it start reclaim again.
	(More smart algorithm can be implemented..)

	No cgroup_mutex or hierarchy_mutex is required.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  214 ++++++++++++++++++++------------------------------------
 1 file changed, 78 insertions(+), 136 deletions(-)

Index: mmotm-2.6.28-Dec15/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec15/mm/memcontrol.c
@@ -154,9 +154,10 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by hierarchy_mutex
+	 * reclaimed from.
 	 */
-	struct mem_cgroup *last_scanned_child;
+	int last_scanned_child;
+	unsigned long scan_age;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -576,103 +577,6 @@ unsigned long mem_cgroup_isolate_pages(u
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with hierarchy_mutex held
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
-{
-	struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
-	curr_cgroup = curr->css.cgroup;
-	root_cgroup = root_mem->css.cgroup;
-
-	if (!list_empty(&curr_cgroup->children)) {
-		/*
-		 * Walk down to children
-		 */
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->children.next,
-						struct cgroup, sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-visit_parent:
-	if (curr_cgroup == root_cgroup) {
-		mem_cgroup_put(curr);
-		curr = root_mem;
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Goto next sibling
-	 */
-	if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
-						sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Go up to next parent and next parent's sibling if need be
-	 */
-	curr_cgroup = curr_cgroup->parent;
-	goto visit_parent;
-
-done:
-	root_mem->last_scanned_child = curr;
-	return curr;
-}
-
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
-{
-	struct cgroup *cgroup;
-	struct mem_cgroup *ret;
-	bool obsolete;
-
-	obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
-
-	/*
-	 * Scan all children under the mem_cgroup mem
-	 */
-	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-	if (list_empty(&root_mem->css.cgroup->children)) {
-		ret = root_mem;
-		goto done;
-	}
-
-	if (!root_mem->last_scanned_child || obsolete) {
-
-		if (obsolete)
-			mem_cgroup_put(root_mem->last_scanned_child);
-
-		cgroup = list_first_entry(&root_mem->css.cgroup->children,
-				struct cgroup, sibling);
-		ret = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(ret);
-	} else
-		ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
-						root_mem);
-
-done:
-	root_mem->last_scanned_child = ret;
-	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
-	return ret;
-}
-
 static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 {
 	if (do_swap_account) {
@@ -702,6 +606,48 @@ static unsigned int get_swappiness(struc
 }
 
 /*
+ * Visit the first child (need not be the first child as per the ordering
+ * of the cgroup list, since we track last_scanned_child) of @mem and use
+ * that to reclaim free pages from.
+ */
+static struct mem_cgroup *
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
+{
+	struct mem_cgroup *ret = NULL;
+	struct cgroup_subsys_state *css;
+	int nextid, found;
+
+	if (!root_mem->use_hierarchy) {
+		spin_lock(&root_mem->reclaim_param_lock);
+		root_mem->scan_age++;
+		spin_unlock(&root_mem->reclaim_param_lock);
+		css_get(&root_mem->css);
+		ret = root_mem;
+	}
+
+	while (!ret) {
+		rcu_read_lock();
+		nextid = root_mem->last_scanned_child + 1;
+		css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
+				   &found);
+		if (css && css_tryget(css))
+			ret = container_of(css, struct mem_cgroup, css);
+
+		rcu_read_unlock();
+		/* Updates scanning parameter */
+		spin_lock(&root_mem->reclaim_param_lock);
+		if (!css) {
+			root_mem->last_scanned_child = 0;
+			root_mem->scan_age++;
+		} else
+			root_mem->last_scanned_child = found;
+		spin_unlock(&root_mem->reclaim_param_lock);
+	}
+
+	return ret;
+}
+
+/*
  * Dance down the hierarchy if needed to reclaim memory. We remember the
  * last child we reclaimed from, so that we don't end up penalizing
  * one child extensively based on its position in the children list.
@@ -711,40 +657,24 @@ static unsigned int get_swappiness(struc
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						gfp_t gfp_mask, bool noswap)
 {
-	struct mem_cgroup *next_mem;
-	int ret = 0;
-
-	/*
-	 * Reclaim unconditionally and don't check for return value.
-	 * We need to reclaim in the current group and down the tree.
-	 * One might think about checking for children before reclaiming,
-	 * but there might be left over accounting, even after children
-	 * have left.
-	 */
-	ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
-					   get_swappiness(root_mem));
-	if (mem_cgroup_check_under_limit(root_mem))
-		return 0;
-	if (!root_mem->use_hierarchy)
-		return ret;
-
-	next_mem = mem_cgroup_get_first_node(root_mem);
-
-	while (next_mem != root_mem) {
-		if (mem_cgroup_is_obsolete(next_mem)) {
-			mem_cgroup_put(next_mem);
-			next_mem = mem_cgroup_get_first_node(root_mem);
-			continue;
-		}
-		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
-						   get_swappiness(next_mem));
+	struct mem_cgroup *victim;
+	unsigned long start_time;
+	int ret, total = 0;
+	/*
+	 * Reclaim in cgroup under root_mem in round robin.
+	 */
+	start_time = root_mem->scan_age;
+
+	while (time_after((start_time + 2UL), root_mem->scan_age)) {
+		victim = mem_cgroup_select_victim(root_mem);
+		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
+						   get_swappiness(victim));
+		css_put(&victim->css);
+		total += ret;
 		if (mem_cgroup_check_under_limit(root_mem))
-			return 0;
-		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
+			return 1 + total;
 	}
-	return ret;
+	return total;
 }
 
 bool mem_cgroup_oom_called(struct task_struct *task)
@@ -1274,7 +1204,6 @@ __mem_cgroup_uncharge_common(struct page
 	default:
 		break;
 	}
-
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
@@ -2116,6 +2048,8 @@ static void __mem_cgroup_free(struct mem
 {
 	int node;
 
+	free_css_id(&mem_cgroup_subsys, &mem->css);
+
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
@@ -2153,11 +2087,12 @@ static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	struct mem_cgroup *mem, *parent;
+	long error = -ENOMEM;
 	int node;
 
 	mem = mem_cgroup_alloc();
 	if (!mem)
-		return ERR_PTR(-ENOMEM);
+		return ERR_PTR(error);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -2178,16 +2113,22 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
 	}
-	mem->last_scanned_child = NULL;
+	mem->last_scanned_child = 0;
+	mem->scan_age = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		error = css_id_alloc(&mem_cgroup_subsys,
+				&parent->css, &mem->css);
+		if (error)
+			goto free_out;
+	}
 	atomic_set(&mem->refcnt, 1);
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
-	return ERR_PTR(-ENOMEM);
+	return ERR_PTR(error);
 }
 
 static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
@@ -2238,6 +2179,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
+	.use_id = 1,
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 9/9] memcg : fix OOM killer under hierarchy
  2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2008-12-16  9:21 ` [PATCH 8/9] memcg: hierarchical memory reclaim by round-robin KAMEZAWA Hiroyuki
@ 2008-12-16  9:22 ` KAMEZAWA Hiroyuki
  8 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16  9:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, lizf


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Current memcg's oom-killer has 2 problems when hierarchy is used.

Assume following tree,
	Group_A/     use_hierarchy = 1, limit=1G
	        01/  nolimit
		02/  nolimit
		03/  nolimit
In this case, sum of memory usage from 01,02,03 is limted to 1G (of Group_A).

Assume a task in Group_A/01 causes OOM, in this case, bad_process() will
select a process in Group_A, (never scans 01,02,03)
This patch fixes the behavior.

And now, to avoid calling oom_kill twice, mem_cgroup_oom_called() hook is
used in pagefault_out_of_memory(). This check the timestamp of the most
recent OOM in memcg. This timestamp should be updated per hierarchy.

Changelog:
  - added an documentation about easy OOM-Kill test.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 Documentation/controllers/memcg_test.txt |   14 +++++++
 include/linux/memcontrol.h               |    4 +-
 mm/memcontrol.c                          |   55 +++++++++++++++++++++++++++++--
 mm/oom_kill.c                            |    4 +-
 4 files changed, 71 insertions(+), 6 deletions(-)

Index: mmotm-2.6.28-Dec15/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec15/mm/memcontrol.c
@@ -399,12 +399,31 @@ void mem_cgroup_move_lists(struct page *
 	mem_cgroup_add_lru_list(page, to);
 }
 
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
+static int
+mm_match_cgroup_hierarchy(struct mm_struct *mm, struct mem_cgroup *mem)
+{
+	struct mem_cgroup *curr;
+	int ret;
+
+	if (!mm)
+		return 0;
+	rcu_read_lock();
+	curr = mem_cgroup_from_task(mm->owner);
+	if (mem->use_hierarchy)
+		ret = css_is_ancestor(&curr->css, &mem->css);
+	else
+		ret = (curr == mem);
+	rcu_read_unlock();
+	return ret;
+}
+
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+				 struct mem_cgroup *mem)
 {
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && mm_match_cgroup(task->mm, mem);
+	ret = mm_match_cgroup_hierarchy(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
@@ -677,6 +696,36 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
+/*
+ *  Update last_oom_jiffies of hierarchy.
+ */
+void mem_cgroup_update_oom_jiffies(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *cur;
+	struct cgroup_subsys_state *css;
+	int id, found;
+
+	if (!mem->use_hierarchy) {
+		mem->last_oom_jiffies = jiffies;
+		return;
+	}
+
+	id = 0;
+	rcu_read_lock();
+	while (1) {
+		css = css_get_next(&mem_cgroup_subsys, id, &mem->css, &found);
+		if (!css)
+			break;
+		if (css_tryget(css)) {
+			cur = container_of(css, struct mem_cgroup, css);
+			cur->last_oom_jiffies = jiffies;
+			css_put(css);
+		}
+		id = found + 1;
+	}
+	rcu_read_unlock();
+	return;
+}
 bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	bool ret = false;
@@ -773,7 +822,7 @@ static int __mem_cgroup_try_charge(struc
 				mutex_lock(&memcg_tasklist);
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
 				mutex_unlock(&memcg_tasklist);
-				mem_over_limit->last_oom_jiffies = jiffies;
+				mem_cgroup_update_oom_jiffies(mem_over_limit);
 			}
 			goto nomem;
 		}
Index: mmotm-2.6.28-Dec15/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec15.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec15/include/linux/memcontrol.h
@@ -65,7 +65,9 @@ extern unsigned long mem_cgroup_isolate_
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+			struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
Index: mmotm-2.6.28-Dec15/mm/oom_kill.c
===================================================================
--- mmotm-2.6.28-Dec15.orig/mm/oom_kill.c
+++ mmotm-2.6.28-Dec15/mm/oom_kill.c
@@ -220,7 +220,7 @@ static struct task_struct *select_bad_pr
 		/* skip the init task */
 		if (is_global_init(p))
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 
 		/*
@@ -292,7 +292,7 @@ static void dump_tasks(const struct mem_
 		 */
 		if (!p->mm)
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 		if (!thread_group_leader(p))
 			continue;
Index: mmotm-2.6.28-Dec15/Documentation/controllers/memcg_test.txt
===================================================================
--- mmotm-2.6.28-Dec15.orig/Documentation/controllers/memcg_test.txt
+++ mmotm-2.6.28-Dec15/Documentation/controllers/memcg_test.txt
@@ -340,3 +340,17 @@ Under below explanation, we assume CONFI
 	# mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
 
 	and do task move, mkdir, rmdir etc...under this.
+
+ 9.7 OOM-KILL
+	If memcg finds out-of-memory, OOM Kill should kill a task in memcg.
+	This select_bad_process() should take hierarchy into account and
+	OOM-KILL itself shoudn't call panic_on_oom.
+
+	It's not difficult to cause OOM under memcg by setting memsw.limit
+	as following.
+	# echo 50M > memory.limit_in_bytes
+	# echo 50M > memory.memsw.limit_in_bytes
+	and run malloc(51M) program.
+	(Alternative is do swapoff and malloc())
+
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/9] cgroup: Support CSS ID
  2008-12-16  9:19 ` [PATCH 7/9] cgroup: Support CSS ID KAMEZAWA Hiroyuki
@ 2008-12-16 10:24   ` Paul Menage
  2008-12-16 11:09     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Menage @ 2008-12-16 10:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, lizf

On Tue, Dec 16, 2008 at 1:19 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Patch for Per-CSS ID and private hierarchy code.
>
> This patch tries to assign a ID to each css. Attach unique ID to each
> css and provides following functions.
>
>  - css_lookup(subsys, id)
>   returns struct cgroup of id.
>  - css_get_next(subsys, id, rootid, depth, foundid)
>   returns the next cgroup under "root" by scanning bitmap (not by tree-walk)

Basic approach looks great - but there are a lot of typos in comments.

>
> When cgrou_subsys->use_id is set, id field and bitmap for css is maintained.

When cgroup_subsyst.use_id is set, an id is maintained for each css
(via an idr bitmap)

> kernel/cgroup.c just parepare

The cgroups framework only prepares:

>        - css_id of root css for subsys
>        - alloc/free id functions.
> So, each subsys should allocate ID in attach() callback if necessary.
>
> There is several reasons to develop this.
>        - Saving space .... For example, memcg's swap_cgroup is array of
>          pointers to cgroup. But it is not necessary to be very fast.
>          By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
>          reduce much amount of memory usage.
>
>        - Scanning without lock.
>          CSS_ID provides "scan id under this ROOT" function. By this, scanning
>          css under root can be written without locks.
>          ex)
>          do {
>                rcu_read_lock();
>                next = cgroup_get_next(subsys, id, root, &found);
>                /* check sanity of next here */
>                css_tryget();
>                rcu_read_unlock();
>                id = found + 1
>         } while(...)
>
> Characteristics:
>        - Each css has unique ID under subsys.
>        - Lifetime of ID is controlled by subsys.
>        - css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
>        - Allowed ID is 1-65535, ID 0 is UNUSED ID.
>
> +       /*
> +        * set 1 if subsys uses ID. ID is not available before cgroup_init()

Make this a bool rather than an int?

> +        * (not available in early_init time.
> +        */
> +       int use_id;
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>        const char *name;
>

> + * CSS ID is a ID for all css struct under subsys. Only works when
> + * cgroup_subsys->use_id != 0. It can be used for look up and scanning
> + * Cgroup ID is assined at cgroup allocation (create) and removed

assined -> assigned

> + * when refcnt to ID goes down to 0. Refcnt is inremented when subsys want to
> + * avoid reuse of ID for persistent objects.

Although the CSS ID is RCU-safe, the subsystem may increment its
refcount when it wishes to avoid reuse of that ID for a different CSS
while it holds the reference outside of an RCU section.

> In usual, refcnt to ID will be 0
> + * when cgroup is removed.

In the normal case, the refcount to the ID will be 0 when the cgroup is removed.

> + *
> + * Note: At using ID, max depth of the hierarchy is determined by

When using ID

> + * cgroup_subsys->max_id_depth.
> + */

Is this comment stale? There's no cgroup_subsys.max_id_depth in this patch.

> +
> +/* called at create() */

If the subsystem has specified use_id=true, is there any reason not to
automatically allocate the ID on its behalf?

> -
> +#include <linux/idr.h>

This is already included in cgroup.h

> +        * The cgroup to whiech this ID points. If cgroup is removed,

"to which"

Mention RCU-safety of the cgroup pointer?

> +        */
> +       unsigned short  stack[0]; /* Length of this field is defined by depth */

/* Array of length (depth+1) */

> +int css_is_ancestor(struct cgroup_subsys_state *css,
> +                   struct cgroup_subsys_state *root)
> +{
> +       struct css_id *id = css->id;
> +       struct css_id *ans = root->id;

It might be clearer to name the css pointers "child" and "root" and
the id pointers "child_id" and "root_id".

> +static int __get_and_prepare_newid(struct cgroup_subsys *ss,
> +                               int depth, struct css_id **ret)
> +{
> +       struct css_id *newid;
> +       int myid, error, size;
> +
> +       BUG_ON(!ss->use_id);
> +
> +       size = sizeof(struct css_id) + sizeof(unsigned short) * (depth + 1);
> +       newid = kzalloc(size, GFP_KERNEL);
> +       if (!newid)
> +               return -ENOMEM;
> +       /* get id */
> +       if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> +               error = -ENOMEM;
> +               goto err_out;
> +       }

Is this safe? If the only place that we allocated ids was in
cgroup_create() then it should be fine since allocation is
synchronized. But if the subsystem can allocate at other times as
well, then theoretically two threads could get past the idr_pre_get()
stage and one of them could exhaust the pre-allocated objects.

> +       spin_lock(&ss->id_lock);
> +       /* Don't use 0 */
> +       error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> +       spin_unlock(&ss->id_lock);
> +
> +       /* Returns error when there are no free spaces for new ID.*/
> +       if (error) {
> +               error = -ENOSPC;
> +               goto err_out;
> +       }
> +
> +       newid->id = myid;
> +       newid->depth = depth;
> +       *ret = newid;
> +       return 0;
> +err_out:
> +       kfree(newid);
> +       return error;
> +
> +}
> +
> +
> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> +{
> +       struct css_id *newid;
> +       struct cgroup_subsys_state *rootcss;
> +       int err = -ENOMEM;
> +
> +       spin_lock_init(&ss->id_lock);
> +       idr_init(&ss->idr);
> +
> +       rootcss = init_css_set.subsys[ss->subsys_id];
> +       err = __get_and_prepare_newid(ss, 0, &newid);
> +       if (err)
> +               return err;
> +
> +       newid->stack[0] = newid->id;
> +       newid->css = rootcss;
> +       rootcss->id = newid;
> +       return 0;
> +}
> +

> + * css_lookup - lookup css by id
> + * @id: the id of cgroup to be looked up
> + *
> + * Returns pointer to css if there is valid css with id, NULL if not.
> + * Should be called under rcu_read_lock()
> + */
> +
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> +       struct cgroup_subsys_state *css = NULL;
> +       struct css_id *cssid = NULL;
> +
> +       BUG_ON(!ss->use_id);
> +       rcu_read_lock();

Why do we need an additional rcu_read_lock() here? Since we've
required that the caller be under rcu_read_lock()?

> +               if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> +                       ret = rcu_dereference(tmp->css);
> +                       /* Sanity check and check hierarchy */
> +                       if (ret && !css_is_removed(ret))
> +                               break;

Is there much point checking for css_is_removed here? The caller will
have to check it anyway since we're not synchronized against cgroup
removal.

> +               }
> +               tmpid = tmpid + 1;

Comment here?

Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/9] cgroup: Support CSS ID
  2008-12-16 10:24   ` Paul Menage
@ 2008-12-16 11:09     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-16 11:09 UTC (permalink / raw)
  To: Paul Menage; +Cc: KAMEZAWA Hiroyuki, linux-mm, balbir, nishimura, lizf

Paul Menage said:
> On Tue, Dec 16, 2008 at 1:19 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Patch for Per-CSS ID and private hierarchy code.
>>
>> This patch tries to assign a ID to each css. Attach unique ID to each
>> css and provides following functions.
>>
>>  - css_lookup(subsys, id)
>>   returns struct cgroup of id.
>>  - css_get_next(subsys, id, rootid, depth, foundid)
>>   returns the next cgroup under "root" by scanning bitmap (not by
>> tree-walk)
>
> Basic approach looks great - but there are a lot of typos in comments.
>
thanks, and sorry :(

>>
>> When cgrou_subsys->use_id is set, id field and bitmap for css is
>> maintained.
>
> When cgroup_subsyst.use_id is set, an id is maintained for each css
> (via an idr bitmap)
>
will do

>> kernel/cgroup.c just parepare
>
> The cgroups framework only prepares:
>
will do

>>        - css_id of root css for subsys
>>        - alloc/free id functions.
>> So, each subsys should allocate ID in attach() callback if necessary.
>>
>> There is several reasons to develop this.
>>        - Saving space .... For example, memcg's swap_cgroup is array of
>>          pointers to cgroup. But it is not necessary to be very fast.
>>          By replacing pointers(8bytes per ent) to ID (2byes per ent), we
>> can
>>          reduce much amount of memory usage.
>>
>>        - Scanning without lock.
>>          CSS_ID provides "scan id under this ROOT" function. By this,
>> scanning
>>          css under root can be written without locks.
>>          ex)
>>          do {
>>                rcu_read_lock();
>>                next = cgroup_get_next(subsys, id, root, &found);
>>                /* check sanity of next here */
>>                css_tryget();
>>                rcu_read_unlock();
>>                id = found + 1
>>         } while(...)
>>
>> Characteristics:
>>        - Each css has unique ID under subsys.
>>        - Lifetime of ID is controlled by subsys.
>>        - css ID contains "ID" and "Depth in hierarchy" and stack of
>> hierarchy
>>        - Allowed ID is 1-65535, ID 0 is UNUSED ID.
>>
>> +       /*
>> +        * set 1 if subsys uses ID. ID is not available before
>> cgroup_init()
>
> Make this a bool rather than an int?
>
ok, will use bool.

>> +        * (not available in early_init time.
>> +        */
>> +       int use_id;
>>  #define MAX_CGROUP_TYPE_NAMELEN 32
>>        const char *name;
>>
>
>> + * CSS ID is a ID for all css struct under subsys. Only works when
>> + * cgroup_subsys->use_id != 0. It can be used for look up and scanning
>> + * Cgroup ID is assined at cgroup allocation (create) and removed
>
> assined -> assigned
>
will fix

>> + * when refcnt to ID goes down to 0. Refcnt is inremented when subsys
>> want to
>> + * avoid reuse of ID for persistent objects.
>
> Although the CSS ID is RCU-safe, the subsystem may increment its
> refcount when it wishes to avoid reuse of that ID for a different CSS
> while it holds the reference outside of an RCU section.
>
>> In usual, refcnt to ID will be 0
>> + * when cgroup is removed.
>
> In the normal case, the refcount to the ID will be 0 when the cgroup is
> removed.
>
will fix.

>> + *
>> + * Note: At using ID, max depth of the hierarchy is determined by
>
> When using ID
>
>> + * cgroup_subsys->max_id_depth.
>> + */
>
> Is this comment stale? There's no cgroup_subsys.max_id_depth in this
> patch.
>
Ah, stale...

>> +
>> +/* called at create() */
>
> If the subsystem has specified use_id=true, is there any reason not to
> automatically allocate the ID on its behalf?
>
Hmm. Because "free" is called by subsys, I moved calls to "create" to
subsys. (free is not necessary to be called at destroy())

>> -
>> +#include <linux/idr.h>
>
> This is already included in cgroup.h
>
Ah, thanks. will check again.

>> +        * The cgroup to whiech this ID points. If cgroup is removed,
>
> "to which"
>
will fix..

> Mention RCU-safety of the cgroup pointer?
>
ok. mention about that.

>> +        */
>> +       unsigned short  stack[0]; /* Length of this field is defined by
>> depth */
>
> /* Array of length (depth+1) */
>
ok

>> +int css_is_ancestor(struct cgroup_subsys_state *css,
>> +                   struct cgroup_subsys_state *root)
>> +{
>> +       struct css_id *id = css->id;
>> +       struct css_id *ans = root->id;
>
> It might be clearer to name the css pointers "child" and "root" and
> the id pointers "child_id" and "root_id".
>
ok. will change.

>> +static int __get_and_prepare_newid(struct cgroup_subsys *ss,
>> +                               int depth, struct css_id **ret)
>> +{
>> +       struct css_id *newid;
>> +       int myid, error, size;
>> +
>> +       BUG_ON(!ss->use_id);
>> +
>> +       size = sizeof(struct css_id) + sizeof(unsigned short) * (depth +
>> 1);
>> +       newid = kzalloc(size, GFP_KERNEL);
>> +       if (!newid)
>> +               return -ENOMEM;
>> +       /* get id */
>> +       if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
>> +               error = -ENOMEM;
>> +               goto err_out;
>> +       }
>
> Is this safe? If the only place that we allocated ids was in
> cgroup_create() then it should be fine since allocation is
> synchronized. But if the subsystem can allocate at other times as
> well, then theoretically two threads could get past the idr_pre_get()
> stage and one of them could exhaust the pre-allocated objects.
>
maybe you're right. will fix this or move "create" to cgroup.c rather than
by subsys.


>> +       spin_lock(&ss->id_lock);
>> +       /* Don't use 0 */
>> +       error = idr_get_new_above(&ss->idr, newid, 1, &myid);
>> +       spin_unlock(&ss->id_lock);
>> +
>> +       /* Returns error when there are no free spaces for new ID.*/
>> +       if (error) {
>> +               error = -ENOSPC;
>> +               goto err_out;
>> +       }
>> +
>> +       newid->id = myid;
>> +       newid->depth = depth;
>> +       *ret = newid;
>> +       return 0;
>> +err_out:
>> +       kfree(newid);
>> +       return error;
>> +
>> +}
>> +
>> +
>> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
>> +{
>> +       struct css_id *newid;
>> +       struct cgroup_subsys_state *rootcss;
>> +       int err = -ENOMEM;
>> +
>> +       spin_lock_init(&ss->id_lock);
>> +       idr_init(&ss->idr);
>> +
>> +       rootcss = init_css_set.subsys[ss->subsys_id];
>> +       err = __get_and_prepare_newid(ss, 0, &newid);
>> +       if (err)
>> +               return err;
>> +
>> +       newid->stack[0] = newid->id;
>> +       newid->css = rootcss;
>> +       rootcss->id = newid;
>> +       return 0;
>> +}
>> +
>
>> + * css_lookup - lookup css by id
>> + * @id: the id of cgroup to be looked up
>> + *
>> + * Returns pointer to css if there is valid css with id, NULL if not.
>> + * Should be called under rcu_read_lock()
>> + */
>> +
>> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int
>> id)
>> +{
>> +       struct cgroup_subsys_state *css = NULL;
>> +       struct css_id *cssid = NULL;
>> +
>> +       BUG_ON(!ss->use_id);
>> +       rcu_read_lock();
>
> Why do we need an additional rcu_read_lock() here? Since we've
> required that the caller be under rcu_read_lock()?
>
Just because I can't find to add a macro to check
==
 BUG_ON(rcu_read_lock_is_not_held)
==
I'll see rcu code again,

>> +               if (tmp->depth >= depth && tmp->stack[depth] == rootid)
>> {
>> +                       ret = rcu_dereference(tmp->css);
>> +                       /* Sanity check and check hierarchy */
>> +                       if (ret && !css_is_removed(ret))
>> +                               break;
>
> Is there much point checking for css_is_removed here? The caller will
> have to check it anyway since we're not synchronized against cgroup
> removal.
>
Ok, will remove this.

>> +               }
>> +               tmpid = tmpid + 1;
>
> Comment here?
>
will do.

Thank you for review!

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling
  2008-12-16  9:10 ` [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling KAMEZAWA Hiroyuki
@ 2008-12-17  2:26   ` Daisuke Nishimura
  0 siblings, 0 replies; 15+ messages in thread
From: Daisuke Nishimura @ 2008-12-17  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, lizf, nishimura

On Tue, 16 Dec 2008 18:10:41 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  1. Fix double-free BUG in error route of mem_cgroup_create().
>     mem_cgroup_free() itself frees per-zone-info.
>  2. Making refcnt of memcg simple.
>     Add 1 refcnt at creation and call free when refcnt goes down to 0.
> 
I like this change.

> Singed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

> ---
> Index: mmotm-2.6.28-Dec15/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Dec15.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Dec15/mm/memcontrol.c
> @@ -2087,14 +2087,10 @@ static struct mem_cgroup *mem_cgroup_all
>   * Removal of cgroup itself succeeds regardless of refs from swap.
>   */
>  
> -static void mem_cgroup_free(struct mem_cgroup *mem)
> +static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>  	int node;
>  
> -	if (atomic_read(&mem->refcnt) > 0)
> -		return;
> -
> -
>  	for_each_node_state(node, N_POSSIBLE)
>  		free_mem_cgroup_per_zone_info(mem, node);
>  
> @@ -2111,11 +2107,8 @@ static void mem_cgroup_get(struct mem_cg
>  
>  static void mem_cgroup_put(struct mem_cgroup *mem)
>  {
> -	if (atomic_dec_and_test(&mem->refcnt)) {
> -		if (!mem->obsolete)
> -			return;
> -		mem_cgroup_free(mem);
> -	}
> +	if (atomic_dec_and_test(&mem->refcnt))
> +		__mem_cgroup_free(mem);
>  }
>  
>  
> @@ -2165,12 +2158,10 @@ mem_cgroup_create(struct cgroup_subsys *
>  
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
> -
> +	atomic_set(&mem->refcnt, 1);
>  	return &mem->css;
>  free_out:
> -	for_each_node_state(node, N_POSSIBLE)
> -		free_mem_cgroup_per_zone_info(mem, node);
> -	mem_cgroup_free(mem);
> +	__mem_cgroup_free(mem);
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> @@ -2185,7 +2176,7 @@ static void mem_cgroup_pre_destroy(struc
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {
> -	mem_cgroup_free(mem_cgroup_from_cont(cont));
> +	mem_cgroup_put(mem_cgroup_from_cont(cont));
>  }
>  
>  static int mem_cgroup_populate(struct cgroup_subsys *ss,
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/9] memcg: use css_tryget()
  2008-12-16  9:17 ` [PATCH 6/9] memcg: use css_tryget() KAMEZAWA Hiroyuki
@ 2008-12-18  1:50   ` Daisuke Nishimura
  2008-12-18  2:03     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Daisuke Nishimura @ 2008-12-18  1:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, lizf, nishimura

On Tue, 16 Dec 2008 18:17:39 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Use css_tryget() in memcg.
> 
> css_tryget() newly is added and we can know css is alive or not and
> get refcnt of css in very safe way.
> ("alive" here means "rmdir/destroy" is not called.)
> 
> This patch replaces css_get() to css_tryget(), where I cannot explain
> why css_get() is safe. And removes memcg->obsolete flag.
> 
> Changelog (v0) -> (v1):
>   - fixed css_ref leak bug at swap-in.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

(snip)

> +/*
> + * While swap-in, try_charge -> commit or cancel, the page is locked.
> + * And when try_charge() successfully returns, one refcnt to memcg without
> + * struct page_cgroup is aquired. This refcnt will be cumsumed by
> + * "commit()" or removed by "cancel()"
> + */
>  int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  				 struct page *page,
>  				 gfp_t mask, struct mem_cgroup **ptr)
>  {
>  	struct mem_cgroup *mem;
>  	swp_entry_t     ent;
> +	int ret;
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> @@ -1089,10 +1115,15 @@ int mem_cgroup_try_charge_swapin(struct 
>  	ent.val = page_private(page);
>  
>  	mem = lookup_swap_cgroup(ent);
> -	if (!mem || mem->obsolete)
> +	if (!mem)
> +		goto charge_cur_mm;
> +	if (!css_tryget(&mem->css))
>  		goto charge_cur_mm;
I haven't noticed the bug here which existed in RFC version.

Actually, I found a problem at rmdir(cannot remove directry because of refcnt leak)
in testing RFC version, and have been digging it.
I've confirmed it is fixed in this version.

This version looks good to me and I think this patch is definitely needed
to remove buggy "obsolete" flag.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 6/9] memcg: use css_tryget()
  2008-12-18  1:50   ` Daisuke Nishimura
@ 2008-12-18  2:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-18  2:03 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, menage, balbir, lizf

On Thu, 18 Dec 2008 10:50:27 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 16 Dec 2008 18:17:39 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Use css_tryget() in memcg.
> > 
> > css_tryget() newly is added and we can know css is alive or not and
> > get refcnt of css in very safe way.
> > ("alive" here means "rmdir/destroy" is not called.)
> > 
> > This patch replaces css_get() to css_tryget(), where I cannot explain
> > why css_get() is safe. And removes memcg->obsolete flag.
> > 
> > Changelog (v0) -> (v1):
> >   - fixed css_ref leak bug at swap-in.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> (snip)
> 
> > +/*
> > + * While swap-in, try_charge -> commit or cancel, the page is locked.
> > + * And when try_charge() successfully returns, one refcnt to memcg without
> > + * struct page_cgroup is aquired. This refcnt will be cumsumed by
> > + * "commit()" or removed by "cancel()"
> > + */
> >  int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> >  				 struct page *page,
> >  				 gfp_t mask, struct mem_cgroup **ptr)
> >  {
> >  	struct mem_cgroup *mem;
> >  	swp_entry_t     ent;
> > +	int ret;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> > @@ -1089,10 +1115,15 @@ int mem_cgroup_try_charge_swapin(struct 
> >  	ent.val = page_private(page);
> >  
> >  	mem = lookup_swap_cgroup(ent);
> > -	if (!mem || mem->obsolete)
> > +	if (!mem)
> > +		goto charge_cur_mm;
> > +	if (!css_tryget(&mem->css))
> >  		goto charge_cur_mm;
> I haven't noticed the bug here which existed in RFC version.
> 
> Actually, I found a problem at rmdir(cannot remove directry because of refcnt leak)
> in testing RFC version, and have been digging it.
> I've confirmed it is fixed in this version.
> 
> This version looks good to me and I think this patch is definitely needed
> to remove buggy "obsolete" flag.
> 
> 	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
ya, thanks again :)

-Kame

> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-12-18  2:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-16  9:09 [PATCH] memcg updates (2008/12/16) KAMEZAWA Hiroyuki
2008-12-16  9:10 ` [PATCH 1/9] fix error path of mem_cgroup_create and refnct handling KAMEZAWA Hiroyuki
2008-12-17  2:26   ` Daisuke Nishimura
2008-12-16  9:12 ` [PATCH 2/9] cgroup hierarchy mutex KAMEZAWA Hiroyuki
2008-12-16  9:13 ` [PATCH 3/9] use hierarchy mutex in memcg KAMEZAWA Hiroyuki
2008-12-16  9:14 ` [PATCH 4/9] cgroup add css_tryget() KAMEZAWA Hiroyuki
2008-12-16  9:15 ` [PATCH 5/9] Add css_is_remvoed KAMEZAWA Hiroyuki
2008-12-16  9:17 ` [PATCH 6/9] memcg: use css_tryget() KAMEZAWA Hiroyuki
2008-12-18  1:50   ` Daisuke Nishimura
2008-12-18  2:03     ` KAMEZAWA Hiroyuki
2008-12-16  9:19 ` [PATCH 7/9] cgroup: Support CSS ID KAMEZAWA Hiroyuki
2008-12-16 10:24   ` Paul Menage
2008-12-16 11:09     ` KAMEZAWA Hiroyuki
2008-12-16  9:21 ` [PATCH 8/9] memcg: hierarchical memory reclaim by round-robin KAMEZAWA Hiroyuki
2008-12-16  9:22 ` [PATCH 9/9] memcg : fix OOM killer under hierarchy KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox