[PATCH 0/3] cgroup id and scanning without cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] cgroup id and scanning without cgroup_lock
@ 2008-12-01  5:59 KAMEZAWA Hiroyuki
  2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  5:59 UTC (permalink / raw)
  To: linux-mm; +Cc: lizf, menage, balbir, nishimura, linux-kernel, akpm

This is a series of patches againse mmotm-Nov29
(passed easy test)

Now, memcg supports hierarhcy. But walking cgroup tree in intellegent way
with lock/unlock cgroup_mutex seems to have troubles rather than expected.
And, I want to reduce the memory usage of swap_cgroup, which uses array of
pointers.

This patch series provides
	- cgroup_id per cgroup object.
	- lookup struct cgroup by cgroup_id
	- scan all cgroup under tree by cgroup_id. without mutex.
	- css_tryget() function.
	- fixes semantics of notify_on_release. (I think this is valid fix.)

Many changes since v1. (But I wonder some more work may be neeeded.)

BTW, I know there are some amount of patches against memcg are posted recently.
If necessary, I'll prepare Weekly-update queue again (Wednesday) and
picks all patches to linux-mm in my queue.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-01  5:59 [PATCH 0/3] cgroup id and scanning without cgroup_lock KAMEZAWA Hiroyuki
@ 2008-12-01  6:02 ` KAMEZAWA Hiroyuki
  2008-12-02  6:15   ` Li Zefan
  2008-12-03  3:44   ` Li Zefan
  2008-12-01  6:03 ` [PATCH 2/3] cgroup: cgroup ID and scanning under RCU KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  6:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, lizf, menage, balbir, nishimura, linux-kernel, akpm

Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
after pre_destroy().
memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..

Several ways to fix this can be considered. This is an idea.

Fortunately, the user of css_get()/css_put() is only memcg, now.
And it seems assumption on css_ref in cgroup.c is a bit complicated.
I'd like to reuse it.
This patch changes this css->refcnt usage and action as following
	- css->refcnt is initialized to 1.

	- after pre_destroy, before destroy(), try to drop css->refcnt to 0.

	- css_tryget() is added. This only success when css->refcnt > 0.

	- css_is_removed() is added. This checks css->refcnt == 0 and means
	  this cgroup is under destroy() or not.

	- css_put() is changed not to call notify_on_release().
	  From documentation, notify_on_release() is called when there is no
	  tasks/children in cgroup. On implementation, notify_on_release is
	  not called if css->refcnt > 0.
	  This is problematic. memcg has css->refcnt by each page even when
	  there are no tasks. release handler will be never called.
	  But, now, rmdir()/pre_destroy() of memcg works well and checking
	  checking css->ref is not (and shouldn't be) necessary for notifying.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>


 include/linux/cgroup.h |   21 +++++++++++++++++--
 kernel/cgroup.c        |   53 +++++++++++++++++++++++++++++++++++--------------
 mm/memcontrol.c        |   40 +++++++++++++++++++++++++-----------
 3 files changed, 85 insertions(+), 29 deletions(-)

Index: mmotm-2.6.28-Nov29/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Nov29.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Nov29/include/linux/cgroup.h
@@ -54,7 +54,9 @@ struct cgroup_subsys_state {
 
 	/* State maintained by the cgroup system to allow
 	 * subsystems to be "busy". Should be accessed via css_get()
-	 * and css_put() */
+	 * and css_put(). If this value is 0, css is now under removal and
+	 * destroy() will be called soon. (and there is no roll-back.)
+	 */
 
 	atomic_t refcnt;
 
@@ -86,7 +88,22 @@ extern void __css_put(struct cgroup_subs
 static inline void css_put(struct cgroup_subsys_state *css)
 {
 	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css);
+		atomic_dec(&css->refcnt);
+}
+
+/* returns not-zero if success */
+static inline int css_tryget(struct cgroup_subsys_state *css)
+{
+	if (!test_bit(CSS_ROOT, &css->flags))
+		return atomic_inc_not_zero(&css->refcnt);
+	return 1;
+}
+
+static inline bool css_under_removal(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return false;
+	return atomic_read(&css->refcnt) == 0;
 }
 
 /* bits in struct cgroup flags field */
Index: mmotm-2.6.28-Nov29/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Nov29.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Nov29/kernel/cgroup.c
@@ -589,6 +589,32 @@ static void cgroup_call_pre_destroy(stru
 	return;
 }
 
+/*
+ * Try to set all subsys's refcnt to be 0.
+ * css->refcnt==0 means this subsys will be destroy()'d.
+ */
+static bool cgroup_set_subsys_removed(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css, *tmp;
+
+	for_each_subsys(cgrp->root, ss) {
+		css = cgrp->subsys[ss->subsys_id];
+		if (!atomic_dec_and_test(&css->refcnt))
+			goto rollback;
+	}
+	return true;
+rollback:
+	for_each_subsys(cgrp->root, ss) {
+		tmp = cgrp->subsys[ss->subsys_id];
+		atomic_inc(&tmp->refcnt);
+		if (tmp == css)
+			break;
+	}
+	return false;
+}
+
+
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 {
 	/* is dentry a directory ? if so, kfree() associated cgroup */
@@ -2310,7 +2336,7 @@ static void init_cgroup_css(struct cgrou
 			       struct cgroup *cgrp)
 {
 	css->cgroup = cgrp;
-	atomic_set(&css->refcnt, 0);
+	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
@@ -2438,7 +2464,7 @@ static int cgroup_has_css_refs(struct cg
 		 * matter, since it can only happen if the cgroup
 		 * has been deleted and hence no longer needs the
 		 * release agent to be called anyway. */
-		if (css && atomic_read(&css->refcnt))
+		if (css && (atomic_read(&css->refcnt) > 1))
 			return 1;
 	}
 	return 0;
@@ -2465,7 +2491,8 @@ static int cgroup_rmdir(struct inode *un
 
 	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
-	 * that rmdir() request comes.
+	 * that rmdir() request comes. pre_destroy() is expected to drop all
+	 * extra refcnt to css. (css->refcnt == 1)
 	 */
 	cgroup_call_pre_destroy(cgrp);
 
@@ -2479,8 +2506,15 @@ static int cgroup_rmdir(struct inode *un
 		return -EBUSY;
 	}
 
+	/* last check ! */
+	if (!cgroup_set_subsys_removed(cgrp)) {
+		mutex_unlock(&cgroup_mutex);
+		return -EBUSY;
+	}
+
 	spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
+
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
 	spin_unlock(&release_list_lock);
@@ -3003,7 +3037,7 @@ static void check_for_release(struct cgr
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
 	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
-	    && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
+	    && list_empty(&cgrp->children)) {
 		/* Control Group is currently removeable. If it's not
 		 * already queued for a userspace notification, queue
 		 * it now */
@@ -3020,17 +3054,6 @@ static void check_for_release(struct cgr
 	}
 }
 
-void __css_put(struct cgroup_subsys_state *css)
-{
-	struct cgroup *cgrp = css->cgroup;
-	rcu_read_lock();
-	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
-		set_bit(CGRP_RELEASABLE, &cgrp->flags);
-		check_for_release(cgrp);
-	}
-	rcu_read_unlock();
-}
-
 /*
  * Notify userspace when a cgroup is released, by running the
  * configured release agent with the name of the cgroup (path
Index: mmotm-2.6.28-Nov29/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov29.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov29/mm/memcontrol.c
@@ -154,7 +154,6 @@ struct mem_cgroup {
 	 */
 	bool use_hierarchy;
 	unsigned long	last_oom_jiffies;
-	int		obsolete;
 	atomic_t	refcnt;
 	/*
 	 * statistics. This must be placed at the end of memcg.
@@ -540,8 +539,14 @@ mem_cgroup_get_first_node(struct mem_cgr
 {
 	struct cgroup *cgroup;
 	struct mem_cgroup *ret;
-	bool obsolete = (root_mem->last_scanned_child &&
-				root_mem->last_scanned_child->obsolete);
+	struct mem_cgroup *last_scan = root_mem->last_scanned_child;
+	bool obsolete = false;
+
+	if (last_scan) {
+		if (css_under_removal(&last_scan->css))
+			obsolete = true;
+	} else
+		obsolete = true;
 
 	/*
 	 * Scan all children under the mem_cgroup mem
@@ -598,7 +603,7 @@ static int mem_cgroup_hierarchical_recla
 	next_mem = mem_cgroup_get_first_node(root_mem);
 
 	while (next_mem != root_mem) {
-		if (next_mem->obsolete) {
+		if (css_under_removal(&next_mem->css)) {
 			mem_cgroup_put(next_mem);
 			cgroup_lock();
 			next_mem = mem_cgroup_get_first_node(root_mem);
@@ -985,6 +990,7 @@ int mem_cgroup_try_charge_swapin(struct 
 {
 	struct mem_cgroup *mem;
 	swp_entry_t     ent;
+	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -1003,10 +1009,18 @@ int mem_cgroup_try_charge_swapin(struct 
 	ent.val = page_private(page);
 
 	mem = lookup_swap_cgroup(ent);
-	if (!mem || mem->obsolete)
+	/*
+	 * Because we can't assume "mem" is alive now, use tryget() and
+	 * drop extra count later
+	 */
+	if (!mem || !css_tryget(&mem->css))
 		goto charge_cur_mm;
 	*ptr = mem;
-	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	/* drop extra count */
+	css_put(&mem->css);
+
+	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -1037,14 +1051,16 @@ int mem_cgroup_cache_charge_swapin(struc
 		ent.val = page_private(page);
 		if (do_swap_account) {
 			mem = lookup_swap_cgroup(ent);
-			if (mem && mem->obsolete)
+			if (mem && !css_tryget(&mem->css))
 				mem = NULL;
 			if (mem)
 				mm = NULL;
 		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
-
+		/* drop extra ref */
+		if (mem)
+			css_put(&mem->css);
 		if (!ret && do_swap_account) {
 			/* avoid double counting */
 			mem = swap_cgroup_record(ent, NULL);
@@ -1886,8 +1902,8 @@ static struct mem_cgroup *mem_cgroup_all
  * the number of reference from swap_cgroup and free mem_cgroup when
  * it goes down to 0.
  *
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
+ * When mem_cgroup is destroyed, css_under_removal() is true and entry which
+ * points to this memcg will be ignore at swapin.
  *
  * Removal of cgroup itself succeeds regardless of refs from swap.
  */
@@ -1917,7 +1933,7 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *mem)
 {
 	if (atomic_dec_and_test(&mem->refcnt)) {
-		if (!mem->obsolete)
+		if (!css_under_removal(&mem->css))
 			return;
 		mem_cgroup_free(mem);
 	}
@@ -1980,7 +1996,7 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem->obsolete = 1;
+	/* dentry's mutex makes this safe. */
 	mem_cgroup_force_empty(mem, false);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] cgroup: cgroup ID and scanning under RCU.
  2008-12-01  5:59 [PATCH 0/3] cgroup id and scanning without cgroup_lock KAMEZAWA Hiroyuki
  2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
@ 2008-12-01  6:03 ` KAMEZAWA Hiroyuki
  2008-12-01  6:04 ` [PATCH 3/3] memcg: change hierarhcy managenemt to use scan by cgroup ID KAMEZAWA Hiroyuki
  2008-12-01  6:24 ` [PATCH 0/3] cgroup id and scanning without cgroup_lock Balbir Singh
  3 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  6:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, lizf, menage, balbir, nishimura, linux-kernel, akpm

patch for Cgroup ID and hierarchy code.

This patch tries to assign a ID to each cgroup. Attach unique ID to each
cgroup and provides following functions.

 - cgroup_lookup(id)
   returns struct cgroup of id.
 - cgroup_get_next(id, rootid, depth, foundid)
   returns the next cgroup under "root" by scanning bitmap (not by tree-walk)
 - cgroup_id_put/getref()
   used when subsystem want to prevent reuse of ID.

There is several reasons to develop this.

	- While trying to implement hierarchy in memory cgroup, we have to
	  implement "walk under hierarchy" code.
	  Now it's consists of cgroup_lock and tree up-down code. Because
	  Because memory cgroup have to do hierarchy walk in other places,
	  intelligent processing, we'll reuse the "walk" code.
	  But taking "cgroup_lock" in walking tree can cause deadlocks.
	  Easier way is helpful.

 	- SwapCgroup uses array of "pointer" to record the owner of swaps.
	  By ID, we can reduce this to "short" or "int". This means ID is 
	  useful for reducing space consumption by pointer if the access cost
	  is not problem.
	  (I hear bio-cgroup will use the same kind of...)

Example) OOM-Killer under hierarchy.
	do {
		rcu_read_lock();
		next = cgroup_get_next(id, root, nextid);
		/* check sanity of next here */
		css_tryget();
		rcu_read_unlock();
		if (!next)
			break;
		cgroup_scan_tasks(select_bad_process?);
		/* record score here...*/
	} while (1);


Characteristics: 
	- Each cgroup get new ID when created.
	- cgroup ID contains "ID" and "Depth in tree" and hierarchy code.
	- hierarchy code is array of IDs of ancestors.
	- ID 0 is UNUSED ID.

Consideration:
	- I'd like to use  "short" to cgroup_id for saving space...
	- MAX_DEPTH is small ? (making this depend on boot option is easy.)
TODO:
	- Documentation.

Changelog (v1) -> (v2):
	- Design change: show only ID(integer) to outside of cgroup.c
	- moved cgroup ID definition from include/ to kernel/cgroup.c
	- struct cgroup_id is freed by RCU.
	- changed interface from pointer to "int"
	- kill_sb() is handled. 
	- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/cgroup.h |   28 ++++-
 include/linux/idr.h    |    1 
 kernel/cgroup.c        |  272 ++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/idr.c              |   46 ++++++++
 4 files changed, 342 insertions(+), 5 deletions(-)

Index: mmotm-2.6.28-Nov29/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Nov29.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Nov29/include/linux/cgroup.h
@@ -22,6 +22,7 @@ struct cgroupfs_root;
 struct cgroup_subsys;
 struct inode;
 struct cgroup;
+struct cgroup_id;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -63,6 +64,12 @@ struct cgroup_subsys_state {
 	unsigned long flags;
 };
 
+/*
+ * Cgroup ID for *internal* identification and lookup. For user-land,"path"
+ * of cgroup works well.
+ */
+#define MAX_CGROUP_DEPTH	(10)
+
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
@@ -162,6 +169,9 @@ struct cgroup {
 	int pids_use_count;
 	/* Length of the current tasks_pids array */
 	int pids_length;
+
+	/* Cgroup ID */
+	struct cgroup_id	*id;
 };
 
 /* A css_set is a structure holding pointers to a set of
@@ -346,7 +356,6 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
-
 	int subsys_id;
 	int active;
 	int disabled;
@@ -410,6 +419,23 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
+/*
+ * For supporting cgroup lookup and hierarchy management.
+ * Giving Flat view of cgroup hierarchy rather than tree.
+ */
+/* An interface for usual lookup */
+struct cgroup *cgroup_lookup(int id);
+/* get next cgroup under tree (for scan) */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid);
+/* get id and depth of cgroup */
+int cgroup_id(struct cgroup *cgroup);
+int cgroup_depth(struct cgroup *cgroup);
+/* For delayed freeing of IDs */
+void cgroup_id_getref(int id);
+void cgroup_id_putref(int id);
+bool cgroup_id_is_obsolete(int id);
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Nov29/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Nov29.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Nov29/kernel/cgroup.c
@@ -46,7 +46,7 @@
 #include <linux/cgroupstats.h>
 #include <linux/hash.h>
 #include <linux/namei.h>
-
+#include <linux/idr.h>
 #include <asm/atomic.h>
 
 static DEFINE_MUTEX(cgroup_mutex);
@@ -545,6 +545,253 @@ void cgroup_unlock(void)
 }
 
 /*
+ * CGROUP ID
+ */
+struct cgroup_id {
+	struct cgroup *myself;
+	unsigned int  id;
+	unsigned int  depth;
+	atomic_t      refcnt;
+	struct rcu_head rcu_head;
+	unsigned int  hierarchy_code[MAX_CGROUP_DEPTH];
+};
+
+void free_cgroupid_cb(struct rcu_head *head)
+{
+	struct cgroup_id *id;
+
+	id = container_of(head, struct cgroup_id, rcu_head);
+	kfree(id);
+}
+
+void free_cgroupid(struct cgroup_id *id)
+{
+	call_rcu(&id->rcu_head, free_cgroupid_cb);
+}
+
+/*
+ * Cgroup ID and lookup functions.
+ * cgid->myself pointer is safe under rcu_read_lock() because d_put() of
+ * cgroup, which finally frees cgroup pointer, uses rcu_synchronize().
+ */
+static DEFINE_IDR(cgroup_idr);
+DEFINE_SPINLOCK(cgroup_idr_lock);
+
+static int cgrouproot_setup_idr(struct cgroupfs_root *root)
+{
+	struct cgroup_id *newid;
+	int err = -ENOMEM;
+	int myid;
+
+	newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+	if (!newid)
+		goto out;
+	if (!idr_pre_get(&cgroup_idr, GFP_KERNEL))
+		goto free_out;
+
+	spin_lock_irq(&cgroup_idr_lock);
+	err = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+	spin_unlock_irq(&cgroup_idr_lock);
+
+	/* This one is new idr....*/
+	BUG_ON(err);
+	newid->id = myid;
+	newid->depth = 0;
+	newid->hierarchy_code[0] = myid;
+	atomic_set(&newid->refcnt, 1);
+	rcu_assign_pointer(newid->myself, &root->top_cgroup);
+	root->top_cgroup.id = newid;
+	return 0;
+
+free_out:
+	kfree(newid);
+out:
+	return err;
+}
+
+/*
+ * should be called while "cgrp" is valid.
+ */
+int cgroup_id(struct cgroup *cgrp)
+{
+	if (cgrp->id)
+		return cgrp->id->id;
+	return 0;
+}
+
+int cgroup_depth(struct cgroup *cgrp)
+{
+	if (cgrp->id)
+		return cgrp->id->depth;
+	return 0;
+}
+
+static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id)
+{
+	struct cgroup_id *newid;
+	int myid, error;
+
+	/* check depth */
+	if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
+		return -ENOSPC;
+	newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+	if (!newid)
+		return -ENOMEM;
+	/* get id */
+	if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
+		error = -ENOMEM;
+		goto err_out;
+	}
+	spin_lock_irq(&cgroup_idr_lock);
+	/* Don't use 0 */
+	error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+	spin_unlock_irq(&cgroup_idr_lock);
+	if (error)
+		goto err_out;
+
+	newid->id = myid;
+	atomic_set(&newid->refcnt, 1);
+	*id = newid;
+	return 0;
+err_out:
+	kfree(newid);
+	return error;
+}
+
+
+static void cgroup_id_attach(struct cgroup_id *cgid,
+			     struct cgroup *cg, struct cgroup *parent)
+{
+	struct cgroup_id *parent_id = parent->id;
+	int i;
+
+	cgid->depth = parent_id->depth + 1;
+	/* Inherit hierarchy code from parent */
+	for (i = 0; i < cgid->depth; i++) {
+		cgid->hierarchy_code[i] =
+			parent_id->hierarchy_code[i];
+		cgid->hierarchy_code[cgid->depth] = cgid->id;
+	}
+	rcu_assign_pointer(cgid->myself, cg);
+	cg->id = cgid;
+
+	return;
+}
+static void cgroup_id_put(int id)
+{
+	struct cgroup_id *cgid;
+	unsigned long flags;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+	BUG_ON(!cgid);
+	if (atomic_dec_and_test(&cgid->refcnt)) {
+		spin_lock_irqsave(&cgroup_idr_lock, flags);
+		idr_remove(&cgroup_idr, cgid->id);
+		spin_unlock_irq(&cgroup_idr_lock);
+		free_cgroupid(cgid);
+	}
+	rcu_read_unlock();
+}
+
+static void cgroup_id_detach(struct cgroup *cg)
+{
+	rcu_assign_pointer(cg->id->myself, NULL);
+	cgroup_id_put(cg->id->id);
+}
+
+void cgroup_id_getref(int id)
+{
+	struct cgroup_id *cgid;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+	if (cgid)
+		atomic_inc(&cgid->refcnt);
+	rcu_read_unlock();
+}
+
+void cgroup_id_putref(int id)
+{
+	cgroup_id_put(id);
+}
+/**
+ * cgroup_lookup - lookup cgroup by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to cgroup if there is valid cgroup with id, NULL if not.
+ * Should be called under rcu_read_lock() or cgroup_lock.
+ * If subsys is not used, returns NULL.
+ */
+
+struct cgroup *cgroup_lookup(int id)
+{
+	struct cgroup *cgrp = NULL;
+	struct cgroup_id *cgid = NULL;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+
+	if (unlikely(!cgid))
+		goto out;
+
+	cgrp = rcu_dereference(cgid->myself);
+	if (unlikely(!cgrp || cgroup_is_removed(cgrp)))
+		cgrp = NULL;
+out:
+	rcu_read_unlock();
+	return cgrp;
+}
+
+/**
+ * cgroup_get_next - lookup next cgroup under specified hierarchy.
+ * @id: current position of iteration.
+ * @rootid: search tree under this.
+ * @depth: depth of root id.
+ * @foundid: position of found object.
+ *
+ * Search next cgroup under the specified hierarchy. If "cur" is NULL,
+ * start from root cgroup. Called under rcu_read_lock() or cgroup_lock()
+ * is necessary (to access a found cgroup.).
+ * If subsys is not used, returns NULL. If used, it's guaranteed that there is
+ * a used cgroup ID (root).
+ */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid)
+{
+	struct cgroup *ret = NULL;
+	struct cgroup_id *tmp;
+	int tmpid;
+	unsigned long flags;
+
+	rcu_read_lock();
+	tmpid = id;
+	while (1) {
+		/* scan next entry from bitmap(tree) */
+		spin_lock_irqsave(&cgroup_idr_lock, flags);
+		tmp = idr_get_next(&cgroup_idr, &tmpid);
+		spin_unlock_irqrestore(&cgroup_idr_lock, flags);
+
+		if (!tmp) {
+			ret = NULL;
+			break;
+		}
+
+		if (tmp->hierarchy_code[depth] == rootid) {
+			ret = rcu_dereference(tmp->myself);
+			/* Sanity check and check hierarchy */
+			if (ret && !cgroup_is_removed(ret))
+				break;
+		}
+		tmpid = tmpid + 1;
+	}
+
+	rcu_read_unlock();
+	*foundid = tmpid;
+	return ret;
+}
+
+/*
  * A couple of forward declarations required, due to cyclic reference loop:
  * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
  * cgroup_add_file -> cgroup_create_file -> cgroup_dir_inode_operations
@@ -1039,6 +1286,13 @@ static int cgroup_get_sb(struct file_sys
 			mutex_unlock(&inode->i_mutex);
 			goto drop_new_super;
 		}
+		/* Setup Cgroup ID for this fs */
+		ret = cgrouproot_setup_idr(root);
+		if (ret) {
+			mutex_unlock(&cgroup_mutex);
+			mutex_unlock(&inode->i_mutex);
+			goto drop_new_super;
+		}
 
 		ret = rebind_subsystems(root, root->subsys_bits);
 		if (ret == -EBUSY) {
@@ -1125,9 +1379,10 @@ static void cgroup_kill_sb(struct super_
 
 	list_del(&root->root_list);
 	root_count--;
-
+	if (root->top_cgroup.id)
+		cgroup_id_detach(&root->top_cgroup);
 	mutex_unlock(&cgroup_mutex);
-
+	synchronize_rcu();
 	kfree(root);
 	kill_litter_super(sb);
 }
@@ -2360,11 +2615,18 @@ static long cgroup_create(struct cgroup 
 	int err = 0;
 	struct cgroup_subsys *ss;
 	struct super_block *sb = root->sb;
+	struct cgroup_id *cgid = NULL;
 
 	cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
 	if (!cgrp)
 		return -ENOMEM;
 
+	err = cgroup_prepare_id(parent, &cgid);
+	if (err) {
+		kfree(cgrp);
+		return err;
+	}
+
 	/* Grab a reference on the superblock so the hierarchy doesn't
 	 * get deleted on unmount if there are child cgroups.  This
 	 * can be done outside cgroup_mutex, since the sb can't
@@ -2404,7 +2666,7 @@ static long cgroup_create(struct cgroup 
 
 	err = cgroup_populate_dir(cgrp);
 	/* If err < 0, we have a half-filled directory - oh well ;) */
-
+	cgroup_id_attach(cgid, cgrp, parent);
 	mutex_unlock(&cgroup_mutex);
 	mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
 
@@ -2512,6 +2774,8 @@ static int cgroup_rmdir(struct inode *un
 		return -EBUSY;
 	}
 
+	cgroup_id_detach(cgrp);
+
 	spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
 
Index: mmotm-2.6.28-Nov29/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Nov29.orig/include/linux/idr.h
+++ mmotm-2.6.28-Nov29/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
 int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
 int idr_for_each(struct idr *idp,
 		 int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Nov29/lib/idr.c
===================================================================
--- mmotm-2.6.28-Nov29.orig/lib/idr.c
+++ mmotm-2.6.28-Nov29/lib/idr.c
@@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id:  pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+	struct idr_layer *p, *pa[MAX_LEVEL];
+	struct idr_layer **paa = &pa[0];
+	int id = *nextidp;
+	int n, max;
+
+	/* find first ent */
+	n = idp->layers * IDR_BITS;
+	max = 1 << n;
+	p = rcu_dereference(idp->top);
+	if (!p)
+		return NULL;
+
+	while (id < max) {
+		while (n > 0 && p) {
+			n -= IDR_BITS;
+			*paa++ = p;
+			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+		}
+
+		if (p) {
+			*nextidp = id;
+			return p;
+		}
+
+		id += 1 << n;
+		while (n < fls(id)) {
+			n += IDR_BITS;
+			p = *--paa;
+		}
+	}
+	return NULL;
+}
+
+
+
+/**
  * idr_replace - replace pointer for given id
  * @idp: idr handle
  * @ptr: pointer you want associated with the id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] memcg: change hierarhcy managenemt to use scan by cgroup ID
  2008-12-01  5:59 [PATCH 0/3] cgroup id and scanning without cgroup_lock KAMEZAWA Hiroyuki
  2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
  2008-12-01  6:03 ` [PATCH 2/3] cgroup: cgroup ID and scanning under RCU KAMEZAWA Hiroyuki
@ 2008-12-01  6:04 ` KAMEZAWA Hiroyuki
  2008-12-01  6:24 ` [PATCH 0/3] cgroup id and scanning without cgroup_lock Balbir Singh
  3 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  6:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, lizf, menage, balbir, nishimura, linux-kernel, akpm

Implement hierarchy reclaim by cgroup_id.

TODO:
 	- memsw support isn't good. (maybe using Nishimura's patch is good.)

What changes:
	- reclaim is not done by tree-walk algorithm
	- mem_cgroup->last_schan_child is ID, not pointer.
	- no cgroup_lock.
	- scanning order is just defined by ID's order.
	  (Scan by round-robin logic.)

Changelog: v1 -> v2
	- make use of css_tryget();
	- count # of loops rather than remembering position.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>


 mm/memcontrol.c |  176 +++++++++++++++++---------------------------------------
 1 file changed, 55 insertions(+), 121 deletions(-)

Index: mmotm-2.6.28-Nov29/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Nov29.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Nov29/mm/memcontrol.c
@@ -146,9 +146,11 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by cgroup_lock()
+	 * reclaimed from.
 	 */
-	struct mem_cgroup *last_scanned_child;
+	spinlock_t	scan_lock;
+	int	last_scan_child;
+	unsigned long	scan_age;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -475,104 +477,44 @@ unsigned long mem_cgroup_isolate_pages(u
 	container_of(counter, struct mem_cgroup, member)
 
 /*
- * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
+ * This routine select next memcg by ID. Using RCU and tryget().
+ * No cgroup_mutex is required.
  */
 static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
 {
-	struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
-	curr_cgroup = curr->css.cgroup;
-	root_cgroup = root_mem->css.cgroup;
-
-	if (!list_empty(&curr_cgroup->children)) {
-		/*
-		 * Walk down to children
-		 */
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->children.next,
-						struct cgroup, sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-visit_parent:
-	if (curr_cgroup == root_cgroup) {
-		mem_cgroup_put(curr);
-		curr = root_mem;
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Goto next sibling
-	 */
-	if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
-						sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Go up to next parent and next parent's sibling if need be
-	 */
-	curr_cgroup = curr_cgroup->parent;
-	goto visit_parent;
-
-done:
-	root_mem->last_scanned_child = curr;
-	return curr;
-}
-
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
-{
-	struct cgroup *cgroup;
+	struct cgroup *cgroup, *root_cgroup;
 	struct mem_cgroup *ret;
-	struct mem_cgroup *last_scan = root_mem->last_scanned_child;
-	bool obsolete = false;
+	int nextid, rootid, depth, found;
+	unsigned long flags;
 
-	if (last_scan) {
-		if (css_under_removal(&last_scan->css))
-			obsolete = true;
-	} else
-		obsolete = true;
+	root_cgroup = root_mem->css.cgroup;
+	rootid = cgroup_id(root_cgroup);
+	depth = cgroup_depth(root_cgroup);
+	found = 0;
+	ret = NULL;
+	rcu_read_lock();
 
-	/*
-	 * Scan all children under the mem_cgroup mem
-	 */
-	cgroup_lock();
-	if (list_empty(&root_mem->css.cgroup->children)) {
-		ret = root_mem;
-		goto done;
+	while (!ret) {
+		/* ID:0 is not used by cgroup-id */
+		nextid = root_mem->last_scan_child + 1;
+		cgroup = cgroup_get_next(nextid, rootid, depth, &found);
+		if (cgroup) {
+			spin_lock_irqsave(&root_mem->scan_lock, flags);
+			root_mem->last_scan_child = found;
+			spin_unlock_irqrestore(&root_mem->scan_lock, flags);
+			ret = mem_cgroup_from_cont(cgroup);
+			if (!css_tryget(&ret->css))
+				ret = NULL;
+		} else {
+			spin_lock_irqsave(&root_mem->scan_lock, flags);
+			root_mem->scan_age++;
+			root_mem->last_scan_child = 0;
+			spin_unlock_irqrestore(&root_mem->scan_lock, flags);
+		}
 	}
+	rcu_read_unlock();
 
-	if (!root_mem->last_scanned_child || obsolete) {
-
-		if (obsolete)
-			mem_cgroup_put(root_mem->last_scanned_child);
-
-		cgroup = list_first_entry(&root_mem->css.cgroup->children,
-				struct cgroup, sibling);
-		ret = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(ret);
-	} else
-		ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
-						root_mem);
-
-done:
-	root_mem->last_scanned_child = ret;
-	cgroup_unlock();
 	return ret;
 }
 
@@ -586,37 +528,25 @@ done:
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						gfp_t gfp_mask, bool noswap)
 {
-	struct mem_cgroup *next_mem;
+	struct mem_cgroup *victim;
+	unsigned long start_age;
 	int ret = 0;
+	int total = 0;
 
-	/*
-	 * Reclaim unconditionally and don't check for return value.
-	 * We need to reclaim in the current group and down the tree.
-	 * One might think about checking for children before reclaiming,
-	 * but there might be left over accounting, even after children
-	 * have left.
-	 */
-	ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
-	if (res_counter_check_under_limit(&root_mem->res))
-		return 0;
-
-	next_mem = mem_cgroup_get_first_node(root_mem);
-
-	while (next_mem != root_mem) {
-		if (css_under_removal(&next_mem->css)) {
-			mem_cgroup_put(next_mem);
-			cgroup_lock();
-			next_mem = mem_cgroup_get_first_node(root_mem);
-			cgroup_unlock();
-			continue;
-		}
-		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap);
+	start_age = root_mem->scan_age;
+	/* allows 2 times of loops */
+	while (time_after((start_age + 2UL), root_mem->scan_age)) {
+		victim = mem_cgroup_select_victim(root_mem);
+		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap);
+		css_put(&victim->css);
 		if (res_counter_check_under_limit(&root_mem->res))
-			return 0;
-		cgroup_lock();
-		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		cgroup_unlock();
+			return 1;
+		total += ret;
 	}
+	ret = total;
+	if (res_counter_check_under_limit(&root_mem->res))
+		ret = 1;
+
 	return ret;
 }
 
@@ -705,6 +635,8 @@ static int __mem_cgroup_try_charge(struc
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
 							noswap);
+		if (ret)
+			continue;
 
 		/*
 		 * try_to_free_mem_cgroup_pages() might not give us a full
@@ -1981,8 +1913,9 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
 	}
-
-	mem->last_scanned_child = NULL;
+	spin_lock_init(&mem->scan_lock);
+	mem->last_scan_child = 0;
+	mem->scan_age = 0;
 
 	return &mem->css;
 free_out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] cgroup id and scanning without cgroup_lock
  2008-12-01  5:59 [PATCH 0/3] cgroup id and scanning without cgroup_lock KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-12-01  6:04 ` [PATCH 3/3] memcg: change hierarhcy managenemt to use scan by cgroup ID KAMEZAWA Hiroyuki
@ 2008-12-01  6:24 ` Balbir Singh
  2008-12-01  7:52   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2008-12-01  6:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, lizf, menage, nishimura, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-01 14:59:07]:

> This is a series of patches againse mmotm-Nov29
> (passed easy test)
> 
> Now, memcg supports hierarhcy. But walking cgroup tree in intellegent way
> with lock/unlock cgroup_mutex seems to have troubles rather than expected.
> And, I want to reduce the memory usage of swap_cgroup, which uses array of
> pointers.
> 
> This patch series provides
> 	- cgroup_id per cgroup object.
> 	- lookup struct cgroup by cgroup_id
> 	- scan all cgroup under tree by cgroup_id. without mutex.
> 	- css_tryget() function.
> 	- fixes semantics of notify_on_release. (I think this is valid fix.)
> 
> Many changes since v1. (But I wonder some more work may be neeeded.)
> 
> BTW, I know there are some amount of patches against memcg are posted recently.
> If necessary, I'll prepare Weekly-update queue again (Wednesday) and
> picks all patches to linux-mm in my queue.
>

Thanks for the offer, I've just come back from foss.in. I need to look
athe locking issue with cgroup_lock() reported and also review/test
the other patches. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] cgroup id and scanning without cgroup_lock
  2008-12-01  6:24 ` [PATCH 0/3] cgroup id and scanning without cgroup_lock Balbir Singh
@ 2008-12-01  7:52   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  7:52 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, lizf, menage, nishimura, linux-kernel, akpm

On Mon, 1 Dec 2008 11:54:29 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-01 14:59:07]:
> 
> > This is a series of patches againse mmotm-Nov29
> > (passed easy test)
> > 
> > Now, memcg supports hierarhcy. But walking cgroup tree in intellegent way
> > with lock/unlock cgroup_mutex seems to have troubles rather than expected.
> > And, I want to reduce the memory usage of swap_cgroup, which uses array of
> > pointers.
> > 
> > This patch series provides
> > 	- cgroup_id per cgroup object.
> > 	- lookup struct cgroup by cgroup_id
> > 	- scan all cgroup under tree by cgroup_id. without mutex.
> > 	- css_tryget() function.
> > 	- fixes semantics of notify_on_release. (I think this is valid fix.)
> > 
> > Many changes since v1. (But I wonder some more work may be neeeded.)
> > 
> > BTW, I know there are some amount of patches against memcg are posted recently.
> > If necessary, I'll prepare Weekly-update queue again (Wednesday) and
> > picks all patches to linux-mm in my queue.
> >
> 
> Thanks for the offer, I've just come back from foss.in. I need to look
> athe locking issue with cgroup_lock() reported and also review/test
> the other patches. 
> 
Hmm, after reading mailing list again, it seems it's better to do some serialization.
I'll pick up some and post a queue tomorrow.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
@ 2008-12-02  6:15   ` Li Zefan
  2008-12-02  6:21     ` KAMEZAWA Hiroyuki
  2008-12-03  3:44   ` Li Zefan
  1 sibling, 1 reply; 14+ messages in thread
From: Li Zefan @ 2008-12-02  6:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

KAMEZAWA Hiroyuki wrote:
> Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
> after pre_destroy().
> memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
> 
> Several ways to fix this can be considered. This is an idea.
> 

I don't see what's the difference with css_under_removal() in this patch and
cgroup_is_removed() which is currently available.

CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
be sucessfully performed.

So mem->obsolete can be replaced with:

bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
{
	return cgroup_is_removed(mem->css.cgroup);
}

Or am I missing something?

> Fortunately, the user of css_get()/css_put() is only memcg, now.
> And it seems assumption on css_ref in cgroup.c is a bit complicated.
> I'd like to reuse it.
> This patch changes this css->refcnt usage and action as following
> 	- css->refcnt is initialized to 1.
> 
> 	- after pre_destroy, before destroy(), try to drop css->refcnt to 0.
> 
> 	- css_tryget() is added. This only success when css->refcnt > 0.
> 
> 	- css_is_removed() is added. This checks css->refcnt == 0 and means
> 	  this cgroup is under destroy() or not.
> 
> 	- css_put() is changed not to call notify_on_release().
> 	  From documentation, notify_on_release() is called when there is no
> 	  tasks/children in cgroup. On implementation, notify_on_release is
> 	  not called if css->refcnt > 0.
> 	  This is problematic. memcg has css->refcnt by each page even when
> 	  there are no tasks. release handler will be never called.
> 	  But, now, rmdir()/pre_destroy() of memcg works well and checking
> 	  checking css->ref is not (and shouldn't be) necessary for notifying.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
> 
> 
>  include/linux/cgroup.h |   21 +++++++++++++++++--
>  kernel/cgroup.c        |   53 +++++++++++++++++++++++++++++++++++--------------
>  mm/memcontrol.c        |   40 +++++++++++++++++++++++++-----------
>  3 files changed, 85 insertions(+), 29 deletions(-)
> 
> Index: mmotm-2.6.28-Nov29/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Nov29.orig/include/linux/cgroup.h
> +++ mmotm-2.6.28-Nov29/include/linux/cgroup.h
> @@ -54,7 +54,9 @@ struct cgroup_subsys_state {
>  
>  	/* State maintained by the cgroup system to allow
>  	 * subsystems to be "busy". Should be accessed via css_get()
> -	 * and css_put() */
> +	 * and css_put(). If this value is 0, css is now under removal and
> +	 * destroy() will be called soon. (and there is no roll-back.)
> +	 */
>  
>  	atomic_t refcnt;
>  
> @@ -86,7 +88,22 @@ extern void __css_put(struct cgroup_subs
>  static inline void css_put(struct cgroup_subsys_state *css)
>  {
>  	if (!test_bit(CSS_ROOT, &css->flags))
> -		__css_put(css);
> +		atomic_dec(&css->refcnt);
> +}
> +
> +/* returns not-zero if success */
> +static inline int css_tryget(struct cgroup_subsys_state *css)
> +{
> +	if (!test_bit(CSS_ROOT, &css->flags))
> +		return atomic_inc_not_zero(&css->refcnt);
> +	return 1;
> +}
> +
> +static inline bool css_under_removal(struct cgroup_subsys_state *css)
> +{
> +	if (test_bit(CSS_ROOT, &css->flags))
> +		return false;
> +	return atomic_read(&css->refcnt) == 0;
>  }
>  
>  /* bits in struct cgroup flags field */
> Index: mmotm-2.6.28-Nov29/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov29.orig/kernel/cgroup.c
> +++ mmotm-2.6.28-Nov29/kernel/cgroup.c
> @@ -589,6 +589,32 @@ static void cgroup_call_pre_destroy(stru
>  	return;
>  }
>  
> +/*
> + * Try to set all subsys's refcnt to be 0.
> + * css->refcnt==0 means this subsys will be destroy()'d.
> + */
> +static bool cgroup_set_subsys_removed(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	struct cgroup_subsys_state *css, *tmp;
> +
> +	for_each_subsys(cgrp->root, ss) {
> +		css = cgrp->subsys[ss->subsys_id];
> +		if (!atomic_dec_and_test(&css->refcnt))
> +			goto rollback;
> +	}
> +	return true;
> +rollback:
> +	for_each_subsys(cgrp->root, ss) {
> +		tmp = cgrp->subsys[ss->subsys_id];
> +		atomic_inc(&tmp->refcnt);
> +		if (tmp == css)
> +			break;
> +	}
> +	return false;
> +}
> +
> +
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  {
>  	/* is dentry a directory ? if so, kfree() associated cgroup */
> @@ -2310,7 +2336,7 @@ static void init_cgroup_css(struct cgrou
>  			       struct cgroup *cgrp)
>  {
>  	css->cgroup = cgrp;
> -	atomic_set(&css->refcnt, 0);
> +	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
> @@ -2438,7 +2464,7 @@ static int cgroup_has_css_refs(struct cg
>  		 * matter, since it can only happen if the cgroup
>  		 * has been deleted and hence no longer needs the
>  		 * release agent to be called anyway. */
> -		if (css && atomic_read(&css->refcnt))
> +		if (css && (atomic_read(&css->refcnt) > 1))
>  			return 1;
>  	}
>  	return 0;
> @@ -2465,7 +2491,8 @@ static int cgroup_rmdir(struct inode *un
>  
>  	/*
>  	 * Call pre_destroy handlers of subsys. Notify subsystems
> -	 * that rmdir() request comes.
> +	 * that rmdir() request comes. pre_destroy() is expected to drop all
> +	 * extra refcnt to css. (css->refcnt == 1)
>  	 */
>  	cgroup_call_pre_destroy(cgrp);
>  
> @@ -2479,8 +2506,15 @@ static int cgroup_rmdir(struct inode *un
>  		return -EBUSY;
>  	}
>  
> +	/* last check ! */
> +	if (!cgroup_set_subsys_removed(cgrp)) {
> +		mutex_unlock(&cgroup_mutex);
> +		return -EBUSY;
> +	}
> +
>  	spin_lock(&release_list_lock);
>  	set_bit(CGRP_REMOVED, &cgrp->flags);
> +
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
>  	spin_unlock(&release_list_lock);
> @@ -3003,7 +3037,7 @@ static void check_for_release(struct cgr
>  	/* All of these checks rely on RCU to keep the cgroup
>  	 * structure alive */
>  	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
> -	    && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
> +	    && list_empty(&cgrp->children)) {
>  		/* Control Group is currently removeable. If it's not
>  		 * already queued for a userspace notification, queue
>  		 * it now */
> @@ -3020,17 +3054,6 @@ static void check_for_release(struct cgr
>  	}
>  }
>  
> -void __css_put(struct cgroup_subsys_state *css)
> -{
> -	struct cgroup *cgrp = css->cgroup;
> -	rcu_read_lock();
> -	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
> -		set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -		check_for_release(cgrp);
> -	}
> -	rcu_read_unlock();
> -}
> -
>  /*
>   * Notify userspace when a cgroup is released, by running the
>   * configured release agent with the name of the cgroup (path
> Index: mmotm-2.6.28-Nov29/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Nov29.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Nov29/mm/memcontrol.c
> @@ -154,7 +154,6 @@ struct mem_cgroup {
>  	 */
>  	bool use_hierarchy;
>  	unsigned long	last_oom_jiffies;
> -	int		obsolete;
>  	atomic_t	refcnt;
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
> @@ -540,8 +539,14 @@ mem_cgroup_get_first_node(struct mem_cgr
>  {
>  	struct cgroup *cgroup;
>  	struct mem_cgroup *ret;
> -	bool obsolete = (root_mem->last_scanned_child &&
> -				root_mem->last_scanned_child->obsolete);
> +	struct mem_cgroup *last_scan = root_mem->last_scanned_child;
> +	bool obsolete = false;
> +
> +	if (last_scan) {
> +		if (css_under_removal(&last_scan->css))
> +			obsolete = true;
> +	} else
> +		obsolete = true;
>  
>  	/*
>  	 * Scan all children under the mem_cgroup mem
> @@ -598,7 +603,7 @@ static int mem_cgroup_hierarchical_recla
>  	next_mem = mem_cgroup_get_first_node(root_mem);
>  
>  	while (next_mem != root_mem) {
> -		if (next_mem->obsolete) {
> +		if (css_under_removal(&next_mem->css)) {
>  			mem_cgroup_put(next_mem);
>  			cgroup_lock();
>  			next_mem = mem_cgroup_get_first_node(root_mem);
> @@ -985,6 +990,7 @@ int mem_cgroup_try_charge_swapin(struct 
>  {
>  	struct mem_cgroup *mem;
>  	swp_entry_t     ent;
> +	int ret;
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> @@ -1003,10 +1009,18 @@ int mem_cgroup_try_charge_swapin(struct 
>  	ent.val = page_private(page);
>  
>  	mem = lookup_swap_cgroup(ent);
> -	if (!mem || mem->obsolete)
> +	/*
> +	 * Because we can't assume "mem" is alive now, use tryget() and
> +	 * drop extra count later
> +	 */
> +	if (!mem || !css_tryget(&mem->css))
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +	/* drop extra count */
> +	css_put(&mem->css);
> +
> +	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> @@ -1037,14 +1051,16 @@ int mem_cgroup_cache_charge_swapin(struc
>  		ent.val = page_private(page);
>  		if (do_swap_account) {
>  			mem = lookup_swap_cgroup(ent);
> -			if (mem && mem->obsolete)
> +			if (mem && !css_tryget(&mem->css))
>  				mem = NULL;
>  			if (mem)
>  				mm = NULL;
>  		}
>  		ret = mem_cgroup_charge_common(page, mm, mask,
>  				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
> -
> +		/* drop extra ref */
> +		if (mem)
> +			css_put(&mem->css);
>  		if (!ret && do_swap_account) {
>  			/* avoid double counting */
>  			mem = swap_cgroup_record(ent, NULL);
> @@ -1886,8 +1902,8 @@ static struct mem_cgroup *mem_cgroup_all
>   * the number of reference from swap_cgroup and free mem_cgroup when
>   * it goes down to 0.
>   *
> - * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
> - * entry which points to this memcg will be ignore at swapin.
> + * When mem_cgroup is destroyed, css_under_removal() is true and entry which
> + * points to this memcg will be ignore at swapin.
>   *
>   * Removal of cgroup itself succeeds regardless of refs from swap.
>   */
> @@ -1917,7 +1933,7 @@ static void mem_cgroup_get(struct mem_cg
>  static void mem_cgroup_put(struct mem_cgroup *mem)
>  {
>  	if (atomic_dec_and_test(&mem->refcnt)) {
> -		if (!mem->obsolete)
> +		if (!css_under_removal(&mem->css))
>  			return;
>  		mem_cgroup_free(mem);
>  	}
> @@ -1980,7 +1996,7 @@ static void mem_cgroup_pre_destroy(struc
>  					struct cgroup *cont)
>  {
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> -	mem->obsolete = 1;
> +	/* dentry's mutex makes this safe. */
>  	mem_cgroup_force_empty(mem, false);
>  }
>  
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-02  6:15   ` Li Zefan
@ 2008-12-02  6:21     ` KAMEZAWA Hiroyuki
  2008-12-02  6:56       ` Li Zefan
  0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-02  6:21 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

On Tue, 02 Dec 2008 14:15:23 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
> > after pre_destroy().
> > memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
> > 
> > Several ways to fix this can be considered. This is an idea.
> > 
> 
> I don't see what's the difference with css_under_removal() in this patch and
> cgroup_is_removed() which is currently available.
> 
> CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
> be sucessfully performed.
> 
> So mem->obsolete can be replaced with:
> 
> bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
> {
> 	return cgroup_is_removed(mem->css.cgroup);
> }
> 
> Or am I missing something?
> 
Yes.
	1. "cgroup" and "css" object are different object.
	2. css object may not be freed at destroy() (as current memcg does.)

Some of css objects cannot be freed even when there are no tasks because
of reference from some persistent object or temporal refcnt.

Please consider css_under_removal() as a kind of css_tryget() which doesn't
increase any refcnt.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-02  6:21     ` KAMEZAWA Hiroyuki
@ 2008-12-02  6:56       ` Li Zefan
  2008-12-02  7:13         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 14+ messages in thread
From: Li Zefan @ 2008-12-02  6:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

KAMEZAWA Hiroyuki wrote:
> On Tue, 02 Dec 2008 14:15:23 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
>>> after pre_destroy().
>>> memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
>>>
>>> Several ways to fix this can be considered. This is an idea.
>>>
>> I don't see what's the difference with css_under_removal() in this patch and
>> cgroup_is_removed() which is currently available.
>>
>> CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
>> be sucessfully performed.
>>
>> So mem->obsolete can be replaced with:
>>
>> bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
>> {
>> 	return cgroup_is_removed(mem->css.cgroup);
>> }
>>
>> Or am I missing something?
>>
> Yes.
> 	1. "cgroup" and "css" object are different object.
> 	2. css object may not be freed at destroy() (as current memcg does.)
> 
> Some of css objects cannot be freed even when there are no tasks because
> of reference from some persistent object or temporal refcnt.
> 

I just noticed mem_cgroup has its own refcnt now. The memcg code has changed
dramatically that I don't catch up with it. Thx for the explanation.

But I have another doubt:

void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
	struct mem_cgroup *memcg;

	memcg = __mem_cgroup_uncharge_common(page,
					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
	/* record memcg information */
	if (do_swap_account && memcg) {
		swap_cgroup_record(ent, memcg);
		mem_cgroup_get(memcg);
	}
}

In the above code, is it possible that memcg is freed before mem_cgroup_get()
increases memcg->refcnt?

> Please consider css_under_removal() as a kind of css_tryget() which doesn't
> increase any refcnt.
> 
> Thanks,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-02  6:56       ` Li Zefan
@ 2008-12-02  7:13         ` KAMEZAWA Hiroyuki
  2008-12-02  7:31           ` Li Zefan
  0 siblings, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-02  7:13 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

On Tue, 02 Dec 2008 14:56:52 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 02 Dec 2008 14:15:23 +0800
> > Li Zefan <lizf@cn.fujitsu.com> wrote:
> > 
> >> KAMEZAWA Hiroyuki wrote:
> >>> Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
> >>> after pre_destroy().
> >>> memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
> >>>
> >>> Several ways to fix this can be considered. This is an idea.
> >>>
> >> I don't see what's the difference with css_under_removal() in this patch and
> >> cgroup_is_removed() which is currently available.
> >>
> >> CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
> >> be sucessfully performed.
> >>
> >> So mem->obsolete can be replaced with:
> >>
> >> bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
> >> {
> >> 	return cgroup_is_removed(mem->css.cgroup);
> >> }
> >>
> >> Or am I missing something?
> >>
> > Yes.
> > 	1. "cgroup" and "css" object are different object.
> > 	2. css object may not be freed at destroy() (as current memcg does.)
> > 
> > Some of css objects cannot be freed even when there are no tasks because
> > of reference from some persistent object or temporal refcnt.
> > 
> 
> I just noticed mem_cgroup has its own refcnt now. The memcg code has changed
> dramatically that I don't catch up with it. Thx for the explanation.
> 
> But I have another doubt:
> 
> void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> {
> 	struct mem_cgroup *memcg;
> 
> 	memcg = __mem_cgroup_uncharge_common(page,
> 					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> 	/* record memcg information */
> 	if (do_swap_account && memcg) {
> 		swap_cgroup_record(ent, memcg);
> 		mem_cgroup_get(memcg);
> 	}
> }
> 
> In the above code, is it possible that memcg is freed before mem_cgroup_get()
> increases memcg->refcnt?
> 
Thank you for looking into. maybe possible.

In this case, 
	1. "the page" was belongs to memcg before uncharge().
	2. but it's not guaranteed that memcg is alive after uncharge.

OK. maybe css_tryget() can change this to be
==
	rcu_read_lock();
	memcg = __mem_cgroup_uncharge_common(page,
					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
	if (do_swap_account && memcg && css_tryget(&memcg->css)) {
		swap_cgroup_record(ent, memcg);
		mem_cgroup_get(memcg);
		css_put(&memcg->css);
	}
	rcu_read_unlock();
==
How about this ?


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-02  7:13         ` KAMEZAWA Hiroyuki
@ 2008-12-02  7:31           ` Li Zefan
  2008-12-02  7:39             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 14+ messages in thread
From: Li Zefan @ 2008-12-02  7:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

KAMEZAWA Hiroyuki wrote:
> On Tue, 02 Dec 2008 14:56:52 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 02 Dec 2008 14:15:23 +0800
>>> Li Zefan <lizf@cn.fujitsu.com> wrote:
>>>
>>>> KAMEZAWA Hiroyuki wrote:
>>>>> Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
>>>>> after pre_destroy().
>>>>> memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
>>>>>
>>>>> Several ways to fix this can be considered. This is an idea.
>>>>>
>>>> I don't see what's the difference with css_under_removal() in this patch and
>>>> cgroup_is_removed() which is currently available.
>>>>
>>>> CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
>>>> be sucessfully performed.
>>>>
>>>> So mem->obsolete can be replaced with:
>>>>
>>>> bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
>>>> {
>>>> 	return cgroup_is_removed(mem->css.cgroup);
>>>> }
>>>>
>>>> Or am I missing something?
>>>>
>>> Yes.
>>> 	1. "cgroup" and "css" object are different object.
>>> 	2. css object may not be freed at destroy() (as current memcg does.)
>>>
>>> Some of css objects cannot be freed even when there are no tasks because
>>> of reference from some persistent object or temporal refcnt.
>>>
>> I just noticed mem_cgroup has its own refcnt now. The memcg code has changed
>> dramatically that I don't catch up with it. Thx for the explanation.
>>
>> But I have another doubt:
>>
>> void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>> {
>> 	struct mem_cgroup *memcg;
>>
>> 	memcg = __mem_cgroup_uncharge_common(page,
>> 					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
>> 	/* record memcg information */
>> 	if (do_swap_account && memcg) {
>> 		swap_cgroup_record(ent, memcg);
>> 		mem_cgroup_get(memcg);
>> 	}
>> }
>>
>> In the above code, is it possible that memcg is freed before mem_cgroup_get()
>> increases memcg->refcnt?
>>
> Thank you for looking into. maybe possible.
> 
> In this case, 
> 	1. "the page" was belongs to memcg before uncharge().
> 	2. but it's not guaranteed that memcg is alive after uncharge.
> 
> OK. maybe css_tryget() can change this to be
> ==
> 	rcu_read_lock();
> 	memcg = __mem_cgroup_uncharge_common(page,
> 					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> 	if (do_swap_account && memcg && css_tryget(&memcg->css)) {
> 		swap_cgroup_record(ent, memcg);
> 		mem_cgroup_get(memcg);
> 		css_put(&memcg->css);
> 	}
> 	rcu_read_unlock();
> ==
> How about this ?
> 

Seems OK for me. Another way to fix this is, don't call css_put() if we want
to use the memcg returned from __mem_cgroup_uncharge_common(), I think this
is more reasonable:

--- a/mm/memcontrol.c.orig	2008-12-02 15:20:55.000000000 +0800
+++ b/mm/memcontrol.c	2008-12-02 15:28:07.000000000 +0800
@@ -1110,8 +1110,9 @@ void mem_cgroup_cancel_charge_swapin(str
 /*
  * uncharge if !page_mapped(page)
  */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
+static void
+__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
+			     struct mem_cgroup **memcg)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
@@ -1163,13 +1164,16 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
-	css_put(&mem->css);
+	/* don't dec refcnt, since the caller want to use this memcg */
+	if (memcg)
+		*memcg = mem;
+	else
+		css_put(&mem->css);
 
-	return mem;
+	return;
 
 unlock_out:
 	unlock_page_cgroup(pc);
-	return NULL;
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -1197,12 +1201,13 @@ void mem_cgroup_uncharge_swapcache(struc
 {
 	struct mem_cgroup *memcg;
 
-	memcg = __mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
+	__mem_cgroup_uncharge_common(page,
+				     MEM_CGROUP_CHARGE_TYPE_SWAPOUT, &memcg);
 	/* record memcg information */
 	if (do_swap_account && memcg) {
 		swap_cgroup_record(ent, memcg);
 		mem_cgroup_get(memcg);
+		css_put(&memcg->css);
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-02  7:31           ` Li Zefan
@ 2008-12-02  7:39             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-02  7:39 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

On Tue, 02 Dec 2008 15:31:16 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 02 Dec 2008 14:56:52 +0800
> > Li Zefan <lizf@cn.fujitsu.com> wrote:
> > 
> >> KAMEZAWA Hiroyuki wrote:
> >>> On Tue, 02 Dec 2008 14:15:23 +0800
> >>> Li Zefan <lizf@cn.fujitsu.com> wrote:
> >>>
> >>>> KAMEZAWA Hiroyuki wrote:
> >>>>> Now, final check of refcnt is done after pre_destroy(), so rmdir() can fail
> >>>>> after pre_destroy().
> >>>>> memcg set mem->obsolete to be 1 at pre_destroy and this is buggy..
> >>>>>
> >>>>> Several ways to fix this can be considered. This is an idea.
> >>>>>
> >>>> I don't see what's the difference with css_under_removal() in this patch and
> >>>> cgroup_is_removed() which is currently available.
> >>>>
> >>>> CGRP_REMOVED flag is set in cgroup_rmdir() when it's confirmed that rmdir can
> >>>> be sucessfully performed.
> >>>>
> >>>> So mem->obsolete can be replaced with:
> >>>>
> >>>> bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
> >>>> {
> >>>> 	return cgroup_is_removed(mem->css.cgroup);
> >>>> }
> >>>>
> >>>> Or am I missing something?
> >>>>
> >>> Yes.
> >>> 	1. "cgroup" and "css" object are different object.
> >>> 	2. css object may not be freed at destroy() (as current memcg does.)
> >>>
> >>> Some of css objects cannot be freed even when there are no tasks because
> >>> of reference from some persistent object or temporal refcnt.
> >>>
> >> I just noticed mem_cgroup has its own refcnt now. The memcg code has changed
> >> dramatically that I don't catch up with it. Thx for the explanation.
> >>
> >> But I have another doubt:
> >>
> >> void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> >> {
> >> 	struct mem_cgroup *memcg;
> >>
> >> 	memcg = __mem_cgroup_uncharge_common(page,
> >> 					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> >> 	/* record memcg information */
> >> 	if (do_swap_account && memcg) {
> >> 		swap_cgroup_record(ent, memcg);
> >> 		mem_cgroup_get(memcg);
> >> 	}
> >> }
> >>
> >> In the above code, is it possible that memcg is freed before mem_cgroup_get()
> >> increases memcg->refcnt?
> >>
> > Thank you for looking into. maybe possible.
> > 
> > In this case, 
> > 	1. "the page" was belongs to memcg before uncharge().
> > 	2. but it's not guaranteed that memcg is alive after uncharge.
> > 
> > OK. maybe css_tryget() can change this to be
> > ==
> > 	rcu_read_lock();
> > 	memcg = __mem_cgroup_uncharge_common(page,
> > 					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> > 	if (do_swap_account && memcg && css_tryget(&memcg->css)) {
> > 		swap_cgroup_record(ent, memcg);
> > 		mem_cgroup_get(memcg);
> > 		css_put(&memcg->css);
> > 	}
> > 	rcu_read_unlock();
> > ==
> > How about this ?
> > 
> 
> Seems OK for me. Another way to fix this is, don't call css_put() if we want
> to use the memcg returned from __mem_cgroup_uncharge_common(), I think this
> is more reasonable:
> 
but more complicated ;) Hmm...
Anyway , I'll queue some fix to next weekly-update.
Thank you for pointing out.

-Kame


> --- a/mm/memcontrol.c.orig	2008-12-02 15:20:55.000000000 +0800
> +++ b/mm/memcontrol.c	2008-12-02 15:28:07.000000000 +0800
> @@ -1110,8 +1110,9 @@ void mem_cgroup_cancel_charge_swapin(str
>  /*
>   * uncharge if !page_mapped(page)
>   */
> -static struct mem_cgroup *
> -__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> +static void
> +__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
> +			     struct mem_cgroup **memcg)
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
> @@ -1163,13 +1164,16 @@ __mem_cgroup_uncharge_common(struct page
>  	mz = page_cgroup_zoneinfo(pc);
>  	unlock_page_cgroup(pc);
>  
> -	css_put(&mem->css);
> +	/* don't dec refcnt, since the caller want to use this memcg */
> +	if (memcg)
> +		*memcg = mem;
> +	else
> +		css_put(&mem->css);
>  
> -	return mem;
> +	return;
>  
>  unlock_out:
>  	unlock_page_cgroup(pc);
> -	return NULL;
>  }
>  
>  void mem_cgroup_uncharge_page(struct page *page)
> @@ -1197,12 +1201,13 @@ void mem_cgroup_uncharge_swapcache(struc
>  {
>  	struct mem_cgroup *memcg;
>  
> -	memcg = __mem_cgroup_uncharge_common(page,
> -					MEM_CGROUP_CHARGE_TYPE_SWAPOUT);
> +	__mem_cgroup_uncharge_common(page,
> +				     MEM_CGROUP_CHARGE_TYPE_SWAPOUT, &memcg);
>  	/* record memcg information */
>  	if (do_swap_account && memcg) {
>  		swap_cgroup_record(ent, memcg);
>  		mem_cgroup_get(memcg);
> +		css_put(&memcg->css);
>  	}
>  }
>  
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
  2008-12-02  6:15   ` Li Zefan
@ 2008-12-03  3:44   ` Li Zefan
  2008-12-03  3:54     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 14+ messages in thread
From: Li Zefan @ 2008-12-03  3:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

> +/*
> + * Try to set all subsys's refcnt to be 0.
> + * css->refcnt==0 means this subsys will be destroy()'d.
> + */
> +static bool cgroup_set_subsys_removed(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	struct cgroup_subsys_state *css, *tmp;
> +
> +	for_each_subsys(cgrp->root, ss) {
> +		css = cgrp->subsys[ss->subsys_id];
> +		if (!atomic_dec_and_test(&css->refcnt))
> +			goto rollback;
> +	}
> +	return true;
> +rollback:
> +	for_each_subsys(cgrp->root, ss) {
> +		tmp = cgrp->subsys[ss->subsys_id];
> +		atomic_inc(&tmp->refcnt);
> +		if (tmp == css)
> +			break;
> +	}
> +	return false;
> +}
> +

This function may return false, then causes rmdir() fail. So css_tryget(subsys1)
returns 0 doesn't necessarily mean subsys1->destroy() will be called,
if subsys2's css's refcnt is >1 when cgroup_set_subsys_removed() is called.

Will this bring up bugs and problems?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt
  2008-12-03  3:44   ` Li Zefan
@ 2008-12-03  3:54     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-03  3:54 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, menage, balbir, nishimura, linux-kernel, akpm

On Wed, 03 Dec 2008 11:44:36 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > +/*
> > + * Try to set all subsys's refcnt to be 0.
> > + * css->refcnt==0 means this subsys will be destroy()'d.
> > + */
> > +static bool cgroup_set_subsys_removed(struct cgroup *cgrp)
> > +{
> > +	struct cgroup_subsys *ss;
> > +	struct cgroup_subsys_state *css, *tmp;
> > +
> > +	for_each_subsys(cgrp->root, ss) {
> > +		css = cgrp->subsys[ss->subsys_id];
> > +		if (!atomic_dec_and_test(&css->refcnt))
> > +			goto rollback;
> > +	}
> > +	return true;
> > +rollback:
> > +	for_each_subsys(cgrp->root, ss) {
> > +		tmp = cgrp->subsys[ss->subsys_id];
> > +		atomic_inc(&tmp->refcnt);
> > +		if (tmp == css)
> > +			break;
> > +	}
> > +	return false;
> > +}
> > +
> 
> This function may return false, then causes rmdir() fail. So css_tryget(subsys1)
> returns 0 doesn't necessarily mean subsys1->destroy() will be called,
> if subsys2's css's refcnt is >1 when cgroup_set_subsys_removed() is called.
> 
> Will this bring up bugs and problems?
> 

current user of css_get() is only memcg, so no problem now.

"css_tryget() fails" means "rmdir" is called against this cgroup. So, not so
troublesome in genral, I think. (the user will retry rmdir()).

To be honest, I don't want to return -EBUSY but wait for success in the kernel.
and go back to pre_destroy() for this temporal race.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-12-03  3:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-01  5:59 [PATCH 0/3] cgroup id and scanning without cgroup_lock KAMEZAWA Hiroyuki
2008-12-01  6:02 ` [PATCH 1/3] cgroup: fix pre_destroy and semantics of css->refcnt KAMEZAWA Hiroyuki
2008-12-02  6:15   ` Li Zefan
2008-12-02  6:21     ` KAMEZAWA Hiroyuki
2008-12-02  6:56       ` Li Zefan
2008-12-02  7:13         ` KAMEZAWA Hiroyuki
2008-12-02  7:31           ` Li Zefan
2008-12-02  7:39             ` KAMEZAWA Hiroyuki
2008-12-03  3:44   ` Li Zefan
2008-12-03  3:54     ` KAMEZAWA Hiroyuki
2008-12-01  6:03 ` [PATCH 2/3] cgroup: cgroup ID and scanning under RCU KAMEZAWA Hiroyuki
2008-12-01  6:04 ` [PATCH 3/3] memcg: change hierarhcy managenemt to use scan by cgroup ID KAMEZAWA Hiroyuki
2008-12-01  6:24 ` [PATCH 0/3] cgroup id and scanning without cgroup_lock Balbir Singh
2008-12-01  7:52   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox