[RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05)
@ 2008-12-05  8:26 KAMEZAWA Hiroyuki
  2008-12-05  8:28 ` [RFC][PATCH 1/4] New css->refcnt implementation KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05  8:26 UTC (permalink / raw)
  To: linux-mm; +Cc: nishimura, balbir, lizf, linux-kernel, menage

This is a patch set onto mmotm-2.6.28-Dec30.

Still RFC. I'm considering whether I can make this simpler....

Major changes from previous one
	- css->refcnt is unified.
	  I think distributed refcnt is a crazy idea...
	- applied comments to previous version.
	- OOM Kill handler is fixed. (this was broken by hierarchy) 

I may not be able to reply quickly in weekend, sorry.

After this, memcg's hierarchical reclaim will be
==
static struct mem_cgroup *
mem_cgroup_select_victim(struct mem_cgroup *root_mem)
{
        struct cgroup *cgroup, *root_cgroup;
        struct mem_cgroup *ret;
        int nextid, rootid, depth, found;

        root_cgroup = root_mem->css.cgroup;
        rootid = cgroup_id(root_cgroup);
        depth = cgroup_depth(root_cgroup);
        found = 0;

        rcu_read_lock();
        if (!root_mem->use_hierarchy) {
                spin_lock(&root_mem->reclaim_param_lock);
                root_mem->scan_age++;
                spin_unlock(&root_mem->reclaim_param_lock);
                css_get(&root_mem->css);
                ret = root_mem;
        }

        while (!ret) {
                /* ID:0 is not used by cgroup-id */
                nextid = root_mem->last_scanned_child + 1;
                cgroup = cgroup_get_next(nextid, rootid, depth, &found);
                if (cgroup) {
                        spin_lock(&root_mem->reclaim_param_lock);
                        root_mem->last_scanned_child = found;
                        spin_unlock(&root_mem->reclaim_param_lock);
                        ret = mem_cgroup_from_cont(cgroup);
                        if (!css_tryget(&ret->css))
                                ret = NULL;
                } else {
                        spin_lock(&root_mem->reclaim_param_lock);
                        root_mem->scan_age++;
                        root_mem->last_scanned_child = 0;
                        spin_unlock(&root_mem->reclaim_param_lock);
                }
        }
        rcu_read_unlock();
        return ret;
}

/*
 * root_mem is the original ancestor that we've been reclaim from.
 * root_mem cannot be freed while walking because there are children.
 */
static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
                                                gfp_t gfp_mask, bool noswap)
{
        struct mem_cgroup *victim;
        unsigned long start_age;
        int ret = 0;
        int total = 0;

        start_age = root_mem->scan_age;
        /* allows visit twice (under this memcg, ->scan_age is shared.) */
        while (time_after((start_age + 2UL), root_mem->scan_age)) {
                victim = mem_cgroup_select_victim(root_mem);
                ret = try_to_free_mem_cgroup_pages(victim,
                                gfp_mask, noswap, get_swappiness(victim));
                css_put(&victim->css);
                if (mem_cgroup_check_under_limit(root_mem))
                        return 1;
                total += ret;
        }

        ret = total;
        if (mem_cgroup_check_under_limit(root_mem))
                ret = 1;

        return ret;
}
==
This can be reused for soft-limit or something fancy featrues.


Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 1/4] New css->refcnt implementation.
  2008-12-05  8:26 [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) KAMEZAWA Hiroyuki
@ 2008-12-05  8:28 ` KAMEZAWA Hiroyuki
  2008-12-05  9:39   ` Paul Menage
  2008-12-05  8:29 ` [RFC][PATCH 2/4] cgroup ID KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05  8:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel, menage


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>

Now, the last check of refcnt is done after pre_destroy(), so rmdir() can fail
after pre_destroy(). But memcg set mem->obsolete to be 1 at pre_destroy.
This is a bug. So, removing memcg->obsolete flag is sane.

But there is no interface to confirm "css" is oboslete or not. I.e. there is
no flag to check whether we can increase css_refcnt or not!
This refcnt is hard to use...

Fortunately, the user of css_get()/css_put() is only memcg, now.
So influence of changing this usage is minimum.

This patch changes this css->refcnt rule as following
	- css->refcnt is no longer private counter, just point to
	  css->cgroup->css_refcnt. 
          (css can use private counter by its own routine and can have
	   pre_destroy() handler)

	- css_refcnt is initialized to 1.

	- css_tryget() is added. This only success when css_refcnt > 0.

	- after pre_destroy, before destroy(), try to drop css_refcnt to 0.

	- after css_refcnt goes down to 0, css->refcnt is replaced to
	  dummy counter. (for tryget())

	- css_is_removed() is added. This checks css_refcnt == 0 and means
	  this cgroup is under pre_destroy()-> destroy() and no rollback.

	- css_put() is changed not to call notify_on_release().

	  From documentation, notify_on_release() is called when there is no
	  tasks/children in cgroup. On implementation, notify_on_release is
	  not called if css->refcnt > 0.
	  This is problem. Memcg has css->refcnt by each page even when
	  there are no tasks. Release handler will be never called.

	  But, now, rmdir()/pre_destroy() of memcg works well and checking
	  checking css->ref is not (and shouldn't be) necessary for notifying.

Changelog: v1 -> v2
 - changed css->refcnt to be pointer.
 - added cgroup->css_refcnt.
 - addec CONFIG_DEBUG_CGROUP_SUBSYS.
 - refreshed memcg's private refcnt handling. we'll revisit this.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
---
 include/linux/cgroup.h |   55 ++++++++++++++++++++++++--------
 kernel/cgroup.c        |   83 ++++++++++++++++++++-----------------------------
 lib/Kconfig.debug      |    8 ++++
 mm/memcontrol.c        |   54 ++++++++++++++++++-------------
 4 files changed, 116 insertions(+), 84 deletions(-)

Index: mmotm-2.6.28-Dec03/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec03.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec03/include/linux/cgroup.h
@@ -51,13 +51,8 @@ struct cgroup_subsys_state {
 	 * for subsystems that want to know about the cgroup
 	 * hierarchy structure */
 	struct cgroup *cgroup;
-
-	/* State maintained by the cgroup system to allow
-	 * subsystems to be "busy". Should be accessed via css_get()
-	 * and css_put() */
-
-	atomic_t refcnt;
-
+	/* refcnt that all subsys shares to show *cgroup is alive or not. */
+	atomic_t      *refcnt;
 	unsigned long flags;
 };
 
@@ -66,6 +61,12 @@ enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
 };
 
+#ifdef CONFIG_DEBUG_CGROUP_SUBSYS
+#define CGROUP_SUBSYS_BUG_ON(cond)	BUG_ON(cond)
+#else
+#define CGROUP_SUBSYS_BUG_ON(cond)	do {} while (0)
+#endif
+
 /*
  * Call css_get() to hold a reference on the cgroup;
  *
@@ -73,20 +74,44 @@ enum {
 
 static inline void css_get(struct cgroup_subsys_state *css)
 {
+	atomic_t *ref = css->refcnt;
 	/* We don't need to reference count the root state */
-	if (!test_bit(CSS_ROOT, &css->flags))
-		atomic_inc(&css->refcnt);
+	if (test_bit(CSS_ROOT, &css->flags))
+		return;
+
+	CGROUP_SUBSYS_BUG_ON(ref != &css->cgroup->css_refcnt);
+	atomic_inc(ref);
 }
 /*
- * css_put() should be called to release a reference taken by
- * css_get()
+ * css_put() should be called to release a reference taken by css_get()
  */
 
-extern void __css_put(struct cgroup_subsys_state *css);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
-	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css);
+	atomic_t *ref = css->refcnt;
+
+	if (test_bit(CSS_ROOT, &css->flags))
+		return;
+
+	CGROUP_SUBSYS_BUG_ON(ref != &css->cgroup->css_refcnt);
+	atomic_dec(ref);
+}
+
+/*
+ * Returns a value other than 0 at success.
+ */
+static inline int css_tryget(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return 1;
+	return atomic_inc_not_zero(css->refcnt);
+}
+
+static inline bool css_under_removal(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return false;
+	return atomic_read(css->refcnt) == 0;
 }
 
 /* bits in struct cgroup flags field */
@@ -145,6 +170,8 @@ struct cgroup {
 	int pids_use_count;
 	/* Length of the current tasks_pids array */
 	int pids_length;
+	/* for css_get/put */
+	atomic_t css_refcnt;
 };
 
 /* A css_set is a structure holding pointers to a set of
Index: mmotm-2.6.28-Dec03/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec03/kernel/cgroup.c
@@ -110,6 +110,8 @@ static int root_count;
 /* dummytop is a shorthand for the dummy hierarchy's top cgroup */
 #define dummytop (&rootnode.top_cgroup)
 
+static atomic_t	dummy_css_refcnt; /* should be 0 forever. */
+
 /* This flag indicates whether tasks in the fork and exit paths should
  * check for fork/exit handlers to call. This avoids us having to do
  * extra work in the fork/exit path if none of the subsystems need to
@@ -589,6 +591,27 @@ static void cgroup_call_pre_destroy(stru
 	return;
 }
 
+/*
+ * Try to set subsys's refcnt to be 0.
+ */
+static int cgroup_set_subsys_removed(struct cgroup *cgrp)
+{
+	/* can refcnt goes down to 0 ? */
+	if (atomic_dec_and_test(&cgrp->css_refcnt)) {
+		struct cgroup_subsys *ss;
+		struct cgroup_subsys_state *css;
+		/* replace refcnt with dummy */
+		for_each_subsys(cgrp->root, ss) {
+			css = cgrp->subsys[ss->subsys_id];
+			css->refcnt = &dummy_css_refcnt;
+		}
+		return true;
+	} else
+		atomic_inc(&cgrp->css_refcnt);
+
+	return false;
+}
+
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 {
 	/* is dentry a directory ? if so, kfree() associated cgroup */
@@ -869,6 +892,7 @@ static void init_cgroup_housekeeping(str
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
 	init_rwsem(&cgrp->pids_mutex);
+	atomic_set(&cgrp->css_refcnt, 1);
 }
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
@@ -2310,7 +2334,7 @@ static void init_cgroup_css(struct cgrou
 			       struct cgroup *cgrp)
 {
 	css->cgroup = cgrp;
-	atomic_set(&css->refcnt, 0);
+	css->refcnt = &cgrp->css_refcnt;
 	css->flags = 0;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
@@ -2413,37 +2437,6 @@ static int cgroup_mkdir(struct inode *di
 	return cgroup_create(c_parent, dentry, mode | S_IFDIR);
 }
 
-static int cgroup_has_css_refs(struct cgroup *cgrp)
-{
-	/* Check the reference count on each subsystem. Since we
-	 * already established that there are no tasks in the
-	 * cgroup, if the css refcount is also 0, then there should
-	 * be no outstanding references, so the subsystem is safe to
-	 * destroy. We scan across all subsystems rather than using
-	 * the per-hierarchy linked list of mounted subsystems since
-	 * we can be called via check_for_release() with no
-	 * synchronization other than RCU, and the subsystem linked
-	 * list isn't RCU-safe */
-	int i;
-	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
-		struct cgroup_subsys *ss = subsys[i];
-		struct cgroup_subsys_state *css;
-		/* Skip subsystems not in this hierarchy */
-		if (ss->root != cgrp->root)
-			continue;
-		css = cgrp->subsys[ss->subsys_id];
-		/* When called from check_for_release() it's possible
-		 * that by this point the cgroup has been removed
-		 * and the css deleted. But a false-positive doesn't
-		 * matter, since it can only happen if the cgroup
-		 * has been deleted and hence no longer needs the
-		 * release agent to be called anyway. */
-		if (css && atomic_read(&css->refcnt))
-			return 1;
-	}
-	return 0;
-}
-
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -2465,16 +2458,21 @@ static int cgroup_rmdir(struct inode *un
 
 	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
-	 * that rmdir() request comes.
+	 * that rmdir() request comes. pre_destroy() is expected to drop all
+	 * extra refcnt to css. (css->refcnt == 1)
 	 */
 	cgroup_call_pre_destroy(cgrp);
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 
-	if (atomic_read(&cgrp->count)
-	    || !list_empty(&cgrp->children)
-	    || cgroup_has_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+		mutex_unlock(&cgroup_mutex);
+		return -EBUSY;
+	}
+
+	/* last check ! */
+	if (!cgroup_set_subsys_removed(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3003,7 +3001,7 @@ static void check_for_release(struct cgr
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
 	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
-	    && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
+	    && list_empty(&cgrp->children)) {
 		/* Control Group is currently removeable. If it's not
 		 * already queued for a userspace notification, queue
 		 * it now */
@@ -3020,17 +3018,6 @@ static void check_for_release(struct cgr
 	}
 }
 
-void __css_put(struct cgroup_subsys_state *css)
-{
-	struct cgroup *cgrp = css->cgroup;
-	rcu_read_lock();
-	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
-		set_bit(CGRP_RELEASABLE, &cgrp->flags);
-		check_for_release(cgrp);
-	}
-	rcu_read_unlock();
-}
-
 /*
  * Notify userspace when a cgroup is released, by running the
  * configured release agent with the name of the cgroup (path
Index: mmotm-2.6.28-Dec03/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec03/mm/memcontrol.c
@@ -165,7 +165,6 @@ struct mem_cgroup {
 	 */
 	bool use_hierarchy;
 	unsigned long	last_oom_jiffies;
-	int		obsolete;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -594,8 +593,14 @@ mem_cgroup_get_first_node(struct mem_cgr
 {
 	struct cgroup *cgroup;
 	struct mem_cgroup *ret;
-	bool obsolete = (root_mem->last_scanned_child &&
-				root_mem->last_scanned_child->obsolete);
+	struct mem_cgroup *last_scan = root_mem->last_scanned_child;
+	bool obsolete = false;
+
+	if (last_scan) {
+		if (css_under_removal(&last_scan->css))
+			obsolete = true;
+	} else
+		obsolete = true;
 
 	/*
 	 * Scan all children under the mem_cgroup mem
@@ -683,7 +688,7 @@ static int mem_cgroup_hierarchical_recla
 	next_mem = mem_cgroup_get_first_node(root_mem);
 
 	while (next_mem != root_mem) {
-		if (next_mem->obsolete) {
+		if (css_under_removal(&next_mem->css)) {
 			mem_cgroup_put(next_mem);
 			cgroup_lock();
 			next_mem = mem_cgroup_get_first_node(root_mem);
@@ -744,14 +749,13 @@ static int __mem_cgroup_try_charge(struc
 	if (likely(!*memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!mem)) {
+		if (unlikely(!mem) || !css_tryget(&mem->css)) {
 			rcu_read_unlock();
 			return 0;
 		}
 		/*
 		 * For every charge from the cgroup, increment reference count
 		 */
-		css_get(&mem->css);
 		*memcg = mem;
 		rcu_read_unlock();
 	} else {
@@ -1067,6 +1071,7 @@ int mem_cgroup_try_charge_swapin(struct 
 {
 	struct mem_cgroup *mem;
 	swp_entry_t     ent;
+	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -1085,10 +1090,18 @@ int mem_cgroup_try_charge_swapin(struct 
 	ent.val = page_private(page);
 
 	mem = lookup_swap_cgroup(ent);
-	if (!mem || mem->obsolete)
+	/*
+	 * Because we can't assume "mem" is alive now, use tryget() and
+	 * drop extra count later
+	 */
+	if (!mem || !css_tryget(&mem->css))
 		goto charge_cur_mm;
 	*ptr = mem;
-	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	/* drop extra count */
+	css_put(&mem->css);
+
+	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -1119,14 +1132,16 @@ int mem_cgroup_cache_charge_swapin(struc
 		ent.val = page_private(page);
 		if (do_swap_account) {
 			mem = lookup_swap_cgroup(ent);
-			if (mem && mem->obsolete)
+			if (mem && !css_tryget(&mem->css))
 				mem = NULL;
 			if (mem)
 				mm = NULL;
 		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
-
+		/* drop extra ref */
+		if (mem)
+			css_put(&mem->css);
 		if (!ret && do_swap_account) {
 			/* avoid double counting */
 			mem = swap_cgroup_record(ent, NULL);
@@ -1411,11 +1426,10 @@ int mem_cgroup_shrink_usage(struct mm_st
 
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (unlikely(!mem)) {
+	if (unlikely(!mem) || !css_tryget(&mem->css)) {
 		rcu_read_unlock();
 		return 0;
 	}
-	css_get(&mem->css);
 	rcu_read_unlock();
 
 	do {
@@ -2058,6 +2072,7 @@ static struct mem_cgroup *mem_cgroup_all
 
 	if (mem)
 		memset(mem, 0, size);
+	atomic_set(&mem->refcnt, 1);
 	return mem;
 }
 
@@ -2069,8 +2084,8 @@ static struct mem_cgroup *mem_cgroup_all
  * the number of reference from swap_cgroup and free mem_cgroup when
  * it goes down to 0.
  *
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
+ * When mem_cgroup is destroyed, css_under_removal() is true and entry which
+ * points to this memcg will be ignore at swapin.
  *
  * Removal of cgroup itself succeeds regardless of refs from swap.
  */
@@ -2079,10 +2094,6 @@ static void mem_cgroup_free(struct mem_c
 {
 	int node;
 
-	if (atomic_read(&mem->refcnt) > 0)
-		return;
-
-
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
@@ -2100,8 +2111,7 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *mem)
 {
 	if (atomic_dec_and_test(&mem->refcnt)) {
-		if (!mem->obsolete)
-			return;
+		BUG(!css_under_removal(&mem->css));
 		mem_cgroup_free(mem);
 	}
 }
@@ -2167,14 +2177,14 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem->obsolete = 1;
+	/* dentry's mutex makes this safe. */
 	mem_cgroup_force_empty(mem, false);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	mem_cgroup_free(mem_cgroup_from_cont(cont));
+	mem_cgroup_put(mem_cgroup_from_cont(cont));
 }
 
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
Index: mmotm-2.6.28-Dec03/lib/Kconfig.debug
===================================================================
--- mmotm-2.6.28-Dec03.orig/lib/Kconfig.debug
+++ mmotm-2.6.28-Dec03/lib/Kconfig.debug
@@ -353,6 +353,14 @@ config DEBUG_LOCK_ALLOC
 	 spin_lock_init()/mutex_init()/etc., or whether there is any lock
 	 held during task exit.
 
+config DEBUG_CGROUP_SUBSYS
+	bool "Debugginug Cgroup Subsystems"
+	depends on DEBUG_KERNEL && CGROUP
+	help
+	 This feature will check cgroup_subsys_state behavior. Currently, this
+	 checks reference count management.
+
+
 config PROVE_LOCKING
 	bool "Lock debugging: prove locking correctness"
 	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 2/4] cgroup ID
  2008-12-05  8:26 [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) KAMEZAWA Hiroyuki
  2008-12-05  8:28 ` [RFC][PATCH 1/4] New css->refcnt implementation KAMEZAWA Hiroyuki
@ 2008-12-05  8:29 ` KAMEZAWA Hiroyuki
  2008-12-05 11:11   ` Paul Menage
  2008-12-05  8:31 ` [RFC][PATCH 3/4] memcg: hierachical reclaim KAMEZAWA Hiroyuki
  2008-12-05  8:32 ` [RFC][PATCH 4/4] fix oom kill under hierarchy KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05  8:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel, menage


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Patch for Cgroup ID and hierarchy code.

This patch tries to assign a ID to each cgroup. Attach unique ID to each
cgroup and provides following functions.

 - cgroup_lookup(id)
   returns struct cgroup of id.
 - cgroup_get_next(id, rootid, depth, foundid)
   returns the next cgroup under "root" by scanning bitmap (not by tree-walk)
 - cgroup_id_put/getref()
   used when subsystem want to prevent reuse of ID.

There is several reasons to develop this.

	- While trying to implement hierarchy in memory cgroup, we have to
	  implement "walk under hierarchy" code.
	  Now it's consists of cgroup_lock and tree up-down code. Because
	  Because memory cgroup have to do hierarchy walk in other places,
	  intelligent processing, we'll reuse the "walk" code.
	  But taking "cgroup_lock" in walking tree can cause deadlocks.
	  Easier way is helpful.

 	- SwapCgroup uses array of "pointer" to record the owner of swaps.
	  By ID, we can reduce this to "short" or "int". This means ID is 
	  useful for reducing space consumption by pointer if the access cost
	  is not problem.
	  (I hear bio-cgroup will use the same kind of...)


Characteristics: 
	- Each cgroup get new ID when created.
	- cgroup ID contains "ID" and "Depth in tree" and hierarchy code.
	- hierarchy code is array of IDs of ancestors.
	- ID 0 is UNUSED ID.

Consideration:
	- I'd like to use  "short" to cgroup_id for saving space...
	- MAX_DEPTH is small ? (making this depend on boot option is easy.)
TODO:
	- Documentation.

Changelog (v2) -> (v3)
	- removed cgroup_id_getref().
	- added cgroup_id_tryget().
Changelog (v1) -> (v2):
	- Design change: show only ID(integer) to outside of cgroup.c
	- moved cgroup ID definition from include/ to kernel/cgroup.c
	- struct cgroup_id is freed by RCU.
	- changed interface from pointer to "int"
	- kill_sb() is handled. 
	- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   24 ++++
 include/linux/idr.h    |    1 
 kernel/cgroup.c        |  285 ++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/idr.c              |   46 +++++++
 4 files changed, 352 insertions(+), 4 deletions(-)

Index: mmotm-2.6.28-Dec03/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec03.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec03/include/linux/cgroup.h
@@ -22,6 +22,7 @@ struct cgroupfs_root;
 struct cgroup_subsys;
 struct inode;
 struct cgroup;
+struct cgroup_id;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -56,6 +57,12 @@ struct cgroup_subsys_state {
 	unsigned long flags;
 };
 
+/*
+ * Cgroup ID for *internal* identification and lookup. For user-land,"path"
+ * of cgroup works well.
+ */
+#define MAX_CGROUP_DEPTH	(10)
+
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
@@ -172,6 +179,8 @@ struct cgroup {
 	int pids_length;
 	/* for css_get/put */
 	atomic_t css_refcnt;
+	/* for cgroup ID */
+	struct cgroup_id	*id;
 };
 
 /* A css_set is a structure holding pointers to a set of
@@ -420,6 +429,21 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
+/*
+ * For supporting cgroup lookup and hierarchy management.
+ */
+/* An interface for usual lookup */
+struct cgroup *cgroup_lookup(int id);
+/* get next cgroup under tree (for scan) */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid);
+/* get id and depth of cgroup */
+int cgroup_id(struct cgroup *cgroup);
+int cgroup_depth(struct cgroup *cgroup);
+/* For delayed freeing of IDs */
+int cgroup_id_tryget(int id);
+void cgroup_id_put(int id);
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Dec03/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec03/kernel/cgroup.c
@@ -46,7 +46,7 @@
 #include <linux/cgroupstats.h>
 #include <linux/hash.h>
 #include <linux/namei.h>
-
+#include <linux/idr.h>
 #include <asm/atomic.h>
 
 static DEFINE_MUTEX(cgroup_mutex);
@@ -547,6 +547,266 @@ void cgroup_unlock(void)
 }
 
 /*
+ * CGROUP ID
+ */
+struct cgroup_id {
+	struct cgroup *myself;
+	unsigned int  id;
+	unsigned int  depth;
+	atomic_t      refcnt;
+	struct rcu_head rcu_head;
+	unsigned int  hierarchy_code[MAX_CGROUP_DEPTH];
+};
+
+void free_cgroupid_cb(struct rcu_head *head)
+{
+	struct cgroup_id *id;
+
+	id = container_of(head, struct cgroup_id, rcu_head);
+	kfree(id);
+}
+
+void free_cgroupid(struct cgroup_id *id)
+{
+	call_rcu(&id->rcu_head, free_cgroupid_cb);
+}
+
+/*
+ * Cgroup ID and lookup functions.
+ * cgid->myself pointer is safe under rcu_read_lock() because d_put() of
+ * cgroup, which finally frees cgroup pointer, uses rcu_synchronize().
+ */
+static DEFINE_IDR(cgroup_idr);
+DEFINE_SPINLOCK(cgroup_idr_lock);
+
+static int cgrouproot_setup_idr(struct cgroupfs_root *root)
+{
+	struct cgroup_id *newid;
+	int err = -ENOMEM;
+	int myid;
+
+	newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+	if (!newid)
+		goto out;
+	if (!idr_pre_get(&cgroup_idr, GFP_KERNEL))
+		goto free_out;
+
+	spin_lock_irq(&cgroup_idr_lock);
+	err = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+	spin_unlock_irq(&cgroup_idr_lock);
+
+	/* This one is new idr....*/
+	BUG_ON(err);
+	newid->id = myid;
+	newid->depth = 0;
+	newid->hierarchy_code[0] = myid;
+	atomic_set(&newid->refcnt, 1);
+	rcu_assign_pointer(newid->myself, &root->top_cgroup);
+	root->top_cgroup.id = newid;
+	return 0;
+
+free_out:
+	kfree(newid);
+out:
+	return err;
+}
+
+/*
+ * should be called while "cgrp" is valid.
+ */
+int cgroup_id(struct cgroup *cgrp)
+{
+	struct cgroup_id *cgid = rcu_dereference(cgrp->id);
+
+	if (cgid)
+		return cgid->id;
+	return 0;
+}
+
+int cgroup_depth(struct cgroup *cgrp)
+{
+	struct cgroup_id *cgid = rcu_dereference(cgrp->id);
+
+	if (cgid)
+		return cgid->depth;
+	return 0;
+}
+
+static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id)
+{
+	struct cgroup_id *newid;
+	int myid, error;
+
+	/* check depth */
+	if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
+		return -ENOSPC;
+	newid = kzalloc(sizeof(*newid), GFP_KERNEL);
+	if (!newid)
+		return -ENOMEM;
+	/* get id */
+	if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
+		error = -ENOMEM;
+		goto err_out;
+	}
+	spin_lock_irq(&cgroup_idr_lock);
+	/* Don't use 0 */
+	error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
+	spin_unlock_irq(&cgroup_idr_lock);
+	if (error)
+		goto err_out;
+
+	newid->id = myid;
+	atomic_set(&newid->refcnt, 1);
+	*id = newid;
+	return 0;
+err_out:
+	kfree(newid);
+	return error;
+}
+
+
+static void cgroup_id_attach(struct cgroup_id *cgid,
+			     struct cgroup *cg, struct cgroup *parent)
+{
+	struct cgroup_id *parent_id = rcu_dereference(parent->id);
+	int i;
+
+	cgid->depth = parent_id->depth + 1;
+	/* Inherit hierarchy code from parent */
+	for (i = 0; i < cgid->depth; i++) {
+		cgid->hierarchy_code[i] =
+			parent_id->hierarchy_code[i];
+		cgid->hierarchy_code[cgid->depth] = cgid->id;
+	}
+	rcu_assign_pointer(cgid->myself, cg);
+	rcu_assign_pointer(cg->id, cgid);
+
+	return;
+}
+
+void cgroup_id_put(int id)
+{
+	struct cgroup_id *cgid;
+	unsigned long flags;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+	BUG_ON(!cgid);
+	if (atomic_dec_and_test(&cgid->refcnt)) {
+		spin_lock_irqsave(&cgroup_idr_lock, flags);
+		idr_remove(&cgroup_idr, cgid->id);
+		spin_unlock_irq(&cgroup_idr_lock);
+		free_cgroupid(cgid);
+	}
+	rcu_read_unlock();
+}
+
+static void cgroup_id_detach(struct cgroup *cg)
+{
+	struct cgroup_id *id = rcu_dereference(cg->id);
+
+	rcu_assign_pointer(id->myself, NULL);
+	cgroup_id_put(id->id);
+	rcu_assign_pointer(cg->id, NULL);
+}
+/**
+ * cgroup_id_tryget() -- try to get refcnt of ID.
+ * @id: the ID to be kept for a while.
+ *
+ * Increment refcnt of ID and prevent reuse. Useful for subsys which remember
+ * cgroup by ID rather than pointer to struct cgroup (or subsys). Returns
+ * value other than zero at success.
+ */
+int cgroup_id_tryget(int id)
+{
+	struct cgroup_id *cgid;
+	int ret = 0;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+	if (cgid)
+		ret = atomic_inc_not_zero(&cgid->refcnt);
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * cgroup_lookup - lookup cgroup by id
+ * @id: the id of cgroup to be looked up
+ *
+ * Returns pointer to cgroup if there is valid cgroup with id, NULL if not.
+ * Should be called under rcu_read_lock() or cgroup_lock.
+ * If subsys is not used, returns NULL.
+ */
+
+struct cgroup *cgroup_lookup(int id)
+{
+	struct cgroup *cgrp = NULL;
+	struct cgroup_id *cgid = NULL;
+
+	rcu_read_lock();
+	cgid = idr_find(&cgroup_idr, id);
+
+	if (unlikely(!cgid))
+		goto out;
+
+	cgrp = rcu_dereference(cgid->myself);
+	if (unlikely(!cgrp || cgroup_is_removed(cgrp)))
+		cgrp = NULL;
+out:
+	rcu_read_unlock();
+	return cgrp;
+}
+
+/**
+ * cgroup_get_next - lookup next cgroup under specified hierarchy.
+ * @id: current position of iteration.
+ * @rootid: search tree under this.
+ * @depth: depth of root id.
+ * @foundid: position of found object.
+ *
+ * Search next cgroup under the specified hierarchy. If "cur" is NULL,
+ * start from root cgroup. Called under rcu_read_lock() or cgroup_lock()
+ * is necessary (to access a found cgroup.).
+ * If subsys is not used, returns NULL. If used, it's guaranteed that there is
+ * a used cgroup ID (root).
+ */
+struct cgroup *
+cgroup_get_next(int id, int rootid, int depth, int *foundid)
+{
+	struct cgroup *ret = NULL;
+	struct cgroup_id *tmp;
+	int tmpid;
+	unsigned long flags;
+
+	rcu_read_lock();
+	tmpid = id;
+	while (1) {
+		/* scan next entry from bitmap(tree) */
+		spin_lock_irqsave(&cgroup_idr_lock, flags);
+		tmp = idr_get_next(&cgroup_idr, &tmpid);
+		spin_unlock_irqrestore(&cgroup_idr_lock, flags);
+
+		if (!tmp) {
+			ret = NULL;
+			break;
+		}
+
+		if (tmp->hierarchy_code[depth] == rootid) {
+			ret = rcu_dereference(tmp->myself);
+			/* Sanity check and check hierarchy */
+			if (ret && !cgroup_is_removed(ret))
+				break;
+		}
+		tmpid = tmpid + 1;
+	}
+
+	rcu_read_unlock();
+	*foundid = tmpid;
+	return ret;
+}
+
+/*
  * A couple of forward declarations required, due to cyclic reference loop:
  * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
  * cgroup_add_file -> cgroup_create_file -> cgroup_dir_inode_operations
@@ -1037,6 +1297,13 @@ static int cgroup_get_sb(struct file_sys
 			mutex_unlock(&inode->i_mutex);
 			goto drop_new_super;
 		}
+		/* Setup Cgroup ID for this fs */
+		ret = cgrouproot_setup_idr(root);
+		if (ret) {
+			mutex_unlock(&cgroup_mutex);
+			mutex_unlock(&inode->i_mutex);
+			goto drop_new_super;
+		}
 
 		ret = rebind_subsystems(root, root->subsys_bits);
 		if (ret == -EBUSY) {
@@ -1123,9 +1390,10 @@ static void cgroup_kill_sb(struct super_
 
 	list_del(&root->root_list);
 	root_count--;
-
+	if (root->top_cgroup.id)
+		cgroup_id_detach(&root->top_cgroup);
 	mutex_unlock(&cgroup_mutex);
-
+	synchronize_rcu();
 	kfree(root);
 	kill_litter_super(sb);
 }
@@ -2358,11 +2626,18 @@ static long cgroup_create(struct cgroup 
 	int err = 0;
 	struct cgroup_subsys *ss;
 	struct super_block *sb = root->sb;
+	struct cgroup_id *cgid = NULL;
 
 	cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
 	if (!cgrp)
 		return -ENOMEM;
 
+	err = cgroup_prepare_id(parent, &cgid);
+	if (err) {
+		kfree(cgrp);
+		return err;
+	}
+
 	/* Grab a reference on the superblock so the hierarchy doesn't
 	 * get deleted on unmount if there are child cgroups.  This
 	 * can be done outside cgroup_mutex, since the sb can't
@@ -2402,7 +2677,7 @@ static long cgroup_create(struct cgroup 
 
 	err = cgroup_populate_dir(cgrp);
 	/* If err < 0, we have a half-filled directory - oh well ;) */
-
+	cgroup_id_attach(cgid, cgrp, parent);
 	mutex_unlock(&cgroup_mutex);
 	mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
 
@@ -2477,6 +2752,8 @@ static int cgroup_rmdir(struct inode *un
 		return -EBUSY;
 	}
 
+	cgroup_id_detach(cgrp);
+
 	spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
 	if (!list_empty(&cgrp->release_list))
Index: mmotm-2.6.28-Dec03/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Dec03.orig/include/linux/idr.h
+++ mmotm-2.6.28-Dec03/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
 int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
 int idr_for_each(struct idr *idp,
 		 int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Dec03/lib/idr.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/lib/idr.c
+++ mmotm-2.6.28-Dec03/lib/idr.c
@@ -573,6 +573,52 @@ int idr_for_each(struct idr *idp,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id:  pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+	struct idr_layer *p, *pa[MAX_LEVEL];
+	struct idr_layer **paa = &pa[0];
+	int id = *nextidp;
+	int n, max;
+
+	/* find first ent */
+	n = idp->layers * IDR_BITS;
+	max = 1 << n;
+	p = rcu_dereference(idp->top);
+	if (!p)
+		return NULL;
+
+	while (id < max) {
+		while (n > 0 && p) {
+			n -= IDR_BITS;
+			*paa++ = p;
+			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+		}
+
+		if (p) {
+			*nextidp = id;
+			return p;
+		}
+
+		id += 1 << n;
+		while (n < fls(id)) {
+			n += IDR_BITS;
+			p = *--paa;
+		}
+	}
+	return NULL;
+}
+
+
+
+/**
  * idr_replace - replace pointer for given id
  * @idp: idr handle
  * @ptr: pointer you want associated with the id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 3/4] memcg: hierachical reclaim
  2008-12-05  8:26 [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) KAMEZAWA Hiroyuki
  2008-12-05  8:28 ` [RFC][PATCH 1/4] New css->refcnt implementation KAMEZAWA Hiroyuki
  2008-12-05  8:29 ` [RFC][PATCH 2/4] cgroup ID KAMEZAWA Hiroyuki
@ 2008-12-05  8:31 ` KAMEZAWA Hiroyuki
  2008-12-05  8:32 ` [RFC][PATCH 4/4] fix oom kill under hierarchy KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05  8:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel, menage


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Implement hierarchy reclaim by cgroup_id.

What changes:
	- Page reclaim is not done by tree-walk algorithm
	- mem_cgroup->last_schan_child is changed to be ID, not pointer.
	- no cgroup_lock, done under RCU.
	- scanning order is just defined by ID's order.
	  (Scan by round-robin logic.)

Changelog: v2 -> v3
	- fixed use_hierarchy==0 case

Changelog: v1 -> v2
	- make use of css_tryget();
	- count # of loops rather than remembering position.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>


 mm/memcontrol.c |  215 +++++++++++++++++++-------------------------------------
 1 file changed, 74 insertions(+), 141 deletions(-)

---
Index: mmotm-2.6.28-Dec03/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec03/mm/memcontrol.c
@@ -157,9 +157,10 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by cgroup_lock()
+	 * reclaimed from.
 	 */
-	struct mem_cgroup *last_scanned_child;
+	int	last_scanned_child;
+	unsigned long	scan_age;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -525,108 +526,69 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-#define mem_cgroup_from_res_counter(counter, member)	\
-	container_of(counter, struct mem_cgroup, member)
-
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
-	struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
-	curr_cgroup = curr->css.cgroup;
-	root_cgroup = root_mem->css.cgroup;
-
-	if (!list_empty(&curr_cgroup->children)) {
-		/*
-		 * Walk down to children
-		 */
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->children.next,
-						struct cgroup, sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-visit_parent:
-	if (curr_cgroup == root_cgroup) {
-		mem_cgroup_put(curr);
-		curr = root_mem;
-		mem_cgroup_get(curr);
-		goto done;
-	}
+	struct cgroup *cgrp = memcg->css.cgroup;
+	unsigned int swappiness;
 
-	/*
-	 * Goto next sibling
-	 */
-	if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
-						sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
+	/* root ? */
+	if (cgrp->parent == NULL)
+		return vm_swappiness;
 
-	/*
-	 * Go up to next parent and next parent's sibling if need be
-	 */
-	curr_cgroup = curr_cgroup->parent;
-	goto visit_parent;
+	spin_lock(&memcg->reclaim_param_lock);
+	swappiness = memcg->swappiness;
+	spin_unlock(&memcg->reclaim_param_lock);
 
-done:
-	root_mem->last_scanned_child = curr;
-	return curr;
+	return swappiness;
 }
 
+#define mem_cgroup_from_res_counter(counter, member)	\
+	container_of(counter, struct mem_cgroup, member)
+
 /*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
+ * This routine select next memcg by ID. Using RCU and tryget().
+ * No cgroup_mutex is required.
  */
 static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
 {
-	struct cgroup *cgroup;
+	struct cgroup *cgroup, *root_cgroup;
 	struct mem_cgroup *ret;
-	struct mem_cgroup *last_scan = root_mem->last_scanned_child;
-	bool obsolete = false;
+	int nextid, rootid, depth, found;
 
-	if (last_scan) {
-		if (css_under_removal(&last_scan->css))
-			obsolete = true;
-	} else
-		obsolete = true;
+	root_cgroup = root_mem->css.cgroup;
+	rootid = cgroup_id(root_cgroup);
+	depth = cgroup_depth(root_cgroup);
+	found = 0;
 
-	/*
-	 * Scan all children under the mem_cgroup mem
-	 */
-	cgroup_lock();
-	if (list_empty(&root_mem->css.cgroup->children)) {
+	rcu_read_lock();
+	if (!root_mem->use_hierarchy) {
+		spin_lock(&root_mem->reclaim_param_lock);
+		root_mem->scan_age++;
+		spin_unlock(&root_mem->reclaim_param_lock);
+		css_get(&root_mem->css);
 		ret = root_mem;
-		goto done;
 	}
 
-	if (!root_mem->last_scanned_child || obsolete) {
-
-		if (obsolete)
-			mem_cgroup_put(root_mem->last_scanned_child);
-
-		cgroup = list_first_entry(&root_mem->css.cgroup->children,
-				struct cgroup, sibling);
-		ret = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(ret);
-	} else
-		ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
-						root_mem);
-
-done:
-	root_mem->last_scanned_child = ret;
-	cgroup_unlock();
+	while (!ret) {
+		/* ID:0 is not used by cgroup-id */
+		nextid = root_mem->last_scanned_child + 1;
+		cgroup = cgroup_get_next(nextid, rootid, depth, &found);
+		if (cgroup) {
+			spin_lock(&root_mem->reclaim_param_lock);
+			root_mem->last_scanned_child = found;
+			spin_unlock(&root_mem->reclaim_param_lock);
+			ret = mem_cgroup_from_cont(cgroup);
+			if (!css_tryget(&ret->css))
+				ret = NULL;
+		} else {
+			spin_lock(&root_mem->reclaim_param_lock);
+			root_mem->scan_age++;
+			root_mem->last_scanned_child = 0;
+			spin_unlock(&root_mem->reclaim_param_lock);
+		}
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -642,67 +604,35 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
-static unsigned int get_swappiness(struct mem_cgroup *memcg)
-{
-	struct cgroup *cgrp = memcg->css.cgroup;
-	unsigned int swappiness;
-
-	/* root ? */
-	if (cgrp->parent == NULL)
-		return vm_swappiness;
-
-	spin_lock(&memcg->reclaim_param_lock);
-	swappiness = memcg->swappiness;
-	spin_unlock(&memcg->reclaim_param_lock);
-
-	return swappiness;
-}
 
 /*
- * Dance down the hierarchy if needed to reclaim memory. We remember the
- * last child we reclaimed from, so that we don't end up penalizing
- * one child extensively based on its position in the children list.
- *
  * root_mem is the original ancestor that we've been reclaim from.
+ * root_mem cannot be freed while walking because there are children.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						gfp_t gfp_mask, bool noswap)
 {
-	struct mem_cgroup *next_mem;
+	struct mem_cgroup *victim;
+	unsigned long start_age;
 	int ret = 0;
+	int total = 0;
 
-	/*
-	 * Reclaim unconditionally and don't check for return value.
-	 * We need to reclaim in the current group and down the tree.
-	 * One might think about checking for children before reclaiming,
-	 * but there might be left over accounting, even after children
-	 * have left.
-	 */
-	ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
-					   get_swappiness(root_mem));
-	if (mem_cgroup_check_under_limit(root_mem))
-		return 0;
-	if (!root_mem->use_hierarchy)
-		return ret;
-
-	next_mem = mem_cgroup_get_first_node(root_mem);
-
-	while (next_mem != root_mem) {
-		if (css_under_removal(&next_mem->css)) {
-			mem_cgroup_put(next_mem);
-			cgroup_lock();
-			next_mem = mem_cgroup_get_first_node(root_mem);
-			cgroup_unlock();
-			continue;
-		}
-		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
-						   get_swappiness(next_mem));
+	start_age = root_mem->scan_age;
+	/* allows visit twice (under this memcg, ->scan_age is shared.) */
+	while (time_after((start_age + 2UL), root_mem->scan_age)) {
+		victim = mem_cgroup_select_victim(root_mem);
+		ret = try_to_free_mem_cgroup_pages(victim,
+				gfp_mask, noswap, get_swappiness(victim));
+		css_put(&victim->css);
 		if (mem_cgroup_check_under_limit(root_mem))
-			return 0;
-		cgroup_lock();
-		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		cgroup_unlock();
+			return 1;
+		total += ret;
 	}
+
+	ret = total;
+	if (mem_cgroup_check_under_limit(root_mem))
+		ret = 1;
+
 	return ret;
 }
 
@@ -790,7 +720,6 @@ static int __mem_cgroup_try_charge(struc
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
 							noswap);
-
 		/*
 		 * try_to_free_mem_cgroup_pages() might not give us a full
 		 * picture of reclaim. Some pages are reclaimed and might be
@@ -804,7 +733,8 @@ static int __mem_cgroup_try_charge(struc
 
 		if (!nr_retries--) {
 			if (oom) {
-				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+				mem_cgroup_out_of_memory(mem_over_limit,
+							gfp_mask);
 				mem_over_limit->last_oom_jiffies = jiffies;
 			}
 			goto nomem;
@@ -2111,7 +2041,7 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *mem)
 {
 	if (atomic_dec_and_test(&mem->refcnt)) {
-		BUG(!css_under_removal(&mem->css));
+		BUG_ON(!css_under_removal(&mem->css));
 		mem_cgroup_free(mem);
 	}
 }
@@ -2159,7 +2089,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem_cgroup_set_inactive_ratio(mem);
-	mem->last_scanned_child = NULL;
+	mem->last_scanned_child = 0;
+	mem->scan_age = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 4/4] fix oom kill under hierarchy
  2008-12-05  8:26 [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-12-05  8:31 ` [RFC][PATCH 3/4] memcg: hierachical reclaim KAMEZAWA Hiroyuki
@ 2008-12-05  8:32 ` KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05  8:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel, menage

From: KAMEZAWA Hiroyki <kamezawa.hiroyu@jp.fujitsu.com>

Current oom-kill by memcg cannot handle hierarchy in correct way.
This is a trial. please review.

After this.
	- OOM Killer check hierarchy and can Kill badprocess under hierarchy.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/cgroup.h     |    1 
 include/linux/memcontrol.h |    7 +++--
 kernel/cgroup.c            |   14 ++++++++++
 mm/memcontrol.c            |   59 ++++++++++++++++++++++++++++++++++++++++++---
 mm/oom_kill.c              |    4 +--
 5 files changed, 77 insertions(+), 8 deletions(-)

Index: mmotm-2.6.28-Dec03/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec03/mm/memcontrol.c
@@ -380,12 +380,38 @@ void mem_cgroup_move_lists(struct page *
 	mem_cgroup_add_lru_list(page, to);
 }
 
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
+static int mm_under_cgroup(struct mm_struct *mm, struct mem_cgroup *mcg)
+{
+	struct cgroup *cg, *check;
+	int ret = 0;
+
+	if (!mm)
+		return 0;
+
+	rcu_read_lock();
+
+	VM_BUG_ON(css_under_removal(&mcg->css));
+
+	cg = task_cgroup(rcu_dereference(mm->owner), mem_cgroup_subsys_id);
+	check = mcg->css.cgroup;
+
+	if (!mcg->use_hierarchy)
+		ret = (cg == check);
+	else
+		ret = cgroup_check_ancestor(cg, check);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+				struct mem_cgroup *mem)
 {
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && mm_match_cgroup(task->mm, mem);
+	ret = mm_under_cgroup(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
@@ -636,6 +662,33 @@ static int mem_cgroup_hierarchical_recla
 	return ret;
 }
 
+void update_oom_jiffies(struct mem_cgroup *mem)
+{
+	struct cgroup *cgroup, *rootcg;
+	struct mem_cgroup *tmp;
+	int rootid, nextid, depth, found;
+
+	if (!mem->use_hierarchy) {
+		mem->last_oom_jiffies = jiffies;
+		return;
+	}
+	rootcg = mem->css.cgroup;
+	rootid = cgroup_id(rootcg);
+	depth = cgroup_depth(rootcg);
+	nextid = 0;
+	rcu_read_lock();
+	while (1) {
+		cgroup = cgroup_get_next(nextid, rootid, depth, &found);
+
+		if (!cgroup)
+			break;
+		tmp = mem_cgroup_from_cont(cgroup);
+		tmp->last_oom_jiffies = jiffies;
+		nextid = found + 1;
+	}
+	rcu_read_unlock();
+}
+
 bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	bool ret = false;
@@ -735,7 +788,7 @@ static int __mem_cgroup_try_charge(struc
 			if (oom) {
 				mem_cgroup_out_of_memory(mem_over_limit,
 							gfp_mask);
-				mem_over_limit->last_oom_jiffies = jiffies;
+				update_oom_jiffies(mem_over_limit);
 			}
 			goto nomem;
 		}
Index: mmotm-2.6.28-Dec03/mm/oom_kill.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/mm/oom_kill.c
+++ mmotm-2.6.28-Dec03/mm/oom_kill.c
@@ -220,7 +220,7 @@ static struct task_struct *select_bad_pr
 		/* skip the init task */
 		if (is_global_init(p))
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 
 		/*
@@ -292,7 +292,7 @@ static void dump_tasks(const struct mem_
 		 */
 		if (!p->mm)
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 		if (!thread_group_leader(p))
 			continue;
Index: mmotm-2.6.28-Dec03/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Dec03.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Dec03/include/linux/cgroup.h
@@ -440,6 +440,7 @@ cgroup_get_next(int id, int rootid, int 
 /* get id and depth of cgroup */
 int cgroup_id(struct cgroup *cgroup);
 int cgroup_depth(struct cgroup *cgroup);
+int cgroup_check_ancestor(struct cgroup *cur, struct cgroup *ans);
 /* For delayed freeing of IDs */
 int cgroup_id_tryget(int id);
 void cgroup_id_put(int id);
Index: mmotm-2.6.28-Dec03/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Dec03.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Dec03/kernel/cgroup.c
@@ -805,6 +805,20 @@ cgroup_get_next(int id, int rootid, int 
 	*foundid = tmpid;
 	return ret;
 }
+/*
+ * called under RCU. (or some routine which prevent freeing cgroup)
+ */
+int cgroup_check_ancestor(struct cgroup *cur, struct cgroup *ans)
+{
+	struct cgroup_id *cid = rcu_dereference(cur->id);
+	struct cgroup_id *ansid = rcu_dereference(ans->id);
+
+	if (!cid || !ansid)
+		return 0;
+	if (cid->hierarchy_code[ansid->depth] == ansid->id)
+		return 1;
+	return 0;
+}
 
 /*
  * A couple of forward declarations required, due to cyclic reference loop:
Index: mmotm-2.6.28-Dec03/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Dec03.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Dec03/include/linux/memcontrol.h
@@ -67,7 +67,8 @@ extern unsigned long mem_cgroup_isolate_
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+			struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -198,8 +199,8 @@ static inline int mm_match_cgroup(struct
 	return 1;
 }
 
-static inline int task_in_mem_cgroup(struct task_struct *task,
-				     const struct mem_cgroup *mem)
+static inline int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+				     struct mem_cgroup *mem)
 {
 	return 1;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/4] New css->refcnt implementation.
  2008-12-05  8:28 ` [RFC][PATCH 1/4] New css->refcnt implementation KAMEZAWA Hiroyuki
@ 2008-12-05  9:39   ` Paul Menage
  2008-12-05 11:24     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Menage @ 2008-12-05  9:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel

On Fri, Dec 5, 2008 at 12:28 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
>
> Now, the last check of refcnt is done after pre_destroy(), so rmdir() can fail
> after pre_destroy(). But memcg set mem->obsolete to be 1 at pre_destroy.
> This is a bug. So, removing memcg->obsolete flag is sane.
>
> But there is no interface to confirm "css" is oboslete or not. I.e. there is
> no flag to check whether we can increase css_refcnt or not!

The basic rule is that you're only supposed to increment the css
refcount if you have:

- a reference to a task in the cgroup (that is pinned via task_lock()
so it can't be moved away)
or
- an existing reference to the css

>
> This patch changes this css->refcnt rule as following
>        - css->refcnt is no longer private counter, just point to
>          css->cgroup->css_refcnt.

The reason I didn't do this is that I'd like to keep the ref counts
separate to make it possible to add/remove subsystems from a hiearchy
- if they're all mingled into a single refcount, it's impossible to
tell if a particular subsystem has refcounts.

>
>        - css_put() is changed not to call notify_on_release().
>
>          From documentation, notify_on_release() is called when there is no
>          tasks/children in cgroup. On implementation, notify_on_release is
>          not called if css->refcnt > 0.

The documentation is a little inaccurate - it's called when the cgroup
is removable. In the original cpusets this implied that there were no
tasks or children; in cgroups, a refcount can keep the group alive
too, so it's right to not call notify_on_release if there are
remaining refcounts.

>          This is problem. Memcg has css->refcnt by each page even when
>          there are no tasks. Release handler will be never called.

Right, because it can't remove the dir if there are still refcounts.

Early in the development of cgroups I did have a css refcount scheme
similar to what you have, with tryget, etc, but still with separate
refcounts for each subsystem; I got rid of it since it seemed more
complicated than we needed at the time. But I'll see if I can dig it
up.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 2/4] cgroup ID
  2008-12-05  8:29 ` [RFC][PATCH 2/4] cgroup ID KAMEZAWA Hiroyuki
@ 2008-12-05 11:11   ` Paul Menage
  2008-12-05 11:50     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Menage @ 2008-12-05 11:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel

Hi Kamezawa,

I definitely agree with the idea of being able to traverse the cgroup
hierarchy without doing a cgroup_lock() and I've included some
comments below. But having said that, maybe there's a simpler
solution?

A while ago I posted some patches that added a per-hierarchy lock
which could be taken to prevent creation or destruction of cgroups in
a given hierarchy; it was lighter-weight than the full cgroup_lock().
Is that sufficient to avoid the deadlock that you mentioned in your
patch description?

The idea of having a short id for each cgroup to save space in the
swap cgroup sounds sensible - but I'm not sure that we need the RCU
support to make the id persist beyond the lifetime of the cgroup
itself.

On Fri, Dec 5, 2008 at 12:29 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> +/*
> + * Cgroup ID for *internal* identification and lookup. For user-land,"path"
> + * of cgroup works well.
> + */

This comment seems misplaced and possibly unnecessary. Should it be
with the struct cgroup_id definition in cgroup.c?

>
> +/*
> + * For supporting cgroup lookup and hierarchy management.
> + */

A lot more commenting would be useful here.

> +/* An interface for usual lookup */
> +struct cgroup *cgroup_lookup(int id);
> +/* get next cgroup under tree (for scan) */
> +struct cgroup *
> +cgroup_get_next(int id, int rootid, int depth, int *foundid);
> +/* get id and depth of cgroup */
> +int cgroup_id(struct cgroup *cgroup);
> +int cgroup_depth(struct cgroup *cgroup);
> +/* For delayed freeing of IDs */
> +int cgroup_id_tryget(int id);
> +void cgroup_id_put(int id);
> +
>  #else /* !CONFIG_CGROUPS */
>
>  /*
> + * CGROUP ID
> + */

More comments needed about the exact semantics of these fields.

> +struct cgroup_id {
> +       struct cgroup *myself;

Can you call this cgroup for consistency with other struct cgroup pointers?

> +       unsigned int  id;
> +       unsigned int  depth;
> +       atomic_t      refcnt;
> +       struct rcu_head rcu_head;
> +       unsigned int  hierarchy_code[MAX_CGROUP_DEPTH];

How about "stack" for this array?

> +};
> +
> +void free_cgroupid_cb(struct rcu_head *head)
> +{
> +       struct cgroup_id *id;
> +
> +       id = container_of(head, struct cgroup_id, rcu_head);
> +       kfree(id);
> +}
> +
> +void free_cgroupid(struct cgroup_id *id)
> +{
> +       call_rcu(&id->rcu_head, free_cgroupid_cb);
> +}
> +

Rather than having a separate RCU callback for the cgroup_id
structure, how about marking it as "dead" when you unlink the cgroup
from the tree, and freeing it in the cgroup_diput() callback at the
same time the struct cgroup is freed? Or is the issue that you need
the id to persist longer than the cgroup itself, to prevent re-use?

> +static DEFINE_IDR(cgroup_idr);
> +DEFINE_SPINLOCK(cgroup_idr_lock);

Any reason to not have a separate idr and idr_lock per hierarchy?

> +
> +static int cgrouproot_setup_idr(struct cgroupfs_root *root)
> +{
> +       struct cgroup_id *newid;
> +       int err = -ENOMEM;
> +       int myid;
> +
> +       newid = kzalloc(sizeof(*newid), GFP_KERNEL);
> +       if (!newid)
> +               goto out;
> +       if (!idr_pre_get(&cgroup_idr, GFP_KERNEL))
> +               goto free_out;
> +
> +       spin_lock_irq(&cgroup_idr_lock);
> +       err = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
> +       spin_unlock_irq(&cgroup_idr_lock);
> +
> +       /* This one is new idr....*/
> +       BUG_ON(err);

There's really no way this can fail?

> +/*
> + * should be called while "cgrp" is valid.
> + */

Can you be more specific here? Clearly calling a function with a
pointer to an object that might have been freed is a bad idea; if
that's all you mean then I don't think it needs to be called out in a
comment.

> +static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id **id)
> +{
> +       struct cgroup_id *newid;
> +       int myid, error;
> +
> +       /* check depth */
> +       if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
> +               return -ENOSPC;
> +       newid = kzalloc(sizeof(*newid), GFP_KERNEL);
> +       if (!newid)
> +               return -ENOMEM;
> +       /* get id */
> +       if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
> +               error = -ENOMEM;
> +               goto err_out;
> +       }
> +       spin_lock_irq(&cgroup_idr_lock);
> +       /* Don't use 0 */
> +       error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
> +       spin_unlock_irq(&cgroup_idr_lock);
> +       if (error)
> +               goto err_out;

This code is pretty similar to a big chunk of cgrouproot_setup_idr() -
can they share the common code?

> +static void cgroup_id_attach(struct cgroup_id *cgid,
> +                            struct cgroup *cg, struct cgroup *parent)
> +{
> +       struct cgroup_id *parent_id = rcu_dereference(parent->id);

It doesn't seem as though it should be necessary to rcu_dereference()
parent->id - parent can't be going away in this case.

> +       int i;
> +
> +       cgid->depth = parent_id->depth + 1;
> +       /* Inherit hierarchy code from parent */
> +       for (i = 0; i < cgid->depth; i++) {
> +               cgid->hierarchy_code[i] =
> +                       parent_id->hierarchy_code[i];
> +               cgid->hierarchy_code[cgid->depth] = cgid->id;

I think this line is supposed to be outside the for() loop.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/4] New css->refcnt implementation.
  2008-12-05  9:39   ` Paul Menage
@ 2008-12-05 11:24     ` KAMEZAWA Hiroyuki
  2008-12-10  9:00       ` Paul Menage
  0 siblings, 1 reply; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05 11:24 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, balbir, lizf, linux-kernel

Thank you for comments.

Paul Menage said:
> On Fri, Dec 5, 2008 at 12:28 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
>>
>> Now, the last check of refcnt is done after pre_destroy(), so rmdir()
>> can fail
>> after pre_destroy(). But memcg set mem->obsolete to be 1 at pre_destroy.
>> This is a bug. So, removing memcg->obsolete flag is sane.
>>
>> But there is no interface to confirm "css" is oboslete or not. I.e.
>> there is
>> no flag to check whether we can increase css_refcnt or not!
>
> The basic rule is that you're only supposed to increment the css
> refcount if you have:
>
> - a reference to a task in the cgroup (that is pinned via task_lock()
> so it can't be moved away)
> or
> - an existing reference to the css
>
My problem is that we can do css_get() after pre_destroy() and
css's refcnt goes down to 0.

>>
>> This patch changes this css->refcnt rule as following
>>        - css->refcnt is no longer private counter, just point to
>>          css->cgroup->css_refcnt.
>
> The reason I didn't do this is that I'd like to keep the ref counts
> separate to make it possible to add/remove subsystems from a hiearchy
> - if they're all mingled into a single refcount, it's impossible to
> tell if a particular subsystem has refcounts.
>
It's not problem. please see memcg, it has its own refcnt.
and memcg subsystem is not destroyed at destroy(), now.

What I want to is atomic check to

   "Can I access css->group ?"

   I once tried to do
    --
    rcu_read_lock();
    css_get(&memcg->css);
    if (cgroup_is_removed(&memcg->css.cgroup)) {
         css_put(&memcg->css);
         return 0;
    }
    --
    But this seems not to work.


>>
>>        - css_put() is changed not to call notify_on_release().
>>
>>          From documentation, notify_on_release() is called when there is
>> no
>>          tasks/children in cgroup. On implementation, notify_on_release
>> is
>>          not called if css->refcnt > 0.
>
> The documentation is a little inaccurate - it's called when the cgroup
> is removable. In the original cpusets this implied that there were no
> tasks or children; in cgroups, a refcount can keep the group alive
> too, so it's right to not call notify_on_release if there are
> remaining refcounts.
>

>>          This is problem. Memcg has css->refcnt by each page even when
>>          there are no tasks. Release handler will be never called.
>
> Right, because it can't remove the dir if there are still refcounts.
>
> Early in the development of cgroups I did have a css refcount scheme
> similar to what you have, with tryget, etc, but still with separate
> refcounts for each subsystem; I got rid of it since it seemed more
> complicated than we needed at the time. But I'll see if I can dig it
> up.
O.K.

I'll not do extra work on this "notify handler" for a while.
But please, it's now broken.

Hmm...but removing memcg->obsolete flag is difficult. Because
I can't guarantee memcg->css.cgroup is valid pointer.

Can I add a flag CSS_REMOVED ? (set after CGROUP_REMOVED flag) ?
Then, I'll be able to have some work around.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 2/4] cgroup ID
  2008-12-05 11:11   ` Paul Menage
@ 2008-12-05 11:50     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-05 11:50 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, balbir, lizf, linux-kernel

Paul Menage said:
> Hi Kamezawa,
>
> I definitely agree with the idea of being able to traverse the cgroup
> hierarchy without doing a cgroup_lock() and I've included some
> comments below. But having said that, maybe there's a simpler
> solution?
>
> A while ago I posted some patches that added a per-hierarchy lock
> which could be taken to prevent creation or destruction of cgroups in
> a given hierarchy; it was lighter-weight than the full cgroup_lock().
> Is that sufficient to avoid the deadlock that you mentioned in your
> patch description?
>
yes, maybe. I just think hierarchy-walk support in cgroup level is
proper way, I think. currnet cgroup_lock() is too rough.

One concern I have is that ID is very robust to scan things rather than
list, which can be modified. Can you design hierarchy walk which the caller
of walker can sleep ? like following,

==
     next = cgroup_get_next(cur, root);
     .......
     cur = next;
     sleep;
     continue from next
==
memcg's code rememver next by its own refcnt of memcg. But it's unclear
way.

Another concern is, css_get() to do above will prevent rmdir() of cgroup
by "Unknown Reason". I don't want to return -EBUSY at rmdir(next) for
remembering "next" by css_get().


> The idea of having a short id for each cgroup to save space in the
> swap cgroup sounds sensible - but I'm not sure that we need the RCU
> support to make the id persist beyond the lifetime of the cgroup
> itself.
>
Ah, maybe explaination was not enough.

Now, I don't clear swap_cgroup entry at pre_destroy() because scanning all
swp_cgroup entry is heavy...so, ID is recorded after removal of cgroup.
If memcg scans all swap_cgroup entry at destroy(), memcg doesn't require
such lifetime of ID. Hmm....what difficult here is cost of scanning depends
on size of swap.


> On Fri, Dec 5, 2008 at 12:29 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> +/*
>> + * Cgroup ID for *internal* identification and lookup. For
>> user-land,"path"
>> + * of cgroup works well.
>> + */
>
> This comment seems misplaced and possibly unnecessary. Should it be
> with the struct cgroup_id definition in cgroup.c?
>
yes...

>>
>> +/*
>> + * For supporting cgroup lookup and hierarchy management.
>> + */
>
> A lot more commenting would be useful here.
>
will do.

>> +/* An interface for usual lookup */
>> +struct cgroup *cgroup_lookup(int id);
>> +/* get next cgroup under tree (for scan) */
>> +struct cgroup *
>> +cgroup_get_next(int id, int rootid, int depth, int *foundid);
>> +/* get id and depth of cgroup */
>> +int cgroup_id(struct cgroup *cgroup);
>> +int cgroup_depth(struct cgroup *cgroup);
>> +/* For delayed freeing of IDs */
>> +int cgroup_id_tryget(int id);
>> +void cgroup_id_put(int id);
>> +
>>  #else /* !CONFIG_CGROUPS */
>>
>>  /*
>> + * CGROUP ID
>> + */
>
> More comments needed about the exact semantics of these fields.
>
Sure, I will add.

>> +struct cgroup_id {
>> +       struct cgroup *myself;
>
> Can you call this cgroup for consistency with other struct cgroup
> pointers?
>
Ok, rename that.

>> +       unsigned int  id;
>> +       unsigned int  depth;
>> +       atomic_t      refcnt;
>> +       struct rcu_head rcu_head;
>> +       unsigned int  hierarchy_code[MAX_CGROUP_DEPTH];
>
> How about "stack" for this array?
>
Ah, ok. hierarchy_code is too long...

>> +};
>> +
>> +void free_cgroupid_cb(struct rcu_head *head)
>> +{
>> +       struct cgroup_id *id;
>> +
>> +       id = container_of(head, struct cgroup_id, rcu_head);
>> +       kfree(id);
>> +}
>> +
>> +void free_cgroupid(struct cgroup_id *id)
>> +{
>> +       call_rcu(&id->rcu_head, free_cgroupid_cb);
>> +}
>> +
>
> Rather than having a separate RCU callback for the cgroup_id
> structure, how about marking it as "dead" when you unlink the cgroup
> from the tree, and freeing it in the cgroup_diput() callback at the
> same time the struct cgroup is freed? Or is the issue that you need
> the id to persist longer than the cgroup itself, to prevent re-use?
>
yes...for swap_cgroup, I'd like to prevent reuse.
But it depencs on choice of design. If memcg clears all record about
swap at destroy(), this ID can be freed at with cgroup.


>> +static DEFINE_IDR(cgroup_idr);
>> +DEFINE_SPINLOCK(cgroup_idr_lock);
>
> Any reason to not have a separate idr and idr_lock per hierarchy?
>
That's because checking whether "cgroup_idr" is alive or not makes
the whole code complex. (because of peristency of ID)
I tried put this into cgroup_rootfs struct and ...found it's complicated.


>> +
>> +static int cgrouproot_setup_idr(struct cgroupfs_root *root)
>> +{
>> +       struct cgroup_id *newid;
>> +       int err = -ENOMEM;
>> +       int myid;
>> +
>> +       newid = kzalloc(sizeof(*newid), GFP_KERNEL);
>> +       if (!newid)
>> +               goto out;
>> +       if (!idr_pre_get(&cgroup_idr, GFP_KERNEL))
>> +               goto free_out;
>> +
>> +       spin_lock_irq(&cgroup_idr_lock);
>> +       err = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
>> +       spin_unlock_irq(&cgroup_idr_lock);
>> +
>> +       /* This one is new idr....*/
>> +       BUG_ON(err);
>
> There's really no way this can fail?
>
Maybe, unless id over INT_MAX.


>> +/*
>> + * should be called while "cgrp" is valid.
>> + */
>
> Can you be more specific here? Clearly calling a function with a
> pointer to an object that might have been freed is a bad idea; if
> that's all you mean then I don't think it needs to be called out in a
> comment.
>
Hmm, "Should be called when !cgroup_is_removed()" ?
But removing above comment is maybe sane...

>> +static int cgroup_prepare_id(struct cgroup *parent, struct cgroup_id
>> **id)
>> +{
>> +       struct cgroup_id *newid;
>> +       int myid, error;
>> +
>> +       /* check depth */
>> +       if (parent->id->depth + 1 >= MAX_CGROUP_DEPTH)
>> +               return -ENOSPC;
>> +       newid = kzalloc(sizeof(*newid), GFP_KERNEL);
>> +       if (!newid)
>> +               return -ENOMEM;
>> +       /* get id */
>> +       if (unlikely(!idr_pre_get(&cgroup_idr, GFP_KERNEL))) {
>> +               error = -ENOMEM;
>> +               goto err_out;
>> +       }
>> +       spin_lock_irq(&cgroup_idr_lock);
>> +       /* Don't use 0 */
>> +       error = idr_get_new_above(&cgroup_idr, newid, 1, &myid);
>> +       spin_unlock_irq(&cgroup_idr_lock);
>> +       if (error)
>> +               goto err_out;
>
> This code is pretty similar to a big chunk of cgrouproot_setup_idr() -
> can they share the common code?
>
Yes, I'll try.


>> +static void cgroup_id_attach(struct cgroup_id *cgid,
>> +                            struct cgroup *cg, struct cgroup *parent)
>> +{
>> +       struct cgroup_id *parent_id = rcu_dereference(parent->id);
>
> It doesn't seem as though it should be necessary to rcu_dereference()
> parent->id - parent can't be going away in this case.
>
yes. will fix.

>> +       int i;
>> +
>> +       cgid->depth = parent_id->depth + 1;
>> +       /* Inherit hierarchy code from parent */
>> +       for (i = 0; i < cgid->depth; i++) {
>> +               cgid->hierarchy_code[i] =
>> +                       parent_id->hierarchy_code[i];
>> +               cgid->hierarchy_code[cgid->depth] = cgid->id;
>
> I think this line is supposed to be outside the for() loop.
>
yes...

Thank you for review.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/4] New css->refcnt implementation.
  2008-12-05 11:24     ` KAMEZAWA Hiroyuki
@ 2008-12-10  9:00       ` Paul Menage
  2008-12-10 10:23         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Menage @ 2008-12-10  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura, balbir, lizf, linux-kernel

On Fri, Dec 5, 2008 at 3:24 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> The basic rule is that you're only supposed to increment the css
>> refcount if you have:
>>
>> - a reference to a task in the cgroup (that is pinned via task_lock()
>> so it can't be moved away)
>> or
>> - an existing reference to the css
>>
> My problem is that we can do css_get() after pre_destroy() and
> css's refcnt goes down to 0.

But where are you getting the reference from in order to do css_get()?
Which call in mem cgroup are you concerned about?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/4] New css->refcnt implementation.
  2008-12-10  9:00       ` Paul Menage
@ 2008-12-10 10:23         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-10 10:23 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura, balbir, lizf, linux-kernel

Paul Menage said:
> On Fri, Dec 5, 2008 at 3:24 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> The basic rule is that you're only supposed to increment the css
>>> refcount if you have:
>>>
>>> - a reference to a task in the cgroup (that is pinned via task_lock()
>>> so it can't be moved away)
>>> or
>>> - an existing reference to the css
>>>
>> My problem is that we can do css_get() after pre_destroy() and
>> css's refcnt goes down to 0.
>
> But where are you getting the reference from in order to do css_get()?
> Which call in mem cgroup are you concerned about?
>
mem_cgroup_try_charge_swapin() at el. all swap-in codes.
In that functions, follwoing occurs.
==
    assume swp_entry which is being swapped-in
    lookup swap_cgroup for swp_entry.
    get memcg from swap_cgroup.
    charge() against memcg got by swap_cgroup. charge() will do css_get()
==
Currently, mem_cgroup->obsolete is used for making this never happen.
But, mem_cgroup->obsolete flag is broken,now.
I'm looking for alternative. (see other patches. I know there are several
ways to go.)

*AND* any kinds of hierarchy-tree-walk algorithm may call css_get() against
cgroup under rmdir() if it's not marked as REMOVED.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-12-10 10:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-05  8:26 [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) KAMEZAWA Hiroyuki
2008-12-05  8:28 ` [RFC][PATCH 1/4] New css->refcnt implementation KAMEZAWA Hiroyuki
2008-12-05  9:39   ` Paul Menage
2008-12-05 11:24     ` KAMEZAWA Hiroyuki
2008-12-10  9:00       ` Paul Menage
2008-12-10 10:23         ` KAMEZAWA Hiroyuki
2008-12-05  8:29 ` [RFC][PATCH 2/4] cgroup ID KAMEZAWA Hiroyuki
2008-12-05 11:11   ` Paul Menage
2008-12-05 11:50     ` KAMEZAWA Hiroyuki
2008-12-05  8:31 ` [RFC][PATCH 3/4] memcg: hierachical reclaim KAMEZAWA Hiroyuki
2008-12-05  8:32 ` [RFC][PATCH 4/4] fix oom kill under hierarchy KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox