linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] CGroups: Hierarchy locking/refcount changes
@ 2008-12-16 11:30 menage
  2008-12-16 11:30 ` [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: menage @ 2008-12-16 11:30 UTC (permalink / raw)
  To: akpm, kamezawa.hiroyu, lizf; +Cc: linux-kernel, linux-mm

These patches introduce new locking/refcount support for cgroups to
reduce the need for subsystems to call cgroup_lock(). This will
ultimately allow the atomicity of cgroup_rmdir() (which was removed
recently) to be restored.

These three patches give:

1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
     use to prevent changes to its own cgroup tree

2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
     memory controller

3/3 - introduce a css_tryget() function similar to the one recently
      proposed by Kamezawa, but avoiding spurious refcount failures in
      the event of a race between a css_tryget() and an unsuccessful
      cgroup_rmdir()

Future patches will likely involve:

- using hierarchy mutex in place of cgroup_lock() in more subsystems
 where appropriate

- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

Signed-off-by: Paul Menage <menage@google.com>

--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-16 11:30 [PATCH 0/3] CGroups: Hierarchy locking/refcount changes menage
@ 2008-12-16 11:30 ` menage
  2008-12-17  5:16   ` KAMEZAWA Hiroyuki
  2008-12-16 11:30 ` [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: menage @ 2008-12-16 11:30 UTC (permalink / raw)
  To: akpm, kamezawa.hiroyu, lizf; +Cc: linux-kernel, linux-mm

[-- Attachment #1: cgroup_hierarchy_lock.patch --]
[-- Type: text/plain, Size: 5598 bytes --]

This patch adds a hierarchy_mutex to the cgroup_subsys object that
protects changes to the hierarchy observed by that subsystem. It is
taken by the cgroup subsystem (in addition to cgroup_mutex) for the
following operations:

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
  bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---

 Documentation/cgroups/cgroups.txt |    2 +-
 include/linux/cgroup.h            |   17 ++++++++++++++++-
 kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 52 insertions(+), 4 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
+++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
@@ -337,8 +337,23 @@ struct cgroup_subsys {
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
+	/*
+	 * Protects sibling/children links of cgroups in this
+	 * hierarchy, plus protects which hierarchy (or none) the
+	 * subsystem is a part of (i.e. root/sibling).  To avoid
+	 * potential deadlocks, the following operations should not be
+	 * undertaken while holding any hierarchy_mutex:
+	 *
+	 * - allocating memory
+	 * - initiating hotplug events
+	 */
+	struct mutex hierarchy_mutex;
+
+	/*
+	 * Link to parent, and list entry in parent's children.
+	 * Protected by this->hierarchy_mutex and cgroup_lock()
+	 */
 	struct cgroupfs_root *root;
-
 	struct list_head sibling;
 };
 
Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
+++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
@@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
 			BUG_ON(cgrp->subsys[i]);
 			BUG_ON(!dummytop->subsys[i]);
 			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
+			mutex_lock(&ss->hierarchy_mutex);
 			cgrp->subsys[i] = dummytop->subsys[i];
 			cgrp->subsys[i]->cgroup = cgrp;
 			list_move(&ss->sibling, &root->subsys_list);
 			ss->root = root;
 			if (ss->bind)
 				ss->bind(ss, cgrp);
-
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & removed_bits) {
 			/* We're removing this subsystem */
 			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
 			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
+			mutex_lock(&ss->hierarchy_mutex);
 			if (ss->bind)
 				ss->bind(ss, dummytop);
 			dummytop->subsys[i]->cgroup = dummytop;
 			cgrp->subsys[i] = NULL;
 			subsys[i]->root = &rootnode;
 			list_move(&ss->sibling, &rootnode.subsys_list);
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & final_bits) {
 			/* Subsystem state should already exist */
 			BUG_ON(!cgrp->subsys[i]);
@@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
 	cgrp->subsys[ss->subsys_id] = css;
 }
 
+static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
+{
+	/* We need to take each hierarchy_mutex in a consistent order */
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_lock_nested(&ss->hierarchy_mutex, i);
+	}
+}
+
+static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
+{
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_unlock(&ss->hierarchy_mutex);
+	}
+}
+
 /*
  * cgroup_create - create a cgroup
  * @parent: cgroup that will be parent of the new cgroup
@@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup 
 		init_cgroup_css(css, ss, cgrp);
 	}
 
+	cgroup_lock_hierarchy(root);
 	list_add(&cgrp->sibling, &cgrp->parent->children);
+	cgroup_unlock_hierarchy(root);
 	root->number_of_cgroups++;
 
 	err = cgroup_create_dir(cgrp, dentry, mode);
@@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
 	spin_unlock(&release_list_lock);
-	/* delete my sibling from parent->children */
+
+	cgroup_lock_hierarchy(cgrp->root);
+	/* delete this cgroup from parent->children */
 	list_del(&cgrp->sibling);
+	cgroup_unlock_hierarchy(cgrp->root);
+
 	spin_lock(&cgrp->dentry->d_lock);
 	d = dget(cgrp->dentry);
 	spin_unlock(&d->d_lock);
@@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st
 	 * need to invoke fork callbacks here. */
 	BUG_ON(!list_empty(&init_task.tasks));
 
+	mutex_init(&ss->hierarchy_mutex);
 	ss->active = 1;
 }
 
Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt
+++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
@@ -528,7 +528,7 @@ example in cpusets, no task may attach b
 up.
 
 void bind(struct cgroup_subsys *ss, struct cgroup *root)
-(cgroup_mutex held by caller)
+(cgroup_mutex and ss->hierarchy_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
 and root cgroup. Currently this will only involve movement between

--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-16 11:30 [PATCH 0/3] CGroups: Hierarchy locking/refcount changes menage
  2008-12-16 11:30 ` [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
@ 2008-12-16 11:30 ` menage
  2008-12-17  5:17   ` KAMEZAWA Hiroyuki
  2008-12-16 11:30 ` [PATCH 3/3] CGroups: Add css_tryget() menage
  2008-12-17 22:03 ` [PATCH 0/3] CGroups: Hierarchy locking/refcount changes Andrew Morton
  3 siblings, 1 reply; 11+ messages in thread
From: menage @ 2008-12-16 11:30 UTC (permalink / raw)
  To: akpm, kamezawa.hiroyu, lizf; +Cc: linux-kernel, linux-mm

[-- Attachment #1: mm-destroy-fix.patch --]
[-- Type: text/plain, Size: 2495 bytes --]

This patch updates the memory controller to use its hierarchy_mutex
rather than calling cgroup_lock() to protected against
cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---
 mm/memcontrol.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c
+++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
@@ -154,7 +154,7 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by cgroup_lock()
+	 * reclaimed from. Protected by hierarchy_mutex
 	 */
 	struct mem_cgroup *last_scanned_child;
 	/*
@@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u
 
 /*
  * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
+ * called with hierarchy_mutex held
  */
 static struct mem_cgroup *
 mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
@@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 	/*
 	 * Scan all children under the mem_cgroup mem
 	 */
-	cgroup_lock();
+	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 	if (list_empty(&root_mem->css.cgroup->children)) {
 		ret = root_mem;
 		goto done;
@@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 
 done:
 	root_mem->last_scanned_child = ret;
-	cgroup_unlock();
+	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	return ret;
 }
 
@@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla
 	while (next_mem != root_mem) {
 		if (next_mem->obsolete) {
 			mem_cgroup_put(next_mem);
-			cgroup_lock();
 			next_mem = mem_cgroup_get_first_node(root_mem);
-			cgroup_unlock();
 			continue;
 		}
 		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
 						   get_swappiness(next_mem));
 		if (mem_cgroup_check_under_limit(root_mem))
 			return 0;
-		cgroup_lock();
+		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		cgroup_unlock();
+		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	}
 	return ret;
 }

--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] CGroups: Add css_tryget()
  2008-12-16 11:30 [PATCH 0/3] CGroups: Hierarchy locking/refcount changes menage
  2008-12-16 11:30 ` [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
  2008-12-16 11:30 ` [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
@ 2008-12-16 11:30 ` menage
  2008-12-17  5:19   ` KAMEZAWA Hiroyuki
  2008-12-17 22:07   ` Andrew Morton
  2008-12-17 22:03 ` [PATCH 0/3] CGroups: Hierarchy locking/refcount changes Andrew Morton
  3 siblings, 2 replies; 11+ messages in thread
From: menage @ 2008-12-16 11:30 UTC (permalink / raw)
  To: akpm, kamezawa.hiroyu, lizf; +Cc: linux-kernel, linux-mm

[-- Attachment #1: cgroup_refcnt.patch --]
[-- Type: text/plain, Size: 6843 bytes --]

This patch adds css_tryget(), that obtains a counted reference on a
CSS.  It is used in situations where the caller has a "weak" reference
to the CSS, i.e. one that does not protect the cgroup from removal via
a reference count, but would instead be cleaned up by a destroy()
callback.

css_tryget() will return true on success, or false if the cgroup is
being removed.

This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
but with the difference that in the event of css_tryget() racing with
a cgroup_rmdir(), css_tryget() will only return false if the cgroup
really does get removed.

This implementation is done by biasing css->refcnt, so that a refcnt
of 1 means "releasable" and 0 means "released or releasing". In the
event of a race, css_tryget() distinguishes between "released" and
"releasing" by checking for the CSS_REMOVED flag in css->flags.

Signed-off-by: Paul Menage <menage@google.com>

---
 include/linux/cgroup.h |   38 +++++++++++++++++++++++++-----
 kernel/cgroup.c        |   61 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 88 insertions(+), 11 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
+++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
@@ -52,9 +52,9 @@ struct cgroup_subsys_state {
 	 * hierarchy structure */
 	struct cgroup *cgroup;
 
-	/* State maintained by the cgroup system to allow
-	 * subsystems to be "busy". Should be accessed via css_get()
-	 * and css_put() */
+	/* State maintained by the cgroup system to allow subsystems
+	 * to be "busy". Should be accessed via css_get(),
+	 * css_tryget() and and css_put(). */
 
 	atomic_t refcnt;
 
@@ -64,11 +64,14 @@ struct cgroup_subsys_state {
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
+	CSS_REMOVED, /* This CSS is dead */
 };
 
 /*
- * Call css_get() to hold a reference on the cgroup;
- *
+ * Call css_get() to hold a reference on the css; it can be used
+ * for a reference obtained via:
+ * - an existing ref-counted reference to the css
+ * - task->cgroups for a locked task
  */
 
 static inline void css_get(struct cgroup_subsys_state *css)
@@ -77,9 +80,32 @@ static inline void css_get(struct cgroup
 	if (!test_bit(CSS_ROOT, &css->flags))
 		atomic_inc(&css->refcnt);
 }
+
+static inline bool css_is_removed(struct cgroup_subsys_state *css)
+{
+	return test_bit(CSS_REMOVED, &css->flags);
+}
+
+/*
+ * Call css_tryget() to take a reference on a css if your existing
+ * (known-valid) reference isn't already ref-counted. Returns false if
+ * the css has been destroyed.
+ */
+
+static inline bool css_tryget(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return true;
+	while (!atomic_inc_not_zero(&css->refcnt)) {
+		if (test_bit(CSS_REMOVED, &css->flags))
+			return false;
+	}
+	return true;
+}
+
 /*
  * css_put() should be called to release a reference taken by
- * css_get()
+ * css_get() or css_tryget()
  */
 
 extern void __css_put(struct cgroup_subsys_state *css);
Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
+++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
@@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
 			       struct cgroup *cgrp)
 {
 	css->cgroup = cgrp;
-	atomic_set(&css->refcnt, 0);
+	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
@@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
 {
 	/* Check the reference count on each subsystem. Since we
 	 * already established that there are no tasks in the
-	 * cgroup, if the css refcount is also 0, then there should
+	 * cgroup, if the css refcount is also 1, then there should
 	 * be no outstanding references, so the subsystem is safe to
 	 * destroy. We scan across all subsystems rather than using
 	 * the per-hierarchy linked list of mounted subsystems since
@@ -2474,12 +2474,62 @@ static int cgroup_has_css_refs(struct cg
 		 * matter, since it can only happen if the cgroup
 		 * has been deleted and hence no longer needs the
 		 * release agent to be called anyway. */
-		if (css && atomic_read(&css->refcnt))
+		if (css && (atomic_read(&css->refcnt) > 1))
 			return 1;
 	}
 	return 0;
 }
 
+/*
+ * Atomically mark all (or else none) of the cgroup's CSS objects as
+ * CSS_REMOVED. Return true on success, or false if the cgroup has
+ * busy subsystems. Call with cgroup_mutex held
+ */
+
+static int cgroup_clear_css_refs(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	unsigned long flags;
+	bool failed = false;
+	local_irq_save(flags);
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		int refcnt;
+		do {
+			/* We can only remove a CSS with a refcnt==1 */
+			refcnt = atomic_read(&css->refcnt);
+			if (refcnt > 1) {
+				failed = true;
+				goto done;
+			}
+			BUG_ON(!refcnt);
+			/*
+			 * Drop the refcnt to 0 while we check other
+			 * subsystems. This will cause any racing
+			 * css_tryget() to spin until we set the
+			 * CSS_REMOVED bits or abort
+			 */
+		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
+	}
+ done:
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		if (failed) {
+			/*
+			 * Restore old refcnt if we previously managed
+			 * to clear it from 1 to 0
+			 */
+			if (!atomic_read(&css->refcnt))
+				atomic_set(&css->refcnt, 1);
+		} else {
+			/* Commit the fact that the CSS is removed */
+			set_bit(CSS_REMOVED, &css->flags);
+		}
+	}
+	local_irq_restore(flags);
+	return !failed;
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -2510,7 +2560,7 @@ static int cgroup_rmdir(struct inode *un
 
 	if (atomic_read(&cgrp->count)
 	    || !list_empty(&cgrp->children)
-	    || cgroup_has_css_refs(cgrp)) {
+	    || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3065,7 +3115,8 @@ void __css_put(struct cgroup_subsys_stat
 {
 	struct cgroup *cgrp = css->cgroup;
 	rcu_read_lock();
-	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
+	if ((atomic_dec_return(&css->refcnt) == 1) &&
+	    notify_on_release(cgrp)) {
 		set_bit(CGRP_RELEASABLE, &cgrp->flags);
 		check_for_release(cgrp);
 	}

--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-16 11:30 ` [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
@ 2008-12-17  5:16   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-17  5:16 UTC (permalink / raw)
  To: menage; +Cc: akpm, lizf, linux-kernel, linux-mm

On Tue, 16 Dec 2008 03:30:56 -0800
menage@google.com wrote:

> This patch adds a hierarchy_mutex to the cgroup_subsys object that
> protects changes to the hierarchy observed by that subsystem. It is
> taken by the cgroup subsystem (in addition to cgroup_mutex) for the
> following operations:
> 
> - linking a cgroup into that subsystem's cgroup tree
> - unlinking a cgroup from that subsystem's cgroup tree
> - moving the subsystem to/from a hierarchy (including across the
>   bind() callback)
> 
> Thus if the subsystem holds its own hierarchy_mutex, it can safely
> traverse its own hierarchy.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



> ---
> 
>  Documentation/cgroups/cgroups.txt |    2 +-
>  include/linux/cgroup.h            |   17 ++++++++++++++++-
>  kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
>  3 files changed, 52 insertions(+), 4 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -337,8 +337,23 @@ struct cgroup_subsys {
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
>  
> +	/*
> +	 * Protects sibling/children links of cgroups in this
> +	 * hierarchy, plus protects which hierarchy (or none) the
> +	 * subsystem is a part of (i.e. root/sibling).  To avoid
> +	 * potential deadlocks, the following operations should not be
> +	 * undertaken while holding any hierarchy_mutex:
> +	 *
> +	 * - allocating memory
> +	 * - initiating hotplug events
> +	 */
> +	struct mutex hierarchy_mutex;
> +
> +	/*
> +	 * Link to parent, and list entry in parent's children.
> +	 * Protected by this->hierarchy_mutex and cgroup_lock()
> +	 */
>  	struct cgroupfs_root *root;
> -
>  	struct list_head sibling;
>  };
>  
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
>  			BUG_ON(cgrp->subsys[i]);
>  			BUG_ON(!dummytop->subsys[i]);
>  			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			cgrp->subsys[i] = dummytop->subsys[i];
>  			cgrp->subsys[i]->cgroup = cgrp;
>  			list_move(&ss->sibling, &root->subsys_list);
>  			ss->root = root;
>  			if (ss->bind)
>  				ss->bind(ss, cgrp);
> -
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & removed_bits) {
>  			/* We're removing this subsystem */
>  			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
>  			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			if (ss->bind)
>  				ss->bind(ss, dummytop);
>  			dummytop->subsys[i]->cgroup = dummytop;
>  			cgrp->subsys[i] = NULL;
>  			subsys[i]->root = &rootnode;
>  			list_move(&ss->sibling, &rootnode.subsys_list);
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & final_bits) {
>  			/* Subsystem state should already exist */
>  			BUG_ON(!cgrp->subsys[i]);
> @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
>  	cgrp->subsys[ss->subsys_id] = css;
>  }
>  
> +static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
> +{
> +	/* We need to take each hierarchy_mutex in a consistent order */
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_lock_nested(&ss->hierarchy_mutex, i);
> +	}
> +}
> +
> +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
> +{
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_unlock(&ss->hierarchy_mutex);
> +	}
> +}
> +
>  /*
>   * cgroup_create - create a cgroup
>   * @parent: cgroup that will be parent of the new cgroup
> @@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup 
>  		init_cgroup_css(css, ss, cgrp);
>  	}
>  
> +	cgroup_lock_hierarchy(root);
>  	list_add(&cgrp->sibling, &cgrp->parent->children);
> +	cgroup_unlock_hierarchy(root);
>  	root->number_of_cgroups++;
>  
>  	err = cgroup_create_dir(cgrp, dentry, mode);
> @@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
>  	spin_unlock(&release_list_lock);
> -	/* delete my sibling from parent->children */
> +
> +	cgroup_lock_hierarchy(cgrp->root);
> +	/* delete this cgroup from parent->children */
>  	list_del(&cgrp->sibling);
> +	cgroup_unlock_hierarchy(cgrp->root);
> +
>  	spin_lock(&cgrp->dentry->d_lock);
>  	d = dget(cgrp->dentry);
>  	spin_unlock(&d->d_lock);
> @@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st
>  	 * need to invoke fork callbacks here. */
>  	BUG_ON(!list_empty(&init_task.tasks));
>  
> +	mutex_init(&ss->hierarchy_mutex);
>  	ss->active = 1;
>  }
>  
> Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt
> +++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
> @@ -528,7 +528,7 @@ example in cpusets, no task may attach b
>  up.
>  
>  void bind(struct cgroup_subsys *ss, struct cgroup *root)
> -(cgroup_mutex held by caller)
> +(cgroup_mutex and ss->hierarchy_mutex held by caller)
>  
>  Called when a cgroup subsystem is rebound to a different hierarchy
>  and root cgroup. Currently this will only involve movement between
> 
> --
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-16 11:30 ` [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
@ 2008-12-17  5:17   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-17  5:17 UTC (permalink / raw)
  To: menage; +Cc: akpm, lizf, linux-kernel, linux-mm

On Tue, 16 Dec 2008 03:30:57 -0800
menage@google.com wrote:

> This patch updates the memory controller to use its hierarchy_mutex
> rather than calling cgroup_lock() to protected against
> cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
Not doing any special test but passed usual test.

Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>  mm/memcontrol.c |   14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c
> +++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
> @@ -154,7 +154,7 @@ struct mem_cgroup {
>  
>  	/*
>  	 * While reclaiming in a hiearchy, we cache the last child we
> -	 * reclaimed from. Protected by cgroup_lock()
> +	 * reclaimed from. Protected by hierarchy_mutex
>  	 */
>  	struct mem_cgroup *last_scanned_child;
>  	/*
> @@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u
>  
>  /*
>   * This routine finds the DFS walk successor. This routine should be
> - * called with cgroup_mutex held
> + * called with hierarchy_mutex held
>   */
>  static struct mem_cgroup *
>  mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> @@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr
>  	/*
>  	 * Scan all children under the mem_cgroup mem
>  	 */
> -	cgroup_lock();
> +	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
>  	if (list_empty(&root_mem->css.cgroup->children)) {
>  		ret = root_mem;
>  		goto done;
> @@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr
>  
>  done:
>  	root_mem->last_scanned_child = ret;
> -	cgroup_unlock();
> +	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
>  	return ret;
>  }
>  
> @@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla
>  	while (next_mem != root_mem) {
>  		if (next_mem->obsolete) {
>  			mem_cgroup_put(next_mem);
> -			cgroup_lock();
>  			next_mem = mem_cgroup_get_first_node(root_mem);
> -			cgroup_unlock();
>  			continue;
>  		}
>  		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
>  						   get_swappiness(next_mem));
>  		if (mem_cgroup_check_under_limit(root_mem))
>  			return 0;
> -		cgroup_lock();
> +		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
>  		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> -		cgroup_unlock();
> +		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
>  	}
>  	return ret;
>  }
> 
> --
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] CGroups: Add css_tryget()
  2008-12-16 11:30 ` [PATCH 3/3] CGroups: Add css_tryget() menage
@ 2008-12-17  5:19   ` KAMEZAWA Hiroyuki
  2008-12-17 22:07   ` Andrew Morton
  1 sibling, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-17  5:19 UTC (permalink / raw)
  To: menage; +Cc: akpm, lizf, linux-kernel, linux-mm

On Tue, 16 Dec 2008 03:30:58 -0800
menage@google.com wrote:

> This patch adds css_tryget(), that obtains a counted reference on a
> CSS.  It is used in situations where the caller has a "weak" reference
> to the CSS, i.e. one that does not protect the cgroup from removal via
> a reference count, but would instead be cleaned up by a destroy()
> callback.
> 
> css_tryget() will return true on success, or false if the cgroup is
> being removed.
> 
> This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
> but with the difference that in the event of css_tryget() racing with
> a cgroup_rmdir(), css_tryget() will only return false if the cgroup
> really does get removed.
> 
> This implementation is done by biasing css->refcnt, so that a refcnt
> of 1 means "releasable" and 0 means "released or releasing". In the
> event of a race, css_tryget() distinguishes between "released" and
> "releasing" by checking for the CSS_REMOVED flag in css->flags.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
mkdir/rmdir works well. I'll write the user of this patch "css_tryget()" 
in memcg.

Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>  include/linux/cgroup.h |   38 +++++++++++++++++++++++++-----
>  kernel/cgroup.c        |   61 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 88 insertions(+), 11 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -52,9 +52,9 @@ struct cgroup_subsys_state {
>  	 * hierarchy structure */
>  	struct cgroup *cgroup;
>  
> -	/* State maintained by the cgroup system to allow
> -	 * subsystems to be "busy". Should be accessed via css_get()
> -	 * and css_put() */
> +	/* State maintained by the cgroup system to allow subsystems
> +	 * to be "busy". Should be accessed via css_get(),
> +	 * css_tryget() and and css_put(). */
>  
>  	atomic_t refcnt;
>  
> @@ -64,11 +64,14 @@ struct cgroup_subsys_state {
>  /* bits in struct cgroup_subsys_state flags field */
>  enum {
>  	CSS_ROOT, /* This CSS is the root of the subsystem */
> +	CSS_REMOVED, /* This CSS is dead */
>  };
>  
>  /*
> - * Call css_get() to hold a reference on the cgroup;
> - *
> + * Call css_get() to hold a reference on the css; it can be used
> + * for a reference obtained via:
> + * - an existing ref-counted reference to the css
> + * - task->cgroups for a locked task
>   */
>  
>  static inline void css_get(struct cgroup_subsys_state *css)
> @@ -77,9 +80,32 @@ static inline void css_get(struct cgroup
>  	if (!test_bit(CSS_ROOT, &css->flags))
>  		atomic_inc(&css->refcnt);
>  }
> +
> +static inline bool css_is_removed(struct cgroup_subsys_state *css)
> +{
> +	return test_bit(CSS_REMOVED, &css->flags);
> +}
> +
> +/*
> + * Call css_tryget() to take a reference on a css if your existing
> + * (known-valid) reference isn't already ref-counted. Returns false if
> + * the css has been destroyed.
> + */
> +
> +static inline bool css_tryget(struct cgroup_subsys_state *css)
> +{
> +	if (test_bit(CSS_ROOT, &css->flags))
> +		return true;
> +	while (!atomic_inc_not_zero(&css->refcnt)) {
> +		if (test_bit(CSS_REMOVED, &css->flags))
> +			return false;
> +	}
> +	return true;
> +}
> +
>  /*
>   * css_put() should be called to release a reference taken by
> - * css_get()
> + * css_get() or css_tryget()
>   */
>  
>  extern void __css_put(struct cgroup_subsys_state *css);
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
>  			       struct cgroup *cgrp)
>  {
>  	css->cgroup = cgrp;
> -	atomic_set(&css->refcnt, 0);
> +	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
> @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
>  {
>  	/* Check the reference count on each subsystem. Since we
>  	 * already established that there are no tasks in the
> -	 * cgroup, if the css refcount is also 0, then there should
> +	 * cgroup, if the css refcount is also 1, then there should
>  	 * be no outstanding references, so the subsystem is safe to
>  	 * destroy. We scan across all subsystems rather than using
>  	 * the per-hierarchy linked list of mounted subsystems since
> @@ -2474,12 +2474,62 @@ static int cgroup_has_css_refs(struct cg
>  		 * matter, since it can only happen if the cgroup
>  		 * has been deleted and hence no longer needs the
>  		 * release agent to be called anyway. */
> -		if (css && atomic_read(&css->refcnt))
> +		if (css && (atomic_read(&css->refcnt) > 1))
>  			return 1;
>  	}
>  	return 0;
>  }
>  
> +/*
> + * Atomically mark all (or else none) of the cgroup's CSS objects as
> + * CSS_REMOVED. Return true on success, or false if the cgroup has
> + * busy subsystems. Call with cgroup_mutex held
> + */
> +
> +static int cgroup_clear_css_refs(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	unsigned long flags;
> +	bool failed = false;
> +	local_irq_save(flags);
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		int refcnt;
> +		do {
> +			/* We can only remove a CSS with a refcnt==1 */
> +			refcnt = atomic_read(&css->refcnt);
> +			if (refcnt > 1) {
> +				failed = true;
> +				goto done;
> +			}
> +			BUG_ON(!refcnt);
> +			/*
> +			 * Drop the refcnt to 0 while we check other
> +			 * subsystems. This will cause any racing
> +			 * css_tryget() to spin until we set the
> +			 * CSS_REMOVED bits or abort
> +			 */
> +		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
> +	}
> + done:
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		if (failed) {
> +			/*
> +			 * Restore old refcnt if we previously managed
> +			 * to clear it from 1 to 0
> +			 */
> +			if (!atomic_read(&css->refcnt))
> +				atomic_set(&css->refcnt, 1);
> +		} else {
> +			/* Commit the fact that the CSS is removed */
> +			set_bit(CSS_REMOVED, &css->flags);
> +		}
> +	}
> +	local_irq_restore(flags);
> +	return !failed;
> +}
> +
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
> @@ -2510,7 +2560,7 @@ static int cgroup_rmdir(struct inode *un
>  
>  	if (atomic_read(&cgrp->count)
>  	    || !list_empty(&cgrp->children)
> -	    || cgroup_has_css_refs(cgrp)) {
> +	    || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3065,7 +3115,8 @@ void __css_put(struct cgroup_subsys_stat
>  {
>  	struct cgroup *cgrp = css->cgroup;
>  	rcu_read_lock();
> -	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
> +	if ((atomic_dec_return(&css->refcnt) == 1) &&
> +	    notify_on_release(cgrp)) {
>  		set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  		check_for_release(cgrp);
>  	}
> 
> --
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] CGroups: Hierarchy locking/refcount changes
  2008-12-16 11:30 [PATCH 0/3] CGroups: Hierarchy locking/refcount changes menage
                   ` (2 preceding siblings ...)
  2008-12-16 11:30 ` [PATCH 3/3] CGroups: Add css_tryget() menage
@ 2008-12-17 22:03 ` Andrew Morton
  2008-12-19  2:06   ` Paul Menage
  3 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-12-17 22:03 UTC (permalink / raw)
  To: menage; +Cc: kamezawa.hiroyu, lizf, linux-kernel, linux-mm

On Tue, 16 Dec 2008 03:30:55 -0800
menage@google.com wrote:

> These patches introduce new locking/refcount support for cgroups to
> reduce the need for subsystems to call cgroup_lock(). This will
> ultimately allow the atomicity of cgroup_rmdir() (which was removed
> recently) to be restored.

OK, they merged OK.  We're accumulating rather a lot of cgroups work.

I have a question mark over these:

cgroups-make-root_list-contains-active-hierarchies-only.patch
cgroups-add-inactive-subsystems-to-rootnodesubsys_list.patch
cgroups-add-inactive-subsystems-to-rootnodesubsys_list-fix.patch
cgroups-introduce-link_css_set-to-remove-duplicate-code.patch
cgroups-introduce-link_css_set-to-remove-duplicate-code-fix.patch

it wasn't clear to me whether you still had issues with them, or
whether updates were expected?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] CGroups: Add css_tryget()
  2008-12-16 11:30 ` [PATCH 3/3] CGroups: Add css_tryget() menage
  2008-12-17  5:19   ` KAMEZAWA Hiroyuki
@ 2008-12-17 22:07   ` Andrew Morton
  2008-12-17 22:48     ` Paul Menage
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-12-17 22:07 UTC (permalink / raw)
  To: menage; +Cc: kamezawa.hiroyu, lizf, linux-kernel, linux-mm

On Tue, 16 Dec 2008 03:30:58 -0800
menage@google.com wrote:

> This patch adds css_tryget(), that obtains a counted reference on a
> CSS.  It is used in situations where the caller has a "weak" reference
> to the CSS, i.e. one that does not protect the cgroup from removal via
> a reference count, but would instead be cleaned up by a destroy()
> callback.
> 
> css_tryget() will return true on success, or false if the cgroup is
> being removed.
> 
> This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
> but with the difference that in the event of css_tryget() racing with
> a cgroup_rmdir(), css_tryget() will only return false if the cgroup
> really does get removed.
> 
> This implementation is done by biasing css->refcnt, so that a refcnt
> of 1 means "releasable" and 0 means "released or releasing". In the
> event of a race, css_tryget() distinguishes between "released" and
> "releasing" by checking for the CSS_REMOVED flag in css->flags.
> 
> ...
>
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -52,9 +52,9 @@ struct cgroup_subsys_state {
>  	 * hierarchy structure */
>  	struct cgroup *cgroup;
>  
> -	/* State maintained by the cgroup system to allow
> -	 * subsystems to be "busy". Should be accessed via css_get()
> -	 * and css_put() */
> +	/* State maintained by the cgroup system to allow subsystems
> +	 * to be "busy". Should be accessed via css_get(),
> +	 * css_tryget() and and css_put(). */

nanonit.  This layout:

	/*
	 * State maintained by the cgroup system to allow subsystems
	 * to be "busy". Should be accessed via css_get(),
	 * css_tryget() and and css_put().
	 */

is conventional/preferred.

>  	atomic_t refcnt;
>  
> @@ -64,11 +64,14 @@ struct cgroup_subsys_state {
>  /* bits in struct cgroup_subsys_state flags field */
>  enum {
>  	CSS_ROOT, /* This CSS is the root of the subsystem */
> +	CSS_REMOVED, /* This CSS is dead */
>  };
>  
>  /*
> - * Call css_get() to hold a reference on the cgroup;
> - *
> + * Call css_get() to hold a reference on the css; it can be used
> + * for a reference obtained via:
> + * - an existing ref-counted reference to the css
> + * - task->cgroups for a locked task
>   */
>  
>  static inline void css_get(struct cgroup_subsys_state *css)
> @@ -77,9 +80,32 @@ static inline void css_get(struct cgroup
>  	if (!test_bit(CSS_ROOT, &css->flags))
>  		atomic_inc(&css->refcnt);
>  }
> +
> +static inline bool css_is_removed(struct cgroup_subsys_state *css)
> +{
> +	return test_bit(CSS_REMOVED, &css->flags);
> +}
> +
> +/*
> + * Call css_tryget() to take a reference on a css if your existing
> + * (known-valid) reference isn't already ref-counted. Returns false if
> + * the css has been destroyed.
> + */
> +
> +static inline bool css_tryget(struct cgroup_subsys_state *css)
> +{
> +	if (test_bit(CSS_ROOT, &css->flags))
> +		return true;
> +	while (!atomic_inc_not_zero(&css->refcnt)) {
> +		if (test_bit(CSS_REMOVED, &css->flags))
> +			return false;
> +	}
> +	return true;
> +}

This looks too large to inline.

We should have a cpu_relax() in the loop?

And possibly a cond_resched().

It would be better if these polling loops didn't exist at all, of
course.  But I guess if you could work out a way of doing that, this
patch wouldn't exist.

>
>  ...
>
> +/*
> + * Atomically mark all (or else none) of the cgroup's CSS objects as
> + * CSS_REMOVED. Return true on success, or false if the cgroup has
> + * busy subsystems. Call with cgroup_mutex held
> + */
> +
> +static int cgroup_clear_css_refs(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	unsigned long flags;
> +	bool failed = false;
> +	local_irq_save(flags);

please put a blank line between end-of-locals and start-of-code.

> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		int refcnt;
> +		do {
> +			/* We can only remove a CSS with a refcnt==1 */
> +			refcnt = atomic_read(&css->refcnt);
> +			if (refcnt > 1) {
> +				failed = true;
> +				goto done;
> +			}
> +			BUG_ON(!refcnt);
> +			/*
> +			 * Drop the refcnt to 0 while we check other
> +			 * subsystems. This will cause any racing
> +			 * css_tryget() to spin until we set the
> +			 * CSS_REMOVED bits or abort
> +			 */
> +		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);

This loop also should have a cpu_relax(), I think?

> +	}
> + done:
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		if (failed) {
> +			/*
> +			 * Restore old refcnt if we previously managed
> +			 * to clear it from 1 to 0
> +			 */
> +			if (!atomic_read(&css->refcnt))
> +				atomic_set(&css->refcnt, 1);
> +		} else {
> +			/* Commit the fact that the CSS is removed */
> +			set_bit(CSS_REMOVED, &css->flags);
> +		}
> +	}
> +	local_irq_restore(flags);
> +	return !failed;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] CGroups: Add css_tryget()
  2008-12-17 22:07   ` Andrew Morton
@ 2008-12-17 22:48     ` Paul Menage
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Menage @ 2008-12-17 22:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kamezawa.hiroyu, lizf, linux-kernel, linux-mm

On Wed, Dec 17, 2008 at 2:07 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>        /*
>         * State maintained by the cgroup system to allow subsystems
>         * to be "busy". Should be accessed via css_get(),
>         * css_tryget() and and css_put().
>         */
>
> is conventional/preferred.

Oops, will fix.

>>  static inline void css_get(struct cgroup_subsys_state *css)
>> @@ -77,9 +80,32 @@ static inline void css_get(struct cgroup
>>       if (!test_bit(CSS_ROOT, &css->flags))
>>               atomic_inc(&css->refcnt);
>>  }
>> +
>> +static inline bool css_is_removed(struct cgroup_subsys_state *css)
>> +{
>> +     return test_bit(CSS_REMOVED, &css->flags);
>> +}
>> +
>> +/*
>> + * Call css_tryget() to take a reference on a css if your existing
>> + * (known-valid) reference isn't already ref-counted. Returns false if
>> + * the css has been destroyed.
>> + */
>> +
>> +static inline bool css_tryget(struct cgroup_subsys_state *css)
>> +{
>> +     if (test_bit(CSS_ROOT, &css->flags))
>> +             return true;
>> +     while (!atomic_inc_not_zero(&css->refcnt)) {
>> +             if (test_bit(CSS_REMOVED, &css->flags))
>> +                     return false;
>> +     }
>> +     return true;
>> +}
>
> This looks too large to inline.
>
> We should have a cpu_relax() in the loop?

Sounds reasonable.

>
> And possibly a cond_resched().

No, we don't want to reschedule. These are pseudo spin locks rather
than psuedo mutexes. And the "hold time" is extremely short.

>
> It would be better if these polling loops didn't exist at all, of
> course.  But I guess if you could work out a way of doing that, this
> patch wouldn't exist.

It would certainly be possible to implement it as a spinlock and a
count, and do:

css_get() {
  spin_lock(&css->lock);
  css->count++;
  spin_unlock(&css->lock);
}

css_tryget() {
  spin_lock(&css->lock);
  if (css->count > 1) {
    css->count++; result = true;
  } else {
    result = false;
  }
  spin_unlock(&css->lock);
}

and implement the cgroups side of it as

for each subsystem {
  spin_lock(&css->lock);
  if (css->count == 1) {
    css->count = 0;
  } else {
    success = false;
  }
}
for each subsystem {
  if (!success && css->count == 0) {
    css->count = 1;
  }
  spin_unlock(&css->lock);
}

Functionally that would be identical - the only downside is that's an
extra atomic operation in the fast path of css_get() and css_tryget(),
which some people had objected to in the past when I proposed similar
patches.

Hmm. Thinking about it, this is very similar to the rwlock_t logic,
and I could probably implement css_get() and css_tryget() via
read_lock() and the clear_css_refs() side via write_trylock(). Which
would be pretty much the same as the original patch, except using
conventional primitives. Big downside would be that we would be
limited to RW_LOCK_BIAS refcounts, or about 16M, versus the 2B that we
get with regular atomics.

Paul
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] CGroups: Hierarchy locking/refcount changes
  2008-12-17 22:03 ` [PATCH 0/3] CGroups: Hierarchy locking/refcount changes Andrew Morton
@ 2008-12-19  2:06   ` Paul Menage
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Menage @ 2008-12-19  2:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kamezawa.hiroyu, lizf, linux-kernel, linux-mm

On Wed, Dec 17, 2008 at 2:03 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> cgroups-make-root_list-contains-active-hierarchies-only.patch
> cgroups-add-inactive-subsystems-to-rootnodesubsys_list.patch
> cgroups-add-inactive-subsystems-to-rootnodesubsys_list-fix.patch
> cgroups-introduce-link_css_set-to-remove-duplicate-code.patch
> cgroups-introduce-link_css_set-to-remove-duplicate-code-fix.patch
>
> it wasn't clear to me whether you still had issues with them, or
> whether updates were expected?

I think that with the fix patches they should be fine.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-12-19  2:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-16 11:30 [PATCH 0/3] CGroups: Hierarchy locking/refcount changes menage
2008-12-16 11:30 ` [PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
2008-12-17  5:16   ` KAMEZAWA Hiroyuki
2008-12-16 11:30 ` [PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
2008-12-17  5:17   ` KAMEZAWA Hiroyuki
2008-12-16 11:30 ` [PATCH 3/3] CGroups: Add css_tryget() menage
2008-12-17  5:19   ` KAMEZAWA Hiroyuki
2008-12-17 22:07   ` Andrew Morton
2008-12-17 22:48     ` Paul Menage
2008-12-17 22:03 ` [PATCH 0/3] CGroups: Hierarchy locking/refcount changes Andrew Morton
2008-12-19  2:06   ` Paul Menage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox