[RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2)
@ 2008-11-08  9:10 Balbir Singh
  2008-11-08  9:10 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v2) Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

This patch follows several iterations of the memory controller hierarchy
patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
at [2].

The current approach is based on [2] and has the following properties

1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
   A multi-tree hierarchy has been supported for a long time in filesystems.
   When the feature is turned on, we honor hierarchies such that the root
   accounts for resource usage of all children and limits can be set at
   any point in the hierarchy. Any memory cgroup is limited by limits
   along the hierarchy. The total usage of all children of a node cannot
   exceed the limit of the node.
2. The hierarchy feature is selectable and off by default
3. Hierarchies are expensive and the trade off is depth versus performance.
   Hierarchies can also be completely turned off.

The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
with SMP and swap turned on.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

v2..v1
======
1. The hierarchy is now selectable per-subtree
2. The features file has been renamed to use_hierarchy
3. Reclaim now holds cgroup lock and the reclaim does recursive walk and reclaim

Series
------

memcg-hierarchy-documentation.patch
resource-counters-hierarchy-support.patch
memcg-hierarchical-reclaim.patch
memcg-add-hierarchy-selector.patch

Reviews? Comments?

References

1. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-06/msg05417.html
2. http://kerneltrap.org/mailarchive/linux-kernel/2008/4/19/1513644/thread

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v2)
  2008-11-08  9:10 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) Balbir Singh
@ 2008-11-08  9:10 ` Balbir Singh
  2008-11-08  9:10 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2) Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki


Documentation updates for hierarchy support

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memory.txt |   37 +++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff -puN Documentation/controllers/memory.txt~memcg-hierarchy-documentation Documentation/controllers/memory.txt
--- linux-2.6.28-rc2/Documentation/controllers/memory.txt~memcg-hierarchy-documentation	2008-11-08 14:09:29.000000000 +0530
+++ linux-2.6.28-rc2-balbir/Documentation/controllers/memory.txt	2008-11-08 14:09:29.000000000 +0530
@@ -245,6 +245,43 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
+5. Hierarchy support
+
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy
+
+		root
+	     /  |   \
+           /	|    \
+	  a	b	c
+			| \
+			|  \
+			d   e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
+that has memory.use_hierarchy enabled.  If one of the ancestors goes over its
+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
+children of the ancestor.
+
+5.1 Enabling hierarchical accounting and reclaim
+
+The memory controller by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
+
+# echo 1 > memory.use_hierarchy
+
+The feature can be disabled by
+
+# echo 0 > memory.use_hierarchy
+
+NOTE1: Enabling/disabling will fail if the cgroup already has other
+cgroups created below it.
+
+NOTE2: This feature can be enabled/disabled per subtree.
+
 5. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
_

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)
  2008-11-08  9:10 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) Balbir Singh
  2008-11-08  9:10 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v2) Balbir Singh
@ 2008-11-08  9:10 ` Balbir Singh
  2008-11-11  3:04   ` KAMEZAWA Hiroyuki
  2008-11-08  9:11 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2) Balbir Singh
  2008-11-08  9:11 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2) Balbir Singh
  3 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

Add support for building hierarchies in resource counters. Cgroups allows us
to build a deep hierarchy, but we currently don't link the resource counters
belonging to the memory controller control groups, in the same fashion
as the corresponding cgroup entries in the cgroup hierarchy. This patch
provides the infrastructure for resource counters that have the same hiearchy
as their cgroup counter parts.

These set of patches are based on the resource counter hiearchy patches posted
by Pavel Emelianov.

NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are expensive,
so the user needs to be careful and aware of the trade-offs before creating
very deep ones.


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/res_counter.h |    8 ++++++--
 kernel/res_counter.c        |   42 ++++++++++++++++++++++++++++++++++--------
 mm/memcontrol.c             |    9 ++++++---
 3 files changed, 46 insertions(+), 13 deletions(-)

diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
--- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
+++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h	2008-11-08 14:09:31.000000000 +0530
@@ -43,6 +43,10 @@ struct res_counter {
 	 * the routines below consider this to be IRQ-safe
 	 */
 	spinlock_t lock;
+	/*
+	 * Parent counter, used for hierarchial resource accounting
+	 */
+	struct res_counter *parent;
 };
 
 /**
@@ -87,7 +91,7 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent);
 
 /*
  * charge - try to consume more resource.
@@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
--- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
+++ linux-2.6.28-rc2-balbir/kernel/res_counter.c	2008-11-08 14:09:31.000000000 +0530
@@ -15,10 +15,11 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->parent = parent;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
 	return 0;
 }
 
-int res_counter_charge(struct res_counter *counter, unsigned long val)
+int res_counter_charge(struct res_counter *counter, unsigned long val,
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
+	struct res_counter *c, *u;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	ret = res_counter_charge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	*limit_fail_at = NULL;
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		ret = res_counter_charge_locked(c, val);
+		spin_unlock(&c->lock);
+		if (ret < 0) {
+			*limit_fail_at = c;
+			goto undo;
+		}
+	}
+	ret = 0;
+	goto done;
+undo:
+	for (u = counter; u != c; u = u->parent) {
+		spin_lock(&u->lock);
+		res_counter_uncharge_locked(u, val);
+		spin_unlock(&u->lock);
+	}
+done:
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
+	struct res_counter *c;
 
-	spin_lock_irqsave(&counter->lock, flags);
-	res_counter_uncharge_locked(counter, val);
-	spin_unlock_irqrestore(&counter->lock, flags);
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		res_counter_uncharge_locked(c, val);
+		spin_unlock(&c->lock);
+	}
+	local_irq_restore(flags);
 }
 
 
diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:31.000000000 +0530
@@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
 	}
 
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
+		parent = NULL;
 	} else {
 		mem = mem_cgroup_alloc();
+		parent = mem_cgroup_from_cont(cont->parent);
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
 	}
 
-	res_counter_init(&mem->res);
+	res_counter_init(&mem->res, parent ? &parent->res : NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)
  2008-11-08  9:10 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) Balbir Singh
  2008-11-08  9:10 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v2) Balbir Singh
  2008-11-08  9:10 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2) Balbir Singh
@ 2008-11-08  9:11 ` Balbir Singh
  2008-11-11  3:06   ` KAMEZAWA Hiroyuki
  2008-11-08  9:11 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2) Balbir Singh
  3 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:11 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

This patch introduces hierarchical reclaim. When an ancestor goes over its
limit, the charging routine points to the parent that is above its limit.
The reclaim process then starts from the last scanned child of the ancestor
and reclaims until the ancestor goes below its limit.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |  152 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 128 insertions(+), 24 deletions(-)

diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-08 14:09:32.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:32.000000000 +0530
@@ -132,6 +132,11 @@ struct mem_cgroup {
 	 * statistics.
 	 */
 	struct mem_cgroup_stat stat;
+	/*
+	 * While reclaiming in a hiearchy, we cache the last child we
+	 * reclaimed from.
+	 */
+	struct mem_cgroup *last_scanned_child;
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+static struct mem_cgroup *
+mem_cgroup_from_res_counter(struct res_counter *counter)
+{
+	return container_of(counter, struct mem_cgroup, res);
+}
+
+/*
+ * Dance down the hierarchy if needed to reclaim memory. We remember the
+ * last child we reclaimed from, so that we don't end up penalizing
+ * one child extensively based on its position in the children list.
+ *
+ * root_mem is the original ancestor that we've been reclaim from.
+ */
+static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
+						struct mem_cgroup *root_mem,
+						gfp_t gfp_mask)
+{
+	struct cgroup *cg_current, *cgroup;
+	struct mem_cgroup *mem_child;
+	int ret = 0;
+
+	/*
+	 * Reclaim unconditionally and don't check for return value.
+	 * We need to reclaim in the current group and down the tree.
+	 * One might think about checking for children before reclaiming,
+	 * but there might be left over accounting, even after children
+	 * have left.
+	 */
+	try_to_free_mem_cgroup_pages(mem, gfp_mask);
+
+	if (res_counter_check_under_limit(&root_mem->res))
+		return 0;
+
+	if (list_empty(&mem->css.cgroup->children))
+		return 0;
+
+	/*
+	 * Scan all children under the mem_cgroup mem
+	 */
+	if (!mem->last_scanned_child)
+		cgroup = list_first_entry(&mem->css.cgroup->children,
+				struct cgroup, sibling);
+	else
+		cgroup = mem->last_scanned_child->css.cgroup;
+
+	cg_current = cgroup;
+	cgroup_lock();
+
+	do {
+		struct list_head *next;
+
+		mem_child = mem_cgroup_from_cont(cgroup);
+		cgroup_unlock();
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
+							gfp_mask);
+		mem->last_scanned_child = mem_child;
+
+		cgroup_lock();
+		if (res_counter_check_under_limit(&root_mem->res)) {
+			ret = 0;
+			goto done;
+		}
+
+		/*
+		 * Since we gave up the lock, it is time to
+		 * start from last cgroup
+		 */
+		cgroup = mem->last_scanned_child->css.cgroup;
+		next = cgroup->sibling.next;
+
+		if (next == &cg_current->parent->children)
+			cgroup = list_first_entry(&mem->css.cgroup->children,
+							struct cgroup, sibling);
+		else
+			cgroup = container_of(next, struct cgroup, sibling);
+	} while (cgroup != cg_current);
+
+done:
+	cgroup_unlock();
+	return ret;
+}
+
+/*
+ * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
+ * from mem.
+ */
+static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	int ret = 0;
+	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	struct mem_cgroup *mem_over_limit;
+
+	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
+		if (!(gfp_mask & __GFP_WAIT))
+			goto out;
+
+		/*
+		 * Is one of our ancestors over their limit?
+		 */
+		if (fail_res)
+			mem_over_limit = mem_cgroup_from_res_counter(fail_res);
+		else
+			mem_over_limit = mem;
+
+		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
+							mem_over_limit,
+							gfp_mask);
+
+		if (!nr_retries--) {
+			mem_cgroup_out_of_memory(mem, gfp_mask);
+			goto out;
+		}
+	}
+out:
+	return ret;
+}
 
 /**
  * mem_cgroup_try_charge - get charge of PAGE_SIZE.
@@ -484,8 +607,7 @@ int mem_cgroup_try_charge(struct mm_stru
 			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res;
+
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -510,29 +632,9 @@ int mem_cgroup_try_charge(struct mm_stru
 		css_get(&mem->css);
 	}
 
+	if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
+		goto nomem;
 
-	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
-		if (!(gfp_mask & __GFP_WAIT))
-			goto nomem;
-
-		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
-			continue;
-
-		/*
-		 * try_to_free_mem_cgroup_pages() might not give us a full
-		 * picture of reclaim. Some pages are reclaimed and might be
-		 * moved to swap cache or just unmapped from the cgroup.
-		 * Check the limit again to see if the reclaim reduced the
-		 * current usage of the cgroup before giving up
-		 */
-		if (res_counter_check_under_limit(&mem->res))
-			continue;
-
-		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto nomem;
-		}
-	}
 	return 0;
 nomem:
 	css_put(&mem->css);
@@ -1195,6 +1297,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
 			goto free_out;
 
+	mem->last_scanned_child = NULL;
+
 	return &mem->css;
 free_out:
 	for_each_node_state(node, N_POSSIBLE)
_

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)
  2008-11-08  9:10 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) Balbir Singh
                   ` (2 preceding siblings ...)
  2008-11-08  9:11 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2) Balbir Singh
@ 2008-11-08  9:11 ` Balbir Singh
  2008-11-08  9:39   ` Li Zefan
  2008-11-11  3:10   ` KAMEZAWA Hiroyuki
  3 siblings, 2 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:11 UTC (permalink / raw)
  To: linux-mm
  Cc: YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Nick Piggin,
	David Rientjes, Pavel Emelianov, Dhaval Giani, Balbir Singh,
	Andrew Morton, KAMEZAWA Hiroyuki

Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
--- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector	2008-11-08 14:09:33.000000000 +0530
+++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:10:53.000000000 +0530
@@ -137,6 +137,11 @@ struct mem_cgroup {
 	 * reclaimed from.
 	 */
 	struct mem_cgroup *last_scanned_child;
+	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	unsigned long use_hierarchy;
+
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -1079,6 +1084,33 @@ out:
 	return ret;
 }
 
+static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cont)->use_hierarchy;
+}
+
+static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
+					u64 val)
+{
+	int retval = 0;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	if (val == 1) {
+		if (list_empty(&cont->children))
+			mem->use_hierarchy = val;
+		else
+			retval = -EBUSY;
+	} else if (val == 0) {
+		if (list_empty(&cont->children))
+			mem->use_hierarchy = val;
+		else
+			retval = -EBUSY;
+	} else
+		retval = -EINVAL;
+
+	return retval;
+}
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
@@ -1213,6 +1245,11 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "use_hierarchy",
+		.write_u64 = mem_cgroup_hierarchy_write,
+		.read_u64 = mem_cgroup_hierarchy_read,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1289,9 +1326,13 @@ mem_cgroup_create(struct cgroup_subsys *
 		parent = mem_cgroup_from_cont(cont->parent);
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		mem->use_hierarchy = parent->use_hierarchy;
 	}
 
-	res_counter_init(&mem->res, parent ? &parent->res : NULL);
+	if (parent && parent->use_hierarchy)
+		res_counter_init(&mem->res, &parent->res);
+	else
+		res_counter_init(&mem->res, NULL);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
_

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)
  2008-11-08  9:11 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2) Balbir Singh
@ 2008-11-08  9:39   ` Li Zefan
  2008-11-08  9:49     ` Balbir Singh
  2008-11-11  3:10   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 15+ messages in thread
From: Li Zefan @ 2008-11-08  9:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton, KAMEZAWA Hiroyuki

> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	int retval = 0;
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> +	if (val == 1) {
> +		if (list_empty(&cont->children))

cgroup_lock should be held before checking cont->children.

> +			mem->use_hierarchy = val;
> +		else
> +			retval = -EBUSY;
> +	} else if (val == 0) {

And code duplicate.

> +		if (list_empty(&cont->children))
> +			mem->use_hierarchy = val;
> +		else
> +			retval = -EBUSY;
> +	} else
> +		retval = -EINVAL;
> +
> +	return retval;
> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)
  2008-11-08  9:39   ` Li Zefan
@ 2008-11-08  9:49     ` Balbir Singh
  0 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-08  9:49 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton, KAMEZAWA Hiroyuki

Li Zefan wrote:
>> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>> +					u64 val)
>> +{
>> +	int retval = 0;
>> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> +
>> +	if (val == 1) {
>> +		if (list_empty(&cont->children))
> 
> cgroup_lock should be held before checking cont->children.
> 

Good point, I'll look at that aspect

>> +			mem->use_hierarchy = val;
>> +		else
>> +			retval = -EBUSY;
>> +	} else if (val == 0) {
> 
> And code duplicate.

Yes, this can be optimized better. I'll fix that in v3.

> 
>> +		if (list_empty(&cont->children))
>> +			mem->use_hierarchy nn= val;
>> +		else
>> +			retval = -EBUSY;
>> +	} else
>> +		retval = -EINVAL;
>> +
>> +	return retval;
>> +}
>> +
> 


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)
  2008-11-08  9:10 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2) Balbir Singh
@ 2008-11-11  3:04   ` KAMEZAWA Hiroyuki
  2008-11-11  4:52     ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-11  3:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sat, 08 Nov 2008 14:40:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> Add support for building hierarchies in resource counters. Cgroups allows us
> to build a deep hierarchy, but we currently don't link the resource counters
> belonging to the memory controller control groups, in the same fashion
> as the corresponding cgroup entries in the cgroup hierarchy. This patch
> provides the infrastructure for resource counters that have the same hiearchy
> as their cgroup counter parts.
> 
> These set of patches are based on the resource counter hiearchy patches posted
> by Pavel Emelianov.
> 
> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
> the all the way up to the root. It is known that hiearchies are expensive,
> so the user needs to be careful and aware of the trade-offs before creating
> very deep ones.
> 
Do you have numbers ?






> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  include/linux/res_counter.h |    8 ++++++--
>  kernel/res_counter.c        |   42 ++++++++++++++++++++++++++++++++++--------
>  mm/memcontrol.c             |    9 ++++++---
>  3 files changed, 46 insertions(+), 13 deletions(-)
> 
> diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
> --- linux-2.6.28-rc2/include/linux/res_counter.h~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/include/linux/res_counter.h	2008-11-08 14:09:31.000000000 +0530
> @@ -43,6 +43,10 @@ struct res_counter {
>  	 * the routines below consider this to be IRQ-safe
>  	 */
>  	spinlock_t lock;
> +	/*
> +	 * Parent counter, used for hierarchial resource accounting
> +	 */
> +	struct res_counter *parent;
>  };
>  
>  /**
> @@ -87,7 +91,7 @@ enum {
>   * helpers for accounting
>   */
>  
> -void res_counter_init(struct res_counter *counter);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>  
>  /*
>   * charge - try to consume more resource.
> @@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
> -		unsigned long val);
> +		unsigned long val, struct res_counter **limit_fail_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
> --- linux-2.6.28-rc2/kernel/res_counter.c~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/kernel/res_counter.c	2008-11-08 14:09:31.000000000 +0530
> @@ -15,10 +15,11 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm.h>
>  
> -void res_counter_init(struct res_counter *counter)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>  {
>  	spin_lock_init(&counter->lock);
>  	counter->limit = (unsigned long long)LLONG_MAX;
> +	counter->parent = parent;
>  }
>  
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> @@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
>  	return 0;
>  }
>  
> -int res_counter_charge(struct res_counter *counter, unsigned long val)
> +int res_counter_charge(struct res_counter *counter, unsigned long val,
> +			struct res_counter **limit_fail_at)
>  {
>  	int ret;
>  	unsigned long flags;
> +	struct res_counter *c, *u;
>  
> -	spin_lock_irqsave(&counter->lock, flags);
> -	ret = res_counter_charge_locked(counter, val);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +	*limit_fail_at = NULL;
> +	local_irq_save(flags);
> +	for (c = counter; c != NULL; c = c->parent) {
> +		spin_lock(&c->lock);
> +		ret = res_counter_charge_locked(c, val);
> +		spin_unlock(&c->lock);
> +		if (ret < 0) {
> +			*limit_fail_at = c;
> +			goto undo;
> +		}
> +	}
> +	ret = 0;
> +	goto done;
> +undo:
> +	for (u = counter; u != c; u = u->parent) {
> +		spin_lock(&u->lock);
> +		res_counter_uncharge_locked(u, val);
> +		spin_unlock(&u->lock);
> +	}
> +done:
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
IMHO, dividing function into

  - res_counter_charge() for res_counter which doesn't need hierarchy.
  - res_counter_charge_hierarchy() fro res_counter with hierarch.

will reduce footprint of other users than memcg. 

All users of res_counter is forced to use hierarchy version ?

Thanks,
-Kame


> @@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct 
>  void res_counter_uncharge(struct res_counter *counter, unsigned long val)
>  {
>  	unsigned long flags;
> +	struct res_counter *c;
>  
> -	spin_lock_irqsave(&counter->lock, flags);
> -	res_counter_uncharge_locked(counter, val);
> -	spin_unlock_irqrestore(&counter->lock, flags);
> +	local_irq_save(flags);
> +	for (c = counter; c != NULL; c = c->parent) {
> +		spin_lock(&c->lock);
> +		res_counter_uncharge_locked(c, val);
> +		spin_unlock(&c->lock);
> +	}
> +	local_irq_restore(flags);
>  }
>  
>  
> diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~resource-counters-hierarchy-support	2008-11-08 14:09:31.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:31.000000000 +0530
> @@ -485,6 +485,7 @@ int mem_cgroup_try_charge(struct mm_stru
>  {
>  	struct mem_cgroup *mem;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct res_counter *fail_res;
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the
> @@ -510,7 +511,7 @@ int mem_cgroup_try_charge(struct mm_stru
>  	}
>  
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> +	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> @@ -1175,18 +1176,20 @@ static void mem_cgroup_free(struct mem_c
>  static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> -	struct mem_cgroup *mem;
> +	struct mem_cgroup *mem, *parent;
>  	int node;
>  
>  	if (unlikely((cont->parent) == NULL)) {
>  		mem = &init_mem_cgroup;
> +		parent = NULL;
>  	} else {
>  		mem = mem_cgroup_alloc();
> +		parent = mem_cgroup_from_cont(cont->parent);
>  		if (!mem)
>  			return ERR_PTR(-ENOMEM);
>  	}
>  
> -	res_counter_init(&mem->res);
> +	res_counter_init(&mem->res, parent ? &parent->res : NULL);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> _
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)
  2008-11-08  9:11 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2) Balbir Singh
@ 2008-11-11  3:06   ` KAMEZAWA Hiroyuki
  2008-11-11  4:47     ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-11  3:06 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sat, 08 Nov 2008 14:41:00 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> This patch introduces hierarchical reclaim. When an ancestor goes over its
> limit, the charging routine points to the parent that is above its limit.
> The reclaim process then starts from the last scanned child of the ancestor
> and reclaims until the ancestor goes below its limit.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  mm/memcontrol.c |  152 +++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 128 insertions(+), 24 deletions(-)
> 
> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-08 14:09:32.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:32.000000000 +0530
> @@ -132,6 +132,11 @@ struct mem_cgroup {
>  	 * statistics.
>  	 */
>  	struct mem_cgroup_stat stat;
> +	/*
> +	 * While reclaiming in a hiearchy, we cache the last child we
> +	 * reclaimed from.
> +	 */
> +	struct mem_cgroup *last_scanned_child;
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
>  	return nr_taken;
>  }
>  
> +static struct mem_cgroup *
> +mem_cgroup_from_res_counter(struct res_counter *counter)
> +{
> +	return container_of(counter, struct mem_cgroup, res);
> +}
> +
> +/*
> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> + * last child we reclaimed from, so that we don't end up penalizing
> + * one child extensively based on its position in the children list.
> + *
> + * root_mem is the original ancestor that we've been reclaim from.
> + */
> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
> +						struct mem_cgroup *root_mem,
> +						gfp_t gfp_mask)
> +{
> +	struct cgroup *cg_current, *cgroup;
> +	struct mem_cgroup *mem_child;
> +	int ret = 0;
> +
> +	/*
> +	 * Reclaim unconditionally and don't check for return value.
> +	 * We need to reclaim in the current group and down the tree.
> +	 * One might think about checking for children before reclaiming,
> +	 * but there might be left over accounting, even after children
> +	 * have left.
> +	 */
> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
> +
> +	if (res_counter_check_under_limit(&root_mem->res))
> +		return 0;
> +
> +	if (list_empty(&mem->css.cgroup->children))
> +		return 0;
> +
> +	/*
> +	 * Scan all children under the mem_cgroup mem
> +	 */
> +	if (!mem->last_scanned_child)
> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> +				struct cgroup, sibling);
> +	else
> +		cgroup = mem->last_scanned_child->css.cgroup;
> +

Who guarantee this last_scan_child is accessible at this point ?

Thanks,
-Kame
> +	cg_current = cgroup;
> +	cgroup_lock();
> +
> +	do {
> +		struct list_head *next;
> +
> +		mem_child = mem_cgroup_from_cont(cgroup);
> +		cgroup_unlock();
> +
> +		ret = mem_cgroup_hierarchical_reclaim(mem_child, root_mem,
> +							gfp_mask);
> +		mem->last_scanned_child = mem_child;
> +
> +		cgroup_lock();
> +		if (res_counter_check_under_limit(&root_mem->res)) {
> +			ret = 0;
> +			goto done;
> +		}
> +
> +		/*
> +		 * Since we gave up the lock, it is time to
> +		 * start from last cgroup
> +		 */
> +		cgroup = mem->last_scanned_child->css.cgroup;
> +		next = cgroup->sibling.next;
> +
> +		if (next == &cg_current->parent->children)
> +			cgroup = list_first_entry(&mem->css.cgroup->children,
> +							struct cgroup, sibling);
> +		else
> +			cgroup = container_of(next, struct cgroup, sibling);
> +	} while (cgroup != cg_current);
> +
> +done:
> +	cgroup_unlock();
> +	return ret;
> +}
> +
> +/*
> + * Charge memory cgroup mem and check if it is over its limit. If so, reclaim
> + * from mem.
> + */
> +static int mem_cgroup_charge_and_reclaim(struct mem_cgroup *mem, gfp_t gfp_mask)
> +{
> +	int ret = 0;
> +	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct res_counter *fail_res;
> +	struct mem_cgroup *mem_over_limit;
> +
> +	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
> +		if (!(gfp_mask & __GFP_WAIT))
> +			goto out;
> +
> +		/*
> +		 * Is one of our ancestors over their limit?
> +		 */
> +		if (fail_res)
> +			mem_over_limit = mem_cgroup_from_res_counter(fail_res);
> +		else
> +			mem_over_limit = mem;
> +
> +		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit,
> +							mem_over_limit,
> +							gfp_mask);
> +
> +		if (!nr_retries--) {
> +			mem_cgroup_out_of_memory(mem, gfp_mask);
> +			goto out;
> +		}
> +	}
> +out:
> +	return ret;
> +}
>  
>  /**
>   * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> @@ -484,8 +607,7 @@ int mem_cgroup_try_charge(struct mm_stru
>  			gfp_t gfp_mask, struct mem_cgroup **memcg)
>  {
>  	struct mem_cgroup *mem;
> -	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct res_counter *fail_res;
> +
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the
> @@ -510,29 +632,9 @@ int mem_cgroup_try_charge(struct mm_stru
>  		css_get(&mem->css);
>  	}
>  
> +	if (mem_cgroup_charge_and_reclaim(mem, gfp_mask))
> +		goto nomem;
>  
> -	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE, &fail_res))) {
> -		if (!(gfp_mask & __GFP_WAIT))
> -			goto nomem;
> -
> -		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> -			continue;
> -
> -		/*
> -		 * try_to_free_mem_cgroup_pages() might not give us a full
> -		 * picture of reclaim. Some pages are reclaimed and might be
> -		 * moved to swap cache or just unmapped from the cgroup.
> -		 * Check the limit again to see if the reclaim reduced the
> -		 * current usage of the cgroup before giving up
> -		 */
> -		if (res_counter_check_under_limit(&mem->res))
> -			continue;
> -
> -		if (!nr_retries--) {
> -			mem_cgroup_out_of_memory(mem, gfp_mask);
> -			goto nomem;
> -		}
> -	}
>  	return 0;
>  nomem:
>  	css_put(&mem->css);
> @@ -1195,6 +1297,8 @@ mem_cgroup_create(struct cgroup_subsys *
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
>  			goto free_out;
>  
> +	mem->last_scanned_child = NULL;
> +
>  	return &mem->css;
>  free_out:
>  	for_each_node_state(node, N_POSSIBLE)
> _
> 
> -- 
> 	Balbir
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)
  2008-11-08  9:11 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2) Balbir Singh
  2008-11-08  9:39   ` Li Zefan
@ 2008-11-11  3:10   ` KAMEZAWA Hiroyuki
  2008-11-11  5:01     ` Balbir Singh
  1 sibling, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-11  3:10 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Sat, 08 Nov 2008 14:41:13 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> 
> Don't enable multiple hierarchy support by default. This patch introduces
> a features element that can be set to enable the nested depth hierarchy
> feature. This feature can only be enabled when the cgroup for which the
> feature this is enabled, has no children.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
IMHO, permission to this is not enough.

I think following is sane way.
==
   When parent->use_hierarchy==1.
	my->use_hierarchy must be "1" and cannot be tunrned to be "0" even if no children.
   When parent->use_hierarchy==0
	my->use_hierarchy can be either of "0" or "1".
	this value can be chagned if we don't have children
==

Thanks,
-Kame


>  mm/memcontrol.c |   43 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-add-hierarchy-selector	2008-11-08 14:09:33.000000000 +0530
> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:10:53.000000000 +0530
> @@ -137,6 +137,11 @@ struct mem_cgroup {
>  	 * reclaimed from.
>  	 */
>  	struct mem_cgroup *last_scanned_child;
> +	/*
> +	 * Should the accounting and control be hierarchical, per subtree?
> +	 */
> +	unsigned long use_hierarchy;
> +
>  };
>  static struct mem_cgroup init_mem_cgroup;
>  
> @@ -1079,6 +1084,33 @@ out:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	return mem_cgroup_from_cont(cont)->use_hierarchy;
> +}
> +
> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
> +					u64 val)
> +{
> +	int retval = 0;
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> +	if (val == 1) {
> +		if (list_empty(&cont->children))
> +			mem->use_hierarchy = val;
> +		else
> +			retval = -EBUSY;
> +	} else if (val == 0) {
> +		if (list_empty(&cont->children))
> +			mem->use_hierarchy = val;
> +		else
> +			retval = -EBUSY;
> +	} else
> +		retval = -EINVAL;
> +
> +	return retval;
> +}
> +
>  static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>  {
>  	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->res,
> @@ -1213,6 +1245,11 @@ static struct cftype mem_cgroup_files[] 
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +	{
> +		.name = "use_hierarchy",
> +		.write_u64 = mem_cgroup_hierarchy_write,
> +		.read_u64 = mem_cgroup_hierarchy_read,
> +	},
>  };
>  
>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> @@ -1289,9 +1326,13 @@ mem_cgroup_create(struct cgroup_subsys *
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		if (!mem)
>  			return ERR_PTR(-ENOMEM);
> +		mem->use_hierarchy = parent->use_hierarchy;
>  	}
>  
> -	res_counter_init(&mem->res, parent ? &parent->res : NULL);
> +	if (parent && parent->use_hierarchy)
> +		res_counter_init(&mem->res, &parent->res);
> +	else
> +		res_counter_init(&mem->res, NULL);
>  
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> _
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)
  2008-11-11  3:06   ` KAMEZAWA Hiroyuki
@ 2008-11-11  4:47     ` Balbir Singh
  2008-11-11  5:01       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-11-11  4:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sat, 08 Nov 2008 14:41:00 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> This patch introduces hierarchical reclaim. When an ancestor goes over its
>> limit, the charging routine points to the parent that is above its limit.
>> The reclaim process then starts from the last scanned child of the ancestor
>> and reclaims until the ancestor goes below its limit.
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>> ---
>>
>>  mm/memcontrol.c |  152 +++++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 128 insertions(+), 24 deletions(-)
>>
>> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
>> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-08 14:09:32.000000000 +0530
>> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:32.000000000 +0530
>> @@ -132,6 +132,11 @@ struct mem_cgroup {
>>  	 * statistics.
>>  	 */
>>  	struct mem_cgroup_stat stat;
>> +	/*
>> +	 * While reclaiming in a hiearchy, we cache the last child we
>> +	 * reclaimed from.
>> +	 */
>> +	struct mem_cgroup *last_scanned_child;
>>  };
>>  static struct mem_cgroup init_mem_cgroup;
>>  
>> @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
>>  	return nr_taken;
>>  }
>>  
>> +static struct mem_cgroup *
>> +mem_cgroup_from_res_counter(struct res_counter *counter)
>> +{
>> +	return container_of(counter, struct mem_cgroup, res);
>> +}
>> +
>> +/*
>> + * Dance down the hierarchy if needed to reclaim memory. We remember the
>> + * last child we reclaimed from, so that we don't end up penalizing
>> + * one child extensively based on its position in the children list.
>> + *
>> + * root_mem is the original ancestor that we've been reclaim from.
>> + */
>> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
>> +						struct mem_cgroup *root_mem,
>> +						gfp_t gfp_mask)
>> +{
>> +	struct cgroup *cg_current, *cgroup;
>> +	struct mem_cgroup *mem_child;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * Reclaim unconditionally and don't check for return value.
>> +	 * We need to reclaim in the current group and down the tree.
>> +	 * One might think about checking for children before reclaiming,
>> +	 * but there might be left over accounting, even after children
>> +	 * have left.
>> +	 */
>> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
>> +
>> +	if (res_counter_check_under_limit(&root_mem->res))
>> +		return 0;
>> +
>> +	if (list_empty(&mem->css.cgroup->children))
>> +		return 0;
>> +
>> +	/*
>> +	 * Scan all children under the mem_cgroup mem
>> +	 */
>> +	if (!mem->last_scanned_child)
>> +		cgroup = list_first_entry(&mem->css.cgroup->children,
>> +				struct cgroup, sibling);
>> +	else
>> +		cgroup = mem->last_scanned_child->css.cgroup;
>> +
> 
> Who guarantee this last_scan_child is accessible at this point ?
> 

Good catch! I'll fix this in mem_cgroup_destroy. It'll need some locking around
it as well.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2)
  2008-11-11  3:04   ` KAMEZAWA Hiroyuki
@ 2008-11-11  4:52     ` Balbir Singh
  0 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-11  4:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sat, 08 Nov 2008 14:40:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Add support for building hierarchies in resource counters. Cgroups allows us
>> to build a deep hierarchy, but we currently don't link the resource counters
>> belonging to the memory controller control groups, in the same fashion
>> as the corresponding cgroup entries in the cgroup hierarchy. This patch
>> provides the infrastructure for resource counters that have the same hiearchy
>> as their cgroup counter parts.
>>
>> These set of patches are based on the resource counter hiearchy patches posted
>> by Pavel Emelianov.
>>
>> NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
>> the all the way up to the root. It is known that hiearchies are expensive,
>> so the user needs to be careful and aware of the trade-offs before creating
>> very deep ones.
>>
> Do you have numbers ?
> 

I am getting those numbers, I'll post them soon.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2)
  2008-11-11  3:10   ` KAMEZAWA Hiroyuki
@ 2008-11-11  5:01     ` Balbir Singh
  0 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-11  5:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
> On Sat, 08 Nov 2008 14:41:13 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Don't enable multiple hierarchy support by default. This patch introduces
>> a features element that can be set to enable the nested depth hierarchy
>> feature. This feature can only be enabled when the cgroup for which the
>> feature this is enabled, has no children.
>>
>> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>> ---
>>
> IMHO, permission to this is not enough.
> 
> I think following is sane way.
> ==
>    When parent->use_hierarchy==1.
> 	my->use_hierarchy must be "1" and cannot be tunrned to be "0" even if no children.
>    When parent->use_hierarchy==0
> 	my->use_hierarchy can be either of "0" or "1".
> 	this value can be chagned if we don't have children
> ==

Sounds reasonable, will fix in v3.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)
  2008-11-11  4:47     ` Balbir Singh
@ 2008-11-11  5:01       ` KAMEZAWA Hiroyuki
  2008-11-11  6:24         ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-11-11  5:01 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

On Tue, 11 Nov 2008 10:17:27 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Sat, 08 Nov 2008 14:41:00 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> This patch introduces hierarchical reclaim. When an ancestor goes over its
> >> limit, the charging routine points to the parent that is above its limit.
> >> The reclaim process then starts from the last scanned child of the ancestor
> >> and reclaims until the ancestor goes below its limit.
> >>
> >> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> >> ---
> >>
> >>  mm/memcontrol.c |  152 +++++++++++++++++++++++++++++++++++++++++++++++---------
> >>  1 file changed, 128 insertions(+), 24 deletions(-)
> >>
> >> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
> >> --- linux-2.6.28-rc2/mm/memcontrol.c~memcg-hierarchical-reclaim	2008-11-08 14:09:32.000000000 +0530
> >> +++ linux-2.6.28-rc2-balbir/mm/memcontrol.c	2008-11-08 14:09:32.000000000 +0530
> >> @@ -132,6 +132,11 @@ struct mem_cgroup {
> >>  	 * statistics.
> >>  	 */
> >>  	struct mem_cgroup_stat stat;
> >> +	/*
> >> +	 * While reclaiming in a hiearchy, we cache the last child we
> >> +	 * reclaimed from.
> >> +	 */
> >> +	struct mem_cgroup *last_scanned_child;
> >>  };
> >>  static struct mem_cgroup init_mem_cgroup;
> >>  
> >> @@ -467,6 +472,124 @@ unsigned long mem_cgroup_isolate_pages(u
> >>  	return nr_taken;
> >>  }
> >>  
> >> +static struct mem_cgroup *
> >> +mem_cgroup_from_res_counter(struct res_counter *counter)
> >> +{
> >> +	return container_of(counter, struct mem_cgroup, res);
> >> +}
> >> +
> >> +/*
> >> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> >> + * last child we reclaimed from, so that we don't end up penalizing
> >> + * one child extensively based on its position in the children list.
> >> + *
> >> + * root_mem is the original ancestor that we've been reclaim from.
> >> + */
> >> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *mem,
> >> +						struct mem_cgroup *root_mem,
> >> +						gfp_t gfp_mask)
> >> +{
> >> +	struct cgroup *cg_current, *cgroup;
> >> +	struct mem_cgroup *mem_child;
> >> +	int ret = 0;
> >> +
> >> +	/*
> >> +	 * Reclaim unconditionally and don't check for return value.
> >> +	 * We need to reclaim in the current group and down the tree.
> >> +	 * One might think about checking for children before reclaiming,
> >> +	 * but there might be left over accounting, even after children
> >> +	 * have left.
> >> +	 */
> >> +	try_to_free_mem_cgroup_pages(mem, gfp_mask);
> >> +
> >> +	if (res_counter_check_under_limit(&root_mem->res))
> >> +		return 0;
> >> +
> >> +	if (list_empty(&mem->css.cgroup->children))
> >> +		return 0;
> >> +
> >> +	/*
> >> +	 * Scan all children under the mem_cgroup mem
> >> +	 */
> >> +	if (!mem->last_scanned_child)
> >> +		cgroup = list_first_entry(&mem->css.cgroup->children,
> >> +				struct cgroup, sibling);
> >> +	else
> >> +		cgroup = mem->last_scanned_child->css.cgroup;
> >> +
> > 
> > Who guarantee this last_scan_child is accessible at this point ?
> > 
> 
> Good catch! I'll fix this in mem_cgroup_destroy. It'll need some locking around
> it as well.
> 
please see mem+swap controller's refcnt-to-memcg for delaying free of memcg.
it will be a hint.

Thanks,
-Kame


> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2)
  2008-11-11  5:01       ` KAMEZAWA Hiroyuki
@ 2008-11-11  6:24         ` Balbir Singh
  0 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-11-11  6:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Nick Piggin, David Rientjes, Pavel Emelianov, Dhaval Giani,
	Andrew Morton

KAMEZAWA Hiroyuki wrote:
>>
> please see mem+swap controller's refcnt-to-memcg for delaying free of memcg.
> it will be a hint.
> 

I'll integrate with those patches later. I see a memcg->swapref, but we don't
need to delay deletion for last_scanned_child, we need to make sure that the
parent does not have any invalid entries.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-11-11  6:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-08  9:10 [RFC][mm][PATCH 0/4] Memory cgroup hierarchy introduction (v2) Balbir Singh
2008-11-08  9:10 ` [RFC][mm] [PATCH 1/4] Memory cgroup hierarchy documentation (v2) Balbir Singh
2008-11-08  9:10 ` [RFC][mm] [PATCH 2/4] Memory cgroup resource counters for hierarchy (v2) Balbir Singh
2008-11-11  3:04   ` KAMEZAWA Hiroyuki
2008-11-11  4:52     ` Balbir Singh
2008-11-08  9:11 ` [RFC][mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v2) Balbir Singh
2008-11-11  3:06   ` KAMEZAWA Hiroyuki
2008-11-11  4:47     ` Balbir Singh
2008-11-11  5:01       ` KAMEZAWA Hiroyuki
2008-11-11  6:24         ` Balbir Singh
2008-11-08  9:11 ` [RFC][mm] [PATCH 4/4] Memory cgroup hierarchy feature selector (v2) Balbir Singh
2008-11-08  9:39   ` Li Zefan
2008-11-08  9:49     ` Balbir Singh
2008-11-11  3:10   ` KAMEZAWA Hiroyuki
2008-11-11  5:01     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox