[RFC][PATCH 0/2] memcg: hierarchy support (v3)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2] memcg: hierarchy support (v3)
@ 2008-06-04  4:58 KAMEZAWA Hiroyuki
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  4:58 UTC (permalink / raw)
  To: linux-mm; +Cc: LKML, balbir, menage, xemul, yamamoto

Hi, this is third version.

While small changes in codes, the whole _tone_ of code is changed.
I'm not in hurry, any comments are welcome.

based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.

Changes from v2:
 - Named as HardWall policy.
 - rewrote the code to be read easily. changed the name of functions.
 - Added text.
 - supported hierarchy_model parameter.
   Now, no_hierarchy and hardwall_hierarchy is implemented.

HardWall Policy:
  - designed for strict resource isolation under hierarchy.
    Usually, automatic load balancing between cgroup can break the
    users assumption even if it's implemented very well.
  - parent overcommits all children
     parent->usage = resource used by itself + resource moved to children.
     Of course, parent->limit > parent->usage. 
  - when child's limit is set, the resouce moves.
  - no automatic resource moving between parent <-> child

Example)
  1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
     - group_A limit=1G,usage=0M.

  2) create group B, C under A.
     - group A limit=1G, usage=0M
          - group B limit=0M, usage=0M.
          - group C limit=0M, usage=0M.

  3) increase group B's limit to 300M.
     - group A limit=1G, usage=300M.
          - group B limit=300M, usage=0M.
          - group C limit=0M, usage=0M.

  4) increase group C's limit to 500M
     - group A limit=1G, usage=800M.
          - group B limit=300M, usage=0M.
          - group C limit=500M, usage=0M.

  5) reduce group B's limit to 100M
     - group A limit=1G, usage=600M.
          - group B limit=100M, usage=0M.
          - group C limit=500M, usage=0M.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  4:58 [RFC][PATCH 0/2] memcg: hierarchy support (v3) KAMEZAWA Hiroyuki
@ 2008-06-04  5:01 ` KAMEZAWA Hiroyuki
  2008-06-04  6:54   ` Li Zefan
                     ` (4 more replies)
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  5:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

A simple hard-wall hierarhcy support for res_counter.

Changelog v2->v3
 - changed the name and arguments of functions.
 - rewrote to be read easily.
 - named as HardWall hierarchy.

This implements following model
 - A cgroup's tree means hierarchy of resource.
 - All child's resource is moved from its parents.
 - The resource moved to children is charged as parent's usage.
 - The resource moves when child->limit is changed.
 - The sum of resource for children and its own usage is limited by "limit".
 
This implies
 - No dynamic automatic hierarhcy balancing in the kernel.
 - Each resource is isolated completely.
 - The kernel just supports resource-move-at-change-in-limit.
 - The user (middle-ware) is responsible to make hierarhcy balanced well.
   Good balance can be achieved by changing limit from user land.


Background:
 Recently, there are popular resource isolation technique widely used,
 i.e. Hardware-Virtualization. We can do hierarchical resource isolation
 by using cgroup on it. But supporting hierarchy management in croups
 has some advantages of performance, unity and costs of management.

 There are good resource management in other OSs, they support some kind of
 hierarchical resource management. We wonder what kind of hierarchy policy
 is good for Linux. And there is an another point. Hierarchical system can be
 implemented by the kernel and user-land co-operation.  So, there are various
 choices to do in the kernel. Doing all in the kernel or export some proper
 interfaces to the user-land. Middle-wares are tend to be used for management.
 I hope there will be Open Source one.

 At supporting hierarchy in cgroup, several aspects of characteristics of
 policy of hierarchy can be considered. Some needs automatic balancing
 between several groups.

  - fairness    ... how fairness is kept under policy

  - performance ... should be _fast_. multi-level resource balancing tend
                 to use much amount of CPU and can cause soft lockup.

  - predictability ... resource management are usually used for resource
                 isolation. the kernel must not break the isolation and
                 predictability of users against application's progress.

  - flexibility ... some sophisticated dynamic resource balancing with
 		 soft-limit is welcomed when the user doesn't want strict
		 resource isolation or when the user cannot estimate how much
		 they want correctly.

Hard Wall Hierarchy.

 This patch implements a hard-wall model of hierarchy for resources.
 Works well for users who want strict resource isolation.

 This model allows the move of resource only between a parent and its children.
 The resource is moved to a child when it declares the amount of resources to be
 used. (by limit)
 Automatic resource balancing is not supported in this code.  
 (But users can do non-automatic by changing limit dynamically.)

 - fairness    ... good. no resource sharing. works as specified by users.
 - performance ... good. each resources are capsuled to its own level.
 - predictability ... good. resources are completely isolated. balancing only
		occurs at the event of changes in limit.
 - flexibility ... bad. no flexibility and scheduling in the kernel level.
	        need middle-ware's help.

Considerations:
 - This implementation uses "limit" == "current_available_resource".
   This should be revisited when Soft-Limit one is implemented.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/resource_counter.txt |   41 +++++++++
 include/linux/res_counter.h                    |   90 +++++++++++++++++++-
 kernel/res_counter.c                           |  112 +++++++++++++++++++++++--
 3 files changed, 235 insertions(+), 8 deletions(-)

Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
===================================================================
--- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
+++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
@@ -38,6 +38,16 @@ struct res_counter {
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
+
+	/*
+	 * hierarchy support: the parent of this resource.
+	 */
+	struct res_counter *parent;
+	/*
+	 * the amount of resources assigned to children.
+	 */
+	unsigned long long for_children;
+
 	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
@@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
 ssize_t res_counter_read(struct res_counter *counter, int member,
 		const char __user *buf, size_t nbytes, loff_t *pos,
 		int (*read_strategy)(unsigned long long val, char *s));
+
+/*
+ * An interface for setting res_counter's member (ex. limit)
+ * the new parameter is passed by *buf and translated by write_strategy().
+ * Then, it is applied to member under the control of set_strategy().
+ * If write_strategy() and set_strategy() can be NULL. see res_counter.c
+ */
+
 ssize_t res_counter_write(struct res_counter *counter, int member,
-		const char __user *buf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *buf, unsigned long long *val));
+	const char __user *buf, size_t nbytes, loff_t *pos,
+        int (*write_strategy)(char *buf, unsigned long long *val),
+	int (*set_strategy)(struct res_counter *res, unsigned long long val,
+			    int what),
+	);
 
 /*
  * the field descriptors. one for each member of res_counter
@@ -76,15 +97,33 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_FOR_CHILDREN,
 };
 
 /*
  * helpers for accounting
  */
 
+/*
+ * initialize res_counter.
+ * @counter : the counter
+ *
+ * initialize res_counter and set default limit to very big value(unlimited)
+ */
+
 void res_counter_init(struct res_counter *counter);
 
 /*
+ * initialize res_counter under hierarchy.
+ * @counter : the counter
+ * @parent : the parent of the counter
+ *
+ * initialize res_counter and set default limit to 0. and set "parent".
+ */
+void res_counter_init_hierarchy(struct res_counter *counter,
+				struct res_counter *parent);
+
+/*
  * charge - try to consume more resource.
  *
  * @counter: the counter
@@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
 	cnt->failcnt = 0;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
+
+/**
+ * Move resources from a parent to a child.
+ * At success,
+ *           parent->usage += val.
+ *           parent->for_children += val.
+ *           child->limit += val.
+ *
+ * @child:    an entity to set res->limit. The parent is child->parent.
+ * @val:      the amount of resource to be moved.
+ * @callback: called when the parent's free resource is not enough to be moved.
+ *            this can be NULL if no callback is necessary.
+ * @retry:    limit for the number of trying to callback.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success, !0 at failure.
+ *
+ * The callback returns 0 at success, !0 at failure.
+ *
+ */
+
+int res_counter_move_resource(struct res_counter *child,
+	unsigned long long val,
+        int (*callback)(struct res_counter *res, unsigned long long val),
+	int retry);
+
+
+/**
+ * Return resource to its parent.
+ * At success,
+ *           parent->usage  -= val.
+ *           parent->for_children -= val.
+ *           child->limit -= val.
+ *
+ * @child:   entry to resize. The parent is child->parent.
+ * @val  :   How much does child repay to parent ? -1 means 'all'
+ * @callback: A callback for decreasing resource usage of child before
+ *            returning. If NULL, just deceases child's limit.
+ * @retry:   # of retries at calling callback for freeing resource.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success.
+ */
+
+int res_counter_return_resource(struct res_counter *child,
+	unsigned long long val,
+	int (*callback)(struct res_counter *res, unsigned long long val),
+	int retry);
+
 #endif
Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
===================================================================
--- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
+++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
@@ -44,6 +44,13 @@ to work with it.
  	Protects changes of the above values.
 
 
+ f. struct res_counter *parent
+
+	Parent res_counter under hierarchy.
+
+ g. unsigned long long for_children
+
+	Resources assigned to children. This is included in usage.
 
 2. Basic accounting routines
 
@@ -179,3 +186,37 @@ counter fields. They are recommended to 
     still can help with it).
 
  c. Compile and run :)
+
+
+6. Hierarchy
+ a. No Hierarchy
+   each cgroup can use its own private resource.
+
+ b. Hard-wall Hierarhcy
+   A simple hierarchical tree system for resource isolation.
+   Allows moving resources only between a parent and its children.
+   A parent can move its resource to children and remember the amount to
+   for_children member. A child can get new resource only from its parent.
+   Limit of a child is the amount of resource which is moved from its parent.
+
+   When add "val" to a child,
+	parent->usage += val
+	parent->for_children += val
+	child->limit += val
+   When a child returns its resource
+	parent->usage -= val
+	parent->for_children -= val
+	child->limit -= val.
+
+   This implements resource isolation among each group. This works very well
+   when you want to use strict resource isolation.
+
+   Usage Hint:
+   This seems for static resource assignment but dynamic resource re-assignment
+   can be done by resetting "limit" of groups. When you consider "limit" as
+   the amount of allowed _current_ resource, a sophisticated resource management
+   system based on strict resource isolation can be implemented.
+
+c. Soft-wall Hierarchy
+   TBD.
+
Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
+++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
@@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
 	counter->limit = (unsigned long long)LLONG_MAX;
 }
 
+void res_counter_init_hierarchy(struct res_counter *counter,
+		struct res_counter *parent)
+{
+	spin_lock_init(&counter->lock);
+	counter->limit = 0;
+	counter->parent = parent;
+}
+
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (counter->usage + val > counter->limit) {
@@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_FOR_CHILDREN:
+		return &counter->for_children;
 	};
 
 	BUG();
@@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
 
 ssize_t res_counter_write(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *st_buf, unsigned long long *val))
+		int (*write_strategy)(char *st_buf, unsigned long long *val),
+		int (*set_strategy)(struct res_counter *res,
+			unsigned long long val, int what))
 {
 	int ret;
 	char *buf, *end;
@@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
 		if (*end != '\0')
 			goto out_free;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
-	ret = nbytes;
+	if (set_strategy) {
+		ret = set_strategy(res, tmp, member);
+		if (!ret)
+			ret = nbytes;
+	} else {
+		spin_lock_irqsave(&counter->lock, flags);
+		val = res_counter_member(counter, member);
+		*val = tmp;
+		spin_unlock_irqrestore(&counter->lock, flags);
+		ret = nbytes;
+	}
 out_free:
 	kfree(buf);
 out:
 	return ret;
 }
+
+
+int res_counter_move_resource(struct res_counter *child,
+				unsigned long long val,
+	int (*callback)(struct res_counter *res, unsigned long long val),
+	int retry)
+{
+	struct res_counter *parent = child->parent;
+	unsigned long flags;
+
+	BUG_ON(!parent);
+
+	while (1) {
+		spin_lock_irqsave(&parent->lock, flags);
+		if (parent->usage + val < parent->limit) {
+			parent->for_children += val;
+			parent->usage += val;
+			break;
+		}
+		spin_unlock_irqrestore(&parent->lock, flags);
+
+		if (!retry || !callback)
+			goto failed;
+		/* -1 means  infinite loop */
+		if (retry != -1)
+			--retry;
+		yield();
+		callback(parent, val);
+	}
+	spin_unlock_irqrestore(&parent->lock, flags);
+
+	spin_lock_irqsave(&child->lock, flags);
+	child->limit += val;
+	spin_unlock_irqrestore(&child->lock, flags);
+	return 0;
+fail:
+	return 1;
+}
+
+
+int res_counter_return_resource(struct res_counter *child,
+				unsigned long long val,
+	int (*callback)(struct res_counter *res, unsigned long long val),
+	int retry)
+{
+	unsigned long flags;
+	struct res_counter *parent = child->parent;
+
+	BUG_ON(!parent);
+
+	while (1) {
+		spin_lock_irqsave(&child->lock, flags);
+		if (val == (unsigned long long) -1) {
+			val = child->limit;
+			child->limit = 0;
+			break;
+		} else if (child->usage <= child->limit - val) {
+			child->limit -= val;
+			break;
+		}
+		spin_unlock_irqrestore(&child->lock, flags);
+
+		if (!retry)
+			goto fail;
+		/* -1 means infinite loop */
+		if (retry != -1)
+			--retry;
+		yield();
+		callback(parent, val);
+	}
+	spin_unlock_irqrestore(&child->lock, flags);
+
+	spin_lock_irqsave(&parent->lock, flags);
+	BUG_ON(parent->for_children < val);
+	BUG_ON(parent->usage < val);
+	parent->for_children -= val;
+	parent->usage -= val;
+	spin_unlock_irqrestore(&parent->lock, flags);
+	return 0;
+fail:
+	return 1;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-06-04  6:54   ` Li Zefan
  2008-06-04  7:03     ` KAMEZAWA Hiroyuki
  2008-06-04  7:20   ` YAMAMOTO Takashi
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Li Zefan @ 2008-06-04  6:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -44,6 +44,13 @@ to work with it.
>   	Protects changes of the above values.
>  
>  
> + f. struct res_counter *parent
> +
> +	Parent res_counter under hierarchy.
> +
> + g. unsigned long long for_children
> +
> +	Resources assigned to children. This is included in usage.

Since parent and for_children are also protected by res_count->lock,
the above text should appear before 'e. spinlock_t lock'.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  6:54   ` Li Zefan
@ 2008-06-04  7:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  7:03 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 04 Jun 2008 14:54:12 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -44,6 +44,13 @@ to work with it.
> >   	Protects changes of the above values.
> >  
> >  
> > + f. struct res_counter *parent
> > +
> > +	Parent res_counter under hierarchy.
> > +
> > + g. unsigned long long for_children
> > +
> > +	Resources assigned to children. This is included in usage.
> 
> Since parent and for_children are also protected by res_count->lock,
> the above text should appear before 'e. spinlock_t lock'.
> 
ok.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-06-04  6:54   ` Li Zefan
@ 2008-06-04  7:20   ` YAMAMOTO Takashi
  2008-06-04  7:32     ` KAMEZAWA Hiroyuki
  2008-06-04  8:59   ` Paul Menage
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: YAMAMOTO Takashi @ 2008-06-04  7:20 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, linux-kernel, balbir, menage, xemul

>  ssize_t res_counter_write(struct res_counter *counter, int member,
> -		const char __user *buf, size_t nbytes, loff_t *pos,
> -		int (*write_strategy)(char *buf, unsigned long long *val));
> +	const char __user *buf, size_t nbytes, loff_t *pos,
> +        int (*write_strategy)(char *buf, unsigned long long *val),
> +	int (*set_strategy)(struct res_counter *res, unsigned long long val,
> +			    int what),

this comma seems surplus.

> +	);

> +int res_counter_return_resource(struct res_counter *child,
> +				unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry)
> +{

> +		callback(parent, val);

s/parent/child/

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  7:20   ` YAMAMOTO Takashi
@ 2008-06-04  7:32     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  7:32 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: linux-mm, linux-kernel, balbir, menage, xemul

On Wed,  4 Jun 2008 16:20:48 +0900 (JST)
yamamoto@valinux.co.jp (YAMAMOTO Takashi) wrote:

> >  ssize_t res_counter_write(struct res_counter *counter, int member,
> > -		const char __user *buf, size_t nbytes, loff_t *pos,
> > -		int (*write_strategy)(char *buf, unsigned long long *val));
> > +	const char __user *buf, size_t nbytes, loff_t *pos,
> > +        int (*write_strategy)(char *buf, unsigned long long *val),
> > +	int (*set_strategy)(struct res_counter *res, unsigned long long val,
> > +			    int what),
> 
> this comma seems surplus.
> 
Ouch, I thought I fixed this...maybe patch reflesh trouble. Thanks.


> > +	);
> 
> > +int res_counter_return_resource(struct res_counter *child,
> > +				unsigned long long val,
> > +	int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry)
> > +{
> 
> > +		callback(parent, val);
> 
> s/parent/child/
> 
Hmm..yes. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-06-04  6:54   ` Li Zefan
  2008-06-04  7:20   ` YAMAMOTO Takashi
@ 2008-06-04  8:59   ` Paul Menage
  2008-06-04  9:18     ` KAMEZAWA Hiroyuki
  2008-06-09  9:48   ` Balbir Singh
  2008-06-11 23:24   ` Randy Dunlap
  4 siblings, 1 reply; 38+ messages in thread
From: Paul Menage @ 2008-06-04  8:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>        int ret;
>        char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
>                if (*end != '\0')
>                        goto out_free;
>        }
> -       spin_lock_irqsave(&counter->lock, flags);
> -       val = res_counter_member(counter, member);
> -       *val = tmp;
> -       spin_unlock_irqrestore(&counter->lock, flags);
> -       ret = nbytes;
> +       if (set_strategy) {
> +               ret = set_strategy(res, tmp, member);
> +               if (!ret)
> +                       ret = nbytes;
> +       } else {
> +               spin_lock_irqsave(&counter->lock, flags);
> +               val = res_counter_member(counter, member);
> +               *val = tmp;
> +               spin_unlock_irqrestore(&counter->lock, flags);
> +               ret = nbytes;
> +       }

I think that the hierarchy/reclaim handling that you currently have in
the memory controller should be here; the memory controller should
just be able to pass a reference to try_to_free_mem_cgroup_pages() and
have everything else handled by res_counter.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  8:59   ` Paul Menage
@ 2008-06-04  9:18     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  9:18 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Wed, 4 Jun 2008 01:59:31 -0700
"Paul Menage" <menage@google.com> wrote:

> On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >        int ret;
> >        char *buf, *end;
> > @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> >                if (*end != '\0')
> >                        goto out_free;
> >        }
> > -       spin_lock_irqsave(&counter->lock, flags);
> > -       val = res_counter_member(counter, member);
> > -       *val = tmp;
> > -       spin_unlock_irqrestore(&counter->lock, flags);
> > -       ret = nbytes;
> > +       if (set_strategy) {
> > +               ret = set_strategy(res, tmp, member);
> > +               if (!ret)
> > +                       ret = nbytes;
> > +       } else {
> > +               spin_lock_irqsave(&counter->lock, flags);
> > +               val = res_counter_member(counter, member);
> > +               *val = tmp;
> > +               spin_unlock_irqrestore(&counter->lock, flags);
> > +               ret = nbytes;
> > +       }
> 
> I think that the hierarchy/reclaim handling that you currently have in
> the memory controller should be here; the memory controller should
> just be able to pass a reference to try_to_free_mem_cgroup_pages() and
> have everything else handled by res_counter.
> 
Sounds reasonable. I'll re-design the whole AMAP. I think I can do more.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-06-04  8:59   ` Paul Menage
@ 2008-06-09  9:48   ` Balbir Singh
  2008-06-09 10:20     ` KAMEZAWA Hiroyuki
  2008-06-11 23:24   ` Randy Dunlap
  4 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-06-09  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> A simple hard-wall hierarhcy support for res_counter.
> 
> Changelog v2->v3
>  - changed the name and arguments of functions.
>  - rewrote to be read easily.
>  - named as HardWall hierarchy.
> 
> This implements following model
>  - A cgroup's tree means hierarchy of resource.
>  - All child's resource is moved from its parents.
>  - The resource moved to children is charged as parent's usage.
>  - The resource moves when child->limit is changed.
>  - The sum of resource for children and its own usage is limited by "limit".
> 
> This implies
>  - No dynamic automatic hierarhcy balancing in the kernel.
>  - Each resource is isolated completely.
>  - The kernel just supports resource-move-at-change-in-limit.
>  - The user (middle-ware) is responsible to make hierarhcy balanced well.

We'd definitely like to see a user level tool/application as a demo of how this
can be achieved.

>    Good balance can be achieved by changing limit from user land.
> 
> 
> Background:
>  Recently, there are popular resource isolation technique widely used,
>  i.e. Hardware-Virtualization. We can do hierarchical resource isolation
>  by using cgroup on it. But supporting hierarchy management in croups
>  has some advantages of performance, unity and costs of management.
> 
>  There are good resource management in other OSs, they support some kind of
>  hierarchical resource management. We wonder what kind of hierarchy policy
>  is good for Linux. And there is an another point. Hierarchical system can be
>  implemented by the kernel and user-land co-operation.  So, there are various
>  choices to do in the kernel. Doing all in the kernel or export some proper
>  interfaces to the user-land. Middle-wares are tend to be used for management.
>  I hope there will be Open Source one.
> 
>  At supporting hierarchy in cgroup, several aspects of characteristics of
>  policy of hierarchy can be considered. Some needs automatic balancing
>  between several groups.
> 
>   - fairness    ... how fairness is kept under policy
> 
>   - performance ... should be _fast_. multi-level resource balancing tend
>                  to use much amount of CPU and can cause soft lockup.
> 
>   - predictability ... resource management are usually used for resource
>                  isolation. the kernel must not break the isolation and
>                  predictability of users against application's progress.
> 
>   - flexibility ... some sophisticated dynamic resource balancing with
>  		 soft-limit is welcomed when the user doesn't want strict
> 		 resource isolation or when the user cannot estimate how much
> 		 they want correctly.

Soft limits has been on my plate for a while now. I'll take a crack at it. At
the moment the statistics is a bit of a worry, since users/administrators need
good statistics to take further action.

> 
> Hard Wall Hierarchy.
> 
>  This patch implements a hard-wall model of hierarchy for resources.
>  Works well for users who want strict resource isolation.
> 
>  This model allows the move of resource only between a parent and its children.
>  The resource is moved to a child when it declares the amount of resources to be
>  used. (by limit)

The other reason for preferring a shares based approach is that, the it will be
more in line with the CPU controllers interfaces.


>  Automatic resource balancing is not supported in this code.  
>  (But users can do non-automatic by changing limit dynamically.)
> 
>  - fairness    ... good. no resource sharing. works as specified by users.
>  - performance ... good. each resources are capsuled to its own level.
>  - predictability ... good. resources are completely isolated. balancing only
> 		occurs at the event of changes in limit.
>  - flexibility ... bad. no flexibility and scheduling in the kernel level.
> 	        need middle-ware's help.
> 
> Considerations:
>  - This implementation uses "limit" == "current_available_resource".
>    This should be revisited when Soft-Limit one is implemented.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  Documentation/controllers/resource_counter.txt |   41 +++++++++
>  include/linux/res_counter.h                    |   90 +++++++++++++++++++-
>  kernel/res_counter.c                           |  112 +++++++++++++++++++++++--
>  3 files changed, 235 insertions(+), 8 deletions(-)
> 
> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> @@ -38,6 +38,16 @@ struct res_counter {
>  	 * the number of unsuccessful attempts to consume the resource
>  	 */
>  	unsigned long long failcnt;
> +
> +	/*
> +	 * hierarchy support: the parent of this resource.
> +	 */
> +	struct res_counter *parent;
> +	/*
> +	 * the amount of resources assigned to children.
> +	 */
> +	unsigned long long for_children;
> +

I would prefer to use a better name, lent_out? reserved_for_children?
borrowed_by_children?

>  	/*
>  	 * the lock to protect all of the above.
>  	 * the routines below consider this to be IRQ-safe
> @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
>  ssize_t res_counter_read(struct res_counter *counter, int member,
>  		const char __user *buf, size_t nbytes, loff_t *pos,
>  		int (*read_strategy)(unsigned long long val, char *s));
> +
> +/*
> + * An interface for setting res_counter's member (ex. limit)
> + * the new parameter is passed by *buf and translated by write_strategy().
> + * Then, it is applied to member under the control of set_strategy().
> + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
> + */
> +
>  ssize_t res_counter_write(struct res_counter *counter, int member,
> -		const char __user *buf, size_t nbytes, loff_t *pos,
> -		int (*write_strategy)(char *buf, unsigned long long *val));
> +	const char __user *buf, size_t nbytes, loff_t *pos,
> +        int (*write_strategy)(char *buf, unsigned long long *val),
> +	int (*set_strategy)(struct res_counter *res, unsigned long long val,
> +			    int what),
> +	);
> 
>  /*
>   * the field descriptors. one for each member of res_counter
> @@ -76,15 +97,33 @@ enum {
>  	RES_MAX_USAGE,
>  	RES_LIMIT,
>  	RES_FAILCNT,
> +	RES_FOR_CHILDREN,

RES_BORROWED? RES_BORROWED_BY_CHILDREN?

>  };
> 
>  /*
>   * helpers for accounting
>   */
> 
> +/*
> + * initialize res_counter.
> + * @counter : the counter
> + *
> + * initialize res_counter and set default limit to very big value(unlimited)
> + */
> +
>  void res_counter_init(struct res_counter *counter);
> 
>  /*
> + * initialize res_counter under hierarchy.
> + * @counter : the counter
> + * @parent : the parent of the counter
> + *
> + * initialize res_counter and set default limit to 0. and set "parent".
> + */
> +void res_counter_init_hierarchy(struct res_counter *counter,
> +				struct res_counter *parent);
> +
> +/*
>   * charge - try to consume more resource.
>   *
>   * @counter: the counter
> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
>  	cnt->failcnt = 0;
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
> +
> +/**
> + * Move resources from a parent to a child.
> + * At success,
> + *           parent->usage += val.
> + *           parent->for_children += val.
> + *           child->limit += val.
> + *
> + * @child:    an entity to set res->limit. The parent is child->parent.
> + * @val:      the amount of resource to be moved.
> + * @callback: called when the parent's free resource is not enough to be moved.
> + *            this can be NULL if no callback is necessary.
> + * @retry:    limit for the number of trying to callback.
> + *            -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success, !0 at failure.
> + *
> + * The callback returns 0 at success, !0 at failure.
> + *
> + */
> +
> +int res_counter_move_resource(struct res_counter *child,
> +	unsigned long long val,
> +        int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry);
> +
> +
> +/**
> + * Return resource to its parent.
> + * At success,
> + *           parent->usage  -= val.
> + *           parent->for_children -= val.
> + *           child->limit -= val.
> + *
> + * @child:   entry to resize. The parent is child->parent.
> + * @val  :   How much does child repay to parent ? -1 means 'all'
> + * @callback: A callback for decreasing resource usage of child before
> + *            returning. If NULL, just deceases child's limit.
> + * @retry:   # of retries at calling callback for freeing resource.
> + *            -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success.
> + */
> +
> +int res_counter_return_resource(struct res_counter *child,
> +	unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry);
> +
>  #endif
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -44,6 +44,13 @@ to work with it.
>   	Protects changes of the above values.
> 
> 
> + f. struct res_counter *parent
> +
> +	Parent res_counter under hierarchy.
> +
> + g. unsigned long long for_children
> +
> +	Resources assigned to children. This is included in usage.
> 
>  2. Basic accounting routines
> 
> @@ -179,3 +186,37 @@ counter fields. They are recommended to 
>      still can help with it).
> 
>   c. Compile and run :)
> +
> +
> +6. Hierarchy
> + a. No Hierarchy
> +   each cgroup can use its own private resource.
> +
> + b. Hard-wall Hierarhcy
> +   A simple hierarchical tree system for resource isolation.
> +   Allows moving resources only between a parent and its children.
> +   A parent can move its resource to children and remember the amount to
> +   for_children member. A child can get new resource only from its parent.
> +   Limit of a child is the amount of resource which is moved from its parent.
> +

OK, after reading this I am totally sure I want a shares based interface. Limits
are not shared like this.

A child and a parent should both be capable of having a limit of 1G, but they
could use different shares factors to govern, how much each children will get.
Doing it this way, breaks limit semantics.


> +   When add "val" to a child,
> +	parent->usage += val
> +	parent->for_children += val
> +	child->limit += val
> +   When a child returns its resource
> +	parent->usage -= val
> +	parent->for_children -= val
> +	child->limit -= val.
> +
> +   This implements resource isolation among each group. This works very well
> +   when you want to use strict resource isolation.
> +
> +   Usage Hint:
> +   This seems for static resource assignment but dynamic resource re-assignment
> +   can be done by resetting "limit" of groups. When you consider "limit" as
> +   the amount of allowed _current_ resource, a sophisticated resource management
> +   system based on strict resource isolation can be implemented.
> +
> +c. Soft-wall Hierarchy
> +   TBD.
> +
> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
>  	counter->limit = (unsigned long long)LLONG_MAX;
>  }
> 
> +void res_counter_init_hierarchy(struct res_counter *counter,
> +		struct res_counter *parent)
> +{
> +	spin_lock_init(&counter->lock);
> +	counter->limit = 0;
> +	counter->parent = parent;
> +}
> +
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
>  	if (counter->usage + val > counter->limit) {
> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
>  		return &counter->limit;
>  	case RES_FAILCNT:
>  		return &counter->failcnt;
> +	case RES_FOR_CHILDREN:
> +		return &counter->for_children;
>  	};
> 
>  	BUG();
> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
> 
>  ssize_t res_counter_write(struct res_counter *counter, int member,
>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
> -		int (*write_strategy)(char *st_buf, unsigned long long *val))
> +		int (*write_strategy)(char *st_buf, unsigned long long *val),
> +		int (*set_strategy)(struct res_counter *res,
> +			unsigned long long val, int what))
>  {
>  	int ret;
>  	char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
>  		if (*end != '\0')
>  			goto out_free;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -	ret = nbytes;
> +	if (set_strategy) {
> +		ret = set_strategy(res, tmp, member);


I'm afraid, I don't understand the set_strategy and it's purpose.

> +		if (!ret)
> +			ret = nbytes;
> +	} else {
> +		spin_lock_irqsave(&counter->lock, flags);
> +		val = res_counter_member(counter, member);
> +		*val = tmp;
> +		spin_unlock_irqrestore(&counter->lock, flags);
> +		ret = nbytes;
> +	}
>  out_free:
>  	kfree(buf);
>  out:
>  	return ret;
>  }
> +
> +
> +int res_counter_move_resource(struct res_counter *child,
> +				unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry)
> +{
> +	struct res_counter *parent = child->parent;
> +	unsigned long flags;
> +
> +	BUG_ON(!parent);
> +
> +	while (1) {
> +		spin_lock_irqsave(&parent->lock, flags);
> +		if (parent->usage + val < parent->limit) {
> +			parent->for_children += val;
> +			parent->usage += val;
> +			break;
> +		}
> +		spin_unlock_irqrestore(&parent->lock, flags);
> +
> +		if (!retry || !callback)
> +			goto failed;
> +		/* -1 means  infinite loop */
> +		if (retry != -1)
> +			--retry;

I don't like the idea of spinning in an infinite loop, I would prefer to fail
things instead of burning CPU cycles.

> +		yield();
> +		callback(parent, val);

This code is not very understandable. Why do we yield before callback?

> +	}
> +	spin_unlock_irqrestore(&parent->lock, flags);
> +
> +	spin_lock_irqsave(&child->lock, flags);
> +	child->limit += val;
> +	spin_unlock_irqrestore(&child->lock, flags);
> +	return 0;
> +fail:
> +	return 1;
> +}
> +
> +
> +int res_counter_return_resource(struct res_counter *child,
> +				unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry)
> +{
> +	unsigned long flags;
> +	struct res_counter *parent = child->parent;
> +
> +	BUG_ON(!parent);
> +
> +	while (1) {
> +		spin_lock_irqsave(&child->lock, flags);
> +		if (val == (unsigned long long) -1) {
> +			val = child->limit;
> +			child->limit = 0;
> +			break;
> +		} else if (child->usage <= child->limit - val) {
> +			child->limit -= val;
> +			break;
> +		}
> +		spin_unlock_irqrestore(&child->lock, flags);
> +
> +		if (!retry)
> +			goto fail;
> +		/* -1 means infinite loop */
> +		if (retry != -1)
> +			--retry;
> +		yield();
> +		callback(parent, val);

Ditto comments as above.

> +	}
> +	spin_unlock_irqrestore(&child->lock, flags);
> +
> +	spin_lock_irqsave(&parent->lock, flags);
> +	BUG_ON(parent->for_children < val);
> +	BUG_ON(parent->usage < val);
> +	parent->for_children -= val;
> +	parent->usage -= val;
> +	spin_unlock_irqrestore(&parent->lock, flags);
> +	return 0;
> +fail:
> +	return 1;
> +}
> 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-09  9:48   ` Balbir Singh
@ 2008-06-09 10:20     ` KAMEZAWA Hiroyuki
  2008-06-09 10:37       ` Balbir Singh
  2008-06-09 12:02       ` kamezawa.hiroyu
  0 siblings, 2 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-09 10:20 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, LKML, menage, xemul, yamamoto

On Mon, 09 Jun 2008 15:18:47 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > A simple hard-wall hierarhcy support for res_counter.
> > 
> > Changelog v2->v3
> >  - changed the name and arguments of functions.
> >  - rewrote to be read easily.
> >  - named as HardWall hierarchy.
> > 
> > This implements following model
> >  - A cgroup's tree means hierarchy of resource.
> >  - All child's resource is moved from its parents.
> >  - The resource moved to children is charged as parent's usage.
> >  - The resource moves when child->limit is changed.
> >  - The sum of resource for children and its own usage is limited by "limit".
> > 
> > This implies
> >  - No dynamic automatic hierarhcy balancing in the kernel.
> >  - Each resource is isolated completely.
> >  - The kernel just supports resource-move-at-change-in-limit.
> >  - The user (middle-ware) is responsible to make hierarhcy balanced well.
> 
> We'd definitely like to see a user level tool/application as a demo of how this
> can be achieved.
> 
I don't have one, now. I'll write one when I have time. Need now ?
Hmm...maybe I(we) need some more patches to implement useful statistics,
notifier to middlewares.



> >    Good balance can be achieved by changing limit from user land.
> > 
> > 
> > Background:
> >  Recently, there are popular resource isolation technique widely used,
> >  i.e. Hardware-Virtualization. We can do hierarchical resource isolation
> >  by using cgroup on it. But supporting hierarchy management in croups
> >  has some advantages of performance, unity and costs of management.
> > 
> >  There are good resource management in other OSs, they support some kind of
> >  hierarchical resource management. We wonder what kind of hierarchy policy
> >  is good for Linux. And there is an another point. Hierarchical system can be
> >  implemented by the kernel and user-land co-operation.  So, there are various
> >  choices to do in the kernel. Doing all in the kernel or export some proper
> >  interfaces to the user-land. Middle-wares are tend to be used for management.
> >  I hope there will be Open Source one.
> > 
> >  At supporting hierarchy in cgroup, several aspects of characteristics of
> >  policy of hierarchy can be considered. Some needs automatic balancing
> >  between several groups.
> > 
> >   - fairness    ... how fairness is kept under policy
> > 
> >   - performance ... should be _fast_. multi-level resource balancing tend
> >                  to use much amount of CPU and can cause soft lockup.
> > 
> >   - predictability ... resource management are usually used for resource
> >                  isolation. the kernel must not break the isolation and
> >                  predictability of users against application's progress.
> > 
> >   - flexibility ... some sophisticated dynamic resource balancing with
> >  		 soft-limit is welcomed when the user doesn't want strict
> > 		 resource isolation or when the user cannot estimate how much
> > 		 they want correctly.
> 
> Soft limits has been on my plate for a while now. I'll take a crack at it. At
> the moment the statistics is a bit of a worry, since users/administrators need
> good statistics to take further action.
> 
Yes, statistics is not enough now.



> > 
> > Hard Wall Hierarchy.
> > 
> >  This patch implements a hard-wall model of hierarchy for resources.
> >  Works well for users who want strict resource isolation.
> > 
> >  This model allows the move of resource only between a parent and its children.
> >  The resource is moved to a child when it declares the amount of resources to be
> >  used. (by limit)
> 
> The other reason for preferring a shares based approach is that, the it will be
> more in line with the CPU controllers interfaces.
> 

You have to think of the major difference of tha nature of CPU and Memory.
We have to reclaim the resource with some feedbacks among sevral cgroups.
But ok, if it's can be implemented in simple way.
I have no objections if cost is very low. My concern is only performance.
(and maintenance)


> > Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> > +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> > @@ -38,6 +38,16 @@ struct res_counter {
> >  	 * the number of unsuccessful attempts to consume the resource
> >  	 */
> >  	unsigned long long failcnt;
> > +
> > +	/*
> > +	 * hierarchy support: the parent of this resource.
> > +	 */
> > +	struct res_counter *parent;
> > +	/*
> > +	 * the amount of resources assigned to children.
> > +	 */
> > +	unsigned long long for_children;
> > +
> 
> I would prefer to use a better name, lent_out? reserved_for_children?
> borrowed_by_children?
> 
ok. use other names.



> >  	/*
> >  	 * the lock to protect all of the above.
> >  	 * the routines below consider this to be IRQ-safe
> > @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
> >  ssize_t res_counter_read(struct res_counter *counter, int member,
> >  		const char __user *buf, size_t nbytes, loff_t *pos,
> >  		int (*read_strategy)(unsigned long long val, char *s));
> > +
> > +/*
> > + * An interface for setting res_counter's member (ex. limit)
> > + * the new parameter is passed by *buf and translated by write_strategy().
> > + * Then, it is applied to member under the control of set_strategy().
> > + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
> > + */
> > +
> >  ssize_t res_counter_write(struct res_counter *counter, int member,
> > -		const char __user *buf, size_t nbytes, loff_t *pos,
> > -		int (*write_strategy)(char *buf, unsigned long long *val));
> > +	const char __user *buf, size_t nbytes, loff_t *pos,
> > +        int (*write_strategy)(char *buf, unsigned long long *val),
> > +	int (*set_strategy)(struct res_counter *res, unsigned long long val,
> > +			    int what),
> > +	);
> > 
> >  /*
> >   * the field descriptors. one for each member of res_counter
> > @@ -76,15 +97,33 @@ enum {
> >  	RES_MAX_USAGE,
> >  	RES_LIMIT,
> >  	RES_FAILCNT,
> > +	RES_FOR_CHILDREN,
> 
> RES_BORROWED? RES_BORROWED_BY_CHILDREN?
> 
ok, again.

> >  };
> > 
> >  /*
> >   * helpers for accounting
> >   */
> > 
> > +/*
> > + * initialize res_counter.
> > + * @counter : the counter
> > + *
> > + * initialize res_counter and set default limit to very big value(unlimited)
> > + */
> > +
> >  void res_counter_init(struct res_counter *counter);
> > 
> >  /*
> > + * initialize res_counter under hierarchy.
> > + * @counter : the counter
> > + * @parent : the parent of the counter
> > + *
> > + * initialize res_counter and set default limit to 0. and set "parent".
> > + */
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > +				struct res_counter *parent);
> > +
> > +/*
> >   * charge - try to consume more resource.
> >   *
> >   * @counter: the counter
> > @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> >  	cnt->failcnt = 0;
> >  	spin_unlock_irqrestore(&cnt->lock, flags);
> >  }
> > +
> > +/**
> > + * Move resources from a parent to a child.
> > + * At success,
> > + *           parent->usage += val.
> > + *           parent->for_children += val.
> > + *           child->limit += val.
> > + *
> > + * @child:    an entity to set res->limit. The parent is child->parent.
> > + * @val:      the amount of resource to be moved.
> > + * @callback: called when the parent's free resource is not enough to be moved.
> > + *            this can be NULL if no callback is necessary.
> > + * @retry:    limit for the number of trying to callback.
> > + *            -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success, !0 at failure.
> > + *
> > + * The callback returns 0 at success, !0 at failure.
> > + *
> > + */
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > +	unsigned long long val,
> > +        int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry);
> > +
> > +
> > +/**
> > + * Return resource to its parent.
> > + * At success,
> > + *           parent->usage  -= val.
> > + *           parent->for_children -= val.
> > + *           child->limit -= val.
> > + *
> > + * @child:   entry to resize. The parent is child->parent.
> > + * @val  :   How much does child repay to parent ? -1 means 'all'
> > + * @callback: A callback for decreasing resource usage of child before
> > + *            returning. If NULL, just deceases child's limit.
> > + * @retry:   # of retries at calling callback for freeing resource.
> > + *            -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success.
> > + */
> > +
> > +int res_counter_return_resource(struct res_counter *child,
> > +	unsigned long long val,
> > +	int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry);
> > +
> >  #endif
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -44,6 +44,13 @@ to work with it.
> >   	Protects changes of the above values.
> > 
> > 
> > + f. struct res_counter *parent
> > +
> > +	Parent res_counter under hierarchy.
> > +
> > + g. unsigned long long for_children
> > +
> > +	Resources assigned to children. This is included in usage.
> > 
> >  2. Basic accounting routines
> > 
> > @@ -179,3 +186,37 @@ counter fields. They are recommended to 
> >      still can help with it).
> > 
> >   c. Compile and run :)
> > +
> > +
> > +6. Hierarchy
> > + a. No Hierarchy
> > +   each cgroup can use its own private resource.
> > +
> > + b. Hard-wall Hierarhcy
> > +   A simple hierarchical tree system for resource isolation.
> > +   Allows moving resources only between a parent and its children.
> > +   A parent can move its resource to children and remember the amount to
> > +   for_children member. A child can get new resource only from its parent.
> > +   Limit of a child is the amount of resource which is moved from its parent.
> > +
> 
> OK, after reading this I am totally sure I want a shares based interface. Limits
> are not shared like this.
> 
> A child and a parent should both be capable of having a limit of 1G, but they
> could use different shares factors to govern, how much each children will get.
> Doing it this way, breaks limit semantics.
> 
Not easy to use in my point of view. Can we use 'share' in proper way 
on no-swap machine ?


> 
> > +   When add "val" to a child,
> > +	parent->usage += val
> > +	parent->for_children += val
> > +	child->limit += val
> > +   When a child returns its resource
> > +	parent->usage -= val
> > +	parent->for_children -= val
> > +	child->limit -= val.
> > +
> > +   This implements resource isolation among each group. This works very well
> > +   when you want to use strict resource isolation.
> > +
> > +   Usage Hint:
> > +   This seems for static resource assignment but dynamic resource re-assignment
> > +   can be done by resetting "limit" of groups. When you consider "limit" as
> > +   the amount of allowed _current_ resource, a sophisticated resource management
> > +   system based on strict resource isolation can be implemented.
> > +
> > +c. Soft-wall Hierarchy
> > +   TBD.
> > +
> > Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> > +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> > @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
> >  	counter->limit = (unsigned long long)LLONG_MAX;
> >  }
> > 
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > +		struct res_counter *parent)
> > +{
> > +	spin_lock_init(&counter->lock);
> > +	counter->limit = 0;
> > +	counter->parent = parent;
> > +}
> > +
> >  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> >  {
> >  	if (counter->usage + val > counter->limit) {
> > @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
> >  		return &counter->limit;
> >  	case RES_FAILCNT:
> >  		return &counter->failcnt;
> > +	case RES_FOR_CHILDREN:
> > +		return &counter->for_children;
> >  	};
> > 
> >  	BUG();
> > @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
> > 
> >  ssize_t res_counter_write(struct res_counter *counter, int member,
> >  		const char __user *userbuf, size_t nbytes, loff_t *pos,
> > -		int (*write_strategy)(char *st_buf, unsigned long long *val))
> > +		int (*write_strategy)(char *st_buf, unsigned long long *val),
> > +		int (*set_strategy)(struct res_counter *res,
> > +			unsigned long long val, int what))
> >  {
> >  	int ret;
> >  	char *buf, *end;
> > @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
> >  		if (*end != '\0')
> >  			goto out_free;
> >  	}
> > -	spin_lock_irqsave(&counter->lock, flags);
> > -	val = res_counter_member(counter, member);
> > -	*val = tmp;
> > -	spin_unlock_irqrestore(&counter->lock, flags);
> > -	ret = nbytes;
> > +	if (set_strategy) {
> > +		ret = set_strategy(res, tmp, member);
> 
> 
> I'm afraid, I don't understand the set_strategy and it's purpose.
> 
Sorry. I'm now rewritten and removed this.



> > +		if (!ret)
> > +			ret = nbytes;
> > +	} else {
> > +		spin_lock_irqsave(&counter->lock, flags);
> > +		val = res_counter_member(counter, member);
> > +		*val = tmp;
> > +		spin_unlock_irqrestore(&counter->lock, flags);
> > +		ret = nbytes;
> > +	}
> >  out_free:
> >  	kfree(buf);
> >  out:
> >  	return ret;
> >  }
> > +
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > +				unsigned long long val,
> > +	int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry)
> > +{
> > +	struct res_counter *parent = child->parent;
> > +	unsigned long flags;
> > +
> > +	BUG_ON(!parent);
> > +
> > +	while (1) {
> > +		spin_lock_irqsave(&parent->lock, flags);
> > +		if (parent->usage + val < parent->limit) {
> > +			parent->for_children += val;
> > +			parent->usage += val;
> > +			break;
> > +		}
> > +		spin_unlock_irqrestore(&parent->lock, flags);
> > +
> > +		if (!retry || !callback)
> > +			goto failed;
> > +		/* -1 means  infinite loop */
> > +		if (retry != -1)
> > +			--retry;
> 
> I don't like the idea of spinning in an infinite loop, I would prefer to fail
> things instead of burning CPU cycles.
> 
ok, will remove "-1" case.




> > +		yield();
> > +		callback(parent, val);
> 
> This code is not very understandable. Why do we yield before callback?
> 

yield() after callback() means that res_counter's state will be
far different from the state after callback.
So, we have to yield before call back and check res_coutner sooner.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-09 10:20     ` KAMEZAWA Hiroyuki
@ 2008-06-09 10:37       ` Balbir Singh
  2008-06-09 12:02       ` kamezawa.hiroyu
  1 sibling, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2008-06-09 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> On Mon, 09 Jun 2008 15:18:47 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> A simple hard-wall hierarhcy support for res_counter.
>>>
>>> Changelog v2->v3
>>>  - changed the name and arguments of functions.
>>>  - rewrote to be read easily.
>>>  - named as HardWall hierarchy.
>>>
>>> This implements following model
>>>  - A cgroup's tree means hierarchy of resource.
>>>  - All child's resource is moved from its parents.
>>>  - The resource moved to children is charged as parent's usage.
>>>  - The resource moves when child->limit is changed.
>>>  - The sum of resource for children and its own usage is limited by "limit".
>>>
>>> This implies
>>>  - No dynamic automatic hierarhcy balancing in the kernel.
>>>  - Each resource is isolated completely.
>>>  - The kernel just supports resource-move-at-change-in-limit.
>>>  - The user (middle-ware) is responsible to make hierarhcy balanced well.
>> We'd definitely like to see a user level tool/application as a demo of how this
>> can be achieved.
>>
> I don't have one, now. I'll write one when I have time. Need now ?
> Hmm...maybe I(we) need some more patches to implement useful statistics,
> notifier to middlewares.
> 

Yes, we need more useful statistics.

> 
> 
>>>    Good balance can be achieved by changing limit from user land.
>>>
>>>
>>> Background:
>>>  Recently, there are popular resource isolation technique widely used,
>>>  i.e. Hardware-Virtualization. We can do hierarchical resource isolation
>>>  by using cgroup on it. But supporting hierarchy management in croups
>>>  has some advantages of performance, unity and costs of management.
>>>
>>>  There are good resource management in other OSs, they support some kind of
>>>  hierarchical resource management. We wonder what kind of hierarchy policy
>>>  is good for Linux. And there is an another point. Hierarchical system can be
>>>  implemented by the kernel and user-land co-operation.  So, there are various
>>>  choices to do in the kernel. Doing all in the kernel or export some proper
>>>  interfaces to the user-land. Middle-wares are tend to be used for management.
>>>  I hope there will be Open Source one.
>>>
>>>  At supporting hierarchy in cgroup, several aspects of characteristics of
>>>  policy of hierarchy can be considered. Some needs automatic balancing
>>>  between several groups.
>>>
>>>   - fairness    ... how fairness is kept under policy
>>>
>>>   - performance ... should be _fast_. multi-level resource balancing tend
>>>                  to use much amount of CPU and can cause soft lockup.
>>>
>>>   - predictability ... resource management are usually used for resource
>>>                  isolation. the kernel must not break the isolation and
>>>                  predictability of users against application's progress.
>>>
>>>   - flexibility ... some sophisticated dynamic resource balancing with
>>>  		 soft-limit is welcomed when the user doesn't want strict
>>> 		 resource isolation or when the user cannot estimate how much
>>> 		 they want correctly.
>> Soft limits has been on my plate for a while now. I'll take a crack at it. At
>> the moment the statistics is a bit of a worry, since users/administrators need
>> good statistics to take further action.
>>
> Yes, statistics is not enough now.
> 
> 
> 
>>> Hard Wall Hierarchy.
>>>
>>>  This patch implements a hard-wall model of hierarchy for resources.
>>>  Works well for users who want strict resource isolation.
>>>
>>>  This model allows the move of resource only between a parent and its children.
>>>  The resource is moved to a child when it declares the amount of resources to be
>>>  used. (by limit)
>> The other reason for preferring a shares based approach is that, the it will be
>> more in line with the CPU controllers interfaces.
>>
> 
> You have to think of the major difference of tha nature of CPU and Memory.
> We have to reclaim the resource with some feedbacks among sevral cgroups.
> But ok, if it's can be implemented in simple way.
> I have no objections if cost is very low. My concern is only performance.
> (and maintenance)
> 

True, I don't see hierarchy as adding too much additional cost.

> 
>>> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
>>> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
>>> @@ -38,6 +38,16 @@ struct res_counter {
>>>  	 * the number of unsuccessful attempts to consume the resource
>>>  	 */
>>>  	unsigned long long failcnt;
>>> +
>>> +	/*
>>> +	 * hierarchy support: the parent of this resource.
>>> +	 */
>>> +	struct res_counter *parent;
>>> +	/*
>>> +	 * the amount of resources assigned to children.
>>> +	 */
>>> +	unsigned long long for_children;
>>> +
>> I would prefer to use a better name, lent_out? reserved_for_children?
>> borrowed_by_children?
>>
> ok. use other names.
> 
> 
> 
>>>  	/*
>>>  	 * the lock to protect all of the above.
>>>  	 * the routines below consider this to be IRQ-safe
>>> @@ -63,9 +73,20 @@ u64 res_counter_read_u64(struct res_coun
>>>  ssize_t res_counter_read(struct res_counter *counter, int member,
>>>  		const char __user *buf, size_t nbytes, loff_t *pos,
>>>  		int (*read_strategy)(unsigned long long val, char *s));
>>> +
>>> +/*
>>> + * An interface for setting res_counter's member (ex. limit)
>>> + * the new parameter is passed by *buf and translated by write_strategy().
>>> + * Then, it is applied to member under the control of set_strategy().
>>> + * If write_strategy() and set_strategy() can be NULL. see res_counter.c
>>> + */
>>> +
>>>  ssize_t res_counter_write(struct res_counter *counter, int member,
>>> -		const char __user *buf, size_t nbytes, loff_t *pos,
>>> -		int (*write_strategy)(char *buf, unsigned long long *val));
>>> +	const char __user *buf, size_t nbytes, loff_t *pos,
>>> +        int (*write_strategy)(char *buf, unsigned long long *val),
>>> +	int (*set_strategy)(struct res_counter *res, unsigned long long val,
>>> +			    int what),
>>> +	);
>>>
>>>  /*
>>>   * the field descriptors. one for each member of res_counter
>>> @@ -76,15 +97,33 @@ enum {
>>>  	RES_MAX_USAGE,
>>>  	RES_LIMIT,
>>>  	RES_FAILCNT,
>>> +	RES_FOR_CHILDREN,
>> RES_BORROWED? RES_BORROWED_BY_CHILDREN?
>>
> ok, again.
> 
>>>  };
>>>
>>>  /*
>>>   * helpers for accounting
>>>   */
>>>
>>> +/*
>>> + * initialize res_counter.
>>> + * @counter : the counter
>>> + *
>>> + * initialize res_counter and set default limit to very big value(unlimited)
>>> + */
>>> +
>>>  void res_counter_init(struct res_counter *counter);
>>>
>>>  /*
>>> + * initialize res_counter under hierarchy.
>>> + * @counter : the counter
>>> + * @parent : the parent of the counter
>>> + *
>>> + * initialize res_counter and set default limit to 0. and set "parent".
>>> + */
>>> +void res_counter_init_hierarchy(struct res_counter *counter,
>>> +				struct res_counter *parent);
>>> +
>>> +/*
>>>   * charge - try to consume more resource.
>>>   *
>>>   * @counter: the counter
>>> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
>>>  	cnt->failcnt = 0;
>>>  	spin_unlock_irqrestore(&cnt->lock, flags);
>>>  }
>>> +
>>> +/**
>>> + * Move resources from a parent to a child.
>>> + * At success,
>>> + *           parent->usage += val.
>>> + *           parent->for_children += val.
>>> + *           child->limit += val.
>>> + *
>>> + * @child:    an entity to set res->limit. The parent is child->parent.
>>> + * @val:      the amount of resource to be moved.
>>> + * @callback: called when the parent's free resource is not enough to be moved.
>>> + *            this can be NULL if no callback is necessary.
>>> + * @retry:    limit for the number of trying to callback.
>>> + *            -1 means infinite loop. At each retry, yield() is called.
>>> + * Returns 0 at success, !0 at failure.
>>> + *
>>> + * The callback returns 0 at success, !0 at failure.
>>> + *
>>> + */
>>> +
>>> +int res_counter_move_resource(struct res_counter *child,
>>> +	unsigned long long val,
>>> +        int (*callback)(struct res_counter *res, unsigned long long val),
>>> +	int retry);
>>> +
>>> +
>>> +/**
>>> + * Return resource to its parent.
>>> + * At success,
>>> + *           parent->usage  -= val.
>>> + *           parent->for_children -= val.
>>> + *           child->limit -= val.
>>> + *
>>> + * @child:   entry to resize. The parent is child->parent.
>>> + * @val  :   How much does child repay to parent ? -1 means 'all'
>>> + * @callback: A callback for decreasing resource usage of child before
>>> + *            returning. If NULL, just deceases child's limit.
>>> + * @retry:   # of retries at calling callback for freeing resource.
>>> + *            -1 means infinite loop. At each retry, yield() is called.
>>> + * Returns 0 at success.
>>> + */
>>> +
>>> +int res_counter_return_resource(struct res_counter *child,
>>> +	unsigned long long val,
>>> +	int (*callback)(struct res_counter *res, unsigned long long val),
>>> +	int retry);
>>> +
>>>  #endif
>>> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
>>> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
>>> @@ -44,6 +44,13 @@ to work with it.
>>>   	Protects changes of the above values.
>>>
>>>
>>> + f. struct res_counter *parent
>>> +
>>> +	Parent res_counter under hierarchy.
>>> +
>>> + g. unsigned long long for_children
>>> +
>>> +	Resources assigned to children. This is included in usage.
>>>
>>>  2. Basic accounting routines
>>>
>>> @@ -179,3 +186,37 @@ counter fields. They are recommended to 
>>>      still can help with it).
>>>
>>>   c. Compile and run :)
>>> +
>>> +
>>> +6. Hierarchy
>>> + a. No Hierarchy
>>> +   each cgroup can use its own private resource.
>>> +
>>> + b. Hard-wall Hierarhcy
>>> +   A simple hierarchical tree system for resource isolation.
>>> +   Allows moving resources only between a parent and its children.
>>> +   A parent can move its resource to children and remember the amount to
>>> +   for_children member. A child can get new resource only from its parent.
>>> +   Limit of a child is the amount of resource which is moved from its parent.
>>> +
>> OK, after reading this I am totally sure I want a shares based interface. Limits
>> are not shared like this.
>>
>> A child and a parent should both be capable of having a limit of 1G, but they
>> could use different shares factors to govern, how much each children will get.
>> Doing it this way, breaks limit semantics.
>>
> Not easy to use in my point of view. Can we use 'share' in proper way 
> on no-swap machine ?
> 

Not sure I understand your question. Share represents the share of available
resources.

> 
>>> +   When add "val" to a child,
>>> +	parent->usage += val
>>> +	parent->for_children += val
>>> +	child->limit += val
>>> +   When a child returns its resource
>>> +	parent->usage -= val
>>> +	parent->for_children -= val
>>> +	child->limit -= val.
>>> +
>>> +   This implements resource isolation among each group. This works very well
>>> +   when you want to use strict resource isolation.
>>> +
>>> +   Usage Hint:
>>> +   This seems for static resource assignment but dynamic resource re-assignment
>>> +   can be done by resetting "limit" of groups. When you consider "limit" as
>>> +   the amount of allowed _current_ resource, a sophisticated resource management
>>> +   system based on strict resource isolation can be implemented.
>>> +
>>> +c. Soft-wall Hierarchy
>>> +   TBD.
>>> +
>>> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
>>> ===================================================================
>>> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
>>> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
>>> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
>>>  	counter->limit = (unsigned long long)LLONG_MAX;
>>>  }
>>>
>>> +void res_counter_init_hierarchy(struct res_counter *counter,
>>> +		struct res_counter *parent)
>>> +{
>>> +	spin_lock_init(&counter->lock);
>>> +	counter->limit = 0;
>>> +	counter->parent = parent;
>>> +}
>>> +
>>>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>>>  {
>>>  	if (counter->usage + val > counter->limit) {
>>> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
>>>  		return &counter->limit;
>>>  	case RES_FAILCNT:
>>>  		return &counter->failcnt;
>>> +	case RES_FOR_CHILDREN:
>>> +		return &counter->for_children;
>>>  	};
>>>
>>>  	BUG();
>>> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
>>>
>>>  ssize_t res_counter_write(struct res_counter *counter, int member,
>>>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
>>> -		int (*write_strategy)(char *st_buf, unsigned long long *val))
>>> +		int (*write_strategy)(char *st_buf, unsigned long long *val),
>>> +		int (*set_strategy)(struct res_counter *res,
>>> +			unsigned long long val, int what))
>>>  {
>>>  	int ret;
>>>  	char *buf, *end;
>>> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
>>>  		if (*end != '\0')
>>>  			goto out_free;
>>>  	}
>>> -	spin_lock_irqsave(&counter->lock, flags);
>>> -	val = res_counter_member(counter, member);
>>> -	*val = tmp;
>>> -	spin_unlock_irqrestore(&counter->lock, flags);
>>> -	ret = nbytes;
>>> +	if (set_strategy) {
>>> +		ret = set_strategy(res, tmp, member);
>>
>> I'm afraid, I don't understand the set_strategy and it's purpose.
>>
> Sorry. I'm now rewritten and removed this.
> 
> 

OK

> 
>>> +		if (!ret)
>>> +			ret = nbytes;
>>> +	} else {
>>> +		spin_lock_irqsave(&counter->lock, flags);
>>> +		val = res_counter_member(counter, member);
>>> +		*val = tmp;
>>> +		spin_unlock_irqrestore(&counter->lock, flags);
>>> +		ret = nbytes;
>>> +	}
>>>  out_free:
>>>  	kfree(buf);
>>>  out:
>>>  	return ret;
>>>  }
>>> +
>>> +
>>> +int res_counter_move_resource(struct res_counter *child,
>>> +				unsigned long long val,
>>> +	int (*callback)(struct res_counter *res, unsigned long long val),
>>> +	int retry)
>>> +{
>>> +	struct res_counter *parent = child->parent;
>>> +	unsigned long flags;
>>> +
>>> +	BUG_ON(!parent);
>>> +
>>> +	while (1) {
>>> +		spin_lock_irqsave(&parent->lock, flags);
>>> +		if (parent->usage + val < parent->limit) {
>>> +			parent->for_children += val;
>>> +			parent->usage += val;
>>> +			break;
>>> +		}
>>> +		spin_unlock_irqrestore(&parent->lock, flags);
>>> +
>>> +		if (!retry || !callback)
>>> +			goto failed;
>>> +		/* -1 means  infinite loop */
>>> +		if (retry != -1)
>>> +			--retry;
>> I don't like the idea of spinning in an infinite loop, I would prefer to fail
>> things instead of burning CPU cycles.
>>
> ok, will remove "-1" case.
> 
> 
> 
> 
>>> +		yield();
>>> +		callback(parent, val);
>> This code is not very understandable. Why do we yield before callback?
>>
> 
> yield() after callback() means that res_counter's state will be
> far different from the state after callback.
> So, we have to yield before call back and check res_coutner sooner.
> 

But does yield() get us any guarantees of seeing the state change?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-09 10:20     ` KAMEZAWA Hiroyuki
  2008-06-09 10:37       ` Balbir Singh
@ 2008-06-09 12:02       ` kamezawa.hiroyu
  1 sibling, 0 replies; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-06-09 12:02 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, LKML, menage, xemul, yamamoto

----- Original Message -----
way, breaks limit semantics.
>>>
>> Not easy to use in my point of view. Can we use 'share' in proper way 
>> on no-swap machine ?
>> 
>
>Not sure I understand your question. Share represents the share of available
>resources.
>

If no swap, you cannot reclaim anonymous pages and shared memory.
Then, the kernel has to abandon any kinds of auto-balancing somewhere.
(just an example. Things will be more complicated when we consinder
 mlocked pages and swap-resource-controller.)


>> yield() after callback() means that res_counter's state will be
>> far different from the state after callback.
>> So, we have to yield before call back and check res_coutner sooner.
>> 
>
>But does yield() get us any guarantees of seeing the state change?
>
Hmm, myabe my explanation is bad.

in following sequence
   1.callback()
   2.yield()
   3.check usage again
Elapsed time between 1->3 is big.

in following
   1.yield()
   2.callback()
   3.check usage again
Elapsed time between 2->3 is small.

There is an option to implement "changing limit grarually"

Thanks,
-Kame









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
                     ` (3 preceding siblings ...)
  2008-06-09  9:48   ` Balbir Singh
@ 2008-06-11 23:24   ` Randy Dunlap
  2008-06-12  4:59     ` KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 38+ messages in thread
From: Randy Dunlap @ 2008-06-11 23:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 4 Jun 2008 14:01:53 +0900 KAMEZAWA Hiroyuki wrote:

> A simple hard-wall hierarhcy support for res_counter.
> 
> Changelog v2->v3
>  - changed the name and arguments of functions.
>  - rewrote to be read easily.
>  - named as HardWall hierarchy.

> ---
>  Documentation/controllers/resource_counter.txt |   41 +++++++++
>  include/linux/res_counter.h                    |   90 +++++++++++++++++++-
>  kernel/res_counter.c                           |  112 +++++++++++++++++++++++--
>  3 files changed, 235 insertions(+), 8 deletions(-)
> 
> Index: temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
> +++ temp-2.6.26-rc2-mm1/include/linux/res_counter.h
> @@ -76,15 +97,33 @@ enum {
>  	RES_MAX_USAGE,
>  	RES_LIMIT,
>  	RES_FAILCNT,
> +	RES_FOR_CHILDREN,
>  };
>  
>  /*
>   * helpers for accounting
>   */
>  
> +/*
> + * initialize res_counter.
> + * @counter : the counter
> + *
> + * initialize res_counter and set default limit to very big value(unlimited)
> + */
> +
>  void res_counter_init(struct res_counter *counter);

For these non-static (non-private) functions, please use kernel-doc notation
(see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
Also, we prefer for the function documentation to be above its definition (implementation)
rather than above its declaration, so the kernel-doc should be moved to .c files
instead of living in .h files.


>  
>  /*
> + * initialize res_counter under hierarchy.
> + * @counter : the counter
> + * @parent : the parent of the counter
> + *
> + * initialize res_counter and set default limit to 0. and set "parent".
> + */
> +void res_counter_init_hierarchy(struct res_counter *counter,
> +				struct res_counter *parent);
> +
> +/*
>   * charge - try to consume more resource.
>   *
>   * @counter: the counter
> @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
>  	cnt->failcnt = 0;
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
> +
> +/**
> + * Move resources from a parent to a child.
> + * At success,
> + *           parent->usage += val.
> + *           parent->for_children += val.
> + *           child->limit += val.
> + *
> + * @child:    an entity to set res->limit. The parent is child->parent.
> + * @val:      the amount of resource to be moved.
> + * @callback: called when the parent's free resource is not enough to be moved.
> + *            this can be NULL if no callback is necessary.
> + * @retry:    limit for the number of trying to callback.
> + *            -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success, !0 at failure.
> + *
> + * The callback returns 0 at success, !0 at failure.
> + *
> + */
> +
> +int res_counter_move_resource(struct res_counter *child,
> +	unsigned long long val,
> +        int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry);
> +
> +
> +/**
> + * Return resource to its parent.
> + * At success,
> + *           parent->usage  -= val.
> + *           parent->for_children -= val.
> + *           child->limit -= val.
> + *
> + * @child:   entry to resize. The parent is child->parent.
> + * @val  :   How much does child repay to parent ? -1 means 'all'
> + * @callback: A callback for decreasing resource usage of child before
> + *            returning. If NULL, just deceases child's limit.
> + * @retry:   # of retries at calling callback for freeing resource.
> + *            -1 means infinite loop. At each retry, yield() is called.
> + * Returns 0 at success.
> + */
> +
> +int res_counter_return_resource(struct res_counter *child,
> +	unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry);
> +
>  #endif
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> @@ -179,3 +186,37 @@ counter fields. They are recommended to 
>      still can help with it).
>  
>   c. Compile and run :)
> +
> +
> +6. Hierarchy
> + a. No Hierarchy
> +   each cgroup can use its own private resource.
> +
> + b. Hard-wall Hierarhcy
> +   A simple hierarchical tree system for resource isolation.
> +   Allows moving resources only between a parent and its children.
> +   A parent can move its resource to children and remember the amount to
> +   for_children member. A child can get new resource only from its parent.
> +   Limit of a child is the amount of resource which is moved from its parent.
> +
> +   When add "val" to a child,
> +	parent->usage += val
> +	parent->for_children += val
> +	child->limit += val
> +   When a child returns its resource
> +	parent->usage -= val
> +	parent->for_children -= val
> +	child->limit -= val.
> +
> +   This implements resource isolation among each group. This works very well
> +   when you want to use strict resource isolation.
> +
> +   Usage Hint:
> +   This seems for static resource assignment but dynamic resource re-assignment

           seems to be?

> +   can be done by resetting "limit" of groups. When you consider "limit" as
> +   the amount of allowed _current_ resource, a sophisticated resource management
> +   system based on strict resource isolation can be implemented.
> +
> +c. Soft-wall Hierarchy
> +   TBD.
> +
> Index: temp-2.6.26-rc2-mm1/kernel/res_counter.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/kernel/res_counter.c
> +++ temp-2.6.26-rc2-mm1/kernel/res_counter.c
> @@ -20,6 +20,14 @@ void res_counter_init(struct res_counter
>  	counter->limit = (unsigned long long)LLONG_MAX;
>  }
>  
> +void res_counter_init_hierarchy(struct res_counter *counter,
> +		struct res_counter *parent)
> +{
> +	spin_lock_init(&counter->lock);
> +	counter->limit = 0;
> +	counter->parent = parent;
> +}
> +
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
>  	if (counter->usage + val > counter->limit) {
> @@ -74,6 +82,8 @@ res_counter_member(struct res_counter *c
>  		return &counter->limit;
>  	case RES_FAILCNT:
>  		return &counter->failcnt;
> +	case RES_FOR_CHILDREN:
> +		return &counter->for_children;
>  	};
>  
>  	BUG();
> @@ -104,7 +114,9 @@ u64 res_counter_read_u64(struct res_coun
>  
>  ssize_t res_counter_write(struct res_counter *counter, int member,
>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
> -		int (*write_strategy)(char *st_buf, unsigned long long *val))
> +		int (*write_strategy)(char *st_buf, unsigned long long *val),
> +		int (*set_strategy)(struct res_counter *res,
> +			unsigned long long val, int what))
>  {
>  	int ret;
>  	char *buf, *end;
> @@ -133,13 +145,101 @@ ssize_t res_counter_write(struct res_cou
>  		if (*end != '\0')
>  			goto out_free;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -	ret = nbytes;
> +	if (set_strategy) {
> +		ret = set_strategy(res, tmp, member);
> +		if (!ret)
> +			ret = nbytes;
> +	} else {
> +		spin_lock_irqsave(&counter->lock, flags);
> +		val = res_counter_member(counter, member);
> +		*val = tmp;
> +		spin_unlock_irqrestore(&counter->lock, flags);
> +		ret = nbytes;
> +	}
>  out_free:
>  	kfree(buf);
>  out:
>  	return ret;
>  }
> +
> +
> +int res_counter_move_resource(struct res_counter *child,
> +				unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry)
> +{
> +	struct res_counter *parent = child->parent;
> +	unsigned long flags;
> +
> +	BUG_ON(!parent);
> +
> +	while (1) {
> +		spin_lock_irqsave(&parent->lock, flags);
> +		if (parent->usage + val < parent->limit) {
> +			parent->for_children += val;
> +			parent->usage += val;
> +			break;
> +		}
> +		spin_unlock_irqrestore(&parent->lock, flags);
> +
> +		if (!retry || !callback)
> +			goto failed;
> +		/* -1 means  infinite loop */
> +		if (retry != -1)
> +			--retry;
> +		yield();
> +		callback(parent, val);
> +	}
> +	spin_unlock_irqrestore(&parent->lock, flags);
> +
> +	spin_lock_irqsave(&child->lock, flags);
> +	child->limit += val;
> +	spin_unlock_irqrestore(&child->lock, flags);
> +	return 0;
> +fail:
> +	return 1;
> +}
> +
> +
> +int res_counter_return_resource(struct res_counter *child,
> +				unsigned long long val,
> +	int (*callback)(struct res_counter *res, unsigned long long val),
> +	int retry)
> +{
> +	unsigned long flags;
> +	struct res_counter *parent = child->parent;
> +
> +	BUG_ON(!parent);
> +
> +	while (1) {
> +		spin_lock_irqsave(&child->lock, flags);
> +		if (val == (unsigned long long) -1) {
> +			val = child->limit;
> +			child->limit = 0;
> +			break;
> +		} else if (child->usage <= child->limit - val) {
> +			child->limit -= val;
> +			break;
> +		}
> +		spin_unlock_irqrestore(&child->lock, flags);
> +
> +		if (!retry)
> +			goto fail;
> +		/* -1 means infinite loop */
> +		if (retry != -1)
> +			--retry;
> +		yield();
> +		callback(parent, val);
> +	}
> +	spin_unlock_irqrestore(&child->lock, flags);
> +
> +	spin_lock_irqsave(&parent->lock, flags);
> +	BUG_ON(parent->for_children < val);
> +	BUG_ON(parent->usage < val);
> +	parent->for_children -= val;
> +	parent->usage -= val;
> +	spin_unlock_irqrestore(&parent->lock, flags);
> +	return 0;
> +fail:
> +	return 1;
> +}
> --


---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-11 23:24   ` Randy Dunlap
@ 2008-06-12  4:59     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-12  4:59 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 11 Jun 2008 16:24:27 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:
> >  /*
> >   * helpers for accounting
> >   */
> >  
> > +/*
> > + * initialize res_counter.
> > + * @counter : the counter
> > + *
> > + * initialize res_counter and set default limit to very big value(unlimited)
> > + */
> > +
> >  void res_counter_init(struct res_counter *counter);
> 
> For these non-static (non-private) functions, please use kernel-doc notation
> (see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
> Also, we prefer for the function documentation to be above its definition (implementation)
> rather than above its declaration, so the kernel-doc should be moved to .c files
> instead of living in .h files.
> 
Ah, sorry. I'll do so in the next version. Maybe I should move other comments
in res_counter.h to res_counter.c


> 
> >  
> >  /*
> > + * initialize res_counter under hierarchy.
> > + * @counter : the counter
> > + * @parent : the parent of the counter
> > + *
> > + * initialize res_counter and set default limit to 0. and set "parent".
> > + */
> > +void res_counter_init_hierarchy(struct res_counter *counter,
> > +				struct res_counter *parent);
> > +
> > +/*
> >   * charge - try to consume more resource.
> >   *
> >   * @counter: the counter
> > @@ -153,4 +192,51 @@ static inline void res_counter_reset_fai
> >  	cnt->failcnt = 0;
> >  	spin_unlock_irqrestore(&cnt->lock, flags);
> >  }
> > +
> > +/**
> > + * Move resources from a parent to a child.
> > + * At success,
> > + *           parent->usage += val.
> > + *           parent->for_children += val.
> > + *           child->limit += val.
> > + *
> > + * @child:    an entity to set res->limit. The parent is child->parent.
> > + * @val:      the amount of resource to be moved.
> > + * @callback: called when the parent's free resource is not enough to be moved.
> > + *            this can be NULL if no callback is necessary.
> > + * @retry:    limit for the number of trying to callback.
> > + *            -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success, !0 at failure.
> > + *
> > + * The callback returns 0 at success, !0 at failure.
> > + *
> > + */
> > +
> > +int res_counter_move_resource(struct res_counter *child,
> > +	unsigned long long val,
> > +        int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry);
> > +
> > +
> > +/**
> > + * Return resource to its parent.
> > + * At success,
> > + *           parent->usage  -= val.
> > + *           parent->for_children -= val.
> > + *           child->limit -= val.
> > + *
> > + * @child:   entry to resize. The parent is child->parent.
> > + * @val  :   How much does child repay to parent ? -1 means 'all'
> > + * @callback: A callback for decreasing resource usage of child before
> > + *            returning. If NULL, just deceases child's limit.
> > + * @retry:   # of retries at calling callback for freeing resource.
> > + *            -1 means infinite loop. At each retry, yield() is called.
> > + * Returns 0 at success.
> > + */
> > +
> > +int res_counter_return_resource(struct res_counter *child,
> > +	unsigned long long val,
> > +	int (*callback)(struct res_counter *res, unsigned long long val),
> > +	int retry);
> > +
> >  #endif
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
> > @@ -179,3 +186,37 @@ counter fields. They are recommended to 
> >      still can help with it).
> >  
> >   c. Compile and run :)
> > +
> > +
> > +6. Hierarchy
> > + a. No Hierarchy
> > +   each cgroup can use its own private resource.
> > +
> > + b. Hard-wall Hierarhcy
> > +   A simple hierarchical tree system for resource isolation.
> > +   Allows moving resources only between a parent and its children.
> > +   A parent can move its resource to children and remember the amount to
> > +   for_children member. A child can get new resource only from its parent.
> > +   Limit of a child is the amount of resource which is moved from its parent.
> > +
> > +   When add "val" to a child,
> > +	parent->usage += val
> > +	parent->for_children += val
> > +	child->limit += val
> > +   When a child returns its resource
> > +	parent->usage -= val
> > +	parent->for_children -= val
> > +	child->limit -= val.
> > +
> > +   This implements resource isolation among each group. This works very well
> > +   when you want to use strict resource isolation.
> > +
> > +   Usage Hint:
> > +   This seems for static resource assignment but dynamic resource re-assignment
> 
>            seems to be?
> 
will fix.

Thank you for review!

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  4:58 [RFC][PATCH 0/2] memcg: hierarchy support (v3) KAMEZAWA Hiroyuki
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-06-04  5:03 ` KAMEZAWA Hiroyuki
  2008-06-04  6:42   ` Li Zefan
                     ` (5 more replies)
  2008-06-04  8:59 ` [RFC][PATCH 0/2] memcg: hierarchy support (v3) Paul Menage
  2008-06-09  9:30 ` Balbir Singh
  3 siblings, 6 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  5:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

Hard-Wall hierarchy support for memcg.
 - new member hierarchy_model is added to memcg.

Only root cgroup can modify this only when there is no children.

Adds following functions for supporting HARDWALL hierarchy.
 - try to reclaim memory at the change of "limit".
 - try to reclaim all memory at force_empty
 - returns resources to the parent at destroy.

Changelog v2->v3
 - added documentation.
 - hierarhcy_model parameter is added.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |   27 +++++-
 mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 5 deletions(-)

Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -137,6 +137,8 @@ struct mem_cgroup {
 	struct mem_cgroup_lru_info info;
 
 	int	prev_priority;	/* for recording reclaim priority */
+
+	int	hierarchy_model; /* used hierarchical policy */
 	/*
 	 * statistics.
 	 */
@@ -144,6 +146,10 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
+
+#define MEMCG_NO_HIERARCHY	(0)
+#define MEMCG_HARDWALL_HIERARCHY	(1)
+
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
  * lock.  We need to ensure that page->page_cgroup is at least two
@@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
 }
 
 /*
+ * Memory Controller hierarchy support.
+ */
+
+/*
+ * shrink usage to be res->usage + val < res->limit.
+ */
+
+int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
+{
+	struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+	unsigned long flags;
+	int ret = 1;
+	int progress = 1;
+
+retry:
+	spin_lock_irqsave(&cnt->lock, flags);
+	/* Need to shrink ? */
+	if (cnt->usage + val <= cnt->limit)
+		ret = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+
+	if (!ret)
+		return 0;
+
+	if (!progress)
+		return 1;
+	progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+
+	goto retry;
+}
+
+/*
+ * For Hard Wall Hierarchy.
+ */
+
+int mem_cgroup_resize_callback(struct res_counter *cnt,
+			unsigned long long val, int what)
+{
+	unsigned long flags, borrow;
+	unsigned long long diffs;
+	int ret = 0;
+
+	BUG_ON(what != RES_LIMIT);
+
+	/* Is this under hierarchy ? */
+	if (!cnt->parent) {
+		spin_lock_irqsave(&cnt->lock, flags);
+		cnt->limit = val;
+		spin_unlock_irqrestore(&cnt->lock, flags);
+		return 0;
+	}
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (val > cnt->limit) {
+		diffs = val - cnt->limit;
+		borrow = 1;
+	} else {
+		diffs = cnt->limit - val;
+		borrow = 0;
+	}
+	spin_unlock_irqrestore(&cnt->lock, flags);
+
+	if (borrow)
+		ret = res_counter_move_resource(cnt,diffs,
+					memcg_shrink_val,
+					MEM_CGROUP_RECLAIM_RETRIES);
+	else
+		ret = res_counter_return_resource(cnt, diffs,
+					memcg_shrink_val,
+					MEM_CGROUP_RECLAIM_RETRIES);
+	return ret;
+}
+
+
+void memcg_shrink_all(struct mem_cgroup *mem)
+{
+	unsigned long long val;
+
+	val = res_counter_read_u64(&mem->res, RES_LIMIT);
+	memcg_shrink_val(&mem->res, val);
+}
+
+/*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
@@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
+	memcg_shrink_all(mem);
+
 	css_get(&mem->css);
 	/*
 	 * page reclaim code (kswapd etc..) will move pages between
@@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
 				struct file *file, const char __user *userbuf,
 				size_t nbytes, loff_t *ppos)
 {
-	return res_counter_write(&mem_cgroup_from_cont(cont)->res,
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+
+	if (cft->private != RES_LIMIT
+		|| !cont->parent
+		|| memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
+		return res_counter_write(&memcg->res, cft->private, userbuf,
+			nbytes, ppos, mem_cgroup_write_strategy, NULL);
+	else
+		return res_counter_write(&memcg->res,
 				cft->private, userbuf, nbytes, ppos,
-				mem_cgroup_write_strategy);
+				mem_cgroup_write_strategy,
+				mem_cgroup_resize_callback);
+}
+
+
+static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	return memcg->hierarchy_model;
+}
+
+static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *cft,
+				u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	/* chage policy is allowed to ROOT cgroup && no children */
+	if (cgrp->parent)
+		return -EINVAL;
+	if (!list_empty(&cgrp->children))
+		return -EINVAL;
+	if (val == 0 || val == 1) {
+		memcg->hierarchy_model = val;
+		return 0;
+	}
+	return -EINVAL;
 }
 
+
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 {
 	struct mem_cgroup *mem;
@@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "hierarchy_model",
+		.read_u64 = mem_cgroup_read_hierarchy,
+		.write_u64 = mem_cgroup_write_hierarchy,
+	},
+	{
+		.name = "assigned_to_child",
+		.private = RES_FOR_CHILDREN,
+		.read_u64 = mem_cgroup_read,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem, *parent;
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
 		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+		parent = NULL;
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		parent = mem_cgroup_from_cont(cont->parent);
 	}
 
-	res_counter_init(&mem->res);
+	if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
+		res_counter_init(&mem->res);
+		mem->hierarchy_model = MEMCG_NO_HIERARCHY;
+	} else {
+		res_counter_init_hierarchy(&mem->res, &parent->res);
+		mem->hierarchy_model = parent->hierarchy_model;
+	}
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
 	int node;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	if (cont->parent &&
+	    mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
+		/* we did what we can...just returns what we borrow */
+		res_counter_return_resource(&mem->res, -1, NULL, 0);
+	}
+
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
===================================================================
--- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
+++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
@@ -237,12 +237,37 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
-5. TODO
+5. Hierarchy Model
+  the kernel supports following kinds of hierarchy models.
+  (your middle-ware may support others based on this.)
+
+  5-a. Independent Hierarchy
+  There are no relationship between any cgroups, even among a parent and
+  children. This is the default mode. To use this hierarchy, write 0
+  to root cgroup's memory.hierarchy_model
+  echo 0 > .../memory.hierarchy_model.
+
+  5-b. Hardwall Hierarchy.
+  The resource has to be moved from the parent to the child before use it.
+  When a child's limit is set to 'val', val of the resource is moved from
+  the parent to the child. the parent's usage += val.
+  The amount of children's usage is reported by the file
+
+  - memory.assigned_to_child
+
+  This policy doesn't provide sophisticated automatic resource balancing in
+  the kernel. But this is very good for strict resource isolation. Users
+  can get high predictability of behavior of applications if this is used
+  under proper environments.
+
+
+6. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
 3. Teach controller to account for shared-pages
 4. Start reclamation when the limit is lowered
+   (this is already done in Hardwall Hierarchy)
 5. Start reclamation in the background when the limit is
    not yet hit but the usage is getting closer
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
@ 2008-06-04  6:42   ` Li Zefan
  2008-06-04  6:54     ` KAMEZAWA Hiroyuki
  2008-06-04  8:59   ` Paul Menage
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Li Zefan @ 2008-06-04  6:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> Hard-Wall hierarchy support for memcg.
>  - new member hierarchy_model is added to memcg.
> 
> Only root cgroup can modify this only when there is no children.
> 
> Adds following functions for supporting HARDWALL hierarchy.
>  - try to reclaim memory at the change of "limit".
>  - try to reclaim all memory at force_empty
>  - returns resources to the parent at destroy.
> 
> Changelog v2->v3
>  - added documentation.
>  - hierarhcy_model parameter is added.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  Documentation/controllers/memory.txt |   27 +++++-
>  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 178 insertions(+), 5 deletions(-)
> 
> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> @@ -137,6 +137,8 @@ struct mem_cgroup {
>  	struct mem_cgroup_lru_info info;
>  
>  	int	prev_priority;	/* for recording reclaim priority */
> +
> +	int	hierarchy_model; /* used hierarchical policy */

hierarchy_model can be a global value instead of per cgroup value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  6:42   ` Li Zefan
@ 2008-06-04  6:54     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  6:54 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 04 Jun 2008 14:42:21 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hard-Wall hierarchy support for memcg.
> >  - new member hierarchy_model is added to memcg.
> > 
> > Only root cgroup can modify this only when there is no children.
> > 
> > Adds following functions for supporting HARDWALL hierarchy.
> >  - try to reclaim memory at the change of "limit".
> >  - try to reclaim all memory at force_empty
> >  - returns resources to the parent at destroy.
> > 
> > Changelog v2->v3
> >  - added documentation.
> >  - hierarhcy_model parameter is added.
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > ---
> >  Documentation/controllers/memory.txt |   27 +++++-
> >  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
> >  2 files changed, 178 insertions(+), 5 deletions(-)
> > 
> > Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> > +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> > @@ -137,6 +137,8 @@ struct mem_cgroup {
> >  	struct mem_cgroup_lru_info info;
> >  
> >  	int	prev_priority;	/* for recording reclaim priority */
> > +
> > +	int	hierarchy_model; /* used hierarchical policy */
> 
> hierarchy_model can be a global value instead of per cgroup value.
> 
Ah, Hmm...yes. thank you for pointing out.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
  2008-06-04  6:42   ` Li Zefan
@ 2008-06-04  8:59   ` Paul Menage
  2008-06-04  9:26     ` KAMEZAWA Hiroyuki
  2008-06-04 12:32   ` Daisuke Nishimura
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Paul Menage @ 2008-06-04  8:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
>  }
>
>  /*
> + * Memory Controller hierarchy support.
> + */
> +
> +/*
> + * shrink usage to be res->usage + val < res->limit.
> + */
> +
> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> +{
> +       struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> +       unsigned long flags;
> +       int ret = 1;
> +       int progress = 1;
> +
> +retry:
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       /* Need to shrink ? */
> +       if (cnt->usage + val <= cnt->limit)
> +               ret = 0;
> +       spin_unlock_irqrestore(&cnt->lock, flags);

Can't this logic be in res_counter itself? I.e. the callback can
assume that some shrinking needs to be done, and should just do it and
return. The res_counter can handle retrying if necessary.

> +/*
> + * For Hard Wall Hierarchy.
> + */
> +
> +int mem_cgroup_resize_callback(struct res_counter *cnt,
> +                       unsigned long long val, int what)
> +{
> +       unsigned long flags, borrow;
> +       unsigned long long diffs;
> +       int ret = 0;
> +
> +       BUG_ON(what != RES_LIMIT);
> +
> +       /* Is this under hierarchy ? */
> +       if (!cnt->parent) {
> +               spin_lock_irqsave(&cnt->lock, flags);
> +               cnt->limit = val;
> +               spin_unlock_irqrestore(&cnt->lock, flags);
> +               return 0;
> +       }
> +
> +       spin_lock_irqsave(&cnt->lock, flags);
> +       if (val > cnt->limit) {
> +               diffs = val - cnt->limit;
> +               borrow = 1;
> +       } else {
> +               diffs = cnt->limit - val;
> +               borrow = 0;
> +       }
> +       spin_unlock_irqrestore(&cnt->lock, flags);
> +
> +       if (borrow)
> +               ret = res_counter_move_resource(cnt,diffs,
> +                                       memcg_shrink_val,
> +                                       MEM_CGROUP_RECLAIM_RETRIES);
> +       else
> +               ret = res_counter_return_resource(cnt, diffs,
> +                                       memcg_shrink_val,
> +                                       MEM_CGROUP_RECLAIM_RETRIES);
> +       return ret;
> +}

Again, a lot of this function seems like generic logic that should be
in res_counter. The only bit that's memory specific is the
memcg_shrink_val, and maybe that could just be passed when creating
the res_counter. Perhaps we should have a res_counter_ops structure
with operations like "parse" for parsing strings into numbers
(currently called "write_strategy") and "reclaim" for trying to shrink
the usage.

> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
>                                struct file *file, const char __user *userbuf,
>                                size_t nbytes, loff_t *ppos)
>  {
> -       return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +
> +       if (cft->private != RES_LIMIT
> +               || !cont->parent
> +               || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)

The res_counter already knows whether it has a parent, so these checks
shouldn't be necessary.

> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
>        int node;
>        struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>
> +       if (cont->parent &&
> +           mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> +               /* we did what we can...just returns what we borrow */
> +               res_counter_return_resource(&mem->res, -1, NULL, 0);
> +       }
> +

Should we also re-account any remaining child usage to the parent?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  8:59   ` Paul Menage
@ 2008-06-04  9:26     ` KAMEZAWA Hiroyuki
  2008-06-04 12:53       ` Daisuke Nishimura
  0 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  9:26 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Wed, 4 Jun 2008 01:59:12 -0700
"Paul Menage" <menage@google.com> wrote:

> On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
> >  }
> >
> >  /*
> > + * Memory Controller hierarchy support.
> > + */
> > +
> > +/*
> > + * shrink usage to be res->usage + val < res->limit.
> > + */
> > +
> > +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> > +{
> > +       struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> > +       unsigned long flags;
> > +       int ret = 1;
> > +       int progress = 1;
> > +
> > +retry:
> > +       spin_lock_irqsave(&cnt->lock, flags);
> > +       /* Need to shrink ? */
> > +       if (cnt->usage + val <= cnt->limit)
> > +               ret = 0;
> > +       spin_unlock_irqrestore(&cnt->lock, flags);
> 
> Can't this logic be in res_counter itself? I.e. the callback can
> assume that some shrinking needs to be done, and should just do it and
> return. The res_counter can handle retrying if necessary.
> 
Hmm ok. Maybe All I have to do is to define "What the callback has to do"
and to move this check interface to res_counter.


> > +/*
> > + * For Hard Wall Hierarchy.
> > + */
> > +
> > +int mem_cgroup_resize_callback(struct res_counter *cnt,
> > +                       unsigned long long val, int what)
> > +{
> > +       unsigned long flags, borrow;
> > +       unsigned long long diffs;
> > +       int ret = 0;
> > +
> > +       BUG_ON(what != RES_LIMIT);
> > +
> > +       /* Is this under hierarchy ? */
> > +       if (!cnt->parent) {
> > +               spin_lock_irqsave(&cnt->lock, flags);
> > +               cnt->limit = val;
> > +               spin_unlock_irqrestore(&cnt->lock, flags);
> > +               return 0;
> > +       }
> > +
> > +       spin_lock_irqsave(&cnt->lock, flags);
> > +       if (val > cnt->limit) {
> > +               diffs = val - cnt->limit;
> > +               borrow = 1;
> > +       } else {
> > +               diffs = cnt->limit - val;
> > +               borrow = 0;
> > +       }
> > +       spin_unlock_irqrestore(&cnt->lock, flags);
> > +
> > +       if (borrow)
> > +               ret = res_counter_move_resource(cnt,diffs,
> > +                                       memcg_shrink_val,
> > +                                       MEM_CGROUP_RECLAIM_RETRIES);
> > +       else
> > +               ret = res_counter_return_resource(cnt, diffs,
> > +                                       memcg_shrink_val,
> > +                                       MEM_CGROUP_RECLAIM_RETRIES);
> > +       return ret;
> > +}
> 
> Again, a lot of this function seems like generic logic that should be
> in res_counter. The only bit that's memory specific is the
> memcg_shrink_val, and maybe that could just be passed when creating
> the res_counter. Perhaps we should have a res_counter_ops structure
> with operations like "parse" for parsing strings into numbers
> (currently called "write_strategy") and "reclaim" for trying to shrink
> the usage.
> 
ok, will try.


> > @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
> >                                struct file *file, const char __user *userbuf,
> >                                size_t nbytes, loff_t *ppos)
> >  {
> > -       return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> > +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> > +
> > +       if (cft->private != RES_LIMIT
> > +               || !cont->parent
> > +               || memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
> 
> The res_counter already knows whether it has a parent, so these checks
> shouldn't be necessary.
> 
ok, will check in res_counter itself.

> > @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> >        int node;
> >        struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> >
> > +       if (cont->parent &&
> > +           mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> > +               /* we did what we can...just returns what we borrow */
> > +               res_counter_return_resource(&mem->res, -1, NULL, 0);
> > +       }
> > +
> 
> Should we also re-account any remaining child usage to the parent?
> 
When this is called, there are no process in this group. Then, remaining
resources in this level is
  - file cache
  - swap cache (if shared)
  - shmem

And the biggest usage will be "file cache".
So, I don't think it's necessary to move child's usage to the parent,
in hurry. But maybe shmem is worth to be moved.

I'd like to revisit this when I implements "usage move at task move"
logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)

It will help me to implement the logic "move remaining usage to the parent"
in clean way.

Thanks,
-Kame









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  9:26     ` KAMEZAWA Hiroyuki
@ 2008-06-04 12:53       ` Daisuke Nishimura
  0 siblings, 0 replies; 38+ messages in thread
From: Daisuke Nishimura @ 2008-06-04 12:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, linux-mm, LKML, balbir, xemul, yamamoto

> > > @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
> > >        int node;
> > >        struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> > >
> > > +       if (cont->parent &&
> > > +           mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> > > +               /* we did what we can...just returns what we borrow */
> > > +               res_counter_return_resource(&mem->res, -1, NULL, 0);
> > > +       }
> > > +
> > 
> > Should we also re-account any remaining child usage to the parent?
> > 
> When this is called, there are no process in this group. Then, remaining
> resources in this level is
>   - file cache
>   - swap cache (if shared)
>   - shmem
> 
> And the biggest usage will be "file cache".
> So, I don't think it's necessary to move child's usage to the parent,
> in hurry. But maybe shmem is worth to be moved.
> 
> I'd like to revisit this when I implements "usage move at task move"
> logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)
> 
> It will help me to implement the logic "move remaining usage to the parent"
> in clean way.
> 

I agree that "usage move at task move" is needed before
"move remaining usage to the parent".


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
  2008-06-04  6:42   ` Li Zefan
  2008-06-04  8:59   ` Paul Menage
@ 2008-06-04 12:32   ` Daisuke Nishimura
  2008-06-05  0:04     ` KAMEZAWA Hiroyuki
  2008-06-09 10:56   ` Balbir Singh
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Daisuke Nishimura @ 2008-06-04 12:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

Hi.

> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
>  
> +	memcg_shrink_all(mem);
> +
>  	css_get(&mem->css);
>  	/*
>  	 * page reclaim code (kswapd etc..) will move pages between

Shouldn't it be called after verifying there remains no task
in this group?

If called via mem_cgroup_pre_destroy, it has been verified
that there remains no task already, but if called via
mem_force_empty_wrte, there may remain some tasks and
this means many and many pages are swaped out, doesn't it?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04 12:32   ` Daisuke Nishimura
@ 2008-06-05  0:04     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-05  0:04 UTC (permalink / raw)
  To: nishimura
  Cc: Daisuke Nishimura, linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 4 Jun 2008 21:32:35 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> Hi.
> 
> > @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
> >  	if (mem_cgroup_subsys.disabled)
> >  		return 0;
> >  
> > +	memcg_shrink_all(mem);
> > +
> >  	css_get(&mem->css);
> >  	/*
> >  	 * page reclaim code (kswapd etc..) will move pages between
> 
> Shouldn't it be called after verifying there remains no task
> in this group?
> 
> If called via mem_cgroup_pre_destroy, it has been verified
> that there remains no task already, but if called via
> mem_force_empty_wrte, there may remain some tasks and
> this means many and many pages are swaped out, doesn't it?
> 
you're right. I misunderstood where the number of children is checked.

Thanks,
-Kame


> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-06-04 12:32   ` Daisuke Nishimura
@ 2008-06-09 10:56   ` Balbir Singh
  2008-06-09 12:09   ` kamezawa.hiroyu
  2008-06-11 23:24   ` Randy Dunlap
  5 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2008-06-09 10:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> Hard-Wall hierarchy support for memcg.
>  - new member hierarchy_model is added to memcg.
> 
> Only root cgroup can modify this only when there is no children.
> 

Sounds like the correct thing TODO and also maintains compatibility

> Adds following functions for supporting HARDWALL hierarchy.
>  - try to reclaim memory at the change of "limit".
>  - try to reclaim all memory at force_empty
>  - returns resources to the parent at destroy.
> 
> Changelog v2->v3
>  - added documentation.
>  - hierarhcy_model parameter is added.
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  Documentation/controllers/memory.txt |   27 +++++-
>  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 178 insertions(+), 5 deletions(-)
> 
> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
> @@ -137,6 +137,8 @@ struct mem_cgroup {
>  	struct mem_cgroup_lru_info info;
> 
>  	int	prev_priority;	/* for recording reclaim priority */
> +
> +	int	hierarchy_model; /* used hierarchical policy */

Could we enumerate what these policies are?

>  	/*
>  	 * statistics.
>  	 */
> @@ -144,6 +146,10 @@ struct mem_cgroup {
>  };
>  static struct mem_cgroup init_mem_cgroup;
> 
> +
> +#define MEMCG_NO_HIERARCHY	(0)

Nice to indent with the #define below

> +#define MEMCG_HARDWALL_HIERARCHY	(1)
> +
>  /*
>   * We use the lower bit of the page->page_cgroup pointer as a bit spin
>   * lock.  We need to ensure that page->page_cgroup is at least two
> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
>  }
> 
>  /*
> + * Memory Controller hierarchy support.
> + */
> +
> +/*
> + * shrink usage to be res->usage + val < res->limit.
> + */
> +
> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
> +{
> +	struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
> +	unsigned long flags;
> +	int ret = 1;
> +	int progress = 1;
> +
> +retry:
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	/* Need to shrink ? */
> +	if (cnt->usage + val <= cnt->limit)
> +		ret = 0;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +

We have res_counter_check_under_limit(), may be we could re-use that here by
adding parameters.

> +	if (!ret)
> +		return 0;
> +
> +	if (!progress)
> +		return 1;
> +	progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
> +
> +	goto retry;
> +}
> +
> +/*
> + * For Hard Wall Hierarchy.
> + */
> +
> +int mem_cgroup_resize_callback(struct res_counter *cnt,
> +			unsigned long long val, int what)
> +{
> +	unsigned long flags, borrow;
> +	unsigned long long diffs;
> +	int ret = 0;
> +
> +	BUG_ON(what != RES_LIMIT);
> +
> +	/* Is this under hierarchy ? */
> +	if (!cnt->parent) {
> +		spin_lock_irqsave(&cnt->lock, flags);
> +		cnt->limit = val;

I know callback is called from the two functions specified in patch 1/2 (move
and return resource). I don't understand why it is OK to force the limit to be
val, if it is the root node?

> +		spin_unlock_irqrestore(&cnt->lock, flags);
> +		return 0;
> +	}
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	if (val > cnt->limit) {
> +		diffs = val - cnt->limit;
> +		borrow = 1;
> +	} else {
> +		diffs = cnt->limit - val;
> +		borrow = 0;
> +	}
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +
> +	if (borrow)
> +		ret = res_counter_move_resource(cnt,diffs,
> +					memcg_shrink_val,
> +					MEM_CGROUP_RECLAIM_RETRIES);
> +	else
> +		ret = res_counter_return_resource(cnt, diffs,
> +					memcg_shrink_val,
> +					MEM_CGROUP_RECLAIM_RETRIES);
> +	return ret;
> +}
> +
> +
> +void memcg_shrink_all(struct mem_cgroup *mem)
> +{

memcg_shrink_to_limit?

> +	unsigned long long val;
> +
> +	val = res_counter_read_u64(&mem->res, RES_LIMIT);
> +	memcg_shrink_val(&mem->res, val);
> +}
> +
> +/*
>   * This routine traverse page_cgroup in given list and drop them all.
>   * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
>   */
> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
> 
> +	memcg_shrink_all(mem);
> +
>  	css_get(&mem->css);
>  	/*
>  	 * page reclaim code (kswapd etc..) will move pages between
> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
>  				struct file *file, const char __user *userbuf,
>  				size_t nbytes, loff_t *ppos)
>  {
> -	return res_counter_write(&mem_cgroup_from_cont(cont)->res,
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> +
> +	if (cft->private != RES_LIMIT
> +		|| !cont->parent
> +		|| memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
> +		return res_counter_write(&memcg->res, cft->private, userbuf,
> +			nbytes, ppos, mem_cgroup_write_strategy, NULL);
> +	else
> +		return res_counter_write(&memcg->res,
>  				cft->private, userbuf, nbytes, ppos,
> -				mem_cgroup_write_strategy);
> +				mem_cgroup_write_strategy,
> +				mem_cgroup_resize_callback);
> +}
> +
> +
> +static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	return memcg->hierarchy_model;
> +}
> +
> +static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *cft,
> +				u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	/* chage policy is allowed to ROOT cgroup && no children */
> +	if (cgrp->parent)
> +		return -EINVAL;
> +	if (!list_empty(&cgrp->children))
> +		return -EINVAL;
> +	if (val == 0 || val == 1) {
> +		memcg->hierarchy_model = val;
> +		return 0;
> +	}
> +	return -EINVAL;
>  }
> 
> +
>  static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>  {
>  	struct mem_cgroup *mem;
> @@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[] 
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +	{
> +		.name = "hierarchy_model",
> +		.read_u64 = mem_cgroup_read_hierarchy,
> +		.write_u64 = mem_cgroup_write_hierarchy,
> +	},
> +	{
> +		.name = "assigned_to_child",
> +		.private = RES_FOR_CHILDREN,
> +		.read_u64 = mem_cgroup_read,
> +	},
>  };
> 
>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
> @@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
>  static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> -	struct mem_cgroup *mem;
> +	struct mem_cgroup *mem, *parent;
>  	int node;
> 
>  	if (unlikely((cont->parent) == NULL)) {
>  		mem = &init_mem_cgroup;
>  		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> +		parent = NULL;
>  	} else {
>  		mem = mem_cgroup_alloc();
>  		if (!mem)
>  			return ERR_PTR(-ENOMEM);
> +		parent = mem_cgroup_from_cont(cont->parent);
>  	}
> 
> -	res_counter_init(&mem->res);
> +	if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
> +		res_counter_init(&mem->res);
> +		mem->hierarchy_model = MEMCG_NO_HIERARCHY;
> +	} else {
> +		res_counter_init_hierarchy(&mem->res, &parent->res);
> +		mem->hierarchy_model = parent->hierarchy_model;
> +	}
> 
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
>  	int node;
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> 
> +	if (cont->parent &&
> +	    mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
> +		/* we did what we can...just returns what we borrow */
> +		res_counter_return_resource(&mem->res, -1, NULL, 0);
> +	}
> +
>  	for_each_node_state(node, N_POSSIBLE)
>  		free_mem_cgroup_per_zone_info(mem, node);
> 
> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> @@ -237,12 +237,37 @@ cgroup might have some charge associated
>  tasks have migrated away from it. Such charges are automatically dropped at
>  rmdir() if there are no tasks.
> 
> -5. TODO
> +5. Hierarchy Model
> +  the kernel supports following kinds of hierarchy models.
> +  (your middle-ware may support others based on this.)
> +
> +  5-a. Independent Hierarchy
> +  There are no relationship between any cgroups, even among a parent and
> +  children. This is the default mode. To use this hierarchy, write 0
> +  to root cgroup's memory.hierarchy_model
> +  echo 0 > .../memory.hierarchy_model.
> +
> +  5-b. Hardwall Hierarchy.
> +  The resource has to be moved from the parent to the child before use it.
> +  When a child's limit is set to 'val', val of the resource is moved from
> +  the parent to the child. the parent's usage += val.
> +  The amount of children's usage is reported by the file
> +
> +  - memory.assigned_to_child
> +
> +  This policy doesn't provide sophisticated automatic resource balancing in
> +  the kernel. But this is very good for strict resource isolation. Users
> +  can get high predictability of behavior of applications if this is used
> +  under proper environments.
> +
> +
> +6. TODO
> 
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
>  3. Teach controller to account for shared-pages
>  4. Start reclamation when the limit is lowered
> +   (this is already done in Hardwall Hierarchy)

Excellent! May be we should split this patch out?

>  5. Start reclamation in the background when the limit is
>     not yet hit but the usage is getting closer
> 
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
                     ` (3 preceding siblings ...)
  2008-06-09 10:56   ` Balbir Singh
@ 2008-06-09 12:09   ` kamezawa.hiroyu
  2008-06-11 23:24   ` Randy Dunlap
  5 siblings, 0 replies; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-06-09 12:09 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, LKML, menage, xemul, yamamoto

----- Original Message -----

>KAMEZAWA Hiroyuki wrote:
>> Hard-Wall hierarchy support for memcg.
>>  - new member hierarchy_model is added to memcg.
>> 
>> Only root cgroup can modify this only when there is no children.
>> 
>
>Sounds like the correct thing TODO and also maintains compatibility
>
Paul suggested another way. (please see his mail, sorry)

So, I changed this behavior as following.
"a cgroup's hierarchy mode can be changed when
  - parent's hierarchy mode is "no hirerachy"
  - no children

>> Adds following functions for supporting HARDWALL hierarchy.
>>  - try to reclaim memory at the change of "limit".
>>  - try to reclaim all memory at force_empty
>>  - returns resources to the parent at destroy.
>> 
>> Changelog v2->v3
>>  - added documentation.
>>  - hierarhcy_model parameter is added.
>> 
>> 
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> 
>> ---
>>  Documentation/controllers/memory.txt |   27 +++++-
>>  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++
++++-
>>  2 files changed, 178 insertions(+), 5 deletions(-)
>> 
>> Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
>> ===================================================================
>> --- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
>> +++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
>> @@ -137,6 +137,8 @@ struct mem_cgroup {
>>  	struct mem_cgroup_lru_info info;
>> 
>>  	int	prev_priority;	/* for recording reclaim priority */
>> +
>> +	int	hierarchy_model; /* used hierarchical policy */
>
>Could we enumerate what these policies are?
>
Sure,

>>  	/*
>>  	 * statistics.
>>  	 */
>> @@ -144,6 +146,10 @@ struct mem_cgroup {
>>  };
>>  static struct mem_cgroup init_mem_cgroup;
>> 
>> +
>> +#define MEMCG_NO_HIERARCHY	(0)
>
>Nice to indent with the #define below
>
ok, but I use another style in my current set (not posted yet.)

>> +#define MEMCG_HARDWALL_HIERARCHY	(1)
>> +
>>  /*
>>   * We use the lower bit of the page->page_cgroup pointer as a bit spin
>>   * lock.  We need to ensure that page->page_cgroup is at least two
>> @@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
>>  }
>> 
>>  /*
>> + * Memory Controller hierarchy support.
>> + */
>> +
>> +/*
>> + * shrink usage to be res->usage + val < res->limit.
>> + */
>> +
>> +int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
>> +{
>> +	struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
>> +	unsigned long flags;
>> +	int ret = 1;
>> +	int progress = 1;
>> +
>> +retry:
>> +	spin_lock_irqsave(&cnt->lock, flags);
>> +	/* Need to shrink ? */
>> +	if (cnt->usage + val <= cnt->limit)
>> +		ret = 0;
>> +	spin_unlock_irqrestore(&cnt->lock, flags);
>> +
>
>We have res_counter_check_under_limit(), may be we could re-use that here by
>adding parameters.
>
sure


>> +	if (!ret)
>> +		return 0;
>> +
>> +	if (!progress)
>> +		return 1;
>> +	progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
>> +
>> +	goto retry;
>> +}
>> +
>> +/*
>> + * For Hard Wall Hierarchy.
>> + */
>> +
>> +int mem_cgroup_resize_callback(struct res_counter *cnt,
>> +			unsigned long long val, int what)
>> +{
>> +	unsigned long flags, borrow;
>> +	unsigned long long diffs;
>> +	int ret = 0;
>> +
>> +	BUG_ON(what != RES_LIMIT);
>> +
>> +	/* Is this under hierarchy ? */
>> +	if (!cnt->parent) {
>> +		spin_lock_irqsave(&cnt->lock, flags);
>> +		cnt->limit = val;
>
>I know callback is called from the two functions specified in patch 1/2 (move
>and return resource). I don't understand why it is OK to force the limit to b
e
>val, if it is the root node?
>
Because it has no parent. So there is no resource move.


>> +		spin_unlock_irqrestore(&cnt->lock, flags);
>> +		return 0;
>> +	}
>> +
>> +	spin_lock_irqsave(&cnt->lock, flags);
>> +	if (val > cnt->limit) {
>> +		diffs = val - cnt->limit;
>> +		borrow = 1;
>> +	} else {
>> +		diffs = cnt->limit - val;
>> +		borrow = 0;
>> +	}
>> +	spin_unlock_irqrestore(&cnt->lock, flags);
>> +
>> +	if (borrow)
>> +		ret = res_counter_move_resource(cnt,diffs,
>> +					memcg_shrink_val,
>> +					MEM_CGROUP_RECLAIM_RETRIES);
>> +	else
>> +		ret = res_counter_return_resource(cnt, diffs,
>> +					memcg_shrink_val,
>> +					MEM_CGROUP_RECLAIM_RETRIES);
>> +	return ret;
>> +}
>> +
>> +
>> +void memcg_shrink_all(struct mem_cgroup *mem)
>> +{
>
>memcg_shrink_to_limit?
>
ok, use more precise name.

>> +	unsigned long long val;
>> +
>> +	val = res_counter_read_u64(&mem->res, RES_LIMIT);
>> +	memcg_shrink_val(&mem->res, val);
>> +}
>> +
>> +/*
>>   * This routine traverse page_cgroup in given list and drop them all.
>>   * *And* this routine doesn't reclaim page itself, just removes page_cgrou
p.
>>   */
>> @@ -848,6 +937,8 @@ static int mem_cgroup_force_empty(struct
>>  	if (mem_cgroup_subsys.disabled)
>>  		return 0;
>> 
>> +	memcg_shrink_all(mem);
>> +
>>  	css_get(&mem->css);
>>  	/*
>>  	 * page reclaim code (kswapd etc..) will move pages between
>> @@ -896,11 +987,44 @@ static ssize_t mem_cgroup_write(struct c
>>  				struct file *file, const char __user *userbuf,
>>  				size_t nbytes, loff_t *ppos)
>>  {
>> -	return res_counter_write(&mem_cgroup_from_cont(cont)->res,
>> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>> +
>> +	if (cft->private != RES_LIMIT
>> +		|| !cont->parent
>> +		|| memcg->hierarchy_model == MEMCG_NO_HIERARCHY)
>> +		return res_counter_write(&memcg->res, cft->private, userbuf,
>> +			nbytes, ppos, mem_cgroup_write_strategy, NULL);
>> +	else
>> +		return res_counter_write(&memcg->res,
>>  				cft->private, userbuf, nbytes, ppos,
>> -				mem_cgroup_write_strategy);
>> +				mem_cgroup_write_strategy,
>> +				mem_cgroup_resize_callback);
>> +}
>> +
>> +
>> +static u64 mem_cgroup_read_hierarchy(struct cgroup *cgrp, struct cftype *c
ft)
>> +{
>> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> +	return memcg->hierarchy_model;
>> +}
>> +
>> +static int mem_cgroup_write_hierarchy(struct cgroup *cgrp, struct cftype *
cft,
>> +				u64 val)
>> +{
>> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> +	/* chage policy is allowed to ROOT cgroup && no children */
>> +	if (cgrp->parent)
>> +		return -EINVAL;
>> +	if (!list_empty(&cgrp->children))
>> +		return -EINVAL;
>> +	if (val == 0 || val == 1) {
>> +		memcg->hierarchy_model = val;
>> +		return 0;
>> +	}
>> +	return -EINVAL;
>>  }
>> 
>> +
>>  static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
>>  {
>>  	struct mem_cgroup *mem;
>> @@ -992,6 +1116,16 @@ static struct cftype mem_cgroup_files[] 
>>  		.name = "stat",
>>  		.read_map = mem_control_stat_show,
>>  	},
>> +	{
>> +		.name = "hierarchy_model",
>> +		.read_u64 = mem_cgroup_read_hierarchy,
>> +		.write_u64 = mem_cgroup_write_hierarchy,
>> +	},
>> +	{
>> +		.name = "assigned_to_child",
>> +		.private = RES_FOR_CHILDREN,
>> +		.read_u64 = mem_cgroup_read,
>> +	},
>>  };
>> 
>>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node
)
>> @@ -1056,19 +1190,27 @@ static void mem_cgroup_free(struct mem_c
>>  static struct cgroup_subsys_state *
>>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>  {
>> -	struct mem_cgroup *mem;
>> +	struct mem_cgroup *mem, *parent;
>>  	int node;
>> 
>>  	if (unlikely((cont->parent) == NULL)) {
>>  		mem = &init_mem_cgroup;
>>  		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
>> +		parent = NULL;
>>  	} else {
>>  		mem = mem_cgroup_alloc();
>>  		if (!mem)
>>  			return ERR_PTR(-ENOMEM);
>> +		parent = mem_cgroup_from_cont(cont->parent);
>>  	}
>> 
>> -	res_counter_init(&mem->res);
>> +	if (!parent || parent->hierarchy_model == MEMCG_NO_HIERARCHY) {
>> +		res_counter_init(&mem->res);
>> +		mem->hierarchy_model = MEMCG_NO_HIERARCHY;
>> +	} else {
>> +		res_counter_init_hierarchy(&mem->res, &parent->res);
>> +		mem->hierarchy_model = parent->hierarchy_model;
>> +	}
>> 
>>  	for_each_node_state(node, N_POSSIBLE)
>>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
>> @@ -1096,6 +1238,12 @@ static void mem_cgroup_destroy(struct cg
>>  	int node;
>>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> 
>> +	if (cont->parent &&
>> +	    mem->hierarchy_model == MEMCG_HARDWALL_HIERARCHY) {
>> +		/* we did what we can...just returns what we borrow */
>> +		res_counter_return_resource(&mem->res, -1, NULL, 0);
>> +	}
>> +
>>  	for_each_node_state(node, N_POSSIBLE)
>>  		free_mem_cgroup_per_zone_info(mem, node);
>> 
>> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
>> ===================================================================
>> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
>> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
>> @@ -237,12 +237,37 @@ cgroup might have some charge associated
>>  tasks have migrated away from it. Such charges are automatically dropped a
t
>>  rmdir() if there are no tasks.
>> 
>> -5. TODO
>> +5. Hierarchy Model
>> +  the kernel supports following kinds of hierarchy models.
>> +  (your middle-ware may support others based on this.)
>> +
>> +  5-a. Independent Hierarchy
>> +  There are no relationship between any cgroups, even among a parent and
>> +  children. This is the default mode. To use this hierarchy, write 0
>> +  to root cgroup's memory.hierarchy_model
>> +  echo 0 > .../memory.hierarchy_model.
>> +
>> +  5-b. Hardwall Hierarchy.
>> +  The resource has to be moved from the parent to the child before use it.
>> +  When a child's limit is set to 'val', val of the resource is moved from
>> +  the parent to the child. the parent's usage += val.
>> +  The amount of children's usage is reported by the file
>> +
>> +  - memory.assigned_to_child
>> +
>> +  This policy doesn't provide sophisticated automatic resource balancing i
n
>> +  the kernel. But this is very good for strict resource isolation. Users
>> +  can get high predictability of behavior of applications if this is used
>> +  under proper environments.
>> +
>> +
>> +6. TODO
>> 
>>  1. Add support for accounting huge pages (as a separate controller)
>>  2. Make per-cgroup scanner reclaim not-shared pages first
>>  3. Teach controller to account for shared-pages
>>  4. Start reclamation when the limit is lowered
>> +   (this is already done in Hardwall Hierarchy)
>
>Excellent! May be we should split this patch out?
>
I'm now rearranging the set to do that :)

Thank you for comments.

Regards,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
                     ` (4 preceding siblings ...)
  2008-06-09 12:09   ` kamezawa.hiroyu
@ 2008-06-11 23:24   ` Randy Dunlap
  2008-06-12  5:00     ` KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 38+ messages in thread
From: Randy Dunlap @ 2008-06-11 23:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

On Wed, 4 Jun 2008 14:03:29 +0900 KAMEZAWA Hiroyuki wrote:

> ---
>  Documentation/controllers/memory.txt |   27 +++++-
>  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 178 insertions(+), 5 deletions(-)
> 

> Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> ===================================================================
> --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> @@ -237,12 +237,37 @@ cgroup might have some charge associated
>  tasks have migrated away from it. Such charges are automatically dropped at
>  rmdir() if there are no tasks.
>  
> -5. TODO
> +5. Hierarchy Model
> +  the kernel supports following kinds of hierarchy models.

     The kernel supports the following kinds ....

> +  (your middle-ware may support others based on this.)

     (Your
> +
> +  5-a. Independent Hierarchy
> +  There are no relationship between any cgroups, even among a parent and

           is

> +  children. This is the default mode. To use this hierarchy, write 0
> +  to root cgroup's memory.hierarchy_model
> +  echo 0 > .../memory.hierarchy_model.
> +
> +  5-b. Hardwall Hierarchy.
> +  The resource has to be moved from the parent to the child before use it.

                                                                      using it.

> +  When a child's limit is set to 'val', val of the resource is moved from
> +  the parent to the child. the parent's usage += val.

                              The parent's usage is incremented by val.
(if that's what you mean)

> +  The amount of children's usage is reported by the file
> +
> +  - memory.assigned_to_child
> +
> +  This policy doesn't provide sophisticated automatic resource balancing in
> +  the kernel. But this is very good for strict resource isolation. Users

         kernel, but this ...

> +  can get high predictability of behavior of applications if this is used
> +  under proper environments.
> +
> +
> +6. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
>  3. Teach controller to account for shared-pages
>  4. Start reclamation when the limit is lowered
> +   (this is already done in Hardwall Hierarchy)
>  5. Start reclamation in the background when the limit is
>     not yet hit but the usage is getting closer
> --

---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg
  2008-06-11 23:24   ` Randy Dunlap
@ 2008-06-12  5:00     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-12  5:00 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, LKML, balbir, menage, xemul, yamamoto

Thank you. will fix.

-Kame

On Wed, 11 Jun 2008 16:24:23 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> On Wed, 4 Jun 2008 14:03:29 +0900 KAMEZAWA Hiroyuki wrote:
> 
> > ---
> >  Documentation/controllers/memory.txt |   27 +++++-
> >  mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
> >  2 files changed, 178 insertions(+), 5 deletions(-)
> > 
> 
> > Index: temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> > ===================================================================
> > --- temp-2.6.26-rc2-mm1.orig/Documentation/controllers/memory.txt
> > +++ temp-2.6.26-rc2-mm1/Documentation/controllers/memory.txt
> > @@ -237,12 +237,37 @@ cgroup might have some charge associated
> >  tasks have migrated away from it. Such charges are automatically dropped at
> >  rmdir() if there are no tasks.
> >  
> > -5. TODO
> > +5. Hierarchy Model
> > +  the kernel supports following kinds of hierarchy models.
> 
>      The kernel supports the following kinds ....
> 
> > +  (your middle-ware may support others based on this.)
> 
>      (Your
> > +
> > +  5-a. Independent Hierarchy
> > +  There are no relationship between any cgroups, even among a parent and
> 
>            is
> 
> > +  children. This is the default mode. To use this hierarchy, write 0
> > +  to root cgroup's memory.hierarchy_model
> > +  echo 0 > .../memory.hierarchy_model.
> > +
> > +  5-b. Hardwall Hierarchy.
> > +  The resource has to be moved from the parent to the child before use it.
> 
>                                                                       using it.
> 
> > +  When a child's limit is set to 'val', val of the resource is moved from
> > +  the parent to the child. the parent's usage += val.
> 
>                               The parent's usage is incremented by val.
> (if that's what you mean)
> 
> > +  The amount of children's usage is reported by the file
> > +
> > +  - memory.assigned_to_child
> > +
> > +  This policy doesn't provide sophisticated automatic resource balancing in
> > +  the kernel. But this is very good for strict resource isolation. Users
> 
>          kernel, but this ...
> 
> > +  can get high predictability of behavior of applications if this is used
> > +  under proper environments.
> > +
> > +
> > +6. TODO
> >  
> >  1. Add support for accounting huge pages (as a separate controller)
> >  2. Make per-cgroup scanner reclaim not-shared pages first
> >  3. Teach controller to account for shared-pages
> >  4. Start reclamation when the limit is lowered
> > +   (this is already done in Hardwall Hierarchy)
> >  5. Start reclamation in the background when the limit is
> >     not yet hit but the usage is getting closer
> > --
> 
> ---
> ~Randy
> '"Daemon' is an old piece of jargon from the UNIX operating system,
> where it referred to a piece of low-level utility software, a
> fundamental part of the operating system."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-04  4:58 [RFC][PATCH 0/2] memcg: hierarchy support (v3) KAMEZAWA Hiroyuki
  2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
@ 2008-06-04  8:59 ` Paul Menage
  2008-06-04  9:15   ` KAMEZAWA Hiroyuki
  2008-06-09  9:30 ` Balbir Singh
  3 siblings, 1 reply; 38+ messages in thread
From: Paul Menage @ 2008-06-04  8:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

Hi Kame,

I like the idea of keeping the kernel simple, and moving more of the
intelligence to userspace.

It may need the kernel to expose a bit more in the way of VM details,
such as memory pressure, OOM notifications, etc, but as long as
userspace can respond quickly to memory imbalance, it should work
fine. We're doing something a bit similar using cpusets and fake NUMA
at Google - the principle of juggling memory between cpusets is the
same, but the granularity is much worse :-)

On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  - supported hierarchy_model parameter.
>   Now, no_hierarchy and hardwall_hierarchy is implemented.

Should we try to support hierarchy and non-hierarchy cgroups in the
same tree? Maybe we should just enforce the restrictions that:

- the hierarchy mode can't be changed on a cgroup if you have children
or any non-zero usage/limit
- a cgroup inherits its parent's hierarchy mode.

>  - parent overcommits all children

I'm not sure that "overcommits" is the right word here - specifically,
the model ensures that a parent can't overcommit its children beyond
its limit.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-04  8:59 ` [RFC][PATCH 0/2] memcg: hierarchy support (v3) Paul Menage
@ 2008-06-04  9:15   ` KAMEZAWA Hiroyuki
  2008-06-04  9:15     ` Paul Menage
  0 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  9:15 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Wed, 4 Jun 2008 01:59:32 -0700
"Paul Menage" <menage@google.com> wrote:

> Hi Kame,
> 
> I like the idea of keeping the kernel simple, and moving more of the
> intelligence to userspace.
> 
thanks.

> It may need the kernel to expose a bit more in the way of VM details,
> such as memory pressure, OOM notifications, etc, but as long as
> userspace can respond quickly to memory imbalance, it should work
> fine. We're doing something a bit similar using cpusets and fake NUMA
> at Google - the principle of juggling memory between cpusets is the
> same, but the granularity is much worse :-)
> 
yes, next problem is adding interfaces. but we have to investigate
what is principal.


> On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >  - supported hierarchy_model parameter.
> >   Now, no_hierarchy and hardwall_hierarchy is implemented.
> 
> Should we try to support hierarchy and non-hierarchy cgroups in the
> same tree? Maybe we should just enforce the restrictions that:
> 
> - the hierarchy mode can't be changed on a cgroup if you have children
> or any non-zero usage/limit
> - a cgroup inherits its parent's hierarchy mode.
> 
Ah, my patch does it (I think).  explanation is bad.

- mem cgroup's mode can be changed against ROOT node which has no children.
- a child inherits parent's mode.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-04  9:15   ` KAMEZAWA Hiroyuki
@ 2008-06-04  9:15     ` Paul Menage
  2008-06-04  9:31       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Menage @ 2008-06-04  9:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> Should we try to support hierarchy and non-hierarchy cgroups in the
>> same tree? Maybe we should just enforce the restrictions that:
>>
>> - the hierarchy mode can't be changed on a cgroup if you have children
>> or any non-zero usage/limit
>> - a cgroup inherits its parent's hierarchy mode.
>>
> Ah, my patch does it (I think).  explanation is bad.
>
> - mem cgroup's mode can be changed against ROOT node which has no children.
> - a child inherits parent's mode.

But if it can only be changed for the root cgroup when it has no
children, than implies that all cgroups must have the same mode. I'm
suggesting that we allow non-root cgroups to change their mode, as
long as:

- they have no children

- they don't have any limit charged to their parent (which means that
either they have a zero limit, or they have no parent, or they're not
in hierarchy mode)

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-04  9:15     ` Paul Menage
@ 2008-06-04  9:31       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-04  9:31 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, LKML, balbir, xemul, yamamoto

On Wed, 4 Jun 2008 02:15:32 -0700
"Paul Menage" <menage@google.com> wrote:

> On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> Should we try to support hierarchy and non-hierarchy cgroups in the
> >> same tree? Maybe we should just enforce the restrictions that:
> >>
> >> - the hierarchy mode can't be changed on a cgroup if you have children
> >> or any non-zero usage/limit
> >> - a cgroup inherits its parent's hierarchy mode.
> >>
> > Ah, my patch does it (I think).  explanation is bad.
> >
> > - mem cgroup's mode can be changed against ROOT node which has no children.
> > - a child inherits parent's mode.
> 
> But if it can only be changed for the root cgroup when it has no
> children, than implies that all cgroups must have the same mode. I'm
> suggesting that we allow non-root cgroups to change their mode, as
> long as:
> 
> - they have no children
> 
> - they don't have any limit charged to their parent (which means that
> either they have a zero limit, or they have no parent, or they're not
> in hierarchy mode)
> 
Hmm, I got your point. Your suggestion seems reasonable.
I'll try that logic in the next version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-04  4:58 [RFC][PATCH 0/2] memcg: hierarchy support (v3) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-06-04  8:59 ` [RFC][PATCH 0/2] memcg: hierarchy support (v3) Paul Menage
@ 2008-06-09  9:30 ` Balbir Singh
  2008-06-09  9:55   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-06-09  9:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> Hi, this is third version.
> 
> While small changes in codes, the whole _tone_ of code is changed.
> I'm not in hurry, any comments are welcome.
> 
> based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
> 

Hi, Kamezawa-San,

Sorry for the delay in responding. Like we discussed last time, I'd prefer a
shares based approach for hierarchial memcg management. I'll review/try these
patches and provide more feedback.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-09  9:30 ` Balbir Singh
@ 2008-06-09  9:55   ` KAMEZAWA Hiroyuki
  2008-06-09 10:33     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-06-09  9:55 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, LKML, menage, xemul, yamamoto

On Mon, 09 Jun 2008 15:00:22 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hi, this is third version.
> > 
> > While small changes in codes, the whole _tone_ of code is changed.
> > I'm not in hurry, any comments are welcome.
> > 
> > based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
> > 
> 
> Hi, Kamezawa-San,
> 
> Sorry for the delay in responding. Like we discussed last time, I'd prefer a
> shares based approach for hierarchial memcg management. I'll review/try these
> patches and provide more feedback.
> 
Hi,

I'm now totally re-arranging patches, so just see concepts.

In previous e-mail, I thought that there was a difference between 'your share'
and 'my share'. So, please explain again ? 

My 'share' has following characteristics.

  - work as soft-limit. not hard-limit.
  - no limit when there are not high memory pressure.
  - resource usage will be proportionally fair to each group's share (priority)
    under memory pressure.

If you want to work on this, I can stop this for a while and do other important
patches, like background reclaim, mlock limitter, guarantee, etc.. because my 
priority to hierarchy is not very high (but it seems better to do this before
other misc works, so I did.). 

Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: hierarchy support (v3)
  2008-06-09  9:55   ` KAMEZAWA Hiroyuki
@ 2008-06-09 10:33     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2008-06-09 10:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, menage, xemul, yamamoto

KAMEZAWA Hiroyuki wrote:
> On Mon, 09 Jun 2008 15:00:22 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> Hi, this is third version.
>>>
>>> While small changes in codes, the whole _tone_ of code is changed.
>>> I'm not in hurry, any comments are welcome.
>>>
>>> based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
>>>
>> Hi, Kamezawa-San,
>>
>> Sorry for the delay in responding. Like we discussed last time, I'd prefer a
>> shares based approach for hierarchial memcg management. I'll review/try these
>> patches and provide more feedback.
>>
> Hi,
> 
> I'm now totally re-arranging patches, so just see concepts.
> 
> In previous e-mail, I thought that there was a difference between 'your share'
> and 'my share'. So, please explain again ? 
> 
> My 'share' has following characteristics.
> 
>   - work as soft-limit. not hard-limit.
>   - no limit when there are not high memory pressure.
>   - resource usage will be proportionally fair to each group's share (priority)
>     under memory pressure.
> 

My share is very similar to yours.

A group might have a share of 100% and a hard limit of 1G. In this case the hard
limit applies if the system has more than 1G of memory. I think of hard limit as
the final controlling factor and shares are suggestive.

Yes, my shares also have the same factors, but can be overridden by hard limits.


> If you want to work on this, I can stop this for a while and do other important
> patches, like background reclaim, mlock limitter, guarantee, etc.. because my 
> priority to hierarchy is not very high (but it seems better to do this before
> other misc works, so I did.). 
> 

I do, but I don't want to stop you from doing it. mlock limitter is definitely
important, along with some control for large pages. Hierarchy is definitely
important, since we cannot add other major functionality without first solving
this proble, After that, High on my list is

1. Soft limits
2. Performance/space trade-offs

> Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)

Yes :) I just saw that going in


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-01  0:35               ` kamezawa.hiroyu
@ 2008-06-02  6:16 Balbir Singh
  2008-05-31 17:18 ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-06-02  6:16 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

Hi, Kamezawa-san,

kamezawa.hiroyu@jp.fujitsu.com wrote:
> 
> It's not problem. We're not developing world-wide eco system.
> It's good that there are several development groups. It's a way to evolution.
> Something popular will be defacto standard. 
> What we have to do is providing proper interfaces for allowing fair race.
> 

I did not claim that we were developing an eco system either :)
My point is that we should not confuse *Linux* users. Lets do the common/useful
stuff in the kernel and make it easy for users to use the cgroup subsystem.

>>> Here is an example. (just an example...)
>>> Please point out if I'm misunderstanding "share".
>>>
>>> root_level/                   = limit 1G.
>>>           /child_A = share=30
>>>           /child_B = share=15
>>>           /child_C = share=5
>>> (and assume there is no process under root_level for make explanation easy.
> .)
>>> 0. At first, before starting to use memory, set all kernel_memory_limit.
>>> root_level.limit = 1G
>>>   child_A.limit=64M,usage=0
>>>   child_B.limit=64M,usage=0
>>>   child_C.limit=64M,usage=0
>>>   free_resource=808M 
>>>
>> This sounds incorrect, since the limits should be proportional to shares. If 
> the
>> maximum shares in the root were 100 (*ideally we want higher resolution than 
> that)
>> Then
>>
>> child_A.limit = .3 * 1G
>> child_B.limit = .15 * 1G
>>
>> and so on
>>
> Above just showing param to the kernel. 
> From user's view, memory limitation is A:B:C=6:3:1 if memory is fully used.
> (In above case, usage=0)
> 
> In general, "share" works only when the total usage reaches limitation.
> (See how cpu scheduler works.)
> When the usage doesn't reach limit, there is no limitatiuon.
> 

If you are implying that shares imply a soft limit, I agree. But the only
parameter in the kernel right now is hard limits. We need to add soft limit support.

>>> 1. next, a process in child_C start to run and use memory of 600M.
>>> root_level.limit = 1G
>>>   child_A.limit=64M
>>>   child_B.limit=64M
>>>   child_C.limit=600M,usage=600M
>>>   free_resource=272M
>>>
>> How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
>> want something like that, child_C should have no limits.
> 
> middleware just do when child_C.failcnt hits.
> echo 64M > childC.memory.limits_in_bytes.
> and periodically checks A,B,C and allow C to use what it wants becasue
> A and B doesn't want memory.
> 
>>> 2. now, a process in child_A start tu run and use memory of 800M.
>>>   child_A.limit=800M,usage=800M
>>>   child_B.limit=64M,usage=0M
>>>   child_C.limit=136M,usage=136M
>>>   free_resouce=0,A:C=6:1
>>>
>> Not sure I understand this step
>>
> Middleware notices that usage in A is growing and moves resources to A.
> 
> echo current child_C's limit - 64M > child_C
> echo current child_A's limit + 64M > child_A
> do above in step by step with loops for making A:C = 6:1
> (64M is just an example)
> 
>>> 3.Finally, a process in child_B start. and use memory of 500M.
>>>   child_A.limit=600M,usage=600M
>>>   child_B.limit=300M,usage=300M
>>>   child_C.limit=100M,usage=100M
>>>   free_resouce=0, A:B:C=6:3:1
>>>
>> Not sure I understand this step
>>
> echo current child_C's limit - 64M > child_C
> echo current child_A's limit - 64M > child_A
> echo current child_B's limit + 64M > child_B
> do above in step by step with loops for making A:B:C = 6:3:1
> 
> 
>>> 4. one more, a process in A exits.
>>>   child_A.limit=64M, usage=0M
>>>   child_B.limit=500M, usage=500M
>>>   child_C.limit=436M, usage=436M
>>>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
>>>
>> Not sure I understand this step
>>
> middleware can notice memory pressure from Child_A is reduced.
> 
> echo current child_A's limit - 64M > child_A
> echo current child_C's limit + 64M > child_C
> echo current child_B's limit + 64M > child_B
> do above in step by step with loops for making B:C = 3:1 with avoiding
> the waste of resources.
> 
> 
> 
>>> This is only an example and the middleware can more pricise "limit"
>>> contols by checking statistics of memory controller hierarchy based on
>>> their own policy.
>>>
>>> What I think now is what kind of statistics/notifier/controls are
>>> necessary to implement shares in middleware. How pricise/quick work the
>>> middleware can do is based on interfaces.
>>> Maybe the middleware should know "how fast the application runs now" by
>>> some kind of check or co-operative interface with the application.
>>> But I'm not sure how the kernel can help it.
>> I am not sure if I understand your proposal at this point.
>>
> 
> The most important point is cgoups.memory.memory.limit_in_bytes
> is _just_ a notification to ask the kernel to limit the memory
> usage of process groups temporally. It changes often.
> Based on user's notification to the middleware (share or limit),
> the middleware changes limit_in_bytes to be suitable value
> and change it dynamically and periodically. 
> 

Why don't we add soft limits, so that we don't have to go to the kernel and
change limits frequently. One missing piece in the memory controller is that we
don't shrink the memory controller when limits change or when tasks move. I
think soft limits is a better solution.

Thanks for patiently explaining all of this.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31 14:47             ` kamezawa.hiroyu
@ 2008-05-31 17:18 ` Balbir Singh
  2008-05-31 11:20   ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-05-31 17:18 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
> 
>>> One more problem is that it's hard to implement various kinds of hierarchy
>>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>>> want to use. Kicking out functions to middleware AMAP is what I'm thinking
>>> now.
>> One way to manage hierarchies other than via limits is to use shares (please 
> see
>> the shares used by the cpu controller). Basically, what you've done with limi
> ts
>> is done with shares
>>
> Yes, I like _share_ rather than limits.
> 
>> If a parent has 100 shares, then it can decide how many to pass on to it's  c
> hildren
>> based on the shares of the child and your logic would work well. I propose
>> assigning top level (high resolution) shares to the root of the cgroup and in
>  a
>> hierarchy passing them down to children and sharing it with them. Based on th
> e
>> shares, deduce the limit of each node in the hierarchy.
>>
>> What do you think?
>>
> As you wrote, a middleware can do controls based on share by limits.
> And it seems much easier to implement it in userland rather than in the kernel
> .

The good thing about user space is that moves unnecessary code outside the
kernel, but the hard thing is standardization. If every middleware is going to
implement what you say, imagine the code duplication, unless we standardize this
into a library component. More comments below. I am not sure about the
difference between user_memory_limit and kernel_memory_limit. Could you please
elaborate.

> 
> Here is an example. (just an example...)
> Please point out if I'm misunderstanding "share".
> 
> root_level/                   = limit 1G.
>           /child_A = share=30
>           /child_B = share=15
>           /child_C = share=5
> (and assume there is no process under root_level for make explanation easy..)
> 
> 0. At first, before starting to use memory, set all kernel_memory_limit.
> root_level.limit = 1G
>   child_A.limit=64M,usage=0
>   child_B.limit=64M,usage=0
>   child_C.limit=64M,usage=0
>   free_resource=808M 
> 

This sounds incorrect, since the limits should be proportional to shares. If the
maximum shares in the root were 100 (*ideally we want higher resolution than that)
Then

child_A.limit = .3 * 1G
child_B.limit = .15 * 1G

and so on


> 1. next, a process in child_C start to run and use memory of 600M.
> root_level.limit = 1G
>   child_A.limit=64M
>   child_B.limit=64M
>   child_C.limit=600M,usage=600M
>   free_resource=272M
> 

How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
want something like that, child_C should have no limits.

> 2. now, a process in child_A start tu run and use memory of 800M.
>   child_A.limit=800M,usage=800M
>   child_B.limit=64M,usage=0M
>   child_C.limit=136M,usage=136M
>   free_resouce=0,A:C=6:1
> 

Not sure I understand this step

> 3.Finally, a process in child_B start. and use memory of 500M.
>   child_A.limit=600M,usage=600M
>   child_B.limit=300M,usage=300M
>   child_C.limit=100M,usage=100M
>   free_resouce=0, A:B:C=6:3:1
> 

Not sure I understand this step

> 4. one more, a process in A exits.
>   child_A.limit=64M, usage=0M
>   child_B.limit=500M, usage=500M
>   child_C.limit=436M, usage=436M
>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
> 

Not sure I understand this step

> This is only an example and the middleware can more pricise "limit"
> contols by checking statistics of memory controller hierarchy based on
> their own policy.
> 
> What I think now is what kind of statistics/notifier/controls are
> necessary to implement shares in middleware. How pricise/quick work the
> middleware can do is based on interfaces.
> Maybe the middleware should know "how fast the application runs now" by
> some kind of check or co-operative interface with the application.
> But I'm not sure how the kernel can help it.

I am not sure if I understand your proposal at this point.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31  1:59           ` kamezawa.hiroyu
@ 2008-05-31 11:20   ` Balbir Singh
  2008-05-30 22:20     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-05-31 11:20 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

kamezawa.hiroyu@jp.fujitsu.com wrote:
>> KAMEZAWA Hiroyuki wrote:
>>> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
>>>
>>> While several policy of hierarchy can be considered, this patch implements
>>> simple one 
>>>    - the parent includes, over-commits the child
>>>    - there are no shared resource
>> I am not sure if this is desirable. The concept of a hierarchy applies really
>> well when there are shared resources.
>>
>>>    - dynamic hierarchy resource usage management in the kernel is not neces
> sary
>> Could you please elaborate as to why? I am not sure I understand your point
>>
> 
> ok, let's consider a _miiddleware_ wchich has following paramater.
> 
> An expoterd param to the user.
>    - user_memory_limit
> parameters for co-operation with the kernel
>    - kernel_memory_limit
> 
> And here,
>    user_memory_limit >= kernel_memory_limit == cgroup's memory.limits_in_bytes
> 
> When a user ask the miidleware to set limit to 1Gbytes
>    user_memory_limit = 1G
>    kernel_memory_limit = 0-1G.
> It moves kernel_memory_limit dynamically 0 to 1Gbytes and reset limits_in_byte
> s in dynamic way with checking memory cgroup's statistics.
> Of course, we can add some kind of interdace , as following
>   - failure_notifier - triggered at failcnt increment.
>   - threshhold_notifier - triggered as usage > threshold.
> 
>>> works as following.
>>>
>>>  1. create a child. set default child limits to be 0.
>>>  2. set limit to child.
>>>     2-a. before setting limit to child, prepare enough room in parent.
>>>     2-b. increase 'usage' of parent by child's limit.
>> The problem with this is that you are forcing the parent will run into a recl
> aim
>> loop even if the child is not using the assigned limit to it.
>>
> That's not problem because it's avoildable by users.
> But it's ok to limit the sum of child's limit to be below XX % ot the parent.
> 
>>>  3. the child sets its limit to the val moved from the parent.
>>>     the parent remembers what amount of resource is to the children.
>>>
>> All of this needs to be dynamic
>>
> As explained, this can be dynamic by middleware.
> 
>>>  Above means that
>>> 	- a directory's usage implies the sum of all sub directories +
>>>           own usage.
>>> 	- there are no shared resource between parent <-> child.
>>>
>>>  Pros.
>>>   - simple and easy policy.
>>>   - no hierarchy overhead.
>>>   - no resource share among child <-> parent. very suitable for multilevel
>>>     resource isolation.
>> Sharing is an important aspect of hierachies. I am not convinced of this
>> approach. Did you look at the patches I sent out? Was there something
>> fundamentally broken in them?
>>
> 
> Yes, I read. And tried to make it faster and found it will be complicated.
> One problem is overhead of counter itself.
> Another problem is overhead of shrinking multi-level LRU with feedback.
> One more problem is that it's hard to implement various kinds of hierarchy
> policy. I believe there are other hierarhcy policies rather than OpenVZ
> want to use. Kicking out functions to middleware AMAP is what I'm thinking
> now.

One way to manage hierarchies other than via limits is to use shares (please see
the shares used by the cpu controller). Basically, what you've done with limits
is done with shares

If a parent has 100 shares, then it can decide how many to pass on to it's  children
based on the shares of the child and your logic would work well. I propose
assigning top level (high resolution) shares to the root of the cgroup and in a
hierarchy passing them down to children and sharing it with them. Based on the
shares, deduce the limit of each node in the hierarchy.

What do you think?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-05-30 22:20     ` Balbir Singh
  2008-05-30  1:43       ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
  2008-06-02  9:52           ` kamezawa.hiroyu
  1 sibling, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2008-05-30 22:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

KAMEZAWA Hiroyuki wrote:
> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
> 
> While several policy of hierarchy can be considered, this patch implements
> simple one 
>    - the parent includes, over-commits the child
>    - there are no shared resource

I am not sure if this is desirable. The concept of a hierarchy applies really
well when there are shared resources.

>    - dynamic hierarchy resource usage management in the kernel is not necessary
> 

Could you please elaborate as to why? I am not sure I understand your point

> works as following.
> 
>  1. create a child. set default child limits to be 0.
>  2. set limit to child.
>     2-a. before setting limit to child, prepare enough room in parent.
>     2-b. increase 'usage' of parent by child's limit.

The problem with this is that you are forcing the parent will run into a reclaim
loop even if the child is not using the assigned limit to it.

>  3. the child sets its limit to the val moved from the parent.
>     the parent remembers what amount of resource is to the children.
> 

All of this needs to be dynamic

>  Above means that
> 	- a directory's usage implies the sum of all sub directories +
>           own usage.
> 	- there are no shared resource between parent <-> child.
> 
>  Pros.
>   - simple and easy policy.
>   - no hierarchy overhead.
>   - no resource share among child <-> parent. very suitable for multilevel
>     resource isolation.

Sharing is an important aspect of hierachies. I am not convinced of this
approach. Did you look at the patches I sent out? Was there something
fundamentally broken in them?

[snip]

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 0/2] memcg: simple hierarchy (v2)
@ 2008-05-30  1:43       ` KAMEZAWA Hiroyuki
  2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  1:43 UTC (permalink / raw)
  To: linux-mm; +Cc: LKML, balbir, xemul, menage, yamamoto, lizf

This is rewritten version of memcg hierarchy handling.
...and I'm sorry tons of typos in v1.

Changelog:
  - fixed typo.
  - removed meaningless params (borrow)
  - renamed structure members.

not-for-test. just for discussion.  (I'll rewrite when our direction is fixed.)

Implemented Policy:
  - parent overcommits all children
     parent->usage = resource used by itself + resource moved to children.
     Of course, parent->limit > parent->usage. 
  - when child's limit is set, the resouce moves.
  - no automatic resource moving between parent <-> child

Example)
  1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
     - group_A limit=1G,usage=0M.

  2) create group B, C under A.
     - group A limit=1G, usage=0M
          - group B limit=0M, usage=0M.
          - group C limit=0M, usage=0M.

  3) increase group B's limit to 300M.
     - group A limit=1G, usage=300M.
          - group B limit=300M, usage=0M.
          - group C limit=0M, usage=0M.

  4) increase group C's limit to 500M
     - group A limit=1G, usage=800M.
          - group B limit=300M, usage=0M.
          - group C limit=500M, usage=0M.

  5) reduce group B's limit to 100M
     - group A limit=1G, usage=600M.
          - group B limit=100M, usage=0M.
          - group C limit=500M, usage=0M.

Why this is enough ?
  - A middleware can do various kind of resource balancing only by reseting "limit"
    in userland.


TODO(maybe)
  - rewrite force_empty to move the resource to the parent.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:43       ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
@ 2008-05-30  1:45         ` KAMEZAWA Hiroyuki
  2008-05-31  1:59           ` kamezawa.hiroyu
  2008-06-02  9:52           ` kamezawa.hiroyu
  0 siblings, 2 replies; 38+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  1:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, xemul, menage, yamamoto, lizf

This patch tries to implements _simple_ 'hierarchy policy' in res_counter.

While several policy of hierarchy can be considered, this patch implements
simple one 
   - the parent includes, over-commits the child
   - there are no shared resource
   - dynamic hierarchy resource usage management in the kernel is not necessary

works as following.

 1. create a child. set default child limits to be 0.
 2. set limit to child.
    2-a. before setting limit to child, prepare enough room in parent.
    2-b. increase 'usage' of parent by child's limit.
 3. the child sets its limit to the val moved from the parent.
    the parent remembers what amount of resource is to the children.

 Above means that
	- a directory's usage implies the sum of all sub directories +
          own usage.
	- there are no shared resource between parent <-> child.

 Pros.
  - simple and easy policy.
  - no hierarchy overhead.
  - no resource share among child <-> parent. very suitable for multilevel
    resource isolation.
 Cons.
  - not good to implement some kind of _intelligent_ hierarchy balancing
    in the _kernel_ 

Changelog:
  -removed borrow.
  -fixed tons of typos.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/resource_counter.txt |   28 +++++
 include/linux/res_counter.h                    |   72 +++++++++++++
 kernel/res_counter.c                           |  130 +++++++++++++++++++++++--
 3 files changed, 222 insertions(+), 8 deletions(-)

Index: hie-2.6.26-rc2-mm1/include/linux/res_counter.h
===================================================================
--- hie-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
+++ hie-2.6.26-rc2-mm1/include/linux/res_counter.h
@@ -39,6 +39,11 @@ struct res_counter {
 	 */
 	unsigned long long failcnt;
 	/*
+	 * the sum of all resource which is assigned to children.
+	 */
+	unsigned long long for_children;
+
+	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
 	 */
@@ -57,6 +62,7 @@ struct res_counter {
  * @nbytes:  its size...
  * @pos:     and the offset.
  */
+typedef int (*res_resize_callback_t)(struct res_counter *, unsigned long long);
 
 u64 res_counter_read_u64(struct res_counter *counter, int member);
 
@@ -65,8 +71,55 @@ ssize_t res_counter_read(struct res_coun
 		int (*read_strategy)(unsigned long long val, char *s));
 ssize_t res_counter_write(struct res_counter *counter, int member,
 		const char __user *buf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *buf, unsigned long long *val));
+		int (*write_strategy)(char *buf, unsigned long long *val),
+		res_resize_callback_t callback);
+
+/**
+ * Move resource from parent to child and set child's limit.
+ * By this,
+ *          res->usage of the parent increased by 'val'
+ *          res->for_children of the parent increased by 'val'
+ *          res->limit of the child increased by 'val'
+ *
+ * @child:    an entity to set res->limit.
+ * @parent:   parent of child and source of resource.
+ * @val:      How much does child want to move from parent ?
+ * @callback: A callback for making resource to allow this moving, called
+ *            against parent. callback should returns 0 at success,
+ *	      returns !0 at failure. _No_ lock is held while the callback is
+ *	      called. If no callback(NULL), no retry.
+ * @retry:    # of retries at calling callback for making resource.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ *
+ * Returns 0 if success. !0 at failure.
+ *
+ */
+typedef int (*res_shrink_callback_t)(struct res_counter*, unsigned long long);
 
+int res_counter_borrow_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry);
+/**
+ * Return resource to its parent.
+ * By this,
+ *           res->usage of the parent is decreased by 'val'
+ *           res->for_children of the parent is decreased by 'val'
+ *           res->limit of the child is decreased by 'val'
+ *
+ * @child:   entry to resize.
+ * @parent:  resource will be moved back to this.
+ * @val  :   How much does child repay to parent ? -1 means 'all'
+ * @callback: A callback for decreasing resource usage of child before
+ *            moving. If NULL, just deceases child's limit.
+ * @retry:   # of retries at calling callback for freeing resource.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success.
+ */
+int res_counter_repay_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry);
 /*
  * the field descriptors. one for each member of res_counter
  */
@@ -76,6 +129,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_FOR_CHILDREN,
 };
 
 /*
@@ -104,6 +158,11 @@ enum charge_code __must_check res_counte
 
 enum charge_code __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val);
+/*
+ * Compare usage with ->borrow member instead of ->limit.
+ */
+int __must_check res_counter_charge_borrow(struct res_counter *counter,
+		unsigned long val);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -158,4 +217,15 @@ static inline void res_counter_reset_fai
 	cnt->failcnt = 0;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
+
+/*
+ * should be called only after cgroup creation.
+ */
+static inline void res_counter_zero_limit(struct res_counter *cnt)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->limit = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+}
 #endif
Index: hie-2.6.26-rc2-mm1/kernel/res_counter.c
===================================================================
--- hie-2.6.26-rc2-mm1.orig/kernel/res_counter.c
+++ hie-2.6.26-rc2-mm1/kernel/res_counter.c
@@ -76,6 +76,8 @@ res_counter_member(struct res_counter *c
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_FOR_CHILDREN:
+		return &counter->for_children;
 	};
 
 	BUG();
@@ -106,7 +108,8 @@ u64 res_counter_read_u64(struct res_coun
 
 ssize_t res_counter_write(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *st_buf, unsigned long long *val))
+		int (*write_strategy)(char *st_buf, unsigned long long *val),
+		res_resize_callback_t callback)
 {
 	int ret;
 	char *buf, *end;
@@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
 		if (*end != '\0')
 			goto out_free;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
-	ret = nbytes;
+	if (member != RES_LIMIT || !callback) {
+		spin_lock_irqsave(&counter->lock, flags);
+		val = res_counter_member(counter, member);
+		*val = tmp;
+		spin_unlock_irqrestore(&counter->lock, flags);
+		ret = nbytes;
+	} else {
+		/* call a callback for hierarchy management */
+		ret = callback(counter, tmp);
+		if (!ret)
+			ret = nbytes;
+	}
+
 out_free:
 	kfree(buf);
 out:
 	return ret;
 }
+
+/*
+ * This tries to move 'val' resource from parent. At success,
+ * child->limit += val.
+ * parent->for_children += val.
+ * parent->usage += val.
+ */
+
+int res_counter_borrow_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry)
+{
+	int success = 0;
+	unsigned long flags;
+
+	/* Borrow resource from parent */
+	success = 0;
+	while (1) {
+		/* res_counter_charge just handles 'long' value...*/
+		spin_lock_irqsave(&parent->lock, flags);
+		if (parent->usage + val < parent->limit) {
+			parent->usage += val;
+			parent->for_children += val;
+			success = 1;
+		}
+		spin_unlock_irqrestore(&parent->lock, flags);
+		if (success)
+			break;
+		if (!retry || !callback)
+			goto fail;
+		if (retry > 0)
+			--retry;
+		yield();
+		callback(parent, val);
+	}
+	/* ok, we successfully got enough resource. */
+	spin_lock_irqsave(&child->lock, flags);
+	child->limit += val;
+	spin_unlock_irqrestore(&child->lock, flags);
+	return 0;
+fail:
+	return 1;
+}
+
+/*
+ * Move resource to its parent.
+ *   child->limit -= val.
+ *   parent->usage -= val.
+ *   parent->limit -= val.
+ */
+
+int res_counter_repay_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry)
+{
+	unsigned long flags;
+	int done = 0;
+	/* Enough resources ? */
+	while (1) {
+		spin_lock_irqsave(&child->lock, flags);
+
+		if (val == (unsigned long long)-1) {
+			val = child->limit;
+			child->limit = 0;
+			done = 1;
+		} else if (child->usage + val <= child->limit) {
+			child->limit -= val;
+			done = 1;
+		}
+		spin_unlock_irqrestore(&child->lock, flags);
+		if (done)
+			break;
+		if (!retry-- || !callback)
+			goto fail;
+		/*
+		 * we want to rest somewhere but right after callback is
+		 * not good place. So rest here.
+		 */
+		yield();
+		/* reduce resource usage */
+		callback(child, val);
+	}
+
+	/* ok, we successfully got enough resource. */
+	spin_lock_irqsave(&parent->lock, flags);
+	BUG_ON(parent->for_children < val);
+	BUG_ON(parent->usage < val);
+	parent->for_children -= val;
+	parent->usage -= val;
+	spin_unlock_irqrestore(&parent->lock, flags);
+
+	return 0;
+fail:
+	return 1;
+}
Index: hie-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
===================================================================
--- hie-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
+++ hie-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
@@ -39,10 +39,13 @@ to work with it.
  	The failcnt stands for "failures counter". This is the number of
 	resource allocation attempts that failed.
 
- c. spinlock_t lock
+ e. spinlock_t lock
 
  	Protects changes of the above values.
 
+ f. for_children
+        The amount of resource moved to the children.
+
 
 
 2. Basic accounting routines
@@ -179,3 +182,24 @@ counter fields. They are recommended to 
     still can help with it).
 
  c. Compile and run :)
+
+6. Hierarchy Model
+   1) simple isolation hierarchy.
+   res_counter supports a simple hierarchy model as that the child's resource
+   is moved from its parent.
+
+   When  the limit is set to a child, its parent's usage increases by the
+   amount of limit. i.e. the child borrows resource from its parent when
+   it set the limit.
+
+   hirarchical cgroup is very useful when you implements hierarchical
+   resource isolation. For example,
+   A) admin - user
+   - system admin layer ....the first level
+	- user layer    ....the second level for user A, B, C
+   B) application/service layer.
+   -  application layer ... the first level
+        - service layer ... the second level for service Gold, Silver,...
+
+  see res_counter_borrow_resource() and res_counter_repay_resource().
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-05-31  1:59           ` kamezawa.hiroyu
  2008-05-31 14:47             ` kamezawa.hiroyu
  2008-06-02  9:52           ` kamezawa.hiroyu
  1 sibling, 1 reply; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-05-31  1:59 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, LKML, xemul, menage, yamamoto, lizf

>KAMEZAWA Hiroyuki wrote:
>> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
>> 
>> While several policy of hierarchy can be considered, this patch implements
>> simple one 
>>    - the parent includes, over-commits the child
>>    - there are no shared resource
>
>I am not sure if this is desirable. The concept of a hierarchy applies really
>well when there are shared resources.
>
>>    - dynamic hierarchy resource usage management in the kernel is not neces
sary
>> 
>
>Could you please elaborate as to why? I am not sure I understand your point
>

ok, let's consider a _miiddleware_ wchich has following paramater.

An expoterd param to the user.
   - user_memory_limit
parameters for co-operation with the kernel
   - kernel_memory_limit

And here,
   user_memory_limit >= kernel_memory_limit == cgroup's memory.limits_in_bytes

When a user ask the miidleware to set limit to 1Gbytes
   user_memory_limit = 1G
   kernel_memory_limit = 0-1G.
It moves kernel_memory_limit dynamically 0 to 1Gbytes and reset limits_in_byte
s in dynamic way with checking memory cgroup's statistics.
Of course, we can add some kind of interdace , as following
  - failure_notifier - triggered at failcnt increment.
  - threshhold_notifier - triggered as usage > threshold.

>> works as following.
>> 
>>  1. create a child. set default child limits to be 0.
>>  2. set limit to child.
>>     2-a. before setting limit to child, prepare enough room in parent.
>>     2-b. increase 'usage' of parent by child's limit.
>
>The problem with this is that you are forcing the parent will run into a recl
aim
>loop even if the child is not using the assigned limit to it.
>
That's not problem because it's avoildable by users.
But it's ok to limit the sum of child's limit to be below XX % ot the parent.

>>  3. the child sets its limit to the val moved from the parent.
>>     the parent remembers what amount of resource is to the children.
>> 
>
>All of this needs to be dynamic
>
As explained, this can be dynamic by middleware.

>>  Above means that
>> 	- a directory's usage implies the sum of all sub directories +
>>           own usage.
>> 	- there are no shared resource between parent <-> child.
>> 
>>  Pros.
>>   - simple and easy policy.
>>   - no hierarchy overhead.
>>   - no resource share among child <-> parent. very suitable for multilevel
>>     resource isolation.
>
>Sharing is an important aspect of hierachies. I am not convinced of this
>approach. Did you look at the patches I sent out? Was there something
>fundamentally broken in them?
>

Yes, I read. And tried to make it faster and found it will be complicated.
One problem is overhead of counter itself.
Another problem is overhead of shrinking multi-level LRU with feedback.
One more problem is that it's hard to implement various kinds of hierarchy
policy. I believe there are other hierarhcy policies rather than OpenVZ
want to use. Kicking out functions to middleware AMAP is what I'm thinking
now.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31  1:59           ` kamezawa.hiroyu
@ 2008-05-31 14:47             ` kamezawa.hiroyu
  2008-06-01  0:35               ` kamezawa.hiroyu
  0 siblings, 1 reply; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-05-31 14:47 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

----- Original Message -----

>> One more problem is that it's hard to implement various kinds of hierarchy
>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>> want to use. Kicking out functions to middleware AMAP is what I'm thinking
>> now.
>
>One way to manage hierarchies other than via limits is to use shares (please 
see
>the shares used by the cpu controller). Basically, what you've done with limi
ts
>is done with shares
>
Yes, I like _share_ rather than limits.

>If a parent has 100 shares, then it can decide how many to pass on to it's  c
hildren
>based on the shares of the child and your logic would work well. I propose
>assigning top level (high resolution) shares to the root of the cgroup and in
 a
>hierarchy passing them down to children and sharing it with them. Based on th
e
>shares, deduce the limit of each node in the hierarchy.
>
>What do you think?
>
As you wrote, a middleware can do controls based on share by limits.
And it seems much easier to implement it in userland rather than in the kernel
.

Here is an example. (just an example...)
Please point out if I'm misunderstanding "share".

root_level/                   = limit 1G.
          /child_A = share=30
          /child_B = share=15
          /child_C = share=5
(and assume there is no process under root_level for make explanation easy..)

0. At first, before starting to use memory, set all kernel_memory_limit.
root_level.limit = 1G
  child_A.limit=64M,usage=0
  child_B.limit=64M,usage=0
  child_C.limit=64M,usage=0
  free_resource=808M 

1. next, a process in child_C start to run and use memory of 600M.
root_level.limit = 1G
  child_A.limit=64M
  child_B.limit=64M
  child_C.limit=600M,usage=600M
  free_resource=272M

2. now, a process in child_A start tu run and use memory of 800M.
  child_A.limit=800M,usage=800M
  child_B.limit=64M,usage=0M
  child_C.limit=136M,usage=136M
  free_resouce=0,A:C=6:1

3.Finally, a process in child_B start. and use memory of 500M.
  child_A.limit=600M,usage=600M
  child_B.limit=300M,usage=300M
  child_C.limit=100M,usage=100M
  free_resouce=0, A:B:C=6:3:1

4. one more, a process in A exits.
  child_A.limit=64M, usage=0M
  child_B.limit=500M, usage=500M
  child_C.limit=436M, usage=436M
  free_resouce=0, B:C=3:1 (but B just want to use 500M)

This is only an example and the middleware can more pricise "limit"
contols by checking statistics of memory controller hierarchy based on
their own policy.

What I think now is what kind of statistics/notifier/controls are
necessary to implement shares in middleware. How pricise/quick work the
middleware can do is based on interfaces.
Maybe the middleware should know "how fast the application runs now" by
some kind of check or co-operative interface with the application.
But I'm not sure how the kernel can help it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31 14:47             ` kamezawa.hiroyu
@ 2008-06-01  0:35               ` kamezawa.hiroyu
  2008-06-02  9:48                 ` kamezawa.hiroyu
  0 siblings, 1 reply; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-06-01  0:35 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

----- Original Message -----
>kamezawa.hiroyu@jp.fujitsu.com wrote:
>> ----- Original Message -----
>> 
>>>> One more problem is that it's hard to implement various kinds of hierarch
y
>>>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>>>> want to use. Kicking out functions to middleware AMAP is what I'm thinkin
g
>>>> now.
>>> One way to manage hierarchies other than via limits is to use shares (plea
se 
>> see
>>> the shares used by the cpu controller). Basically, what you've done with l
imi
>> ts
>>> is done with shares
>>>
>> Yes, I like _share_ rather than limits.
>> 
>>> If a parent has 100 shares, then it can decide how many to pass on to it's
  c
>> hildren
>>> based on the shares of the child and your logic would work well. I propose
>>> assigning top level (high resolution) shares to the root of the cgroup and
 in
>>  a
>>> hierarchy passing them down to children and sharing it with them. Based on
 th
>> e
>>> shares, deduce the limit of each node in the hierarchy.
>>>
>>> What do you think?
>>>
>> As you wrote, a middleware can do controls based on share by limits.
>> And it seems much easier to implement it in userland rather than in the ker
nel
>> .
>
>The good thing about user space is that moves unnecessary code outside the
>kernel, but the hard thing is standardization. If every middleware is going t
o
>implement what you say, imagine the code duplication, unless we standardize t
his
>into a library component.

It's not problem. We're not developing world-wide eco system.
It's good that there are several development groups. It's a way to evolution.
Something popular will be defacto standard. 
What we have to do is providing proper interfaces for allowing fair race.

>> 
>> Here is an example. (just an example...)
>> Please point out if I'm misunderstanding "share".
>> 
>> root_level/                   = limit 1G.
>>           /child_A = share=30
>>           /child_B = share=15
>>           /child_C = share=5
>> (and assume there is no process under root_level for make explanation easy.
.)
>> 
>> 0. At first, before starting to use memory, set all kernel_memory_limit.
>> root_level.limit = 1G
>>   child_A.limit=64M,usage=0
>>   child_B.limit=64M,usage=0
>>   child_C.limit=64M,usage=0
>>   free_resource=808M 
>> 
>
>This sounds incorrect, since the limits should be proportional to shares. If 
the
>maximum shares in the root were 100 (*ideally we want higher resolution than 
that)
>Then
>
>child_A.limit = .3 * 1G
>child_B.limit = .15 * 1G
>
>and so on
>
Above just showing param to the kernel. 
>From user's view, memory limitation is A:B:C=6:3:1 if memory is fully used.
(In above case, usage=0)

In general, "share" works only when the total usage reaches limitation.
(See how cpu scheduler works.)
When the usage doesn't reach limit, there is no limitatiuon.

>
>> 1. next, a process in child_C start to run and use memory of 600M.
>> root_level.limit = 1G
>>   child_A.limit=64M
>>   child_B.limit=64M
>>   child_C.limit=600M,usage=600M
>>   free_resource=272M
>> 
>
>How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
>want something like that, child_C should have no limits.

middleware just do when child_C.failcnt hits.
echo 64M > childC.memory.limits_in_bytes.
and periodically checks A,B,C and allow C to use what it wants becasue
A and B doesn't want memory.

>
>> 2. now, a process in child_A start tu run and use memory of 800M.
>>   child_A.limit=800M,usage=800M
>>   child_B.limit=64M,usage=0M
>>   child_C.limit=136M,usage=136M
>>   free_resouce=0,A:C=6:1
>> 
>
>Not sure I understand this step
>
Middleware notices that usage in A is growing and moves resources to A.

echo current child_C's limit - 64M > child_C
echo current child_A's limit + 64M > child_A
do above in step by step with loops for making A:C = 6:1
(64M is just an example)

>> 3.Finally, a process in child_B start. and use memory of 500M.
>>   child_A.limit=600M,usage=600M
>>   child_B.limit=300M,usage=300M
>>   child_C.limit=100M,usage=100M
>>   free_resouce=0, A:B:C=6:3:1
>> 
>
>Not sure I understand this step
>
echo current child_C's limit - 64M > child_C
echo current child_A's limit - 64M > child_A
echo current child_B's limit + 64M > child_B
do above in step by step with loops for making A:B:C = 6:3:1


>> 4. one more, a process in A exits.
>>   child_A.limit=64M, usage=0M
>>   child_B.limit=500M, usage=500M
>>   child_C.limit=436M, usage=436M
>>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
>> 
>
>Not sure I understand this step
>
middleware can notice memory pressure from Child_A is reduced.

echo current child_A's limit - 64M > child_A
echo current child_C's limit + 64M > child_C
echo current child_B's limit + 64M > child_B
do above in step by step with loops for making B:C = 3:1 with avoiding
the waste of resources.



>> This is only an example and the middleware can more pricise "limit"
>> contols by checking statistics of memory controller hierarchy based on
>> their own policy.
>> 
>> What I think now is what kind of statistics/notifier/controls are
>> necessary to implement shares in middleware. How pricise/quick work the
>> middleware can do is based on interfaces.
>> Maybe the middleware should know "how fast the application runs now" by
>> some kind of check or co-operative interface with the application.
>> But I'm not sure how the kernel can help it.
>
>I am not sure if I understand your proposal at this point.
>

The most important point is cgoups.memory.memory.limit_in_bytes
is _just_ a notification to ask the kernel to limit the memory
usage of process groups temporally. It changes often.
Based on user's notification to the middleware (share or limit),
the middleware changes limit_in_bytes to be suitable value
and change it dynamically and periodically. 

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-01  0:35               ` kamezawa.hiroyu
@ 2008-06-02  9:48                 ` kamezawa.hiroyu
  0 siblings, 0 replies; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-06-02  9:48 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

>Why don't we add soft limits, so that we don't have to go to the kernel and
>change limits frequently. One missing piece in the memory controller is that 
we
>don't shrink the memory controller when limits change or when tasks move. I
>think soft limits is a better solution.
>
My code adds shirinking_at_limit_change. I'm now try to write migrate_resouces
_at_task_move. (But seems not so easy to be implemented in 
clean/fast way.)

I have no objection to soft-limit if it's easy to be implemented. (I wrote
my explanation was just an example and we could add more knobs.) 
_But_ I think that something to control multiple cgroups with regard to hierar
chy under some policy never be a simple one. Adding some knobs for each cgroup
s to do soft-limit will be simple one if no hirerachy.

Memory controller's  difference from scheduler's hirerachy is that we have to 
do multilevel page reclaim with feedback under some policy (not only one..). 
Even without hierarhcy, we _did_ make the kernel's LRU logic more complicated.
But we can get a help from the middleware here, I think.

My goal is never to make cgroup slow or complicated. If it's slow, 
I'd like to say "ok, please use VMware.It's simpler and enough fast for you." 
"How fast it works rather than Hardware-Virtualization" is the most
important for me. It should be much more faster.

>Thanks for patiently explaining all of this.
>
Thanks, I'm sorry for my poor explanation skill.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-05-31  1:59           ` kamezawa.hiroyu
@ 2008-06-02  9:52           ` kamezawa.hiroyu
  1 sibling, 0 replies; 38+ messages in thread
From: kamezawa.hiroyu @ 2008-06-02  9:52 UTC (permalink / raw)
  To: YAMAMOTO Takashi
  Cc: kamezawa.hiroyu, linux-mm, linux-kernel, balbir, xemul, menage, lizf

----- Original Message -----

>> @@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
>>  		if (*end != '\0')
>>  			goto out_free;
>>  	}
>> -	spin_lock_irqsave(&counter->lock, flags);
>> -	val = res_counter_member(counter, member);
>> -	*val = tmp;
>> -	spin_unlock_irqrestore(&counter->lock, flags);
>> -	ret = nbytes;
>> +	if (member != RES_LIMIT || !callback) {
>
>is there any reason to check member != RES_LIMIT here,
>rather than in callers?

Hmm...ok. This is messy. I'll rearrange this.


>
>> +/*
>> + * Move resource to its parent.
>> + *   child->limit -= val.
>> + *   parent->usage -= val.
>> + *   parent->limit -= val.
>
>s/limit/for_children/
>
>> + */
>> +
>> +int res_counter_repay_resource(struct res_counter *child,
>> +				struct res_counter *parent,
>> +				unsigned long long val,
>> +				res_shrink_callback_t callback, int retry)
>
>can you reduce gratuitous differences between
>res_counter_borrow_resource and res_counter_repay_resource?
>eg. 'success' vs 'done', how to decrement 'retry'.
>

Ah, sorry. I'll rewrite.
I'll make next version's quality better.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-06-02  2:15 YAMAMOTO Takashi
  1 sibling, 0 replies; 38+ messages in thread
From: YAMAMOTO Takashi @ 2008-06-02  2:15 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, linux-kernel, balbir, xemul, menage, lizf

> @@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
>  		if (*end != '\0')
>  			goto out_free;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -	ret = nbytes;
> +	if (member != RES_LIMIT || !callback) {

is there any reason to check member != RES_LIMIT here,
rather than in callers?

> +/*
> + * Move resource to its parent.
> + *   child->limit -= val.
> + *   parent->usage -= val.
> + *   parent->limit -= val.

s/limit/for_children/

> + */
> +
> +int res_counter_repay_resource(struct res_counter *child,
> +				struct res_counter *parent,
> +				unsigned long long val,
> +				res_shrink_callback_t callback, int retry)

can you reduce gratuitous differences between
res_counter_borrow_resource and res_counter_repay_resource?
eg. 'success' vs 'done', how to decrement 'retry'.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-06-12  5:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-04  4:58 [RFC][PATCH 0/2] memcg: hierarchy support (v3) KAMEZAWA Hiroyuki
2008-06-04  5:01 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
2008-06-04  6:54   ` Li Zefan
2008-06-04  7:03     ` KAMEZAWA Hiroyuki
2008-06-04  7:20   ` YAMAMOTO Takashi
2008-06-04  7:32     ` KAMEZAWA Hiroyuki
2008-06-04  8:59   ` Paul Menage
2008-06-04  9:18     ` KAMEZAWA Hiroyuki
2008-06-09  9:48   ` Balbir Singh
2008-06-09 10:20     ` KAMEZAWA Hiroyuki
2008-06-09 10:37       ` Balbir Singh
2008-06-09 12:02       ` kamezawa.hiroyu
2008-06-11 23:24   ` Randy Dunlap
2008-06-12  4:59     ` KAMEZAWA Hiroyuki
2008-06-04  5:03 ` [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg KAMEZAWA Hiroyuki
2008-06-04  6:42   ` Li Zefan
2008-06-04  6:54     ` KAMEZAWA Hiroyuki
2008-06-04  8:59   ` Paul Menage
2008-06-04  9:26     ` KAMEZAWA Hiroyuki
2008-06-04 12:53       ` Daisuke Nishimura
2008-06-04 12:32   ` Daisuke Nishimura
2008-06-05  0:04     ` KAMEZAWA Hiroyuki
2008-06-09 10:56   ` Balbir Singh
2008-06-09 12:09   ` kamezawa.hiroyu
2008-06-11 23:24   ` Randy Dunlap
2008-06-12  5:00     ` KAMEZAWA Hiroyuki
2008-06-04  8:59 ` [RFC][PATCH 0/2] memcg: hierarchy support (v3) Paul Menage
2008-06-04  9:15   ` KAMEZAWA Hiroyuki
2008-06-04  9:15     ` Paul Menage
2008-06-04  9:31       ` KAMEZAWA Hiroyuki
2008-06-09  9:30 ` Balbir Singh
2008-06-09  9:55   ` KAMEZAWA Hiroyuki
2008-06-09 10:33     ` Balbir Singh
  -- strict thread matches above, loose matches on Subject: below --
2008-06-02  6:16 [RFC][PATCH 1/2] memcg: res_counter hierarchy Balbir Singh
2008-05-31 17:18 ` Balbir Singh
2008-05-31 11:20   ` Balbir Singh
2008-05-30 22:20     ` Balbir Singh
2008-05-30  1:43       ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
2008-05-30  1:45         ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
2008-05-31  1:59           ` kamezawa.hiroyu
2008-05-31 14:47             ` kamezawa.hiroyu
2008-06-01  0:35               ` kamezawa.hiroyu
2008-06-02  9:48                 ` kamezawa.hiroyu
2008-06-02  9:52           ` kamezawa.hiroyu
2008-06-02  2:15 YAMAMOTO Takashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox