linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Help Resource Counters Scale Better (v2)
@ 2009-08-07 22:12 Balbir Singh
  2009-08-08  1:11 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-07 22:12 UTC (permalink / raw)
  To: Andrew Morton, andi.kleen, Prarit Bhargava
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, lizf, menage,
	Pavel Emelianov, linux-kernel, linux-mm

Enhancement: For scalability move the resource counter to a percpu counter

Changelog v2->v1

From: Balbir Singh <balbir@linux.vnet.ibm.com>

1. Updated Documentation (cgroups.txt and resource_counters.txt)
2. Added the notion of tolerance to resource counter initialization

I posted an early RFI of the patches and got several responses stating
that this would be a nice to have patch and we should work on this
right away.

This patch changes the usage field of a resource counter to a percpu
counter. The counter is incremented with local irq disabled. The other
fields are still protected by the spin lock for write.

This patch adds a fuzziness factor to hard limit, since the value we read
could be off the original value (by batch value), this can be fixed
by adding a strict/non-strict functionality check. The intention is
to turn of strict checking for root (since we can't set limits on
it anyway).

I tested this patch on my x86_64 box with a regular test for hard
limits and a page fault program. I also enabled lockdep and lock_stat
clearly shows that the contention on counter->lock is down quite
significantly. I tested these patches against an older mmotm, but
this should apply cleanly to the 6th August mmotm as well.

Please test/review the patches.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt           |    3 ++
 Documentation/cgroups/resource_counter.txt |   11 +++++-
 include/linux/res_counter.h                |   48 ++++++++++++++++-----------
 kernel/res_counter.c                       |   50 +++++++++++++++++-----------
 mm/memcontrol.c                            |   31 +++++++++++++----
 5 files changed, 96 insertions(+), 47 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..8f86537 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
 d. Provides a double LRU: global memory pressure causes reclaim from the
    global LRU; a cgroup on hitting a limit, reclaims from the per
    cgroup LRU
+   NOTE: One can no longer rely on the exact limit. Since we've moved
+   to using percpu_counters for resource counters, there is always going
+   to be a fuzziness factor depending on the batch value.
 
 Benefits and Purpose of the memory controller
 
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..d3b276b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -12,12 +12,15 @@ to work with it.
 
 1. Crucial parts of the res_counter structure
 
- a. unsigned long long usage
+ a. percpu_counter usage
 
  	The usage value shows the amount of a resource that is consumed
 	by a group at a given time. The units of measurement should be
 	determined by the controller that uses this counter. E.g. it can
 	be bytes, items or any other unit the controller operates on.
+	NOTE: being a percpu_counter, the way to read the correct value
+	at all times makes it unscalable and reading it scalably makes
+	the value a little unreliable :)
 
  b. unsigned long long max_usage
 
@@ -48,7 +51,8 @@ to work with it.
 2. Basic accounting routines
 
  a. void res_counter_init(struct res_counter *rc,
-				struct res_counter *rc_parent)
+				struct res_counter *rc_parent,
+				unsigned long tolerance)
 
  	Initializes the resource counter. As usual, should be the first
 	routine called for a new counter.
@@ -57,6 +61,9 @@ to work with it.
 	child -> parent relationship directly in the res_counter structure,
 	NULL can be used to define no relationship.
 
+	The tolerance is used to control the batching behaviour of percpu
+	counters
+
  c. int res_counter_charge(struct res_counter *rc, unsigned long val,
 				struct res_counter **limit_fail_at)
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 731af71..2d412d7 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/percpu_counter.h>
 
 /*
  * The core object. the cgroup that wishes to account for some
@@ -23,10 +24,6 @@
 
 struct res_counter {
 	/*
-	 * the current resource consumption level
-	 */
-	unsigned long long usage;
-	/*
 	 * the maximal value of the usage from the counter creation
 	 */
 	unsigned long long max_usage;
@@ -48,6 +45,14 @@ struct res_counter {
 	 */
 	spinlock_t lock;
 	/*
+	 * the current resource consumption level
+	 */
+	struct percpu_counter usage;
+	/*
+	 * Tolerance for the percpu_counter (usage) above
+	 */
+	unsigned long usage_tolerance;
+	/*
 	 * Parent counter, used for hierarchial resource accounting
 	 */
 	struct res_counter *parent;
@@ -98,7 +103,8 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance);
 
 /*
  * charge - try to consume more resource.
@@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->limit)
 		return true;
 
 	return false;
@@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 
 static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->soft_limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->soft_limit)
 		return true;
 
 	return false;
@@ -157,15 +165,15 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 static inline unsigned long long
 res_counter_soft_limit_excess(struct res_counter *cnt)
 {
-	unsigned long long excess;
-	unsigned long flags;
+	unsigned long long excess, usage;
 
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
+	usage = percpu_counter_read_positive(&cnt->usage);
+	preempt_disable();
+	if (usage <= cnt->soft_limit)
 		excess = 0;
 	else
-		excess = cnt->usage - cnt->soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
+		excess = usage - cnt->soft_limit;
+	preempt_enable();
 	return excess;
 }
 
@@ -178,9 +186,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -189,18 +197,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_soft_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->max_usage = cnt->usage;
+	cnt->max_usage = usage;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
@@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
 	int ret = -EBUSY;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= limit) {
+	if (usage <= limit) {
 		cnt->limit = limit;
 		ret = 0;
 	}
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 88faec2..ae83168 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -15,24 +15,34 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance)
 {
 	spin_lock_init(&counter->lock);
+	percpu_counter_init(&counter->usage, 0);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
 	counter->parent = parent;
+	counter->usage_tolerance = usage_tolerance;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (counter->usage + val > counter->limit) {
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (usage + val > counter->limit) {
 		counter->failcnt++;
 		return -ENOMEM;
 	}
 
-	counter->usage += val;
-	if (counter->usage > counter->max_usage)
-		counter->max_usage = counter->usage;
+	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
+				counter->usage_tolerance);
+	if (usage + val > counter->max_usage) {
+		spin_lock(&counter->lock);
+		counter->max_usage = (usage + val);
+		spin_unlock(&counter->lock);
+	}
 	return 0;
 }
 
@@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		/*
 		 * With soft limits, we return the highest ancestor
@@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		if (soft_limit_fail_at &&
 			!res_counter_soft_limit_check_locked(c))
 			*soft_limit_fail_at = c;
-		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
 			goto undo;
@@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 	goto done;
 undo:
 	for (u = counter; u != c; u = u->parent) {
-		spin_lock(&u->lock);
 		res_counter_uncharge_locked(u, val);
-		spin_unlock(&u->lock);
 	}
 done:
 	local_irq_restore(flags);
@@ -79,10 +85,13 @@ done:
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (WARN_ON(counter->usage < val))
-		val = counter->usage;
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
+		val = usage;
 
-	counter->usage -= val;
+	percpu_counter_sub(&counter->usage, val);
 }
 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val,
@@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		if (was_soft_limit_excess)
 			*was_soft_limit_excess =
 				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
-		spin_unlock(&c->lock);
 	}
 	local_irq_restore(flags);
 }
@@ -108,8 +115,6 @@ static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
 	switch (member) {
-	case RES_USAGE:
-		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
 	case RES_LIMIT:
@@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
 		int (*read_strategy)(unsigned long long val, char *st_buf))
 {
-	unsigned long long *val;
+	unsigned long long *val, usage_val;
 	char buf[64], *s;
 
 	s = buf;
-	val = res_counter_member(counter, member);
+	if (member == RES_USAGE) {
+		usage_val = percpu_counter_read_positive(&counter->usage);
+		val = &usage_val;
+	} else
+		val = res_counter_member(counter, member);
 	if (read_strategy)
 		s += read_strategy(*val, s);
 	else
@@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 
 u64 res_counter_read_u64(struct res_counter *counter, int member)
 {
-	return *res_counter_member(counter, member);
+	if (member == RES_USAGE)
+		return percpu_counter_read_positive(&counter->usage);
+	else
+		return *res_counter_member(counter, member);
 }
 
 int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..17d305d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
+ * To help resource counters scale, we take a step back
+ * and allow the counters to be scalable and set a
+ * batch value such that every addition does not cause
+ * global synchronization. The side-effect will be visible
+ * on limit enforcement, where due to this fuzziness,
+ * we will lose out on inforcing a limit when the usage
+ * exceeds the limit. The plan however in the long run
+ * is to allow this value to be controlled. We will
+ * probably add a new control file for it.
+ */
+#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
+
+/*
  * Statistics for memory cgroup.
  */
 enum mem_cgroup_stat_index {
@@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
 	if (free_all)
 		goto try_to_free;
 move_account:
-	while (mem->res.usage > 0) {
+	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
@@ -2383,7 +2396,7 @@ try_to_free:
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
 	shrink = 1;
-	while (nr_retries && mem->res.usage > 0) {
+	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		int progress;
 
 		if (signal_pending(current)) {
@@ -2401,7 +2414,7 @@ try_to_free:
 	}
 	lru_add_drain();
 	/* try move_account...there may be some *locked* pages. */
-	if (mem->res.usage)
+	if (res_counter_read_u64(&mem->res, RES_USAGE))
 		goto move_account;
 	ret = 0;
 	goto out;
@@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	}
 
 	if (parent && parent->use_hierarchy) {
-		res_counter_init(&mem->res, &parent->res);
-		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->res, &parent->res,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, &parent->memsw,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		 */
 		mem_cgroup_get(parent);
 	} else {
-		res_counter_init(&mem->res, NULL);
-		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->res, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 	}
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v2)
  2009-08-07 22:12 Help Resource Counters Scale Better (v2) Balbir Singh
@ 2009-08-08  1:11 ` KAMEZAWA Hiroyuki
  2009-08-08  6:05   ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-08  1:11 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
	linux-mm

Balbir Singh wrote:
> Enhancement: For scalability move the resource counter to a percpu counter
>
> Changelog v2->v1
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
>
looks better ..but a few concerns/nitpicks.


> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program. I also enabled lockdep and lock_stat
> clearly shows that the contention on counter->lock is down quite
> significantly. I tested these patches against an older mmotm, but
> this should apply cleanly to the 6th August mmotm as well.
>
It's always helpful if the numbers are shown.


> diff --git a/Documentation/cgroups/memory.txt
> b/Documentation/cgroups/memory.txt
> index b871f25..8f86537 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller
> users
>  d. Provides a double LRU: global memory pressure causes reclaim from the
>     global LRU; a cgroup on hitting a limit, reclaims from the per
>     cgroup LRU
> +   NOTE: One can no longer rely on the exact limit. Since we've moved
> +   to using percpu_counters for resource counters, there is always going
> +   to be a fuzziness factor depending on the batch value.
>
This text is just from our view.
Please explain to people who reads memcg for the first time.


>  Benefits and Purpose of the memory controller
>
> diff --git a/Documentation/cgroups/resource_counter.txt
> b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..d3b276b 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>
>  1. Crucial parts of the res_counter structure
>
> - a. unsigned long long usage
> + a. percpu_counter usage
>
>   	The usage value shows the amount of a resource that is consumed
>  	by a group at a given time. The units of measurement should be
>  	determined by the controller that uses this counter. E.g. it can
>  	be bytes, items or any other unit the controller operates on.
> +	NOTE: being a percpu_counter, the way to read the correct value
> +	at all times makes it unscalable and reading it scalably makes
> +	the value a little unreliable :)
>
ditto.

>   b. unsigned long long max_usage
>
> @@ -48,7 +51,8 @@ to work with it.
>  2. Basic accounting routines
>
>   a. void res_counter_init(struct res_counter *rc,
> -				struct res_counter *rc_parent)
> +				struct res_counter *rc_parent,
> +				unsigned long tolerance)
>
>   	Initializes the resource counter. As usual, should be the first
>  	routine called for a new counter.
> @@ -57,6 +61,9 @@ to work with it.
>  	child -> parent relationship directly in the res_counter structure,
>  	NULL can be used to define no relationship.
>
> +	The tolerance is used to control the batching behaviour of percpu
> +	counters
> +
This description is ambiguous.
What is the system's total tolerance ?
    tolerance ?
    nr_online_cpus * tolerance ?
    MAX_CPUS * tolerance ?


>   c. int res_counter_charge(struct res_counter *rc, unsigned long val,
>  				struct res_counter **limit_fail_at)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..2d412d7 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
>   */
>
>  #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>
>  /*
>   * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>
>  struct res_counter {
>  	/*
> -	 * the current resource consumption level
> -	 */
> -	unsigned long long usage;
> -	/*
>  	 * the maximal value of the usage from the counter creation
>  	 */
>  	unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
>  	 */
>  	spinlock_t lock;
>  	/*
> +	 * the current resource consumption level
> +	 */
> +	struct percpu_counter usage;
> +	/*
> +	 * Tolerance for the percpu_counter (usage) above
> +	 */
> +	unsigned long usage_tolerance;
> +	/*
>  	 * Parent counter, used for hierarchial resource accounting
>  	 */
>  	struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
>   * helpers for accounting
>   */
>
> -void res_counter_init(struct res_counter *counter, struct res_counter
> *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter
> *parent,
> +			unsigned long usage_tolerance);
>
>  /*
>   * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter,
> unsigned long val,
>
>  static inline bool res_counter_limit_check_locked(struct res_counter
> *cnt)
>  {
> -	if (cnt->usage < cnt->limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->limit)
>  		return true;
>
Hmm. In memcg, this function is not used for busy pass but used for
important pass to check usage under limit (and continue reclaim)

Can't we add res_clounter_check_locked_exact(), which use
percpu_counter_sum() later ?

>  	return false;
> @@ -141,7 +148,8 @@ static inline bool
> res_counter_limit_check_locked(struct res_counter *cnt)
>
>  static inline bool res_counter_soft_limit_check_locked(struct res_counter
> *cnt)
>  {
> -	if (cnt->usage < cnt->soft_limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->soft_limit)
>  		return true;
>
>  	return false;
> @@ -157,15 +165,15 @@ static inline bool
> res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  static inline unsigned long long
>  res_counter_soft_limit_excess(struct res_counter *cnt)
>  {
> -	unsigned long long excess;
> -	unsigned long flags;
> +	unsigned long long excess, usage;
>
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= cnt->soft_limit)
> +	usage = percpu_counter_read_positive(&cnt->usage);
> +	preempt_disable();
> +	if (usage <= cnt->soft_limit)
>  		excess = 0;
>  	else
> -		excess = cnt->usage - cnt->soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +		excess = usage - cnt->soft_limit;
> +	preempt_enable();
>  	return excess;
>  }
It's not clear why this part uses preempt_disable() instead of
irqsave(). Could you add comment ?
(*AND* it seems the caller disable irq....)


>
> @@ -178,9 +186,9 @@ static inline bool
> res_counter_check_under_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>
> @@ -189,18 +197,19 @@ static inline bool
> res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_soft_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->max_usage = cnt->usage;
> +	cnt->max_usage = usage;
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
>
> @@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct
> res_counter *cnt,
>  		unsigned long long limit)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>  	int ret = -EBUSY;
>
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= limit) {
> +	if (usage <= limit) {
>  		cnt->limit = limit;
>  		ret = 0;
>  	}

For the same reason to check_limit, I want correct number here.
percpu_counter_sum() is better.


> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm.h>
>
> -void res_counter_init(struct res_counter *counter, struct res_counter
> *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter
> *parent,
> +			unsigned long usage_tolerance)
>  {
>  	spin_lock_init(&counter->lock);
> +	percpu_counter_init(&counter->usage, 0);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
>  	counter->parent = parent;
> +	counter->usage_tolerance = usage_tolerance;
>  }
>
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long
> val)
>  {
> -	if (counter->usage + val > counter->limit) {
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (usage + val > counter->limit) {
>  		counter->failcnt++;
>  		return -ENOMEM;
>  	}
>
> -	counter->usage += val;
> -	if (counter->usage > counter->max_usage)
> -		counter->max_usage = counter->usage;
> +	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> +				counter->usage_tolerance);
> +	if (usage + val > counter->max_usage) {
> +		spin_lock(&counter->lock);
> +		counter->max_usage = (usage + val);
> +		spin_unlock(&counter->lock);
> +	}
Hmm...irq is already off here ?


>  	return 0;
>  }
>
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
>  		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		/*
>  		 * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
>  		if (soft_limit_fail_at &&
>  			!res_counter_soft_limit_check_locked(c))
>  			*soft_limit_fail_at = c;
> -		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
>  			goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
>  	goto done;
>  undo:
>  	for (u = counter; u != c; u = u->parent) {
> -		spin_lock(&u->lock);
>  		res_counter_uncharge_locked(u, val);
> -		spin_unlock(&u->lock);
>  	}
>  done:

When using hierarchy, tolerance to root node will be bigger.
Please write this attention to the document.


>  	local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> long val)
>  {
> -	if (WARN_ON(counter->usage < val))
> -		val = counter->usage;
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> +		val = usage;
Is this correct ? (or do we need this WARN_ON ?)
Hmm. percpu_counter is cpu-hotplug aware. Then,
nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.


>
> -	counter->usage -= val;
> +	percpu_counter_sub(&counter->usage, val);
>  }
>
>  void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter
> *counter, unsigned long val,
>
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		if (was_soft_limit_excess)
>  			*was_soft_limit_excess =
>  				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
> -		spin_unlock(&c->lock);
>  	}
>  	local_irq_restore(flags);
>  }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
>  res_counter_member(struct res_counter *counter, int member)
>  {
>  	switch (member) {
> -	case RES_USAGE:
> -		return &counter->usage;
>  	case RES_MAX_USAGE:
>  		return &counter->max_usage;
>  	case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter
> *counter, int member,
>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
>  		int (*read_strategy)(unsigned long long val, char *st_buf))
>  {
> -	unsigned long long *val;
> +	unsigned long long *val, usage_val;
>  	char buf[64], *s;
>
>  	s = buf;
> -	val = res_counter_member(counter, member);
> +	if (member == RES_USAGE) {
> +		usage_val = percpu_counter_read_positive(&counter->usage);
> +		val = &usage_val;
> +	} else
> +		val = res_counter_member(counter, member);
>  	if (read_strategy)
>  		s += read_strategy(*val, s);


>  	else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter,
> int member,
>
>  u64 res_counter_read_u64(struct res_counter *counter, int member)
>  {
> -	return *res_counter_member(counter, member);
> +	if (member == RES_USAGE)
> +		return percpu_counter_read_positive(&counter->usage);
> +	else
> +		return *res_counter_member(counter, member);
>  }

>
>  int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..17d305d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold
> under cgroup_mutex */
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>
>  /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)

Considering percpu counter's extra overhead. This number is too small, IMO.

> +
> +/*
>   * Statistics for memory cgroup.
>   */
>  enum mem_cgroup_stat_index {
> @@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup
> *mem, bool free_all)
>  	if (free_all)
>  		goto try_to_free;
>  move_account:
> -	while (mem->res.usage > 0) {
> +	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		ret = -EBUSY;
>  		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>  			goto out;
> @@ -2383,7 +2396,7 @@ try_to_free:
>  	lru_add_drain_all();
>  	/* try to free all pages in this cgroup */
>  	shrink = 1;
> -	while (nr_retries && mem->res.usage > 0) {
> +	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		int progress;
>
>  		if (signal_pending(current)) {
> @@ -2401,7 +2414,7 @@ try_to_free:
>  	}
>  	lru_add_drain();
>  	/* try move_account...there may be some *locked* pages. */
> -	if (mem->res.usage)
> +	if (res_counter_read_u64(&mem->res, RES_USAGE))
>  		goto move_account;
>  	ret = 0;
>  	goto out;
> @@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
>  	}
>
>  	if (parent && parent->use_hierarchy) {
> -		res_counter_init(&mem->res, &parent->res);
> -		res_counter_init(&mem->memsw, &parent->memsw);
> +		res_counter_init(&mem->res, &parent->res,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, &parent->memsw,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  		/*
>  		 * We increment refcnt of the parent to ensure that we can
>  		 * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
>  		 */
>  		mem_cgroup_get(parent);
>  	} else {
> -		res_counter_init(&mem->res, NULL);
> -		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->res, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  	}
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>

Thanks,
-Kame

>
> --
> 	Balbir
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v2)
  2009-08-08  1:11 ` KAMEZAWA Hiroyuki
@ 2009-08-08  6:05   ` Balbir Singh
  2009-08-08  7:38     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-08  6:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-08 10:11:40]:

> Balbir Singh wrote:
> > Enhancement: For scalability move the resource counter to a percpu counter
> >
> > Changelog v2->v1
> >
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> >
> > 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> > 2. Added the notion of tolerance to resource counter initialization
> >
> looks better ..but a few concerns/nitpicks.
> 
> 
> > I tested this patch on my x86_64 box with a regular test for hard
> > limits and a page fault program. I also enabled lockdep and lock_stat
> > clearly shows that the contention on counter->lock is down quite
> > significantly. I tested these patches against an older mmotm, but
> > this should apply cleanly to the 6th August mmotm as well.
> >
> It's always helpful if the numbers are shown.
>

Sure, but please do remember I did these tests on a small box. I'll
try and get hold of a bigger box for more results. This was a with
a simple memcg test.

Before

                          &counter->lock:          4139           4195
0.64           7.22        2072.51         144272       17989058
0.50          26.78     3573647.35
                          --------------
                          &counter->lock           2016
[<ffffffff81087ac8>] res_counter_charge+0x57/0x10c
                          &counter->lock           2179
[<ffffffff81087a2f>] res_counter_uncharge+0x39/0x7b
                          --------------
                          &counter->lock           1931
[<ffffffff81087ac8>] res_counter_charge+0x57/0x10c
                          &counter->lock           2264
[<ffffffff81087a2f>] res_counter_uncharge+0x39/0x7b

After

                                   key#2:           467            475
0.61           4.74         216.92          16860        8578542
0.56          16.26     1707604.41
                                   -----
                                   key#2            475
[<ffffffff811f7347>] __percpu_counter_add+0x56/0xad
                                   -----
                                   key#2            475
[<ffffffff811f7347>] __percpu_counter_add+0x56/0xad



 
> 
> > diff --git a/Documentation/cgroups/memory.txt
> > b/Documentation/cgroups/memory.txt
> > index b871f25..8f86537 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller
> > users
> >  d. Provides a double LRU: global memory pressure causes reclaim from the
> >     global LRU; a cgroup on hitting a limit, reclaims from the per
> >     cgroup LRU
> > +   NOTE: One can no longer rely on the exact limit. Since we've moved
> > +   to using percpu_counters for resource counters, there is always going
> > +   to be a fuzziness factor depending on the batch value.
> >
> This text is just from our view.
> Please explain to people who reads memcg for the first time.
> 
> 
> >  Benefits and Purpose of the memory controller
> >
> > diff --git a/Documentation/cgroups/resource_counter.txt
> > b/Documentation/cgroups/resource_counter.txt
> > index 95b24d7..d3b276b 100644
> > --- a/Documentation/cgroups/resource_counter.txt
> > +++ b/Documentation/cgroups/resource_counter.txt
> > @@ -12,12 +12,15 @@ to work with it.
> >
> >  1. Crucial parts of the res_counter structure
> >
> > - a. unsigned long long usage
> > + a. percpu_counter usage
> >
> >   	The usage value shows the amount of a resource that is consumed
> >  	by a group at a given time. The units of measurement should be
> >  	determined by the controller that uses this counter. E.g. it can
> >  	be bytes, items or any other unit the controller operates on.
> > +	NOTE: being a percpu_counter, the way to read the correct value
> > +	at all times makes it unscalable and reading it scalably makes
> > +	the value a little unreliable :)
> >
> ditto.

OK.. I'll enhance the documentation


> 
> >   b. unsigned long long max_usage
> >
> > @@ -48,7 +51,8 @@ to work with it.
> >  2. Basic accounting routines
> >
> >   a. void res_counter_init(struct res_counter *rc,
> > -				struct res_counter *rc_parent)
> > +				struct res_counter *rc_parent,
> > +				unsigned long tolerance)
> >
> >   	Initializes the resource counter. As usual, should be the first
> >  	routine called for a new counter.
> > @@ -57,6 +61,9 @@ to work with it.
> >  	child -> parent relationship directly in the res_counter structure,
> >  	NULL can be used to define no relationship.
> >
> > +	The tolerance is used to control the batching behaviour of percpu
> > +	counters
> > +
> This description is ambiguous.
> What is the system's total tolerance ?
>     tolerance ?
>     nr_online_cpus * tolerance ?
>     MAX_CPUS * tolerance ?
> 

I'll add more text to explain

> 
> >   c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> >  				struct res_counter **limit_fail_at)
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 731af71..2d412d7 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -14,6 +14,7 @@
> >   */
> >
> >  #include <linux/cgroup.h>
> > +#include <linux/percpu_counter.h>
> >
> >  /*
> >   * The core object. the cgroup that wishes to account for some
> > @@ -23,10 +24,6 @@
> >
> >  struct res_counter {
> >  	/*
> > -	 * the current resource consumption level
> > -	 */
> > -	unsigned long long usage;
> > -	/*
> >  	 * the maximal value of the usage from the counter creation
> >  	 */
> >  	unsigned long long max_usage;
> > @@ -48,6 +45,14 @@ struct res_counter {
> >  	 */
> >  	spinlock_t lock;
> >  	/*
> > +	 * the current resource consumption level
> > +	 */
> > +	struct percpu_counter usage;
> > +	/*
> > +	 * Tolerance for the percpu_counter (usage) above
> > +	 */
> > +	unsigned long usage_tolerance;
> > +	/*
> >  	 * Parent counter, used for hierarchial resource accounting
> >  	 */
> >  	struct res_counter *parent;
> > @@ -98,7 +103,8 @@ enum {
> >   * helpers for accounting
> >   */
> >
> > -void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent);
> > +void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent,
> > +			unsigned long usage_tolerance);
> >
> >  /*
> >   * charge - try to consume more resource.
> > @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter,
> > unsigned long val,
> >
> >  static inline bool res_counter_limit_check_locked(struct res_counter
> > *cnt)
> >  {
> > -	if (cnt->usage < cnt->limit)
> > +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> > +	if (usage < cnt->limit)
> >  		return true;
> >
> Hmm. In memcg, this function is not used for busy pass but used for
> important pass to check usage under limit (and continue reclaim)
> 
> Can't we add res_clounter_check_locked_exact(), which use
> percpu_counter_sum() later ?

We can, but I want to do it in parts, once I add the policy for
strict/no-strict checking. It is on my mind, but I want to work on the
overhead, since I've heard from many people that we need to resolve
this first.


> 
> >  	return false;
> > @@ -141,7 +148,8 @@ static inline bool
> > res_counter_limit_check_locked(struct res_counter *cnt)
> >
> >  static inline bool res_counter_soft_limit_check_locked(struct res_counter
> > *cnt)
> >  {
> > -	if (cnt->usage < cnt->soft_limit)
> > +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> > +	if (usage < cnt->soft_limit)
> >  		return true;
> >
> >  	return false;
> > @@ -157,15 +165,15 @@ static inline bool
> > res_counter_soft_limit_check_locked(struct res_counter *cnt)
> >  static inline unsigned long long
> >  res_counter_soft_limit_excess(struct res_counter *cnt)
> >  {
> > -	unsigned long long excess;
> > -	unsigned long flags;
> > +	unsigned long long excess, usage;
> >
> > -	spin_lock_irqsave(&cnt->lock, flags);
> > -	if (cnt->usage <= cnt->soft_limit)
> > +	usage = percpu_counter_read_positive(&cnt->usage);
> > +	preempt_disable();
> > +	if (usage <= cnt->soft_limit)
> >  		excess = 0;
> >  	else
> > -		excess = cnt->usage - cnt->soft_limit;
> > -	spin_unlock_irqrestore(&cnt->lock, flags);
> > +		excess = usage - cnt->soft_limit;
> > +	preempt_enable();
> >  	return excess;
> >  }
> It's not clear why this part uses preempt_disable() instead of
> irqsave(). Could you add comment ?
> (*AND* it seems the caller disable irq....)
>

Is it.. the caller is mem_cgroup_update_tree, which is called outside
of IRQ context. In mem_cgroup_insert_exceeded(), it is under the lock.

 
> 
> >
> > @@ -178,9 +186,9 @@ static inline bool
> > res_counter_check_under_limit(struct res_counter *cnt)
> >  	bool ret;
> >  	unsigned long flags;
> >
> > -	spin_lock_irqsave(&cnt->lock, flags);
> > +	local_irq_save(flags);
> >  	ret = res_counter_limit_check_locked(cnt);
> > -	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	local_irq_restore(flags);
> >  	return ret;
> >  }
> >
> > @@ -189,18 +197,19 @@ static inline bool
> > res_counter_check_under_soft_limit(struct res_counter *cnt)
> >  	bool ret;
> >  	unsigned long flags;
> >
> > -	spin_lock_irqsave(&cnt->lock, flags);
> > +	local_irq_save(flags);
> >  	ret = res_counter_soft_limit_check_locked(cnt);
> > -	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	local_irq_restore(flags);
> >  	return ret;
> >  }
> >
> >  static inline void res_counter_reset_max(struct res_counter *cnt)
> >  {
> >  	unsigned long flags;
> > +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> >
> >  	spin_lock_irqsave(&cnt->lock, flags);
> > -	cnt->max_usage = cnt->usage;
> > +	cnt->max_usage = usage;
> >  	spin_unlock_irqrestore(&cnt->lock, flags);
> >  }
> >
> > @@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct
> > res_counter *cnt,
> >  		unsigned long long limit)
> >  {
> >  	unsigned long flags;
> > +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> >  	int ret = -EBUSY;
> >
> >  	spin_lock_irqsave(&cnt->lock, flags);
> > -	if (cnt->usage <= limit) {
> > +	if (usage <= limit) {
> >  		cnt->limit = limit;
> >  		ret = 0;
> >  	}
> 
> For the same reason to check_limit, I want correct number here.
> percpu_counter_sum() is better.
>

I'll add that when we do strict accounting. Are you suggesting that
resource_counter_set_limit should use strict accounting?
 
> 
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 88faec2..ae83168 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -15,24 +15,34 @@
> >  #include <linux/uaccess.h>
> >  #include <linux/mm.h>
> >
> > -void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent)
> > +void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent,
> > +			unsigned long usage_tolerance)
> >  {
> >  	spin_lock_init(&counter->lock);
> > +	percpu_counter_init(&counter->usage, 0);
> >  	counter->limit = RESOURCE_MAX;
> >  	counter->soft_limit = RESOURCE_MAX;
> >  	counter->parent = parent;
> > +	counter->usage_tolerance = usage_tolerance;
> >  }
> >
> >  int res_counter_charge_locked(struct res_counter *counter, unsigned long
> > val)
> >  {
> > -	if (counter->usage + val > counter->limit) {
> > +	unsigned long long usage;
> > +
> > +	usage = percpu_counter_read_positive(&counter->usage);
> > +	if (usage + val > counter->limit) {
> >  		counter->failcnt++;
> >  		return -ENOMEM;
> >  	}
> >
> > -	counter->usage += val;
> > -	if (counter->usage > counter->max_usage)
> > -		counter->max_usage = counter->usage;
> > +	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> > +				counter->usage_tolerance);
> > +	if (usage + val > counter->max_usage) {
> > +		spin_lock(&counter->lock);
> > +		counter->max_usage = (usage + val);
> > +		spin_unlock(&counter->lock);
> > +	}
> Hmm...irq is already off here ?

Yes

> 
> 
> >  	return 0;
> >  }
> >
> > @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> >  		*soft_limit_fail_at = NULL;
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> > -		spin_lock(&c->lock);
> >  		ret = res_counter_charge_locked(c, val);
> >  		/*
> >  		 * With soft limits, we return the highest ancestor
> > @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> >  		if (soft_limit_fail_at &&
> >  			!res_counter_soft_limit_check_locked(c))
> >  			*soft_limit_fail_at = c;
> > -		spin_unlock(&c->lock);
> >  		if (ret < 0) {
> >  			*limit_fail_at = c;
> >  			goto undo;
> > @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> >  	goto done;
> >  undo:
> >  	for (u = counter; u != c; u = u->parent) {
> > -		spin_lock(&u->lock);
> >  		res_counter_uncharge_locked(u, val);
> > -		spin_unlock(&u->lock);
> >  	}
> >  done:
> 
> When using hierarchy, tolerance to root node will be bigger.
> Please write this attention to the document.
> 

No.. I don't think so..

Irrespective of hierarchy, we do the following

1. Add, if the sum reaches batch count, we sum and save

I don't think hierarchy should affect it.. no?


> 
> >  	local_irq_restore(flags);
> > @@ -79,10 +85,13 @@ done:
> >
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> > long val)
> >  {
> > -	if (WARN_ON(counter->usage < val))
> > -		val = counter->usage;
> > +	unsigned long long usage;
> > +
> > +	usage = percpu_counter_read_positive(&counter->usage);
> > +	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> > +		val = usage;
> Is this correct ? (or do we need this WARN_ON ?)
> Hmm. percpu_counter is cpu-hotplug aware. Then,
> nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.
>

OK.. so the deal is, even though it is aware, batch count is a
heuristic and I don't want to do heavy math in it. nr_cpu_ids is
larger, but also light weight in terms of computation.
 
> 
> >
> > -	counter->usage -= val;
> > +	percpu_counter_sub(&counter->usage, val);
> >  }
> >
> >  void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter
> > *counter, unsigned long val,
> >
> >  	local_irq_save(flags);
> >  	for (c = counter; c != NULL; c = c->parent) {
> > -		spin_lock(&c->lock);
> >  		if (was_soft_limit_excess)
> >  			*was_soft_limit_excess =
> >  				!res_counter_soft_limit_check_locked(c);
> >  		res_counter_uncharge_locked(c, val);
> > -		spin_unlock(&c->lock);
> >  	}
> >  	local_irq_restore(flags);
> >  }
> > @@ -108,8 +115,6 @@ static inline unsigned long long *
> >  res_counter_member(struct res_counter *counter, int member)
> >  {
> >  	switch (member) {
> > -	case RES_USAGE:
> > -		return &counter->usage;
> >  	case RES_MAX_USAGE:
> >  		return &counter->max_usage;
> >  	case RES_LIMIT:
> > @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter
> > *counter, int member,
> >  		const char __user *userbuf, size_t nbytes, loff_t *pos,
> >  		int (*read_strategy)(unsigned long long val, char *st_buf))
> >  {
> > -	unsigned long long *val;
> > +	unsigned long long *val, usage_val;
> >  	char buf[64], *s;
> >
> >  	s = buf;
> > -	val = res_counter_member(counter, member);
> > +	if (member == RES_USAGE) {
> > +		usage_val = percpu_counter_read_positive(&counter->usage);
> > +		val = &usage_val;
> > +	} else
> > +		val = res_counter_member(counter, member);
> >  	if (read_strategy)
> >  		s += read_strategy(*val, s);
> 
> 
> >  	else
> > @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter,
> > int member,
> >
> >  u64 res_counter_read_u64(struct res_counter *counter, int member)
> >  {
> > -	return *res_counter_member(counter, member);
> > +	if (member == RES_USAGE)
> > +		return percpu_counter_read_positive(&counter->usage);
> > +	else
> > +		return *res_counter_member(counter, member);
> >  }
> 
> >
> >  int res_counter_memparse_write_strategy(const char *buf,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 48a38e1..17d305d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold
> > under cgroup_mutex */
> >  #define SOFTLIMIT_EVENTS_THRESH (1000)
> >
> >  /*
> > + * To help resource counters scale, we take a step back
> > + * and allow the counters to be scalable and set a
> > + * batch value such that every addition does not cause
> > + * global synchronization. The side-effect will be visible
> > + * on limit enforcement, where due to this fuzziness,
> > + * we will lose out on inforcing a limit when the usage
> > + * exceeds the limit. The plan however in the long run
> > + * is to allow this value to be controlled. We will
> > + * probably add a new control file for it.
> > + */
> > +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
> 
> Considering percpu counter's extra overhead. This number is too small, IMO.
>

OK.. the reason I kept it that way is because on ppc64 PAGE_SIZE is
now 64k. May be we should pick a standard size like 64k and stick with
it. What do you think?

 
> > +
> > +/*
> >   * Statistics for memory cgroup.
> >   */
> >  enum mem_cgroup_stat_index {
> > @@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup
> > *mem, bool free_all)
> >  	if (free_all)
> >  		goto try_to_free;
> >  move_account:
> > -	while (mem->res.usage > 0) {
> > +	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> >  		ret = -EBUSY;
> >  		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> >  			goto out;
> > @@ -2383,7 +2396,7 @@ try_to_free:
> >  	lru_add_drain_all();
> >  	/* try to free all pages in this cgroup */
> >  	shrink = 1;
> > -	while (nr_retries && mem->res.usage > 0) {
> > +	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> >  		int progress;
> >
> >  		if (signal_pending(current)) {
> > @@ -2401,7 +2414,7 @@ try_to_free:
> >  	}
> >  	lru_add_drain();
> >  	/* try move_account...there may be some *locked* pages. */
> > -	if (mem->res.usage)
> > +	if (res_counter_read_u64(&mem->res, RES_USAGE))
> >  		goto move_account;
> >  	ret = 0;
> >  	goto out;
> > @@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> > cgroup *cont)
> >  	}
> >
> >  	if (parent && parent->use_hierarchy) {
> > -		res_counter_init(&mem->res, &parent->res);
> > -		res_counter_init(&mem->memsw, &parent->memsw);
> > +		res_counter_init(&mem->res, &parent->res,
> > +			MEM_CGROUP_RES_ERR_TOLERANCE);
> > +		res_counter_init(&mem->memsw, &parent->memsw,
> > +			MEM_CGROUP_RES_ERR_TOLERANCE);
> >  		/*
> >  		 * We increment refcnt of the parent to ensure that we can
> >  		 * safely access it on res_counter_charge/uncharge.
> > @@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> > cgroup *cont)
> >  		 */
> >  		mem_cgroup_get(parent);
> >  	} else {
> > -		res_counter_init(&mem->res, NULL);
> > -		res_counter_init(&mem->memsw, NULL);
> > +		res_counter_init(&mem->res, NULL,
> > +			MEM_CGROUP_RES_ERR_TOLERANCE);
> > +		res_counter_init(&mem->memsw, NULL,
> > +			MEM_CGROUP_RES_ERR_TOLERANCE);
> >  	}
> >  	mem->last_scanned_child = 0;
> >  	spin_lock_init(&mem->reclaim_param_lock);
> >
> 
> Thanks,
> -Kame

Thanks for the detailed review!

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v2)
  2009-08-08  6:05   ` Balbir Singh
@ 2009-08-08  7:38     ` KAMEZAWA Hiroyuki
  2009-08-09 12:15       ` Help Resource Counters Scale Better (v3) Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-08  7:38 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, andi.kleen, Prarit Bhargava,
	KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
	linux-mm

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-08
> 10:11:40]:
>
>> Balbir Singh wrote:

>> >  static inline bool res_counter_limit_check_locked(struct res_counter
>> > *cnt)
>> >  {
>> > -	if (cnt->usage < cnt->limit)
>> > +	unsigned long long usage =
>> percpu_counter_read_positive(&cnt->usage);
>> > +	if (usage < cnt->limit)
>> >  		return true;
>> >
>> Hmm. In memcg, this function is not used for busy pass but used for
>> important pass to check usage under limit (and continue reclaim)
>>
>> Can't we add res_clounter_check_locked_exact(), which use
>> percpu_counter_sum() later ?
>
> We can, but I want to do it in parts, once I add the policy for
> strict/no-strict checking. It is on my mind, but I want to work on the
> overhead, since I've heard from many people that we need to resolve
> this first.
>
ok.

>> >  	spin_lock_irqsave(&cnt->lock, flags);
>> > -	if (cnt->usage <= limit) {
>> > +	if (usage <= limit) {
>> >  		cnt->limit = limit;
>> >  		ret = 0;
>> >  	}
>>
>> For the same reason to check_limit, I want correct number here.
>> percpu_counter_sum() is better.
>>
>
> I'll add that when we do strict accounting. Are you suggesting that
> resource_counter_set_limit should use strict accounting?

yes, I think so.
..and..I'd like to add "mem_cgroup_reduce_usage" or some call
to do reclaim-on-demand, later.

I wonder it's ok to add error-tolerance to memcg but I want some
interface to do "sync". Especially when, we measure size of working set.

I like current your direction to achieve better performance.
But I  wonder how users can see synchronous numbers without tolerance,
it will be necessary in high-end users.

	goto undo;
>> > @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
>> > unsigned long val,
>> >  	goto done;
>> >  undo:
>> >  	for (u = counter; u != c; u = u->parent) {
>> > -		spin_lock(&u->lock);
>> >  		res_counter_uncharge_locked(u, val);
>> > -		spin_unlock(&u->lock);
>> >  	}
>> >  done:

>> When using hierarchy, tolerance to root node will be bigger.
>> Please write this attention to the document.
>>
>
> No.. I don't think so..
>
> Irrespective of hierarchy, we do the following
>
> 1. Add, if the sum reaches batch count, we sum and save
>
> I don't think hierarchy should affect it.. no?
>
Hmm, maybe I'm misunderstanding. Let me brainstoming...

In following hierarchy,

   A/01
    /02
    /03/X
       /Y
       /Z
 sum of tolerance of X+Y+Z is limitted by torelance of 03.
 sum of tolerance of 01+02+03 is limited by tolerance of A

Ah, ok. I'm wrong. Hmm...


>
>>
>> >  	local_irq_restore(flags);
>> > @@ -79,10 +85,13 @@ done:
>> >
>> >  void res_counter_uncharge_locked(struct res_counter *counter,
>> unsigned
>> > long val)
>> >  {
>> > -	if (WARN_ON(counter->usage < val))
>> > -		val = counter->usage;
>> > +	unsigned long long usage;
>> > +
>> > +	usage = percpu_counter_read_positive(&counter->usage);
>> > +	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
>> > +		val = usage;
>> Is this correct ? (or do we need this WARN_ON ?)
>> Hmm. percpu_counter is cpu-hotplug aware. Then,
>> nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.
>>
>
> OK.. so the deal is, even though it is aware, batch count is a
> heuristic and I don't want to do heavy math in it. nr_cpu_ids is
> larger, but also light weight in terms of computation.
>
yes...I wonder there is a _variable_ to show nr_online_cpus without
bitmap scan...


>> >  /*
>> > + * To help resource counters scale, we take a step back
>> > + * and allow the counters to be scalable and set a
>> > + * batch value such that every addition does not cause
>> > + * global synchronization. The side-effect will be visible
>> > + * on limit enforcement, where due to this fuzziness,
>> > + * we will lose out on inforcing a limit when the usage
>> > + * exceeds the limit. The plan however in the long run
>> > + * is to allow this value to be controlled. We will
>> > + * probably add a new control file for it.
>> > + */
>> > +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
>>
>> Considering percpu counter's extra overhead. This number is too small,
>> IMO.
>>
>
> OK.. the reason I kept it that way is because on ppc64 PAGE_SIZE is
> now 64k. May be we should pick a standard size like 64k and stick with
> it. What do you think?
>
I think 64k is reasonanle as far as there is no monster machine with
4096 cpus...But even with 4096cpus
64k*4096 = 256M...then, small amount for monster machine..

Hmm...I think you can add CONFIG_MEMCG_PCPU_TOLERANCE and
set default value to 64k. (of course, you can do this in other patch)

On laptop/desktop, 4cpus
 4*64k=256k

On volume-zone server, 8-16,32cpus
 32*64k=2M

On high-end 64-256cpu machine in these days,
 256*64k=16M

maybe not so bad. I'm not sure how many 1024cpu machines will
be used in the the next ten years..

I want a percpu counter with flexible batching for minimizing tolerance.
It will be my homework.

Thanks,
-Kame


64kx256 = 16M ...maybe reasonable.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Help Resource Counters Scale Better (v3)
  2009-08-08  7:38     ` KAMEZAWA Hiroyuki
@ 2009-08-09 12:15       ` Balbir Singh
  2009-08-10  0:32         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-09 12:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

Hi, 

Thanks for the detailed review, here is v3 of the patches against
mmotm 6th August. I've documented the TODOs as well. If there are
no major objections, I would like this to be included in mmotm
for more testing. Any test reports on a large machine would be highly
appreciated.

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Changelog v3->v2

1. Added more documentation and comments
2. Made the check in mem_cgroup_set_limit strict
3. Increased tolerance per cpu to 64KB.
4. Still have the WARN_ON(), I've kept it for debugging
   purposes, may be we should make it a conditional with
   DEBUG_VM

Changelog v2->v1

1. Updated Documentation (cgroups.txt and resource_counters.txt)
2. Added the notion of tolerance to resource counter initialization

Enhancement: For scalability move the resource counter to a percpu counter

This patch changes the usage field of a resource counter to a percpu
counter. The counter is incremented with local irq disabled. The other
fields are still protected by the spin lock for write.

This patch adds a fuzziness factor to hard limit, since the value we read
could be off the original value (by batch value), this can be fixed
by adding a strict/non-strict functionality check. The intention is
to turn of strict checking for root (since we can't set limits on
it anyway).

I tested this patch on my x86_64 box with a regular test for hard
limits and a page fault program.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt           |   22 ++++++++++++
 Documentation/cgroups/resource_counter.txt |   20 +++++++++--
 include/linux/res_counter.h                |   52 ++++++++++++++++++----------
 kernel/res_counter.c                       |   50 +++++++++++++++++----------
 mm/memcontrol.c                            |   32 +++++++++++++----
 5 files changed, 128 insertions(+), 48 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..a24dab7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
 d. Provides a double LRU: global memory pressure causes reclaim from the
    global LRU; a cgroup on hitting a limit, reclaims from the per
    cgroup LRU
+   NOTE: One can no longer rely on the exact limit. Since we've moved
+   to using percpu_counters for resource counters, there is always going
+   to be a fuzziness factor depending on the batch value.
 
 Benefits and Purpose of the memory controller
 
@@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
 4. Start reclamation in the background when the limit is
    not yet hit but the usage is getting closer
 
+9. Scalability Tradeoff
+
+As documented in Documentation/cgroups/resource_counter.txt, we've
+moved over to percpu counters for accounting usage. What does this
+mean for the end user?
+
+1. It means better performance
+2. It means that the usage reported does not necessarily reflect
+   realty. Because percpu counters do a sync only so often (see
+   batch value in the code), the value reported might be off the
+   real value by an amount proportional to the specified tolerenace.
+   The tolerance value is currently stored internally.
+
+TODOs
+
+1. Move tolerance to a config option
+2. Add support for strict/non-strict accounting
+
+
 Summary
 
 Overall, the memory controller has been a stable controller and has been
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a43ea60 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -12,12 +12,15 @@ to work with it.
 
 1. Crucial parts of the res_counter structure
 
- a. unsigned long long usage
+ a. percpu_counter usage
 
  	The usage value shows the amount of a resource that is consumed
 	by a group at a given time. The units of measurement should be
 	determined by the controller that uses this counter. E.g. it can
 	be bytes, items or any other unit the controller operates on.
+	NOTE: being a percpu_counter, the way to read the correct value
+	at all times makes it unscalable and reading it scalably makes
+	the value a little unreliable :)
 
  b. unsigned long long max_usage
 
@@ -39,16 +42,24 @@ to work with it.
  	The failcnt stands for "failures counter". This is the number of
 	resource allocation attempts that failed.
 
- c. spinlock_t lock
+ e. spinlock_t lock
 
  	Protects changes of the above values.
 
+ f. unsigned long tolerance
+
+	This value is used to keep track of the amount of error that might
+	be tolerated by the resource counter. See the NOTE in (a) above.
+	The tolerance value is per cpu, hence the total error at any time
+	can be nr_cpu_ids * tolerance.
+
 
 
 2. Basic accounting routines
 
  a. void res_counter_init(struct res_counter *rc,
-				struct res_counter *rc_parent)
+				struct res_counter *rc_parent,
+				unsigned long tolerance)
 
  	Initializes the resource counter. As usual, should be the first
 	routine called for a new counter.
@@ -57,6 +68,9 @@ to work with it.
 	child -> parent relationship directly in the res_counter structure,
 	NULL can be used to define no relationship.
 
+	The tolerance is used to control the batching behaviour of percpu
+	counters. Please see details in Section 1, item f above.
+
  c. int res_counter_charge(struct res_counter *rc, unsigned long val,
 				struct res_counter **limit_fail_at)
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 731af71..3728c0d 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/percpu_counter.h>
 
 /*
  * The core object. the cgroup that wishes to account for some
@@ -23,10 +24,6 @@
 
 struct res_counter {
 	/*
-	 * the current resource consumption level
-	 */
-	unsigned long long usage;
-	/*
 	 * the maximal value of the usage from the counter creation
 	 */
 	unsigned long long max_usage;
@@ -48,6 +45,14 @@ struct res_counter {
 	 */
 	spinlock_t lock;
 	/*
+	 * the current resource consumption level
+	 */
+	struct percpu_counter usage;
+	/*
+	 * Tolerance for the percpu_counter (usage) above
+	 */
+	unsigned long usage_tolerance;
+	/*
 	 * Parent counter, used for hierarchial resource accounting
 	 */
 	struct res_counter *parent;
@@ -98,7 +103,8 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance);
 
 /*
  * charge - try to consume more resource.
@@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->limit)
 		return true;
 
 	return false;
@@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 
 static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->soft_limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->soft_limit)
 		return true;
 
 	return false;
@@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 static inline unsigned long long
 res_counter_soft_limit_excess(struct res_counter *cnt)
 {
-	unsigned long long excess;
-	unsigned long flags;
+	unsigned long long excess, usage;
 
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
+	usage = percpu_counter_read_positive(&cnt->usage);
+	/*
+	 * Not all callers call with irq's disabled, make
+	 * sure we read out something sensible.
+	 */
+	preempt_disable();
+	if (usage <= cnt->soft_limit)
 		excess = 0;
 	else
-		excess = cnt->usage - cnt->soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
+		excess = usage - cnt->soft_limit;
+	preempt_enable();
 	return excess;
 }
 
@@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_soft_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->max_usage = cnt->usage;
+	cnt->max_usage = usage;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
@@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
 	int ret = -EBUSY;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= limit) {
+	if (usage <= limit) {
 		cnt->limit = limit;
 		ret = 0;
 	}
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 88faec2..ae83168 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -15,24 +15,34 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance)
 {
 	spin_lock_init(&counter->lock);
+	percpu_counter_init(&counter->usage, 0);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
 	counter->parent = parent;
+	counter->usage_tolerance = usage_tolerance;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (counter->usage + val > counter->limit) {
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (usage + val > counter->limit) {
 		counter->failcnt++;
 		return -ENOMEM;
 	}
 
-	counter->usage += val;
-	if (counter->usage > counter->max_usage)
-		counter->max_usage = counter->usage;
+	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
+				counter->usage_tolerance);
+	if (usage + val > counter->max_usage) {
+		spin_lock(&counter->lock);
+		counter->max_usage = (usage + val);
+		spin_unlock(&counter->lock);
+	}
 	return 0;
 }
 
@@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		/*
 		 * With soft limits, we return the highest ancestor
@@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		if (soft_limit_fail_at &&
 			!res_counter_soft_limit_check_locked(c))
 			*soft_limit_fail_at = c;
-		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
 			goto undo;
@@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 	goto done;
 undo:
 	for (u = counter; u != c; u = u->parent) {
-		spin_lock(&u->lock);
 		res_counter_uncharge_locked(u, val);
-		spin_unlock(&u->lock);
 	}
 done:
 	local_irq_restore(flags);
@@ -79,10 +85,13 @@ done:
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (WARN_ON(counter->usage < val))
-		val = counter->usage;
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
+		val = usage;
 
-	counter->usage -= val;
+	percpu_counter_sub(&counter->usage, val);
 }
 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val,
@@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		if (was_soft_limit_excess)
 			*was_soft_limit_excess =
 				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
-		spin_unlock(&c->lock);
 	}
 	local_irq_restore(flags);
 }
@@ -108,8 +115,6 @@ static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
 	switch (member) {
-	case RES_USAGE:
-		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
 	case RES_LIMIT:
@@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
 		int (*read_strategy)(unsigned long long val, char *st_buf))
 {
-	unsigned long long *val;
+	unsigned long long *val, usage_val;
 	char buf[64], *s;
 
 	s = buf;
-	val = res_counter_member(counter, member);
+	if (member == RES_USAGE) {
+		usage_val = percpu_counter_read_positive(&counter->usage);
+		val = &usage_val;
+	} else
+		val = res_counter_member(counter, member);
 	if (read_strategy)
 		s += read_strategy(*val, s);
 	else
@@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 
 u64 res_counter_read_u64(struct res_counter *counter, int member)
 {
-	return *res_counter_member(counter, member);
+	if (member == RES_USAGE)
+		return percpu_counter_read_positive(&counter->usage);
+	else
+		return *res_counter_member(counter, member);
 }
 
 int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..36d46aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
+ * To help resource counters scale, we take a step back
+ * and allow the counters to be scalable and set a
+ * batch value such that every addition does not cause
+ * global synchronization. The side-effect will be visible
+ * on limit enforcement, where due to this fuzziness,
+ * we will lose out on inforcing a limit when the usage
+ * exceeds the limit. The plan however in the long run
+ * is to allow this value to be controlled. We will
+ * probably add a new control file for it. This will be
+ * moved to a config option later.
+ */
+#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
+
+/*
  * Statistics for memory cgroup.
  */
 enum mem_cgroup_stat_index {
@@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
 	if (free_all)
 		goto try_to_free;
 move_account:
-	while (mem->res.usage > 0) {
+	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
@@ -2383,7 +2397,7 @@ try_to_free:
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
 	shrink = 1;
-	while (nr_retries && mem->res.usage > 0) {
+	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		int progress;
 
 		if (signal_pending(current)) {
@@ -2401,7 +2415,7 @@ try_to_free:
 	}
 	lru_add_drain();
 	/* try move_account...there may be some *locked* pages. */
-	if (mem->res.usage)
+	if (res_counter_read_u64(&mem->res, RES_USAGE))
 		goto move_account;
 	ret = 0;
 	goto out;
@@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	}
 
 	if (parent && parent->use_hierarchy) {
-		res_counter_init(&mem->res, &parent->res);
-		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->res, &parent->res,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, &parent->memsw,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		 */
 		mem_cgroup_get(parent);
 	} else {
-		res_counter_init(&mem->res, NULL);
-		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->res, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 	}
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-09 12:15       ` Help Resource Counters Scale Better (v3) Balbir Singh
@ 2009-08-10  0:32         ` KAMEZAWA Hiroyuki
  2009-08-10  0:43           ` KAMEZAWA Hiroyuki
  2009-08-10  5:30           ` Balbir Singh
  0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10  0:32 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

On Sun, 9 Aug 2009 17:45:30 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Hi, 
> 
> Thanks for the detailed review, here is v3 of the patches against
> mmotm 6th August. I've documented the TODOs as well. If there are
> no major objections, I would like this to be included in mmotm
> for more testing. Any test reports on a large machine would be highly
> appreciated.
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Changelog v3->v2
> 
> 1. Added more documentation and comments
> 2. Made the check in mem_cgroup_set_limit strict
> 3. Increased tolerance per cpu to 64KB.
> 4. Still have the WARN_ON(), I've kept it for debugging
>    purposes, may be we should make it a conditional with
>    DEBUG_VM
> 
Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.

Before leaving, I'd like to write some concerns here.

1. you use res_counter_read_positive() in force_empty. It seems force_empty can
   go into infinite loop. plz check. (especially when some pages are freed or swapped-in
   in other cpu while force_empry runs.)

2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
   Assume 1024cpu system, 64k*1024=64M is a tolerance.
   Can't we calculate max-tolerane as following ?
  
   tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
   tolerance /= num_online_cpus();
   per_cpu_tolerance = min(16k, tolelance);

   I think automatic runtine adjusting of tolerance will be finally necessary,
   but above will not be very bad because we can guarantee 1% tolerance.

Thx,
-Kame

> Changelog v2->v1
> 
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
> 
> Enhancement: For scalability move the resource counter to a percpu counter
> 
> This patch changes the usage field of a resource counter to a percpu
> counter. The counter is incremented with local irq disabled. The other
> fields are still protected by the spin lock for write.
> 
> This patch adds a fuzziness factor to hard limit, since the value we read
> could be off the original value (by batch value), this can be fixed
> by adding a strict/non-strict functionality check. The intention is
> to turn of strict checking for root (since we can't set limits on
> it anyway).
> 
> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  Documentation/cgroups/memory.txt           |   22 ++++++++++++
>  Documentation/cgroups/resource_counter.txt |   20 +++++++++--
>  include/linux/res_counter.h                |   52 ++++++++++++++++++----------
>  kernel/res_counter.c                       |   50 +++++++++++++++++----------
>  mm/memcontrol.c                            |   32 +++++++++++++----
>  5 files changed, 128 insertions(+), 48 deletions(-)
> 
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..a24dab7 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
>  d. Provides a double LRU: global memory pressure causes reclaim from the
>     global LRU; a cgroup on hitting a limit, reclaims from the per
>     cgroup LRU
> +   NOTE: One can no longer rely on the exact limit. Since we've moved
> +   to using percpu_counters for resource counters, there is always going
> +   to be a fuzziness factor depending on the batch value.
>  
>  Benefits and Purpose of the memory controller
>  
> @@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
>  4. Start reclamation in the background when the limit is
>     not yet hit but the usage is getting closer
>  
> +9. Scalability Tradeoff
> +
> +As documented in Documentation/cgroups/resource_counter.txt, we've
> +moved over to percpu counters for accounting usage. What does this
> +mean for the end user?
> +
> +1. It means better performance
> +2. It means that the usage reported does not necessarily reflect
> +   realty. Because percpu counters do a sync only so often (see
> +   batch value in the code), the value reported might be off the
> +   real value by an amount proportional to the specified tolerenace.
> +   The tolerance value is currently stored internally.
> +
> +TODOs
> +
> +1. Move tolerance to a config option
> +2. Add support for strict/non-strict accounting
> +
> +
>  Summary
>  
>  Overall, the memory controller has been a stable controller and has been
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..a43ea60 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>  
>  1. Crucial parts of the res_counter structure
>  
> - a. unsigned long long usage
> + a. percpu_counter usage
>  
>   	The usage value shows the amount of a resource that is consumed
>  	by a group at a given time. The units of measurement should be
>  	determined by the controller that uses this counter. E.g. it can
>  	be bytes, items or any other unit the controller operates on.
> +	NOTE: being a percpu_counter, the way to read the correct value
> +	at all times makes it unscalable and reading it scalably makes
> +	the value a little unreliable :)
>  
>   b. unsigned long long max_usage
>  
> @@ -39,16 +42,24 @@ to work with it.
>   	The failcnt stands for "failures counter". This is the number of
>  	resource allocation attempts that failed.
>  
> - c. spinlock_t lock
> + e. spinlock_t lock
>  
>   	Protects changes of the above values.
>  
> + f. unsigned long tolerance
> +
> +	This value is used to keep track of the amount of error that might
> +	be tolerated by the resource counter. See the NOTE in (a) above.
> +	The tolerance value is per cpu, hence the total error at any time
> +	can be nr_cpu_ids * tolerance.
> +
>  
>  
>  2. Basic accounting routines
>  
>   a. void res_counter_init(struct res_counter *rc,
> -				struct res_counter *rc_parent)
> +				struct res_counter *rc_parent,
> +				unsigned long tolerance)
>  
>   	Initializes the resource counter. As usual, should be the first
>  	routine called for a new counter.
> @@ -57,6 +68,9 @@ to work with it.
>  	child -> parent relationship directly in the res_counter structure,
>  	NULL can be used to define no relationship.
>  
> +	The tolerance is used to control the batching behaviour of percpu
> +	counters. Please see details in Section 1, item f above.
> +
>   c. int res_counter_charge(struct res_counter *rc, unsigned long val,
>  				struct res_counter **limit_fail_at)
>  
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..3728c0d 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>  
>  /*
>   * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>  
>  struct res_counter {
>  	/*
> -	 * the current resource consumption level
> -	 */
> -	unsigned long long usage;
> -	/*
>  	 * the maximal value of the usage from the counter creation
>  	 */
>  	unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
>  	 */
>  	spinlock_t lock;
>  	/*
> +	 * the current resource consumption level
> +	 */
> +	struct percpu_counter usage;
> +	/*
> +	 * Tolerance for the percpu_counter (usage) above
> +	 */
> +	unsigned long usage_tolerance;
> +	/*
>  	 * Parent counter, used for hierarchial resource accounting
>  	 */
>  	struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
>   * helpers for accounting
>   */
>  
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> +			unsigned long usage_tolerance);
>  
>  /*
>   * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> -	if (cnt->usage < cnt->limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->limit)
>  		return true;
>  
>  	return false;
> @@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  
>  static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  {
> -	if (cnt->usage < cnt->soft_limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->soft_limit)
>  		return true;
>  
>  	return false;
> @@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  static inline unsigned long long
>  res_counter_soft_limit_excess(struct res_counter *cnt)
>  {
> -	unsigned long long excess;
> -	unsigned long flags;
> +	unsigned long long excess, usage;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= cnt->soft_limit)
> +	usage = percpu_counter_read_positive(&cnt->usage);
> +	/*
> +	 * Not all callers call with irq's disabled, make
> +	 * sure we read out something sensible.
> +	 */
> +	preempt_disable();
> +	if (usage <= cnt->soft_limit)
>  		excess = 0;
>  	else
> -		excess = cnt->usage - cnt->soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +		excess = usage - cnt->soft_limit;
> +	preempt_enable();
>  	return excess;
>  }
>  
> @@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
> @@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_soft_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>  
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->max_usage = cnt->usage;
> +	cnt->max_usage = usage;
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
>  
> @@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>  		unsigned long long limit)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
>  	int ret = -EBUSY;
>  
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= limit) {
> +	if (usage <= limit) {
>  		cnt->limit = limit;
>  		ret = 0;
>  	}
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm.h>
>  
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> +			unsigned long usage_tolerance)
>  {
>  	spin_lock_init(&counter->lock);
> +	percpu_counter_init(&counter->usage, 0);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
>  	counter->parent = parent;
> +	counter->usage_tolerance = usage_tolerance;
>  }
>  
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
> -	if (counter->usage + val > counter->limit) {
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (usage + val > counter->limit) {
>  		counter->failcnt++;
>  		return -ENOMEM;
>  	}
>  
> -	counter->usage += val;
> -	if (counter->usage > counter->max_usage)
> -		counter->max_usage = counter->usage;
> +	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> +				counter->usage_tolerance);
> +	if (usage + val > counter->max_usage) {
> +		spin_lock(&counter->lock);
> +		counter->max_usage = (usage + val);
> +		spin_unlock(&counter->lock);
> +	}
>  	return 0;
>  }
>  
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		/*
>  		 * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  		if (soft_limit_fail_at &&
>  			!res_counter_soft_limit_check_locked(c))
>  			*soft_limit_fail_at = c;
> -		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
>  			goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  	goto done;
>  undo:
>  	for (u = counter; u != c; u = u->parent) {
> -		spin_lock(&u->lock);
>  		res_counter_uncharge_locked(u, val);
> -		spin_unlock(&u->lock);
>  	}
>  done:
>  	local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  {
> -	if (WARN_ON(counter->usage < val))
> -		val = counter->usage;
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> +		val = usage;
>  
> -	counter->usage -= val;
> +	percpu_counter_sub(&counter->usage, val);
>  }
>  
>  void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>  
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		if (was_soft_limit_excess)
>  			*was_soft_limit_excess =
>  				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
> -		spin_unlock(&c->lock);
>  	}
>  	local_irq_restore(flags);
>  }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
>  res_counter_member(struct res_counter *counter, int member)
>  {
>  	switch (member) {
> -	case RES_USAGE:
> -		return &counter->usage;
>  	case RES_MAX_USAGE:
>  		return &counter->max_usage;
>  	case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
>  		int (*read_strategy)(unsigned long long val, char *st_buf))
>  {
> -	unsigned long long *val;
> +	unsigned long long *val, usage_val;
>  	char buf[64], *s;
>  
>  	s = buf;
> -	val = res_counter_member(counter, member);
> +	if (member == RES_USAGE) {
> +		usage_val = percpu_counter_read_positive(&counter->usage);
> +		val = &usage_val;
> +	} else
> +		val = res_counter_member(counter, member);
>  	if (read_strategy)
>  		s += read_strategy(*val, s);
>  	else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>  
>  u64 res_counter_read_u64(struct res_counter *counter, int member)
>  {
> -	return *res_counter_member(counter, member);
> +	if (member == RES_USAGE)
> +		return percpu_counter_read_positive(&counter->usage);
> +	else
> +		return *res_counter_member(counter, member);
>  }
>  
>  int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..36d46aa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  
>  /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it. This will be
> + * moved to a config option later.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
> +
> +/*
>   * Statistics for memory cgroup.
>   */
>  enum mem_cgroup_stat_index {
> @@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
>  	if (free_all)
>  		goto try_to_free;
>  move_account:
> -	while (mem->res.usage > 0) {
> +	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		ret = -EBUSY;
>  		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>  			goto out;
> @@ -2383,7 +2397,7 @@ try_to_free:
>  	lru_add_drain_all();
>  	/* try to free all pages in this cgroup */
>  	shrink = 1;
> -	while (nr_retries && mem->res.usage > 0) {
> +	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		int progress;
>  
>  		if (signal_pending(current)) {
> @@ -2401,7 +2415,7 @@ try_to_free:
>  	}
>  	lru_add_drain();
>  	/* try move_account...there may be some *locked* pages. */
> -	if (mem->res.usage)
> +	if (res_counter_read_u64(&mem->res, RES_USAGE))
>  		goto move_account;
>  	ret = 0;
>  	goto out;
> @@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> -		res_counter_init(&mem->res, &parent->res);
> -		res_counter_init(&mem->memsw, &parent->memsw);
> +		res_counter_init(&mem->res, &parent->res,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, &parent->memsw,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  		/*
>  		 * We increment refcnt of the parent to ensure that we can
>  		 * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		 */
>  		mem_cgroup_get(parent);
>  	} else {
> -		res_counter_init(&mem->res, NULL);
> -		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->res, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  	}
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
> -- 
> 	Balbir
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  0:32         ` KAMEZAWA Hiroyuki
@ 2009-08-10  0:43           ` KAMEZAWA Hiroyuki
  2009-08-10  5:22             ` Balbir Singh
  2009-08-10  5:30           ` Balbir Singh
  1 sibling, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10  0:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, Andrew Morton, andi.kleen, Prarit Bhargava,
	KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
	linux-mm

On Mon, 10 Aug 2009 09:32:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
>    go into infinite loop. plz check. (especially when some pages are freed or swapped-in
>    in other cpu while force_empry runs.)
> 
> 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
>    Assume 1024cpu system, 64k*1024=64M is a tolerance.
>    Can't we calculate max-tolerane as following ?
>   
>    tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
>    tolerance /= num_online_cpus();
>    per_cpu_tolerance = min(16k, tolelance);
> 
>    I think automatic runtine adjusting of tolerance will be finally necessary,
>    but above will not be very bad because we can guarantee 1% tolerance.
> 

Sorry, one more.

3. As I requested when you pushed softlimit changes to mmotom, plz consider
   to implement a way to check-and-notify gadget to res_counter.
   See: http://marc.info/?l=linux-mm&m=124753058921677&w=2

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  0:43           ` KAMEZAWA Hiroyuki
@ 2009-08-10  5:22             ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10  5:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:43:44]:

> On Mon, 10 Aug 2009 09:32:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> >    go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> >    in other cpu while force_empry runs.)
> > 
> > 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> >    Assume 1024cpu system, 64k*1024=64M is a tolerance.
> >    Can't we calculate max-tolerane as following ?
> >   
> >    tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> >    tolerance /= num_online_cpus();
> >    per_cpu_tolerance = min(16k, tolelance);
> > 
> >    I think automatic runtine adjusting of tolerance will be finally necessary,
> >    but above will not be very bad because we can guarantee 1% tolerance.
> > 
> 
> Sorry, one more.
> 
> 3. As I requested when you pushed softlimit changes to mmotom, plz consider
>    to implement a way to check-and-notify gadget to res_counter.
>    See: http://marc.info/?l=linux-mm&m=124753058921677&w=2
>

Yes, I will do that, but only after the scaling, since this is more
important at the moment. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  0:32         ` KAMEZAWA Hiroyuki
  2009-08-10  0:43           ` KAMEZAWA Hiroyuki
@ 2009-08-10  5:30           ` Balbir Singh
  2009-08-10  5:45             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-10  5:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:32:29]:

> On Sun, 9 Aug 2009 17:45:30 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Hi, 
> > 
> > Thanks for the detailed review, here is v3 of the patches against
> > mmotm 6th August. I've documented the TODOs as well. If there are
> > no major objections, I would like this to be included in mmotm
> > for more testing. Any test reports on a large machine would be highly
> > appreciated.
> > 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > 
> > Changelog v3->v2
> > 
> > 1. Added more documentation and comments
> > 2. Made the check in mem_cgroup_set_limit strict
> > 3. Increased tolerance per cpu to 64KB.
> > 4. Still have the WARN_ON(), I've kept it for debugging
> >    purposes, may be we should make it a conditional with
> >    DEBUG_VM
> > 
> Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
> 
> Before leaving, I'd like to write some concerns here.
> 
> 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
>    go into infinite loop. plz check. (especially when some pages are freed or swapped-in
>    in other cpu while force_empry runs.)

OK.. so you want me to use _sum_positive(), will do. In all my testing
using the stress scripts I have, I found no issues with force_empty so
far. But I'll change over.

> 
> 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
>    Assume 1024cpu system, 64k*1024=64M is a tolerance.
>    Can't we calculate max-tolerane as following ?
>   
>    tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
>    tolerance /= num_online_cpus();
>    per_cpu_tolerance = min(16k, tolelance);
> 
>    I think automatic runtine adjusting of tolerance will be finally necessary,
>    but above will not be very bad because we can guarantee 1% tolerance.
> 

I agree that automatic tuning will be necessary, but I want to go the
CONFIG_MEM_CGROUP_RES_TOLERANCE approach you suggested earlier, since
num_online_cpus() with CPU hotplug can be a bit of a game play and
with Power Management and CPUs going idle, we really don't want to
count those, etc. For now a simple nr_cpu_ids * tolerance and then
get feedback, since it is a heuristic. Again, limit_in_bytes can
change, may be some of this needs to go into resize_limit and
set_limit paths. Right now, I want to keep it simple and see if
others can see the benefits of this patch. Then add some more
heuristics based on your suggestion.

Do you agree?



-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  5:30           ` Balbir Singh
@ 2009-08-10  5:45             ` KAMEZAWA Hiroyuki
  2009-08-10  6:22               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10  5:45 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

On Mon, 10 Aug 2009 11:00:25 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:32:29]:
> 
> > On Sun, 9 Aug 2009 17:45:30 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > Hi, 
> > > 
> > > Thanks for the detailed review, here is v3 of the patches against
> > > mmotm 6th August. I've documented the TODOs as well. If there are
> > > no major objections, I would like this to be included in mmotm
> > > for more testing. Any test reports on a large machine would be highly
> > > appreciated.
> > > 
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > > 
> > > Changelog v3->v2
> > > 
> > > 1. Added more documentation and comments
> > > 2. Made the check in mem_cgroup_set_limit strict
> > > 3. Increased tolerance per cpu to 64KB.
> > > 4. Still have the WARN_ON(), I've kept it for debugging
> > >    purposes, may be we should make it a conditional with
> > >    DEBUG_VM
> > > 
> > Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
> > 
> > Before leaving, I'd like to write some concerns here.
> > 
> > 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> >    go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> >    in other cpu while force_empry runs.)
> 
> OK.. so you want me to use _sum_positive(), will do. In all my testing
> using the stress scripts I have, I found no issues with force_empty so
> far. But I'll change over.
> 
Thanks. Things around force_empty are very sensitive ;(



> > 
> > 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> >    Assume 1024cpu system, 64k*1024=64M is a tolerance.
> >    Can't we calculate max-tolerane as following ?
> >   
> >    tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> >    tolerance /= num_online_cpus();
> >    per_cpu_tolerance = min(16k, tolelance);
> > 
> >    I think automatic runtine adjusting of tolerance will be finally necessary,
> >    but above will not be very bad because we can guarantee 1% tolerance.
> > 
> 
> I agree that automatic tuning will be necessary, but I want to go the
> CONFIG_MEM_CGROUP_RES_TOLERANCE approach you suggested earlier, since
> num_online_cpus() with CPU hotplug can be a bit of a game play and
> with Power Management and CPUs going idle, we really don't want to
> count those, etc. For now a simple nr_cpu_ids * tolerance and then
> get feedback, since it is a heuristic. Again, limit_in_bytes can
> change, may be some of this needs to go into resize_limit and
> set_limit paths. Right now, I want to keep it simple and see if
> others can see the benefits of this patch. Then add some more
> heuristics based on your suggestion.
> 
> Do you agree?

Ok. Config is enough at this stage.

The last advice for merge is, it's better to show the numbers or
ask someone who have many cpus to measure benefits. Then, Andrew can
know how this is benefical.
(My box has 8 cpus. But maybe your IBM collaegue has some bigger one)

In my experience (in my own old trial),
 - lock contention itself is low. not high.
 - but cacheline-miss, pingpong is very very frequent.

Then, this patch has some benefit logically but, in general,
File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
the performance of usual apps. You'll have to be careful to select apps
to measure the benfits of this patch by application performance.
(And this is why I don't feel so much emergency as you do)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  5:45             ` KAMEZAWA Hiroyuki
@ 2009-08-10  6:22               ` KAMEZAWA Hiroyuki
  2009-08-10  7:41                 ` Balbir Singh
  2009-08-10  8:36                 ` Balbir Singh
  0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10  6:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, Andrew Morton, andi.kleen, Prarit Bhargava,
	KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
	linux-mm

On Mon, 10 Aug 2009 14:45:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Do you agree?
> 
> Ok. Config is enough at this stage.
> 
> The last advice for merge is, it's better to show the numbers or
> ask someone who have many cpus to measure benefits. Then, Andrew can
> know how this is benefical.
> (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
> 
> In my experience (in my own old trial),
>  - lock contention itself is low. not high.
>  - but cacheline-miss, pingpong is very very frequent.
> 
> Then, this patch has some benefit logically but, in general,
> File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> the performance of usual apps. You'll have to be careful to select apps
> to measure the benfits of this patch by application performance.
> (And this is why I don't feel so much emergency as you do)
> 

Why I say "I want to see the numbers" again and again is that
this is performance improvement with _bad side effect_.
If this is an emergent trouble, and need fast-track, which requires us
"fix small problems later", plz say so. 

I have no objection to this approach itself because I can't think of
something better, now. percpu-counter's error tolerance is a generic
problem and we'll have to visit this anyway.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  6:22               ` KAMEZAWA Hiroyuki
@ 2009-08-10  7:41                 ` Balbir Singh
  2009-08-10  8:36                 ` Balbir Singh
  1 sibling, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10  7:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 15:22:05]:

> On Mon, 10 Aug 2009 14:45:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > Do you agree?
> > 
> > Ok. Config is enough at this stage.
> > 
> > The last advice for merge is, it's better to show the numbers or
> > ask someone who have many cpus to measure benefits. Then, Andrew can
> > know how this is benefical.
> > (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
> > 
> > In my experience (in my own old trial),
> >  - lock contention itself is low. not high.
> >  - but cacheline-miss, pingpong is very very frequent.
> > 
> > Then, this patch has some benefit logically but, in general,
> > File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> > the performance of usual apps. You'll have to be careful to select apps
> > to measure the benfits of this patch by application performance.
> > (And this is why I don't feel so much emergency as you do)
> > 
> 
> Why I say "I want to see the numbers" again and again is that
> this is performance improvement with _bad side effect_.
> If this is an emergent trouble, and need fast-track, which requires us
> "fix small problems later", plz say so. 
> 


Yes, this is an emergent trouble, I've gotten reports of the lock
showing up on 16 to 64 ways.

> I have no objection to this approach itself because I can't think of
> something better, now. percpu-counter's error tolerance is a generic
> problem and we'll have to visit this anyway.
>

Yes, my plan is to then later add a strict/no-strict accounting layer
and allow users to choose. Keep root as non-script as we don't
support limit setting on root now. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help Resource Counters Scale Better (v3)
  2009-08-10  6:22               ` KAMEZAWA Hiroyuki
  2009-08-10  7:41                 ` Balbir Singh
@ 2009-08-10  8:36                 ` Balbir Singh
  1 sibling, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10  8:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
	lizf, menage, Pavel Emelianov, linux-kernel, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 15:22:05]:

> On Mon, 10 Aug 2009 14:45:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > Do you agree?
> > 
> > Ok. Config is enough at this stage.
> > 
> > The last advice for merge is, it's better to show the numbers or
> > ask someone who have many cpus to measure benefits. Then, Andrew can
> > know how this is benefical.
> > (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
> > 
> > In my experience (in my own old trial),
> >  - lock contention itself is low. not high.
> >  - but cacheline-miss, pingpong is very very frequent.
> > 
> > Then, this patch has some benefit logically but, in general,
> > File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> > the performance of usual apps. You'll have to be careful to select apps
> > to measure the benfits of this patch by application performance.
> > (And this is why I don't feel so much emergency as you do)
> > 
> 
> Why I say "I want to see the numbers" again and again is that
> this is performance improvement with _bad side effect_.
> If this is an emergent trouble, and need fast-track, which requires us
> "fix small problems later", plz say so. 
>

OK... I finally got a bigger machine (24 CPUs). I ran a simple
program called parallel_pagefault, which does pagefault's in parallel
(runs on every other CPU) and allocates 10K pages and touches the
data allocated, unmaps and repeats the process. I ran the program
for 300 seconds. With the patch, I was able to fault in twice
the number of pages as I was able to without the patch. I used
perf tool from tools/perf in the kernel

With patch

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7188177.405648  task-clock-msecs         #     23.926 CPUs 
         423130  context-switches         #      0.000 M/sec
            210  CPU-migrations           #      0.000 M/sec
       49851597  page-faults              #      0.007 M/sec
  5900210219604  cycles                   #    820.821 M/sec
   424658049425  instructions             #      0.072 IPC  
     7867744369  cache-references         #      1.095 M/sec
     2882370051  cache-misses             #      0.401 M/sec

  300.431591843  seconds time elapsed

Without Patch

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7192804.124144  task-clock-msecs         #     23.937 CPUs 
         424691  context-switches         #      0.000 M/sec
            267  CPU-migrations           #      0.000 M/sec
       28498113  page-faults              #      0.004 M/sec
  5826093739340  cycles                   #    809.989 M/sec
   408883496292  instructions             #      0.070 IPC  
     7057079452  cache-references         #      0.981 M/sec
     3036086243  cache-misses             #      0.422 M/sec

  300.485365680  seconds time elapsed


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-08-10  8:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-07 22:12 Help Resource Counters Scale Better (v2) Balbir Singh
2009-08-08  1:11 ` KAMEZAWA Hiroyuki
2009-08-08  6:05   ` Balbir Singh
2009-08-08  7:38     ` KAMEZAWA Hiroyuki
2009-08-09 12:15       ` Help Resource Counters Scale Better (v3) Balbir Singh
2009-08-10  0:32         ` KAMEZAWA Hiroyuki
2009-08-10  0:43           ` KAMEZAWA Hiroyuki
2009-08-10  5:22             ` Balbir Singh
2009-08-10  5:30           ` Balbir Singh
2009-08-10  5:45             ` KAMEZAWA Hiroyuki
2009-08-10  6:22               ` KAMEZAWA Hiroyuki
2009-08-10  7:41                 ` Balbir Singh
2009-08-10  8:36                 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox