* Help Resource Counters Scale Better (v2)
@ 2009-08-07 22:12 Balbir Singh
2009-08-08 1:11 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-07 22:12 UTC (permalink / raw)
To: Andrew Morton, andi.kleen, Prarit Bhargava
Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, lizf, menage,
Pavel Emelianov, linux-kernel, linux-mm
Enhancement: For scalability move the resource counter to a percpu counter
Changelog v2->v1
From: Balbir Singh <balbir@linux.vnet.ibm.com>
1. Updated Documentation (cgroups.txt and resource_counters.txt)
2. Added the notion of tolerance to resource counter initialization
I posted an early RFI of the patches and got several responses stating
that this would be a nice to have patch and we should work on this
right away.
This patch changes the usage field of a resource counter to a percpu
counter. The counter is incremented with local irq disabled. The other
fields are still protected by the spin lock for write.
This patch adds a fuzziness factor to hard limit, since the value we read
could be off the original value (by batch value), this can be fixed
by adding a strict/non-strict functionality check. The intention is
to turn of strict checking for root (since we can't set limits on
it anyway).
I tested this patch on my x86_64 box with a regular test for hard
limits and a page fault program. I also enabled lockdep and lock_stat
clearly shows that the contention on counter->lock is down quite
significantly. I tested these patches against an older mmotm, but
this should apply cleanly to the 6th August mmotm as well.
Please test/review the patches.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
Documentation/cgroups/memory.txt | 3 ++
Documentation/cgroups/resource_counter.txt | 11 +++++-
include/linux/res_counter.h | 48 ++++++++++++++++-----------
kernel/res_counter.c | 50 +++++++++++++++++-----------
mm/memcontrol.c | 31 +++++++++++++----
5 files changed, 96 insertions(+), 47 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..8f86537 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
d. Provides a double LRU: global memory pressure causes reclaim from the
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
+ NOTE: One can no longer rely on the exact limit. Since we've moved
+ to using percpu_counters for resource counters, there is always going
+ to be a fuzziness factor depending on the batch value.
Benefits and Purpose of the memory controller
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..d3b276b 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -12,12 +12,15 @@ to work with it.
1. Crucial parts of the res_counter structure
- a. unsigned long long usage
+ a. percpu_counter usage
The usage value shows the amount of a resource that is consumed
by a group at a given time. The units of measurement should be
determined by the controller that uses this counter. E.g. it can
be bytes, items or any other unit the controller operates on.
+ NOTE: being a percpu_counter, the way to read the correct value
+ at all times makes it unscalable and reading it scalably makes
+ the value a little unreliable :)
b. unsigned long long max_usage
@@ -48,7 +51,8 @@ to work with it.
2. Basic accounting routines
a. void res_counter_init(struct res_counter *rc,
- struct res_counter *rc_parent)
+ struct res_counter *rc_parent,
+ unsigned long tolerance)
Initializes the resource counter. As usual, should be the first
routine called for a new counter.
@@ -57,6 +61,9 @@ to work with it.
child -> parent relationship directly in the res_counter structure,
NULL can be used to define no relationship.
+ The tolerance is used to control the batching behaviour of percpu
+ counters
+
c. int res_counter_charge(struct res_counter *rc, unsigned long val,
struct res_counter **limit_fail_at)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 731af71..2d412d7 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,6 +14,7 @@
*/
#include <linux/cgroup.h>
+#include <linux/percpu_counter.h>
/*
* The core object. the cgroup that wishes to account for some
@@ -23,10 +24,6 @@
struct res_counter {
/*
- * the current resource consumption level
- */
- unsigned long long usage;
- /*
* the maximal value of the usage from the counter creation
*/
unsigned long long max_usage;
@@ -48,6 +45,14 @@ struct res_counter {
*/
spinlock_t lock;
/*
+ * the current resource consumption level
+ */
+ struct percpu_counter usage;
+ /*
+ * Tolerance for the percpu_counter (usage) above
+ */
+ unsigned long usage_tolerance;
+ /*
* Parent counter, used for hierarchial resource accounting
*/
struct res_counter *parent;
@@ -98,7 +103,8 @@ enum {
* helpers for accounting
*/
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+ unsigned long usage_tolerance);
/*
* charge - try to consume more resource.
@@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
{
- if (cnt->usage < cnt->limit)
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+ if (usage < cnt->limit)
return true;
return false;
@@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
{
- if (cnt->usage < cnt->soft_limit)
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+ if (usage < cnt->soft_limit)
return true;
return false;
@@ -157,15 +165,15 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
static inline unsigned long long
res_counter_soft_limit_excess(struct res_counter *cnt)
{
- unsigned long long excess;
- unsigned long flags;
+ unsigned long long excess, usage;
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= cnt->soft_limit)
+ usage = percpu_counter_read_positive(&cnt->usage);
+ preempt_disable();
+ if (usage <= cnt->soft_limit)
excess = 0;
else
- excess = cnt->usage - cnt->soft_limit;
- spin_unlock_irqrestore(&cnt->lock, flags);
+ excess = usage - cnt->soft_limit;
+ preempt_enable();
return excess;
}
@@ -178,9 +186,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
bool ret;
unsigned long flags;
- spin_lock_irqsave(&cnt->lock, flags);
+ local_irq_save(flags);
ret = res_counter_limit_check_locked(cnt);
- spin_unlock_irqrestore(&cnt->lock, flags);
+ local_irq_restore(flags);
return ret;
}
@@ -189,18 +197,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
bool ret;
unsigned long flags;
- spin_lock_irqsave(&cnt->lock, flags);
+ local_irq_save(flags);
ret = res_counter_soft_limit_check_locked(cnt);
- spin_unlock_irqrestore(&cnt->lock, flags);
+ local_irq_restore(flags);
return ret;
}
static inline void res_counter_reset_max(struct res_counter *cnt)
{
unsigned long flags;
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
spin_lock_irqsave(&cnt->lock, flags);
- cnt->max_usage = cnt->usage;
+ cnt->max_usage = usage;
spin_unlock_irqrestore(&cnt->lock, flags);
}
@@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
unsigned long long limit)
{
unsigned long flags;
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
int ret = -EBUSY;
spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= limit) {
+ if (usage <= limit) {
cnt->limit = limit;
ret = 0;
}
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 88faec2..ae83168 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -15,24 +15,34 @@
#include <linux/uaccess.h>
#include <linux/mm.h>
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+ unsigned long usage_tolerance)
{
spin_lock_init(&counter->lock);
+ percpu_counter_init(&counter->usage, 0);
counter->limit = RESOURCE_MAX;
counter->soft_limit = RESOURCE_MAX;
counter->parent = parent;
+ counter->usage_tolerance = usage_tolerance;
}
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
{
- if (counter->usage + val > counter->limit) {
+ unsigned long long usage;
+
+ usage = percpu_counter_read_positive(&counter->usage);
+ if (usage + val > counter->limit) {
counter->failcnt++;
return -ENOMEM;
}
- counter->usage += val;
- if (counter->usage > counter->max_usage)
- counter->max_usage = counter->usage;
+ __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
+ counter->usage_tolerance);
+ if (usage + val > counter->max_usage) {
+ spin_lock(&counter->lock);
+ counter->max_usage = (usage + val);
+ spin_unlock(&counter->lock);
+ }
return 0;
}
@@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
*soft_limit_fail_at = NULL;
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
- spin_lock(&c->lock);
ret = res_counter_charge_locked(c, val);
/*
* With soft limits, we return the highest ancestor
@@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
if (soft_limit_fail_at &&
!res_counter_soft_limit_check_locked(c))
*soft_limit_fail_at = c;
- spin_unlock(&c->lock);
if (ret < 0) {
*limit_fail_at = c;
goto undo;
@@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
goto done;
undo:
for (u = counter; u != c; u = u->parent) {
- spin_lock(&u->lock);
res_counter_uncharge_locked(u, val);
- spin_unlock(&u->lock);
}
done:
local_irq_restore(flags);
@@ -79,10 +85,13 @@ done:
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
- if (WARN_ON(counter->usage < val))
- val = counter->usage;
+ unsigned long long usage;
+
+ usage = percpu_counter_read_positive(&counter->usage);
+ if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
+ val = usage;
- counter->usage -= val;
+ percpu_counter_sub(&counter->usage, val);
}
void res_counter_uncharge(struct res_counter *counter, unsigned long val,
@@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
- spin_lock(&c->lock);
if (was_soft_limit_excess)
*was_soft_limit_excess =
!res_counter_soft_limit_check_locked(c);
res_counter_uncharge_locked(c, val);
- spin_unlock(&c->lock);
}
local_irq_restore(flags);
}
@@ -108,8 +115,6 @@ static inline unsigned long long *
res_counter_member(struct res_counter *counter, int member)
{
switch (member) {
- case RES_USAGE:
- return &counter->usage;
case RES_MAX_USAGE:
return &counter->max_usage;
case RES_LIMIT:
@@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
const char __user *userbuf, size_t nbytes, loff_t *pos,
int (*read_strategy)(unsigned long long val, char *st_buf))
{
- unsigned long long *val;
+ unsigned long long *val, usage_val;
char buf[64], *s;
s = buf;
- val = res_counter_member(counter, member);
+ if (member == RES_USAGE) {
+ usage_val = percpu_counter_read_positive(&counter->usage);
+ val = &usage_val;
+ } else
+ val = res_counter_member(counter, member);
if (read_strategy)
s += read_strategy(*val, s);
else
@@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
u64 res_counter_read_u64(struct res_counter *counter, int member)
{
- return *res_counter_member(counter, member);
+ if (member == RES_USAGE)
+ return percpu_counter_read_positive(&counter->usage);
+ else
+ return *res_counter_member(counter, member);
}
int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..17d305d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
#define SOFTLIMIT_EVENTS_THRESH (1000)
/*
+ * To help resource counters scale, we take a step back
+ * and allow the counters to be scalable and set a
+ * batch value such that every addition does not cause
+ * global synchronization. The side-effect will be visible
+ * on limit enforcement, where due to this fuzziness,
+ * we will lose out on inforcing a limit when the usage
+ * exceeds the limit. The plan however in the long run
+ * is to allow this value to be controlled. We will
+ * probably add a new control file for it.
+ */
+#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
+
+/*
* Statistics for memory cgroup.
*/
enum mem_cgroup_stat_index {
@@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
if (free_all)
goto try_to_free;
move_account:
- while (mem->res.usage > 0) {
+ while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
ret = -EBUSY;
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
@@ -2383,7 +2396,7 @@ try_to_free:
lru_add_drain_all();
/* try to free all pages in this cgroup */
shrink = 1;
- while (nr_retries && mem->res.usage > 0) {
+ while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
int progress;
if (signal_pending(current)) {
@@ -2401,7 +2414,7 @@ try_to_free:
}
lru_add_drain();
/* try move_account...there may be some *locked* pages. */
- if (mem->res.usage)
+ if (res_counter_read_u64(&mem->res, RES_USAGE))
goto move_account;
ret = 0;
goto out;
@@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
}
if (parent && parent->use_hierarchy) {
- res_counter_init(&mem->res, &parent->res);
- res_counter_init(&mem->memsw, &parent->memsw);
+ res_counter_init(&mem->res, &parent->res,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
+ res_counter_init(&mem->memsw, &parent->memsw,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
*/
mem_cgroup_get(parent);
} else {
- res_counter_init(&mem->res, NULL);
- res_counter_init(&mem->memsw, NULL);
+ res_counter_init(&mem->res, NULL,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
+ res_counter_init(&mem->memsw, NULL,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
}
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v2)
2009-08-07 22:12 Help Resource Counters Scale Better (v2) Balbir Singh
@ 2009-08-08 1:11 ` KAMEZAWA Hiroyuki
2009-08-08 6:05 ` Balbir Singh
0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-08 1:11 UTC (permalink / raw)
To: balbir
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
linux-mm
Balbir Singh wrote:
> Enhancement: For scalability move the resource counter to a percpu counter
>
> Changelog v2->v1
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
>
looks better ..but a few concerns/nitpicks.
> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program. I also enabled lockdep and lock_stat
> clearly shows that the contention on counter->lock is down quite
> significantly. I tested these patches against an older mmotm, but
> this should apply cleanly to the 6th August mmotm as well.
>
It's always helpful if the numbers are shown.
> diff --git a/Documentation/cgroups/memory.txt
> b/Documentation/cgroups/memory.txt
> index b871f25..8f86537 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller
> users
> d. Provides a double LRU: global memory pressure causes reclaim from the
> global LRU; a cgroup on hitting a limit, reclaims from the per
> cgroup LRU
> + NOTE: One can no longer rely on the exact limit. Since we've moved
> + to using percpu_counters for resource counters, there is always going
> + to be a fuzziness factor depending on the batch value.
>
This text is just from our view.
Please explain to people who reads memcg for the first time.
> Benefits and Purpose of the memory controller
>
> diff --git a/Documentation/cgroups/resource_counter.txt
> b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..d3b276b 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>
> 1. Crucial parts of the res_counter structure
>
> - a. unsigned long long usage
> + a. percpu_counter usage
>
> The usage value shows the amount of a resource that is consumed
> by a group at a given time. The units of measurement should be
> determined by the controller that uses this counter. E.g. it can
> be bytes, items or any other unit the controller operates on.
> + NOTE: being a percpu_counter, the way to read the correct value
> + at all times makes it unscalable and reading it scalably makes
> + the value a little unreliable :)
>
ditto.
> b. unsigned long long max_usage
>
> @@ -48,7 +51,8 @@ to work with it.
> 2. Basic accounting routines
>
> a. void res_counter_init(struct res_counter *rc,
> - struct res_counter *rc_parent)
> + struct res_counter *rc_parent,
> + unsigned long tolerance)
>
> Initializes the resource counter. As usual, should be the first
> routine called for a new counter.
> @@ -57,6 +61,9 @@ to work with it.
> child -> parent relationship directly in the res_counter structure,
> NULL can be used to define no relationship.
>
> + The tolerance is used to control the batching behaviour of percpu
> + counters
> +
This description is ambiguous.
What is the system's total tolerance ?
tolerance ?
nr_online_cpus * tolerance ?
MAX_CPUS * tolerance ?
> c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> struct res_counter **limit_fail_at)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..2d412d7 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
> */
>
> #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>
> /*
> * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>
> struct res_counter {
> /*
> - * the current resource consumption level
> - */
> - unsigned long long usage;
> - /*
> * the maximal value of the usage from the counter creation
> */
> unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
> */
> spinlock_t lock;
> /*
> + * the current resource consumption level
> + */
> + struct percpu_counter usage;
> + /*
> + * Tolerance for the percpu_counter (usage) above
> + */
> + unsigned long usage_tolerance;
> + /*
> * Parent counter, used for hierarchial resource accounting
> */
> struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
> * helpers for accounting
> */
>
> -void res_counter_init(struct res_counter *counter, struct res_counter
> *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter
> *parent,
> + unsigned long usage_tolerance);
>
> /*
> * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter,
> unsigned long val,
>
> static inline bool res_counter_limit_check_locked(struct res_counter
> *cnt)
> {
> - if (cnt->usage < cnt->limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->limit)
> return true;
>
Hmm. In memcg, this function is not used for busy pass but used for
important pass to check usage under limit (and continue reclaim)
Can't we add res_clounter_check_locked_exact(), which use
percpu_counter_sum() later ?
> return false;
> @@ -141,7 +148,8 @@ static inline bool
> res_counter_limit_check_locked(struct res_counter *cnt)
>
> static inline bool res_counter_soft_limit_check_locked(struct res_counter
> *cnt)
> {
> - if (cnt->usage < cnt->soft_limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->soft_limit)
> return true;
>
> return false;
> @@ -157,15 +165,15 @@ static inline bool
> res_counter_soft_limit_check_locked(struct res_counter *cnt)
> static inline unsigned long long
> res_counter_soft_limit_excess(struct res_counter *cnt)
> {
> - unsigned long long excess;
> - unsigned long flags;
> + unsigned long long excess, usage;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= cnt->soft_limit)
> + usage = percpu_counter_read_positive(&cnt->usage);
> + preempt_disable();
> + if (usage <= cnt->soft_limit)
> excess = 0;
> else
> - excess = cnt->usage - cnt->soft_limit;
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + excess = usage - cnt->soft_limit;
> + preempt_enable();
> return excess;
> }
It's not clear why this part uses preempt_disable() instead of
irqsave(). Could you add comment ?
(*AND* it seems the caller disable irq....)
>
> @@ -178,9 +186,9 @@ static inline bool
> res_counter_check_under_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> @@ -189,18 +197,19 @@ static inline bool
> res_counter_check_under_soft_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_soft_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> static inline void res_counter_reset_max(struct res_counter *cnt)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>
> spin_lock_irqsave(&cnt->lock, flags);
> - cnt->max_usage = cnt->usage;
> + cnt->max_usage = usage;
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
>
> @@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct
> res_counter *cnt,
> unsigned long long limit)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> int ret = -EBUSY;
>
> spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= limit) {
> + if (usage <= limit) {
> cnt->limit = limit;
> ret = 0;
> }
For the same reason to check_limit, I want correct number here.
percpu_counter_sum() is better.
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
> #include <linux/uaccess.h>
> #include <linux/mm.h>
>
> -void res_counter_init(struct res_counter *counter, struct res_counter
> *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter
> *parent,
> + unsigned long usage_tolerance)
> {
> spin_lock_init(&counter->lock);
> + percpu_counter_init(&counter->usage, 0);
> counter->limit = RESOURCE_MAX;
> counter->soft_limit = RESOURCE_MAX;
> counter->parent = parent;
> + counter->usage_tolerance = usage_tolerance;
> }
>
> int res_counter_charge_locked(struct res_counter *counter, unsigned long
> val)
> {
> - if (counter->usage + val > counter->limit) {
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (usage + val > counter->limit) {
> counter->failcnt++;
> return -ENOMEM;
> }
>
> - counter->usage += val;
> - if (counter->usage > counter->max_usage)
> - counter->max_usage = counter->usage;
> + __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> + counter->usage_tolerance);
> + if (usage + val > counter->max_usage) {
> + spin_lock(&counter->lock);
> + counter->max_usage = (usage + val);
> + spin_unlock(&counter->lock);
> + }
Hmm...irq is already off here ?
> return 0;
> }
>
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
> *soft_limit_fail_at = NULL;
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> ret = res_counter_charge_locked(c, val);
> /*
> * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
> if (soft_limit_fail_at &&
> !res_counter_soft_limit_check_locked(c))
> *soft_limit_fail_at = c;
> - spin_unlock(&c->lock);
> if (ret < 0) {
> *limit_fail_at = c;
> goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
> unsigned long val,
> goto done;
> undo:
> for (u = counter; u != c; u = u->parent) {
> - spin_lock(&u->lock);
> res_counter_uncharge_locked(u, val);
> - spin_unlock(&u->lock);
> }
> done:
When using hierarchy, tolerance to root node will be bigger.
Please write this attention to the document.
> local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>
> void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> long val)
> {
> - if (WARN_ON(counter->usage < val))
> - val = counter->usage;
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> + val = usage;
Is this correct ? (or do we need this WARN_ON ?)
Hmm. percpu_counter is cpu-hotplug aware. Then,
nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.
>
> - counter->usage -= val;
> + percpu_counter_sub(&counter->usage, val);
> }
>
> void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter
> *counter, unsigned long val,
>
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> if (was_soft_limit_excess)
> *was_soft_limit_excess =
> !res_counter_soft_limit_check_locked(c);
> res_counter_uncharge_locked(c, val);
> - spin_unlock(&c->lock);
> }
> local_irq_restore(flags);
> }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
> res_counter_member(struct res_counter *counter, int member)
> {
> switch (member) {
> - case RES_USAGE:
> - return &counter->usage;
> case RES_MAX_USAGE:
> return &counter->max_usage;
> case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter
> *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> int (*read_strategy)(unsigned long long val, char *st_buf))
> {
> - unsigned long long *val;
> + unsigned long long *val, usage_val;
> char buf[64], *s;
>
> s = buf;
> - val = res_counter_member(counter, member);
> + if (member == RES_USAGE) {
> + usage_val = percpu_counter_read_positive(&counter->usage);
> + val = &usage_val;
> + } else
> + val = res_counter_member(counter, member);
> if (read_strategy)
> s += read_strategy(*val, s);
> else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter,
> int member,
>
> u64 res_counter_read_u64(struct res_counter *counter, int member)
> {
> - return *res_counter_member(counter, member);
> + if (member == RES_USAGE)
> + return percpu_counter_read_positive(&counter->usage);
> + else
> + return *res_counter_member(counter, member);
> }
>
> int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..17d305d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold
> under cgroup_mutex */
> #define SOFTLIMIT_EVENTS_THRESH (1000)
>
> /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
Considering percpu counter's extra overhead. This number is too small, IMO.
> +
> +/*
> * Statistics for memory cgroup.
> */
> enum mem_cgroup_stat_index {
> @@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup
> *mem, bool free_all)
> if (free_all)
> goto try_to_free;
> move_account:
> - while (mem->res.usage > 0) {
> + while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> ret = -EBUSY;
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> goto out;
> @@ -2383,7 +2396,7 @@ try_to_free:
> lru_add_drain_all();
> /* try to free all pages in this cgroup */
> shrink = 1;
> - while (nr_retries && mem->res.usage > 0) {
> + while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> int progress;
>
> if (signal_pending(current)) {
> @@ -2401,7 +2414,7 @@ try_to_free:
> }
> lru_add_drain();
> /* try move_account...there may be some *locked* pages. */
> - if (mem->res.usage)
> + if (res_counter_read_u64(&mem->res, RES_USAGE))
> goto move_account;
> ret = 0;
> goto out;
> @@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
> }
>
> if (parent && parent->use_hierarchy) {
> - res_counter_init(&mem->res, &parent->res);
> - res_counter_init(&mem->memsw, &parent->memsw);
> + res_counter_init(&mem->res, &parent->res,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, &parent->memsw,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> cgroup *cont)
> */
> mem_cgroup_get(parent);
> } else {
> - res_counter_init(&mem->res, NULL);
> - res_counter_init(&mem->memsw, NULL);
> + res_counter_init(&mem->res, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> }
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
Thanks,
-Kame
>
> --
> Balbir
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v2)
2009-08-08 1:11 ` KAMEZAWA Hiroyuki
@ 2009-08-08 6:05 ` Balbir Singh
2009-08-08 7:38 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-08 6:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-08 10:11:40]:
> Balbir Singh wrote:
> > Enhancement: For scalability move the resource counter to a percpu counter
> >
> > Changelog v2->v1
> >
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> >
> > 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> > 2. Added the notion of tolerance to resource counter initialization
> >
> looks better ..but a few concerns/nitpicks.
>
>
> > I tested this patch on my x86_64 box with a regular test for hard
> > limits and a page fault program. I also enabled lockdep and lock_stat
> > clearly shows that the contention on counter->lock is down quite
> > significantly. I tested these patches against an older mmotm, but
> > this should apply cleanly to the 6th August mmotm as well.
> >
> It's always helpful if the numbers are shown.
>
Sure, but please do remember I did these tests on a small box. I'll
try and get hold of a bigger box for more results. This was a with
a simple memcg test.
Before
&counter->lock: 4139 4195
0.64 7.22 2072.51 144272 17989058
0.50 26.78 3573647.35
--------------
&counter->lock 2016
[<ffffffff81087ac8>] res_counter_charge+0x57/0x10c
&counter->lock 2179
[<ffffffff81087a2f>] res_counter_uncharge+0x39/0x7b
--------------
&counter->lock 1931
[<ffffffff81087ac8>] res_counter_charge+0x57/0x10c
&counter->lock 2264
[<ffffffff81087a2f>] res_counter_uncharge+0x39/0x7b
After
key#2: 467 475
0.61 4.74 216.92 16860 8578542
0.56 16.26 1707604.41
-----
key#2 475
[<ffffffff811f7347>] __percpu_counter_add+0x56/0xad
-----
key#2 475
[<ffffffff811f7347>] __percpu_counter_add+0x56/0xad
>
> > diff --git a/Documentation/cgroups/memory.txt
> > b/Documentation/cgroups/memory.txt
> > index b871f25..8f86537 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller
> > users
> > d. Provides a double LRU: global memory pressure causes reclaim from the
> > global LRU; a cgroup on hitting a limit, reclaims from the per
> > cgroup LRU
> > + NOTE: One can no longer rely on the exact limit. Since we've moved
> > + to using percpu_counters for resource counters, there is always going
> > + to be a fuzziness factor depending on the batch value.
> >
> This text is just from our view.
> Please explain to people who reads memcg for the first time.
>
>
> > Benefits and Purpose of the memory controller
> >
> > diff --git a/Documentation/cgroups/resource_counter.txt
> > b/Documentation/cgroups/resource_counter.txt
> > index 95b24d7..d3b276b 100644
> > --- a/Documentation/cgroups/resource_counter.txt
> > +++ b/Documentation/cgroups/resource_counter.txt
> > @@ -12,12 +12,15 @@ to work with it.
> >
> > 1. Crucial parts of the res_counter structure
> >
> > - a. unsigned long long usage
> > + a. percpu_counter usage
> >
> > The usage value shows the amount of a resource that is consumed
> > by a group at a given time. The units of measurement should be
> > determined by the controller that uses this counter. E.g. it can
> > be bytes, items or any other unit the controller operates on.
> > + NOTE: being a percpu_counter, the way to read the correct value
> > + at all times makes it unscalable and reading it scalably makes
> > + the value a little unreliable :)
> >
> ditto.
OK.. I'll enhance the documentation
>
> > b. unsigned long long max_usage
> >
> > @@ -48,7 +51,8 @@ to work with it.
> > 2. Basic accounting routines
> >
> > a. void res_counter_init(struct res_counter *rc,
> > - struct res_counter *rc_parent)
> > + struct res_counter *rc_parent,
> > + unsigned long tolerance)
> >
> > Initializes the resource counter. As usual, should be the first
> > routine called for a new counter.
> > @@ -57,6 +61,9 @@ to work with it.
> > child -> parent relationship directly in the res_counter structure,
> > NULL can be used to define no relationship.
> >
> > + The tolerance is used to control the batching behaviour of percpu
> > + counters
> > +
> This description is ambiguous.
> What is the system's total tolerance ?
> tolerance ?
> nr_online_cpus * tolerance ?
> MAX_CPUS * tolerance ?
>
I'll add more text to explain
>
> > c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> > struct res_counter **limit_fail_at)
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 731af71..2d412d7 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -14,6 +14,7 @@
> > */
> >
> > #include <linux/cgroup.h>
> > +#include <linux/percpu_counter.h>
> >
> > /*
> > * The core object. the cgroup that wishes to account for some
> > @@ -23,10 +24,6 @@
> >
> > struct res_counter {
> > /*
> > - * the current resource consumption level
> > - */
> > - unsigned long long usage;
> > - /*
> > * the maximal value of the usage from the counter creation
> > */
> > unsigned long long max_usage;
> > @@ -48,6 +45,14 @@ struct res_counter {
> > */
> > spinlock_t lock;
> > /*
> > + * the current resource consumption level
> > + */
> > + struct percpu_counter usage;
> > + /*
> > + * Tolerance for the percpu_counter (usage) above
> > + */
> > + unsigned long usage_tolerance;
> > + /*
> > * Parent counter, used for hierarchial resource accounting
> > */
> > struct res_counter *parent;
> > @@ -98,7 +103,8 @@ enum {
> > * helpers for accounting
> > */
> >
> > -void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent);
> > +void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent,
> > + unsigned long usage_tolerance);
> >
> > /*
> > * charge - try to consume more resource.
> > @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter,
> > unsigned long val,
> >
> > static inline bool res_counter_limit_check_locked(struct res_counter
> > *cnt)
> > {
> > - if (cnt->usage < cnt->limit)
> > + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> > + if (usage < cnt->limit)
> > return true;
> >
> Hmm. In memcg, this function is not used for busy pass but used for
> important pass to check usage under limit (and continue reclaim)
>
> Can't we add res_clounter_check_locked_exact(), which use
> percpu_counter_sum() later ?
We can, but I want to do it in parts, once I add the policy for
strict/no-strict checking. It is on my mind, but I want to work on the
overhead, since I've heard from many people that we need to resolve
this first.
>
> > return false;
> > @@ -141,7 +148,8 @@ static inline bool
> > res_counter_limit_check_locked(struct res_counter *cnt)
> >
> > static inline bool res_counter_soft_limit_check_locked(struct res_counter
> > *cnt)
> > {
> > - if (cnt->usage < cnt->soft_limit)
> > + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> > + if (usage < cnt->soft_limit)
> > return true;
> >
> > return false;
> > @@ -157,15 +165,15 @@ static inline bool
> > res_counter_soft_limit_check_locked(struct res_counter *cnt)
> > static inline unsigned long long
> > res_counter_soft_limit_excess(struct res_counter *cnt)
> > {
> > - unsigned long long excess;
> > - unsigned long flags;
> > + unsigned long long excess, usage;
> >
> > - spin_lock_irqsave(&cnt->lock, flags);
> > - if (cnt->usage <= cnt->soft_limit)
> > + usage = percpu_counter_read_positive(&cnt->usage);
> > + preempt_disable();
> > + if (usage <= cnt->soft_limit)
> > excess = 0;
> > else
> > - excess = cnt->usage - cnt->soft_limit;
> > - spin_unlock_irqrestore(&cnt->lock, flags);
> > + excess = usage - cnt->soft_limit;
> > + preempt_enable();
> > return excess;
> > }
> It's not clear why this part uses preempt_disable() instead of
> irqsave(). Could you add comment ?
> (*AND* it seems the caller disable irq....)
>
Is it.. the caller is mem_cgroup_update_tree, which is called outside
of IRQ context. In mem_cgroup_insert_exceeded(), it is under the lock.
>
> >
> > @@ -178,9 +186,9 @@ static inline bool
> > res_counter_check_under_limit(struct res_counter *cnt)
> > bool ret;
> > unsigned long flags;
> >
> > - spin_lock_irqsave(&cnt->lock, flags);
> > + local_irq_save(flags);
> > ret = res_counter_limit_check_locked(cnt);
> > - spin_unlock_irqrestore(&cnt->lock, flags);
> > + local_irq_restore(flags);
> > return ret;
> > }
> >
> > @@ -189,18 +197,19 @@ static inline bool
> > res_counter_check_under_soft_limit(struct res_counter *cnt)
> > bool ret;
> > unsigned long flags;
> >
> > - spin_lock_irqsave(&cnt->lock, flags);
> > + local_irq_save(flags);
> > ret = res_counter_soft_limit_check_locked(cnt);
> > - spin_unlock_irqrestore(&cnt->lock, flags);
> > + local_irq_restore(flags);
> > return ret;
> > }
> >
> > static inline void res_counter_reset_max(struct res_counter *cnt)
> > {
> > unsigned long flags;
> > + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> >
> > spin_lock_irqsave(&cnt->lock, flags);
> > - cnt->max_usage = cnt->usage;
> > + cnt->max_usage = usage;
> > spin_unlock_irqrestore(&cnt->lock, flags);
> > }
> >
> > @@ -217,10 +226,11 @@ static inline int res_counter_set_limit(struct
> > res_counter *cnt,
> > unsigned long long limit)
> > {
> > unsigned long flags;
> > + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> > int ret = -EBUSY;
> >
> > spin_lock_irqsave(&cnt->lock, flags);
> > - if (cnt->usage <= limit) {
> > + if (usage <= limit) {
> > cnt->limit = limit;
> > ret = 0;
> > }
>
> For the same reason to check_limit, I want correct number here.
> percpu_counter_sum() is better.
>
I'll add that when we do strict accounting. Are you suggesting that
resource_counter_set_limit should use strict accounting?
>
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index 88faec2..ae83168 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -15,24 +15,34 @@
> > #include <linux/uaccess.h>
> > #include <linux/mm.h>
> >
> > -void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent)
> > +void res_counter_init(struct res_counter *counter, struct res_counter
> > *parent,
> > + unsigned long usage_tolerance)
> > {
> > spin_lock_init(&counter->lock);
> > + percpu_counter_init(&counter->usage, 0);
> > counter->limit = RESOURCE_MAX;
> > counter->soft_limit = RESOURCE_MAX;
> > counter->parent = parent;
> > + counter->usage_tolerance = usage_tolerance;
> > }
> >
> > int res_counter_charge_locked(struct res_counter *counter, unsigned long
> > val)
> > {
> > - if (counter->usage + val > counter->limit) {
> > + unsigned long long usage;
> > +
> > + usage = percpu_counter_read_positive(&counter->usage);
> > + if (usage + val > counter->limit) {
> > counter->failcnt++;
> > return -ENOMEM;
> > }
> >
> > - counter->usage += val;
> > - if (counter->usage > counter->max_usage)
> > - counter->max_usage = counter->usage;
> > + __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> > + counter->usage_tolerance);
> > + if (usage + val > counter->max_usage) {
> > + spin_lock(&counter->lock);
> > + counter->max_usage = (usage + val);
> > + spin_unlock(&counter->lock);
> > + }
> Hmm...irq is already off here ?
Yes
>
>
> > return 0;
> > }
> >
> > @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> > *soft_limit_fail_at = NULL;
> > local_irq_save(flags);
> > for (c = counter; c != NULL; c = c->parent) {
> > - spin_lock(&c->lock);
> > ret = res_counter_charge_locked(c, val);
> > /*
> > * With soft limits, we return the highest ancestor
> > @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> > if (soft_limit_fail_at &&
> > !res_counter_soft_limit_check_locked(c))
> > *soft_limit_fail_at = c;
> > - spin_unlock(&c->lock);
> > if (ret < 0) {
> > *limit_fail_at = c;
> > goto undo;
> > @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
> > unsigned long val,
> > goto done;
> > undo:
> > for (u = counter; u != c; u = u->parent) {
> > - spin_lock(&u->lock);
> > res_counter_uncharge_locked(u, val);
> > - spin_unlock(&u->lock);
> > }
> > done:
>
> When using hierarchy, tolerance to root node will be bigger.
> Please write this attention to the document.
>
No.. I don't think so..
Irrespective of hierarchy, we do the following
1. Add, if the sum reaches batch count, we sum and save
I don't think hierarchy should affect it.. no?
>
> > local_irq_restore(flags);
> > @@ -79,10 +85,13 @@ done:
> >
> > void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> > long val)
> > {
> > - if (WARN_ON(counter->usage < val))
> > - val = counter->usage;
> > + unsigned long long usage;
> > +
> > + usage = percpu_counter_read_positive(&counter->usage);
> > + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> > + val = usage;
> Is this correct ? (or do we need this WARN_ON ?)
> Hmm. percpu_counter is cpu-hotplug aware. Then,
> nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.
>
OK.. so the deal is, even though it is aware, batch count is a
heuristic and I don't want to do heavy math in it. nr_cpu_ids is
larger, but also light weight in terms of computation.
>
> >
> > - counter->usage -= val;
> > + percpu_counter_sub(&counter->usage, val);
> > }
> >
> > void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter
> > *counter, unsigned long val,
> >
> > local_irq_save(flags);
> > for (c = counter; c != NULL; c = c->parent) {
> > - spin_lock(&c->lock);
> > if (was_soft_limit_excess)
> > *was_soft_limit_excess =
> > !res_counter_soft_limit_check_locked(c);
> > res_counter_uncharge_locked(c, val);
> > - spin_unlock(&c->lock);
> > }
> > local_irq_restore(flags);
> > }
> > @@ -108,8 +115,6 @@ static inline unsigned long long *
> > res_counter_member(struct res_counter *counter, int member)
> > {
> > switch (member) {
> > - case RES_USAGE:
> > - return &counter->usage;
> > case RES_MAX_USAGE:
> > return &counter->max_usage;
> > case RES_LIMIT:
> > @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter
> > *counter, int member,
> > const char __user *userbuf, size_t nbytes, loff_t *pos,
> > int (*read_strategy)(unsigned long long val, char *st_buf))
> > {
> > - unsigned long long *val;
> > + unsigned long long *val, usage_val;
> > char buf[64], *s;
> >
> > s = buf;
> > - val = res_counter_member(counter, member);
> > + if (member == RES_USAGE) {
> > + usage_val = percpu_counter_read_positive(&counter->usage);
> > + val = &usage_val;
> > + } else
> > + val = res_counter_member(counter, member);
> > if (read_strategy)
> > s += read_strategy(*val, s);
>
>
> > else
> > @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter,
> > int member,
> >
> > u64 res_counter_read_u64(struct res_counter *counter, int member)
> > {
> > - return *res_counter_member(counter, member);
> > + if (member == RES_USAGE)
> > + return percpu_counter_read_positive(&counter->usage);
> > + else
> > + return *res_counter_member(counter, member);
> > }
>
> >
> > int res_counter_memparse_write_strategy(const char *buf,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 48a38e1..17d305d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -58,6 +58,19 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold
> > under cgroup_mutex */
> > #define SOFTLIMIT_EVENTS_THRESH (1000)
> >
> > /*
> > + * To help resource counters scale, we take a step back
> > + * and allow the counters to be scalable and set a
> > + * batch value such that every addition does not cause
> > + * global synchronization. The side-effect will be visible
> > + * on limit enforcement, where due to this fuzziness,
> > + * we will lose out on inforcing a limit when the usage
> > + * exceeds the limit. The plan however in the long run
> > + * is to allow this value to be controlled. We will
> > + * probably add a new control file for it.
> > + */
> > +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
>
> Considering percpu counter's extra overhead. This number is too small, IMO.
>
OK.. the reason I kept it that way is because on ppc64 PAGE_SIZE is
now 64k. May be we should pick a standard size like 64k and stick with
it. What do you think?
> > +
> > +/*
> > * Statistics for memory cgroup.
> > */
> > enum mem_cgroup_stat_index {
> > @@ -2340,7 +2353,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup
> > *mem, bool free_all)
> > if (free_all)
> > goto try_to_free;
> > move_account:
> > - while (mem->res.usage > 0) {
> > + while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> > ret = -EBUSY;
> > if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> > goto out;
> > @@ -2383,7 +2396,7 @@ try_to_free:
> > lru_add_drain_all();
> > /* try to free all pages in this cgroup */
> > shrink = 1;
> > - while (nr_retries && mem->res.usage > 0) {
> > + while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> > int progress;
> >
> > if (signal_pending(current)) {
> > @@ -2401,7 +2414,7 @@ try_to_free:
> > }
> > lru_add_drain();
> > /* try move_account...there may be some *locked* pages. */
> > - if (mem->res.usage)
> > + if (res_counter_read_u64(&mem->res, RES_USAGE))
> > goto move_account;
> > ret = 0;
> > goto out;
> > @@ -3019,8 +3032,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> > cgroup *cont)
> > }
> >
> > if (parent && parent->use_hierarchy) {
> > - res_counter_init(&mem->res, &parent->res);
> > - res_counter_init(&mem->memsw, &parent->memsw);
> > + res_counter_init(&mem->res, &parent->res,
> > + MEM_CGROUP_RES_ERR_TOLERANCE);
> > + res_counter_init(&mem->memsw, &parent->memsw,
> > + MEM_CGROUP_RES_ERR_TOLERANCE);
> > /*
> > * We increment refcnt of the parent to ensure that we can
> > * safely access it on res_counter_charge/uncharge.
> > @@ -3029,8 +3044,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct
> > cgroup *cont)
> > */
> > mem_cgroup_get(parent);
> > } else {
> > - res_counter_init(&mem->res, NULL);
> > - res_counter_init(&mem->memsw, NULL);
> > + res_counter_init(&mem->res, NULL,
> > + MEM_CGROUP_RES_ERR_TOLERANCE);
> > + res_counter_init(&mem->memsw, NULL,
> > + MEM_CGROUP_RES_ERR_TOLERANCE);
> > }
> > mem->last_scanned_child = 0;
> > spin_lock_init(&mem->reclaim_param_lock);
> >
>
> Thanks,
> -Kame
Thanks for the detailed review!
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v2)
2009-08-08 6:05 ` Balbir Singh
@ 2009-08-08 7:38 ` KAMEZAWA Hiroyuki
2009-08-09 12:15 ` Help Resource Counters Scale Better (v3) Balbir Singh
0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-08 7:38 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, Andrew Morton, andi.kleen, Prarit Bhargava,
KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
linux-mm
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-08
> 10:11:40]:
>
>> Balbir Singh wrote:
>> > static inline bool res_counter_limit_check_locked(struct res_counter
>> > *cnt)
>> > {
>> > - if (cnt->usage < cnt->limit)
>> > + unsigned long long usage =
>> percpu_counter_read_positive(&cnt->usage);
>> > + if (usage < cnt->limit)
>> > return true;
>> >
>> Hmm. In memcg, this function is not used for busy pass but used for
>> important pass to check usage under limit (and continue reclaim)
>>
>> Can't we add res_clounter_check_locked_exact(), which use
>> percpu_counter_sum() later ?
>
> We can, but I want to do it in parts, once I add the policy for
> strict/no-strict checking. It is on my mind, but I want to work on the
> overhead, since I've heard from many people that we need to resolve
> this first.
>
ok.
>> > spin_lock_irqsave(&cnt->lock, flags);
>> > - if (cnt->usage <= limit) {
>> > + if (usage <= limit) {
>> > cnt->limit = limit;
>> > ret = 0;
>> > }
>>
>> For the same reason to check_limit, I want correct number here.
>> percpu_counter_sum() is better.
>>
>
> I'll add that when we do strict accounting. Are you suggesting that
> resource_counter_set_limit should use strict accounting?
yes, I think so.
..and..I'd like to add "mem_cgroup_reduce_usage" or some call
to do reclaim-on-demand, later.
I wonder it's ok to add error-tolerance to memcg but I want some
interface to do "sync". Especially when, we measure size of working set.
I like current your direction to achieve better performance.
But I wonder how users can see synchronous numbers without tolerance,
it will be necessary in high-end users.
goto undo;
>> > @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter,
>> > unsigned long val,
>> > goto done;
>> > undo:
>> > for (u = counter; u != c; u = u->parent) {
>> > - spin_lock(&u->lock);
>> > res_counter_uncharge_locked(u, val);
>> > - spin_unlock(&u->lock);
>> > }
>> > done:
>> When using hierarchy, tolerance to root node will be bigger.
>> Please write this attention to the document.
>>
>
> No.. I don't think so..
>
> Irrespective of hierarchy, we do the following
>
> 1. Add, if the sum reaches batch count, we sum and save
>
> I don't think hierarchy should affect it.. no?
>
Hmm, maybe I'm misunderstanding. Let me brainstoming...
In following hierarchy,
A/01
/02
/03/X
/Y
/Z
sum of tolerance of X+Y+Z is limitted by torelance of 03.
sum of tolerance of 01+02+03 is limited by tolerance of A
Ah, ok. I'm wrong. Hmm...
>
>>
>> > local_irq_restore(flags);
>> > @@ -79,10 +85,13 @@ done:
>> >
>> > void res_counter_uncharge_locked(struct res_counter *counter,
>> unsigned
>> > long val)
>> > {
>> > - if (WARN_ON(counter->usage < val))
>> > - val = counter->usage;
>> > + unsigned long long usage;
>> > +
>> > + usage = percpu_counter_read_positive(&counter->usage);
>> > + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
>> > + val = usage;
>> Is this correct ? (or do we need this WARN_ON ?)
>> Hmm. percpu_counter is cpu-hotplug aware. Then,
>> nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm.
>>
>
> OK.. so the deal is, even though it is aware, batch count is a
> heuristic and I don't want to do heavy math in it. nr_cpu_ids is
> larger, but also light weight in terms of computation.
>
yes...I wonder there is a _variable_ to show nr_online_cpus without
bitmap scan...
>> > /*
>> > + * To help resource counters scale, we take a step back
>> > + * and allow the counters to be scalable and set a
>> > + * batch value such that every addition does not cause
>> > + * global synchronization. The side-effect will be visible
>> > + * on limit enforcement, where due to this fuzziness,
>> > + * we will lose out on inforcing a limit when the usage
>> > + * exceeds the limit. The plan however in the long run
>> > + * is to allow this value to be controlled. We will
>> > + * probably add a new control file for it.
>> > + */
>> > +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE)
>>
>> Considering percpu counter's extra overhead. This number is too small,
>> IMO.
>>
>
> OK.. the reason I kept it that way is because on ppc64 PAGE_SIZE is
> now 64k. May be we should pick a standard size like 64k and stick with
> it. What do you think?
>
I think 64k is reasonanle as far as there is no monster machine with
4096 cpus...But even with 4096cpus
64k*4096 = 256M...then, small amount for monster machine..
Hmm...I think you can add CONFIG_MEMCG_PCPU_TOLERANCE and
set default value to 64k. (of course, you can do this in other patch)
On laptop/desktop, 4cpus
4*64k=256k
On volume-zone server, 8-16,32cpus
32*64k=2M
On high-end 64-256cpu machine in these days,
256*64k=16M
maybe not so bad. I'm not sure how many 1024cpu machines will
be used in the the next ten years..
I want a percpu counter with flexible batching for minimizing tolerance.
It will be my homework.
Thanks,
-Kame
64kx256 = 16M ...maybe reasonable.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Help Resource Counters Scale Better (v3)
2009-08-08 7:38 ` KAMEZAWA Hiroyuki
@ 2009-08-09 12:15 ` Balbir Singh
2009-08-10 0:32 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-09 12:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
Hi,
Thanks for the detailed review, here is v3 of the patches against
mmotm 6th August. I've documented the TODOs as well. If there are
no major objections, I would like this to be included in mmotm
for more testing. Any test reports on a large machine would be highly
appreciated.
From: Balbir Singh <balbir@linux.vnet.ibm.com>
Changelog v3->v2
1. Added more documentation and comments
2. Made the check in mem_cgroup_set_limit strict
3. Increased tolerance per cpu to 64KB.
4. Still have the WARN_ON(), I've kept it for debugging
purposes, may be we should make it a conditional with
DEBUG_VM
Changelog v2->v1
1. Updated Documentation (cgroups.txt and resource_counters.txt)
2. Added the notion of tolerance to resource counter initialization
Enhancement: For scalability move the resource counter to a percpu counter
This patch changes the usage field of a resource counter to a percpu
counter. The counter is incremented with local irq disabled. The other
fields are still protected by the spin lock for write.
This patch adds a fuzziness factor to hard limit, since the value we read
could be off the original value (by batch value), this can be fixed
by adding a strict/non-strict functionality check. The intention is
to turn of strict checking for root (since we can't set limits on
it anyway).
I tested this patch on my x86_64 box with a regular test for hard
limits and a page fault program.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
Documentation/cgroups/memory.txt | 22 ++++++++++++
Documentation/cgroups/resource_counter.txt | 20 +++++++++--
include/linux/res_counter.h | 52 ++++++++++++++++++----------
kernel/res_counter.c | 50 +++++++++++++++++----------
mm/memcontrol.c | 32 +++++++++++++----
5 files changed, 128 insertions(+), 48 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..a24dab7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
d. Provides a double LRU: global memory pressure causes reclaim from the
global LRU; a cgroup on hitting a limit, reclaims from the per
cgroup LRU
+ NOTE: One can no longer rely on the exact limit. Since we've moved
+ to using percpu_counters for resource counters, there is always going
+ to be a fuzziness factor depending on the batch value.
Benefits and Purpose of the memory controller
@@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
4. Start reclamation in the background when the limit is
not yet hit but the usage is getting closer
+9. Scalability Tradeoff
+
+As documented in Documentation/cgroups/resource_counter.txt, we've
+moved over to percpu counters for accounting usage. What does this
+mean for the end user?
+
+1. It means better performance
+2. It means that the usage reported does not necessarily reflect
+ realty. Because percpu counters do a sync only so often (see
+ batch value in the code), the value reported might be off the
+ real value by an amount proportional to the specified tolerenace.
+ The tolerance value is currently stored internally.
+
+TODOs
+
+1. Move tolerance to a config option
+2. Add support for strict/non-strict accounting
+
+
Summary
Overall, the memory controller has been a stable controller and has been
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a43ea60 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -12,12 +12,15 @@ to work with it.
1. Crucial parts of the res_counter structure
- a. unsigned long long usage
+ a. percpu_counter usage
The usage value shows the amount of a resource that is consumed
by a group at a given time. The units of measurement should be
determined by the controller that uses this counter. E.g. it can
be bytes, items or any other unit the controller operates on.
+ NOTE: being a percpu_counter, the way to read the correct value
+ at all times makes it unscalable and reading it scalably makes
+ the value a little unreliable :)
b. unsigned long long max_usage
@@ -39,16 +42,24 @@ to work with it.
The failcnt stands for "failures counter". This is the number of
resource allocation attempts that failed.
- c. spinlock_t lock
+ e. spinlock_t lock
Protects changes of the above values.
+ f. unsigned long tolerance
+
+ This value is used to keep track of the amount of error that might
+ be tolerated by the resource counter. See the NOTE in (a) above.
+ The tolerance value is per cpu, hence the total error at any time
+ can be nr_cpu_ids * tolerance.
+
2. Basic accounting routines
a. void res_counter_init(struct res_counter *rc,
- struct res_counter *rc_parent)
+ struct res_counter *rc_parent,
+ unsigned long tolerance)
Initializes the resource counter. As usual, should be the first
routine called for a new counter.
@@ -57,6 +68,9 @@ to work with it.
child -> parent relationship directly in the res_counter structure,
NULL can be used to define no relationship.
+ The tolerance is used to control the batching behaviour of percpu
+ counters. Please see details in Section 1, item f above.
+
c. int res_counter_charge(struct res_counter *rc, unsigned long val,
struct res_counter **limit_fail_at)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 731af71..3728c0d 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,6 +14,7 @@
*/
#include <linux/cgroup.h>
+#include <linux/percpu_counter.h>
/*
* The core object. the cgroup that wishes to account for some
@@ -23,10 +24,6 @@
struct res_counter {
/*
- * the current resource consumption level
- */
- unsigned long long usage;
- /*
* the maximal value of the usage from the counter creation
*/
unsigned long long max_usage;
@@ -48,6 +45,14 @@ struct res_counter {
*/
spinlock_t lock;
/*
+ * the current resource consumption level
+ */
+ struct percpu_counter usage;
+ /*
+ * Tolerance for the percpu_counter (usage) above
+ */
+ unsigned long usage_tolerance;
+ /*
* Parent counter, used for hierarchial resource accounting
*/
struct res_counter *parent;
@@ -98,7 +103,8 @@ enum {
* helpers for accounting
*/
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+ unsigned long usage_tolerance);
/*
* charge - try to consume more resource.
@@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
{
- if (cnt->usage < cnt->limit)
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+ if (usage < cnt->limit)
return true;
return false;
@@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
{
- if (cnt->usage < cnt->soft_limit)
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+ if (usage < cnt->soft_limit)
return true;
return false;
@@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
static inline unsigned long long
res_counter_soft_limit_excess(struct res_counter *cnt)
{
- unsigned long long excess;
- unsigned long flags;
+ unsigned long long excess, usage;
- spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= cnt->soft_limit)
+ usage = percpu_counter_read_positive(&cnt->usage);
+ /*
+ * Not all callers call with irq's disabled, make
+ * sure we read out something sensible.
+ */
+ preempt_disable();
+ if (usage <= cnt->soft_limit)
excess = 0;
else
- excess = cnt->usage - cnt->soft_limit;
- spin_unlock_irqrestore(&cnt->lock, flags);
+ excess = usage - cnt->soft_limit;
+ preempt_enable();
return excess;
}
@@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
bool ret;
unsigned long flags;
- spin_lock_irqsave(&cnt->lock, flags);
+ local_irq_save(flags);
ret = res_counter_limit_check_locked(cnt);
- spin_unlock_irqrestore(&cnt->lock, flags);
+ local_irq_restore(flags);
return ret;
}
@@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
bool ret;
unsigned long flags;
- spin_lock_irqsave(&cnt->lock, flags);
+ local_irq_save(flags);
ret = res_counter_soft_limit_check_locked(cnt);
- spin_unlock_irqrestore(&cnt->lock, flags);
+ local_irq_restore(flags);
return ret;
}
static inline void res_counter_reset_max(struct res_counter *cnt)
{
unsigned long flags;
+ unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
spin_lock_irqsave(&cnt->lock, flags);
- cnt->max_usage = cnt->usage;
+ cnt->max_usage = usage;
spin_unlock_irqrestore(&cnt->lock, flags);
}
@@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
unsigned long long limit)
{
unsigned long flags;
+ unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
int ret = -EBUSY;
spin_lock_irqsave(&cnt->lock, flags);
- if (cnt->usage <= limit) {
+ if (usage <= limit) {
cnt->limit = limit;
ret = 0;
}
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 88faec2..ae83168 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -15,24 +15,34 @@
#include <linux/uaccess.h>
#include <linux/mm.h>
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+ unsigned long usage_tolerance)
{
spin_lock_init(&counter->lock);
+ percpu_counter_init(&counter->usage, 0);
counter->limit = RESOURCE_MAX;
counter->soft_limit = RESOURCE_MAX;
counter->parent = parent;
+ counter->usage_tolerance = usage_tolerance;
}
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
{
- if (counter->usage + val > counter->limit) {
+ unsigned long long usage;
+
+ usage = percpu_counter_read_positive(&counter->usage);
+ if (usage + val > counter->limit) {
counter->failcnt++;
return -ENOMEM;
}
- counter->usage += val;
- if (counter->usage > counter->max_usage)
- counter->max_usage = counter->usage;
+ __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
+ counter->usage_tolerance);
+ if (usage + val > counter->max_usage) {
+ spin_lock(&counter->lock);
+ counter->max_usage = (usage + val);
+ spin_unlock(&counter->lock);
+ }
return 0;
}
@@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
*soft_limit_fail_at = NULL;
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
- spin_lock(&c->lock);
ret = res_counter_charge_locked(c, val);
/*
* With soft limits, we return the highest ancestor
@@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
if (soft_limit_fail_at &&
!res_counter_soft_limit_check_locked(c))
*soft_limit_fail_at = c;
- spin_unlock(&c->lock);
if (ret < 0) {
*limit_fail_at = c;
goto undo;
@@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
goto done;
undo:
for (u = counter; u != c; u = u->parent) {
- spin_lock(&u->lock);
res_counter_uncharge_locked(u, val);
- spin_unlock(&u->lock);
}
done:
local_irq_restore(flags);
@@ -79,10 +85,13 @@ done:
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
- if (WARN_ON(counter->usage < val))
- val = counter->usage;
+ unsigned long long usage;
+
+ usage = percpu_counter_read_positive(&counter->usage);
+ if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
+ val = usage;
- counter->usage -= val;
+ percpu_counter_sub(&counter->usage, val);
}
void res_counter_uncharge(struct res_counter *counter, unsigned long val,
@@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
- spin_lock(&c->lock);
if (was_soft_limit_excess)
*was_soft_limit_excess =
!res_counter_soft_limit_check_locked(c);
res_counter_uncharge_locked(c, val);
- spin_unlock(&c->lock);
}
local_irq_restore(flags);
}
@@ -108,8 +115,6 @@ static inline unsigned long long *
res_counter_member(struct res_counter *counter, int member)
{
switch (member) {
- case RES_USAGE:
- return &counter->usage;
case RES_MAX_USAGE:
return &counter->max_usage;
case RES_LIMIT:
@@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
const char __user *userbuf, size_t nbytes, loff_t *pos,
int (*read_strategy)(unsigned long long val, char *st_buf))
{
- unsigned long long *val;
+ unsigned long long *val, usage_val;
char buf[64], *s;
s = buf;
- val = res_counter_member(counter, member);
+ if (member == RES_USAGE) {
+ usage_val = percpu_counter_read_positive(&counter->usage);
+ val = &usage_val;
+ } else
+ val = res_counter_member(counter, member);
if (read_strategy)
s += read_strategy(*val, s);
else
@@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
u64 res_counter_read_u64(struct res_counter *counter, int member)
{
- return *res_counter_member(counter, member);
+ if (member == RES_USAGE)
+ return percpu_counter_read_positive(&counter->usage);
+ else
+ return *res_counter_member(counter, member);
}
int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..36d46aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
#define SOFTLIMIT_EVENTS_THRESH (1000)
/*
+ * To help resource counters scale, we take a step back
+ * and allow the counters to be scalable and set a
+ * batch value such that every addition does not cause
+ * global synchronization. The side-effect will be visible
+ * on limit enforcement, where due to this fuzziness,
+ * we will lose out on inforcing a limit when the usage
+ * exceeds the limit. The plan however in the long run
+ * is to allow this value to be controlled. We will
+ * probably add a new control file for it. This will be
+ * moved to a config option later.
+ */
+#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
+
+/*
* Statistics for memory cgroup.
*/
enum mem_cgroup_stat_index {
@@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
if (free_all)
goto try_to_free;
move_account:
- while (mem->res.usage > 0) {
+ while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
ret = -EBUSY;
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
@@ -2383,7 +2397,7 @@ try_to_free:
lru_add_drain_all();
/* try to free all pages in this cgroup */
shrink = 1;
- while (nr_retries && mem->res.usage > 0) {
+ while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
int progress;
if (signal_pending(current)) {
@@ -2401,7 +2415,7 @@ try_to_free:
}
lru_add_drain();
/* try move_account...there may be some *locked* pages. */
- if (mem->res.usage)
+ if (res_counter_read_u64(&mem->res, RES_USAGE))
goto move_account;
ret = 0;
goto out;
@@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
}
if (parent && parent->use_hierarchy) {
- res_counter_init(&mem->res, &parent->res);
- res_counter_init(&mem->memsw, &parent->memsw);
+ res_counter_init(&mem->res, &parent->res,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
+ res_counter_init(&mem->memsw, &parent->memsw,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
*/
mem_cgroup_get(parent);
} else {
- res_counter_init(&mem->res, NULL);
- res_counter_init(&mem->memsw, NULL);
+ res_counter_init(&mem->res, NULL,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
+ res_counter_init(&mem->memsw, NULL,
+ MEM_CGROUP_RES_ERR_TOLERANCE);
}
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-09 12:15 ` Help Resource Counters Scale Better (v3) Balbir Singh
@ 2009-08-10 0:32 ` KAMEZAWA Hiroyuki
2009-08-10 0:43 ` KAMEZAWA Hiroyuki
2009-08-10 5:30 ` Balbir Singh
0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 0:32 UTC (permalink / raw)
To: balbir
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
On Sun, 9 Aug 2009 17:45:30 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Hi,
>
> Thanks for the detailed review, here is v3 of the patches against
> mmotm 6th August. I've documented the TODOs as well. If there are
> no major objections, I would like this to be included in mmotm
> for more testing. Any test reports on a large machine would be highly
> appreciated.
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> Changelog v3->v2
>
> 1. Added more documentation and comments
> 2. Made the check in mem_cgroup_set_limit strict
> 3. Increased tolerance per cpu to 64KB.
> 4. Still have the WARN_ON(), I've kept it for debugging
> purposes, may be we should make it a conditional with
> DEBUG_VM
>
Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
Before leaving, I'd like to write some concerns here.
1. you use res_counter_read_positive() in force_empty. It seems force_empty can
go into infinite loop. plz check. (especially when some pages are freed or swapped-in
in other cpu while force_empry runs.)
2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
Assume 1024cpu system, 64k*1024=64M is a tolerance.
Can't we calculate max-tolerane as following ?
tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
tolerance /= num_online_cpus();
per_cpu_tolerance = min(16k, tolelance);
I think automatic runtine adjusting of tolerance will be finally necessary,
but above will not be very bad because we can guarantee 1% tolerance.
Thx,
-Kame
> Changelog v2->v1
>
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
>
> Enhancement: For scalability move the resource counter to a percpu counter
>
> This patch changes the usage field of a resource counter to a percpu
> counter. The counter is incremented with local irq disabled. The other
> fields are still protected by the spin lock for write.
>
> This patch adds a fuzziness factor to hard limit, since the value we read
> could be off the original value (by batch value), this can be fixed
> by adding a strict/non-strict functionality check. The intention is
> to turn of strict checking for root (since we can't set limits on
> it anyway).
>
> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program.
>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
>
> Documentation/cgroups/memory.txt | 22 ++++++++++++
> Documentation/cgroups/resource_counter.txt | 20 +++++++++--
> include/linux/res_counter.h | 52 ++++++++++++++++++----------
> kernel/res_counter.c | 50 +++++++++++++++++----------
> mm/memcontrol.c | 32 +++++++++++++----
> 5 files changed, 128 insertions(+), 48 deletions(-)
>
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..a24dab7 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
> d. Provides a double LRU: global memory pressure causes reclaim from the
> global LRU; a cgroup on hitting a limit, reclaims from the per
> cgroup LRU
> + NOTE: One can no longer rely on the exact limit. Since we've moved
> + to using percpu_counters for resource counters, there is always going
> + to be a fuzziness factor depending on the batch value.
>
> Benefits and Purpose of the memory controller
>
> @@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
> 4. Start reclamation in the background when the limit is
> not yet hit but the usage is getting closer
>
> +9. Scalability Tradeoff
> +
> +As documented in Documentation/cgroups/resource_counter.txt, we've
> +moved over to percpu counters for accounting usage. What does this
> +mean for the end user?
> +
> +1. It means better performance
> +2. It means that the usage reported does not necessarily reflect
> + realty. Because percpu counters do a sync only so often (see
> + batch value in the code), the value reported might be off the
> + real value by an amount proportional to the specified tolerenace.
> + The tolerance value is currently stored internally.
> +
> +TODOs
> +
> +1. Move tolerance to a config option
> +2. Add support for strict/non-strict accounting
> +
> +
> Summary
>
> Overall, the memory controller has been a stable controller and has been
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..a43ea60 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>
> 1. Crucial parts of the res_counter structure
>
> - a. unsigned long long usage
> + a. percpu_counter usage
>
> The usage value shows the amount of a resource that is consumed
> by a group at a given time. The units of measurement should be
> determined by the controller that uses this counter. E.g. it can
> be bytes, items or any other unit the controller operates on.
> + NOTE: being a percpu_counter, the way to read the correct value
> + at all times makes it unscalable and reading it scalably makes
> + the value a little unreliable :)
>
> b. unsigned long long max_usage
>
> @@ -39,16 +42,24 @@ to work with it.
> The failcnt stands for "failures counter". This is the number of
> resource allocation attempts that failed.
>
> - c. spinlock_t lock
> + e. spinlock_t lock
>
> Protects changes of the above values.
>
> + f. unsigned long tolerance
> +
> + This value is used to keep track of the amount of error that might
> + be tolerated by the resource counter. See the NOTE in (a) above.
> + The tolerance value is per cpu, hence the total error at any time
> + can be nr_cpu_ids * tolerance.
> +
>
>
> 2. Basic accounting routines
>
> a. void res_counter_init(struct res_counter *rc,
> - struct res_counter *rc_parent)
> + struct res_counter *rc_parent,
> + unsigned long tolerance)
>
> Initializes the resource counter. As usual, should be the first
> routine called for a new counter.
> @@ -57,6 +68,9 @@ to work with it.
> child -> parent relationship directly in the res_counter structure,
> NULL can be used to define no relationship.
>
> + The tolerance is used to control the batching behaviour of percpu
> + counters. Please see details in Section 1, item f above.
> +
> c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> struct res_counter **limit_fail_at)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..3728c0d 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
> */
>
> #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>
> /*
> * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>
> struct res_counter {
> /*
> - * the current resource consumption level
> - */
> - unsigned long long usage;
> - /*
> * the maximal value of the usage from the counter creation
> */
> unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
> */
> spinlock_t lock;
> /*
> + * the current resource consumption level
> + */
> + struct percpu_counter usage;
> + /*
> + * Tolerance for the percpu_counter (usage) above
> + */
> + unsigned long usage_tolerance;
> + /*
> * Parent counter, used for hierarchial resource accounting
> */
> struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
> * helpers for accounting
> */
>
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> + unsigned long usage_tolerance);
>
> /*
> * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>
> static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> {
> - if (cnt->usage < cnt->limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->limit)
> return true;
>
> return false;
> @@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>
> static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
> {
> - if (cnt->usage < cnt->soft_limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->soft_limit)
> return true;
>
> return false;
> @@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
> static inline unsigned long long
> res_counter_soft_limit_excess(struct res_counter *cnt)
> {
> - unsigned long long excess;
> - unsigned long flags;
> + unsigned long long excess, usage;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= cnt->soft_limit)
> + usage = percpu_counter_read_positive(&cnt->usage);
> + /*
> + * Not all callers call with irq's disabled, make
> + * sure we read out something sensible.
> + */
> + preempt_disable();
> + if (usage <= cnt->soft_limit)
> excess = 0;
> else
> - excess = cnt->usage - cnt->soft_limit;
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + excess = usage - cnt->soft_limit;
> + preempt_enable();
> return excess;
> }
>
> @@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> @@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_soft_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> static inline void res_counter_reset_max(struct res_counter *cnt)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>
> spin_lock_irqsave(&cnt->lock, flags);
> - cnt->max_usage = cnt->usage;
> + cnt->max_usage = usage;
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
>
> @@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
> unsigned long long limit)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
> int ret = -EBUSY;
>
> spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= limit) {
> + if (usage <= limit) {
> cnt->limit = limit;
> ret = 0;
> }
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
> #include <linux/uaccess.h>
> #include <linux/mm.h>
>
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> + unsigned long usage_tolerance)
> {
> spin_lock_init(&counter->lock);
> + percpu_counter_init(&counter->usage, 0);
> counter->limit = RESOURCE_MAX;
> counter->soft_limit = RESOURCE_MAX;
> counter->parent = parent;
> + counter->usage_tolerance = usage_tolerance;
> }
>
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> {
> - if (counter->usage + val > counter->limit) {
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (usage + val > counter->limit) {
> counter->failcnt++;
> return -ENOMEM;
> }
>
> - counter->usage += val;
> - if (counter->usage > counter->max_usage)
> - counter->max_usage = counter->usage;
> + __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> + counter->usage_tolerance);
> + if (usage + val > counter->max_usage) {
> + spin_lock(&counter->lock);
> + counter->max_usage = (usage + val);
> + spin_unlock(&counter->lock);
> + }
> return 0;
> }
>
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> *soft_limit_fail_at = NULL;
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> ret = res_counter_charge_locked(c, val);
> /*
> * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> if (soft_limit_fail_at &&
> !res_counter_soft_limit_check_locked(c))
> *soft_limit_fail_at = c;
> - spin_unlock(&c->lock);
> if (ret < 0) {
> *limit_fail_at = c;
> goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> goto done;
> undo:
> for (u = counter; u != c; u = u->parent) {
> - spin_lock(&u->lock);
> res_counter_uncharge_locked(u, val);
> - spin_unlock(&u->lock);
> }
> done:
> local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>
> void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> {
> - if (WARN_ON(counter->usage < val))
> - val = counter->usage;
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> + val = usage;
>
> - counter->usage -= val;
> + percpu_counter_sub(&counter->usage, val);
> }
>
> void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> if (was_soft_limit_excess)
> *was_soft_limit_excess =
> !res_counter_soft_limit_check_locked(c);
> res_counter_uncharge_locked(c, val);
> - spin_unlock(&c->lock);
> }
> local_irq_restore(flags);
> }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
> res_counter_member(struct res_counter *counter, int member)
> {
> switch (member) {
> - case RES_USAGE:
> - return &counter->usage;
> case RES_MAX_USAGE:
> return &counter->max_usage;
> case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> int (*read_strategy)(unsigned long long val, char *st_buf))
> {
> - unsigned long long *val;
> + unsigned long long *val, usage_val;
> char buf[64], *s;
>
> s = buf;
> - val = res_counter_member(counter, member);
> + if (member == RES_USAGE) {
> + usage_val = percpu_counter_read_positive(&counter->usage);
> + val = &usage_val;
> + } else
> + val = res_counter_member(counter, member);
> if (read_strategy)
> s += read_strategy(*val, s);
> else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>
> u64 res_counter_read_u64(struct res_counter *counter, int member)
> {
> - return *res_counter_member(counter, member);
> + if (member == RES_USAGE)
> + return percpu_counter_read_positive(&counter->usage);
> + else
> + return *res_counter_member(counter, member);
> }
>
> int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..36d46aa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
> #define SOFTLIMIT_EVENTS_THRESH (1000)
>
> /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it. This will be
> + * moved to a config option later.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
> +
> +/*
> * Statistics for memory cgroup.
> */
> enum mem_cgroup_stat_index {
> @@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
> if (free_all)
> goto try_to_free;
> move_account:
> - while (mem->res.usage > 0) {
> + while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> ret = -EBUSY;
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> goto out;
> @@ -2383,7 +2397,7 @@ try_to_free:
> lru_add_drain_all();
> /* try to free all pages in this cgroup */
> shrink = 1;
> - while (nr_retries && mem->res.usage > 0) {
> + while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> int progress;
>
> if (signal_pending(current)) {
> @@ -2401,7 +2415,7 @@ try_to_free:
> }
> lru_add_drain();
> /* try move_account...there may be some *locked* pages. */
> - if (mem->res.usage)
> + if (res_counter_read_u64(&mem->res, RES_USAGE))
> goto move_account;
> ret = 0;
> goto out;
> @@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> }
>
> if (parent && parent->use_hierarchy) {
> - res_counter_init(&mem->res, &parent->res);
> - res_counter_init(&mem->memsw, &parent->memsw);
> + res_counter_init(&mem->res, &parent->res,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, &parent->memsw,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> */
> mem_cgroup_get(parent);
> } else {
> - res_counter_init(&mem->res, NULL);
> - res_counter_init(&mem->memsw, NULL);
> + res_counter_init(&mem->res, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> }
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> --
> Balbir
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 0:32 ` KAMEZAWA Hiroyuki
@ 2009-08-10 0:43 ` KAMEZAWA Hiroyuki
2009-08-10 5:22 ` Balbir Singh
2009-08-10 5:30 ` Balbir Singh
1 sibling, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 0:43 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, Andrew Morton, andi.kleen, Prarit Bhargava,
KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
linux-mm
On Mon, 10 Aug 2009 09:32:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> in other cpu while force_empry runs.)
>
> 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> Assume 1024cpu system, 64k*1024=64M is a tolerance.
> Can't we calculate max-tolerane as following ?
>
> tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> tolerance /= num_online_cpus();
> per_cpu_tolerance = min(16k, tolelance);
>
> I think automatic runtine adjusting of tolerance will be finally necessary,
> but above will not be very bad because we can guarantee 1% tolerance.
>
Sorry, one more.
3. As I requested when you pushed softlimit changes to mmotom, plz consider
to implement a way to check-and-notify gadget to res_counter.
See: http://marc.info/?l=linux-mm&m=124753058921677&w=2
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 0:43 ` KAMEZAWA Hiroyuki
@ 2009-08-10 5:22 ` Balbir Singh
0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10 5:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:43:44]:
> On Mon, 10 Aug 2009 09:32:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> > go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> > in other cpu while force_empry runs.)
> >
> > 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> > Assume 1024cpu system, 64k*1024=64M is a tolerance.
> > Can't we calculate max-tolerane as following ?
> >
> > tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> > tolerance /= num_online_cpus();
> > per_cpu_tolerance = min(16k, tolelance);
> >
> > I think automatic runtine adjusting of tolerance will be finally necessary,
> > but above will not be very bad because we can guarantee 1% tolerance.
> >
>
> Sorry, one more.
>
> 3. As I requested when you pushed softlimit changes to mmotom, plz consider
> to implement a way to check-and-notify gadget to res_counter.
> See: http://marc.info/?l=linux-mm&m=124753058921677&w=2
>
Yes, I will do that, but only after the scaling, since this is more
important at the moment.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 0:32 ` KAMEZAWA Hiroyuki
2009-08-10 0:43 ` KAMEZAWA Hiroyuki
@ 2009-08-10 5:30 ` Balbir Singh
2009-08-10 5:45 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-08-10 5:30 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:32:29]:
> On Sun, 9 Aug 2009 17:45:30 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> > Hi,
> >
> > Thanks for the detailed review, here is v3 of the patches against
> > mmotm 6th August. I've documented the TODOs as well. If there are
> > no major objections, I would like this to be included in mmotm
> > for more testing. Any test reports on a large machine would be highly
> > appreciated.
> >
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> >
> > Changelog v3->v2
> >
> > 1. Added more documentation and comments
> > 2. Made the check in mem_cgroup_set_limit strict
> > 3. Increased tolerance per cpu to 64KB.
> > 4. Still have the WARN_ON(), I've kept it for debugging
> > purposes, may be we should make it a conditional with
> > DEBUG_VM
> >
> Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
>
> Before leaving, I'd like to write some concerns here.
>
> 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> in other cpu while force_empry runs.)
OK.. so you want me to use _sum_positive(), will do. In all my testing
using the stress scripts I have, I found no issues with force_empty so
far. But I'll change over.
>
> 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> Assume 1024cpu system, 64k*1024=64M is a tolerance.
> Can't we calculate max-tolerane as following ?
>
> tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> tolerance /= num_online_cpus();
> per_cpu_tolerance = min(16k, tolelance);
>
> I think automatic runtine adjusting of tolerance will be finally necessary,
> but above will not be very bad because we can guarantee 1% tolerance.
>
I agree that automatic tuning will be necessary, but I want to go the
CONFIG_MEM_CGROUP_RES_TOLERANCE approach you suggested earlier, since
num_online_cpus() with CPU hotplug can be a bit of a game play and
with Power Management and CPUs going idle, we really don't want to
count those, etc. For now a simple nr_cpu_ids * tolerance and then
get feedback, since it is a heuristic. Again, limit_in_bytes can
change, may be some of this needs to go into resize_limit and
set_limit paths. Right now, I want to keep it simple and see if
others can see the benefits of this patch. Then add some more
heuristics based on your suggestion.
Do you agree?
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 5:30 ` Balbir Singh
@ 2009-08-10 5:45 ` KAMEZAWA Hiroyuki
2009-08-10 6:22 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 5:45 UTC (permalink / raw)
To: balbir
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
On Mon, 10 Aug 2009 11:00:25 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 09:32:29]:
>
> > On Sun, 9 Aug 2009 17:45:30 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks for the detailed review, here is v3 of the patches against
> > > mmotm 6th August. I've documented the TODOs as well. If there are
> > > no major objections, I would like this to be included in mmotm
> > > for more testing. Any test reports on a large machine would be highly
> > > appreciated.
> > >
> > > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > >
> > > Changelog v3->v2
> > >
> > > 1. Added more documentation and comments
> > > 2. Made the check in mem_cgroup_set_limit strict
> > > 3. Increased tolerance per cpu to 64KB.
> > > 4. Still have the WARN_ON(), I've kept it for debugging
> > > purposes, may be we should make it a conditional with
> > > DEBUG_VM
> > >
> > Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
> >
> > Before leaving, I'd like to write some concerns here.
> >
> > 1. you use res_counter_read_positive() in force_empty. It seems force_empty can
> > go into infinite loop. plz check. (especially when some pages are freed or swapped-in
> > in other cpu while force_empry runs.)
>
> OK.. so you want me to use _sum_positive(), will do. In all my testing
> using the stress scripts I have, I found no issues with force_empty so
> far. But I'll change over.
>
Thanks. Things around force_empty are very sensitive ;(
> >
> > 2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
> > Assume 1024cpu system, 64k*1024=64M is a tolerance.
> > Can't we calculate max-tolerane as following ?
> >
> > tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
> > tolerance /= num_online_cpus();
> > per_cpu_tolerance = min(16k, tolelance);
> >
> > I think automatic runtine adjusting of tolerance will be finally necessary,
> > but above will not be very bad because we can guarantee 1% tolerance.
> >
>
> I agree that automatic tuning will be necessary, but I want to go the
> CONFIG_MEM_CGROUP_RES_TOLERANCE approach you suggested earlier, since
> num_online_cpus() with CPU hotplug can be a bit of a game play and
> with Power Management and CPUs going idle, we really don't want to
> count those, etc. For now a simple nr_cpu_ids * tolerance and then
> get feedback, since it is a heuristic. Again, limit_in_bytes can
> change, may be some of this needs to go into resize_limit and
> set_limit paths. Right now, I want to keep it simple and see if
> others can see the benefits of this patch. Then add some more
> heuristics based on your suggestion.
>
> Do you agree?
Ok. Config is enough at this stage.
The last advice for merge is, it's better to show the numbers or
ask someone who have many cpus to measure benefits. Then, Andrew can
know how this is benefical.
(My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
In my experience (in my own old trial),
- lock contention itself is low. not high.
- but cacheline-miss, pingpong is very very frequent.
Then, this patch has some benefit logically but, in general,
File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
the performance of usual apps. You'll have to be careful to select apps
to measure the benfits of this patch by application performance.
(And this is why I don't feel so much emergency as you do)
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 5:45 ` KAMEZAWA Hiroyuki
@ 2009-08-10 6:22 ` KAMEZAWA Hiroyuki
2009-08-10 7:41 ` Balbir Singh
2009-08-10 8:36 ` Balbir Singh
0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-10 6:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, Andrew Morton, andi.kleen, Prarit Bhargava,
KOSAKI Motohiro, lizf, menage, Pavel Emelianov, linux-kernel,
linux-mm
On Mon, 10 Aug 2009 14:45:59 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Do you agree?
>
> Ok. Config is enough at this stage.
>
> The last advice for merge is, it's better to show the numbers or
> ask someone who have many cpus to measure benefits. Then, Andrew can
> know how this is benefical.
> (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
>
> In my experience (in my own old trial),
> - lock contention itself is low. not high.
> - but cacheline-miss, pingpong is very very frequent.
>
> Then, this patch has some benefit logically but, in general,
> File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> the performance of usual apps. You'll have to be careful to select apps
> to measure the benfits of this patch by application performance.
> (And this is why I don't feel so much emergency as you do)
>
Why I say "I want to see the numbers" again and again is that
this is performance improvement with _bad side effect_.
If this is an emergent trouble, and need fast-track, which requires us
"fix small problems later", plz say so.
I have no objection to this approach itself because I can't think of
something better, now. percpu-counter's error tolerance is a generic
problem and we'll have to visit this anyway.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 6:22 ` KAMEZAWA Hiroyuki
@ 2009-08-10 7:41 ` Balbir Singh
2009-08-10 8:36 ` Balbir Singh
1 sibling, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10 7:41 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 15:22:05]:
> On Mon, 10 Aug 2009 14:45:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > Do you agree?
> >
> > Ok. Config is enough at this stage.
> >
> > The last advice for merge is, it's better to show the numbers or
> > ask someone who have many cpus to measure benefits. Then, Andrew can
> > know how this is benefical.
> > (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
> >
> > In my experience (in my own old trial),
> > - lock contention itself is low. not high.
> > - but cacheline-miss, pingpong is very very frequent.
> >
> > Then, this patch has some benefit logically but, in general,
> > File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> > the performance of usual apps. You'll have to be careful to select apps
> > to measure the benfits of this patch by application performance.
> > (And this is why I don't feel so much emergency as you do)
> >
>
> Why I say "I want to see the numbers" again and again is that
> this is performance improvement with _bad side effect_.
> If this is an emergent trouble, and need fast-track, which requires us
> "fix small problems later", plz say so.
>
Yes, this is an emergent trouble, I've gotten reports of the lock
showing up on 16 to 64 ways.
> I have no objection to this approach itself because I can't think of
> something better, now. percpu-counter's error tolerance is a generic
> problem and we'll have to visit this anyway.
>
Yes, my plan is to then later add a strict/no-strict accounting layer
and allow users to choose. Keep root as non-script as we don't
support limit setting on root now.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Help Resource Counters Scale Better (v3)
2009-08-10 6:22 ` KAMEZAWA Hiroyuki
2009-08-10 7:41 ` Balbir Singh
@ 2009-08-10 8:36 ` Balbir Singh
1 sibling, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-08-10 8:36 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, andi.kleen, Prarit Bhargava, KOSAKI Motohiro,
lizf, menage, Pavel Emelianov, linux-kernel, linux-mm
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-10 15:22:05]:
> On Mon, 10 Aug 2009 14:45:59 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > Do you agree?
> >
> > Ok. Config is enough at this stage.
> >
> > The last advice for merge is, it's better to show the numbers or
> > ask someone who have many cpus to measure benefits. Then, Andrew can
> > know how this is benefical.
> > (My box has 8 cpus. But maybe your IBM collaegue has some bigger one)
> >
> > In my experience (in my own old trial),
> > - lock contention itself is low. not high.
> > - but cacheline-miss, pingpong is very very frequent.
> >
> > Then, this patch has some benefit logically but, in general,
> > File-I/O, swapin-swapout, page-allocation/initalize etc..dominates
> > the performance of usual apps. You'll have to be careful to select apps
> > to measure the benfits of this patch by application performance.
> > (And this is why I don't feel so much emergency as you do)
> >
>
> Why I say "I want to see the numbers" again and again is that
> this is performance improvement with _bad side effect_.
> If this is an emergent trouble, and need fast-track, which requires us
> "fix small problems later", plz say so.
>
OK... I finally got a bigger machine (24 CPUs). I ran a simple
program called parallel_pagefault, which does pagefault's in parallel
(runs on every other CPU) and allocates 10K pages and touches the
data allocated, unmaps and repeats the process. I ran the program
for 300 seconds. With the patch, I was able to fault in twice
the number of pages as I was able to without the patch. I used
perf tool from tools/perf in the kernel
With patch
Performance counter stats for '/home/balbir/parallel_pagefault':
7188177.405648 task-clock-msecs # 23.926 CPUs
423130 context-switches # 0.000 M/sec
210 CPU-migrations # 0.000 M/sec
49851597 page-faults # 0.007 M/sec
5900210219604 cycles # 820.821 M/sec
424658049425 instructions # 0.072 IPC
7867744369 cache-references # 1.095 M/sec
2882370051 cache-misses # 0.401 M/sec
300.431591843 seconds time elapsed
Without Patch
Performance counter stats for '/home/balbir/parallel_pagefault':
7192804.124144 task-clock-msecs # 23.937 CPUs
424691 context-switches # 0.000 M/sec
267 CPU-migrations # 0.000 M/sec
28498113 page-faults # 0.004 M/sec
5826093739340 cycles # 809.989 M/sec
408883496292 instructions # 0.070 IPC
7057079452 cache-references # 0.981 M/sec
3036086243 cache-misses # 0.422 M/sec
300.485365680 seconds time elapsed
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-08-10 8:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-07 22:12 Help Resource Counters Scale Better (v2) Balbir Singh
2009-08-08 1:11 ` KAMEZAWA Hiroyuki
2009-08-08 6:05 ` Balbir Singh
2009-08-08 7:38 ` KAMEZAWA Hiroyuki
2009-08-09 12:15 ` Help Resource Counters Scale Better (v3) Balbir Singh
2009-08-10 0:32 ` KAMEZAWA Hiroyuki
2009-08-10 0:43 ` KAMEZAWA Hiroyuki
2009-08-10 5:22 ` Balbir Singh
2009-08-10 5:30 ` Balbir Singh
2009-08-10 5:45 ` KAMEZAWA Hiroyuki
2009-08-10 6:22 ` KAMEZAWA Hiroyuki
2009-08-10 7:41 ` Balbir Singh
2009-08-10 8:36 ` Balbir Singh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox