* [PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5)
@ 2009-08-13 6:55 Balbir Singh
2009-08-13 7:26 ` Daisuke Nishimura
0 siblings, 1 reply; 8+ messages in thread
From: Balbir Singh @ 2009-08-13 6:55 UTC (permalink / raw)
To: Andrew Morton
Cc: lizf, nishimura, KAMEZAWA Hiroyuki, menage, xemul, prarit,
andi.kleen, linux-kernel, linux-mm
Hi, Andrew,
Here is the new drop for the scalability patch, could you please
replace the one curently in mmotm with this one.
NOTE: The subject has changed (Daisuke felt the old subject was not
appropriate) and I've incorporate hierarchy related changes and
tested them.
From: Balbir Singh <balbir@linux.vnet.ibm.com>
Enhancement: Remove the overhead of root based resource counter accounting
This patch reduces the resource counter overhead (mostly spinlock)
associated with the root cgroup. This is a part of the several
patches to reduce mem cgroup overhead. I had posted other
approaches earlier (including using percpu counters). Those
patches will be a natural addition and will be added iteratively
on top of these.
The patch stops resource counter accounting for the root cgroup.
The data for display is derived from the statisitcs we maintain
via mem_cgroup_charge_statistics (which is more scalable).
What happens today is that, we do double accounting, once using
res_counter_charge() and once using memory_cgroup_charge_statistics().
For the root, since we don't implement limits any more, we don't
need to track every charge via res_counter_charge() and check
for limit being exceeded and reclaim.
The main mem->res usage_in_bytes can be derived by summing
the cache and rss usage data from memory statistics
(MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for
memsw->res usage_in_bytes, we need additional data about
swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT
and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE
to derive the memsw data. This data is computed recursively
when hierarchy is enabled.
The tests results I see on a 24 way show that
1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
cgroup_disable=memory.
Here is a sample of my program runs
Without Patch
Performance counter stats for '/home/balbir/parallel_pagefault':
7192804.124144 task-clock-msecs # 23.937 CPUs
424691 context-switches # 0.000 M/sec
267 CPU-migrations # 0.000 M/sec
28498113 page-faults # 0.004 M/sec
5826093739340 cycles # 809.989 M/sec
408883496292 instructions # 0.070 IPC
7057079452 cache-references # 0.981 M/sec
3036086243 cache-misses # 0.422 M/sec
300.485365680 seconds time elapsed
With cgroup_disable=memory
Performance counter stats for '/home/balbir/parallel_pagefault':
7182183.546587 task-clock-msecs # 23.915 CPUs
425458 context-switches # 0.000 M/sec
203 CPU-migrations # 0.000 M/sec
92545093 page-faults # 0.013 M/sec
6034363609986 cycles # 840.185 M/sec
437204346785 instructions # 0.072 IPC
6636073192 cache-references # 0.924 M/sec
2358117732 cache-misses # 0.328 M/sec
300.320905827 seconds time elapsed
With this patch applied
Performance counter stats for '/home/balbir/parallel_pagefault':
7191619.223977 task-clock-msecs # 23.955 CPUs
422579 context-switches # 0.000 M/sec
88 CPU-migrations # 0.000 M/sec
91946060 page-faults # 0.013 M/sec
5957054385619 cycles # 828.333 M/sec
1058117350365 instructions # 0.178 IPC
9161776218 cache-references # 1.274 M/sec
1920494280 cache-misses # 0.267 M/sec
300.218764862 seconds time elapsed
Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)
For a single run
Without patch
real 27m8.988s
user 87m24.916s
sys 382m6.037s
With patch
real 4m18.607s
user 84m58.943s
sys 50m52.682s
With config turned off
real 4m54.972s
user 90m13.456s
sys 50m19.711s
NOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
mm/memcontrol.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 99 insertions(+), 20 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..c8fd6f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -70,6 +70,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
+ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_STAT_NSTATS,
};
@@ -478,6 +479,19 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
return mz;
}
+static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
+ bool charge)
+{
+ int val = (charge) ? 1 : -1;
+ struct mem_cgroup_stat *stat = &mem->stat;
+ struct mem_cgroup_stat_cpu *cpustat;
+ int cpu = get_cpu();
+
+ cpustat = &stat->cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
+ put_cpu();
+}
+
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
struct page_cgroup *pc,
bool charge)
@@ -1281,9 +1295,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
VM_BUG_ON(css_is_removed(&mem->css));
while (1) {
- int ret;
+ int ret = 0;
unsigned long flags = 0;
+ if (mem_cgroup_is_root(mem))
+ goto done;
ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
&soft_fail_res);
if (likely(!ret)) {
@@ -1343,6 +1359,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
mem_cgroup_update_tree(mem_over_soft_limit, page);
}
+done:
return 0;
nomem:
css_put(&mem->css);
@@ -1415,9 +1432,12 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
lock_page_cgroup(pc);
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
- res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
- if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(mem)) {
+ res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE,
+ NULL);
+ }
css_put(&mem->css);
return;
}
@@ -1494,7 +1514,8 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
if (pc->mem_cgroup != from)
goto out;
- res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(from))
+ res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
mem_cgroup_charge_statistics(from, pc, false);
page = pc->page;
@@ -1513,7 +1534,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
1);
}
- if (do_swap_account)
+ if (do_swap_account && !mem_cgroup_is_root(from))
res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
css_put(&from->css);
@@ -1584,9 +1605,11 @@ uncharge:
/* drop extra refcnt by try_charge() */
css_put(&parent->css);
/* uncharge if move fails */
- res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
- if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(parent)) {
+ res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+ }
return ret;
}
@@ -1775,7 +1798,10 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
* This recorded memcg can be obsolete one. So, avoid
* calling css_tryget
*/
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(memcg))
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
+ NULL);
+ mem_cgroup_swap_statistics(memcg, false);
mem_cgroup_put(memcg);
}
rcu_read_unlock();
@@ -1800,9 +1826,11 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
return;
if (!mem)
return;
- res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
- if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(mem)) {
+ res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+ if (do_swap_account)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ }
css_put(&mem->css);
}
@@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
break;
}
- res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
- if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(mem)) {
+ res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+ if (do_swap_account &&
+ (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ }
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT && mem_cgroup_is_root(mem))
+ mem_cgroup_swap_statistics(mem, true);
mem_cgroup_charge_statistics(mem, pc, false);
ClearPageCgroupUsed(pc);
@@ -1948,7 +1981,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
* We uncharge this because swap is freed.
* This memcg can be obsolete one. We avoid calling css_tryget
*/
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+ if (!mem_cgroup_is_root(memcg))
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+ mem_cgroup_swap_statistics(memcg, false);
mem_cgroup_put(memcg);
}
rcu_read_unlock();
@@ -2451,20 +2486,64 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
return retval;
}
+struct mem_cgroup_idx_data {
+ s64 val;
+ enum mem_cgroup_stat_index idx;
+};
+
+static int
+mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data)
+{
+ struct mem_cgroup_idx_data *d = data;
+ d->val += mem_cgroup_read_stat(&mem->stat, d->idx);
+ return 0;
+}
+
+static void
+mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_stat_index idx, s64 *val)
+{
+ struct mem_cgroup_idx_data d;
+ d.idx = idx;
+ d.val = 0;
+ mem_cgroup_walk_tree(mem, &d, mem_cgroup_get_idx_stat);
+ *val = d.val;
+}
+
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
- u64 val = 0;
+ u64 idx_val, val;
int type, name;
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
switch (type) {
case _MEM:
- val = res_counter_read_u64(&mem->res, name);
+ if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_CACHE, &idx_val);
+ val = idx_val;
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_RSS, &idx_val);
+ val += idx_val;
+ val <<= PAGE_SHIFT;
+ } else
+ val = res_counter_read_u64(&mem->res, name);
break;
case _MEMSWAP:
- val = res_counter_read_u64(&mem->memsw, name);
+ if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_CACHE, &idx_val);
+ val = idx_val;
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_RSS, &idx_val);
+ val += idx_val;
+ mem_cgroup_get_recursive_idx_stat(mem,
+ MEM_CGROUP_STAT_SWAPOUT, &idx_val);
+ val <<= PAGE_SHIFT;
+ } else
+ val = res_counter_read_u64(&mem->memsw, name);
break;
default:
BUG();
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 6:55 [PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) Balbir Singh @ 2009-08-13 7:26 ` Daisuke Nishimura 2009-08-13 8:02 ` [UPDATED][PATCH][mmotm] " Balbir Singh 0 siblings, 1 reply; 8+ messages in thread From: Daisuke Nishimura @ 2009-08-13 7:26 UTC (permalink / raw) To: balbir Cc: Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, nishimura, linux-kernel, linux-mm > @@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) > break; > } > > - res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > - if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(mem)) { > + res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > + if (do_swap_account && > + (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + } > + if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT && mem_cgroup_is_root(mem)) > + mem_cgroup_swap_statistics(mem, true); Hmm, if we don't count MEM_CGROUP_STAT_SWAPOUT properly about other groups than the root, memsw.usage_in_bytes of the root would be incorrect in use_hierarchy==1 case, right? > mem_cgroup_charge_statistics(mem, pc, false); > > ClearPageCgroupUsed(pc); Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 7:26 ` Daisuke Nishimura @ 2009-08-13 8:02 ` Balbir Singh 2009-08-13 8:35 ` Ingo Molnar 2009-08-13 11:12 ` Daisuke Nishimura 0 siblings, 2 replies; 8+ messages in thread From: Balbir Singh @ 2009-08-13 8:02 UTC (permalink / raw) To: Daisuke Nishimura Cc: Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, linux-kernel, linux-mm * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-08-13 16:26:40]: > > @@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) > > break; > > } > > > > - res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > > - if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > > + if (!mem_cgroup_is_root(mem)) { > > + res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > > + if (do_swap_account && > > + (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > > + } > > + if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT && mem_cgroup_is_root(mem)) > > + mem_cgroup_swap_statistics(mem, true); > Hmm, if we don't count MEM_CGROUP_STAT_SWAPOUT properly about other groups than the root, > memsw.usage_in_bytes of the root would be incorrect in use_hierarchy==1 case, right? > OK, does this look better From: Balbir Singh <balbir@linux.vnet.ibm.com> Enhancement: Remove the overhead of root based resource counter accounting This patch reduces the resource counter overhead (mostly spinlock) associated with the root cgroup. This is a part of the several patches to reduce mem cgroup overhead. I had posted other approaches earlier (including using percpu counters). Those patches will be a natural addition and will be added iteratively on top of these. The patch stops resource counter accounting for the root cgroup. The data for display is derived from the statisitcs we maintain via mem_cgroup_charge_statistics (which is more scalable). What happens today is that, we do double accounting, once using res_counter_charge() and once using memory_cgroup_charge_statistics(). For the root, since we don't implement limits any more, we don't need to track every charge via res_counter_charge() and check for limit being exceeded and reclaim. The main mem->res usage_in_bytes can be derived by summing the cache and rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need additional data about swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed recursively when hierarchy is enabled. The tests results I see on a 24 way show that 1. The lock contention disappears from /proc/lock_stats 2. The results of the test are comparable to running with cgroup_disable=memory. Here is a sample of my program runs Without Patch Performance counter stats for '/home/balbir/parallel_pagefault': 7192804.124144 task-clock-msecs # 23.937 CPUs 424691 context-switches # 0.000 M/sec 267 CPU-migrations # 0.000 M/sec 28498113 page-faults # 0.004 M/sec 5826093739340 cycles # 809.989 M/sec 408883496292 instructions # 0.070 IPC 7057079452 cache-references # 0.981 M/sec 3036086243 cache-misses # 0.422 M/sec 300.485365680 seconds time elapsed With cgroup_disable=memory Performance counter stats for '/home/balbir/parallel_pagefault': 7182183.546587 task-clock-msecs # 23.915 CPUs 425458 context-switches # 0.000 M/sec 203 CPU-migrations # 0.000 M/sec 92545093 page-faults # 0.013 M/sec 6034363609986 cycles # 840.185 M/sec 437204346785 instructions # 0.072 IPC 6636073192 cache-references # 0.924 M/sec 2358117732 cache-misses # 0.328 M/sec 300.320905827 seconds time elapsed With this patch applied Performance counter stats for '/home/balbir/parallel_pagefault': 7191619.223977 task-clock-msecs # 23.955 CPUs 422579 context-switches # 0.000 M/sec 88 CPU-migrations # 0.000 M/sec 91946060 page-faults # 0.013 M/sec 5957054385619 cycles # 828.333 M/sec 1058117350365 instructions # 0.178 IPC 9161776218 cache-references # 1.274 M/sec 1920494280 cache-misses # 0.267 M/sec 300.218764862 seconds time elapsed Data from Prarit (kernel compile with make -j64 on a 64 CPU/32G machine) For a single run Without patch real 27m8.988s user 87m24.916s sys 382m6.037s With patch real 4m18.607s user 84m58.943s sys 50m52.682s With config turned off real 4m54.972s user 90m13.456s sys 50m19.711s NOTE: The data looks counterintuitive due to the increased performance with the patch, even over the config being turned off. We probably need more runs, but so far all testing has shown that the patches definitely help. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- mm/memcontrol.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 files changed, 99 insertions(+), 20 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 48a38e1..36a8b6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -70,6 +70,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ MEM_CGROUP_STAT_NSTATS, }; @@ -478,6 +479,19 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) return mz; } +static void mem_cgroup_swap_statistics(struct mem_cgroup *mem, + bool charge) +{ + int val = (charge) ? 1 : -1; + struct mem_cgroup_stat *stat = &mem->stat; + struct mem_cgroup_stat_cpu *cpustat; + int cpu = get_cpu(); + + cpustat = &stat->cpustat[cpu]; + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val); + put_cpu(); +} + static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, struct page_cgroup *pc, bool charge) @@ -1281,9 +1295,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, VM_BUG_ON(css_is_removed(&mem->css)); while (1) { - int ret; + int ret = 0; unsigned long flags = 0; + if (mem_cgroup_is_root(mem)) + goto done; ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res, &soft_fail_res); if (likely(!ret)) { @@ -1343,6 +1359,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, if (mem_cgroup_soft_limit_check(mem_over_soft_limit)) mem_cgroup_update_tree(mem_over_soft_limit, page); } +done: return 0; nomem: css_put(&mem->css); @@ -1415,9 +1432,12 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { unlock_page_cgroup(pc); - res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(mem)) { + res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); + if (do_swap_account) + res_counter_uncharge(&mem->memsw, PAGE_SIZE, + NULL); + } css_put(&mem->css); return; } @@ -1494,7 +1514,8 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, if (pc->mem_cgroup != from) goto out; - res_counter_uncharge(&from->res, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(from)) + res_counter_uncharge(&from->res, PAGE_SIZE, NULL); mem_cgroup_charge_statistics(from, pc, false); page = pc->page; @@ -1513,7 +1534,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, 1); } - if (do_swap_account) + if (do_swap_account && !mem_cgroup_is_root(from)) res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL); css_put(&from->css); @@ -1584,9 +1605,11 @@ uncharge: /* drop extra refcnt by try_charge() */ css_put(&parent->css); /* uncharge if move fails */ - res_counter_uncharge(&parent->res, PAGE_SIZE, NULL); - if (do_swap_account) - res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(parent)) { + res_counter_uncharge(&parent->res, PAGE_SIZE, NULL); + if (do_swap_account) + res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL); + } return ret; } @@ -1775,7 +1798,10 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr, * This recorded memcg can be obsolete one. So, avoid * calling css_tryget */ - res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(memcg)) + res_counter_uncharge(&memcg->memsw, PAGE_SIZE, + NULL); + mem_cgroup_swap_statistics(memcg, false); mem_cgroup_put(memcg); } rcu_read_unlock(); @@ -1800,9 +1826,11 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem) return; if (!mem) return; - res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); - if (do_swap_account) - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(mem)) { + res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); + if (do_swap_account) + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); + } css_put(&mem->css); } @@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) break; } - res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); - if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(mem)) { + res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); + if (do_swap_account && + (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); + } + if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) + mem_cgroup_swap_statistics(mem, true); mem_cgroup_charge_statistics(mem, pc, false); ClearPageCgroupUsed(pc); @@ -1948,7 +1981,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) * We uncharge this because swap is freed. * This memcg can be obsolete one. We avoid calling css_tryget */ - res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); + if (!mem_cgroup_is_root(memcg)) + res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); + mem_cgroup_swap_statistics(memcg, false); mem_cgroup_put(memcg); } rcu_read_unlock(); @@ -2451,20 +2486,64 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft, return retval; } +struct mem_cgroup_idx_data { + s64 val; + enum mem_cgroup_stat_index idx; +}; + +static int +mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data) +{ + struct mem_cgroup_idx_data *d = data; + d->val += mem_cgroup_read_stat(&mem->stat, d->idx); + return 0; +} + +static void +mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem, + enum mem_cgroup_stat_index idx, s64 *val) +{ + struct mem_cgroup_idx_data d; + d.idx = idx; + d.val = 0; + mem_cgroup_walk_tree(mem, &d, mem_cgroup_get_idx_stat); + *val = d.val; +} + static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) { struct mem_cgroup *mem = mem_cgroup_from_cont(cont); - u64 val = 0; + u64 idx_val, val; int type, name; type = MEMFILE_TYPE(cft->private); name = MEMFILE_ATTR(cft->private); switch (type) { case _MEM: - val = res_counter_read_u64(&mem->res, name); + if (name == RES_USAGE && mem_cgroup_is_root(mem)) { + mem_cgroup_get_recursive_idx_stat(mem, + MEM_CGROUP_STAT_CACHE, &idx_val); + val = idx_val; + mem_cgroup_get_recursive_idx_stat(mem, + MEM_CGROUP_STAT_RSS, &idx_val); + val += idx_val; + val <<= PAGE_SHIFT; + } else + val = res_counter_read_u64(&mem->res, name); break; case _MEMSWAP: - val = res_counter_read_u64(&mem->memsw, name); + if (name == RES_USAGE && mem_cgroup_is_root(mem)) { + mem_cgroup_get_recursive_idx_stat(mem, + MEM_CGROUP_STAT_CACHE, &idx_val); + val = idx_val; + mem_cgroup_get_recursive_idx_stat(mem, + MEM_CGROUP_STAT_RSS, &idx_val); + val += idx_val; + mem_cgroup_get_recursive_idx_stat(mem, + MEM_CGROUP_STAT_SWAPOUT, &idx_val); + val <<= PAGE_SHIFT; + } else + val = res_counter_read_u64(&mem->memsw, name); break; default: BUG(); -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 8:02 ` [UPDATED][PATCH][mmotm] " Balbir Singh @ 2009-08-13 8:35 ` Ingo Molnar 2009-08-13 8:43 ` Balbir Singh 2009-08-14 2:01 ` Balbir Singh 2009-08-13 11:12 ` Daisuke Nishimura 1 sibling, 2 replies; 8+ messages in thread From: Ingo Molnar @ 2009-08-13 8:35 UTC (permalink / raw) To: Balbir Singh Cc: Daisuke Nishimura, Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, linux-kernel, linux-mm * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Without Patch > > Performance counter stats for '/home/balbir/parallel_pagefault': > > 5826093739340 cycles # 809.989 M/sec > 408883496292 instructions # 0.070 IPC > 7057079452 cache-references # 0.981 M/sec > 3036086243 cache-misses # 0.422 M/sec > With this patch applied > > Performance counter stats for '/home/balbir/parallel_pagefault': > > 5957054385619 cycles # 828.333 M/sec > 1058117350365 instructions # 0.178 IPC > 9161776218 cache-references # 1.274 M/sec > 1920494280 cache-misses # 0.267 M/sec Nice how the instruction count and the IPC value incraesed, and the cache-miss count decreased. Btw., a 'perf stat' suggestion: you can also make use of built-in error bars via repeating parallel_pagefault N times: aldebaran:~> perf stat --repeat 3 /bin/ls Performance counter stats for '/bin/ls' (3 runs): 1.108886 task-clock-msecs # 0.875 CPUs ( +- 4.316% ) 0 context-switches # 0.000 M/sec ( +- 0.000% ) 0 CPU-migrations # 0.000 M/sec ( +- 0.000% ) 254 page-faults # 0.229 M/sec ( +- 0.000% ) 3461896 cycles # 3121.958 M/sec ( +- 3.508% ) 3044445 instructions # 0.879 IPC ( +- 0.134% ) 21213 cache-references # 19.130 M/sec ( +- 1.612% ) 2610 cache-misses # 2.354 M/sec ( +- 39.640% ) 0.001267355 seconds time elapsed ( +- 4.762% ) that way even small changes in metrics can be identified as positive effects of a patch, if the improvement is beyond the error percentage that perf reports. For example in the /bin/ls numbers i cited above, the 'instructions' value can be trusted up to 99.8% (with a ~0.13% noise), while say the cache-misses value can not really be trusted, as it has 40% of noise. (Increasing the repeat count will drive down the noise level - at the cost of longer measurement time.) For your patch the improvement is so drastic that this isnt needed - but the error estimations can be quite useful for more borderline improvements. (and they are also useful in finding and proving small performance regressions) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 8:35 ` Ingo Molnar @ 2009-08-13 8:43 ` Balbir Singh 2009-08-14 2:01 ` Balbir Singh 1 sibling, 0 replies; 8+ messages in thread From: Balbir Singh @ 2009-08-13 8:43 UTC (permalink / raw) To: Ingo Molnar Cc: Daisuke Nishimura, Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, linux-kernel, linux-mm * Ingo Molnar <mingo@elte.hu> [2009-08-13 10:35:24]: > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > Without Patch > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > 5826093739340 cycles # 809.989 M/sec > > 408883496292 instructions # 0.070 IPC > > 7057079452 cache-references # 0.981 M/sec > > 3036086243 cache-misses # 0.422 M/sec > > > With this patch applied > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > 5957054385619 cycles # 828.333 M/sec > > 1058117350365 instructions # 0.178 IPC > > 9161776218 cache-references # 1.274 M/sec > > 1920494280 cache-misses # 0.267 M/sec > > Nice how the instruction count and the IPC value incraesed, and the > cache-miss count decreased. > > Btw., a 'perf stat' suggestion: you can also make use of built-in > error bars via repeating parallel_pagefault N times: > > aldebaran:~> perf stat --repeat 3 /bin/ls > > Performance counter stats for '/bin/ls' (3 runs): > > 1.108886 task-clock-msecs # 0.875 CPUs ( +- 4.316% ) > 0 context-switches # 0.000 M/sec ( +- 0.000% ) > 0 CPU-migrations # 0.000 M/sec ( +- 0.000% ) > 254 page-faults # 0.229 M/sec ( +- 0.000% ) > 3461896 cycles # 3121.958 M/sec ( +- 3.508% ) > 3044445 instructions # 0.879 IPC ( +- 0.134% ) > 21213 cache-references # 19.130 M/sec ( +- 1.612% ) > 2610 cache-misses # 2.354 M/sec ( +- 39.640% ) > > 0.001267355 seconds time elapsed ( +- 4.762% ) > > that way even small changes in metrics can be identified as positive > effects of a patch, if the improvement is beyond the error > percentage that perf reports. > > For example in the /bin/ls numbers i cited above, the 'instructions' > value can be trusted up to 99.8% (with a ~0.13% noise), while say > the cache-misses value can not really be trusted, as it has 40% of > noise. (Increasing the repeat count will drive down the noise level > - at the cost of longer measurement time.) > > For your patch the improvement is so drastic that this isnt needed - > but the error estimations can be quite useful for more borderline > improvements. (and they are also useful in finding and proving small > performance regressions) Thanks for the tip, let me try and use the repeats feature. BTW, nice work on the perf counters, I was pleasantly surprised to see a wonderful tool in the kernel with a good set of options and detailed analysis capabilities. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 8:35 ` Ingo Molnar 2009-08-13 8:43 ` Balbir Singh @ 2009-08-14 2:01 ` Balbir Singh 2009-08-15 14:26 ` Ingo Molnar 1 sibling, 1 reply; 8+ messages in thread From: Balbir Singh @ 2009-08-14 2:01 UTC (permalink / raw) To: Ingo Molnar Cc: Daisuke Nishimura, Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, linux-kernel, linux-mm * Ingo Molnar <mingo@elte.hu> [2009-08-13 10:35:24]: > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > Without Patch > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > 5826093739340 cycles # 809.989 M/sec > > 408883496292 instructions # 0.070 IPC > > 7057079452 cache-references # 0.981 M/sec > > 3036086243 cache-misses # 0.422 M/sec > > > With this patch applied > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > 5957054385619 cycles # 828.333 M/sec > > 1058117350365 instructions # 0.178 IPC > > 9161776218 cache-references # 1.274 M/sec > > 1920494280 cache-misses # 0.267 M/sec > > Nice how the instruction count and the IPC value incraesed, and the > cache-miss count decreased. > > Btw., a 'perf stat' suggestion: you can also make use of built-in > error bars via repeating parallel_pagefault N times: > > aldebaran:~> perf stat --repeat 3 /bin/ls Ingo, with the repeat experiements I see 7192354.545647 task-clock-msecs # 23.955 CPUs ( +- 0.002% ) 425627 context-switches # 0.000 M/sec ( +- 0.333% ) 155 CPU-migrations # 0.000 M/sec ( +- 10.897% ) 95336481 page-faults # 0.013 M/sec ( +- 0.085% ) 5951929070187 cycles # 827.536 M/sec ( +- 0.009% ) 1058312583796 instructions # 0.178 IPC ( +- 0.076% ) 9616609083 cache-references # 1.337 M/sec ( +- 2.536% ) 1952367514 cache-misses # 0.271 M/sec ( +- 0.156% ) 300.246532761 seconds time elapsed ( +- 0.002% ) Except for the CPU migrations and the cache references, all the other parameters seem to be well within an acceptable error range. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-14 2:01 ` Balbir Singh @ 2009-08-15 14:26 ` Ingo Molnar 0 siblings, 0 replies; 8+ messages in thread From: Ingo Molnar @ 2009-08-15 14:26 UTC (permalink / raw) To: Balbir Singh Cc: Daisuke Nishimura, Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, linux-kernel, linux-mm * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * Ingo Molnar <mingo@elte.hu> [2009-08-13 10:35:24]: > > > > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > Without Patch > > > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > > > 5826093739340 cycles # 809.989 M/sec > > > 408883496292 instructions # 0.070 IPC > > > 7057079452 cache-references # 0.981 M/sec > > > 3036086243 cache-misses # 0.422 M/sec > > > > > With this patch applied > > > > > > Performance counter stats for '/home/balbir/parallel_pagefault': > > > > > > 5957054385619 cycles # 828.333 M/sec > > > 1058117350365 instructions # 0.178 IPC > > > 9161776218 cache-references # 1.274 M/sec > > > 1920494280 cache-misses # 0.267 M/sec > > > > Nice how the instruction count and the IPC value incraesed, and the > > cache-miss count decreased. > > > > Btw., a 'perf stat' suggestion: you can also make use of built-in > > error bars via repeating parallel_pagefault N times: > > > > aldebaran:~> perf stat --repeat 3 /bin/ls > > Ingo, with the repeat experiements I see > > 7192354.545647 task-clock-msecs # 23.955 CPUs ( +- 0.002% ) > 425627 context-switches # 0.000 M/sec ( +- 0.333% ) > 155 CPU-migrations # 0.000 M/sec ( +- 10.897% ) > 95336481 page-faults # 0.013 M/sec ( +- 0.085% ) > 5951929070187 cycles # 827.536 M/sec ( +- 0.009% ) > 1058312583796 instructions # 0.178 IPC ( +- 0.076% ) > 9616609083 cache-references # 1.337 M/sec ( +- 2.536% ) > 1952367514 cache-misses # 0.271 M/sec ( +- 0.156% ) > > 300.246532761 seconds time elapsed ( +- 0.002% ) > > Except for the CPU migrations and the cache references, all the > other parameters seem to be well within an acceptable error range. Yeah, nice! Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [UPDATED][PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) 2009-08-13 8:02 ` [UPDATED][PATCH][mmotm] " Balbir Singh 2009-08-13 8:35 ` Ingo Molnar @ 2009-08-13 11:12 ` Daisuke Nishimura 1 sibling, 0 replies; 8+ messages in thread From: Daisuke Nishimura @ 2009-08-13 11:12 UTC (permalink / raw) To: balbir Cc: Andrew Morton, lizf, KAMEZAWA Hiroyuki, menage, xemul, prarit, andi.kleen, d-nishimura, Daisuke Nishimura, linux-kernel, linux-mm On Thu, 13 Aug 2009 13:32:06 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-08-13 16:26:40]: > > > > @@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) > > > break; > > > } > > > > > > - res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > > > - if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > > > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > > > + if (!mem_cgroup_is_root(mem)) { > > > + res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > > > + if (do_swap_account && > > > + (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > > > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > > > + } > > > + if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT && mem_cgroup_is_root(mem)) > > > + mem_cgroup_swap_statistics(mem, true); > > Hmm, if we don't count MEM_CGROUP_STAT_SWAPOUT properly about other groups than the root, > > memsw.usage_in_bytes of the root would be incorrect in use_hierarchy==1 case, right? > > > > > OK, does this look better > Looks good to me. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Thanks, Daisuke Nishimura. > > From: Balbir Singh <balbir@linux.vnet.ibm.com> > > Enhancement: Remove the overhead of root based resource counter accounting > > This patch reduces the resource counter overhead (mostly spinlock) > associated with the root cgroup. This is a part of the several > patches to reduce mem cgroup overhead. I had posted other > approaches earlier (including using percpu counters). Those > patches will be a natural addition and will be added iteratively > on top of these. > > The patch stops resource counter accounting for the root cgroup. > The data for display is derived from the statisitcs we maintain > via mem_cgroup_charge_statistics (which is more scalable). > What happens today is that, we do double accounting, once using > res_counter_charge() and once using memory_cgroup_charge_statistics(). > For the root, since we don't implement limits any more, we don't > need to track every charge via res_counter_charge() and check > for limit being exceeded and reclaim. > > The main mem->res usage_in_bytes can be derived by summing > the cache and rss usage data from memory statistics > (MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for > memsw->res usage_in_bytes, we need additional data about > swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT > and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE > to derive the memsw data. This data is computed recursively > when hierarchy is enabled. > > The tests results I see on a 24 way show that > > 1. The lock contention disappears from /proc/lock_stats > 2. The results of the test are comparable to running with > cgroup_disable=memory. > > Here is a sample of my program runs > > Without Patch > > Performance counter stats for '/home/balbir/parallel_pagefault': > > 7192804.124144 task-clock-msecs # 23.937 CPUs > 424691 context-switches # 0.000 M/sec > 267 CPU-migrations # 0.000 M/sec > 28498113 page-faults # 0.004 M/sec > 5826093739340 cycles # 809.989 M/sec > 408883496292 instructions # 0.070 IPC > 7057079452 cache-references # 0.981 M/sec > 3036086243 cache-misses # 0.422 M/sec > > 300.485365680 seconds time elapsed > > With cgroup_disable=memory > > Performance counter stats for '/home/balbir/parallel_pagefault': > > 7182183.546587 task-clock-msecs # 23.915 CPUs > 425458 context-switches # 0.000 M/sec > 203 CPU-migrations # 0.000 M/sec > 92545093 page-faults # 0.013 M/sec > 6034363609986 cycles # 840.185 M/sec > 437204346785 instructions # 0.072 IPC > 6636073192 cache-references # 0.924 M/sec > 2358117732 cache-misses # 0.328 M/sec > > 300.320905827 seconds time elapsed > > With this patch applied > > Performance counter stats for '/home/balbir/parallel_pagefault': > > 7191619.223977 task-clock-msecs # 23.955 CPUs > 422579 context-switches # 0.000 M/sec > 88 CPU-migrations # 0.000 M/sec > 91946060 page-faults # 0.013 M/sec > 5957054385619 cycles # 828.333 M/sec > 1058117350365 instructions # 0.178 IPC > 9161776218 cache-references # 1.274 M/sec > 1920494280 cache-misses # 0.267 M/sec > > 300.218764862 seconds time elapsed > > > Data from Prarit (kernel compile with make -j64 on a 64 > CPU/32G machine) > > For a single run > > Without patch > > real 27m8.988s > user 87m24.916s > sys 382m6.037s > > With patch > > real 4m18.607s > user 84m58.943s > sys 50m52.682s > > > With config turned off > > real 4m54.972s > user 90m13.456s > sys 50m19.711s > > NOTE: The data looks counterintuitive due to the increased performance > with the patch, even over the config being turned off. We probably need > more runs, but so far all testing has shown that the patches definitely > help. > > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> > --- > > mm/memcontrol.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++--------- > 1 files changed, 99 insertions(+), 20 deletions(-) > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 48a38e1..36a8b6d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -70,6 +70,7 @@ enum mem_cgroup_stat_index { > MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ > MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ > MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ > + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ > > MEM_CGROUP_STAT_NSTATS, > }; > @@ -478,6 +479,19 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) > return mz; > } > > +static void mem_cgroup_swap_statistics(struct mem_cgroup *mem, > + bool charge) > +{ > + int val = (charge) ? 1 : -1; > + struct mem_cgroup_stat *stat = &mem->stat; > + struct mem_cgroup_stat_cpu *cpustat; > + int cpu = get_cpu(); > + > + cpustat = &stat->cpustat[cpu]; > + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val); > + put_cpu(); > +} > + > static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, > struct page_cgroup *pc, > bool charge) > @@ -1281,9 +1295,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, > VM_BUG_ON(css_is_removed(&mem->css)); > > while (1) { > - int ret; > + int ret = 0; > unsigned long flags = 0; > > + if (mem_cgroup_is_root(mem)) > + goto done; > ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res, > &soft_fail_res); > if (likely(!ret)) { > @@ -1343,6 +1359,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, > if (mem_cgroup_soft_limit_check(mem_over_soft_limit)) > mem_cgroup_update_tree(mem_over_soft_limit, page); > } > +done: > return 0; > nomem: > css_put(&mem->css); > @@ -1415,9 +1432,12 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, > lock_page_cgroup(pc); > if (unlikely(PageCgroupUsed(pc))) { > unlock_page_cgroup(pc); > - res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); > - if (do_swap_account) > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(mem)) { > + res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); > + if (do_swap_account) > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, > + NULL); > + } > css_put(&mem->css); > return; > } > @@ -1494,7 +1514,8 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > if (pc->mem_cgroup != from) > goto out; > > - res_counter_uncharge(&from->res, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(from)) > + res_counter_uncharge(&from->res, PAGE_SIZE, NULL); > mem_cgroup_charge_statistics(from, pc, false); > > page = pc->page; > @@ -1513,7 +1534,7 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, > 1); > } > > - if (do_swap_account) > + if (do_swap_account && !mem_cgroup_is_root(from)) > res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL); > css_put(&from->css); > > @@ -1584,9 +1605,11 @@ uncharge: > /* drop extra refcnt by try_charge() */ > css_put(&parent->css); > /* uncharge if move fails */ > - res_counter_uncharge(&parent->res, PAGE_SIZE, NULL); > - if (do_swap_account) > - res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(parent)) { > + res_counter_uncharge(&parent->res, PAGE_SIZE, NULL); > + if (do_swap_account) > + res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL); > + } > return ret; > } > > @@ -1775,7 +1798,10 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr, > * This recorded memcg can be obsolete one. So, avoid > * calling css_tryget > */ > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE, > + NULL); > + mem_cgroup_swap_statistics(memcg, false); > mem_cgroup_put(memcg); > } > rcu_read_unlock(); > @@ -1800,9 +1826,11 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem) > return; > if (!mem) > return; > - res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); > - if (do_swap_account) > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(mem)) { > + res_counter_uncharge(&mem->res, PAGE_SIZE, NULL); > + if (do_swap_account) > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + } > css_put(&mem->css); > } > > @@ -1855,9 +1883,14 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) > break; > } > > - res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > - if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > - res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(mem)) { > + res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess); > + if (do_swap_account && > + (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)) > + res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL); > + } > + if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) > + mem_cgroup_swap_statistics(mem, true); > mem_cgroup_charge_statistics(mem, pc, false); > > ClearPageCgroupUsed(pc); > @@ -1948,7 +1981,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent) > * We uncharge this because swap is freed. > * This memcg can be obsolete one. We avoid calling css_tryget > */ > - res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); > + if (!mem_cgroup_is_root(memcg)) > + res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL); > + mem_cgroup_swap_statistics(memcg, false); > mem_cgroup_put(memcg); > } > rcu_read_unlock(); > @@ -2451,20 +2486,64 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft, > return retval; > } > > +struct mem_cgroup_idx_data { > + s64 val; > + enum mem_cgroup_stat_index idx; > +}; > + > +static int > +mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data) > +{ > + struct mem_cgroup_idx_data *d = data; > + d->val += mem_cgroup_read_stat(&mem->stat, d->idx); > + return 0; > +} > + > +static void > +mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem, > + enum mem_cgroup_stat_index idx, s64 *val) > +{ > + struct mem_cgroup_idx_data d; > + d.idx = idx; > + d.val = 0; > + mem_cgroup_walk_tree(mem, &d, mem_cgroup_get_idx_stat); > + *val = d.val; > +} > + > static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) > { > struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > - u64 val = 0; > + u64 idx_val, val; > int type, name; > > type = MEMFILE_TYPE(cft->private); > name = MEMFILE_ATTR(cft->private); > switch (type) { > case _MEM: > - val = res_counter_read_u64(&mem->res, name); > + if (name == RES_USAGE && mem_cgroup_is_root(mem)) { > + mem_cgroup_get_recursive_idx_stat(mem, > + MEM_CGROUP_STAT_CACHE, &idx_val); > + val = idx_val; > + mem_cgroup_get_recursive_idx_stat(mem, > + MEM_CGROUP_STAT_RSS, &idx_val); > + val += idx_val; > + val <<= PAGE_SHIFT; > + } else > + val = res_counter_read_u64(&mem->res, name); > break; > case _MEMSWAP: > - val = res_counter_read_u64(&mem->memsw, name); > + if (name == RES_USAGE && mem_cgroup_is_root(mem)) { > + mem_cgroup_get_recursive_idx_stat(mem, > + MEM_CGROUP_STAT_CACHE, &idx_val); > + val = idx_val; > + mem_cgroup_get_recursive_idx_stat(mem, > + MEM_CGROUP_STAT_RSS, &idx_val); > + val += idx_val; > + mem_cgroup_get_recursive_idx_stat(mem, > + MEM_CGROUP_STAT_SWAPOUT, &idx_val); > + val <<= PAGE_SHIFT; > + } else > + val = res_counter_read_u64(&mem->memsw, name); > break; > default: > BUG(); > > -- > Balbir > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-08-15 14:26 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-08-13 6:55 [PATCH][mmotm] Help Root Memory Cgroup Resource Counters Scale Better (v5) Balbir Singh 2009-08-13 7:26 ` Daisuke Nishimura 2009-08-13 8:02 ` [UPDATED][PATCH][mmotm] " Balbir Singh 2009-08-13 8:35 ` Ingo Molnar 2009-08-13 8:43 ` Balbir Singh 2009-08-14 2:01 ` Balbir Singh 2009-08-15 14:26 ` Ingo Molnar 2009-08-13 11:12 ` Daisuke Nishimura
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox