[RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat
@ 2009-09-30 10:04 KAMEZAWA Hiroyuki
  2009-09-30 10:09 ` [RFC][PATCH 1/2] percpu " KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-30 10:04 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura, linux-kernel

Hi,

In current implementation, memcg uses its own percpu counters for counting
evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
any synchronization as vm_stat[] or percpu_counter. So, this is
 update-is-fast-but-read-is-slow conter.

Because "read" for these counter was only done by memory.stat file, I thought
read-side-slowness was acceptable. Amount of memory usage, which affects
memory limit check, can be read by memory.usage_in_bytes. It's maintained
by res_counter.

But in current -rc, root memcg's memory usage is calcualted by this per cpu
counter and read side slowness may be trouble if it's frequently read.

And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
in memcg. So, slow-read-counter will not match our requirements, I guess.
I want some counter like vm_stat[] in memcg.

This 2 patches are for using counter like vm_stat[] in memcg.
Just an idea level implementaion but I think this is not so bad.

I confirmed this patch works well. I'm now thinking how to test performance...

Any comments are welcome. 
This patch is onto mmotm + some myown patches...so...this is just an RFC.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC][PATCH 1/2] percpu array counter like vmstat
  2009-09-30 10:04 [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat KAMEZAWA Hiroyuki
@ 2009-09-30 10:09 ` KAMEZAWA Hiroyuki
  2009-09-30 23:31   ` KAMEZAWA Hiroyuki
  2009-09-30 10:12 ` [RFC][PATCH 2/2] memcg: use generic percpu array_counter KAMEZAWA Hiroyuki
  2009-10-01  0:45 ` [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat Daisuke Nishimura
  2 siblings, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-30 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, linux-kernel

This patch is for implemening percpu counter of array. Unlike percpu counter,
this one's design is based on vm_stat[], an array counter for zone statistics.
It's an array of percpu counter.

The user can define counter array as
	struct foobar {
		.....
		DEFINE_ARRAY_COUNTER(name, NR_ELEMENTS);
		....
	}
and there will be array of counter, which has NR_ELEMENTS of size.

And, a macro GET_ARC() is provided. Users can read this counter by

	array_counter_read(GET_ARC(&foobar.name), NAME_OF_ELEMENT)

can add a value be
	array_counter_add(GET_ARC(&foobar.name), NAME_OF_ELEMENT, val)

My first purpose for writing this is replacing memcg's private percpu
counter array with this. (Then, memcg can use generic percpu object.)

I placed this counter in the same files of lib/percpu_counter

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/percpu_counter.h |  108 +++++++++++++++++++++++++++++++++++++++++
 lib/percpu_counter.c           |   83 +++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+)

Index: mmotm-2.6.31-Sep28/include/linux/percpu_counter.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/percpu_counter.h
+++ mmotm-2.6.31-Sep28/include/linux/percpu_counter.h
@@ -77,6 +77,67 @@ static inline s64 percpu_counter_read_po
 	return 1;
 }
 
+/*
+ * A counter for implementing counter array as vmstat[]. Usage and behavior
+ * is similar to percpu_counter but good for handle multiple statistics. The
+ * idea is from vmstat[] array implementation. atomic_long_t implies this is
+ * a 32bit counter on 32bit architecture and this is intentional. If you want
+ * 64bit, please use percpu_counter. This will not provide a function like
+ * percpu_counter_sum() function.
+ *
+ * For avoiding unnecessary access, it's recommended to use this via macro.
+ * You can use this counter as following in a struct.
+ *
+ *  struct xxxx {
+ *    ......
+ *   DEFINE_ARRAY_COUTER(coutner, NR_MY_ELEMENTS);
+ *    .....
+ *  };
+ * Then, you can define your array within above struct xxxx.
+ * In many case, address of counter(idx) can be calculated by compiler, easily.
+ *
+ * To access this, GET_ARC() macro is provided. This can be used in
+ * following style.
+ *    array_counter_add(GET_ARC(&object->coutner), idx, num);
+ *    array_counter_read(GET_ARC(&object->counter), idx)
+ */
+
+struct pad_array_counter { /* Don't use this struct directly */
+	s8 batch;
+	s8 *counters;
+#ifdef CONFIG_HOTPLUG_CPU
+	struct list_head list;
+#endif
+	int elements;
+} ____cacheline_aligned_in_smp;
+
+struct array_counter {
+	struct pad_array_counter v;
+	atomic_long_t	counters[0];
+};
+/* For static size definitions */
+
+#define DEFINE_ARRAY_COUNTER(name, elements) \
+	struct {\
+		struct array_counter ac;\
+		long __counters[(elements)];} name;
+
+#define GET_ARC(x)	(&(x)->ac)
+
+#define INIT_ARC(x,s) do {		\
+	memset((x), 0, sizeof(*(x)));	\
+	array_counter_init(&(x)->ac, (s));\
+}
+
+extern int array_counter_init(struct array_counter *ac, int size);
+extern void array_counter_destroy(struct array_counter *ac);
+extern void array_counter_add(struct array_counter *ac, int idx, int val);
+
+static inline long array_counter_read(struct array_counter *ac,int idx)
+{
+	return atomic_long_read(&ac->counters[idx]);
+}
+
 #else
 
 struct percpu_counter {
@@ -129,6 +190,44 @@ static inline s64 percpu_counter_sum(str
 	return percpu_counter_read(fbc);
 }
 
+struct array_counter {
+	int elmements;
+	long counters[0];
+};
+/* For static size definitions (please see CONFIG_SMP case) */
+#define DEFINE_ARRAY_COUNTER(name, elements) \
+	struct {\
+		struct array_counter ac;
+		long __counters[elements];\
+	}name;
+#define GET_ARC(x)	(&(x)->ac)
+#define INIT_ARC(x,s) do {		\
+	memset((x), 0, sizeof(*(x)));	\
+	array_counter_init(&(x)->ac, (s));\
+}
+
+static inline void
+array_counter_init(struct array_counter *ac, int size)
+{
+	ac->elements = size;
+}
+static inline void array_counter_destroy(struct array_counter *ac)
+{
+	/* nothing to do */
+}
+
+static inline
+void array_counter_add(struct array_counter *ac, int idx, int val)
+{
+	ac->counter[idx] += val;
+}
+
+static inline
+void array_coutner_read(struct array_counter *ac, int idx)
+{
+	return ac->counter[idx];
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
@@ -146,4 +245,13 @@ static inline void percpu_counter_sub(st
 	percpu_counter_add(fbc, -amount);
 }
 
+static inline void array_counter_inc(struct array_counter *ac, int idx)
+{
+	array_counter_add(ac, idx, 1);
+}
+
+static inline void array_counter_dec(struct array_counter *ac, int idx)
+{
+	array_counter_add(ac, idx, -1);
+}
 #endif /* _LINUX_PERCPU_COUNTER_H */
Index: mmotm-2.6.31-Sep28/lib/percpu_counter.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/lib/percpu_counter.c
+++ mmotm-2.6.31-Sep28/lib/percpu_counter.c
@@ -144,3 +144,86 @@ static int __init percpu_counter_startup
 	return 0;
 }
 module_init(percpu_counter_startup);
+
+
+static LIST_HEAD(array_counters);
+static DEFINE_MUTEX(array_counters_lock);
+
+int array_counter_init(struct array_counter *ac, int size)
+{
+	ac->v.elements = size;
+	ac->v.counters = alloc_percpu(s8);
+	if (!ac->v.counters)
+		return -ENOMEM;
+	ac->v.batch = percpu_counter_batch;
+
+#ifdef CONFIG_HOTPLUG_CPU
+	mutex_lock(&array_counters_lock);
+	list_add(&ac->v.list, &array_counters);
+	mutex_unlock(&array_counters_lock);
+#endif
+	return 0;
+}
+
+void array_counter_destroy(struct array_counter *ac)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	mutex_lock(&array_counters_lock);
+	list_del(&ac->v.list);
+	mutex_unlock(&array_counters_lock);
+#endif
+	free_percpu(&ac->v.counters);
+	ac->v.counters = NULL;
+}
+
+void array_counter_add(struct array_counter *ac, int idx, int val)
+{
+	s8 *pcount;
+	long count;
+	int cpu = get_cpu();
+
+	pcount = per_cpu_ptr(ac->v.counters, cpu);
+	count = pcount[idx] + val;
+	if ((count >= ac->v.batch) || (-count >= ac->v.batch)) {
+		atomic_long_add(count, &ac->counters[idx]);
+		pcount[idx] = 0;
+	} else
+		pcount[idx] = (s8)count;
+	put_cpu();
+}
+
+
+static int __cpuinit array_counter_hotcpu_callback(struct notifier_block *nb,
+					unsigned long action, void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	unsigned int cpu;
+	struct pad_array_counter *pac;
+	int idx;
+	if (action != CPU_DEAD)
+		return NOTIFY_OK;
+
+	cpu = (unsigned long)hcpu;
+	mutex_lock(&percpu_counters_lock);
+	list_for_each_entry(pac, &array_counters, list) {
+		s8 *pcount;
+		struct array_counter *ac;
+
+		ac = container_of(pac, struct array_counter, v);
+		pcount = per_cpu_ptr(pac->counters, cpu);
+		for (idx = 0; idx < pac->elements; idx++){
+			atomic_long_add(pcount[idx], &ac->counters[idx]);
+			pcount[idx] = 0;
+		}
+	}
+	mutex_unlock(&percpu_counters_lock);
+#endif
+	return NOTIFY_OK;
+}
+
+static int __init array_counter_startup(void)
+{
+	hotcpu_notifier(array_counter_hotcpu_callback, 0);
+	return 0;
+}
+module_init(array_counter_startup);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC][PATCH 2/2] memcg: use generic percpu array_counter
  2009-09-30 10:04 [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat KAMEZAWA Hiroyuki
  2009-09-30 10:09 ` [RFC][PATCH 1/2] percpu " KAMEZAWA Hiroyuki
@ 2009-09-30 10:12 ` KAMEZAWA Hiroyuki
  2009-10-01  0:45 ` [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat Daisuke Nishimura
  2 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-30 10:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, linux-kernel

Replace memcg's private percpu counter with array_counter().
And I made counter's array index's name shorter for easy reading.
By this patch, one of memcg's dirty part will go away.

Side-effect:
This makes event-coutner as global counter, not per-cpu counter.
Maybe good side-effect for softlimit updates.


Signged-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  216 +++++++++++++++++++-------------------------------------
 1 file changed, 75 insertions(+), 141 deletions(-)

Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -39,6 +39,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
+#include <linux/percpu_counter.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -56,7 +57,6 @@ static int really_do_swap_account __init
 #endif
 
 static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
-#define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
  * Statistics for memory cgroup.
@@ -65,67 +65,17 @@ enum mem_cgroup_stat_index {
 	/*
 	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
 	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEMCG_STAT_CACHE, 	   /* # of pages charged as cache */
+	MEMCG_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEMCG_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
+	MEMCG_STAT_PGPGIN,	/* # of pages paged in */
+	MEMCG_STAT_PGPGOUT,	/* # of pages paged out */
+	MEMCG_STAT_EVENTS,	/* sum of pagein + pageout for internal use */
+	MEMCG_STAT_SWAPOUT, /* # of pages, swapped out */
 
-	MEM_CGROUP_STAT_NSTATS,
+	MEMCG_STAT_NSTATS,
 };
 
-struct mem_cgroup_stat_cpu {
-	s64 count[MEM_CGROUP_STAT_NSTATS];
-} ____cacheline_aligned_in_smp;
-
-struct mem_cgroup_stat {
-	struct mem_cgroup_stat_cpu cpustat[0];
-};
-
-static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	stat->count[idx] = 0;
-}
-
-static inline s64
-__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	return stat->count[idx];
-}
-
-/*
- * For accounting under irq disable, no need for increment preempt count.
- */
-static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
-		enum mem_cgroup_stat_index idx, int val)
-{
-	stat->count[idx] += val;
-}
-
-static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
-		enum mem_cgroup_stat_index idx)
-{
-	int cpu;
-	s64 ret = 0;
-	for_each_possible_cpu(cpu)
-		ret += stat->cpustat[cpu].count[idx];
-	return ret;
-}
-
-static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
-{
-	s64 ret;
-
-	ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
-	ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
-	return ret;
-}
-
 /*
  * per-zone information in memory controller.
  */
@@ -223,13 +173,13 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	unsigned long		softlimit_prev_event;
+#define SOFTLIMIT_EVENTS_THRESH (4000)
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
-	/*
-	 * statistics. This must be placed at the end of memcg.
-	 */
-	struct mem_cgroup_stat stat;
+	/* statistics. using percpu array counter */
+	DEFINE_ARRAY_COUNTER(stat, MEMCG_STAT_NSTATS);
 };
 
 /*
@@ -369,20 +319,17 @@ mem_cgroup_remove_exceeded(struct mem_cg
 
 static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 {
-	bool ret = false;
-	int cpu;
-	s64 val;
-	struct mem_cgroup_stat_cpu *cpustat;
-
-	cpu = get_cpu();
-	cpustat = &mem->stat.cpustat[cpu];
-	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
-	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
-		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
-		ret = true;
+	unsigned long val;
+	val = (unsigned long)
+		array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_EVENTS);
+
+	if (unlikely(time_after(val, mem->softlimit_prev_event +
+					SOFTLIMIT_EVENTS_THRESH))) {
+		/* there can be race..but not necessary to correct. */
+		mem->softlimit_prev_event = val;
+		return true;
 	}
-	put_cpu();
-	return ret;
+	return false;
 }
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
@@ -480,14 +427,10 @@ mem_cgroup_largest_soft_limit_node(struc
 static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 					 bool charge)
 {
-	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
-
-	cpustat = &stat->cpustat[cpu];
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
-	put_cpu();
+	if (charge)
+		array_counter_inc(GET_ARC(&mem->stat), MEMCG_STAT_SWAPOUT);
+	else
+		array_counter_dec(GET_ARC(&mem->stat), MEMCG_STAT_SWAPOUT);
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -495,24 +438,28 @@ static void mem_cgroup_charge_statistics
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
+	struct array_counter *ac = GET_ARC(&mem->stat);
 
-	cpustat = &stat->cpustat[cpu];
 	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
+		array_counter_add(ac, MEMCG_STAT_CACHE, val);
 	else
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
+		array_counter_add(ac, MEMCG_STAT_RSS, val);
 
 	if (charge)
-		__mem_cgroup_stat_add_safe(cpustat,
-				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
+		array_counter_inc(ac,MEMCG_STAT_PGPGIN);
 	else
-		__mem_cgroup_stat_add_safe(cpustat,
-				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
-	put_cpu();
+		array_counter_inc(ac, MEMCG_STAT_PGPGOUT);
+	array_counter_inc(ac, MEMCG_STAT_EVENTS);
+}
+
+/* This function ignores hierarchy */
+static unsigned long mem_cgroup_my_usage(struct mem_cgroup *mem)
+{
+	unsigned long ret;
+
+	ret = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_RSS);
+	ret += array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_CACHE);
+	return ret;
 }
 
 static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
@@ -1164,7 +1111,7 @@ static int mem_cgroup_hierarchical_recla
 				}
 			}
 		}
-		if (!mem_cgroup_local_usage(&victim->stat)) {
+		if (!mem_cgroup_my_usage(victim)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
@@ -1230,9 +1177,6 @@ static void record_last_oom(struct mem_c
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu;
 	struct page_cgroup *pc;
 
 	if (!page_is_file_cache(page))
@@ -1250,14 +1194,7 @@ void mem_cgroup_update_mapped_file_stat(
 	if (!PageCgroupUsed(pc))
 		goto done;
 
-	/*
-	 * Preemption is already disabled, we don't need get_cpu()
-	 */
-	cpu = smp_processor_id();
-	stat = &mem->stat;
-	cpustat = &stat->cpustat[cpu];
-
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+	array_counter_add(GET_ARC(&mem->stat), MEMCG_STAT_MAPPED_FILE, val);
 done:
 	unlock_page_cgroup(pc);
 }
@@ -1598,9 +1535,6 @@ static int mem_cgroup_move_account(struc
 	int nid, zid;
 	int ret = -EBUSY;
 	struct page *page;
-	int cpu;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
@@ -1625,18 +1559,10 @@ static int mem_cgroup_move_account(struc
 
 	page = pc->page;
 	if (page_is_file_cache(page) && page_mapped(page)) {
-		cpu = smp_processor_id();
 		/* Update mapped_file data for mem_cgroup "from" */
-		stat = &from->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						-1);
-
+		array_counter_dec(GET_ARC(&from->stat), MEMCG_STAT_MAPPED_FILE);
 		/* Update mapped_file data for mem_cgroup "to" */
-		stat = &to->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						1);
+		array_counter_inc(GET_ARC(&to->stat), MEMCG_STAT_MAPPED_FILE);
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
@@ -2687,7 +2613,7 @@ static int
 mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data)
 {
 	struct mem_cgroup_idx_data *d = data;
-	d->val += mem_cgroup_read_stat(&mem->stat, d->idx);
+	d->val += array_counter_read(GET_ARC(&mem->stat), d->idx);
 	return 0;
 }
 
@@ -2714,10 +2640,10 @@ static u64 mem_cgroup_read(struct cgroup
 	case _MEM:
 		if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
 			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_CACHE, &idx_val);
+				MEMCG_STAT_CACHE, &idx_val);
 			val = idx_val;
 			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_RSS, &idx_val);
+				MEMCG_STAT_RSS, &idx_val);
 			val += idx_val;
 			val <<= PAGE_SHIFT;
 		} else
@@ -2726,13 +2652,13 @@ static u64 mem_cgroup_read(struct cgroup
 	case _MEMSWAP:
 		if (name == RES_USAGE && mem_cgroup_is_root(mem)) {
 			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_CACHE, &idx_val);
+				MEMCG_STAT_CACHE, &idx_val);
 			val = idx_val;
 			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_RSS, &idx_val);
+				MEMCG_STAT_RSS, &idx_val);
 			val += idx_val;
 			mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_SWAPOUT, &idx_val);
+				MEMCG_STAT_SWAPOUT, &idx_val);
 			val <<= PAGE_SHIFT;
 		} else
 			val = res_counter_read_u64(&mem->memsw, name);
@@ -2892,18 +2818,19 @@ static int mem_cgroup_get_local_stat(str
 	s64 val;
 
 	/* per cpu stat */
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+	val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_CACHE);
 	s->stat[MCS_CACHE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+	val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_RSS);
 	s->stat[MCS_RSS] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+	val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_MAPPED_FILE);
 	s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+	val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_PGPGIN);
 	s->stat[MCS_PGPGIN] += val;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+	val = array_counter_read(GET_ARC(&mem->stat), MEMCG_STAT_PGPGOUT);
 	s->stat[MCS_PGPGOUT] += val;
 	if (do_swap_account) {
-		val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_SWAPOUT);
+		val = array_counter_read(GET_ARC(&mem->stat),
+			MEMCG_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
@@ -3162,16 +3089,10 @@ static void free_mem_cgroup_per_zone_inf
 	kfree(mem->info.nodeinfo[node]);
 }
 
-static int mem_cgroup_size(void)
-{
-	int cpustat_size = nr_cpu_ids * sizeof(struct mem_cgroup_stat_cpu);
-	return sizeof(struct mem_cgroup) + cpustat_size;
-}
-
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
 	struct mem_cgroup *mem;
-	int size = mem_cgroup_size();
+	int size = sizeof(struct mem_cgroup);
 
 	if (size < PAGE_SIZE)
 		mem = kmalloc(size, GFP_KERNEL);
@@ -3180,6 +3101,17 @@ static struct mem_cgroup *mem_cgroup_all
 
 	if (mem)
 		memset(mem, 0, size);
+	if (!mem)
+		return NULL;
+
+	if (array_counter_init(GET_ARC(&mem->stat), MEMCG_STAT_NSTATS)) {
+		/* can't allocate percpu counter */
+		if (size < PAGE_SIZE)
+			kfree(mem);
+		else
+			vfree(mem);
+		mem = NULL;
+	}
 	return mem;
 }
 
@@ -3204,7 +3136,9 @@ static void __mem_cgroup_free(struct mem
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
-	if (mem_cgroup_size() < PAGE_SIZE)
+	array_counter_destroy(GET_ARC(&mem->stat));
+
+	if (sizeof(*mem) < PAGE_SIZE)
 		kfree(mem);
 	else
 		vfree(mem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 1/2] percpu array counter like vmstat
  2009-09-30 10:09 ` [RFC][PATCH 1/2] percpu " KAMEZAWA Hiroyuki
@ 2009-09-30 23:31   ` KAMEZAWA Hiroyuki
  2009-09-30 23:56     ` nishimura
  0 siblings, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-30 23:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, linux-kernel

On Wed, 30 Sep 2009 19:09:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> +int array_counter_init(struct array_counter *ac, int size)
> +{
> +	ac->v.elements = size;
> +	ac->v.counters = alloc_percpu(s8);
This is a bug, of course...
should be
ac->v.counters = __alloc_percpu(size, __alignof__(char));

Regads,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 1/2] percpu array counter like vmstat
  2009-09-30 23:31   ` KAMEZAWA Hiroyuki
@ 2009-09-30 23:56     ` nishimura
  0 siblings, 0 replies; 7+ messages in thread
From: nishimura @ 2009-09-30 23:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura, linux-kernel

> On Wed, 30 Sep 2009 19:09:43 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> +int array_counter_init(struct array_counter *ac, int size)
>> +{
>> +	ac->v.elements = size;
>> +	ac->v.counters = alloc_percpu(s8);
> This is a bug, of course...
Yes, I was confused at that point and about to pointing it out :)

> should be
> ac->v.counters = __alloc_percpu(size, __alignof__(char));
> 
__alloc_pecpu(size * sizeof(s8), __alignof__(s8)) would be better ?
There is no actual difference, though.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat
  2009-09-30 10:04 [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat KAMEZAWA Hiroyuki
  2009-09-30 10:09 ` [RFC][PATCH 1/2] percpu " KAMEZAWA Hiroyuki
  2009-09-30 10:12 ` [RFC][PATCH 2/2] memcg: use generic percpu array_counter KAMEZAWA Hiroyuki
@ 2009-10-01  0:45 ` Daisuke Nishimura
  2009-10-01  1:29   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 7+ messages in thread
From: Daisuke Nishimura @ 2009-10-01  0:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir, linux-kernel

On Wed, 30 Sep 2009 19:04:17 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Hi,
> 
> In current implementation, memcg uses its own percpu counters for counting
> evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
> any synchronization as vm_stat[] or percpu_counter. So, this is
>  update-is-fast-but-read-is-slow conter.
> 
> Because "read" for these counter was only done by memory.stat file, I thought
> read-side-slowness was acceptable. Amount of memory usage, which affects
> memory limit check, can be read by memory.usage_in_bytes. It's maintained
> by res_counter.
> 
> But in current -rc, root memcg's memory usage is calcualted by this per cpu
> counter and read side slowness may be trouble if it's frequently read.
> 
> And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
> in memcg. So, slow-read-counter will not match our requirements, I guess.
> I want some counter like vm_stat[] in memcg.
> 
I see your concern.

But IMHO, it would be better to explain why we need a new percpu array counter
instead of using array of percpu_counter(size or consolidation of related counters ?),
IOW, what the benefit of percpu array counter is.


Thanks,
Daisuke Nishimura.

> This 2 patches are for using counter like vm_stat[] in memcg.
> Just an idea level implementaion but I think this is not so bad.
> 
> I confirmed this patch works well. I'm now thinking how to test performance...
> 
> Any comments are welcome. 
> This patch is onto mmotm + some myown patches...so...this is just an RFC.
> 
> Regards,
> -Kame
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat
  2009-10-01  0:45 ` [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat Daisuke Nishimura
@ 2009-10-01  1:29   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-01  1:29 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir, linux-kernel

On Thu, 1 Oct 2009 09:45:14 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 30 Sep 2009 19:04:17 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Hi,
> > 
> > In current implementation, memcg uses its own percpu counters for counting
> > evetns and # of RSS, CACHES. Now, counter is maintainer per cpu without
> > any synchronization as vm_stat[] or percpu_counter. So, this is
> >  update-is-fast-but-read-is-slow conter.
> > 
> > Because "read" for these counter was only done by memory.stat file, I thought
> > read-side-slowness was acceptable. Amount of memory usage, which affects
> > memory limit check, can be read by memory.usage_in_bytes. It's maintained
> > by res_counter.
> > 
> > But in current -rc, root memcg's memory usage is calcualted by this per cpu
> > counter and read side slowness may be trouble if it's frequently read.
> > 
> > And, in recent discusstion, I wonder we should maintain NR_DIRTY etc...
> > in memcg. So, slow-read-counter will not match our requirements, I guess.
> > I want some counter like vm_stat[] in memcg.
> > 
> I see your concern.
> 
> But IMHO, it would be better to explain why we need a new percpu array counter
> instead of using array of percpu_counter(size or consolidation of related counters ?),
> IOW, what the benefit of percpu array counter is.
> 
Ok.
  array of 4 percpu counter means a struct like following.

     lock                4bytes (int)
     count               8bytes
     list_head           16bytes
     pointer to percpu   8bytes
     lock                ,,,
     count
     list_head
     pointer to percpu
     lock
     count
     list_head
     pointer to percpu
     lock
     count
     list_head
     pointer to percpu

    36x4= 144 bytes and this has 4 spinlocks.2 cache lines.
    4 spinlock means if one of "batch" expires in a cpu, all cache above will
    be invalidated. Most of read-only data will lost.

    Making alignments of each percpu counter to cacheline for avoiding
    false sharing means this will use 4 cachelines + percpu area.
    That's bad.

  array counter of 4 entry is:
     s8 batch            4bytes (will be aligned)
     pointer to percpu   8bytes
     elements            4bytes.
     list head           16bytes
     ==== cacheline aligned here== 128bytes.
     atomic_long_t       4x8==32bytes
     ==== should be aligned to cache ? maybe yes===

  Then, this will occupy 2 cachelines + percpu area.
  No false sharing in read-only area.
  All writes are done in one (locked) access.

Hmm..I may have to consider more about archs which has not atomic_xxx ops.

Considerng sets of counters can be updated at once, array of percpu counter
is not good choice. I think.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-10-01  1:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-30 10:04 [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat KAMEZAWA Hiroyuki
2009-09-30 10:09 ` [RFC][PATCH 1/2] percpu " KAMEZAWA Hiroyuki
2009-09-30 23:31   ` KAMEZAWA Hiroyuki
2009-09-30 23:56     ` nishimura
2009-09-30 10:12 ` [RFC][PATCH 2/2] memcg: use generic percpu array_counter KAMEZAWA Hiroyuki
2009-10-01  0:45 ` [RFC][PATCH 0/2] memcg: replace memcg's per cpu status counter with array counter like vmstat Daisuke Nishimura
2009-10-01  1:29   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox