[RFC][mmotm][PATCH] percpu mm struct counter cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][mmotm][PATCH] percpu mm struct counter cache
@ 2009-12-03  1:28 KAMEZAWA Hiroyuki
  2009-12-03  1:32 ` KAMEZAWA Hiroyuki
  2009-12-03 15:11 ` Minchan Kim
  0 siblings, 2 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-03  1:28 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, cl, akpm, minchan.kim, yanmin_zhang

Christophs's mm_counter+percpu implemtation has scalability at updates but
read-side had some problems. Inspired by that, I tried to write percpu-cache
counter + synchronization method. My own tiny benchmark shows something good
but this patch's hooks may summon other troubles...

Now, I start from sharing codes here. Any comments are welcome.
(Especially, moving hooks to somewhere better is my concern.)
My test proram will be posted in reply to this mail.

Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch is for implemanting light-weight per-mm statistics.
Now, when split-pagetable-lock is used, statistics per mm struct
is maintainer by atomic_long_t value. This costs one atomic_inc()
under page_table_lock and if multi-thread program runs and shares
mm_struct, this tend to cause cache-miss+atomic_ops.

This patch adds per-cpu mm statistics cache and sync it in periodically.
Cached Information are synchronized into mm_struct at
  - tick
  - context_switch.
  if there is difference.

Tiny test progam on x86-64/4core/2socket machine shows (small) improvements.
This test program measures # of page faults on cpu  0 and 4.
(Using all 8cpus, most of time is used for spinlock and you can't see
 benefits of this patch..)

[Before Patch]
Performance counter stats for './multi-fault 2' (5 runs):

       44282223  page-faults                ( +-   0.912% )
     1015540330  cache-references           ( +-   1.701% )
      210497140  cache-misses               ( +-   0.731% )
 29262804803383988  bus-cycles                 ( +-   0.003% )

   60.003401467  seconds time elapsed   ( +-   0.004% )

 4.75 miss/faults
 660825108.1564714580837551899777 bus-cycles/faults

[After Patch]
Performance counter stats for './multi-fault 2' (5 runs):

       45543398  page-faults                ( +-   0.499% )
     1031865896  cache-references           ( +-   2.720% )
      184901499  cache-misses               ( +-   0.626% )
 29261889737265056  bus-cycles                 ( +-   0.002% )

   60.001218501  seconds time elapsed   ( +-   0.000% )

 4.05 miss/faults
 642505632.5 bus-cycles/faults

Note: to enable split-pagetable-lock, you have to disable SPINLOCK_DEBUG.

This patch moves mm_counter definitions to mm.h+memory.c from sched.h.
So, total patch size seems to be big.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c       |    4 -
 include/linux/mm.h       |   55 ++++++++++++++++++
 include/linux/mm_types.h |   23 +++++--
 include/linux/sched.h    |   54 ------------------
 kernel/exit.c            |    4 +
 kernel/fork.c            |    3 -
 kernel/sched.c           |    1 
 kernel/timer.c           |    2 
 mm/filemap_xip.c         |    2 
 mm/fremap.c              |    2 
 mm/memory.c              |  139 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/oom_kill.c            |    4 -
 mm/rmap.c                |   10 +--
 mm/swapfile.c            |    2 
 14 files changed, 217 insertions(+), 88 deletions(-)

Index: mmotm-2.6.32-Nov24/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Nov24.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Nov24/include/linux/mm_types.h
@@ -24,11 +24,6 @@ struct address_space;
 
 #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 
-#if USE_SPLIT_PTLOCKS
-typedef atomic_long_t mm_counter_t;
-#else  /* !USE_SPLIT_PTLOCKS */
-typedef unsigned long mm_counter_t;
-#endif /* !USE_SPLIT_PTLOCKS */
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -199,6 +194,21 @@ struct core_state {
 	struct completion startup;
 };
 
+/*
+ * for per-mm accounting.
+ */
+#if USE_SPLIT_PTLOCKS
+typedef atomic_long_t mm_counter_t;
+#else  /* !USE_SPLIT_PTLOCKS */
+typedef unsigned long mm_counter_t;
+#endif /* !USE_SPLIT_PTLOCKS */
+
+enum {
+	MM_FILEPAGES,
+	MM_ANONPAGES,
+	NR_MM_STATS,
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -226,8 +236,7 @@ struct mm_struct {
 	/* Special counters, in some configurations protected by the
 	 * page_table_lock, in other configurations by being atomic.
 	 */
-	mm_counter_t _file_rss;
-	mm_counter_t _anon_rss;
+	mm_counter_t counters[NR_MM_STATS];
 
 	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
 	unsigned long hiwater_vm;	/* High-water virtual memory usage */
Index: mmotm-2.6.32-Nov24/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Nov24.orig/include/linux/sched.h
+++ mmotm-2.6.32-Nov24/include/linux/sched.h
@@ -385,60 +385,6 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct mm_struct *, unsigned long);
 extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
 
-#if USE_SPLIT_PTLOCKS
-/*
- * The mm counters are not protected by its page_table_lock,
- * so must be incremented atomically.
- */
-#define set_mm_counter(mm, member, value) atomic_long_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic_long_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
-
-#else  /* !USE_SPLIT_PTLOCKS */
-/*
- * The mm counters are protected by its page_table_lock,
- * so can be incremented directly.
- */
-#define set_mm_counter(mm, member, value) (mm)->_##member = (value)
-#define get_mm_counter(mm, member) ((mm)->_##member)
-#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
-#define inc_mm_counter(mm, member) (mm)->_##member++
-#define dec_mm_counter(mm, member) (mm)->_##member--
-
-#endif /* !USE_SPLIT_PTLOCKS */
-
-#define get_mm_rss(mm)					\
-	(get_mm_counter(mm, file_rss) + get_mm_counter(mm, anon_rss))
-#define update_hiwater_rss(mm)	do {			\
-	unsigned long _rss = get_mm_rss(mm);		\
-	if ((mm)->hiwater_rss < _rss)			\
-		(mm)->hiwater_rss = _rss;		\
-} while (0)
-#define update_hiwater_vm(mm)	do {			\
-	if ((mm)->hiwater_vm < (mm)->total_vm)		\
-		(mm)->hiwater_vm = (mm)->total_vm;	\
-} while (0)
-
-static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
-{
-	return max(mm->hiwater_rss, get_mm_rss(mm));
-}
-
-static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
-					 struct mm_struct *mm)
-{
-	unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
-
-	if (*maxrss < hiwater_rss)
-		*maxrss = hiwater_rss;
-}
-
-static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
-{
-	return max(mm->hiwater_vm, mm->total_vm);
-}
 
 extern void set_dumpable(struct mm_struct *mm, int value);
 extern int get_dumpable(struct mm_struct *mm);
Index: mmotm-2.6.32-Nov24/mm/memory.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/memory.c
+++ mmotm-2.6.32-Nov24/mm/memory.c
@@ -121,6 +121,115 @@ static int __init init_zero_pfn(void)
 }
 core_initcall(init_zero_pfn);
 
+
+#ifdef USE_SPLIT_PTLOCKS
+
+struct pcp_mm_cache {
+	struct mm_struct *mm;
+	long counters[NR_MM_STATS];
+};
+static DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
+/*
+ * The mm counters are not protected by its page_table_lock,
+ * so must be incremented atomically.
+ */
+void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+	atomic_long_set(&mm->counters[member], value);
+}
+
+unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+	long ret = atomic_long_read(&mm->counters[member]);
+	if (ret < 0)
+		return 0;
+	return ret;
+}
+
+void add_mm_counter(struct mm_struct *mm, int member, long value)
+{
+	atomic_long_add(value, &mm->counters[member]);
+}
+
+/*
+ * Always called under pte_lock....irq off, mm != curr_mmc.mm if called
+ * by get_user_pages() etc.
+ */
+static void
+add_mm_counter_fast(struct mm_struct *mm, int member, long val)
+{
+	if (likely(percpu_read(curr_mmc.mm) == mm))
+		percpu_add(curr_mmc.counters[member], val);
+	else
+		add_mm_counter(mm, member, val);
+}
+
+/* Called by not-preemptable context */
+void sync_tsk_mm_counters(void)
+{
+	struct pcp_mm_cache *cache = &per_cpu(curr_mmc, smp_processor_id());
+	int i;
+
+	if (!cache->mm)
+		return;
+
+	for (i = 0; i < NR_MM_STATS; i++) {
+		if (!cache->counters[i])
+			continue;
+		add_mm_counter(cache->mm, i, cache->counters[i]);
+		cache->counters[i] = 0;
+	}
+}
+
+void prepare_mm_switch(struct task_struct *prev, struct task_struct *next)
+{
+	if (prev->mm == next->mm)
+		return;
+	/* If task is exited, sync is already done and prev->mm is NULL */
+	if (prev->mm)
+		sync_tsk_mm_counters();
+	percpu_write(curr_mmc.mm, next->mm);
+}
+
+#else  /* !USE_SPLIT_PTLOCKS */
+/*
+ * The mm counters are protected by its page_table_lock,
+ * so can be incremented directly.
+ */
+void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+	mm->counters[member] = value;
+}
+
+unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+	return mm->counters[member];
+}
+
+void add_mm_counter(struct mm_struct *mm, int member, long val)
+{
+	mm->counters[member] += val;
+}
+
+void sync_tsk_mm_counters(struct task_struct *tsk)
+{
+}
+
+#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
+
+#endif /* !USE_SPLIT_PTLOCKS */
+/* Special asynchronous routine for page fault path */
+#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
+#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
+
+void init_mm_counters(struct mm_struct *mm)
+{
+	int i;
+
+	for (i = 0; i < NR_MM_STATS; i++)
+		set_mm_counter(mm, i, 0);
+}
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
@@ -378,10 +487,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 
 static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
 {
+	/* use synchronous updates here */
 	if (file_rss)
-		add_mm_counter(mm, file_rss, file_rss);
+		add_mm_counter(mm, MM_FILEPAGES, file_rss);
 	if (anon_rss)
-		add_mm_counter(mm, anon_rss, anon_rss);
+		add_mm_counter(mm, MM_ANONPAGES, anon_rss);
 }
 
 /*
@@ -632,7 +742,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
-		rss[PageAnon(page)]++;
+		if (PageAnon(page))
+			rss[MM_ANONPAGES]++;
+		else
+			rss[MM_FILEPAGES]++;
 	}
 
 out_set_pte:
@@ -648,7 +761,7 @@ static int copy_pte_range(struct mm_stru
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
-	int rss[2];
+	int rss[NR_MM_STATS];
 	swp_entry_t entry = (swp_entry_t){0};
 
 again:
@@ -688,7 +801,7 @@ again:
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
 	pte_unmap_nested(orig_src_pte);
-	add_mm_rss(dst_mm, rss[0], rss[1]);
+	add_mm_rss(dst_mm, rss[MM_FILEPAGES], rss[MM_ANONPAGES]);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
@@ -1527,7 +1640,7 @@ static int insert_page(struct vm_area_st
 
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter(mm, file_rss);
+	inc_mm_counter_fast(mm, MM_FILEPAGES);
 	page_add_file_rmap(page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
@@ -2163,11 +2276,11 @@ gotten:
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
-				dec_mm_counter(mm, file_rss);
-				inc_mm_counter(mm, anon_rss);
+				dec_mm_counter_fast(mm, MM_FILEPAGES);
+				inc_mm_counter_fast(mm, MM_ANONPAGES);
 			}
 		} else
-			inc_mm_counter(mm, anon_rss);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2600,7 +2713,7 @@ static int do_swap_page(struct mm_struct
 	 * discarded at swap_free().
 	 */
 
-	inc_mm_counter(mm, anon_rss);
+	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2684,7 +2797,7 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 
-	inc_mm_counter(mm, anon_rss);
+	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -2838,10 +2951,10 @@ static int __do_fault(struct mm_struct *
 		if (flags & FAULT_FLAG_WRITE)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		if (anon) {
-			inc_mm_counter(mm, anon_rss);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
 			page_add_new_anon_rmap(page, vma, address);
 		} else {
-			inc_mm_counter(mm, file_rss);
+			inc_mm_counter_fast(mm, MM_FILEPAGES);
 			page_add_file_rmap(page);
 			if (flags & FAULT_FLAG_WRITE) {
 				dirty_page = page;
Index: mmotm-2.6.32-Nov24/kernel/fork.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/kernel/fork.c
+++ mmotm-2.6.32-Nov24/kernel/fork.c
@@ -452,8 +452,7 @@ static struct mm_struct * mm_init(struct
 		(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
-	set_mm_counter(mm, file_rss, 0);
-	set_mm_counter(mm, anon_rss, 0);
+	init_mm_counters(mm);
 	spin_lock_init(&mm->page_table_lock);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
Index: mmotm-2.6.32-Nov24/mm/fremap.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/fremap.c
+++ mmotm-2.6.32-Nov24/mm/fremap.c
@@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
 			page_remove_rmap(page);
 			page_cache_release(page);
 			update_hiwater_rss(mm);
-			dec_mm_counter(mm, file_rss);
+			dec_mm_counter(mm, MM_FILEPAGES);
 		}
 	} else {
 		if (!pte_file(pte))
Index: mmotm-2.6.32-Nov24/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/rmap.c
+++ mmotm-2.6.32-Nov24/mm/rmap.c
@@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page, 
 
 	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
 		if (PageAnon(page))
-			dec_mm_counter(mm, anon_rss);
+			dec_mm_counter(mm, MM_ANONPAGES);
 		else
-			dec_mm_counter(mm, file_rss);
+			dec_mm_counter(mm, MM_FILEPAGES);
 		set_pte_at(mm, address, pte,
 				swp_entry_to_pte(make_hwpoison_entry(page)));
 	} else if (PageAnon(page)) {
@@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page, 
 					list_add(&mm->mmlist, &init_mm.mmlist);
 				spin_unlock(&mmlist_lock);
 			}
-			dec_mm_counter(mm, anon_rss);
+			dec_mm_counter(mm, MM_ANONPAGES);
 		} else if (PAGE_MIGRATION) {
 			/*
 			 * Store the pfn of the page in a special migration
@@ -857,7 +857,7 @@ int try_to_unmap_one(struct page *page, 
 		entry = make_migration_entry(page, pte_write(pteval));
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 	} else
-		dec_mm_counter(mm, file_rss);
+		dec_mm_counter(mm, MM_FILEPAGES);
 
 	page_remove_rmap(page);
 	page_cache_release(page);
@@ -995,7 +995,7 @@ static int try_to_unmap_cluster(unsigned
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-		dec_mm_counter(mm, file_rss);
+		dec_mm_counter(mm, MM_FILEPAGES);
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
Index: mmotm-2.6.32-Nov24/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/swapfile.c
+++ mmotm-2.6.32-Nov24/mm/swapfile.c
@@ -839,7 +839,7 @@ static int unuse_pte(struct vm_area_stru
 		goto out;
 	}
 
-	inc_mm_counter(vma->vm_mm, anon_rss);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: mmotm-2.6.32-Nov24/kernel/timer.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/kernel/timer.c
+++ mmotm-2.6.32-Nov24/kernel/timer.c
@@ -1200,6 +1200,8 @@ void update_process_times(int user_tick)
 	account_process_tick(p, user_tick);
 	run_local_timers();
 	rcu_check_callbacks(cpu, user_tick);
+	/* sync cached mm stat information */
+	sync_tsk_mm_counters();
 	printk_tick();
 	scheduler_tick();
 	run_posix_cpu_timers(p);
Index: mmotm-2.6.32-Nov24/mm/filemap_xip.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/filemap_xip.c
+++ mmotm-2.6.32-Nov24/mm/filemap_xip.c
@@ -194,7 +194,7 @@ retry:
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page);
-			dec_mm_counter(mm, file_rss);
+			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
Index: mmotm-2.6.32-Nov24/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Nov24.orig/include/linux/mm.h
+++ mmotm-2.6.32-Nov24/include/linux/mm.h
@@ -863,6 +863,61 @@ extern int mprotect_fixup(struct vm_area
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
+/* For per-mm stat accounting */
+extern void set_mm_counter(struct mm_struct *mm, int member, long value);
+extern unsigned long get_mm_counter(struct mm_struct *mm, int member);
+extern void add_mm_counter(struct mm_struct *mm, int member, long value);
+extern void sync_tsk_mm_counters(void);
+extern void init_mm_counters(struct mm_struct *mm);
+
+#ifdef USE_SPLIT_PTLOCKS
+extern void prepare_mm_switch(struct task_struct *prev,
+				 struct task_struct *next);
+#else
+static inline prepare_mm_switch(struct task_struct *prev,
+				struct task_struct *next)
+{
+}
+#endif
+
+#define inc_mm_counter(mm, member) add_mm_counter((mm), (member), 1)
+#define dec_mm_counter(mm, member) add_mm_counter((mm), (member), -1)
+
+#define get_mm_rss(mm)			\
+	(get_mm_counter(mm, MM_FILEPAGES) +\
+	 get_mm_counter(mm, MM_ANONPAGES))
+
+#define update_hiwater_rss(mm)	do {			\
+	unsigned long _rss = get_mm_rss(mm);		\
+	if ((mm)->hiwater_rss < _rss)			\
+		(mm)->hiwater_rss = _rss;		\
+} while (0)
+
+#define update_hiwater_vm(mm)	do {			\
+	if ((mm)->hiwater_vm < (mm)->total_vm)		\
+		(mm)->hiwater_vm = (mm)->total_vm;	\
+} while (0)
+
+static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
+{
+	return max(mm->hiwater_rss, get_mm_rss(mm));
+}
+
+static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
+					 struct mm_struct *mm)
+{
+	unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
+
+	if (*maxrss < hiwater_rss)
+		*maxrss = hiwater_rss;
+}
+
+static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
+{
+	return max(mm->hiwater_vm, mm->total_vm);
+}
+
+
 /*
  * doesn't attempt to fault and will return short.
  */
Index: mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
@@ -65,11 +65,11 @@ unsigned long task_vsize(struct mm_struc
 int task_statm(struct mm_struct *mm, int *shared, int *text,
 	       int *data, int *resident)
 {
-	*shared = get_mm_counter(mm, file_rss);
+	*shared = get_mm_counter(mm, MM_FILEPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm;
-	*resident = *shared + get_mm_counter(mm, anon_rss);
+	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
 
Index: mmotm-2.6.32-Nov24/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov24/mm/oom_kill.c
@@ -400,8 +400,8 @@ static void __oom_kill_task(struct task_
 		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		       task_pid_nr(p), p->comm,
 		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, anon_rss)),
-		       K(get_mm_counter(p->mm, file_rss)));
+		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
 	/*
Index: mmotm-2.6.32-Nov24/kernel/exit.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/kernel/exit.c
+++ mmotm-2.6.32-Nov24/kernel/exit.c
@@ -681,6 +681,10 @@ static void exit_mm(struct task_struct *
 	}
 	atomic_inc(&mm->mm_count);
 	BUG_ON(mm != tsk->active_mm);
+	/* drop cached information */
+	preempt_disable();
+	sync_tsk_mm_counters();
+	preempt_enable();
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
Index: mmotm-2.6.32-Nov24/kernel/sched.c
===================================================================
--- mmotm-2.6.32-Nov24.orig/kernel/sched.c
+++ mmotm-2.6.32-Nov24/kernel/sched.c
@@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
 	trace_sched_switch(rq, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
+	prepare_mm_switch(prev, next);
 	/*
 	 * For paravirt, this is coupled with an exit in switch_to to
 	 * combine the page table reload and the switch backend into

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][mmotm][PATCH] percpu mm struct counter cache
  2009-12-03  1:28 [RFC][mmotm][PATCH] percpu mm struct counter cache KAMEZAWA Hiroyuki
@ 2009-12-03  1:32 ` KAMEZAWA Hiroyuki
  2009-12-03 15:11 ` Minchan Kim
  1 sibling, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-03  1:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, cl, akpm, minchan.kim, yanmin_zhang

This is a test program for measuring page fault cost, I used.
This program creates threads on cpu 0,4,8,....(because I use 4code cpu)
and cause page faults on each cpu in parallel.
If you run too many threads, spin_lock costs will dominates all...
I used just 2 threads in measurements, cpu 0 and 4.

==
/*
 * multi-fault.c :: causes 60secs of parallel page fault in multi-thread.
 * % gcc -O2 -o multi-fault multi-fault.c -lpthread
 * % multi-fault # of cpus.
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

#define CORE_PER_SOCK	4
#define NR_THREADS	8
pthread_t threads[NR_THREADS];
/*
 * For avoiding contention in page table lock, FAULT area is
 * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
 */
#define MMAP_LENGTH	(8 * 1024 * 1024)
#define FAULT_LENGTH	(2 * 1024 * 1024)
void *mmap_area[NR_THREADS];
#define PAGE_SIZE	4096

pthread_barrier_t barrier;
int name[NR_THREADS];

void segv_handler(int sig)
{
	sleep(100);
}
void *worker(void *data)
{
	cpu_set_t set;
	int cpu;

	cpu = *(int *)data;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	cpu /= CORE_PER_SOCK;

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;
		pthread_barrier_wait(&barrier);
		//printf("fault into %p-%p\n",start, end);

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;
		pthread_barrier_wait(&barrier);

		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, ret;
	unsigned int num;

	if (argc < 2)
		return 0;

	num = atoi(argv[1]);	
	pthread_barrier_init(&barrier, NULL, num);

	mmap_area[0] = mmap(NULL, MMAP_LENGTH * num, PROT_WRITE|PROT_READ,
				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	for (i = 1; i < num; i++) {
		mmap_area[i] = mmap_area[i - 1]+ MMAP_LENGTH;
	}

	for (i = 0; i < num; ++i) {
		name[i] = i * CORE_PER_SOCK;
		ret = pthread_create(&threads[i], NULL, worker, &name[i]);
		if (ret < 0) {
			perror("pthread create");
			return 0;
		}
	}
	sleep(60);
	return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][mmotm][PATCH] percpu mm struct counter cache
  2009-12-03  1:28 [RFC][mmotm][PATCH] percpu mm struct counter cache KAMEZAWA Hiroyuki
  2009-12-03  1:32 ` KAMEZAWA Hiroyuki
@ 2009-12-03 15:11 ` Minchan Kim
  2009-12-04  0:18   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 6+ messages in thread
From: Minchan Kim @ 2009-12-03 15:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, cl, akpm, yanmin_zhang

Hi, Kame.

KAMEZAWA Hiroyuki wrote:
> Christophs's mm_counter+percpu implemtation has scalability at updates but
> read-side had some problems. Inspired by that, I tried to write percpu-cache
> counter + synchronization method. My own tiny benchmark shows something good
> but this patch's hooks may summon other troubles...
> 
> Now, I start from sharing codes here. Any comments are welcome.
> (Especially, moving hooks to somewhere better is my concern.)
> My test proram will be posted in reply to this mail.
> 
> Regards,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for implemanting light-weight per-mm statistics.
> Now, when split-pagetable-lock is used, statistics per mm struct
> is maintainer by atomic_long_t value. This costs one atomic_inc()
> under page_table_lock and if multi-thread program runs and shares
> mm_struct, this tend to cause cache-miss+atomic_ops.

Both cases are (page_table_lock + atomic inc) cost?

AFAIK, 
If we don't use split lock, we get the just spinlock of page_table_lock. 
If we use split lock, we get the just atomic_op cost + page->ptl lock.
In case of split lock, ptl lock contention for rss accounting is little, I think.

If I am wrong, could you write down changelog more clearly?


> 
> This patch adds per-cpu mm statistics cache and sync it in periodically.
> Cached Information are synchronized into mm_struct at
>   - tick
>   - context_switch.
>   if there is difference.

Should we sync mm statistics periodically?
Couldn't we sync statistics when we need it?
ex) get_mm_counter.
I am not sure it's possible. :)

> 
> Tiny test progam on x86-64/4core/2socket machine shows (small) improvements.
> This test program measures # of page faults on cpu  0 and 4.
> (Using all 8cpus, most of time is used for spinlock and you can't see
>  benefits of this patch..)
> 
> [Before Patch]
> Performance counter stats for './multi-fault 2' (5 runs):
> 
>        44282223  page-faults                ( +-   0.912% )
>      1015540330  cache-references           ( +-   1.701% )
>       210497140  cache-misses               ( +-   0.731% )
>  29262804803383988  bus-cycles                 ( +-   0.003% )
> 
>    60.003401467  seconds time elapsed   ( +-   0.004% )
> 
>  4.75 miss/faults
>  660825108.1564714580837551899777 bus-cycles/faults
> 
> [After Patch]
> Performance counter stats for './multi-fault 2' (5 runs):
> 
>        45543398  page-faults                ( +-   0.499% )
>      1031865896  cache-references           ( +-   2.720% )
>       184901499  cache-misses               ( +-   0.626% )
>  29261889737265056  bus-cycles                 ( +-   0.002% )
> 
>    60.001218501  seconds time elapsed   ( +-   0.000% )
> 
>  4.05 miss/faults
>  642505632.5 bus-cycles/faults
> 
> Note: to enable split-pagetable-lock, you have to disable SPINLOCK_DEBUG.
> 
> This patch moves mm_counter definitions to mm.h+memory.c from sched.h.
> So, total patch size seems to be big.

What's your goal/benefit?
You cut down atomic operations with (cache and sync) method?

Please, write down the your goal/benefit. :)

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  fs/proc/task_mmu.c       |    4 -
>  include/linux/mm.h       |   55 ++++++++++++++++++
>  include/linux/mm_types.h |   23 +++++--
>  include/linux/sched.h    |   54 ------------------
>  kernel/exit.c            |    4 +
>  kernel/fork.c            |    3 -
>  kernel/sched.c           |    1 
>  kernel/timer.c           |    2 
>  mm/filemap_xip.c         |    2 
>  mm/fremap.c              |    2 
>  mm/memory.c              |  139 ++++++++++++++++++++++++++++++++++++++++++-----
>  mm/oom_kill.c            |    4 -
>  mm/rmap.c                |   10 +--
>  mm/swapfile.c            |    2 
>  14 files changed, 217 insertions(+), 88 deletions(-)
> 
> Index: mmotm-2.6.32-Nov24/include/linux/mm_types.h
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/include/linux/mm_types.h
> +++ mmotm-2.6.32-Nov24/include/linux/mm_types.h
> @@ -24,11 +24,6 @@ struct address_space;
>  
>  #define USE_SPLIT_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
>  
> -#if USE_SPLIT_PTLOCKS
> -typedef atomic_long_t mm_counter_t;
> -#else  /* !USE_SPLIT_PTLOCKS */
> -typedef unsigned long mm_counter_t;
> -#endif /* !USE_SPLIT_PTLOCKS */
>  
>  /*
>   * Each physical page in the system has a struct page associated with
> @@ -199,6 +194,21 @@ struct core_state {
>  	struct completion startup;
>  };
>  
> +/*
> + * for per-mm accounting.
> + */
> +#if USE_SPLIT_PTLOCKS
> +typedef atomic_long_t mm_counter_t;
> +#else  /* !USE_SPLIT_PTLOCKS */
> +typedef unsigned long mm_counter_t;
> +#endif /* !USE_SPLIT_PTLOCKS */
> +
> +enum {
> +	MM_FILEPAGES,
> +	MM_ANONPAGES,
> +	NR_MM_STATS,
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -226,8 +236,7 @@ struct mm_struct {
>  	/* Special counters, in some configurations protected by the
>  	 * page_table_lock, in other configurations by being atomic.
>  	 */
> -	mm_counter_t _file_rss;
> -	mm_counter_t _anon_rss;
> +	mm_counter_t counters[NR_MM_STATS];
>  
>  	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
>  	unsigned long hiwater_vm;	/* High-water virtual memory usage */
> Index: mmotm-2.6.32-Nov24/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/include/linux/sched.h
> +++ mmotm-2.6.32-Nov24/include/linux/sched.h
> @@ -385,60 +385,6 @@ arch_get_unmapped_area_topdown(struct fi
>  extern void arch_unmap_area(struct mm_struct *, unsigned long);
>  extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
>  
> -#if USE_SPLIT_PTLOCKS
> -/*
> - * The mm counters are not protected by its page_table_lock,
> - * so must be incremented atomically.
> - */
> -#define set_mm_counter(mm, member, value) atomic_long_set(&(mm)->_##member, value)
> -#define get_mm_counter(mm, member) ((unsigned long)atomic_long_read(&(mm)->_##member))
> -#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
> -#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
> -#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
> -
> -#else  /* !USE_SPLIT_PTLOCKS */
> -/*
> - * The mm counters are protected by its page_table_lock,
> - * so can be incremented directly.
> - */
> -#define set_mm_counter(mm, member, value) (mm)->_##member = (value)
> -#define get_mm_counter(mm, member) ((mm)->_##member)
> -#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
> -#define inc_mm_counter(mm, member) (mm)->_##member++
> -#define dec_mm_counter(mm, member) (mm)->_##member--
> -
> -#endif /* !USE_SPLIT_PTLOCKS */
> -
> -#define get_mm_rss(mm)					\
> -	(get_mm_counter(mm, file_rss) + get_mm_counter(mm, anon_rss))
> -#define update_hiwater_rss(mm)	do {			\
> -	unsigned long _rss = get_mm_rss(mm);		\
> -	if ((mm)->hiwater_rss < _rss)			\
> -		(mm)->hiwater_rss = _rss;		\
> -} while (0)
> -#define update_hiwater_vm(mm)	do {			\
> -	if ((mm)->hiwater_vm < (mm)->total_vm)		\
> -		(mm)->hiwater_vm = (mm)->total_vm;	\
> -} while (0)
> -
> -static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
> -{
> -	return max(mm->hiwater_rss, get_mm_rss(mm));
> -}
> -
> -static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
> -					 struct mm_struct *mm)
> -{
> -	unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
> -
> -	if (*maxrss < hiwater_rss)
> -		*maxrss = hiwater_rss;
> -}
> -
> -static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
> -{
> -	return max(mm->hiwater_vm, mm->total_vm);
> -}
>  
>  extern void set_dumpable(struct mm_struct *mm, int value);
>  extern int get_dumpable(struct mm_struct *mm);
> Index: mmotm-2.6.32-Nov24/mm/memory.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/memory.c
> +++ mmotm-2.6.32-Nov24/mm/memory.c
> @@ -121,6 +121,115 @@ static int __init init_zero_pfn(void)
>  }
>  core_initcall(init_zero_pfn);
>  
> +
> +#ifdef USE_SPLIT_PTLOCKS
> +
> +struct pcp_mm_cache {
> +	struct mm_struct *mm;
> +	long counters[NR_MM_STATS];
> +};
> +static DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +/*
> + * The mm counters are not protected by its page_table_lock,
> + * so must be incremented atomically.
> + */
> +void set_mm_counter(struct mm_struct *mm, int member, long value)
> +{
> +	atomic_long_set(&mm->counters[member], value);
> +}
> +
> +unsigned long get_mm_counter(struct mm_struct *mm, int member)
> +{
> +	long ret = atomic_long_read(&mm->counters[member]);

Which case do we get the minus 'ret'?

> +	if (ret < 0)
> +		return 0;
> +	return ret;
> +}
> +
> +void add_mm_counter(struct mm_struct *mm, int member, long value)
> +{
> +	atomic_long_add(value, &mm->counters[member]);
> +}
> +
> +/*
> + * Always called under pte_lock....irq off, mm != curr_mmc.mm if called
> + * by get_user_pages() etc.
> + */
> +static void
> +add_mm_counter_fast(struct mm_struct *mm, int member, long val)
> +{
> +	if (likely(percpu_read(curr_mmc.mm) == mm))
> +		percpu_add(curr_mmc.counters[member], val);
> +	else
> +		add_mm_counter(mm, member, val);
> +}
> +
> +/* Called by not-preemptable context */
        	non-preemptible
> +void sync_tsk_mm_counters(void)
> +{
> +	struct pcp_mm_cache *cache = &per_cpu(curr_mmc, smp_processor_id());
> +	int i;
> +
> +	if (!cache->mm)
> +		return;
> +
> +	for (i = 0; i < NR_MM_STATS; i++) {
> +		if (!cache->counters[i])
> +			continue;
> +		add_mm_counter(cache->mm, i, cache->counters[i]);
> +		cache->counters[i] = 0;
> +	}
> +}
> +
> +void prepare_mm_switch(struct task_struct *prev, struct task_struct *next)
> +{
> +	if (prev->mm == next->mm)
> +		return;
> +	/* If task is exited, sync is already done and prev->mm is NULL */
> +	if (prev->mm)
> +		sync_tsk_mm_counters();
> +	percpu_write(curr_mmc.mm, next->mm);
> +}

Further optimization.
In case of (A-> kernel thread -> A), we don't need sync only if
we update statistics when we need it as i suggested.

> +
> +#else  /* !USE_SPLIT_PTLOCKS */
> +/*
> + * The mm counters are protected by its page_table_lock,
> + * so can be incremented directly.
> + */
> +void set_mm_counter(struct mm_struct *mm, int member, long value)
> +{
> +	mm->counters[member] = value;
> +}
> +
> +unsigned long get_mm_counter(struct mm_struct *mm, int member)
> +{
> +	return mm->counters[member];
> +}
> +
> +void add_mm_counter(struct mm_struct *mm, int member, long val)
> +{
> +	mm->counters[member] += val;
> +}
> +
> +void sync_tsk_mm_counters(struct task_struct *tsk)
> +{
> +}
> +
> +#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
> +
> +#endif /* !USE_SPLIT_PTLOCKS */
> +/* Special asynchronous routine for page fault path */
> +#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
> +#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
> +
> +void init_mm_counters(struct mm_struct *mm)
> +{
> +	int i;
> +
> +	for (i = 0; i < NR_MM_STATS; i++)
> +		set_mm_counter(mm, i, 0);
> +}
> +
>  /*
>   * If a p?d_bad entry is found while walking page tables, report
>   * the error, before resetting entry to p?d_none.  Usually (but
> @@ -378,10 +487,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
>  
>  static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
>  {
> +	/* use synchronous updates here */
>  	if (file_rss)
> -		add_mm_counter(mm, file_rss, file_rss);
> +		add_mm_counter(mm, MM_FILEPAGES, file_rss);

We can divide MM_[FILEP|ANON]AGES on another patches. 
Things like rss[0] and rss[1] were not good. 


>  	if (anon_rss)
> -		add_mm_counter(mm, anon_rss, anon_rss);
> +		add_mm_counter(mm, MM_ANONPAGES, anon_rss);
>  }
>  
>  /*
> @@ -632,7 +742,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page);
> -		rss[PageAnon(page)]++;
> +		if (PageAnon(page))
> +			rss[MM_ANONPAGES]++;
> +		else
> +			rss[MM_FILEPAGES]++;
>  	}
>  
>  out_set_pte:
> @@ -648,7 +761,7 @@ static int copy_pte_range(struct mm_stru
>  	pte_t *src_pte, *dst_pte;
>  	spinlock_t *src_ptl, *dst_ptl;
>  	int progress = 0;
> -	int rss[2];
> +	int rss[NR_MM_STATS];
>  	swp_entry_t entry = (swp_entry_t){0};
>  
>  again:
> @@ -688,7 +801,7 @@ again:
>  	arch_leave_lazy_mmu_mode();
>  	spin_unlock(src_ptl);
>  	pte_unmap_nested(orig_src_pte);
> -	add_mm_rss(dst_mm, rss[0], rss[1]);
> +	add_mm_rss(dst_mm, rss[MM_FILEPAGES], rss[MM_ANONPAGES]);
>  	pte_unmap_unlock(orig_dst_pte, dst_ptl);
>  	cond_resched();
>  
> @@ -1527,7 +1640,7 @@ static int insert_page(struct vm_area_st
>  
>  	/* Ok, finally just insert the thing.. */
>  	get_page(page);
> -	inc_mm_counter(mm, file_rss);
> +	inc_mm_counter_fast(mm, MM_FILEPAGES);
>  	page_add_file_rmap(page);
>  	set_pte_at(mm, addr, pte, mk_pte(page, prot));
>  
> @@ -2163,11 +2276,11 @@ gotten:
>  	if (likely(pte_same(*page_table, orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {
> -				dec_mm_counter(mm, file_rss);
> -				inc_mm_counter(mm, anon_rss);
> +				dec_mm_counter_fast(mm, MM_FILEPAGES);
> +				inc_mm_counter_fast(mm, MM_ANONPAGES);
>  			}
>  		} else
> -			inc_mm_counter(mm, anon_rss);
> +			inc_mm_counter_fast(mm, MM_ANONPAGES);
>  		flush_cache_page(vma, address, pte_pfn(orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2600,7 +2713,7 @@ static int do_swap_page(struct mm_struct
>  	 * discarded at swap_free().
>  	 */
>  
> -	inc_mm_counter(mm, anon_rss);
> +	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	pte = mk_pte(page, vma->vm_page_prot);
>  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> @@ -2684,7 +2797,7 @@ static int do_anonymous_page(struct mm_s
>  	if (!pte_none(*page_table))
>  		goto release;
>  
> -	inc_mm_counter(mm, anon_rss);
> +	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	page_add_new_anon_rmap(page, vma, address);
>  setpte:
>  	set_pte_at(mm, address, page_table, entry);
> @@ -2838,10 +2951,10 @@ static int __do_fault(struct mm_struct *
>  		if (flags & FAULT_FLAG_WRITE)
>  			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		if (anon) {
> -			inc_mm_counter(mm, anon_rss);
> +			inc_mm_counter_fast(mm, MM_ANONPAGES);
>  			page_add_new_anon_rmap(page, vma, address);
>  		} else {
> -			inc_mm_counter(mm, file_rss);
> +			inc_mm_counter_fast(mm, MM_FILEPAGES);
>  			page_add_file_rmap(page);
>  			if (flags & FAULT_FLAG_WRITE) {
>  				dirty_page = page;
> Index: mmotm-2.6.32-Nov24/kernel/fork.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/kernel/fork.c
> +++ mmotm-2.6.32-Nov24/kernel/fork.c
> @@ -452,8 +452,7 @@ static struct mm_struct * mm_init(struct
>  		(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
> -	set_mm_counter(mm, file_rss, 0);
> -	set_mm_counter(mm, anon_rss, 0);
> +	init_mm_counters(mm);
>  	spin_lock_init(&mm->page_table_lock);
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
> Index: mmotm-2.6.32-Nov24/mm/fremap.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/fremap.c
> +++ mmotm-2.6.32-Nov24/mm/fremap.c
> @@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
>  			page_remove_rmap(page);
>  			page_cache_release(page);
>  			update_hiwater_rss(mm);
> -			dec_mm_counter(mm, file_rss);
> +			dec_mm_counter(mm, MM_FILEPAGES);
>  		}
>  	} else {
>  		if (!pte_file(pte))
> Index: mmotm-2.6.32-Nov24/mm/rmap.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/rmap.c
> +++ mmotm-2.6.32-Nov24/mm/rmap.c
> @@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page, 
>  
>  	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
>  		if (PageAnon(page))
> -			dec_mm_counter(mm, anon_rss);
> +			dec_mm_counter(mm, MM_ANONPAGES);
>  		else
> -			dec_mm_counter(mm, file_rss);
> +			dec_mm_counter(mm, MM_FILEPAGES);
>  		set_pte_at(mm, address, pte,
>  				swp_entry_to_pte(make_hwpoison_entry(page)));
>  	} else if (PageAnon(page)) {
> @@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page, 
>  					list_add(&mm->mmlist, &init_mm.mmlist);
>  				spin_unlock(&mmlist_lock);
>  			}
> -			dec_mm_counter(mm, anon_rss);
> +			dec_mm_counter(mm, MM_ANONPAGES);
>  		} else if (PAGE_MIGRATION) {
>  			/*
>  			 * Store the pfn of the page in a special migration
> @@ -857,7 +857,7 @@ int try_to_unmap_one(struct page *page, 
>  		entry = make_migration_entry(page, pte_write(pteval));
>  		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
>  	} else
> -		dec_mm_counter(mm, file_rss);
> +		dec_mm_counter(mm, MM_FILEPAGES);
>  
>  	page_remove_rmap(page);
>  	page_cache_release(page);
> @@ -995,7 +995,7 @@ static int try_to_unmap_cluster(unsigned
>  
>  		page_remove_rmap(page);
>  		page_cache_release(page);
> -		dec_mm_counter(mm, file_rss);
> +		dec_mm_counter(mm, MM_FILEPAGES);
>  		(*mapcount)--;
>  	}
>  	pte_unmap_unlock(pte - 1, ptl);
> Index: mmotm-2.6.32-Nov24/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/swapfile.c
> +++ mmotm-2.6.32-Nov24/mm/swapfile.c
> @@ -839,7 +839,7 @@ static int unuse_pte(struct vm_area_stru
>  		goto out;
>  	}
>  
> -	inc_mm_counter(vma->vm_mm, anon_rss);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);

Why can't we use inc_mm_counter_fast in here?

>  	get_page(page);
>  	set_pte_at(vma->vm_mm, addr, pte,
>  		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
> Index: mmotm-2.6.32-Nov24/kernel/timer.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/kernel/timer.c
> +++ mmotm-2.6.32-Nov24/kernel/timer.c
> @@ -1200,6 +1200,8 @@ void update_process_times(int user_tick)
>  	account_process_tick(p, user_tick);
>  	run_local_timers();
>  	rcu_check_callbacks(cpu, user_tick);
> +	/* sync cached mm stat information */
> +	sync_tsk_mm_counters();
>  	printk_tick();
>  	scheduler_tick();
>  	run_posix_cpu_timers(p);
> Index: mmotm-2.6.32-Nov24/mm/filemap_xip.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/filemap_xip.c
> +++ mmotm-2.6.32-Nov24/mm/filemap_xip.c
> @@ -194,7 +194,7 @@ retry:
>  			flush_cache_page(vma, address, pte_pfn(*pte));
>  			pteval = ptep_clear_flush_notify(vma, address, pte);
>  			page_remove_rmap(page);
> -			dec_mm_counter(mm, file_rss);
> +			dec_mm_counter(mm, MM_FILEPAGES);
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
>  			page_cache_release(page);
> Index: mmotm-2.6.32-Nov24/include/linux/mm.h
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/include/linux/mm.h
> +++ mmotm-2.6.32-Nov24/include/linux/mm.h
> @@ -863,6 +863,61 @@ extern int mprotect_fixup(struct vm_area
>  			  struct vm_area_struct **pprev, unsigned long start,
>  			  unsigned long end, unsigned long newflags);
>  
> +/* For per-mm stat accounting */
> +extern void set_mm_counter(struct mm_struct *mm, int member, long value);
> +extern unsigned long get_mm_counter(struct mm_struct *mm, int member);
> +extern void add_mm_counter(struct mm_struct *mm, int member, long value);
> +extern void sync_tsk_mm_counters(void);
> +extern void init_mm_counters(struct mm_struct *mm);
> +
> +#ifdef USE_SPLIT_PTLOCKS
> +extern void prepare_mm_switch(struct task_struct *prev,
> +				 struct task_struct *next);
> +#else
> +static inline prepare_mm_switch(struct task_struct *prev,
> +				struct task_struct *next)
> +{
> +}
> +#endif
> +
> +#define inc_mm_counter(mm, member) add_mm_counter((mm), (member), 1)
> +#define dec_mm_counter(mm, member) add_mm_counter((mm), (member), -1)
> +
> +#define get_mm_rss(mm)			\
> +	(get_mm_counter(mm, MM_FILEPAGES) +\
> +	 get_mm_counter(mm, MM_ANONPAGES))
> +
> +#define update_hiwater_rss(mm)	do {			\
> +	unsigned long _rss = get_mm_rss(mm);		\
> +	if ((mm)->hiwater_rss < _rss)			\
> +		(mm)->hiwater_rss = _rss;		\
> +} while (0)
> +
> +#define update_hiwater_vm(mm)	do {			\
> +	if ((mm)->hiwater_vm < (mm)->total_vm)		\
> +		(mm)->hiwater_vm = (mm)->total_vm;	\
> +} while (0)
> +
> +static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
> +{
> +	return max(mm->hiwater_rss, get_mm_rss(mm));
> +}
> +
> +static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
> +					 struct mm_struct *mm)
> +{
> +	unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
> +
> +	if (*maxrss < hiwater_rss)
> +		*maxrss = hiwater_rss;
> +}
> +
> +static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
> +{
> +	return max(mm->hiwater_vm, mm->total_vm);
> +}
> +
> +
>  /*
>   * doesn't attempt to fault and will return short.
>   */
> Index: mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/fs/proc/task_mmu.c
> +++ mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
> @@ -65,11 +65,11 @@ unsigned long task_vsize(struct mm_struc
>  int task_statm(struct mm_struct *mm, int *shared, int *text,
>  	       int *data, int *resident)
>  {
> -	*shared = get_mm_counter(mm, file_rss);
> +	*shared = get_mm_counter(mm, MM_FILEPAGES);
>  	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>  								>> PAGE_SHIFT;
>  	*data = mm->total_vm - mm->shared_vm;
> -	*resident = *shared + get_mm_counter(mm, anon_rss);
> +	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
>  	return mm->total_vm;
>  }
>  
> Index: mmotm-2.6.32-Nov24/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/mm/oom_kill.c
> +++ mmotm-2.6.32-Nov24/mm/oom_kill.c
> @@ -400,8 +400,8 @@ static void __oom_kill_task(struct task_
>  		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		       task_pid_nr(p), p->comm,
>  		       K(p->mm->total_vm),
> -		       K(get_mm_counter(p->mm, anon_rss)),
> -		       K(get_mm_counter(p->mm, file_rss)));
> +		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
> +		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
>  	task_unlock(p);
>  
>  	/*
> Index: mmotm-2.6.32-Nov24/kernel/exit.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/kernel/exit.c
> +++ mmotm-2.6.32-Nov24/kernel/exit.c
> @@ -681,6 +681,10 @@ static void exit_mm(struct task_struct *
>  	}
>  	atomic_inc(&mm->mm_count);
>  	BUG_ON(mm != tsk->active_mm);
> +	/* drop cached information */
> +	preempt_disable();
> +	sync_tsk_mm_counters();
> +	preempt_enable();

How about (get|put)_cpu in sync_tsk_mm_counters?
It disable and enable preemption.

>  	/* more a memory barrier than a real lock */
>  	task_lock(tsk);
>  	tsk->mm = NULL;
> Index: mmotm-2.6.32-Nov24/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.32-Nov24.orig/kernel/sched.c
> +++ mmotm-2.6.32-Nov24/kernel/sched.c
> @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
>  	trace_sched_switch(rq, prev, next);
>  	mm = next->mm;
>  	oldmm = prev->active_mm;
> +	prepare_mm_switch(prev, next);
>  	/*
>  	 * For paravirt, this is coupled with an exit in switch_to to
>  	 * combine the page table reload and the switch backend into
> 

I think code is not bad but I don't know how effective this patch is in practice.
Thanks for good effort. Kame. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][mmotm][PATCH] percpu mm struct counter cache
  2009-12-03 15:11 ` Minchan Kim
@ 2009-12-04  0:18   ` KAMEZAWA Hiroyuki
  2009-12-04  0:49     ` Minchan Kim
  0 siblings, 1 reply; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  0:18 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, cl, akpm, yanmin_zhang

On Fri, 04 Dec 2009 00:11:02 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi, Kame.
> 
> KAMEZAWA Hiroyuki wrote:
> > Christophs's mm_counter+percpu implemtation has scalability at updates but
> > read-side had some problems. Inspired by that, I tried to write percpu-cache
> > counter + synchronization method. My own tiny benchmark shows something good
> > but this patch's hooks may summon other troubles...
> > 
> > Now, I start from sharing codes here. Any comments are welcome.
> > (Especially, moving hooks to somewhere better is my concern.)
> > My test proram will be posted in reply to this mail.
> > 
> > Regards,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > This patch is for implemanting light-weight per-mm statistics.
> > Now, when split-pagetable-lock is used, statistics per mm struct
> > is maintainer by atomic_long_t value. This costs one atomic_inc()
> > under page_table_lock and if multi-thread program runs and shares
> > mm_struct, this tend to cause cache-miss+atomic_ops.
> 
> Both cases are (page_table_lock + atomic inc) cost?
> 
> AFAIK, 
> If we don't use split lock, we get the just spinlock of page_table_lock. 
yes.

> If we use split lock, we get the just atomic_op cost + page->ptl lock.
yes. now.

> In case of split lock, ptl lock contention for rss accounting is little, I think.
> 
> If I am wrong, could you write down changelog more clearly?
> 
AFAIK, you're right. 


> 
> > 
> > This patch adds per-cpu mm statistics cache and sync it in periodically.
> > Cached Information are synchronized into mm_struct at
> >   - tick
> >   - context_switch.
> >   if there is difference.
> 
> Should we sync mm statistics periodically?
> Couldn't we sync statistics when we need it?
> ex) get_mm_counter.
> I am not sure it's possible. :)

For this counter, read-side cost is important.
My reply to Christoph's per-cpu-mm-counter, which gathers information at
get_mm_counter.
http://marc.info/?l=linux-mm&m=125747002917101&w=2

Making read-side of this counter slower means making ps or top slower.
IMO, ps or top is too slow now and making them more slow is very bad.


> 
> > 
> > Tiny test progam on x86-64/4core/2socket machine shows (small) improvements.
> > This test program measures # of page faults on cpu  0 and 4.
> > (Using all 8cpus, most of time is used for spinlock and you can't see
> >  benefits of this patch..)
> > 
> > [Before Patch]
> > Performance counter stats for './multi-fault 2' (5 runs):
> > 
> >        44282223  page-faults                ( +-   0.912% )
> >      1015540330  cache-references           ( +-   1.701% )
> >       210497140  cache-misses               ( +-   0.731% )
> >  29262804803383988  bus-cycles                 ( +-   0.003% )
> > 
> >    60.003401467  seconds time elapsed   ( +-   0.004% )
> > 
> >  4.75 miss/faults
> >  660825108.1564714580837551899777 bus-cycles/faults
> > 
> > [After Patch]
> > Performance counter stats for './multi-fault 2' (5 runs):
> > 
> >        45543398  page-faults                ( +-   0.499% )
> >      1031865896  cache-references           ( +-   2.720% )
> >       184901499  cache-misses               ( +-   0.626% )
> >  29261889737265056  bus-cycles                 ( +-   0.002% )
> > 
> >    60.001218501  seconds time elapsed   ( +-   0.000% )
> > 
> >  4.05 miss/faults
> >  642505632.5 bus-cycles/faults
> > 
> > Note: to enable split-pagetable-lock, you have to disable SPINLOCK_DEBUG.
> > 
> > This patch moves mm_counter definitions to mm.h+memory.c from sched.h.
> > So, total patch size seems to be big.
> 
> What's your goal/benefit?
> You cut down atomic operations with (cache and sync) method?
> 
> Please, write down the your goal/benefit. :)
> 
Sorry.

My goal is adding more counters like swap_usage or lowmem_rss_usage,
etc. Adding them means I'll add more cache-misses.
Once we can add cache-hit+no-atomic-ops counter, adding statistics will be
much easier.

And considering relaxinug mmap_sem as my speculative-page-fault patch,
this mm_counter will be another heavy cache-miss point.


> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> > +/*
> > + * The mm counters are not protected by its page_table_lock,
> > + * so must be incremented atomically.
> > + */
> > +void set_mm_counter(struct mm_struct *mm, int member, long value)
> > +{
> > +	atomic_long_set(&mm->counters[member], value);
> > +}
> > +
> > +unsigned long get_mm_counter(struct mm_struct *mm, int member)
> > +{
> > +	long ret = atomic_long_read(&mm->counters[member]);
> 
> Which case do we get the minus 'ret'?
> 
When a process is heavily swapped out and no "sync" happens,
we can get minus. And file-map,fault,munmap in short time can
make this minus.

And In this patch, dec_mm_counter() is not used so much.
But I'll add ones at adding swap_usage counter.




> > +	if (ret < 0)
> > +		return 0;
> > +	return ret;
> > +}
> > +
> > +void add_mm_counter(struct mm_struct *mm, int member, long value)
> > +{
> > +	atomic_long_add(value, &mm->counters[member]);
> > +}
> > +
> > +/*
> > + * Always called under pte_lock....irq off, mm != curr_mmc.mm if called
> > + * by get_user_pages() etc.
> > + */
> > +static void
> > +add_mm_counter_fast(struct mm_struct *mm, int member, long val)
> > +{
> > +	if (likely(percpu_read(curr_mmc.mm) == mm))
> > +		percpu_add(curr_mmc.counters[member], val);
> > +	else
> > +		add_mm_counter(mm, member, val);
> > +}
> > +
> > +/* Called by not-preemptable context */
>         	non-preemptible
> > +void sync_tsk_mm_counters(void)
> > +{
> > +	struct pcp_mm_cache *cache = &per_cpu(curr_mmc, smp_processor_id());
> > +	int i;
> > +
> > +	if (!cache->mm)
> > +		return;
> > +
> > +	for (i = 0; i < NR_MM_STATS; i++) {
> > +		if (!cache->counters[i])
> > +			continue;
> > +		add_mm_counter(cache->mm, i, cache->counters[i]);
> > +		cache->counters[i] = 0;
> > +	}
> > +}
> > +
> > +void prepare_mm_switch(struct task_struct *prev, struct task_struct *next)
> > +{
> > +	if (prev->mm == next->mm)
> > +		return;
> > +	/* If task is exited, sync is already done and prev->mm is NULL */
> > +	if (prev->mm)
> > +		sync_tsk_mm_counters();
> > +	percpu_write(curr_mmc.mm, next->mm);
> > +}
> 
> Further optimization.
> In case of (A-> kernel thread -> A), we don't need sync only if
> we update statistics when we need it as i suggested.
> 
Hmm. I'll check following can work or not.
==
       if (next->mm == &init_mm)
		return;
       if (prev->mm == &init_mm) {
		if (percpu_read(curr_mmc.mm) == next->mm)
			return;
	}
==

> > +
> > +#else  /* !USE_SPLIT_PTLOCKS */
> > +/*
> > + * The mm counters are protected by its page_table_lock,
> > + * so can be incremented directly.
> > + */
> > +void set_mm_counter(struct mm_struct *mm, int member, long value)
> > +{
> > +	mm->counters[member] = value;
> > +}
> > +
> > +unsigned long get_mm_counter(struct mm_struct *mm, int member)
> > +{
> > +	return mm->counters[member];
> > +}
> > +
> > +void add_mm_counter(struct mm_struct *mm, int member, long val)
> > +{
> > +	mm->counters[member] += val;
> > +}
> > +
> > +void sync_tsk_mm_counters(struct task_struct *tsk)
> > +{
> > +}
> > +
> > +#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
> > +
> > +#endif /* !USE_SPLIT_PTLOCKS */
> > +/* Special asynchronous routine for page fault path */
> > +#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
> > +#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
> > +
> > +void init_mm_counters(struct mm_struct *mm)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < NR_MM_STATS; i++)
> > +		set_mm_counter(mm, i, 0);
> > +}
> > +
> >  /*
> >   * If a p?d_bad entry is found while walking page tables, report
> >   * the error, before resetting entry to p?d_none.  Usually (but
> > @@ -378,10 +487,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
> >  
> >  static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
> >  {
> > +	/* use synchronous updates here */
> >  	if (file_rss)
> > -		add_mm_counter(mm, file_rss, file_rss);
> > +		add_mm_counter(mm, MM_FILEPAGES, file_rss);
> 
> We can divide MM_[FILEP|ANON]AGES on another patches. 
> Things like rss[0] and rss[1] were not good. 
> 
Ah, ok. clean-up first.

> 
> >  	if (anon_rss)
> > -		add_mm_counter(mm, anon_rss, anon_rss);
> > +		add_mm_counter(mm, MM_ANONPAGES, anon_rss);
> >  }
> >  
> >  /*
> > @@ -632,7 +742,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
> >  	if (page) {
> >  		get_page(page);
> >  		page_dup_rmap(page);
> > -		rss[PageAnon(page)]++;
> > +		if (PageAnon(page))
> > +			rss[MM_ANONPAGES]++;
> > +		else
> > +			rss[MM_FILEPAGES]++;
> >  	}
> >  
> >  out_set_pte:
> > @@ -648,7 +761,7 @@ static int copy_pte_range(struct mm_stru
> >  	pte_t *src_pte, *dst_pte;
> >  	spinlock_t *src_ptl, *dst_ptl;
> >  	int progress = 0;
> > -	int rss[2];
> > +	int rss[NR_MM_STATS];
> >  	swp_entry_t entry = (swp_entry_t){0};
> >  
> >  again:
> > @@ -688,7 +801,7 @@ again:
> >  	arch_leave_lazy_mmu_mode();
> >  	spin_unlock(src_ptl);
> >  	pte_unmap_nested(orig_src_pte);
> > -	add_mm_rss(dst_mm, rss[0], rss[1]);
> > +	add_mm_rss(dst_mm, rss[MM_FILEPAGES], rss[MM_ANONPAGES]);
> >  	pte_unmap_unlock(orig_dst_pte, dst_ptl);
> >  	cond_resched();
> >  
> > @@ -1527,7 +1640,7 @@ static int insert_page(struct vm_area_st
> >  
> >  	/* Ok, finally just insert the thing.. */
> >  	get_page(page);
> > -	inc_mm_counter(mm, file_rss);
> > +	inc_mm_counter_fast(mm, MM_FILEPAGES);
> >  	page_add_file_rmap(page);
> >  	set_pte_at(mm, addr, pte, mk_pte(page, prot));
> >  
> > @@ -2163,11 +2276,11 @@ gotten:
> >  	if (likely(pte_same(*page_table, orig_pte))) {
> >  		if (old_page) {
> >  			if (!PageAnon(old_page)) {
> > -				dec_mm_counter(mm, file_rss);
> > -				inc_mm_counter(mm, anon_rss);
> > +				dec_mm_counter_fast(mm, MM_FILEPAGES);
> > +				inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  			}
> >  		} else
> > -			inc_mm_counter(mm, anon_rss);
> > +			inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  		flush_cache_page(vma, address, pte_pfn(orig_pte));
> >  		entry = mk_pte(new_page, vma->vm_page_prot);
> >  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > @@ -2600,7 +2713,7 @@ static int do_swap_page(struct mm_struct
> >  	 * discarded at swap_free().
> >  	 */
> >  
> > -	inc_mm_counter(mm, anon_rss);
> > +	inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  	pte = mk_pte(page, vma->vm_page_prot);
> >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> >  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > @@ -2684,7 +2797,7 @@ static int do_anonymous_page(struct mm_s
> >  	if (!pte_none(*page_table))
> >  		goto release;
> >  
> > -	inc_mm_counter(mm, anon_rss);
> > +	inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  	page_add_new_anon_rmap(page, vma, address);
> >  setpte:
> >  	set_pte_at(mm, address, page_table, entry);
> > @@ -2838,10 +2951,10 @@ static int __do_fault(struct mm_struct *
> >  		if (flags & FAULT_FLAG_WRITE)
> >  			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> >  		if (anon) {
> > -			inc_mm_counter(mm, anon_rss);
> > +			inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  			page_add_new_anon_rmap(page, vma, address);
> >  		} else {
> > -			inc_mm_counter(mm, file_rss);
> > +			inc_mm_counter_fast(mm, MM_FILEPAGES);
> >  			page_add_file_rmap(page);
> >  			if (flags & FAULT_FLAG_WRITE) {
> >  				dirty_page = page;
> > Index: mmotm-2.6.32-Nov24/kernel/fork.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/kernel/fork.c
> > +++ mmotm-2.6.32-Nov24/kernel/fork.c
> > @@ -452,8 +452,7 @@ static struct mm_struct * mm_init(struct
> >  		(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
> >  	mm->core_state = NULL;
> >  	mm->nr_ptes = 0;
> > -	set_mm_counter(mm, file_rss, 0);
> > -	set_mm_counter(mm, anon_rss, 0);
> > +	init_mm_counters(mm);
> >  	spin_lock_init(&mm->page_table_lock);
> >  	mm->free_area_cache = TASK_UNMAPPED_BASE;
> >  	mm->cached_hole_size = ~0UL;
> > Index: mmotm-2.6.32-Nov24/mm/fremap.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/mm/fremap.c
> > +++ mmotm-2.6.32-Nov24/mm/fremap.c
> > @@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
> >  			page_remove_rmap(page);
> >  			page_cache_release(page);
> >  			update_hiwater_rss(mm);
> > -			dec_mm_counter(mm, file_rss);
> > +			dec_mm_counter(mm, MM_FILEPAGES);
> >  		}
> >  	} else {
> >  		if (!pte_file(pte))
> > Index: mmotm-2.6.32-Nov24/mm/rmap.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/mm/rmap.c
> > +++ mmotm-2.6.32-Nov24/mm/rmap.c
> > @@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page, 
> >  
> >  	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> >  		if (PageAnon(page))
> > -			dec_mm_counter(mm, anon_rss);
> > +			dec_mm_counter(mm, MM_ANONPAGES);
> >  		else
> > -			dec_mm_counter(mm, file_rss);
> > +			dec_mm_counter(mm, MM_FILEPAGES);
> >  		set_pte_at(mm, address, pte,
> >  				swp_entry_to_pte(make_hwpoison_entry(page)));
> >  	} else if (PageAnon(page)) {
> > @@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page, 
> >  					list_add(&mm->mmlist, &init_mm.mmlist);
> >  				spin_unlock(&mmlist_lock);
> >  			}
> > -			dec_mm_counter(mm, anon_rss);
> > +			dec_mm_counter(mm, MM_ANONPAGES);
> >  		} else if (PAGE_MIGRATION) {
> >  			/*
> >  			 * Store the pfn of the page in a special migration
> > @@ -857,7 +857,7 @@ int try_to_unmap_one(struct page *page, 
> >  		entry = make_migration_entry(page, pte_write(pteval));
> >  		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
> >  	} else
> > -		dec_mm_counter(mm, file_rss);
> > +		dec_mm_counter(mm, MM_FILEPAGES);
> >  
> >  	page_remove_rmap(page);
> >  	page_cache_release(page);
> > @@ -995,7 +995,7 @@ static int try_to_unmap_cluster(unsigned
> >  
> >  		page_remove_rmap(page);
> >  		page_cache_release(page);
> > -		dec_mm_counter(mm, file_rss);
> > +		dec_mm_counter(mm, MM_FILEPAGES);
> >  		(*mapcount)--;
> >  	}
> >  	pte_unmap_unlock(pte - 1, ptl);
> > Index: mmotm-2.6.32-Nov24/mm/swapfile.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/mm/swapfile.c
> > +++ mmotm-2.6.32-Nov24/mm/swapfile.c
> > @@ -839,7 +839,7 @@ static int unuse_pte(struct vm_area_stru
> >  		goto out;
> >  	}
> >  
> > -	inc_mm_counter(vma->vm_mm, anon_rss);
> > +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> 
> Why can't we use inc_mm_counter_fast in here?
> 
This vma->vm_mm isn't current->mm in many case, I think.


> >  	get_page(page);
> >  	set_pte_at(vma->vm_mm, addr, pte,
> >  		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
> > Index: mmotm-2.6.32-Nov24/kernel/timer.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/kernel/timer.c
> > +++ mmotm-2.6.32-Nov24/kernel/timer.c
> > @@ -1200,6 +1200,8 @@ void update_process_times(int user_tick)
> >  	account_process_tick(p, user_tick);
> >  	run_local_timers();
> >  	rcu_check_callbacks(cpu, user_tick);
> > +	/* sync cached mm stat information */
> > +	sync_tsk_mm_counters();
> >  	printk_tick();
> >  	scheduler_tick();
> >  	run_posix_cpu_timers(p);
> > Index: mmotm-2.6.32-Nov24/mm/filemap_xip.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/mm/filemap_xip.c
> > +++ mmotm-2.6.32-Nov24/mm/filemap_xip.c
> > @@ -194,7 +194,7 @@ retry:
> >  			flush_cache_page(vma, address, pte_pfn(*pte));
> >  			pteval = ptep_clear_flush_notify(vma, address, pte);
> >  			page_remove_rmap(page);
> > -			dec_mm_counter(mm, file_rss);
> > +			dec_mm_counter(mm, MM_FILEPAGES);
> >  			BUG_ON(pte_dirty(pteval));
> >  			pte_unmap_unlock(pte, ptl);
> >  			page_cache_release(page);
> > Index: mmotm-2.6.32-Nov24/include/linux/mm.h
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/include/linux/mm.h
> > +++ mmotm-2.6.32-Nov24/include/linux/mm.h
> > @@ -863,6 +863,61 @@ extern int mprotect_fixup(struct vm_area
> >  			  struct vm_area_struct **pprev, unsigned long start,
> >  			  unsigned long end, unsigned long newflags);
> >  
> > +/* For per-mm stat accounting */
> > +extern void set_mm_counter(struct mm_struct *mm, int member, long value);
> > +extern unsigned long get_mm_counter(struct mm_struct *mm, int member);
> > +extern void add_mm_counter(struct mm_struct *mm, int member, long value);
> > +extern void sync_tsk_mm_counters(void);
> > +extern void init_mm_counters(struct mm_struct *mm);
> > +
> > +#ifdef USE_SPLIT_PTLOCKS
> > +extern void prepare_mm_switch(struct task_struct *prev,
> > +				 struct task_struct *next);
> > +#else
> > +static inline prepare_mm_switch(struct task_struct *prev,
> > +				struct task_struct *next)
> > +{
> > +}
> > +#endif
> > +
> > +#define inc_mm_counter(mm, member) add_mm_counter((mm), (member), 1)
> > +#define dec_mm_counter(mm, member) add_mm_counter((mm), (member), -1)
> > +
> > +#define get_mm_rss(mm)			\
> > +	(get_mm_counter(mm, MM_FILEPAGES) +\
> > +	 get_mm_counter(mm, MM_ANONPAGES))
> > +
> > +#define update_hiwater_rss(mm)	do {			\
> > +	unsigned long _rss = get_mm_rss(mm);		\
> > +	if ((mm)->hiwater_rss < _rss)			\
> > +		(mm)->hiwater_rss = _rss;		\
> > +} while (0)
> > +
> > +#define update_hiwater_vm(mm)	do {			\
> > +	if ((mm)->hiwater_vm < (mm)->total_vm)		\
> > +		(mm)->hiwater_vm = (mm)->total_vm;	\
> > +} while (0)
> > +
> > +static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
> > +{
> > +	return max(mm->hiwater_rss, get_mm_rss(mm));
> > +}
> > +
> > +static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
> > +					 struct mm_struct *mm)
> > +{
> > +	unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
> > +
> > +	if (*maxrss < hiwater_rss)
> > +		*maxrss = hiwater_rss;
> > +}
> > +
> > +static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
> > +{
> > +	return max(mm->hiwater_vm, mm->total_vm);
> > +}
> > +
> > +
> >  /*
> >   * doesn't attempt to fault and will return short.
> >   */
> > Index: mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/fs/proc/task_mmu.c
> > +++ mmotm-2.6.32-Nov24/fs/proc/task_mmu.c
> > @@ -65,11 +65,11 @@ unsigned long task_vsize(struct mm_struc
> >  int task_statm(struct mm_struct *mm, int *shared, int *text,
> >  	       int *data, int *resident)
> >  {
> > -	*shared = get_mm_counter(mm, file_rss);
> > +	*shared = get_mm_counter(mm, MM_FILEPAGES);
> >  	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >  								>> PAGE_SHIFT;
> >  	*data = mm->total_vm - mm->shared_vm;
> > -	*resident = *shared + get_mm_counter(mm, anon_rss);
> > +	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> >  	return mm->total_vm;
> >  }
> >  
> > Index: mmotm-2.6.32-Nov24/mm/oom_kill.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/mm/oom_kill.c
> > +++ mmotm-2.6.32-Nov24/mm/oom_kill.c
> > @@ -400,8 +400,8 @@ static void __oom_kill_task(struct task_
> >  		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> >  		       task_pid_nr(p), p->comm,
> >  		       K(p->mm->total_vm),
> > -		       K(get_mm_counter(p->mm, anon_rss)),
> > -		       K(get_mm_counter(p->mm, file_rss)));
> > +		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
> > +		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
> >  	task_unlock(p);
> >  
> >  	/*
> > Index: mmotm-2.6.32-Nov24/kernel/exit.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/kernel/exit.c
> > +++ mmotm-2.6.32-Nov24/kernel/exit.c
> > @@ -681,6 +681,10 @@ static void exit_mm(struct task_struct *
> >  	}
> >  	atomic_inc(&mm->mm_count);
> >  	BUG_ON(mm != tsk->active_mm);
> > +	/* drop cached information */
> > +	preempt_disable();
> > +	sync_tsk_mm_counters();
> > +	preempt_enable();
> 
> How about (get|put)_cpu in sync_tsk_mm_counters?
> It disable and enable preemption.
> 
I'll add sync_tsk_mm_counters_safe().



> >  	/* more a memory barrier than a real lock */
> >  	task_lock(tsk);
> >  	tsk->mm = NULL;
> > Index: mmotm-2.6.32-Nov24/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov24.orig/kernel/sched.c
> > +++ mmotm-2.6.32-Nov24/kernel/sched.c
> > @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> >  	trace_sched_switch(rq, prev, next);
> >  	mm = next->mm;
> >  	oldmm = prev->active_mm;
> > +	prepare_mm_switch(prev, next);
> >  	/*
> >  	 * For paravirt, this is coupled with an exit in switch_to to
> >  	 * combine the page table reload and the switch backend into
> > 
> 
> I think code is not bad but I don't know how effective this patch is in practice.
Maybe the benefit of this patch itself is not clear at this point.
I'll post with "more counters" patch as swap_usage, lowmem_rss usage counter in the
next time. Adding more counters without atomic_ops will seems attractive.

> Thanks for good effort. Kame. :)
> 

Thank you for review.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][mmotm][PATCH] percpu mm struct counter cache
  2009-12-04  0:18   ` KAMEZAWA Hiroyuki
@ 2009-12-04  0:49     ` Minchan Kim
  2009-12-04  1:00       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 6+ messages in thread
From: Minchan Kim @ 2009-12-04  0:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, cl, akpm, yanmin_zhang

On Fri, Dec 4, 2009 at 9:18 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 04 Dec 2009 00:11:02 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi, Kame.
>>
>> KAMEZAWA Hiroyuki wrote:
>> > Christophs's mm_counter+percpu implemtation has scalability at updates but
>> > read-side had some problems. Inspired by that, I tried to write percpu-cache
>> > counter + synchronization method. My own tiny benchmark shows something good
>> > but this patch's hooks may summon other troubles...
>> >
>> > Now, I start from sharing codes here. Any comments are welcome.
>> > (Especially, moving hooks to somewhere better is my concern.)
>> > My test proram will be posted in reply to this mail.
>> >
>> > Regards,
>> > -Kame
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > This patch is for implemanting light-weight per-mm statistics.
>> > Now, when split-pagetable-lock is used, statistics per mm struct
>> > is maintainer by atomic_long_t value. This costs one atomic_inc()
>> > under page_table_lock and if multi-thread program runs and shares
>> > mm_struct, this tend to cause cache-miss+atomic_ops.
>>
>> Both cases are (page_table_lock + atomic inc) cost?
>>
>> AFAIK,
>> If we don't use split lock, we get the just spinlock of page_table_lock.
> yes.
>
>> If we use split lock, we get the just atomic_op cost + page->ptl lock.
> yes. now.
>
>> In case of split lock, ptl lock contention for rss accounting is little, I think.
>>
>> If I am wrong, could you write down changelog more clearly?
>>
> AFAIK, you're right.
>
>
>>
>> >
>> > This patch adds per-cpu mm statistics cache and sync it in periodically.
>> > Cached Information are synchronized into mm_struct at
>> >   - tick
>> >   - context_switch.
>> >   if there is difference.
>>
>> Should we sync mm statistics periodically?
>> Couldn't we sync statistics when we need it?
>> ex) get_mm_counter.
>> I am not sure it's possible. :)
>
> For this counter, read-side cost is important.
> My reply to Christoph's per-cpu-mm-counter, which gathers information at
> get_mm_counter.
> http://marc.info/?l=linux-mm&m=125747002917101&w=2
>
> Making read-side of this counter slower means making ps or top slower.
> IMO, ps or top is too slow now and making them more slow is very bad.

Also, we don't want to make regression in no-split-ptl lock system.
Now, tick update cost is zero in no-split-ptl-lock system.
but task switching is a little increased since compare instruction.
As you know, task-switching is rather costly function.
I mind additional overhead in so-split-ptl lock system.
I think we can remove the overhead completely.

>
>>
>> >
>> > Tiny test progam on x86-64/4core/2socket machine shows (small) improvements.
>> > This test program measures # of page faults on cpu  0 and 4.
>> > (Using all 8cpus, most of time is used for spinlock and you can't see
>> >  benefits of this patch..)
>> >
>> > [Before Patch]
>> > Performance counter stats for './multi-fault 2' (5 runs):
>> >
>> >        44282223  page-faults                ( +-   0.912% )
>> >      1015540330  cache-references           ( +-   1.701% )
>> >       210497140  cache-misses               ( +-   0.731% )
>> >  29262804803383988  bus-cycles                 ( +-   0.003% )
>> >
>> >    60.003401467  seconds time elapsed   ( +-   0.004% )
>> >
>> >  4.75 miss/faults
>> >  660825108.1564714580837551899777 bus-cycles/faults
>> >
>> > [After Patch]
>> > Performance counter stats for './multi-fault 2' (5 runs):
>> >
>> >        45543398  page-faults                ( +-   0.499% )
>> >      1031865896  cache-references           ( +-   2.720% )
>> >       184901499  cache-misses               ( +-   0.626% )
>> >  29261889737265056  bus-cycles                 ( +-   0.002% )
>> >
>> >    60.001218501  seconds time elapsed   ( +-   0.000% )
>> >
>> >  4.05 miss/faults
>> >  642505632.5 bus-cycles/faults
>> >
>> > Note: to enable split-pagetable-lock, you have to disable SPINLOCK_DEBUG.
>> >
>> > This patch moves mm_counter definitions to mm.h+memory.c from sched.h.
>> > So, total patch size seems to be big.
>>
>> What's your goal/benefit?
>> You cut down atomic operations with (cache and sync) method?
>>
>> Please, write down the your goal/benefit. :)
>>
> Sorry.

No problem. :)

>
> My goal is adding more counters like swap_usage or lowmem_rss_usage,
> etc. Adding them means I'll add more cache-misses.
> Once we can add cache-hit+no-atomic-ops counter, adding statistics will be
> much easier.

Yeb. It would be better to add this in changelog.

> And considering relaxinug mmap_sem as my speculative-page-fault patch,
> this mm_counter will be another heavy cache-miss point.
>
>
>> >
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
>> > +/*
>> > + * The mm counters are not protected by its page_table_lock,
>> > + * so must be incremented atomically.
>> > + */
>> > +void set_mm_counter(struct mm_struct *mm, int member, long value)
>> > +{
>> > +   atomic_long_set(&mm->counters[member], value);
>> > +}
>> > +
>> > +unsigned long get_mm_counter(struct mm_struct *mm, int member)
>> > +{
>> > +   long ret = atomic_long_read(&mm->counters[member]);
>>
>> Which case do we get the minus 'ret'?
>>
> When a process is heavily swapped out and no "sync" happens,
> we can get minus. And file-map,fault,munmap in short time can
> make this minus.

Yes. please, add this description by comment.

> And In this patch, dec_mm_counter() is not used so much.
> But I'll add ones at adding swap_usage counter.
>
>
>
>
>> > +   if (ret < 0)
>> > +           return 0;
>> > +   return ret;
>> > +}
>> > +
>> > +void add_mm_counter(struct mm_struct *mm, int member, long value)
>> > +{
>> > +   atomic_long_add(value, &mm->counters[member]);
>> > +}
>> > +
>> > +/*
>> > + * Always called under pte_lock....irq off, mm != curr_mmc.mm if called
>> > + * by get_user_pages() etc.
>> > + */
>> > +static void
>> > +add_mm_counter_fast(struct mm_struct *mm, int member, long val)
>> > +{
>> > +   if (likely(percpu_read(curr_mmc.mm) == mm))
>> > +           percpu_add(curr_mmc.counters[member], val);
>> > +   else
>> > +           add_mm_counter(mm, member, val);
>> > +}
>> > +
>> > +/* Called by not-preemptable context */
>>               non-preemptible
>> > +void sync_tsk_mm_counters(void)
>> > +{
>> > +   struct pcp_mm_cache *cache = &per_cpu(curr_mmc, smp_processor_id());
>> > +   int i;
>> > +
>> > +   if (!cache->mm)
>> > +           return;
>> > +
>> > +   for (i = 0; i < NR_MM_STATS; i++) {
>> > +           if (!cache->counters[i])
>> > +                   continue;
>> > +           add_mm_counter(cache->mm, i, cache->counters[i]);
>> > +           cache->counters[i] = 0;
>> > +   }
>> > +}
>> > +
>> > +void prepare_mm_switch(struct task_struct *prev, struct task_struct *next)
>> > +{
>> > +   if (prev->mm == next->mm)
>> > +           return;
>> > +   /* If task is exited, sync is already done and prev->mm is NULL */
>> > +   if (prev->mm)
>> > +           sync_tsk_mm_counters();
>> > +   percpu_write(curr_mmc.mm, next->mm);
>> > +}
>>
>> Further optimization.
>> In case of (A-> kernel thread -> A), we don't need sync only if
>> we update statistics when we need it as i suggested.
>>
> Hmm. I'll check following can work or not.
> ==
>       if (next->mm == &init_mm)
>                return;
>       if (prev->mm == &init_mm) {
>                if (percpu_read(curr_mmc.mm) == next->mm)
>                        return;
>        }
> ==
if next->mm is NULL, it's kernel thread.
You can use this rule.

As I suggested, I want to remove this compare overhead in non-split-ptl system.


>> > +
>> > +#else  /* !USE_SPLIT_PTLOCKS */
>> > +/*
>> > + * The mm counters are protected by its page_table_lock,
>> > + * so can be incremented directly.
>> > + */
>> > +void set_mm_counter(struct mm_struct *mm, int member, long value)
>> > +{
>> > +   mm->counters[member] = value;
>> > +}
>> > +
>> > +unsigned long get_mm_counter(struct mm_struct *mm, int member)
>> > +{
>> > +   return mm->counters[member];
..
<snip>
..
>> >     pte_unmap_unlock(pte - 1, ptl);
>> > Index: mmotm-2.6.32-Nov24/mm/swapfile.c
>> > ===================================================================
>> > --- mmotm-2.6.32-Nov24.orig/mm/swapfile.c
>> > +++ mmotm-2.6.32-Nov24/mm/swapfile.c
>> > @@ -839,7 +839,7 @@ static int unuse_pte(struct vm_area_stru
>> >             goto out;
>> >     }
>> >
>> > -   inc_mm_counter(vma->vm_mm, anon_rss);
>> > +   add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>
>> Why can't we use inc_mm_counter_fast in here?
>>
> This vma->vm_mm isn't current->mm in many case, I think.

I missed point. Thanks.

>
>
>> >     get_page(page);
>> >     set_pte_at(vma->vm_mm, addr, pte,
>> >                pte_mkold(mk_pte(page, vma->vm_page_prot)));
>> > Index: mmotm-2.6.32-Nov24/kernel/timer.c
>> > ===================================================================
>> > --- mmotm-2.6.32-Nov24.orig/kernel/timer.c
>> > +++ mmotm-2.6.32-Nov24/kernel/timer.c
>> > @@ -1200,6 +1200,8 @@ void update_process_times(int user_tick)
>> >     account_process_tick(p, user_tick);
>> >     run_local_timers();
..
<snip>
..
     /*
>> >      * For paravirt, this is coupled with an exit in switch_to to
>> >      * combine the page table reload and the switch backend into
>> >
>>
>> I think code is not bad but I don't know how effective this patch is in practice.
> Maybe the benefit of this patch itself is not clear at this point.
> I'll post with "more counters" patch as swap_usage, lowmem_rss usage counter in the
> next time. Adding more counters without atomic_ops will seems attractive.

I agree.

>> Thanks for good effort. Kame. :)
>>
>
> Thank you for review.
> -Kame
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][mmotm][PATCH] percpu mm struct counter cache
  2009-12-04  0:49     ` Minchan Kim
@ 2009-12-04  1:00       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 6+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  1:00 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, cl, akpm, yanmin_zhang

On Fri, 4 Dec 2009 09:49:17 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Fri, Dec 4, 2009 at 9:18 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Making read-side of this counter slower means making ps or top slower.
> > IMO, ps or top is too slow now and making them more slow is very bad.
> 
> Also, we don't want to make regression in no-split-ptl lock system.
> Now, tick update cost is zero in no-split-ptl-lock system.
yes.
> but task switching is a little increased since compare instruction.
Ah, 

+#ifdef USE_SPLIT_PTLOCKS
+extern void prepare_mm_switch(struct task_struct *prev,
+				 struct task_struct *next);
+#else
+static inline prepare_mm_switch(struct task_struct *prev,
+				struct task_struct *next)
+{
+}
+#endif

makes costs zero.

> As you know, task-switching is rather costly function.
yes.

> I mind additional overhead in so-split-ptl lock system.
yes. here. 
> I think we can remove the overhead completely.
> 

I have another version of this patch, which switches curr_mmc.mm
lazilu in a page fault. But it requires some complicated rules.
I'll try it again rather than adding hooks in context-switch.

BTW, I'm wondering to export "curr_mmc" to other files. Maybe
there will be some more information nice to be cached per cpu+mm.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-12-04  1:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-03  1:28 [RFC][mmotm][PATCH] percpu mm struct counter cache KAMEZAWA Hiroyuki
2009-12-03  1:32 ` KAMEZAWA Hiroyuki
2009-12-03 15:11 ` Minchan Kim
2009-12-04  0:18   ` KAMEZAWA Hiroyuki
2009-12-04  0:49     ` Minchan Kim
2009-12-04  1:00       ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox