[RFC][PATCH -mm 0/7] memcg: lockless page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
       [not found] <20080819173014.17358c17.kamezawa.hiroyu@jp.fujitsu.com>
@ 2008-08-20  9:53 ` KAMEZAWA Hiroyuki
  2008-08-20  9:55   ` [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch KAMEZAWA Hiroyuki
                     ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20  9:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

Hi, this is a patch set for lockless page_cgroup.

dropped patches related to mem+swap controller for easy review.
(I'm rewriting it, too.)

Changes from current -mm is.
  - page_cgroup->flags operations is set to be atomic.
  - lock_page_cgroup() is removed.
  - page->page_cgroup is changed from unsigned long to struct page_cgroup*
  - page_cgroup is freed by RCU.
  - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
    omitted. This is ususally used for mapping device's page. (I think...)

In my quick test, perfomance is improved a little. But the benefit of this
patch is to allow access page_cgroup without lock. I think this is good 
for Yamamoto's Dirty page tracking for memcg.
For I/O tracking people, I added a header file for allowing access to
page_cgroup from out of memcontrol.c

The base kernel is recent mmtom. Any comments are welcome.
This is still under test. I have to do long-run test before removing "RFC".

patch [1-4] is core logic.

[1/7] page_cgroup_atomic_flags.patch
[2/7] delayed_batch_freeing_of_page_cgroup.patch
[3/7] freeing page_cgroup by rcu.patch
[4/7] lockess page_cgroup.patch
[5/7] add prefetch patch
[6/7] make-mapping-null-before-calling-uncharge.patch
[7/7] adding page_cgroup.h header file.patch


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
@ 2008-08-20  9:55   ` KAMEZAWA Hiroyuki
  2008-08-20  9:59   ` [RFC][PATCH -mm 2/7] memcg: delayed_batch_freeing_of_page_cgroup.patch KAMEZAWA Hiroyuki
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20  9:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

This patch itself makes memcg slow but this patch's final purpose is 
to remove lock_page_cgroup() and allowing fast/easy access to page_cgroup.

Before trying to modify memory resource controller, this atomic operation
on flags is necessary.

Changelog  (preview) -> (v1):
 - patch ordering is changed.
 - Added macro for defining functions for Test/Set/Clear bit.
 - made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  108 +++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 77 insertions(+), 31 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -158,12 +158,57 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	int flags;
+	unsigned long flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+enum {
+	/* flags for mem_cgroup */
+	Pcg_CACHE, /* charged as cache */
+	/* flags for LRU placement */
+	Pcg_ACTIVE, /* page is active in this cgroup */
+	Pcg_FILE, /* page is file system backed */
+	Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int Pcg##uname(struct page_cgroup *pc)	\
+	{ return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+	{ set_bit(Pcg_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPcg##uname(struct page_cgroup *pc)	\
+	{ clear_bit(Pcg_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+	{ __set_bit(Pcg_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(Pcg_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -184,14 +229,15 @@ enum charge_type {
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
-					bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
 
 	VM_BUG_ON(!irqs_disabled());
-	if (flags & PAGE_CGROUP_FLAG_CACHE)
+	if (PcgCache(pc))
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
@@ -284,18 +330,18 @@ static void __mem_cgroup_remove_list(str
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PcgUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PcgActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PcgFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
 }
 
@@ -304,27 +350,27 @@ static void __mem_cgroup_add_list(struct
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PcgUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PcgActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PcgFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
-	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+	int active    = PcgActive(pc);
+	int file      = PcgFile(pc);
+	int unevictable = PcgUnevictable(pc);
 	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
 				(LRU_FILE * !!file + !!active);
 
@@ -334,14 +380,14 @@ static void __mem_cgroup_move_lists(stru
 	MEM_CGROUP_ZSTAT(mz, from) -= 1;
 
 	if (is_unevictable_lru(lru)) {
-		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+		ClearPcgActive(pc);
+		SetPcgUnevictable(pc);
 	} else {
 		if (is_active_lru(lru))
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			SetPcgActive(pc);
 		else
-			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+			ClearPcgActive(pc);
+		ClearPcgUnevictable(pc);
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -569,18 +615,19 @@ static int mem_cgroup_charge_common(stru
 
 	pc->mem_cgroup = mem;
 	pc->page = page;
+	pc->flags = 0;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
-		pc->flags = PAGE_CGROUP_FLAG_CACHE;
+		__SetPcgCache(pc);
 		if (page_is_file_cache(page))
-			pc->flags |= PAGE_CGROUP_FLAG_FILE;
+			__SetPcgFile(pc);
 		else
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			__SetPcgActive(pc);
 	} else
-		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+		__SetPcgActive(pc);
 
 	lock_page_cgroup(page);
 	if (unlikely(page_get_page_cgroup(page))) {
@@ -688,8 +735,7 @@ __mem_cgroup_uncharge_common(struct page
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
-		|| page_mapped(page)))
+	    && ((PcgCache(pc) || page_mapped(page))))
 		goto unlock;
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -739,7 +785,7 @@ int mem_cgroup_prepare_migration(struct 
 	if (pc) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+		if (PcgCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
 	unlock_page_cgroup(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 2/7] memcg: delayed_batch_freeing_of_page_cgroup.patch
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
  2008-08-20  9:55   ` [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch KAMEZAWA Hiroyuki
@ 2008-08-20  9:59   ` KAMEZAWA Hiroyuki
  2008-08-20 10:03   ` [RFC][PATCH -mm 3/7] memcg: freeing page_cgroup by rcu.patch KAMEZAWA Hiroyuki
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20  9:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

Freeing page_cgroup at mem_cgroup_uncharge() in lazy way.

In mem_cgroup_uncharge_common(), we don't free page_cgroup
and just link it to per-cpu free queue.
And remove it later by checking threshold.

This patch is a base patch for freeing page_cgroup by RCU patch.
This patch depends on page_cgroup_atomic_flags.patch.

Changelog: (preview) -> (v1)
  - Clean up.
  - renamed functions

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 103 insertions(+), 12 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -159,11 +159,13 @@ struct page_cgroup {
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
 	unsigned long flags;
+	struct page_cgroup *next;
 };
 
 enum {
 	/* flags for mem_cgroup */
 	Pcg_CACHE, /* charged as cache */
+	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
 	/* flags for LRU placement */
 	Pcg_ACTIVE, /* page is active in this cgroup */
 	Pcg_FILE, /* page is file system backed */
@@ -194,6 +196,10 @@ static inline void __ClearPcg##uname(str
 TESTPCGFLAG(Cache, CACHE)
 __SETPCGFLAG(Cache, CACHE)
 
+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
 /* LRU management flags (from global-lru definition) */
 TESTPCGFLAG(File, FILE)
 SETPCGFLAG(File, FILE)
@@ -220,6 +226,18 @@ static enum zone_type page_cgroup_zid(st
 	return page_zonenum(pc->page);
 }
 
+/*
+ * per-cpu slot for freeing page_cgroup in lazy manner.
+ * All page_cgroup linked to this list is OBSOLETE.
+ */
+struct mem_cgroup_sink_list {
+	int count;
+	struct page_cgroup *next;
+};
+DEFINE_PER_CPU(struct mem_cgroup_sink_list, memcg_sink_list);
+#define MEMCG_LRU_THRESH	(16)
+
+
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -427,7 +445,7 @@ void mem_cgroup_move_lists(struct page *
 		return;
 
 	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc && !PcgObsolete(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_move_lists(pc, lru);
@@ -520,6 +538,10 @@ unsigned long mem_cgroup_isolate_pages(u
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+
+		if (PcgObsolete(pc))
+			continue;
+
 		page = pc->page;
 
 		if (unlikely(!PageLRU(page)))
@@ -552,6 +574,81 @@ unsigned long mem_cgroup_isolate_pages(u
 }
 
 /*
+ * Free obsolete page_cgroups which is linked to per-cpu drop list.
+ */
+
+static void __free_obsolete_page_cgroup(void)
+{
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc, *next;
+	struct mem_cgroup_per_zone *mz, *page_mz;
+	struct mem_cgroup_sink_list *mcsl;
+	unsigned long flags;
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	next = mcsl->next;
+	mcsl->next = NULL;
+	mcsl->count = 0;
+	put_cpu_var(memcg_sink_list);
+
+	mz = NULL;
+
+	local_irq_save(flags);
+	while (next) {
+		pc = next;
+		VM_BUG_ON(!PcgObsolete(pc));
+		next = pc->next;
+		prefetch(next);
+		page_mz = page_cgroup_zoneinfo(pc);
+		memcg = pc->mem_cgroup;
+		if (page_mz != mz) {
+			if (mz)
+				spin_unlock(&mz->lru_lock);
+			mz = page_mz;
+			spin_lock(&mz->lru_lock);
+		}
+		__mem_cgroup_remove_list(mz, pc);
+		css_put(&memcg->css);
+		kmem_cache_free(page_cgroup_cache, pc);
+	}
+	if (mz)
+		spin_unlock(&mz->lru_lock);
+	local_irq_restore(flags);
+}
+
+static void free_obsolete_page_cgroup(struct page_cgroup *pc)
+{
+	int count;
+	struct mem_cgroup_sink_list *mcsl;
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	pc->next = mcsl->next;
+	mcsl->next = pc;
+	count = ++mcsl->count;
+	put_cpu_var(memcg_sink_list);
+	if (count >= MEMCG_LRU_THRESH)
+		__free_obsolete_page_cgroup();
+}
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void mem_cgroup_local_force_drain(struct work_struct *work)
+{
+	__free_obsolete_page_cgroup();
+}
+
+static void mem_cgroup_all_force_drain(void)
+{
+	mutex_lock(&memcg_force_drain_mutex);
+	schedule_on_each_cpu(mem_cgroup_local_force_drain);
+	mutex_unlock(&memcg_force_drain_mutex);
+}
+
+/*
  * Charge the memory controller for page usage.
  * Return
  * 0 if the charge was successful
@@ -616,6 +713,7 @@ static int mem_cgroup_charge_common(stru
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = 0;
+	pc->next = NULL;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
@@ -718,8 +816,6 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return;
@@ -737,20 +833,14 @@ __mem_cgroup_uncharge_common(struct page
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 	    && ((PcgCache(pc) || page_mapped(page))))
 		goto unlock;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-
+	mem = pc->mem_cgroup;
+	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
 	unlock_page_cgroup(page);
 
-	mem = pc->mem_cgroup;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
+	free_obsolete_page_cgroup(pc);
 
-	kmem_cache_free(page_cgroup_cache, pc);
 	return;
 unlock:
 	unlock_page_cgroup(page);
@@ -943,6 +1033,7 @@ static int mem_cgroup_force_empty(struct
 			}
 	}
 	ret = 0;
+	mem_cgroup_all_force_drain();
 out:
 	css_put(&mem->css);
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 3/7] memcg: freeing page_cgroup by rcu.patch
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
  2008-08-20  9:55   ` [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch KAMEZAWA Hiroyuki
  2008-08-20  9:59   ` [RFC][PATCH -mm 2/7] memcg: delayed_batch_freeing_of_page_cgroup.patch KAMEZAWA Hiroyuki
@ 2008-08-20 10:03   ` KAMEZAWA Hiroyuki
  2008-08-20 10:04   ` [RFC][PATCH -mm 4/7] memcg: lockless page_cgroup KAMEZAWA Hiroyuki
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

By delayed_batch_freeing_of_page_cgroup.patch, page_cgroup can be
freed lazily. After this patch, page_cgroup is freed by RCU and
page_cgroup is RCU safe. This is necessary for lockless page_cgroup patch

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -577,19 +577,23 @@ unsigned long mem_cgroup_isolate_pages(u
  * Free obsolete page_cgroups which is linked to per-cpu drop list.
  */
 
-static void __free_obsolete_page_cgroup(void)
+struct page_cgroup_rcu_work {
+	struct rcu_head head;
+	struct page_cgroup *list;
+};
+
+static void __free_obsolete_page_cgroup_cb(struct rcu_head *head)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc, *next;
 	struct mem_cgroup_per_zone *mz, *page_mz;
-	struct mem_cgroup_sink_list *mcsl;
+	struct page_cgroup_rcu_work *work;
 	unsigned long flags;
 
-	mcsl = &get_cpu_var(memcg_sink_list);
-	next = mcsl->next;
-	mcsl->next = NULL;
-	mcsl->count = 0;
-	put_cpu_var(memcg_sink_list);
+
+	work = container_of(head, struct page_cgroup_rcu_work, head);
+	next = work->list;
+	kfree(work);
 
 	mz = NULL;
 
@@ -616,6 +620,26 @@ static void __free_obsolete_page_cgroup(
 	local_irq_restore(flags);
 }
 
+static int __free_obsolete_page_cgroup(void)
+{
+	struct page_cgroup_rcu_work *work;
+	struct mem_cgroup_sink_list *mcsl;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work)
+		return -ENOMEM;
+	INIT_RCU_HEAD(&work->head);
+
+	mcsl = &get_cpu_var(memcg_sink_list);
+	work->list = mcsl->next;
+	mcsl->next = NULL;
+	mcsl->count = 0;
+	put_cpu_var(memcg_sink_list);
+
+	call_rcu(&work->head, __free_obsolete_page_cgroup_cb);
+	return 0;
+}
+
 static void free_obsolete_page_cgroup(struct page_cgroup *pc)
 {
 	int count;
@@ -638,13 +662,17 @@ static DEFINE_MUTEX(memcg_force_drain_mu
 
 static void mem_cgroup_local_force_drain(struct work_struct *work)
 {
-	__free_obsolete_page_cgroup();
+	int ret;
+	do {
+		ret = __free_obsolete_page_cgroup();
+	} while (ret);
 }
 
 static void mem_cgroup_all_force_drain(void)
 {
 	mutex_lock(&memcg_force_drain_mutex);
 	schedule_on_each_cpu(mem_cgroup_local_force_drain);
+	synchronize_rcu();
 	mutex_unlock(&memcg_force_drain_mutex);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 4/7] memcg: lockless page_cgroup
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-08-20 10:03   ` [RFC][PATCH -mm 3/7] memcg: freeing page_cgroup by rcu.patch KAMEZAWA Hiroyuki
@ 2008-08-20 10:04   ` KAMEZAWA Hiroyuki
  2008-08-20 10:05   ` [RFC][PATCH -mm 5/7] memcg: prefetch mem cgroup per zone KAMEZAWA Hiroyuki
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

To remove lock_page_cgroup(), we have to confirm there is no race.

Anon pages:
* pages are chareged/uncharged only when first-mapped/last-unmapped.
  page_mapcount() handles that.
   (And... pte_lock() is always held in any racy case.)

Swap pages:
  There will be race because charge is done before lock_page().
  This patch moves mem_cgroup_charge() under lock_page().

File pages: (not Shmem)
* pages are charged/uncharged only when it's added/removed to radix-tree.
  In this case, PageLock() is always held.

Install Page:
  Is it worth to charge this special map page ? which is (maybe) not on LRU.
  I think no.
  I removed charge/uncharge from install_page().

Page Migration:
  We precharge it and map it back under lock_page(). This should be treated
  as special case.

freeing page_cgroup is done under RCU.

After this patch, page_cgroup can be accessed via

**
	rcu_read_lock();
	pc = page_get_page_cgroup(page);
	if (pc && !PcgObsolete(pc)) {
		......
	}
	rcu_read_unlock();
**

This is now under test. Don't apply if you're not brave.

Changelog: (preview) -> (v1)
 - Added comments.
 - Fixed page migration.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 include/linux/mm_types.h |    2 
 mm/memcontrol.c          |  119 +++++++++++++++++------------------------------
 mm/memory.c              |   16 +-----
 3 files changed, 51 insertions(+), 86 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -137,20 +137,6 @@ struct mem_cgroup {
 static struct mem_cgroup init_mem_cgroup;
 
 /*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
  * A page_cgroup page is associated with every page descriptor. The
  * page_cgroup helps us identify information about the cgroup
  */
@@ -312,35 +298,14 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
 {
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+	rcu_assign_pointer(page->page_cgroup, pc);
 }
 
 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+	return rcu_dereference(page->page_cgroup);
 }
 
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
@@ -434,16 +399,7 @@ void mem_cgroup_move_lists(struct page *
 	if (mem_cgroup_subsys.disabled)
 		return;
 
-	/*
-	 * We cannot lock_page_cgroup while holding zone's lru_lock,
-	 * because other holders of lock_page_cgroup can be interrupted
-	 * with an attempt to rotate_reclaimable_page.  But we cannot
-	 * safely get to page_cgroup without it, so just try_lock it:
-	 * mem_cgroup_isolate_pages allows for page left on wrong list.
-	 */
-	if (!try_lock_page_cgroup(page))
-		return;
-
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
 	if (pc && !PcgObsolete(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
@@ -451,7 +407,7 @@ void mem_cgroup_move_lists(struct page *
 		__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
-	unlock_page_cgroup(page);
+	rcu_read_unlock();
 }
 
 /*
@@ -755,14 +711,9 @@ static int mem_cgroup_charge_common(stru
 	} else
 		__SetPcgActive(pc);
 
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
+	/* Double counting race condition ? */
+	VM_BUG_ON(page_get_page_cgroup(page));
+
 	page_assign_page_cgroup(page, pc);
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -770,8 +721,6 @@ static int mem_cgroup_charge_common(stru
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 
-	unlock_page_cgroup(page);
-done:
 	return 0;
 out:
 	css_put(&mem->css);
@@ -796,6 +745,28 @@ int mem_cgroup_charge(struct page *page,
 		return 0;
 	if (unlikely(!mm))
 		mm = &init_mm;
+	/*
+	 * Check for pre-charged case of an anonymous page.
+	 * i.e. page migraion.
+	 *
+	 * Under page migration, the new page (target of migration) is charged
+	 * befere being mapped. And page->mapping points to anon_vma.
+	 * Check it here wheter we've already charged this or not.
+	 *
+	 * But, in this case, we don't charge against a page which is newly
+	 * allocated. It should be locked for avoiding race.
+	 */
+	if (PageAnon(page)) {
+		struct page_cgroup *pc;
+		VM_BUG_ON(!PageLocked(page));
+		rcu_read_lock();
+		pc = page_get_page_cgroup(page);
+		if (pc && !PcgObsolete(pc)) {
+			rcu_read_unlock();
+			return 0;
+		}
+		rcu_read_unlock();
+	}
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
@@ -813,20 +784,21 @@ int mem_cgroup_cache_charge(struct page 
 	 *
 	 * For GFP_NOWAIT case, the page may be pre-charged before calling
 	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
-	 * charge twice. (It works but has to pay a bit larger cost.)
+	 * charge twice.
+	 *
+	 * Note: page migration doesn't call add_to_page_cache(). We can ignore
+	 * the case.
 	 */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
-
-		lock_page_cgroup(page);
+		rcu_read_lock();
 		pc = page_get_page_cgroup(page);
-		if (pc) {
+		if (pc && !PcgObsolete(pc)) {
 			VM_BUG_ON(pc->page != page);
 			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
 			return 0;
 		}
-		unlock_page_cgroup(page);
+		rcu_read_unlock();
 	}
 
 	if (unlikely(!mm))
@@ -851,27 +823,26 @@ __mem_cgroup_uncharge_common(struct page
 	/*
 	 * Check if our page_cgroup is valid
 	 */
-	lock_page_cgroup(page);
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
+	if (unlikely(!pc) || PcgObsolete(pc))
+		goto out;
 
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 	    && ((PcgCache(pc) || page_mapped(page))))
-		goto unlock;
+		goto out;
 	mem = pc->mem_cgroup;
 	SetPcgObsolete(pc);
 	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
 
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	free_obsolete_page_cgroup(pc);
 
+out:
+	rcu_read_unlock();
 	return;
-unlock:
-	unlock_page_cgroup(page);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
@@ -898,15 +869,15 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
-	lock_page_cgroup(page);
+	rcu_read_lock();
 	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc && !PcgObsolete(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 		if (PcgCache(pc))
 			ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	}
-	unlock_page_cgroup(page);
+	rcu_read_unlock();
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
Index: mmtom-2.6.27-rc3+/mm/memory.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memory.c
+++ mmtom-2.6.27-rc3+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
 	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2325,16 +2319,16 @@ static int do_swap_page(struct mm_struct
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 	}
+	lock_page(page);
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
-		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		ret = VM_FAULT_OOM;
+		unlock_page(page);
 		goto out;
 	}
 
 	mark_page_accessed(page);
-	lock_page(page);
-	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	/*
 	 * Back out if somebody else already faulted in this pte.
Index: mmtom-2.6.27-rc3+/include/linux/mm_types.h
===================================================================
--- mmtom-2.6.27-rc3+.orig/include/linux/mm_types.h
+++ mmtom-2.6.27-rc3+/include/linux/mm_types.h
@@ -93,7 +93,7 @@ struct page {
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
+	struct page_cgroup *page_cgroup;
 #endif
 
 #ifdef CONFIG_KMEMCHECK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 5/7] memcg: prefetch mem cgroup per zone
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (3 preceding siblings ...)
  2008-08-20 10:04   ` [RFC][PATCH -mm 4/7] memcg: lockless page_cgroup KAMEZAWA Hiroyuki
@ 2008-08-20 10:05   ` KAMEZAWA Hiroyuki
  2008-08-20 10:07   ` [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch KAMEZAWA Hiroyuki
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

Address of "mz" can be calculated in early stage.
prefetch it (we always do spin_lock later.)


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -694,6 +694,8 @@ static int mem_cgroup_charge_common(stru
 		}
 	}
 
+	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	prefetchw(mz);
 	pc->mem_cgroup = mem;
 	pc->page = page;
 	pc->flags = 0;
@@ -716,7 +718,6 @@ static int mem_cgroup_charge_common(stru
 
 	page_assign_page_cgroup(page, pc);
 
-	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (4 preceding siblings ...)
  2008-08-20 10:05   ` [RFC][PATCH -mm 5/7] memcg: prefetch mem cgroup per zone KAMEZAWA Hiroyuki
@ 2008-08-20 10:07   ` KAMEZAWA Hiroyuki
  2008-08-22  4:57     ` Daisuke Nishimura
  2008-08-20 10:08   ` [RFC][PATCH -mm 7/7] memcg: add page_cgroup.h header file KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".

This patch also adds VM_BUG_ON() to mem_cgroup_uncharge_cache_page();


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/filemap.c    |    2 +-
 mm/memcontrol.c |    1 +
 mm/migrate.c    |   11 +++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/filemap.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/filemap.c
+++ mmtom-2.6.27-rc3+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
 {
 	struct address_space *mapping = page->mapping;
 
-	mem_cgroup_uncharge_cache_page(page);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
+	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -854,6 +854,7 @@ void mem_cgroup_uncharge_page(struct pag
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
+	VM_BUG_ON(page->mapping);
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
Index: mmtom-2.6.27-rc3+/mm/migrate.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/migrate.c
+++ mmtom-2.6.27-rc3+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!PageSwapCache(newpage))
-		mem_cgroup_uncharge_cache_page(page);
 
 	return 0;
 }
@@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
+	/* page->mapping contains a flag for PageAnon() */
+	if (PageAnon(page)) {
+		/* This page is uncharged at try_to_unmap(). */
+		page->mapping = NULL;
+	} else {
+		/* Obsolete file cache should be uncharged */
+		page->mapping = NULL;
+		mem_cgroup_uncharge_cache_page(page);
+	}
 
 	/*
 	 * If any waiters have accumulated on the new page then

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH -mm 7/7] memcg: add page_cgroup.h header file
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (5 preceding siblings ...)
  2008-08-20 10:07   ` [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch KAMEZAWA Hiroyuki
@ 2008-08-20 10:08   ` KAMEZAWA Hiroyuki
  2008-08-20 10:41   ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
  2008-08-20 11:33   ` Hirokazu Takahashi
  8 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

Experimental...I wonder whether this is enough for potential users.
==

page_cgroup is a struct for accounting each page under memory resource
controller. Currently, it's only used under memcontrol.h but there 
is possible user of this struct (now).
(*) Because page_cgroup is an extended/on-demand mem_map by nature,
    there are people who want to use this for recording information.

If no users, this patch is not necessary.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/page_cgroup.h |  100 ++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c             |   82 ------------------------------------
 2 files changed, 101 insertions(+), 81 deletions(-)

Index: mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmtom-2.6.27-rc3+/include/linux/page_cgroup.h
@@ -0,0 +1,100 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup.
+ *
+ * This is pointed from struct page by page->page_cgroup pointer.
+ * This pointer is safe under RCU. If a page_cgroup is marked as
+ * Obsolete, don't access it.
+ *
+ * Typical way to access page_cgroup is following.
+ *
+ * rcu_read_lock();
+ * pc = page_get_page_cgroup(page);
+ * if (pc && !PcgObsolete(pc)) {
+ *         ......
+ * }
+ * rcu_read_unlock();
+ *
+ */
+struct page_cgroup {
+	struct list_head lru;		/* per zone/memcg LRU list */
+	struct page *page;		/* the page this accounts for */
+	struct mem_cgroup *mem_cgroup;  /* belongs to this mem_cgroup */
+	unsigned long flags;
+	struct page_cgroup *next;
+};
+
+enum {
+	/* flags for mem_cgroup */
+	Pcg_CACHE, /* charged as cache */
+	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
+	/* flags for LRU placement */
+	Pcg_ACTIVE, /* page is active in this cgroup */
+	Pcg_FILE, /* page is file system backed */
+	Pcg_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int Pcg##uname(struct page_cgroup *pc)	\
+	{ return test_bit(Pcg_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPcg##uname(struct page_cgroup *pc)\
+	{ set_bit(Pcg_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPcg##uname(struct page_cgroup *pc)	\
+	{ clear_bit(Pcg_##lname, &pc->flags);  }
+
+#define __SETPCGFLAG(uname, lname)			\
+static inline void __SetPcg##uname(struct page_cgroup *pc)\
+	{ __set_bit(Pcg_##lname, &pc->flags);  }
+
+#define __CLEARPCGFLAG(uname, lname)			\
+static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
+	{ __clear_bit(Pcg_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+__SETPCGFLAG(Cache, CACHE)
+
+/* No "Clear" routine for OBSOLETE flag */
+TESTPCGFLAG(Obsolete, OBSOLETE);
+SETPCGFLAG(Obsolete, OBSOLETE);
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+__SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+__SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+
+static int page_cgroup_nid(struct page_cgroup *pc)
+{
+	return page_to_nid(pc->page);
+}
+
+static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+	return page_zonenum(pc->page);
+}
+
+struct page_cgroup *page_get_page_cgroup(struct page *page)
+{
+	return rcu_dereference(page->page_cgroup);
+}
+
+
+#endif
Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -33,7 +33,7 @@
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
-
+#include <linux/page_cgroup.h>
 #include <asm/uaccess.h>
 
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
@@ -136,81 +136,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	unsigned long flags;
-	struct page_cgroup *next;
-};
-
-enum {
-	/* flags for mem_cgroup */
-	Pcg_CACHE, /* charged as cache */
-	Pcg_OBSOLETE,	/* this page cgroup is invalid (unused) */
-	/* flags for LRU placement */
-	Pcg_ACTIVE, /* page is active in this cgroup */
-	Pcg_FILE, /* page is file system backed */
-	Pcg_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname)			\
-static inline int Pcg##uname(struct page_cgroup *pc)	\
-	{ return test_bit(Pcg_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPcg##uname(struct page_cgroup *pc)\
-	{ set_bit(Pcg_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPcg##uname(struct page_cgroup *pc)	\
-	{ clear_bit(Pcg_##lname, &pc->flags);  }
-
-#define __SETPCGFLAG(uname, lname)			\
-static inline void __SetPcg##uname(struct page_cgroup *pc)\
-	{ __set_bit(Pcg_##lname, &pc->flags);  }
-
-#define __CLEARPCGFLAG(uname, lname)			\
-static inline void __ClearPcg##uname(struct page_cgroup *pc)	\
-	{ __clear_bit(Pcg_##lname, &pc->flags);  }
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-__SETPCGFLAG(Cache, CACHE)
-
-/* No "Clear" routine for OBSOLETE flag */
-TESTPCGFLAG(Obsolete, OBSOLETE);
-SETPCGFLAG(Obsolete, OBSOLETE);
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-__SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-__SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
 
 /*
  * per-cpu slot for freeing page_cgroup in lazy manner.
@@ -303,11 +228,6 @@ static void page_assign_page_cgroup(stru
 	rcu_assign_pointer(page->page_cgroup, pc);
 }
 
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return rcu_dereference(page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (6 preceding siblings ...)
  2008-08-20 10:08   ` [RFC][PATCH -mm 7/7] memcg: add page_cgroup.h header file KAMEZAWA Hiroyuki
@ 2008-08-20 10:41   ` KAMEZAWA Hiroyuki
  2008-08-20 11:00     ` KAMEZAWA Hiroyuki
  2008-08-20 11:33   ` Hirokazu Takahashi
  8 siblings, 1 reply; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 10:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

On Wed, 20 Aug 2008 18:53:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Hi, this is a patch set for lockless page_cgroup.
> 
> dropped patches related to mem+swap controller for easy review.
> (I'm rewriting it, too.)
> 
> Changes from current -mm is.
>   - page_cgroup->flags operations is set to be atomic.
>   - lock_page_cgroup() is removed.
>   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
>   - page_cgroup is freed by RCU.
>   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
>     omitted. This is ususally used for mapping device's page. (I think...)
> 
> In my quick test, perfomance is improved a little. But the benefit of this
> patch is to allow access page_cgroup without lock. I think this is good 
> for Yamamoto's Dirty page tracking for memcg.
> For I/O tracking people, I added a header file for allowing access to
> page_cgroup from out of memcontrol.c
> 
> The base kernel is recent mmtom. Any comments are welcome.
> This is still under test. I have to do long-run test before removing "RFC".
> 
Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
It's because of patch 2/7.
will be fixed in the next version.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-20 10:41   ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
@ 2008-08-20 11:00     ` KAMEZAWA Hiroyuki
  2008-08-21  2:17       ` KAMEZAWA Hiroyuki
  2008-08-21  8:34       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-20 11:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

On Wed, 20 Aug 2008 19:41:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Aug 2008 18:53:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Hi, this is a patch set for lockless page_cgroup.
> > 
> > dropped patches related to mem+swap controller for easy review.
> > (I'm rewriting it, too.)
> > 
> > Changes from current -mm is.
> >   - page_cgroup->flags operations is set to be atomic.
> >   - lock_page_cgroup() is removed.
> >   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> >   - page_cgroup is freed by RCU.
> >   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> >     omitted. This is ususally used for mapping device's page. (I think...)
> > 
> > In my quick test, perfomance is improved a little. But the benefit of this
> > patch is to allow access page_cgroup without lock. I think this is good 
> > for Yamamoto's Dirty page tracking for memcg.
> > For I/O tracking people, I added a header file for allowing access to
> > page_cgroup from out of memcontrol.c
> > 
> > The base kernel is recent mmtom. Any comments are welcome.
> > This is still under test. I have to do long-run test before removing "RFC".
> > 
> Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> It's because of patch 2/7.
> will be fixed in the next version.
> 

This is a quick fix but I think I can find some better solution..
==
Because removal from LRU is delayed, mz->lru will never be empty until
someone kick drain. This patch rotate LRU while force_empty and makes
page_cgroup will be freed.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 mm/memcontrol.c |   40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

Index: mmtom-2.6.27-rc3+/mm/memcontrol.c
===================================================================
--- mmtom-2.6.27-rc3+.orig/mm/memcontrol.c
+++ mmtom-2.6.27-rc3+/mm/memcontrol.c
@@ -893,34 +893,45 @@ static void mem_cgroup_force_empty_list(
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
-	struct page_cgroup *pc;
+	struct page_cgroup *pc, *tmp;
 	struct page *page;
 	int count = FORCE_UNCHARGE_BATCH;
 	unsigned long flags;
 	struct list_head *list;
+	int drain, rotate;
 
 	list = &mz->lists[lru];
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
+	rotate = 0;
 	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
-		page = pc->page;
-		get_page(page);
-		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
-			put_page(page);
+		drain = PcgObsolete(pc);
+		if (drain) {
+			/* Skip this */
+			list_move(&pc->lru);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			rotate++;
+			if (rotate > MEMCG_LRU_THRESH/2)
+				mem_cgroup_all_force_drain();
+			cond_resched();
+		} else {
+			page = pc->page;
+			get_page(page);
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			/*
+		 	* Check if this page is on LRU. !LRU page can be found
+		 	* if it's under page migration.
+		 	*/
+			if (PageLRU(page)) {
+				__mem_cgroup_uncharge_common(page,
+						MEM_CGROUP_CHARGE_TYPE_FORCE);
+			}
 			if (--count <= 0) {
 				count = FORCE_UNCHARGE_BATCH;
 				cond_resched();
 			}
-		} else
-			cond_resched();
+		}
 		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
@@ -954,7 +965,6 @@ static int mem_cgroup_force_empty(struct
 			}
 	}
 	ret = 0;
-	mem_cgroup_all_force_drain();
 out:
 	css_put(&mem->css);
 	return ret;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
                     ` (7 preceding siblings ...)
  2008-08-20 10:41   ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
@ 2008-08-20 11:33   ` Hirokazu Takahashi
  8 siblings, 0 replies; 18+ messages in thread
From: Hirokazu Takahashi @ 2008-08-20 11:33 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-kernel, balbir, yamamoto, nishimura, ryov, linux-mm

Hi,

> Hi, this is a patch set for lockless page_cgroup.
> 
> dropped patches related to mem+swap controller for easy review.
> (I'm rewriting it, too.)
> 
> Changes from current -mm is.
>   - page_cgroup->flags operations is set to be atomic.
>   - lock_page_cgroup() is removed.
>   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
>   - page_cgroup is freed by RCU.
>   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
>     omitted. This is ususally used for mapping device's page. (I think...)
> 
> In my quick test, perfomance is improved a little. But the benefit of this
> patch is to allow access page_cgroup without lock. I think this is good 
> for Yamamoto's Dirty page tracking for memcg.
> For I/O tracking people, I added a header file for allowing access to
> page_cgroup from out of memcontrol.c

Thanks, Kame.
It is a good news that the page tracking framework is open.
I think I can send some feedback to you to make it more generic.

> The base kernel is recent mmtom. Any comments are welcome.
> This is still under test. I have to do long-run test before removing "RFC".
> 

Thanks,
Hirokazu Takahashi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-20 11:00     ` KAMEZAWA Hiroyuki
@ 2008-08-21  2:17       ` KAMEZAWA Hiroyuki
  2008-08-21  3:36         ` Balbir Singh
  2008-08-21  3:54         ` Daisuke Nishimura
  2008-08-21  8:34       ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-21  2:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

On Wed, 20 Aug 2008 20:00:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Aug 2008 19:41:08 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 20 Aug 2008 18:53:06 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > Hi, this is a patch set for lockless page_cgroup.
> > > 
> > > dropped patches related to mem+swap controller for easy review.
> > > (I'm rewriting it, too.)
> > > 
> > > Changes from current -mm is.
> > >   - page_cgroup->flags operations is set to be atomic.
> > >   - lock_page_cgroup() is removed.
> > >   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> > >   - page_cgroup is freed by RCU.
> > >   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> > >     omitted. This is ususally used for mapping device's page. (I think...)
> > > 
> > > In my quick test, perfomance is improved a little. But the benefit of this
> > > patch is to allow access page_cgroup without lock. I think this is good 
> > > for Yamamoto's Dirty page tracking for memcg.
> > > For I/O tracking people, I added a header file for allowing access to
> > > page_cgroup from out of memcontrol.c
> > > 
> > > The base kernel is recent mmtom. Any comments are welcome.
> > > This is still under test. I have to do long-run test before removing "RFC".
> > > 
> > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > It's because of patch 2/7.
> > will be fixed in the next version.
> > 
> 
> This is a quick fix but I think I can find some better solution..
> ==
> Because removal from LRU is delayed, mz->lru will never be empty until
> someone kick drain. This patch rotate LRU while force_empty and makes
> page_cgroup will be freed.
> 

I'd like to rewrite force_empty to move all usage to "default" cgroup.
There are some reasons.

1. current force_empty creates an alive page which has no page_cgroup.
   This is bad for routine which want to access page_cgroup from page.
   And this behavior will be an issue of race condition in future.    
2. We can see amount of out-of-control usage in default cgroup.

But to do this, I'll have to avoid "hitting limit" in default cgroup.
I'm now wondering to make it impossible to set limit to default cgroup.
(will show as a patch in the next version of series.) 
Does anyone have an idea ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-21  2:17       ` KAMEZAWA Hiroyuki
@ 2008-08-21  3:36         ` Balbir Singh
  2008-08-21  3:58           ` KAMEZAWA Hiroyuki
  2008-08-21  3:54         ` Daisuke Nishimura
  1 sibling, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2008-08-21  3:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, yamamoto, nishimura, ryov, linux-mm

KAMEZAWA Hiroyuki wrote:
> On Wed, 20 Aug 2008 20:00:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> On Wed, 20 Aug 2008 19:41:08 +0900
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>>> On Wed, 20 Aug 2008 18:53:06 +0900
>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>>> Hi, this is a patch set for lockless page_cgroup.
>>>>
>>>> dropped patches related to mem+swap controller for easy review.
>>>> (I'm rewriting it, too.)
>>>>
>>>> Changes from current -mm is.
>>>>   - page_cgroup->flags operations is set to be atomic.
>>>>   - lock_page_cgroup() is removed.
>>>>   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
>>>>   - page_cgroup is freed by RCU.
>>>>   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
>>>>     omitted. This is ususally used for mapping device's page. (I think...)
>>>>
>>>> In my quick test, perfomance is improved a little. But the benefit of this
>>>> patch is to allow access page_cgroup without lock. I think this is good 
>>>> for Yamamoto's Dirty page tracking for memcg.
>>>> For I/O tracking people, I added a header file for allowing access to
>>>> page_cgroup from out of memcontrol.c
>>>>
>>>> The base kernel is recent mmtom. Any comments are welcome.
>>>> This is still under test. I have to do long-run test before removing "RFC".
>>>>
>>> Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
>>> It's because of patch 2/7.
>>> will be fixed in the next version.
>>>
>> This is a quick fix but I think I can find some better solution..
>> ==
>> Because removal from LRU is delayed, mz->lru will never be empty until
>> someone kick drain. This patch rotate LRU while force_empty and makes
>> page_cgroup will be freed.
>>
> 
> I'd like to rewrite force_empty to move all usage to "default" cgroup.
> There are some reasons.
> 
> 1. current force_empty creates an alive page which has no page_cgroup.
>    This is bad for routine which want to access page_cgroup from page.
>    And this behavior will be an issue of race condition in future.    
> 2. We can see amount of out-of-control usage in default cgroup.
> 
> But to do this, I'll have to avoid "hitting limit" in default cgroup.
> I'm now wondering to make it impossible to set limit to default cgroup.
> (will show as a patch in the next version of series.) 
> Does anyone have an idea ?
> 

Hi, Kamezawa-San,

The definition of default-cgroup would be root cgroup right? I would like to
implement hierarchies correctly in order to define the default-cgroup (it could
be a parent of the child cgroup for example).


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-21  2:17       ` KAMEZAWA Hiroyuki
  2008-08-21  3:36         ` Balbir Singh
@ 2008-08-21  3:54         ` Daisuke Nishimura
  1 sibling, 0 replies; 18+ messages in thread
From: Daisuke Nishimura @ 2008-08-21  3:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, LKML, balbir, yamamoto, ryov, linux-mm

On Thu, 21 Aug 2008 11:17:40 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 20 Aug 2008 20:00:06 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 20 Aug 2008 19:41:08 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Wed, 20 Aug 2008 18:53:06 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > Hi, this is a patch set for lockless page_cgroup.
> > > > 
> > > > dropped patches related to mem+swap controller for easy review.
> > > > (I'm rewriting it, too.)
> > > > 
> > > > Changes from current -mm is.
> > > >   - page_cgroup->flags operations is set to be atomic.
> > > >   - lock_page_cgroup() is removed.
> > > >   - page->page_cgroup is changed from unsigned long to struct page_cgroup*
> > > >   - page_cgroup is freed by RCU.
> > > >   - For avoiding race, charge/uncharge against mm/memory.c::insert_page() is
> > > >     omitted. This is ususally used for mapping device's page. (I think...)
> > > > 
> > > > In my quick test, perfomance is improved a little. But the benefit of this
> > > > patch is to allow access page_cgroup without lock. I think this is good 
> > > > for Yamamoto's Dirty page tracking for memcg.
> > > > For I/O tracking people, I added a header file for allowing access to
> > > > page_cgroup from out of memcontrol.c
> > > > 
> > > > The base kernel is recent mmtom. Any comments are welcome.
> > > > This is still under test. I have to do long-run test before removing "RFC".
> > > > 
> > > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > > It's because of patch 2/7.
> > > will be fixed in the next version.
> > > 
> > 
> > This is a quick fix but I think I can find some better solution..
> > ==
> > Because removal from LRU is delayed, mz->lru will never be empty until
> > someone kick drain. This patch rotate LRU while force_empty and makes
> > page_cgroup will be freed.
> > 
> 
> I'd like to rewrite force_empty to move all usage to "default" cgroup.
> There are some reasons.
> 
> 1. current force_empty creates an alive page which has no page_cgroup.
>    This is bad for routine which want to access page_cgroup from page.
>    And this behavior will be an issue of race condition in future.    
I agree that current force_empty is not good in this point.

> 2. We can see amount of out-of-control usage in default cgroup.
> 
> But to do this, I'll have to avoid "hitting limit" in default cgroup.
> I'm now wondering to make it impossible to set limit to default cgroup.
> (will show as a patch in the next version of series.) 
> Does anyone have an idea ?
> 
I don't have a strong objection about setting default cgroup unlimited
and moving usages to default cgroup.

But I think this is related to hierarchy support as Balbir-san says.
And, setting default cgroup unlimited would not be so strange if
hierarchy is supported.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-21  3:36         ` Balbir Singh
@ 2008-08-21  3:58           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-21  3:58 UTC (permalink / raw)
  To: balbir; +Cc: LKML, yamamoto, nishimura, ryov, linux-mm

On Thu, 21 Aug 2008 09:06:53 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > I'd like to rewrite force_empty to move all usage to "default" cgroup.
> > There are some reasons.
> > 
> > 1. current force_empty creates an alive page which has no page_cgroup.
> >    This is bad for routine which want to access page_cgroup from page.
> >    And this behavior will be an issue of race condition in future.    
> > 2. We can see amount of out-of-control usage in default cgroup.
> > 
> > But to do this, I'll have to avoid "hitting limit" in default cgroup.
> > I'm now wondering to make it impossible to set limit to default cgroup.
> > (will show as a patch in the next version of series.) 
> > Does anyone have an idea ?
> > 
> 
> Hi, Kamezawa-San,
> 
> The definition of default-cgroup would be root cgroup right? I would like to
> implement hierarchies correctly in order to define the default-cgroup (it could
> be a parent of the child cgroup for example).
> 

Ah yes, "root" cgroup, now.
I need trash-can-cgroup somewhere for force_empty. Accounted-in-trash-can is
better than accounter by no one. Once we change the behavior, we can have 
another choices of improvements.

1. move account information to the parent cgroup.
2. move account information to user-defined trash-can cgroup.

As first step, I'd like to start from "root" cgroup. We can improve behavior in
step-by-step manner as we've done.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1
  2008-08-20 11:00     ` KAMEZAWA Hiroyuki
  2008-08-21  2:17       ` KAMEZAWA Hiroyuki
@ 2008-08-21  8:34       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-21  8:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: LKML, balbir, yamamoto, nishimura, ryov, linux-mm

On Wed, 20 Aug 2008 20:00:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Known problem: force_emtpy is broken...so rmdir will struck into nightmare.
> > It's because of patch 2/7.
> > will be fixed in the next version.
> > 
> 
This is a new routine for force_empty. Assumes init_mem_cgroup has no limit.
(lockless page_cgroup is also applied.)

I think this routine is enough generic to be enhanced for hierarchy in future.
I think move_account() routine can be used for other purpose.
(for example, move_task.)


==
int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
        struct mem_cgroup *from, struct mem_cgroup *to)
{
        struct mem_cgroup_per_zone *from_mz, *to_mz;
        int nid, zid;
        int ret = 1;

        VM_BUG_ON(to->no_limit == 0);
        VM_BUG_ON(!irqs_disabled());

        nid = page_to_nid(page);
        zid = page_zonenum(page);
        from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
        to_mz =  mem_cgroup_zoneinfo(to, nid, zid);

        if (res_counter_charge(&to->res, PAGE_SIZE)) {
                /* Now, we assume no_limit...no failure here. */
                return ret;
        }

        if (spin_trylock(&to_mz->lru_lock)) {
                __mem_cgroup_remove_list(from_mz, pc);
                css_put(&from->css);
                res_counter_uncharge(&from->res, PAGE_SIZE);
                pc->mem_cgroup = to;
                css_get(&to->css);
                __mem_cgroup_add_list(to_mz, pc);
                ret = 0;
                spin_unlock(&to_mz->lru_lock);
        } else {
                res_counter_uncharge(&to->res, PAGE_SIZE);
        }

        return ret;
}
/*
 * This routine moves all account to root cgroup.
 */
static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
                            struct mem_cgroup_per_zone *mz,
                            enum lru_list lru)
{
        struct page_cgroup *pc;
        unsigned long flags;
        struct list_head *list;
        int drain = 0;

        list = &mz->lists[lru];

        spin_lock_irqsave(&mz->lru_lock, flags);
        while (!list_empty(list)) {
                pc = list_entry(list->prev, struct page_cgroup, lru);
                if (PcgObsolete(pc)) {
                        list_move(&pc->lru, list);
                        /* This page_cgroup may remain on this list until
                           we drain it. */
                        if (drain++ > MEMCG_LRU_THRESH/2) {
                                spin_unlock_irqrestore(&mz->lru_lock, flags);
                                mem_cgroup_all_force_drain();
                                yield();
                                drain = 0;
                                spin_lock_irqsave(&mz->lru_lock, flags);
                        }
                        continue;
                }
                if (mem_cgroup_move_account(page, pc->page,
                                                mem, &init_mem_cgroup)) {
                        /* some confliction */
                        list_move(&pc->lru, list);
                        spin_unlock_irqrestore(&mz->lru_lock, flags);
                        yield();
                        spin_lock_irqsave(&mz->lru_lock, flags);
                }
                if (atomic_read(&mem->css.cgroup->count) > 0)
                        break;
        }
        spin_unlock_irqrestore(&mz->lru_lock, flags);
}
==

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch
  2008-08-20 10:07   ` [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch KAMEZAWA Hiroyuki
@ 2008-08-22  4:57     ` Daisuke Nishimura
  2008-08-22  5:48       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 18+ messages in thread
From: Daisuke Nishimura @ 2008-08-22  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, LKML, balbir, yamamoto, ryov, linux-mm

> @@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
>  	ClearPagePrivate(page);
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
You forget to remove this line :)

Thanks,
Daisuke Nishimura.

> +	/* page->mapping contains a flag for PageAnon() */
> +	if (PageAnon(page)) {
> +		/* This page is uncharged at try_to_unmap(). */
> +		page->mapping = NULL;
> +	} else {
> +		/* Obsolete file cache should be uncharged */
> +		page->mapping = NULL;
> +		mem_cgroup_uncharge_cache_page(page);
> +	}
>  
>  	/*
>  	 * If any waiters have accumulated on the new page then
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch
  2008-08-22  4:57     ` Daisuke Nishimura
@ 2008-08-22  5:48       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 18+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-22  5:48 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, balbir, yamamoto, ryov, linux-mm

On Fri, 22 Aug 2008 13:57:43 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > @@ -379,6 +377,15 @@ static void migrate_page_copy(struct pag
> >  	ClearPagePrivate(page);
> >  	set_page_private(page, 0);
> >  	page->mapping = NULL;
> You forget to remove this line :)
> 
Ouch, thanks.
-Kame

> Thanks,
> Daisuke Nishimura.
> 
> > +	/* page->mapping contains a flag for PageAnon() */
> > +	if (PageAnon(page)) {
> > +		/* This page is uncharged at try_to_unmap(). */
> > +		page->mapping = NULL;
> > +	} else {
> > +		/* Obsolete file cache should be uncharged */
> > +		page->mapping = NULL;
> > +		mem_cgroup_uncharge_cache_page(page);
> > +	}
> >  
> >  	/*
> >  	 * If any waiters have accumulated on the new page then
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-08-22  5:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20080819173014.17358c17.kamezawa.hiroyu@jp.fujitsu.com>
2008-08-20  9:53 ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
2008-08-20  9:55   ` [RFC][PATCH -mm 1/7] memcg: page_cgroup_atomic_flags.patch KAMEZAWA Hiroyuki
2008-08-20  9:59   ` [RFC][PATCH -mm 2/7] memcg: delayed_batch_freeing_of_page_cgroup.patch KAMEZAWA Hiroyuki
2008-08-20 10:03   ` [RFC][PATCH -mm 3/7] memcg: freeing page_cgroup by rcu.patch KAMEZAWA Hiroyuki
2008-08-20 10:04   ` [RFC][PATCH -mm 4/7] memcg: lockless page_cgroup KAMEZAWA Hiroyuki
2008-08-20 10:05   ` [RFC][PATCH -mm 5/7] memcg: prefetch mem cgroup per zone KAMEZAWA Hiroyuki
2008-08-20 10:07   ` [RFC][PATCH -mm 6/7] memcg: make-mapping-null-before-calling-uncharge.patch KAMEZAWA Hiroyuki
2008-08-22  4:57     ` Daisuke Nishimura
2008-08-22  5:48       ` KAMEZAWA Hiroyuki
2008-08-20 10:08   ` [RFC][PATCH -mm 7/7] memcg: add page_cgroup.h header file KAMEZAWA Hiroyuki
2008-08-20 10:41   ` [RFC][PATCH -mm 0/7] memcg: lockless page_cgroup v1 KAMEZAWA Hiroyuki
2008-08-20 11:00     ` KAMEZAWA Hiroyuki
2008-08-21  2:17       ` KAMEZAWA Hiroyuki
2008-08-21  3:36         ` Balbir Singh
2008-08-21  3:58           ` KAMEZAWA Hiroyuki
2008-08-21  3:54         ` Daisuke Nishimura
2008-08-21  8:34       ` KAMEZAWA Hiroyuki
2008-08-20 11:33   ` Hirokazu Takahashi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox