[PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
@ 2009-09-29  6:01 KAMEZAWA Hiroyuki
  2009-09-29  6:03 ` [PATCH 2/2] memcg: reduce check for softlimit excess KAMEZAWA Hiroyuki
  2009-09-29  6:11 ` [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim Balbir Singh
  0 siblings, 2 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  6:01 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, balbir, nishimura

No major changes in this patch for 3 weeks.
While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
But it seems no more (easy) ones.

Andrew, this patch modifies bahavior of memcg's softlimit when it is used
under memcg's hierarchical memory management. I think this is a right way but
don't think current behavior is buggy.
plz take this as for better memcg's behavior.

This patch is based on
  mmotm-Sep28 +
  http://marc.info/?l=linux-kernel&m=125412897121988&w=2
  http://marc.info/?l=linux-kernel&m=125419390519405&w=2

In above. "[BUGFIX][PATCH][rc1] memcg: fix refcnt goes to minus" is a bugfix.

==
This patch clean up/fixes for memcg's uncharge soft limit path.

Problems:
  Now, res_counter_charge()/uncharge() handles softlimit information at
  charge/uncharge and softlimit-check is done when event counter per memcg
  goes over limit. Now, event counter per memcg is updated only when
  memory usage is over soft limit. Here, considering hierarchical memcg
  management, ancesotors should be taken care of.

  Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
  This is not good.

  Prolems:
  1. memcg's event counter incremented only when softlimit hits. That's bad.
     It makes event counter hard to be reused for other purpose.

  2. At uncharge, only the lowest level rescounter is handled. This is bug.
     Because ancesotor's event counter is not incremented, children should
     take care of them.

  3. res_counter_uncharge()'s 3rd argument is NULL in most case.
     ops under res_counter->lock should be small. No "if" sentense is better.

Fixes:
  * Removed soft_limit_xx poitner and checks in charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

  * make event-counter of memcg incremented at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

  * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/res_counter.h |    6 --
 kernel/res_counter.c        |   18 ------
 mm/memcontrol.c             |  115 +++++++++++++++++++-------------------------
 3 files changed, 55 insertions(+), 84 deletions(-)

Index: mmotm-2.6.31-Sep28/kernel/res_counter.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/kernel/res_counter.c
+++ mmotm-2.6.31-Sep28/kernel/res_counter.c
@@ -37,27 +37,17 @@ int res_counter_charge_locked(struct res
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at,
-			struct res_counter **soft_limit_fail_at)
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
-	if (soft_limit_fail_at)
-		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
-		/*
-		 * With soft limits, we return the highest ancestor
-		 * that exceeds its soft limit
-		 */
-		if (soft_limit_fail_at &&
-			!res_counter_soft_limit_check_locked(c))
-			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -85,8 +75,7 @@ void res_counter_uncharge_locked(struct 
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -94,9 +83,6 @@ void res_counter_uncharge(struct res_cou
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
-		if (was_soft_limit_excess)
-			*was_soft_limit_excess =
-				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
Index: mmotm-2.6.31-Sep28/include/linux/res_counter.h
===================================================================
--- mmotm-2.6.31-Sep28.orig/include/linux/res_counter.h
+++ mmotm-2.6.31-Sep28/include/linux/res_counter.h
@@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at,
-		struct res_counter **soft_limit_at);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -353,16 +353,6 @@ __mem_cgroup_remove_exceeded(struct mem_
 }
 
 static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_insert_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-static void
 mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
 				struct mem_cgroup_tree_per_zone *mctz)
@@ -392,34 +382,40 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long prev_usage_in_excess, new_usage_in_excess;
-	bool updated_tree = false;
+	unsigned long long new_usage_in_excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
-
-	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
 	mctz = soft_limit_tree_from_page(page);
 
 	/*
-	 * We do updates in lazy mode, mem's are removed
-	 * lazily from the per-zone, per-node rb tree
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
 	 */
-	prev_usage_in_excess = mz->usage_in_excess;
-
-	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
-	if (prev_usage_in_excess) {
-		mem_cgroup_remove_exceeded(mem, mz, mctz);
-		updated_tree = true;
-	}
-	if (!new_usage_in_excess)
-		goto done;
-	mem_cgroup_insert_exceeded(mem, mz, mctz);
-
-done:
-	if (updated_tree) {
-		spin_lock(&mctz->lock);
-		mz->usage_in_excess = new_usage_in_excess;
-		spin_unlock(&mctz->lock);
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		new_usage_in_excess =
+			res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (new_usage_in_excess || mz->on_tree) {
+			spin_lock(&mctz->lock);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mem, mz, mctz);
+			/*
+			 * if over soft limit, insert again. mz->usage_in_excess
+			 * will be updated properly.
+			 */
+			if (new_usage_in_excess)
+				__mem_cgroup_insert_exceeded(mem, mz, mctz);
+			else
+				mz->usage_in_excess = 0;
+			spin_unlock(&mctz->lock);
+		}
 	}
 }
 
@@ -1271,9 +1267,9 @@ static int __mem_cgroup_try_charge(struc
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
+	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res, *soft_fail_res = NULL;
+	struct res_counter *fail_res;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1305,17 +1301,16 @@ static int __mem_cgroup_try_charge(struc
 
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
-						&soft_fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res, NULL);
+							&fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1354,16 +1349,11 @@ static int __mem_cgroup_try_charge(struc
 		}
 	}
 	/*
-	 * Insert just the ancestor, we should trickle down to the correct
-	 * cgroup for reclaim, since the other nodes will be below their
-	 * soft limit
-	 */
-	if (soft_fail_res) {
-		mem_over_soft_limit =
-			mem_cgroup_from_res_counter(soft_fail_res, res);
-		if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
-			mem_cgroup_update_tree(mem_over_soft_limit, page);
-	}
+	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+	 * if they exceeds softlimit.
+	 */
+	if (mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
 nomem:
@@ -1438,10 +1428,9 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		}
 		css_put(&mem->css);
 		return;
@@ -1520,7 +1509,7 @@ static int mem_cgroup_move_account(struc
 		goto out;
 
 	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
 	mem_cgroup_charge_statistics(from, pc, false);
 
 	page = pc->page;
@@ -1540,7 +1529,7 @@ static int mem_cgroup_move_account(struc
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1611,9 +1600,9 @@ uncharge:
 	css_put(&parent->css);
 	/* uncharge if move fails */
 	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
 	}
 	return ret;
 }
@@ -1804,8 +1793,7 @@ __mem_cgroup_commit_charge_swapin(struct
 			 * calling css_tryget
 			 */
 			if (!mem_cgroup_is_root(memcg))
-				res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_swap_statistics(memcg, false);
 			mem_cgroup_put(memcg);
 		}
@@ -1832,9 +1820,9 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	css_put(&mem->css);
 }
@@ -1849,7 +1837,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1889,10 +1876,10 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account &&
 				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
@@ -1909,7 +1896,7 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
-	if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
+	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
@@ -1987,7 +1974,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		mem_cgroup_put(memcg);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] memcg: reduce check for softlimit excess.
  2009-09-29  6:01 [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim KAMEZAWA Hiroyuki
@ 2009-09-29  6:03 ` KAMEZAWA Hiroyuki
  2009-09-29  6:11 ` [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim Balbir Singh
  1 sibling, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  6:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, balbir, nishimura

In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and
it takes res_counter's spin_lock every time.

This patch removes unnecessary calls for res_count_soft_limit_excess.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

Index: mmotm-2.6.31-Sep28/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep28.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep28/mm/memcontrol.c
@@ -313,7 +313,8 @@ soft_limit_tree_from_page(struct page *p
 static void
 __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+				struct mem_cgroup_tree_per_zone *mctz,
+				unsigned long long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -322,7 +323,9 @@ __mem_cgroup_insert_exceeded(struct mem_
 	if (mz->on_tree)
 		return;
 
-	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
 	while (*p) {
 		parent = *p;
 		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
@@ -382,7 +385,7 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long new_usage_in_excess;
+	unsigned long long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 	int nid = page_to_nid(page);
@@ -395,25 +398,21 @@ static void mem_cgroup_update_tree(struc
 	 */
 	for (; mem; mem = parent_mem_cgroup(mem)) {
 		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		new_usage_in_excess =
-			res_counter_soft_limit_excess(&mem->res);
+		excess = res_counter_soft_limit_excess(&mem->res);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
 		 */
-		if (new_usage_in_excess || mz->on_tree) {
+		if (excess || mz->on_tree) {
 			spin_lock(&mctz->lock);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mem, mz, mctz);
 			/*
-			 * if over soft limit, insert again. mz->usage_in_excess
-			 * will be updated properly.
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
 			 */
-			if (new_usage_in_excess)
-				__mem_cgroup_insert_exceeded(mem, mz, mctz);
-			else
-				mz->usage_in_excess = 0;
+			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
 			spin_unlock(&mctz->lock);
 		}
 	}
@@ -2221,6 +2220,7 @@ unsigned long mem_cgroup_soft_limit_recl
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
+	unsigned long long excess;
 
 	if (order > 0)
 		return 0;
@@ -2272,9 +2272,8 @@ unsigned long mem_cgroup_soft_limit_recl
 					break;
 			} while (1);
 		}
-		mz->usage_in_excess =
-			res_counter_soft_limit_excess(&mz->mem->res);
 		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -2283,8 +2282,8 @@ unsigned long mem_cgroup_soft_limit_recl
 		 * memory to reclaim from. Consider this as a longer
 		 * term TODO.
 		 */
-		if (mz->usage_in_excess)
-			__mem_cgroup_insert_exceeded(mz->mem, mz, mctz);
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
 		spin_unlock(&mctz->lock);
 		css_put(&mz->mem->css);
 		loop++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  6:01 [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim KAMEZAWA Hiroyuki
  2009-09-29  6:03 ` [PATCH 2/2] memcg: reduce check for softlimit excess KAMEZAWA Hiroyuki
@ 2009-09-29  6:11 ` Balbir Singh
  2009-09-29  7:21   ` KAMEZAWA Hiroyuki
  2009-09-29  9:33   ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 8+ messages in thread
From: Balbir Singh @ 2009-09-29  6:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:

> No major changes in this patch for 3 weeks.
> While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> But it seems no more (easy) ones.
>

Kamezawa-San, this worries me, could you please confirm if you are
able to see this behaviour without your patches applied as well? I am
doing some more stress tests on my side.
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  6:11 ` [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim Balbir Singh
@ 2009-09-29  7:21   ` KAMEZAWA Hiroyuki
  2009-09-29  7:24     ` KAMEZAWA Hiroyuki
  2009-09-29  9:33   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  7:21 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, akpm, nishimura

On Tue, 29 Sep 2009 11:41:32 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:
> 
> > No major changes in this patch for 3 weeks.
> > While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> > But it seems no more (easy) ones.
> >
> 
> Kamezawa-San, this worries me, could you please confirm if you are
> able to see this behaviour without your patches applied as well?

will try just with BUG_ON() for css->refcnt patch.
But it happend only once even with my patch set.
I found the potential bug by review. I checked all css_get/put/tryget
and it's an only candidates.

> I am doing some more stress tests on my side.
>  
"a few" includes my patch and Nishimura-san's patch for refcnt
which fixes css->refcnt leak. Which are already in -rc.

As I said (in merge plan), I have some concerns on softlimit. But it's
not from softlimit code, but from memcg's nature and complication especially
with hierarchy. These 2 years history shows there are tons of race condition.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  7:21   ` KAMEZAWA Hiroyuki
@ 2009-09-29  7:24     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  7:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, linux-mm, akpm, nishimura

On Tue, 29 Sep 2009 16:21:22 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 29 Sep 2009 11:41:32 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:
> > 
> > > No major changes in this patch for 3 weeks.
> > > While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> > > But it seems no more (easy) ones.
> > >
> > 
> > Kamezawa-San, this worries me, could you please confirm if you are
> > able to see this behaviour without your patches applied as well?
> 
> will try just with BUG_ON() for css->refcnt patch.
> But it happend only once even with my patch set.
> I found the potential bug by review. I checked all css_get/put/tryget
> and it's an only candidates.

Ah, but I never deny this pactch's influence.

-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  6:11 ` [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim Balbir Singh
  2009-09-29  7:21   ` KAMEZAWA Hiroyuki
@ 2009-09-29  9:33   ` KAMEZAWA Hiroyuki
  2009-09-29 11:13     ` Balbir Singh
  2009-09-29 17:06     ` Balbir Singh
  1 sibling, 2 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  9:33 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, akpm, nishimura

On Tue, 29 Sep 2009 11:41:32 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:
> 
> > No major changes in this patch for 3 weeks.
> > While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> > But it seems no more (easy) ones.
> >
> 
> Kamezawa-San, this worries me, could you please confirm if you are
> able to see this behaviour without your patches applied as well? I am
> doing some more stress tests on my side.
>  
I found an easy way to reprocue. And yes, it can happen without this series.

==
#!/bin/bash -x

mount -tcgroup none /cgroups -omemory
mkdir /cgroups/A

while true;do
        mkdir /cgroups/A/01
        echo 3M > /cgroups/A/01/memory.soft_limit_in_bytes
        echo $$ > /cgroups/A/01/tasks
        dd if=/dev/zero of=./tmpfile bs=4096 count=1024
        rm ./tmpfile
        sync
        sleep 1
        echo $$ > /cgroups/A/tasks
        rmdir /cgroups/A/01
done
==
Run this scipt under memory pressure.
Then folloiwng happens. refcnt goes bad. (WARN_ON is my css_refcnt patch's one)

My patch fixes this as far as I tested. 

Bye,
-Kame
==
ep 29 18:24:57 localhost kernel: [  253.756803] Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptable_filter ip_tables
ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput ppdev i2c_i8
01 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unlo
aded: microcode]
Sep 29 18:24:57 localhost kernel: [  253.756846] Pid: 3561, comm: rmdir Not tainted 2.6.32-rc2 #5
Sep 29 18:24:57 localhost kernel: [  253.756849] Call Trace:
Sep 29 18:24:57 localhost kernel: [  253.756853]  [<ffffffff810a62d6>] ? __css_put+0x106/0x120
Sep 29 18:24:57 localhost kernel: [  253.756859]  [<ffffffff810502a0>] warn_slowpath_common+0x80/0xd0
Sep 29 18:24:57 localhost kernel: [  253.756863]  [<ffffffff81050304>] warn_slowpath_null+0x14/0x20
Sep 29 18:24:57 localhost kernel: [  253.756866]  [<ffffffff810a62d6>] __css_put+0x106/0x120
Sep 29 18:24:57 localhost kernel: [  253.756870]  [<ffffffff810a61d0>] ? __css_put+0x0/0x120
Sep 29 18:24:57 localhost kernel: [  253.756875]  [<ffffffff81132f42>] mem_cgroup_force_empty+0x7c2/0x870
Sep 29 18:24:57 localhost kernel: [  253.756880]  [<ffffffff8108a45d>] ? trace_hardirqs_on+0xd/0x10
Sep 29 18:24:57 localhost kernel: [  253.756884]  [<ffffffff81133004>] mem_cgroup_pre_destroy+0x14/0x20
Sep 29 18:24:57 localhost kernel: [  253.756887]  [<ffffffff810a7531>] cgroup_rmdir+0xb1/0x4e0
Sep 29 18:24:57 localhost kernel: [  253.756892]  [<ffffffff81076890>] ? autoremove_wake_function+0x0/0x40
Sep 29 18:24:57 localhost kernel: [  253.756897]  [<ffffffff81147a21>] vfs_rmdir+0x131/0x160
Sep 29 18:24:57 localhost kernel: [  253.756901]  [<ffffffff8114a453>] do_rmdir+0x113/0x130
Sep 29 18:24:57 localhost kernel: [  253.756907]  [<ffffffff8100c9e9>] ? retint_swapgs+0xe/0x13
Sep 29 18:24:57 localhost kernel: [  253.756912]  [<ffffffff810b82b2>] ? audit_syscall_entry+0x202/0x230
Sep 29 18:24:57 localhost kernel: [  253.756915]  [<ffffffff8114a4c6>] sys_rmdir+0x16/0x20
Sep 29 18:24:57 localhost kernel: [  253.756919]  [<ffffffff8100bf9b>] system_call_fastpath+0x16/0x1b
Sep 29 18:24:57 localhost kernel: [  253.756922] ---[ end trace 871ca24f2f871b2a ]---
Sep 29 18:25:00 localhost kernel: [  256.363888] ------------[ cut here ]------------
Sep 29 18:25:00 localhost kernel: [  256.363895] kernel BUG at kernel/cgroup.c:3057!
Sep 29 18:25:00 localhost kernel: [  256.363899] invalid opcode: 0000 [#1] SMP
Sep 29 18:25:00 localhost kernel: [  256.363904] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Sep 29 18:25:00 localhost kernel: [  256.363907] CPU 4
Sep 29 18:25:00 localhost kernel: [  256.363910] Modules linked in: sco bridge stp bnep l2cap crc16 bluetooth rfkill iptable_filter ip_tables
ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput ppdev i2c_i8
01 pcspkr i2c_core bnx2 sg e1000e parport_pc parport button shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unlo
aded: microcode]
Sep 29 18:25:00 localhost kernel: [  256.363962] Pid: 3574, comm: rmdir Tainted: G        W  2.6.32-rc2 #5 PRIMERGY
Sep 29 18:25:00 localhost kernel: [  256.363965] RIP: 0010:[<ffffffff810a7740>]  [<ffffffff810a7740>] cgroup_rmdir+0x2c0/0x4e0
Sep 29 18:25:00 localhost kernel: [  256.363977] RSP: 0018:ffff880620c7bdf8  EFLAGS: 00010046
Sep 29 18:25:00 localhost kernel: [  256.363980] RAX: 0000000000000003 RBX: ffff880621e7bf08 RCX: ffff8806131bd800
Sep 29 18:25:00 localhost kernel: [  256.363983] RDX: 0000000000000000 RSI: ffff880621e7c000 RDI: ffffffff818c1940
Sep 29 18:25:00 localhost kernel: [  256.363986] RBP: ffff880620c7be68 R08: ffff880621e7c020 R09: 0000000000000000
Sep 29 18:25:00 localhost kernel: [  256.363989] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88061e773800
Sep 29 18:25:00 localhost kernel: [  256.363993] R13: 0000000000000282 R14: ffff88061e773820 R15: ffff88062206e800
Sep 29 18:25:00 localhost kernel: [  256.363996] FS:  00007fc7eb91f6f0(0000) GS:ffff880050200000(0000) knlGS:0000000000000000
Sep 29 18:25:00 localhost kernel: [  256.364000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 29 18:25:00 localhost kernel: [  256.364003] CR2: 0000003f6b0d9cd0 CR3: 0000000622c60000 CR4: 00000000000006e0
Sep 29 18:25:00 localhost kernel: [  256.364007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 29 18:25:00 localhost kernel: [  256.364011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 29 18:25:00 localhost kernel: [  256.364014] Process rmdir (pid: 3574, threadinfo ffff880620c7a000, task ffff880622de2860)
Sep 29 18:25:00 localhost kernel: [  256.364017] Stack:
Sep 29 18:25:00 localhost kernel: [  256.364019]  0000000000000000 ffff880622de2860 0000000000000000 ffff880622de2860
Sep 29 18:25:00 localhost kernel: [  256.364024] <0> ffffffff81076890 ffffffff8181ebe0 ffffffff8181ebe0 ffff88061ebd1000
Sep 29 18:25:00 localhost kernel: [  256.364030] <0> 0000000000000000 0000000000000000 ffff880614158dc8 ffff88061ebd1000
Sep 29 18:25:00 localhost kernel: [  256.364038] Call Trace:
Sep 29 18:25:00 localhost kernel: [  256.364046]  [<ffffffff81076890>] ? autoremove_wake_function+0x0/0x40
Sep 29 18:25:00 localhost kernel: [  256.364053]  [<ffffffff81147a21>] vfs_rmdir+0x131/0x160
Sep 29 18:25:00 localhost kernel: [  256.364057]  [<ffffffff8114a453>] do_rmdir+0x113/0x130
Sep 29 18:25:00 localhost kernel: [  256.364064]  [<ffffffff8100c9e9>] ? retint_swapgs+0xe/0x13
Sep 29 18:25:00 localhost kernel: [  256.364071]  [<ffffffff810b82b2>] ? audit_syscall_entry+0x202/0x230
Sep 29 18:25:00 localhost kernel: [  256.364076]  [<ffffffff8114a4c6>] sys_rmdir+0x16/0x20
Sep 29 18:25:00 localhost kernel: [  256.364080]  [<ffffffff8100bf9b>] system_call_fastpath+0x16/0x1b
Sep 29 18:25:00 localhost kernel: [  256.364083] Code: 1f 40 00 48 8b 87 18 01 00 00 49 8b 74 24 68 48 8d b8 e8 fe ff ff e9 8f fe ff ff e8 1b
2d fe ff 41 55 9d eb 87 66 0f 1f 44 00 00 <0f> 0b eb fe 0f 1f 40 00 f0 41 80 24 24 f7 48 83 c4 48 5b 41 5c
Sep 29 18:25:00 localhost kernel: [  256.364141] RIP  [<ffffffff810a7740>] cgroup_rmdir+0x2c0/0x4e0
Sep 29 18:25:00 localhost kernel: [  256.364146]  RSP <ffff880620c7bdf8>
Sep 29 18:25:00 localhost kernel: [  256.364152] ---[ end trace 871ca24f2f871b2b ]---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  9:33   ` KAMEZAWA Hiroyuki
@ 2009-09-29 11:13     ` Balbir Singh
  2009-09-29 17:06     ` Balbir Singh
  1 sibling, 0 replies; 8+ messages in thread
From: Balbir Singh @ 2009-09-29 11:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 18:33:21]:

> On Tue, 29 Sep 2009 11:41:32 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:
> > 
> > > No major changes in this patch for 3 weeks.
> > > While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> > > But it seems no more (easy) ones.
> > >
> > 
> > Kamezawa-San, this worries me, could you please confirm if you are
> > able to see this behaviour without your patches applied as well? I am
> > doing some more stress tests on my side.
> >  
> I found an easy way to reprocue. And yes, it can happen without this series.
> 
> ==
> #!/bin/bash -x
> 
> mount -tcgroup none /cgroups -omemory
> mkdir /cgroups/A
> 
> while true;do
>         mkdir /cgroups/A/01
>         echo 3M > /cgroups/A/01/memory.soft_limit_in_bytes
>         echo $$ > /cgroups/A/01/tasks
>         dd if=/dev/zero of=./tmpfile bs=4096 count=1024
>         rm ./tmpfile
>         sync
>         sleep 1
>         echo $$ > /cgroups/A/tasks
>         rmdir /cgroups/A/01
> done
> ==
> Run this scipt under memory pressure.
> Then folloiwng happens. refcnt goes bad. (WARN_ON is my css_refcnt patch's one)
>

Excellent script, i was able to reproduce the problem with the WARN_ON
patch applied. I am going to try the fix now.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim.
  2009-09-29  9:33   ` KAMEZAWA Hiroyuki
  2009-09-29 11:13     ` Balbir Singh
@ 2009-09-29 17:06     ` Balbir Singh
  1 sibling, 0 replies; 8+ messages in thread
From: Balbir Singh @ 2009-09-29 17:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, akpm, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 18:33:21]:

> On Tue, 29 Sep 2009 11:41:32 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-29 15:01:41]:
> > 
> > > No major changes in this patch for 3 weeks.
> > > While testing, I found a few css->refcnt bug in softlimit.(and posted patches)
> > > But it seems no more (easy) ones.
> > >
> > 
> > Kamezawa-San, this worries me, could you please confirm if you are
> > able to see this behaviour without your patches applied as well? I am
> > doing some more stress tests on my side.
> >  
> I found an easy way to reprocue. And yes, it can happen without this series.
>

Kamezawa-San,

Yes, your fix does work and the machine no longer gives a
BUG_ON()/WARN_ON(). Thanks for the analysis and fix.
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-29 17:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-29  6:01 [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim KAMEZAWA Hiroyuki
2009-09-29  6:03 ` [PATCH 2/2] memcg: reduce check for softlimit excess KAMEZAWA Hiroyuki
2009-09-29  6:11 ` [PATCH 1/2] memcg: some modification to softlimit under hierarchical memory reclaim Balbir Singh
2009-09-29  7:21   ` KAMEZAWA Hiroyuki
2009-09-29  7:24     ` KAMEZAWA Hiroyuki
2009-09-29  9:33   ` KAMEZAWA Hiroyuki
2009-09-29 11:13     ` Balbir Singh
2009-09-29 17:06     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox