[RFC][BUGFIX] memcg: rmdir doesn't return

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][BUGFIX] memcg: rmdir doesn't return
@ 2009-06-12  5:33 Daisuke Nishimura
  2009-06-12  6:19 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-12  5:33 UTC (permalink / raw)
  To: linux-mm; +Cc: KAMEZAWA Hiroyuki, Balbir Singh, Li Zefan, Daisuke Nishimura

Hi.

I found a problem about rmdir: rmdir doesn't return(or take a very very long time).
Actually, I found this problem long ago, but I've not had enough time to
track it down until the stale swap cache problem has been fixed.

The cause of this problem is the commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
(cgroup: fix frequent -EBUSY at rmdir) and memcg's behavior about swap-in.

The commit introduced cgroup_rmdir_waitq and make rmdir wait until someone
(who will decrement css->refcnt to 1) wake it up.
But even after we have succeeded pre_destroy, which means mem.usage has
become 0, a process which has moved to another cgroup from the cgroup being removed
can increment mem.usage(and css->refcnt as a result) by doing swap-in.
This css->refcnt won't be dropped, that is the rmdir process won't be woken up,
until the owner process frees the page.

So, just "waking up after a while" by a patch below can fix this problem.

===
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3737a68..2fe9645 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2722,7 +2722,7 @@ again:

 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		schedule_timeout(HZ/10);	/* don't wait forever */
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
===

But, is there any reason why we should charge a NEW swap-in'ed page to
"the group to which the swap has been charged", not to "the group in which
the process is now" ?
I agree that we should uncharge "swap" at swap-in from "the group to which
the swap has been charged", but IIUC, memcg before/without mem+swap controller behaves
as the latter about the charge of a swap-in'ed page.

I've confirmed that a patch below can also fix this rmdir problem.

===
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ceb6f2..dbece65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1063,7 +1063,7 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)

 static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
 {
-	struct mem_cgroup *mem;
+	struct mem_cgroup *mem = NULL;
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
@@ -1079,14 +1079,6 @@ static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
 			mem = NULL;
-	} else {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup(ent);
-		rcu_read_lock();
-		mem = mem_cgroup_lookup(id);
-		if (mem && !css_tryget(&mem->css))
-			mem = NULL;
-		rcu_read_unlock();
 	}
 	unlock_page_cgroup(pc);
 	return mem;
===

Any suggestions ?

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-12  5:33 [RFC][BUGFIX] memcg: rmdir doesn't return Daisuke Nishimura
@ 2009-06-12  6:19 ` KAMEZAWA Hiroyuki
  2009-06-15  2:50   ` Daisuke Nishimura
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-12  6:19 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Li Zefan

On Fri, 12 Jun 2009 14:33:46 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> I found a problem about rmdir: rmdir doesn't return(or take a very very long time).
> Actually, I found this problem long ago, but I've not had enough time to
> track it down until the stale swap cache problem has been fixed.
> 
> The cause of this problem is the commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
> (cgroup: fix frequent -EBUSY at rmdir) and memcg's behavior about swap-in.
> 
> The commit introduced cgroup_rmdir_waitq and make rmdir wait until someone
> (who will decrement css->refcnt to 1) wake it up.
> But even after we have succeeded pre_destroy, which means mem.usage has
> become 0, a process which has moved to another cgroup from the cgroup being removed
> can increment mem.usage(and css->refcnt as a result) by doing swap-in.
> This css->refcnt won't be dropped, that is the rmdir process won't be woken up,
> until the owner process frees the page.
> 
> So, just "waking up after a while" by a patch below can fix this problem.
> 
> ===
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3737a68..2fe9645 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2722,7 +2722,7 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		schedule_timeout(HZ/10);	/* don't wait forever */
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> ===
> 
This is not a choice, maybe.



> But, is there any reason why we should charge a NEW swap-in'ed page to
> "the group to which the swap has been charged", not to "the group in which
> the process is now" ?
> I agree that we should uncharge "swap" at swap-in from "the group to which
> the swap has been charged", but IIUC, memcg before/without mem+swap controller behaves
> as the latter about the charge of a swap-in'ed page.
> 
I have no objection to this direction. But this implies the resouce usage
can be moved from a cgroup to other silently.
But this bahavior is not different from behavior of page caches, I think
this one is a choice.

This happens only when swapped-out pages are swapped-in by a process in other
cgroup. Maybe rare case.



> I've confirmed that a patch below can also fix this rmdir problem.
> 
> ===
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ceb6f2..dbece65 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1063,7 +1063,7 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
>  
>  static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
>  {
> -	struct mem_cgroup *mem;
> +	struct mem_cgroup *mem = NULL;
>  	struct page_cgroup *pc;
>  	unsigned short id;
>  	swp_entry_t ent;
> @@ -1079,14 +1079,6 @@ static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
>  		mem = pc->mem_cgroup;
>  		if (mem && !css_tryget(&mem->css))
>  			mem = NULL;
> -	} else {
> -		ent.val = page_private(page);
> -		id = lookup_swap_cgroup(ent);
> -		rcu_read_lock();
> -		mem = mem_cgroup_lookup(id);
> -		if (mem && !css_tryget(&mem->css))
> -			mem = NULL;
> -		rcu_read_unlock();
>  	}
>  	unlock_page_cgroup(pc);
>  	return mem;
> ===
> 
> 
> Any suggestions ?
> 

After this,  swap-cache behavior will be highly complecated ;(

 - If swap-cache is newly swapped-in, it's charged to current user and resource
   usage moves.
 - If swap-cache is used (or unmapped recently), it's charged to old user and
   resource usage don't move.

Then, my suggestion is here.
==
} else {
	ent.val = page_private(page);
	id = lookup_swap_cgroup(ent);
	rcu_read_lock();
	mem = mem_cgroup_lookup(id);
	if (mem) {
		if (css_tryget(mem->css)) {
			/*
			 * If no processes in this cgroup, accounting back to
			 * this cgroup seems silly and prevents RMDIR.
			 */
			struct cgroup *cg = mem->css.cgroup;
			if (!atomic_read(&cg->count) && list_empty(&cg->children)) {
				css_put(&mem->css);
				mem = NULL;
			}
	}
	rcu_read_unlock();
 }
==

nonsense ?

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-12  6:19 ` KAMEZAWA Hiroyuki
@ 2009-06-15  2:50   ` Daisuke Nishimura
  2009-06-15  3:02     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-15  2:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Li Zefan, Daisuke Nishimura

On Fri, 12 Jun 2009 15:19:24 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 12 Jun 2009 14:33:46 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > I found a problem about rmdir: rmdir doesn't return(or take a very very long time).
> > Actually, I found this problem long ago, but I've not had enough time to
> > track it down until the stale swap cache problem has been fixed.
> > 
> > The cause of this problem is the commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
> > (cgroup: fix frequent -EBUSY at rmdir) and memcg's behavior about swap-in.
> > 
> > The commit introduced cgroup_rmdir_waitq and make rmdir wait until someone
> > (who will decrement css->refcnt to 1) wake it up.
> > But even after we have succeeded pre_destroy, which means mem.usage has
> > become 0, a process which has moved to another cgroup from the cgroup being removed
> > can increment mem.usage(and css->refcnt as a result) by doing swap-in.
> > This css->refcnt won't be dropped, that is the rmdir process won't be woken up,
> > until the owner process frees the page.
> > 
> > So, just "waking up after a while" by a patch below can fix this problem.
> > 
> > ===
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 3737a68..2fe9645 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -2722,7 +2722,7 @@ again:
> >  
> >  	if (!cgroup_clear_css_refs(cgrp)) {
> >  		mutex_unlock(&cgroup_mutex);
> > -		schedule();
> > +		schedule_timeout(HZ/10);	/* don't wait forever */
> >  		finish_wait(&cgroup_rmdir_waitq, &wait);
> >  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		if (signal_pending(current))
> > ===
> > 
> This is not a choice, maybe.
> 
> 
> 
> > But, is there any reason why we should charge a NEW swap-in'ed page to
> > "the group to which the swap has been charged", not to "the group in which
> > the process is now" ?
> > I agree that we should uncharge "swap" at swap-in from "the group to which
> > the swap has been charged", but IIUC, memcg before/without mem+swap controller behaves
> > as the latter about the charge of a swap-in'ed page.
> > 
> I have no objection to this direction. But this implies the resouce usage
> can be moved from a cgroup to other silently.
> But this bahavior is not different from behavior of page caches, I think
> this one is a choice.
> 
> This happens only when swapped-out pages are swapped-in by a process in other
> cgroup. Maybe rare case.
> 
It would be rare case if rmdir is executed after all the tasks have exited,
but I think it would not be so rare if we do "move tasks to another cgroup
and rmdir the old cgroup". 

> 
> 
> > I've confirmed that a patch below can also fix this rmdir problem.
> > 
> > ===
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6ceb6f2..dbece65 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1063,7 +1063,7 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
> >  
> >  static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
> >  {
> > -	struct mem_cgroup *mem;
> > +	struct mem_cgroup *mem = NULL;
> >  	struct page_cgroup *pc;
> >  	unsigned short id;
> >  	swp_entry_t ent;
> > @@ -1079,14 +1079,6 @@ static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
> >  		mem = pc->mem_cgroup;
> >  		if (mem && !css_tryget(&mem->css))
> >  			mem = NULL;
> > -	} else {
> > -		ent.val = page_private(page);
> > -		id = lookup_swap_cgroup(ent);
> > -		rcu_read_lock();
> > -		mem = mem_cgroup_lookup(id);
> > -		if (mem && !css_tryget(&mem->css))
> > -			mem = NULL;
> > -		rcu_read_unlock();
> >  	}
> >  	unlock_page_cgroup(pc);
> >  	return mem;
> > ===
> > 
> > 
> > Any suggestions ?
> > 
> 
> After this,  swap-cache behavior will be highly complecated ;(
> 
>  - If swap-cache is newly swapped-in, it's charged to current user and resource
>    usage moves.
>  - If swap-cache is used (or unmapped recently), it's charged to old user and
>    resource usage don't move.
> 
> Then, my suggestion is here.
> ==
> } else {
> 	ent.val = page_private(page);
> 	id = lookup_swap_cgroup(ent);
> 	rcu_read_lock();
> 	mem = mem_cgroup_lookup(id);
> 	if (mem) {
> 		if (css_tryget(mem->css)) {
> 			/*
> 			 * If no processes in this cgroup, accounting back to
> 			 * this cgroup seems silly and prevents RMDIR.
> 			 */
> 			struct cgroup *cg = mem->css.cgroup;
> 			if (!atomic_read(&cg->count) && list_empty(&cg->children)) {
> 				css_put(&mem->css);
> 				mem = NULL;
> 			}
> 	}
> 	rcu_read_unlock();
>  }
> ==
> 
Thank you for your suggestion.
To be honest, I think swap cache behavior would be complicated anyway :(

I prefer my change because the behavior would become consistent with
the case we don't use mem+swap controller and with the behavior of page cache.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-15  2:50   ` Daisuke Nishimura
@ 2009-06-15  3:02     ` KAMEZAWA Hiroyuki
  2009-06-15  8:17       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-15  3:02 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Li Zefan

On Mon, 15 Jun 2009 11:50:21 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:


> > Then, my suggestion is here.
> > ==
> > } else {
> > 	ent.val = page_private(page);
> > 	id = lookup_swap_cgroup(ent);
> > 	rcu_read_lock();
> > 	mem = mem_cgroup_lookup(id);
> > 	if (mem) {
> > 		if (css_tryget(mem->css)) {
> > 			/*
> > 			 * If no processes in this cgroup, accounting back to
> > 			 * this cgroup seems silly and prevents RMDIR.
> > 			 */
> > 			struct cgroup *cg = mem->css.cgroup;
> > 			if (!atomic_read(&cg->count) && list_empty(&cg->children)) {
> > 				css_put(&mem->css);
> > 				mem = NULL;
> > 			}
> > 	}
> > 	rcu_read_unlock();
> >  }
> > ==
> > 
> Thank you for your suggestion.
> To be honest, I think swap cache behavior would be complicated anyway :(
> 
> I prefer my change because the behavior would become consistent with
> the case we don't use mem+swap controller and with the behavior of page cache.
> 
I don't like implict resource move. I'll try some today. plz see it.
_But_ this case just happens when swap is shared between cgroups and _very_ heavy
swap-in continues very long. I don't think this is a fatal and BUG.

But ok, maybe wake-up path is not enough.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-15  3:02     ` KAMEZAWA Hiroyuki
@ 2009-06-15  8:17       ` KAMEZAWA Hiroyuki
  2009-06-16  2:47         ` Daisuke Nishimura
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-15  8:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Li Zefan

On Mon, 15 Jun 2009 12:02:13 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I don't like implict resource move. I'll try some today. plz see it.
> _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> swap-in continues very long. I don't think this is a fatal and BUG.
> 
> But ok, maybe wake-up path is not enough.
> 
Here.
Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
I think this will remove all concern in
	pre_destroy -> check -> start rmdir path
if subsys is aware of what they does.
Usual subsys just consider "tasks" and no extra references I hope.
If your test result is good, I'll post again (after merge window ?).

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Cgroup is designed for do some work against _tasks_. But when it comes to
memcg, a cgroup can be obtained by something other...i.e. page and swap entry.
Then, pre_destroy at el. are provided. Historically, there are some races
around this...this is new one.

Now, rmdir() path uses following logic.

	pre_destroy();	   # drop all css->refcnt to be 0.
	lock cgroup mutex  # no new task after this
	check cgroup has no tasks.
	check cgroup has no children.
	check css refcnt 
	(*)	if refcnt is not 0, sleep and wait for refcnt goes down to 0.

The logic (*) assumes the refcnt will goes down soon, but in some case(memcg),
it's better to call pre_destroy() again if pre_destroy() can handle it.
(The most unfortunate in above logic is that we can't have some trustable
 lock in this path..but..we may never be able to do.)

This patch adds ss->restart_rmdir() callback to subsys and allow immediate
retry of pre_destroy() if necessary.

Reported-by: Daisuke Nishimura  <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: linux-2.6.30.org/include/linux/cgroup.h
===================================================================
--- linux-2.6.30.org.orig/include/linux/cgroup.h
+++ linux-2.6.30.org/include/linux/cgroup.h
@@ -374,6 +374,7 @@ struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+	bool (*restart_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
Index: linux-2.6.30.org/kernel/cgroup.c
===================================================================
--- linux-2.6.30.org.orig/kernel/cgroup.c
+++ linux-2.6.30.org/kernel/cgroup.c
@@ -635,6 +635,23 @@ static int cgroup_call_pre_destroy(struc
 		}
 	return ret;
 }
+/*
+ * Check we have to restart rmdir immediately or not. Because we don't have any
+ * system which prevents "new reference comes after pre_destroy", we checks
+ * whether we have to call pre_destroy() again or not.
+ * i.e. if css_get()'s refcnt is not a temporal one, we can't expect css_put()
+ * is called and need to call pre_destroy().
+ */
+static bool cgroup_need_restart_rmdir(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+
+	for_each_subsys(cgrp->root, ss)
+		if (ss->restart_rmdir)
+			if (ss->restart_rmdir(ss, cgrp))
+				return true;
+	return false;
+}
 
 static void free_cgroup_rcu(struct rcu_head *obj)
 {
@@ -2705,7 +2722,8 @@ again:
 
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (!cgroup_need_restart_rmdir(cgrp))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: linux-2.6.30.org/mm/memcontrol.c
===================================================================
--- linux-2.6.30.org.orig/mm/memcontrol.c
+++ linux-2.6.30.org/mm/memcontrol.c
@@ -2462,6 +2462,18 @@ static int mem_cgroup_pre_destroy(struct
 	return mem_cgroup_force_empty(mem, false);
 }
 
+static bool mem_cgroup_restart_rmdir(struct cgroup_subsys *ss,
+					struct cgroup *cont)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	unsigned long long usage;
+
+	usage = res_counter_read_u64(&mem->res, RES_USAGE);
+	if (usage)/* some charge after pre_destroy() (via swap)....*/
+		return true;
+	return false;
+}
+
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
@@ -2501,6 +2513,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.subsys_id = mem_cgroup_subsys_id,
 	.create = mem_cgroup_create,
 	.pre_destroy = mem_cgroup_pre_destroy,
+	.restart_rmdir = mem_cgroup_restart_rmdir,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-15  8:17       ` KAMEZAWA Hiroyuki
@ 2009-06-16  2:47         ` Daisuke Nishimura
  2009-06-16  5:00           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-16  2:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Li Zefan, Daisuke Nishimura

On Mon, 15 Jun 2009 17:17:15 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 15 Jun 2009 12:02:13 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I don't like implict resource move. I'll try some today. plz see it.
> > _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> > swap-in continues very long. I don't think this is a fatal and BUG.
> > 
> > But ok, maybe wake-up path is not enough.
> > 
> Here.
> Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
> I think this will remove all concern in
> 	pre_destroy -> check -> start rmdir path
> if subsys is aware of what they does.
> Usual subsys just consider "tasks" and no extra references I hope.
> If your test result is good, I'll post again (after merge window ?).
> 
Thank you for your patch.

At first, I thought this problem can be solved by this direction, but
there is a race window yet.

The root cause of this problem is that mem.usage can be incremented
by swap-in behavior of memcg even after it has become 0 once.
So, mem.usage can also be incremented between cgroup_need_restart_rmdir()
and schedule().
I can see rmdir being locked up actually in my test.

hmm, sleeping until being waken up might not be good if we don't change
swap-in behavior of memcg in some way.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Cgroup is designed for do some work against _tasks_. But when it comes to
> memcg, a cgroup can be obtained by something other...i.e. page and swap entry.
> Then, pre_destroy at el. are provided. Historically, there are some races
> around this...this is new one.
> 
> Now, rmdir() path uses following logic.
> 
> 	pre_destroy();	   # drop all css->refcnt to be 0.
> 	lock cgroup mutex  # no new task after this
> 	check cgroup has no tasks.
> 	check cgroup has no children.
> 	check css refcnt 
> 	(*)	if refcnt is not 0, sleep and wait for refcnt goes down to 0.
> 
> The logic (*) assumes the refcnt will goes down soon, but in some case(memcg),
> it's better to call pre_destroy() again if pre_destroy() can handle it.
> (The most unfortunate in above logic is that we can't have some trustable
>  lock in this path..but..we may never be able to do.)
> 
> This patch adds ss->restart_rmdir() callback to subsys and allow immediate
> retry of pre_destroy() if necessary.
> 
> Reported-by: Daisuke Nishimura  <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> Index: linux-2.6.30.org/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.30.org.orig/include/linux/cgroup.h
> +++ linux-2.6.30.org/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	bool (*restart_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: linux-2.6.30.org/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.30.org.orig/kernel/cgroup.c
> +++ linux-2.6.30.org/kernel/cgroup.c
> @@ -635,6 +635,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Check we have to restart rmdir immediately or not. Because we don't have any
> + * system which prevents "new reference comes after pre_destroy", we checks
> + * whether we have to call pre_destroy() again or not.
> + * i.e. if css_get()'s refcnt is not a temporal one, we can't expect css_put()
> + * is called and need to call pre_destroy().
> + */
> +static bool cgroup_need_restart_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->restart_rmdir)
> +			if (ss->restart_rmdir(ss, cgrp))
> +				return true;
> +	return false;
> +}
>  
>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2705,7 +2722,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_need_restart_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: linux-2.6.30.org/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.30.org.orig/mm/memcontrol.c
> +++ linux-2.6.30.org/mm/memcontrol.c
> @@ -2462,6 +2462,18 @@ static int mem_cgroup_pre_destroy(struct
>  	return mem_cgroup_force_empty(mem, false);
>  }
>  
> +static bool mem_cgroup_restart_rmdir(struct cgroup_subsys *ss,
> +					struct cgroup *cont)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	unsigned long long usage;
> +
> +	usage = res_counter_read_u64(&mem->res, RES_USAGE);
> +	if (usage)/* some charge after pre_destroy() (via swap)....*/
> +		return true;
> +	return false;
> +}
> +
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {
> @@ -2501,6 +2513,7 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.subsys_id = mem_cgroup_subsys_id,
>  	.create = mem_cgroup_create,
>  	.pre_destroy = mem_cgroup_pre_destroy,
> +	.restart_rmdir = mem_cgroup_restart_rmdir,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
>  	.attach = mem_cgroup_move_task,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  2:47         ` Daisuke Nishimura
@ 2009-06-16  5:00           ` KAMEZAWA Hiroyuki
  2009-06-16  6:38             ` Daisuke Nishimura
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-16  5:00 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Li Zefan

On Tue, 16 Jun 2009 11:47:35 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 15 Jun 2009 17:17:15 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 15 Jun 2009 12:02:13 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > I don't like implict resource move. I'll try some today. plz see it.
> > > _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> > > swap-in continues very long. I don't think this is a fatal and BUG.
> > > 
> > > But ok, maybe wake-up path is not enough.
> > > 
> > Here.
> > Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
> > I think this will remove all concern in
> > 	pre_destroy -> check -> start rmdir path
> > if subsys is aware of what they does.
> > Usual subsys just consider "tasks" and no extra references I hope.
> > If your test result is good, I'll post again (after merge window ?).
> > 
> Thank you for your patch.
> 
> At first, I thought this problem can be solved by this direction, but
> there is a race window yet.
> 
> The root cause of this problem is that mem.usage can be incremented
> by swap-in behavior of memcg even after it has become 0 once.
> So, mem.usage can also be incremented between cgroup_need_restart_rmdir()
> and schedule().
> I can see rmdir being locked up actually in my test.
> 
> hmm, sleeping until being waken up might not be good if we don't change
> swap-in behavior of memcg in some way.
> 
Or, invalidate all refs from swap_cgroup in force_empty().
Fixed one is attached.

Why I don't like "charge to current process" at swap-in is that a user cannot
expect how the resource usage will change. It will be random.

In this meaning, I wanted to set "owner" of file-caches. But file-caches are
used in more explict way than swap and the user can be aware of the usage
easier than swap cache.(and files are expected to be shared in its nature.)

The patch itself will require some more work.
What I feel difficut in cgroup's rmdir() is
==
	pre_destroy();   => pre_destroy() reduces css's refcnt to be 0.
	CGROUP_WAIT_ON_RMDIR is set
	if (check css's refcnt again)
	{
		sleep and retry
	}
==
css_tryget() check CSS_IS_REMOVED but CSS_IS_REMOVED is set only when
css->refcnt goes down to be 0. Hmm.

I think my patch itself is not so bad. But the scheme is dirty in general.

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   12 ++++++++++++
 kernel/cgroup.c        |   22 ++++++++++++++++++----
 mm/memcontrol.c        |   22 ++++++++++++++++++++--
 3 files changed, 50 insertions(+), 6 deletions(-)

Index: linux-2.6.30.org/kernel/cgroup.c
===================================================================
--- linux-2.6.30.org.orig/kernel/cgroup.c
+++ linux-2.6.30.org/kernel/cgroup.c
@@ -636,6 +636,20 @@ static int cgroup_call_pre_destroy(struc
 	return ret;
 }
 
+static int cgroup_retry_rmdir(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	int ret = 0;
+
+	for_each_subsys(cgrp->root, ss)
+		if (ss->pre_destroy) {
+			ret = ss->retry_rmdir(ss, cgrp);
+			if (ret)
+				break;
+		}
+	return ret;
+}
+
 static void free_cgroup_rcu(struct rcu_head *obj)
 {
 	struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
@@ -737,10 +751,9 @@ static void cgroup_d_remove_dir(struct d
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
@@ -2705,7 +2718,8 @@ again:
 
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (!cgroup_retry_rmdir(cgrp))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: linux-2.6.30.org/include/linux/cgroup.h
===================================================================
--- linux-2.6.30.org.orig/include/linux/cgroup.h
+++ linux-2.6.30.org/include/linux/cgroup.h
@@ -366,6 +366,17 @@ int cgroup_task_count(const struct cgrou
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
 /*
+ * Wake up rmdir() waiter if the subsys requires to call pre_destroy() again to
+ * make css's refcnt to be 0 and allow rmdir() go ahead.
+ */
+void __cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp);
+static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
+{
+	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters(cgrp);
+}
+
+/*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
  */
@@ -374,6 +385,7 @@ struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+	int (*rmdir_retry)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
Index: linux-2.6.30.org/mm/memcontrol.c
===================================================================
--- linux-2.6.30.org.orig/mm/memcontrol.c
+++ linux-2.6.30.org/mm/memcontrol.c
@@ -1367,8 +1367,12 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
-
+	/*
+	 * At swap-in, it's not guaranteed that "ptr" includes a task and
+	 * some thread may be waiting for rmdir(). Wake it up and allow to
+	 * retry pre_destroy();
+	 */
+	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -2462,6 +2466,18 @@ static int mem_cgroup_pre_destroy(struct
 	return mem_cgroup_force_empty(mem, false);
 }
 
+static int mem_cgroup_retry_rmdir(struct cgroup_subsys *ss,
+				  struct cgroup *cont)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	unsigned long long usage;
+
+	usage = res_counter_read_u64(&mem->res, RES_USAGE);
+	if (usage) /* some chagers after pre_destroy() */
+		return 1;
+	return 0;
+}
+
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
@@ -2496,11 +2512,13 @@ static void mem_cgroup_move_task(struct 
 	mutex_unlock(&memcg_tasklist);
 }
 
+
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
 	.create = mem_cgroup_create,
 	.pre_destroy = mem_cgroup_pre_destroy,
+	.retry_rmdir = mem_cgroup_retry_rmdir,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  5:00           ` KAMEZAWA Hiroyuki
@ 2009-06-16  6:38             ` Daisuke Nishimura
  2009-06-16  6:48               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-16  6:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Li Zefan, Daisuke Nishimura

On Tue, 16 Jun 2009 14:00:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 16 Jun 2009 11:47:35 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Mon, 15 Jun 2009 17:17:15 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Mon, 15 Jun 2009 12:02:13 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > I don't like implict resource move. I'll try some today. plz see it.
> > > > _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> > > > swap-in continues very long. I don't think this is a fatal and BUG.
> > > > 
> > > > But ok, maybe wake-up path is not enough.
> > > > 
> > > Here.
> > > Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
> > > I think this will remove all concern in
> > > 	pre_destroy -> check -> start rmdir path
> > > if subsys is aware of what they does.
> > > Usual subsys just consider "tasks" and no extra references I hope.
> > > If your test result is good, I'll post again (after merge window ?).
> > > 
> > Thank you for your patch.
> > 
> > At first, I thought this problem can be solved by this direction, but
> > there is a race window yet.
> > 
> > The root cause of this problem is that mem.usage can be incremented
> > by swap-in behavior of memcg even after it has become 0 once.
> > So, mem.usage can also be incremented between cgroup_need_restart_rmdir()
> > and schedule().
> > I can see rmdir being locked up actually in my test.
> > 
> > hmm, sleeping until being waken up might not be good if we don't change
> > swap-in behavior of memcg in some way.
> > 
> Or, invalidate all refs from swap_cgroup in force_empty().
> Fixed one is attached.
> 
> Why I don't like "charge to current process" at swap-in is that a user cannot
> expect how the resource usage will change. It will be random.
> 
> In this meaning, I wanted to set "owner" of file-caches. But file-caches are
> used in more explict way than swap and the user can be aware of the usage
> easier than swap cache.(and files are expected to be shared in its nature.)
> 
> The patch itself will require some more work.
> What I feel difficut in cgroup's rmdir() is
> ==
> 	pre_destroy();   => pre_destroy() reduces css's refcnt to be 0.
> 	CGROUP_WAIT_ON_RMDIR is set
> 	if (check css's refcnt again)
> 	{
> 		sleep and retry
> 	}
> ==
> css_tryget() check CSS_IS_REMOVED but CSS_IS_REMOVED is set only when
> css->refcnt goes down to be 0. Hmm.
> 
> I think my patch itself is not so bad. But the scheme is dirty in general.
> 
> Thanks,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good except for:

> @@ -374,6 +385,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	int (*rmdir_retry)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
s/rmdir_retry/retry_rmdir

It has been working well so far, but I will continue to test for more long time.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  6:38             ` Daisuke Nishimura
@ 2009-06-16  6:48               ` KAMEZAWA Hiroyuki
  2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-16  6:48 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, Balbir Singh, Li Zefan

On Tue, 16 Jun 2009 15:38:10 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 16 Jun 2009 14:00:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 16 Jun 2009 11:47:35 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Mon, 15 Jun 2009 17:17:15 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Mon, 15 Jun 2009 12:02:13 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > I don't like implict resource move. I'll try some today. plz see it.
> > > > > _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> > > > > swap-in continues very long. I don't think this is a fatal and BUG.
> > > > > 
> > > > > But ok, maybe wake-up path is not enough.
> > > > > 
> > > > Here.
> > > > Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
> > > > I think this will remove all concern in
> > > > 	pre_destroy -> check -> start rmdir path
> > > > if subsys is aware of what they does.
> > > > Usual subsys just consider "tasks" and no extra references I hope.
> > > > If your test result is good, I'll post again (after merge window ?).
> > > > 
> > > Thank you for your patch.
> > > 
> > > At first, I thought this problem can be solved by this direction, but
> > > there is a race window yet.
> > > 
> > > The root cause of this problem is that mem.usage can be incremented
> > > by swap-in behavior of memcg even after it has become 0 once.
> > > So, mem.usage can also be incremented between cgroup_need_restart_rmdir()
> > > and schedule().
> > > I can see rmdir being locked up actually in my test.
> > > 
> > > hmm, sleeping until being waken up might not be good if we don't change
> > > swap-in behavior of memcg in some way.
> > > 
> > Or, invalidate all refs from swap_cgroup in force_empty().
> > Fixed one is attached.
> > 
> > Why I don't like "charge to current process" at swap-in is that a user cannot
> > expect how the resource usage will change. It will be random.
> > 
> > In this meaning, I wanted to set "owner" of file-caches. But file-caches are
> > used in more explict way than swap and the user can be aware of the usage
> > easier than swap cache.(and files are expected to be shared in its nature.)
> > 
> > The patch itself will require some more work.
> > What I feel difficut in cgroup's rmdir() is
> > ==
> > 	pre_destroy();   => pre_destroy() reduces css's refcnt to be 0.
> > 	CGROUP_WAIT_ON_RMDIR is set
> > 	if (check css's refcnt again)
> > 	{
> > 		sleep and retry
> > 	}
> > ==
> > css_tryget() check CSS_IS_REMOVED but CSS_IS_REMOVED is set only when
> > css->refcnt goes down to be 0. Hmm.
> > 
> > I think my patch itself is not so bad. But the scheme is dirty in general.
> > 
> > Thanks,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Looks good except for:
> 
> > @@ -374,6 +385,7 @@ struct cgroup_subsys {
> >  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
> >  						  struct cgroup *cgrp);
> >  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > +	int (*rmdir_retry)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >  	int (*can_attach)(struct cgroup_subsys *ss,
> >  			  struct cgroup *cgrp, struct task_struct *tsk);
> s/rmdir_retry/retry_rmdir
> 
> It has been working well so far, but I will continue to test for more long time.
> 
Thank you. I'd like to find out more clean fix, keeping this as an option.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  6:48               ` KAMEZAWA Hiroyuki
@ 2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
  2009-06-17  4:56                   ` Balbir Singh
  2009-06-18  3:03                   ` Daisuke Nishimura
  0 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-16  8:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Balbir Singh, Li Zefan

On Tue, 16 Jun 2009 15:48:20 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > It has been working well so far, but I will continue to test for more long time.
> > 
> Thank you. I'd like to find out more clean fix, keeping this as an option.
> 
This is cleaned up version. works well in following test.

==
1. mount -t tmpfs /dev/null /mnt/tmpfs
2. mount -t cgroup /dev/null /mnt/cgroups -o memory
3. mkdir /mnt/cgroups/A/
4. echo $$ > /mnt/cgroups/A/tasks
5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
 => 26M of swap.
6. echo $$ > /mnt/cgroups/tasks
 => group "A" is empty now.
7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done

In ohter shell.
7-b. rmdir /mnt/cgroups/A
==
Of course, you have more compliated ones..

the patch seems a bit long but most of patch is comment..

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
per objects related to tasks. But memcg counts refcnt per pages because
the pointer to css is recorded per page. And more, hierarchy-management support
requires safe css refcnt management while a rmdir() is ongoing.

css_tryget()/css_put() is for such purpose and works well.

But, frequent css_put()/get() tends to prevent rmdir() and users can see
EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
rmdir() can work in synchronous way with cgroup subsytems. But this logic
expects "refcnt obtained by css_tryget() is temporal and will go down to
0 soon, then rmdir() will wake up soon."

But memcg's swapin code breaks the assumption. (But necessary...)
This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
by css_tryget().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |    7 +++++++
 kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
 mm/memcontrol.c        |   11 +++++++++--
 3 files changed, 43 insertions(+), 21 deletions(-)

Index: linux-2.6.30.org/include/linux/cgroup.h
===================================================================
--- linux-2.6.30.org.orig/include/linux/cgroup.h
+++ linux-2.6.30.org/include/linux/cgroup.h
@@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
 /* Return true if cgrp is a descendant of the task's cgroup */
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
+void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
+static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+{
+	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters(cgrp);
+}
+
 /*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
Index: linux-2.6.30.org/kernel/cgroup.c
===================================================================
--- linux-2.6.30.org.orig/kernel/cgroup.c
+++ linux-2.6.30.org/kernel/cgroup.c
@@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
@@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
 
 	/* the vfs holds both inode->i_mutex already */
 again:
+	/*
+	 * css_put/get is provided for subsys to grab refcnt to css. In typical
+	 * case, subsystem has no reference after pre_destroy(). But, under
+	 * hierarchy management, some *temporal* refcnt can be hold.
+	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
+	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
+	 * is called when css_put() is called and refcnt goes down to 0.
+	 *
+	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
+	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
+	 * only after css->refcnt returns 0, checking this bit is useful when
+	 * css' refcnt seems to be not temporal.
+	 */
+	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
+
 	mutex_lock(&cgroup_mutex);
-	if (atomic_read(&cgrp->count) != 0) {
-		mutex_unlock(&cgroup_mutex);
-		return -EBUSY;
-	}
-	if (!list_empty(&cgrp->children)) {
+	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
 		mutex_unlock(&cgroup_mutex);
+		finish_wait(&cgroup_rmdir_waitq, &wait);
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		return -EBUSY;
 	}
 	mutex_unlock(&cgroup_mutex);
@@ -2683,25 +2696,20 @@ again:
 	 * that rmdir() request comes.
 	 */
 	ret = cgroup_call_pre_destroy(cgrp);
-	if (ret)
+	if (ret) {
+		finish_wait(&cgroup_rmdir_waitq, &wait);
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		return ret;
+	}
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
 		mutex_unlock(&cgroup_mutex);
+		finish_wait(&cgroup_rmdir_waitq, &wait);
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		return -EBUSY;
 	}
-	/*
-	 * css_put/get is provided for subsys to grab refcnt to css. In typical
-	 * case, subsystem has no reference after pre_destroy(). But, under
-	 * hierarchy management, some *temporal* refcnt can be hold.
-	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
-	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
-	 * is called when css_put() is called and refcnt goes down to 0.
-	 */
-	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
 
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
Index: linux-2.6.30.org/mm/memcontrol.c
===================================================================
--- linux-2.6.30.org.orig/mm/memcontrol.c
+++ linux-2.6.30.org/mm/memcontrol.c
@@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	if (!ptr)
 		return;
+	css_get(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
@@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
-
+	/*
+	 * Because we charged against a cgroup which is obtained by record
+	 * in swap_cgroup, not by task, there is a possibility that someone is
+	 * waiting for rmdir. This happens when a swap entry is shared
+	 * among cgroups. After wakeup, pre_destroy() will be called again.
+	 */
+	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
+	css_put(&ptr->css);
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
@ 2009-06-17  4:56                   ` Balbir Singh
  2009-06-17  5:11                     ` KAMEZAWA Hiroyuki
  2009-06-18  3:03                   ` Daisuke Nishimura
  1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-06-17  4:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-16 17:44:36]:

> On Tue, 16 Jun 2009 15:48:20 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > It has been working well so far, but I will continue to test for more long time.
> > > 
> > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > 
> This is cleaned up version. works well in following test.
> 
> ==
> 1. mount -t tmpfs /dev/null /mnt/tmpfs
> 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> 3. mkdir /mnt/cgroups/A/
> 4. echo $$ > /mnt/cgroups/A/tasks
> 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
>  => 26M of swap.
> 6. echo $$ > /mnt/cgroups/tasks
>  => group "A" is empty now.
> 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> 
> In ohter shell.
> 7-b. rmdir /mnt/cgroups/A
> ==
> Of course, you have more compliated ones..
> 
> the patch seems a bit long but most of patch is comment..
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> per objects related to tasks. But memcg counts refcnt per pages because
> the pointer to css is recorded per page. And more, hierarchy-management support
> requires safe css refcnt management while a rmdir() is ongoing.
> 
> css_tryget()/css_put() is for such purpose and works well.
> 
> But, frequent css_put()/get() tends to prevent rmdir() and users can see
> EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> rmdir() can work in synchronous way with cgroup subsytems. But this logic
> expects "refcnt obtained by css_tryget() is temporal and will go down to
> 0 soon, then rmdir() will wake up soon."
> 
> But memcg's swapin code breaks the assumption. (But necessary...)
> This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> by css_tryget().
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |    7 +++++++
>  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
>  mm/memcontrol.c        |   11 +++++++++--
>  3 files changed, 43 insertions(+), 21 deletions(-)
> 
> Index: linux-2.6.30.org/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.30.org.orig/include/linux/cgroup.h
> +++ linux-2.6.30.org/include/linux/cgroup.h
> @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
>  /* Return true if cgrp is a descendant of the task's cgroup */
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> 
> +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +{
> +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		__cgroup_wakeup_rmdir_waiters(cgrp);
> +}
> +
>  /*
>   * Control Group subsystem type.
>   * See Documentation/cgroups/cgroups.txt for details
> Index: linux-2.6.30.org/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.30.org.orig/kernel/cgroup.c
> +++ linux-2.6.30.org/kernel/cgroup.c
> @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
>   */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> 
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
>  {
> -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> +	wake_up_all(&cgroup_rmdir_waitq);
>  }
> 
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
> 
>  	/* the vfs holds both inode->i_mutex already */
>  again:
> +	/*
> +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> +	 * case, subsystem has no reference after pre_destroy(). But, under
> +	 * hierarchy management, some *temporal* refcnt can be hold.
> +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +	 * is called when css_put() is called and refcnt goes down to 0.
> +	 *
> +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> +	 * only after css->refcnt returns 0, checking this bit is useful when
> +	 * css' refcnt seems to be not temporal.
> +	 */
> +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> +
>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> -		mutex_unlock(&cgroup_mutex);
> -		return -EBUSY;
> -	}
> -	if (!list_empty(&cgrp->children)) {
> +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
>  		mutex_unlock(&cgroup_mutex);
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return -EBUSY;
>  	}
>  	mutex_unlock(&cgroup_mutex);
> @@ -2683,25 +2696,20 @@ again:
>  	 * that rmdir() request comes.
>  	 */
>  	ret = cgroup_call_pre_destroy(cgrp);
> -	if (ret)
> +	if (ret) {
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return ret;
> +	}
> 
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
>  		mutex_unlock(&cgroup_mutex);
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return -EBUSY;
>  	}
> -	/*
> -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> -	 * case, subsystem has no reference after pre_destroy(). But, under
> -	 * hierarchy management, some *temporal* refcnt can be hold.
> -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -	 * is called when css_put() is called and refcnt goes down to 0.
> -	 */
> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> 
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> Index: linux-2.6.30.org/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.30.org.orig/mm/memcontrol.c
> +++ linux-2.6.30.org/mm/memcontrol.c
> @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  		return;
>  	if (!ptr)
>  		return;
> +	css_get(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> -
> +	/*
> +	 * Because we charged against a cgroup which is obtained by record
> +	 * in swap_cgroup, not by task, there is a possibility that someone is
> +	 * waiting for rmdir. This happens when a swap entry is shared
> +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> +	css_put(&ptr->css);
>  }
> 
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> 
>

I am a little confused by the infrastructure. I think it is OK to
return -EBUSY and then use notify_on_release to figure out when to
free the cgroup. Are we moving away from the design?
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  4:56                   ` Balbir Singh
@ 2009-06-17  5:11                     ` KAMEZAWA Hiroyuki
  2009-06-17  5:49                       ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-17  5:11 UTC (permalink / raw)
  To: balbir; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

On Wed, 17 Jun 2009 10:26:43 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-16 17:44:36]:
> 
> > On Tue, 16 Jun 2009 15:48:20 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > > It has been working well so far, but I will continue to test for more long time.
> > > > 
> > > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > > 
> > This is cleaned up version. works well in following test.
> > 
> > ==
> > 1. mount -t tmpfs /dev/null /mnt/tmpfs
> > 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> > 3. mkdir /mnt/cgroups/A/
> > 4. echo $$ > /mnt/cgroups/A/tasks
> > 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> > 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
> >  => 26M of swap.
> > 6. echo $$ > /mnt/cgroups/tasks
> >  => group "A" is empty now.
> > 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> > 
> > In ohter shell.
> > 7-b. rmdir /mnt/cgroups/A
> > ==
> > Of course, you have more compliated ones..
> > 
> > the patch seems a bit long but most of patch is comment..
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> > per objects related to tasks. But memcg counts refcnt per pages because
> > the pointer to css is recorded per page. And more, hierarchy-management support
> > requires safe css refcnt management while a rmdir() is ongoing.
> > 
> > css_tryget()/css_put() is for such purpose and works well.
> > 
> > But, frequent css_put()/get() tends to prevent rmdir() and users can see
> > EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> > rmdir() can work in synchronous way with cgroup subsytems. But this logic
> > expects "refcnt obtained by css_tryget() is temporal and will go down to
> > 0 soon, then rmdir() will wake up soon."
> > 
> > But memcg's swapin code breaks the assumption. (But necessary...)
> > This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> > check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> > by css_tryget().
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/cgroup.h |    7 +++++++
> >  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
> >  mm/memcontrol.c        |   11 +++++++++--
> >  3 files changed, 43 insertions(+), 21 deletions(-)
> > 
> > Index: linux-2.6.30.org/include/linux/cgroup.h
> > ===================================================================
> > --- linux-2.6.30.org.orig/include/linux/cgroup.h
> > +++ linux-2.6.30.org/include/linux/cgroup.h
> > @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
> >  /* Return true if cgrp is a descendant of the task's cgroup */
> >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> > 
> > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> > +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > +{
> > +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > +		__cgroup_wakeup_rmdir_waiters(cgrp);
> > +}
> > +
> >  /*
> >   * Control Group subsystem type.
> >   * See Documentation/cgroups/cgroups.txt for details
> > Index: linux-2.6.30.org/kernel/cgroup.c
> > ===================================================================
> > --- linux-2.6.30.org.orig/kernel/cgroup.c
> > +++ linux-2.6.30.org/kernel/cgroup.c
> > @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
> >   */
> >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> > 
> > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> >  {
> > -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > -		wake_up_all(&cgroup_rmdir_waitq);
> > +	wake_up_all(&cgroup_rmdir_waitq);
> >  }
> > 
> >  static int rebind_subsystems(struct cgroupfs_root *root,
> > @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
> > 
> >  	/* the vfs holds both inode->i_mutex already */
> >  again:
> > +	/*
> > +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > +	 * case, subsystem has no reference after pre_destroy(). But, under
> > +	 * hierarchy management, some *temporal* refcnt can be hold.
> > +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > +	 * is called when css_put() is called and refcnt goes down to 0.
> > +	 *
> > +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> > +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> > +	 * only after css->refcnt returns 0, checking this bit is useful when
> > +	 * css' refcnt seems to be not temporal.
> > +	 */
> > +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > +
> >  	mutex_lock(&cgroup_mutex);
> > -	if (atomic_read(&cgrp->count) != 0) {
> > -		mutex_unlock(&cgroup_mutex);
> > -		return -EBUSY;
> > -	}
> > -	if (!list_empty(&cgrp->children)) {
> > +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
> >  		mutex_unlock(&cgroup_mutex);
> > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		return -EBUSY;
> >  	}
> >  	mutex_unlock(&cgroup_mutex);
> > @@ -2683,25 +2696,20 @@ again:
> >  	 * that rmdir() request comes.
> >  	 */
> >  	ret = cgroup_call_pre_destroy(cgrp);
> > -	if (ret)
> > +	if (ret) {
> > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		return ret;
> > +	}
> > 
> >  	mutex_lock(&cgroup_mutex);
> >  	parent = cgrp->parent;
> >  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> >  		mutex_unlock(&cgroup_mutex);
> > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		return -EBUSY;
> >  	}
> > -	/*
> > -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > -	 * case, subsystem has no reference after pre_destroy(). But, under
> > -	 * hierarchy management, some *temporal* refcnt can be hold.
> > -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > -	 * is called when css_put() is called and refcnt goes down to 0.
> > -	 */
> > -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > 
> >  	if (!cgroup_clear_css_refs(cgrp)) {
> >  		mutex_unlock(&cgroup_mutex);
> > Index: linux-2.6.30.org/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.30.org.orig/mm/memcontrol.c
> > +++ linux-2.6.30.org/mm/memcontrol.c
> > @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >  		return;
> >  	if (!ptr)
> >  		return;
> > +	css_get(&ptr->css);
> >  	pc = lookup_page_cgroup(page);
> >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> >  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
> >  		}
> >  		rcu_read_unlock();
> >  	}
> > -	/* add this page(page_cgroup) to the LRU we want. */
> > -
> > +	/*
> > +	 * Because we charged against a cgroup which is obtained by record
> > +	 * in swap_cgroup, not by task, there is a possibility that someone is
> > +	 * waiting for rmdir. This happens when a swap entry is shared
> > +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> > +	 */
> > +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> > +	css_put(&ptr->css);
> >  }
> > 
> >  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > 
> >
> 
> I am a little confused by the infrastructure. I think it is OK to
> return -EBUSY and then use notify_on_release to figure out when to
> free the cgroup. Are we moving away from the design?
>  

When you use cgroup with _hierarchy_ in real world, rmdir returns -EBUSY very
very often. (Because hierarchy dramatically increase chance to css_get() to empty
cgroup.)

rmdir against cgroup should succeed if there are no tasks. It's well documented.
Without this rmdir returns -EBUSY because of the kernel's internal reason which
the user can't do anything against. It is too bad. So, I added waitq.

Have you seen a shell scirpt which checks rmdir()'s return code ?
I've never seen and don't like following scripts.
==
while [ -d /cgroup/A/B ] ; do
	# we can remove this directly without error only when we're lucky.
	rmdir /cgroup/A/B; 
	if [ -d /cgroup/A/B ]; then
		tasks = (< /cgroup/A/B/tasks)
		if [ -n $file ] && exit then  # do we really busy ?
		# check other things...	
	fi
done
==

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  5:11                     ` KAMEZAWA Hiroyuki
@ 2009-06-17  5:49                       ` Balbir Singh
  2009-06-17  6:27                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-06-17  5:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 14:11:09]:

> On Wed, 17 Jun 2009 10:26:43 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-16 17:44:36]:
> > 
> > > On Tue, 16 Jun 2009 15:48:20 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > > It has been working well so far, but I will continue to test for more long time.
> > > > > 
> > > > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > > > 
> > > This is cleaned up version. works well in following test.
> > > 
> > > ==
> > > 1. mount -t tmpfs /dev/null /mnt/tmpfs
> > > 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> > > 3. mkdir /mnt/cgroups/A/
> > > 4. echo $$ > /mnt/cgroups/A/tasks
> > > 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> > > 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
> > >  => 26M of swap.
> > > 6. echo $$ > /mnt/cgroups/tasks
> > >  => group "A" is empty now.
> > > 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> > > 
> > > In ohter shell.
> > > 7-b. rmdir /mnt/cgroups/A
> > > ==
> > > Of course, you have more compliated ones..
> > > 
> > > the patch seems a bit long but most of patch is comment..
> > > 
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> > > per objects related to tasks. But memcg counts refcnt per pages because
> > > the pointer to css is recorded per page. And more, hierarchy-management support
> > > requires safe css refcnt management while a rmdir() is ongoing.
> > > 
> > > css_tryget()/css_put() is for such purpose and works well.
> > > 
> > > But, frequent css_put()/get() tends to prevent rmdir() and users can see
> > > EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> > > rmdir() can work in synchronous way with cgroup subsytems. But this logic
> > > expects "refcnt obtained by css_tryget() is temporal and will go down to
> > > 0 soon, then rmdir() will wake up soon."
> > > 
> > > But memcg's swapin code breaks the assumption. (But necessary...)
> > > This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> > > check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> > > by css_tryget().
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  include/linux/cgroup.h |    7 +++++++
> > >  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
> > >  mm/memcontrol.c        |   11 +++++++++--
> > >  3 files changed, 43 insertions(+), 21 deletions(-)
> > > 
> > > Index: linux-2.6.30.org/include/linux/cgroup.h
> > > ===================================================================
> > > --- linux-2.6.30.org.orig/include/linux/cgroup.h
> > > +++ linux-2.6.30.org/include/linux/cgroup.h
> > > @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
> > >  /* Return true if cgrp is a descendant of the task's cgroup */
> > >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> > > 
> > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> > > +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > +{
> > > +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > +		__cgroup_wakeup_rmdir_waiters(cgrp);
> > > +}
> > > +
> > >  /*
> > >   * Control Group subsystem type.
> > >   * See Documentation/cgroups/cgroups.txt for details
> > > Index: linux-2.6.30.org/kernel/cgroup.c
> > > ===================================================================
> > > --- linux-2.6.30.org.orig/kernel/cgroup.c
> > > +++ linux-2.6.30.org/kernel/cgroup.c
> > > @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
> > >   */
> > >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> > > 
> > > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > >  {
> > > -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > -		wake_up_all(&cgroup_rmdir_waitq);
> > > +	wake_up_all(&cgroup_rmdir_waitq);
> > >  }
> > > 
> > >  static int rebind_subsystems(struct cgroupfs_root *root,
> > > @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
> > > 
> > >  	/* the vfs holds both inode->i_mutex already */
> > >  again:
> > > +	/*
> > > +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > +	 * case, subsystem has no reference after pre_destroy(). But, under
> > > +	 * hierarchy management, some *temporal* refcnt can be hold.
> > > +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > +	 * is called when css_put() is called and refcnt goes down to 0.
> > > +	 *
> > > +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> > > +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> > > +	 * only after css->refcnt returns 0, checking this bit is useful when
> > > +	 * css' refcnt seems to be not temporal.
> > > +	 */
> > > +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > +
> > >  	mutex_lock(&cgroup_mutex);
> > > -	if (atomic_read(&cgrp->count) != 0) {
> > > -		mutex_unlock(&cgroup_mutex);
> > > -		return -EBUSY;
> > > -	}
> > > -	if (!list_empty(&cgrp->children)) {
> > > +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
> > >  		mutex_unlock(&cgroup_mutex);
> > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >  		return -EBUSY;
> > >  	}
> > >  	mutex_unlock(&cgroup_mutex);
> > > @@ -2683,25 +2696,20 @@ again:
> > >  	 * that rmdir() request comes.
> > >  	 */
> > >  	ret = cgroup_call_pre_destroy(cgrp);
> > > -	if (ret)
> > > +	if (ret) {
> > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >  		return ret;
> > > +	}
> > > 
> > >  	mutex_lock(&cgroup_mutex);
> > >  	parent = cgrp->parent;
> > >  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > >  		mutex_unlock(&cgroup_mutex);
> > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >  		return -EBUSY;
> > >  	}
> > > -	/*
> > > -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > -	 * case, subsystem has no reference after pre_destroy(). But, under
> > > -	 * hierarchy management, some *temporal* refcnt can be hold.
> > > -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > -	 * is called when css_put() is called and refcnt goes down to 0.
> > > -	 */
> > > -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > 
> > >  	if (!cgroup_clear_css_refs(cgrp)) {
> > >  		mutex_unlock(&cgroup_mutex);
> > > Index: linux-2.6.30.org/mm/memcontrol.c
> > > ===================================================================
> > > --- linux-2.6.30.org.orig/mm/memcontrol.c
> > > +++ linux-2.6.30.org/mm/memcontrol.c
> > > @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > >  		return;
> > >  	if (!ptr)
> > >  		return;
> > > +	css_get(&ptr->css);
> > >  	pc = lookup_page_cgroup(page);
> > >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> > >  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > > @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
> > >  		}
> > >  		rcu_read_unlock();
> > >  	}
> > > -	/* add this page(page_cgroup) to the LRU we want. */
> > > -
> > > +	/*
> > > +	 * Because we charged against a cgroup which is obtained by record
> > > +	 * in swap_cgroup, not by task, there is a possibility that someone is
> > > +	 * waiting for rmdir. This happens when a swap entry is shared
> > > +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> > > +	 */
> > > +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> > > +	css_put(&ptr->css);
> > >  }
> > > 
> > >  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > 
> > >
> > 
> > I am a little confused by the infrastructure. I think it is OK to
> > return -EBUSY and then use notify_on_release to figure out when to
> > free the cgroup. Are we moving away from the design?
> >  
> 
> When you use cgroup with _hierarchy_ in real world, rmdir returns -EBUSY very
> very often. (Because hierarchy dramatically increase chance to css_get() to empty
> cgroup.)
> 

I completely understand that part, but I thought we had solved this
problem through notify_on_release and release_agent. I am not against
it, but I want to know why release_agent will not solve this problem
(is it too loose for control?)

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  5:49                       ` Balbir Singh
@ 2009-06-17  6:27                         ` KAMEZAWA Hiroyuki
  2009-06-17  7:35                           ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-17  6:27 UTC (permalink / raw)
  To: balbir; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

On Wed, 17 Jun 2009 11:19:55 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 14:11:09]:
> 
> > On Wed, 17 Jun 2009 10:26:43 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-16 17:44:36]:
> > > 
> > > > On Tue, 16 Jun 2009 15:48:20 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > > It has been working well so far, but I will continue to test for more long time.
> > > > > > 
> > > > > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > > > > 
> > > > This is cleaned up version. works well in following test.
> > > > 
> > > > ==
> > > > 1. mount -t tmpfs /dev/null /mnt/tmpfs
> > > > 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> > > > 3. mkdir /mnt/cgroups/A/
> > > > 4. echo $$ > /mnt/cgroups/A/tasks
> > > > 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> > > > 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
> > > >  => 26M of swap.
> > > > 6. echo $$ > /mnt/cgroups/tasks
> > > >  => group "A" is empty now.
> > > > 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> > > > 
> > > > In ohter shell.
> > > > 7-b. rmdir /mnt/cgroups/A
> > > > ==
> > > > Of course, you have more compliated ones..
> > > > 
> > > > the patch seems a bit long but most of patch is comment..
> > > > 
> > > > ==
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > > In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> > > > per objects related to tasks. But memcg counts refcnt per pages because
> > > > the pointer to css is recorded per page. And more, hierarchy-management support
> > > > requires safe css refcnt management while a rmdir() is ongoing.
> > > > 
> > > > css_tryget()/css_put() is for such purpose and works well.
> > > > 
> > > > But, frequent css_put()/get() tends to prevent rmdir() and users can see
> > > > EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> > > > rmdir() can work in synchronous way with cgroup subsytems. But this logic
> > > > expects "refcnt obtained by css_tryget() is temporal and will go down to
> > > > 0 soon, then rmdir() will wake up soon."
> > > > 
> > > > But memcg's swapin code breaks the assumption. (But necessary...)
> > > > This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> > > > check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> > > > by css_tryget().
> > > > 
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > ---
> > > >  include/linux/cgroup.h |    7 +++++++
> > > >  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
> > > >  mm/memcontrol.c        |   11 +++++++++--
> > > >  3 files changed, 43 insertions(+), 21 deletions(-)
> > > > 
> > > > Index: linux-2.6.30.org/include/linux/cgroup.h
> > > > ===================================================================
> > > > --- linux-2.6.30.org.orig/include/linux/cgroup.h
> > > > +++ linux-2.6.30.org/include/linux/cgroup.h
> > > > @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
> > > >  /* Return true if cgrp is a descendant of the task's cgroup */
> > > >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> > > > 
> > > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> > > > +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > > +{
> > > > +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > > +		__cgroup_wakeup_rmdir_waiters(cgrp);
> > > > +}
> > > > +
> > > >  /*
> > > >   * Control Group subsystem type.
> > > >   * See Documentation/cgroups/cgroups.txt for details
> > > > Index: linux-2.6.30.org/kernel/cgroup.c
> > > > ===================================================================
> > > > --- linux-2.6.30.org.orig/kernel/cgroup.c
> > > > +++ linux-2.6.30.org/kernel/cgroup.c
> > > > @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
> > > >   */
> > > >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> > > > 
> > > > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > >  {
> > > > -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > > -		wake_up_all(&cgroup_rmdir_waitq);
> > > > +	wake_up_all(&cgroup_rmdir_waitq);
> > > >  }
> > > > 
> > > >  static int rebind_subsystems(struct cgroupfs_root *root,
> > > > @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
> > > > 
> > > >  	/* the vfs holds both inode->i_mutex already */
> > > >  again:
> > > > +	/*
> > > > +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > > +	 * case, subsystem has no reference after pre_destroy(). But, under
> > > > +	 * hierarchy management, some *temporal* refcnt can be hold.
> > > > +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > > +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > > +	 * is called when css_put() is called and refcnt goes down to 0.
> > > > +	 *
> > > > +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> > > > +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> > > > +	 * only after css->refcnt returns 0, checking this bit is useful when
> > > > +	 * css' refcnt seems to be not temporal.
> > > > +	 */
> > > > +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > > +
> > > >  	mutex_lock(&cgroup_mutex);
> > > > -	if (atomic_read(&cgrp->count) != 0) {
> > > > -		mutex_unlock(&cgroup_mutex);
> > > > -		return -EBUSY;
> > > > -	}
> > > > -	if (!list_empty(&cgrp->children)) {
> > > > +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
> > > >  		mutex_unlock(&cgroup_mutex);
> > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > >  		return -EBUSY;
> > > >  	}
> > > >  	mutex_unlock(&cgroup_mutex);
> > > > @@ -2683,25 +2696,20 @@ again:
> > > >  	 * that rmdir() request comes.
> > > >  	 */
> > > >  	ret = cgroup_call_pre_destroy(cgrp);
> > > > -	if (ret)
> > > > +	if (ret) {
> > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > >  		return ret;
> > > > +	}
> > > > 
> > > >  	mutex_lock(&cgroup_mutex);
> > > >  	parent = cgrp->parent;
> > > >  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > > >  		mutex_unlock(&cgroup_mutex);
> > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > >  		return -EBUSY;
> > > >  	}
> > > > -	/*
> > > > -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > > -	 * case, subsystem has no reference after pre_destroy(). But, under
> > > > -	 * hierarchy management, some *temporal* refcnt can be hold.
> > > > -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > > -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > > -	 * is called when css_put() is called and refcnt goes down to 0.
> > > > -	 */
> > > > -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > > 
> > > >  	if (!cgroup_clear_css_refs(cgrp)) {
> > > >  		mutex_unlock(&cgroup_mutex);
> > > > Index: linux-2.6.30.org/mm/memcontrol.c
> > > > ===================================================================
> > > > --- linux-2.6.30.org.orig/mm/memcontrol.c
> > > > +++ linux-2.6.30.org/mm/memcontrol.c
> > > > @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > > >  		return;
> > > >  	if (!ptr)
> > > >  		return;
> > > > +	css_get(&ptr->css);
> > > >  	pc = lookup_page_cgroup(page);
> > > >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > >  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > > > @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
> > > >  		}
> > > >  		rcu_read_unlock();
> > > >  	}
> > > > -	/* add this page(page_cgroup) to the LRU we want. */
> > > > -
> > > > +	/*
> > > > +	 * Because we charged against a cgroup which is obtained by record
> > > > +	 * in swap_cgroup, not by task, there is a possibility that someone is
> > > > +	 * waiting for rmdir. This happens when a swap entry is shared
> > > > +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> > > > +	 */
> > > > +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> > > > +	css_put(&ptr->css);
> > > >  }
> > > > 
> > > >  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > > 
> > > >
> > > 
> > > I am a little confused by the infrastructure. I think it is OK to
> > > return -EBUSY and then use notify_on_release to figure out when to
> > > free the cgroup. Are we moving away from the design?
> > >  
> > 
> > When you use cgroup with _hierarchy_ in real world, rmdir returns -EBUSY very
> > very often. (Because hierarchy dramatically increase chance to css_get() to empty
> > cgroup.)
> > 
> 
> I completely understand that part, but I thought we had solved this
> problem through notify_on_release and release_agent. I am not against
> it, but I want to know why release_agent will not solve this problem
> (is it too loose for control?)
> 

release_agent is not useful with memcg by following reason.

1. release_agent() is called when css->count get to be 0.
2. until force_empty() is called, css->count > 0.
3. to call force_empty, rmdir or memory.force_empty should be kicked.
4. Then, release_agent doesn't work for automatic rmdir.

You have to change release_agent definition/implimentation to rmdir memcg's
directory automatically.

And even if release_agent() is called, it will do rmdir and see -EBUSY.
I don't think multiple calls of release_agent against a cgroup is a sane
activity. 

But under current implementation, release_agent can be called multiple times.
(This is other topic, maybe.)



Thanks,
-Kame











--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  6:27                         ` KAMEZAWA Hiroyuki
@ 2009-06-17  7:35                           ` Balbir Singh
  2009-06-17  9:05                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2009-06-17  7:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 15:27:48]:

> On Wed, 17 Jun 2009 11:19:55 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 14:11:09]:
> > 
> > > On Wed, 17 Jun 2009 10:26:43 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-16 17:44:36]:
> > > > 
> > > > > On Tue, 16 Jun 2009 15:48:20 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > > It has been working well so far, but I will continue to test for more long time.
> > > > > > > 
> > > > > > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > > > > > 
> > > > > This is cleaned up version. works well in following test.
> > > > > 
> > > > > ==
> > > > > 1. mount -t tmpfs /dev/null /mnt/tmpfs
> > > > > 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> > > > > 3. mkdir /mnt/cgroups/A/
> > > > > 4. echo $$ > /mnt/cgroups/A/tasks
> > > > > 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> > > > > 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
> > > > >  => 26M of swap.
> > > > > 6. echo $$ > /mnt/cgroups/tasks
> > > > >  => group "A" is empty now.
> > > > > 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> > > > > 
> > > > > In ohter shell.
> > > > > 7-b. rmdir /mnt/cgroups/A
> > > > > ==
> > > > > Of course, you have more compliated ones..
> > > > > 
> > > > > the patch seems a bit long but most of patch is comment..
> > > > > 
> > > > > ==
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > 
> > > > > In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> > > > > per objects related to tasks. But memcg counts refcnt per pages because
> > > > > the pointer to css is recorded per page. And more, hierarchy-management support
> > > > > requires safe css refcnt management while a rmdir() is ongoing.
> > > > > 
> > > > > css_tryget()/css_put() is for such purpose and works well.
> > > > > 
> > > > > But, frequent css_put()/get() tends to prevent rmdir() and users can see
> > > > > EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> > > > > rmdir() can work in synchronous way with cgroup subsytems. But this logic
> > > > > expects "refcnt obtained by css_tryget() is temporal and will go down to
> > > > > 0 soon, then rmdir() will wake up soon."
> > > > > 
> > > > > But memcg's swapin code breaks the assumption. (But necessary...)
> > > > > This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> > > > > check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> > > > > by css_tryget().
> > > > > 
> > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > ---
> > > > >  include/linux/cgroup.h |    7 +++++++
> > > > >  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
> > > > >  mm/memcontrol.c        |   11 +++++++++--
> > > > >  3 files changed, 43 insertions(+), 21 deletions(-)
> > > > > 
> > > > > Index: linux-2.6.30.org/include/linux/cgroup.h
> > > > > ===================================================================
> > > > > --- linux-2.6.30.org.orig/include/linux/cgroup.h
> > > > > +++ linux-2.6.30.org/include/linux/cgroup.h
> > > > > @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
> > > > >  /* Return true if cgrp is a descendant of the task's cgroup */
> > > > >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> > > > > 
> > > > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> > > > > +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > > > +{
> > > > > +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > > > +		__cgroup_wakeup_rmdir_waiters(cgrp);
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Control Group subsystem type.
> > > > >   * See Documentation/cgroups/cgroups.txt for details
> > > > > Index: linux-2.6.30.org/kernel/cgroup.c
> > > > > ===================================================================
> > > > > --- linux-2.6.30.org.orig/kernel/cgroup.c
> > > > > +++ linux-2.6.30.org/kernel/cgroup.c
> > > > > @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
> > > > >   */
> > > > >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> > > > > 
> > > > > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > > > +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > > > >  {
> > > > > -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > > > -		wake_up_all(&cgroup_rmdir_waitq);
> > > > > +	wake_up_all(&cgroup_rmdir_waitq);
> > > > >  }
> > > > > 
> > > > >  static int rebind_subsystems(struct cgroupfs_root *root,
> > > > > @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
> > > > > 
> > > > >  	/* the vfs holds both inode->i_mutex already */
> > > > >  again:
> > > > > +	/*
> > > > > +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > > > +	 * case, subsystem has no reference after pre_destroy(). But, under
> > > > > +	 * hierarchy management, some *temporal* refcnt can be hold.
> > > > > +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > > > +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > > > +	 * is called when css_put() is called and refcnt goes down to 0.
> > > > > +	 *
> > > > > +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> > > > > +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> > > > > +	 * only after css->refcnt returns 0, checking this bit is useful when
> > > > > +	 * css' refcnt seems to be not temporal.
> > > > > +	 */
> > > > > +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > > +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > > > +
> > > > >  	mutex_lock(&cgroup_mutex);
> > > > > -	if (atomic_read(&cgrp->count) != 0) {
> > > > > -		mutex_unlock(&cgroup_mutex);
> > > > > -		return -EBUSY;
> > > > > -	}
> > > > > -	if (!list_empty(&cgrp->children)) {
> > > > > +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
> > > > >  		mutex_unlock(&cgroup_mutex);
> > > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > >  		return -EBUSY;
> > > > >  	}
> > > > >  	mutex_unlock(&cgroup_mutex);
> > > > > @@ -2683,25 +2696,20 @@ again:
> > > > >  	 * that rmdir() request comes.
> > > > >  	 */
> > > > >  	ret = cgroup_call_pre_destroy(cgrp);
> > > > > -	if (ret)
> > > > > +	if (ret) {
> > > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > >  		return ret;
> > > > > +	}
> > > > > 
> > > > >  	mutex_lock(&cgroup_mutex);
> > > > >  	parent = cgrp->parent;
> > > > >  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > > > >  		mutex_unlock(&cgroup_mutex);
> > > > > +		finish_wait(&cgroup_rmdir_waitq, &wait);
> > > > > +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > >  		return -EBUSY;
> > > > >  	}
> > > > > -	/*
> > > > > -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > > > -	 * case, subsystem has no reference after pre_destroy(). But, under
> > > > > -	 * hierarchy management, some *temporal* refcnt can be hold.
> > > > > -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > > > -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > > > -	 * is called when css_put() is called and refcnt goes down to 0.
> > > > > -	 */
> > > > > -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > > > -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > > > 
> > > > >  	if (!cgroup_clear_css_refs(cgrp)) {
> > > > >  		mutex_unlock(&cgroup_mutex);
> > > > > Index: linux-2.6.30.org/mm/memcontrol.c
> > > > > ===================================================================
> > > > > --- linux-2.6.30.org.orig/mm/memcontrol.c
> > > > > +++ linux-2.6.30.org/mm/memcontrol.c
> > > > > @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > > > >  		return;
> > > > >  	if (!ptr)
> > > > >  		return;
> > > > > +	css_get(&ptr->css);
> > > > >  	pc = lookup_page_cgroup(page);
> > > > >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> > > > >  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > > > > @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
> > > > >  		}
> > > > >  		rcu_read_unlock();
> > > > >  	}
> > > > > -	/* add this page(page_cgroup) to the LRU we want. */
> > > > > -
> > > > > +	/*
> > > > > +	 * Because we charged against a cgroup which is obtained by record
> > > > > +	 * in swap_cgroup, not by task, there is a possibility that someone is
> > > > > +	 * waiting for rmdir. This happens when a swap entry is shared
> > > > > +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> > > > > +	 */
> > > > > +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> > > > > +	css_put(&ptr->css);
> > > > >  }
> > > > > 
> > > > >  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> > > > > 
> > > > >
> > > > 
> > > > I am a little confused by the infrastructure. I think it is OK to
> > > > return -EBUSY and then use notify_on_release to figure out when to
> > > > free the cgroup. Are we moving away from the design?
> > > >  
> > > 
> > > When you use cgroup with _hierarchy_ in real world, rmdir returns -EBUSY very
> > > very often. (Because hierarchy dramatically increase chance to css_get() to empty
> > > cgroup.)
> > > 
> > 
> > I completely understand that part, but I thought we had solved this
> > problem through notify_on_release and release_agent. I am not against
> > it, but I want to know why release_agent will not solve this problem
> > (is it too loose for control?)
> > 
> 
> release_agent is not useful with memcg by following reason.
> 
> 1. release_agent() is called when css->count get to be 0.
> 2. until force_empty() is called, css->count > 0.
> 3. to call force_empty, rmdir or memory.force_empty should be kicked.
> 4. Then, release_agent doesn't work for automatic rmdir.
> 
> You have to change release_agent definition/implimentation to rmdir memcg's
> directory automatically.
> 
> And even if release_agent() is called, it will do rmdir and see -EBUSY.

Because of hierarchy? But we need to cleanup hierarchy before rmdir()
no?

> I don't think multiple calls of release_agent against a cgroup is a sane
> activity. 
> 
> But under current implementation, release_agent can be called multiple times.
> (This is other topic, maybe.)
>

I was hoping to use release agent in the following manner

1. write  a script to do a force_empty() followed by rmdir()

I am not against using the waitq, just seeing the top level design
implications and end user semantics as a result of the patch.
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  7:35                           ` Balbir Singh
@ 2009-06-17  9:05                             ` KAMEZAWA Hiroyuki
  2009-06-17  9:24                               ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-17  9:05 UTC (permalink / raw)
  To: balbir; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

On Wed, 17 Jun 2009 13:05:21 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 15:27:48]:

> > And even if release_agent() is called, it will do rmdir and see -EBUSY.
> 
> Because of hierarchy? But we need to cleanup hierarchy before rmdir()
> no?
> 

Assume following (I think my patch in git explains this.)

/cgroup/A/01
	 /02
	 /03
	 /04
A and 01,02,03,04 is under hierarchy.

Now, 04 has no task and it can be removed by rmdir.
Case 1) 01,02,03 hits memory limit heavily and hirerchical memory recalim
walks. In this case, 04's css refcnt is got/put very often.
Case 2) read statistics of cgroup/A very frequently, this means
css_put/get is called very often agatinst 04.

Case 3)....

04's refcnt is put/get when other group under hierarchy is busy and
rmdir against 04 returns -EBUSY in some amount of possiblitly.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-17  9:05                             ` KAMEZAWA Hiroyuki
@ 2009-06-17  9:24                               ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2009-06-17  9:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Daisuke Nishimura, linux-mm, Li Zefan

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 18:05:55]:

> On Wed, 17 Jun 2009 13:05:21 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-06-17 15:27:48]:
> 
> > > And even if release_agent() is called, it will do rmdir and see -EBUSY.
> > 
> > Because of hierarchy? But we need to cleanup hierarchy before rmdir()
> > no?
> > 
> 
> Assume following (I think my patch in git explains this.)
> 
> /cgroup/A/01
> 	 /02
> 	 /03
> 	 /04
> A and 01,02,03,04 is under hierarchy.
> 
> Now, 04 has no task and it can be removed by rmdir.
> Case 1) 01,02,03 hits memory limit heavily and hirerchical memory recalim
> walks. In this case, 04's css refcnt is got/put very often.
> Case 2) read statistics of cgroup/A very frequently, this means
> css_put/get is called very often agatinst 04.
> 
> Case 3)....
> 
> 04's refcnt is put/get when other group under hierarchy is busy and
> rmdir against 04 returns -EBUSY in some amount of possiblitly.
>

Yes, agreed! We did design hierarchy that way. Thanks for the detailed
explanation!
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
  2009-06-17  4:56                   ` Balbir Singh
@ 2009-06-18  3:03                   ` Daisuke Nishimura
  2009-06-18  3:21                     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2009-06-18  3:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Balbir Singh, Li Zefan, linux-mm, Daisuke Nishimura

On Tue, 16 Jun 2009 17:44:36 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 16 Jun 2009 15:48:20 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > It has been working well so far, but I will continue to test for more long time.
> > > 
> > Thank you. I'd like to find out more clean fix, keeping this as an option.
> > 
> This is cleaned up version. works well in following test.
> 
> ==
> 1. mount -t tmpfs /dev/null /mnt/tmpfs
> 2. mount -t cgroup /dev/null /mnt/cgroups -o memory
> 3. mkdir /mnt/cgroups/A/
> 4. echo $$ > /mnt/cgroups/A/tasks
> 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes
> 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000
>  => 26M of swap.
> 6. echo $$ > /mnt/cgroups/tasks
>  => group "A" is empty now.
> 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done
> 
> In ohter shell.
> 7-b. rmdir /mnt/cgroups/A
> ==
> Of course, you have more compliated ones..
> 
> the patch seems a bit long but most of patch is comment..
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained
> per objects related to tasks. But memcg counts refcnt per pages because
> the pointer to css is recorded per page. And more, hierarchy-management support
> requires safe css refcnt management while a rmdir() is ongoing.
> 
> css_tryget()/css_put() is for such purpose and works well.
> 
> But, frequent css_put()/get() tends to prevent rmdir() and users can see
> EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and
> rmdir() can work in synchronous way with cgroup subsytems. But this logic
> expects "refcnt obtained by css_tryget() is temporal and will go down to
> 0 soon, then rmdir() will wake up soon."
> 
> But memcg's swapin code breaks the assumption. (But necessary...)
> This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to
> check whether cgroup is under rmdir if memcg got a *not termporal* refcnt 
> by css_tryget().
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |    7 +++++++
>  kernel/cgroup.c        |   46 +++++++++++++++++++++++++++-------------------
>  mm/memcontrol.c        |   11 +++++++++--
>  3 files changed, 43 insertions(+), 21 deletions(-)
> 
> Index: linux-2.6.30.org/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.30.org.orig/include/linux/cgroup.h
> +++ linux-2.6.30.org/include/linux/cgroup.h
> @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
>  /* Return true if cgrp is a descendant of the task's cgroup */
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>  
> +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp);
> +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +{
> +	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		__cgroup_wakeup_rmdir_waiters(cgrp);
> +}
> +
>  /*
>   * Control Group subsystem type.
>   * See Documentation/cgroups/cgroups.txt for details
> Index: linux-2.6.30.org/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.30.org.orig/kernel/cgroup.c
> +++ linux-2.6.30.org/kernel/cgroup.c
> @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d
>   */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
>  
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
>  {
> -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> +	wake_up_all(&cgroup_rmdir_waitq);
>  }
>  
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un
>  
>  	/* the vfs holds both inode->i_mutex already */
>  again:
> +	/*
> +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> +	 * case, subsystem has no reference after pre_destroy(). But, under
> +	 * hierarchy management, some *temporal* refcnt can be hold.
> +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +	 * is called when css_put() is called and refcnt goes down to 0.
> +	 *
> +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
> +	 * it's under ongoing rmdir() or not. Because css_tryget() returns false
> +	 * only after css->refcnt returns 0, checking this bit is useful when
> +	 * css' refcnt seems to be not temporal.
> +	 */
> +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> +
I'm sorry if I misunderstand something.

Preparing waitq here means force_empty would be called with TASK_INTERRUPTIBLE,
so current can sleep with TASK_INTRRUPTIBLE by cond_resched().

Can we ensure that it can be waken up, especially in case we are not under
memory pressure ?

>  	mutex_lock(&cgroup_mutex);
> -	if (atomic_read(&cgrp->count) != 0) {
> -		mutex_unlock(&cgroup_mutex);
> -		return -EBUSY;
> -	}
> -	if (!list_empty(&cgrp->children)) {
> +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
>  		mutex_unlock(&cgroup_mutex);
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return -EBUSY;
>  	}
>  	mutex_unlock(&cgroup_mutex);
> @@ -2683,25 +2696,20 @@ again:
>  	 * that rmdir() request comes.
>  	 */
>  	ret = cgroup_call_pre_destroy(cgrp);
> -	if (ret)
> +	if (ret) {
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return ret;
> +	}
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
>  		mutex_unlock(&cgroup_mutex);
> +		finish_wait(&cgroup_rmdir_waitq, &wait);
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return -EBUSY;
>  	}
> -	/*
> -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> -	 * case, subsystem has no reference after pre_destroy(). But, under
> -	 * hierarchy management, some *temporal* refcnt can be hold.
> -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -	 * is called when css_put() is called and refcnt goes down to 0.
> -	 */
> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> Index: linux-2.6.30.org/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.30.org.orig/mm/memcontrol.c
> +++ linux-2.6.30.org/mm/memcontrol.c
> @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  		return;
>  	if (!ptr)
>  		return;
> +	css_get(&ptr->css);
What's the purpose of this css_get ?
Can you add a comment ?

>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> -
> +	/*
> +	 * Because we charged against a cgroup which is obtained by record
> +	 * in swap_cgroup, not by task, there is a possibility that someone is
> +	 * waiting for rmdir. This happens when a swap entry is shared
> +	 * among cgroups. After wakeup, pre_destroy() will be called again.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
'&' must be removed here.

> +	css_put(&ptr->css);
>  }
>  
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> 
> 


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][BUGFIX] memcg: rmdir doesn't return
  2009-06-18  3:03                   ` Daisuke Nishimura
@ 2009-06-18  3:21                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-18  3:21 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: KAMEZAWA Hiroyuki, Balbir Singh, Li Zefan, linux-mm

Daisuke Nishimura wrote:
> On Tue, 16 Jun 2009 17:44:36 +0900, KAMEZAWA Hiroyuki
>> +	/*
>> +	 * css_put/get is provided for subsys to grab refcnt to css. In
>> typical
>> +	 * case, subsystem has no reference after pre_destroy(). But, under
>> +	 * hierarchy management, some *temporal* refcnt can be hold.
>> +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
>> +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
>> +	 * is called when css_put() is called and refcnt goes down to 0.
>> +	 *
>> +	 * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know
>> +	 * it's under ongoing rmdir() or not. Because css_tryget() returns
>> false
>> +	 * only after css->refcnt returns 0, checking this bit is useful when
>> +	 * css' refcnt seems to be not temporal.
>> +	 */
>> +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>> +	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>> +
> I'm sorry if I misunderstand something.
>
> Preparing waitq here means force_empty would be called with
> TASK_INTERRUPTIBLE,
> so current can sleep with TASK_INTRRUPTIBLE by cond_resched().
>
Ah...you're right.

> Can we ensure that it can be waken up, especially in case we are not under
> memory pressure ?
>
Hmm. I'll modify here.

lag between
 pre_destroy-> check css's ref -> sleep
 css_tryget -> charge to res_counter
is an enemy anyway. Adding "retry_rmdir()" as previous one is a choice..
(I wonder we should stop css_get/put against page_cgroup ...
 but that change will be too large for bugfix)

>>  	mutex_lock(&cgroup_mutex);
>> -	if (atomic_read(&cgrp->count) != 0) {
>> -		mutex_unlock(&cgroup_mutex);
>> -		return -EBUSY;
>> -	}
>> -	if (!list_empty(&cgrp->children)) {
>> +	if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) {
>>  		mutex_unlock(&cgroup_mutex);
>> +		finish_wait(&cgroup_rmdir_waitq, &wait);
>> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>>  		return -EBUSY;
>>  	}
>>  	mutex_unlock(&cgroup_mutex);
>> @@ -2683,25 +2696,20 @@ again:
>>  	 * that rmdir() request comes.
>>  	 */
>>  	ret = cgroup_call_pre_destroy(cgrp);
>> -	if (ret)
>> +	if (ret) {
>> +		finish_wait(&cgroup_rmdir_waitq, &wait);
>> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>>  		return ret;
>> +	}
>>
>>  	mutex_lock(&cgroup_mutex);
>>  	parent = cgrp->parent;
>>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
>>  		mutex_unlock(&cgroup_mutex);
>> +		finish_wait(&cgroup_rmdir_waitq, &wait);
>> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>>  		return -EBUSY;
>>  	}
>> -	/*
>> -	 * css_put/get is provided for subsys to grab refcnt to css. In
>> typical
>> -	 * case, subsystem has no reference after pre_destroy(). But, under
>> -	 * hierarchy management, some *temporal* refcnt can be hold.
>> -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
>> -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
>> -	 * is called when css_put() is called and refcnt goes down to 0.
>> -	 */
>> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>> -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>>
>>  	if (!cgroup_clear_css_refs(cgrp)) {
>>  		mutex_unlock(&cgroup_mutex);
>> Index: linux-2.6.30.org/mm/memcontrol.c
>> ===================================================================
>> --- linux-2.6.30.org.orig/mm/memcontrol.c
>> +++ linux-2.6.30.org/mm/memcontrol.c
>> @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct
>>  		return;
>>  	if (!ptr)
>>  		return;
>> +	css_get(&ptr->css);
> What's the purpose of this css_get ?
> Can you add a comment ?
>
memcg's css->refcnt can be go down to 0 while commit. So, access to
memcg->css.cgroup can be invalid.


>>  	pc = lookup_page_cgroup(page);
>>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
>> @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct
>>  		}
>>  		rcu_read_unlock();
>>  	}
>> -	/* add this page(page_cgroup) to the LRU we want. */
>> -
>> +	/*
>> +	 * Because we charged against a cgroup which is obtained by record
>> +	 * in swap_cgroup, not by task, there is a possibility that someone is
>> +	 * waiting for rmdir. This happens when a swap entry is shared
>> +	 * among cgroups. After wakeup, pre_destroy() will be called again.
>> +	 */
>> +	cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup);
> '&' must be removed here.
>
maybe reflesh miss, sorry.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-06-18  3:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-12  5:33 [RFC][BUGFIX] memcg: rmdir doesn't return Daisuke Nishimura
2009-06-12  6:19 ` KAMEZAWA Hiroyuki
2009-06-15  2:50   ` Daisuke Nishimura
2009-06-15  3:02     ` KAMEZAWA Hiroyuki
2009-06-15  8:17       ` KAMEZAWA Hiroyuki
2009-06-16  2:47         ` Daisuke Nishimura
2009-06-16  5:00           ` KAMEZAWA Hiroyuki
2009-06-16  6:38             ` Daisuke Nishimura
2009-06-16  6:48               ` KAMEZAWA Hiroyuki
2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
2009-06-17  4:56                   ` Balbir Singh
2009-06-17  5:11                     ` KAMEZAWA Hiroyuki
2009-06-17  5:49                       ` Balbir Singh
2009-06-17  6:27                         ` KAMEZAWA Hiroyuki
2009-06-17  7:35                           ` Balbir Singh
2009-06-17  9:05                             ` KAMEZAWA Hiroyuki
2009-06-17  9:24                               ` Balbir Singh
2009-06-18  3:03                   ` Daisuke Nishimura
2009-06-18  3:21                     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox