linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm <linux-mm@kvack.org>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Li Zefan <lizf@cn.fujitsu.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Subject: Re: [RFC][BUGFIX] memcg: rmdir doesn't return
Date: Tue, 16 Jun 2009 11:47:35 +0900	[thread overview]
Message-ID: <20090616114735.c7a91b8b.nishimura@mxp.nes.nec.co.jp> (raw)
In-Reply-To: <20090615171715.53743dce.kamezawa.hiroyu@jp.fujitsu.com>

On Mon, 15 Jun 2009 17:17:15 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 15 Jun 2009 12:02:13 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I don't like implict resource move. I'll try some today. plz see it.
> > _But_ this case just happens when swap is shared between cgroups and _very_ heavy
> > swap-in continues very long. I don't think this is a fatal and BUG.
> > 
> > But ok, maybe wake-up path is not enough.
> > 
> Here.
> Anyway, there is an unfortunate complexity in cgroup's rmdir() path.
> I think this will remove all concern in
> 	pre_destroy -> check -> start rmdir path
> if subsys is aware of what they does.
> Usual subsys just consider "tasks" and no extra references I hope.
> If your test result is good, I'll post again (after merge window ?).
> 
Thank you for your patch.

At first, I thought this problem can be solved by this direction, but
there is a race window yet.

The root cause of this problem is that mem.usage can be incremented
by swap-in behavior of memcg even after it has become 0 once.
So, mem.usage can also be incremented between cgroup_need_restart_rmdir()
and schedule().
I can see rmdir being locked up actually in my test.

hmm, sleeping until being waken up might not be good if we don't change
swap-in behavior of memcg in some way.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Cgroup is designed for do some work against _tasks_. But when it comes to
> memcg, a cgroup can be obtained by something other...i.e. page and swap entry.
> Then, pre_destroy at el. are provided. Historically, there are some races
> around this...this is new one.
> 
> Now, rmdir() path uses following logic.
> 
> 	pre_destroy();	   # drop all css->refcnt to be 0.
> 	lock cgroup mutex  # no new task after this
> 	check cgroup has no tasks.
> 	check cgroup has no children.
> 	check css refcnt 
> 	(*)	if refcnt is not 0, sleep and wait for refcnt goes down to 0.
> 
> The logic (*) assumes the refcnt will goes down soon, but in some case(memcg),
> it's better to call pre_destroy() again if pre_destroy() can handle it.
> (The most unfortunate in above logic is that we can't have some trustable
>  lock in this path..but..we may never be able to do.)
> 
> This patch adds ss->restart_rmdir() callback to subsys and allow immediate
> retry of pre_destroy() if necessary.
> 
> Reported-by: Daisuke Nishimura  <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> Index: linux-2.6.30.org/include/linux/cgroup.h
> ===================================================================
> --- linux-2.6.30.org.orig/include/linux/cgroup.h
> +++ linux-2.6.30.org/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	bool (*restart_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: linux-2.6.30.org/kernel/cgroup.c
> ===================================================================
> --- linux-2.6.30.org.orig/kernel/cgroup.c
> +++ linux-2.6.30.org/kernel/cgroup.c
> @@ -635,6 +635,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Check we have to restart rmdir immediately or not. Because we don't have any
> + * system which prevents "new reference comes after pre_destroy", we checks
> + * whether we have to call pre_destroy() again or not.
> + * i.e. if css_get()'s refcnt is not a temporal one, we can't expect css_put()
> + * is called and need to call pre_destroy().
> + */
> +static bool cgroup_need_restart_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->restart_rmdir)
> +			if (ss->restart_rmdir(ss, cgrp))
> +				return true;
> +	return false;
> +}
>  
>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2705,7 +2722,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_need_restart_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: linux-2.6.30.org/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.30.org.orig/mm/memcontrol.c
> +++ linux-2.6.30.org/mm/memcontrol.c
> @@ -2462,6 +2462,18 @@ static int mem_cgroup_pre_destroy(struct
>  	return mem_cgroup_force_empty(mem, false);
>  }
>  
> +static bool mem_cgroup_restart_rmdir(struct cgroup_subsys *ss,
> +					struct cgroup *cont)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +	unsigned long long usage;
> +
> +	usage = res_counter_read_u64(&mem->res, RES_USAGE);
> +	if (usage)/* some charge after pre_destroy() (via swap)....*/
> +		return true;
> +	return false;
> +}
> +
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {
> @@ -2501,6 +2513,7 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.subsys_id = mem_cgroup_subsys_id,
>  	.create = mem_cgroup_create,
>  	.pre_destroy = mem_cgroup_pre_destroy,
> +	.restart_rmdir = mem_cgroup_restart_rmdir,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
>  	.attach = mem_cgroup_move_task,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-06-16  3:12 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-12  5:33 Daisuke Nishimura
2009-06-12  6:19 ` KAMEZAWA Hiroyuki
2009-06-15  2:50   ` Daisuke Nishimura
2009-06-15  3:02     ` KAMEZAWA Hiroyuki
2009-06-15  8:17       ` KAMEZAWA Hiroyuki
2009-06-16  2:47         ` Daisuke Nishimura [this message]
2009-06-16  5:00           ` KAMEZAWA Hiroyuki
2009-06-16  6:38             ` Daisuke Nishimura
2009-06-16  6:48               ` KAMEZAWA Hiroyuki
2009-06-16  8:44                 ` KAMEZAWA Hiroyuki
2009-06-17  4:56                   ` Balbir Singh
2009-06-17  5:11                     ` KAMEZAWA Hiroyuki
2009-06-17  5:49                       ` Balbir Singh
2009-06-17  6:27                         ` KAMEZAWA Hiroyuki
2009-06-17  7:35                           ` Balbir Singh
2009-06-17  9:05                             ` KAMEZAWA Hiroyuki
2009-06-17  9:24                               ` Balbir Singh
2009-06-18  3:03                   ` Daisuke Nishimura
2009-06-18  3:21                     ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090616114735.c7a91b8b.nishimura@mxp.nes.nec.co.jp \
    --to=nishimura@mxp.nes.nec.co.jp \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lizf@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox