From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E72706B004F for ; Tue, 16 Jun 2009 04:44:48 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n5G8kBHb010471 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Tue, 16 Jun 2009 17:46:11 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 1205845DD7D for ; Tue, 16 Jun 2009 17:46:11 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id DA20445DD7B for ; Tue, 16 Jun 2009 17:46:10 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id A66BCE0800C for ; Tue, 16 Jun 2009 17:46:10 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 72032E08006 for ; Tue, 16 Jun 2009 17:46:08 +0900 (JST) Date: Tue, 16 Jun 2009 17:44:36 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC][BUGFIX] memcg: rmdir doesn't return Message-Id: <20090616174436.5a4b6577.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20090616154820.c9065809.kamezawa.hiroyu@jp.fujitsu.com> References: <20090612143346.68e1f006.nishimura@mxp.nes.nec.co.jp> <20090612151924.2d305ce8.kamezawa.hiroyu@jp.fujitsu.com> <20090615115021.c79444cb.nishimura@mxp.nes.nec.co.jp> <20090615120213.e9a3bd1d.kamezawa.hiroyu@jp.fujitsu.com> <20090615171715.53743dce.kamezawa.hiroyu@jp.fujitsu.com> <20090616114735.c7a91b8b.nishimura@mxp.nes.nec.co.jp> <20090616140050.4172f988.kamezawa.hiroyu@jp.fujitsu.com> <20090616153810.fd710c5b.nishimura@mxp.nes.nec.co.jp> <20090616154820.c9065809.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura , linux-mm , Balbir Singh , Li Zefan List-ID: On Tue, 16 Jun 2009 15:48:20 +0900 KAMEZAWA Hiroyuki wrote: > > It has been working well so far, but I will continue to test for more long time. > > > Thank you. I'd like to find out more clean fix, keeping this as an option. > This is cleaned up version. works well in following test. == 1. mount -t tmpfs /dev/null /mnt/tmpfs 2. mount -t cgroup /dev/null /mnt/cgroups -o memory 3. mkdir /mnt/cgroups/A/ 4. echo $$ > /mnt/cgroups/A/tasks 5. echo 4M > /mnt/cgroups/A/memory.limit_in_bytes 5. dd if=/dev/zero of=/mnt/tmpfs/testfile bs=1024 count=30000 => 26M of swap. 6. echo $$ > /mnt/cgroups/tasks => group "A" is empty now. 7-a. while true; do cat /mnt/tmpfs/testfile > /dev/null;done In ohter shell. 7-b. rmdir /mnt/cgroups/A == Of course, you have more compliated ones.. the patch seems a bit long but most of patch is comment.. == From: KAMEZAWA Hiroyuki In general, cgroup is for tasks and cgroup subsys(css)'s refcnt is maintained per objects related to tasks. But memcg counts refcnt per pages because the pointer to css is recorded per page. And more, hierarchy-management support requires safe css refcnt management while a rmdir() is ongoing. css_tryget()/css_put() is for such purpose and works well. But, frequent css_put()/get() tends to prevent rmdir() and users can see EBUSY very often. To fix that, waitqueue-for-rmdir was introduced and rmdir() can work in synchronous way with cgroup subsytems. But this logic expects "refcnt obtained by css_tryget() is temporal and will go down to 0 soon, then rmdir() will wake up soon." But memcg's swapin code breaks the assumption. (But necessary...) This patch try to reuse another anotation of CGRP_WAIT_ON_RMDIR flag to check whether cgroup is under rmdir if memcg got a *not termporal* refcnt by css_tryget(). Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/cgroup.h | 7 +++++++ kernel/cgroup.c | 46 +++++++++++++++++++++++++++------------------- mm/memcontrol.c | 11 +++++++++-- 3 files changed, 43 insertions(+), 21 deletions(-) Index: linux-2.6.30.org/include/linux/cgroup.h =================================================================== --- linux-2.6.30.org.orig/include/linux/cgroup.h +++ linux-2.6.30.org/include/linux/cgroup.h @@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou /* Return true if cgrp is a descendant of the task's cgroup */ int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task); +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp); +static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp) +{ + if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))) + __cgroup_wakeup_rmdir_waiters(cgrp); +} + /* * Control Group subsystem type. * See Documentation/cgroups/cgroups.txt for details Index: linux-2.6.30.org/kernel/cgroup.c =================================================================== --- linux-2.6.30.org.orig/kernel/cgroup.c +++ linux-2.6.30.org/kernel/cgroup.c @@ -737,10 +737,9 @@ static void cgroup_d_remove_dir(struct d */ DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq); -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp) +void __cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp) { - if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))) - wake_up_all(&cgroup_rmdir_waitq); + wake_up_all(&cgroup_rmdir_waitq); } static int rebind_subsystems(struct cgroupfs_root *root, @@ -2667,13 +2666,27 @@ static int cgroup_rmdir(struct inode *un /* the vfs holds both inode->i_mutex already */ again: + /* + * css_put/get is provided for subsys to grab refcnt to css. In typical + * case, subsystem has no reference after pre_destroy(). But, under + * hierarchy management, some *temporal* refcnt can be hold. + * To avoid returning -EBUSY to a user, waitqueue is used. If subsys + * is really busy, it should return -EBUSY at pre_destroy(). wake_up + * is called when css_put() is called and refcnt goes down to 0. + * + * Subsys can check CGRP_WAIT_ON_RMDIR bit by itself to know + * it's under ongoing rmdir() or not. Because css_tryget() returns false + * only after css->refcnt returns 0, checking this bit is useful when + * css' refcnt seems to be not temporal. + */ + set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); + prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE); + mutex_lock(&cgroup_mutex); - if (atomic_read(&cgrp->count) != 0) { - mutex_unlock(&cgroup_mutex); - return -EBUSY; - } - if (!list_empty(&cgrp->children)) { + if (atomic_read(&cgrp->count) != 0 || !list_empty(&cgrp->children)) { mutex_unlock(&cgroup_mutex); + finish_wait(&cgroup_rmdir_waitq, &wait); + clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); return -EBUSY; } mutex_unlock(&cgroup_mutex); @@ -2683,25 +2696,20 @@ again: * that rmdir() request comes. */ ret = cgroup_call_pre_destroy(cgrp); - if (ret) + if (ret) { + finish_wait(&cgroup_rmdir_waitq, &wait); + clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); return ret; + } mutex_lock(&cgroup_mutex); parent = cgrp->parent; if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) { mutex_unlock(&cgroup_mutex); + finish_wait(&cgroup_rmdir_waitq, &wait); + clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); return -EBUSY; } - /* - * css_put/get is provided for subsys to grab refcnt to css. In typical - * case, subsystem has no reference after pre_destroy(). But, under - * hierarchy management, some *temporal* refcnt can be hold. - * To avoid returning -EBUSY to a user, waitqueue is used. If subsys - * is really busy, it should return -EBUSY at pre_destroy(). wake_up - * is called when css_put() is called and refcnt goes down to 0. - */ - set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags); - prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE); if (!cgroup_clear_css_refs(cgrp)) { mutex_unlock(&cgroup_mutex); Index: linux-2.6.30.org/mm/memcontrol.c =================================================================== --- linux-2.6.30.org.orig/mm/memcontrol.c +++ linux-2.6.30.org/mm/memcontrol.c @@ -1338,6 +1338,7 @@ __mem_cgroup_commit_charge_swapin(struct return; if (!ptr) return; + css_get(&ptr->css); pc = lookup_page_cgroup(page); mem_cgroup_lru_del_before_commit_swapcache(page); __mem_cgroup_commit_charge(ptr, pc, ctype); @@ -1367,8 +1368,14 @@ __mem_cgroup_commit_charge_swapin(struct } rcu_read_unlock(); } - /* add this page(page_cgroup) to the LRU we want. */ - + /* + * Because we charged against a cgroup which is obtained by record + * in swap_cgroup, not by task, there is a possibility that someone is + * waiting for rmdir. This happens when a swap entry is shared + * among cgroups. After wakeup, pre_destroy() will be called again. + */ + cgroup_wakeup_rmdir_waiters(&ptr->css.cgroup); + css_put(&ptr->css); } void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org