From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from zps38.corp.google.com (zps38.corp.google.com [172.25.146.38]) by smtp-out.google.com with ESMTP id m2SB1b2P014853 for ; Fri, 28 Mar 2008 04:01:37 -0700 Received: from py-out-1112.google.com (pyea73.prod.google.com [10.34.153.73]) by zps38.corp.google.com with ESMTP id m2SB1aUh006807 for ; Fri, 28 Mar 2008 04:01:37 -0700 Received: by py-out-1112.google.com with SMTP id a73so357649pye.33 for ; Fri, 28 Mar 2008 04:01:36 -0700 (PDT) Message-ID: <6599ad830803280401r68d30e91waaea8eb1de36eb52@mail.gmail.com> Date: Fri, 28 Mar 2008 04:01:35 -0700 From: "Paul Menage" Subject: Re: [-mm] Add an owner to the mm_struct (v2) In-Reply-To: <20080328082316.6961.29044.sendpatchset@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080328082316.6961.29044.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org Return-Path: To: Balbir Singh Cc: Pavel Emelianov , Hugh Dickins , Sudhir Kumar , YAMAMOTO Takashi , lizf@cn.fujitsu.com, linux-kernel@vger.kernel.org, taka@valinux.co.jp, linux-mm@kvack.org, David Rientjes , Andrew Morton , KAMEZAWA Hiroyuki List-ID: On Fri, Mar 28, 2008 at 1:23 AM, Balbir Singh wrote: > > diff -puN include/linux/mm_types.h~memory-controller-add-mm-owner include/linux/mm_types.h > --- linux-2.6.25-rc5/include/linux/mm_types.h~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530 > +++ linux-2.6.25-rc5-balbir/include/linux/mm_types.h 2008-03-28 12:26:59.000000000 +0530 > @@ -227,8 +227,10 @@ struct mm_struct { > /* aio bits */ > rwlock_t ioctx_list_lock; > struct kioctx *ioctx_list; > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR > - struct mem_cgroup *mem_cgroup; > +#ifdef CONFIG_MM_OWNER > + spinlock_t owner_lock; > + struct task_struct *owner; /* The thread group leader that */ > + /* owns the mm_struct. */ > #endif I'm not convinced that we need the spinlock. Just use the simple rule that you can only modify mm->owner if: - mm->owner points to current - the new owner is a user of mm - you hold task_lock() for the new owner (which is necessary anyway to ensure that the new owner's mm doesn't change while you're updating mm->owner) and I think everything is fine without an additional lock. > > #ifdef CONFIG_PROC_FS > diff -puN kernel/fork.c~memory-controller-add-mm-owner kernel/fork.c > --- linux-2.6.25-rc5/kernel/fork.c~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530 > +++ linux-2.6.25-rc5-balbir/kernel/fork.c 2008-03-28 12:33:12.000000000 +0530 > @@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct > mm->free_area_cache = TASK_UNMAPPED_BASE; > mm->cached_hole_size = ~0UL; > mm_init_cgroup(mm, p); > + mm_init_owner(mm, p); > > if (likely(!mm_alloc_pgd(mm))) { > mm->def_flags = 0; > @@ -995,6 +996,27 @@ static void rt_mutex_init_task(struct ta > #endif > } > > +#ifdef CONFIG_MM_OWNER > +void mm_init_owner(struct mm_struct *mm, struct task_struct *p) > +{ > + spin_lock_init(&mm->owner_lock); > + mm->owner = p; > +} > + > +void mm_fork_init_owner(struct task_struct *p) > +{ > + struct mm_struct *mm = get_task_mm(p); Do we need this? p->mm can't go away if we're in the middle of forking it. > + if (!mm) > + return; > + > + spin_lock(&mm->owner); I suspect that you meant this to be spin_lock(&mm->owner_lock). > + if (mm->owner != p) > + rcu_assign_pointer(mm->owner, p->group_leader); > + spin_unlock(&mm->owner); > + mmput(mm); > +} > +#endif /* CONFIG_MM_OWNER */ > + > /* > * This creates a new process as a copy of the old one, > * but does not actually start it yet. > @@ -1357,6 +1379,10 @@ static struct task_struct *copy_process( > write_unlock_irq(&tasklist_lock); > proc_fork_connector(p); > cgroup_post_fork(p); > + > + if (!(clone_flags & CLONE_VM) && (p != p->group_leader)) > + mm_fork_init_owner(p); > + I'm not sure I understand what this is doing. I read it as "if p has its own mm and p is a child thread, set p->mm->owner to p->group_leader". But by definition if p has its own mm, then p->group_leader->mm will be different to p->mm, therefore we'd end up with mm->owner->mm != mm, which seems very bad. What's the intention of this bit of code? > return p; > > bad_fork_free_pid: > diff -puN include/linux/memcontrol.h~memory-controller-add-mm-owner include/linux/memcontrol.h > --- linux-2.6.25-rc5/include/linux/memcontrol.h~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530 > +++ linux-2.6.25-rc5-balbir/include/linux/memcontrol.h 2008-03-28 09:30:47.000000000 +0530 > @@ -29,6 +29,7 @@ struct mm_struct; > > extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p); > extern void mm_free_cgroup(struct mm_struct *mm); > +extern void mem_cgroup_fork_init(struct task_struct *p); > > #define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0) > > @@ -49,7 +50,7 @@ extern void mem_cgroup_out_of_memory(str > int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); > > #define mm_match_cgroup(mm, cgroup) \ > - ((cgroup) == rcu_dereference((mm)->mem_cgroup)) > + ((cgroup) == mem_cgroup_from_task((mm)->owner)) > > extern int mem_cgroup_prepare_migration(struct page *page); > extern void mem_cgroup_end_migration(struct page *page); > @@ -72,6 +73,8 @@ extern long mem_cgroup_calc_reclaim_acti > extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, > struct zone *zone, int priority); > > +extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > + > #else /* CONFIG_CGROUP_MEM_RES_CTLR */ > static inline void mm_init_cgroup(struct mm_struct *mm, > struct task_struct *p) > @@ -82,6 +85,10 @@ static inline void mm_free_cgroup(struct > { > } > > +static inline void mem_cgroup_fork_init(struct task_struct *p) > +{ > +} > + Is this stale? > static inline void page_reset_bad_cgroup(struct page *page) > { > } > @@ -172,6 +179,11 @@ static inline long mem_cgroup_calc_recla > { > return 0; > } > + > +static void mm_free_fork_cgroup(struct task_struct *p) > +{ > +} > + And this? > #endif /* CONFIG_CGROUP_MEM_CONT */ > -static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) > +struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) > { > return container_of(task_subsys_state(p, mem_cgroup_subsys_id), > struct mem_cgroup, css); I think it would be better to make this static inline in the header file - it's just two indexed dereferences, so hardly worth the function call overhead. > @@ -250,12 +250,17 @@ void mm_init_cgroup(struct mm_struct *mm > > mem = mem_cgroup_from_task(p); > css_get(&mem->css); > - mm->mem_cgroup = mem; > } > > void mm_free_cgroup(struct mm_struct *mm) > { > - css_put(&mm->mem_cgroup->css); > + struct mem_cgroup *mem; > + > + /* > + * TODO: Should we assign mm->owner to NULL here? No, controller code shouldn't be changing mm->owner. And surely we don't need mm_init_cgroup() and mm_free_cgroup() any longer? > > - rcu_read_lock(); > - mem = rcu_dereference(mm->mem_cgroup); > + mem = mem_cgroup_from_task(rcu_dereference(mm->owner)); > /* > * For every charge from the cgroup, increment reference count > */ > css_get(&mem->css); > - rcu_read_unlock(); Why is it OK to take away the rcu_read_lock() here? We're still doing an rcu_dereference(). > > while (res_counter_charge(&mem->res, PAGE_SIZE)) { > if (!(gfp_mask & __GFP_WAIT)) > @@ -990,8 +994,8 @@ mem_cgroup_create(struct cgroup_subsys * > > if (unlikely((cont->parent) == NULL)) { > mem = &init_mem_cgroup; > - init_mm.mem_cgroup = mem; > page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC); > + init_mm.owner = &init_task; This shouldn't be in here - it should be in the core code that sets up init_mm. > > +#ifdef CONFIG_MM_OWNER > +extern void mm_update_next_owner(struct mm_struct *mm, struct task_struct *p); > +extern void mm_fork_init_owner(struct task_struct *p); > +extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p); > +#else > +static inline void > +mm_update_next_owner(struct mm_struct *mm, struct task_struct *p) > +{ > +} > + > +static inline void mm_fork_init_owner(struct task_struct *p) > +{ > +} > + > +static inline void mm_init_owner(struct mm_struct *mm, struct task_struct *p) > +{ > +} > +#endif /* CONFIG_MM_OWNER */ > + > #endif /* __KERNEL__ */ > > #endif > diff -puN kernel/exit.c~memory-controller-add-mm-owner kernel/exit.c > --- linux-2.6.25-rc5/kernel/exit.c~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530 > +++ linux-2.6.25-rc5-balbir/kernel/exit.c 2008-03-28 12:35:39.000000000 +0530 > @@ -579,6 +579,71 @@ void exit_fs(struct task_struct *tsk) > > EXPORT_SYMBOL_GPL(exit_fs); > > +#ifdef CONFIG_MM_OWNER > +/* > + * Task p is exiting and it owned p, so lets find a new owner for it > + */ > +static inline int > +mm_need_new_owner(struct mm_struct *mm, struct task_struct *p) > +{ > + int ret; > + > + rcu_read_lock(); > + ret = (mm && (rcu_dereference(mm->owner) == p) && > + (atomic_read(&mm->mm_users) > 1)); > + rcu_read_unlock(); > + return ret; The only way that rcu_read_lock() helps here is if mm freeing is protected by RCU, which I don't think is the case. But as long as p==current, there's no race, since no other process will re-point mm->owner at themselves, so mm can't go away anyway since we have a reference to it that we're going to be dropping soon. Is there ever a case where we'd want to call this on anything other than current? It would simplify the code to just refer to current rather than tsk. > +} > + > +void mm_update_next_owner(struct mm_struct *mm, struct task_struct *p) > +{ > + struct task_struct *c, *g; > + > + /* > + * This should not be called for init_task > + */ > + BUG_ON(p == p->parent); I'd be inclined to make this BUG_ON(p != current), or just have p as a local variable initialized from current. (If you're trying to save multiple calls to current on arches where it's not just a simple register). > + > + if (!mm_need_new_owner(mm, p)) > + return; > + > + /* > + * Search in the children > + */ > + list_for_each_entry(c, &p->children, sibling) { > + if (c->mm == p->mm) > + goto assign_new_owner; > + } We need to keep checking mm_need_new_owner() since it can become false if the only other user of the mm exits at the same time that we do. (In which case there's nothing to do). > + * Search through everything else. We should not get > + * here often > + */ > + for_each_process(c) { > + g = c; > + do { > + if (c->mm && (c->mm == p->mm)) > + goto assign_new_owner; > + } while ((c = next_thread(c)) != g); > + } Is there a reason to not code this as for_each_thread? > + > + BUG(); > + > +assign_new_owner: > + spin_lock(&mm->owner_lock); > + rcu_assign_pointer(mm->owner, c); > + spin_unlock(&mm->owner_lock); > +} This can break if c is also exiting and has passed the call to mm_update_next_owner() by the time we assign mm->owner. That's why my original suggested version had a function like: static inline void try_give_mm_ownership(struct task_struct *task, struct mm_struct *mm) { if (task->mm != mm) return; task_lock(task); if (task->mm == mm) { mm->owner = task; } task_unlock(task); } i.e. determining that a task is a valid candidate and updating the owner pointer has to be done in the same critical section. Also, looking forward to when we have the virtual AS limits controller, in the (unlikely?) event that the new owner is in a different virtual AS limit control group, this code will need to be able to handle shifting the mm->total_mm from the old AS cgroup to the new one. That's the "fiddly layer violation" that I mentioned earlier. It might be cleaner to be able to specify on a per-subsystem basis whether we require that all users of an mm be in the same cgroup. > config CGROUP_MEM_RES_CTLR > bool "Memory Resource Controller for Control Groups" > - depends on CGROUPS && RESOURCE_COUNTERS > + depends on CGROUPS && RESOURCE_COUNTERS && MM_OWNER Maybe this should select MM_OWNER rather than depending on it? Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org