[PATCH] mm owner: fix race between swapoff and exit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm owner: fix race between swapoff and exit
@ 2008-09-25  0:25 Hugh Dickins, Balbir Singh
  2008-09-26 10:58 ` Jiri Slaby
  2008-09-28 22:09 ` Hugh Dickins, Balbir Singh
  0 siblings, 2 replies; 7+ messages in thread
From: Hugh Dickins, Balbir Singh @ 2008-09-25  0:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Jiri Slaby, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Paul Menage, linux-mm

There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on.  The condition occurs when
try_to_unuse() runs in parallel with an exiting task.  A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.

CPU0                                    CPU1
                                        try_to_unuse
                                        looks at mm = task0->mm
                                        increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
                                        mmput(mm) decrements mm->mm_users
task0 freed
                                        dereferencing mm->owner fails

The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.

Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.

Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.

Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same.  And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.

Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
Andrew, this patch folds together the following three from mmotm:
mm-owner-fix-race-between-swap-and-exit.patch
mm-owner-fix-race-between-swap-and-exit-fix.patch
mm-owner-fix-race-between-swap-and-exit-fix-fix.patch
which tests just before KS showed now to be needed in 2.6.27.

Together with Hugh's already-folded-in contribution to:
memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
which is now needed even without memrlimits, given the above changes.

The regression fixed is actually a regression since 2.6.25;
but let's take a breath before rushing an equivalent fix to stable.

 fs/exec.c       |    2 +-
 kernel/cgroup.c |    5 +++--
 kernel/exit.c   |   12 ++++++++++--
 mm/memcontrol.c |   17 +++++++++++++++++
 4 files changed, 31 insertions(+), 5 deletions(-)

--- 2.6.27-rc7/fs/exec.c	2008-07-29 04:24:47.000000000 +0100
+++ linux/fs/exec.c	2008-09-24 17:17:32.000000000 +0100
@@ -752,11 +752,11 @@ static int exec_mmap(struct mm_struct *m
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
-	mm_update_next_owner(old_mm);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
 		up_read(&old_mm->mmap_sem);
 		BUG_ON(active_mm != old_mm);
+		mm_update_next_owner(old_mm);
 		mmput(old_mm);
 		return 0;
 	}
--- 2.6.27-rc7/kernel/cgroup.c	2008-08-06 08:36:20.000000000 +0100
+++ linux/kernel/cgroup.c	2008-09-24 17:17:32.000000000 +0100
@@ -2738,14 +2738,15 @@ void cgroup_fork_callbacks(struct task_s
  */
 void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
 {
-	struct cgroup *oldcgrp, *newcgrp;
+	struct cgroup *oldcgrp, *newcgrp = NULL;
 
 	if (need_mm_owner_callback) {
 		int i;
 		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
 			oldcgrp = task_cgroup(old, ss->subsys_id);
-			newcgrp = task_cgroup(new, ss->subsys_id);
+			if (new)
+				newcgrp = task_cgroup(new, ss->subsys_id);
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
--- 2.6.27-rc7/kernel/exit.c	2008-09-10 07:37:25.000000000 +0100
+++ linux/kernel/exit.c	2008-09-24 17:17:32.000000000 +0100
@@ -583,8 +583,6 @@ mm_need_new_owner(struct mm_struct *mm, 
 	 * If there are other users of the mm and the owner (us) is exiting
 	 * we need to find a new owner to take on the responsibility.
 	 */
-	if (!mm)
-		return 0;
 	if (atomic_read(&mm->mm_users) <= 1)
 		return 0;
 	if (mm->owner != p)
@@ -627,6 +625,16 @@ retry:
 	} while_each_thread(g, c);
 
 	read_unlock(&tasklist_lock);
+	/*
+	 * We found no owner yet mm_users > 1: this implies that we are
+	 * most likely racing with swapoff (try_to_unuse()) or /proc or
+	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL,
+	 * so that subsystems can understand the callback and take action.
+	 */
+	down_write(&mm->mmap_sem);
+	cgroup_mm_owner_callbacks(mm->owner, NULL);
+	mm->owner = NULL;
+	up_write(&mm->mmap_sem);
 	return;
 
 assign_new_owner:
--- 2.6.27-rc7/mm/memcontrol.c	2008-08-13 04:14:51.000000000 +0100
+++ linux/mm/memcontrol.c	2008-09-24 17:17:32.000000000 +0100
@@ -250,6 +250,14 @@ static struct mem_cgroup *mem_cgroup_fro
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
+	/*
+	 * mm_update_next_owner() may clear mm->owner to NULL
+	 * if it races with swapoff, page migration, etc.
+	 * So this can be called with p == NULL.
+	 */
+	if (unlikely(!p))
+		return NULL;
+
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
 }
@@ -549,6 +557,11 @@ static int mem_cgroup_charge_common(stru
 	if (likely(!memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem)) {
+			rcu_read_unlock();
+			kmem_cache_free(page_cgroup_cache, pc);
+			return 0;
+		}
 		/*
 		 * For every charge from the cgroup, increment reference count
 		 */
@@ -801,6 +814,10 @@ int mem_cgroup_shrink_usage(struct mm_st
 
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!mem)) {
+		rcu_read_unlock();
+		return 0;
+	}
 	css_get(&mem->css);
 	rcu_read_unlock();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm owner: fix race between swapoff and exit
  2008-09-25  0:25 [PATCH] mm owner: fix race between swapoff and exit Hugh Dickins, Balbir Singh
@ 2008-09-26 10:58 ` Jiri Slaby
  2008-09-26 12:02   ` Balbir Singh
  2008-09-26 13:36   ` Hugh Dickins
  2008-09-28 22:09 ` Hugh Dickins, Balbir Singh
  1 sibling, 2 replies; 7+ messages in thread
From: Jiri Slaby @ 2008-09-26 10:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Paul Menage, linux-mm

Hugh Dickins napsal(a):
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> There's a race between mm->owner assignment and swapoff, more easily
> seen when task slab poisoning is turned on.  The condition occurs when
> try_to_unuse() runs in parallel with an exiting task.  A similar race
> can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
> or ptrace or page migration.
> 
> CPU0                                    CPU1
>                                         try_to_unuse
>                                         looks at mm = task0->mm
>                                         increments mm->mm_users
> task 0 exits
> mm->owner needs to be updated, but no
> new owner is found (mm_users > 1, but
> no other task has task->mm = task0->mm)
> mm_update_next_owner() leaves
>                                         mmput(mm) decrements mm->mm_users
> task0 freed
>                                         dereferencing mm->owner fails
> 
> The fix is to notify the subsystem via mm_owner_changed callback(),
> if no new owner is found, by specifying the new task as NULL.
> 
> Jiri Slaby:
> mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
> must be set after that, so as not to pass NULL as old owner causing oops.
> 
> Daisuke Nishimura:
> mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
> and its callers need to take account of this situation to avoid oops.

What about
memrlimit-setup-the-memrlimit-controller-mm_owner-fix
? It adds check for `old' being NULL.

BTW there is also mm->owner = NULL; movement in the patch to the line before
the callbacks are invoked which I don't understand much (why to inform
anybody about NULL->NULL change?), but the first hunk seems reasonable to me.

[...]
> --- 2.6.27-rc7/kernel/cgroup.c	2008-08-06 08:36:20.000000000 +0100
> +++ linux/kernel/cgroup.c	2008-09-24 17:17:32.000000000 +0100
> @@ -2738,14 +2738,15 @@ void cgroup_fork_callbacks(struct task_s
>   */
>  void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
>  {
> -	struct cgroup *oldcgrp, *newcgrp;
> +	struct cgroup *oldcgrp, *newcgrp = NULL;
>  
>  	if (need_mm_owner_callback) {
>  		int i;
>  		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>  			struct cgroup_subsys *ss = subsys[i];
>  			oldcgrp = task_cgroup(old, ss->subsys_id);
> -			newcgrp = task_cgroup(new, ss->subsys_id);
> +			if (new)
> +				newcgrp = task_cgroup(new, ss->subsys_id);
>  			if (oldcgrp == newcgrp)
>  				continue;
>  			if (ss->mm_owner_changed)
> --- 2.6.27-rc7/kernel/exit.c	2008-09-10 07:37:25.000000000 +0100
> +++ linux/kernel/exit.c	2008-09-24 17:17:32.000000000 +0100
> @@ -627,6 +625,16 @@ retry:
>  	} while_each_thread(g, c);
>  
>  	read_unlock(&tasklist_lock);
> +	/*
> +	 * We found no owner yet mm_users > 1: this implies that we are
> +	 * most likely racing with swapoff (try_to_unuse()) or /proc or
> +	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL,
> +	 * so that subsystems can understand the callback and take action.
> +	 */
> +	down_write(&mm->mmap_sem);
> +	cgroup_mm_owner_callbacks(mm->owner, NULL);
> +	mm->owner = NULL;
> +	up_write(&mm->mmap_sem);
>  	return;
>  
>  assign_new_owner:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm owner: fix race between swapoff and exit
  2008-09-26 10:58 ` Jiri Slaby
@ 2008-09-26 12:02   ` Balbir Singh
  2008-09-26 13:36   ` Hugh Dickins
  1 sibling, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2008-09-26 12:02 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Paul Menage, linux-mm

Jiri Slaby wrote:
> Hugh Dickins napsal(a):
>> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>>
>> There's a race between mm->owner assignment and swapoff, more easily
>> seen when task slab poisoning is turned on.  The condition occurs when
>> try_to_unuse() runs in parallel with an exiting task.  A similar race
>> can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
>> or ptrace or page migration.
>>
>> CPU0                                    CPU1
>>                                         try_to_unuse
>>                                         looks at mm = task0->mm
>>                                         increments mm->mm_users
>> task 0 exits
>> mm->owner needs to be updated, but no
>> new owner is found (mm_users > 1, but
>> no other task has task->mm = task0->mm)
>> mm_update_next_owner() leaves
>>                                         mmput(mm) decrements mm->mm_users
>> task0 freed
>>                                         dereferencing mm->owner fails
>>
>> The fix is to notify the subsystem via mm_owner_changed callback(),
>> if no new owner is found, by specifying the new task as NULL.
>>
>> Jiri Slaby:
>> mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
>> must be set after that, so as not to pass NULL as old owner causing oops.
>>
>> Daisuke Nishimura:
>> mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
>> and its callers need to take account of this situation to avoid oops.
> 
> What about
> memrlimit-setup-the-memrlimit-controller-mm_owner-fix
> ? It adds check for `old' being NULL.
> 

The memrlimit patches are not yet in mainline (they are in -mm).

> BTW there is also mm->owner = NULL; movement in the patch to the line before
> the callbacks are invoked which I don't understand much (why to inform
> anybody about NULL->NULL change?), but the first hunk seems reasonable to me.
> 

It is not in the hunk being posted for inclusion in 2.6.27-rc

> [...]
>> --- 2.6.27-rc7/kernel/cgroup.c	2008-08-06 08:36:20.000000000 +0100
>> +++ linux/kernel/cgroup.c	2008-09-24 17:17:32.000000000 +0100
>> @@ -2738,14 +2738,15 @@ void cgroup_fork_callbacks(struct task_s
>>   */
>>  void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
>>  {
>> -	struct cgroup *oldcgrp, *newcgrp;
>> +	struct cgroup *oldcgrp, *newcgrp = NULL;
>>  
>>  	if (need_mm_owner_callback) {
>>  		int i;
>>  		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>>  			struct cgroup_subsys *ss = subsys[i];
>>  			oldcgrp = task_cgroup(old, ss->subsys_id);
>> -			newcgrp = task_cgroup(new, ss->subsys_id);
>> +			if (new)
>> +				newcgrp = task_cgroup(new, ss->subsys_id);
>>  			if (oldcgrp == newcgrp)
>>  				continue;
>>  			if (ss->mm_owner_changed)
>> --- 2.6.27-rc7/kernel/exit.c	2008-09-10 07:37:25.000000000 +0100
>> +++ linux/kernel/exit.c	2008-09-24 17:17:32.000000000 +0100
>> @@ -627,6 +625,16 @@ retry:
>>  	} while_each_thread(g, c);
>>  
>>  	read_unlock(&tasklist_lock);
>> +	/*
>> +	 * We found no owner yet mm_users > 1: this implies that we are
>> +	 * most likely racing with swapoff (try_to_unuse()) or /proc or
>> +	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL,
>> +	 * so that subsystems can understand the callback and take action.
>> +	 */
>> +	down_write(&mm->mmap_sem);
>> +	cgroup_mm_owner_callbacks(mm->owner, NULL);
>> +	mm->owner = NULL;
>> +	up_write(&mm->mmap_sem);
>>  	return;
>>  
>>  assign_new_owner:
> 


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm owner: fix race between swapoff and exit
  2008-09-26 10:58 ` Jiri Slaby
  2008-09-26 12:02   ` Balbir Singh
@ 2008-09-26 13:36   ` Hugh Dickins
  2008-10-02 23:11     ` Andrew Morton
  1 sibling, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2008-09-26 13:36 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Linus Torvalds, Andrew Morton, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Li Zefan, Paul Menage, linux-mm

On Fri, 26 Sep 2008, Jiri Slaby wrote:
> Hugh Dickins napsal(a):
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > The fix is to notify the subsystem via mm_owner_changed callback(),
> > if no new owner is found, by specifying the new task as NULL.
> > 
> > Jiri Slaby:
> > mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
> > must be set after that, so as not to pass NULL as old owner causing oops.
> > 
> > Daisuke Nishimura:
> > mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
> > and its callers need to take account of this situation to avoid oops.
> 
> What about
> memrlimit-setup-the-memrlimit-controller-mm_owner-fix
> ? It adds check for `old' being NULL.

Good question, thanks for noticing that.

The true answer is that I didn't even notice that one.
In order to fix the oops I was seeing on 2.6.27-rc6,
Balbir pointed me to 2.6.27-rc5-mm1's
mm-owner-fix-race-between-swap-and-exit.patch
and I immediately added in your
mm-owner-fix-race-between-swap-and-exit-fix.patch
but those weren't enough, it needed Daisuke-san's
mm-owner-fix-race-between-swap-and-exit-fix-fix.patch
from mmotm, and then lockdep and hang showed me
memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
also needed.  At which point we seemed to be in the clear: phew!

But let's pretend that I had noticed the one you're indicating,
rather than hurriedly looking at it now you've pointed it out ;)

It isn't necessary for 2.6.27 because actually the whole of the
cgroup_mm_owner_callbacks stuff is irrelevant to 2.6.27 - there's
nothing in 2.6.27 to set need_mm_owner_callback, so it will never
hit the code which the first hunk modifies.  memrlimits would set
it, but they remain in -mm not in mainline.

This work has been poorly factored: I think the story is that
mm->owner was invented for memrlimit, then got used in memcgroup,
that use went forward into 2.6.26 but memrlimit has stayed behind;
yet the mm->owner work in 2.6.26 and 2.6.27 contains callback code
which only makes sense if memrlimit (or something else) were there.

When putting together the posted patch from its constituents, I did
briefly wonder whether to rip out all the cgroup_mm_owner_callbacks
stuff, but decided it's safer at this stage to stick with known and
tested patches from -mm.  I guess I could have omitted your patch
from the mix, on the same grounds that I've just cited (2.6.27 never
reaches the affected code), but it looks a lot more sensible with
yours in there.  2.6.27 should be okay with the patch I posted.

> BTW there is also mm->owner = NULL; movement in the patch to the line before
> the callbacks are invoked which I don't understand much (why to inform
> anybody about NULL->NULL change?), but the first hunk seems reasonable to me.

You draw attention to the second hunk of
memrlimit-setup-the-memrlimit-controller-mm_owner-fix
(shown below).  It's just nonsense, isn't it, reverting the fix you
already made?  Perhaps it's not the patch Balbir and Zefan actually
submitted, but a mismerge of that with the fluctuating state of
all these accumulated fixes in the mm tree, and nobody properly
tested the issue in question on the resulting tree.

Or is the whole patch pointless, the first hunk just an attempt
to handle the nonsense of the second hunk?

I wish there were a lot more care and a lot less churn in this area.

Hugh

> From: Balbir Singh <balbir@linux.vnet.ibm.com> and
>       Li Zefan <lizf@cn.fujitsu.com>
> 
> This patch allows mm->owner to be NULL when mm_owner callback is called.
> Without this patch, (for example) you can see panic while you do migrate
> a set of task, which calls fork/exit.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  kernel/cgroup.c |    5 +++--
>  kernel/exit.c   |    2 +-
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff -puN kernel/cgroup.c~memrlimit-setup-the-memrlimit-controller-mm_owner-fix kernel/cgroup.c
> --- a/kernel/cgroup.c~memrlimit-setup-the-memrlimit-controller-mm_owner-fix
> +++ a/kernel/cgroup.c
> @@ -2761,13 +2761,14 @@ void cgroup_fork_callbacks(struct task_s
>   */
>  void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
>  {
> -	struct cgroup *oldcgrp, *newcgrp = NULL;
> +	struct cgroup *oldcgrp = NULL, *newcgrp = NULL;
>  
>  	if (need_mm_owner_callback) {
>  		int i;
>  		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>  			struct cgroup_subsys *ss = subsys[i];
> -			oldcgrp = task_cgroup(old, ss->subsys_id);
> +			if (old)
> +				oldcgrp = task_cgroup(old, ss->subsys_id);
>  			if (new)
>  				newcgrp = task_cgroup(new, ss->subsys_id);
>  			if (oldcgrp == newcgrp)
> diff -puN kernel/exit.c~memrlimit-setup-the-memrlimit-controller-mm_owner-fix kernel/exit.c
> --- a/kernel/exit.c~memrlimit-setup-the-memrlimit-controller-mm_owner-fix
> +++ a/kernel/exit.c
> @@ -641,8 +641,8 @@ retry:
>  	 * the callback and take action
>  	 */
>  	down_write(&mm->mmap_sem);
> -	cgroup_mm_owner_callbacks(mm->owner, NULL);
>  	mm->owner = NULL;
> +	cgroup_mm_owner_callbacks(mm->owner, NULL);
>  	up_write(&mm->mmap_sem);
>  	return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] mm owner: fix race between swapoff and exit
  2008-09-25  0:25 [PATCH] mm owner: fix race between swapoff and exit Hugh Dickins, Balbir Singh
  2008-09-26 10:58 ` Jiri Slaby
@ 2008-09-28 22:09 ` Hugh Dickins, Balbir Singh
  1 sibling, 0 replies; 7+ messages in thread
From: Hugh Dickins, Balbir Singh @ 2008-09-28 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Balbir Singh, Jiri Slaby, Daisuke Nishimura,
	KAMEZAWA Hiroyuki, Paul Menage, linux-mm

There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on.  The condition occurs when
try_to_unuse() runs in parallel with an exiting task.  A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.

CPU0                                    CPU1
                                        try_to_unuse
                                        looks at mm = task0->mm
                                        increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
                                        mmput(mm) decrements mm->mm_users
task0 freed
                                        dereferencing mm->owner fails

The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.

Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.

Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.

Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same.  And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.

Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
Andrew, this patch folds together the following three from mmotm:
mm-owner-fix-race-between-swap-and-exit.patch
mm-owner-fix-race-between-swap-and-exit-fix.patch
mm-owner-fix-race-between-swap-and-exit-fix-fix.patch
which tests just before KS showed now to be needed in 2.6.27.

Together with Hugh's already-folded-in contribution to:
memrlimit-cgroup-mm-owner-callback-changes-to-add-task-info.patch
which is now needed even without memrlimits, given the above changes.

The regression fixed is actually a regression since 2.6.25;
but let's take a breath before rushing an equivalent fix to stable.

 fs/exec.c       |    2 +-
 kernel/cgroup.c |    5 +++--
 kernel/exit.c   |   12 ++++++++++--
 mm/memcontrol.c |   17 +++++++++++++++++
 4 files changed, 31 insertions(+), 5 deletions(-)

--- 2.6.27-rc7/fs/exec.c	2008-07-29 04:24:47.000000000 +0100
+++ linux/fs/exec.c	2008-09-24 17:17:32.000000000 +0100
@@ -752,11 +752,11 @@ static int exec_mmap(struct mm_struct *m
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
-	mm_update_next_owner(old_mm);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
 		up_read(&old_mm->mmap_sem);
 		BUG_ON(active_mm != old_mm);
+		mm_update_next_owner(old_mm);
 		mmput(old_mm);
 		return 0;
 	}
--- 2.6.27-rc7/kernel/cgroup.c	2008-08-06 08:36:20.000000000 +0100
+++ linux/kernel/cgroup.c	2008-09-24 17:17:32.000000000 +0100
@@ -2738,14 +2738,15 @@ void cgroup_fork_callbacks(struct task_s
  */
 void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
 {
-	struct cgroup *oldcgrp, *newcgrp;
+	struct cgroup *oldcgrp, *newcgrp = NULL;
 
 	if (need_mm_owner_callback) {
 		int i;
 		for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
 			struct cgroup_subsys *ss = subsys[i];
 			oldcgrp = task_cgroup(old, ss->subsys_id);
-			newcgrp = task_cgroup(new, ss->subsys_id);
+			if (new)
+				newcgrp = task_cgroup(new, ss->subsys_id);
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
--- 2.6.27-rc7/kernel/exit.c	2008-09-10 07:37:25.000000000 +0100
+++ linux/kernel/exit.c	2008-09-24 17:17:32.000000000 +0100
@@ -583,8 +583,6 @@ mm_need_new_owner(struct mm_struct *mm, 
 	 * If there are other users of the mm and the owner (us) is exiting
 	 * we need to find a new owner to take on the responsibility.
 	 */
-	if (!mm)
-		return 0;
 	if (atomic_read(&mm->mm_users) <= 1)
 		return 0;
 	if (mm->owner != p)
@@ -627,6 +625,16 @@ retry:
 	} while_each_thread(g, c);
 
 	read_unlock(&tasklist_lock);
+	/*
+	 * We found no owner yet mm_users > 1: this implies that we are
+	 * most likely racing with swapoff (try_to_unuse()) or /proc or
+	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL,
+	 * so that subsystems can understand the callback and take action.
+	 */
+	down_write(&mm->mmap_sem);
+	cgroup_mm_owner_callbacks(mm->owner, NULL);
+	mm->owner = NULL;
+	up_write(&mm->mmap_sem);
 	return;
 
 assign_new_owner:
--- 2.6.27-rc7/mm/memcontrol.c	2008-08-13 04:14:51.000000000 +0100
+++ linux/mm/memcontrol.c	2008-09-24 17:17:32.000000000 +0100
@@ -250,6 +250,14 @@ static struct mem_cgroup *mem_cgroup_fro
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
+	/*
+	 * mm_update_next_owner() may clear mm->owner to NULL
+	 * if it races with swapoff, page migration, etc.
+	 * So this can be called with p == NULL.
+	 */
+	if (unlikely(!p))
+		return NULL;
+
 	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
 				struct mem_cgroup, css);
 }
@@ -549,6 +557,11 @@ static int mem_cgroup_charge_common(stru
 	if (likely(!memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem)) {
+			rcu_read_unlock();
+			kmem_cache_free(page_cgroup_cache, pc);
+			return 0;
+		}
 		/*
 		 * For every charge from the cgroup, increment reference count
 		 */
@@ -801,6 +814,10 @@ int mem_cgroup_shrink_usage(struct mm_st
 
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!mem)) {
+		rcu_read_unlock();
+		return 0;
+	}
 	css_get(&mem->css);
 	rcu_read_unlock();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm owner: fix race between swapoff and exit
  2008-09-26 13:36   ` Hugh Dickins
@ 2008-10-02 23:11     ` Andrew Morton
  2008-10-03  5:10       ` Balbir Singh
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2008-10-02 23:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: jirislaby, torvalds, balbir, nishimura, kamezawa.hiroyuki, lizf,
	menage, linux-mm

On Fri, 26 Sep 2008 14:36:55 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> > BTW there is also mm->owner = NULL; movement in the patch to the line before
> > the callbacks are invoked which I don't understand much (why to inform
> > anybody about NULL->NULL change?), but the first hunk seems reasonable to me.
> 
> You draw attention to the second hunk of
> memrlimit-setup-the-memrlimit-controller-mm_owner-fix
> (shown below).  It's just nonsense, isn't it, reverting the fix you
> already made?  Perhaps it's not the patch Balbir and Zefan actually
> submitted, but a mismerge of that with the fluctuating state of
> all these accumulated fixes in the mm tree, and nobody properly
> tested the issue in question on the resulting tree.
> 
> Or is the whole patch pointless, the first hunk just an attempt
> to handle the nonsense of the second hunk?
> 
> I wish there were a lot more care and a lot less churn in this area.

I really don't see those patches going anywhere and they are, to some
extent, getting in the way of real work.

I'm thinking lets-drop-them-all thoughts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm owner: fix race between swapoff and exit
  2008-10-02 23:11     ` Andrew Morton
@ 2008-10-03  5:10       ` Balbir Singh
  0 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2008-10-03  5:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, jirislaby, torvalds, nishimura, kamezawa.hiroyuki,
	lizf, menage, linux-mm

Andrew Morton wrote:
> On Fri, 26 Sep 2008 14:36:55 +0100 (BST)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
>>> BTW there is also mm->owner = NULL; movement in the patch to the line before
>>> the callbacks are invoked which I don't understand much (why to inform
>>> anybody about NULL->NULL change?), but the first hunk seems reasonable to me.
>> You draw attention to the second hunk of
>> memrlimit-setup-the-memrlimit-controller-mm_owner-fix
>> (shown below).  It's just nonsense, isn't it, reverting the fix you
>> already made?  Perhaps it's not the patch Balbir and Zefan actually
>> submitted, but a mismerge of that with the fluctuating state of
>> all these accumulated fixes in the mm tree, and nobody properly
>> tested the issue in question on the resulting tree.
>>
>> Or is the whole patch pointless, the first hunk just an attempt
>> to handle the nonsense of the second hunk?
>>
>> I wish there were a lot more care and a lot less churn in this area.
> 
> I really don't see those patches going anywhere and they are, to some
> extent, getting in the way of real work.
> 
> I'm thinking lets-drop-them-all thoughts.

Andrew,

There has been some discussion around memrlimits, the main argument against
those patches by Dave Hansen and Paul Menage has been that no application can
deal with mmap()/malloc() failures. My argument has been that applications that
can deal with them should not be penalized and we have no overcommit support for
cgroups (I don't mind the back port that Andrea did for overcommit support).
I've listed the pros and cons in a separate set of emails to lkml. The
discussion can be found at
http://kerneltrap.org/mailarchive/linux-kernel/2008/8/19/2988814

Although, I find it to be useful and non-users can decide not to enable any
limits, if we are not going to build consensus on this feature, we might as well
drop it :(

-- 
	Balbir

PS: When we do the mlock controller, we'll probably redo some of the
infrastructure that we have memrlimits, but at the moment other things are
keeping me occupied.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-10-03  5:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-25  0:25 [PATCH] mm owner: fix race between swapoff and exit Hugh Dickins, Balbir Singh
2008-09-26 10:58 ` Jiri Slaby
2008-09-26 12:02   ` Balbir Singh
2008-09-26 13:36   ` Hugh Dickins
2008-10-02 23:11     ` Andrew Morton
2008-10-03  5:10       ` Balbir Singh
2008-09-28 22:09 ` Hugh Dickins, Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox