[PATCH 0/2] userfaultfd: use RCU to free the task struct when fork fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] userfaultfd: use RCU to free the task struct when fork fails
@ 2019-03-25 22:56 Andrea Arcangeli
  2019-03-25 22:56 ` [PATCH 1/2] " Andrea Arcangeli
  2019-03-25 22:56 ` [PATCH 2/2] mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2019-03-25 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

Hello,

this fixes a race condition between memcg and UFFD_EVENT_FORK that was
reproduced on aarch64 with qemu with syzkaller.

While at it I also added more WRITE_ONCE in places that shall use it
(in theory) against the rcu_deferenfence issued in the in
rcu_read_lock critical section.

Andrea Arcangeli (2):
  userfaultfd: use RCU to free the task struct when fork fails
  mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE

 kernel/exit.c |  6 +++---
 kernel/fork.c | 34 ++++++++++++++++++++++++++++++++--
 2 files changed, 35 insertions(+), 5 deletions(-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-25 22:56 [PATCH 0/2] userfaultfd: use RCU to free the task struct when fork fails Andrea Arcangeli
@ 2019-03-25 22:56 ` Andrea Arcangeli
  2019-03-26  8:07   ` Kirill Tkhai
  2019-03-26  8:56   ` Michal Hocko
  2019-03-25 22:56 ` [PATCH 2/2] mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE Andrea Arcangeli
  1 sibling, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2019-03-25 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

MEMCG depends on the task structure not to be freed under
rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
mm->owner.

An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed
to succeed. Such a change would require more changes because it would
create a strict ordering dependency where the uffd methods would need
to be called beyond the last potentially failing branch in order to be
safe. This solution as opposed only adds the dependency to common code
to set mm->owner to NULL and to free the task struct that was pointed
by mm->owner with RCU, if fork ends up failing. The userfaultfd
methods can still be called anywhere during the fork runtime and the
monitor will keep discarding orphaned "mm" coming from failed forks in
userland.

This race condition couldn't trigger if CONFIG_MEMCG was set =n at
build time.

Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
Cc: stable@kernel.org
Tested-by: zhong jiang <zhongjiang@huawei.com>
Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..a19790e27afd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
+static __always_inline void mm_clear_owner(struct mm_struct *mm,
+					   struct task_struct *p)
+{
+#ifdef CONFIG_MEMCG
+	if (mm->owner == p)
+		WRITE_ONCE(mm->owner, NULL);
+#endif
+}
+
 static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
 {
 #ifdef CONFIG_MEMCG
@@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 free_pt:
 	/* don't put binfmt in mmput, we haven't got module yet */
 	mm->binfmt = NULL;
+	mm_init_owner(mm, NULL);
 	mmput(mm);
 
 fail_nomem:
@@ -1662,6 +1672,24 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+#ifdef CONFIG_MEMCG
+static void __delayed_free_task(struct rcu_head *rhp)
+{
+	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
+
+	free_task(tsk);
+}
+#endif /* CONFIG_MEMCG */
+
+static __always_inline void delayed_free_task(struct task_struct *tsk)
+{
+#ifdef CONFIG_MEMCG
+	call_rcu(&tsk->rcu, __delayed_free_task);
+#else /* CONFIG_MEMCG */
+	free_task(tsk);
+#endif /* CONFIG_MEMCG */
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2123,8 +2151,10 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm)
+	if (p->mm) {
+		mm_clear_owner(p->mm, p);
 		mmput(p->mm);
+	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
@@ -2155,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_free:
 	p->state = TASK_DEAD;
 	put_task_stack(p);
-	free_task(p);
+	delayed_free_task(p);
 fork_out:
 	spin_lock_irq(&current->sighand->siglock);
 	hlist_del_init(&delayed.node);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE
  2019-03-25 22:56 [PATCH 0/2] userfaultfd: use RCU to free the task struct when fork fails Andrea Arcangeli
  2019-03-25 22:56 ` [PATCH 1/2] " Andrea Arcangeli
@ 2019-03-25 22:56 ` Andrea Arcangeli
  1 sibling, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2019-03-25 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

The RCU reader uses rcu_dereference() inside rcu_read_lock critical
sections, so the writer shall use WRITE_ONCE. Just a cleanup, we still
rely on gcc to emit atomic writes in other places.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/exit.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 2166c2d92ddc..8361a560cd1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -422,7 +422,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 	 * freed task structure.
 	 */
 	if (atomic_read(&mm->mm_users) <= 1) {
-		mm->owner = NULL;
+		WRITE_ONCE(mm->owner, NULL);
 		return;
 	}
 
@@ -462,7 +462,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
 	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 	 */
-	mm->owner = NULL;
+	WRITE_ONCE(mm->owner, NULL);
 	return;
 
 assign_new_owner:
@@ -483,7 +483,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		put_task_struct(c);
 		goto retry;
 	}
-	mm->owner = c;
+	WRITE_ONCE(mm->owner, c);
 	task_unlock(c);
 	put_task_struct(c);
 }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-25 22:56 ` [PATCH 1/2] " Andrea Arcangeli
@ 2019-03-26  8:07   ` Kirill Tkhai
  2019-03-26  8:18     ` Kirill Tkhai
  2019-03-26  8:56   ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2019-03-26  8:07 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On 26.03.2019 01:56, Andrea Arcangeli wrote:
> MEMCG depends on the task structure not to be freed under
> rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
> mm->owner.
> 
> An alternate possible fix would be to defer the delivery of the
> userfaultfd contexts to the monitor until after fork() is guaranteed
> to succeed. Such a change would require more changes because it would
> create a strict ordering dependency where the uffd methods would need
> to be called beyond the last potentially failing branch in order to be
> safe. This solution as opposed only adds the dependency to common code
> to set mm->owner to NULL and to free the task struct that was pointed
> by mm->owner with RCU, if fork ends up failing. The userfaultfd
> methods can still be called anywhere during the fork runtime and the
> monitor will keep discarding orphaned "mm" coming from failed forks in
> userland.
> 
> This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> build time.
> 
> Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
> Cc: stable@kernel.org
> Tested-by: zhong jiang <zhongjiang@huawei.com>
> Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9dcd18aa210b..a19790e27afd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
>  #endif
>  }
>  
> +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> +					   struct task_struct *p)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (mm->owner == p)
> +		WRITE_ONCE(mm->owner, NULL);
> +#endif
> +}
> +
>  static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  {
>  #ifdef CONFIG_MEMCG
> @@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>  free_pt:
>  	/* don't put binfmt in mmput, we haven't got module yet */
>  	mm->binfmt = NULL;
> +	mm_init_owner(mm, NULL);
>  	mmput(mm);
>  
>  fail_nomem:
> @@ -1662,6 +1672,24 @@ static inline void rcu_copy_process(struct task_struct *p)
>  #endif /* #ifdef CONFIG_TASKS_RCU */
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static void __delayed_free_task(struct rcu_head *rhp)
> +{
> +	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
> +
> +	free_task(tsk);
> +}
> +#endif /* CONFIG_MEMCG */
> +
> +static __always_inline void delayed_free_task(struct task_struct *tsk)
> +{
> +#ifdef CONFIG_MEMCG
> +	call_rcu(&tsk->rcu, __delayed_free_task);
> +#else /* CONFIG_MEMCG */
> +	free_task(tsk);
> +#endif /* CONFIG_MEMCG */
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2123,8 +2151,10 @@ static __latent_entropy struct task_struct *copy_process(
>  bad_fork_cleanup_namespaces:
>  	exit_task_namespaces(p);
>  bad_fork_cleanup_mm:
> -	if (p->mm)
> +	if (p->mm) {
> +		mm_clear_owner(p->mm, p);
>  		mmput(p->mm);
> +	}
>  bad_fork_cleanup_signal:
>  	if (!(clone_flags & CLONE_THREAD))
>  		free_signal_struct(p->signal);
> @@ -2155,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
>  bad_fork_free:
>  	p->state = TASK_DEAD;
>  	put_task_stack(p);
> -	free_task(p);
> +	delayed_free_task(p);

Can't call_rcu(&p->rcu, delayed_put_task_struct) be used instead this?

Kirill


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-26  8:07   ` Kirill Tkhai
@ 2019-03-26  8:18     ` Kirill Tkhai
  2019-03-26  8:19       ` Kirill Tkhai
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill Tkhai @ 2019-03-26  8:18 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On 26.03.2019 11:07, Kirill Tkhai wrote:
> On 26.03.2019 01:56, Andrea Arcangeli wrote:
>> MEMCG depends on the task structure not to be freed under
>> rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
>> mm->owner.
>>
>> An alternate possible fix would be to defer the delivery of the
>> userfaultfd contexts to the monitor until after fork() is guaranteed
>> to succeed. Such a change would require more changes because it would
>> create a strict ordering dependency where the uffd methods would need
>> to be called beyond the last potentially failing branch in order to be
>> safe. This solution as opposed only adds the dependency to common code
>> to set mm->owner to NULL and to free the task struct that was pointed
>> by mm->owner with RCU, if fork ends up failing. The userfaultfd
>> methods can still be called anywhere during the fork runtime and the
>> monitor will keep discarding orphaned "mm" coming from failed forks in
>> userland.
>>
>> This race condition couldn't trigger if CONFIG_MEMCG was set =n at
>> build time.
>>
>> Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
>> Cc: stable@kernel.org
>> Tested-by: zhong jiang <zhongjiang@huawei.com>
>> Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> ---
>>  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 9dcd18aa210b..a19790e27afd 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
>>  #endif
>>  }
>>  
>> +static __always_inline void mm_clear_owner(struct mm_struct *mm,
>> +					   struct task_struct *p)
>> +{
>> +#ifdef CONFIG_MEMCG
>> +	if (mm->owner == p)
>> +		WRITE_ONCE(mm->owner, NULL);
>> +#endif
>> +}
>> +
>>  static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>>  {
>>  #ifdef CONFIG_MEMCG
>> @@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>>  free_pt:
>>  	/* don't put binfmt in mmput, we haven't got module yet */
>>  	mm->binfmt = NULL;
>> +	mm_init_owner(mm, NULL);
>>  	mmput(mm);
>>  
>>  fail_nomem:
>> @@ -1662,6 +1672,24 @@ static inline void rcu_copy_process(struct task_struct *p)
>>  #endif /* #ifdef CONFIG_TASKS_RCU */
>>  }
>>  
>> +#ifdef CONFIG_MEMCG
>> +static void __delayed_free_task(struct rcu_head *rhp)
>> +{
>> +	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
>> +
>> +	free_task(tsk);
>> +}
>> +#endif /* CONFIG_MEMCG */
>> +
>> +static __always_inline void delayed_free_task(struct task_struct *tsk)
>> +{
>> +#ifdef CONFIG_MEMCG
>> +	call_rcu(&tsk->rcu, __delayed_free_task);
>> +#else /* CONFIG_MEMCG */
>> +	free_task(tsk);
>> +#endif /* CONFIG_MEMCG */
>> +}
>> +
>>  /*
>>   * This creates a new process as a copy of the old one,
>>   * but does not actually start it yet.
>> @@ -2123,8 +2151,10 @@ static __latent_entropy struct task_struct *copy_process(
>>  bad_fork_cleanup_namespaces:
>>  	exit_task_namespaces(p);
>>  bad_fork_cleanup_mm:
>> -	if (p->mm)
>> +	if (p->mm) {
>> +		mm_clear_owner(p->mm, p);
>>  		mmput(p->mm);
>> +	}
>>  bad_fork_cleanup_signal:
>>  	if (!(clone_flags & CLONE_THREAD))
>>  		free_signal_struct(p->signal);
>> @@ -2155,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  bad_fork_free:
>>  	p->state = TASK_DEAD;
>>  	put_task_stack(p);
>> -	free_task(p);
>> +	delayed_free_task(p);
> 
> Can't call_rcu(&p->rcu, delayed_put_task_struct) be used instead this?

I mean:

refcount_set(&tsk->usage, 2);
call_rcu(&p->rcu, delayed_put_task_struct);

And:

diff --git a/kernel/fork.c b/kernel/fork.c
index 3c516c6f7ce4..27cdf61b51a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -715,7 +715,9 @@ static inline void put_signal_struct(struct signal_struct *sig)
 
 void __put_task_struct(struct task_struct *tsk)
 {
-	WARN_ON(!tsk->exit_state);
+	if (!tsk->exit_state)
+	/* Cleanup of copy_process() */
+		goto free;
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
@@ -727,6 +729,7 @@ void __put_task_struct(struct task_struct *tsk)
 	put_signal_struct(tsk->signal);
 
 	if (!profile_handoff_task(tsk))
+free:
 		free_task(tsk);
 }
 EXPORT_SYMBOL_GPL(__put_task_struct);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-26  8:18     ` Kirill Tkhai
@ 2019-03-26  8:19       ` Kirill Tkhai
  0 siblings, 0 replies; 13+ messages in thread
From: Kirill Tkhai @ 2019-03-26  8:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On 26.03.2019 11:18, Kirill Tkhai wrote:
> On 26.03.2019 11:07, Kirill Tkhai wrote:
>> On 26.03.2019 01:56, Andrea Arcangeli wrote:
>>> MEMCG depends on the task structure not to be freed under
>>> rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
>>> mm->owner.
>>>
>>> An alternate possible fix would be to defer the delivery of the
>>> userfaultfd contexts to the monitor until after fork() is guaranteed
>>> to succeed. Such a change would require more changes because it would
>>> create a strict ordering dependency where the uffd methods would need
>>> to be called beyond the last potentially failing branch in order to be
>>> safe. This solution as opposed only adds the dependency to common code
>>> to set mm->owner to NULL and to free the task struct that was pointed
>>> by mm->owner with RCU, if fork ends up failing. The userfaultfd
>>> methods can still be called anywhere during the fork runtime and the
>>> monitor will keep discarding orphaned "mm" coming from failed forks in
>>> userland.
>>>
>>> This race condition couldn't trigger if CONFIG_MEMCG was set =n at
>>> build time.
>>>
>>> Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
>>> Cc: stable@kernel.org
>>> Tested-by: zhong jiang <zhongjiang@huawei.com>
>>> Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>> ---
>>>  kernel/fork.c | 34 ++++++++++++++++++++++++++++++++--
>>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 9dcd18aa210b..a19790e27afd 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
>>>  #endif
>>>  }
>>>  
>>> +static __always_inline void mm_clear_owner(struct mm_struct *mm,
>>> +					   struct task_struct *p)
>>> +{
>>> +#ifdef CONFIG_MEMCG
>>> +	if (mm->owner == p)
>>> +		WRITE_ONCE(mm->owner, NULL);
>>> +#endif
>>> +}
>>> +
>>>  static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>>>  {
>>>  #ifdef CONFIG_MEMCG
>>> @@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>>>  free_pt:
>>>  	/* don't put binfmt in mmput, we haven't got module yet */
>>>  	mm->binfmt = NULL;
>>> +	mm_init_owner(mm, NULL);
>>>  	mmput(mm);
>>>  
>>>  fail_nomem:
>>> @@ -1662,6 +1672,24 @@ static inline void rcu_copy_process(struct task_struct *p)
>>>  #endif /* #ifdef CONFIG_TASKS_RCU */
>>>  }
>>>  
>>> +#ifdef CONFIG_MEMCG
>>> +static void __delayed_free_task(struct rcu_head *rhp)
>>> +{
>>> +	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
>>> +
>>> +	free_task(tsk);
>>> +}
>>> +#endif /* CONFIG_MEMCG */
>>> +
>>> +static __always_inline void delayed_free_task(struct task_struct *tsk)
>>> +{
>>> +#ifdef CONFIG_MEMCG
>>> +	call_rcu(&tsk->rcu, __delayed_free_task);
>>> +#else /* CONFIG_MEMCG */
>>> +	free_task(tsk);
>>> +#endif /* CONFIG_MEMCG */
>>> +}
>>> +
>>>  /*
>>>   * This creates a new process as a copy of the old one,
>>>   * but does not actually start it yet.
>>> @@ -2123,8 +2151,10 @@ static __latent_entropy struct task_struct *copy_process(
>>>  bad_fork_cleanup_namespaces:
>>>  	exit_task_namespaces(p);
>>>  bad_fork_cleanup_mm:
>>> -	if (p->mm)
>>> +	if (p->mm) {
>>> +		mm_clear_owner(p->mm, p);
>>>  		mmput(p->mm);
>>> +	}
>>>  bad_fork_cleanup_signal:
>>>  	if (!(clone_flags & CLONE_THREAD))
>>>  		free_signal_struct(p->signal);
>>> @@ -2155,7 +2185,7 @@ static __latent_entropy struct task_struct *copy_process(
>>>  bad_fork_free:
>>>  	p->state = TASK_DEAD;
>>>  	put_task_stack(p);
>>> -	free_task(p);
>>> +	delayed_free_task(p);
>>
>> Can't call_rcu(&p->rcu, delayed_put_task_struct) be used instead this?
> 
> I mean:
> 
> refcount_set(&tsk->usage, 2);

I.e., refcount_set(&tsk->usage, 1);

> call_rcu(&p->rcu, delayed_put_task_struct);
> 
> And:
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3c516c6f7ce4..27cdf61b51a1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -715,7 +715,9 @@ static inline void put_signal_struct(struct signal_struct *sig)
>  
>  void __put_task_struct(struct task_struct *tsk)
>  {
> -	WARN_ON(!tsk->exit_state);
> +	if (!tsk->exit_state)
> +	/* Cleanup of copy_process() */
> +		goto free;
>  	WARN_ON(refcount_read(&tsk->usage));
>  	WARN_ON(tsk == current);
>  
> @@ -727,6 +729,7 @@ void __put_task_struct(struct task_struct *tsk)
>  	put_signal_struct(tsk->signal);
>  
>  	if (!profile_handoff_task(tsk))
> +free:
>  		free_task(tsk);
>  }
>  EXPORT_SYMBOL_GPL(__put_task_struct);
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-25 22:56 ` [PATCH 1/2] " Andrea Arcangeli
  2019-03-26  8:07   ` Kirill Tkhai
@ 2019-03-26  8:56   ` Michal Hocko
  2019-03-27  0:16     ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-03-26  8:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On Mon 25-03-19 18:56:35, Andrea Arcangeli wrote:
> MEMCG depends on the task structure not to be freed under
> rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
> mm->owner.

Please state the actual problem. Your cover letter mentiones a race
condition. Please make it explicit in the changelog.
 
> An alternate possible fix would be to defer the delivery of the
> userfaultfd contexts to the monitor until after fork() is guaranteed
> to succeed. Such a change would require more changes because it would
> create a strict ordering dependency where the uffd methods would need
> to be called beyond the last potentially failing branch in order to be
> safe.

How much more changes are we talking about? Because ...

> This solution as opposed only adds the dependency to common code
> to set mm->owner to NULL and to free the task struct that was pointed
> by mm->owner with RCU, if fork ends up failing. The userfaultfd
> methods can still be called anywhere during the fork runtime and the
> monitor will keep discarding orphaned "mm" coming from failed forks in
> userland.

... this is adding a subtle hack that might break in the future because
copy_process error paths are far from trivial and quite error prone
IMHO. I am not opposed to the patch in principle but I would really like
to see what kind of solutions we are comparing here.

> This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> build time.

All the CONFIG_MEMCG is just ugly as hell. Can we reduce that please?
E.g. use if (IS_ENABLED(CONFIG_MEMCG)) where appropriate?

[...]

> +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> +					   struct task_struct *p)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (mm->owner == p)
> +		WRITE_ONCE(mm->owner, NULL);
> +#endif

How can we ever hit this warning and what does that mean?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-26  8:56   ` Michal Hocko
@ 2019-03-27  0:16     ` Andrea Arcangeli
  2019-03-27  8:49       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2019-03-27  0:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On Tue, Mar 26, 2019 at 09:56:43AM +0100, Michal Hocko wrote:
> On Mon 25-03-19 18:56:35, Andrea Arcangeli wrote:
> > MEMCG depends on the task structure not to be freed under
> > rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
> > mm->owner.
> 
> Please state the actual problem. Your cover letter mentiones a race
> condition. Please make it explicit in the changelog.

The actual problem is the task structure is freed while
get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences
mm->owner.

I thought the breakage of RCU is pretty clear, but we could add a
description of the race like I did in the original thread:

https://lkml.kernel.org/r/000000000000601367057a095de4@google.com
https://lkml.kernel.org/r/20190316194222.GA29767@redhat.com

> > An alternate possible fix would be to defer the delivery of the
> > userfaultfd contexts to the monitor until after fork() is guaranteed
> > to succeed. Such a change would require more changes because it would
> > create a strict ordering dependency where the uffd methods would need
> > to be called beyond the last potentially failing branch in order to be
> > safe.
> 
> How much more changes are we talking about? Because ...

I haven't implemented but I can theorize. It should require a new
hooking point and information being accumulated in RAM and passed from
the current hooking point to the new hooking point and to hold off the
delivery of such information to the uffd monitor (the fd reader),
until the new hooking point is invoked. The new hooking point would
need to be invoked after fork cannot fail anymore.

We already accumulate some information in RAM there, but the first
delivery happens at a point where fork can still fail.

> > This solution as opposed only adds the dependency to common code
> > to set mm->owner to NULL and to free the task struct that was pointed
> > by mm->owner with RCU, if fork ends up failing. The userfaultfd
> > methods can still be called anywhere during the fork runtime and the
> > monitor will keep discarding orphaned "mm" coming from failed forks in
> > userland.
> 
> ... this is adding a subtle hack that might break in the future because
> copy_process error paths are far from trivial and quite error prone
> IMHO. I am not opposed to the patch in principle but I would really like
> to see what kind of solutions we are comparing here.

The rule of clearing mm->owner and then freeing the mm->owner memory
with call_rcu is already followed everywhere else. See for example
mm_update_next_owner() that sets mm->owner to NULL and only then
invokes put_task_struct which frees the memory pointed by the old
value of mm->owner using RCU.

The "subtle hack" already happens at every exit when MEMCG=y. All the
patch does is to extend the "subtle hack" to the fork failure path too
which it didn't follow the rule and it didn't clear mm->owner and it
just freed the task struct without waiting for a RCU grace period. In
fact like pointed out by Kirill Tkhai we could reuse
delayed_put_task_struct method that is already used by exit, except it
does more than freeing the task structure and it relies on refcounters
to be initialized so I thought the free_task -> call_rcu( free_task)
conversion was simpler and more obviously safe. Sharing the other
method only looked a complication that requires syncing up the
refcounts.

I think the only conceptual simplification possible would be again to
add a new hooking point and more buildup of information until fork
cannot fail, but in implementation terms I doubt the fix will become
smaller or simpler that way.

> > This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> > build time.
> 
> All the CONFIG_MEMCG is just ugly as hell. Can we reduce that please?
> E.g. use if (IS_ENABLED(CONFIG_MEMCG)) where appropriate?

There's just one place where I could use that instead of #ifdef.

> > +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> > +					   struct task_struct *p)
> > +{
> > +#ifdef CONFIG_MEMCG
> > +	if (mm->owner == p)
> > +		WRITE_ONCE(mm->owner, NULL);
> > +#endif
> 
> How can we ever hit this warning and what does that mean?

Which warning?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-27  0:16     ` Andrea Arcangeli
@ 2019-03-27  8:49       ` Michal Hocko
  2019-04-26  4:48         ` Andrew Morton
  2019-04-29  3:57         ` [PATCH 1/1 v2] " Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2019-03-27  8:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

On Tue 26-03-19 20:16:16, Andrea Arcangeli wrote:
> On Tue, Mar 26, 2019 at 09:56:43AM +0100, Michal Hocko wrote:
> > On Mon 25-03-19 18:56:35, Andrea Arcangeli wrote:
> > > MEMCG depends on the task structure not to be freed under
> > > rcu_read_lock() in get_mem_cgroup_from_mm() after it dereferences
> > > mm->owner.
> > 
> > Please state the actual problem. Your cover letter mentiones a race
> > condition. Please make it explicit in the changelog.
> 
> The actual problem is the task structure is freed while
> get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences
> mm->owner.
> 
> I thought the breakage of RCU is pretty clear, but we could add a
> description of the race like I did in the original thread:
> 
> https://lkml.kernel.org/r/000000000000601367057a095de4@google.com
> https://lkml.kernel.org/r/20190316194222.GA29767@redhat.com

Yes please. That really belongs to the changelog. You do not expect
people chasing long email threads or code to figure that out, right?

> > > An alternate possible fix would be to defer the delivery of the
> > > userfaultfd contexts to the monitor until after fork() is guaranteed
> > > to succeed. Such a change would require more changes because it would
> > > create a strict ordering dependency where the uffd methods would need
> > > to be called beyond the last potentially failing branch in order to be
> > > safe.
> > 
> > How much more changes are we talking about? Because ...
> 
> I haven't implemented but I can theorize. It should require a new
> hooking point and information being accumulated in RAM and passed from
> the current hooking point to the new hooking point and to hold off the
> delivery of such information to the uffd monitor (the fd reader),
> until the new hooking point is invoked. The new hooking point would
> need to be invoked after fork cannot fail anymore.
> 
> We already accumulate some information in RAM there, but the first
> delivery happens at a point where fork can still fail.

I am sorry but this is not really clear to me. What is the problem to
postpone hooking point to later and how much more data we are talking
about here?

> > > This solution as opposed only adds the dependency to common code
> > > to set mm->owner to NULL and to free the task struct that was pointed
> > > by mm->owner with RCU, if fork ends up failing. The userfaultfd
> > > methods can still be called anywhere during the fork runtime and the
> > > monitor will keep discarding orphaned "mm" coming from failed forks in
> > > userland.
> > 
> > ... this is adding a subtle hack that might break in the future because
> > copy_process error paths are far from trivial and quite error prone
> > IMHO. I am not opposed to the patch in principle but I would really like
> > to see what kind of solutions we are comparing here.
> 
> The rule of clearing mm->owner and then freeing the mm->owner memory
> with call_rcu is already followed everywhere else. See for example
> mm_update_next_owner() that sets mm->owner to NULL and only then
> invokes put_task_struct which frees the memory pointed by the old
> value of mm->owner using RCU.
>
> The "subtle hack" already happens at every exit when MEMCG=y. All the
> patch does is to extend the "subtle hack" to the fork failure path too
> which it didn't follow the rule and it didn't clear mm->owner and it
> just freed the task struct without waiting for a RCU grace period. In
> fact like pointed out by Kirill Tkhai we could reuse
> delayed_put_task_struct method that is already used by exit, except it
> does more than freeing the task structure and it relies on refcounters
> to be initialized so I thought the free_task -> call_rcu( free_task)
> conversion was simpler and more obviously safe. Sharing the other
> method only looked a complication that requires syncing up the
> refcounts.
> 
> I think the only conceptual simplification possible would be again to
> add a new hooking point and more buildup of information until fork
> cannot fail, but in implementation terms I doubt the fix will become
> smaller or simpler that way.

Well, in general I prefer the code to be memcg neutral as much as
possible. We might have this subtle dependency with memcg now but this
is not specific to memcg in general. Therefore, if there is a way to
make a userfault specific fix then I would prefer it. If that is not
feasible then fair enough.

JFYI, getting rid of mm->owner is a long term plan. This is just too
ugly to live. Easier said than done, unfortunately.

> > > This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> > > build time.
> > 
> > All the CONFIG_MEMCG is just ugly as hell. Can we reduce that please?
> > E.g. use if (IS_ENABLED(CONFIG_MEMCG)) where appropriate?
> 
> There's just one place where I could use that instead of #ifdef.

OK, I can see it now. Is there any strong reason to make the delayed
freeing conditional that would spare at least part of the ugliness.

> > > +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> > > +					   struct task_struct *p)
> > > +{
> > > +#ifdef CONFIG_MEMCG
> > > +	if (mm->owner == p)
> > > +		WRITE_ONCE(mm->owner, NULL);
> > > +#endif
> > 
> > How can we ever hit this warning and what does that mean?
> 
> Which warning?

A brain fart, I would have sworn that I've seen WARN_ON_ONCE. Sorry
about the confusion.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-27  8:49       ` Michal Hocko
@ 2019-04-26  4:48         ` Andrew Morton
  2019-04-29  3:57         ` [PATCH 1/1 v2] " Andrea Arcangeli
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2019-04-26  4:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Arcangeli, linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov


This patch is presently stuck.  AFAICT we just need a changelog update
to reflect Michal's observations?



From: Andrea Arcangeli <aarcange@redhat.com>
Subject: userfaultfd: use RCU to free the task struct when fork fails

MEMCG depends on the task structure not to be freed under rcu_read_lock()
in get_mem_cgroup_from_mm() after it dereferences mm->owner.

An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed to
succeed.  Such a change would require more changes because it would create
a strict ordering dependency where the uffd methods would need to be
called beyond the last potentially failing branch in order to be safe. 
This solution as opposed only adds the dependency to common code to set
mm->owner to NULL and to free the task struct that was pointed by
mm->owner with RCU, if fork ends up failing.  The userfaultfd methods can
still be called anywhere during the fork runtime and the monitor will keep
discarding orphaned "mm" coming from failed forks in userland.

This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
time.

Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: zhong jiang <zhongjiang@huawei.com>
Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |   34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

--- a/kernel/fork.c~userfaultfd-use-rcu-to-free-the-task-struct-when-fork-fails
+++ a/kernel/fork.c
@@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct
 #endif
 }
 
+static __always_inline void mm_clear_owner(struct mm_struct *mm,
+					   struct task_struct *p)
+{
+#ifdef CONFIG_MEMCG
+	if (mm->owner == p)
+		WRITE_ONCE(mm->owner, NULL);
+#endif
+}
+
 static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
 {
 #ifdef CONFIG_MEMCG
@@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct t
 free_pt:
 	/* don't put binfmt in mmput, we haven't got module yet */
 	mm->binfmt = NULL;
+	mm_init_owner(mm, NULL);
 	mmput(mm);
 
 fail_nomem:
@@ -1662,6 +1672,24 @@ static inline void rcu_copy_process(stru
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+#ifdef CONFIG_MEMCG
+static void __delayed_free_task(struct rcu_head *rhp)
+{
+	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
+
+	free_task(tsk);
+}
+#endif /* CONFIG_MEMCG */
+
+static __always_inline void delayed_free_task(struct task_struct *tsk)
+{
+#ifdef CONFIG_MEMCG
+	call_rcu(&tsk->rcu, __delayed_free_task);
+#else /* CONFIG_MEMCG */
+	free_task(tsk);
+#endif /* CONFIG_MEMCG */
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2123,8 +2151,10 @@ bad_fork_cleanup_io:
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm)
+	if (p->mm) {
+		mm_clear_owner(p->mm, p);
 		mmput(p->mm);
+	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
@@ -2155,7 +2185,7 @@ bad_fork_cleanup_count:
 bad_fork_free:
 	p->state = TASK_DEAD;
 	put_task_stack(p);
-	free_task(p);
+	delayed_free_task(p);
 fork_out:
 	spin_lock_irq(&current->sighand->siglock);
 	hlist_del_init(&delayed.node);
_


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/1 v2] userfaultfd: use RCU to free the task struct when fork fails
  2019-03-27  8:49       ` Michal Hocko
  2019-04-26  4:48         ` Andrew Morton
@ 2019-04-29  3:57         ` Andrea Arcangeli
  2019-04-29  6:36           ` zhong jiang
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2019-04-29  3:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, zhong jiang, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

The task structure is freed while get_mem_cgroup_from_mm() holds
rcu_read_lock() and dereferences mm->owner.

get_mem_cgroup_from_mm()                failing fork()
----                                    ---
task = mm->owner
                                        mm->owner = NULL;
                                        free(task)
if (task) *task; /* use after free */

The fix consists in freeing the task with RCU also in the fork failure
case, exactly like it always happens for the regular exit(2)
path. That is enough to make the rcu_read_lock hold in
get_mem_cgroup_from_mm() (left side above) effective to avoid a use
after free when dereferencing the task structure.

An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed
to succeed. Such a change would require more changes because it would
create a strict ordering dependency where the uffd methods would need
to be called beyond the last potentially failing branch in order to be
safe. This solution as opposed only adds the dependency to common code
to set mm->owner to NULL and to free the task struct that was pointed
by mm->owner with RCU, if fork ends up failing. The userfaultfd
methods can still be called anywhere during the fork runtime and the
monitor will keep discarding orphaned "mm" coming from failed forks in
userland.

This race condition couldn't trigger if CONFIG_MEMCG was set =n at
build time.

v2: improved commit header and reduced #ifdef material suggested by
Michal Hocko.

Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
Cc: stable@kernel.org
Tested-by: zhong jiang <zhongjiang@huawei.com>
Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..2628f3773ca8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
+static __always_inline void mm_clear_owner(struct mm_struct *mm,
+					   struct task_struct *p)
+{
+#ifdef CONFIG_MEMCG
+	if (mm->owner == p)
+		WRITE_ONCE(mm->owner, NULL);
+#endif
+}
+
 static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
 {
 #ifdef CONFIG_MEMCG
@@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 free_pt:
 	/* don't put binfmt in mmput, we haven't got module yet */
 	mm->binfmt = NULL;
+	mm_init_owner(mm, NULL);
 	mmput(mm);
 
 fail_nomem:
@@ -1662,6 +1672,21 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+static void __delayed_free_task(struct rcu_head *rhp)
+{
+	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
+
+	free_task(tsk);
+}
+
+static __always_inline void delayed_free_task(struct task_struct *tsk)
+{
+	if (IS_ENABLED(CONFIG_MEMCG))
+		call_rcu(&tsk->rcu, __delayed_free_task);
+	else
+		free_task(tsk);
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2123,8 +2148,10 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
-	if (p->mm)
+	if (p->mm) {
+		mm_clear_owner(p->mm, p);
 		mmput(p->mm);
+	}
 bad_fork_cleanup_signal:
 	if (!(clone_flags & CLONE_THREAD))
 		free_signal_struct(p->signal);
@@ -2155,7 +2182,7 @@ static __latent_entropy struct task_struct *copy_process(
 bad_fork_free:
 	p->state = TASK_DEAD;
 	put_task_stack(p);
-	free_task(p);
+	delayed_free_task(p);
 fork_out:
 	spin_lock_irq(&current->sighand->siglock);
 	hlist_del_init(&delayed.node);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1 v2] userfaultfd: use RCU to free the task struct when fork fails
  2019-04-29  3:57         ` [PATCH 1/1 v2] " Andrea Arcangeli
@ 2019-04-29  6:36           ` zhong jiang
  2019-04-29 14:37             ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: zhong jiang @ 2019-04-29  6:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Michal Hocko, linux-mm, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

 On 2019/4/29 11:57, Andrea Arcangeli wrote:
> The task structure is freed while get_mem_cgroup_from_mm() holds
> rcu_read_lock() and dereferences mm->owner.
>
> get_mem_cgroup_from_mm()                failing fork()
> ----                                    ---
> task = mm->owner
>                                         mm->owner = NULL;
>                                         free(task)
> if (task) *task; /* use after free */
>
> The fix consists in freeing the task with RCU also in the fork failure
> case, exactly like it always happens for the regular exit(2)
> path. That is enough to make the rcu_read_lock hold in
> get_mem_cgroup_from_mm() (left side above) effective to avoid a use
> after free when dereferencing the task structure.
>
> An alternate possible fix would be to defer the delivery of the
> userfaultfd contexts to the monitor until after fork() is guaranteed
> to succeed. Such a change would require more changes because it would
> create a strict ordering dependency where the uffd methods would need
> to be called beyond the last potentially failing branch in order to be
> safe. This solution as opposed only adds the dependency to common code
> to set mm->owner to NULL and to free the task struct that was pointed
> by mm->owner with RCU, if fork ends up failing. The userfaultfd
> methods can still be called anywhere during the fork runtime and the
> monitor will keep discarding orphaned "mm" coming from failed forks in
> userland.
>
> This race condition couldn't trigger if CONFIG_MEMCG was set =n at
> build time.
>
> v2: improved commit header and reduced #ifdef material suggested by
> Michal Hocko.
>
> Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
> Cc: stable@kernel.org
> Tested-by: zhong jiang <zhongjiang@huawei.com>
> Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9dcd18aa210b..2628f3773ca8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -952,6 +952,15 @@ static void mm_init_aio(struct mm_struct *mm)
>  #endif
>  }
>  
> +static __always_inline void mm_clear_owner(struct mm_struct *mm,
> +					   struct task_struct *p)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (mm->owner == p)
> +		WRITE_ONCE(mm->owner, NULL);
> +#endif
> +}
> +
>  static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  {
>  #ifdef CONFIG_MEMCG
> @@ -1331,6 +1340,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>  free_pt:
>  	/* don't put binfmt in mmput, we haven't got module yet */
>  	mm->binfmt = NULL;
> +	mm_init_owner(mm, NULL);
>  	mmput(mm);
>  
>  fail_nomem:
> @@ -1662,6 +1672,21 @@ static inline void rcu_copy_process(struct task_struct *p)
>  #endif /* #ifdef CONFIG_TASKS_RCU */
>  }
>  
> +static void __delayed_free_task(struct rcu_head *rhp)
> +{
> +	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
> +
> +	free_task(tsk);
> +}
if we disable the CONFIG_MEMCG,  __delay_free_task will not to be used.

Thanks,
zhong jiang
> +static __always_inline void delayed_free_task(struct task_struct *tsk)
> +{
> +	if (IS_ENABLED(CONFIG_MEMCG))
> +		call_rcu(&tsk->rcu, __delayed_free_task);
> +	else
> +		free_task(tsk);
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2123,8 +2148,10 @@ static __latent_entropy struct task_struct *copy_process(
>  bad_fork_cleanup_namespaces:
>  	exit_task_namespaces(p);
>  bad_fork_cleanup_mm:
> -	if (p->mm)
> +	if (p->mm) {
> +		mm_clear_owner(p->mm, p);
>  		mmput(p->mm);
> +	}
>  bad_fork_cleanup_signal:
>  	if (!(clone_flags & CLONE_THREAD))
>  		free_signal_struct(p->signal);
> @@ -2155,7 +2182,7 @@ static __latent_entropy struct task_struct *copy_process(
>  bad_fork_free:
>  	p->state = TASK_DEAD;
>  	put_task_stack(p);
> -	free_task(p);
> +	delayed_free_task(p);
>  fork_out:
>  	spin_lock_irq(&current->sighand->siglock);
>  	hlist_del_init(&delayed.node);
>
> .
>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1 v2] userfaultfd: use RCU to free the task struct when fork fails
  2019-04-29  6:36           ` zhong jiang
@ 2019-04-29 14:37             ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2019-04-29 14:37 UTC (permalink / raw)
  To: zhong jiang
  Cc: Andrew Morton, Michal Hocko, linux-mm, syzkaller-bugs,
	syzbot+cbb52e396df3e565ab02, Mike Rapoport, Mike Kravetz,
	Peter Xu, Dmitry Vyukov

Hello,

On Mon, Apr 29, 2019 at 02:36:28PM +0800, zhong jiang wrote:
> if we disable the CONFIG_MEMCG,  __delay_free_task will not to be used.

Yes, the compiler optimizes that away at build time.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-29 21:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25 22:56 [PATCH 0/2] userfaultfd: use RCU to free the task struct when fork fails Andrea Arcangeli
2019-03-25 22:56 ` [PATCH 1/2] " Andrea Arcangeli
2019-03-26  8:07   ` Kirill Tkhai
2019-03-26  8:18     ` Kirill Tkhai
2019-03-26  8:19       ` Kirill Tkhai
2019-03-26  8:56   ` Michal Hocko
2019-03-27  0:16     ` Andrea Arcangeli
2019-03-27  8:49       ` Michal Hocko
2019-04-26  4:48         ` Andrew Morton
2019-04-29  3:57         ` [PATCH 1/1 v2] " Andrea Arcangeli
2019-04-29  6:36           ` zhong jiang
2019-04-29 14:37             ` Andrea Arcangeli
2019-03-25 22:56 ` [PATCH 2/2] mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox