[PATCH v2] mm: optimize the redundant loop of mm_update_next

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
@ 2024-06-20 15:27 alexjlzheng
  2024-06-20 17:30 ` Oleg Nesterov
  2024-06-27  7:44 ` Michal Hocko
  0 siblings, 2 replies; 7+ messages in thread
From: alexjlzheng @ 2024-06-20 15:27 UTC (permalink / raw)
  To: akpm, brauner
  Cc: axboe, oleg, tandersen, willy, mjguzik, alexjlzheng,
	linux-kernel, linux-mm

From: Jinliang Zheng <alexjlzheng@tencent.com>

When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
ptrace or page migration (get_task_mm()), it is impossible to find an
appropriate task_struct in the loop whose mm_struct is the same as the target
mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
Changelog:

V2: Fix mm_update_owner_next() to mm_update_next_owner() in comment
---
 kernel/exit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index f95a2c1338a8..81fcee45d630 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
 	 * Search through everything else, we should not get here often.
 	 */
 	for_each_process(g) {
+		if (atomic_read(&mm->mm_users) <= 1)
+			break;
 		if (g->flags & PF_KTHREAD)
 			continue;
 		for_each_thread(g, c) {
-- 
2.39.3



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-20 15:27 [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner() alexjlzheng
@ 2024-06-20 17:30 ` Oleg Nesterov
  2024-06-21  8:50   ` Michal Hocko
  2024-06-27  7:44 ` Michal Hocko
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2024-06-20 17:30 UTC (permalink / raw)
  To: alexjlzheng, Michal Hocko, Eric W. Biederman
  Cc: akpm, brauner, axboe, tandersen, willy, mjguzik, alexjlzheng,
	linux-kernel, linux-mm

Can't review, I forgot everything about mm_update_next_owner().
So I am sorry for the noise I am going to add, feel free to ignore.
Just in case, I see nothing wrong in this patch.

On 06/20, alexjlzheng@gmail.com wrote:
>
> When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> ptrace or page migration (get_task_mm()), it is impossible to find an
> appropriate task_struct in the loop whose mm_struct is the same as the target
> mm_struct.
>
> If the above race condition is combined with the stress-ng-zombie and
> stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> write_lock_irq() for tasklist_lock.
>
> Recognize this situation in advance and exit early.

But this patch won't help if (say) ptrace_access_vm() sleeps while
for_each_process() tries to find another owner, right?

> @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
>  	 * Search through everything else, we should not get here often.
>  	 */
>  	for_each_process(g) {
> +		if (atomic_read(&mm->mm_users) <= 1)
> +			break;

I think this deserves a comment to explain that this is optimization
for the case we race with the pending mmput(). mm_update_next_owner()
checks mm_users at the start.

And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
Yes, this will probably need more changes even if possible...

Or even better. Can't we finally kill mm_update_next_owner() and turn the
ugly mm->owner into mm->mem_cgroup ?

Michal, Eric, iirc you had the patch(es) which do this?

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-20 17:30 ` Oleg Nesterov
@ 2024-06-21  8:50   ` Michal Hocko
  2024-06-25 22:21     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michal Hocko @ 2024-06-21  8:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: alexjlzheng, Eric W. Biederman, akpm, brauner, axboe, tandersen,
	willy, mjguzik, alexjlzheng, linux-kernel, linux-mm

On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> Can't review, I forgot everything about mm_update_next_owner().
> So I am sorry for the noise I am going to add, feel free to ignore.
> Just in case, I see nothing wrong in this patch.
> 
> On 06/20, alexjlzheng@gmail.com wrote:
> >
> > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> > ptrace or page migration (get_task_mm()), it is impossible to find an
> > appropriate task_struct in the loop whose mm_struct is the same as the target
> > mm_struct.
> >
> > If the above race condition is combined with the stress-ng-zombie and
> > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> > write_lock_irq() for tasklist_lock.
> >
> > Recognize this situation in advance and exit early.
> 
> But this patch won't help if (say) ptrace_access_vm() sleeps while
> for_each_process() tries to find another owner, right?
> 
> > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
> >  	 * Search through everything else, we should not get here often.
> >  	 */
> >  	for_each_process(g) {
> > +		if (atomic_read(&mm->mm_users) <= 1)
> > +			break;
> 
> I think this deserves a comment to explain that this is optimization
> for the case we race with the pending mmput(). mm_update_next_owner()
> checks mm_users at the start.
> 
> And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> Yes, this will probably need more changes even if possible...
> 
> 
> Or even better. Can't we finally kill mm_update_next_owner() and turn the
> ugly mm->owner into mm->mem_cgroup ?

Yes, dropping the mm->owner should be a way to go. Replacing that by
mem_cgroup sounds like an improvemnt. I have a vague recollection that
this has some traps on the way. E.g. tasks sharing the mm but living in
different cgroups. Things have changes since the last time I've checked
and for example memcg charge migration on task move will be deprecated
soon so chances are that there are less roadblocks on the way.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-21  8:50   ` Michal Hocko
@ 2024-06-25 22:21     ` Andrew Morton
  2024-06-26  6:43     ` Jinliang Zheng
  2024-06-26 15:23     ` Oleg Nesterov
  2 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2024-06-25 22:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, alexjlzheng, Eric W. Biederman, brauner, axboe,
	tandersen, willy, mjguzik, alexjlzheng, linux-kernel, linux-mm

On Fri, 21 Jun 2024 10:50:10 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> > Can't review, I forgot everything about mm_update_next_owner().
> > So I am sorry for the noise I am going to add, feel free to ignore.
> > Just in case, I see nothing wrong in this patch.
> > 
> > I think this deserves a comment to explain that this is optimization
> > for the case we race with the pending mmput(). mm_update_next_owner()
> > checks mm_users at the start.
> > 
> > And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> > Yes, this will probably need more changes even if possible...
> > 
> > 
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
> 
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that
> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.

I think this was alexjlzheng's first kernel contribution and as such we
might not be hearing from him(?) again.  And that's OK, thanks for the
bug report - it helps Linux.

Meanwhile we have a stalled patch in mm-unstable.  If someone could add
this issue to their todo list, that would be great.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-21  8:50   ` Michal Hocko
  2024-06-25 22:21     ` Andrew Morton
@ 2024-06-26  6:43     ` Jinliang Zheng
  2024-06-26 15:23     ` Oleg Nesterov
  2 siblings, 0 replies; 7+ messages in thread
From: Jinliang Zheng @ 2024-06-26  6:43 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, alexjlzheng, alexjlzheng, axboe, brauner, ebiederm,
	linux-kernel, linux-mm, mjguzik, oleg, tandersen, willy

On Fri, 21 Jun 2024 10:50:10 +0200, Michal Hocko wrote:
> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> > Can't review, I forgot everything about mm_update_next_owner().
> > So I am sorry for the noise I am going to add, feel free to ignore.
> > Just in case, I see nothing wrong in this patch.
> > 
> > On 06/20, alexjlzheng@gmail.com wrote:
> > >
> > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> > > ptrace or page migration (get_task_mm()), it is impossible to find an
> > > appropriate task_struct in the loop whose mm_struct is the same as the target
> > > mm_struct.
> > >
> > > If the above race condition is combined with the stress-ng-zombie and
> > > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> > > write_lock_irq() for tasklist_lock.
> > >
> > > Recognize this situation in advance and exit early.
> > 
> > But this patch won't help if (say) ptrace_access_vm() sleeps while
> > for_each_process() tries to find another owner, right?
> > 
> > > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
> > >  	 * Search through everything else, we should not get here often.
> > >  	 */
> > >  	for_each_process(g) {
> > > +		if (atomic_read(&mm->mm_users) <= 1)
> > > +			break;
> > 
> > I think this deserves a comment to explain that this is optimization
> > for the case we race with the pending mmput(). mm_update_next_owner()
> > checks mm_users at the start.
> > 
> > And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> > Yes, this will probably need more changes even if possible...
> > 
> > 
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
> 
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that

Sorry for the late reply.

Replacing that by mem_cgroup maybe a good idea, a rcu lock looks good, too.
But before the above optimization is implemented, I recommend using this
patch to alleviate it. Both [PATCH] and [PATCH v2] are acceptable, they only
differ in the commit log.

Thanks for your reply. :)
Jinliang Zheng

> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-21  8:50   ` Michal Hocko
  2024-06-25 22:21     ` Andrew Morton
  2024-06-26  6:43     ` Jinliang Zheng
@ 2024-06-26 15:23     ` Oleg Nesterov
  2 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2024-06-26 15:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: alexjlzheng, Eric W. Biederman, akpm, brauner, axboe, tandersen,
	willy, mjguzik, alexjlzheng, linux-kernel, linux-mm

On 06/21, Michal Hocko wrote:
>
> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> >
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
>
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that
> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.

OK, thanks...

So if we can't do this right now, can we at least cleanup it? To me it looks
just ugly.

We don't need get/put_task_struct. The "retry" logic is obviously suboptimal.
The search in the children/siblings doesn't handle zombie leaders.

I'll send 2 (hopefully simple) patches in a minute, could you review? I have
no idea how to test them...

Oleg.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner()
  2024-06-20 15:27 [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner() alexjlzheng
  2024-06-20 17:30 ` Oleg Nesterov
@ 2024-06-27  7:44 ` Michal Hocko
  1 sibling, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2024-06-27  7:44 UTC (permalink / raw)
  To: alexjlzheng
  Cc: akpm, brauner, axboe, oleg, tandersen, willy, mjguzik,
	alexjlzheng, linux-kernel, linux-mm

On Thu 20-06-24 23:27:45, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> ptrace or page migration (get_task_mm()), it is impossible to find an
> appropriate task_struct in the loop whose mm_struct is the same as the target
> mm_struct.
> 
> If the above race condition is combined with the stress-ng-zombie and
> stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> write_lock_irq() for tasklist_lock.
> 
> Recognize this situation in advance and exit early.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>

Even if this is not really a full fix it is a useful stop gap to catch
at least some cases.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changelog:
> 
> V2: Fix mm_update_owner_next() to mm_update_next_owner() in comment
> ---
>  kernel/exit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f95a2c1338a8..81fcee45d630 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
>  	 * Search through everything else, we should not get here often.
>  	 */
>  	for_each_process(g) {
> +		if (atomic_read(&mm->mm_users) <= 1)
> +			break;
>  		if (g->flags & PF_KTHREAD)
>  			continue;
>  		for_each_thread(g, c) {
> -- 
> 2.39.3
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-27  7:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-20 15:27 [PATCH v2] mm: optimize the redundant loop of mm_update_next_owner() alexjlzheng
2024-06-20 17:30 ` Oleg Nesterov
2024-06-21  8:50   ` Michal Hocko
2024-06-25 22:21     ` Andrew Morton
2024-06-26  6:43     ` Jinliang Zheng
2024-06-26 15:23     ` Oleg Nesterov
2024-06-27  7:44 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox