Re: [PATCH 1/2] mm: fix the racy mm->locked

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] mm: fix the racy mm->locked_vm change in
       [not found] <20150929182756.GA21740@redhat.com>
@ 2015-10-01  3:01 ` Hugh Dickins
  2015-10-01 14:49   ` Oleg Nesterov
  0 siblings, 1 reply; 3+ messages in thread
From: Hugh Dickins @ 2015-10-01  3:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Andrey Konovalov, Davidlohr Bueso, Hugh Dickins,
	Kirill A. Shutemov, Sasha Levin, Vlastimil Babka,
	Andrea Arcangeli, Michel Lespinasse, linux-kernel, linux-mm

On Tue, 29 Sep 2015, Oleg Nesterov wrote:

> "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth()
> are not safe; multiple threads using the same ->mm can do this at the
> same time trying to expans different vma's under down_read(mmap_sem).
                      expand
> This means that one of the "locked_vm += grow" changes can be lost
> and we can miss munlock_vma_pages_all() later.

>From the Cc list, I guess you are thinking this might be the fix to
the "Bad state page (mlocked)" issues Andrey and Sasha have reported.

I've not been able to explain those from the direction in which
I was thinking (despite giving it more hours of thought meanwhile),
so I am glad you're looking at it from a very different direction,
and hope you're right with this.

> 
> Move this code into the caller(s) under mm->page_table_lock. All other
> updates to ->locked_vm hold mmap_sem for writing.

So it looks like Andrea and I broke this back in v2.6.7: page_table_lock
was used here before then, and we thought the anon_vma lock was better.

Confession: from that time until today, I thought MAP_GROWSDOWN was
one of those flags (say, like MAP_DENYWRITE) which the kernel accepts
from userspace but ignores; I thought ia64 was the only architecture
on which an mm might contain more than one VM_GROWS* vma (excepting
the case where the original gets split; but surely stack would have
its anon_vma allocated by then, and shared across the split).  It's
only this patch of yours that leads me to calc_vm_flag_bits(), and
to how Michel brought page_table_lock back here to guard vma_gap.

> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Hugh Dickins <hughd@google.com>

with some hesitation.  I don't like very much that the preliminary
mm->locked_vm + grow check is still done without complete locking,
so racing threads could get more locked_vm than they're permitted;
but I'm not sure that we care enough to put page_table_lock back
over all of that (and security_vm_enough_memory wants to have final
say on whether to go ahead); even if it was that way years ago.

(And if we did care, shouldn't __vm_enough_memory() be using
percpu_counter_compare instead of percpu_counter_read_positive?
but that's a digression.)

It would be even nicer if we could kill these expand_stack()
anomalies once and for all, with down_write of mmap_sem here too.
But can't be done without revisiting every architecture's mm/fault.c,
which I have no stomach for at this time, and probably you neither.

Let's accept that your patch is a significant improvement,
and hope that it fixes the "Bad page state (mlocked)".

> ---
>  mm/mmap.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8393580..4efdc37 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2138,10 +2138,6 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
>  	if (security_vm_enough_memory_mm(mm, grow))
>  		return -ENOMEM;
>  
> -	/* Ok, everything looks good - let it rip */
> -	if (vma->vm_flags & VM_LOCKED)
> -		mm->locked_vm += grow;
> -	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
>  	return 0;
>  }
>  
> @@ -2202,6 +2198,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
>  				 * against concurrent vma expansions.
>  				 */
>  				spin_lock(&vma->vm_mm->page_table_lock);
> +				if (vma->vm_flags & VM_LOCKED)
> +					vma->vm_mm->locked_vm += grow;
> +				vm_stat_account(vma->vm_mm, vma->vm_flags,
> +						vma->vm_file, grow);
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_end = address;
>  				anon_vma_interval_tree_post_update_vma(vma);
> @@ -2273,6 +2273,10 @@ int expand_downwards(struct vm_area_struct *vma,
>  				 * against concurrent vma expansions.
>  				 */
>  				spin_lock(&vma->vm_mm->page_table_lock);
> +				if (vma->vm_flags & VM_LOCKED)
> +					vma->vm_mm->locked_vm += grow;
> +				vm_stat_account(vma->vm_mm, vma->vm_flags,
> +						vma->vm_file, grow);
>  				anon_vma_interval_tree_pre_update_vma(vma);
>  				vma->vm_start = address;
>  				vma->vm_pgoff -= grow;
> -- 
> 2.4.3
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/2] mm: fix the racy mm->locked_vm change in
  2015-10-01  3:01 ` [PATCH 1/2] mm: fix the racy mm->locked_vm change in Hugh Dickins
@ 2015-10-01 14:49   ` Oleg Nesterov
  2015-10-01 18:34     ` Hugh Dickins
  0 siblings, 1 reply; 3+ messages in thread
From: Oleg Nesterov @ 2015-10-01 14:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Andrey Konovalov, Davidlohr Bueso,
	Kirill A. Shutemov, Sasha Levin, Vlastimil Babka,
	Andrea Arcangeli, Michel Lespinasse, linux-kernel, linux-mm

On 09/30, Hugh Dickins wrote:
>
> On Tue, 29 Sep 2015, Oleg Nesterov wrote:
>
> > "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth()
> > are not safe; multiple threads using the same ->mm can do this at the
> > same time trying to expans different vma's under down_read(mmap_sem).
>                       expand
> > This means that one of the "locked_vm += grow" changes can be lost
> > and we can miss munlock_vma_pages_all() later.
>
> From the Cc list, I guess you are thinking this might be the fix to
> the "Bad state page (mlocked)" issues Andrey and Sasha have reported.

Yes, I found this when I tried to explain this problem, but I doubt
this change can fix it... Firstly I think it is very unlikely that
trinity hits this race. And even if mm->locked_vm is wrongly equal
to zero in exit_mmap(), it seems that page_remove_rmap() should do
clear_page_mlock(). But I do not understand this code enough. So if
this patch can actually help I would really like to know why ;)

And of course this can not explain other traces which look like
mm->mmap corruption.

> Acked-by: Hugh Dickins <hughd@google.com>

Thanks!

> with some hesitation.  I don't like very much that the preliminary
> mm->locked_vm + grow check is still done without complete locking,
> so racing threads could get more locked_vm than they're permitted;
> but I'm not sure that we care enough to put page_table_lock back
> over all of that (and security_vm_enough_memory wants to have final
> say on whether to go ahead); even if it was that way years ago.

Yes. Plus all these RLIMIT_MEMLOCK/etc and security_* checks assume
that we are going to expand current->mm, but this is not necessarily
true. Debugger or sys_process_vm_* can expand a foreign vma.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/2] mm: fix the racy mm->locked_vm change in
  2015-10-01 14:49   ` Oleg Nesterov
@ 2015-10-01 18:34     ` Hugh Dickins
  0 siblings, 0 replies; 3+ messages in thread
From: Hugh Dickins @ 2015-10-01 18:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Andrey Konovalov, Davidlohr Bueso,
	Kirill A. Shutemov, Sasha Levin, Vlastimil Babka,
	Andrea Arcangeli, Michel Lespinasse, linux-kernel, linux-mm

On Thu, 1 Oct 2015, Oleg Nesterov wrote:
> On 09/30, Hugh Dickins wrote:
> >
> > On Tue, 29 Sep 2015, Oleg Nesterov wrote:
> >
> > > "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth()
> > > are not safe; multiple threads using the same ->mm can do this at the
> > > same time trying to expans different vma's under down_read(mmap_sem).
> >                       expand
> > > This means that one of the "locked_vm += grow" changes can be lost
> > > and we can miss munlock_vma_pages_all() later.
> >
> > From the Cc list, I guess you are thinking this might be the fix to
> > the "Bad state page (mlocked)" issues Andrey and Sasha have reported.
> 
> Yes, I found this when I tried to explain this problem, but I doubt
> this change can fix it... Firstly I think it is very unlikely that
> trinity hits this race. And even if mm->locked_vm is wrongly equal
> to zero in exit_mmap(), it seems that page_remove_rmap() should do
> clear_page_mlock().

Oh yes, good point, a subsequent clear_page_mlock(), in unmapping
this address space, or later unmapping from another, ought to clear
it before the page ever gets freed.

> But I do not understand this code enough. So if
> this patch can actually help I would really like to know why ;)

I doubt any of us understand it very well, mlock+munlock have
over the years become so much more grotesque than the uninitiated
would expect.

> 
> And of course this can not explain other traces which look like
> mm->mmap corruption.
> 
> > Acked-by: Hugh Dickins <hughd@google.com>
> 
> Thanks!
> 
> > with some hesitation.  I don't like very much that the preliminary
> > mm->locked_vm + grow check is still done without complete locking,
> > so racing threads could get more locked_vm than they're permitted;
> > but I'm not sure that we care enough to put page_table_lock back
> > over all of that (and security_vm_enough_memory wants to have final
> > say on whether to go ahead); even if it was that way years ago.
> 
> Yes. Plus all these RLIMIT_MEMLOCK/etc and security_* checks assume
> that we are going to expand current->mm, but this is not necessarily
> true. Debugger or sys_process_vm_* can expand a foreign vma.

Right, I'd forgotten all about that aspect: yes, none of us ever took
expand_stack()'s "current" assumptions seriously enough to rework its
interface with all the architectures, so that's another argument for
sticking for now with the patch you already have here - thanks.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-10-01 18:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20150929182756.GA21740@redhat.com>
2015-10-01  3:01 ` [PATCH 1/2] mm: fix the racy mm->locked_vm change in Hugh Dickins
2015-10-01 14:49   ` Oleg Nesterov
2015-10-01 18:34     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox