linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
Date: Wed, 28 Apr 2010 18:34:17 +0100	[thread overview]
Message-ID: <20100428173416.GJ15815@csn.ul.ie> (raw)
In-Reply-To: <20100428162305.GX510@random.random>

On Wed, Apr 28, 2010 at 06:23:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> > Frankly, I don't understand why it was safe to drop the lock either.
> > Maybe it was a mistake but I still haven't convinced myself I fully
> > understand the subtleties of the anon_vma changes.
> 
> I understand the design but I'm also unsure about the details. it's
> just this lock that gets splitted and when you update the
> vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
> queued in multiple other anon_vmas, and you're only serializing
> and safe for the pages that have page_mapcount 1, and point to the
> local anon_vma == vma->anon_vma, not any other shared page.
> 
> The removal of the vma->anon_vma->lock from vma_adjust just seems an
> unrelated mistake to me too, but I don't know for sure why yet.
> Basically vma_adjust needs the anon_vma lock like expand_downards has.
> 

Well, in the easiest case, the details of the VMA (particularly vm_start
and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
case of migration, it will return other false positives or negatives.

At best I'm fuzzy on the details though.

> After you fix vma_adjust to be as safe as expand_downards you've also
> to take care of the rmap_walk that may run on a page->mapping =
> anon_vma that isn't the vma->anon_vma and you're not taking that
> anon_vma->lock of the shared page, when you change the vma
> vm_pgoff/vm_start.

Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
in patch 2 is doing or am I missing something else?

> If rmap_walk finds to find a pte, becauase of that,
> migrate will crash.
> 
> > Temporarily at least until I figured out if execve was the only problem. The
> 
> For aa.git it is sure enough. And as long as you only see the migrate
> crash in execve it's also sure enough.
> 

I can only remember seeing the execve-related crash but I'd rather the
locking was correct too.

Problem is, I've seen at least one crash due to execve() even with the
check made in try_to_unmap_anon to not migrate within the temporary
stack. I'm not sure how it could have happened.

> > locking in vma_adjust didn't look like the prime reason for the crash
> > but the lack of locking there is still very odd.
> 
> And I think it needs fixing to be safe.
> 
> > 
> > > I agree dropping patch 1 but
> > > to me the having to take all the anon_vma locks for every
> > > vma->anon_vma->lock that we walk seems a must, otherwise
> > > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > > 
> > 
> > No, you're not. If nothing else, vma_address can return the wrong value because
> > the VMAs vm_start and vm_pgoff were in the process of being updated but not
> > fully updated. It's hard to see how vma_address would return the wrong value
> > and miss a migration PTE as a result but it's a possibility.  It's probably
> > a lot more important for transparent hugepage support.
> 
> For the rmap_walk itself, migrate and split_huge_page are identical,
> the main problem of transparent hugepage support is that I used the
> anon_vma->lock in a wider way and taken well before the rmap_walk, so
> I'm screwed in a worse way than migrate.
> 

Ok.

> So I may have to change from anon_vma->lock to the compound_lock in
> wait_split_huge_page(). But I'll still need the restarting loop of
> anon_vma locks then inside the two rmap_walk run by split_huge_page.
> Problem is, I would have preferred to do this locking change later as
> a pure optimization than as a requirement for merging and running
> stable, as it'll make things slightly more complex.
> 
> BTW, if we were to share the lock across all anon_vmas as I mentioned
> in prev email, and just reduce the chain length, then it'd solve all
> issues for rmap_walk in migrate and also for THP completely.
> 

It might be where we end up eventually. I'm simply loathe to introduce
another set of rules to anon_vma locking if it can be at all avoided.

> > > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > > point the anon_vma of the stack is the local one, it's just after
> > > execve.
> > > 
> > > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > > _local_ one of the vma. 
> > 
> > True, although in the case of expand_downwards, it's highly unlikely that
> > there is also a migration PTE to be cleaned up. It's hard to see how a
> > migration PTE would be left behind in that case but it still looks wrong to
> > be updating the VMA fields without locking.
> 
> Every time we fail to find the PTE, it can also mean try_to_unmap just
> failed to instantiate the migration pte leading to random memory
> corruption in migrate.

How so? The old PTE should have been left in place, the page count of
the page remain positive and migration not occur.

> If a task fork and the some of the stack pages
> at the bottom of the stack are shared, but the top of the stack isn't
> shared (so the vma->anon_vma->lock only protects the top and not the
> bottom) migrate should be able to silently random corrupt memory right
> now because of this.
> 
> > > But a vma in mainline can be indexed in
> > > infinite anon_vmas, so to prevent breaking migration
> > > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > > anon_vma->locks for every anon_vma that the vma is indexed into.
> > 
> > I felt this would be too heavy in the common case which is why I made
> > rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> > in far less critical paths.
> 
> If we could take all locks, it'd make life easier as it'd already
> implement the "shared lock" but without sharing it. It won't provide
> much runtime benefit though (just rmap_walk will be even slower than
> real shared lock, and vma_adjust/expand_downards will be slightly faster).
> 

Because the list could be very large, it would make more sense to
introduce the shared lock if this is what was required.

> > > Or
> > > alternatively like you implemented rmap_walk would need to check if
> > > the vma we found in the rmap_walk is different from the original
> > > anon_vma and to take the vma->anon_vma->lock (so taking the
> > > anon_vma->lock of the local anon_vma of every vma) before it can
> > > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > > 
> > 
> > To be absolutly sure, yes this is required. I don't think we've been hitting
> > this exact problem in these tests but it still is a better plan than adjusting
> > VMA details without locks.
> 
> We've not been hitting it unless somebody crashed with random
> corruption.
> 

Not that I've seen. Still just the crashes within execve.

> > > If the above is right it also means the new anon-vma changes also break
> > > the whole locking of transparent hugepage, see wait_split_huge_page,
> > > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > > "local" anon-vma is enough, when in fact the hugepage may be shared
> > > and belonging to the parent parent_vma->anon_vma and not to the local
> > > one of the last child that is waiting on the wrong lock. So I may have
> > > to rewrite this part of the thp locking to solve this. And for me it's
> > > not enough to just taking more locks inside the rmap walks inside
> > > split_huge_page as I used the anon_vma lock outside too.
> > > 
> > 
> > No fun. That potentially could be a lot of locks to take to split the
> > page.
> 
> compound_lock should be able to replace it in a more granular way, but
> it's not exactly the time I was looking to apply scalar optimization
> to THP as it may introduce other issues I can't foresee right now.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-04-28 17:34 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-27 21:30 [PATCH 0/3] Fix migration races in rmap_walk() V2 Mel Gorman
2010-04-27 21:30 ` [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered Mel Gorman
2010-04-27 22:22   ` Andrea Arcangeli
2010-04-27 23:52     ` KAMEZAWA Hiroyuki
2010-04-28  0:18       ` Andrea Arcangeli
2010-04-28  0:19         ` Andrea Arcangeli
2010-04-28  0:28           ` KAMEZAWA Hiroyuki
2010-04-28  0:59             ` Andrea Arcangeli
2010-04-28  8:24       ` Mel Gorman
2010-04-27 21:30 ` [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-04-27 23:10   ` Andrea Arcangeli
2010-04-28  9:15     ` Mel Gorman
2010-04-28 15:35       ` Andrea Arcangeli
2010-04-28 15:39         ` Andrea Arcangeli
2010-04-28 15:55         ` Mel Gorman
2010-04-28 16:23           ` Andrea Arcangeli
2010-04-28 17:34             ` Mel Gorman [this message]
2010-04-28 17:58               ` Andrea Arcangeli
2010-04-28 17:47             ` [RFC PATCH] take all anon_vma locks in anon_vma_lock Rik van Riel
2010-04-28 18:03               ` Andrea Arcangeli
2010-04-28 18:09                 ` Rik van Riel
2010-04-28 18:25               ` [RFC PATCH -v2] " Rik van Riel
2010-04-28 19:07                 ` Mel Gorman
2010-04-28 20:17                 ` [RFC PATCH -v3] " Rik van Riel
2010-04-28 20:57                   ` Rik van Riel
2010-04-29  0:28                     ` Minchan Kim
2010-04-29  2:10                       ` Rik van Riel
2010-04-29  2:55                         ` Minchan Kim
2010-04-29  6:42                           ` Minchan Kim
2010-04-29 15:39                           ` Rik van Riel
2010-04-29  7:37                       ` Mel Gorman
2010-04-29  8:15                     ` Mel Gorman
2010-04-29  8:32                       ` Minchan Kim
2010-04-29  8:44                         ` Mel Gorman
2010-04-27 21:30 ` [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved Mel Gorman
2010-04-27 22:30   ` Andrea Arcangeli
2010-04-27 22:58     ` Andrea Arcangeli
2010-04-28  0:39       ` KAMEZAWA Hiroyuki
2010-04-28  1:05         ` Andrea Arcangeli
2010-04-28  1:09           ` Andrea Arcangeli
2010-04-28  1:18           ` KAMEZAWA Hiroyuki
2010-04-28  1:36             ` Andrea Arcangeli
2010-04-28  1:29       ` KAMEZAWA Hiroyuki
2010-04-28  1:44         ` Andrea Arcangeli
2010-04-28  2:12           ` KAMEZAWA Hiroyuki
2010-04-28  2:42             ` Andrea Arcangeli
2010-04-28  2:49               ` KAMEZAWA Hiroyuki
2010-04-28  7:28                 ` KAMEZAWA Hiroyuki
2010-04-28 10:48                   ` Mel Gorman
2010-04-28  0:03   ` KAMEZAWA Hiroyuki
2010-04-28  0:08     ` Andrea Arcangeli
2010-04-28  0:36       ` KAMEZAWA Hiroyuki
2010-04-28  8:30   ` KAMEZAWA Hiroyuki
2010-04-28 14:46     ` Andrea Arcangeli
2010-04-27 22:27 ` [PATCH 0/3] Fix migration races in rmap_walk() V2 Christoph Lameter
2010-04-27 22:32   ` Andrea Arcangeli
2010-04-28  0:13     ` KAMEZAWA Hiroyuki
2010-04-28  0:20       ` Andrea Arcangeli
2010-04-28 14:23         ` Mel Gorman
2010-04-28 14:57           ` Mel Gorman
2010-04-28 15:16             ` Andrea Arcangeli
2010-04-28 15:23               ` Mel Gorman
2010-04-28 15:45                 ` Andrea Arcangeli
2010-04-28 20:40                   ` Andrea Arcangeli
2010-04-28 21:05                     ` Andrea Arcangeli
2010-04-28  9:17     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100428173416.GJ15815@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox