From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
Minchan Kim <minchan.kim@gmail.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
Date: Wed, 28 Apr 2010 18:34:17 +0100 [thread overview]
Message-ID: <20100428173416.GJ15815@csn.ul.ie> (raw)
In-Reply-To: <20100428162305.GX510@random.random>
On Wed, Apr 28, 2010 at 06:23:05PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> > Frankly, I don't understand why it was safe to drop the lock either.
> > Maybe it was a mistake but I still haven't convinced myself I fully
> > understand the subtleties of the anon_vma changes.
>
> I understand the design but I'm also unsure about the details. it's
> just this lock that gets splitted and when you update the
> vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
> queued in multiple other anon_vmas, and you're only serializing
> and safe for the pages that have page_mapcount 1, and point to the
> local anon_vma == vma->anon_vma, not any other shared page.
>
> The removal of the vma->anon_vma->lock from vma_adjust just seems an
> unrelated mistake to me too, but I don't know for sure why yet.
> Basically vma_adjust needs the anon_vma lock like expand_downards has.
>
Well, in the easiest case, the details of the VMA (particularly vm_start
and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
case of migration, it will return other false positives or negatives.
At best I'm fuzzy on the details though.
> After you fix vma_adjust to be as safe as expand_downards you've also
> to take care of the rmap_walk that may run on a page->mapping =
> anon_vma that isn't the vma->anon_vma and you're not taking that
> anon_vma->lock of the shared page, when you change the vma
> vm_pgoff/vm_start.
Is this not what the try-lock-different-vmas-or-backoff-and-retry logic
in patch 2 is doing or am I missing something else?
> If rmap_walk finds to find a pte, becauase of that,
> migrate will crash.
>
> > Temporarily at least until I figured out if execve was the only problem. The
>
> For aa.git it is sure enough. And as long as you only see the migrate
> crash in execve it's also sure enough.
>
I can only remember seeing the execve-related crash but I'd rather the
locking was correct too.
Problem is, I've seen at least one crash due to execve() even with the
check made in try_to_unmap_anon to not migrate within the temporary
stack. I'm not sure how it could have happened.
> > locking in vma_adjust didn't look like the prime reason for the crash
> > but the lack of locking there is still very odd.
>
> And I think it needs fixing to be safe.
>
> >
> > > I agree dropping patch 1 but
> > > to me the having to take all the anon_vma locks for every
> > > vma->anon_vma->lock that we walk seems a must, otherwise
> > > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > >
> >
> > No, you're not. If nothing else, vma_address can return the wrong value because
> > the VMAs vm_start and vm_pgoff were in the process of being updated but not
> > fully updated. It's hard to see how vma_address would return the wrong value
> > and miss a migration PTE as a result but it's a possibility. It's probably
> > a lot more important for transparent hugepage support.
>
> For the rmap_walk itself, migrate and split_huge_page are identical,
> the main problem of transparent hugepage support is that I used the
> anon_vma->lock in a wider way and taken well before the rmap_walk, so
> I'm screwed in a worse way than migrate.
>
Ok.
> So I may have to change from anon_vma->lock to the compound_lock in
> wait_split_huge_page(). But I'll still need the restarting loop of
> anon_vma locks then inside the two rmap_walk run by split_huge_page.
> Problem is, I would have preferred to do this locking change later as
> a pure optimization than as a requirement for merging and running
> stable, as it'll make things slightly more complex.
>
> BTW, if we were to share the lock across all anon_vmas as I mentioned
> in prev email, and just reduce the chain length, then it'd solve all
> issues for rmap_walk in migrate and also for THP completely.
>
It might be where we end up eventually. I'm simply loathe to introduce
another set of rules to anon_vma locking if it can be at all avoided.
> > > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > > point the anon_vma of the stack is the local one, it's just after
> > > execve.
> > >
> > > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > > _local_ one of the vma.
> >
> > True, although in the case of expand_downwards, it's highly unlikely that
> > there is also a migration PTE to be cleaned up. It's hard to see how a
> > migration PTE would be left behind in that case but it still looks wrong to
> > be updating the VMA fields without locking.
>
> Every time we fail to find the PTE, it can also mean try_to_unmap just
> failed to instantiate the migration pte leading to random memory
> corruption in migrate.
How so? The old PTE should have been left in place, the page count of
the page remain positive and migration not occur.
> If a task fork and the some of the stack pages
> at the bottom of the stack are shared, but the top of the stack isn't
> shared (so the vma->anon_vma->lock only protects the top and not the
> bottom) migrate should be able to silently random corrupt memory right
> now because of this.
>
> > > But a vma in mainline can be indexed in
> > > infinite anon_vmas, so to prevent breaking migration
> > > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > > anon_vma->locks for every anon_vma that the vma is indexed into.
> >
> > I felt this would be too heavy in the common case which is why I made
> > rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> > in far less critical paths.
>
> If we could take all locks, it'd make life easier as it'd already
> implement the "shared lock" but without sharing it. It won't provide
> much runtime benefit though (just rmap_walk will be even slower than
> real shared lock, and vma_adjust/expand_downards will be slightly faster).
>
Because the list could be very large, it would make more sense to
introduce the shared lock if this is what was required.
> > > Or
> > > alternatively like you implemented rmap_walk would need to check if
> > > the vma we found in the rmap_walk is different from the original
> > > anon_vma and to take the vma->anon_vma->lock (so taking the
> > > anon_vma->lock of the local anon_vma of every vma) before it can
> > > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > >
> >
> > To be absolutly sure, yes this is required. I don't think we've been hitting
> > this exact problem in these tests but it still is a better plan than adjusting
> > VMA details without locks.
>
> We've not been hitting it unless somebody crashed with random
> corruption.
>
Not that I've seen. Still just the crashes within execve.
> > > If the above is right it also means the new anon-vma changes also break
> > > the whole locking of transparent hugepage, see wait_split_huge_page,
> > > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > > "local" anon-vma is enough, when in fact the hugepage may be shared
> > > and belonging to the parent parent_vma->anon_vma and not to the local
> > > one of the last child that is waiting on the wrong lock. So I may have
> > > to rewrite this part of the thp locking to solve this. And for me it's
> > > not enough to just taking more locks inside the rmap walks inside
> > > split_huge_page as I used the anon_vma lock outside too.
> > >
> >
> > No fun. That potentially could be a lot of locks to take to split the
> > page.
>
> compound_lock should be able to replace it in a more granular way, but
> it's not exactly the time I was looking to apply scalar optimization
> to THP as it may introduce other issues I can't foresee right now.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-04-28 17:34 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-27 21:30 [PATCH 0/3] Fix migration races in rmap_walk() V2 Mel Gorman
2010-04-27 21:30 ` [PATCH 1/3] mm,migration: During fork(), wait for migration to end if migration PTE is encountered Mel Gorman
2010-04-27 22:22 ` Andrea Arcangeli
2010-04-27 23:52 ` KAMEZAWA Hiroyuki
2010-04-28 0:18 ` Andrea Arcangeli
2010-04-28 0:19 ` Andrea Arcangeli
2010-04-28 0:28 ` KAMEZAWA Hiroyuki
2010-04-28 0:59 ` Andrea Arcangeli
2010-04-28 8:24 ` Mel Gorman
2010-04-27 21:30 ` [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-04-27 23:10 ` Andrea Arcangeli
2010-04-28 9:15 ` Mel Gorman
2010-04-28 15:35 ` Andrea Arcangeli
2010-04-28 15:39 ` Andrea Arcangeli
2010-04-28 15:55 ` Mel Gorman
2010-04-28 16:23 ` Andrea Arcangeli
2010-04-28 17:34 ` Mel Gorman [this message]
2010-04-28 17:58 ` Andrea Arcangeli
2010-04-28 17:47 ` [RFC PATCH] take all anon_vma locks in anon_vma_lock Rik van Riel
2010-04-28 18:03 ` Andrea Arcangeli
2010-04-28 18:09 ` Rik van Riel
2010-04-28 18:25 ` [RFC PATCH -v2] " Rik van Riel
2010-04-28 19:07 ` Mel Gorman
2010-04-28 20:17 ` [RFC PATCH -v3] " Rik van Riel
2010-04-28 20:57 ` Rik van Riel
2010-04-29 0:28 ` Minchan Kim
2010-04-29 2:10 ` Rik van Riel
2010-04-29 2:55 ` Minchan Kim
2010-04-29 6:42 ` Minchan Kim
2010-04-29 15:39 ` Rik van Riel
2010-04-29 7:37 ` Mel Gorman
2010-04-29 8:15 ` Mel Gorman
2010-04-29 8:32 ` Minchan Kim
2010-04-29 8:44 ` Mel Gorman
2010-04-27 21:30 ` [PATCH 3/3] mm,migration: Remove straggling migration PTEs when page tables are being moved after the VMA has already moved Mel Gorman
2010-04-27 22:30 ` Andrea Arcangeli
2010-04-27 22:58 ` Andrea Arcangeli
2010-04-28 0:39 ` KAMEZAWA Hiroyuki
2010-04-28 1:05 ` Andrea Arcangeli
2010-04-28 1:09 ` Andrea Arcangeli
2010-04-28 1:18 ` KAMEZAWA Hiroyuki
2010-04-28 1:36 ` Andrea Arcangeli
2010-04-28 1:29 ` KAMEZAWA Hiroyuki
2010-04-28 1:44 ` Andrea Arcangeli
2010-04-28 2:12 ` KAMEZAWA Hiroyuki
2010-04-28 2:42 ` Andrea Arcangeli
2010-04-28 2:49 ` KAMEZAWA Hiroyuki
2010-04-28 7:28 ` KAMEZAWA Hiroyuki
2010-04-28 10:48 ` Mel Gorman
2010-04-28 0:03 ` KAMEZAWA Hiroyuki
2010-04-28 0:08 ` Andrea Arcangeli
2010-04-28 0:36 ` KAMEZAWA Hiroyuki
2010-04-28 8:30 ` KAMEZAWA Hiroyuki
2010-04-28 14:46 ` Andrea Arcangeli
2010-04-27 22:27 ` [PATCH 0/3] Fix migration races in rmap_walk() V2 Christoph Lameter
2010-04-27 22:32 ` Andrea Arcangeli
2010-04-28 0:13 ` KAMEZAWA Hiroyuki
2010-04-28 0:20 ` Andrea Arcangeli
2010-04-28 14:23 ` Mel Gorman
2010-04-28 14:57 ` Mel Gorman
2010-04-28 15:16 ` Andrea Arcangeli
2010-04-28 15:23 ` Mel Gorman
2010-04-28 15:45 ` Andrea Arcangeli
2010-04-28 20:40 ` Andrea Arcangeli
2010-04-28 21:05 ` Andrea Arcangeli
2010-04-28 9:17 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100428173416.GJ15815@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox