From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux-foundation.org>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Linus Torvalds <torvalds@linux-foundation.org>,
hugh <hugh@veritas.com>,
Paul E McKenney <paulmck@linux.vnet.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>, linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC] another crazy idea to get rid of mmap_sem in faults
Date: Mon, 01 Dec 2008 10:27:18 -0500 [thread overview]
Message-ID: <1228145238.18834.29.camel@lts-notebook> (raw)
In-Reply-To: <1228142895.7140.43.camel@twins>
On Mon, 2008-12-01 at 15:48 +0100, Peter Zijlstra wrote:
> On Mon, 2008-12-01 at 08:00 -0600, Christoph Lameter wrote:
> > On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> >
> > > Pagefault concurrency with mmap() is undefined at best (any sane
> > > application will start using memory after its been mmap'ed and stop
> > > using it before it unmaps it).
> >
> > mmap_sem in pagefaults is mainly used to serialize various
> > modifications to the address space structures while faults are processed.
>
> Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
> into doesn't go away or change under us.
>
> Does mmap_sem protect more than the vma's and their RB tree?
>
> > This is of course all mmap related but stuff like forking can
> > occur concurrently in a multithreaded application. The COW mechanism is
> > tied up with this too.
>
> Hohumm, fork().. good point.
>
> I'll ponder the COW stuff, but since that too holds on to the PTL I
> think we're good.
>
> > > If we do not freeze the vm map like we normally do but use a lockless
> > > vma lookup we're left with the unmap race (you're unlikely to find the
> > > vma before insertion anyway).
> >
> > Then you will need to use RCU for the vmas in general. We already use
> > RCU for the anonymous vma. Extend that to all vmas?
>
> RCU cannot be used, since we need to be able to sleep in a fault in
> order to do IO. So we need to extend SRCU - but since we have all the
> experience of preemptible RCU to draw from I think that should not be an
> issue.
>
> > > I think we can close that race by marking a vma 'dead' before we do the
> > > pte unmap, this means that once we have the pte lock in the fault
> > > handler we can validate the vma (it cannot go away after all, because
> > > the unmap will block on it).
> >
> > The anonymous VMAs already have refcounts and vm_area_struct also for the
> > !MM case. So maybe you could get to the notion of a "dead" vma easily.
>
> !MMU case?, yes I was thinking to abuse that ;-)
>
> > > Therefore, we can do the fault optimistically with any sane vma we get
> > > until the point we want to insert the PTE, at which point we have to
> > > take the PTL and validate the vma is still good.
> >
> > How would this sync with other operations that need to take mmap_sem?
>
> What other ops? mmap/munmap/mremap/madvise etc.?
>
> The idea is that any change to a vma (which would require exclusive
> mmap_sem) will replace the vma - marking the old one dead and SRCU free
> it.
mlock(), mprotect() and mbind() [others?] can also split/merge vmas
under exclusive mmap_sem to accomodate the changed attributes.
>
> All non-exclusive users can already handle others.
>
> Stuff like merge/split is interesting because that might invalidate a
> vma while the fault stays valid.
>
> This means we have to have a more complex vma validation, something
> along the lines of:
>
> /*
> * Finds a valid vma
> */
> struct vm_area_struct *find_vma(mm, addr)
> {
> again:
> rcu_read_lock(); /* solely for the lookup structure */
> vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
> rcu_read_unlock();
> if (vma && vma_is_dead(vma))
> goto again;
>
> return vma;
> }
>
> /*
> * validates if a previously obtained vma is still valid,
> * synchronizes with vma against PTL.
> */
> int validate_vma(mm, vma, addr)
> {
> ASSERT_PTL_LOCKED(mm, addr);
>
> if (!vma_is_dead(vma))
> return 1;
>
> vma2 = find_vma(mm, addr);
>
> if (/*old vma fault is still valid in vma2*/)
> return 1
>
> return 0;
> }
>
> Merge:
>
> mark both vmas dead
> grow the left to cover both
> remove the right
> replace the left with a new alive one
>
> (Munge PTEs)
>
> Split:
>
> mark the vma dead
> insert the new fragment (always the right-most)
> replace the left with a new smaller.
>
> (Munge PTEs)
>
> Where we basically use the re-try in the lookup to wait for a valid vma
> to appear while never having the lookup return NULL (which would make
> the fault fail and sigbus).
>
> > > I'm sure there are many fun details to work out, even if the above idea
> > > is found solid, amongst them is extending srcu to provide call_srcu(),
> > > and implement an RCU friendly tree structure.
> >
> > srcu may have too much of an overhead for this.
>
> Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.
>
> > > [ hmm, while writing this it occurred to me this might mean we have to
> > > srcu free the page table pages :/ ]
> >
> > The page tables cannot be immediately be reused then (quicklists etc).
>
> I think I was wrong there, we don't do speculative PTE locks, so we
> should be good here.
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-12-01 15:27 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-28 15:42 Peter Zijlstra
2008-11-28 17:04 ` Paul E. McKenney
2008-11-30 19:27 ` Linus Torvalds
2008-11-30 19:42 ` Peter Zijlstra
2008-12-01 22:55 ` Hugh Dickins
2008-12-01 14:00 ` Christoph Lameter
2008-12-01 14:48 ` Peter Zijlstra
2008-12-01 15:06 ` Christoph Lameter
2008-12-01 15:28 ` Peter Zijlstra
2008-12-01 16:22 ` Paul E. McKenney
2008-12-01 15:27 ` Lee Schermerhorn [this message]
2008-12-01 15:33 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1228145238.18834.29.camel@lts-notebook \
--to=lee.schermerhorn@hp.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux-foundation.org \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=nickpiggin@yahoo.com.au \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox