Re: [RFC] another crazy idea to get rid of mmap_sem in faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux-foundation.org>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	hugh <hugh@veritas.com>,
	Paul E McKenney <paulmck@linux.vnet.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>, linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC] another crazy idea to get rid of mmap_sem in faults
Date: Mon, 01 Dec 2008 10:27:18 -0500	[thread overview]
Message-ID: <1228145238.18834.29.camel@lts-notebook> (raw)
In-Reply-To: <1228142895.7140.43.camel@twins>

On Mon, 2008-12-01 at 15:48 +0100, Peter Zijlstra wrote:
> On Mon, 2008-12-01 at 08:00 -0600, Christoph Lameter wrote:
> > On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> > 
> > > Pagefault concurrency with mmap() is undefined at best (any sane
> > > application will start using memory after its been mmap'ed and stop
> > > using it before it unmaps it).
> > 
> > mmap_sem in pagefaults is mainly used to serialize various
> > modifications to the address space structures while faults are processed.
> 
> Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
> into doesn't go away or change under us.
> 
> Does mmap_sem protect more than the vma's and their RB tree?
> 
> > This is of course all mmap related but stuff like forking can
> > occur concurrently in a multithreaded application. The COW mechanism is
> > tied up with this too.
> 
> Hohumm, fork().. good point.
> 
> I'll ponder the COW stuff, but since that too holds on to the PTL I
> think we're good.
> 
> > > If we do not freeze the vm map like we normally do but use a lockless
> > > vma lookup we're left with the unmap race (you're unlikely to find the
> > > vma before insertion anyway).
> > 
> > Then you will need to use RCU for the vmas in general. We already use
> > RCU for the anonymous vma. Extend that to all vmas?
> 
> RCU cannot be used, since we need to be able to sleep in a fault in
> order to do IO. So we need to extend SRCU - but since we have all the
> experience of preemptible RCU to draw from I think that should not be an
> issue.
> 
> > > I think we can close that race by marking a vma 'dead' before we do the
> > > pte unmap, this means that once we have the pte lock in the fault
> > > handler we can validate the vma (it cannot go away after all, because
> > > the unmap will block on it).
> > 
> > The anonymous VMAs already have refcounts and vm_area_struct also for the
> > !MM case. So maybe you could get to the notion of a "dead" vma easily.
> 
> !MMU case?, yes I was thinking to abuse that ;-)
> 
> > > Therefore, we can do the fault optimistically with any sane vma we get
> > > until the point we want to insert the PTE, at which point we have to
> > > take the PTL and validate the vma is still good.
> > 
> > How would this sync with other operations that need to take mmap_sem?
> 
> What other ops? mmap/munmap/mremap/madvise etc.?
> 
> The idea is that any change to a vma (which would require exclusive
> mmap_sem) will replace the vma - marking the old one dead and SRCU free
> it.

mlock(), mprotect() and mbind() [others?] can also split/merge vmas
under exclusive mmap_sem to accomodate the changed attributes.

> 
> All non-exclusive users can already handle others.
> 
> Stuff like merge/split is interesting because that might invalidate a
> vma while the fault stays valid.
> 
> This means we have to have a more complex vma validation, something
> along the lines of:
> 
> /*
>  * Finds a valid vma
>  */
> struct vm_area_struct *find_vma(mm, addr)
> {
> again:
> 	rcu_read_lock(); /* solely for the lookup structure */
> 	vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
> 	rcu_read_unlock();
> 	if (vma && vma_is_dead(vma))
> 		goto again;
> 
> 	return vma;
> }
> 
> /*
>  * validates if a previously obtained vma is still valid,
>  * synchronizes with vma against PTL.
>  */
> int validate_vma(mm, vma, addr)
> {
> 	ASSERT_PTL_LOCKED(mm, addr);
> 
> 	if (!vma_is_dead(vma))
> 		return 1;
> 
> 	vma2 = find_vma(mm, addr);
> 
> 	if (/*old vma fault is still valid in vma2*/)
> 		return 1
> 
> 	return 0;
> }
> 
> Merge:
> 
>   mark both vmas dead
>   grow the left to cover both
>   remove the right
>   replace the left with a new alive one
> 
>   (Munge PTEs)
> 
> Split:
> 
>   mark the vma dead
>   insert the new fragment (always the right-most)
>   replace the left with a new smaller.
> 
>   (Munge PTEs)
> 
> Where we basically use the re-try in the lookup to wait for a valid vma
> to appear while never having the lookup return NULL (which would make
> the fault fail and sigbus).
> 
> > > I'm sure there are many fun details to work out, even if the above idea
> > > is found solid, amongst them is extending srcu to provide call_srcu(),
> > > and implement an RCU friendly tree structure.
> > 
> > srcu may have too much of an overhead for this.
> 
> Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.
> 
> > > [ hmm, while writing this it occurred to me this might mean we have to
> > >   srcu free the page table pages :/ ]
> > 
> > The page tables cannot be immediately be reused then (quicklists etc).
> 
> I think I was wrong there, we don't do speculative PTE locks, so we
> should be good here.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-12-01 15:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-28 15:42 Peter Zijlstra
2008-11-28 17:04 ` Paul E. McKenney
2008-11-30 19:27 ` Linus Torvalds
2008-11-30 19:42   ` Peter Zijlstra
2008-12-01 22:55     ` Hugh Dickins
2008-12-01 14:00 ` Christoph Lameter
2008-12-01 14:48   ` Peter Zijlstra
2008-12-01 15:06     ` Christoph Lameter
2008-12-01 15:28       ` Peter Zijlstra
2008-12-01 16:22       ` Paul E. McKenney
2008-12-01 15:27     ` Lee Schermerhorn [this message]
2008-12-01 15:33       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228145238.18834.29.camel@lts-notebook \
    --to=lee.schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox