[RFC] another crazy idea to get rid of mmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] another crazy idea to get rid of mmap_sem in faults
@ 2008-11-28 15:42 Peter Zijlstra
  2008-11-28 17:04 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2008-11-28 15:42 UTC (permalink / raw)
  To: Nick Piggin, Linus Torvalds, hugh
  Cc: Paul E McKenney, Andrew Morton, Ingo Molnar, linux-mm

Hi,

While pondering the page_fault retry stuff, I came up with the following
idea.

Pagefault concurrency with mmap() is undefined at best (any sane
application will start using memory after its been mmap'ed and stop
using it before it unmaps it).

The only thing we need to ensure is that we don't insert a PTE in the
wrong map in case some app does stupid.

If we do not freeze the vm map like we normally do but use a lockless
vma lookup we're left with the unmap race (you're unlikely to find the
vma before insertion anyway).

I think we can close that race by marking a vma 'dead' before we do the
pte unmap, this means that once we have the pte lock in the fault
handler we can validate the vma (it cannot go away after all, because
the unmap will block on it).

Therefore, we can do the fault optimistically with any sane vma we get
until the point we want to insert the PTE, at which point we have to
take the PTL and validate the vma is still good.

[ Of course getting the PTL after the unmap might instantiate the upper
  page tables for naught and we ought to clean them up again if we
  raced, can this be done race free? ]

I'm sure there are many fun details to work out, even if the above idea
is found solid, amongst them is extending srcu to provide call_srcu(),
and implement an RCU friendly tree structure.

[ hmm, while writing this it occurred to me this might mean we have to
  srcu free the page table pages :/ ]

The below patch is very rough and doesn't compile nor attempts to be
correct, it's only purpose is to illustrate the idea more clearly.

NOT-Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 63e9f7c..ba0eeeb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -591,6 +591,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	unsigned long address;
 	int write, si_code;
 	int fault;
+	int idx;
 	unsigned long *stackend;
 
 #ifdef CONFIG_X86_64
@@ -600,7 +601,6 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 
 	tsk = current;
 	mm = tsk->mm;
-	prefetchw(&mm->mmap_sem);
 
 	/* get the address */
 	address = read_cr2();
@@ -683,32 +683,35 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
 		goto bad_area_nosemaphore;
 
 again:
+	idx = srcu_read_lock(&mm_srcu);
+
+retry:
 	/*
-	 * When running in the kernel we expect faults to occur only to
-	 * addresses in user space.  All other faults represent errors in the
-	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
-	 * erroneous fault occurring in a code path which already holds mmap_sem
-	 * we will deadlock attempting to validate the fault against the
-	 * address space.  Luckily the kernel only validly references user
-	 * space from well defined areas of code, which are listed in the
-	 * exceptions table.
+	 * this should be lockless, except RB trees suck!
+	 *
+	 * we want:
+	 * srcu_read_lock(); // guard vmas
+	 * rcu_read_lock();  // guard the lookup structure
+	 * vma = find_vma(mm, address);
+	 * rcu_read_unlock();
 	 *
-	 * As the vast majority of faults will be valid we will only perform
-	 * the source reference check when there is a possibility of a deadlock.
-	 * Attempt to lock the address space, if we cannot we then validate the
-	 * source.  If this is invalid we can skip the address space check,
-	 * thus avoiding the deadlock.
+	 * do the fault
+	 *
+	 * srcu_read_unlock(); // release vma
 	 */
-	if (!down_read_trylock(&mm->mmap_sem)) {
-		if ((error_code & PF_USER) == 0 &&
-		    !search_exception_tables(regs->ip))
-			goto bad_area_nosemaphore;
-		down_read(&mm->mmap_sem);
-	}
-
+	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, address);
+	up_read(&mm->mmap_sem);
+
 	if (!vma)
 		goto bad_area;
+	if (vma_is_dead(vma)) {
+		/*
+		 * We lost a race against an concurrent modification. Retry the
+		 * lookup which should now obtain a NULL or valid vma.
+		 */
+		goto retry;
+	}
 	if (vma->vm_start <= address)
 		goto good_area;
 	if (!(vma->vm_flags & VM_GROWSDOWN))
@@ -723,6 +726,9 @@ again:
 		if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
 			goto bad_area;
 	}
+	/*
+	 * XXX this might still need mmap_sem...
+	 */
 	if (expand_stack(vma, address))
 		goto bad_area;
 /*
@@ -775,7 +781,7 @@ good_area:
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
 #endif
-	up_read(&mm->mmap_sem);
+	srcu_read_unlock(&mm_srcu, idx);
 	return;
 
 /*
@@ -783,7 +789,7 @@ good_area:
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	srcu_read_unlock(&mm_srcu, idx);
 
 bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
@@ -883,7 +889,7 @@ no_context:
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	srcu_read_unlock(&mm_srcu, idx);
 	if (is_global_init(tsk)) {
 		yield();
 		/*
@@ -899,7 +905,7 @@ out_of_memory:
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	srcu_read_unlock(&mm_srcu, idx);
 
 	/* Kernel mode? Handle exceptions or die */
 	if (!(error_code & PF_USER))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..c9f1727 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/srcu.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -145,6 +146,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 
+extern struct srcu_struct mm_srcu;
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1132,6 +1134,25 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+static inline int vma_is_dead(struct vm_area_struct *vma)
+{
+	return atomic_read(&vma->vm_usage) == 0;
+}
+
+static inline void vma_get(struct vm_area_struct *vma)
+{
+	BUG_ON(vma_is_dead(vma));
+	atomic_inc(&vma->vm_usage);
+}
+
+extern void __vma_put(struct vm_area_struct *vma);
+
+static inline void vma_put(struct vm_area_struct *vma)
+{
+	if (!atomic_dec_return(&vma->vm_usage))
+		__vma_put(vma);
+}
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0a48058..d5b46a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -159,12 +159,11 @@ struct vm_area_struct {
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
-#ifndef CONFIG_MMU
 	atomic_t vm_usage;		/* refcount (VMAs shared if !MMU) */
-#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct rcu_head rcu_head;
 };
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index afb376d..b4daad6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -136,6 +136,8 @@ struct kmem_cache *vm_area_cachep;
 /* SLAB cache for mm_struct structures (tsk->mm) */
 static struct kmem_cache *mm_cachep;
 
+struct srcu_struct mm_srcu;
+
 void free_task(struct task_struct *tsk)
 {
 	prop_local_destroy_single(&tsk->dirties);
@@ -1477,6 +1479,8 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+
+	init_srcu_struct(&mm_srcu);
 }
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index fc031d6..dc2475a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1829,6 +1829,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * sleep if it needs to.
 			 */
 			page_cache_get(old_page);
+			vma_get(vma);
 			pte_unmap_unlock(page_table, ptl);
 
 			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
@@ -1842,6 +1843,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
+			vma_put(vma);
 			page_cache_release(old_page);
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
@@ -1869,6 +1871,7 @@ reuse:
 	 */
 	page_cache_get(old_page);
 gotten:
+	vma_get(vma);
 	pte_unmap_unlock(page_table, ptl);
 
 	if (unlikely(anon_vma_prepare(vma)))
@@ -1896,6 +1899,7 @@ gotten:
 	 * Re-check the pte - we dropped the lock
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	vma_put(vma);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2410,6 +2414,17 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (vma_is_dead(vma)) {
+		/*
+		 * by holding the ptl we pin whatever vma that is covering its
+		 * memory region, if at this point the vma we got is dead,
+		 * we've lost the race and need to bail.
+		 *
+		 * XXX should we re-do the lookup and check for merged vmas?
+		 */
+		pte_unmap_unlock(pte, ptl);
+		return VM_FAULT_SIGBUS;
+	}
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
@@ -2543,6 +2558,17 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (vma_is_dead(vma)) {
+		/*
+		 * by holding the ptl we pin whatever vma that is covering its
+		 * memory region, if at this point the vma we got is dead,
+		 * we've lost the race and need to bail.
+		 *
+		 * XXX should we re-do the lookup and check for merged vmas?
+		 */
+		pte_unmap_unlock(pte, ptl);
+		return VM_FAULT_SIGBUS;
+	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2690,6 +2716,17 @@ static inline int handle_pte_fault(struct mm_struct *mm,
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
+	if (vma_is_dead(vma)) {
+		/*
+		 * by holding the ptl we pin whatever vma that is covering its
+		 * memory region, if at this point the vma we got is dead,
+		 * we've lost the race and need to bail.
+		 *
+		 * XXX should we re-do the lookup and check for merged vmas?
+		 */
+		pte_unmap_unlock(pte, ptl);
+		return VM_FAULT_SIGBUS;
+	}
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (write_access) {
diff --git a/mm/mmap.c b/mm/mmap.c
index d4855a6..2b2d454 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -43,6 +43,48 @@
 #define arch_rebalance_pgtables(addr, len)		(addr)
 #endif
 
+void __vma_put(struct vm_area_struct *vma)
+{
+	/*
+	 * we should never free a vma through here!
+	 */
+	BUG();
+}
+
+void vma_free_rcu(struct rcu_head *rcu)
+{
+	struct vm_area_struct *vma = 
+		container_of(rcu, struct vm_area_struct, rcu_head);
+
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+void vma_free(struct vm_area_struct *vma)
+{
+	VM_BUG_ON(!vma_is_dead(vma));
+	/*
+	 * Yeah, I know, this doesn't exist...
+	 */
+	call_srcu(&mm_srcu, &vma->rcu_head, vma_free_rcu);
+}
+
+/*
+ * mark the vma unused before we zap the PTEs, that way, holding the PTE lock
+ * will block the unmap and guarantee vma validity.
+ */
+void vma_remove(struct vm_area_struct *vma)
+{
+	/*
+	 * XXX might need to be a blocking wait ?
+	 *     complicates vma_adjust
+	 *
+	 *     better to get rid of vma_get/put do_wp_page() might be able
+	 *     to compare PTEs and bail, it'll just re-take the fault.
+	 */
+	while (atomic_cmpxchg(&vma->vm_usage, 1, 0))
+		cpu_relax();
+}
+
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
@@ -241,7 +283,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 			removed_exe_file_vma(vma->vm_mm);
 	}
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	vma_free(vma);
 	return next;
 }
 
@@ -407,6 +449,7 @@ __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct rb_node **rb_link, struct rb_node *rb_parent)
 {
+	atomic_set(&vma->vm_usage, 1);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
 }
@@ -492,6 +535,7 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
+	vma_remove(vma);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
 }
@@ -644,7 +688,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		vma_free(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -1210,7 +1254,7 @@ munmap_back:
 	if (file && vma_merge(mm, prev, addr, vma->vm_end,
 			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		mpol_put(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
+		vma_free(vma);
 		fput(file);
 		if (vm_flags & VM_EXECUTABLE)
 			removed_exe_file_vma(mm);
@@ -1247,7 +1291,7 @@ unmap_and_free_vma:
 	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 free_vma:
-	kmem_cache_free(vm_area_cachep, vma);
+	vma_free(vma);
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
@@ -1801,6 +1845,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
+		vma_remove(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-11-28 15:42 [RFC] another crazy idea to get rid of mmap_sem in faults Peter Zijlstra
@ 2008-11-28 17:04 ` Paul E. McKenney
  2008-11-30 19:27 ` Linus Torvalds
  2008-12-01 14:00 ` Christoph Lameter
  2 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2008-11-28 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, hugh, Andrew Morton, Ingo Molnar, linux-mm

On Fri, Nov 28, 2008 at 04:42:39PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> While pondering the page_fault retry stuff, I came up with the following
> idea.
> 
> Pagefault concurrency with mmap() is undefined at best (any sane
> application will start using memory after its been mmap'ed and stop
> using it before it unmaps it).

I don't know that it matters, but thought I should point out the
historical exception to this rule.  This would be applications maintaining
large sparse data structures, using SIGSEGV to instantiate new data
(preferably via a separate thread that does the actual mapping).

That said, I would agree that such applications are the exception rather
than the rule, and I would -not- recommend optimizing for them.  However,
any page fault running concurrently with an mmap() can indeed be legally
handed a SIGSEGV, given that the mmap() has not yet completed.

> The only thing we need to ensure is that we don't insert a PTE in the
> wrong map in case some app does stupid.
> 
> If we do not freeze the vm map like we normally do but use a lockless
> vma lookup we're left with the unmap race (you're unlikely to find the
> vma before insertion anyway).
> 
> I think we can close that race by marking a vma 'dead' before we do the
> pte unmap, this means that once we have the pte lock in the fault
> handler we can validate the vma (it cannot go away after all, because
> the unmap will block on it).
> 
> Therefore, we can do the fault optimistically with any sane vma we get
> until the point we want to insert the PTE, at which point we have to
> take the PTL and validate the vma is still good.

This makes sense to me, though I cannot claim full knowledge of the
current VM.

> [ Of course getting the PTL after the unmap might instantiate the upper
>   page tables for naught and we ought to clean them up again if we
>   raced, can this be done race free? ]
> 
> I'm sure there are many fun details to work out, even if the above idea
> is found solid, amongst them is extending srcu to provide call_srcu(),
> and implement an RCU friendly tree structure.
> 
> [ hmm, while writing this it occurred to me this might mean we have to
>   srcu free the page table pages :/ ]

In theory, a call_srcu() could be provided, but the user of such a thing
would need to be very careful to somehow avoid situations where some
reader was blocked indefinitely, permitting arbitrarily large numbers
of callbacks from being generated...

Like when paging over NFS in face of possible network outages.  ;-)

						Thanx, Paul

> The below patch is very rough and doesn't compile nor attempts to be
> correct, it's only purpose is to illustrate the idea more clearly.
> 
> NOT-Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 63e9f7c..ba0eeeb 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -591,6 +591,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	unsigned long address;
>  	int write, si_code;
>  	int fault;
> +	int idx;
>  	unsigned long *stackend;
> 
>  #ifdef CONFIG_X86_64
> @@ -600,7 +601,6 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
> 
>  	tsk = current;
>  	mm = tsk->mm;
> -	prefetchw(&mm->mmap_sem);
> 
>  	/* get the address */
>  	address = read_cr2();
> @@ -683,32 +683,35 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  		goto bad_area_nosemaphore;
> 
>  again:
> +	idx = srcu_read_lock(&mm_srcu);
> +
> +retry:
>  	/*
> -	 * When running in the kernel we expect faults to occur only to
> -	 * addresses in user space.  All other faults represent errors in the
> -	 * kernel and should generate an OOPS.  Unfortunately, in the case of an
> -	 * erroneous fault occurring in a code path which already holds mmap_sem
> -	 * we will deadlock attempting to validate the fault against the
> -	 * address space.  Luckily the kernel only validly references user
> -	 * space from well defined areas of code, which are listed in the
> -	 * exceptions table.
> +	 * this should be lockless, except RB trees suck!
> +	 *
> +	 * we want:
> +	 * srcu_read_lock(); // guard vmas
> +	 * rcu_read_lock();  // guard the lookup structure
> +	 * vma = find_vma(mm, address);
> +	 * rcu_read_unlock();
>  	 *
> -	 * As the vast majority of faults will be valid we will only perform
> -	 * the source reference check when there is a possibility of a deadlock.
> -	 * Attempt to lock the address space, if we cannot we then validate the
> -	 * source.  If this is invalid we can skip the address space check,
> -	 * thus avoiding the deadlock.
> +	 * do the fault
> +	 *
> +	 * srcu_read_unlock(); // release vma
>  	 */
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		if ((error_code & PF_USER) == 0 &&
> -		    !search_exception_tables(regs->ip))
> -			goto bad_area_nosemaphore;
> -		down_read(&mm->mmap_sem);
> -	}
> -
> +	down_read(&mm->mmap_sem);
>  	vma = find_vma(mm, address);
> +	up_read(&mm->mmap_sem);
> +
>  	if (!vma)
>  		goto bad_area;
> +	if (vma_is_dead(vma)) {
> +		/*
> +		 * We lost a race against an concurrent modification. Retry the
> +		 * lookup which should now obtain a NULL or valid vma.
> +		 */
> +		goto retry;
> +	}
>  	if (vma->vm_start <= address)
>  		goto good_area;
>  	if (!(vma->vm_flags & VM_GROWSDOWN))
> @@ -723,6 +726,9 @@ again:
>  		if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
>  			goto bad_area;
>  	}
> +	/*
> +	 * XXX this might still need mmap_sem...
> +	 */
>  	if (expand_stack(vma, address))
>  		goto bad_area;
>  /*
> @@ -775,7 +781,7 @@ good_area:
>  			tsk->thread.screen_bitmap |= 1 << bit;
>  	}
>  #endif
> -	up_read(&mm->mmap_sem);
> +	srcu_read_unlock(&mm_srcu, idx);
>  	return;
> 
>  /*
> @@ -783,7 +789,7 @@ good_area:
>   * Fix it, but check if it's kernel or user first..
>   */
>  bad_area:
> -	up_read(&mm->mmap_sem);
> +	srcu_read_unlock(&mm_srcu, idx);
> 
>  bad_area_nosemaphore:
>  	/* User mode accesses just cause a SIGSEGV */
> @@ -883,7 +889,7 @@ no_context:
>   * us unable to handle the page fault gracefully.
>   */
>  out_of_memory:
> -	up_read(&mm->mmap_sem);
> +	srcu_read_unlock(&mm_srcu, idx);
>  	if (is_global_init(tsk)) {
>  		yield();
>  		/*
> @@ -899,7 +905,7 @@ out_of_memory:
>  	goto no_context;
> 
>  do_sigbus:
> -	up_read(&mm->mmap_sem);
> +	srcu_read_unlock(&mm_srcu, idx);
> 
>  	/* Kernel mode? Handle exceptions or die */
>  	if (!(error_code & PF_USER))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ffee2f7..c9f1727 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -13,6 +13,7 @@
>  #include <linux/prio_tree.h>
>  #include <linux/debug_locks.h>
>  #include <linux/mm_types.h>
> +#include <linux/srcu.h>
> 
>  struct mempolicy;
>  struct anon_vma;
> @@ -145,6 +146,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>  #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
> 
> +extern struct srcu_struct mm_srcu;
> 
>  /*
>   * vm_fault is filled by the the pagefault handler and passed to the vma's
> @@ -1132,6 +1134,25 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
> 
>  extern unsigned long do_brk(unsigned long, unsigned long);
> 
> +static inline int vma_is_dead(struct vm_area_struct *vma)
> +{
> +	return atomic_read(&vma->vm_usage) == 0;
> +}
> +
> +static inline void vma_get(struct vm_area_struct *vma)
> +{
> +	BUG_ON(vma_is_dead(vma));
> +	atomic_inc(&vma->vm_usage);
> +}
> +
> +extern void __vma_put(struct vm_area_struct *vma);
> +
> +static inline void vma_put(struct vm_area_struct *vma)
> +{
> +	if (!atomic_dec_return(&vma->vm_usage))
> +		__vma_put(vma);
> +}
> +
>  /* filemap.c */
>  extern unsigned long page_unuse(struct page *);
>  extern void truncate_inode_pages(struct address_space *, loff_t);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0a48058..d5b46a8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -159,12 +159,11 @@ struct vm_area_struct {
>  	void * vm_private_data;		/* was vm_pte (shared mem) */
>  	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
> 
> -#ifndef CONFIG_MMU
>  	atomic_t vm_usage;		/* refcount (VMAs shared if !MMU) */
> -#endif
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +	struct rcu_head rcu_head;
>  };
> 
>  struct core_thread {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index afb376d..b4daad6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -136,6 +136,8 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
> 
> +struct srcu_struct mm_srcu;
> +
>  void free_task(struct task_struct *tsk)
>  {
>  	prop_local_destroy_single(&tsk->dirties);
> @@ -1477,6 +1479,8 @@ void __init proc_caches_init(void)
>  	mm_cachep = kmem_cache_create("mm_struct",
>  			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
>  			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
> +
> +	init_srcu_struct(&mm_srcu);
>  }
> 
>  /*
> diff --git a/mm/memory.c b/mm/memory.c
> index fc031d6..dc2475a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1829,6 +1829,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 * sleep if it needs to.
>  			 */
>  			page_cache_get(old_page);
> +			vma_get(vma);
>  			pte_unmap_unlock(page_table, ptl);
> 
>  			if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
> @@ -1842,6 +1843,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 */
>  			page_table = pte_offset_map_lock(mm, pmd, address,
>  							 &ptl);
> +			vma_put(vma);
>  			page_cache_release(old_page);
>  			if (!pte_same(*page_table, orig_pte))
>  				goto unlock;
> @@ -1869,6 +1871,7 @@ reuse:
>  	 */
>  	page_cache_get(old_page);
>  gotten:
> +	vma_get(vma);
>  	pte_unmap_unlock(page_table, ptl);
> 
>  	if (unlikely(anon_vma_prepare(vma)))
> @@ -1896,6 +1899,7 @@ gotten:
>  	 * Re-check the pte - we dropped the lock
>  	 */
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	vma_put(vma);
>  	if (likely(pte_same(*page_table, orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {
> @@ -2410,6 +2414,17 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> 
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (vma_is_dead(vma)) {
> +		/*
> +		 * by holding the ptl we pin whatever vma that is covering its
> +		 * memory region, if at this point the vma we got is dead,
> +		 * we've lost the race and need to bail.
> +		 *
> +		 * XXX should we re-do the lookup and check for merged vmas?
> +		 */
> +		pte_unmap_unlock(pte, ptl);
> +		return VM_FAULT_SIGBUS;
> +	}
>  	if (!pte_none(*page_table))
>  		goto release;
>  	inc_mm_counter(mm, anon_rss);
> @@ -2543,6 +2558,17 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
> 
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (vma_is_dead(vma)) {
> +		/*
> +		 * by holding the ptl we pin whatever vma that is covering its
> +		 * memory region, if at this point the vma we got is dead,
> +		 * we've lost the race and need to bail.
> +		 *
> +		 * XXX should we re-do the lookup and check for merged vmas?
> +		 */
> +		pte_unmap_unlock(pte, ptl);
> +		return VM_FAULT_SIGBUS;
> +	}
> 
>  	/*
>  	 * This silly early PAGE_DIRTY setting removes a race
> @@ -2690,6 +2716,17 @@ static inline int handle_pte_fault(struct mm_struct *mm,
> 
>  	ptl = pte_lockptr(mm, pmd);
>  	spin_lock(ptl);
> +	if (vma_is_dead(vma)) {
> +		/*
> +		 * by holding the ptl we pin whatever vma that is covering its
> +		 * memory region, if at this point the vma we got is dead,
> +		 * we've lost the race and need to bail.
> +		 *
> +		 * XXX should we re-do the lookup and check for merged vmas?
> +		 */
> +		pte_unmap_unlock(pte, ptl);
> +		return VM_FAULT_SIGBUS;
> +	}
>  	if (unlikely(!pte_same(*pte, entry)))
>  		goto unlock;
>  	if (write_access) {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d4855a6..2b2d454 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -43,6 +43,48 @@
>  #define arch_rebalance_pgtables(addr, len)		(addr)
>  #endif
> 
> +void __vma_put(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * we should never free a vma through here!
> +	 */
> +	BUG();
> +}
> +
> +void vma_free_rcu(struct rcu_head *rcu)
> +{
> +	struct vm_area_struct *vma = 
> +		container_of(rcu, struct vm_area_struct, rcu_head);
> +
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +void vma_free(struct vm_area_struct *vma)
> +{
> +	VM_BUG_ON(!vma_is_dead(vma));
> +	/*
> +	 * Yeah, I know, this doesn't exist...
> +	 */
> +	call_srcu(&mm_srcu, &vma->rcu_head, vma_free_rcu);
> +}
> +
> +/*
> + * mark the vma unused before we zap the PTEs, that way, holding the PTE lock
> + * will block the unmap and guarantee vma validity.
> + */
> +void vma_remove(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * XXX might need to be a blocking wait ?
> +	 *     complicates vma_adjust
> +	 *
> +	 *     better to get rid of vma_get/put do_wp_page() might be able
> +	 *     to compare PTEs and bail, it'll just re-take the fault.
> +	 */
> +	while (atomic_cmpxchg(&vma->vm_usage, 1, 0))
> +		cpu_relax();
> +}
> +
>  static void unmap_region(struct mm_struct *mm,
>  		struct vm_area_struct *vma, struct vm_area_struct *prev,
>  		unsigned long start, unsigned long end);
> @@ -241,7 +283,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  			removed_exe_file_vma(vma->vm_mm);
>  	}
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	vma_free(vma);
>  	return next;
>  }
> 
> @@ -407,6 +449,7 @@ __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
>  void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  		struct rb_node **rb_link, struct rb_node *rb_parent)
>  {
> +	atomic_set(&vma->vm_usage, 1);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
>  }
> @@ -492,6 +535,7 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	prev->vm_next = vma->vm_next;
>  	rb_erase(&vma->vm_rb, &mm->mm_rb);
> +	vma_remove(vma);
>  	if (mm->mmap_cache == vma)
>  		mm->mmap_cache = prev;
>  }
> @@ -644,7 +688,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> +		vma_free(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
> @@ -1210,7 +1254,7 @@ munmap_back:
>  	if (file && vma_merge(mm, prev, addr, vma->vm_end,
>  			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
>  		mpol_put(vma_policy(vma));
> -		kmem_cache_free(vm_area_cachep, vma);
> +		vma_free(vma);
>  		fput(file);
>  		if (vm_flags & VM_EXECUTABLE)
>  			removed_exe_file_vma(mm);
> @@ -1247,7 +1291,7 @@ unmap_and_free_vma:
>  	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
>  	charged = 0;
>  free_vma:
> -	kmem_cache_free(vm_area_cachep, vma);
> +	vma_free(vma);
>  unacct_error:
>  	if (charged)
>  		vm_unacct_memory(charged);
> @@ -1801,6 +1845,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
>  	do {
>  		rb_erase(&vma->vm_rb, &mm->mm_rb);
> +		vma_remove(vma);
>  		mm->map_count--;
>  		tail_vma = vma;
>  		vma = vma->vm_next;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-11-28 15:42 [RFC] another crazy idea to get rid of mmap_sem in faults Peter Zijlstra
  2008-11-28 17:04 ` Paul E. McKenney
@ 2008-11-30 19:27 ` Linus Torvalds
  2008-11-30 19:42   ` Peter Zijlstra
  2008-12-01 14:00 ` Christoph Lameter
  2 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2008-11-30 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, hugh, Paul E McKenney, Andrew Morton, Ingo Molnar, linux-mm

On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> 
> While pondering the page_fault retry stuff, I came up with the following
> idea.

I don't know if your idea is any good, but this part of your patch is 
utter crap:

	-       if (!down_read_trylock(&mm->mmap_sem)) {
	-               if ((error_code & PF_USER) == 0 &&
	-                   !search_exception_tables(regs->ip))
	-                       goto bad_area_nosemaphore;
	-               down_read(&mm->mmap_sem);
	-       }
	-
	+       down_read(&mm->mmap_sem);

because the reason we do a down_read_trylock() is not because of any lock 
order issue or anything like that, but really really fundamental: we want 
to be able to print an oops, instead of deadlocking, if we take a page 
fault in kernel code while holding/waiting-for the mmap_sem for writing.

... and I don't even see the reason why you did tht change anyway, since 
it seems to be totally independent of all the other locking changes.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-11-30 19:27 ` Linus Torvalds
@ 2008-11-30 19:42   ` Peter Zijlstra
  2008-12-01 22:55     ` Hugh Dickins
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2008-11-30 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, hugh, Paul E McKenney, Andrew Morton, Ingo Molnar, linux-mm

On Sun, 2008-11-30 at 11:27 -0800, Linus Torvalds wrote:
> 
> On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> > 
> > While pondering the page_fault retry stuff, I came up with the following
> > idea.
> 
> I don't know if your idea is any good, but this part of your patch is 
> utter crap:
> 
> 	-       if (!down_read_trylock(&mm->mmap_sem)) {
> 	-               if ((error_code & PF_USER) == 0 &&
> 	-                   !search_exception_tables(regs->ip))
> 	-                       goto bad_area_nosemaphore;
> 	-               down_read(&mm->mmap_sem);
> 	-       }
> 	-
> 	+       down_read(&mm->mmap_sem);
> 
> because the reason we do a down_read_trylock() is not because of any lock 
> order issue or anything like that, but really really fundamental: we want 
> to be able to print an oops, instead of deadlocking, if we take a page 
> fault in kernel code while holding/waiting-for the mmap_sem for writing.
> 
> .... and I don't even see the reason why you did tht change anyway, since 
> it seems to be totally independent of all the other locking changes.

Yes, the patch sucks, and you replying to it in such detail makes me
think I should have never attached it..

I know why we do trylock, and the only reason I did that change is to
place the down/up right around the find_vma(), if you'd read the changed
comment you'd see I'd intended that to become rcu_read_{,un}lock()
however we currently cannot since RB trees are not compatible with
lockless lookups (due to the tree rotations).

Please consider the idea of lockless vma lookup and synchronizing
against the PTE lock.

If that primary idea seems feasible, I'll continue working on it and try
to tackle further obstacles.

One non trivial issue is splitting/merging vmas against this lockless
lookup - I do have a solution, but I;m not particularly fond of it. 

Other issues include replacing the RB tree with something that is suited
for lockless lookups (B+ trees for example) and extending SRCU to suit
the new requirements.

Anyway, please don't take the patch too serious (and again, yes, its
utter shite), but consider the idea of getting rid of the mmap_sem usage
in the regular fault path as outlined (currently I'm still not sure on
how to get rid of it in the stack extend case).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-11-30 19:42   ` Peter Zijlstra
@ 2008-12-01 22:55     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2008-12-01 22:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Paul E McKenney, Andrew Morton,
	Ingo Molnar, linux-mm

On Sun, 30 Nov 2008, Peter Zijlstra wrote:
> 
> Please consider the idea of lockless vma lookup and synchronizing
> against the PTE lock.
> 
> If that primary idea seems feasible, I'll continue working on it and try
> to tackle further obstacles.

I've not studied any of your details (you'll be relieved to know ;),
but this does seem to me like a very promising direction, and fun too!

It is consistent with the nature of faulting (go back and try it again
if any difficulty encountered, just don't take eternity), and the way
we already grab a snapshot of the pte, make decisions based upon that,
get the lock we need, then check pte_same().  I imagine you'll need
something like vma_same(), with a sequence count in the vma (perhaps
you already said as much).

Good luck with it!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-11-28 15:42 [RFC] another crazy idea to get rid of mmap_sem in faults Peter Zijlstra
  2008-11-28 17:04 ` Paul E. McKenney
  2008-11-30 19:27 ` Linus Torvalds
@ 2008-12-01 14:00 ` Christoph Lameter
  2008-12-01 14:48   ` Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2008-12-01 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, hugh, Paul E McKenney,
	Andrew Morton, Ingo Molnar, linux-mm

On Fri, 28 Nov 2008, Peter Zijlstra wrote:

> Pagefault concurrency with mmap() is undefined at best (any sane
> application will start using memory after its been mmap'ed and stop
> using it before it unmaps it).

mmap_sem in pagefaults is mainly used to serialize various
modifications to the address space structures while faults are processed.
This is of course all mmap related but stuff like forking can
occur concurrently in a multithreaded application. The COW mechanism is
tied up with this too.

> If we do not freeze the vm map like we normally do but use a lockless
> vma lookup we're left with the unmap race (you're unlikely to find the
> vma before insertion anyway).

Then you will need to use RCU for the vmas in general. We already use
RCU for the anonymous vma. Extend that to all vmas?

> I think we can close that race by marking a vma 'dead' before we do the
> pte unmap, this means that once we have the pte lock in the fault
> handler we can validate the vma (it cannot go away after all, because
> the unmap will block on it).

The anonymous VMAs already have refcounts and vm_area_struct also for the
!MM case. So maybe you could get to the notion of a "dead" vma easily.

> Therefore, we can do the fault optimistically with any sane vma we get
> until the point we want to insert the PTE, at which point we have to
> take the PTL and validate the vma is still good.

How would this sync with other operations that need to take mmap_sem?

> I'm sure there are many fun details to work out, even if the above idea
> is found solid, amongst them is extending srcu to provide call_srcu(),
> and implement an RCU friendly tree structure.

srcu may have too much of an overhead for this.

> [ hmm, while writing this it occurred to me this might mean we have to
>   srcu free the page table pages :/ ]

The page tables cannot be immediately be reused then (quicklists etc).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 14:00 ` Christoph Lameter
@ 2008-12-01 14:48   ` Peter Zijlstra
  2008-12-01 15:06     ` Christoph Lameter
  2008-12-01 15:27     ` Lee Schermerhorn
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2008-12-01 14:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Linus Torvalds, hugh, Paul E McKenney,
	Andrew Morton, Ingo Molnar, linux-mm

On Mon, 2008-12-01 at 08:00 -0600, Christoph Lameter wrote:
> On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> 
> > Pagefault concurrency with mmap() is undefined at best (any sane
> > application will start using memory after its been mmap'ed and stop
> > using it before it unmaps it).
> 
> mmap_sem in pagefaults is mainly used to serialize various
> modifications to the address space structures while faults are processed.

Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
into doesn't go away or change under us.

Does mmap_sem protect more than the vma's and their RB tree?

> This is of course all mmap related but stuff like forking can
> occur concurrently in a multithreaded application. The COW mechanism is
> tied up with this too.

Hohumm, fork().. good point.

I'll ponder the COW stuff, but since that too holds on to the PTL I
think we're good.

> > If we do not freeze the vm map like we normally do but use a lockless
> > vma lookup we're left with the unmap race (you're unlikely to find the
> > vma before insertion anyway).
> 
> Then you will need to use RCU for the vmas in general. We already use
> RCU for the anonymous vma. Extend that to all vmas?

RCU cannot be used, since we need to be able to sleep in a fault in
order to do IO. So we need to extend SRCU - but since we have all the
experience of preemptible RCU to draw from I think that should not be an
issue.

> > I think we can close that race by marking a vma 'dead' before we do the
> > pte unmap, this means that once we have the pte lock in the fault
> > handler we can validate the vma (it cannot go away after all, because
> > the unmap will block on it).
> 
> The anonymous VMAs already have refcounts and vm_area_struct also for the
> !MM case. So maybe you could get to the notion of a "dead" vma easily.

!MMU case?, yes I was thinking to abuse that ;-)

> > Therefore, we can do the fault optimistically with any sane vma we get
> > until the point we want to insert the PTE, at which point we have to
> > take the PTL and validate the vma is still good.
> 
> How would this sync with other operations that need to take mmap_sem?

What other ops? mmap/munmap/mremap/madvise etc.?

The idea is that any change to a vma (which would require exclusive
mmap_sem) will replace the vma - marking the old one dead and SRCU free
it.

All non-exclusive users can already handle others.

Stuff like merge/split is interesting because that might invalidate a
vma while the fault stays valid.

This means we have to have a more complex vma validation, something
along the lines of:

/*
 * Finds a valid vma
 */
struct vm_area_struct *find_vma(mm, addr)
{
again:
	rcu_read_lock(); /* solely for the lookup structure */
	vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
	rcu_read_unlock();
	if (vma && vma_is_dead(vma))
		goto again;

	return vma;
}

/*
 * validates if a previously obtained vma is still valid,
 * synchronizes with vma against PTL.
 */
int validate_vma(mm, vma, addr)
{
	ASSERT_PTL_LOCKED(mm, addr);

	if (!vma_is_dead(vma))
		return 1;

	vma2 = find_vma(mm, addr);

	if (/*old vma fault is still valid in vma2*/)
		return 1

	return 0;
}

Merge:

  mark both vmas dead
  grow the left to cover both
  remove the right
  replace the left with a new alive one

  (Munge PTEs)

Split:

  mark the vma dead
  insert the new fragment (always the right-most)
  replace the left with a new smaller.

  (Munge PTEs)

Where we basically use the re-try in the lookup to wait for a valid vma
to appear while never having the lookup return NULL (which would make
the fault fail and sigbus).

> > I'm sure there are many fun details to work out, even if the above idea
> > is found solid, amongst them is extending srcu to provide call_srcu(),
> > and implement an RCU friendly tree structure.
> 
> srcu may have too much of an overhead for this.

Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.

> > [ hmm, while writing this it occurred to me this might mean we have to
> >   srcu free the page table pages :/ ]
> 
> The page tables cannot be immediately be reused then (quicklists etc).

I think I was wrong there, we don't do speculative PTE locks, so we
should be good here.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 14:48   ` Peter Zijlstra
@ 2008-12-01 15:06     ` Christoph Lameter
  2008-12-01 15:28       ` Peter Zijlstra
  2008-12-01 16:22       ` Paul E. McKenney
  2008-12-01 15:27     ` Lee Schermerhorn
  1 sibling, 2 replies; 12+ messages in thread
From: Christoph Lameter @ 2008-12-01 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Linus Torvalds, hugh, Paul E McKenney,
	Andrew Morton, Ingo Molnar, linux-mm

On Mon, 1 Dec 2008, Peter Zijlstra wrote:

> Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
> into doesn't go away or change under us.
>
> Does mmap_sem protect more than the vma's and their RB tree?

It may protect portions of the mm_struct?

> I'll ponder the COW stuff, but since that too holds on to the PTL I
> think we're good.

Copy pte range is called under mmap_sem when forking.

> > Then you will need to use RCU for the vmas in general. We already use
> > RCU for the anonymous vma. Extend that to all vmas?
>
> RCU cannot be used, since we need to be able to sleep in a fault in
> order to do IO. So we need to extend SRCU - but since we have all the
> experience of preemptible RCU to draw from I think that should not be an
> issue.

Then do not hold it across sleeping portions. If you have to sleep then
start another rcu section and recheck the situation before proceeding.

> > !MM case. So maybe you could get to the notion of a "dead" vma easily.
>
> !MMU case?, yes I was thinking to abuse that ;-)

Right.

> > How would this sync with other operations that need to take mmap_sem?
>
> What other ops? mmap/munmap/mremap/madvise etc.?

Look for down_write operations on mmap_sem. Things like set_brk etc.

> The idea is that any change to a vma (which would require exclusive
> mmap_sem) will replace the vma - marking the old one dead and SRCU free
> it.

That is a classic RCU approach. The problem then is that operations may
continue on the stale object. You need to make sure that no modifications
can occur to the original VMA while it exists in two version.

> struct vm_area_struct *find_vma(mm, addr)
> {
> again:
> 	rcu_read_lock(); /* solely for the lookup structure */
> 	vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
> 	rcu_read_unlock();
> 	if (vma && vma_is_dead(vma))
> 		goto again;
>
> 	return vma;

Dont you need to check and increase the refcount in the rcu section (in
the absence of any other locking) to make sure that the object still
exists when you refer to it next?

> /*
>  * validates if a previously obtained vma is still valid,
>  * synchronizes with vma against PTL.
>  */
> int validate_vma(mm, vma, addr)
> {
> 	ASSERT_PTL_LOCKED(mm, addr);
>
> 	if (!vma_is_dead(vma))
> 		return 1;
>
> 	vma2 = find_vma(mm, addr);
>
> 	if (/*old vma fault is still valid in vma2*/)
> 		return 1
>
> 	return 0;
> }

This can only be done while the refcount is elevated.

> > srcu may have too much of an overhead for this.
>
> Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.

Holding off frees for a long time (sleeping???) is usually bad for cache
hot behavior. It introduces cacheline refetches. Avoid if possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 15:06     ` Christoph Lameter
@ 2008-12-01 15:28       ` Peter Zijlstra
  2008-12-01 16:22       ` Paul E. McKenney
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2008-12-01 15:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Linus Torvalds, hugh, Paul E McKenney,
	Andrew Morton, Ingo Molnar, linux-mm

On Mon, 2008-12-01 at 09:06 -0600, Christoph Lameter wrote:
> On Mon, 1 Dec 2008, Peter Zijlstra wrote:
> 
> > Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
> > into doesn't go away or change under us.
> >
> > Does mmap_sem protect more than the vma's and their RB tree?
> 
> It may protect portions of the mm_struct?

Maybe, I'll see what of that faults use.

> > I'll ponder the COW stuff, but since that too holds on to the PTL I
> > think we're good.
> 
> Copy pte range is called under mmap_sem when forking.

Obviously, but if we do faults without mmap_sem there's a race, we need
to do the right thing.

> > > Then you will need to use RCU for the vmas in general. We already use
> > > RCU for the anonymous vma. Extend that to all vmas?
> >
> > RCU cannot be used, since we need to be able to sleep in a fault in
> > order to do IO. So we need to extend SRCU - but since we have all the
> > experience of preemptible RCU to draw from I think that should not be an
> > issue.
> 
> Then do not hold it across sleeping portions. If you have to sleep then
> start another rcu section and recheck the situation before proceeding.

I'd prefer to start out with SRCU and maybe try that later.

> > > How would this sync with other operations that need to take mmap_sem?
> >
> > What other ops? mmap/munmap/mremap/madvise etc.?
> 
> Look for down_write operations on mmap_sem. Things like set_brk etc.

Sure, I thought you had something specific in mind. 

The only relevant thing for faults is vma's (afaict) and any operation
that changes those either fudges PTEs and thus needs the PTE lock, or
leaves the fault valid in which case the extended validation should find
it so.

> > The idea is that any change to a vma (which would require exclusive
> > mmap_sem) will replace the vma - marking the old one dead and SRCU free
> > it.
> 
> That is a classic RCU approach. The problem then is that operations may
> continue on the stale object. You need to make sure that no modifications
> can occur to the original VMA while it exists in two version.
> 
> > struct vm_area_struct *find_vma(mm, addr)
> > {
> > again:
> > 	rcu_read_lock(); /* solely for the lookup structure */
> > 	vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
> > 	rcu_read_unlock();
> > 	if (vma && vma_is_dead(vma))
> > 		goto again;
> >
> > 	return vma;
> 
> Dont you need to check and increase the refcount in the rcu section (in
> the absence of any other locking) to make sure that the object still
> exists when you refer to it next?

/* vma is srcu guarded */

taking a ref on the vma is just as bad as using mmap_sem - we might as
well not bother with the whole ordeal in that case.

> > /*
> >  * validates if a previously obtained vma is still valid,
> >  * synchronizes with vma against PTL.
> >  */
> > int validate_vma(mm, vma, addr)
> > {
> > 	ASSERT_PTL_LOCKED(mm, addr);
> >
> > 	if (!vma_is_dead(vma))
> > 		return 1;
> >
> > 	vma2 = find_vma(mm, addr);
> >
> > 	if (/*old vma fault is still valid in vma2*/)
> > 		return 1
> >
> > 	return 0;
> > }
> 
> This can only be done while the refcount is elevated.

Like above, we're still in the one SRCU section. The basic idea is to
replace down_read/up_read with srcu_read_lock/srcu_read_unlock for
faults, since the latter scales (or can scale) but a single lock can
not.

> > > srcu may have too much of an overhead for this.
> >
> > Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.
> 
> Holding off frees for a long time (sleeping???) is usually bad for cache
> hot behavior. It introduces cacheline refetches. Avoid if possible.

Well, what do you rather have, mmap_sem cacheline bouncing and
contention?

Do you think the small vma slab will suffer significantly on the
relatively expensive exclusive mmap_sem operations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 15:06     ` Christoph Lameter
  2008-12-01 15:28       ` Peter Zijlstra
@ 2008-12-01 16:22       ` Paul E. McKenney
  1 sibling, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2008-12-01 16:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Nick Piggin, Linus Torvalds, hugh, Andrew Morton,
	Ingo Molnar, linux-mm

On Mon, Dec 01, 2008 at 09:06:10AM -0600, Christoph Lameter wrote:
> On Mon, 1 Dec 2008, Peter Zijlstra wrote:
> > > srcu may have too much of an overhead for this.
> >
> > Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.
> 
> Holding off frees for a long time (sleeping???) is usually bad for cache
> hot behavior. It introduces cacheline refetches. Avoid if possible.

This is a classic tradeoff.  Cache misses from mmap_sem vs. cache misses
from accessing newly allocated memory that was freed via RCU (and thus
was ejected from the CPU cache).

This is one reason why RCU is specialized for read-mostly situations.
In read-mostly situations, the savings due to avoiding cache-thrashy
locking primitives can outweigh the cache-cold effects due to memory
waiting for an RCU grace period to elapse.  Plus RCU does not suffer
so much from lock contention (for exclusive locks or write-acquisitions
of reader-writer locks) or cache thrashing (for locks in general and
also most atomic operations).

Your mileage may vary, so use the best tool for the job.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 14:48   ` Peter Zijlstra
  2008-12-01 15:06     ` Christoph Lameter
@ 2008-12-01 15:27     ` Lee Schermerhorn
  2008-12-01 15:33       ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Lee Schermerhorn @ 2008-12-01 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Nick Piggin, Linus Torvalds, hugh,
	Paul E McKenney, Andrew Morton, Ingo Molnar, linux-mm

On Mon, 2008-12-01 at 15:48 +0100, Peter Zijlstra wrote:
> On Mon, 2008-12-01 at 08:00 -0600, Christoph Lameter wrote:
> > On Fri, 28 Nov 2008, Peter Zijlstra wrote:
> > 
> > > Pagefault concurrency with mmap() is undefined at best (any sane
> > > application will start using memory after its been mmap'ed and stop
> > > using it before it unmaps it).
> > 
> > mmap_sem in pagefaults is mainly used to serialize various
> > modifications to the address space structures while faults are processed.
> 
> Well, yeah, mainly the vmas, right? We need to ensure the vma we fault
> into doesn't go away or change under us.
> 
> Does mmap_sem protect more than the vma's and their RB tree?
> 
> > This is of course all mmap related but stuff like forking can
> > occur concurrently in a multithreaded application. The COW mechanism is
> > tied up with this too.
> 
> Hohumm, fork().. good point.
> 
> I'll ponder the COW stuff, but since that too holds on to the PTL I
> think we're good.
> 
> > > If we do not freeze the vm map like we normally do but use a lockless
> > > vma lookup we're left with the unmap race (you're unlikely to find the
> > > vma before insertion anyway).
> > 
> > Then you will need to use RCU for the vmas in general. We already use
> > RCU for the anonymous vma. Extend that to all vmas?
> 
> RCU cannot be used, since we need to be able to sleep in a fault in
> order to do IO. So we need to extend SRCU - but since we have all the
> experience of preemptible RCU to draw from I think that should not be an
> issue.
> 
> > > I think we can close that race by marking a vma 'dead' before we do the
> > > pte unmap, this means that once we have the pte lock in the fault
> > > handler we can validate the vma (it cannot go away after all, because
> > > the unmap will block on it).
> > 
> > The anonymous VMAs already have refcounts and vm_area_struct also for the
> > !MM case. So maybe you could get to the notion of a "dead" vma easily.
> 
> !MMU case?, yes I was thinking to abuse that ;-)
> 
> > > Therefore, we can do the fault optimistically with any sane vma we get
> > > until the point we want to insert the PTE, at which point we have to
> > > take the PTL and validate the vma is still good.
> > 
> > How would this sync with other operations that need to take mmap_sem?
> 
> What other ops? mmap/munmap/mremap/madvise etc.?
> 
> The idea is that any change to a vma (which would require exclusive
> mmap_sem) will replace the vma - marking the old one dead and SRCU free
> it.

mlock(), mprotect() and mbind() [others?] can also split/merge vmas
under exclusive mmap_sem to accomodate the changed attributes.

> 
> All non-exclusive users can already handle others.
> 
> Stuff like merge/split is interesting because that might invalidate a
> vma while the fault stays valid.
> 
> This means we have to have a more complex vma validation, something
> along the lines of:
> 
> /*
>  * Finds a valid vma
>  */
> struct vm_area_struct *find_vma(mm, addr)
> {
> again:
> 	rcu_read_lock(); /* solely for the lookup structure */
> 	vma = tree_lookup(&mm->vma_tree, addr); /* vma is srcu guarded */
> 	rcu_read_unlock();
> 	if (vma && vma_is_dead(vma))
> 		goto again;
> 
> 	return vma;
> }
> 
> /*
>  * validates if a previously obtained vma is still valid,
>  * synchronizes with vma against PTL.
>  */
> int validate_vma(mm, vma, addr)
> {
> 	ASSERT_PTL_LOCKED(mm, addr);
> 
> 	if (!vma_is_dead(vma))
> 		return 1;
> 
> 	vma2 = find_vma(mm, addr);
> 
> 	if (/*old vma fault is still valid in vma2*/)
> 		return 1
> 
> 	return 0;
> }
> 
> Merge:
> 
>   mark both vmas dead
>   grow the left to cover both
>   remove the right
>   replace the left with a new alive one
> 
>   (Munge PTEs)
> 
> Split:
> 
>   mark the vma dead
>   insert the new fragment (always the right-most)
>   replace the left with a new smaller.
> 
>   (Munge PTEs)
> 
> Where we basically use the re-try in the lookup to wait for a valid vma
> to appear while never having the lookup return NULL (which would make
> the fault fail and sigbus).
> 
> > > I'm sure there are many fun details to work out, even if the above idea
> > > is found solid, amongst them is extending srcu to provide call_srcu(),
> > > and implement an RCU friendly tree structure.
> > 
> > srcu may have too much of an overhead for this.
> 
> Then we need to fix that ;-) But surely SRCU is cheaper than mmap_sem.
> 
> > > [ hmm, while writing this it occurred to me this might mean we have to
> > >   srcu free the page table pages :/ ]
> > 
> > The page tables cannot be immediately be reused then (quicklists etc).
> 
> I think I was wrong there, we don't do speculative PTE locks, so we
> should be good here.
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] another crazy idea to get rid of mmap_sem in faults
  2008-12-01 15:27     ` Lee Schermerhorn
@ 2008-12-01 15:33       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2008-12-01 15:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, Nick Piggin, Linus Torvalds, hugh,
	Paul E McKenney, Andrew Morton, Ingo Molnar, linux-mm

On Mon, 2008-12-01 at 10:27 -0500, Lee Schermerhorn wrote:
> mlock(), mprotect() and mbind() [others?] can also split/merge vmas
> under exclusive mmap_sem to accomodate the changed attributes.

Right, all those should be OK as described in the Split/Merge scenarios.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-12-01 22:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-28 15:42 [RFC] another crazy idea to get rid of mmap_sem in faults Peter Zijlstra
2008-11-28 17:04 ` Paul E. McKenney
2008-11-30 19:27 ` Linus Torvalds
2008-11-30 19:42   ` Peter Zijlstra
2008-12-01 22:55     ` Hugh Dickins
2008-12-01 14:00 ` Christoph Lameter
2008-12-01 14:48   ` Peter Zijlstra
2008-12-01 15:06     ` Christoph Lameter
2008-12-01 15:28       ` Peter Zijlstra
2008-12-01 16:22       ` Paul E. McKenney
2008-12-01 15:27     ` Lee Schermerhorn
2008-12-01 15:33       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox