[patch 0/5] [RFC] Conversion of reverse map locks to semaphores

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
@ 2008-06-26  0:36 Christoph Lameter
  2008-06-26  0:36 ` [patch 1/5] Move tlb handling into free_pgtables() Christoph Lameter
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Hugh Dickins, holt, steiner

(This is going to be the last patchset that I post from clameter@sgi.com.
Please use cl@linux-foundation.org in the future.)

Having semaphores there instead of spinlocks is useful since it
allows sleeping in various code paths. That sleeping is useful
if one wants to implement callbacks to remove external mapping
(like done in the mmu notifier).

Also it seems that a semaphore helps RT and should avoid busy spinning
on systems where these locks experience significant contention.

The first patches move tlb flushing around in such a way that
the _lock's can always be taken in preemptible contexts.

The i_mmap_sem used to be present until someone switched it to a spinlock in
2004 due to scaling concerns on NUMA with a benchmark called SDET. I was not
able to locate that benchmark (but Andi Whitcroft has access and promised me
some results).

AIM9 results (3 samples) anon_vma conversion not applied:

 5 exec_test    1048.95 1025.50     -23.45 -2.24% Program Loads/second
 6 fork_test    4775.22 4945.16     169.94  3.56% Task Creations/second

 5 exec_test    1057.00 1019.00     -38.00 -3.60% Program Loads/second
 6 fork_test    4930.14 4760.00    -170.14 -3.45% Task Creations/second

 5 exec_test    1047.50 1038.96      -8.54 -0.82% Program Loads/second
 6 fork_test    4760.48 4925.07     164.59  3.46% Task Creations/second

Loads per second seem to have down tendency. Task creations are up. Not sure
how much jitter gets into it.

The old page fault performance test on file backed pages
(anon_vma conversion not applied, 250k per process):

Before:
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  0  3    1   1    0.00s      0.08s   0.00s261555.860 246536.848
  0  3    2   1    0.00s      0.09s   0.00s219709.015 357800.362
  0  3    4   1    0.19s      0.13s   0.01s 67810.629 218846.742
  0  3    8   1    1.04s      0.21s   0.02s 17548.427 104461.093

After:
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  0  3    1   1    0.00s      0.09s   0.00s238813.108 243323.477
  0  3    2   1    0.00s      0.10s   0.00s219706.818 354671.772
  0  3    4   1    0.20s      0.13s   0.00s 64619.728 225528.586
  0  3    8   1    1.09s      0.22s   0.02s 16644.421 101027.423

A slight performance degradation in most regimes, just 4 processors
is a bright spot.

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/5] Move tlb handling into free_pgtables()
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
@ 2008-06-26  0:36 ` Christoph Lameter
  2008-06-26  0:36 ` [patch 2/5] Move tlb flushing inside of unmap vmas Christoph Lameter
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Hugh Dickins, holt, steiner

[-- Attachment #1: move_tlb_flushing_into_free_pgtables --]
[-- Type: text/plain, Size: 4030 bytes --]

Move the tlb_gather/tlb_finish call into the free_pgtables() function so
that tlb_gather/finish is done for each vma separately.
This may add a number of tlb flushes depending on the
number of vmas that survive the coalescing scan.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    4 ++--
 mm/memory.c        |   13 +++++++++----
 mm/mmap.c          |    6 +++---
 3 files changed, 14 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-06-09 20:20:58.770505275 -0700
+++ linux-2.6/include/linux/mm.h	2008-06-09 20:22:59.845841566 -0700
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-09 20:20:58.806841642 -0700
+++ linux-2.6/mm/memory.c	2008-06-09 20:22:59.845841566 -0700
@@ -271,9 +271,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -285,7 +287,8 @@ void free_pgtables(struct mmu_gather **t
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
@@ -298,9 +301,11 @@ void free_pgtables(struct mmu_gather **t
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		}
+		tlb_finish_mmu(tlb, addr, vma->vm_end);
 		vma = next;
 	}
 }
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-09 20:20:58.821591534 -0700
+++ linux-2.6/mm/mmap.c	2008-06-09 20:22:59.845841566 -0700
@@ -1762,9 +1762,9 @@ static void unmap_region(struct mm_struc
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
-				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+				 next? next->vm_start: 0);
 }
 
 /*
@@ -2062,8 +2062,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/5] Move tlb flushing inside of unmap vmas
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
  2008-06-26  0:36 ` [patch 1/5] Move tlb handling into free_pgtables() Christoph Lameter
@ 2008-06-26  0:36 ` Christoph Lameter
  2008-06-26  0:36 ` [patch 3/5] Add capability to check if rwsems are contended Christoph Lameter
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Hugh Dickins, holt, steiner

[-- Attachment #1: move_tlb_flushing_into_unmap_vmas --]
[-- Type: text/plain, Size: 6448 bytes --]

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    3 +--
 mm/memory.c        |   40 ++++++++++++++++------------------------
 mm/mmap.c          |   13 +++----------
 3 files changed, 20 insertions(+), 36 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-06-09 20:22:59.845841566 -0700
+++ linux-2.6/include/linux/mm.h	2008-06-09 20:25:24.210091782 -0700
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-09 20:22:59.845841566 -0700
+++ linux-2.6/mm/memory.c	2008-06-09 20:28:07.516464953 -0700
@@ -848,7 +848,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -860,20 +859,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -882,7 +874,14 @@ unsigned long unmap_vmas(struct mmu_gath
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
+	struct mm_struct *mm = vma->vm_mm;
+
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -909,7 +908,7 @@ unsigned long unmap_vmas(struct mmu_gath
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -917,22 +916,23 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
-					*tlbp = NULL;
+					tlb = NULL;
 					goto out;
 				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
 	return start;	/* which is now the end (or restart) address */
 }
@@ -947,18 +947,10 @@ out:
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-09 20:22:59.845841566 -0700
+++ linux-2.6/mm/mmap.c	2008-06-09 20:25:24.210091782 -0700
@@ -1754,15 +1754,10 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 }
@@ -2047,7 +2042,6 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2057,12 +2051,11 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/5] Add capability to check if rwsems are contended.
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
  2008-06-26  0:36 ` [patch 1/5] Move tlb handling into free_pgtables() Christoph Lameter
  2008-06-26  0:36 ` [patch 2/5] Move tlb flushing inside of unmap vmas Christoph Lameter
@ 2008-06-26  0:36 ` Christoph Lameter
  2008-06-26  0:36 ` [patch 4/5] Convert i_mmap_lock to a rw semaphore Christoph Lameter
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Andrea Arcangeli, Hugh Dickins, holt, steiner

[-- Attachment #1: rwsem_is_contended --]
[-- Type: text/plain, Size: 3451 bytes --]

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak() to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/rwsem.h |    2 ++
 include/linux/sched.h |    9 +++++++++
 lib/rwsem-spinlock.c  |   12 ++++++++++++
 lib/rwsem.c           |   12 ++++++++++++
 4 files changed, 35 insertions(+)

Index: linux-2.6/include/linux/rwsem.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem.h	2008-06-09 20:20:59.037591344 -0700
+++ linux-2.6/include/linux/rwsem.h	2008-06-09 20:28:47.359341232 -0700
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2008-06-09 20:20:59.045591415 -0700
+++ linux-2.6/include/linux/sched.h	2008-06-09 20:28:47.389841510 -0700
@@ -2071,6 +2071,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+	return rwsem_is_contended(sem);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
Index: linux-2.6/lib/rwsem-spinlock.c
===================================================================
--- linux-2.6.orig/lib/rwsem-spinlock.c	2008-06-09 20:20:59.053591561 -0700
+++ linux-2.6/lib/rwsem-spinlock.c	2008-06-09 20:28:47.402091148 -0700
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
Index: linux-2.6/lib/rwsem.c
===================================================================
--- linux-2.6.orig/lib/rwsem.c	2008-06-09 20:20:59.061591425 -0700
+++ linux-2.6/lib/rwsem.c	2008-06-09 20:28:47.402091148 -0700
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
 	return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 4/5] Convert i_mmap_lock to a rw semaphore
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-06-26  0:36 ` [patch 3/5] Add capability to check if rwsems are contended Christoph Lameter
@ 2008-06-26  0:36 ` Christoph Lameter
  2008-06-26  0:36 ` [patch 5/5] Convert anon_vma spinlock to " Christoph Lameter
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Hugh Dickins, holt, steiner

[-- Attachment #1: rwsem_conversion --]
[-- Type: text/plain, Size: 20541 bytes --]

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

This is a reversal of a 2004 patch to convert the lock to a spinlock.
See http://osdir.com/ml/kernel.mm/2004-05/msg00046.html. The numbers there
could so far not be replicated and the test used is not accessible to me.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/vm/locking  |    2 +-
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 kernel/fork.c             |    4 ++--
 mm/filemap.c              |    8 ++++----
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   10 +++++-----
 mm/memory.c               |   22 +++++++++++-----------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   16 ++++++++--------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   20 +++++++++-----------
 16 files changed, 55 insertions(+), 57 deletions(-)

Index: linux-2.6/Documentation/vm/locking
===================================================================
--- linux-2.6.orig/Documentation/vm/locking	2008-06-09 20:30:52.033591432 -0700
+++ linux-2.6/Documentation/vm/locking	2008-06-09 20:31:07.174091873 -0700
@@ -66,7 +66,7 @@ in some cases it is not really needed. E
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-06-09 20:30:52.042091449 -0700
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-06-09 20:31:07.174091873 -0700
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-06-09 20:30:52.049591425 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-06-09 20:31:07.174091873 -0700
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2008-06-09 20:30:52.053591401 -0700
+++ linux-2.6/fs/inode.c	2008-06-09 20:31:07.174091873 -0700
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-06-09 20:30:52.061591310 -0700
+++ linux-2.6/include/linux/fs.h	2008-06-09 20:31:07.174091873 -0700
@@ -502,7 +502,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-06-09 20:31:03.358091899 -0700
+++ linux-2.6/include/linux/mm.h	2008-06-09 20:31:07.174091873 -0700
@@ -735,7 +735,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-06-09 20:30:52.077591441 -0700
+++ linux-2.6/kernel/fork.c	2008-06-09 20:31:07.174091873 -0700
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2008-06-09 20:30:52.081591170 -0700
+++ linux-2.6/mm/filemap.c	2008-06-09 20:31:07.178091142 -0700
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-06-09 20:30:52.089591419 -0700
+++ linux-2.6/mm/filemap_xip.c	2008-06-09 20:31:07.178091142 -0700
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -196,7 +196,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-06-09 20:30:52.093591211 -0700
+++ linux-2.6/mm/fremap.c	2008-06-09 20:31:07.178091142 -0700
@@ -205,13 +205,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	err = populate_range(mm, vma, start, size, pgoff);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-06-09 20:30:52.097591289 -0700
+++ linux-2.6/mm/hugetlb.c	2008-06-09 20:31:07.178091142 -0700
@@ -813,7 +813,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -861,9 +861,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1108,7 +1108,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1123,7 +1123,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-06-09 20:31:03.358091899 -0700
+++ linux-2.6/mm/memory.c	2008-06-09 20:32:02.084217242 -0700
@@ -873,7 +873,7 @@ unsigned long unmap_vmas(struct vm_area_
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -919,8 +919,8 @@ unsigned long unmap_vmas(struct vm_area_
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1820,7 +1820,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1839,7 +1839,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1889,7 +1889,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1903,9 +1903,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -1999,9 +1999,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2016,7 +2016,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2008-06-09 20:30:52.113591393 -0700
+++ linux-2.6/mm/migrate.c	2008-06-09 20:31:07.178091142 -0700
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-09 20:31:03.358091899 -0700
+++ linux-2.6/mm/mmap.c	2008-06-09 20:31:07.178091142 -0700
@@ -186,7 +186,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -214,9 +214,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -448,7 +448,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -458,7 +458,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -545,7 +545,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -629,7 +629,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file) {
@@ -2070,7 +2070,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-06-09 20:30:52.126091748 -0700
+++ linux-2.6/mm/mremap.c	2008-06-09 20:31:07.178091142 -0700
@@ -83,7 +83,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -115,7 +115,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-06-09 20:30:52.133591430 -0700
+++ linux-2.6/mm/rmap.c	2008-06-09 20:31:07.178091142 -0700
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -365,14 +365,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -395,7 +395,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -478,12 +478,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -914,7 +914,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -951,7 +951,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -973,7 +972,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -985,7 +983,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 5/5] Convert anon_vma spinlock to rw semaphore
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-06-26  0:36 ` [patch 4/5] Convert i_mmap_lock to a rw semaphore Christoph Lameter
@ 2008-06-26  0:36 ` Christoph Lameter
  2008-06-26  1:05   ` Andrea Arcangeli
  2008-06-26  0:53 ` [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Andrea Arcangeli
  2008-06-27  8:53 ` Peter Zijlstra
  6 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26  0:36 UTC (permalink / raw)
  To: linux-mm; +Cc: apw, Hugh Dickins, holt, steiner

[-- Attachment #1: anon_vma_sem --]
[-- Type: text/plain, Size: 10572 bytes --]

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

Prerequisite: i_mmap_lock->sem conversion patches applied.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/rmap.h |   20 ++++++++++++++++---
 mm/migrate.c         |   26 ++++++++++---------------
 mm/mmap.c            |    4 +--
 mm/rmap.c            |   53 +++++++++++++++++++++++++++++----------------------
 4 files changed, 61 insertions(+), 42 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-06-13 11:20:52.973643292 -0700
+++ linux-2.6/include/linux/rmap.h	2008-06-13 11:21:40.401643310 -0700
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma_page(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2008-06-13 11:20:52.977643185 -0700
+++ linux-2.6/mm/migrate.c	2008-06-13 11:21:40.438642920 -0700
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma_page(page);
 
 	/*
 	 * Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-06-13 11:20:52.985643796 -0700
+++ linux-2.6/mm/mmap.c	2008-06-13 11:21:40.438642920 -0700
@@ -573,7 +573,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -627,7 +627,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-06-13 11:20:53.017643368 -0700
+++ linux-2.6/mm/rmap.c	2008-06-13 11:21:40.438642920 -0700
@@ -68,7 +68,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -80,6 +80,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -87,7 +88,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -98,14 +99,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -113,36 +117,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -156,9 +156,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma_page(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -169,17 +169,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma_page(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 5/5] Convert anon_vma spinlock to rw semaphore
  2008-06-26  0:36 ` [patch 5/5] Convert anon_vma spinlock to " Christoph Lameter
@ 2008-06-26  1:05   ` Andrea Arcangeli
  2008-06-26 17:23     ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2008-06-26  1:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, apw, Hugh Dickins, holt, steiner

On Wed, Jun 25, 2008 at 05:36:37PM -0700, Christoph Lameter wrote:
> However:
> - Atomic overhead increases in situations where a new reference
>   to the anon_vma has to be established or removed. Overhead also increases
>   when a speculative reference is used (try_to_unmap,
>   page_mkclean, page migration).
> - There is the potential for more frequent processor change due to up_xxx
>   letting waiting tasks run first.

You dropped the benchmark numbers from the comment, that was useful
data. You may want to re-run the benchmark on different hardware just
to be sure it was valid though (just to be sure it's a significant
regression for AIM).

>  void __anon_vma_link(struct vm_area_struct *vma)
>  {
>  	struct anon_vma *anon_vma = vma->anon_vma;
>  
> -	if (anon_vma)
> +	if (anon_vma) {
> +		get_anon_vma(anon_vma);
>  		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> +	}
>  }

Last time I checked this code the above get_anon_vma was superfluous.

Below a quote of the email where I already pointed this out once in
the middle of the mmu notifier email flooding, so it's fair enough
that it got lost in the noise ;).

I recommend to optimize this and re-run the benchmark and see if my
optimization makes the -10% slowdown go away in AIM. If it does then
it's surely more reasonable to merge those unconditionally. Unless we
can prove no slowdown in small-smp, I doubt it's ok to merge this one
unconditionally (and I also doubt my optimization will fix AIM as it
only removes a atomic op for each vma in fork, and similar during vma
teardown).

Thanks!

------------
Secondly we don't need to increase the refcount in fork() when we
queue the vma-copy in the anon_vma. You should init the refcount to 1
when the anon_vma is allocated, remove the atomic_inc from all code
(except when down_read_trylock fails) and then change anon_vma_unlink
to:

        up_write(&anon_vma->sem);
        if (empty)
                put_anon_vma(anon_vma);

While the down_read_trylock surely won't help in AIM, the second
change will reduce a bit the overhead in the VM core fast paths by
avoiding all refcounting changes by checking the list_empty the same
way the current code does. I really like how I designed the garbage 
collection through list_empty and that's efficient and I'd like to
keep it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 5/5] Convert anon_vma spinlock to rw semaphore
  2008-06-26  1:05   ` Andrea Arcangeli
@ 2008-06-26 17:23     ` Christoph Lameter
  2008-06-26 17:27       ` Andrea Arcangeli
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26 17:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, apw, Hugh Dickins, holt, steiner

On Thu, 26 Jun 2008, Andrea Arcangeli wrote:

> You dropped the benchmark numbers from the comment, that was useful
> data. You may want to re-run the benchmark on different hardware just
> to be sure it was valid though (just to be sure it's a significant
> regression for AIM).

I could not reproduce it with the recent versions. The degradation was 
less than expected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 5/5] Convert anon_vma spinlock to rw semaphore
  2008-06-26 17:23     ` Christoph Lameter
@ 2008-06-26 17:27       ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2008-06-26 17:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, apw, Hugh Dickins, holt, steiner

On Thu, Jun 26, 2008 at 10:23:17AM -0700, Christoph Lameter wrote:
> On Thu, 26 Jun 2008, Andrea Arcangeli wrote:
> 
> > You dropped the benchmark numbers from the comment, that was useful
> > data. You may want to re-run the benchmark on different hardware just
> > to be sure it was valid though (just to be sure it's a significant
> > regression for AIM).
> 
> I could not reproduce it with the recent versions. The degradation was 
> less than expected.

That's very encouraging! That info plus the removal of the superflous
atomic ops in the fork fast path sounds quite reasonable. We need to
run this series over some extensive benchmark like the one that found
preemptive BLK was hurting performance a lot to be more certain.

Thanks for the info!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-06-26  0:36 ` [patch 5/5] Convert anon_vma spinlock to " Christoph Lameter
@ 2008-06-26  0:53 ` Andrea Arcangeli
  2008-06-26 17:19   ` Christoph Lameter
  2008-06-27  8:53 ` Peter Zijlstra
  6 siblings, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2008-06-26  0:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, apw, Hugh Dickins, holt, steiner

Ah great, so you're maintaining those! Just a moment before seeing
this I post, I uploaded them too at this URL:

     http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc7/mmu-notifier-v18/

Since you're taking care of this yourself should I feel free to remove
those from my patchset right?

With a VM hat I don't think those lock changes should be
unconditional... especially for the anon-vma case where the common case
are small critical section. But then with a KVM hat those won't make
the slightest difference to my current interesting workload, so it's
truly not my concern if those goes in as-is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
  2008-06-26  0:53 ` [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Andrea Arcangeli
@ 2008-06-26 17:19   ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2008-06-26 17:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, apw, Hugh Dickins, holt, steiner

On Thu, 26 Jun 2008, Andrea Arcangeli wrote:

> Ah great, so you're maintaining those! Just a moment before seeing

clameter@sgi.com will cease to exist soon. Sorry. Maybe Robin?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
  2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-06-26  0:53 ` [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Andrea Arcangeli
@ 2008-06-27  8:53 ` Peter Zijlstra
  2008-06-27 15:46   ` Christoph Lameter
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2008-06-27  8:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, apw

On Wed, 2008-06-25 at 17:36 -0700, Christoph Lameter wrote:
> (This is going to be the last patchset that I post from clameter@sgi.com.
> Please use cl@linux-foundation.org in the future.)

Congratulations on the new job I suppose.. :-)

> Having semaphores there instead of spinlocks is useful since it
> allows sleeping in various code paths. That sleeping is useful
> if one wants to implement callbacks to remove external mapping
> (like done in the mmu notifier).
> 
> Also it seems that a semaphore helps RT and should avoid busy spinning
> on systems where these locks experience significant contention.

Please be careful with the wording here. Semaphores are evil esp for RT.
But luckily you're referring to a sleeping RW lock, which we call
RW-semaphore (but is not an actual semaphore).

You really scared some people saying this ;-)

> The first patches move tlb flushing around in such a way that
> the _lock's can always be taken in preemptible contexts.
> 
> The i_mmap_sem used to be present until someone switched it to a spinlock in
> 2004 due to scaling concerns on NUMA with a benchmark called SDET. I was not
> able to locate that benchmark (but Andi Whitcroft has access and promised me
> some results).
> 
> AIM9 results (3 samples) anon_vma conversion not applied:
> 
>  5 exec_test    1048.95 1025.50     -23.45 -2.24% Program Loads/second
>  6 fork_test    4775.22 4945.16     169.94  3.56% Task Creations/second
> 
>  5 exec_test    1057.00 1019.00     -38.00 -3.60% Program Loads/second
>  6 fork_test    4930.14 4760.00    -170.14 -3.45% Task Creations/second
> 
>  5 exec_test    1047.50 1038.96      -8.54 -0.82% Program Loads/second
>  6 fork_test    4760.48 4925.07     164.59  3.46% Task Creations/second
> 
> Loads per second seem to have down tendency. Task creations are up. Not sure
> how much jitter gets into it.
> 
> The old page fault performance test on file backed pages
> (anon_vma conversion not applied, 250k per process):
> 
> Before:
>  Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
>   0  3    1   1    0.00s      0.08s   0.00s261555.860 246536.848
>   0  3    2   1    0.00s      0.09s   0.00s219709.015 357800.362
>   0  3    4   1    0.19s      0.13s   0.01s 67810.629 218846.742
>   0  3    8   1    1.04s      0.21s   0.02s 17548.427 104461.093
> 
> After:
>  Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
>   0  3    1   1    0.00s      0.09s   0.00s238813.108 243323.477
>   0  3    2   1    0.00s      0.10s   0.00s219706.818 354671.772
>   0  3    4   1    0.20s      0.13s   0.00s 64619.728 225528.586
>   0  3    8   1    1.09s      0.22s   0.02s 16644.421 101027.423
> 
> A slight performance degradation in most regimes, just 4 processors
> is a bright spot.

Would you have any lockstat output for these locks?

Depending on the contention stats you could try an adaptive spin on the
readers. I doubt adaptive spins on the writer would work out well, with
the natural plenty-ness of readers..



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
  2008-06-27  8:53 ` Peter Zijlstra
@ 2008-06-27 15:46   ` Christoph Lameter
  2008-06-27 16:38     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2008-06-27 15:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-mm, apw

On Fri, 27 Jun 2008, Peter Zijlstra wrote:

> > Also it seems that a semaphore helps RT and should avoid busy spinning
> > on systems where these locks experience significant contention.
> 
> Please be careful with the wording here. Semaphores are evil esp for RT.
> But luckily you're referring to a sleeping RW lock, which we call
> RW-semaphore (but is not an actual semaphore).
> 
> You really scared some people saying this ;-)

Well we use the term semaphore for sleeping locks in the kernel it seems.

Maybe you could get a patch done that renames the struct to 
sleeping_rw_lock or so? That would finally clear the air. This is the 
second or third time we talk about a semaphore not truly being a 
semaphore.

> Depending on the contention stats you could try an adaptive spin on the
> readers. I doubt adaptive spins on the writer would work out well, with
> the natural plenty-ness of readers..

That depends on the frequency of lock taking and the contention. If you 
have a rw lock then you would assume that writers are rare so this is 
likely okay.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/5] [RFC] Conversion of reverse map locks to semaphores
  2008-06-27 15:46   ` Christoph Lameter
@ 2008-06-27 16:38     ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2008-06-27 16:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, apw, Ingo Molnar, David Howells, Andrew Morton,
	Linus Torvalds, Nick Piggin

On Fri, 2008-06-27 at 08:46 -0700, Christoph Lameter wrote:
> On Fri, 27 Jun 2008, Peter Zijlstra wrote:
> 
> > > Also it seems that a semaphore helps RT and should avoid busy spinning
> > > on systems where these locks experience significant contention.
> > 
> > Please be careful with the wording here. Semaphores are evil esp for RT.
> > But luckily you're referring to a sleeping RW lock, which we call
> > RW-semaphore (but is not an actual semaphore).
> > 
> > You really scared some people saying this ;-)
> 
> Well we use the term semaphore for sleeping locks in the kernel it seems.

We have an actual mutex implementation, which is oddly enough called a
mutex, not binary-semaphore-with-owner-semantics.

> Maybe you could get a patch done that renames the struct to 
> sleeping_rw_lock or so? That would finally clear the air. This is the 
> second or third time we talk about a semaphore not truly being a 
> semaphore.

Yes indeed. It mainly comes from the fact that some people drop the rw
prefix, creating the implession they talk about an actual semaphore
(which we also still have).

About that rename - it's come up before, and while I would not mind to
do such a rename, we've failed to come up with a decent name. I think
people will object to the length of your proposed one.

We could of course go for the oxymoron: rw_mutex, but I think that was
shot down once before.

> > Depending on the contention stats you could try an adaptive spin on the
> > readers. I doubt adaptive spins on the writer would work out well, with
> > the natural plenty-ness of readers..
> 
> That depends on the frequency of lock taking and the contention. If you 
> have a rw lock then you would assume that writers are rare so this is 
> likely okay.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-06-27 16:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-26  0:36 [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Christoph Lameter
2008-06-26  0:36 ` [patch 1/5] Move tlb handling into free_pgtables() Christoph Lameter
2008-06-26  0:36 ` [patch 2/5] Move tlb flushing inside of unmap vmas Christoph Lameter
2008-06-26  0:36 ` [patch 3/5] Add capability to check if rwsems are contended Christoph Lameter
2008-06-26  0:36 ` [patch 4/5] Convert i_mmap_lock to a rw semaphore Christoph Lameter
2008-06-26  0:36 ` [patch 5/5] Convert anon_vma spinlock to " Christoph Lameter
2008-06-26  1:05   ` Andrea Arcangeli
2008-06-26 17:23     ` Christoph Lameter
2008-06-26 17:27       ` Andrea Arcangeli
2008-06-26  0:53 ` [patch 0/5] [RFC] Conversion of reverse map locks to semaphores Andrea Arcangeli
2008-06-26 17:19   ` Christoph Lameter
2008-06-27  8:53 ` Peter Zijlstra
2008-06-27 15:46   ` Christoph Lameter
2008-06-27 16:38     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox