[PATCH 0/2] Fix migration races in rmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix migration races in rmap_walk() V6
@ 2010-05-06 15:33 Mel Gorman
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-06 15:33 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
  0 siblings, 2 replies; 7+ messages in thread
From: Mel Gorman @ 2010-05-06 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel,
	Linus Torvalds, Peter Zijlstra, Mel Gorman

Patch 1 of this series is the biggest change. Instead of the trylock+retry
logic, it finds the "root" anon_vma and locks all anon_vmas encountered. As
long as walkers taking multiple locks use the same order, there is no
deadlock. Stress-tests based on compaction have been running a while with
these patches applied without problems.

Changelog since V5
  o Have rmap_walk take anon_vma locks in order starting from the "root"
  o Ensure that mm_take_all_locks locks VMAs in the same order

Changelog since V4
  o Switch back anon_vma locking to put bulk of locking in rmap_walk
  o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
  o Rediff against the latest upstream tree
  o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration versus
vma_adjust. While I am not aware of any reproduction cases, it is potentially
racy. This patch is an alternative to Rik's "heavy lock" approach posted
at http://lkml.org/lkml/2010/5/3/155. With the patch, rmap_walk finds the
"root" anon_vma and starts locking from there, locking each new anon_vma
as it finds it. As long as the order is preserved, there is no deadlock.
In vma_adjust, the anon_vma locks are acquired under similar conditions
to 2.6.33 so that walkers will block until VMA changes are complete. The
rmap_walk changes potentially slows down migration and aspects of page
reclaim a little but they are the less important path.

Patch 2 of this series addresses the swapops bug reported that is a race
between migration and execve where pages get migrated from the temporary
stack before it is moved. To avoid migration PTEs being left behind,
a temporary VMA is put in place so that a migration PTE in either the
temporary stack or the relocated stack can be found.

The reproduction case for the races was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they undertake are;
	a) Forks a hundred children, each of which faults the mapping
		Purpose: stress tests migration pte removal
	b) Forks a hundred children, each which punches a hole in the mapping
	   and faults what remains
		Purpose: stress test VMA manipulations during migration
	c) Forks a hundred children, each of which execs and calls echo
		Purpose: stress test the execve race
	d) Size the mapping to be 1.5 times physical memory. Constantly
	   memset it
		Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
   is active all the time. In theory, you could also force this using
   sys_move_pages or memory hot-remove but it'd be nowhere near as easy
   to test.

Compaction is the easiest way to trigger these bugs which is not going to
be in 2.6.34 but in theory the problem also affects memory hot-remove.

There were some concerns with patch 2 that performance would be impacted. To
check if this was the case I ran kernbench, aim9 and sysbench. AIM9 in
particular was of interest as it has an exec microbenchmark.

             kernbench-vanilla    fixraces-v5r1
Elapsed mean     103.40 ( 0.00%)   103.35 ( 0.05%)
Elapsed stddev     0.09 ( 0.00%)     0.13 (-55.72%)
User    mean     313.50 ( 0.00%)   313.15 ( 0.11%)
User    stddev     0.61 ( 0.00%)     0.20 (66.70%)
System  mean      55.50 ( 0.00%)    55.85 (-0.64%)
System  stddev     0.48 ( 0.00%)     0.15 (68.98%)
CPU     mean     356.25 ( 0.00%)   356.50 (-0.07%)
CPU     stddev     0.43 ( 0.00%)     0.50 (-15.47%)

Nothing special there and kernbench is fork+exec heavy. The patched kernel
is slightly faster on wall time but it's well within the noise. System time
is slightly slower but again, it's within the noise.

AIM9
                  aim9-vanilla    fixraces-v5r1
creat-clo     116813.86 ( 0.00%)  117980.34 ( 0.99%)
page_test     270923.33 ( 0.00%)  268668.56 (-0.84%)
brk_test     2551558.07 ( 0.00%) 2649450.00 ( 3.69%)
signal_test   279866.67 ( 0.00%)  279533.33 (-0.12%)
exec_test        226.67 ( 0.00%)     232.67 ( 2.58%)
fork_test       4261.91 ( 0.00%)    4110.98 (-3.67%)
link_test      53534.78 ( 0.00%)   54076.49 ( 1.00%)

So, here exec and fork aren't showing up major worries. exec is faster but
these tests can be so sensitive to starting conditions that I tend not to
read much into them unless there are major differences.

SYSBENCH
              sysbench-vanilla    fixraces-v5r1
           1 14177.73 ( 0.00%) 14218.41 ( 0.29%)
           2 27647.23 ( 0.00%) 27774.14 ( 0.46%)
           3 31395.69 ( 0.00%) 31499.95 ( 0.33%)
           4 49866.54 ( 0.00%) 49713.49 (-0.31%)
           5 49919.58 ( 0.00%) 49524.21 (-0.80%)
           6 49532.97 ( 0.00%) 49397.60 (-0.27%)
           7 49465.79 ( 0.00%) 49384.14 (-0.17%)
           8 49483.33 ( 0.00%) 49186.49 (-0.60%)

These figures also show no differences worth talking about.

While the extra allocation in patch 2 would appear to slow down exec somewhat,
it's not by any amount that matters. As it is in exec, it means that anon_vmas
have likely been freed very recently so the allocation will be cache-hot and
cpu-local. It is possible to special-case migration to avoid migrating pages
in the temporary stack, but fixing it in exec is a more maintainable approach.

 fs/exec.c            |   37 ++++++++++++++++++++--
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |   14 +++++++-
 mm/rmap.c            |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 137 insertions(+), 17 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 [PATCH 0/2] Fix migration races in rmap_walk() V6 Mel Gorman
@ 2010-05-06 15:33 ` Mel Gorman
  2010-05-06 15:44   ` Rik van Riel
  2010-05-06 15:59   ` Linus Torvalds
  2010-05-06 15:33 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman
  1 sibling, 2 replies; 7+ messages in thread
From: Mel Gorman @ 2010-05-06 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel,
	Linus Torvalds, Peter Zijlstra, Mel Gorman

vma_adjust() is updating anon VMA information without locks being taken
unlike what happened in 2.6.33.  In contrast, file-backed mappings use
the i_mmap_lock and this lack of locking can result in races with users of
rmap_walk such as page migration.  vma_address() can return -EFAULT for an
address that will soon be valid.  For migration, this potentially leaves
a dangling migration PTE behind which can later cause a BUG_ON to trigger
when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take when walking a list of anon_vma_chains. As rmap_walk begins with
the anon_vma a page is associated with, the order of anon_vmas cannot be
guaranteed and multiple locks cannot be taken without potentially deadlocking.

To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
list but always starting from the "root" anon_vma which is the oldest
anon_vma in the list. It starts by locking the anon_vma lock associated
with a page. It then finds the "root" anon_vma using the anon_vma_chains
"same_vma" list as it is strictly ordered. The root anon_vma lock is taken
and rmap_walk traverses the list. This allows multiple locks to be taken
as the list is always traversed in the same direction.

As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
both rmap_walk and mm_take_all_locks lock anon_vmas in the order of
old to new.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    2 +
 mm/ksm.c             |   20 ++++++++++--
 mm/mmap.c            |   14 +++++++-
 mm/rmap.c            |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..6d4d5f7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,8 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..d16b459 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,24 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1693,14 @@ again:
 			 * were forked from the original since ksmd passed.
 			 */
 			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
-				continue;
+				goto next_vma;
 
 			ret = rmap_one(page, vma, rmap_item->address, arg);
+
+next_vma:
+			if (locked_vma)
+				spin_unlock(&locked_vma->lock);
+
 			if (ret != SWAP_AGAIN) {
 				spin_unlock(&anon_vma->lock);
 				goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..b447d5b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
+		/* Lock the anon_vmas in the same order rmap_walk would */
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
@@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..b511710 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,62 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/*
+ * Given an anon_vma, find the root of the chain, lock it and return it.
+ * This must be called with the rcu_read_lock held
+ */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> root vma -> root anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		VM_BUG_ON(!rcu_read_lock_held());
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -1358,7 +1414,7 @@ int try_to_munlock(struct page *page)
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *locked_vma;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		locked_vma = NULL;
+		if (anon_vma != vma->anon_vma) {
+			locked_vma = vma->anon_vma;
+			spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
+		}
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
+		if (locked_vma)
+			spin_unlock(&locked_vma->lock);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-05-06 15:33 [PATCH 0/2] Fix migration races in rmap_walk() V6 Mel Gorman
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-06 15:33 ` Mel Gorman
  1 sibling, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2010-05-06 15:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel,
	Linus Torvalds, Peter Zijlstra, Mel Gorman

From: Andrea Arcangeli <aarcange@redhat.com>

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to
its final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location before
rmap_walk is able to find the copy. This leaves a dangling migration PTE
behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the
temporary stack and the other which covers the new location. This guarantees
that rmap can always find the migration PTE even if it is copied while
rmap_walk is taking place.

[mel@csn.ul.ie: Tested and rewrote changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/exec.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 725d7ef..fd0abff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	tmp_vma->vm_start = new_start;
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
 		return -ENOMEM;
+	}
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length)
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -551,7 +580,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
@ 2010-05-06 15:44   ` Rik van Riel
  2010-05-06 15:51     ` Mel Gorman
  2010-05-06 15:59   ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2010-05-06 15:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Linus Torvalds,
	Peter Zijlstra

On 05/06/2010 11:33 AM, Mel Gorman wrote:

> @@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>   	 * are holding mmap_sem. Users without mmap_sem are required to
>   	 * take a reference count to prevent the anon_vma disappearing
>   	 */
> -	anon_vma = page_anon_vma(page);
> +	anon_vma = page_anon_vma_lock_root(page);
>   	if (!anon_vma)
>   		return ret;
> -	spin_lock(&anon_vma->lock);
>   	list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {

One conceptual problem here.  By taking the oldest anon_vma,
instead of the anon_vma of the page, we may end up searching
way too many processes.

Eg. if the page is the page of a child process in a forking
server workload, the above code will end up searching the
parent and all of the siblings - even for a private page, in
the child process's private anon_vma.

For an Apache or Oracle system with 1000 clients (and child
processes), that could be quite a drag - searching 1000 times
as many processes as we should.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:44   ` Rik van Riel
@ 2010-05-06 15:51     ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2010-05-06 15:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Linus Torvalds,
	Peter Zijlstra

On Thu, May 06, 2010 at 11:44:57AM -0400, Rik van Riel wrote:
> On 05/06/2010 11:33 AM, Mel Gorman wrote:
>
>> @@ -1368,16 +1424,25 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>>   	 * are holding mmap_sem. Users without mmap_sem are required to
>>   	 * take a reference count to prevent the anon_vma disappearing
>>   	 */
>> -	anon_vma = page_anon_vma(page);
>> +	anon_vma = page_anon_vma_lock_root(page);
>>   	if (!anon_vma)
>>   		return ret;
>> -	spin_lock(&anon_vma->lock);
>>   	list_for_each_entry(avc,&anon_vma->head, same_anon_vma) {
>
> One conceptual problem here.  By taking the oldest anon_vma,
> instead of the anon_vma of the page, we may end up searching
> way too many processes.
>
> Eg. if the page is the page of a child process in a forking
> server workload, the above code will end up searching the
> parent and all of the siblings - even for a private page, in
> the child process's private anon_vma.
>
> For an Apache or Oracle system with 1000 clients (and child
> processes), that could be quite a drag - searching 1000 times
> as many processes as we should.
>

That does indeed suck. If we were always locking the root anon_vma, we'd get
away with it but that would involve introducing RCU into the munmap/mmap/etc
path. Is there any way around this problem or will migration just have to
take it on the chin until anon_vma is reference counted and we can
cheaply lock the root anon_vma?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
  2010-05-06 15:44   ` Rik van Riel
@ 2010-05-06 15:59   ` Linus Torvalds
  2010-05-06 17:07     ` Mel Gorman
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2010-05-06 15:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel,
	Peter Zijlstra

On Thu, 6 May 2010, Mel Gorman wrote:
> +		anon_vma = anon_vma_lock_root(anon_vma);
>  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
>  			vma = vmac->vma;
> +
> +			locked_vma = NULL;
> +			if (anon_vma != vma->anon_vma) {
> +				locked_vma = vma->anon_vma;
> +				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
> +			}
> +
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
> +				goto next_vma;
> +
>  			/*
>  			 * Initially we examine only the vma which covers this
>  			 * rmap_item; but later, if there is still work to do,
> @@ -1684,9 +1693,14 @@ again:
>  			 * were forked from the original since ksmd passed.
>  			 */
>  			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
> +				goto next_vma;
>  
>  			ret = rmap_one(page, vma, rmap_item->address, arg);
> +
> +next_vma:
> +			if (locked_vma)
> +				spin_unlock(&locked_vma->lock);
> +
>  			if (ret != SWAP_AGAIN) {
>  				spin_unlock(&anon_vma->lock);
>  				goto out;

[ Removed '-' lines to show the actual end result ]

That loop is f*cked up.

In the "goto next_vma" case, it will then test the 'ret' from the 
_previous_ iteration after having unlocked the anon_vma. Which may not 
even exist, if this is the first one.

Yes, yes, 'ret' is initialized to SWAP_AGAIN, so it will work, but it's 
still screwed up. It's just _waiting_ for bugs to be introduced.

Just make the "goto out" case unlock thngs properly. Have a real exclusive 
error return case that does

		/* normal return */
		return SWAP_AGAIN;

	out:
		if (locked_anon_vma)
			spin_unlock(&locked_anon_vma->lock);
		spin_unlock(&anon_vma->lock);
		return ret;

rather than that horrible crud in the loop itself.

Also, wouldn't it be nicer to make the whole "locked_vma" be something you 
do at the head of the loop, so that you can use "continue" instead of 
"goto next_vma". And then you can do it like this:

	locked_anon_vma = lock_nested_anon_vma(locked_anon_vma, vma->anon_vma, anon_vma);

where we have

   static struct anon_vma *lock_nested_anon_vma(struct anon_vma_struct anon_vma *prev,
	 struct anon_vma *next, struct anon_vma *root)
   {
	if (prev)
		spin_unlock(&prev->lock);
	if (next == root)
		return NULL;
	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
	return next;
   }

isn't that _much_ nicer? You get to split the locking off into a function 
of its own, and you unlock the old one before you (potentially) lock the 
new one, _and_ you can just use "continue" to go to the next iteration.

Yes, yes, it means that after the loop you have to unlock that 
'locked_anon_vma', but you have to do that for the early exit case 
_anyway_, so that won't look all that odd. It will certainly look less odd 
than using a status variable from the previous iteration and depending on 
it having a special value.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
  2010-05-06 15:59   ` Linus Torvalds
@ 2010-05-06 17:07     ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2010-05-06 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, LKML, Minchan Kim, KAMEZAWA Hiroyuki,
	Christoph Lameter, Andrea Arcangeli, Rik van Riel,
	Peter Zijlstra

On Thu, May 06, 2010 at 08:59:52AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 6 May 2010, Mel Gorman wrote:
> > +		anon_vma = anon_vma_lock_root(anon_vma);
> >  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> >  			vma = vmac->vma;
> > +
> > +			locked_vma = NULL;
> > +			if (anon_vma != vma->anon_vma) {
> > +				locked_vma = vma->anon_vma;
> > +				spin_lock_nested(&locked_vma->lock, SINGLE_DEPTH_NESTING);
> > +			}
> > +
> >  			if (rmap_item->address < vma->vm_start ||
> >  			    rmap_item->address >= vma->vm_end)
> > +				goto next_vma;
> > +
> >  			/*
> >  			 * Initially we examine only the vma which covers this
> >  			 * rmap_item; but later, if there is still work to do,
> > @@ -1684,9 +1693,14 @@ again:
> >  			 * were forked from the original since ksmd passed.
> >  			 */
> >  			if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
> > +				goto next_vma;
> >  
> >  			ret = rmap_one(page, vma, rmap_item->address, arg);
> > +
> > +next_vma:
> > +			if (locked_vma)
> > +				spin_unlock(&locked_vma->lock);
> > +
> >  			if (ret != SWAP_AGAIN) {
> >  				spin_unlock(&anon_vma->lock);
> >  				goto out;
> 
> [ Removed '-' lines to show the actual end result ]
> 
> That loop is f*cked up.
> 
> In the "goto next_vma" case, it will then test the 'ret' from the 
> _previous_ iteration after having unlocked the anon_vma. Which may not 
> even exist, if this is the first one.
> 
> Yes, yes, 'ret' is initialized to SWAP_AGAIN, so it will work, but it's 
> still screwed up.

Yes, it works but ...

> It's just _waiting_ for bugs to be introduced.
> 

This is true too. 

> Just make the "goto out" case unlock thngs properly. Have a real exclusive 
> error return case that does
> 
> 		/* normal return */
> 		return SWAP_AGAIN;
> 
> 	out:
> 		if (locked_anon_vma)
> 			spin_unlock(&locked_anon_vma->lock);
> 		spin_unlock(&anon_vma->lock);
> 		return ret;
> 
> rather than that horrible crud in the loop itself.
> 
> Also, wouldn't it be nicer to make the whole "locked_vma" be something you 
> do at the head of the loop, so that you can use "continue" instead of 
> "goto next_vma". And then you can do it like this:
> 
> 	locked_anon_vma = lock_nested_anon_vma(locked_anon_vma, vma->anon_vma, anon_vma);
> 

It obscures the unlocking slightly but it does look neater in the main
functions that call lock_nested_anon_vma. I considered for a while if
there was some macro magic that could be applied but it would be a
delicate obsenity at best.

> where we have
> 
>    static struct anon_vma *lock_nested_anon_vma(struct anon_vma_struct anon_vma *prev,
> 	 struct anon_vma *next, struct anon_vma *root)
>    {
> 	if (prev)
> 		spin_unlock(&prev->lock);
> 	if (next == root)
> 		return NULL;
> 	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
> 	return next;
>    }
> 
> isn't that _much_ nicer? You get to split the locking off into a function 
> of its own, and you unlock the old one before you (potentially) lock the 
> new one, _and_ you can just use "continue" to go to the next iteration.
> 
> Yes, yes, it means that after the loop you have to unlock that 
> 'locked_anon_vma', but you have to do that for the early exit case 
> _anyway_, so that won't look all that odd. It will certainly look less odd 
> than using a status variable from the previous iteration and depending on 
> it having a special value.
> 

Can't argue with the logic and it does look a lot neater. This is what the
revised version looks like with that scheme. I changed the name of the locking
function slightly to be similar to the other anon_vma functions but that's
about it.  rmap_walk_ksm still looks somewhat tortured but rmap_walk_anon
is much neater.

==== CUT HERE ====
From: Mel Gorman <mel@csn.ul.ie>
Subject: [PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take when walking a list of anon_vma_chains but as the order of anon_vmas
cannot be guaranteed, rmap_walk cannot take multiple locks without
potentially deadlocking.

To resolve this problem, this patch has rmap_walk walk the anon_vma_chain
list but always starting from the "root" anon_vma which is the oldest
anon_vma in the list. It starts by locking the anon_vma lock associated
with a page. It then finds the "root" anon_vma using the anon_vma_chains
"same_vma" list as it is strictly ordered. The root anon_vma lock is taken
and rmap_walk traverses the list. This allows multiple locks to be taken
as the list is always traversed in the same direction.

As spotted by Rik, to avoid any deadlocks versus mmu_notify, the order that
anon_vmas is locked in by mm_take_all_locks is reversed by this patch so that
both rmap_walk and mm_take_all_locks lock anon_vmas in the order of old->new.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |    4 ++
 mm/ksm.c             |   13 ++++++-
 mm/mmap.c            |   14 ++++++-
 mm/rmap.c            |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..1dc949f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -121,6 +121,10 @@ int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root);
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma);
+struct anon_vma *page_anon_vma_lock_root(struct page *page);
 void __anon_vma_link(struct vm_area_struct *);
 void anon_vma_free(struct anon_vma *);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..1db8656 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 {
 	struct stable_node *stable_node;
 	struct hlist_node *hlist;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct rmap_item *rmap_item;
 	int ret = SWAP_AGAIN;
 	int search_new_forks = 0;
@@ -1671,9 +1672,16 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		spin_lock(&anon_vma->lock);
+		anon_vma = anon_vma_lock_root(anon_vma);
+		if (nested_anon_vma) {
+			spin_unlock(&nested_anon_vma->lock);
+			nested_anon_vma = NULL;
+		}
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
@@ -1697,6 +1705,9 @@ again:
 	if (!search_new_forks++)
 		goto again;
 out:
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	return ret;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..b447d5b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -505,6 +505,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *next = vma->vm_next;
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
 	long adjust_next = 0;
@@ -578,6 +579,11 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start)) {
+		anon_vma = vma->anon_vma;
+		spin_lock(&anon_vma->lock);
+	}
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +626,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -2556,8 +2565,9 @@ int mm_take_all_locks(struct mm_struct *mm)
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (signal_pending(current))
 			goto out_unlock;
+		/* Lock the anon_vmas in the same order rmap_walk would */
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_lock_anon_vma(mm, avc->anon_vma);
 	}
 
@@ -2620,7 +2630,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (vma->anon_vma)
-			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
 		if (vma->vm_file && vma->vm_file->f_mapping)
 			vm_unlock_mapping(vma->vm_file->f_mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..2e65a75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -236,6 +236,81 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	return -ENOMEM;
 }
 
+/*
+ * When walking an anon_vma_chain and locking each anon_vma encountered,
+ * this function is responsible for checking if the next VMA is the
+ * same as the root, locking it if not and released the previous lock
+ * if necessary.
+ *
+ * It is assumed the caller has locked the root anon_vma
+ */
+struct anon_vma *anon_vma_lock_nested(struct anon_vma *prev,
+			struct anon_vma *next, struct anon_vma *root)
+{
+	if (prev)
+		spin_unlock(&prev->lock);
+	if (next == root)
+		return NULL;
+	spin_lock_nested(&next->lock, SINGLE_DEPTH_NESTING);
+	return next;
+}
+
+/*
+ * Given an anon_vma, find the root of the chain, lock it and return the
+ * root. This must be called with the rcu_read_lock held
+ */
+struct anon_vma *anon_vma_lock_root(struct anon_vma *anon_vma)
+{
+	struct anon_vma *root_anon_vma;
+	struct anon_vma_chain *avc, *root_avc;
+	struct vm_area_struct *vma;
+
+	/* Lock the same_anon_vma list and make sure we are on a chain */
+	spin_lock(&anon_vma->lock);
+	if (list_empty(&anon_vma->head)) {
+		spin_unlock(&anon_vma->lock);
+		return NULL;
+	}
+
+	/*
+	 * Get the root anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap. Basically
+	 * we are doing
+	 *
+	 * local anon_vma -> local vma -> root vma -> root anon_vma
+	 */
+	avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	vma = avc->vma;
+	root_avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
+	root_anon_vma = root_avc->anon_vma;
+
+	/* Get the lock of the root anon_vma */
+	if (anon_vma != root_anon_vma) {
+		VM_BUG_ON(!rcu_read_lock_held());
+		spin_unlock(&anon_vma->lock);
+		spin_lock(&root_anon_vma->lock);
+	}
+
+	return root_anon_vma;
+}
+
+/*
+ * From the anon_vma associated with this page, find and lock the
+ * deepest anon_vma on the list. This allows multiple anon_vma locks
+ * to be taken by guaranteeing the locks are taken in the same order
+ */
+struct anon_vma *page_anon_vma_lock_root(struct page *page)
+{
+	struct anon_vma *anon_vma;
+
+	/* Get the local anon_vma */
+	anon_vma = page_anon_vma(page);
+	if (!anon_vma)
+		return NULL;
+
+	return anon_vma_lock_root(anon_vma);
+}
+
 static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 {
 	struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
@@ -326,7 +401,7 @@ void page_unlock_anon_vma(struct anon_vma *anon_vma)
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+static noinline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1359,6 +1434,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
 {
 	struct anon_vma *anon_vma;
+	struct anon_vma *nested_anon_vma = NULL;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
 	 */
-	anon_vma = page_anon_vma(page);
+	anon_vma = page_anon_vma_lock_root(page);
 	if (!anon_vma)
 		return ret;
-	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
-			continue;
-		ret = rmap_one(page, vma, address, arg);
+		unsigned long address;
+
+		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
+						vma->anon_vma, anon_vma);
+		address = vma_address(page, vma);
+		if (address != -EFAULT)
+			ret = rmap_one(page, vma, address, arg);
+
 		if (ret != SWAP_AGAIN)
 			break;
 	}
+
+	if (nested_anon_vma)
+		spin_unlock(&nested_anon_vma->lock);
+
 	spin_unlock(&anon_vma->lock);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-05-06 17:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-06 15:33 [PATCH 0/2] Fix migration races in rmap_walk() V6 Mel Gorman
2010-05-06 15:33 ` [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information Mel Gorman
2010-05-06 15:44   ` Rik van Riel
2010-05-06 15:51     ` Mel Gorman
2010-05-06 15:59   ` Linus Torvalds
2010-05-06 17:07     ` Mel Gorman
2010-05-06 15:33 ` [PATCH 2/2] mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox