[PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
@ 1999-10-15  0:06 Kanoj Sarcar
  1999-10-15 11:58 ` Manfred Spraul
  1999-10-18 19:45 ` Kanoj Sarcar
  0 siblings, 2 replies; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15  0:06 UTC (permalink / raw)
  To: torvalds, sct, manfreds, andrea, viro; +Cc: linux-mm, linux-kernel

Linus,

This 2.3 patch fixes the race between page stealing code doing vma
list traversal, and other pieces of code adding/deleting elements
to the list, or otherwise changing fields in the list elements
that might confuse the stealing code. Let me know if things seem
fine, or you want me to alter some code. 

There will probably be a second, independent part to this patch, 
where vma list scanners (callers to find_vma() for example) other
than the page stealing code are fixed to grab mmap_sem. Some of
this has already been pointed out by Manfred Spraul.

We should also probably spawn an independent discussion thread
about the driver swapout() method parameter passing, invoked from
try_to_swap_out(). swapout() currently takes the vma as an input,
but the vma might be getting deleted (the documentation which is
part of the patch describes currently how things are protected),
so it might be prudent to pass individual fields of the vma to the
swapout() method, rather than a pointer to the structure. 

Thanks.

Kanoj

--- Documentation/vm/locking	Thu Oct 14 15:38:03 1999
+++ Documentation/vm/locking	Thu Oct 14 15:44:40 1999
@@ -0,0 +1,83 @@
+Started Oct 1999 by Kanoj Sarcar <kanoj@sgi.com>
+
+The intent of this file is to have an uptodate, running commentary 
+from different people about how locking and synchronization is done 
+in the Linux vm code.
+
+vmlist_access_lock/vmlist_modify_lock
+--------------------------------------
+
+Page stealers pick processes out of the process pool and scan for 
+the best process to steal pages from. To guarantee the existance 
+of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
+Page stealers hold kernel_lock to protect against a bunch of races.
+The vma list of the victim mm is also scanned by the stealer, 
+and the vmlist_lock is used to preserve list sanity against the
+process adding/deleting to the list. This also gurantees existance
+of the vma. Vma existance gurantee while invoking the driver
+swapout() method in try_to_swap_out() also relies on the fact
+that do_munmap() temporarily gets lock_kernel before decimating
+the vma, thus the swapout() method must snapshot all the vma 
+fields it needs before going to sleep (which will release the
+lock_kernel held by the page stealer). Currently, filemap_swapout
+is the only method that depends on this shaky interlocking.
+
+Any code that modifies the vmlist, or the vm_start/vm_end/
+vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent 
+kswapd from looking at the chain. This does not include driver mmap() 
+methods, for example, since the vma is still not in the list.
+
+The rules are:
+1. To modify the vmlist (add/delete or change fields in an element), 
+you must hold mmap_sem to guard against clones doing mmap/munmap/faults, 
+(ie all vm system calls and faults), and from ptrace, swapin due to 
+swap deletion etc.
+2. To modify the vmlist (add/delete or change fields in an element), 
+you must also hold vmlist_modify_lock, to guard against page stealers 
+scanning the list.
+3. To scan the vmlist (find_vma()), you must either 
+        a. grab mmap_sem, which should be done by all cases except 
+	   page stealer.
+or
+        b. grab vmlist_access_lock, only done by page stealer.
+4. While holding the vmlist_modify_lock, you must be able to guarantee
+that no code path will lead to page stealing. A better guarantee is
+to claim non sleepability, which ensures that you are not sleeping
+for a lock, whose holder might in turn be doing page stealing.
+5. You must be able to guarantee that while holding vmlist_modify_lock
+or vmlist_access_lock of mm A, you will not try to get either lock
+for mm B.
+
+The caveats are:
+1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
+The update of mmap_cache is racy (page stealer can race with other code
+that invokes find_vma with mmap_sem held), but that is okay, since it 
+is a hint. This can be fixed, if desired, by having find_vma grab the
+vmlist lock.
+
+
+Code that add/delete elements from the vmlist chain are
+1. callers of insert_vm_struct
+2. callers of merge_segments
+3. callers of avl_remove
+
+Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
+the list:
+1. expand_stack
+2. mprotect
+3. mlock
+4. mremap
+
+It is advisable that changes to vm_start/vm_end be protected, although 
+in some cases it is not really needed. Eg, vm_start is modified by 
+expand_stack(), it is hard to come up with a destructive scenario without 
+having the vmlist protection in this case.
+
+The vmlist lock nests with the inode i_shared_lock and the kmem cache
+c_spinlock spinlocks. This is okay, since code that holds i_shared_lock 
+never asks for memory, and the kmem code asks for pages after dropping
+c_spinlock.
+
+The vmlist lock can be a sleeping or spin lock. In either case, care
+must be taken that it is not held on entry to the driver methods, since
+those methods might sleep or ask for memory, causing deadlocks.
--- /usr/tmp/p_rdiff_a003hF/exec.c	Thu Oct 14 16:35:50 1999
+++ fs/exec.c	Thu Oct 14 09:50:25 1999
@@ -276,7 +276,9 @@
 		mpnt->vm_offset = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = (void *) 0;
+		vmlist_modify_lock(current->mm);
 		insert_vm_struct(current->mm, mpnt);
+		vmlist_modify_unlock(current->mm);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	} 
 
--- /usr/tmp/p_rdiff_a003hO/mm.h	Thu Oct 14 16:35:59 1999
+++ include/linux/mm.h	Thu Oct 14 13:45:29 1999
@@ -427,6 +427,12 @@
 #define pgcache_under_min()	(atomic_read(&page_cache_size) * 100 < \
 				page_cache.min_percent * num_physpages)
 
+#define vmlist_access_lock(mm)		down(&mm->vmlist_lock)
+#define vmlist_access_unlock(mm)	up(&mm->vmlist_lock)
+#define vmlist_modify_lock(mm)		vmlist_access_lock(mm)
+#define vmlist_modify_unlock(mm)	vmlist_access_unlock(mm)
+
+
 #endif /* __KERNEL__ */
 
 #endif
--- /usr/tmp/p_rdiff_a003hX/sched.h	Thu Oct 14 16:36:08 1999
+++ include/linux/sched.h	Thu Oct 14 13:45:24 1999
@@ -213,6 +213,7 @@
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct semaphore mmap_sem;
+	struct semaphore vmlist_lock;		/* protect against kswapd */
 	spinlock_t page_table_lock;
 	unsigned long context;
 	unsigned long start_code, end_code, start_data, end_data;
@@ -235,6 +236,7 @@
 		swapper_pg_dir, 			\
 		ATOMIC_INIT(2), ATOMIC_INIT(1), 1,	\
 		__MUTEX_INITIALIZER(name.mmap_sem),	\
+		__MUTEX_INITIALIZER(name.vmlist_lock),	\
 		SPIN_LOCK_UNLOCKED,			\
 		0,					\
 		0, 0, 0, 0,				\
--- /usr/tmp/p_rdiff_a003he/shm.c	Thu Oct 14 16:36:17 1999
+++ ipc/shm.c	Wed Oct 13 14:31:47 1999
@@ -462,8 +462,10 @@
 	   > (unsigned long) current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 	current->mm->total_vm += tmp >> PAGE_SHIFT;
+	vmlist_modify_lock(current->mm);
 	insert_vm_struct(current->mm, shmd);
 	merge_segments(current->mm, shmd->vm_start, shmd->vm_end);
+	vmlist_modify_unlock(current->mm);
 
 	return 0;
 }
--- /usr/tmp/p_rdiff_a003hn/fork.c	Thu Oct 14 16:36:26 1999
+++ kernel/fork.c	Thu Oct 14 13:47:37 1999
@@ -303,6 +303,7 @@
 		atomic_set(&mm->mm_users, 1);
 		atomic_set(&mm->mm_count, 1);
 		init_MUTEX(&mm->mmap_sem);
+		init_MUTEX(&mm->vmlist_lock);
 		mm->page_table_lock = SPIN_LOCK_UNLOCKED;
 		mm->pgd = pgd_alloc();
 		if (mm->pgd)
--- /usr/tmp/p_rdiff_a003hw/ptrace.c	Thu Oct 14 16:36:34 1999
+++ kernel/ptrace.c	Wed Oct 13 10:43:38 1999
@@ -80,12 +80,14 @@
 int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
 {
 	int copied;
-	struct vm_area_struct * vma = find_extend_vma(tsk, addr);
+	struct vm_area_struct * vma;
 
-	if (!vma)
-		return 0;
-
 	down(&tsk->mm->mmap_sem);
+	vma = find_extend_vma(tsk, addr);
+	if (!vma) {
+		up(&tsk->mm->mmap_sem);
+		return 0;
+	}
 	copied = 0;
 	for (;;) {
 		unsigned long offset = addr & ~PAGE_MASK;
--- /usr/tmp/p_rdiff_a003i5/mlock.c	Thu Oct 14 16:36:43 1999
+++ mm/mlock.c	Tue Oct 12 16:35:25 1999
@@ -13,7 +13,9 @@
 
 static inline int mlock_fixup_all(struct vm_area_struct * vma, int newflags)
 {
+	vmlist_modify_lock(vma->vm_mm);
 	vma->vm_flags = newflags;
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -26,15 +28,17 @@
 	if (!n)
 		return -EAGAIN;
 	*n = *vma;
-	vma->vm_start = end;
 	n->vm_end = end;
-	vma->vm_offset += vma->vm_start - n->vm_start;
 	n->vm_flags = newflags;
 	if (n->vm_file)
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += end - vma->vm_start;
+	vma->vm_start = end;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -47,7 +51,6 @@
 	if (!n)
 		return -EAGAIN;
 	*n = *vma;
-	vma->vm_end = start;
 	n->vm_start = start;
 	n->vm_offset += n->vm_start - vma->vm_start;
 	n->vm_flags = newflags;
@@ -55,7 +58,10 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_end = start;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -75,10 +81,7 @@
 	*left = *vma;
 	*right = *vma;
 	left->vm_end = start;
-	vma->vm_start = start;
-	vma->vm_end = end;
 	right->vm_start = end;
-	vma->vm_offset += vma->vm_start - left->vm_start;
 	right->vm_offset += right->vm_start - left->vm_start;
 	vma->vm_flags = newflags;
 	if (vma->vm_file)
@@ -88,8 +91,14 @@
 		vma->vm_ops->open(left);
 		vma->vm_ops->open(right);
 	}
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += start - vma->vm_start;
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_flags = newflags;
 	insert_vm_struct(current->mm, left);
 	insert_vm_struct(current->mm, right);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -168,7 +177,9 @@
 			break;
 		}
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, start, end);
+	vmlist_modify_unlock(current->mm);
 	return error;
 }
 
@@ -240,7 +251,9 @@
 		if (error)
 			break;
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, 0, TASK_SIZE);
+	vmlist_modify_unlock(current->mm);
 	return error;
 }
 
--- /usr/tmp/p_rdiff_a003iE/mmap.c	Thu Oct 14 16:36:52 1999
+++ mm/mmap.c	Thu Oct 14 15:30:30 1999
@@ -324,8 +324,10 @@
 	 */
 	flags = vma->vm_flags;
 	addr = vma->vm_start; /* can addr have changed?? */
+	vmlist_modify_lock(mm);
 	insert_vm_struct(mm, vma);
 	merge_segments(mm, vma->vm_start, vma->vm_end);
+	vmlist_modify_unlock(mm);
 	
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
@@ -528,11 +530,13 @@
 	}
 
 	/* Work out to one of the ends. */
-	if (end == area->vm_end)
+	if (end == area->vm_end) {
 		area->vm_end = addr;
-	else if (addr == area->vm_start) {
+		vmlist_modify_lock(current->mm);
+	} else if (addr == area->vm_start) {
 		area->vm_offset += (end - area->vm_start);
 		area->vm_start = end;
+		vmlist_modify_lock(current->mm);
 	} else {
 	/* Unmapping a hole: area->vm_start < addr <= end < area->vm_end */
 		/* Add end mapping -- leave beginning for below */
@@ -553,10 +557,12 @@
 		if (mpnt->vm_ops && mpnt->vm_ops->open)
 			mpnt->vm_ops->open(mpnt);
 		area->vm_end = addr;	/* Truncate area */
+		vmlist_modify_lock(current->mm);
 		insert_vm_struct(current->mm, mpnt);
 	}
 
 	insert_vm_struct(current->mm, area);
+	vmlist_modify_unlock(current->mm);
 	return extra;
 }
 
@@ -656,6 +662,7 @@
 
 	npp = (prev ? &prev->vm_next : &mm->mmap);
 	free = NULL;
+	vmlist_modify_lock(mm);
 	for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) {
 		*npp = mpnt->vm_next;
 		mpnt->vm_next = free;
@@ -663,6 +670,8 @@
 		if (mm->mmap_avl)
 			avl_remove(mpnt, &mm->mmap_avl);
 	}
+	mm->mmap_cache = NULL;	/* Kill the cache. */
+	vmlist_modify_unlock(mm);
 
 	/* Ok - we have the memory areas we should free on the 'free' list,
 	 * so release them, and unmap the page range..
@@ -679,6 +688,11 @@
 		end = end > mpnt->vm_end ? mpnt->vm_end : end;
 		size = end - st;
 
+		/*
+		 * The lock_kernel interlocks with kswapd try_to_swap_out
+		 * invoking a driver swapout() method, and being able to
+		 * guarantee vma existance.
+		 */
 		lock_kernel();
 		if (mpnt->vm_ops && mpnt->vm_ops->unmap)
 			mpnt->vm_ops->unmap(mpnt, st, size);
@@ -703,7 +717,6 @@
 
 	free_pgtables(mm, prev, addr, addr+len);
 
-	mm->mmap_cache = NULL;	/* Kill the cache. */
 	return 0;
 }
 
@@ -787,8 +800,10 @@
 	flags = vma->vm_flags;
 	addr = vma->vm_start;
 
+	vmlist_modify_lock(mm);
 	insert_vm_struct(mm, vma);
 	merge_segments(mm, vma->vm_start, vma->vm_end);
+	vmlist_modify_unlock(mm);
 	
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
@@ -815,7 +830,9 @@
 
 	release_segments(mm);
 	mpnt = mm->mmap;
+	vmlist_modify_lock(mm);
 	mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;
+	vmlist_modify_unlock(mm);
 	mm->rss = 0;
 	mm->total_vm = 0;
 	mm->locked_vm = 0;
@@ -911,6 +928,7 @@
 		prev = mpnt;
 		mpnt = mpnt->vm_next;
 	}
+	mm->mmap_cache = NULL;		/* Kill the cache. */
 
 	/* prev and mpnt cycle through the list, as long as
 	 * start_addr < mpnt->vm_end && prev->vm_start < end_addr
@@ -947,7 +965,9 @@
 		if (mpnt->vm_ops && mpnt->vm_ops->close) {
 			mpnt->vm_offset += mpnt->vm_end - mpnt->vm_start;
 			mpnt->vm_start = mpnt->vm_end;
+			vmlist_modify_unlock(mm);
 			mpnt->vm_ops->close(mpnt);
+			vmlist_modify_lock(mm);
 		}
 		mm->map_count--;
 		remove_shared_vm_struct(mpnt);
@@ -956,7 +976,6 @@
 		kmem_cache_free(vm_area_cachep, mpnt);
 		mpnt = prev;
 	}
-	mm->mmap_cache = NULL;		/* Kill the cache. */
 }
 
 void __init vma_init(void)
--- /usr/tmp/p_rdiff_a003iN/mprotect.c	Thu Oct 14 16:37:02 1999
+++ mm/mprotect.c	Wed Oct 13 10:57:02 1999
@@ -82,8 +82,10 @@
 static inline int mprotect_fixup_all(struct vm_area_struct * vma,
 	int newflags, pgprot_t prot)
 {
+	vmlist_modify_lock(vma->vm_mm);
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = prot;
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -97,9 +99,7 @@
 	if (!n)
 		return -ENOMEM;
 	*n = *vma;
-	vma->vm_start = end;
 	n->vm_end = end;
-	vma->vm_offset += vma->vm_start - n->vm_start;
 	n->vm_flags = newflags;
 	n->vm_page_prot = prot;
 	if (n->vm_file)
@@ -106,7 +106,11 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += end - vma->vm_start;
+	vma->vm_start = end;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -120,7 +124,6 @@
 	if (!n)
 		return -ENOMEM;
 	*n = *vma;
-	vma->vm_end = start;
 	n->vm_start = start;
 	n->vm_offset += n->vm_start - vma->vm_start;
 	n->vm_flags = newflags;
@@ -129,7 +132,10 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_end = start;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -150,13 +156,8 @@
 	*left = *vma;
 	*right = *vma;
 	left->vm_end = start;
-	vma->vm_start = start;
-	vma->vm_end = end;
 	right->vm_start = end;
-	vma->vm_offset += vma->vm_start - left->vm_start;
 	right->vm_offset += right->vm_start - left->vm_start;
-	vma->vm_flags = newflags;
-	vma->vm_page_prot = prot;
 	if (vma->vm_file)
 		atomic_add(2,&vma->vm_file->f_count);
 	if (vma->vm_ops && vma->vm_ops->open) {
@@ -163,8 +164,15 @@
 		vma->vm_ops->open(left);
 		vma->vm_ops->open(right);
 	}
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += start - vma->vm_start;
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_flags = newflags;
+	vma->vm_page_prot = prot;
 	insert_vm_struct(current->mm, left);
 	insert_vm_struct(current->mm, right);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -246,7 +254,9 @@
 			break;
 		}
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, start, end);
+	vmlist_modify_unlock(current->mm);
 out:
 	up(&current->mm->mmap_sem);
 	return error;
--- /usr/tmp/p_rdiff_a003iW/mremap.c	Thu Oct 14 16:37:10 1999
+++ mm/mremap.c	Wed Oct 13 10:58:54 1999
@@ -141,8 +141,10 @@
 				get_file(new_vma->vm_file);
 			if (new_vma->vm_ops && new_vma->vm_ops->open)
 				new_vma->vm_ops->open(new_vma);
+			vmlist_modify_lock(current->mm);
 			insert_vm_struct(current->mm, new_vma);
 			merge_segments(current->mm, new_vma->vm_start, new_vma->vm_end);
+			vmlist_modify_unlock(vma->vm_mm);
 			do_munmap(addr, old_len);
 			current->mm->total_vm += new_len >> PAGE_SHIFT;
 			if (new_vma->vm_flags & VM_LOCKED) {
@@ -220,7 +222,9 @@
 		/* can we just expand the current mapping? */
 		if (max_addr - addr >= new_len) {
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
+			vmlist_modify_lock(vma->vm_mm);
 			vma->vm_end = addr + new_len;
+			vmlist_modify_unlock(vma->vm_mm);
 			current->mm->total_vm += pages;
 			if (vma->vm_flags & VM_LOCKED) {
 				current->mm->locked_vm += pages;
--- /usr/tmp/p_rdiff_a003id/vmscan.c	Thu Oct 14 16:37:19 1999
+++ mm/vmscan.c	Thu Oct 14 14:49:38 1999
@@ -139,6 +139,7 @@
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		flush_tlb_page(vma, address);
 		vma->vm_mm->rss--;
+		vmlist_access_unlock(vma->vm_mm);
 		error = vma->vm_ops->swapout(vma, page);
 		if (!error)
 			goto out_free_success;
@@ -164,6 +165,7 @@
 	spin_unlock(&vma->vm_mm->page_table_lock);
 
 	flush_tlb_page(vma, address);
+	vmlist_access_unlock(vma->vm_mm);
 	swap_duplicate(entry);	/* One for the process, one for the swap cache */
 
 	/* This will also lock the page */
@@ -295,6 +297,7 @@
 	/*
 	 * Find the proper vm-area
 	 */
+	vmlist_access_lock(mm);
 	vma = find_vma(mm, address);
 	if (vma) {
 		if (address < vma->vm_start)
@@ -310,6 +313,7 @@
 			address = vma->vm_start;
 		}
 	}
+	vmlist_access_unlock(mm);
 
 	/* We didn't find anything for the process */
 	mm->swap_cnt = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15  0:06 [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection Kanoj Sarcar
@ 1999-10-15 11:58 ` Manfred Spraul
  1999-10-15 16:38   ` Kanoj Sarcar
  1999-10-18 19:45 ` Kanoj Sarcar
  1 sibling, 1 reply; 15+ messages in thread
From: Manfred Spraul @ 1999-10-15 11:58 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

Kanoj Sarcar wrote:
> swapout() currently takes the vma as an input,
> but the vma might be getting deleted (the documentation which is
> part of the patch describes currently how things are protected),
> so it might be prudent to pass individual fields of the vma to the
> swapout() method, rather than a pointer to the structure.

passing the individual fields of the vma is impossible:
only swap_out knows which field of the vma are important, and which
locking is required (eg fget()).

AFAICS, there are only 2 acceptable solutions:
- lock_kernel() as in your patch.
- swap_out() is called with the semaphore held, and it sleeps with the
semaphore. [I prefer this solution: it's the first step towards swapping
without lock_kernel()].

Or: ->swapout() releases the semaphore, or split ->swapout() into 2
parts.


> +               /*
> +                * The lock_kernel interlocks with kswapd try_to_swap_out
> +                * invoking a driver swapout() method, and being able to
> +                * guarantee vma existance.
> +                */
>                 lock_kernel();
>                 if (mpnt->vm_ops && mpnt->vm_ops->unmap)
>                         mpnt->vm_ops->unmap(mpnt, st, size);
> [...]
>         flush_tlb_page(vma, address);
> +       vmlist_access_unlock(vma->vm_mm);
>         swap_duplicate(entry);  /* One for the process, one for the swap cache */
> 
>         /* This will also lock the page */

I thought that the page stealer would call ->swapout() while owning the
vmlist_lock. 
a) there should be no lock-up, because the swapper is never reentered
[PF_MEMALLOC].
b) noone except the swapper is allowed to sleep while owning
vmlist_lock.
c) getting rid of that lock_kernel() call is one of the main aims of the
vmlist_lock.

--
	Manfred

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 11:58 ` Manfred Spraul
@ 1999-10-15 16:38   ` Kanoj Sarcar
  1999-10-15 18:26     ` Manfred Spraul
  0 siblings, 1 reply; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15 16:38 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

> 
> Kanoj Sarcar wrote:
> > swapout() currently takes the vma as an input,
> > but the vma might be getting deleted (the documentation which is
> > part of the patch describes currently how things are protected),
> > so it might be prudent to pass individual fields of the vma to the
> > swapout() method, rather than a pointer to the structure.
> 
> passing the individual fields of the vma is impossible:
> only swap_out knows which field of the vma are important, and which
> locking is required (eg fget()).
> 

Note that currently, the only swapout() method which does anything is
filemap_swapout, and what it really needs from the vma is vm_file. Yes,
changing a driver interface is not the best solution, but a new 2.X
release is the place to do it. If you wanted to be more careful, you 
could define the swapout prototype as swapout(start, end, flags, file).
That *should* be enough for most future 2.3/2.4 driver. 

There is also a cleaner way to do this. Have a field vm_drvhandle in
the vma, pass that to the swapout routine. Any driver which has a 
swapout() routine will maintain whatever private data it needs to
implement the swapout, tagged with this specific vm_drvhandle value
that it fills in at open/mmap time. 

> AFAICS, there are only 2 acceptable solutions:
> - lock_kernel() as in your patch.

IMO, we can do better.

> - swap_out() is called with the semaphore held, and it sleeps with the
> semaphore. [I prefer this solution: it's the first step towards swapping
> without lock_kernel()].
>

Look below for why this is not safe.

> Or: ->swapout() releases the semaphore, or split ->swapout() into 2
> parts.
> 

This works for filemap_swapout, but you can not expect every regular Joe
driver writer to adhere to this rule. What do you mean by splitting 
swapout() into 2 parts?

> 
> > +               /*
> > +                * The lock_kernel interlocks with kswapd try_to_swap_out
> > +                * invoking a driver swapout() method, and being able to
> > +                * guarantee vma existance.
> > +                */
> >                 lock_kernel();
> >                 if (mpnt->vm_ops && mpnt->vm_ops->unmap)
> >                         mpnt->vm_ops->unmap(mpnt, st, size);
> > [...]
> >         flush_tlb_page(vma, address);
> > +       vmlist_access_unlock(vma->vm_mm);
> >         swap_duplicate(entry);  /* One for the process, one for the swap cache */
> > 
> >         /* This will also lock the page */
> 
> I thought that the page stealer would call ->swapout() while owning the
> vmlist_lock. 
> a) there should be no lock-up, because the swapper is never reentered
> [PF_MEMALLOC].
> b) noone except the swapper is allowed to sleep while owning
> vmlist_lock.

How about this. Process A runs short of memory, decides to swap_out its
own mm. It victimizes vma V, whose swapout() routine goes to sleep with
A's vmlist lock, waiting for a sleeping lock L. Meanwhile, lock L is 
held by process B, who runs short of memory, and decides to steal from
A. But, A's vmlist lock is held. Deadlock, right?

If you read the documentation which is part of the patch, this is why
I clearly point out that holding vmlist_lock into driver methods is not 
safe at all.

> c) getting rid of that lock_kernel() call is one of the main aims of the
> vmlist_lock.
> 

Yes, and there's a bunch of ways to achieve that.

And here's one more. Before invoking swapout(), and before loosing the 
vmlist_lock in try_to_swap_out, the vma might be marked with a flag
that indicates that swapout() is looking at the vma. do_munmap will 
look at this flag and put itself to sleep on a synchronization 
variable. After swapout() terminates, the page stealer will wake up 
anyone waiting in do_munmap to continue destroying the vma.

This swapout() cleanup is independent of the patch I have already posted,
so the patch should be integrated into 2.3, while we debate how to tackle
the cleanup.

Thanks.

Kanoj

> --
> 	Manfred
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 16:38   ` Kanoj Sarcar
@ 1999-10-15 18:26     ` Manfred Spraul
  1999-10-15 18:43       ` Kanoj Sarcar
  0 siblings, 1 reply; 15+ messages in thread
From: Manfred Spraul @ 1999-10-15 18:26 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

Kanoj Sarcar wrote:
> If you wanted to be more careful, you
> could define the swapout prototype as swapout(start, end, flags, file).
> That *should* be enough for most future 2.3/2.4 driver.

"file" can go away if you do not call "get_file()" before releasing the
locking.

> 
> > - swap_out() is called with the semaphore held, 
> 
> Look below for why this is not safe.
You are right, this can lock-up. Swapper is only protected from
reentrancy on it's own stack, not from reentrancy from another thread.

> 
> > Or: ->swapout() releases the semaphore, 
> >
> 
> This works for filemap_swapout, but you can not expect every regular Joe
> driver writer to adhere to this rule.
The result is not a rare lock-up, but it will lock-up nearly
immediately. Even Joe would notice that.
[I know this is ugly]

> And here's one more. Before invoking swapout(), and before loosing the
> vmlist_lock in try_to_swap_out, the vma might be marked with a flag
> that indicates that swapout() is looking at the vma.
Or: use a multiple reader - single writer semaphore with "starve writer"
policy.
IMO that's cleaner than a semaphore with an attached waitqueue for
do_munmap().


> This swapout() cleanup is independent of the patch I have already posted,
> so the patch should be integrated into 2.3, while we debate how to tackle
> the cleanup.

Ok.

--
	Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 18:26     ` Manfred Spraul
@ 1999-10-15 18:43       ` Kanoj Sarcar
  1999-10-15 20:47         ` Manfred Spraul
  0 siblings, 1 reply; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15 18:43 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

> 
> Kanoj Sarcar wrote:
> > If you wanted to be more careful, you
> > could define the swapout prototype as swapout(start, end, flags, file).
> > That *should* be enough for most future 2.3/2.4 driver.
> 
> "file" can go away if you do not call "get_file()" before releasing the
> locking.

Absolutely ... you would need to do a get_file() in try_to_swap_out,
then a fput() after return from the swapout() method. I was just
focusing attention on the vma-does-not-have-to-exist point.

> > 
> > This works for filemap_swapout, but you can not expect every regular Joe
> > driver writer to adhere to this rule.
> The result is not a rare lock-up, but it will lock-up nearly
> immediately. Even Joe would notice that.
> [I know this is ugly]

Oh, the lockup will hapen only if Joe tests his driver under huge 
memory demand conditions, that too, if a page stealer decides to 
victimize the vma Joe set up. Entirely missable during driver
developement, imo. Even if Joe hits the scenario, it would be cruel
for us to ask him to understand how swapping works; its enough for
him to have to worry about MP issues ...

> 
> > And here's one more. Before invoking swapout(), and before loosing the
> > vmlist_lock in try_to_swap_out, the vma might be marked with a flag
> > that indicates that swapout() is looking at the vma.
> Or: use a multiple reader - single writer semaphore with "starve writer"
> policy.
> IMO that's cleaner than a semaphore with an attached waitqueue for
> do_munmap().
> 

Explain ... who are the readers, and who are the writers? I think if you 
are talking about a semaphore lock being held thru out swapout() in the
try_to_swap_out path, you are reduced to the same deadlock I just pointed 
out. I was talking more about a monitor like approach here.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 18:43       ` Kanoj Sarcar
@ 1999-10-15 20:47         ` Manfred Spraul
  1999-10-15 21:13           ` Kanoj Sarcar
  0 siblings, 1 reply; 15+ messages in thread
From: Manfred Spraul @ 1999-10-15 20:47 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

Kanoj Sarcar wrote:
> Explain ... who are the readers, and who are the writers? I think if you
> are talking about a semaphore lock being held thru out swapout() in the
> try_to_swap_out path, you are reduced to the same deadlock I just pointed
> out. I was talking more about a monitor like approach here.

The lock is held thru out swapout(), but it is a shared lock: multiple
swapper threads can own it. There should be no lock-up.

reader: swapper. Reentrancy is not a problem because it is a read-lock,
ie shared. The implementation must starve exclusive waiters (ie a reader
is allowed to continue even if a writer is waiting).

write: everyone who changes the vma list. These functions must not sleep
while owning the ERESOURCE (IIRC the NT kernel name) exclusive.

I hope I have not overlocked a detail,
	Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 20:47         ` Manfred Spraul
@ 1999-10-15 21:13           ` Kanoj Sarcar
  1999-10-15 21:24             ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15 21:13 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: torvalds, sct, andrea, viro, linux-mm, linux-kernel

> 
> Kanoj Sarcar wrote:
> > Explain ... who are the readers, and who are the writers? I think if you
> > are talking about a semaphore lock being held thru out swapout() in the
> > try_to_swap_out path, you are reduced to the same deadlock I just pointed
> > out. I was talking more about a monitor like approach here.
> 
> The lock is held thru out swapout(), but it is a shared lock: multiple
> swapper threads can own it. There should be no lock-up.
> 
> reader: swapper. Reentrancy is not a problem because it is a read-lock,
> ie shared. The implementation must starve exclusive waiters (ie a reader
> is allowed to continue even if a writer is waiting).
> 
> write: everyone who changes the vma list. These functions must not sleep
> while owning the ERESOURCE (IIRC the NT kernel name) exclusive.
> 
> I hope I have not overlocked a detail,
> 	Manfred
> 

With an eye partly towards this implementation, I had the page stealer
code grab vmlist_access_lock, while others get vmlist_modify_lock, 
although in mm.h, both of these reduce to a down() operation.

The reason I am not very keen on this solution either is if you 
consider process A holding vmlist_access_lock of B, going into swapout(),
where it tries to get a (sleeping) driver lock. Meanwhile, process B
has the driver lock, and is trying to grab the vmlist_update_lock on
itself, ie B, maybe to add/delete the vma. I do not think there is
such a driver currently though.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 21:13           ` Kanoj Sarcar
@ 1999-10-15 21:24             ` Linus Torvalds
  1999-10-15 21:39               ` Kanoj Sarcar
  1999-10-15 23:16               ` Manfred Spraul
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 1999-10-15 21:24 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Manfred Spraul, sct, andrea, viro, linux-mm, linux-kernel

On Fri, 15 Oct 1999, Kanoj Sarcar wrote:
> 
> The reason I am not very keen on this solution either is if you 
> consider process A holding vmlist_access_lock of B, going into swapout(),
> where it tries to get a (sleeping) driver lock. Meanwhile, process B
> has the driver lock, and is trying to grab the vmlist_update_lock on
> itself, ie B, maybe to add/delete the vma. I do not think there is
> such a driver currently though.

I am convinced that all these games are unnecessary, and that the problem
is fundamentally different. Not fixing up the current code, but just
looking at the problem differently - making the deadlock go away by virtue
of avoiding the critical regions.

I think the suggestion to change the semantics of "swapout" is a good one.
Now we have the mm layer passing down the vma to the IO layer, and hat
makes everything more complex. I would certainly agree with just changing
that semantic detail, and changing swapout to something like

	.. hold a spinlock - we can probably just reuse the
	   page_table_lock for this to avoid multiple levels of locking
	   here..

	file = fget(vma->vm_file);
	offset = file->f_offset + (address - vma->vm_start);
	flush_tlb_page(vma, address);
	spin_unlock(&vma->vm_mm->page_table_lock);

	error = file->f_ops->swapout(file, offset, page);
	fput(file);

	...

and then the other requirement would be that whenever the vma chain is
physically modified, you also have to hold the page_table_lock. 

And finally, we rename the "page_table_lock" to the "page_stealer_lock",
and we're all done.

Does anybody see anything fundamentally wrong here? It looks like it
should fix the problem without introducing any new locks, and without
holding any locks across the actual physical swapout activity.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 21:24             ` Linus Torvalds
@ 1999-10-15 21:39               ` Kanoj Sarcar
  1999-10-15 22:04                 ` Linus Torvalds
  1999-10-15 23:16               ` Manfred Spraul
  1 sibling, 1 reply; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15 21:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: manfreds, sct, andrea, viro, linux-mm, linux-kernel

> 
> 
> On Fri, 15 Oct 1999, Kanoj Sarcar wrote:
> > 
> > The reason I am not very keen on this solution either is if you 
> > consider process A holding vmlist_access_lock of B, going into swapout(),
> > where it tries to get a (sleeping) driver lock. Meanwhile, process B
> > has the driver lock, and is trying to grab the vmlist_update_lock on
> > itself, ie B, maybe to add/delete the vma. I do not think there is
> > such a driver currently though.
> 
> I am convinced that all these games are unnecessary, and that the problem
> is fundamentally different. Not fixing up the current code, but just
> looking at the problem differently - making the deadlock go away by virtue
> of avoiding the critical regions.
> 
> I think the suggestion to change the semantics of "swapout" is a good one.
> Now we have the mm layer passing down the vma to the IO layer, and hat
> makes everything more complex. I would certainly agree with just changing
> that semantic detail, and changing swapout to something like
> 
> 	.. hold a spinlock - we can probably just reuse the
> 	   page_table_lock for this to avoid multiple levels of locking
> 	   here..
> 
> 	file = fget(vma->vm_file);
> 	offset = file->f_offset + (address - vma->vm_start);
> 	flush_tlb_page(vma, address);
> 	spin_unlock(&vma->vm_mm->page_table_lock);
> 
> 	error = file->f_ops->swapout(file, offset, page);
> 	fput(file);
> 
> 	...
> 
> and then the other requirement would be that whenever the vma chain is
> physically modified, you also have to hold the page_table_lock. 
> 
> And finally, we rename the "page_table_lock" to the "page_stealer_lock",
> and we're all done.
> 
> Does anybody see anything fundamentally wrong here? It looks like it
> should fix the problem without introducing any new locks, and without
> holding any locks across the actual physical swapout activity.
>

Linus,

This basically means that you are overloading the page_table_lock to do 
the work of the vmlist_lock in my patch. Thus vmlist_modify_lock/
vmlist_access_lock in my patch could be changed in mm.h to grab the 
page_table_lock. As I mentioned before, moving to a spinlock to 
protect the vma chain will need some changes to the vmscan.c code.

The reason I think most people suggested a different lock, namely vmlist_lock,
is to reduce contention on the page_table_lock, so that all the other
paths like mprotect/mlock/mmap/munmap do not end up grabbing the
page_table_lock which is grabbed in the fault path. Grabbing the
page_table_lock while doing things like merge_segments/insert_vm_struct
would seem to increase the hold time of the spinlock. Pte protection
and vma chain protection are quite distinct things too.

Let me know how you want me to rework the patch. Imo, we should keep
the macros vmlist_modify_lock/vmlist_access_lock, even if we do
decide to overload the page table lock.

Once we have put the (modified?) patch in, I can look at the swapout()
interface cleanup.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 21:39               ` Kanoj Sarcar
@ 1999-10-15 22:04                 ` Linus Torvalds
  1999-10-15 22:32                   ` Kanoj Sarcar
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 1999-10-15 22:04 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: manfreds, sct, andrea, viro, linux-mm, linux-kernel

On Fri, 15 Oct 1999, Kanoj Sarcar wrote:
> 
> This basically means that you are overloading the page_table_lock to do 
> the work of the vmlist_lock in my patch. Thus vmlist_modify_lock/
> vmlist_access_lock in my patch could be changed in mm.h to grab the 
> page_table_lock. As I mentioned before, moving to a spinlock to 
> protect the vma chain will need some changes to the vmscan.c code.

Agreed. 

> The reason I think most people suggested a different lock, namely vmlist_lock,
> is to reduce contention on the page_table_lock, so that all the other
> paths like mprotect/mlock/mmap/munmap do not end up grabbing the
> page_table_lock which is grabbed in the fault path.

There can be no contention on the page_table_lock in the absense of the
page stealer. The reason is simple: every single thing that gets the page
table lock has already gotten the mm lock beforehand, and as such
contention is never an issue.

Contention _only_ occurs for the specific case when somebody is scanning
the page tables in order to free up pages. And at that point it's not
contention any more, at that point it is the thing that protects us from
bad things happening.

As such, the hold time of the spinlock is entirely immaterial, and
coalescing the page table lock and the vmlist lock does not increase
contention, it only decreases the number of locks you have to get. At
least as far as I can see.

> Let me know how you want me to rework the patch. Imo, we should keep
> the macros vmlist_modify_lock/vmlist_access_lock, even if we do
> decide to overload the page table lock.

Don't think of it as overloading the page table lock. Notice how the page
table lock really isn't a page table lock - it really is just "protection
against vmscan", and it's misnamed mainly because the only part we
protected was the page tables (which isn't enough). 

So think of it as a fix to the current protection, and as that fix makes
it protect more than just the page tables (makes it protect everything
that is required), the name should also change.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 22:04                 ` Linus Torvalds
@ 1999-10-15 22:32                   ` Kanoj Sarcar
  0 siblings, 0 replies; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-15 22:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: manfreds, sct, andrea, viro, linux-mm, linux-kernel

Okay, in the next couple of days, I will try to use the lock
currently known as "page_table_lock" for vma scan piotection
in the page stealing code and post the modififed patch.

Thanks.

Kanoj
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 21:24             ` Linus Torvalds
  1999-10-15 21:39               ` Kanoj Sarcar
@ 1999-10-15 23:16               ` Manfred Spraul
  1999-10-16  0:44                 ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Manfred Spraul @ 1999-10-15 23:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kanoj Sarcar, sct, andrea, viro, linux-mm, linux-kernel

Linus Torvalds wrote:
> 
>         .. hold a spinlock - we can probably just reuse the
>            page_table_lock for this to avoid multiple levels of locking
>            here..
> 
>         file = fget(vma->vm_file);
		^^^^^^^
>         offset = file->f_offset + (address - vma->vm_start);
>         flush_tlb_page(vma, address);
>         spin_unlock(&vma->vm_mm->page_table_lock);
> 
>         error = file->f_ops->swapout(file, offset, page);
>         fput(file);
> 
>         ...
> 
> and then the other requirement would be that whenever the vma chain is
> physically modified, you also have to hold the page_table_lock.
> 

What about shm? vma->vm_file is NULL, this would oops.
I think that both "prepare for possible vma removal" and the parameters
which are passed to ->swapout() should be vma-specific: what about a
vm_ops->swapprepare()? This function should not allocate memory, ie
parameter passing should be stack based:

<<<<< mm.h
struct private_data
{
	void* private[4];
};
struct vm_ops
{
...
void (*swapprepare)(struct vm_area_struct * vma, struct page * page,
struct private_data * info);
void (*swapout)(struct private_data * info, struct page* page);
...
};
>>>>>>>>>
<<<<<<<< vmscan.c

if(vma->vm_ops && vma->vm_ops->swapout) {
	int error;
	struct private_data info;
	void (*swapout)(...);

	pte_clear(page_table);
	swapout = vma->vm_ops->swapout;
	vma->vm_ops->swapprepare(vma,page,&info);
	spin_unlock(page_table_lock);
	flush_tlb_page();
	error = swapout(&info,page);
	...
}
>>>>>>>>>>>>>>>>


--
	Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15 23:16               ` Manfred Spraul
@ 1999-10-16  0:44                 ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 1999-10-16  0:44 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Kanoj Sarcar, sct, andrea, viro, linux-mm, linux-kernel

On Sat, 16 Oct 1999, Manfred Spraul wrote:
> 
> What about shm? vma->vm_file is NULL, this would oops.

Well, considering that shm_swapout() currently looks like this:

	static int shm_swapout(struct vm_area_struct * vma, struct page * page)
	{ 
	        return 0;
	}

I don't think the SHM case is all that problematic: we could easily just
have a dummy vma->vm_file there. In fact, it probably should do so anyway:
the SHM code _really_ does not need the private member.

There are strong arguments for saying that if the thing you're mapping
actually _needs_ the vma in order to swap out, then the thing is broken.
SHM certainly used to be horribly broken in this area, but that's no
longer true.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-15  0:06 [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection Kanoj Sarcar
  1999-10-15 11:58 ` Manfred Spraul
@ 1999-10-18 19:45 ` Kanoj Sarcar
  1999-10-18 20:02   ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Kanoj Sarcar @ 1999-10-18 19:45 UTC (permalink / raw)
  To: torvalds; +Cc: linux-mm

Linus,

Here is the reworked vma scanning protection patch against 2.3.22.
This patch has to get one less lock in the page stealing path 
compared to the previous patch that I posted. Let me know if this 
looks okay now, and I will send you an incremental swapout() interface 
cleanup patch that we have discussed.

Thanks.

Kanoj


--- Documentation/vm/locking	Mon Oct 18 11:20:13 1999
+++ Documentation/vm/locking	Mon Oct 18 10:59:49 1999
@@ -0,0 +1,88 @@
+Started Oct 1999 by Kanoj Sarcar <kanoj@sgi.com>
+
+The intent of this file is to have an uptodate, running commentary 
+from different people about how locking and synchronization is done 
+in the Linux vm code.
+
+vmlist_access_lock/vmlist_modify_lock
+--------------------------------------
+
+Page stealers pick processes out of the process pool and scan for 
+the best process to steal pages from. To guarantee the existance 
+of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
+Page stealers hold kernel_lock to protect against a bunch of races.
+The vma list of the victim mm is also scanned by the stealer, 
+and the vmlist_lock is used to preserve list sanity against the
+process adding/deleting to the list. This also gurantees existance
+of the vma. Vma existance gurantee while invoking the driver
+swapout() method in try_to_swap_out() also relies on the fact
+that do_munmap() temporarily gets lock_kernel before decimating
+the vma, thus the swapout() method must snapshot all the vma 
+fields it needs before going to sleep (which will release the
+lock_kernel held by the page stealer). Currently, filemap_swapout
+is the only method that depends on this shaky interlocking.
+
+Any code that modifies the vmlist, or the vm_start/vm_end/
+vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent 
+kswapd from looking at the chain. This does not include driver mmap() 
+methods, for example, since the vma is still not in the list.
+
+The rules are:
+1. To modify the vmlist (add/delete or change fields in an element), 
+you must hold mmap_sem to guard against clones doing mmap/munmap/faults, 
+(ie all vm system calls and faults), and from ptrace, swapin due to 
+swap deletion etc.
+2. To modify the vmlist (add/delete or change fields in an element), 
+you must also hold vmlist_modify_lock, to guard against page stealers 
+scanning the list.
+3. To scan the vmlist (find_vma()), you must either 
+        a. grab mmap_sem, which should be done by all cases except 
+	   page stealer.
+or
+        b. grab vmlist_access_lock, only done by page stealer.
+4. While holding the vmlist_modify_lock, you must be able to guarantee
+that no code path will lead to page stealing. A better guarantee is
+to claim non sleepability, which ensures that you are not sleeping
+for a lock, whose holder might in turn be doing page stealing.
+5. You must be able to guarantee that while holding vmlist_modify_lock
+or vmlist_access_lock of mm A, you will not try to get either lock
+for mm B.
+
+The caveats are:
+1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
+The update of mmap_cache is racy (page stealer can race with other code
+that invokes find_vma with mmap_sem held), but that is okay, since it 
+is a hint. This can be fixed, if desired, by having find_vma grab the
+vmlist lock.
+
+
+Code that add/delete elements from the vmlist chain are
+1. callers of insert_vm_struct
+2. callers of merge_segments
+3. callers of avl_remove
+
+Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
+the list:
+1. expand_stack
+2. mprotect
+3. mlock
+4. mremap
+
+It is advisable that changes to vm_start/vm_end be protected, although 
+in some cases it is not really needed. Eg, vm_start is modified by 
+expand_stack(), it is hard to come up with a destructive scenario without 
+having the vmlist protection in this case.
+
+The vmlist lock nests with the inode i_shared_lock and the kmem cache
+c_spinlock spinlocks. This is okay, since code that holds i_shared_lock 
+never asks for memory, and the kmem code asks for pages after dropping
+c_spinlock.
+
+The vmlist lock can be a sleeping or spin lock. In either case, care
+must be taken that it is not held on entry to the driver methods, since
+those methods might sleep or ask for memory, causing deadlocks.
+
+The current implementation of the vmlist lock uses the page_table_lock,
+which is also the spinlock that page stealers use to protect changes to
+the victim process' ptes. Thus we have a reduction in the total number
+of locks. 
--- /usr/tmp/p_rdiff_a004tQ/exec.c	Mon Oct 18 12:25:58 1999
+++ fs/exec.c	Mon Oct 18 10:39:20 1999
@@ -276,7 +276,9 @@
 		mpnt->vm_offset = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = (void *) 0;
+		vmlist_modify_lock(current->mm);
 		insert_vm_struct(current->mm, mpnt);
+		vmlist_modify_unlock(current->mm);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	} 
 
--- /usr/tmp/p_rdiff_a004tQ/mm.h	Mon Oct 18 12:26:01 1999
+++ include/linux/mm.h	Mon Oct 18 10:40:34 1999
@@ -427,6 +427,12 @@
 #define pgcache_under_min()	(atomic_read(&page_cache_size) * 100 < \
 				page_cache.min_percent * num_physpages)
 
+#define vmlist_access_lock(mm)		spin_lock(&mm->page_table_lock)
+#define vmlist_access_unlock(mm)	spin_unlock(&mm->page_table_lock)
+#define vmlist_modify_lock(mm)		vmlist_access_lock(mm)
+#define vmlist_modify_unlock(mm)	vmlist_access_unlock(mm)
+
+
 #endif /* __KERNEL__ */
 
 #endif
--- /usr/tmp/p_rdiff_a004tQ/shm.c	Mon Oct 18 12:26:05 1999
+++ ipc/shm.c	Mon Oct 18 10:41:05 1999
@@ -462,8 +462,10 @@
 	   > (unsigned long) current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 	current->mm->total_vm += tmp >> PAGE_SHIFT;
+	vmlist_modify_lock(current->mm);
 	insert_vm_struct(current->mm, shmd);
 	merge_segments(current->mm, shmd->vm_start, shmd->vm_end);
+	vmlist_modify_unlock(current->mm);
 
 	return 0;
 }
--- /usr/tmp/p_rdiff_a004tQ/ptrace.c	Mon Oct 18 12:26:07 1999
+++ kernel/ptrace.c	Mon Oct 18 10:41:33 1999
@@ -80,12 +80,14 @@
 int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write)
 {
 	int copied;
-	struct vm_area_struct * vma = find_extend_vma(tsk, addr);
+	struct vm_area_struct * vma;
 
-	if (!vma)
-		return 0;
-
 	down(&tsk->mm->mmap_sem);
+	vma = find_extend_vma(tsk, addr);
+	if (!vma) {
+		up(&tsk->mm->mmap_sem);
+		return 0;
+	}
 	copied = 0;
 	for (;;) {
 		unsigned long offset = addr & ~PAGE_MASK;
--- /usr/tmp/p_rdiff_a004tQ/mlock.c	Mon Oct 18 12:26:09 1999
+++ mm/mlock.c	Mon Oct 18 10:42:27 1999
@@ -13,7 +13,9 @@
 
 static inline int mlock_fixup_all(struct vm_area_struct * vma, int newflags)
 {
+	vmlist_modify_lock(vma->vm_mm);
 	vma->vm_flags = newflags;
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -26,15 +28,17 @@
 	if (!n)
 		return -EAGAIN;
 	*n = *vma;
-	vma->vm_start = end;
 	n->vm_end = end;
-	vma->vm_offset += vma->vm_start - n->vm_start;
 	n->vm_flags = newflags;
 	if (n->vm_file)
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += end - vma->vm_start;
+	vma->vm_start = end;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -47,7 +51,6 @@
 	if (!n)
 		return -EAGAIN;
 	*n = *vma;
-	vma->vm_end = start;
 	n->vm_start = start;
 	n->vm_offset += n->vm_start - vma->vm_start;
 	n->vm_flags = newflags;
@@ -55,7 +58,10 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_end = start;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -75,10 +81,7 @@
 	*left = *vma;
 	*right = *vma;
 	left->vm_end = start;
-	vma->vm_start = start;
-	vma->vm_end = end;
 	right->vm_start = end;
-	vma->vm_offset += vma->vm_start - left->vm_start;
 	right->vm_offset += right->vm_start - left->vm_start;
 	vma->vm_flags = newflags;
 	if (vma->vm_file)
@@ -88,8 +91,14 @@
 		vma->vm_ops->open(left);
 		vma->vm_ops->open(right);
 	}
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += start - vma->vm_start;
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_flags = newflags;
 	insert_vm_struct(current->mm, left);
 	insert_vm_struct(current->mm, right);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -168,7 +177,9 @@
 			break;
 		}
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, start, end);
+	vmlist_modify_unlock(current->mm);
 	return error;
 }
 
@@ -240,7 +251,9 @@
 		if (error)
 			break;
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, 0, TASK_SIZE);
+	vmlist_modify_unlock(current->mm);
 	return error;
 }
 
--- /usr/tmp/p_rdiff_a004tQ/mmap.c	Mon Oct 18 12:26:12 1999
+++ mm/mmap.c	Mon Oct 18 10:43:45 1999
@@ -323,8 +323,10 @@
 	 */
 	flags = vma->vm_flags;
 	addr = vma->vm_start; /* can addr have changed?? */
+	vmlist_modify_lock(mm);
 	insert_vm_struct(mm, vma);
 	merge_segments(mm, vma->vm_start, vma->vm_end);
+	vmlist_modify_unlock(mm);
 	
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
@@ -527,11 +529,13 @@
 	}
 
 	/* Work out to one of the ends. */
-	if (end == area->vm_end)
+	if (end == area->vm_end) {
 		area->vm_end = addr;
-	else if (addr == area->vm_start) {
+		vmlist_modify_lock(current->mm);
+	} else if (addr == area->vm_start) {
 		area->vm_offset += (end - area->vm_start);
 		area->vm_start = end;
+		vmlist_modify_lock(current->mm);
 	} else {
 	/* Unmapping a hole: area->vm_start < addr <= end < area->vm_end */
 		/* Add end mapping -- leave beginning for below */
@@ -552,10 +556,12 @@
 		if (mpnt->vm_ops && mpnt->vm_ops->open)
 			mpnt->vm_ops->open(mpnt);
 		area->vm_end = addr;	/* Truncate area */
+		vmlist_modify_lock(current->mm);
 		insert_vm_struct(current->mm, mpnt);
 	}
 
 	insert_vm_struct(current->mm, area);
+	vmlist_modify_unlock(current->mm);
 	return extra;
 }
 
@@ -655,6 +661,7 @@
 
 	npp = (prev ? &prev->vm_next : &mm->mmap);
 	free = NULL;
+	vmlist_modify_lock(mm);
 	for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) {
 		*npp = mpnt->vm_next;
 		mpnt->vm_next = free;
@@ -662,6 +669,8 @@
 		if (mm->mmap_avl)
 			avl_remove(mpnt, &mm->mmap_avl);
 	}
+	mm->mmap_cache = NULL;	/* Kill the cache. */
+	vmlist_modify_unlock(mm);
 
 	/* Ok - we have the memory areas we should free on the 'free' list,
 	 * so release them, and unmap the page range..
@@ -678,6 +687,11 @@
 		end = end > mpnt->vm_end ? mpnt->vm_end : end;
 		size = end - st;
 
+		/*
+		 * The lock_kernel interlocks with kswapd try_to_swap_out
+		 * invoking a driver swapout() method, and being able to
+		 * guarantee vma existance.
+		 */
 		lock_kernel();
 		if (mpnt->vm_ops && mpnt->vm_ops->unmap)
 			mpnt->vm_ops->unmap(mpnt, st, size);
@@ -702,7 +716,6 @@
 
 	free_pgtables(mm, prev, addr, addr+len);
 
-	mm->mmap_cache = NULL;	/* Kill the cache. */
 	return 0;
 }
 
@@ -786,8 +799,10 @@
 	flags = vma->vm_flags;
 	addr = vma->vm_start;
 
+	vmlist_modify_lock(mm);
 	insert_vm_struct(mm, vma);
 	merge_segments(mm, vma->vm_start, vma->vm_end);
+	vmlist_modify_unlock(mm);
 	
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
@@ -814,7 +829,9 @@
 
 	release_segments(mm);
 	mpnt = mm->mmap;
+	vmlist_modify_lock(mm);
 	mm->mmap = mm->mmap_avl = mm->mmap_cache = NULL;
+	vmlist_modify_unlock(mm);
 	mm->rss = 0;
 	mm->total_vm = 0;
 	mm->locked_vm = 0;
@@ -910,6 +927,7 @@
 		prev = mpnt;
 		mpnt = mpnt->vm_next;
 	}
+	mm->mmap_cache = NULL;		/* Kill the cache. */
 
 	/* prev and mpnt cycle through the list, as long as
 	 * start_addr < mpnt->vm_end && prev->vm_start < end_addr
@@ -946,7 +964,9 @@
 		if (mpnt->vm_ops && mpnt->vm_ops->close) {
 			mpnt->vm_offset += mpnt->vm_end - mpnt->vm_start;
 			mpnt->vm_start = mpnt->vm_end;
+			vmlist_modify_unlock(mm);
 			mpnt->vm_ops->close(mpnt);
+			vmlist_modify_lock(mm);
 		}
 		mm->map_count--;
 		remove_shared_vm_struct(mpnt);
@@ -955,7 +975,6 @@
 		kmem_cache_free(vm_area_cachep, mpnt);
 		mpnt = prev;
 	}
-	mm->mmap_cache = NULL;		/* Kill the cache. */
 }
 
 void __init vma_init(void)
--- /usr/tmp/p_rdiff_a004tQ/mprotect.c	Mon Oct 18 12:26:16 1999
+++ mm/mprotect.c	Mon Oct 18 10:44:41 1999
@@ -82,8 +82,10 @@
 static inline int mprotect_fixup_all(struct vm_area_struct * vma,
 	int newflags, pgprot_t prot)
 {
+	vmlist_modify_lock(vma->vm_mm);
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = prot;
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -97,9 +99,7 @@
 	if (!n)
 		return -ENOMEM;
 	*n = *vma;
-	vma->vm_start = end;
 	n->vm_end = end;
-	vma->vm_offset += vma->vm_start - n->vm_start;
 	n->vm_flags = newflags;
 	n->vm_page_prot = prot;
 	if (n->vm_file)
@@ -106,7 +106,11 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += end - vma->vm_start;
+	vma->vm_start = end;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -120,7 +124,6 @@
 	if (!n)
 		return -ENOMEM;
 	*n = *vma;
-	vma->vm_end = start;
 	n->vm_start = start;
 	n->vm_offset += n->vm_start - vma->vm_start;
 	n->vm_flags = newflags;
@@ -129,7 +132,10 @@
 		get_file(n->vm_file);
 	if (n->vm_ops && n->vm_ops->open)
 		n->vm_ops->open(n);
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_end = start;
 	insert_vm_struct(current->mm, n);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -150,13 +156,8 @@
 	*left = *vma;
 	*right = *vma;
 	left->vm_end = start;
-	vma->vm_start = start;
-	vma->vm_end = end;
 	right->vm_start = end;
-	vma->vm_offset += vma->vm_start - left->vm_start;
 	right->vm_offset += right->vm_start - left->vm_start;
-	vma->vm_flags = newflags;
-	vma->vm_page_prot = prot;
 	if (vma->vm_file)
 		atomic_add(2,&vma->vm_file->f_count);
 	if (vma->vm_ops && vma->vm_ops->open) {
@@ -163,8 +164,15 @@
 		vma->vm_ops->open(left);
 		vma->vm_ops->open(right);
 	}
+	vmlist_modify_lock(vma->vm_mm);
+	vma->vm_offset += start - vma->vm_start;
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_flags = newflags;
+	vma->vm_page_prot = prot;
 	insert_vm_struct(current->mm, left);
 	insert_vm_struct(current->mm, right);
+	vmlist_modify_unlock(vma->vm_mm);
 	return 0;
 }
 
@@ -246,7 +254,9 @@
 			break;
 		}
 	}
+	vmlist_modify_lock(current->mm);
 	merge_segments(current->mm, start, end);
+	vmlist_modify_unlock(current->mm);
 out:
 	up(&current->mm->mmap_sem);
 	return error;
--- /usr/tmp/p_rdiff_a004tQ/mremap.c	Mon Oct 18 12:26:18 1999
+++ mm/mremap.c	Mon Oct 18 10:45:11 1999
@@ -141,8 +141,10 @@
 				get_file(new_vma->vm_file);
 			if (new_vma->vm_ops && new_vma->vm_ops->open)
 				new_vma->vm_ops->open(new_vma);
+			vmlist_modify_lock(current->mm);
 			insert_vm_struct(current->mm, new_vma);
 			merge_segments(current->mm, new_vma->vm_start, new_vma->vm_end);
+			vmlist_modify_unlock(vma->vm_mm);
 			do_munmap(addr, old_len);
 			current->mm->total_vm += new_len >> PAGE_SHIFT;
 			if (new_vma->vm_flags & VM_LOCKED) {
@@ -220,7 +222,9 @@
 		/* can we just expand the current mapping? */
 		if (max_addr - addr >= new_len) {
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
+			vmlist_modify_lock(vma->vm_mm);
 			vma->vm_end = addr + new_len;
+			vmlist_modify_unlock(vma->vm_mm);
 			current->mm->total_vm += pages;
 			if (vma->vm_flags & VM_LOCKED) {
 				current->mm->locked_vm += pages;
--- /usr/tmp/p_rdiff_a004tQ/vmscan.c	Mon Oct 18 12:26:20 1999
+++ mm/vmscan.c	Mon Oct 18 10:45:52 1999
@@ -47,9 +47,6 @@
 		goto out_failed;
 
 	page = mem_map + MAP_NR(page_addr);
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (pte_val(pte) != pte_val(*page_table))
-		goto out_failed_unlock;
 
 	/* Don't look at this pte if it's been accessed recently. */
 	if (pte_young(pte)) {
@@ -59,7 +56,7 @@
 		 */
 		set_pte(page_table, pte_mkold(pte));
 		set_bit(PG_referenced, &page->flags);
-		goto out_failed_unlock;
+		goto out_failed;
 	}
 
 	if (PageReserved(page)
@@ -66,7 +63,7 @@
 	    || PageLocked(page)
 	    || ((gfp_mask & __GFP_DMA) && !PageDMA(page))
 	    || (!(gfp_mask & __GFP_BIGMEM) && PageBIGMEM(page)))
-		goto out_failed_unlock;
+		goto out_failed;
 
 	/*
 	 * Is the page already in the swap cache? If so, then
@@ -84,7 +81,7 @@
 		vma->vm_mm->rss--;
 		flush_tlb_page(vma, address);
 		__free_page(page);
-		goto out_failed_unlock;
+		goto out_failed;
 	}
 
 	/*
@@ -111,7 +108,7 @@
 	 * locks etc.
 	 */
 	if (!(gfp_mask & __GFP_IO))
-		goto out_failed_unlock;
+		goto out_failed;
 
 	/*
 	 * Ok, it's really dirty. That means that
@@ -136,9 +133,9 @@
 	if (vma->vm_ops && vma->vm_ops->swapout) {
 		int error;
 		pte_clear(page_table);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		flush_tlb_page(vma, address);
 		vma->vm_mm->rss--;
+		flush_tlb_page(vma, address);
+		vmlist_access_unlock(vma->vm_mm);
 		error = vma->vm_ops->swapout(vma, page);
 		if (!error)
 			goto out_free_success;
@@ -154,14 +151,14 @@
 	 */
 	entry = acquire_swap_entry(page);
 	if (!entry)
-		goto out_failed_unlock; /* No swap space left */
+		goto out_failed; /* No swap space left */
 		
 	if (!(page = prepare_bigmem_swapout(page)))
-		goto out_swap_free_unlock;
+		goto out_swap_free;
 
 	vma->vm_mm->rss--;
 	set_pte(page_table, __pte(entry));
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	vmlist_access_unlock(vma->vm_mm);
 
 	flush_tlb_page(vma, address);
 	swap_duplicate(entry);	/* One for the process, one for the swap cache */
@@ -175,13 +172,9 @@
 out_free_success:
 	__free_page(page);
 	return 1;
-out_failed_unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
-out_failed:
-	return 0;
-out_swap_free_unlock:
+out_swap_free:
 	swap_free(entry);
-	spin_unlock(&vma->vm_mm->page_table_lock);
+out_failed:
 	return 0;
 
 }
@@ -293,8 +286,10 @@
 	address = mm->swap_address;
 
 	/*
-	 * Find the proper vm-area
+	 * Find the proper vm-area after freezing the vma chain 
+	 * and ptes.
 	 */
+	vmlist_access_lock(mm);
 	vma = find_vma(mm, address);
 	if (vma) {
 		if (address < vma->vm_start)
@@ -310,6 +305,7 @@
 			address = vma->vm_start;
 		}
 	}
+	vmlist_access_unlock(mm);
 
 	/* We didn't find anything for the process */
 	mm->swap_cnt = 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection
  1999-10-18 19:45 ` Kanoj Sarcar
@ 1999-10-18 20:02   ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 1999-10-18 20:02 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: linux-mm


On Mon, 18 Oct 1999, Kanoj Sarcar wrote:
> 
> Here is the reworked vma scanning protection patch against 2.3.22.
> This patch has to get one less lock in the page stealing path 
> compared to the previous patch that I posted. Let me know if this 
> looks okay now, and I will send you an incremental swapout() interface 
> cleanup patch that we have discussed.

Looks ok to me now. Thanks,

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~1999-10-18 20:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-10-15  0:06 [PATCH] kanoj-mm17-2.3.21 kswapd vma scanning protection Kanoj Sarcar
1999-10-15 11:58 ` Manfred Spraul
1999-10-15 16:38   ` Kanoj Sarcar
1999-10-15 18:26     ` Manfred Spraul
1999-10-15 18:43       ` Kanoj Sarcar
1999-10-15 20:47         ` Manfred Spraul
1999-10-15 21:13           ` Kanoj Sarcar
1999-10-15 21:24             ` Linus Torvalds
1999-10-15 21:39               ` Kanoj Sarcar
1999-10-15 22:04                 ` Linus Torvalds
1999-10-15 22:32                   ` Kanoj Sarcar
1999-10-15 23:16               ` Manfred Spraul
1999-10-16  0:44                 ` Linus Torvalds
1999-10-18 19:45 ` Kanoj Sarcar
1999-10-18 20:02   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox