VM oops in 2.4.20, attempted patch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* VM oops in 2.4.20, attempted patch
@ 2003-06-20 11:00 Dominik Vogt
  0 siblings, 0 replies; only message in thread
From: Dominik Vogt @ 2003-06-20 11:00 UTC (permalink / raw)
  To: linux-kernel-owner, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4029 bytes --]

There is a bug in the 2.4.20 VM that causes a kernel oops.  The
rreason is that sometimes the mm_struct of a task is accessed
while it is currently being deleted.  The function exec_mmap(),
fs/exec.c begins with this code:

  old_mm = current->mm;
  if (old_mm && atomic_read(&old_mm->mm_users) == 1) {
          mm_release();
          exit_mmap(old_mm);
          return 0;
  }

The logic assumes that if mm_users == 1, no other process can
access the tasks mm_struct concurrently.  But that is not true.
There are a number of places that happily use the mm_struct even
while it is being destroyed via the exit_mmap() call in
exec_mmap().  There are definitely oopses caused by the way the
proc fs code accesses the mm_struct.

  Example from proc_pid_read_maps(), fs/proc/array.c:

  task_lock(task);
  mm = task->mm;
  if (mm)
          atomic_inc(&mm->mm_users);
  task_unlock(task);

I am not exactly sure about some other code.  The following code
snippets look suspicious, but I am not deep enough into the VM
code to judge whether they pose a problem or not:

access_process_vm(), kernel/ptrace.c:

  /* Worry about races with exit() */
  task_lock(tsk);
  mm = tsk->mm;
  if (mm)
    atomic_inc(&mm->mm_users);
  task_unlock(tsk);
  /* ... code fiddles with mm */

Calls to access_process_vm() eventually originate from either the
ptrace() system call in some architectures or the proc fs.  I
believe that this code has the potential to make the kernel oops,
but I have no proof.

swap_out(), mm/vmscan.c:

  spin_lock(&mmlist_lock);
  mm = swap_mm;
  while (mm->swap_address == TASK_SIZE || mm == &init_mm)
  {
    mm->swap_address = 0;
    mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
    if (mm == swap_mm)
      goto empty;
    swap_mm = mm;
  }
  /* Make sure the mm doesn't disappear when we drop the lock.. */
  atomic_inc(&mm->mm_users);
  spin_unlock(&mmlist_lock);

Since swap_out() can be called by any process, for example via

  ...
  create_buffers()
  free_more_memory()
  try_to_free_pages()
  try_to_free_pages_zone()
  shrink_caches()
  shrink_cache()
  swap_out()

it might occur that mm_struct is swapped out while it is being
destroyed by execve().

There is more suspicious code related to swapping in
try_to_unsuse(), mm/swapfile.c.  Also, mm_users is accessed in an
unsafe fashien in the various architecture dependent smp.c files.
For example smp_flush_tlb_mm(), arch/sparc64/kernel/smp.c:

  if (atomic_read(&mm->mm_users) == 1) {
    /* See smp_flush_tlb_page for info about this. */
    mm->cpu_vm_mask = (1UL << cpu);
    goto local_flush_and_out;
  }
  /* ... */

This is probably safe, but may unnecessarily flush the tlb (or
possibly forget to flush the tlb although it has to?).

----

Basically, I think the problem is that the original VM design did
not take care of concurrent access to the mm_struct of a process.
It seems to assume that all processes accessing the mm_struct are
clones.  If it were like this, mm_users == 1 would mean that only
the current process is using the mm_struct and it can do with it
as it pleases since other processes can not spontaneously start
using the same structure.  Unfortunately this is not true, as
detailed above.  As a consequence the leightweight locking scheme
using the mm_users counter falls apart.

----

As for the severity of the problems we are having:  We have a
cron job that triggers an lsof every minute.  Reading the
/proc/<pid>/maps files of all processes sometimes collides with
another process running once per minute that calls execve().  With
ca. 200 machines we get about 100 oops messages per year cause by
this problem.

----

I have attempted to make a patch that fixes the locking problems
but does not add too much overhead for locking but did not get
very far.  It uses a semaphore to protect read/write access to
mm_users but does not lock the whole mm_struct.  The patch is
attached below for reference.  Do not use it since it deadlocks
the kernel within a couple of seconds.

Bye

Dominik ^_^  ^_^

[-- Attachment #2: vmcrash-0.0.4_27_may_2003.patch --]
[-- Type: text/plain, Size: 5778 bytes --]

diff -r -u linux-2.4.20.orig/fs/exec.c linux-2.4.20/fs/exec.c
--- linux-2.4.20.orig/fs/exec.c	Fri Nov 29 00:53:15 2002
+++ linux-2.4.20/fs/exec.c	Thu May 22 10:30:56 2003
@@ -413,12 +413,21 @@
 static int exec_mmap(void)
 {
 	struct mm_struct * mm, * old_mm;
+	int cnt;
 
 	old_mm = current->mm;
-	if (old_mm && atomic_read(&old_mm->mm_users) == 1) {
-		mm_release();
-		exit_mmap(old_mm);
-		return 0;
+	if (old_mm) {
+		down_write(&old_mm->mm_users_sem);
+		if ((cnt = atomic_read(&old_mm->mm_users)) == 1) {
+			atomic_dec(&old_mm->mm_users);
+		}
+		up_write(&old_mm->mm_users_sem);
+		if (cnt == 1) {
+			mm_release();
+			exit_mmap(old_mm);
+			atomic_inc(&old_mm->mm_users);
+			return 0;
+		}
 	}
 
 	mm = mm_alloc();
diff -r -u linux-2.4.20.orig/fs/proc/array.c linux-2.4.20/fs/proc/array.c
--- linux-2.4.20.orig/fs/proc/array.c	Sat Aug  3 02:39:45 2002
+++ linux-2.4.20/fs/proc/array.c	Thu May 22 10:53:28 2003
@@ -281,9 +281,7 @@
 	buffer = task_name(task, buffer);
 	buffer = task_state(task, buffer);
 	task_lock(task);
-	mm = task->mm;
-	if(mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (mm) {
 		buffer = task_mem(mm, buffer);
@@ -311,9 +309,7 @@
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
 	task_lock(task);
-	mm = task->mm;
-	if(mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	if (task->tty) {
 		tty_pgrp = task->tty->pgrp;
 		tty_nr = kdev_t_to_nr(task->tty->device);
@@ -477,9 +473,7 @@
 	int size=0, resident=0, share=0, trs=0, lrs=0, drs=0, dt=0;
 
 	task_lock(task);
-	mm = task->mm;
-	if(mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (mm) {
 		struct vm_area_struct * vma;
@@ -619,9 +613,7 @@
 		goto out_free1;
 
 	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	retval = 0;
 	if (!mm)
diff -r -u linux-2.4.20.orig/fs/proc/base.c linux-2.4.20/fs/proc/base.c
--- linux-2.4.20.orig/fs/proc/base.c	Sat Aug  3 02:39:45 2002
+++ linux-2.4.20/fs/proc/base.c	Thu May 22 10:53:56 2003
@@ -60,9 +60,7 @@
 	struct task_struct *task = inode->u.proc_i.task;
 
 	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (!mm)
 		goto out;
@@ -129,9 +127,7 @@
 	struct mm_struct *mm;
 	int res = 0;
 	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (mm) {
 		int len = mm->env_end - mm->env_start;
@@ -148,9 +144,7 @@
 	struct mm_struct *mm;
 	int res = 0;
 	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (mm) {
 		int len = mm->arg_end - mm->arg_start;
@@ -356,9 +350,7 @@
 		return -ENOMEM;
 
 	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(task);
 	task_unlock(task);
 	if (!mm)
 		return 0;
diff -r -u linux-2.4.20.orig/include/linux/sched.h linux-2.4.20/include/linux/sched.h
--- linux-2.4.20.orig/include/linux/sched.h	Fri Nov 29 00:53:15 2002
+++ linux-2.4.20/include/linux/sched.h	Thu May 22 11:34:34 2003
@@ -213,6 +213,7 @@
 	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
+	struct rw_semaphore mm_users_sem;	/* protects access to mm_users */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
@@ -244,6 +245,7 @@
 	mm_rb:		RB_ROOT,			\
 	pgd:		swapper_pg_dir, 		\
 	mm_users:	ATOMIC_INIT(2), 		\
+	mm_users_sem:	__RWSEM_INITIALIZER(name.mm_users_sem), \
 	mm_count:	ATOMIC_INIT(1), 		\
 	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
@@ -769,6 +771,9 @@
 		__mmdrop(mm);
 }
 
+/* increases the usage counter of the mm_struct */
+struct mm_struct * mmhold(struct mm_struct *mm);
+struct mm_struct * mmhold_task(struct task_struct *tsk);
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
 /* Remove the current tasks stale references to the old mm_struct */
diff -r -u linux-2.4.20.orig/kernel/fork.c linux-2.4.20/kernel/fork.c
--- linux-2.4.20.orig/kernel/fork.c	Fri Nov 29 00:53:15 2002
+++ linux-2.4.20/kernel/fork.c	Mon May 26 17:17:22 2003
@@ -257,6 +257,45 @@
 }
 
 /*
+ * Test whether the mm_users counter of the mm_struct is
+ * positive and increase it by one if it is.  Do not touch
+ * the counter if it is zero or less, i.e. the mm is being
+ * created or destroyed.
+ *
+ * Returns mm if the mm could be held and NULL otherwise.
+ */
+struct mm_struct * mmhold(struct mm_struct *mm)
+{
+	struct mm_struct *ret_mm;
+
+	down_read(&mm->mm_users_sem);
+	if (likely(atomic_read(&mm->mm_users) >= 1)) {
+		atomic_inc(&mm->mm_users);
+		ret_mm = mm;
+	} else {
+		ret_mm = NULL;
+	}
+	up_read(&mm->mm_users_sem);
+
+	return ret_mm;
+}
+
+/*
+ * Calls mmhold with the task's mm.
+ */
+struct mm_struct * mmhold_task(struct task_struct *tsk)
+{
+	struct mm_struct *mm;
+
+	if (tsk->mm != NULL)
+		mm = mmhold(tsk->mm);
+	else
+		mm = NULL;
+
+	return mm;
+}
+
+/*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
  * mmput. Free the page directory and the mm.
diff -r -u linux-2.4.20.orig/kernel/ptrace.c linux-2.4.20/kernel/ptrace.c
--- linux-2.4.20.orig/kernel/ptrace.c	Sat Aug  3 02:39:46 2002
+++ linux-2.4.20/kernel/ptrace.c	Thu May 22 12:13:24 2003
@@ -135,9 +135,7 @@
 
 	/* Worry about races with exit() */
 	task_lock(tsk);
-	mm = tsk->mm;
-	if (mm)
-		atomic_inc(&mm->mm_users);
+	mm = mmhold_task(tsk);
 	task_unlock(tsk);
 	if (!mm)
 		return 0;

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2003-06-20 11:00 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-20 11:00 VM oops in 2.4.20, attempted patch Dominik Vogt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox