There is a bug in the 2.4.20 VM that causes a kernel oops. The rreason is that sometimes the mm_struct of a task is accessed while it is currently being deleted. The function exec_mmap(), fs/exec.c begins with this code: old_mm = current->mm; if (old_mm && atomic_read(&old_mm->mm_users) == 1) { mm_release(); exit_mmap(old_mm); return 0; } The logic assumes that if mm_users == 1, no other process can access the tasks mm_struct concurrently. But that is not true. There are a number of places that happily use the mm_struct even while it is being destroyed via the exit_mmap() call in exec_mmap(). There are definitely oopses caused by the way the proc fs code accesses the mm_struct. Example from proc_pid_read_maps(), fs/proc/array.c: task_lock(task); mm = task->mm; if (mm) atomic_inc(&mm->mm_users); task_unlock(task); I am not exactly sure about some other code. The following code snippets look suspicious, but I am not deep enough into the VM code to judge whether they pose a problem or not: access_process_vm(), kernel/ptrace.c: /* Worry about races with exit() */ task_lock(tsk); mm = tsk->mm; if (mm) atomic_inc(&mm->mm_users); task_unlock(tsk); /* ... code fiddles with mm */ Calls to access_process_vm() eventually originate from either the ptrace() system call in some architectures or the proc fs. I believe that this code has the potential to make the kernel oops, but I have no proof. swap_out(), mm/vmscan.c: spin_lock(&mmlist_lock); mm = swap_mm; while (mm->swap_address == TASK_SIZE || mm == &init_mm) { mm->swap_address = 0; mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist); if (mm == swap_mm) goto empty; swap_mm = mm; } /* Make sure the mm doesn't disappear when we drop the lock.. */ atomic_inc(&mm->mm_users); spin_unlock(&mmlist_lock); Since swap_out() can be called by any process, for example via ... create_buffers() free_more_memory() try_to_free_pages() try_to_free_pages_zone() shrink_caches() shrink_cache() swap_out() it might occur that mm_struct is swapped out while it is being destroyed by execve(). There is more suspicious code related to swapping in try_to_unsuse(), mm/swapfile.c. Also, mm_users is accessed in an unsafe fashien in the various architecture dependent smp.c files. For example smp_flush_tlb_mm(), arch/sparc64/kernel/smp.c: if (atomic_read(&mm->mm_users) == 1) { /* See smp_flush_tlb_page for info about this. */ mm->cpu_vm_mask = (1UL << cpu); goto local_flush_and_out; } /* ... */ This is probably safe, but may unnecessarily flush the tlb (or possibly forget to flush the tlb although it has to?). ---- Basically, I think the problem is that the original VM design did not take care of concurrent access to the mm_struct of a process. It seems to assume that all processes accessing the mm_struct are clones. If it were like this, mm_users == 1 would mean that only the current process is using the mm_struct and it can do with it as it pleases since other processes can not spontaneously start using the same structure. Unfortunately this is not true, as detailed above. As a consequence the leightweight locking scheme using the mm_users counter falls apart. ---- As for the severity of the problems we are having: We have a cron job that triggers an lsof every minute. Reading the /proc//maps files of all processes sometimes collides with another process running once per minute that calls execve(). With ca. 200 machines we get about 100 oops messages per year cause by this problem. ---- I have attempted to make a patch that fixes the locking problems but does not add too much overhead for locking but did not get very far. It uses a semaphore to protect read/write access to mm_users but does not lock the whole mm_struct. The patch is attached below for reference. Do not use it since it deadlocks the kernel within a couple of seconds. Bye Dominik ^_^ ^_^