Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
       [not found] ` <Pine.LNX.3.96.990127131315.19147A-100000@laser.bogus>
@ 1999-01-27 21:38   ` Stephen C. Tweedie
  1999-01-27 21:45     ` Linus Torvalds
  1999-01-28  1:02     ` Andrea Arcangeli
  0 siblings, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-01-27 21:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

Hi,

On Wed, 27 Jan 1999 13:15:25 +0100 (CET), Andrea Arcangeli
<andrea@e-mind.com> said:

> - * Fixed a race between mmget() and mmput(): added the mm_lock spinlock in
> - * the task_struct to serialize accesses to the tsk->mm field.
> + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock
> + * to serialize accesses to the tsk->mm field.

I don't buy it, because we've seen these on UP machines too.  Besides,
in all of the fork/exit/procfs code paths which look to be relevant to
the reported oopses, we already hold the global kernel lock by the time
we start fiddling with the mm references.  Adding yet another spinlock
should make no difference at all to the locking.

I think the real problem is in the new procfs code in fs/proc/array.c
calling mmget()/mmput() the way it does.  The get_mm_and_lock() function
tries to be safe, but in a couple of places we do not use it, instead
doing something like:

	get_status() {
		tsk = grab_task(pid);
		task_mem() {
			down(&mm->mmap_sem);
			/* Walk the vma tree */
			up(&mm->mmap_sem);
		}
		release_task(tsk);
	}

grab_task() includes

	if (tsk && tsk->mm && tsk->mm != &init_mm)
		mmget(tsk->mm);

and release_task does

	if (tsk != current && tsk->mm && tsk->mm != &init_mm)
		mmput(tsk->mm);

In other words, we are grabbing a task, pinning the mm_struct, blocking
for the mm semaphore and releasing the task's *current* mm_struct, which
is not necessarily the same as it was before we blocked.  In particular,
if we catch the process during a fork+exec, then it is perfectly
possible for tsk->mm to change here.

Note that this was not a problem in the original version of this patch
which just copied the task_struct in grab_task(), because that way we
would always be releasing the same mm that we grabbed.

Can anybody give a good reason why we need to block in array.c at all?
I don't see why we need the down/up(&mm->mmap_sem) code, since we should
already hold the global kernel lock and the vma tree should remain
intact while we inspect it as long as we do not drop that lock.  If we
do keep that lock intact from beginning to end then we cannot run the
risk of tsk->mm changing out from under our feet.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-27 21:38   ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
@ 1999-01-27 21:45     ` Linus Torvalds
  1999-01-28  1:02     ` Andrea Arcangeli
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-27 21:45 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm


On Wed, 27 Jan 1999, Stephen C. Tweedie wrote:
> 
> In other words, we are grabbing a task, pinning the mm_struct, blocking
> for the mm semaphore and releasing the task's *current* mm_struct, which
> is not necessarily the same as it was before we blocked.  In particular,
> if we catch the process during a fork+exec, then it is perfectly
> possible for tsk->mm to change here.

Good point. I reverted part of the array.c patches in the released 2.2.0,
it seems I should really have reverted them all.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-27 21:38   ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
  1999-01-27 21:45     ` Linus Torvalds
@ 1999-01-28  1:02     ` Andrea Arcangeli
  1999-01-28  2:50       ` Andrea Arcangeli
  1999-01-28 15:05       ` Stephen C. Tweedie
  1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-28  1:02 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Wed, 27 Jan 1999, Stephen C. Tweedie wrote:

> > + * Fixed a race between mmget() and mmput(): added the mm_lock spinlock
> > + * to serialize accesses to the tsk->mm field.
> 
> I don't buy it, because we've seen these on UP machines too.  Besides,

Yes the comment/credits are completly bogus. Excuse me, I was too tired to
understand this last night (and btw I thought it was not sure if it was
happening also on UP).

> in all of the fork/exit/procfs code paths which look to be relevant to
> the reported oopses, we already hold the global kernel lock by the time
> we start fiddling with the mm references.  Adding yet another spinlock
> should make no difference at all to the locking.

I think this too. My new code made tons of sense to me and since when I
finished all my work everything become rock solid, I posted the thing to
the list. I don't like lock_kernel(), but yes I noticed too that
lock_kernel() should be enough (I noticed it today, last night I had too
much lack of sleep to think more about it).

> 	get_status() {
> 		tsk = grab_task(pid);
> 		task_mem() {
> 			down(&mm->mmap_sem);

My reason to reinsert the memcpy() was different than your one. It's
because sys_wait4 don't hold the kernel lock and does _only_ a
spin_lock_irq(tasklist_lock) and then remove the process from the
tasklist, so if we don't want to read_lock(tasklist_lock)  all the time in
array.c we must do the copy of the tsk, to be sure that the task_struct we
are playing with, still exists. But holding the tasklist_lock all the time
looked not safe to me due the down() in the array.c code...

Maybe I've thought stupid/wrong things but with my whole patch applyed the
kernel become rock solid and race-free. I'm sure of this. Otherwise I
would have not posted so sure of myself ;).

Can somebody tell me _exactly_ what the mmap_sem stays for? The kernel is
~always doing a down on the mmap_sem of the process itself.  It's
_useless_ that way. The only place the kernel is doing a down on another
task seems ptrace.c and fs/proc/array.c, so does we have the mmap_sem only
for handling correctly such two cases? 

And is true as I think that doing a down(current->mm->mmap_sem) is
_useless_ if every other place on the kernel only does only the same?

And finally I reask your question: can we at any time play with the ->mm,
mm->vma, pgd, pmd, pte, of a process without helding any semaphore, only
having the big kernel lock held?

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28  1:02     ` Andrea Arcangeli
@ 1999-01-28  2:50       ` Andrea Arcangeli
  1999-01-28  4:20         ` [patch] fixed both processes in D state and the /proc/ oopses Tom Holroyd
                           ` (3 more replies)
  1999-01-28 15:05       ` Stephen C. Tweedie
  1 sibling, 4 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-28  2:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

Do you want to know why last night I added a spinlock around mmget/mmput
without thinking twice?  Simply because mm->count was an atomic_t while it
doesn't need to be an atomic_t in first place.

Obviously if it mm->count need to be an atomic_t, we strictly _need_ also
my new mm_lock spinlock in 2.2.1.

So you don't buy my code, but now, I don't buy both all /proc mmget/mmput
sutff and the mm->count atomic_t.

I also avoided all the mmget/mmput stuff in /proc since it _has_ to run
just serialized by the mmget/mmput semaphore (otherwise we need
my mm_lock).

I also removed the semaphore stuff because it seems to me (and to you btw
;) that _all_ places in the mm that does a down(current->mm->mmap_sem),
does always then a lock_kernel().

I also removed all the memcpy, we only need the read_lock(tasklist_lock)
held in SMP because otherwise wait4() could remove the stack of the
process under our eyes as just pointed out in the last email.

Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will
mean that array.c in 2.2.1 will be not safe enough without my mm_lock
spinlock. Do you understand my point?

I rediffed everything against clean 2.2.0, and I am now running it without
any problem, seems rock solid from a /proc race point of view.

Index: arch/i386/kernel/ldt.c
===================================================================
RCS file: /var/cvs/linux/arch/i386/kernel/ldt.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 ldt.c
--- ldt.c	1999/01/18 01:28:57	1.1.1.1
+++ ldt.c	1999/01/27 20:09:21
@@ -83,7 +83,7 @@
 			set_ldt_desc(i, ldt, LDT_ENTRIES);
 			current->tss.ldt = _LDT(i);
 			load_ldt(i);
-			if (atomic_read(&mm->count) > 1)
+			if (mm->count > 1)
 				printk(KERN_WARNING
 					"LDT allocated for cloned task!\n");
 		} else {
Index: fs/binfmt_aout.c
===================================================================
RCS file: /var/cvs/linux/fs/binfmt_aout.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 binfmt_aout.c
--- binfmt_aout.c	1999/01/18 01:26:49	1.1.1.1
+++ binfmt_aout.c	1999/01/27 20:09:21
@@ -101,7 +101,7 @@
 #       define START_STACK(u)   (u.start_stack)
 #endif
 
-	if (!current->dumpable || atomic_read(&current->mm->count) != 1)
+	if (!current->dumpable || current->mm->count != 1)
 		return 0;
 	current->dumpable = 0;
 
Index: fs/binfmt_elf.c
===================================================================
RCS file: /var/cvs/linux/fs/binfmt_elf.c,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 binfmt_elf.c
--- binfmt_elf.c	1999/01/18 01:26:49	1.1.1.1
+++ binfmt_elf.c	1999/01/27 20:09:21
@@ -1067,7 +1067,7 @@
 
 	if (!current->dumpable ||
 	    limit < ELF_EXEC_PAGESIZE ||
-	    atomic_read(&current->mm->count) != 1)
+	    current->mm->count != 1)
 		return 0;
 	current->dumpable = 0;
 
Index: fs/exec.c
===================================================================
RCS file: /var/cvs/linux/fs/exec.c,v
retrieving revision 1.1.1.2
diff -u -r1.1.1.2 exec.c
--- exec.c	1999/01/23 16:28:07	1.1.1.2
+++ exec.c	1999/01/28 01:47:52
@@ -378,7 +378,7 @@
 	struct mm_struct * mm, * old_mm;
 	int retval, nr;
 
-	if (atomic_read(&current->mm->count) == 1) {
+	if (current->mm->count == 1) {
 		flush_cache_mm(current->mm);
 		mm_release();
 		release_segments(current->mm);
Index: fs/proc/array.c
===================================================================
RCS file: /var/cvs/linux/fs/proc/array.c,v
retrieving revision 1.1.1.4
diff -u -r1.1.1.4 array.c
--- array.c	1999/01/26 18:30:44	1.1.1.4
+++ array.c	1999/01/28 02:32:48
@@ -388,31 +388,6 @@
 	return sprintf(buffer, "%s\n", saved_command_line);
 }
 
-/*
- * Caller must release_mm the mm_struct later.
- * You don't get any access to init_mm.
- */
-static struct mm_struct *get_mm_and_lock(int pid)
-{
-	struct mm_struct *mm = NULL;
-	struct task_struct *tsk;
-
-	read_lock(&tasklist_lock);
-	tsk = find_task_by_pid(pid);
-	if (tsk && tsk->mm && tsk->mm != &init_mm)
-		mmget(mm = tsk->mm);
-	read_unlock(&tasklist_lock);
-	if (mm != NULL)
-		down(&mm->mmap_sem);
-	return mm;
-}
-
-static void release_mm(struct mm_struct *mm)
-{
-	up(&mm->mmap_sem);
-	mmput(mm);
-}
-
 static unsigned long get_phys_addr(struct mm_struct *mm, unsigned long ptr)
 {
 	pgd_t *page_dir;
@@ -480,27 +455,35 @@
 
 static int get_env(int pid, char * buffer)
 {
-	struct mm_struct *mm;
+	struct task_struct * tsk;
 	int res = 0;
 
-	mm = get_mm_and_lock(pid);
-	if (mm) {
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+	{
+		struct mm_struct * mm = tsk->mm;
 		res = get_array(mm, mm->env_start, mm->env_end, buffer);
-		release_mm(mm);
 	}
+	read_unlock(&tasklist_lock);
+
 	return res;
 }
 
 static int get_arg(int pid, char * buffer)
 {
-	struct mm_struct *mm;
+	struct task_struct * tsk;
 	int res = 0;
 
-	mm = get_mm_and_lock(pid);
-	if (mm) {
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+	{
+		struct mm_struct * mm = tsk->mm;
 		res = get_array(mm, mm->arg_start, mm->arg_end, buffer);
-		release_mm(mm);
 	}
+	read_unlock(&tasklist_lock);
+
 	return res;
 }
 
@@ -849,39 +832,22 @@
 			    cap_t(p->cap_effective));
 }
 
-static struct task_struct *grab_task(int pid)
-{
-	struct task_struct *tsk = current;
-	if (pid != tsk->pid) {
-		read_lock(&tasklist_lock);
-		tsk = find_task_by_pid(pid);
-		if (tsk && tsk->mm && tsk->mm != &init_mm)
-			mmget(tsk->mm);
-		read_unlock(&tasklist_lock);
-	}	
-	return tsk;
-}
-
-static void release_task(struct task_struct *tsk)
-{
-	if (tsk != current && tsk->mm && tsk->mm != &init_mm)
-		mmput(tsk->mm);
-}
-
 static int get_status(int pid, char * buffer)
 {
 	char * orig = buffer;
-	struct task_struct *tsk;
+	struct task_struct * tsk;
 	
-	tsk = grab_task(pid);
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
 	if (!tsk)
-		return 0;
+		goto unlock;
 	buffer = task_name(tsk, buffer);
 	buffer = task_state(tsk, buffer);
 	buffer = task_mem(tsk, buffer);
 	buffer = task_sig(tsk, buffer);
 	buffer = task_cap(tsk, buffer);
-	release_task(tsk);
+ unlock:
+	read_unlock(&tasklist_lock);
 	return buffer - orig;
 }
 
@@ -893,21 +859,20 @@
 	int tty_pgrp;
 	sigset_t sigign, sigcatch;
 	char state;
-	int res;
+	int res = 0;
 
-	tsk = grab_task(pid);
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
 	if (!tsk)
-		return 0;
+		goto unlock;
 	state = *get_task_state(tsk);
 	vsize = eip = esp = 0;
 	if (tsk->mm && tsk->mm != &init_mm) {
 		struct vm_area_struct *vma;
 
-		down(&tsk->mm->mmap_sem);
 		for (vma = tsk->mm->mmap; vma; vma = vma->vm_next) {
 			vsize += vma->vm_end - vma->vm_start;
 		}
-		up(&tsk->mm->mmap_sem);
 		
 		eip = KSTK_EIP(tsk);
 		esp = KSTK_ESP(tsk);
@@ -975,7 +940,8 @@
 		tsk->cnswap,
 		tsk->exit_signal);
 
-	release_task(tsk);
+ unlock:
+	read_unlock(&tasklist_lock);
 	return res;
 }
 		
@@ -1054,11 +1020,14 @@
 
 static int get_statm(int pid, char * buffer)
 {
+	struct task_struct * tsk;
 	int size=0, resident=0, share=0, trs=0, lrs=0, drs=0, dt=0;
-	struct mm_struct *mm;
 
-	mm = get_mm_and_lock(pid);
-	if (mm) {
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
+	if (tsk)
+	{
+		struct mm_struct * mm = tsk->mm;
 		struct vm_area_struct * vma = mm->mmap;
 
 		while (vma) {
@@ -1080,8 +1049,8 @@
 				drs += pages;
 			vma = vma->vm_next;
 		}
-		release_mm(mm);
 	}
+	read_unlock(&tasklist_lock);
 	return sprintf(buffer,"%d %d %d %d %d %d %d\n",
 		       size, resident, share, trs, lrs, drs, dt);
 }
@@ -1149,7 +1118,7 @@
 		goto getlen_out;
 
 	/* Check whether the mmaps could change if we sleep */
-	volatile_task = (p != current || atomic_read(&p->mm->count) > 1);
+	volatile_task = (p != current || p->mm->count > 1);
 
 	/* decode f_pos */
 	lineno = *ppos >> MAPS_LINE_SHIFT;
@@ -1251,11 +1220,12 @@
 static int get_pidcpu(int pid, char * buffer)
 {
 	struct task_struct * tsk;
-	int i, len;
+	int i, len = 0;
 
-	tsk = grab_task(pid);
+	read_lock(&tasklist_lock);
+	tsk = find_task_by_pid(pid);
 	if (!tsk)
-		return 0;
+		goto unlock;
 
 	len = sprintf(buffer,
 		"cpu  %lu %lu\n",
@@ -1267,8 +1237,9 @@
 			i,
 			tsk->per_cpu_utime[cpu_logical_map(i)],
 			tsk->per_cpu_stime[cpu_logical_map(i)]);
+ unlock:
+	read_unlock(&tasklist_lock);
 
-	release_task(tsk);
 	return len;
 }
 #endif
Index: include/asm-i386/pgtable.h
===================================================================
RCS file: /var/cvs/linux/include/asm-i386/pgtable.h,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 pgtable.h
--- pgtable.h	1999/01/18 01:27:15	1.1.1.1
+++ pgtable.h	1999/01/27 20:09:21
@@ -101,7 +101,7 @@
 static inline void flush_tlb_current_task(void)
 {
 	/* just one copy of this mm? */
-	if (atomic_read(&current->mm->count) == 1)
+	if (current->mm->count == 1)
 		local_flush_tlb();	/* and that's us, so.. */
 	else
 		smp_flush_tlb();
@@ -113,7 +113,7 @@
 
 static inline void flush_tlb_mm(struct mm_struct * mm)
 {
-	if (mm == current->mm && atomic_read(&mm->count) == 1)
+	if (mm == current->mm && mm->count == 1)
 		local_flush_tlb();
 	else
 		smp_flush_tlb();
@@ -122,7 +122,7 @@
 static inline void flush_tlb_page(struct vm_area_struct * vma,
 	unsigned long va)
 {
-	if (vma->vm_mm == current->mm && atomic_read(&current->mm->count) == 1)
+	if (vma->vm_mm == current->mm && current->mm->count == 1)
 		__flush_tlb_one(va);
 	else
 		smp_flush_tlb();
Index: include/linux/sched.h
===================================================================
RCS file: /var/cvs/linux/include/linux/sched.h,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 sched.h
--- sched.h	1999/01/23 16:29:58	1.1.1.3
+++ sched.h	1999/01/28 02:33:01
@@ -164,7 +164,7 @@
 	struct vm_area_struct *mmap_avl;	/* tree of VMAs */
 	struct vm_area_struct *mmap_cache;	/* last find_vma result */
 	pgd_t * pgd;
-	atomic_t count;
+	int count;
 	int map_count;				/* number of VMAs */
 	struct semaphore mmap_sem;
 	unsigned long context;
@@ -184,7 +184,7 @@
 #define INIT_MM {					\
 		&init_mmap, NULL, NULL,			\
 		swapper_pg_dir, 			\
-		ATOMIC_INIT(1), 1,			\
+		1, 1,					\
 		MUTEX,					\
 		0,					\
 		0, 0, 0, 0,				\
@@ -609,11 +609,11 @@
  * Routines for handling mm_structs
  */
 extern struct mm_struct * mm_alloc(void);
-static inline void mmget(struct mm_struct * mm)
+extern inline void mmget(struct mm_struct * mm)
 {
-	atomic_inc(&mm->count);
+	mm->count++;
 }
-extern void mmput(struct mm_struct *);
+extern void FASTCALL(mmput(struct mm_struct *));
 /* Remove the current tasks stale references to the old mm_struct */
 extern void mm_release(void);
 
Index: kernel/fork.c
===================================================================
RCS file: /var/cvs/linux/kernel/fork.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 fork.c
--- fork.c	1999/01/23 16:30:27	1.1.1.3
+++ fork.c	1999/01/28 01:52:53
@@ -261,7 +261,7 @@
 	if (mm) {
 		*mm = *current->mm;
 		init_new_context(mm);
-		atomic_set(&mm->count, 1);
+		mm->count = 1;
 		mm->map_count = 0;
 		mm->def_flags = 0;
 		mm->mmap_sem = MUTEX_LOCKED;
@@ -308,7 +308,7 @@
  */
 void mmput(struct mm_struct *mm)
 {
-	if (atomic_dec_and_test(&mm->count)) {
+	if (!--mm->count) {
 		release_segments(mm);
 		exit_mmap(mm);
 		free_page_tables(mm);


If you see a race in this my new patch, please let me know and probably
you'll give me a good reason to reinsert mm_lock ;) 

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses
  1999-01-28  2:50       ` Andrea Arcangeli
@ 1999-01-28  4:20         ` Tom Holroyd
  1999-01-28  6:23         ` Tom Holroyd
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Tom Holroyd @ 1999-01-28  4:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	Alan Cox, linux-mm

Mmm, yes.  Both of Andrea's patches (the one in arca-4 and the most recent
one posted here) fix the problem with procs stuck in the D state, on my
Alpha PC164LX.

Hmpf.  Even though I have only 1 processor, make "MAKE=make -j5" dep is
consistently faster than make dep.  128M box...

Ah well, I guess I'll try loading some modules and crashing it that way.
:-)

Dr. Tom Holroyd
I would dance and be merry,
Life would be a ding-a-derry,
If I only had a brain.
	-- The Scarecrow

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses
  1999-01-28  2:50       ` Andrea Arcangeli
  1999-01-28  4:20         ` [patch] fixed both processes in D state and the /proc/ oopses Tom Holroyd
@ 1999-01-28  6:23         ` Tom Holroyd
  1999-01-28 15:09         ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
  1999-01-28 17:36         ` Linus Torvalds
  3 siblings, 0 replies; 34+ messages in thread
From: Tom Holroyd @ 1999-01-28  6:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	Alan Cox, linux-mm

On Thu, 28 Jan 1999, Andrea Arcangeli wrote:

>If you see a race in this my new patch, please let me know and probably
>you'll give me a good reason to reinsert mm_lock ;) 

I spoke too soon.  :(

With that latest patch I was still able to get lots of procs stuck in D,
but it was harder. ^_^;

I was playing with modules, trying to get it to hang that way (very
successful there).  At the end of one experiment there was no crash but
rather a lot of D procs.  Details:

Alpha PC164 (LX).  128M, egcs-1.1.1.

MSDOS configured as a module.  Stick a floppy in the floppy drive.

# mount -o remount,ro /home		; be safe
# mount -o remount,ro /usr
# mformat a:
# mount -t msdos /dev/fd0 /tmp/mnt

now /proc/modules contains:

msdos                  11600   1 (autoclean)
fat                    33656   1 (autoclean) [msdos]

Run this script:
---
#! /bin/sh

while true; do
	cp -av /usr/src/linux/arch/ppc /tmp/mnt/ppc
	ls -lR /tmp/mnt
	rm -rf /tmp/mnt/ppc
done
---

In another window, make MAKE="make -j5" dep.  Now normally, with msdos
as a module, this causes the machine to hang (alt-sysrq unresponsive)
after a few minutes (often after it has started to swap stuff out, but I'm
having trouble narrowing it down more than that).

This last time, I got D procs.  Again, this is with Andrea's latest patch.
Without that patch, the make dep is guaranteed to fail almost right away,
but with it I was able to do this about 4 times before it occured.

 1188  1175 root      1496   664 end       D     0.0  0.5   0:00 make
 1213  1183 root      1208   664 end       D     0.0  0.4   0:00 make
 1224  1177 root      1320   664 end       D     0.0  0.5   0:00 make
 1226  1177 root      1256   664 end       D     0.0  0.4   0:00 make
 1247  1190 root      1272   664 end       D     0.0  0.4   0:00 make

I'll try again with the earlier patch.

Dr. Tom Holroyd
I would dance and be merry,
Life would be a ding-a-derry,
If I only had a brain.
	-- The Scarecrow

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28  1:02     ` Andrea Arcangeli
  1999-01-28  2:50       ` Andrea Arcangeli
@ 1999-01-28 15:05       ` Stephen C. Tweedie
  1 sibling, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-01-28 15:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

Hi,

On Thu, 28 Jan 1999 02:02:38 +0100 (CET), Andrea Arcangeli
<andrea@e-mind.com> said:

> I think this too. My new code made tons of sense to me and since when I
> finished all my work everything become rock solid, 

Luck!  Fwiw, I was unable to reproduce the problem at all even on 2.2.0
with your test script.

>> get_status() {
>> tsk = grab_task(pid);
>> task_mem() {
>> down(&mm->mmap_sem);

No spinlock change will fix this race (although the memcpy will).

> My reason to reinsert the memcpy() was different than your one. It's
> because sys_wait4 don't hold the kernel lock and does _only_ a
> spin_lock_irq(tasklist_lock) and then remove the process from the
> tasklist, 

OK.

> Maybe I've thought stupid/wrong things but with my whole patch applyed the
> kernel become rock solid and race-free. I'm sure of this. Otherwise I
> would have not posted so sure of myself ;).

No, because the fork/exec race is still obviously present in get_stat,
and because SMP-only synchronisation fixes cannot fix a problem seen on
UP machines.

> Can somebody tell me _exactly_ what the mmap_sem stays for? 

Any modifications or blocking lookups to the mmap structures, and all
places where we add a new page into the process page tables.  That
includes mmap operations and page faults.

> The kernel is ~always doing a down on the mmap_sem of the process
> itself.  It's _useless_ that way.  The only place the kernel is doing
> a down on another task seems ptrace.c and fs/proc/array.c, so does we
> have the mmap_sem only for handling correctly such two cases?

Not at all.  The whole point of the semaphore is to protect the shared
mm when we have multiple threads all mmaping and page faulting
independently within the same mm_struct.  That's why the semaphore is in
the mm_struct, not in the task_struct.

> And finally I reask your question: can we at any time play with the ->mm,
> mm-> vma, pgd, pmd, pte, of a process without helding any semaphore, only
> having the big kernel lock held?

Yes.  Basically, changing an existing pte needs the kernel lock.  Adding
or modifying (but not removing) a new pte or modifying the vma tree
needs the mm semaphore.  We can unmap ptes in swap_out() without the
semaphore but with only the kernel lock.  We can map new anonymous pages
without the kernel lock but with only the semaphore.  Everything else
needs both.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28  2:50       ` Andrea Arcangeli
  1999-01-28  4:20         ` [patch] fixed both processes in D state and the /proc/ oopses Tom Holroyd
  1999-01-28  6:23         ` Tom Holroyd
@ 1999-01-28 15:09         ` Stephen C. Tweedie
  1999-01-28 17:54           ` Linus Torvalds
  1999-01-28 17:36         ` Linus Torvalds
  3 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-01-28 15:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

Hi,

On Thu, 28 Jan 1999 03:50:39 +0100 (CET), Andrea Arcangeli
<andrea@e-mind.com> said:

> Do you want to know why last night I added a spinlock around mmget/mmput
> without thinking twice?  Simply because mm->count was an atomic_t while it
> doesn't need to be an atomic_t in first place.

Agreed.

> So you don't buy my code, but now, I don't buy both all /proc mmget/mmput
> sutff and the mm->count atomic_t.

Agreed.  mm->count might as well remain atomic because that will help
when we come to apply finer grained locking to the mm, but as far as I
am concerned we may as well drop pretty much all of the mmget/mmput
stuff.  The only race I can still see is the possibility of sys_wait*
removing the task struct while we run on SMP.

> I also removed all the memcpy, we only need the read_lock(tasklist_lock)
> held in SMP because otherwise wait4() could remove the stack of the
> process under our eyes as just pointed out in the last email.

Yep, fine, as long as we keep the tasklist_lock right until the end of
our use of the task struct.

> Not doing in 2.2.1 the mm->count s/atomic_t/int/ due worry of races will
> mean that array.c in 2.2.1 will be not safe enough without my mm_lock
> spinlock. Do you understand my point?

No, because we already have a sufficient spinlock to protect us.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28  2:50       ` Andrea Arcangeli
                           ` (2 preceding siblings ...)
  1999-01-28 15:09         ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
@ 1999-01-28 17:36         ` Linus Torvalds
  3 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-28 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm


On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
>
> Do you want to know why last night I added a spinlock around mmget/mmput
> without thinking twice?  Simply because mm->count was an atomic_t while it
> doesn't need to be an atomic_t in first place.

No. Your argument does not make any sense at all.

Go away, this is not the time to use magic to make kernel patches.

A "atomic_t" _often_ makes sense without having any spinlocks,
_especially_ when used as a reference count. Let me count the ways:

 - when you increase the count because you make a copy, you know that
   you're already a holder of the count, so you don't need any spinlocks
   to protect anything else: you _know_ the area is there.

 - when you decrease the count, you use "atomic_dec_and_test()" because
   you know that the count was > 0, and you know that only _one_ such
   decrementer will get a positive reply for the atomic_dec_and_test. The
   one that is successful doesn't need any locking, because he's now the
   only owner, and should just release everything. 

In short, it has to be atomic, and spinlocks never enter the picture at
all.

Your patches do not make sense. Period. The only thing that makes sense is
to revert the array.c thing to the old one that didn't try to be clever.

		Linus 

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 15:09         ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
@ 1999-01-28 17:54           ` Linus Torvalds
  1999-01-28 18:07             ` Stephen C. Tweedie
  1999-01-28 22:04             ` Andrea Arcangeli
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-28 17:54 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm


On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

> 
> > Do you want to know why last night I added a spinlock around mmget/mmput
> > without thinking twice?  Simply because mm->count was an atomic_t while it
> > doesn't need to be an atomic_t in first place.
> 
> Agreed.

Incorrect, see my previous email. It may not be strictly necessary right
now due to us probably holding the kernel lock everywhere, but it is
conceptually necessary, and it is _not_ an argument for a spinlock.

The /proc code has to be fixed, but the easy fix is to just revert to the
old one as far as I can see. I shouldn't have accepted the /proc patches
in the first place, and I'm sorry I did.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 17:54           ` Linus Torvalds
@ 1999-01-28 18:07             ` Stephen C. Tweedie
  1999-01-28 18:17               ` Linus Torvalds
  1999-01-28 22:33               ` Andrea Arcangeli
  1999-01-28 22:04             ` Andrea Arcangeli
  1 sibling, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-01-28 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, werner,
	mlord, David S. Miller, gandalf, adamk, kiracofe.8, ksi,
	djf-lists, tomh, Alan Cox, linux-mm

Hi,

On Thu, 28 Jan 1999 09:54:07 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

>> > Do you want to know why last night I added a spinlock around mmget/mmput
>> > without thinking twice?  Simply because mm->count was an atomic_t while it
>> > doesn't need to be an atomic_t in first place.
>> Agreed.

> Incorrect, see my previous email. It may not be strictly necessary right
> now due to us probably holding the kernel lock everywhere, but it is
> conceptually necessary, and it is _not_ an argument for a spinlock.

Linus, we are in violent agreement: see my previous email. :)  I agree
with both you and Andrea that the atomic_t is not strictly necessary,
and agree vigorously that removing it is wrong because it will just make
the job of fine-graining the locking ever more harder.  As we relax the
kernel locks, the atomic_t becomes more and more important.

> The /proc code has to be fixed, but the easy fix is to just revert to the
> old one as far as I can see. I shouldn't have accepted the /proc patches
> in the first place, and I'm sorry I did.

Yep, but Andrea did point out what looks like at least one valid race:
sys_wait* on a zombie task can remove and deallocate the task_struct
without taking the global lock.  Reverting the diff is the right thing
for 2.2.1, but once we've done that we may need to look at either
keeping the task lock until we have finished with the task_struct in
array.c, or doing a memcpy on the task while we still have it locked.
That does seem to be a valid fix.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 18:07             ` Stephen C. Tweedie
@ 1999-01-28 18:17               ` Linus Torvalds
  1999-01-28 18:25                 ` Stephen C. Tweedie
  1999-01-28 19:23                 ` Alan Cox
  1999-01-28 22:33               ` Andrea Arcangeli
  1 sibling, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-28 18:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:
> 
> Yep, but Andrea did point out what looks like at least one valid race:
> sys_wait* on a zombie task can remove and deallocate the task_struct
> without taking the global lock.  Reverting the diff is the right thing
> for 2.2.1, but once we've done that we may need to look at either
> keeping the task lock until we have finished with the task_struct in
> array.c, or doing a memcpy on the task while we still have it locked.
> That does seem to be a valid fix.

I'd much rather just use some stale "struct task_struct" data.

What we _might_ do in /proc, is to just increment the usage count for the
(double) page that contains the task structure, so that even if a
release() does happen while we're using the page, the page won't get
re-used, and we can essentially look at all the info as it was when it was
released. 

Note that by the time it has become a zombie, it will have released all
the mm stuff anyway, so we don't even have any dangerous stale mm pointers
that we'd follow: we'd only be looking at that one structure. 

Voila, no memcpy(), no silly locks, no problems. 

Anyway, for 2.2.1 I don't even want to be clever. As it is (with the bogus
array.c race "fixes" removed), the page may get freed without any kernel
lock, and we may return _completely_ bogus information, but that is (a) 
extremely unlikely in the first place and (b) basically harmless and
pretty much impossible to exploit. 

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 18:17               ` Linus Torvalds
@ 1999-01-28 18:25                 ` Stephen C. Tweedie
  1999-01-28 19:23                 ` Alan Cox
  1 sibling, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-01-28 18:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, linux-kernel, werner,
	mlord, David S. Miller, gandalf, adamk, kiracofe.8, ksi,
	djf-lists, tomh, Alan Cox, linux-mm

Hi,

On Thu, 28 Jan 1999 10:17:37 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> I'd much rather just use some stale "struct task_struct" data.

The problem isn't the risk of using stale data: it is the risk of using
complete garbage if the task_struct page gets reused.  The procfs code
does check that tsk->mm is non-zero before following the pointers, but
if there is a non-zero address there then it _will_ be dereferenced
regardless.

> What we _might_ do in /proc, is to just increment the usage count for the
> (double) page that contains the task structure,

That would certainly take care of it.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 19:23                 ` Alan Cox
@ 1999-01-28 19:11                   ` Linus Torvalds
  1999-01-28 20:11                     ` Alan Cox
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 1999-01-28 19:11 UTC (permalink / raw)
  To: Alan Cox
  Cc: sct, andrea, linux-kernel, werner, mlord, davem, gandalf, adamk,
	kiracofe.8, ksi, djf-lists, tomh, linux-mm

On Thu, 28 Jan 1999, Alan Cox wrote:
> 
> (c) you can check if the thing has disappeared after using it and clear
> the buffer if so.

Yes, but the problem is that when there _is_ stale data (unlikely), you
can actually do the wrong thing before.

Incrementing the page counter should fix all problems, but it's such a
subtle fix (even though it's essentially just a few one-liners) that I'm
not going to do it for 2.2.1, which I want to get out later today. 

Alan, the only patch I don't have that looks likely for 2.2.1 is the
IDE-SCSI thing. Did you have that somewhere?

Right now my 2.2.1 patches are:
 - the stupid off-by-one bug found by Ingo
 - __down_interruptible on alpha
 - move "esstype" to outside a #ifdef MODULE
 - NFSD rename/rmdir fixes
 - revert to old array.c
 - change comment about __PAGE_OFFSET
 - missing "vma = NULL" case for avl find_vma()

Holler now or forever hold your peace, because I'd like to get the thing
out in a few hours. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 18:17               ` Linus Torvalds
  1999-01-28 18:25                 ` Stephen C. Tweedie
@ 1999-01-28 19:23                 ` Alan Cox
  1999-01-28 19:11                   ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Alan Cox @ 1999-01-28 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: sct, andrea, linux-kernel, werner, mlord, davem, gandalf, adamk,
	kiracofe.8, ksi, djf-lists, tomh, alan, linux-mm

> Anyway, for 2.2.1 I don't even want to be clever. As it is (with the bogus
> array.c race "fixes" removed), the page may get freed without any kernel
> lock, and we may return _completely_ bogus information, but that is (a) 
> extremely unlikely in the first place and (b) basically harmless and
> pretty much impossible to exploit. 

(c) you can check if the thing has disappeared after using it and clear
the buffer if so.


Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 19:11                   ` Linus Torvalds
@ 1999-01-28 20:11                     ` Alan Cox
  0 siblings, 0 replies; 34+ messages in thread
From: Alan Cox @ 1999-01-28 20:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: alan, sct, andrea, linux-kernel, werner, mlord, davem, gandalf,
	adamk, kiracofe.8, ksi, djf-lists, tomh, linux-mm

This is the first of two "people fixes" - this isnt so major the 2nd is

o	Fix Erik Andersen's email


diff -u --new-file --recursive --exclude-from ../exclude linux.vanilla/CREDITS linux.ac/CREDITS
--- linux.vanilla/CREDITS	Sun Jan 24 19:55:28 1999
+++ linux.ac/CREDITS	Wed Jan 27 19:07:38 1999
@@ -49,12 +49,12 @@
 
 N: Erik Andersen
 E: andersee@debian.org
-W: http://www.inconnect.com/~andersen
+W: http://www.xmission.com/~andersen
 P: 1024/FC4CFFED 78 3C 6A 19 FA 5D 92 5A  FB AC 7B A5 A5 E1 FF 8E
 D: Maintainer of ide-cd and Uniform CD-ROM driver, 
 D: ATAPI CD-Changer support, Major 2.1.x CD-ROM update.
 S: 4538 South Carnegie Tech Street
-S: West Valley City, Utah 84120
+S: Salt Lake City, Utah 84120
 S: USA
 
 N: H. Peter Anvin


--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 17:54           ` Linus Torvalds
  1999-01-28 18:07             ` Stephen C. Tweedie
@ 1999-01-28 22:04             ` Andrea Arcangeli
  1999-01-29  0:17               ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-28 22:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Thu, 28 Jan 1999, Linus Torvalds wrote:

> Incorrect, see my previous email. It may not be strictly necessary right
> now due to us probably holding the kernel lock everywhere, but it is
> conceptually necessary, and it is _not_ an argument for a spinlock.

If you remove the kernel lock around do_exit() you _need_ my mm_lock
spinlock. You need it to make atomic the decreasing of mm->count and
current->mm = &init_mm. If the two instructions are not atomic you have
_no_ way to know if you can mmget() at any time the mm of a process.

I repeat in another way (just trying to avoid English mistakes): 
decreasing mm->count has to go in sync with updating current->mm,
otherwise you don't know if you can access the mm of a process (and so
also touching mm->count) because the mm of such process could be just been
deallocated (the process could be a zombie).

As far I can see the mm->count will _never_ have 1 reason to be atomic_t
even removing the function lock_kernel() from
/usr/src/linux/include/asm/smplock.h . 

Using an int for mm->count (isntead of atomic_t) is far from be something
of magic. Doing that change it means that I don't expect work by magic,
but I know that it will work fine. It's instead magic (out of the human
reason) how the kernel works now, if you think that we _need_ atomic_t
instead of int. 

It's hard for me to be quiet even if you asked me to go away because I
think you are plain wrong telling that we need mm->count atomic_t for the
future. 

Andrea Arcangeli

PS. You have not hurted me asking me to go away, because I _always_ do the
_best_ I _can_, and if my best is plain wrong I can't change this reality. 
What I care is to do the best I can. Maybe I should really go away and do
something in my sparetime where it's not requested to be clever... I hope
it's not the case though and that you wasn't serious.

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 18:07             ` Stephen C. Tweedie
  1999-01-28 18:17               ` Linus Torvalds
@ 1999-01-28 22:33               ` Andrea Arcangeli
  1999-01-28 22:53                 ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-28 22:33 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Thu, 28 Jan 1999, Stephen C. Tweedie wrote:

> Linus, we are in violent agreement: see my previous email. :)  I agree
> with both you and Andrea that the atomic_t is not strictly necessary,
> and agree vigorously that removing it is wrong because it will just make
> the job of fine-graining the locking ever more harder.  As we relax the
> kernel locks, the atomic_t becomes more and more important.

I'm afraid. I still think that it will never be needed even removing
lock_kernel(), because doing that we would need to make atomic the
decreasing of mm->count with current->mm = &init_mm, otherwise we would
not know if we can touch the current->mm of a process at any time. 

It's far from be like the semaphore case where we avoid the spinlock to
not race because we can order our moves differently if we are in down() or
in up(). 

But maybe I am missing something, but it's what I think though.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 22:33               ` Andrea Arcangeli
@ 1999-01-28 22:53                 ` Linus Torvalds
  1999-01-29  1:47                   ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 1999-01-28 22:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
> 
> I'm afraid. I still think that it will never be needed even removing
> lock_kernel(), because doing that we would need to make atomic the
> decreasing of mm->count with current->mm = &init_mm, otherwise we would
> not know if we can touch the current->mm of a process at any time. 

What?

We can _always_ touch current->mm, because it can _never_ go away from
under us: it will always have a non-zero count exactly because "current"
has a copy of it.

If you want to touch some _other_ process mm pointer, that's when it gets
interesting. Buyer beware.

> But maybe I am missing something, but it's what I think though.

You're missing the fact that whenever we own the mm, we know that NOBODY
else can do anything bad at all, because nobody else can ever think that
the count is zero. Because we hold a reference count to it.

The _only_ interesting case is when multiple threads do "exit()" at the
same time, and then one of them will be the last one - and that last one
will _know_ he is the last one because that's how atomic_dec_and_tes()
works. And once he knows, he no longer has to care about anybody else,
because he knows that nobody else can touch that mm space any more (or it
wouldn't have decremented the counter to zero in the first place). 

So not only does mm->count need to be atomic in the absense of other
locks, but that atomicity is obviously sufficient for allocation purposes. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 22:04             ` Andrea Arcangeli
@ 1999-01-29  0:17               ` Linus Torvalds
  0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-29  0:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm


On Thu, 28 Jan 1999, Andrea Arcangeli wrote:
>
> If you remove the kernel lock around do_exit() you _need_ my mm_lock
> spinlock. You need it to make atomic the decreasing of mm->count and
> current->mm = &init_mm. If the two instructions are not atomic you have
> _no_ way to know if you can mmget() at any time the mm of a process.

Andrea, just go away.

The two do not _have_ to be atomic, they never had to, and they never
_will_ have to be atomic. You obviously haven't read all my email
explaining why they don't have to be atomic.

> I repeat in another way (just trying to avoid English mistakes): 
> decreasing mm->count has to go in sync with updating current->mm,

No it has not.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-28 22:53                 ` Linus Torvalds
@ 1999-01-29  1:47                   ` Andrea Arcangeli
  1999-01-29 11:20                     ` MOLNAR Ingo
                                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-29  1:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Thu, 28 Jan 1999, Linus Torvalds wrote:

> If you want to touch some _other_ process mm pointer, that's when it gets
> interesting. Buyer beware.

Infact this is the point. I really think you are missing something. I read
your explanation of why we only need atomic_t but it was not touching some
point I instead thought about.

Ok, I assume you are right. Please take this example: I am writing a nice
kernel module that will collect some nice stats from the kernel.

I don't have the big kernel lock held, I consider that also __exit_mm()
won't have the big kernel lock held.

To collect such stats I need to touch all mm_struct pointed by all tsk->mm
in the kernel.

How can do that without racing? (see below)

> You're missing the fact that whenever we own the mm, we know that NOBODY

Sure, once we have owned an mm with mmget() we are safe, but my __only__
point is: how can I run a mmget() being sure that I am _getting_ an mm
that is not just been deallocated from the process that was owning such mm
(and it was the _only_ onwer) that get killed in the meantime on the other
CPU? Now I am sure because __exit_mm() is running with the kernel lock
held.

Another way to tell the same: "how can I be sure that I am doing an
atomic_inc(&mm->count) on a mm->count that was just > 0, and more
important on an mm that is still allocated? "

Now everything is fine of course, because I only need to get the big
kernel lock (and for this reason atomic_t _right_now_ is not needed) but
as you pointed out, some time in the future we could kill the big kernel
lock to scale better in SMP. OK? At that time according to you we'll
_need_ (and we'll _only_ need) the mm->count atomic_t to play with the mm
of other process without risk of races. 

I return to the kernel module stat colletctor example:

To be sure that the kernel stack of the process will not go away under me
I need to held the tasklist_lock. Ok? So i'll do:

	read_lock(&tasklist_lock);
	tsk = find_task_by_pid(pid);
	if (tsk)
	{
		struct page * page = mem_map + MAP_NR(tsk);
		atomic_inc(&page->count);
	}
	read_unlock(&tasklist_lock);
	mdelay(10000000000000);

So now I can wait all time I want and nobody can free and reuse my task
struct and replace it with garbage under my eyes. OK?

Now I want to play with the tsk->mm of the tsk. OK?

I'll do:

	unlock_kernel();
	^^
	if (tsk->mm && tsk->mm != &init_mm)
	{
		mdelay(2000000000000000000);
		mmget();
	}

but between the check `tsk->mm != &init_mm' and mmget() in the other CPU
the task `tsk' could have run __exit_mm()!!! (you remeber, we don't have
the big kernel lock held in __exit_mm() anymore). So mmget() will reput
the mm->count to 1, but there won't be any kind of mm_struct there!!! We
are increasing a random kernel space doing atomic_inc(&mm->count)!!
The mm_sruct is gone away between tsk->mm != &init_mm and mmget(). 

This is the race I seen two night ago. I seen it because I seen some
report on the list that was Oopsing in kmem_cache_free() called by
mmput(). So I didn't thought two times before adding the mm_lock. Do you
see why now?  Obviously I didn't seen the lock_kernel() (or better I seen
it too late and I am been too lazy to remove the mm_lock and since it was
not harming I left it there, as an advance for the future). 

An Oops in mmput() could be explained very well by the race I am trying to
point out to you:

	task 1				task 2
	-----------			---------------
					mm->count == 1
	if (tsk->mm != &init_mm)
	{
		/* interrupt */
					__do_exit()
					{
						tsk->mm = &init_mm
						mmput(); (mm->count == 0)
					}
		mmget() (mm->count == 1 but there' isn't a mm_struct there!!!)
	}

I can't see how _only_ with the mm->count atomic_t we can avoid this
scenario (you said that in the last emails!!!!). Maybe I only need a
loooong sleep.

Excuse me if I am still here (not gone away yet) but such mmget()/mmput() 
race sound not fixable to me without using a mm_lock spinlock (I am
considering to not have the lock_kernel() in __exit_mm() of course).

Do you see the future race now? Do you see why I tell you that we can't be
race free in looking the mm of _another_ task without a spinlock if
__exit_mm() won't own the big kernel lock?

Obviously if we'll serialize mmput/mmget and the setting of the tsk->mm
using a spinlock, we won't need mm->count atomic_t (as now btw).

Do you agree with me now?

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29  1:47                   ` Andrea Arcangeli
@ 1999-01-29 11:20                     ` MOLNAR Ingo
  1999-01-29 12:08                       ` Andrea Arcangeli
  1999-01-29 14:13                     ` Eric W. Biederman
  1999-01-29 18:24                     ` Linus Torvalds
  2 siblings, 1 reply; 34+ messages in thread
From: MOLNAR Ingo @ 1999-01-29 11:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Stephen C. Tweedie, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

On Fri, 29 Jan 1999, Andrea Arcangeli wrote:

> Another way to tell the same: "how can I be sure that I am doing an
> atomic_inc(&mm->count) on a mm->count that was just > 0, and more
> important on an mm that is still allocated? "

you are misunderstanding how atomic_inc_and_test() works. The processor
guarantees this. This is the crux of SMP atomic operations. How otherwise
could we reliably build read-write spinlocks.

yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to
implement but pretty much analogous to read-write locks, we probably need
to shift values down by one to get the 'just increased from -1 to 0' event
via the zero flag, and get the 'just decreased from 0 to -1' event via the
sign flag.) Also note that this is all fiction yet because we _are_
holding the kernel lock for these situations in 2.2.

-- mingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29 11:20                     ` MOLNAR Ingo
@ 1999-01-29 12:08                       ` Andrea Arcangeli
  1999-01-29 13:19                         ` MOLNAR Ingo
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-29 12:08 UTC (permalink / raw)
  To: MOLNAR Ingo
  Cc: Linus Torvalds, Stephen C. Tweedie, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

On Fri, 29 Jan 1999, MOLNAR Ingo wrote:

> yes, there is no atomic_inc_and_test() yet. (it's a bit tricky to

_Where_ do you want to run atomic_inc_and_test()? On random kernel data
where incidentally a long time ago there was the just deallocated
and reused (somewhere else) mm_struct?

If we incidentally access the mm_struct and we notice that the mm->count
is zero it menas we are just buggy. 

> sign flag.) Also note that this is all fiction yet because we _are_
> holding the kernel lock for these situations in 2.2.

Sure it's finction, but _all_ complains I get against my s/atomic_t/int/
was about the future.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29 12:08                       ` Andrea Arcangeli
@ 1999-01-29 13:19                         ` MOLNAR Ingo
  1999-01-29 14:14                           ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: MOLNAR Ingo @ 1999-01-29 13:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Stephen C. Tweedie, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

On Fri, 29 Jan 1999, Andrea Arcangeli wrote:

> _Where_ do you want to run atomic_inc_and_test()? On random kernel data

note that in 99% of the cases we need the counter only in the clone(),
exec() and exit() path, for these three cases we know implicitly that it's
a valid buffer. (because we hold a reference to it) [subsequently we dont
need any atomic_inc_and_test thing either for clone+exec+exit] An atomic
counter is just about perfect for those uses, even in the 'no kernel lock'
case. 

I only looked at your last (fairly large) patch, which does not seem to
have _any_ effect at all as far as bugfixes are concerned, except the
array.c change (which again turned out to have nothing to do with the
atomic vs. spinlock thing). 

for 'other process' uses (for cases when we want to dereference an
mm_struct pointer but do not hold a reference to it, eg. /proc or the VM
scanning logic) we will have to do something smart when we remove the
kernel lock in 2.3, but it should not increase the cost of the common case
(clone() + exit()) if possible.

-- mingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29  1:47                   ` Andrea Arcangeli
  1999-01-29 11:20                     ` MOLNAR Ingo
@ 1999-01-29 14:13                     ` Eric W. Biederman
  1999-01-30 15:42                       ` Andrea Arcangeli
  1999-01-29 18:24                     ` Linus Torvalds
  2 siblings, 1 reply; 34+ messages in thread
From: Eric W. Biederman @ 1999-01-29 14:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On Thu, 28 Jan 1999, Linus Torvalds wrote:

>> You're missing the fact that whenever we own the mm, we know that NOBODY

AA> I return to the kernel module stat colletctor example:

AA> To be sure that the kernel stack of the process will not go away under me
AA> I need to held the tasklist_lock. Ok? So i'll do:

AA> 	read_lock(&tasklist_lock);
AA> 	tsk = find_task_by_pid(pid);
AA> 	if (tsk)
AA> 	{
AA> 		struct page * page = mem_map + MAP_NR(tsk);
AA> 		atomic_inc(&page->count);

Actually in the future we should have something increment the task count or 
similiar.  But I suppose keeping a page count should be enough.

AA> 	}
AA> 	read_unlock(&tasklist_lock);
AA> 	mdelay(10000000000000);

AA> So now I can wait all time I want and nobody can free and reuse my task
AA> struct and replace it with garbage under my eyes. OK?

AA> Now I want to play with the tsk->mm of the tsk. OK?

AA> I'll do:

AA> 	unlock_kernel();
AA> 	^^
AA> 	if (tsk->mm && tsk->mm != &init_mm)
AA> 	{
AA> 		mdelay(2000000000000000000);
AA> 		mmget();
AA> 	}

This would need to say.
	mm = tsk->mm;
	mmget(mm);
	if (mm != &init_mm) {
	/* xyz */
	}

And do_exit & exec would need to say:
     old_mm = tsk->mm;
     tsk->mm = new_mm; /* probably init_mm */
     mmput(old_mm);

There does to be a memory barier there to sychronize reads/writes of cache
data.  I forget off had what kind that needs to be.

The fix is just to never let bad sit in the tsk struct while it is valid.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29 13:19                         ` MOLNAR Ingo
@ 1999-01-29 14:14                           ` Andrea Arcangeli
  1999-01-29 17:46                             ` Theodore Y. Ts'o
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-29 14:14 UTC (permalink / raw)
  To: MOLNAR Ingo
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm

On Fri, 29 Jan 1999, MOLNAR Ingo wrote:

> note that in 99% of the cases we need the counter only in the clone(),
> exec() and exit() path, for these three cases we know implicitly that it's
> a valid buffer. (because we hold a reference to it) [subsequently we dont
> need any atomic_inc_and_test thing either for clone+exec+exit] An atomic
> counter is just about perfect for those uses, even in the 'no kernel lock'
> case. 

Sure, if you look at my last email to Linus, you'll see that I am _only_
talking about getting the mm of a random process (not the current one!).

Probably I should comment better my patches, but I have really a big
leakage of spare time these days but I _don't_ want to decrease the kernel
hacking (even if Linus asked me to go away two times).

> I only looked at your last (fairly large) patch, which does not seem to
> have _any_ effect at all as far as bugfixes are concerned, except the
> array.c change (which again turned out to have nothing to do with the
> atomic vs. spinlock thing).

Infact, the only bugfix is array.c (as I just pointed out clearly in
bugtraq). Another reason I didn't either checked about lock_kernel() two
night ago before adding mm_lock, is that it was very late, I had a little
time and I seen some bugreport on the list that was oopsing in mmput (so I
thought "yeah, I seen the race!", I thought it was the mmput/current->mm =
&init_mm race). Then I also seen the mm->count as atomic_t so I thought it
was really the case. 

Ah note, there was overhead also in the overhead since checking for
mm->count == 0 in mmget was a nono ;).

> for 'other process' uses (for cases when we want to dereference an
> mm_struct pointer but do not hold a reference to it, eg. /proc or the VM
> scanning logic) we will have to do something smart when we remove the
> kernel lock in 2.3, but it should not increase the cost of the common case
> (clone() + exit()) if possible.

Ok this is a completly different story and I am the first that don't want
to continue it now, but I want to point out that atomic_t is __far__ to be
the only thing we'll nee to make the mm_struct browsing race free (as I
understood from Linus's word).

I am very very happy to see that I was not (completly ;) crazy.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29 14:14                           ` Andrea Arcangeli
@ 1999-01-29 17:46                             ` Theodore Y. Ts'o
  0 siblings, 0 replies; 34+ messages in thread
From: Theodore Y. Ts'o @ 1999-01-29 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: MOLNAR Ingo, Stephen C. Tweedie, linux-kernel, werner, mlord,
	David S. Miller, gandalf, adamk, kiracofe.8, ksi, djf-lists,
	tomh, Alan Cox, linux-mm

   > note that in 99% of the cases we need the counter only in the clone(),
   > exec() and exit() path, for these three cases we know implicitly that it's
   > a valid buffer. (because we hold a reference to it) [subsequently we dont
   > need any atomic_inc_and_test thing either for clone+exec+exit] An atomic
   > counter is just about perfect for those uses, even in the 'no kernel lock'
   > case. 

   Sure, if you look at my last email to Linus, you'll see that I am _only_
   talking about getting the mm of a random process (not the current one!).

I think the important thing to remember is that Linus has said that he's
only interested in critical bug fixes at this point.  So things which
make certain other operations conceptually easier in the future are
simply not of interest at this point.  That's probably the cause of the
confusion --- you're trying to solve a general problem, and Linus is
trying to make Linus 2.2 stable.  There's a time for fixing general
problems in the kernel, and that was when we were still in the 2.1
development kernel, and it will happen again once Linus opens the 2.3
tree.

In the meantime, I suggest you save your patches; I have a number of
patches which I'm working on and saving to be sent to Linus once the 2.3
development tree opens up.  

Also, those of us who are experts with our parts of the kernel should
consider spending at least some of time investigating bug reports sent
in by users, and fixing stability problems in the 2.2 kernel, in
addition to working on future changes for 2.3.  The sooner we can get
2.2 stable, the sooner 2.3 can get opened, and the happier Linux users
(our customers!) will be.

							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29  1:47                   ` Andrea Arcangeli
  1999-01-29 11:20                     ` MOLNAR Ingo
  1999-01-29 14:13                     ` Eric W. Biederman
@ 1999-01-29 18:24                     ` Linus Torvalds
  2 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 1999-01-29 18:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, linux-kernel, werner, mlord, David S. Miller,
	gandalf, adamk, kiracofe.8, ksi, djf-lists, tomh, Alan Cox,
	linux-mm


On Fri, 29 Jan 1999, Andrea Arcangeli wrote:
> 
> > If you want to touch some _other_ process mm pointer, that's when it gets
> > interesting. Buyer beware.
> 
> Infact this is the point. I really think you are missing something. I read
> your explanation of why we only need atomic_t but it was not touching some
> point I instead thought about.
> 
> Ok, I assume you are right. Please take this example: I am writing a nice
> kernel module that will collect some nice stats from the kernel.

And that's where you have problems. You shouldn't do that, and that's why
/proc is such a nasty beast right now. 

If you want to look at other peoples processes, then the onus should be on
_you_ to do all the extra crap that normal processes do not need to do. 
That extra crap can be a number of things, but you shouldn't penalize the
normal path (which is to touch only your own mm space). 

For example, the thing I suspect we'll have to do in the long run for
/proc is:

 - get the process while holding the tasklist lock, and increment the page
   count so that even if the process exists, the page does not get unused.
 - get the kernel lock. Now we know that we're atomic with regard to
   __exit_mm()
 - look at tsk->mm: it it is not init_mm, you're now safe, because
   __exit_mm is called only with the kernel lock held. 

See? No spinlocks.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-29 14:13                     ` Eric W. Biederman
@ 1999-01-30 15:42                       ` Andrea Arcangeli
  1999-01-30 20:32                         ` Eric W. Biederman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-30 15:42 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 29 Jan 1999, Eric W. Biederman wrote:

> AA> 	unlock_kernel();
> AA> 	^^
> AA> 	if (tsk->mm && tsk->mm != &init_mm)
> AA> 	{
> AA> 		mdelay(2000000000000000000);
> AA> 		mmget();
> AA> 	}
> 
> This would need to say.
> 	mm = tsk->mm;
> 	mmget(mm);
> 	if (mm != &init_mm) {
> 	/* xyz */
> 	}

This is not enough to avoid races. I supposed to _not_ have the big kernel
lock held. The point is _where_ you do mmget() and so _where_ you do
mm->count++. If current!=tsk and you don't have the big kernel lock held,
you can risk to do a mm->count++ on a random kernel memory because mmput()
run from __exit_mm() from the tsk context in the meantime on the other
CPU.

When tsk == current instead you implicit know that you _can't_ race yes,
but this was _not_ the case I was complaining about. 

Tell me if I am misunderstood your email.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-30 15:42                       ` Andrea Arcangeli
@ 1999-01-30 20:32                         ` Eric W. Biederman
  1999-01-31  1:00                           ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Eric W. Biederman @ 1999-01-30 20:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Eric W. Biederman, linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On 29 Jan 1999, Eric W. Biederman wrote:
AA> unlock_kernel();
AA> ^^
AA> if (tsk->mm && tsk->mm != &init_mm)
AA> {
AA> mdelay(2000000000000000000);
AA> mmget();
AA> }
>> 
>> This would need to say.
>> mm = tsk->mm;
>> mmget(mm);
>> if (mm != &init_mm) {
>> /* xyz */
>> }

AA> This is not enough to avoid races. I supposed to _not_ have the big kernel
AA> lock held. The point is _where_ you do mmget() and so _where_ you do
AA> mm-> count++. If current!=tsk and you don't have the big kernel lock held,
AA> you can risk to do a mm->count++ on a random kernel memory because mmput()
AA> run from __exit_mm() from the tsk context in the meantime on the other
AA> CPU.

I have a count incremented on the task so the task_struct won't go away.
tsk->mm at any point in time _always_ points to a valid mm.
(A new mm is assigned before the old mm is put)/

That does appear to leave a small race in my code.  That of having a pointer
to a valid mm, which is reallocated before the count can be incremented.

Probably a piece of code like:

mm_struct *fetch_tsk_mm(task_struct *tsk)
{
	unsinged long flags;
	mm_struct *mm;
	save_flags(flags);
	cli();
	do {	
		mm = tsk->mm;
	} while (!atomic_inc_and_test(&mm->count);
	restore_flags(flags);
	return mm;
}
is needed to make sure the count goes up before the mm can be reallocated.

I'm not an expert on locks, so there may be an even cheaper way of implementing
it.

The point is that:
a) An atomic count is sufficient if you know the mm will be valid while you hold it.
b) Making sure the time you CPU spends between getting the valid mm reference
   and incrementing the count on that reference, is smaller than time it takes
   for another CPU to put another mm in the task_struct, decrement the mm count, ensure
   the caches will see the memory writes in the approrpiate order and then reallocate
   the memory.  
   Is sufficient to say the mm will be valid when the count is incremented.

AA> Tell me if I am misunderstood your email.

In part, and in part I misunderstood the problem.

As far as I can tell the race is so unlikely it should never happen in practice.
But also it is so small there should be ample room for small for a large variety of
solutions.  It is an interesting problem, and worth solving well for 2.3.

I am quite certain however that your get_mm_and_lock routine would help not
at all in this case.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-30 20:32                         ` Eric W. Biederman
@ 1999-01-31  1:00                           ` Andrea Arcangeli
  1999-01-31  8:36                             ` Eric W. Biederman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-31  1:00 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 30 Jan 1999, Eric W. Biederman wrote:

> I have a count incremented on the task so the task_struct won't go away.
> tsk->mm at any point in time _always_ points to a valid mm.

You must think this:

	CPU0				CPU1
	/proc				task1
	------------			----------------
	is task1->mm valid?
	answer -> yes!!
	mm = task1->mm
	IRQ14 (disk I/O completed)
	...
					__exit_mm()
					_dealloc_ task1->mm kmem_cache_free(mm)
					task1->mm = &init_mm
	...
	mm->count++
	^^^^^^^^^^^ you are writing data on random kernel space

Right now this can't happen because both /proc and __exit_mm() are
synchronized by the big kernel lock.

> 	do {	
> 		mm = tsk->mm;
> 	} while (!atomic_inc_and_test(&mm->count);

The point is that you can't increment and test a mm->count if you are not
sure that the mm exists on such piece of memory. And if you are sure that
such piece of memory exists you don't need to check it and you can only
increment it ;). Do you see my point now?

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-31  1:00                           ` Andrea Arcangeli
@ 1999-01-31  8:36                             ` Eric W. Biederman
  1999-01-31 19:16                               ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Eric W. Biederman @ 1999-01-31  8:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On 30 Jan 1999, Eric W. Biederman wrote:

AA> The point is that you can't increment and test a mm->count if you are not
AA> sure that the mm exists on such piece of memory. And if you are sure that
AA> such piece of memory exists you don't need to check it and you can only
AA> increment it ;). Do you see my point now?

The check may be needed if someone is decrementing the count while you are
incrementing.   To remove the need for the check would require a lock
on the task struct.  (So a new pointer isn't written, and subsequently
the old data freed before you increment your count).

Furthermore I am perfectly aware, the race existed in my code, and that
it relied on fast code paths (not the best).  But since it cleared
the interrupts I could if need be garantee on a given machine the code would
always work.

However that is not a very portable way to code and I wouldn't recommend it.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-31  8:36                             ` Eric W. Biederman
@ 1999-01-31 19:16                               ` Andrea Arcangeli
  1999-01-31 21:56                                 ` Eric W. Biederman
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-01-31 19:16 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 31 Jan 1999, Eric W. Biederman wrote:

> The check may be needed if someone is decrementing the count while you are
> incrementing.   To remove the need for the check would require a lock

No. When you are incrementing it you _must_ be sure that mm->count was
just >= 1 and that it will remains >=1 while you are incrementing it.

> on the task struct.  (So a new pointer isn't written, and subsequently

No you can't lock on the task struct. Other processes won't share your
lock otherwise. If other processes doesn't share your lock the lock is
useless.

> Furthermore I am perfectly aware, the race existed in my code, and that

Which code? ;)

> it relied on fast code paths (not the best).  But since it cleared
> the interrupts I could if need be garantee on a given machine the code would
> always work.

You usually don't release a mm inside an irq (so __cli() can't help you to
avoid the race). And it's _only_ a SMP race issue. UP is safe because
do_exit() run outside irq handlers.

I hope to have understood well your email (I had some problem with my
not very good English ;). If not let me know.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0]
  1999-01-31 19:16                               ` Andrea Arcangeli
@ 1999-01-31 21:56                                 ` Eric W. Biederman
  0 siblings, 0 replies; 34+ messages in thread
From: Eric W. Biederman @ 1999-01-31 21:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On 31 Jan 1999, Eric W. Biederman wrote:
>> The check may be needed if someone is decrementing the count while you are
>> incrementing.   To remove the need for the check would require a lock

AA> No. When you are incrementing it you _must_ be sure that mm->count was
AA> just >= 1 and that it will remains >=1 while you are incrementing it.

It's possible to do without this.  Not smart terribly smart or portable, but possible.

>> on the task struct.  (So a new pointer isn't written, and subsequently

AA> No you can't lock on the task struct. Other processes won't share your
AA> lock otherwise. If other processes doesn't share your lock the lock is
AA> useless.

You must lock on the task whose mm you are incrementing, or aquire a more general lock.
What you want to keep is the valid pointer from the tsk struct, valid
until you release the count.

>> Furthermore I am perfectly aware, the race existed in my code, and that

AA> Which code? ;)

The snippet for just using the atomic count, several emails ago in this
thread.  I believe I called the sketched subroutine fetch_mm.

>> it relied on fast code paths (not the best).  But since it cleared
>> the interrupts I could if need be garantee on a given machine the code would
>> always work.

AA> You usually don't release a mm inside an irq (so __cli() can't help you to
AA> avoid the race). And it's _only_ a SMP race issue. UP is safe because
AA> do_exit() run outside irq handlers.

But cli() will allow you to have a bounded execution time on a single CPU,
so you can know that another cpu won't have time to deallocate the memory.

AA> I hope to have understood well your email (I had some problem with my
AA> not very good English ;). If not let me know.

Go back and read through the thread slowly.  The trouble seems more
to do with missed points then miscomprehension of english.

I believe my last message was quite clear, though the ones before
it may have been a little muddled.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~1999-02-01 21:42 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.3.96.990127123207.15486A-100000@laser.bogus>
     [not found] ` <Pine.LNX.3.96.990127131315.19147A-100000@laser.bogus>
1999-01-27 21:38   ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
1999-01-27 21:45     ` Linus Torvalds
1999-01-28  1:02     ` Andrea Arcangeli
1999-01-28  2:50       ` Andrea Arcangeli
1999-01-28  4:20         ` [patch] fixed both processes in D state and the /proc/ oopses Tom Holroyd
1999-01-28  6:23         ` Tom Holroyd
1999-01-28 15:09         ` [patch] fixed both processes in D state and the /proc/ oopses [Re: [patch] Fixed the race that was oopsing Linux-2.2.0] Stephen C. Tweedie
1999-01-28 17:54           ` Linus Torvalds
1999-01-28 18:07             ` Stephen C. Tweedie
1999-01-28 18:17               ` Linus Torvalds
1999-01-28 18:25                 ` Stephen C. Tweedie
1999-01-28 19:23                 ` Alan Cox
1999-01-28 19:11                   ` Linus Torvalds
1999-01-28 20:11                     ` Alan Cox
1999-01-28 22:33               ` Andrea Arcangeli
1999-01-28 22:53                 ` Linus Torvalds
1999-01-29  1:47                   ` Andrea Arcangeli
1999-01-29 11:20                     ` MOLNAR Ingo
1999-01-29 12:08                       ` Andrea Arcangeli
1999-01-29 13:19                         ` MOLNAR Ingo
1999-01-29 14:14                           ` Andrea Arcangeli
1999-01-29 17:46                             ` Theodore Y. Ts'o
1999-01-29 14:13                     ` Eric W. Biederman
1999-01-30 15:42                       ` Andrea Arcangeli
1999-01-30 20:32                         ` Eric W. Biederman
1999-01-31  1:00                           ` Andrea Arcangeli
1999-01-31  8:36                             ` Eric W. Biederman
1999-01-31 19:16                               ` Andrea Arcangeli
1999-01-31 21:56                                 ` Eric W. Biederman
1999-01-29 18:24                     ` Linus Torvalds
1999-01-28 22:04             ` Andrea Arcangeli
1999-01-29  0:17               ` Linus Torvalds
1999-01-28 17:36         ` Linus Torvalds
1999-01-28 15:05       ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox