Re: 2.2.0-final

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: 2.2.0-final
       [not found] <Pine.LNX.3.95.990120224340.23558G-100000@penguin.transmeta.com>
@ 1999-01-23 20:56 ` Andrea Arcangeli
  1999-01-24  1:51   ` 2.2.0-final Andrea Arcangeli
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-23 20:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel

On Wed, 20 Jan 1999, Linus Torvalds wrote:

> In short, before you post a bug-report about 2.2.0-final, I'd like you to

There are three things from me I think should go in before 2.2.0 real
(maybe a normal user would be not too much worried by these two races, it
depends also about the definition on `normal user' ;).

The first is a fix for a potential swapout deadlock I discovered and fixed
some day ago. See my email about the topic with the patch:

On Mon, 18 Jan 1999, Andrea Arcangeli wrote:
> 
> Date: Mon, 18 Jan 1999 21:26:05 +0100 (CET)
> From: Andrea Arcangeli <andrea@e-mind.com>
> To: Zlatko Calusic <Zlatko.Calusic@CARNet.hr>,
>     "Stephen C. Tweedie" <sct@redhat.com>,
>     Linus Torvalds <torvalds@transmeta.com>
> Cc: Linux-MM List <linux-mm@kvack.org>,
>     Linux Kernel List <linux-kernel@vger.rutgers.edu>
> Subject: Re: Removing swap lockmap...
> 
> On 18 Jan 1999, Zlatko Calusic wrote:
> 
> > I removed swap lockmap all together and, to my surprise, I can't
> > produce any ill behaviour on my system, not even under very heavy
> > swapping (in low memory condition).
> 
> Looking at your patch (and so looking at the swap_lockmap code) I found a
> potential deadlock in the current swap_lockmap handling: 
> 
> 	task A				task B
> 	----------			-------------
> 	rw_swap_page_base()
> 	
> 	...if (test_and_set_bit(lockmap))
> 		... run_task_queue()
> 					swap_after_unlock_page()
> 						... clear_bit(lockmap)
> 						.... wakeup(&lock_queue)
> 		...sleep_on(&lock_queue);
> 		deadlocked
> 
> I think it will not harm too much because the window is not too big (but
> not small) and because usually one of the process not yet deadlocked will
> generate IO and will wakeup also the deadlocked process at I/O
> completation time. A very lazy ;) but at the same time obviosly right
> (that should not harm performances at all) fix could be to replace the
> sleep_on() with a sleep_on_timeout(..., 1).
> 
 * patch snipped *
> 
> I think we need the swap_lockmap in the shm case because without swap
> cache a swapin could happen at the same time of the swapout because
> find_in_swap_cache() won't work there. 
> 
> Andrea Arcangeli

Here the fix:

Index: page_io.c
===================================================================
RCS file: /var/cvs/linux/mm/page_io.c,v
retrieving revision 1.1.2.1
diff -u -r1.1.2.1 page_io.c
--- page_io.c	1999/01/18 01:32:53	1.1.2.1
+++ linux/mm/page_io.c	1999/01/18 20:21:41
@@ -88,7 +88,7 @@
 		/* Make sure we are the only process doing I/O with this swap page. */
 		while (test_and_set_bit(offset,p->swap_lockmap)) {
 			run_task_queue(&tq_disk);
-			sleep_on(&lock_queue);
+			sleep_on_timeout(&lock_queue, 1);
 		}
 
 		/* 


----------------------------------------------------------------------

The second thing is the complete race fix for the disable/enable_bh(). 
It's obviously right. Here it is (against 2.2.0-pre8intestingforalan but
should apply clean to your tree too): 

Index: linux/include/asm-i386/softirq.h
diff -u linux/include/asm-i386/softirq.h:1.1.1.1 linux/include/asm-i386/softirq.h:1.1.2.2
--- linux/include/asm-i386/softirq.h:1.1.1.1	Mon Jan 18 02:27:17 1999
+++ linux/include/asm-i386/softirq.h	Wed Jan 20 07:41:42 1999
@@ -9,24 +9,6 @@
 #define get_active_bhs()	(bh_mask & bh_active)
 #define clear_active_bhs(x)	atomic_clear_mask((x),&bh_active)
 
-extern inline void init_bh(int nr, void (*routine)(void))
-{
-	bh_base[nr] = routine;
-	atomic_set(&bh_mask_count[nr], 0);
-	bh_mask |= 1 << nr;
-}
-
-extern inline void remove_bh(int nr)
-{
-	bh_base[nr] = NULL;
-	bh_mask &= ~(1 << nr);
-}
-
-extern inline void mark_bh(int nr)
-{
-	set_bit(nr, &bh_active);
-}
-
 #ifdef __SMP__
 
 /*
@@ -90,21 +72,49 @@
 
 #endif	/* SMP */
 
+extern inline void init_bh(int nr, void (*routine)(void))
+{
+	bh_base[nr] = routine;
+	bh_mask_count[nr] = 0;
+	wmb();
+	bh_mask |= 1 << nr;
+}
+
+extern inline void remove_bh(int nr)
+{
+	bh_mask &= ~(1 << nr);
+	synchronize_bh();
+	bh_base[nr] = NULL;
+}
+
+extern inline void mark_bh(int nr)
+{
+	set_bit(nr, &bh_active);
+}
+
 /*
  * These use a mask count to correctly handle
  * nested disable/enable calls
  */
 extern inline void disable_bh(int nr)
 {
+	unsigned long flags;
+
+	spin_lock_irqsave(&bh_lock, flags);
 	bh_mask &= ~(1 << nr);
-	atomic_inc(&bh_mask_count[nr]);
+	bh_mask_count[nr]++;
+	spin_unlock_irqrestore(&bh_lock, flags);
 	synchronize_bh();
 }
 
 extern inline void enable_bh(int nr)
 {
-	if (atomic_dec_and_test(&bh_mask_count[nr]))
+	unsigned long flags;
+
+	spin_lock_irqsave(&bh_lock, flags);
+	if (!--bh_mask_count[nr])
 		bh_mask |= 1 << nr;
+	spin_unlock_irqrestore(&bh_lock, flags);
 }
 
 #endif	/* __ASM_SOFTIRQ_H */
Index: linux/include/linux/interrupt.h
diff -u linux/include/linux/interrupt.h:1.1.1.1 linux/include/linux/interrupt.h:1.1.2.1
--- linux/include/linux/interrupt.h:1.1.1.1	Mon Jan 18 02:27:09 1999
+++ linux/include/linux/interrupt.h	Mon Jan 18 02:32:58 1999
@@ -17,7 +17,8 @@
 
 extern volatile unsigned char bh_running;
 
-extern atomic_t bh_mask_count[32];
+extern spinlock_t bh_lock;
+extern int bh_mask_count[32];
 extern unsigned long bh_active;
 extern unsigned long bh_mask;
 extern void (*bh_base[32])(void);
Index: linux/kernel/softirq.c
diff -u linux/kernel/softirq.c:1.1.1.1 linux/kernel/softirq.c:1.1.2.1
--- linux/kernel/softirq.c:1.1.1.1	Mon Jan 18 02:27:00 1999
+++ linux/kernel/softirq.c	Mon Jan 18 02:32:52 1999
@@ -20,7 +20,8 @@
 
 /* intr_count died a painless death... -DaveM */
 
-atomic_t bh_mask_count[32];
+spinlock_t bh_lock = SPIN_LOCK_UNLOCKED;
+int bh_mask_count[32];
 unsigned long bh_active = 0;
 unsigned long bh_mask = 0;
 void (*bh_base[32])(void);


----------------------------------------------------------------------

The third thing I disagree is to swapout in cluster when shrink_mmap() 
fails at priority == 6 (or whatever). shrink_mmap() that fails tell
nothing about the state of the VM. We could be with 0 phys RAM but with
some freeable cache but shrink_mmap could still fail at that stage. This
has no trivial fix (I think my new nr_freeable pages balance level will
fix it though) and luckily is mostly a performances issue (even if I
think it's the cause of the VM slowdown after some day of usage).

>From a stableness point of view instead I think that the current
try_to_free_pages() algorithm is not good because we should do only _one_
(and not count-- until swapout fail)  swapout(), if nr_free_pages <
freepages.min.  This because low memory system SWAP_CLUSTER_MAX (aka 32) 
is very major than 10 (minimum of freepages.min). Here a patch: 

Index: vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 vmscan.c
--- vmscan.c	1999/01/23 18:52:32	1.1.1.3
+++ linux/mm/vmscan.c	1999/01/23 20:53:11
@@ -487,6 +487,8 @@
 		while (swap_out(priority, gfp_mask)) {
 			if (!--count)
 				goto done;
+			if (nr_free_pages < freepages.min)
+				break;
 		}
 
 		shrink_dcache_memory(priority, gfp_mask);


But NOTE, I _never_ tried this patch (nor tried compiled it), because I am
testing my VM algorithm instead of 2.2.0 ones. Maybe it will harm a bit
performances (not too much though) but looks to me strictly _needed_ to me
for low memory machines. If somebody would try out the system w and w/o
this patch after setting echo 10 >/proc/sys/vm/freepages it would be
interesting. 

------------ Busy-Linus can stop reading here (now ;) ----------------

BTW, I am running now with my new vm that take stable the number of
freeable pages. This VM works greatly here. But I had to change all
bh->b_count++ with bget(bh) and implementing bget() this way: 

extern inline unsigned int bget(struct buffer_head * bh)
{
        buffer_get(bh);
        return ++bh->b_count;
}

where buffer_get() is this:

extern inline void buffer_get(struct buffer_head *bh)
{
        struct page * page = mem_map + MAP_NR(bh->b_data);

        switch (atomic_read(&page->count))
        {
        case 1:
                atomic_inc(&page->count);
                nr_freeable_pages--;
                break;
#if 1 /* PARANOID */
        case 0:
                printk(KERN_ERR "buffer_get: page was unused!\n");
#endif
        }
}

And for b_count-- exists a bput().

Taking uptodate the file cache instead is been very easier (some line
changed and nothing more). Lukily the only b_count++ or b_count-- are in
buffer.c and in ext2fs, other fs has one or two b_count only.

Seeems to works fine and stable here but I still need to do some test
before release it. The only reason I developed nr_freeable_pages is
because I want stable numbers. And to get stable numbers under swapout
shrink_mmap retval is not enough because I could go sometime in the wrong
direction doing the wrong work.  But I can't trust the size of the cache
or of the buffers as a balance factor because they could be all busy or
all freeable... (as pointed out by Stephen).  BTW, Stephen, having b_count
== 0 (as I done) is a good approximation that the buffer is ready to be
freed? I seen in buffer.c that it should be also unlocked, unprotected and
clean to be freeable, but b_count looks like to be the most important
thing, can a driver take locked/dirty/protected for an infinite time a
buffer? 

If I rember well (not sure if we was talking about the same thing) also
Rik suggested to have a nr_freeable_pages, I don't know if the reason he
wanted it is my same one though. 

Comments from MM guys?

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli
@ 1999-01-24  1:51   ` Andrea Arcangeli
  1999-01-24  2:57   ` 2.2.0-final Eric W. Biederman
  1999-01-24 13:16   ` 2.2.0-final Andrea Arcangeli
  2 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-24  1:51 UTC (permalink / raw)
  To: Stephen C. Tweedie, Rik van Riel; +Cc: Kernel Mailing List, linux-mm

On Sat, 23 Jan 1999, Andrea Arcangeli wrote:

> where buffer_get() is this:

Just for the record, I cut-and-pasted a wrong buffer_get() (due a
last-minute wrong hack, I noticed it now when I powerup the machine now
;), the right one is this: 

extern inline void buffer_get(struct buffer_head *bh)
{
	struct page * page = mem_map + MAP_NR(bh->b_data);

	switch (atomic_read(&page->count))
	{
	case 1:
		nr_freeable_pages--;
	default:
		atomic_inc(&page->count);
		break;
#if 1 /* PARANOID */
	case 0:
		printk(KERN_ERR "buffer_get: page was unused!\n");
#endif
	}
}

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli
  1999-01-24  1:51   ` 2.2.0-final Andrea Arcangeli
@ 1999-01-24  2:57   ` Eric W. Biederman
  1999-01-24 12:49     ` 2.2.0-final Andrea Arcangeli
  1999-01-24 13:16   ` 2.2.0-final Andrea Arcangeli
  2 siblings, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 1999-01-24  2:57 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

bh-> b_count++ with bget(bh) and implementing bget() this way: 

AA> extern inline unsigned int bget(struct buffer_head * bh)
AA> {
AA>         buffer_get(bh);
AA>         return ++bh->b_count;
AA> }

AA> where buffer_get() is this:

AA> extern inline void buffer_get(struct buffer_head *bh)
AA> {
AA>         struct page * page = mem_map + MAP_NR(bh->b_data);

AA>         switch (atomic_read(&page->count))
AA>         {
AA>         case 1:
AA>                 atomic_inc(&page->count);
AA>                 nr_freeable_pages--;

This is bogus.   Consider the case when you have 4 buffers per page (common with ext2fs)
You will way underestimate the number of freeable pages.

AA>                 break;
AA> #if 1 /* PARANOID */
AA>         case 0:
AA>                 printk(KERN_ERR "buffer_get: page was unused!\n");
AA> #endif
AA>         }
AA> }

AA> And for b_count-- exists a bput().

AA> Taking uptodate the file cache instead is been very easier (some line
AA> changed and nothing more). Lukily the only b_count++ or b_count-- are in
AA> buffer.c and in ext2fs, other fs has one or two b_count only.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-24  2:57   ` 2.2.0-final Eric W. Biederman
@ 1999-01-24 12:49     ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-24 12:49 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 23 Jan 1999, Eric W. Biederman wrote:

> AA> extern inline void buffer_get(struct buffer_head *bh)
> AA> {
> AA>         struct page * page = mem_map + MAP_NR(bh->b_data);
> 
> AA>         switch (atomic_read(&page->count))
> AA>         {
> AA>         case 1:
> AA>                 atomic_inc(&page->count);
> AA>                 nr_freeable_pages--;
> 
> This is bogus.   Consider the case when you have 4 buffers per page (common with ext2fs)
> You will way underestimate the number of freeable pages.

I will not only understimate: the kernel will not boot at all ;). It's a
last minute wrong hack I did (I replaced some if with the switch statement
and I did a mistake). See my following post. Thanks.

Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli
  1999-01-24  1:51   ` 2.2.0-final Andrea Arcangeli
  1999-01-24  2:57   ` 2.2.0-final Eric W. Biederman
@ 1999-01-24 13:16   ` Andrea Arcangeli
  1999-01-24 13:28     ` 2.2.0-final Andrea Arcangeli
  1999-01-24 18:41     ` 2.2.0-final Eric W. Biederman
  2 siblings, 2 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-24 13:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel

On Sat, 23 Jan 1999, Andrea Arcangeli wrote:

> On Wed, 20 Jan 1999, Linus Torvalds wrote:
> 
> > In short, before you post a bug-report about 2.2.0-final, I'd like you to
> 
> There are three things from me I think should go in before 2.2.0 real

There's a fourth thing I forget to tell yesterday. If all pte are young we
could not be able to swapout while with priority == 0 we must not care
about CPU aging. I hope to have pointed out right and needed things, I
don't want to spam you while you are busy... 

Here the fix:

Index: vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 vmscan.c
--- vmscan.c	1999/01/23 18:52:32	1.1.1.3
+++ linux/mm/vmscan.c	1999/01/24 13:12:30
@@ -325,7 +325,7 @@
 	 * Think of swap_cnt as a "shadow rss" - it tells us which process
 	 * we want to page out (always try largest first).
 	 */
-	counter = nr_tasks / (priority+1);
+	counter = (nr_tasks << 1) / (priority+1);
 	if (counter < 1)
 		counter = 1;
 	if (counter > nr_tasks)


Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-24 13:16   ` 2.2.0-final Andrea Arcangeli
@ 1999-01-24 13:28     ` Andrea Arcangeli
  1999-01-24 18:41     ` 2.2.0-final Eric W. Biederman
  1 sibling, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-24 13:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel

On Sun, 24 Jan 1999, Andrea Arcangeli wrote:

> Here the fix:

Woops the fix was wrong, I forgot that there was a not needed check due me
(just removed from some weeks here, and that's because I forget to
remove it now ;):

The complete fix is this. Excuse me...

Index: vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 vmscan.c
--- vmscan.c	1999/01/23 18:52:32	1.1.1.3
+++ vmscan.c	1999/01/24 13:26:24
@@ -325,11 +325,9 @@
 	 * Think of swap_cnt as a "shadow rss" - it tells us which process
 	 * we want to page out (always try largest first).
 	 */
-	counter = nr_tasks / (priority+1);
+	counter = (nr_tasks << 1) / (priority+1);
 	if (counter < 1)
 		counter = 1;
-	if (counter > nr_tasks)
-		counter = nr_tasks;
 
 	for (; counter >= 0; counter--) {
 		assign = 0;



Andrea Arcangeli

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-24 13:16   ` 2.2.0-final Andrea Arcangeli
  1999-01-24 13:28     ` 2.2.0-final Andrea Arcangeli
@ 1999-01-24 18:41     ` Eric W. Biederman
  1999-01-24 23:05       ` 2.2.0-final Andrea Arcangeli
  1 sibling, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 1999-01-24 18:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On Sat, 23 Jan 1999, Andrea Arcangeli wrote:
>> On Wed, 20 Jan 1999, Linus Torvalds wrote:
>> 
>> > In short, before you post a bug-report about 2.2.0-final, I'd like you to
>> 
>> There are three things from me I think should go in before 2.2.0 real

AA> There's a fourth thing I forget to tell yesterday. If all pte are young we
AA> could not be able to swapout while with priority == 0 we must not care
AA> about CPU aging. I hope to have pointed out right and needed things, I
AA> don't want to spam you while you are busy... 

I don't think this is an issue.  Before we get to calling
swap_out with priority == 0 we have called it with priorities.
6,5,4,3,2,1  Which will have travelled a little over 1.5 times over
the page tables (assuming they can't find anything either).

So it looks doubtful to me that all pte's could be young.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.2.0-final
  1999-01-24 18:41     ` 2.2.0-final Eric W. Biederman
@ 1999-01-24 23:05       ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 1999-01-24 23:05 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

On 24 Jan 1999, Eric W. Biederman wrote:

> I don't think this is an issue.  Before we get to calling
> swap_out with priority == 0 we have called it with priorities.
> 6,5,4,3,2,1  Which will have travelled a little over 1.5 times over
> the page tables (assuming they can't find anything either).

So you think that also shrink_mmap() could have a weight of nr_physpages
instead of physpages << 1, the point is that we could sleep for a _long_
time in swap_shm() (sync I/O in rw_swap_page_nocache()).

Andrea Arcangeli


--
To unsubscribe, send a message with 'unsubscribe linux-mm my@address'
in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://humbolt.geo.uu.nl/Linux-MM/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~1999-01-24 23:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.3.95.990120224340.23558G-100000@penguin.transmeta.com>
1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli
1999-01-24  1:51   ` 2.2.0-final Andrea Arcangeli
1999-01-24  2:57   ` 2.2.0-final Eric W. Biederman
1999-01-24 12:49     ` 2.2.0-final Andrea Arcangeli
1999-01-24 13:16   ` 2.2.0-final Andrea Arcangeli
1999-01-24 13:28     ` 2.2.0-final Andrea Arcangeli
1999-01-24 18:41     ` 2.2.0-final Eric W. Biederman
1999-01-24 23:05       ` 2.2.0-final Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox