* Re: 2.2.0-final [not found] <Pine.LNX.3.95.990120224340.23558G-100000@penguin.transmeta.com> @ 1999-01-23 20:56 ` Andrea Arcangeli 1999-01-24 1:51 ` 2.2.0-final Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-23 20:56 UTC (permalink / raw) To: Linus Torvalds Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel On Wed, 20 Jan 1999, Linus Torvalds wrote: > In short, before you post a bug-report about 2.2.0-final, I'd like you to There are three things from me I think should go in before 2.2.0 real (maybe a normal user would be not too much worried by these two races, it depends also about the definition on `normal user' ;). The first is a fix for a potential swapout deadlock I discovered and fixed some day ago. See my email about the topic with the patch: On Mon, 18 Jan 1999, Andrea Arcangeli wrote: > > Date: Mon, 18 Jan 1999 21:26:05 +0100 (CET) > From: Andrea Arcangeli <andrea@e-mind.com> > To: Zlatko Calusic <Zlatko.Calusic@CARNet.hr>, > "Stephen C. Tweedie" <sct@redhat.com>, > Linus Torvalds <torvalds@transmeta.com> > Cc: Linux-MM List <linux-mm@kvack.org>, > Linux Kernel List <linux-kernel@vger.rutgers.edu> > Subject: Re: Removing swap lockmap... > > On 18 Jan 1999, Zlatko Calusic wrote: > > > I removed swap lockmap all together and, to my surprise, I can't > > produce any ill behaviour on my system, not even under very heavy > > swapping (in low memory condition). > > Looking at your patch (and so looking at the swap_lockmap code) I found a > potential deadlock in the current swap_lockmap handling: > > task A task B > ---------- ------------- > rw_swap_page_base() > > ...if (test_and_set_bit(lockmap)) > ... run_task_queue() > swap_after_unlock_page() > ... clear_bit(lockmap) > .... wakeup(&lock_queue) > ...sleep_on(&lock_queue); > deadlocked > > I think it will not harm too much because the window is not too big (but > not small) and because usually one of the process not yet deadlocked will > generate IO and will wakeup also the deadlocked process at I/O > completation time. A very lazy ;) but at the same time obviosly right > (that should not harm performances at all) fix could be to replace the > sleep_on() with a sleep_on_timeout(..., 1). > * patch snipped * > > I think we need the swap_lockmap in the shm case because without swap > cache a swapin could happen at the same time of the swapout because > find_in_swap_cache() won't work there. > > Andrea Arcangeli Here the fix: Index: page_io.c =================================================================== RCS file: /var/cvs/linux/mm/page_io.c,v retrieving revision 1.1.2.1 diff -u -r1.1.2.1 page_io.c --- page_io.c 1999/01/18 01:32:53 1.1.2.1 +++ linux/mm/page_io.c 1999/01/18 20:21:41 @@ -88,7 +88,7 @@ /* Make sure we are the only process doing I/O with this swap page. */ while (test_and_set_bit(offset,p->swap_lockmap)) { run_task_queue(&tq_disk); - sleep_on(&lock_queue); + sleep_on_timeout(&lock_queue, 1); } /* ---------------------------------------------------------------------- The second thing is the complete race fix for the disable/enable_bh(). It's obviously right. Here it is (against 2.2.0-pre8intestingforalan but should apply clean to your tree too): Index: linux/include/asm-i386/softirq.h diff -u linux/include/asm-i386/softirq.h:1.1.1.1 linux/include/asm-i386/softirq.h:1.1.2.2 --- linux/include/asm-i386/softirq.h:1.1.1.1 Mon Jan 18 02:27:17 1999 +++ linux/include/asm-i386/softirq.h Wed Jan 20 07:41:42 1999 @@ -9,24 +9,6 @@ #define get_active_bhs() (bh_mask & bh_active) #define clear_active_bhs(x) atomic_clear_mask((x),&bh_active) -extern inline void init_bh(int nr, void (*routine)(void)) -{ - bh_base[nr] = routine; - atomic_set(&bh_mask_count[nr], 0); - bh_mask |= 1 << nr; -} - -extern inline void remove_bh(int nr) -{ - bh_base[nr] = NULL; - bh_mask &= ~(1 << nr); -} - -extern inline void mark_bh(int nr) -{ - set_bit(nr, &bh_active); -} - #ifdef __SMP__ /* @@ -90,21 +72,49 @@ #endif /* SMP */ +extern inline void init_bh(int nr, void (*routine)(void)) +{ + bh_base[nr] = routine; + bh_mask_count[nr] = 0; + wmb(); + bh_mask |= 1 << nr; +} + +extern inline void remove_bh(int nr) +{ + bh_mask &= ~(1 << nr); + synchronize_bh(); + bh_base[nr] = NULL; +} + +extern inline void mark_bh(int nr) +{ + set_bit(nr, &bh_active); +} + /* * These use a mask count to correctly handle * nested disable/enable calls */ extern inline void disable_bh(int nr) { + unsigned long flags; + + spin_lock_irqsave(&bh_lock, flags); bh_mask &= ~(1 << nr); - atomic_inc(&bh_mask_count[nr]); + bh_mask_count[nr]++; + spin_unlock_irqrestore(&bh_lock, flags); synchronize_bh(); } extern inline void enable_bh(int nr) { - if (atomic_dec_and_test(&bh_mask_count[nr])) + unsigned long flags; + + spin_lock_irqsave(&bh_lock, flags); + if (!--bh_mask_count[nr]) bh_mask |= 1 << nr; + spin_unlock_irqrestore(&bh_lock, flags); } #endif /* __ASM_SOFTIRQ_H */ Index: linux/include/linux/interrupt.h diff -u linux/include/linux/interrupt.h:1.1.1.1 linux/include/linux/interrupt.h:1.1.2.1 --- linux/include/linux/interrupt.h:1.1.1.1 Mon Jan 18 02:27:09 1999 +++ linux/include/linux/interrupt.h Mon Jan 18 02:32:58 1999 @@ -17,7 +17,8 @@ extern volatile unsigned char bh_running; -extern atomic_t bh_mask_count[32]; +extern spinlock_t bh_lock; +extern int bh_mask_count[32]; extern unsigned long bh_active; extern unsigned long bh_mask; extern void (*bh_base[32])(void); Index: linux/kernel/softirq.c diff -u linux/kernel/softirq.c:1.1.1.1 linux/kernel/softirq.c:1.1.2.1 --- linux/kernel/softirq.c:1.1.1.1 Mon Jan 18 02:27:00 1999 +++ linux/kernel/softirq.c Mon Jan 18 02:32:52 1999 @@ -20,7 +20,8 @@ /* intr_count died a painless death... -DaveM */ -atomic_t bh_mask_count[32]; +spinlock_t bh_lock = SPIN_LOCK_UNLOCKED; +int bh_mask_count[32]; unsigned long bh_active = 0; unsigned long bh_mask = 0; void (*bh_base[32])(void); ---------------------------------------------------------------------- The third thing I disagree is to swapout in cluster when shrink_mmap() fails at priority == 6 (or whatever). shrink_mmap() that fails tell nothing about the state of the VM. We could be with 0 phys RAM but with some freeable cache but shrink_mmap could still fail at that stage. This has no trivial fix (I think my new nr_freeable pages balance level will fix it though) and luckily is mostly a performances issue (even if I think it's the cause of the VM slowdown after some day of usage). >From a stableness point of view instead I think that the current try_to_free_pages() algorithm is not good because we should do only _one_ (and not count-- until swapout fail) swapout(), if nr_free_pages < freepages.min. This because low memory system SWAP_CLUSTER_MAX (aka 32) is very major than 10 (minimum of freepages.min). Here a patch: Index: vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 vmscan.c --- vmscan.c 1999/01/23 18:52:32 1.1.1.3 +++ linux/mm/vmscan.c 1999/01/23 20:53:11 @@ -487,6 +487,8 @@ while (swap_out(priority, gfp_mask)) { if (!--count) goto done; + if (nr_free_pages < freepages.min) + break; } shrink_dcache_memory(priority, gfp_mask); But NOTE, I _never_ tried this patch (nor tried compiled it), because I am testing my VM algorithm instead of 2.2.0 ones. Maybe it will harm a bit performances (not too much though) but looks to me strictly _needed_ to me for low memory machines. If somebody would try out the system w and w/o this patch after setting echo 10 >/proc/sys/vm/freepages it would be interesting. ------------ Busy-Linus can stop reading here (now ;) ---------------- BTW, I am running now with my new vm that take stable the number of freeable pages. This VM works greatly here. But I had to change all bh->b_count++ with bget(bh) and implementing bget() this way: extern inline unsigned int bget(struct buffer_head * bh) { buffer_get(bh); return ++bh->b_count; } where buffer_get() is this: extern inline void buffer_get(struct buffer_head *bh) { struct page * page = mem_map + MAP_NR(bh->b_data); switch (atomic_read(&page->count)) { case 1: atomic_inc(&page->count); nr_freeable_pages--; break; #if 1 /* PARANOID */ case 0: printk(KERN_ERR "buffer_get: page was unused!\n"); #endif } } And for b_count-- exists a bput(). Taking uptodate the file cache instead is been very easier (some line changed and nothing more). Lukily the only b_count++ or b_count-- are in buffer.c and in ext2fs, other fs has one or two b_count only. Seeems to works fine and stable here but I still need to do some test before release it. The only reason I developed nr_freeable_pages is because I want stable numbers. And to get stable numbers under swapout shrink_mmap retval is not enough because I could go sometime in the wrong direction doing the wrong work. But I can't trust the size of the cache or of the buffers as a balance factor because they could be all busy or all freeable... (as pointed out by Stephen). BTW, Stephen, having b_count == 0 (as I done) is a good approximation that the buffer is ready to be freed? I seen in buffer.c that it should be also unlocked, unprotected and clean to be freeable, but b_count looks like to be the most important thing, can a driver take locked/dirty/protected for an infinite time a buffer? If I rember well (not sure if we was talking about the same thing) also Rik suggested to have a nr_freeable_pages, I don't know if the reason he wanted it is my same one though. Comments from MM guys? Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli @ 1999-01-24 1:51 ` Andrea Arcangeli 1999-01-24 2:57 ` 2.2.0-final Eric W. Biederman 1999-01-24 13:16 ` 2.2.0-final Andrea Arcangeli 2 siblings, 0 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-24 1:51 UTC (permalink / raw) To: Stephen C. Tweedie, Rik van Riel; +Cc: Kernel Mailing List, linux-mm On Sat, 23 Jan 1999, Andrea Arcangeli wrote: > where buffer_get() is this: Just for the record, I cut-and-pasted a wrong buffer_get() (due a last-minute wrong hack, I noticed it now when I powerup the machine now ;), the right one is this: extern inline void buffer_get(struct buffer_head *bh) { struct page * page = mem_map + MAP_NR(bh->b_data); switch (atomic_read(&page->count)) { case 1: nr_freeable_pages--; default: atomic_inc(&page->count); break; #if 1 /* PARANOID */ case 0: printk(KERN_ERR "buffer_get: page was unused!\n"); #endif } } Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli 1999-01-24 1:51 ` 2.2.0-final Andrea Arcangeli @ 1999-01-24 2:57 ` Eric W. Biederman 1999-01-24 12:49 ` 2.2.0-final Andrea Arcangeli 1999-01-24 13:16 ` 2.2.0-final Andrea Arcangeli 2 siblings, 1 reply; 8+ messages in thread From: Eric W. Biederman @ 1999-01-24 2:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm >>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes: bh-> b_count++ with bget(bh) and implementing bget() this way: AA> extern inline unsigned int bget(struct buffer_head * bh) AA> { AA> buffer_get(bh); AA> return ++bh->b_count; AA> } AA> where buffer_get() is this: AA> extern inline void buffer_get(struct buffer_head *bh) AA> { AA> struct page * page = mem_map + MAP_NR(bh->b_data); AA> switch (atomic_read(&page->count)) AA> { AA> case 1: AA> atomic_inc(&page->count); AA> nr_freeable_pages--; This is bogus. Consider the case when you have 4 buffers per page (common with ext2fs) You will way underestimate the number of freeable pages. AA> break; AA> #if 1 /* PARANOID */ AA> case 0: AA> printk(KERN_ERR "buffer_get: page was unused!\n"); AA> #endif AA> } AA> } AA> And for b_count-- exists a bput(). AA> Taking uptodate the file cache instead is been very easier (some line AA> changed and nothing more). Lukily the only b_count++ or b_count-- are in AA> buffer.c and in ext2fs, other fs has one or two b_count only. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-24 2:57 ` 2.2.0-final Eric W. Biederman @ 1999-01-24 12:49 ` Andrea Arcangeli 0 siblings, 0 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-24 12:49 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-mm On 23 Jan 1999, Eric W. Biederman wrote: > AA> extern inline void buffer_get(struct buffer_head *bh) > AA> { > AA> struct page * page = mem_map + MAP_NR(bh->b_data); > > AA> switch (atomic_read(&page->count)) > AA> { > AA> case 1: > AA> atomic_inc(&page->count); > AA> nr_freeable_pages--; > > This is bogus. Consider the case when you have 4 buffers per page (common with ext2fs) > You will way underestimate the number of freeable pages. I will not only understimate: the kernel will not boot at all ;). It's a last minute wrong hack I did (I replaced some if with the switch statement and I did a mistake). See my following post. Thanks. Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli 1999-01-24 1:51 ` 2.2.0-final Andrea Arcangeli 1999-01-24 2:57 ` 2.2.0-final Eric W. Biederman @ 1999-01-24 13:16 ` Andrea Arcangeli 1999-01-24 13:28 ` 2.2.0-final Andrea Arcangeli 1999-01-24 18:41 ` 2.2.0-final Eric W. Biederman 2 siblings, 2 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-24 13:16 UTC (permalink / raw) To: Linus Torvalds Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel On Sat, 23 Jan 1999, Andrea Arcangeli wrote: > On Wed, 20 Jan 1999, Linus Torvalds wrote: > > > In short, before you post a bug-report about 2.2.0-final, I'd like you to > > There are three things from me I think should go in before 2.2.0 real There's a fourth thing I forget to tell yesterday. If all pte are young we could not be able to swapout while with priority == 0 we must not care about CPU aging. I hope to have pointed out right and needed things, I don't want to spam you while you are busy... Here the fix: Index: vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 vmscan.c --- vmscan.c 1999/01/23 18:52:32 1.1.1.3 +++ linux/mm/vmscan.c 1999/01/24 13:12:30 @@ -325,7 +325,7 @@ * Think of swap_cnt as a "shadow rss" - it tells us which process * we want to page out (always try largest first). */ - counter = nr_tasks / (priority+1); + counter = (nr_tasks << 1) / (priority+1); if (counter < 1) counter = 1; if (counter > nr_tasks) Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-24 13:16 ` 2.2.0-final Andrea Arcangeli @ 1999-01-24 13:28 ` Andrea Arcangeli 1999-01-24 18:41 ` 2.2.0-final Eric W. Biederman 1 sibling, 0 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-24 13:28 UTC (permalink / raw) To: Linus Torvalds Cc: Kernel Mailing List, linux-mm, Stephen C. Tweedie, Rik van Riel On Sun, 24 Jan 1999, Andrea Arcangeli wrote: > Here the fix: Woops the fix was wrong, I forgot that there was a not needed check due me (just removed from some weeks here, and that's because I forget to remove it now ;): The complete fix is this. Excuse me... Index: vmscan.c =================================================================== RCS file: /var/cvs/linux/mm/vmscan.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 vmscan.c --- vmscan.c 1999/01/23 18:52:32 1.1.1.3 +++ vmscan.c 1999/01/24 13:26:24 @@ -325,11 +325,9 @@ * Think of swap_cnt as a "shadow rss" - it tells us which process * we want to page out (always try largest first). */ - counter = nr_tasks / (priority+1); + counter = (nr_tasks << 1) / (priority+1); if (counter < 1) counter = 1; - if (counter > nr_tasks) - counter = nr_tasks; for (; counter >= 0; counter--) { assign = 0; Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-24 13:16 ` 2.2.0-final Andrea Arcangeli 1999-01-24 13:28 ` 2.2.0-final Andrea Arcangeli @ 1999-01-24 18:41 ` Eric W. Biederman 1999-01-24 23:05 ` 2.2.0-final Andrea Arcangeli 1 sibling, 1 reply; 8+ messages in thread From: Eric W. Biederman @ 1999-01-24 18:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm >>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes: AA> On Sat, 23 Jan 1999, Andrea Arcangeli wrote: >> On Wed, 20 Jan 1999, Linus Torvalds wrote: >> >> > In short, before you post a bug-report about 2.2.0-final, I'd like you to >> >> There are three things from me I think should go in before 2.2.0 real AA> There's a fourth thing I forget to tell yesterday. If all pte are young we AA> could not be able to swapout while with priority == 0 we must not care AA> about CPU aging. I hope to have pointed out right and needed things, I AA> don't want to spam you while you are busy... I don't think this is an issue. Before we get to calling swap_out with priority == 0 we have called it with priorities. 6,5,4,3,2,1 Which will have travelled a little over 1.5 times over the page tables (assuming they can't find anything either). So it looks doubtful to me that all pte's could be young. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.2.0-final 1999-01-24 18:41 ` 2.2.0-final Eric W. Biederman @ 1999-01-24 23:05 ` Andrea Arcangeli 0 siblings, 0 replies; 8+ messages in thread From: Andrea Arcangeli @ 1999-01-24 23:05 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-mm On 24 Jan 1999, Eric W. Biederman wrote: > I don't think this is an issue. Before we get to calling > swap_out with priority == 0 we have called it with priorities. > 6,5,4,3,2,1 Which will have travelled a little over 1.5 times over > the page tables (assuming they can't find anything either). So you think that also shrink_mmap() could have a weight of nr_physpages instead of physpages << 1, the point is that we could sleep for a _long_ time in swap_shm() (sync I/O in rw_swap_page_nocache()). Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm my@address' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://humbolt.geo.uu.nl/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~1999-01-24 23:05 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <Pine.LNX.3.95.990120224340.23558G-100000@penguin.transmeta.com>
1999-01-23 20:56 ` 2.2.0-final Andrea Arcangeli
1999-01-24 1:51 ` 2.2.0-final Andrea Arcangeli
1999-01-24 2:57 ` 2.2.0-final Eric W. Biederman
1999-01-24 12:49 ` 2.2.0-final Andrea Arcangeli
1999-01-24 13:16 ` 2.2.0-final Andrea Arcangeli
1999-01-24 13:28 ` 2.2.0-final Andrea Arcangeli
1999-01-24 18:41 ` 2.2.0-final Eric W. Biederman
1999-01-24 23:05 ` 2.2.0-final Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox