* __GFP_IO && shrink_[d|i]cache_memory()? @ 2000-09-24 10:11 Ingo Molnar 2000-09-24 18:11 ` Linus Torvalds 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 10:11 UTC (permalink / raw) To: Rik van Riel, Roger Larsson; +Cc: Linus Torvalds, MM mailing list, linux-kernel i've seen a couple of GFP_BUFFER allocation deadlocks in an atypical system which had lots of RAM allocated to inodes. The reason for the deadlock is that the shrink_*() functions cannot be called if __GFP_IO is not set. Nothing else can be freed at that point, so the try_again: loop in page_alloc() gets into an infinite loop. as an immediate solution the previous __GFP_WAIT suggestion solves the deadlock - because the GFP_BUFFER allocator yields the CPU and kswapd can run and do the dcache/icache shrinking. [i cannot reproduce any deadlocks after doing this change.] as a longer term solution, i'm wondering how hard it would be to propagate gfp_mask into the shrink_*() functions, and prevent recursion similarly to the swap-out logic? This way even GFP_BUFFER allocators could touch/free the dcache/icache. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 10:11 __GFP_IO && shrink_[d|i]cache_memory()? Ingo Molnar @ 2000-09-24 18:11 ` Linus Torvalds 2000-09-24 18:40 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-24 18:11 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Ingo Molnar wrote: > > as a longer term solution, i'm wondering how hard it would be to propagate > gfp_mask into the shrink_*() functions, and prevent recursion similarly to > the swap-out logic? This way even GFP_BUFFER allocators could touch/free > the dcache/icache. Well, the gfp_mask actually _is_ propagated already, it's just that if __GFP_IO isn't set the calls are never done. A trivial patch would move the __GFP_IO test into the functions (no change in behaviour), and then slowly move the test down to the proper place. We should be able to do some SHM swapping even if __GFP_IO isn't set. For example, I don't think shrinking the inode cache is actually illegal when GPF_IO isn't set. In fact, it's probably only the buffer cache itself that has to avoid recursion - the other stuff doesn't actually do any IO. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 18:11 ` Linus Torvalds @ 2000-09-24 18:40 ` Ingo Molnar 2000-09-24 18:39 ` Linus Torvalds 2000-09-24 21:38 ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie 0 siblings, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 18:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > [...] I don't think shrinking the inode cache is actually illegal when > GPF_IO isn't set. In fact, it's probably only the buffer cache itself > that has to avoid recursion - the other stuff doesn't actually do any > IO. i just found this out by example, i'm running the shrink_[i|d]cache stuff even if __GFP_IO is not set, and no problems so far. (and much better balancing behavior) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 18:40 ` Ingo Molnar @ 2000-09-24 18:39 ` Linus Torvalds 2000-09-24 18:46 ` Linus Torvalds 2000-09-24 21:38 ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie 1 sibling, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-24 18:39 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Ingo Molnar wrote: > > i just found this out by example, i'm running the shrink_[i|d]cache stuff > even if __GFP_IO is not set, and no problems so far. (and much better > balancing behavior) Send me the tested patch (and I'd suggest moving the shm_swap() test into shm_swap() too, so that refill_inactive() gets cleaned up a bit). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 18:39 ` Linus Torvalds @ 2000-09-24 18:46 ` Linus Torvalds 2000-09-24 18:59 ` Ingo Molnar 2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar 0 siblings, 2 replies; 243+ messages in thread From: Linus Torvalds @ 2000-09-24 18:46 UTC (permalink / raw) To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel [ Sorry to follow up on myself.. ] On Sun, 24 Sep 2000, Linus Torvalds wrote: > > Send me the tested patch (and I'd suggest moving the shm_swap() test into > shm_swap() too, so that refill_inactive() gets cleaned up a bit). I think that shm_swap still needs it - it's doing things with rw_swap_page() that means that we cannot run it without GFP_IO. HOWEVER, I suspect that in the long run we should move to using the page cache better by the shm routines, and that might mean that eventually we can do it even without GFP_IO (and instead let the generic VM routines handle the actual IO on the swap cache). So it makes sense to leave shm_swap() behaviour unchanged (ie do nothing if GFP_IO is not set), but move the GFP_IO test down into shm_swap() so that it will (a) match the other cases and (b) be easier to change the GFP_IO logic later on if/when we clean up shm. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 18:46 ` Linus Torvalds @ 2000-09-24 18:59 ` Ingo Molnar 2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar 1 sibling, 0 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 18:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > I think that shm_swap still needs it - it's doing things with > rw_swap_page() that means that we cannot run it without GFP_IO. yep - i only pushed the test inside, it's functionally equivalent - it only vanished from refill_inactive(). It's basically now a detail of the lowlevel swapping functions to honor __GFP_IO. > So it makes sense to leave shm_swap() behaviour unchanged (ie do > nothing if GFP_IO is not set), but move the GFP_IO test down into > shm_swap() so that it will (a) match the other cases and (b) be easier > to change the GFP_IO logic later on if/when we clean up shm. yep. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* [patch] vmfixes-2.4.0-test9-B2 2000-09-24 18:46 ` Linus Torvalds 2000-09-24 18:59 ` Ingo Molnar @ 2000-09-24 19:34 ` Ingo Molnar 2000-09-24 20:20 ` Rui Sousa 2000-09-24 20:24 ` Andrea Arcangeli 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 19:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 1616 bytes --] the attached vmfixes-B2 patch adds the following fixes/cleanups: vmscan.c: - check for __GFP_WAIT not __GFP_IO when yielding the CPU. This fixes GFP_BUFFER deadlocks. In fact since no caller to do_try_to_free_pages() can expect that function to not block, we dont test for __GFP_WAIT either. [GFP_KSWAPD is the only caller without __GFP_WAIT set.] - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance. - push the __GFP_IO test into shm_swap(). - after shm_swap() do not test for !count but for <= 0, because count could be negative if in the future the shrink_ functions return bigger than 1, and we could then get into an infinite loop. Same after swap_out() and refill_inactive_scan(). No performance penalty, test for zero is exchanged with test for sign. - kmem_cache_reap() is done within refill_inactive(), so it's unnecessery to call it at the beginning of do_try_to_free_pages(). Moved to the else branch. (i saw kmem_cache_reap() show up in profiles) - (small codestyle cleanup.) page_alloc.c: - in __alloc_pages(), the infinite allocation loop yields the CPU if necessery. This prevents a potential lockup on UP, and even on SMP it can prevent livelocks. (i saw this happen.) mm.h: - made the GFP_ flag definitions easier to parse for humans :-) - remove shrink_mmap() prototype, it doesnt exist anymore. shm.c: - the trivial test for __GFP_IO. swap_state.c, filemap.c: - (shrink_mmap doesnt exist anymore, it's refill_inactive.) (The patch applies and compiles cleanly, and is tested under various VM loads i use.) Ingo [-- Attachment #2: Type: TEXT/PLAIN, Size: 7375 bytes --] --- linux/mm/vmscan.c.orig Sun Sep 24 11:41:38 2000 +++ linux/mm/vmscan.c Sun Sep 24 12:20:27 2000 @@ -119,7 +119,7 @@ * our scan. * * Basically, this just makes it possible for us to do - * some real work in the future in "shrink_mmap()". + * some real work in the future in "refill_inactive()". */ if (!pte_dirty(pte)) { flush_cache_page(vma, address); @@ -159,7 +159,7 @@ * NOTE NOTE NOTE! This should just set a * dirty bit in 'page', and just drop the * pte. All the hard work would be done by - * shrink_mmap(). + * refill_inactive(). * * That would get rid of a lot of problems. */ @@ -891,7 +891,7 @@ do { made_progress = 0; - if (current->need_resched && (gfp_mask & __GFP_IO)) { + if (current->need_resched) { __set_current_state(TASK_RUNNING); schedule(); } @@ -899,34 +899,32 @@ while (refill_inactive_scan(priority, 1) || swap_out(priority, gfp_mask, idle_time)) { made_progress = 1; - if (!--count) + if (--count <= 0) goto done; } - /* Try to get rid of some shared memory pages.. */ - if (gfp_mask & __GFP_IO) { - /* - * don't be too light against the d/i cache since - * shrink_mmap() almost never fail when there's - * really plenty of memory free. - */ - count -= shrink_dcache_memory(priority, gfp_mask); - count -= shrink_icache_memory(priority, gfp_mask); - /* - * Not currently working, see fixme in shrink_?cache_memory - * In the inner funtions there is a comment: - * "To help debugging, a zero exit status indicates - * all slabs were released." (-arca?) - * lets handle it in a primitive but working way... - * if (count <= 0) - * goto done; - */ + /* + * don't be too light against the d/i cache since + * refill_inactive() almost never fail when there's + * really plenty of memory free. + */ + count -= shrink_dcache_memory(priority, gfp_mask); + count -= shrink_icache_memory(priority, gfp_mask); + /* + * Not currently working, see fixme in shrink_?cache_memory + * In the inner funtions there is a comment: + * "To help debugging, a zero exit status indicates + * all slabs were released." (-arca?) + * lets handle it in a primitive but working way... + * if (count <= 0) + * goto done; + */ - while (shm_swap(priority, gfp_mask)) { - made_progress = 1; - if (!--count) - goto done; - } + /* Try to get rid of some shared memory pages.. */ + while (shm_swap(priority, gfp_mask)) { + made_progress = 1; + if (--count <= 0) + goto done; } /* @@ -934,7 +932,7 @@ */ while (swap_out(priority, gfp_mask, 0)) { made_progress = 1; - if (!--count) + if (--count <= 0) goto done; } @@ -955,9 +953,9 @@ priority--; } while (priority >= 0); - /* Always end on a shrink_mmap.., may sleep... */ + /* Always end on a refill_inactive.., may sleep... */ while (refill_inactive_scan(0, 1)) { - if (!--count) + if (--count <= 0) goto done; } @@ -970,11 +968,6 @@ int ret = 0; /* - * First, reclaim unused slab cache memory. - */ - kmem_cache_reap(gfp_mask); - - /* * If we're low on free pages, move pages from the * inactive_dirty list to the inactive_clean list. * @@ -992,13 +985,14 @@ * the inode and dentry cache whenever we do this. */ if (free_shortage() || inactive_shortage()) { - if (gfp_mask & __GFP_IO) { - ret += shrink_dcache_memory(6, gfp_mask); - ret += shrink_icache_memory(6, gfp_mask); - } - + ret += shrink_dcache_memory(6, gfp_mask); + ret += shrink_icache_memory(6, gfp_mask); ret += refill_inactive(gfp_mask, user); } else { + /* + * Reclaim unused slab cache memory. + */ + kmem_cache_reap(gfp_mask); ret = 1; } @@ -1153,9 +1147,8 @@ { int ret = 1; - if (gfp_mask & __GFP_WAIT) { + if (gfp_mask & __GFP_WAIT) ret = do_try_to_free_pages(gfp_mask, 1); - } return ret; } --- linux/mm/page_alloc.c.orig Sun Sep 24 11:44:59 2000 +++ linux/mm/page_alloc.c Sun Sep 24 11:52:00 2000 @@ -444,6 +444,13 @@ * processes, etc). */ if (gfp_mask & __GFP_WAIT) { + /* + * Give other processes a chance to run: + */ + if (current->need_resched) { + __set_current_state(TASK_RUNNING); + schedule(); + } try_to_free_pages(gfp_mask); memory_pressure++; goto try_again; --- linux/mm/filemap.c.orig Sun Sep 24 12:20:35 2000 +++ linux/mm/filemap.c Sun Sep 24 12:20:48 2000 @@ -1925,10 +1925,10 @@ * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The - * zap_page_range call sets things up for shrink_mmap to actually free + * zap_page_range call sets things up for refill_inactive to actually free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for - * shrink_mmap to pick up before reclaiming other pages. + * refill_inactive to pick up before reclaiming other pages. * * NB: This interface discards data rather than pushes it out to swap, * as some implementations do. This has performance implications for --- linux/mm/swap_state.c.orig Sun Sep 24 12:21:02 2000 +++ linux/mm/swap_state.c Sun Sep 24 12:21:13 2000 @@ -166,7 +166,7 @@ return 0; /* * Though the "found" page was in the swap cache an instant - * earlier, it might have been removed by shrink_mmap etc. + * earlier, it might have been removed by refill_inactive etc. * Re search ... Since find_lock_page grabs a reference on * the page, it can not be reused for anything else, namely * it can not be associated with another swaphandle, so it --- linux/include/linux/mm.h.orig Sun Sep 24 11:46:37 2000 +++ linux/include/linux/mm.h Sun Sep 24 12:21:54 2000 @@ -441,7 +441,6 @@ /* filemap.c */ extern void remove_inode_page(struct page *); extern unsigned long page_unuse(struct page *); -extern int shrink_mmap(int, int); extern void truncate_inode_pages(struct address_space *, loff_t); /* generic vm_area_ops exported for stackable file systems */ @@ -469,11 +468,11 @@ #define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) #define GFP_ATOMIC (__GFP_HIGH) -#define GFP_USER (__GFP_WAIT | __GFP_IO) -#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM) +#define GFP_USER ( __GFP_WAIT | __GFP_IO) +#define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) #define GFP_KERNEL (__GFP_HIGH | __GFP_WAIT | __GFP_IO) #define GFP_NFS (__GFP_HIGH | __GFP_WAIT | __GFP_IO) -#define GFP_KSWAPD (__GFP_IO) +#define GFP_KSWAPD ( __GFP_IO) /* Flag - indicates that the buffer will be suitable for DMA. Ignored on some platforms, used as appropriate on others */ --- linux/ipc/shm.c.orig Sun Sep 24 11:45:16 2000 +++ linux/ipc/shm.c Sun Sep 24 11:53:59 2000 @@ -1536,6 +1536,12 @@ int counter; struct page * page_map; + /* + * Push this inside: + */ + if (!(gfp_mask & __GFP_IO)) + return 0; + zshm_swap(prio, gfp_mask); counter = shm_rss >> prio; if (!counter) ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar @ 2000-09-24 20:20 ` Rui Sousa 2000-09-24 20:24 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: Rui Sousa @ 2000-09-24 20:20 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Ingo Molnar wrote: Hi, Did any of these lead to an infinite loop in swap_out()? > > the attached vmfixes-B2 patch adds the following fixes/cleanups: > Rui Sousa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar 2000-09-24 20:20 ` Rui Sousa @ 2000-09-24 20:24 ` Andrea Arcangeli 2000-09-24 20:26 ` Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-24 20:24 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 09:34:43PM +0200, Ingo Molnar wrote: > - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance. It will deadlock. (that same mistake was dealdocking early 2.2.x too btw) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 20:24 ` Andrea Arcangeli @ 2000-09-24 20:26 ` Ingo Molnar 2000-09-24 21:12 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 20:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Andrea Arcangeli wrote: > > - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance. > > It will deadlock. (that same mistake was dealdocking early 2.2.x too btw) where will it deadlock? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 20:26 ` Ingo Molnar @ 2000-09-24 21:12 ` Andrea Arcangeli 2000-09-24 21:12 ` Ingo Molnar 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-24 21:12 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote: > where will it deadlock? ext2_new_block (or whatever that runs getblk with the superlock lock acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 21:12 ` Andrea Arcangeli @ 2000-09-24 21:12 ` Ingo Molnar 2000-09-24 21:43 ` Stephen C. Tweedie 2000-09-25 4:56 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-24 21:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Andrea Arcangeli wrote: > ext2_new_block (or whatever that runs getblk with the superlock lock > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache-> > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op-> > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in places where we might block? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 21:12 ` Ingo Molnar @ 2000-09-24 21:43 ` Stephen C. Tweedie 2000-09-24 22:13 ` Andrea Arcangeli 2000-09-25 4:56 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 1 sibling, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-24 21:43 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel, Stephen Tweedie Hi, On Sun, Sep 24, 2000 at 11:12:39PM +0200, Ingo Molnar wrote: > > > ext2_new_block (or whatever that runs getblk with the superlock lock > > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache-> > > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op-> > > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D > > nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in > places where we might block? That's only a valid fix if there are no other filesystems, and no other places in ext2, where we can call GFP with locks which prevent a put_inode from being incurred. And with the quota case to consider, you have to avoid calling GFP with a lock against quota file writes too (and since quota writes may GFP, this would deadlock if there was any form of serialisation on the quota file). This feels like rather a lot of new and interesting deadlocks to be introducing so late in 2.4. :-) Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 21:43 ` Stephen C. Tweedie @ 2000-09-24 22:13 ` Andrea Arcangeli 2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-24 22:13 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote: > any form of serialisation on the quota file). This feels like rather > a lot of new and interesting deadlocks to be introducing so late in > 2.4. :-) Agreed. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 22:13 ` Andrea Arcangeli @ 2000-09-24 22:36 ` bert hubert 2000-09-24 23:41 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: bert hubert @ 2000-09-24 22:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote: > On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote: > > any form of serialisation on the quota file). This feels like rather > > a lot of new and interesting deadlocks to be introducing so late in > > 2.4. :-) True. But they also appear to be found and solved at an impressive rate. These deadlocks are fatal and don't hide in corners, whereas the previous mm problems used to be very hard to spot and fix, there not being real showstoppers, except for abysmal performance. [1] Since Rik's stuff was merged, the number of eyeball hours devoted to MM have skyrocketed, whereas the previous incarnations had far smaller audiences. The patches are barely a week in, and look how much has been improved that hadn't been found by the people working with Rik. It's tempting to revert the merge, but let's work at it a bit longer. There are problems, but we are solving them rapidly and both performance and design of the new MM are pretty pleasing. Let's not waste this opportunity. Regards, bert hubert [1] bad performance is not often attributed to the Linux kernel - people just assume that their problem is hard, because they don't have experience with other unixes that might outperform us. We may be running Solaris and other unices for reference, but your average user isn't. -- PowerDNS Versatile DNS Services Trilab The Technology People 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert @ 2000-09-24 23:41 ` Andrea Arcangeli 2000-09-25 16:24 ` Stephen C. Tweedie 2000-09-25 17:21 ` bert hubert 2000-09-25 15:09 ` Miles Lane 2000-09-25 15:51 ` Stephen C. Tweedie 2 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-24 23:41 UTC (permalink / raw) To: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote: > True. But they also appear to be found and solved at an impressive rate. We're talking about shrink_[id]cache_memory change. That have _nothing_ to do with the VM changes that happened anywhere between test8 and test9-pre6. You were talking about a different thing. > It's tempting to revert the merge, but let's work at it a bit longer. There Since you're talking about this I'll soon (as soon as I'll finish some other thing that is just work in progress) release a classzone against latest's 2.4.x. My approch is _quite_ different from the curren VM. Current approch is very imperfect and it's based solely on aging whereas classzone had hooks into pagefaults paths and all other map/unmap points to have perfect accounting of the amount of active/inactive stuff. The mapped pages was never seen by anything except swap_out, if they was mapped (it's not a if page->age then move into the active list, with classzone the page was _just_ in the active list in first place since it was mapped). I consider the current approch the wrong way to go and for this reason I prefer to spend time porting/improving classzone. In classzone the aging exists too but it's _completly_ orthogonal to how rest of the VM works. classzone had only 1 bit of aging per page to save mem_map_t array so I'll extend the aging info from 1 bit to 32bit to make it more biased. This is my humble opinion at least. I may be wrong. I'll let you know once I'll have a patch I'll happy with and some real life number to proof my theory. In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give it a try under swap to see the difference in the behaviour and compare (Mike said it's still an order of magnitude faster with his "make -j30 bzImage" testcase and he's always very reliable in his reports). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 23:41 ` Andrea Arcangeli @ 2000-09-25 16:24 ` Stephen C. Tweedie 2000-09-25 17:03 ` Andrea Arcangeli 2000-09-25 17:21 ` bert hubert 1 sibling, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 16:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 01:41:37AM +0200, Andrea Arcangeli wrote: > > Since you're talking about this I'll soon (as soon as I'll finish some other > thing that is just work in progress) release a classzone against latest's > 2.4.x. My approch is _quite_ different from the curren VM. Current approch is > very imperfect and it's based solely on aging whereas classzone had hooks into > pagefaults paths and all other map/unmap points to have perfect accounting of > the amount of active/inactive stuff. Andrea, I'm not quite sure what you're saying here. Could you be a bit more specific? The current VM _does_ track the amount of active/inactive stuff. It does so by keeping separate list of active and inactive stuff. Accounting on memory pressure on these different lists is used to generate dynamic targets for how many pages we aim to have on those lists, so aging/reclaim activity is tuned to the current memory load. Your other recent complaint, that newly-swapped pages end up on the wrong end of the LRU lists and can't be reclaimed without cycling the rest of the pages in shrink_mmap, is also cured in Rik's code, by placing pages which are queued for swapout on a different list altogether. I thought we had managed to agree in Ottawa that such a cure for the old 2.4 VM was desirable. > The mapped pages was never seen by > anything except swap_out, if they was mapped (it's not a if page->age then move > into the active list, with classzone the page was _just_ in the active list in > first place since it was mapped). This really seems to be the biggest difference between the two approaches right now. The FreeBSD folks believe fervently that one of the main reasons that their VM rocks is that it ages cache pages and mapped pages at the same rate. Having both on the same aging list achieves that. Separating the two raises the question of how to balance the aging of cache vs. swap in a fair manner. > In classzone the aging exists too but it's _completly_ orthogonal to how rest > of the VM works. Umm, that applies to Rik's stuff too! > This is my humble opinion at least. I may be wrong. I'll let you know > once I'll have a patch I'll happy with and some real life number to proof my > theory. Good, the best theoretical VM in the world can fall apart instantly on contact with the real world. :-) Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 16:24 ` Stephen C. Tweedie @ 2000-09-25 17:03 ` Andrea Arcangeli 2000-09-25 18:06 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:03 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 05:24:42PM +0100, Stephen C. Tweedie wrote: > Your other recent complaint, that newly-swapped pages end up on the > wrong end of the LRU lists and can't be reclaimed without cycling the > rest of the pages in shrink_mmap, is also cured in Rik's code, by > placing pages which are queued for swapout on a different list > altogether. I thought we had managed to agree in Ottawa that such a > cure for the old 2.4 VM was desirable. Yes, I seen and the fix looks ok. It's the deactivate_page call when we swapout the anonymous page. I overlooked it at first, I apologise. > > The mapped pages was never seen by anything except swap_out, if they was > > mapped (it's not a if page->age then move into the active list, with > > classzone the page was _just_ in the active list in first place since it > > was mapped). > > This really seems to be the biggest difference between the two > approaches right now. The FreeBSD folks believe fervently that one of Right. And since you move the page into the active list only once you reach it from the cache recycler and you find it with page->age != 0, you also spend time putting those pages back and forth from those LRU lists while in my approch the mapped pages are never seen from the cycle recylcer and no cycle is spent on them. This mean in a pure fs read test with cache pollution going on, there's _no_way_ that classzone touches or notice _any_ mapped page in its path. I think you can't be faster than classzone here. When the cache isn't polluted adding some more bit of aging I'll better know when it's time to unmap/swapout stuff. (it just works this way but with only literally 1 bit of aging at the moment) > the main reasons that their VM rocks is that it ages cache pages and > mapped pages at the same rate. Having both on the same aging list > achieves that. Separating the two raises the question of how to > balance the aging of cache vs. swap in a fair manner. I believe increasing the aging in the unmapped cache should take care of that fine. (it was working pretty much fine also with only 1 bit of most frequently used aging plus the LRU order of the list) > > In classzone the aging exists too but it's _completly_ orthogonal to how > > rest of the VM works. > > Umm, that applies to Rik's stuff too! I may be overlooking something but where do you notice when a page gets unmapped from the last mapping and put it back into a place that can be reached from shrink_mmap (or whatever the cache recycler is)? Since none mapped page can in any way be freed by the cache recycler (you need to unmap it first from swap_out at the moment) if you should reach those pages from the cache recyler someway it means thus you're wasting CPU (I couldn't reach any mapped page from the cache recylcer in classzone and infact the mapped pages wasn't linked in any LRU at all to save even more CPU). > Good, the best theoretical VM in the world can fall apart instantly on > contact with the real world. :-) :)) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 17:03 ` Andrea Arcangeli @ 2000-09-25 18:06 ` Stephen C. Tweedie 2000-09-25 19:32 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 18:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 07:03:47PM +0200, Andrea Arcangeli wrote: > > > This really seems to be the biggest difference between the two > > approaches right now. The FreeBSD folks believe fervently that one of > > [ aging cache and mapped pages in the same cycle ] > > Right. > > And since you move the page into the active list only once you reach it from > the cache recycler and you find it with page->age != 0, you also spend time > putting those pages back and forth from those LRU lists while in my approch the > mapped pages are never seen from the cycle recylcer and no cycle is spent on > them. This mean in a pure fs read test with cache pollution going on, there's > _no_way_ that classzone touches or notice _any_ mapped page in its path. The "age==0" pages are basically just "pages we are ready to get rid of right away". The alternative to having that inactive list is to do what we do today --- which is to throw away the pages immediately. Having that extra list is simply giving pages a last chance before evicting them. It allows us to run reliably with fewer physically free pages --- we can reap inactive pages with no IO so those pages are as good as free for most purposes. The alternative to moving pages to the inactive list would be freeing them completely. Moving a page back to the active list from inactive is equivalent to avoiding a disk IO to pull in the page from backing store. It's supposed to be an optimisation to save physically freeing things unless we really, really need to. It is _not_ a transition which recently referenced pages encounter. > > the main reasons that their VM rocks is that it ages cache pages and > > mapped pages at the same rate. Having both on the same aging list > > achieves that. Separating the two raises the question of how to > > balance the aging of cache vs. swap in a fair manner. > > I believe increasing the aging in the unmapped cache should take care of that > fine. (it was working pretty much fine also with only 1 bit of most > frequently used aging plus the LRU order of the list) Good. One of the problems we always had in the past, though, was that getting the relative aging of cache vs. vmas was easy if you had a small set of test loads, but it was really, really hard to find a balance that didn't show pathological behaviour in the worst cases. > > > In classzone the aging exists too but it's _completly_ orthogonal to how > > > rest of the VM works. > > > > Umm, that applies to Rik's stuff too! > > I may be overlooking something but where do you notice when a page > gets unmapped from the last mapping and put it back into a place > that can be reached from shrink_mmap (or whatever the cache recycler is)? It doesn't --- that is part of the design. The vm scanner propagates referenced bits to the struct page, so the new shrink_mmap can do its aging based on whether a page has been referenced at all recently, not caring whether the reference was a VM reference or a page cache reference. That is done specifically to address the balance issue between VM and filesystem memory pressure. > Since none mapped page can in any way be freed by the cache recycler > (you need to unmap it first from swap_out at the moment) if you > should reach those pages from the cache recyler someway it means > thus you're wasting CPU (I couldn't reach any mapped page from the > cache recylcer in classzone and infact the mapped pages wasn't > linked in any LRU at all to save even more CPU). That's not how the current VM is supposed to work. The cache scanner isn't meant to reclaim pages --- it is meant to update the age information on pages, which is not quite the same job. If it finds pages whose age becomes zero, those are shifted to the inactive list, and once that list is large enough (ie. we have enough freeable pages), it can give up. The inactive list then gets physically freed on demand. The fact that we have a common loop in the VM for updating all age information is central to the design, and requires the cache recycler to pass over all those pages. By doing it that way, rather than from the VM scan, we can avoid one of the really bad properties of the old 2.0 aging code --- it means that for shared pages, we only do the aging once per walk over the pages regardless of how many ptes refer to the page. This avoids the nasty worst-case behaviour of having a recently-referenced page thrown out of memory just because there also happened to be a lot of old, unused references to it too. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 18:06 ` Stephen C. Tweedie @ 2000-09-25 19:32 ` Andrea Arcangeli 2000-09-25 19:26 ` Rik van Riel 2000-09-25 19:54 ` Stephen C. Tweedie 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 19:32 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote: > Good. One of the problems we always had in the past, though, was that > getting the relative aging of cache vs. vmas was easy if you had a > small set of test loads, but it was really, really hard to find a > balance that didn't show pathological behaviour in the worst cases. Yep, that's not trivial. > > I may be overlooking something but where do you notice when a page > > gets unmapped from the last mapping and put it back into a place > > that can be reached from shrink_mmap (or whatever the cache recycler is)? > > It doesn't --- that is part of the design. The vm scanner propagates And that's the inferior part of the design IMHO. > referenced bits to the struct page, so the new shrink_mmap can do its > aging based on whether a page has been referenced at all recently, not shrink_mmap could can care less about pages that it can't do anything with them. When it notice it can't do anything it kicks in swap_out. Having shrink_mmap that browse the mapped page cache is useless as having shrink_mmap browsing kernel memory and anonymous pages as it does in 2.2.x as far I can tell. It's an algorithm complexity problem and it will waste lots of CPU. Now think this simple real life example. A 2G RAM machine running an executable image of 1.5G, 300M in shm and 200M in cache. No memory pressure, no need to swap anything anytime. Now the application starts to read heavily from disk some giga of data. Why should shrink_mmap waste an huge amount of time rolling back and forth from the LRUs the 384000 mapped pages? There's no memory pressure there's no need to check those mapped pages at all. Classzone will make an huge difference in numbers in this scenario since it will only work on the 300M of cache (it will never see the 1.5G of mapped .text). > caring whether the reference was a VM reference or a page cache > reference. That is done specifically to address the balance issue > between VM and filesystem memory pressure. I think it's not necessary to pay all that huge overhead to only learn when it's time to kick swap_out in. When we're short in unmapped cache we can just startup swap_out. That apparently works. > That's not how the current VM is supposed to work. The cache scanner > isn't meant to reclaim pages --- it is meant to update the age > information on pages, which is not quite the same job. If it finds So it will be the cache scanner (not the recycler) that will waste the CPU cycles. > pages whose age becomes zero, those are shifted to the inactive list, > and once that list is large enough (ie. we have enough freeable > pages), it can give up. The inactive list then gets physically freed > on demand. So in a long cache-polluting read from disk the inactive list will return empty all the time and so cache scanner will have to waste the CPU as described. > The fact that we have a common loop in the VM for updating all age > information is central to the design, and requires the cache recycler > to pass over all those pages. By doing it that way, rather than from That's a waste IMHO. We don't need to pass over the mapped pages. > 2.0 aging code --- it means that for shared pages, we only do the > aging once per walk over the pages regardless of how many ptes refer > to the page. This avoids the nasty worst-case behaviour of having a You'll still refresh the referenced bit too often for those pages because they're referenced multiple times so it will still be unfair. Said that it's probably not that bad property since a very shared library is more justified to live in cache than a page that is mapped only once. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 19:32 ` Andrea Arcangeli @ 2000-09-25 19:26 ` Rik van Riel 2000-09-25 22:28 ` Andrea Arcangeli 2000-09-25 19:54 ` Stephen C. Tweedie 1 sibling, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 19:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote: > > Good. One of the problems we always had in the past, though, was that > > getting the relative aging of cache vs. vmas was easy if you had a > > small set of test loads, but it was really, really hard to find a > > balance that didn't show pathological behaviour in the worst cases. > > Yep, that's not trivial. It is. Just do physical-page based aging (so you age all the pages in the system the same) and the problem is solved. > > > I may be overlooking something but where do you notice when a page > > > gets unmapped from the last mapping and put it back into a place > > > that can be reached from shrink_mmap (or whatever the cache recycler is)? > > > > It doesn't --- that is part of the design. The vm scanner propagates > > And that's the inferior part of the design IMHO. Indeed, but physical page based aging is a definate 2.5 thing ... ;( regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 19:26 ` Rik van Riel @ 2000-09-25 22:28 ` Andrea Arcangeli 2000-09-25 22:26 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 22:28 UTC (permalink / raw) To: Rik van Riel Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote: > > > It doesn't --- that is part of the design. The vm scanner propagates > > > > And that's the inferior part of the design IMHO. > > Indeed, but physical page based aging is a definate > 2.5 thing ... ;( I'm talking about the fact that if you have a file mmapped in 1.5G of RAM test9 will waste time rolling between LRUs 384000 pages, while classzone won't ever see 1 of those pages until you run low on fs cache. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:28 ` Andrea Arcangeli @ 2000-09-25 22:26 ` Rik van Riel 2000-09-25 22:51 ` Andrea Arcangeli 2000-09-25 22:30 ` Linus Torvalds 2000-09-25 22:30 ` Juan J. Quintela 2 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 22:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote: > > > > It doesn't --- that is part of the design. The vm scanner propagates > > > > > > And that's the inferior part of the design IMHO. > > > > Indeed, but physical page based aging is a definate > > 2.5 thing ... ;( > > I'm talking about the fact that if you have a file mmapped in > 1.5G of RAM test9 will waste time rolling between LRUs 384000 > pages, while classzone won't ever see 1 of those pages until you > run low on fs cache. IMHO this is a minor issue because: 1) you need to do page replacement with shared pages right 2) you don't /want/ to run low on fs cache, you want to have a good balance between thee cache(s) and the processes OTOH, if you have a way to keep fair page aging and fix the CPU time issue at the same time, I'd love to see it. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:26 ` Rik van Riel @ 2000-09-25 22:51 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 22:51 UTC (permalink / raw) To: Rik van Riel Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:26:56PM -0300, Rik van Riel wrote: > IMHO this is a minor issue because: I don't think it's a minor issue. If you don't have reschedule point in your equivalent of shrink_mmap and this 1.5G will happen to be consecutive in the lru order (quite probably if it's been pagedin at fast rate) then you may even hang in interruptible mode for seconds as soon as somebody start reading from disk. 2.4.x have to scale for dozen of Giga of RAM as there are archs supporting that amount of RAM. > 2) you don't /want/ to run low on fs cache, you want So I can't read more than the size that the fs cache can take? I must be allowed to do that (they're 200 Mbyte of RAM that can be more than enough if the server mainly generate pollution anyway). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:28 ` Andrea Arcangeli 2000-09-25 22:26 ` Rik van Riel @ 2000-09-25 22:30 ` Linus Torvalds 2000-09-25 23:03 ` Andrea Arcangeli 2000-09-25 22:30 ` Juan J. Quintela 2 siblings, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 22:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson, MM mailing list, linux-kernel On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM > test9 will waste time rolling between LRUs 384000 pages, while classzone > won't ever see 1 of those pages until you run low on fs cache. What drugs are you on? Nobody looks at the LRU's until the system is low on memory. Sure, there's some background activity, but what are you talking about? It's only when you're low on memory that _either_ approach starts looking at the LRU list. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:30 ` Linus Torvalds @ 2000-09-25 23:03 ` Andrea Arcangeli 2000-09-25 23:18 ` Linus Torvalds 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 23:03 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:30:10PM -0700, Linus Torvalds wrote: > On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > > > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM > > test9 will waste time rolling between LRUs 384000 pages, while classzone > > won't ever see 1 of those pages until you run low on fs cache. > > What drugs are you on? Nobody looks at the LRU's until the system is low > on memory. Sure, there's some background activity, but what are you The system is low on memory when you run `free` and you see a value < freepages_high*PAGE_SIZE in the "free" column first row. > talking about? It's only when you're low on memory that _either_ approach > starts looking at the LRU list. The machine will run low on memory as soon as I read 200mbyte from disk. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 23:03 ` Andrea Arcangeli @ 2000-09-25 23:18 ` Linus Torvalds 2000-09-26 0:32 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 23:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson, MM mailing list, linux-kernel On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > The machine will run low on memory as soon as I read 200mbyte from disk. So? Yes, at that point we'll do the LRU dance. Then we won't be low on memory any more, and we won't do the LRU dance any more. What's the magic in zoneinfo that makes it not have to do the same thing? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 23:18 ` Linus Torvalds @ 2000-09-26 0:32 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-26 0:32 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:18:13PM -0700, Linus Torvalds wrote: > > > On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > > > > The machine will run low on memory as soon as I read 200mbyte from disk. > > So? > > Yes, at that point we'll do the LRU dance. Then we won't be low on memory > any more, and we won't do the LRU dance any more. What's the magic in We'll run low on memory again as soon as we read the next page from disk and so very soon we'll have to roll around all the 1.5G private mapping again. (the program have a file working set larger than 200M) If you want to see some number I can produce them. The testcase only need to do a: truncate(1.5G) mmap(1.5G MAP_PRIVATE) fault in read mode into the mapped 1.5G measure how long it takes to read N Giga from disk > zoneinfo that makes it not have to do the same thing? The name "classzone" is misleading. The zoneinfo change is not relevant to this case (it started only with the zoneinfo change that's why it's still called so). This case is relevant on how the lru are been restructured. To say it simple as soon as somebody faults into the pagecache I remove the page from the LRU. Then munmap time (zap_page_range) the page is reinserted into the LRU. This avoids shrink_mmap to waste time into the mapped regions that shrink_mmap can't do anything to change anyway. This mean that under cache pollution there's no 1 cycle spent browsing those mapped pages and I know when it's time to swapout in function of the age of the fs cache (so the system is very efficient during cache pollution, this way the example performs equally to not having any mapping in memory). The case without memory pressure (where the working set fits in cache) is sure just fine of course. When swap_out unmaps a page and put them back into the lru I know that such page is not been touched recently and I consider it with zero age. (actually it's not a big deal since there's only literally 1 bit of age, so this may change in the future introducing more bits of info for the age) Of course all the subtle cases of shared read only anonymous pages added to the swap cache and page cache mapped but with bhs overlapped on it and some other non obvious issue are handled correctly. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:28 ` Andrea Arcangeli 2000-09-25 22:26 ` Rik van Riel 2000-09-25 22:30 ` Linus Torvalds @ 2000-09-25 22:30 ` Juan J. Quintela 2000-09-25 23:00 ` Andrea Arcangeli 2 siblings, 1 reply; 243+ messages in thread From: Juan J. Quintela @ 2000-09-25 22:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel >>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes: Hi andrea> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM andrea> test9 will waste time rolling between LRUs 384000 pages, while classzone andrea> won't ever see 1 of those pages until you run low on fs cache. Which is completely wrong if the program uses _any not completely_ unusual locality of reference. Think twice about that, it is more probable that you need more that 300MB of filesystem cache that you have an aplication that references _randomly_ 1.5GB of data. You need to balance that _always_ :(((((( I think that there is no silver bullet here :( Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:30 ` Juan J. Quintela @ 2000-09-25 23:00 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 23:00 UTC (permalink / raw) To: Juan J. Quintela Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 12:30:28AM +0200, Juan J. Quintela wrote: > Which is completely wrong if the program uses _any not completely_ > unusual locality of reference. Think twice about that, it is more > probable that you need more that 300MB of filesystem cache that you > have an aplication that references _randomly_ 1.5GB of data. You need > to balance that _always_ :(((((( The application doesn't references ramdonly 1.5GB of data. Assume there's a big executable large 2G (and yes I know there are) and I run it. After some hour its RSS it's 1.5G. Ok? So now this program also shmget a 300 Mbyte shm segment. Now this program starts reading and writing terabyte of data that wouldn't fit in cache even if there would be 300G of ram (and this is possible too). Or maybe the program itself uses rawio but then you at a certain point use the machine to run a tar somewhere. Now tell me why this program needs more than 200Mbyte of fs cache if the kernel doesn't waste time on the mapped pages (as in classzone). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 19:32 ` Andrea Arcangeli 2000-09-25 19:26 ` Rik van Riel @ 2000-09-25 19:54 ` Stephen C. Tweedie 2000-09-25 22:44 ` Andrea Arcangeli 2000-09-26 6:54 ` Christoph Rohland 1 sibling, 2 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 19:54 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: > Having shrink_mmap that browse the mapped page cache is useless > as having shrink_mmap browsing kernel memory and anonymous pages > as it does in 2.2.x as far I can tell. It's an algorithm > complexity problem and it will waste lots of CPU. It's a compromise between CPU cost and Getting It Right. Ignoring the mmap is not a good solution either. > Now think this simple real life example. A 2G RAM machine running an executable > image of 1.5G, 300M in shm and 200M in cache. OK, and here's another simple real life example. A 2GB RAM machine running something like Oracle with a hundred client processes all shm-mapping the same shared memory segment. Oh, and you're also doing lots of file IO. How on earth do you decide what to swap and what to page out in this sort of scenario, where basically the whole of memory is data cache, some of which is mapped and some of which is not? If you don't separate out the propagation of referenced bits from the actual page aging, then every time you pass over the whole VM working set, you're likely to find a handful of live references to some of the shared memory, and a hundred or so references that haven't done anything since last time. Anything that only ages per-pte, not per-page, is simply going to die horribly under such load, and any imbalance between pure filesystem cache and VM pressure will be magnified to the point where one dominates. Hence my observation that it's really easy to find special cases where certain optimisations make a ton of sense, but you often lose balance in the process. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 19:54 ` Stephen C. Tweedie @ 2000-09-25 22:44 ` Andrea Arcangeli 2000-09-25 22:42 ` Rik van Riel 2000-09-26 6:54 ` Christoph Rohland 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 22:44 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote: > OK, and here's another simple real life example. A 2GB RAM machine > running something like Oracle with a hundred client processes all > shm-mapping the same shared memory segment. Oracle takes the SHM locked, and it will never run on a machine without enough memory. > Oh, and you're also doing lots of file IO. How on earth do you decide > what to swap and what to page out in this sort of scenario, where > basically the whole of memory is data cache, some of which is mapped > and some of which is not? As as said in the last email aging on the cache is supposed to that. Wasting CPU and incrasing the complexity of the algorithm is a price that I won't pay just to get the information on when it's time to recall swap_out(). If the cache have no age it means I'd better throw it out instead of swapping/unmapping out stuff, simple? > anything since last time. Anything that only ages per-pte, not > per-page, is simply going to die horribly under such load, and any The aging on the fs cache is done per-page. The per-pte issue happens when we just took the difficult decision (that it was time to swap-out) and you have the same problem because you don't know the chain of pte that point to the physical page (so you're refresh the referenced bit more often). Once we'll have the chain of pte pointing to the page classzone will only need a real lru for the mapped pages to use it instead of walking pagetables. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 22:44 ` Andrea Arcangeli @ 2000-09-25 22:42 ` Rik van Riel 0 siblings, 0 replies; 243+ messages in thread From: Rik van Riel @ 2000-09-25 22:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Tue, 26 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote: > > basically the whole of memory is data cache, some of which is mapped > > and some of which is not? > > As as said in the last email aging on the cache is supposed to that. > > Wasting CPU and incrasing the complexity of the algorithm is a price > that I won't pay just to get the information on when it's time > to recall swap_out(). You must be joking. Page replacement should be tuned to do good page replacement, not just to be easy on the CPU. (though a heavily thrashing system /is/ easy on the cpu, I'll have to admit that) > If the cache have no age it means I'd better throw it out instead > of swapping/unmapping out stuff, simple? Simple, yes. But completely BOGUS if you don't age the cache and the mapped pages at the same rate! If I age your pages twice as much as my pages, is it still only fair that your pages will be swapped out first? ;) > > anything since last time. Anything that only ages per-pte, not > > per-page, is simply going to die horribly under such load, and any > > The aging on the fs cache is done per-page. And the same should be done for other pages as well. If you don't do that, you'll have big problems keeping page replacement balanced and making the system work well under various loads. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 19:54 ` Stephen C. Tweedie 2000-09-25 22:44 ` Andrea Arcangeli @ 2000-09-26 6:54 ` Christoph Rohland 2000-09-26 14:05 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Christoph Rohland @ 2000-09-26 6:54 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrea Arcangeli, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel "Stephen C. Tweedie" <sct@redhat.com> writes: > Hi, > > On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: > > > Having shrink_mmap that browse the mapped page cache is useless > > as having shrink_mmap browsing kernel memory and anonymous pages > > as it does in 2.2.x as far I can tell. It's an algorithm > > complexity problem and it will waste lots of CPU. > > It's a compromise between CPU cost and Getting It Right. Ignoring the > mmap is not a good solution either. > > > Now think this simple real life example. A 2G RAM machine running > > an executable image of 1.5G, 300M in shm and 200M in cache. Hey that's ridiculous: 1.5G executable image and 300M shm? Take it vice-versa and you are approaching real life. > OK, and here's another simple real life example. A 2GB RAM machine > running something like Oracle with a hundred client processes all > shm-mapping the same shared memory segment. That sound much more realistic. > Oh, and you're also doing lots of file IO. How on earth do you decide > what to swap and what to page out in this sort of scenario, where > basically the whole of memory is data cache, some of which is mapped > and some of which is not? > > If you don't separate out the propagation of referenced bits from the > actual page aging, then every time you pass over the whole VM working > set, you're likely to find a handful of live references to some of the > shared memory, and a hundred or so references that haven't done > anything since last time. Anything that only ages per-pte, not > per-page, is simply going to die horribly under such load, and any > imbalance between pure filesystem cache and VM pressure will be > magnified to the point where one dominates. Yes and that's why I stress most of the patch levels with my ipctst program on a highmem machine. It's simulating a load like this: A lot of processes attached to shm segments and trashing them. There were very few kernels which really worked with that load without totally breaking or killing processes _way_ too early. > Hence my observation that it's really easy to find special cases where > certain optimisations make a ton of sense, but you often lose balance > in the process. O.K. My test case is such a special case, but it is related to real live transactional load on a highend server. Greetings Christoph -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-26 6:54 ` Christoph Rohland @ 2000-09-26 14:05 ` Andrea Arcangeli 2000-09-26 16:20 ` Christoph Rohland 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-26 14:05 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote: > "Stephen C. Tweedie" <sct@redhat.com> writes: > > > Hi, > > > > On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote: > > > > > Having shrink_mmap that browse the mapped page cache is useless > > > as having shrink_mmap browsing kernel memory and anonymous pages > > > as it does in 2.2.x as far I can tell. It's an algorithm > > > complexity problem and it will waste lots of CPU. > > > > It's a compromise between CPU cost and Getting It Right. Ignoring the > > mmap is not a good solution either. > > > > > Now think this simple real life example. A 2G RAM machine running > > > an executable image of 1.5G, 300M in shm and 200M in cache. > > Hey that's ridiculous: 1.5G executable image and 300M shm? Take it > vice-versa and you are approaching real life. Could you tell me what's wrong in having an app with a 1.5G mapped executable (or a tiny executable but with a 1.5G shared/private file mapping if you prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as filesystem cache? The application have a misc I/O load that in some part will run out of the working set, what's wrong with this? What's ridiculous? Please elaborate. To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED), fault into it, and run bonnie. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-26 14:05 ` Andrea Arcangeli @ 2000-09-26 16:20 ` Christoph Rohland 2000-09-26 17:10 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Christoph Rohland @ 2000-09-26 16:20 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Andrea Arcangeli <andrea@suse.de> writes: > Could you tell me what's wrong in having an app with a 1.5G mapped executable > (or a tiny executable but with a 1.5G shared/private file mapping if you > prefer), O.K. that sound more reasonable. I was reading image as program text... and a 1.5GB program text is a something I never have seen (and hopefully will never see :-) > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as > filesystem cache? I don't really see a reason for fs cache in the application. I think that parallel applications tend to either share mostly all or nothing, but I may be wrong here. > The application have a misc I/O load that in some part will run out > of the working set, what's wrong with this? > > What's ridiculous? Please elaborate. I think we fixed this misreading. But still IMHO you underestimate the importance of shared memory for a lot of applications in the high end. There is not only Oracle out there and most of the shared memory is _not_ locked. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-26 16:20 ` Christoph Rohland @ 2000-09-26 17:10 ` Andrea Arcangeli 2000-09-27 8:11 ` Christoph Rohland 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-26 17:10 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote: > O.K. that sound more reasonable. I was reading image as program > text... and a 1.5GB program text is a something I never have seen (and > hopefully will never see :-) :) >From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB (it doesn't need to be the .text of the program). Said that I heard of real world programs that have a .text larger than 2G (that's why I wasn't very careful to say it doesn't need to be a 1.5G .text but that any other so large page-cache mapping would have the same effect). > > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as > > filesystem cache? > > I don't really see a reason for fs cache in the application. I think Infact the application can as well use rawio. > that parallel applications tend to either share mostly all or nothing, > but I may be wrong here. And then at some point you'll run `find /` or `tar mylatestsources.tar.gz sources/` or updatedb is startedup or whatever. And you don't need more than 200M of fs cache for that purpose. Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x `free` say 0 in shared field). It was making impossible to run `xosview` on a 10G box (it was stalling for seconds). And si_meminfo was only counting 1 field, not rolling pages around lru grabbing locks and dirtyfing cachelines. That's a plain complexity/scalability issue as far I can tell, and classzone solves it completly. When you run tar with your 1.5G shared mapping in memory and you happen to hit the low watermark and you need to recycle some byte of old cache, you'll run as fast as without the mapping in memory. There will be zero difference in performance. (just like now if you run `free` on a 10G machine it runs as fast on a 4mbyte machine) > I think we fixed this misreading. I should have explained things more carefully since the first place sorry. > But still IMHO you underestimate the importance of shared memory for a > lot of applications in the high end. There is not only Oracle out > there and most of the shared memory is _not_ locked. Well I wasn't claiming that this optimization is very sensitive for DB applications (at least for DB that doesn't use quite big file mappings). I know Oracle (and most other DB) are very shm intensive. However the fact you say the shm is not locked in memory is really a news to me. I really remembered that the shm was locked. I also don't see the point of keeping data cache in the swap. Swap involves SMP tlb flushes and all the other big overhead that you could avoid by sizing properly the shm cache and taking it locked. Note: having very fast shm swapout/swapin is very good thing (infact we introduced readaround of the swapin and moved shm swapout/swapin locking to the swap cache in early 2.3.x exactly for that reason). But I just don't think DBMS needed that. Note: simulations are completly a different thing (their evolution is not predicable). Simulations can sure trash shm into swap anytime (but Oracle shouldn't do that AFIK). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-26 17:10 ` Andrea Arcangeli @ 2000-09-27 8:11 ` Christoph Rohland 2000-09-27 8:28 ` Ingo Molnar 2000-09-27 13:56 ` Andrea Arcangeli 0 siblings, 2 replies; 243+ messages in thread From: Christoph Rohland @ 2000-09-27 8:11 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Andrea Arcangeli <andrea@suse.de> writes: > Said that I heard of real world programs that have a .text larger than 2G =:-O > I know Oracle (and most other DB) are very shm intensive. However > the fact you say the shm is not locked in memory is really a news to > me. I really remembered that the shm was locked. I just checked one oracle system and it did not lock the memory. And I do not think that the other databases do it by default either. And our application server doesn't do it definitely. And it uses loads of shared memory. We will have application servers soon with 16 GB memory at customer sites which will have the whole memory in shmfs. > I also don't see the point of keeping data cache in the swap. Swap > involves SMP tlb flushes and all the other big overhead that you > could avoid by sizing properly the shm cache and taking it locked. > > Note: having very fast shm swapout/swapin is very good thing (infact > we introduced readaround of the swapin and moved shm swapout/swapin > locking to the swap cache in early 2.3.x exactly for that > reason). But I just don't think DBMS needed that. Nobody should rely on shm swapping for productive use. But you have changing/increasing loads on application servers and out of a sudden you run oom. In this case the system should behave and it is _very_ good to have a smooth behaviour. Customers with performance problems very often start with too little memory, but they cannot upgrade until this really big job finishes :-( Another issue about shm swapping is interactive transactions, where some users have very large contexts and go for a coffee before submitting. This memory can be swapped. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 8:11 ` Christoph Rohland @ 2000-09-27 8:28 ` Ingo Molnar 2000-09-27 9:24 ` Christoph Rohland 2000-09-27 13:56 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-27 8:28 UTC (permalink / raw) To: Christoph Rohland Cc: Andrea Arcangeli, Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On 27 Sep 2000, Christoph Rohland wrote: > Nobody should rely on shm swapping for productive use. But you have > changing/increasing loads on application servers and out of a sudden > you run oom. In this case the system should behave and it is _very_ > good to have a smooth behaviour. it might make sense even in production use. If there is some calculation that has to be done only once per month, then sure the customer can decide to wait for it a few hours until it swaps itself ready, instead of buying gigs of RAM just to execute this single operation faster. Uncooperative OOM in such cases is a show-stopper. Or are you saying the same thing? :-) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 8:28 ` Ingo Molnar @ 2000-09-27 9:24 ` Christoph Rohland 0 siblings, 0 replies; 243+ messages in thread From: Christoph Rohland @ 2000-09-27 9:24 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Ingo Molnar <mingo@elte.hu> writes: > On 27 Sep 2000, Christoph Rohland wrote: > > > Nobody should rely on shm swapping for productive use. But you have > > changing/increasing loads on application servers and out of a sudden > > you run oom. In this case the system should behave and it is _very_ > > good to have a smooth behaviour. > > it might make sense even in production use. If there is some calculation > that has to be done only once per month, then sure the customer can decide > to wait for it a few hours until it swaps itself ready, instead of buying > gigs of RAM just to execute this single operation faster. Uncooperative > OOM in such cases is a show-stopper. Or are you saying the same thing? :-) That's what I meant with the coffee break. In a big installation somebody is always drinking coffee :-) You also have often different loads during daytime and nighttime. Swapping buffers out to swap disk instead of rereading from the database makes a lot of sense for this. But a single job should never swap. (It works for two month and then next month you get the big escalation and you would love to have hotplug memory) So swapping happens in productive use. But nobody should rely on that too much. And I completely agree that uncooperative OOM is not acceptable. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 8:11 ` Christoph Rohland 2000-09-27 8:28 ` Ingo Molnar @ 2000-09-27 13:56 ` Andrea Arcangeli 2000-09-27 16:56 ` Christoph Rohland 2000-09-28 10:08 ` Rik van Riel 1 sibling, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-27 13:56 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > I just checked one oracle system and it did not lock the memory. And I If that memory is used for I/O cache then such memory should released when the system runs into swap instead of swapping it out too (otherwise it's not cache anymore and it could be slower than re-reading from disk the real data in rawio). > Customers with performance problems very often start with too little > memory, but they cannot upgrade until this really big job finishes :-( > > Another issue about shm swapping is interactive transactions, where > some users have very large contexts and go for a coffee before > submitting. This memory can be swapped. Agreed, that's why I said shm performance under swap is very important as well (I'm not understimating it). But again: if the shm contains I/O cache it should be released and not swapped out. Swapping out shmfs that contains I/O cache would be exactly like swapping out page-cache. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 13:56 ` Andrea Arcangeli @ 2000-09-27 16:56 ` Christoph Rohland 2000-09-27 17:42 ` Andrea Arcangeli 2000-09-28 10:08 ` Rik van Riel 1 sibling, 1 reply; 243+ messages in thread From: Christoph Rohland @ 2000-09-27 16:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Andrea Arcangeli <andrea@suse.de> writes: > On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > > I just checked one oracle system and it did not lock the memory. And I > > If that memory is used for I/O cache then such memory should > released when the system runs into swap instead of swapping it out > too (otherwise it's not cache anymore and it could be slower than > re-reading from disk the real data in rawio). Yes, but how does the application detect that it should free the mem? Also you often have more overhead reading out of a database then having preprocessed data in swap. > > Customers with performance problems very often start with too little > > memory, but they cannot upgrade until this really big job finishes :-( > > > > Another issue about shm swapping is interactive transactions, where > > some users have very large contexts and go for a coffee before > > submitting. This memory can be swapped. > > Agreed, that's why I said shm performance under swap is very important > as well (I'm not understimating it). fine :-) Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 16:56 ` Christoph Rohland @ 2000-09-27 17:42 ` Andrea Arcangeli 2000-09-27 18:25 ` Erik Andersen 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-27 17:42 UTC (permalink / raw) To: Christoph Rohland Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote: > Yes, but how does the application detect that it should free the mem? The trivial way is not to detect it and to allow the user to select how much memory it will use as cache and to take it locked and then don't care (he will have to decrease the size of the shm by hand if it wants to drop some cache). >From the OS point of view it's like not having that RAM at all and there will be zero performance difference compared into trashing into swap without such memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that is solved with the real lru in 2.4.x) The other way is to have the shm cache that shrinks dynamically by looking /proc/meminfo and looking at the aging of their own cache. Again the user should say a miniumum and a maximum of shm cache to keep locked in memory. Then you look at the "freemem + cache + buffers - active cache" and you can say when you're going to run into swap. Specifically with classzone you'll run into swap when that value is near zero. So when such value is near zero you know it's time to shrink the shm cache dynamically if it has a low age otherwise the machine will trash into swap badly and performance will decrease. (you could start shrinking when such value is below an amount of mbyte again configurable via a form) You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in 2.4.x so it's just the overhead of a read syscall) These DB using rawio really want to substitue part of the kernel cache functionality and so it's quite natural that they also don't want the kernel to play with their caches while they run and they would need some more interaction with the kernel memory balancing (possibly via async signals) to get their shm reclaimed dynamically more cleanly and efficiently by registering for this functionality (they could get signals when the machine runs into swap and then the DB chooses if it worth to release some locked cache after looking at the /proc/meminfo and the working set on their own caches). > Also you often have more overhead reading out of a database then > having preprocessed data in swap. Yes I see, it of course depends on the kind of cache (if it's very near to the on-disk format than more probably it shouldn't be swapped out). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 17:42 ` Andrea Arcangeli @ 2000-09-27 18:25 ` Erik Andersen 2000-09-27 18:55 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Erik Andersen @ 2000-09-27 18:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: MM mailing list, linux-kernel On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote: > > You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in > 2.4.x so it's just the overhead of a read syscall) Or sysinfo(2). Same thing... -Erik -- Erik B. Andersen email: andersee@debian.org --This message was written using 73% post-consumer electrons-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 18:25 ` Erik Andersen @ 2000-09-27 18:55 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-27 18:55 UTC (permalink / raw) To: MM mailing list, linux-kernel On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote: > Or sysinfo(2). Same thing... sysinfo structure doesn't export the number of active pages in the system. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-27 13:56 ` Andrea Arcangeli 2000-09-27 16:56 ` Christoph Rohland @ 2000-09-28 10:08 ` Rik van Riel 2000-09-28 11:16 ` Rik van Riel ` (2 more replies) 1 sibling, 3 replies; 243+ messages in thread From: Rik van Riel @ 2000-09-28 10:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Wed, 27 Sep 2000, Andrea Arcangeli wrote: > On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote: > > I just checked one oracle system and it did not lock the memory. And I > > If that memory is used for I/O cache then such memory should > released when the system runs into swap instead of swapping it > out too (otherwise it's not cache anymore and it could be slower > than re-reading from disk the real data in rawio). It could also be faster. If the database spent half an hour gathering pieces of data from all over the database, it might be faster to keep it in one place in swap so it can be read in again in one swoop. (I had an interesting talk about this with a database person while at OLS) But that's not the point. If your assertion is true, then the database will probably be using an mlock()ed SHM region and taking care of this itself. But this is not something the OS should prescribe to the application. If the OS finds that certain SHM pages are used far less than the pages in the I/O cache, then those SHM pages should be swapped out. The system's job is to keep the most used pages of data in memory to minimise the amount of page faults happening. Trying to outsmart the application shouldn't (IHMO of course) be part of that job... > > Customers with performance problems very often start with too little > > memory, but they cannot upgrade until this really big job finishes :-( > > > > Another issue about shm swapping is interactive transactions, where > > some users have very large contexts and go for a coffee before > > submitting. This memory can be swapped. > > Agreed, that's why I said shm performance under swap is very important > as well (I'm not understimating it). > > But again: if the shm contains I/O cache it should be released > and not swapped out. Swapping out shmfs that contains I/O cache > would be exactly like swapping out page-cache. The OS has no business knowing what's inside that SHM page. IF the shm contains I/O cache, maybe you're right. However, until you know that this is the case, optimising for that situation just doesn't make any sense. (unless the SHM users tell you that this is the normal way they use SHM ... but as Christoph just told us, it isn't) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 10:08 ` Rik van Riel @ 2000-09-28 11:16 ` Rik van Riel 2000-09-28 14:52 ` Andrea Arcangeli 2000-09-28 11:31 ` Ingo Molnar 2000-09-28 14:31 ` Andrea Arcangeli 2 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-28 11:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, 28 Sep 2000, Rik van Riel wrote: > On Wed, 27 Sep 2000, Andrea Arcangeli wrote: > > But again: if the shm contains I/O cache it should be released > > and not swapped out. Swapping out shmfs that contains I/O cache > > would be exactly like swapping out page-cache. > > The OS has no business knowing what's inside that SHM page. Hmm, now I woke up maybe I should formulate this in a different way. Andrea, I have the strong impression that your idea of memory balancing is based on the idea that the OS should out-smart the application instead of looking at the usage pattern of the pages in memory. This is fundamentally different from the idea that the OS should make decisions based on the observed usage patterns of the pages in question, instead of making presumptions based on what kind of cache the page is in. I've been away for 10 days and have been sitting on a bus all last night so my judgement may be off. I'd certainly like to hear I'm wrong ;) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 11:16 ` Rik van Riel @ 2000-09-28 14:52 ` Andrea Arcangeli 2000-09-29 14:39 ` Rik van Riel 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-28 14:52 UTC (permalink / raw) To: Rik van Riel Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: > Andrea, I have the strong impression that your idea of > memory balancing is based on the idea that the OS should > out-smart the application instead of looking at the usage > pattern of the pages in memory. Not sure what you mean with out-smart. My only point is that the OS actually can only swapout such shm. If that SHM is not supposed to be swapped out and if the OS I/O cache have more aging then the shm cache, then the OS should tell the DBMS that it's time to shrink some shm page by freeing it. > of the pages in question, instead of making presumptions > based on what kind of cache the page is in. For the mapped pages we never make presumptions. We always check the accessed bit and that's the most reliable info to know if the page is been accessed recently (set from the cpu accesse through the pte not only during page faults or cache hits). With the current design pages mapped multiple times will be overaged a bit but this can't be fixed until we make a page->pte reverse lookup... Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 14:52 ` Andrea Arcangeli @ 2000-09-29 14:39 ` Rik van Riel 2000-09-29 14:55 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-29 14:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, 28 Sep 2000, Andrea Arcangeli wrote: > On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote: > > Andrea, I have the strong impression that your idea of > > memory balancing is based on the idea that the OS should > > out-smart the application instead of looking at the usage > > pattern of the pages in memory. > > Not sure what you mean with out-smart. > > My only point is that the OS actually can only swapout such shm. > If that SHM is not supposed to be swapped out and if the OS I/O > cache have more aging then the shm cache, then the OS should > tell the DBMS that it's time to shrink some shm page by freeing > it. OK, good to see that we agree on the fact that we should age and swapout all pages equally agressively. > > of the pages in question, instead of making presumptions > > based on what kind of cache the page is in. > > For the mapped pages we never make presumptions. We always check > the accessed bit and that's the most reliable info to know if > the page is been accessed recently (set from the cpu accesse > through the pte not only during page faults or cache hits). > With the current design pages mapped multiple times will be > overaged a bit but this can't be fixed until we make a page->pte > reverse lookup... Indeed. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-29 14:39 ` Rik van Riel @ 2000-09-29 14:55 ` Andrea Arcangeli 2000-09-29 15:40 ` Rik van Riel 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-29 14:55 UTC (permalink / raw) To: Rik van Riel Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: > OK, good to see that we agree on the fact that we > should age and swapout all pages equally agressively. Actually I think we should start looking at the mapped stuff _only_ when the I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no point to look at the mapped stuff since there's cache pollution going on. It's much less costly to drop a page from the unmapped cache than to play with pagetables, and also having slow read() is much better than having to fault into the .text areas (because the process is going to be designed in a way that expects read to block so it may do it asynchronously or in a separate thread or whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything. If the cache is re-used (so if it's useful) that's completly different issue and in that case unmapping potentially unused stuff is the right thing to do of course. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-29 14:55 ` Andrea Arcangeli @ 2000-09-29 15:40 ` Rik van Riel 0 siblings, 0 replies; 243+ messages in thread From: Rik van Riel @ 2000-09-29 15:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Fri, 29 Sep 2000, Andrea Arcangeli wrote: > On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote: > > OK, good to see that we agree on the fact that we > > should age and swapout all pages equally agressively. > > Actually I think we should start looking at the mapped stuff > _only_ when the I/O cache aging is relevant. If the I/O cache > aging isn't relevant there's no point to look at the mapped > stuff since there's cache pollution going on. > If the cache is re-used (so if it's useful) that's completly > different issue and in that case unmapping potentially unused > stuff is the right thing to do of course. This is why I want to do: 1) equal aging of all pages in the system 2) page aging to have properties of both LRU and LFU 3) drop-behind to cope with streaming IO in a good way and maybe: 4) move unmapped pages to the inactive_clean list for immediate reclaiming but put pages which are/were mapped on the inactive_dirty list so we keep it a little bit longer The only way to reliably know if the cache is re-used a lot is by making sure we do the page aging for unmapped and mapped pages the same. If we don't do that, we won't be able to make a sensible comparison between the activity of pages in different places. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 10:08 ` Rik van Riel 2000-09-28 11:16 ` Rik van Riel @ 2000-09-28 11:31 ` Ingo Molnar 2000-09-28 14:54 ` Andrea Arcangeli 2000-09-28 14:31 ` Andrea Arcangeli 2 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-28 11:31 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Christoph Rohland, Stephen C. Tweedie, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, 28 Sep 2000, Rik van Riel wrote: > The OS has no business knowing what's inside that SHM page. exactly. > IF the shm contains I/O cache, maybe you're right. However, > until you know that this is the case, optimising for that > situation just doesn't make any sense. if the shm contains raw I/O data, then thats flawed application design - an mmap()-ed file should be used instead. Shm is equivalent to shared anonymous pages. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 11:31 ` Ingo Molnar @ 2000-09-28 14:54 ` Andrea Arcangeli 2000-09-28 15:13 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-28 14:54 UTC (permalink / raw) To: Ingo Molnar Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote: > if the shm contains raw I/O data, then thats flawed application design - > an mmap()-ed file should be used instead. Shm is equivalent to shared The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus and synchronize the distributed cache via TCP. Tell me how to do that with the OS cache and mmap. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 14:54 ` Andrea Arcangeli @ 2000-09-28 15:13 ` Ingo Molnar 2000-09-28 15:23 ` Andrea Arcangeli 2000-09-28 16:16 ` Juan J. Quintela 0 siblings, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-28 15:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, 28 Sep 2000, Andrea Arcangeli wrote: > The DBMS uses shared SCSI disks across multiple hosts on the same SCSI > bus and synchronize the distributed cache via TCP. Tell me how to do > that with the OS cache and mmap. this could be supported by: 1) mlock()-ing the whole mapping. 2) introducing sys_flush(), which flushes pages from the pagecache. 3) doing sys_msync() after dirtying a range and before sending a TCP event. Whenever the DB-cache-flush-event comes over TCP, it calls sys_flush() for that given virtual address range or file address space range. Sys_flush flushes the page from the pagecache and unmaps the address. Whenever it's needed again by the application it will be faulted in and read from disk. Can anyone see any problems with the concept of this approach? This can be used for a page-granularity distributed IO cache. (there are some smaller problems with this approach, like mlock() on a big range can only be done by priviledged users, but thats not an issue IMO.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 15:13 ` Ingo Molnar @ 2000-09-28 15:23 ` Andrea Arcangeli 2000-09-28 16:16 ` Juan J. Quintela 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-28 15:23 UTC (permalink / raw) To: Ingo Molnar Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote: > Can anyone see any problems with the concept of this approach? This can be It works only on top of a filesystem while all the checkpointing clever stuff is done internally by the DB (infact it _needs_ O_SYNC when it works on the fs). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 15:13 ` Ingo Molnar 2000-09-28 15:23 ` Andrea Arcangeli @ 2000-09-28 16:16 ` Juan J. Quintela 1 sibling, 0 replies; 243+ messages in thread From: Juan J. Quintela @ 2000-09-28 16:16 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Rik van Riel, Christoph Rohland, Stephen C. Tweedie, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel >>>>> "ingo" == Ingo Molnar <mingo@elte.hu> writes: Hi ingo> 2) introducing sys_flush(), which flushes pages from the pagecache. It is not supposed that mincore can do that (yes, just now it is not implemented, but the interface is there to do that)? Just curious. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-28 10:08 ` Rik van Riel 2000-09-28 11:16 ` Rik van Riel 2000-09-28 11:31 ` Ingo Molnar @ 2000-09-28 14:31 ` Andrea Arcangeli 2 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-28 14:31 UTC (permalink / raw) To: Rik van Riel Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Thu, Sep 28, 2000 at 07:08:51AM -0300, Rik van Riel wrote: > taking care of this itself. But this is not something the OS > should prescribe to the application. Agreed. > (unless the SHM users tell you that this is the normal way > they use SHM ... but as Christoph just told us, it isn't) shm is not used as I/O cache from 90% of the apps out there because normal apps uses the OS cache functionality (90% of those apps doesn't use rawio to share a black box that looks like a scsi disk via SCSI bus connected to other hosts as well). I for sure agree shm swapin/swapout is very important. (we moved shm swapout/swapin to swap cache with readaround for that reason) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 23:41 ` Andrea Arcangeli 2000-09-25 16:24 ` Stephen C. Tweedie @ 2000-09-25 17:21 ` bert hubert 2000-09-25 17:49 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: bert hubert @ 2000-09-25 17:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > We're talking about shrink_[id]cache_memory change. That have _nothing_ to do > with the VM changes that happened anywhere between test8 and test9-pre6. > > You were talking about a different thing. Ok, sorry. Kernel development is proceding at a furious pace and I sometimes lose track. > I consider the current approch the wrong way to go and for this reason I > prefer to spend time porting/improving classzone. I seem to remember that people were impressed by classzone, but that the implementation was very non-trivial and hard to grok. One of the reasons Rik's vm made it (so far) is that it is pretty straightforward, with all the marks of the right amount of simplicity. > In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give > it a try under swap to see the difference in the behaviour and compare > (Mike said it's still an order of magnitude faster with his "make -j30 > bzImage" testcase and he's always very reliable in his reports). There is no such thing as 'under swap'. There are lots of loadpatterns that will generate different kinds of memory pressure. Just calling it 'under swap' gives entirely the wrong impression. Although Mike's compile is a relevant benchmark, every VM has cases for which it excels, and cases for which it sucks. This appears to be a general property of VM design. Given knowledge of the algorithms used, you can always dream up a situation where it will fail. It's a bit like writing the halting problem algorithm. Same goes the other way around, every VM will have a 'shining benchmark' - hence the invention of benchmarketing. We used to have a bad virtual memory implementation that was sometimes well tuned so a lots of ordinary cases showed acceptable performance. We now have an elegant VM that works reasonably well, but needs more tweaking. What is the point of all this ranting? Think twice before embarking on 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far higher differential improvement. Regards, bert hubert -- PowerDNS Versatile DNS Services Trilab The Technology People 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 17:21 ` bert hubert @ 2000-09-25 17:49 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:49 UTC (permalink / raw) To: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:21:48PM +0200, bert hubert wrote: > Ok, sorry. Kernel development is proceding at a furious pace and I sometimes > lose track. No problem :). > I seem to remember that people were impressed by classzone, but that the > implementation was very non-trivial and hard to grok. One of the reasons Yes. Classzone is certainly more complex. > There is no such thing as 'under swap'. There are lots of loadpatterns that > will generate different kinds of memory pressure. Just calling it 'under > swap' gives entirely the wrong impression. Sorry for not being precise. I meant one of those load patterns. > 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far > higher differential improvement. I've spent efforts on classzone as well, and since I think it's way superior approch I'll at least port it on top of 2.4.0-test9 as soon as time permits to generate some number. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert 2000-09-24 23:41 ` Andrea Arcangeli @ 2000-09-25 15:09 ` Miles Lane 2000-09-25 15:51 ` Stephen C. Tweedie 2 siblings, 0 replies; 243+ messages in thread From: Miles Lane @ 2000-09-25 15:09 UTC (permalink / raw) To: bert hubert Cc: Andrea Arcangeli, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel bert hubert wrote: > On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote: > >> On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote: >> >>> any form of serialisation on the quota file). This feels like rather >>> a lot of new and interesting deadlocks to be introducing so late in >>> 2.4. :-) >> > True. But they also appear to be found and solved at an impressive rate. > These deadlocks are fatal and don't hide in corners, whereas the previous mm > problems used to be very hard to spot and fix, there not being real > showstoppers, except for abysmal performance. [1] > > Since Rik's stuff was merged, the number of eyeball hours devoted to MM have > skyrocketed, whereas the previous incarnations had far smaller audiences. > The patches are barely a week in, and look how much has been improved that > hadn't been found by the people working with Rik. > > It's tempting to revert the merge, but let's work at it a bit longer. There > are problems, but we are solving them rapidly and both performance and > design of the new MM are pretty pleasing. > > Let's not waste this opportunity. I agree. I have seen really fabulous system response since Rik's changes were merged in. I have managed to crash my machine a couple of times (I am working on getting a serial debugging connection set up, since I don't see any OOPS messages), but I think this is not terribly surprising. My impression is that system responsiveness is much improved. Let's hang in there a bit longer. We are making rapid progress on testing and fixing. Miles -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert 2000-09-24 23:41 ` Andrea Arcangeli 2000-09-25 15:09 ` Miles Lane @ 2000-09-25 15:51 ` Stephen C. Tweedie 2000-09-25 16:05 ` Ingo Molnar 2 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 15:51 UTC (permalink / raw) To: Andrea Arcangeli, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote: > On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote: > > On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote: > > > any form of serialisation on the quota file). This feels like rather > > > a lot of new and interesting deadlocks to be introducing so late in > > > 2.4. :-) > > True. But they also appear to be found and solved at an impressive rate. > These deadlocks are fatal and don't hide in corners, whereas the previous mm > problems used to be very hard to spot and fix, there not being real > showstoppers, except for abysmal performance. [1] Sorry, but in this case you have got a lot more variables than you seem to think. The obvious lock is the ext2 superblock lock, but there are side cases with quota and O_SYNC which are much less commonly triggered. That's not even starting to consider the other dozens of filesystems in the kernel which have to be audited if we change the locking requirements for GFP calls. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 15:51 ` Stephen C. Tweedie @ 2000-09-25 16:05 ` Ingo Molnar 2000-09-25 16:06 ` Alexander Viro 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 16:05 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > Sorry, but in this case you have got a lot more variables than you > seem to think. The obvious lock is the ext2 superblock lock, but > there are side cases with quota and O_SYNC which are much less > commonly triggered. That's not even starting to consider the other > dozens of filesystems in the kernel which have to be audited if we > change the locking requirements for GFP calls. i'd suggest to simply BUG() in schedule() if the superblock lock is held not directly by lock_super. Holding the superblock lock is IMO quite rude anyway (for performance and latency) - is there any place where we hold it for a long time and it's unavoidable? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 16:05 ` Ingo Molnar @ 2000-09-25 16:06 ` Alexander Viro 2000-09-25 16:20 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Alexander Viro @ 2000-09-25 16:06 UTC (permalink / raw) To: Ingo Molnar Cc: Stephen C. Tweedie, Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > > > Sorry, but in this case you have got a lot more variables than you > > seem to think. The obvious lock is the ext2 superblock lock, but > > there are side cases with quota and O_SYNC which are much less > > commonly triggered. That's not even starting to consider the other > > dozens of filesystems in the kernel which have to be audited if we > > change the locking requirements for GFP calls. > > i'd suggest to simply BUG() in schedule() if the superblock lock is held > not directly by lock_super. Holding the superblock lock is IMO quite rude > anyway (for performance and latency) - is there any place where we hold it > for a long time and it's unavoidable? Ingo, schedule() has no bloody business _knowing_ about superblock locks in the first place. Yes, ext2 should not bother taking it at all. For completely unrelated reasons. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 16:06 ` Alexander Viro @ 2000-09-25 16:20 ` Ingo Molnar 2000-09-25 16:29 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 16:20 UTC (permalink / raw) To: Alexander Viro Cc: Stephen C. Tweedie, Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alexander Viro wrote: > > i'd suggest to simply BUG() in schedule() if the superblock lock is held > > not directly by lock_super. Holding the superblock lock is IMO quite rude > > anyway (for performance and latency) - is there any place where we hold it > > for a long time and it's unavoidable? > > Ingo, schedule() has no bloody business _knowing_ about superblock > locks in the first place. Yes, ext2 should not bother taking it at > all. For completely unrelated reasons. i only suggested this as a debugging helper, instead of the suggested ext2_getblk() BUG() helper. Obviously schedule() has no business knowing about filesystem locks. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks 2000-09-25 16:20 ` Ingo Molnar @ 2000-09-25 16:29 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:29 UTC (permalink / raw) To: Ingo Molnar Cc: Alexander Viro, Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:20:40PM +0200, Ingo Molnar wrote: > i only suggested this as a debugging helper, instead of the suggested I don't think removing the superlock from all fs is good thing at this stage (I agree with SCT doing it only for ext2 [that's what we mostly care about] would be possible). Who cares if UFS grabs the super lock or not? grep lock_super fs/ext2/*.c is enough and we don't need debugging in the scheduler for that. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 21:12 ` Ingo Molnar 2000-09-24 21:43 ` Stephen C. Tweedie @ 2000-09-25 4:56 ` Linus Torvalds 2000-09-25 5:19 ` Alexander Viro 1 sibling, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 4:56 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel Hmm.. Thinking some more about this issue, I actually suspect that there's a better solution. The fact is that GFP_BUFFER is only used for the old-fashioned buffer block allocations, and anything that uses the page cache automatically avoids the whole issue. As such, from a VM balancing standpoint we would fix the problem equally well by just avoiding using old-fashioned buffer blocks.. Now, I don't believe that the indirect blocks etc of the meta-data is much of an issue - whenever we need to access indirect blocks we're certainly already doing the page cache thing, so the page cache VM pressure should be qutie sufficient to keep the VM balanced - regular file access is very much biased towards the page cache, and the meta-data buffer-cache accesses are likely to be a very very small part of the big picture. The remaining part if the directory handling. THAT is very buffer-cache intensive, as the directory handling hasn't been moved over to the page cache at all for ext2. Doing a large "find" (or even just a "ls -l") will basically do purely buffer cache accesses, first for the directory data and then for the inode data. With no page cache activity to balance things out at all - leading to a potentially quite unbalanced VM that never really had a good chance to get rid of dentries etc. However, Al Viro already basically has the "directories using the page cache" code pretty much done, so for 2.5.x we'll just do that, and I bet that the VM balancing will improve (as well as performance going up simply just because the page cache is more efficient anyway). With the directory information in the page cache, there simply isn't any regular operations that depend entirely on the buffer cache any more. Sure, there will still be the inode and indirect blocks, but there just aren't loads that I know of that can put as much pressure on those as on the page cache.. So the proper approach may be to just ignore the current issue with __GFP_IO being a big deal under some loads, because it probably will go away on its own (the superblock lock contention is still an issue, of course, but while somewhat related it's still fairly orthogonal). Al, if you'd port over the "namei in page-cache" stuff from UFS to ext2, I bet that there would be people interested in seeing whether the above theory is just another of Linu's whimsies, or whether it really does make a difference.. It may not be 2.4.x material, but it won't hurt to have it tested some more anyway. Comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 4:56 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds @ 2000-09-25 5:19 ` Alexander Viro 2000-09-25 6:06 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 5:19 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > The remaining part if the directory handling. THAT is very buffer-cache > intensive, as the directory handling hasn't been moved over to the page > cache at all for ext2. Doing a large "find" (or even just a "ls -l") will > basically do purely buffer cache accesses, first for the directory data > and then for the inode data. With no page cache activity to balance things > out at all - leading to a potentially quite unbalanced VM that never > really had a good chance to get rid of dentries etc. You forgot inode tables themselves. > Al, if you'd port over the "namei in page-cache" stuff from UFS to ext2, I > bet that there would be people interested in seeing whether the above > theory is just another of Linu's whimsies, or whether it really does make > a difference.. It may not be 2.4.x material, but it won't hurt to have it > tested some more anyway. Comments? I'll do it and post the result tomorrow. I bet that there will be issues I've overlooked (stuff that happens to work on UFS, but needs to be more general for ext2), so it's going as "very alpha", but hey, it's pretty straightforward, so there is a chance to debug it fast. Yes, famous last words and all such... BTW, we _will_ need it on UFS side in 2.4 anyway. Rationale: * UFS _does_ fragments, whether we like it or not. * Reallocating fragments for regular files can not be done by bread()+getblk()+memcpy()+mark_buffer_dirty() - data is in pagecache, so that's an instant death * to get UFS working with pagecache and not eating filesystems we must do fragment reallocation through pagecache * it means that we either duplicate the whole mess both for buffer cache (directories) and pagecache (inodes) or move directories to pagecache The former (pagecache duplicate of the reallocation code) is nasty since we have to separate the current realloc stuff from the code pathes where it sits right now anyway - it's merged into the functions used by pagecache side. I.e. we would have to * do pagecache fragment handling * rip the buffer-cache fragment handling out * redo it, so that it would live outside of the path used by pagecache side * change the callers. The last couple means more work than switching directories to pagecache. So some variant of directories in pagecache is needed for 2.4, the question being whether it's UFS-only or we use its port on ext2... BTW, minixfs/sysvfs can also use the thing, but that's another story. Off to port the bloody thing... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 5:19 ` Alexander Viro @ 2000-09-25 6:06 ` Linus Torvalds 2000-09-25 6:17 ` Alexander Viro 2000-09-25 21:21 ` Alexander Viro 2000-09-26 13:42 ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro 2000-09-26 21:29 ` Alexander Viro 2 siblings, 2 replies; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 6:06 UTC (permalink / raw) To: Alexander Viro Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alexander Viro wrote: > > > On Sun, 24 Sep 2000, Linus Torvalds wrote: > > > The remaining part if the directory handling. THAT is very buffer-cache > > intensive, as the directory handling hasn't been moved over to the page > > cache at all for ext2. Doing a large "find" (or even just a "ls -l") will > > basically do purely buffer cache accesses, first for the directory data > > and then for the inode data. With no page cache activity to balance things > > out at all - leading to a potentially quite unbalanced VM that never > > really had a good chance to get rid of dentries etc. > > You forgot inode tables themselves. I don't. That's the "then for the inode data" part. I'm not claiming that the buffer cache accesses would go away - I'm just saying that the unbalanced "only buffer cache" case should go away, because things like "find" and friends will still cause mostly page cache activity. (Considering the size of the inode on ext2, I don't know how true this is, I have to admit. It might still be quite biased towards the buffer cache, and as such the additional page cache pressure might not be enough to really cause any major shift in balancing). > I'll do it and post the result tomorrow. I bet that there will be issues > I've overlooked (stuff that happens to work on UFS, but needs to be more > general for ext2), so it's going as "very alpha", but hey, it's pretty > straightforward, so there is a chance to debug it fast. Yes, famous last > words and all such... Sure. > BTW, we _will_ need it on UFS side in 2.4 anyway. Rationale: [ reasons removed ] I have no problem with that. Especially as I suspect the people who use UFS are more likely to be the technical kind of user who is more inclined to be able to debug whatever potential problems crop up anyway. Your point about not duplicating the fragment handling is certainly quite convincing for the case of UFS. > So some variant of directories in pagecache is needed for 2.4, the > question being whether it's UFS-only or we use its port on ext2... BTW, > minixfs/sysvfs can also use the thing, but that's another story. Let's plan on UFS-only, for all the prudent reasons. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 6:06 ` Linus Torvalds @ 2000-09-25 6:17 ` Alexander Viro 2000-09-25 21:21 ` Alexander Viro 1 sibling, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 6:17 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > I'm not claiming that the buffer cache accesses would go away - I'm just > saying that the unbalanced "only buffer cache" case should go away, > because things like "find" and friends will still cause mostly page cache > activity. > > (Considering the size of the inode on ext2, I don't know how true this is, > I have to admit. It might still be quite biased towards the buffer cache, > and as such the additional page cache pressure might not be enough to > really cause any major shift in balancing). Hrrrmmm... You know, since we don't have to associate struct inode with every address space and inode table _is_ a linear array, after all... We might put it into pagecache too. Very few places access the on-disk inode, so it's not too horrible. All we need is readpage() and that's very easy, considering the fact that allocation is static. prepare_write() and commit_write() may be NULL for all I care and writepage() will be easy too - no holes, no allocation, no nothing. Looks like we need to deal with ext2_update_inode(), ext2_read_inode() and that's it. Even less intrusive than directory stuff... Comments? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 6:06 ` Linus Torvalds 2000-09-25 6:17 ` Alexander Viro @ 2000-09-25 21:21 ` Alexander Viro 1 sibling, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 21:21 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Y. Ts'o, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Alexander Viro, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: [directories in pagecache on ext2] > > I'll do it and post the result tomorrow. I bet that there will be issues > > I've overlooked (stuff that happens to work on UFS, but needs to be more > > general for ext2), so it's going as "very alpha", but hey, it's pretty > > straightforward, so there is a chance to debug it fast. Yes, famous last > > words and all such... > > Sure. All right, I think I've got something that may work. Yes, there were issues - UFS has the constant directory chunk size (1 sector), while ext2 makes it equal to fs block size. _Bad_ idea, since the sector writes are atomic and block ones... Oh, well, so ext2 is slightly less robust. It required some changes, I'll do the initial testing and post the patch once it will pass the trivial tests. BTW, why on the Earth had we done it that way? It has no noticable effect on directory fragmentation, it makes code (both in page- and buffer-cache variants) more complex, it's less robust (by definition - directory layout may be broken easier)... What was the point? Not that we could do something about that now (albeit as a ro-compat feature it would be nice), but I'm curious about the reasons... Cheers, Al -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* [CFT][PATCH] ext2 directories in pagecache 2000-09-25 5:19 ` Alexander Viro 2000-09-25 6:06 ` Linus Torvalds @ 2000-09-26 13:42 ` Alexander Viro 2000-09-26 21:29 ` Alexander Viro 2 siblings, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-26 13:42 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel Help in testing is welcome, just keep in mind that it's ext2 we are talking about. IOW, proceed with care and don't let it loose on the data you can't easily restore. Patch moves the directory data into the pagecache. I hope that it's sufficiently straightforward to be readable. Linus, if you prefer to get it in the mail - tell and I'll send it (50K unpacked due to ext2/{dir,namei}.c modifications, so it's too large for the lists). Cheers, Al -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* [CFT][PATCH] ext2 directories in pagecache 2000-09-25 5:19 ` Alexander Viro 2000-09-25 6:06 ` Linus Torvalds 2000-09-26 13:42 ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro @ 2000-09-26 21:29 ` Alexander Viro 2000-09-26 22:16 ` Marko Kreen 2000-09-26 23:19 ` Andreas Dilger 2 siblings, 2 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-26 21:29 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel be really working (survives assorted builds, does the right thing on find-based scripts and obvious local tests, yodda, yodda). It certainly needs more testing, but I would call it (early) beta. Folks, give it a try - just keep decent backups. Similar code will have to go into UFS in 2.4 and that (ext2) variant may be of interest for 2.4.<late>/2.5.<early> timeframe. I'm putting it on ftp.math.psu.edu/pub/viro/ext2-patch-7.gz. Comments and help in testing are more than welcome. Cheers, Al -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 21:29 ` Alexander Viro @ 2000-09-26 22:16 ` Marko Kreen 2000-09-26 22:31 ` Alexander Viro 2000-09-26 23:19 ` Andreas Dilger 1 sibling, 1 reply; 243+ messages in thread From: Marko Kreen @ 2000-09-26 22:16 UTC (permalink / raw) To: Alexander Viro Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Tue, Sep 26, 2000 at 05:29:27PM -0400, Alexander Viro wrote: > Comments and help in testing are more than welcome. There is something fishy in ext2_empty_dir: + /* check for . and .. */ + if (de->name[0] != '.') + goto not_empty; + if (!de->name[1]) { + if (de->inode != + le32_to_cpu(inode->i_ino)) + goto not_empty; + } else if (de->name[2]) + goto not_empty; + else if (de->name[1] != '.') + goto not_empty; -- marko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 22:16 ` Marko Kreen @ 2000-09-26 22:31 ` Alexander Viro 2000-09-26 22:47 ` Marko Kreen 0 siblings, 1 reply; 243+ messages in thread From: Alexander Viro @ 2000-09-26 22:31 UTC (permalink / raw) To: Marko Kreen Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Wed, 27 Sep 2000, Marko Kreen wrote: > On Tue, Sep 26, 2000 at 05:29:27PM -0400, Alexander Viro wrote: > > Comments and help in testing are more than welcome. > > There is something fishy in ext2_empty_dir: Why? > + /* check for . and .. */ > + if (de->name[0] != '.') > + goto not_empty; Doesn't start with '.' - definitely not an empty directory > + if (!de->name[1]) { OK, it's {'.','\0'}, aka. ".". > + if (de->inode != > + le32_to_cpu(inode->i_ino)) Consistency check... Aha, I see. Yup, s/le32_to_cpu/cpu_to_le32/. Doesn't matter on all normal architectures, but yes, it's still wrong. > + goto not_empty; If we have it screwed - leave it as is and don't mess with it. Otherwise - skip this record, it's all right for empty directory. > + } else if (de->name[2]) Starts with '.' and longer than 2 characters? Not empty. > + goto not_empty; > + else if (de->name[1] != '.') Starts with '.', 2 characters, but the second isn't '.'? Not empty. > + goto not_empty; Otherwise - skip the record. So checks are OK, the only thing being that we should use cpu_to_le32() instead of le32_to_cpu(). Doesn't affect the behaviour right now, but ought to be fixed anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 22:31 ` Alexander Viro @ 2000-09-26 22:47 ` Marko Kreen 2000-09-27 7:32 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Marko Kreen @ 2000-09-26 22:47 UTC (permalink / raw) To: Alexander Viro Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Tue, Sep 26, 2000 at 06:31:04PM -0400, Alexander Viro wrote: > On Wed, 27 Sep 2000, Marko Kreen wrote: > > There is something fishy in ext2_empty_dir: > > Why? > > > + } else if (de->name[2]) > Sorry, I had a hard day and I should have gone to sleep already... I did not think (anyway I tried ;) too hard on that [2], it seemed to me with the following stuff as some copy-paste bug... -- marko -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 22:47 ` Marko Kreen @ 2000-09-27 7:32 ` Ingo Molnar 2000-09-27 9:22 ` Alexander Viro 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-27 7:32 UTC (permalink / raw) To: Marko Kreen Cc: Alexander Viro, Linus Torvalds, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Wed, 27 Sep 2000, Marko Kreen wrote: > > Why? > > > > > + } else if (de->name[2]) > > > Sorry, I had a hard day and I should have gone to sleep already... hey, you made Alexander notice an endianness bug so it was ok :-) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-27 7:32 ` Ingo Molnar @ 2000-09-27 9:22 ` Alexander Viro 0 siblings, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-27 9:22 UTC (permalink / raw) To: Ingo Molnar Cc: Marko Kreen, Linus Torvalds, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Wed, 27 Sep 2000, Ingo Molnar wrote: > > On Wed, 27 Sep 2000, Marko Kreen wrote: > > > > Why? > > > > > > > + } else if (de->name[2]) > > > > > Sorry, I had a hard day and I should have gone to sleep already... > > hey, you made Alexander notice an endianness bug so it was ok :-) Definitely. Usually "it looks fishy" feeling should be trusted - if code is non-obvious it's more likely to contain bugs. How it was? "The goal is to write clear code, not clever code". And right now dir.c in the patch is not clear enough - better than the corresponding code in the tree (esp. in ext2_readdir()), but still needs cleaning up. ObFsck: router in the $ORKPLACE apparently deciding that it's a good time to shit itself and external SCSI on one of the home boxen joining the fun. Sheesh... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 21:29 ` Alexander Viro 2000-09-26 22:16 ` Marko Kreen @ 2000-09-26 23:19 ` Andreas Dilger 2000-09-26 23:33 ` Alexander Viro 1 sibling, 1 reply; 243+ messages in thread From: Andreas Dilger @ 2000-09-26 23:19 UTC (permalink / raw) To: Alexander Viro Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel Al Viro writes: > Folks, give it a try - just keep decent backups. Similar code will > have to go into UFS in 2.4 and that (ext2) variant may be of interest for > 2.4.<late>/2.5.<early> timeframe. Haven't tested it yet, but just reading over the patch - in ext2_lookup(): if (dentry->d_name.len > UFS_MAXNAMLEN) return ERR_PTR(-ENAMETOOLONG) should probably be changed back to: if (dentry->d_name.len > EXT2_NAME_LEN) return ERR_PTR(-ENAMETOOLONG) Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 23:19 ` Andreas Dilger @ 2000-09-26 23:33 ` Alexander Viro 2000-09-26 23:44 ` Alexander Viro 0 siblings, 1 reply; 243+ messages in thread From: Alexander Viro @ 2000-09-26 23:33 UTC (permalink / raw) To: Andreas Dilger Cc: Alexander Viro, Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel On Tue, 26 Sep 2000, Andreas Dilger wrote: > Al Viro writes: > > Folks, give it a try - just keep decent backups. Similar code will > > have to go into UFS in 2.4 and that (ext2) variant may be of interest for > > 2.4.<late>/2.5.<early> timeframe. > > Haven't tested it yet, but just reading over the patch - in ext2_lookup(): > > if (dentry->d_name.len > UFS_MAXNAMLEN) > return ERR_PTR(-ENAMETOOLONG) > > should probably be changed back to: > > if (dentry->d_name.len > EXT2_NAME_LEN) > return ERR_PTR(-ENAMETOOLONG) Grrr... It shows the ancestry - it's a ported UFS patch. Thanks for spotting, I'll fix that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [CFT][PATCH] ext2 directories in pagecache 2000-09-26 23:33 ` Alexander Viro @ 2000-09-26 23:44 ` Alexander Viro 0 siblings, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-26 23:44 UTC (permalink / raw) To: Alexander Viro Cc: Andreas Dilger, Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel, linux-fsdevel On Tue, 26 Sep 2000, Alexander Viro wrote: > On Tue, 26 Sep 2000, Andreas Dilger wrote: > > > Al Viro writes: > > > Folks, give it a try - just keep decent backups. Similar code will > > > have to go into UFS in 2.4 and that (ext2) variant may be of interest for > > > 2.4.<late>/2.5.<early> timeframe. > > > > Haven't tested it yet, but just reading over the patch - in ext2_lookup(): > > > > if (dentry->d_name.len > UFS_MAXNAMLEN) > > return ERR_PTR(-ENAMETOOLONG) > > > > should probably be changed back to: > > > > if (dentry->d_name.len > EXT2_NAME_LEN) > > return ERR_PTR(-ENAMETOOLONG) > > Grrr... It shows the ancestry - it's a ported UFS patch. Thanks for spotting, > I'll fix that. Aha. And there was that UFS_LINK_MAX thing. Fixed. OK, new version is on the same site, URL being ftp://ftp.math.psu.edu/pub/viro/ext2-patch-8.gz Changes: got rid of the remnants of UFS ancestry (EXT2 limits are used; not that it mattered much, but...), fixed the conversion in ext2_empty_dir() (cpu_to_le32() instead of le32_to_cpu()). Cheers, Al -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-24 21:12 ` Andrea Arcangeli 2000-09-24 21:12 ` Ingo Molnar @ 2000-09-25 0:09 ` Linus Torvalds 2000-09-25 0:49 ` Alexander Viro ` (2 more replies) 1 sibling, 3 replies; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 0:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Andrea Arcangeli wrote: > > On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote: > > where will it deadlock? > > ext2_new_block (or whatever that runs getblk with the superlock lock > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache-> > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op-> > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D Whee.. Good that you remembered (now that you mention it, I recollect that we had this bug and discussion earlier). I added a comment to the effect, although I still moved the __GFP_IO test into the icache and dcache shrink functions, because as with the shm_swap() thing this is probably something we do want to fix eventually. The icache shrinker probably has similar problems with clear_inode. I suspect that it might be a good idea to try to fix this issue, because it will probably keep coming up otherwise. And it's likely to be fairly easily debugged, by just making getblk() have some debugging code that basically says something like lock_super() { .. do the lock .. + current->super_locked++; } unlock_super() { + if (current->super_locked < 1) + BUG(); + current->super_locked--; .. do the unlock .. } getblk() { + if (current->super_locked) + BUG(); .. do the getblk .. } and just making it a new rule that you cannot call getblk() with any locks held. It should be fairly easy to make the callers well-behaved: the hard part is probably just enumerating and finding the suckers, which is why the above debug code would make people aware of it.. (We definitely don't want to wait for the deadlock to happen and trap that one: the above code will BUG() out in any normal situation regardless of whether it would actually trigger a deadlock or even allocate memory or not. Which is what we'd want if we want to fix this). On the whole, fixing the cases would probably imply dropping the lock, doing the read, re-aquireing the lock, and then going back and seeing if maybe somebody else already filled in the bitmap cache or whatever. So not one-liners by any means, but we'll probably want to do it at some point (the superblock lock is quite contended right now, and the reason for that may well be that it's just so badly done for historical reasons). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds @ 2000-09-25 0:49 ` Alexander Viro 2000-09-25 0:53 ` Marcelo Tosatti 2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli 2 siblings, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 0:49 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > > ext2_new_block (or whatever that runs getblk with the superlock lock > > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache-> > > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op-> > > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D > > Whee.. [snip] > On the whole, fixing the cases would probably imply dropping the lock, > doing the read, re-aquireing the lock, and then going back and seeing if > maybe somebody else already filled in the bitmap cache or whatever. So not > one-liners by any means, but we'll probably want to do it at some point > (the superblock lock is quite contended right now, and the reason for that > may well be that it's just so badly done for historical reasons). Nope. Solution is to kill the silly "hold super_block lock during the allocation" completely. Right now the main problem making us use it at all is the following: dquot_alloc_block() is a blocking operation. If that gets fixed - that's it. We simply don't need anything more fancy than rwlock on access to bitmap + rwlock or plain spinlock on access to group descriptors cache. End of problem. Remember that off-list thread in July when you asked what could be done with lock_super()? I did the analysis, all right - list of ext2 races was a side effect of that. Now we have that crap fixed, so getting rid of lock_super() in ext2 (in clear way) is possible. So if you still want it - tell. ext2 part is very easy, it's quota part that needs serious work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 2000-09-25 0:49 ` Alexander Viro @ 2000-09-25 0:53 ` Marcelo Tosatti 2000-09-25 1:45 ` Andrea Arcangeli 2000-09-25 10:42 ` the new VM Ingo Molnar 2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli 2 siblings, 2 replies; 243+ messages in thread From: Marcelo Tosatti @ 2000-09-25 0:53 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, 24 Sep 2000, Linus Torvalds wrote: > > > On Sun, 24 Sep 2000, Andrea Arcangeli wrote: > > > > On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote: > > > where will it deadlock? > > > > ext2_new_block (or whatever that runs getblk with the superlock lock > > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache-> > > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op-> > > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D > > Whee.. > > Good that you remembered (now that you mention it, I recollect that we had > this bug and discussion earlier). > > I added a comment to the effect, although I still moved the __GFP_IO test > into the icache and dcache shrink functions, because as with the > shm_swap() thing this is probably something we do want to fix eventually. Btw, why we need kmem_cache_shrink() inside shrink_{i,d}cache_memory ? Since refill_inactive and do_try_to_free_pages (the only functions which calls shrink_{i,d}cache_memory) already shrink the SLAB cache (with kmem_cache_reap), I dont think its needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 0:53 ` Marcelo Tosatti @ 2000-09-25 1:45 ` Andrea Arcangeli 2000-09-25 2:39 ` Marcelo Tosatti 2000-09-25 10:42 ` the new VM Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 1:45 UTC (permalink / raw) To: Marcelo Tosatti Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 09:53:33PM -0300, Marcelo Tosatti wrote: > Btw, why we need kmem_cache_shrink() inside shrink_{i,d}cache_memory ? Because kmem_cache_free doesn't free anything. It only queues slab objects into the partial and free part of the cachep slab queue (so that they're ready to be freed later, and that's what we do in shrink_slab_cache). > calls shrink_{i,d}cache_memory) already shrink the SLAB cache (with > kmem_cache_reap), I dont think its needed. kmem_cache_reap shrinks the slabs at _very_ low frequency. It's worthless to keep lots of dentries and icache into the slab internal queues until kmem_cache_reap kicks in again, if we free them such memory immediatly instead we'll run kmem_cache_reap later and for something more appropraite for what's been designed. The [id]cache shrink could release lots of memory. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 1:45 ` Andrea Arcangeli @ 2000-09-25 2:39 ` Marcelo Tosatti 2000-09-25 15:36 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Marcelo Tosatti @ 2000-09-25 2:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 1326 bytes --] On Mon, 25 Sep 2000, Andrea Arcangeli wrote: <snip> > kmem_cache_reap shrinks the slabs at _very_ low frequency. It's worthless to > keep lots of dentries and icache into the slab internal queues until > kmem_cache_reap kicks in again, if we free them such memory immediatly instead > we'll run kmem_cache_reap later and for something more appropraite for what's > been designed. The [id]cache shrink could release lots of memory. I see. Since we have code which is using GFP_BUFFER allocations to not block but only shrink the cache (1), I've done a patch to: - Change kmem_cache_shrink to return the number of freed pages. - Move __GFP_IO checking from do_try_to_free_pages/refill_inactive to {i,d}cache shrink functions (Linus already did this in his tree) - On the {i,d}cache shrink functions, return the value of kmem_cache_shrink() (no need of __GFP_IO for that) There was a comment on the shrink functions about making kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the wanted pages by the current allocation. GFP_DMA allocations will never reach this code (do_try_to_free_pages is only called if __GFP_WAIT is set) and GFP_HIGHMEM pages will never be used as SLAB obj's memory. (please correct me if I'm wrong) Comments? (1) Using GPF_BUFFER is wrong, but its a separate issue. [-- Attachment #2: Type: TEXT/PLAIN, Size: 4122 bytes --] diff --exclude-from=exclude -Nur linux.orig/fs/dcache.c linux/fs/dcache.c --- linux.orig/fs/dcache.c Sun Sep 24 18:14:24 2000 +++ linux/fs/dcache.c Sun Sep 24 22:49:16 2000 @@ -556,15 +556,11 @@ int count = 0; if (priority) count = dentry_stat.nr_unused / priority; - prune_dcache(count); - /* FIXME: kmem_cache_shrink here should tell us - the number of pages freed, and it should - work in a __GFP_DMA/__GFP_HIGHMEM behaviour - to free only the interesting pages in - function of the needs of the current allocation. */ - kmem_cache_shrink(dentry_cache); - return 0; + if(gfp_mask & __GFP_IO) + prune_dcache(count); + + return kmem_cache_shrink(dentry_cache); } #define NAME_ALLOC_LEN(len) ((len+16) & ~15) diff --exclude-from=exclude -Nur linux.orig/fs/inode.c linux/fs/inode.c --- linux.orig/fs/inode.c Sun Sep 24 18:14:25 2000 +++ linux/fs/inode.c Sun Sep 24 22:47:30 2000 @@ -460,15 +460,11 @@ if (priority) count = inodes_stat.nr_unused / priority; - prune_icache(count); - /* FIXME: kmem_cache_shrink here should tell us - the number of pages freed, and it should - work in a __GFP_DMA/__GFP_HIGHMEM behaviour - to free only the interesting pages in - function of the needs of the current allocation. */ - kmem_cache_shrink(inode_cachep); - return 0; + if(gfp_mask & __GFP_IO) + prune_icache(count); + + return kmem_cache_shrink(inode_cachep); } /* diff --exclude-from=exclude -Nur linux.orig/mm/slab.c linux/mm/slab.c --- linux.orig/mm/slab.c Sun Sep 24 18:14:04 2000 +++ linux/mm/slab.c Sun Sep 24 22:46:11 2000 @@ -887,7 +887,7 @@ static int __kmem_cache_shrink(kmem_cache_t *cachep) { slab_t *slabp; - int ret; + int ret, freed = 0; drain_cpu_caches(cachep); @@ -912,8 +912,11 @@ spin_unlock_irq(&cachep->spinlock); kmem_slab_destroy(cachep, slabp); spin_lock_irq(&cachep->spinlock); + + freed++; } - ret = !list_empty(&cachep->slabs); + + ret = ((1 << cachep->gfporder) * freed); spin_unlock_irq(&cachep->spinlock); return ret; } @@ -923,7 +926,8 @@ * @cachep: The cache to shrink. * * Releases as many slabs as possible for a cache. - * To help debugging, a zero exit status indicates all slabs were released. + * + * Returns the amount of freed pages. */ int kmem_cache_shrink(kmem_cache_t *cachep) { @@ -962,7 +966,9 @@ list_del(&cachep->next); up(&cache_chain_sem); - if (__kmem_cache_shrink(cachep)) { + __kmem_cache_shrink(cachep); + + if (!list_empty(&cachep->slabs)) { printk(KERN_ERR "kmem_cache_destroy: Can't free all objects %p\n", cachep); down(&cache_chain_sem); diff --exclude-from=exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c --- linux.orig/mm/vmscan.c Sun Sep 24 18:14:04 2000 +++ linux/mm/vmscan.c Sun Sep 24 23:09:01 2000 @@ -904,14 +904,16 @@ } /* Try to get rid of some shared memory pages.. */ - if (gfp_mask & __GFP_IO) { - /* - * don't be too light against the d/i cache since - * shrink_mmap() almost never fail when there's - * really plenty of memory free. - */ - count -= shrink_dcache_memory(priority, gfp_mask); - count -= shrink_icache_memory(priority, gfp_mask); + + /* + * don't be too light against the d/i cache since + * shrink_mmap() almost never fail when there's + * really plenty of memory free. + */ + count -= shrink_dcache_memory(priority, gfp_mask); + count -= shrink_icache_memory(priority, gfp_mask); + + if(gfp_mask & __GFP_IO) { /* * Not currently working, see fixme in shrink_?cache_memory * In the inner funtions there is a comment: @@ -992,10 +994,8 @@ * the inode and dentry cache whenever we do this. */ if (free_shortage() || inactive_shortage()) { - if (gfp_mask & __GFP_IO) { - ret += shrink_dcache_memory(6, gfp_mask); - ret += shrink_icache_memory(6, gfp_mask); - } + ret += shrink_dcache_memory(6, gfp_mask); + ret += shrink_icache_memory(6, gfp_mask); ret += refill_inactive(gfp_mask, user); } else { ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 2:39 ` Marcelo Tosatti @ 2000-09-25 15:36 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:36 UTC (permalink / raw) To: Marcelo Tosatti Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 11:39:13PM -0300, Marcelo Tosatti wrote: > - Change kmem_cache_shrink to return the number of freed pages. I did that too extending a patch from Mark. I also removed the first_not_full ugliness providing a LIFO behaviour to the completly freed slabs (so kmem_cache_reap removes the oldest completly unused slabs from the queue, not the most recently used ones with potentially live cache in the CPU). > There was a comment on the shrink functions about making > kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the > wanted pages by the current allocation. This is meaningless at the moment because it can't be addressed without classzone logic in the allocator (classzone means that the allocator will pass to the memory balancing code the information about _which_ classzone you have to allocate memory from, so you won't waste time to synchronously balance unrelated zones). My patch is here (it isn't going to apply cleanly due the test9 changes in do_try_to_free_pages but porting is trivial). It was tested and it was working for me. ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test7/slab-1 BTW, here there's a fix for a longstanding SMP race (since swap_out and msync doesn't run with the big lock) that can corrupt memory: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/msync-smp-race-1 Here the fix for another SMP race in enstablish_pte: ftp://ftp.uskernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/tlb-flush-smp-race-1 The fix for this last bit is ugly bit it's safe because Manfred said s390 have a flush_tlb_page that atomically flushes and makees the pte invalid (cleaner fix means moving part of enstablish_pte into the arch inlines). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* the new VM 2000-09-25 0:53 ` Marcelo Tosatti 2000-09-25 1:45 ` Andrea Arcangeli @ 2000-09-25 10:42 ` Ingo Molnar 2000-09-25 13:02 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 10:42 UTC (permalink / raw) To: Marcelo Tosatti Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel i'd also like to share my experiences with recent kernels, compared to the 'old VM'. I frequently run high VM load multi-gigabyte systems with alot of IRQ-side allocations as well, and it's surprising how sensitive these systems' performance is to VM balance, despite gobs of RAM. - The biggest difference under high allocation load is that the CPU usage of kswapd and the synchronous VM balancing code has decreased significantly. Under previous kernels it was not uncommon to see sudden spikes in kswapd usage, and to see significant CPU time spent in shrink_mmap() & friends. I suspect that this is because the new VM does much less 'guessing' and blind list-walking. - i'm also happy that __alloc_pages() now 'guarantees' allocation. This i believe could simplify unrelated kernel code significantly. Eg. no need to check for NULL pointers on most allocations, a GFP_KERNEL allocation always succeeds, end of story. This behavior also has the 'nice' side-effect of showing memory inbalance rather forcefully: the system locks up ;-) A GFP_ATOMIC allocation obviously still has the potential to fail, and must be handled properly. all in one, the new VM balancing code looks really promising, despite all the growing pains. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 10:42 ` the new VM Ingo Molnar @ 2000-09-25 13:02 ` Andrea Arcangeli 2000-09-25 13:02 ` Ingo Molnar 2000-09-25 13:04 ` Ingo Molnar 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:02 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 12:42:09PM +0200, Ingo Molnar wrote: > believe could simplify unrelated kernel code significantly. Eg. no need to > check for NULL pointers on most allocations, a GFP_KERNEL allocation > always succeeds, end of story. This behavior also has the 'nice' Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. We also have another showstopper bug in getblk that will be hard to fix because people was used to rely on it and they wrote dealdock prone code. You should know that people not running benchmarks and and using the machine power for simulations runs out of memory all the time. If you put this kind of obvious deadlock into the main kernel allocator you'll screwup the hard work to fix all the other deadlock problems during OOM that is been done so far. Please fix raid1 instead of making things worse. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:02 ` Andrea Arcangeli @ 2000-09-25 13:02 ` Ingo Molnar 2000-09-25 13:08 ` Andrea Arcangeli 2000-09-25 13:04 ` Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:02 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > that is a showstopper bug. [...] why? > machine power for simulations runs out of memory all the time. If you > put this kind of obvious deadlock into the main kernel allocator FYI, i havent put it there. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:02 ` Ingo Molnar @ 2000-09-25 13:08 ` Andrea Arcangeli 2000-09-25 13:12 ` Ingo Molnar 2000-09-25 14:37 ` Rik van Riel 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:08 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > that is a showstopper bug. [...] > > why? Because as you said the machine can lockup when you run out of memory. > FYI, i havent put it there. Ok. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:08 ` Andrea Arcangeli @ 2000-09-25 13:12 ` Ingo Molnar 2000-09-25 13:30 ` Andrea Arcangeli 2000-09-25 14:47 ` Alan Cox 2000-09-25 14:37 ` Rik van Riel 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > > that is a showstopper bug. [...] > > > > why? > > Because as you said the machine can lockup when you run out of memory. well, i think all kernel-space allocations have to be limited carefully, denying succeeding allocations is not a solution against over-allocation, especially in a multi-user environment. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:12 ` Ingo Molnar @ 2000-09-25 13:30 ` Andrea Arcangeli 2000-09-25 13:39 ` Ingo Molnar 2000-09-25 14:47 ` Alan Cox 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:30 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:12:58PM +0200, Ingo Molnar wrote: > well, i think all kernel-space allocations have to be limited carefully, When a machine without a gigabit ethernet runs oom it's userspace that allocated the memory via page faults not the kernel. And if the careful limit avoids the deadlock in the layer above alloc_pages, then it will also avoid alloc_pages to return NULL and you won't need an infinite loop in first place (unless the memory balancing is buggy). GFP should return NULL only if the machine is out of memory. The kernel can be written in a way that never deadlocks when the machine is out of memory just checking the GFP retval. I don't think any in-kernel resource limit is necessary to have things reliable and fast. Most dynamic big caches and kernel data can be shrinked dynamically during memory pressure (pheraps except skbs and I agree that for skbs on gigabit ethernet the thing is a little different). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:30 ` Andrea Arcangeli @ 2000-09-25 13:39 ` Ingo Molnar 2000-09-25 14:04 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > And if the careful limit avoids the deadlock in the layer above > alloc_pages, then it will also avoid alloc_pages to return NULL and > you won't need an infinite loop in first place (unless the memory > balancing is buggy). yes i like this property very much because it unearths VM balancing bugs, which plagued us for so long and are so hard to detect. But statistically it's also possible that try_to_free_pages() frees a page and alloc_pages() done on another CPU (or in IRQ context) 'steals' the page. This can happen, because the VM right now guarantees no straight path from deallocator to allocator. (and it's not necessery to guarantee it, given the varying nature of allocation requests.) > GFP should return NULL only if the machine is out of memory. The > kernel can be written in a way that never deadlocks when the machine > is out of memory just checking the GFP retval. I don't think any > in-kernel resource limit is necessary to have things reliable and > fast. [...] Andrea, if you really mean this then you should not be let near the VM balancing code :-) > Most dynamic big caches and kernel data can be shrinked dynamically > during memory pressure (pheraps except skbs and I agree that for skbs > on gigabit ethernet the thing is a little different). a big 'except'. You dont need gigabit for that, to the contrary, if the network is slow it's easier to overallocate within the kernel. Ask Alan about how many D.O.S. attacks there are possible without implicit or explicit bean counting. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:39 ` Ingo Molnar @ 2000-09-25 14:04 ` Andrea Arcangeli 2000-09-25 14:04 ` Ingo Molnar 2000-09-25 14:26 ` Marcelo Tosatti 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:04 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:39:51PM +0200, Ingo Molnar wrote: > Andrea, if you really mean this then you should not be let near the VM > balancing code :-) What I mean is that the VM balancing is in the lower layer that knows anything about the per-socket gigabit ethernet skbs limits, the limit should live at the higher layer. For most code just checking for NULL in GFP is fine (for example do_anonymous_page). It's the caller (not the VM balancing developer) that shouldn't be let near his code if it allows his code to fill all the physical ram with his stuff causing the machine to run OOM. > > Most dynamic big caches and kernel data can be shrinked dynamically > > during memory pressure (pheraps except skbs and I agree that for skbs > > on gigabit ethernet the thing is a little different). > > a big 'except'. You dont need gigabit for that, to the contrary, if the I talked with Alexey about this and it seems the best way is to have a per-socket reservation of clean cache in function of the receive window. So we don't need an huge atomic pool but we can have a special lru with an irq spinlock that is able to shrink cache from irq as well. > about how many D.O.S. attacks there are possible without implicit or > explicit bean counting. Again: the bean counting and all the limit happens at the higher layer. I shouldn't know anything about it when I play with the lower layer GFP memory balancing code. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:04 ` Andrea Arcangeli @ 2000-09-25 14:04 ` Ingo Molnar 2000-09-25 14:23 ` Andrea Arcangeli 2000-09-25 14:26 ` Marcelo Tosatti 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Again: the bean counting and all the limit happens at the higher > layer. I shouldn't know anything about it when I play with the lower > layer GFP memory balancing code. exactly, and this is why if a higher level lets through a GFP_KERNEL, then it *must* succeed. Otherwise either the higher level code is buggy, or the VM balance is buggy, but we want to have clear signs of it. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:04 ` Ingo Molnar @ 2000-09-25 14:23 ` Andrea Arcangeli 2000-09-25 14:27 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:23 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:04:14PM +0200, Ingo Molnar wrote: > exactly, and this is why if a higher level lets through a GFP_KERNEL, then > it *must* succeed. Otherwise either the higher level code is buggy, or the > VM balance is buggy, but we want to have clear signs of it. I'm not sure if we should restrict the limiting only to the cases that needs them. For example do_anonymous_page looks a place that could rely on the GFP retval. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:23 ` Andrea Arcangeli @ 2000-09-25 14:27 ` Ingo Molnar 2000-09-25 14:39 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I'm not sure if we should restrict the limiting only to the cases that > needs them. For example do_anonymous_page looks a place that could > rely on the GFP retval. i think an application should not fail due to other applications allocating too much RAM. OOM behavior should be a central thing and based on allocation patterns, not pure luck or unluck. I always found it rude to SIGBUS when some other application is abusing RAM but the oom detector has not yet killed it off. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:27 ` Ingo Molnar @ 2000-09-25 14:39 ` Andrea Arcangeli 2000-09-25 14:43 ` Ingo Molnar 2000-09-25 16:09 ` Rik van Riel 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:39 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: > i think an application should not fail due to other applications > allocating too much RAM. OOM behavior should be a central thing and based At least Linus's point is that doing perfect accounting (at least on the userspace allocation side) may cause you to waste resources, failing even if you could still run and I tend to agree with him. We're lazy on that side and that's global win in most cases. We are finegrined with page granularity, not with the mmap granularity. The point is that not all the mmapped regions are going to be pagedin. Think a program that only after 1 hour did all the calculations that allocated all the memory it requested with malloc. Before the hour passes the unused memory can still be used for other things and that's what the user also expects when he runs `free`. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:39 ` Andrea Arcangeli @ 2000-09-25 14:43 ` Ingo Molnar 2000-09-25 15:01 ` Andrea Arcangeli 2000-09-25 16:09 ` Rik van Riel 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > At least Linus's point is that doing perfect accounting (at least on > the userspace allocation side) may cause you to waste resources, > failing even if you could still run and I tend to agree with him. > We're lazy on that side and that's global win in most cases. well, as i said, i agree that being lazy on the user-space side (which is by far the biggest RAM allocator in a typical system) makes sense - and we can handle it cleanly. Being lazy on the kernel-space side is the default behavior for us kernel hackers :-) but i dont think it's the right thing in the long term. > We are finegrined with page granularity, not with the mmap > granularity. The point is that not all the mmapped regions are going > to be pagedin. Think a program that only after 1 hour did all the > calculations that allocated all the memory it requested with malloc. > Before the hour passes the unused memory can still be used for other > things and that's what the user also expects when he runs `free`. i think you've completely missed the fact that i made exactly this point in my previous mail. 'user-space laziness': correct 'kernel-space laziness': dangerous i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i believe the right place to oom is via a signal, not in the gfp() case. (because oom situation in the gfp() case is a completely random and statistical event, which might have no connection at all to the behavior of that given process.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:43 ` Ingo Molnar @ 2000-09-25 15:01 ` Andrea Arcangeli 2000-09-25 15:10 ` Ingo Molnar 2000-09-26 19:10 ` Pavel Machek 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:01 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:43:44PM +0200, Ingo Molnar wrote: > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i My bad, you're right I was talking about GFP_USER indeed. But even GFP_KERNEL allocations like the init of a module or any other thing that is static sized during production just checking the retval looks be ok. > believe the right place to oom is via a signal, not in the gfp() case. Signal can be trapped and ignored by malicious task. We had that security problem until 2.2.14 IIRC. > (because oom situation in the gfp() case is a completely random and > statistical event, which might have no connection at all to the behavior > of that given process.) I agree we should have more information about the behaviour of the system and I think a per-task page fault rate should work in practice. But my question isn't what you do when you're OOM, but is _how_ do you notice that you're OOM? In the GFP_USER case simply checking when GFP fails looks right to me. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:01 ` Andrea Arcangeli @ 2000-09-25 15:10 ` Ingo Molnar 2000-09-25 15:24 ` Andrea Arcangeli 2000-09-26 19:10 ` Pavel Machek 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 15:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Signal can be trapped and ignored by malicious task. [...] a SIGKILL? i agree with the 2.2 solution - first a soft signal, and if it's being ignored then a SIGKILL. > But my question isn't what you do when you're OOM, but is _how_ do you > notice that you're OOM? good question :-) > In the GFP_USER case simply checking when GFP fails looks right to me. i think the GFP_USER case should do the oom logic within __alloc_pages(), by SIGTERM/SIGKILL-ing off abusive processes. Ie. it's *still* an infinite loop (barring the case where *this* process is abusive, but thats a detail). Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:10 ` Ingo Molnar @ 2000-09-25 15:24 ` Andrea Arcangeli 2000-09-25 15:26 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:24 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 05:10:43PM +0200, Ingo Molnar wrote: > a SIGKILL? i agree with the 2.2 solution - first a soft signal, and if > it's being ignored then a SIGKILL. Actually we do the soft signal try (SIGTERM) only if the task was running with iopl privilegies (and that means on alpha and other archs where there isn't the iopl we send a SIGKILL to X immediatly). Extending it to all tasks looked a bit riskious solution because then we would even less probability to kill the right task since all tasks would run oom while the first is put to sleep for a while. With X we really prefer to kill another task than screwup the console instead (even when X is the real hog, and X can be made the real hog by any tasks that allocates huge xshm). Kray reproduces this easily. > > But my question isn't what you do when you're OOM, but is _how_ do you > > notice that you're OOM? > > good question :-) :)) > i think the GFP_USER case should do the oom logic within __alloc_pages(), What's the difference of implementing the logic outside alloc_pages? Putting the logic inside looks not clean design to me. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:24 ` Andrea Arcangeli @ 2000-09-25 15:26 ` Ingo Molnar 2000-09-25 15:22 ` yodaiken 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 15:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > i think the GFP_USER case should do the oom logic within __alloc_pages(), > > What's the difference of implementing the logic outside alloc_pages? > Putting the logic inside looks not clean design to me. it gives consistency and simplicity. The allocators themselves do not have to care about oom. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:26 ` Ingo Molnar @ 2000-09-25 15:22 ` yodaiken 0 siblings, 0 replies; 243+ messages in thread From: yodaiken @ 2000-09-25 15:22 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 05:26:59PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > i think the GFP_USER case should do the oom logic within __alloc_pages(), > > > > What's the difference of implementing the logic outside alloc_pages? > > Putting the logic inside looks not clean design to me. > > it gives consistency and simplicity. The allocators themselves do not have > to care about oom. There are many cases where it is simple to do: if( alloc(r1) == fail) goto freeall if( alloc(r2) == fail) goto freeall if( alloc(r3) == fail) goto freeall And the alloc functions don't know how to "freeall". Perhaps it would be good to do an alloc_vec allocation in these cases. alloc_vec[0].size = n; .. alloc_vec[n].size = 0; if(kmalloc_all(alloc_vec) == FAIL) return -ENOMEM; else alloc_vec[i].ptr is the pointer. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:01 ` Andrea Arcangeli 2000-09-25 15:10 ` Ingo Molnar @ 2000-09-26 19:10 ` Pavel Machek 2000-09-26 20:16 ` Andrea Arcangeli 2000-09-27 7:42 ` Ingo Molnar 1 sibling, 2 replies; 243+ messages in thread From: Pavel Machek @ 2000-09-26 19:10 UTC (permalink / raw) To: Andrea Arcangeli, Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi! > > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i > > My bad, you're right I was talking about GFP_USER indeed. > > But even GFP_KERNEL allocations like the init of a module or any other thing > that is static sized during production just checking the retval > looks be ok. Okay, I'm user on small machine and I'm doing stupid thing: I've got 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I insert module_1mb.o. Repeat. How does it end? I think that kmalloc(GFP_KERNEL) *has* to return NULL at some point. Killing apps is not a solution: If my insmoder is smaller than module I'm trying to insert, and it happens to be the only process, you just will not be able to kmalloc(GFP_KERNEL, sizeof(module)). Will you panic at the end? Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-26 19:10 ` Pavel Machek @ 2000-09-26 20:16 ` Andrea Arcangeli 2000-09-27 7:42 ` Ingo Molnar 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-26 20:16 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 09:10:16PM +0200, Pavel Machek wrote: > Hi! > > > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i > > > > My bad, you're right I was talking about GFP_USER indeed. > > > > But even GFP_KERNEL allocations like the init of a module or any other thing > > that is static sized during production just checking the retval > > looks be ok. > > Okay, I'm user on small machine and I'm doing stupid thing: I've got > 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I > insert module_1mb.o. Repeat. How does it end? I think that > kmalloc(GFP_KERNEL) *has* to return NULL at some point. I agree and that's what I said since the first place. GFP_KERNEL must return null when the system is truly out of memory or the kernel will deadlock at that time. In the sentence you quoted I meant that both GFP_USER and most GFP_KERNEL could only keep to check the retval even in the long term to be correct (checking for NULL, that in turn means GFP_KERNEL _will_ return NULL eventually). There's no need of special resource accounting for many static sized data structure in kernel (this accounting is necessary only for some of the dynamic things that grows and shrink during production and that can't be reclaimed synchronously when memory goes low by blocking in the allocator, like pagetables skbs on gbit ethernet and other things). Not sure if at the end we'll need to account also the static parts to get the dynamic part right. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-26 19:10 ` Pavel Machek 2000-09-26 20:16 ` Andrea Arcangeli @ 2000-09-27 7:42 ` Ingo Molnar 2000-09-27 12:11 ` yodaiken 2000-09-27 14:08 ` Andrea Arcangeli 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-27 7:42 UTC (permalink / raw) To: Pavel Machek Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Tue, 26 Sep 2000, Pavel Machek wrote: > Okay, I'm user on small machine and I'm doing stupid thing: I've got > 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I > insert module_1mb.o. Repeat. How does it end? I think that > kmalloc(GFP_KERNEL) *has* to return NULL at some point. if a stupid root user keeps inserting bogus modules :-) then thats a problem, no matter what. I can DoS your system if given the right to insert arbitrary size modules, even if kmalloc returns NULL. For such things explicit highlevel protection is needed - completely independently of the VM allocation issues. Returning NULL in kmalloc() is just a way to say: 'oops, we screwed up somewhere'. And i'd suggest to not work around such screwups by checking for NULL and trying to handle it. I suggest to rather fix those screwups. the __GFP_SOFT suggestion handles these things nicely. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-27 7:42 ` Ingo Molnar @ 2000-09-27 12:11 ` yodaiken 2000-09-27 14:08 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: yodaiken @ 2000-09-27 12:11 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote: > > On Tue, 26 Sep 2000, Pavel Machek wrote: > of the VM allocation issues. Returning NULL in kmalloc() is just a way to > say: 'oops, we screwed up somewhere'. And i'd suggest to not work around That is not at all how it is currently used in the kernel. > such screwups by checking for NULL and trying to handle it. I suggest to > rather fix those screwups. Kmalloc returns null when there is not enough memory to satisfy the request. What's wrong with that? -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-27 7:42 ` Ingo Molnar 2000-09-27 12:11 ` yodaiken @ 2000-09-27 14:08 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-27 14:08 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote: > such screwups by checking for NULL and trying to handle it. I suggest to > rather fix those screwups. How do you know which is the minimal amount of RAM that allows you not to be in the screwedup state? We for sure need a kind of counter for the special dynamic structures but I'm not sure if that should account the static stuff as well. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:39 ` Andrea Arcangeli 2000-09-25 14:43 ` Ingo Molnar @ 2000-09-25 16:09 ` Rik van Riel 1 sibling, 0 replies; 243+ messages in thread From: Rik van Riel @ 2000-09-25 16:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: > > i think an application should not fail due to other applications > > allocating too much RAM. OOM behavior should be a central thing and based > > At least Linus's point is that doing perfect accounting (at > least on the userspace allocation side) may cause you to waste > resources, failing even if you could still run and I tend to > agree with him. We're lazy on that side and that's global win in > most cases. OK, so do you guys want my OOM-killer selection code in 2.4? ;) (that will fix the OOM case in the rare situations where it occurs and do the expected thing most of the time) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:04 ` Andrea Arcangeli 2000-09-25 14:04 ` Ingo Molnar @ 2000-09-25 14:26 ` Marcelo Tosatti 2000-09-25 14:50 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Marcelo Tosatti @ 2000-09-25 14:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: <snip> > I talked with Alexey about this and it seems the best way is to have a > per-socket reservation of clean cache in function of the receive window. So we > don't need an huge atomic pool but we can have a special lru with an irq > spinlock that is able to shrink cache from irq as well. In the current 2.4 VM code, there is a kernel thread called "kreclaimd". This thread keeps freeing pages from the inactive clean list when needed (when zone->free_pages < zone->pages_low), making them available for atomic allocations. Do you consider pages_low pages as a "huge atomic pool" ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:26 ` Marcelo Tosatti @ 2000-09-25 14:50 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:50 UTC (permalink / raw) To: Marcelo Tosatti Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 11:26:48AM -0300, Marcelo Tosatti wrote: > This thread keeps freeing pages from the inactive clean list when needed > (when zone->free_pages < zone->pages_low), making them available for > atomic allocations. This is flawed. It's the irq that have to shrink the memory itself. It can't certainly reschedule kreclaimd and wait it to do the work. Increasing the free_pages_min limit is the _only_ alternative to having irqs that are able to shrink clean cache (and hopefully that "feature" will be resurrected soon since it's the only way to go right now). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:12 ` Ingo Molnar 2000-09-25 13:30 ` Andrea Arcangeli @ 2000-09-25 14:47 ` Alan Cox 2000-09-25 15:16 ` Ingo Molnar 2000-09-25 15:40 ` Stephen C. Tweedie 1 sibling, 2 replies; 243+ messages in thread From: Alan Cox @ 2000-09-25 14:47 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > Because as you said the machine can lockup when you run out of memory. > > well, i think all kernel-space allocations have to be limited carefully, > denying succeeding allocations is not a solution against over-allocation, > especially in a multi-user environment. GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make space you die. The alternative approach where it cannot fail has to be at higher levels so you can release other resources that might need freeing for deadlock avoidance before you retry Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:47 ` Alan Cox @ 2000-09-25 15:16 ` Ingo Molnar 2000-09-25 15:16 ` the new VMt Alan Cox 2000-09-25 15:48 ` the new VM Andrea Arcangeli 2000-09-25 15:40 ` Stephen C. Tweedie 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 15:16 UTC (permalink / raw) To: Alan Cox Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alan Cox wrote: > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get > everything jammed in kernel space waiting on GFP_KERNEL and if the > swapper cannot make space you die. if one can get everything jammed waiting for GFP_KERNEL, and not being able to deallocate anything, thats a VM or resource-limit bug. This situation is just 1% RAM away from the 'root cannot log in', situation. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:16 ` Ingo Molnar @ 2000-09-25 15:16 ` Alan Cox 2000-09-25 15:33 ` the new VM Ingo Molnar ` (3 more replies) 2000-09-25 15:48 ` the new VM Andrea Arcangeli 1 sibling, 4 replies; 243+ messages in thread From: Alan Cox @ 2000-09-25 15:16 UTC (permalink / raw) To: mingo Cc: Alan Cox, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get > > everything jammed in kernel space waiting on GFP_KERNEL and if the > > swapper cannot make space you die. > > if one can get everything jammed waiting for GFP_KERNEL, and not being > able to deallocate anything, thats a VM or resource-limit bug. This > situation is just 1% RAM away from the 'root cannot log in', situation. Unless Im missing something here think about this case 2 active processes, no swap #1 #2 kmalloc 32K kmalloc 16K OK OK kmalloc 16K kmalloc 32K block block so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with care, but when we have no pages left something has to give -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:16 ` the new VMt Alan Cox @ 2000-09-25 15:33 ` Ingo Molnar 2000-09-25 15:41 ` the new VMt Andrea Arcangeli ` (2 subsequent siblings) 3 siblings, 0 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 15:33 UTC (permalink / raw) To: Alan Cox Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alan Cox wrote: > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1 #2 > kmalloc 32K kmalloc 16K > OK OK > kmalloc 16K kmalloc 32K > block block > > so GFP_KERNEL has to be able to fail - it can wait for I/O in some > cases with care, but when we have no pages left something has to give you are right, i agree that synchronous OOM for higher-order allocations must be preserved (just like ATOMIC allocations). But the overwhelming majority of allocations is done at page granularity. with multi-page allocations and the need for physically contiguous buffers, the problem cannot be solved. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:16 ` the new VMt Alan Cox 2000-09-25 15:33 ` the new VM Ingo Molnar @ 2000-09-25 15:41 ` Andrea Arcangeli 2000-09-25 16:02 ` Ingo Molnar 2000-09-25 15:42 ` Stephen C. Tweedie 2000-09-25 16:16 ` Rik van Riel 3 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:41 UTC (permalink / raw) To: Alan Cox Cc: mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1 #2 > kmalloc 32K kmalloc 16K > OK OK > kmalloc 16K kmalloc 32K ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > block block Yep, you're not missing anything. That was my complain about the fact GFP_KERNEL not failing will obviously dealdock the kernel all over the place. Ingo's point is that the underlined line won't ever happen in the first place because of the resource accounting that will tell the upper layer that they can't try to allocate anything, so they won't enter kmalloc at all. But he's obviously not talking about 2.4.x. (and I'm not sure if that's the right way to go in the general case but certainly it's the right way to go for special cases like skbs with gigabit ethernet) In 2.4.x GFP_KERNEL not failing is a deadlock as you said. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:41 ` the new VMt Andrea Arcangeli @ 2000-09-25 16:02 ` Ingo Molnar 2000-09-25 16:04 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 16:02 UTC (permalink / raw) To: Andrea Arcangeli Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Ingo's point is that the underlined line won't ever happen in the > first place please dont misinterpret my point ... Frankly, how often do we allocate multi-order pages? I've just made quick statistics wrt. how allocation orders are distributed on a more or less typical system: (ALLOC ORDER) 0: 167081 1: 850 2: 16 3: 25 4: 0 5: 1 6: 0 7: 2 8: 13 9: 5 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb task-structure. The rest is 0.05%. i'm not talking about 4MB contiguous physical allocations having to succeed on a 8MB box. I'm talking about 99% of the simple allocation points not having to worry about a NULL pointer. (not checking for NULL is one of the most common allocation-related bug that beats low-RAM systems.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:02 ` Ingo Molnar @ 2000-09-25 16:04 ` Andi Kleen 2000-09-25 16:19 ` Ingo Molnar 2000-09-25 16:11 ` Andrea Arcangeli 2000-09-25 16:53 ` Alan Cox 2 siblings, 1 reply; 243+ messages in thread From: Andi Kleen @ 2000-09-25 16:04 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: > Frankly, how often do we allocate multi-order pages? I've just made quick > statistics wrt. how allocation orders are distributed on a more or less > typical system: > > (ALLOC ORDER) > 0: 167081 > 1: 850 > 2: 16 > 3: 25 > 4: 0 > 5: 1 > 6: 0 > 7: 2 > 8: 13 > 9: 5 > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > task-structure. The rest is 0.05%. An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable (=GFP_ATOMIC) 16K allocations. Another thing I would worry about are ports with multiple user page sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages but may likely need a 16K kernel stack due to the 64bit stack bloat. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:04 ` Andi Kleen @ 2000-09-25 16:19 ` Ingo Molnar 2000-09-25 16:18 ` Andi Kleen 2000-09-25 16:28 ` Rik van Riel 0 siblings, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 16:19 UTC (permalink / raw) To: Andi Kleen Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andi Kleen wrote: > An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed > in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable > (=GFP_ATOMIC) 16K allocations. the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must* be prepared to handle occasional oom situations gracefully. > Another thing I would worry about are ports with multiple user page > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > but may likely need a 16K kernel stack due to the 64bit stack bloat. yep, but these cases are not affected, i think in the order != 0 case we should return NULL if a certain number of iterations did not yield any free page. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:19 ` Ingo Molnar @ 2000-09-25 16:18 ` Andi Kleen 2000-09-25 16:41 ` Andrea Arcangeli 2000-09-25 20:23 ` Russell King 2000-09-25 16:28 ` Rik van Riel 1 sibling, 2 replies; 243+ messages in thread From: Andi Kleen @ 2000-09-25 16:18 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: > > Another thing I would worry about are ports with multiple user page > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > yep, but these cases are not affected, i think in the order != 0 case we > should return NULL if a certain number of iterations did not yield any > free page. Ok, that would just break fork() -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:18 ` Andi Kleen @ 2000-09-25 16:41 ` Andrea Arcangeli 2000-09-25 16:35 ` Linus Torvalds 2000-09-25 20:23 ` Russell King 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:41 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:18:17PM +0200, Andi Kleen wrote: > On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: > > > Another thing I would worry about are ports with multiple user page > > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > > > yep, but these cases are not affected, i think in the order != 0 case we > > should return NULL if a certain number of iterations did not yield any > > free page. > > Ok, that would just break fork() Not sure if I have the whole context (I've not yet received Ingo's email that you're replying to). Currently we do a memory balancing pass indipendently by the order of the allocation. Thus we don't do any iteraction and the memory balancing is completly order blind (unfortunately it's also zone blind, while at least in 2.2.x the memory balancing known which zone it had to allocate memory from). If Ingo suggested more iteractions of memory balancing for those cases that should only make things better with respect to fragmentation. But I'd much prefer to pass not only the classzone from allocator to memory balancing, but _also_ the order of the allocation, and then shrink_mmap will know it doesn't worth to free anything that isn't contigous on the order of the allocation that we need. classzone haven't reached this point yet. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:41 ` Andrea Arcangeli @ 2000-09-25 16:35 ` Linus Torvalds 2000-09-25 16:41 ` Rik van Riel 2000-09-27 7:14 ` Rusty Russell 0 siblings, 2 replies; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 16:35 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > But I'd much prefer to pass not only the classzone from allocator > to memory balancing, but _also_ the order of the allocation, > and then shrink_mmap will know it doesn't worth to free anything > that isn't contigous on the order of the allocation that we need. I suspect that the proper way to do this is to just make another gfp_flag, which is basically another hint to the mm layer that we're doing a multi- page allocation and that the MM layer should not try forever to handle it. In fact, that's independent of whether it is a multi-page allocation or not. It might be something like __GFP_SOFT - you could use it with single pages too. Thinking about it, we do have it already. It's called !__GFP_HIGH, and it used by all the GFP_USER allocations. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:35 ` Linus Torvalds @ 2000-09-25 16:41 ` Rik van Riel 2000-09-25 16:49 ` Linus Torvalds 2000-09-27 7:14 ` Rusty Russell 1 sibling, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 16:41 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Linus Torvalds wrote: > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > But I'd much prefer to pass not only the classzone from allocator > > to memory balancing, but _also_ the order of the allocation, > > and then shrink_mmap will know it doesn't worth to free anything > > that isn't contigous on the order of the allocation that we need. > > I suspect that the proper way to do this is to just make another gfp_flag, > which is basically another hint to the mm layer that we're doing a multi- > page allocation and that the MM layer should not try forever to handle it. > > In fact, that's independent of whether it is a multi-page > allocation or not. It might be something like __GFP_SOFT - you > could use it with single pages too. > > Thinking about it, we do have it already. It's called > !__GFP_HIGH, and it used by all the GFP_USER allocations. Hmm, I think these two are orthagonal. __GFP_HIGH means that we are allowed to eat deeper into the free list (maybe needed to avoid a deadlock freeing pages) __GFP_SOFT would mean "don't bother waiting for free pages", which is something very different... (I wouldn't want a user process to get killed simply because kswapd is waiting for IO to finish on a swapout, in that case we really do want to sleep for a while) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:41 ` Rik van Riel @ 2000-09-25 16:49 ` Linus Torvalds 2000-09-25 17:03 ` Ingo Molnar 2000-09-25 17:15 ` Andrea Arcangeli 0 siblings, 2 replies; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 16:49 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Rik van Riel wrote: > > > > Thinking about it, we do have it already. It's called > > !__GFP_HIGH, and it used by all the GFP_USER allocations. > > Hmm, I think these two are orthagonal. > > __GFP_HIGH means that we are allowed to eat deeper into > the free list (maybe needed to avoid a deadlock freeing > pages) > > __GFP_SOFT would mean "don't bother waiting for free pages", > which is something very different... Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing that the order itself may not be the most interesting thing, and that I don't think the balancing has to take the order of the allocation into account - because it should be equivalent to just tell that it's a soft allocation (whether though the current !__GFP_HIGH or through a new __GFP_SOFT with slightly different logic). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:49 ` Linus Torvalds @ 2000-09-25 17:03 ` Ingo Molnar 2000-09-25 17:17 ` Andrea Arcangeli 2000-09-25 17:15 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 17:03 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrea Arcangeli, Andi Kleen, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Linus Torvalds wrote: > Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing > that the order itself may not be the most interesting thing, and that > I don't think the balancing has to take the order of the allocation > into account - because it should be equivalent to just tell that it's > a soft allocation (whether though the current !__GFP_HIGH or through a > new __GFP_SOFT with slightly different logic). yep, and there is another problem with pure order-based distinction: if i do kmalloc(5k), and write the code on Alpha and expect it to never fail, shouldnt i expect this to never fail on x86 as well? Along with the fork() failure. __GFP_SOFT solves this all very nicely - the *allocator* decides what allocation policy to follow. Great! Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:03 ` Ingo Molnar @ 2000-09-25 17:17 ` Andrea Arcangeli 2000-09-25 17:10 ` Rik van Riel 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:17 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Andi Kleen, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote: > [..] __GFP_SOFT solves this all very nicely [..] s/very nicely/throwing away lots of useful cache for no one good reason/ Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:17 ` Andrea Arcangeli @ 2000-09-25 17:10 ` Rik van Riel 2000-09-25 17:27 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 17:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote: > > [..] __GFP_SOFT solves this all very nicely [..] > > s/very nicely/throwing away lots of useful cache for no one good reason/ Not really. We could fix this by making the page freeing functions smarter and only free the pages we need. I just don't know if this is worth it for 0.5% of the allocations (and further more, since we allocate the 1-page allocations directly from the cache when we're low on free memory, fragmentation isn't as bad as it used to be with the old VM). regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:10 ` Rik van Riel @ 2000-09-25 17:27 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:27 UTC (permalink / raw) To: Rik van Riel Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 02:10:07PM -0300, Rik van Riel wrote: > Not really. We could fix this by making the page freeing > functions smarter and only free the pages we need. That's what I proposed in first place infact. To free large chunk of memory you may have to throw away lots of cache. We're not freeing contigous cache as we do in 2.2.x. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:49 ` Linus Torvalds 2000-09-25 17:03 ` Ingo Molnar @ 2000-09-25 17:15 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:15 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 09:49:46AM -0700, Linus Torvalds wrote: > [..] I > don't think the balancing has to take the order of the allocation into > account [..] Why do you prefer to throw away most of the cache (potentially at fork time) instead of freeing only the few contigous bits that we need? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:35 ` Linus Torvalds 2000-09-25 16:41 ` Rik van Riel @ 2000-09-27 7:14 ` Rusty Russell 1 sibling, 0 replies; 243+ messages in thread From: Rusty Russell @ 2000-09-27 7:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel In message <Pine.LNX.4.10.10009250931570.1739-100000@penguin.transmeta.com> you write: > I suspect that the proper way to do this is to just make another gfp_flag, > which is basically another hint to the mm layer that we're doing a multi- > page allocation and that the MM layer should not try forever to handle it. > > In fact, that's independent of whether it is a multi-page allocation or > not. It might be something like __GFP_SOFT - you could use it with single > pages too. That'd be a lovely interface, now wouldn't it? *yecch* Please consider at least: /* Never fails. */ #define trivial_kmalloc(s) \ ((void)((s) > PAGE_SIZE ? bad_size_##s : __kmalloc((s), GFP_KERNEL))) /* Can fail */ #define kmalloc(s, pri) __kmalloc((s), (pri)|__GFP_SOFT) Rusty. -- Hacking time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:18 ` Andi Kleen 2000-09-25 16:41 ` Andrea Arcangeli @ 2000-09-25 20:23 ` Russell King 1 sibling, 0 replies; 243+ messages in thread From: Russell King @ 2000-09-25 20:23 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, linux-mm, torvalds Andi Kleen writes: > On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: > > > Another thing I would worry about are ports with multiple user page > > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > > > yep, but these cases are not affected, i think in the order != 0 case we > > should return NULL if a certain number of iterations did not yield any > > free page. > > Ok, that would just break fork() Especially so when, on the ARM, the first level page table is 16K, and the page size is 4K. Should Ingo's suggestion happen, we still need a way of allocating 16K aligned chunks of memory for such stuff. Just a small question... I thought we were discussing 2.4, not possible features for 2.5? _____ |_____| ------------------------------------------------- ---+---+- | | Russell King rmk@arm.linux.org.uk --- --- | | | | http://www.arm.linux.org.uk/personal/aboutme.html / / | | +-+-+ --- -+- / | THE developer of ARM Linux |+| /|\ / | | | --- | +-+-+ ------------------------------------------------- /\\\ | -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:19 ` Ingo Molnar 2000-09-25 16:18 ` Andi Kleen @ 2000-09-25 16:28 ` Rik van Riel 1 sibling, 0 replies; 243+ messages in thread From: Rik van Riel @ 2000-09-25 16:28 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Ingo Molnar wrote: > On Mon, 25 Sep 2000, Andi Kleen wrote: > > > Another thing I would worry about are ports with multiple user page > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > yep, but these cases are not affected, i think in the order != 0 > case we should return NULL if a certain number of iterations did > not yield any free page. Indeed. You're right here. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:02 ` Ingo Molnar 2000-09-25 16:04 ` Andi Kleen @ 2000-09-25 16:11 ` Andrea Arcangeli 2000-09-25 16:22 ` Ingo Molnar 2000-09-25 16:53 ` Alan Cox 2 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:11 UTC (permalink / raw) To: Ingo Molnar Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: > Frankly, how often do we allocate multi-order pages? I've just made quick The deadlock Alan pointed out can happen also with single page allocation if we in 2.4.x-current put a loop in GFP_KERNEL. > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb You're right. That's why it's a waste to have so many order in the buddy allocator. Even more now that the hashtables should be allocated with the bootmem allocator! :) Chuck seen the slowdown of increasing the highest order allocation in his bench. But of course in 2.2.x we can't avoid that. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:11 ` Andrea Arcangeli @ 2000-09-25 16:22 ` Ingo Molnar 2000-09-25 16:17 ` Alexander Viro ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 16:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > > You're right. That's why it's a waste to have so many order in the > buddy allocator. [...] yep, i agree. I'm not sure what the biggest allocation is, some drivers might use megabytes or contiguous RAM? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:22 ` Ingo Molnar @ 2000-09-25 16:17 ` Alexander Viro 2000-09-25 16:36 ` Jeff Garzik 2000-09-25 16:57 ` Alan Cox 2000-09-25 16:33 ` the new VMt Andrea Arcangeli 2000-09-26 8:38 ` Jes Sorensen 2 siblings, 2 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 16:17 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Ingo Molnar wrote: > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > > > > You're right. That's why it's a waste to have so many order in the > > buddy allocator. [...] > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > might use megabytes or contiguous RAM? Stupidity has no limits... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:17 ` Alexander Viro @ 2000-09-25 16:36 ` Jeff Garzik 2000-09-25 16:57 ` Alan Cox 1 sibling, 0 replies; 243+ messages in thread From: Jeff Garzik @ 2000-09-25 16:36 UTC (permalink / raw) To: Alexander Viro; +Cc: Ingo Molnar, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alexander Viro wrote: > On Mon, 25 Sep 2000, Ingo Molnar wrote: > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > might use megabytes or contiguous RAM? > Stupidity has no limits... Blame the hardware designers... and give me my big allocations. :) Sounds drivers (not mine though, <g>) do stuff like order = 20; /* just a made-up high number*/ while ((order-- > 0) && (mem == NULL)) { mem = __get_free_pages (GFP_KERNEL, order); } /* use sound buffer 'mem' */ Older or modern, less-than-cool framegrabbers need tons of contiguous memory too... Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:17 ` Alexander Viro 2000-09-25 16:36 ` Jeff Garzik @ 2000-09-25 16:57 ` Alan Cox 2000-09-25 17:01 ` Alexander Viro 1 sibling, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 16:57 UTC (permalink / raw) To: Alexander Viro Cc: Ingo Molnar, Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > might use megabytes or contiguous RAM? > > Stupidity has no limits... Unfortunately its frequently wired into the hardware to save a few cents on scatter gather logic. We need 128K blocks for sound DMA buffers and most sound cards they need to be linear (but not the newer ones thankfully). Some video capture hardware needs 4Mb but that needs to use bootmem (in 2.2 they use bigmem hacks) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:57 ` Alan Cox @ 2000-09-25 17:01 ` Alexander Viro 2000-09-25 17:06 ` Alan Cox 0 siblings, 1 reply; 243+ messages in thread From: Alexander Viro @ 2000-09-25 17:01 UTC (permalink / raw) To: Alan Cox Cc: Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alan Cox wrote: > > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > > might use megabytes or contiguous RAM? > > > > Stupidity has no limits... > > Unfortunately its frequently wired into the hardware to save a few cents on > scatter gather logic. Since when hardware folks became exempt from the rule above? 128K is almost tolerable, there were requests for 64 _mega_bytes... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:01 ` Alexander Viro @ 2000-09-25 17:06 ` Alan Cox 2000-09-25 17:31 ` Oliver Xymoron 2000-09-25 19:03 ` the new VMt [4MB+ blocks] Matti Aarnio 0 siblings, 2 replies; 243+ messages in thread From: Alan Cox @ 2000-09-25 17:06 UTC (permalink / raw) To: Alexander Viro Cc: Alan Cox, Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > > Stupidity has no limits... > > > > Unfortunately its frequently wired into the hardware to save a few cents on > > scatter gather logic. > > Since when hardware folks became exempt from the rule above? 128K is > almost tolerable, there were requests for 64 _mega_bytes... Most cheap ass PCI hardware is built on the basis you can do linear 4Mb allocations. There is a reason for this. You can do that 4Mb allocation on NT or Windows 9x -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:06 ` Alan Cox @ 2000-09-25 17:31 ` Oliver Xymoron 2000-09-25 17:51 ` Jeff Garzik 2000-09-25 19:03 ` the new VMt [4MB+ blocks] Matti Aarnio 1 sibling, 1 reply; 243+ messages in thread From: Oliver Xymoron @ 2000-09-25 17:31 UTC (permalink / raw) To: Alan Cox Cc: Alexander Viro, Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alan Cox wrote: > > > > Stupidity has no limits... > > > > > > Unfortunately its frequently wired into the hardware to save a few cents on > > > scatter gather logic. > > > > Since when hardware folks became exempt from the rule above? 128K is > > almost tolerable, there were requests for 64 _mega_bytes... > > Most cheap ass PCI hardware is built on the basis you can do linear 4Mb > allocations. There is a reason for this. You can do that 4Mb allocation on > NT or Windows 9x Sure about that? It's been a while, but I seem to recall NT enforcing a scatter-gather framework on all drivers because it only gave them virtual allocations. For the cheaper cards, the s-g was done by software issuing single span requests to the card. -- "Love the dolphins," she advised him. "Write by W.A.S.T.E.." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:31 ` Oliver Xymoron @ 2000-09-25 17:51 ` Jeff Garzik 0 siblings, 0 replies; 243+ messages in thread From: Jeff Garzik @ 2000-09-25 17:51 UTC (permalink / raw) To: Oliver Xymoron; +Cc: MM mailing list, linux-kernel On Mon, 25 Sep 2000, Oliver Xymoron wrote: > Sure about that? It's been a while, but I seem to recall NT enforcing a > scatter-gather framework on all drivers because it only gave them virtual > allocations. For the cheaper cards, the s-g was done by software issuing > single span requests to the card. The Matrox framegrabber guys use some API under NT to allocate megabytes upon megabytes of contiguous memory for DMA. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt [4MB+ blocks] 2000-09-25 17:06 ` Alan Cox 2000-09-25 17:31 ` Oliver Xymoron @ 2000-09-25 19:03 ` Matti Aarnio 2000-09-25 20:02 ` Stephen Williams 1 sibling, 1 reply; 243+ messages in thread From: Matti Aarnio @ 2000-09-25 19:03 UTC (permalink / raw) To: Alan Cox; +Cc: MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:06:11PM +0100, Alan Cox wrote: > > > > Stupidity has no limits... > > > Unfortunately its frequently wired into the hardware to save a few cents on > > > scatter gather logic. > > > > Since when hardware folks became exempt from the rule above? 128K is > > almost tolerable, there were requests for 64 _mega_bytes... > > Most cheap ass PCI hardware is built on the basis you can do linear 4Mb > allocations. There is a reason for this. You can do that 4Mb allocation on > NT or Windows 9x Sure, but intel processors have this neat 4 MB "super-page" feature in the MMU... (as we all well know) Sometimes allocating such monster memory blocks could be supported, but it should not be expected to be *fast*. E.g. if doing it in "reliable" way needs possibly moving currently allocated pages away from memory to create such a hole(s), so be it.. Anybody here who can describe those M$ API calls ? Are they kernel/DDK-only, or userspace ones, or both ? /Matti Aarnio -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt [4MB+ blocks] 2000-09-25 19:03 ` the new VMt [4MB+ blocks] Matti Aarnio @ 2000-09-25 20:02 ` Stephen Williams 0 siblings, 0 replies; 243+ messages in thread From: Stephen Williams @ 2000-09-25 20:02 UTC (permalink / raw) To: Matti Aarnio; +Cc: Alan Cox, MM mailing list, linux-kernel matti.aarnio@zmailer.org said: > Sometimes allocating such monster memory blocks could be supported, > but it should not be expected to be *fast*. E.g. if doing it in > "reliable" way needs possibly moving currently allocated pages > away from memory to create such a hole(s), so be it. matti.aarnio@zmailer.org said: > Anybody here who can describe those M$ API calls ? > Are they kernel/DDK-only, or userspace ones, or both ? NT does indeed support allocating contiguous buffers of memory, which is useful when the hardware in question doesn't do scatter-gather. I have on occasion been compelled to use these routines. (Paradoxically, the requirements in my case came from broken NT mmap support and not from the hardware. Blech!) Anyhow, these routines are indeed slow. And judging by the amount of disk noise I hear when they are called, they do try to kick out pages to make an allocation work. However, even so the M$ calls will eventually fail due to lack of large enough holes, so fragmentation takes its toll. So, they are both slow and unreliable under NT. But drivers that use them tend to be loaded once at boot time, and that's it. -- Steve Williams "The woods are lovely, dark and deep. steve@icarus.com But I have promises to keep, steve@picturel.com and lines to code before I sleep, http://www.picturel.com And lines to code before I sleep." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:22 ` Ingo Molnar 2000-09-25 16:17 ` Alexander Viro @ 2000-09-25 16:33 ` Andrea Arcangeli 2000-09-26 8:38 ` Jes Sorensen 2 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:33 UTC (permalink / raw) To: Ingo Molnar Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 06:22:42PM +0200, Ingo Molnar wrote: > yep, i agree. I'm not sure what the biggest allocation is, some drivers > might use megabytes or contiguous RAM? I'm not sure (we should grep all the drivers to be sure...) but I bet the old 2.2.0 MAX_ORDER #define will work for everything. The fact is that over a certain order there's no hope anyway at runtime and the only big allocations done through the init sequence are for the hashtable. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:22 ` Ingo Molnar 2000-09-25 16:17 ` Alexander Viro 2000-09-25 16:33 ` the new VMt Andrea Arcangeli @ 2000-09-26 8:38 ` Jes Sorensen 2000-09-26 8:52 ` Ingo Molnar 2 siblings, 1 reply; 243+ messages in thread From: Jes Sorensen @ 2000-09-26 8:38 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: Ingo> On Mon, 25 Sep 2000, Andrea Arcangeli wrote: >> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb >> >> You're right. That's why it's a waste to have so many order in the >> buddy allocator. [...] Ingo> yep, i agree. I'm not sure what the biggest allocation is, some Ingo> drivers might use megabytes or contiguous RAM? 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo frames at least. Jes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 8:38 ` Jes Sorensen @ 2000-09-26 8:52 ` Ingo Molnar 2000-09-26 9:02 ` Jes Sorensen 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-26 8:52 UTC (permalink / raw) To: Jes Sorensen Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On 26 Sep 2000, Jes Sorensen wrote: > 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo > frames at least. yep, although this is more of a Linux limitation, the cards themselves are happy to DMA fragmented buffers as well. (sans some small penalty per new fragment.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 8:52 ` Ingo Molnar @ 2000-09-26 9:02 ` Jes Sorensen 0 siblings, 0 replies; 243+ messages in thread From: Jes Sorensen @ 2000-09-26 9:02 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: Ingo> On 26 Sep 2000, Jes Sorensen wrote: >> 9.5KB blocks is common for people running Gigabit Ethernet with >> Jumbo frames at least. Ingo> yep, although this is more of a Linux limitation, the cards Ingo> themselves are happy to DMA fragmented buffers as well. (sans Ingo> some small penalty per new fragment.) Hence the reason I have been pushing for the kiobufifying of the skbs ;-) It's even more important for HIPPI with the 65280 bytes MTU. Jes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:02 ` Ingo Molnar 2000-09-25 16:04 ` Andi Kleen 2000-09-25 16:11 ` Andrea Arcangeli @ 2000-09-25 16:53 ` Alan Cox 2 siblings, 0 replies; 243+ messages in thread From: Alan Cox @ 2000-09-25 16:53 UTC (permalink / raw) To: mingo Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > Frankly, how often do we allocate multi-order pages? I've just made quick > statistics wrt. how allocation orders are distributed on a more or less > typical system: Enough that failures on this crashed older 2.2 kernels because the tcp code ended up looping trying to get memory and the slab allocator couldnt get a new multipage block. Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:16 ` the new VMt Alan Cox 2000-09-25 15:33 ` the new VM Ingo Molnar 2000-09-25 15:41 ` the new VMt Andrea Arcangeli @ 2000-09-25 15:42 ` Stephen C. Tweedie 2000-09-25 16:05 ` Andrea Arcangeli ` (2 more replies) 2000-09-25 16:16 ` Rik van Riel 3 siblings, 3 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 15:42 UTC (permalink / raw) To: Alan Cox Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: > > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1 #2 > kmalloc 32K kmalloc 16K > OK OK > kmalloc 16K kmalloc 32K > block block > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is able to eat memory which processes #1 and #2 are not allowed to touch. Progress is made, clean pages are discarded and dirty ones queued for write, memory becomes free again and the world is a better place. Or so goes the theory, at least. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:42 ` Stephen C. Tweedie @ 2000-09-25 16:05 ` Andrea Arcangeli 2000-09-25 16:22 ` Rik van Riel 2000-09-25 17:39 ` Stephen C. Tweedie 2000-09-25 16:51 ` Alan Cox 2000-09-25 16:52 ` yodaiken 2 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:05 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > Progress is made, clean pages are discarded and dirty ones queued for How can you make progress if there isn't swap avaiable and all the freeable page/buffer cache is just been freed? The deadlock happens in OOM condition (not when we can make progress). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:05 ` Andrea Arcangeli @ 2000-09-25 16:22 ` Rik van Riel 2000-09-25 16:42 ` Andrea Arcangeli 2000-09-25 17:39 ` Stephen C. Tweedie 1 sibling, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 16:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > > Progress is made, clean pages are discarded and dirty ones queued for > > How can you make progress if there isn't swap avaiable and all the > freeable page/buffer cache is just been freed? The deadlock happens > in OOM condition (not when we can make progress). This is exactly why integrating the OOM killer is on my TODO list. The important difference between the new VM and the old one is that we can't fail while we are not OOM, whereas the old allocator could break down even when we still had enough swap free.... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:22 ` Rik van Riel @ 2000-09-25 16:42 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:42 UTC (permalink / raw) To: Rik van Riel Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 01:22:40PM -0300, Rik van Riel wrote: > whereas the old allocator could break down even when > we still had enough swap free.... As far I can see that's a bug that you hided introducing a deadlock. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:05 ` Andrea Arcangeli 2000-09-25 16:22 ` Rik van Riel @ 2000-09-25 17:39 ` Stephen C. Tweedie 1 sibling, 0 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 17:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 06:05:00PM +0200, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > > Progress is made, clean pages are discarded and dirty ones queued for > > How can you make progress if there isn't swap avaiable and all the > freeable page/buffer cache is just been freed? The deadlock happens > in OOM condition (not when we can make progress). Agreed --- this assumes that all pinned, nonswappable pages are subject to resource limiting to prevent them from exhausting the whole of memory. For things like page tables, that means we need beancounter in place for us to be 100% safe. For the no-swap case, that requires an OOM killer. The problem of avoiding filling memory with pinned pages is orthogonal to the problem of managing the unpinned memory. Both are obviously required for a stable system. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:42 ` Stephen C. Tweedie 2000-09-25 16:05 ` Andrea Arcangeli @ 2000-09-25 16:51 ` Alan Cox 2000-09-25 17:43 ` Stephen C. Tweedie 2000-09-25 16:52 ` yodaiken 2 siblings, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 16:51 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > 2 active processes, no swap > > > > #1 #2 > > kmalloc 32K kmalloc 16K > > OK OK > > kmalloc 16K kmalloc 32K > > block block > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > able to eat memory which processes #1 and #2 are not allowed to touch. 'no swap' -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:51 ` Alan Cox @ 2000-09-25 17:43 ` Stephen C. Tweedie 2000-09-25 18:13 ` Alan Cox 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 17:43 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 05:51:49PM +0100, Alan Cox wrote: > > > 2 active processes, no swap > > > > > > #1 #2 > > > kmalloc 32K kmalloc 16K > > > OK OK > > > kmalloc 16K kmalloc 32K > > > block block > > > > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > > able to eat memory which processes #1 and #2 are not allowed to touch. > > 'no swap' kswapd is perfectly capable of evicting clean pages and triggering any necessary writeback of dirty filesystem data at this point, even if there is no swap. If there is truly nothing kswapd can do to recover here, then we are truly OOM. Otherwise, kswapd should be able to free the required memory, providing that the PF_MEMALLOC flag allows it to eat into a reserved set of free pages which nobody else can allocate once physical free pages gets below a certain threshold. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:43 ` Stephen C. Tweedie @ 2000-09-25 18:13 ` Alan Cox 2000-09-25 18:21 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 18:13 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > there is no swap. If there is truly nothing kswapd can do to recover > here, then we are truly OOM. Otherwise, kswapd should be able to free Indeed. But we wont fail the kmalloc with a NULL return -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:13 ` Alan Cox @ 2000-09-25 18:21 ` Stephen C. Tweedie 2000-09-25 19:09 ` Alan Cox 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 18:21 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 07:13:27PM +0100, Alan Cox wrote: > > there is no swap. If there is truly nothing kswapd can do to recover > > here, then we are truly OOM. Otherwise, kswapd should be able to free > > Indeed. But we wont fail the kmalloc with a NULL return Isn't that the preferred behaviour, though? If we are completely out of VM on a no-swap machine, we should be killing one of the existing processes rather than preventing any progress and keeping all of the old tasks alive but deadlocked. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:21 ` Stephen C. Tweedie @ 2000-09-25 19:09 ` Alan Cox 2000-09-25 19:21 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 19:09 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > > Indeed. But we wont fail the kmalloc with a NULL return > > Isn't that the preferred behaviour, though? If we are completely out > of VM on a no-swap machine, we should be killing one of the existing > processes rather than preventing any progress and keeping all of the > old tasks alive but deadlocked. Unless Im missing something we wont kill any task in that condition - even a SIGKILL will make no odds as everyone is asleep in kmalloc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 19:09 ` Alan Cox @ 2000-09-25 19:21 ` Stephen C. Tweedie 0 siblings, 0 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 19:21 UTC (permalink / raw) To: Alan Cox Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 08:09:31PM +0100, Alan Cox wrote: > > > Indeed. But we wont fail the kmalloc with a NULL return > > > > Isn't that the preferred behaviour, though? If we are completely out > > of VM on a no-swap machine, we should be killing one of the existing > > processes rather than preventing any progress and keeping all of the > > old tasks alive but deadlocked. > > Unless Im missing something we wont kill any task in that condition - even > a SIGKILL will make no odds as everyone is asleep in kmalloc Right. Eeek. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:42 ` Stephen C. Tweedie 2000-09-25 16:05 ` Andrea Arcangeli 2000-09-25 16:51 ` Alan Cox @ 2000-09-25 16:52 ` yodaiken 2000-09-25 17:18 ` Jamie Lokier 2 siblings, 1 reply; 243+ messages in thread From: yodaiken @ 2000-09-25 16:52 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: > > > > Unless Im missing something here think about this case > > > > 2 active processes, no swap > > > > #1 #2 > > kmalloc 32K kmalloc 16K > > OK OK > > kmalloc 16K kmalloc 32K > > block block > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > able to eat memory which processes #1 and #2 are not allowed to touch. > Progress is made, clean pages are discarded and dirty ones queued for > write, memory becomes free again and the world is a better place. > > Or so goes the theory, at least. from fs/select.c walk = out; while(nfds > 0) { poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); if (!tmp) { while(out != NULL) { tmp = out->next; free_page((unsigned long)out); out = tmp; } return NULL; } tmp->nr = 0; tmp->entry = (struct poll_table_entry *)(tmp + 1); tmp->next = NULL; walk->next = tmp; walk = tmp; nfds -=__MAX_POLL_TABLE_ENTRIES; } > > --Stephen > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:52 ` yodaiken @ 2000-09-25 17:18 ` Jamie Lokier 2000-09-25 17:51 ` yodaiken 0 siblings, 1 reply; 243+ messages in thread From: Jamie Lokier @ 2000-09-25 17:18 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel yodaiken@fsmlabs.com wrote: > walk = out; > while(nfds > 0) { > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > if (!tmp) { Shouldn't this be GFP_USER? (Which would also conveniently fix the problem Victor's pointing out...) -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:18 ` Jamie Lokier @ 2000-09-25 17:51 ` yodaiken 2000-09-25 18:04 ` Jamie Lokier 2000-09-25 18:20 ` Andrea Arcangeli 0 siblings, 2 replies; 243+ messages in thread From: yodaiken @ 2000-09-25 17:51 UTC (permalink / raw) To: Jamie Lokier Cc: yodaiken, Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:18:29PM +0200, Jamie Lokier wrote: > yodaiken@fsmlabs.com wrote: > > walk = out; > > while(nfds > 0) { > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > if (!tmp) { > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > problem Victor's pointing out...) It should probably be GFP_ATOMIC, if I understand the mm right. The algorithm for requesting a collection of reources and freeing all of them on failure is simple, fast, and robust. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:51 ` yodaiken @ 2000-09-25 18:04 ` Jamie Lokier 2000-09-25 18:13 ` yodaiken 2000-09-25 18:20 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Jamie Lokier @ 2000-09-25 18:04 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel yodaiken@fsmlabs.com wrote: > > yodaiken@fsmlabs.com wrote: > > > walk = out; > > > while(nfds > 0) { > > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > > if (!tmp) { > > > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > > problem Victor's pointing out...) > > It should probably be GFP_ATOMIC, if I understand the mm right. Definitely not. GFP_ATOMIC is reserved for things that really can't swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll have to increase the number of atomic-allocatable pages. > The algorithm for requesting a collection of reources and freeing all > of them on failure is simple, fast, and robust. Allocation is just as fast with GFP_KERNEL/USER, just less likely to fail and less likely to break something else that really needs GFP_ATOMIC allocations. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:04 ` Jamie Lokier @ 2000-09-25 18:13 ` yodaiken 2000-09-25 18:24 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: yodaiken @ 2000-09-25 18:13 UTC (permalink / raw) To: Jamie Lokier Cc: yodaiken, Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 08:04:54PM +0200, Jamie Lokier wrote: > yodaiken@fsmlabs.com wrote: > > > yodaiken@fsmlabs.com wrote: > > > > walk = out; > > > > while(nfds > 0) { > > > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > > > if (!tmp) { > > > > > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > > > problem Victor's pointing out...) > > > > It should probably be GFP_ATOMIC, if I understand the mm right. > > Definitely not. GFP_ATOMIC is reserved for things that really can't > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > have to increase the number of atomic-allocatable pages. Process 1,2 and 3 all start allocating 20 pages process 1 stalls after allocating 19 some memory is freed and process 2 runs and stall after allocating 19 some memory is free and process 3 runs and stalls after allocating 19 now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. > > The algorithm for requesting a collection of reources and freeing all > > of them on failure is simple, fast, and robust. > > Allocation is just as fast with GFP_KERNEL/USER, just less likely to It's not speed, it's deadlock avoidance. > fail and less likely to break something else that really needs > GFP_ATOMIC allocations. My point here is simply that error returns in memory allocation allow higher level kernel operations to safely marshal a collection of resources following a safe algorithm that is optimized for the case when there is no memory shortage and that only starts going to the slow case when the system is stalling due to memory shortages anyways. > > -- Jamie -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:13 ` yodaiken @ 2000-09-25 18:24 ` Stephen C. Tweedie 2000-09-25 18:34 ` yodaiken 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 18:24 UTC (permalink / raw) To: yodaiken Cc: Jamie Lokier, Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 12:13:15PM -0600, yodaiken@fsmlabs.com wrote: > > Definitely not. GFP_ATOMIC is reserved for things that really can't > > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > > have to increase the number of atomic-allocatable pages. > > Process 1,2 and 3 all start allocating 20 pages > process 1 stalls after allocating 19 > some memory is freed and process 2 runs and stall after allocating 19 > some memory is free and process 3 runs and stalls after allocating 19 > > now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. Or go the beancounter route: process 1 asks "can I pin 20 pages", gets told "yes", and goes allocating them, blocking as necessary until it gets them. Process 2 asks "can *I* pin 20 pages" and the answer is either "not right now", in which case it waits for process 1 to release its reservation, or "no, you've exceeded your user quota" in which case it fails with ENOMEM. (That latter case can protect us against a lot of DoS attacks from local users.) The same accounting really needs to be done for page tables, as that represents one of the biggest sources of unaccounted, unswappable pages which user processes can cause to be created right now. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:24 ` Stephen C. Tweedie @ 2000-09-25 18:34 ` yodaiken 2000-09-25 18:48 ` Jamie Lokier 2000-09-25 19:25 ` Stephen C. Tweedie 0 siblings, 2 replies; 243+ messages in thread From: yodaiken @ 2000-09-25 18:34 UTC (permalink / raw) To: Stephen C. Tweedie Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:24:53PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 12:13:15PM -0600, yodaiken@fsmlabs.com wrote: > > > > Definitely not. GFP_ATOMIC is reserved for things that really can't > > > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > > > have to increase the number of atomic-allocatable pages. > > > > Process 1,2 and 3 all start allocating 20 pages > > process 1 stalls after allocating 19 > > some memory is freed and process 2 runs and stall after allocating 19 > > some memory is free and process 3 runs and stalls after allocating 19 > > > > now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > told "yes", and goes allocating them, blocking as necessary until it So you have a "pre-allocation allocator"? Leads to interesting and hard to detect bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates or that blocks on something unrelated preallocte 20 pages get first ask for an inode -- block waiting for an inode or preallocate 20 pages if(checkuserpath())return -ENOWAY; /* stranding my pre-allocate */ else get them pages What's nice about these is they don't cause errors on test and seem more difficult to spot than looking for cases where allocated memory gets stranded. Doesn't the alloc_vec method seem simpler to you? > gets them. Process 2 asks "can *I* pin 20 pages" and the answer is > either "not right now", in which case it waits for process 1 to > release its reservation, or "no, you've exceeded your user quota" in Or for someone else to free more pages ... > which case it fails with ENOMEM. (That latter case can protect us > against a lot of DoS attacks from local users.) I like ENOMEM anyways. > > The same accounting really needs to be done for page tables, as that > represents one of the biggest sources of unaccounted, unswappable > pages which user processes can cause to be created right now. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:34 ` yodaiken @ 2000-09-25 18:48 ` Jamie Lokier 2000-09-25 19:25 ` Stephen C. Tweedie 1 sibling, 0 replies; 243+ messages in thread From: Jamie Lokier @ 2000-09-25 18:48 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel yodaiken@fsmlabs.com wrote: > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > > told "yes", and goes allocating them, blocking as necessary until it > > So you have a "pre-allocation allocator"? Leads to interesting and > hard to detect bugs with old code that does not pre-allocate or with > code that incorrectly pre-allocates or that blocks on something > unrelated I agree with Victor. Relying on code that calls gfp to do the correct accounting in advance, to avoid deadlocks, is not at all robust. Even the best programmers will have off by one errors in that, and the rest of us will blindly write code that works all the time, except for the really obscure case when it fails. Ideally do both: see if you can allocate in advance, then try it, but be prepared to back off and return ENOMEM if that fails. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 18:34 ` yodaiken 2000-09-25 18:48 ` Jamie Lokier @ 2000-09-25 19:25 ` Stephen C. Tweedie 2000-09-25 20:04 ` yodaiken 1 sibling, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 19:25 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 12:34:56PM -0600, yodaiken@fsmlabs.com wrote: > > > Process 1,2 and 3 all start allocating 20 pages > > > now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. > > > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > > told "yes", and goes allocating them, blocking as necessary until it > > So you have a "pre-allocation allocator"? Leads to interesting and hard to detect > bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates > or that blocks on something unrelated Right, but if the alternative is spurious ENOMEM when we can satisfy all of the pending requests just as long as they are serialised, is this a problem? If you want, wrap it in a "get_free_pagev" call which returns a vector of pointers to free pages, doing whatever accounting is needed. You don't have to push all of it to the callers. However, you just can't escape from the fact that on low memory machinnes, we *need* beancounter-style accounting of pinned pages or we'll be in Deep Trouble (TM). We already have nasty DoS situations which are embarassingly easy to reproduce. If we need such beancounter protection, AND such protection can prevent the situation you describe, then do we need to go looking for another way of achieving the same protection? --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 19:25 ` Stephen C. Tweedie @ 2000-09-25 20:04 ` yodaiken 2000-09-25 20:23 ` Alan Cox ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: yodaiken @ 2000-09-25 20:04 UTC (permalink / raw) To: Stephen C. Tweedie Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 08:25:49PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 12:34:56PM -0600, yodaiken@fsmlabs.com wrote: > > > > > Process 1,2 and 3 all start allocating 20 pages > > > > now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. > > > > > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > > > told "yes", and goes allocating them, blocking as necessary until it > > > > So you have a "pre-allocation allocator"? Leads to interesting and hard to detect > > bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates > > or that blocks on something unrelated > > Right, but if the alternative is spurious ENOMEM when we can satisfy An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the OS to do impossible tricks. > all of the pending requests just as long as they are serialised, is > this a problem? I think you are solving the wrong problem. On a small memory machine, the kernel, utilities, and applications should be configured to use little memory. BusyBox is better than BeanCount. > However, you just can't escape from the fact that on low memory > machinnes, we *need* beancounter-style accounting of pinned pages or > we'll be in Deep Trouble (TM). We already have nasty DoS situations What we need is simple kernel code that does not hold resources into a possible deadlock situation. > which are embarassingly easy to reproduce. If we need such > beancounter protection, AND such protection can prevent the situation > you describe, then do we need to go looking for another way of > achieving the same protection? On general principles, I don't see any substitute for clean code in the kernel and my prediction is that if you show me an example of DoS vulnerability, I can show you fix that does not require bean counting. Am I wrong? -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:04 ` yodaiken @ 2000-09-25 20:23 ` Alan Cox 2000-09-25 20:35 ` yodaiken 2000-09-25 20:32 ` Stephen C. Tweedie 2000-09-25 23:14 ` Erik Andersen 2 siblings, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 20:23 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > my prediction is that if you show me an example of > DoS vulnerability, I can show you fix that does not require bean counting. > Am I wrong? I think so. Page tables are a good example -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:23 ` Alan Cox @ 2000-09-25 20:35 ` yodaiken 2000-09-25 20:46 ` Alan Cox 2000-09-25 20:47 ` Benjamin C.R. LaHaise 0 siblings, 2 replies; 243+ messages in thread From: yodaiken @ 2000-09-25 20:35 UTC (permalink / raw) To: Alan Cox Cc: yodaiken, Stephen C. Tweedie, Jamie Lokier, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > my prediction is that if you show me an example of > > DoS vulnerability, I can show you fix that does not require bean counting. > > Am I wrong? > > I think so. Page tables are a good example I'm not too sure of what you have in mind, but if it is "process creates vast virtual space to generate many page table entries -- using mmap" the answer is, virtual address space quotas and mmap should kill the process on low mem for page tables. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:35 ` yodaiken @ 2000-09-25 20:46 ` Alan Cox 2000-09-25 21:07 ` yodaiken 2000-09-25 20:47 ` Benjamin C.R. LaHaise 1 sibling, 1 reply; 243+ messages in thread From: Alan Cox @ 2000-09-25 20:46 UTC (permalink / raw) To: yodaiken Cc: Alan Cox, Stephen C. Tweedie, Jamie Lokier, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > I'm not too sure of what you have in mind, but if it is > "process creates vast virtual space to generate many page table > entries -- using mmap" > the answer is, virtual address space quotas and mmap should kill > the process on low mem for page tables. Those quotas being exactly what beancounter is -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:46 ` Alan Cox @ 2000-09-25 21:07 ` yodaiken 2000-09-26 9:54 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: yodaiken @ 2000-09-25 21:07 UTC (permalink / raw) To: Alan Cox Cc: yodaiken, Stephen C. Tweedie, Jamie Lokier, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote: > > I'm not too sure of what you have in mind, but if it is > > "process creates vast virtual space to generate many page table > > entries -- using mmap" > > the answer is, virtual address space quotas and mmap should kill > > the process on low mem for page tables. > > Those quotas being exactly what beancounter is But that is a function specific counter, not a counter in the alloc code. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 21:07 ` yodaiken @ 2000-09-26 9:54 ` Stephen C. Tweedie 2000-09-26 13:17 ` yodaiken 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-26 9:54 UTC (permalink / raw) To: yodaiken Cc: Alan Cox, Stephen C. Tweedie, Jamie Lokier, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 03:07:44PM -0600, yodaiken@fsmlabs.com wrote: > On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote: > > > I'm not too sure of what you have in mind, but if it is > > > "process creates vast virtual space to generate many page table > > > entries -- using mmap" > > > the answer is, virtual address space quotas and mmap should kill > > > the process on low mem for page tables. > > > > Those quotas being exactly what beancounter is > > But that is a function specific counter, not a counter in the > alloc code. Beancounter is a framework for user-level accounting. _What_ you account is up to the callers. Maybe this has been a miscommunication, but beancounter is all about allowing callers to account for stuff before allocation, not about having the page allocation functions themselves enforce quotas. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 9:54 ` Stephen C. Tweedie @ 2000-09-26 13:17 ` yodaiken 0 siblings, 0 replies; 243+ messages in thread From: yodaiken @ 2000-09-26 13:17 UTC (permalink / raw) To: Stephen C. Tweedie Cc: yodaiken, Alan Cox, Jamie Lokier, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote: > Beancounter is a framework for user-level accounting. _What_ you > account is up to the callers. Maybe this has been a miscommunication, > but beancounter is all about allowing callers to account for stuff > before allocation, not about having the page allocation functions > themselves enforce quotas. per-user and system-wide and per-process quotas are one thing, a pre-allocate-and-then-allocate generic scheme seems to me to be a error prone way of getting there. In particular, I think it is dangerous to have a pre-count that is approximately tethered to the thing it is counting -- in the memory allocation we were discussing, you need to make sure that the pre-allocations are for memory that is really going to be allocated soon and that it is later correlated with free in some way. So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then pre-allocate counting, or, even worse, a "smart" kmalloc that never fails. If the problem is unaccounted for page-tables then account for page tables and return a -EYOURPROCESSISOUTOFCONTROL so that calling kernel code can take the responsible action. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:35 ` yodaiken 2000-09-25 20:46 ` Alan Cox @ 2000-09-25 20:47 ` Benjamin C.R. LaHaise 2000-09-25 21:12 ` yodaiken 1 sibling, 1 reply; 243+ messages in thread From: Benjamin C.R. LaHaise @ 2000-09-25 20:47 UTC (permalink / raw) To: yodaiken; +Cc: Stephen C. Tweedie, MM mailing list, linux-kernel On Mon, 25 Sep 2000 yodaiken@fsmlabs.com wrote: > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > > my prediction is that if you show me an example of > > > DoS vulnerability, I can show you fix that does not require bean counting. > > > Am I wrong? > > > > I think so. Page tables are a good example > > I'm not too sure of what you have in mind, but if it is > "process creates vast virtual space to generate many page table > entries -- using mmap" > the answer is, virtual address space quotas and mmap should kill > the process on low mem for page tables. No. Page tables are not freed after munmap (and for good reason). The counting of page table "beans" is critical. -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:47 ` Benjamin C.R. LaHaise @ 2000-09-25 21:12 ` yodaiken 2000-09-26 10:07 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: yodaiken @ 2000-09-25 21:12 UTC (permalink / raw) To: Benjamin C.R. LaHaise Cc: yodaiken, Stephen C. Tweedie, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:47:21PM -0400, Benjamin C.R. LaHaise wrote: > On Mon, 25 Sep 2000 yodaiken@fsmlabs.com wrote: > > > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > > > my prediction is that if you show me an example of > > > > DoS vulnerability, I can show you fix that does not require bean counting. > > > > Am I wrong? > > > > > > I think so. Page tables are a good example > > > > I'm not too sure of what you have in mind, but if it is > > "process creates vast virtual space to generate many page table > > entries -- using mmap" > > the answer is, virtual address space quotas and mmap should kill > > the process on low mem for page tables. > > No. Page tables are not freed after munmap (and for good reason). The > counting of page table "beans" is critical. I've seen the assertion before, reasons would be interesting. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 21:12 ` yodaiken @ 2000-09-26 10:07 ` Stephen C. Tweedie 2000-09-26 13:30 ` yodaiken 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-26 10:07 UTC (permalink / raw) To: yodaiken Cc: Benjamin C.R. LaHaise, Stephen C. Tweedie, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 03:12:50PM -0600, yodaiken@fsmlabs.com wrote: > > > > > > I'm not too sure of what you have in mind, but if it is > > > "process creates vast virtual space to generate many page table > > > entries -- using mmap" > > > the answer is, virtual address space quotas and mmap should kill > > > the process on low mem for page tables. > > > > No. Page tables are not freed after munmap (and for good reason). The > > counting of page table "beans" is critical. > > I've seen the assertion before, reasons would be interesting. Reason 1: under DoS attack, you want to target not the process using the most resources, but the *user* using the most resources (else a fork-bomb style attack can work around your OOM-killer algorithms). Reason 2: if you've got tasks stuck in low-level page allocation routines, then you can't immediately kill -9 them, so reactive OOM killing always has vulnerabilities --- to be robust in preventing resource exhaustion you want limits on the use of those resources before they are exhausted --- the necessary accounting being part of what we refer to as "beancounter". --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 10:07 ` Stephen C. Tweedie @ 2000-09-26 13:30 ` yodaiken 0 siblings, 0 replies; 243+ messages in thread From: yodaiken @ 2000-09-26 13:30 UTC (permalink / raw) To: Stephen C. Tweedie Cc: yodaiken, Benjamin C.R. LaHaise, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 03:12:50PM -0600, yodaiken@fsmlabs.com wrote: > > > > > > > > I'm not too sure of what you have in mind, but if it is > > > > "process creates vast virtual space to generate many page table > > > > entries -- using mmap" > > > > the answer is, virtual address space quotas and mmap should kill > > > > the process on low mem for page tables. > > > > > > No. Page tables are not freed after munmap (and for good reason). The > > > counting of page table "beans" is critical. > > > > I've seen the assertion before, reasons would be interesting. > > Reason 1: under DoS attack, you want to target not the process using > the most resources, but the *user* using the most resources (else a > fork-bomb style attack can work around your OOM-killer algorithms). Ok. if(over_allocated_page_tables(task->uid) ) return ENOMEM; makes sense in "fork". I guess the argument here is not about whether accounting is good, it's about where the accounting should be done. To me the alternatives of if(preallocate_pages(page_table_size_for_this_process()) == -1)return error then actually allocate making sure to adjust counts if some other error turns up and with something taking care of how the pre-allocation works while we are sleeping waiting for possibly unrelated resources. or just kmalloc with kmalloc magically juggling resources in some safe way seem less clear. > Reason 2: if you've got tasks stuck in low-level page allocation > routines, then you can't immediately kill -9 them, so reactive OOM > killing always has vulnerabilities --- to be robust in preventing > resource exhaustion you want limits on the use of those resources > before they are exhausted --- the necessary accounting being part of > what we refer to as "beancounter". doesn't the problem really come from low level page allocation at too high a level? That is, if instead of select doing get_free_page, it maybe should do get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess) Then we could have a config-optional per-process pinned page accounting with the possibility of doing something sensible in a user-space daemon when memory is low. > > --Stephen -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:04 ` yodaiken 2000-09-25 20:23 ` Alan Cox @ 2000-09-25 20:32 ` Stephen C. Tweedie 2000-09-26 12:10 ` Mark Hemment 2000-09-25 23:14 ` Erik Andersen 2 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 20:32 UTC (permalink / raw) To: yodaiken Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote: > > Right, but if the alternative is spurious ENOMEM when we can satisfy > > An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the > OS to do impossible tricks. Yes, but the ENOMEM _is_ spurious if you actually meant EAGAIN, and if the OS was perfectly capable of doing the retry itself. > > all of the pending requests just as long as they are serialised, is > > this a problem? > > I think you are solving the wrong problem. On a small memory machine, the kernel, > utilities, and applications should be configured to use little memory. > BusyBox is better than BeanCount. Any box is a small memory machine if you get the wrong workload on it, and the DoS attacks which are possible without beancounting let any user bring even a large system to its knees right now. If solving that problem also means that small memory machines do the right thing on their own rather than requiring specific manual configuration, then it sounds like a good aim. > > However, you just can't escape from the fact that on low memory > > machinnes, we *need* beancounter-style accounting of pinned pages or > > we'll be in Deep Trouble (TM). We already have nasty DoS situations > > What we need is simple kernel code that does not hold resources > into a possible deadlock situation. <nod> > On general principles, I don't see any substitute for clean code in the kernel and > my prediction is that if you show me an example of > DoS vulnerability, I can show you fix that does not require bean counting. > Am I wrong? If you have a user forking multiple processes and exhausting some resource, then at some point you have to do something about it. Let's say it's page tables, just for argument's sake, because those are currently non-swappable, but even if you make those swappable there are plenty of other resources it might be (eg. data shoved down unix domain sockets if you want another example). So you have run out of physical memory --- what do you do about it? The important observation here is that in a multi-user environment, simply denying further allocations isn't good enough --- unless you revoke those existing allocations you have DoS. And you can't fairly revoke existing allocations without knowing WHICH user has exhausted the memory (which requires beancounter-style resource tracking), AND having mechanisms in place to revoke all of the possible resources which might be involved (eg unix domain socket datagrams). kill -9 might solve that latter problem but it doesn't help in identifying who to kill. --Stephen > > > > > > -- > --------------------------------------------------------- > Victor Yodaiken > Finite State Machine Labs: The RTLinux Company. > www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:32 ` Stephen C. Tweedie @ 2000-09-26 12:10 ` Mark Hemment 2000-09-27 10:13 ` Andrey Savochkin 0 siblings, 1 reply; 243+ messages in thread From: Mark Hemment @ 2000-09-26 12:10 UTC (permalink / raw) To: Stephen C. Tweedie Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > So you have run out of physical memory --- what do you do about it? Why let the system get into the state where it is neccessary to kill a process? Per-user/task resource counters should prevent unprivileged users from soaking up too many resources. That is the DoS protection. So an OOM is possibly; 1) A privileged, legally resource hungry, app(s) has taken all the memory. Could be too important to simply kill (it should exit gracefully). 2) Simply too many tasks*(memory-requirements-of-each-task). Ignoring allocations done by the kernel, the suitation comes down to the fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up. (note, I use the term "memory" and not "physical memory", I'm including swap space). Why not protect the system from over committing its memory resources? It is possible to do true, system wide, resource counting of physical memory and swap space, and to deny a fork() or mmap() which would cause over committing of memoy resources if everyone cashed in their requirements. Named pages (those which came from a file) are the simplest to handle. If dirty, they already have allocated backing store, so we know there is somewhere to put them when memory is low. How many named pages need to be held in physical memory at any one instance for the system to function? Only a few, although if you reach that state, the system will be thrashing itself to death. Anonymous and copied (those faulted from a write to an MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical memory or on swap. To avoid getting into the OOM suitation, when these mappings are created the system needs to check that it has (and will have, in the future) space for every page that _could_ be allocated for the mapping - ie. work out the worst case (including page-tables). This space could be on swap or in physical memory. It is the accounting which needs to be done, not the actual allocation (and not even the decision of where to store the page when allocated - that is made much later, when it needs to be). If a machine has 2GB of RAM, a 1MB swap, and 1GB of dirty anon or copied pages, that is fine. I'm stressing this point, as the scheme of reserving space for an (as yet) unallocated page is sometimes refered to as "eager swap allocation" (or some such similar term). This is confusing. People then start to believe they need backing store for each anon/copied pages. You don't. You simply need somewhere to store it, and that could be a physical page. It is all in the accounting. :) Allocations made by the kernel, for the kernel, are (obviously) pinned memory. To ensure kernel allocations do not completely exhaust physical memory (or cause phyiscal memory to be over committed if the worst case occurs), they need to be limited. How to limit? As I first guess (and this is only a guess); 1) don't let kernel allocations exceed 25% of physical memory (tunable) 2) don't let kernel allocations succeed if they would cause over commitment. Both conditions would need to pass before an allocation could succeed. This does need much more thought. Should some tuning be per subsystem? I don't know.... Perhaps 1) isn't needed. I'm not sure. Because of 2), the total physical memory accounted for anon/copied pages needs to have a high watermark. Otherwise, in the accounting, the system could allow too much physical memory to be reserved for these types of pages (there doesn't need to be space on swap for each anon/copied page, just space somewhere - a watermark would prevent too much of this being physical memory). Note, this doesn't mean start swapping earlier - remember, this is accounting of anon/copied pages to avoid over commitment. For named pages, the page cache needs to have a reserved number of physical pages (ie. how small is it allowed to get, before pruning stops). Again, these reserved pages are in the accounting. mlock()ed pages need to have accouting also to prevent over commitment of physical memory. All fun. The disadvantages; 1) Extra code to do the accouting. This shouldn't be too heavy. 2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily. Programs which expect to memory map areas (which would created anon/copied pages when written to) will see an increased failure rate in mmap(). This can be very annoying, espically when you know the mapping will be used sparsely. One solution is to add a new mmap() flag, which tells the kernel to let this mmap() exceed the actually resources. With such a flag, the mmap() will be allowed, but the task should expected to be killed if memory is exhausted. (It could be possible for the kernel to deliver a SIGDANGER signal to such a task, as in AIX, to give it a chance of reducing its requirments on the system or to exit gracefully.) Another solution is to allow the strict resource accounting to be over ridden on a global basis. Say, by allowing the system to over commit the memory resources by 10%. This does remove the absolute protection, but leaves some in place. The OOM killer would come into play if the system did over commit. Those who don't need/want protection, could set the over commit to some large value. 500%? 3) fork() failures. There is the problem of a large(ish) process wanting to run a small program. Say, a shell wanting to run a simple utility. Because of the memory resource accounting, the fork() is disallowed as the newly created child could (in theory) write to mmap()ed areas, creating anon/copied pages which would cause the kernel to (in the worst case) be OOM for user-pages. Given that the child will almost immediately do an exec(), which could well succeed, this is frustrating. Again, a small over commit kludge would reduce (but not eliminate), this occurance. An idea from a colleague, is to allow such a fork() to succeed, but to run the child process in a "container". Inside the container, the child is allowed to perform operations which would be expected before an exec(). Such operations could be closing file descriptors. However, if it tries to do something which would _seriously_ affect the state of the system (such as remove a file), then it is killed. ie. given it a chance to do an exec(). This could be done by running with an alternative system call table for the child process, which refers to bounce functions within the kernel where the checks are done (ie. don't load the common code path with the checks). This could be tricky to do, and there could well be a few system (library?) calls which would make it impossible. However, if it could be achieved, it would remove one of the most annoying "features" of over commitment protection. This sort of protection isn't to prevent DoS attacks; as said above, they need to be on a per user/task level. This protection is to protect against asynchronous failures on page faults due to OOM, and to make them synchronous (from mmap(), fork(), mlock(), etc) where programs expected to test for an error code. There isn't much an application can do with a synchronous memory failure; sleep and try again, release some of its own resources, or exit gracefully. Anyway, I've skipped over a lot of interesting details (and problems). This stuff isn't new. Some commercial OS have this type of protection. Comments? Mark -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 12:10 ` Mark Hemment @ 2000-09-27 10:13 ` Andrey Savochkin 2000-09-27 12:55 ` Hugh Dickins 0 siblings, 1 reply; 243+ messages in thread From: Andrey Savochkin @ 2000-09-27 10:13 UTC (permalink / raw) To: Mark Hemment Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel, Stephen C. Tweedie Hello, On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote: > > On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > > So you have run out of physical memory --- what do you do about it? > > Why let the system get into the state where it is neccessary to kill a > process? > Per-user/task resource counters should prevent unprivileged users from > soaking up too many resources. That is the DoS protection. > [snip] > It is possible to do true, system wide, resource counting of physical > memory and swap space, and to deny a fork() or mmap() which would cause > over committing of memoy resources if everyone cashed in their > requirements. [snip] People use overcommitting not because they are fans of the idea. Overcommitting simply is the _efficient_ way of resource sharing. It's a waste of resources to reserve memory+swap for the case that every running process decides to modify libc code (and, thus, should receive its private copy of the pages). A real waste! I always agree to take the risk of some applications being killed in such a case of all processes turning crazy. The approach I believe in is: - ensure that accidental or intentional madness of applications of one user may cause only limited damage to other users; and - introduce a way to tell the kernel that some applications should be saved longer than others when troubles begin and ways to set up some guaranteed amounts for important processes. Certainly, a lot of processes may consume more than their guarantee until bad things start to happen. Then the rules of user protection and killing order apply. That's how I develop the resource control in the beancounter patch ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7 Best regards Andrey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-27 10:13 ` Andrey Savochkin @ 2000-09-27 12:55 ` Hugh Dickins 2000-09-28 3:25 ` Andrey Savochkin 0 siblings, 1 reply; 243+ messages in thread From: Hugh Dickins @ 2000-09-27 12:55 UTC (permalink / raw) To: Andrey Savochkin; +Cc: Mark Hemment, MM mailing list, linux-kernel On Wed, 27 Sep 2000, Andrey Savochkin wrote: > > It's a waste of resources to reserve memory+swap for the case that every > running process decides to modify libc code (and, thus, should receive its > private copy of the pages). A real waste! A real waste indeed, but a bad example: libc code is mapped read-only, so nobody would recommend reserving memory+swap for private mods to it. Of course, a process might choose to mprotect it writable at some time, that would be when to refuse if overcommitted. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-27 12:55 ` Hugh Dickins @ 2000-09-28 3:25 ` Andrey Savochkin 0 siblings, 0 replies; 243+ messages in thread From: Andrey Savochkin @ 2000-09-28 3:25 UTC (permalink / raw) To: Hugh Dickins; +Cc: Mark Hemment, MM mailing list, linux-kernel Hello, On Wed, Sep 27, 2000 at 01:55:52PM +0100, Hugh Dickins wrote: > On Wed, 27 Sep 2000, Andrey Savochkin wrote: > > > > It's a waste of resources to reserve memory+swap for the case that every > > running process decides to modify libc code (and, thus, should receive its > > private copy of the pages). A real waste! > > A real waste indeed, but a bad example: libc code is mapped read-only, > so nobody would recommend reserving memory+swap for private mods to it. > Of course, a process might choose to mprotect it writable at some time, > that would be when to refuse if overcommitted. Returning error from mprotect() call for private mappings? It wouldn't be what people expect... The other example where overcommit makes sense is fork() (not vfork) and immediate exec in one of the threads. Best regards Andrey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 20:04 ` yodaiken 2000-09-25 20:23 ` Alan Cox 2000-09-25 20:32 ` Stephen C. Tweedie @ 2000-09-25 23:14 ` Erik Andersen 2000-09-26 15:17 ` yodaiken 2 siblings, 1 reply; 243+ messages in thread From: Erik Andersen @ 2000-09-25 23:14 UTC (permalink / raw) To: yodaiken; +Cc: MM mailing list, linux-kernel On Mon Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote: > > > all of the pending requests just as long as they are serialised, is > > this a problem? > > I think you are solving the wrong problem. On a small memory machine, the kernel, > utilities, and applications should be configured to use little memory. > BusyBox is better than BeanCount. > Granted that smaller apps can help -- for a particular workload. But while I am very partial to BusyBox (in fact I am about to cut a new release) I can assure you that OOM is easily possible even when your user space is tiny. I do it all the time. There are mallocs in busybox and when under memory pressure, the kernel still tends to fall over... -Erik -- Erik B. Andersen email: andersee@debian.org --This message was written using 73% post-consumer electrons-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 23:14 ` Erik Andersen @ 2000-09-26 15:17 ` yodaiken 2000-09-26 16:04 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: yodaiken @ 2000-09-26 15:17 UTC (permalink / raw) To: yodaiken, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote: > On Mon Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote: > > > > > all of the pending requests just as long as they are serialised, is > > > this a problem? > > > > I think you are solving the wrong problem. On a small memory machine, the kernel, > > utilities, and applications should be configured to use little memory. > > BusyBox is better than BeanCount. > > > > Granted that smaller apps can help -- for a particular workload. But while I > am very partial to BusyBox (in fact I am about to cut a new release) I can > assure you that OOM is easily possible even when your user space is tiny. I do > it all the time. There are mallocs in busybox and when under memory pressure, > the kernel still tends to fall over... Operating systems cannot make more memory appear by magic. The question is really about the best strategy for dealing with low memory. In my opinion, the OS should not try to out-think physical limitations. Instead, the OS should take as little space as possible and provide the ability for user level clever management of space. In a truly embedded system, there can easily be a user level root process that watches memory usage and prevents DOS attacks -- if the OS provides settable enforced quotas etc. -- --------------------------------------------------------- Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 15:17 ` yodaiken @ 2000-09-26 16:04 ` Stephen C. Tweedie 2000-09-26 17:02 ` Erik Andersen 0 siblings, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-26 16:04 UTC (permalink / raw) To: yodaiken; +Cc: MM mailing list, linux-kernel Hi, On Tue, Sep 26, 2000 at 09:17:44AM -0600, yodaiken@fsmlabs.com wrote: > Operating systems cannot make more memory appear by magic. > The question is really about the best strategy for dealing with low memory. In my > opinion, the OS should not try to out-think physical limitations. Instead, the OS > should take as little space as possible and provide the ability for user level > clever management of space. In a truly embedded system, there can easily be a user level > root process that watches memory usage and prevents DOS attacks -- if the OS provides > settable enforced quotas etc. Agreed, absolutely. The beancounter is one approach to those quotas, and has the advantage of allowing per-user as well as per-process quotas. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 16:04 ` Stephen C. Tweedie @ 2000-09-26 17:02 ` Erik Andersen 2000-09-26 17:08 ` Stephen C. Tweedie 0 siblings, 1 reply; 243+ messages in thread From: Erik Andersen @ 2000-09-26 17:02 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, Sep 26, 2000 at 09:17:44AM -0600, yodaiken@fsmlabs.com wrote: > > > Operating systems cannot make more memory appear by magic. > > The question is really about the best strategy for dealing with low memory. In my > > opinion, the OS should not try to out-think physical limitations. Instead, the OS > > should take as little space as possible and provide the ability for user level > > clever management of space. In a truly embedded system, there can easily be a user level > > root process that watches memory usage and prevents DOS attacks -- if the OS provides > > settable enforced quotas etc. > > Agreed, absolutely. The beancounter is one approach to those quotas, > and has the advantage of allowing per-user as well as per-process > quotas. Another approach would be to let user space turn off overcommit. That way, user space can be assured there will be no surprises... -Erik -- Erik B. Andersen email: andersee@debian.org --This message was written using 73% post-consumer electrons-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 17:02 ` Erik Andersen @ 2000-09-26 17:08 ` Stephen C. Tweedie 2000-09-26 17:45 ` Erik Andersen 2000-09-26 21:13 ` Eric Lowe 0 siblings, 2 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-26 17:08 UTC (permalink / raw) To: Stephen C. Tweedie, yodaiken, MM mailing list, linux-kernel Hi, On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote: > Another approach would be to let user space turn off overcommit. No. Overcommit only applies to pageable memory. Beancounter is really needed for non-pageable resources such as page tables and mlock()ed pages. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 17:08 ` Stephen C. Tweedie @ 2000-09-26 17:45 ` Erik Andersen 2000-09-27 10:20 ` Andrey Savochkin 2000-09-26 21:13 ` Eric Lowe 1 sibling, 1 reply; 243+ messages in thread From: Erik Andersen @ 2000-09-26 17:45 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel On Tue Sep 26, 2000 at 06:08:20PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote: > > > Another approach would be to let user space turn off overcommit. > > No. Overcommit only applies to pageable memory. Beancounter is > really needed for non-pageable resources such as page tables and > mlock()ed pages. I think we do agree here, though we are having problems with semantics. "Overcommit" to me is the same things as Mark Hemment stated earlier in this thread -- the "fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up." Basically any case where too many tickets have been sold (applied to the entire system, and all subsystems). To extend the airplane metaphor a bit past credibility... When an airline sells too many tickets, it bribes people to get off the plane. For the kernel, it tends to fall over, or starts kicking off pilots and flight attendants. If the Beancounter patch lets the kernel count "passengers", classify them (with user hinting) so the pilot and flight attendants (init, X, or whatever) always stay on the plane, and has some sane predictable mechanism for booting non-priveledged passengers, then I am all for it. How does one provide the kernel with hints as to which processes are sacred? Where does one find this beancounter patch? How much weight does it add to the kernel? -Erik -- Erik B. Andersen email: andersee@debian.org --This message was written using 73% post-consumer electrons-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 17:45 ` Erik Andersen @ 2000-09-27 10:20 ` Andrey Savochkin 0 siblings, 0 replies; 243+ messages in thread From: Andrey Savochkin @ 2000-09-27 10:20 UTC (permalink / raw) To: Erik Andersen; +Cc: Stephen C. Tweedie, yodaiken, MM mailing list, linux-kernel On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote: [snip] > "Overcommit" to me is the same things as Mark Hemment stated earlier in this > thread -- the "fact that the system has over committed its memory resources. > ie. it has sold too many tickets for the number of seats in the plane, and all > the passengers have turned up." Basically any case where too many tickets > have been sold (applied to the entire system, and all subsystems). [snip] > If the Beancounter patch lets the kernel count "passengers", classify them > (with user hinting) so the pilot and flight attendants (init, X, or whatever) > always stay on the plane, and has some sane predictable mechanism for booting > non-priveledged passengers, then I am all for it. That's exactly what I'm doing. > How does one provide the kernel with hints as to which processes are sacred? > Where does one find this beancounter patch? How much weight does it add to > the kernel? ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html The current version has some drawbacks, and one of them is the performance. Memory accounting is implemented as a kernel thread which goes through page tables of processes (similar to kswapd), and it appears to consume 1-5% of CPU (depending on number of processes). I consider it unacceptable, and have started reimplementation of the process memory accounting from the beginning. Best regards Andrey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-26 17:08 ` Stephen C. Tweedie 2000-09-26 17:45 ` Erik Andersen @ 2000-09-26 21:13 ` Eric Lowe 1 sibling, 0 replies; 243+ messages in thread From: Eric Lowe @ 2000-09-26 21:13 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel Hello, > > Another approach would be to let user space turn off overcommit. > > No. Overcommit only applies to pageable memory. Beancounter is > really needed for non-pageable resources such as page tables and > mlock()ed pages. > In addition to beancounter, do you think pageable page tables are something we want to tackle in 2.5.x? 4MB page mappings on x86 could be cool too, as an option... -- Eric Lowe FibreChannel Software Engineer, Systran Corporation elowe@systran.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 17:51 ` yodaiken 2000-09-25 18:04 ` Jamie Lokier @ 2000-09-25 18:20 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 18:20 UTC (permalink / raw) To: yodaiken Cc: Jamie Lokier, Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 11:51:39AM -0600, yodaiken@fsmlabs.com wrote: > It should probably be GFP_ATOMIC, if I understand the mm right. poll_wait is called from the f_op->poll callback from select just before a sleep and since it's allowed to sleep too it should be a GFP_KERNEL (not ATOMIC). Using GFP_ATOMIC where GFP_KERNEL can be used is a bug and it can lead to failed allocations even while there's huge amount of freeable/recyclable cache. The reason it isn't GFP_USER but it's a GFP_KERNEL is because the memory isn't allocated in userspace. On a solid VM the only difference between GFP_USER and GFP_KERNEL happens to be when the machine runs truly out of memory. In 2.4.x GFP_KERNEL should probably be changed not to short the PF_MEMALLOC atomic queue when memory balancing fails (then they would be equal). > The algorithm for requesting a collection of reources and freeing all of them > on failure is simple, fast, and robust. Yes, I tend to like that style too because it's obviously safe and it obviously can't dealdock during oom. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 15:16 ` the new VMt Alan Cox ` (2 preceding siblings ...) 2000-09-25 15:42 ` Stephen C. Tweedie @ 2000-09-25 16:16 ` Rik van Riel 2000-09-25 16:55 ` Alan Cox 3 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 16:16 UTC (permalink / raw) To: Alan Cox Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Alan Cox wrote: > > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get > > > everything jammed in kernel space waiting on GFP_KERNEL and if the > > > swapper cannot make space you die. > > > > if one can get everything jammed waiting for GFP_KERNEL, and not being > > able to deallocate anything, thats a VM or resource-limit bug. This > > situation is just 1% RAM away from the 'root cannot log in', situation. > > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1 #2 > kmalloc 32K kmalloc 16K > OK OK > kmalloc 16K kmalloc 32K > block block > > so GFP_KERNEL has to be able to fail - it can wait for I/O in > some cases with care, but when we have no pages left something > has to give The trick here is to: 1) keep some reserved pages around for PF_MEMALLOC tasks (we need this anyway) 2) set PF_MEMALLOC on the task you're killing for OOM, that way this task will either get the memory or fail (note that PF_MEMALLOC tasks don't wait) This way the OOM-killed task will be able to exit quickly and the rest of the system will not get killed as a side effect. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VMt 2000-09-25 16:16 ` Rik van Riel @ 2000-09-25 16:55 ` Alan Cox 0 siblings, 0 replies; 243+ messages in thread From: Alan Cox @ 2000-09-25 16:55 UTC (permalink / raw) To: Rik van Riel Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel > > kmalloc 16K kmalloc 32K > > block block > > > 2) set PF_MEMALLOC on the task you're killing for OOM, > that way this task will either get the memory or > fail (note that PF_MEMALLOC tasks don't wait) Nobody is out of memory at this point. Everyone is in kernel space blocking for someone else. There is also no further allocation after this deadlock point to cause a kill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:16 ` Ingo Molnar 2000-09-25 15:16 ` the new VMt Alan Cox @ 2000-09-25 15:48 ` Andrea Arcangeli 1 sibling, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:48 UTC (permalink / raw) To: Ingo Molnar Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 05:16:06PM +0200, Ingo Molnar wrote: > situation is just 1% RAM away from the 'root cannot log in', situation. The root cannot log in is a little different. Just think that in the "root cannot log in" you only need to press SYSRQ+E (or as worse +I). If all tasks in the systems are hanging into the GFP loop SYSRQ+I won't solve the deadlock. Ok you can add a signal check in the memory balancing code but that looks an ugly hack that shows the difference between the two cases (the one Alan pointed out is real deadlock, the current one is kind of live lock that can go away any time, while the deadlock can reach the point where it can't be recovered without an hack from an irq somewhere). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:47 ` Alan Cox 2000-09-25 15:16 ` Ingo Molnar @ 2000-09-25 15:40 ` Stephen C. Tweedie 2000-09-25 16:01 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 15:40 UTC (permalink / raw) To: Alan Cox Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 03:47:03PM +0100, Alan Cox wrote: > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything > jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make > space you die. We already have PF_MEMALLOC to provide a last-chance allocation pool which only the swapper can eat into. The critical thing is to avoid having the swapper itself deadlock. Everything revolves around that. Once you can make that guarantee, it's perfectly safe to make GFP_KERNEL succeed for other callers, just as long as you have enough beancounting in place in those callers. Right now, the biggest obstacle to this is the GFP_ATOMIC behaviour: /* * Final phase: allocate anything we can! * * This is basically reserved for PF_MEMALLOC and * GFP_ATOMIC allocations... */ Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the wrong thing to do if we want to guarantee swapper progress under extreme load. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 15:40 ` Stephen C. Tweedie @ 2000-09-25 16:01 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 16:01 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:40:44PM +0100, Stephen C. Tweedie wrote: > Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the > wrong thing to do if we want to guarantee swapper progress under > extreme load. You're definitely right. We at least need the garantee of the memory to allocate the bhs on top of the swap cache while we atttempt to swapout one page (that path can't fail at the moment). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:08 ` Andrea Arcangeli 2000-09-25 13:12 ` Ingo Molnar @ 2000-09-25 14:37 ` Rik van Riel 2000-09-25 20:34 ` Christoph Rohland 1 sibling, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-09-25 14:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > > that is a showstopper bug. [...] > > > > why? > > Because as you said the machine can lockup when you run out of memory. The fix for this is to kill a user process when you're OOM (you need to do this anyway). The last few allocations of the "condemned" process can come frome the reserved pages and the process we killed will exit just fine. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 14:37 ` Rik van Riel @ 2000-09-25 20:34 ` Christoph Rohland 2000-10-06 16:14 ` Rik van Riel 0 siblings, 1 reply; 243+ messages in thread From: Christoph Rohland @ 2000-09-25 20:34 UTC (permalink / raw) To: Rik van Riel; +Cc: MM mailing list, linux-kernel Hi Rik, Rik van Riel <riel@conectiva.com.br> writes: > > Because as you said the machine can lockup when you run out of memory. > > The fix for this is to kill a user process when you're OOM > (you need to do this anyway). > > The last few allocations of the "condemned" process can come > frome the reserved pages and the process we killed will exit just > fine. It's slightly offtopic, but you should think about detached shm segments in yout OOM killer. As many of the high end applications like databases and e.g. SAP have most of the memory in shm segments you easily end up killing a lot of processes without freeing a lot of memory. I see this often in my shm tests. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 20:34 ` Christoph Rohland @ 2000-10-06 16:14 ` Rik van Riel 2000-10-09 7:37 ` Christoph Rohland 0 siblings, 1 reply; 243+ messages in thread From: Rik van Riel @ 2000-10-06 16:14 UTC (permalink / raw) To: Christoph Rohland; +Cc: MM mailing list, linux-kernel [replying to a really old email now that I've started work on integrating the OOM handler] On 25 Sep 2000, Christoph Rohland wrote: > Rik van Riel <riel@conectiva.com.br> writes: > > > > Because as you said the machine can lockup when you run out of memory. > > > > The fix for this is to kill a user process when you're OOM > > (you need to do this anyway). > > > > The last few allocations of the "condemned" process can come > > frome the reserved pages and the process we killed will exit just > > fine. > > It's slightly offtopic, but you should think about detached shm > segments in yout OOM killer. As many of the high end > applications like databases and e.g. SAP have most of the memory > in shm segments you easily end up killing a lot of processes > without freeing a lot of memory. I see this often in my shm > tests. Hmmm, could you help me with drawing up a selection algorithm on how to choose which SHM segment to destroy when we run OOM? The criteria would be about the same as with normal programs: 1) minimise the amount of work lost 2) try to protect 'innocent' stuff 3) try to kill only one thing 4) don't surprise the user, but chose something that the user will expect to be killed/destroyed regards, regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-10-06 16:14 ` Rik van Riel @ 2000-10-09 7:37 ` Christoph Rohland 0 siblings, 0 replies; 243+ messages in thread From: Christoph Rohland @ 2000-10-09 7:37 UTC (permalink / raw) To: Rik van Riel; +Cc: MM mailing list, linux-kernel Rik van Riel <riel@conectiva.com.br> writes: > Hmmm, could you help me with drawing up a selection algorithm > on how to choose which SHM segment to destroy when we run OOM? > > The criteria would be about the same as with normal programs: > > 1) minimise the amount of work lost > 2) try to protect 'innocent' stuff > 3) try to kill only one thing > 4) don't surprise the user, but chose something that > the user will expect to be killed/destroyed First we only kill segments with no attachees. There are circumstances under normal load where you have these. (SAP R/3 will do this all the time on Linux 2.4) So perhaps we could signal shm that we killed a process and let it try to find a segment where this process was the last attachee. This would be a good candidate. If this does not help either we could do two different things: 1) kill the biggest nonattached segment 2) kill the segment which was longest detached Greetings Christoph -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:02 ` Andrea Arcangeli 2000-09-25 13:02 ` Ingo Molnar @ 2000-09-25 13:04 ` Ingo Molnar 2000-09-25 13:19 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Please fix raid1 instead of making things worse. huh, what do you mean? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:04 ` Ingo Molnar @ 2000-09-25 13:19 ` Andrea Arcangeli 2000-09-25 13:18 ` Ingo Molnar 2000-09-25 13:21 ` Ingo Molnar 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:19 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:04:10PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Please fix raid1 instead of making things worse. > > huh, what do you mean? I mean this: while (!( /* FIXME: now we are rather fault tolerant than nice */ mirror_bh[i] = kmalloc (sizeof (struct buffer_head), GFP_KERNEL) ) ) I've seen in the 2.4.0-test9-pre6 raid1 code the above is gone (and this looks very promising :)), it is at least proof that some care about the deadlock is been taken) and you instead sleep on a waitqueue now. While it's not obvious at all that sleeping on the waitqueue is not deadlock prone (for example getblk sleeps on a waitqueue bit it's deadlock prone too), at least it's not an infinite loop anymore and that's still better. Is it safe to sleep on the waitqueue in the kmalloc fail path in raid1? Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:19 ` Andrea Arcangeli @ 2000-09-25 13:18 ` Ingo Molnar 2000-09-25 13:21 ` Ingo Molnar 1 sibling, 0 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > huh, what do you mean? > > I mean this: > > while (!( /* FIXME: now we are rather fault tolerant than nice */ this is fixed in 2.4. The 2.2 RAID code is frozen, and has known limitations (ie. due to the above RAID1 cannot be used as a swap-device). Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:19 ` Andrea Arcangeli 2000-09-25 13:18 ` Ingo Molnar @ 2000-09-25 13:21 ` Ingo Molnar 2000-09-25 13:31 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Is it safe to sleep on the waitqueue in the kmalloc fail path in > raid1? yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO latencies) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:21 ` Ingo Molnar @ 2000-09-25 13:31 ` Andrea Arcangeli 2000-09-25 13:47 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:31 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:21:01PM +0200, Ingo Molnar wrote: > yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO > latencies) Very good! Many thanks Ingo. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:31 ` Andrea Arcangeli @ 2000-09-25 13:47 ` Ingo Molnar 2000-09-25 14:04 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:47 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO > > latencies) > > Very good! Many thanks Ingo. this was actually coded/fixed by Neil Brown - so the kudos go to him! Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: the new VM 2000-09-25 13:47 ` Ingo Molnar @ 2000-09-25 14:04 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:04 UTC (permalink / raw) To: Ingo Molnar Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:47:57PM +0200, Ingo Molnar wrote: > this was actually coded/fixed by Neil Brown - so the kudos go to him! Indeed :). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 2000-09-25 0:49 ` Alexander Viro 2000-09-25 0:53 ` Marcelo Tosatti @ 2000-09-25 1:31 ` Andrea Arcangeli 2000-09-25 1:27 ` Alexander Viro 2000-09-25 10:13 ` Ingo Molnar 2 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 1:31 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 05:09:40PM -0700, Linus Torvalds wrote: > [..] as with the > shm_swap() thing this is probably something we do want to fix eventually. both shm_swap and regular rw_swap_cache have the same deadlock problematic w.r.t. __GFP_IO. We could do that on a raw device, but if we swap on top of the filesystem then we could have deadlock problems again. Really since with the swapfile blocks are just allocated with ext2 we should not deadlock (but maybe some other fs have a lock_super in the get_block path anyway). Thus it's safer not to swapout anything when __GFP_IO is not set. Also some linux/net/* code is using (or better abusing since __GFP_IO originally was only meant as a deadlock avoidance thing not a thing to only shrink the clean cache) GFP_BUFFER to not block (so actually we would hurt networking too by causing _any_ kind of block in a GFP_BUFFER allocation). It would been better to introduce a new flag for allocations that must not block for latency requirements but that wants still to shrink the clean cache (instead of finishing the atomic queue). This is trivially fixable grepping for GFP_BUFFER. > The icache shrinker probably has similar problems with clear_inode. Yep. And it sure does blocking I/O because it have to sync the dirty inodes. > I suspect that it might be a good idea to try to fix this issue, because > it will probably keep coming up otherwise. And it's likely to be fairly > easily debugged, by just making getblk() have some debugging code that > basically says something like > > lock_super() > { > .. do the lock .. > + current->super_locked++; > } > > unlock_super() > { > + if (current->super_locked < 1) > + BUG(); > + current->super_locked--; > .. do the unlock .. > } > > getblk() > { > + if (current->super_locked) > + BUG(); > .. do the getblk .. > } BTW (running offtopic), I collected such information in 2.2.x too (but for another reason). ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre9/VM-global-2.2.18pre9-6.bz2 I trapped all the down on the inode semaphore in the same way (I called it current->fs_locks for both down and superlock). I'm using such information to know if there's any lock held in the context of the task to know if I can do I/O or not without risking to deadlock on any inode semaphore or on any superblock lock. With that change I could then also use GFP_KERNEL in getblk in 2.2.x (I admit at first I did that :), but then I preferred to stay on the safe side for things like loop that _have_ to work in 2.2.x :). So now we know when we can writepage a dirty MAP_SHARED page in swap_out and we do it from the task that is trying to allocate memory, so the task that is trying to allocate memory will block waiting some dirty buffer to be written in writepage->wakeup_bdflush(1). In 2.2.x (as we do in 2.4.x) we _need_ to writeout the page ourself from swapout (not async queueing into kpiod) because kpiod is completly asynchrous and so without this change GFP was returning, we was allocating memory again, and we was entering GFP again, all at fast rate. In the meantime kpiod was still blocked in mark_buffer_dirty->wakeup_bdflush(1) and then the tasks allocating memory (who thought to have done some progress because it queued many pages into kpiod) was getting killed. Of course then I also killed kpiod since it wasn't necessary anymore and now MAP_SHARED semgments doesn't kill tasks anymore. > and just making it a new rule that you cannot call getblk() with any locks > held. Yes I see it would certainly trap the deadlock cases. > (the superblock lock is quite contended right now, and the reason for that Right (on large fs is going to be quite painful for scalability) and the BUG would have the benefit of partly solving it. I'm thinking that dropping the superblock lock completly wouldn't be much more difficult than this mid stage. The only cases where we block in critical sections protected by the superblock lock is in getblk/bread (bread calls getblk) and ll_rw_block and mark_buffer_dirty. Once we drop the lock for the first cases it should not be more difficult to drop it completly. Not sure if this is the right moment for those changes though, I'm not worried about ext2 but about the other non-netoworked fses that nobody uses regularly. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli @ 2000-09-25 1:27 ` Alexander Viro 2000-09-25 2:02 ` Andrea Arcangeli 2000-09-25 10:13 ` Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Alexander Viro @ 2000-09-25 1:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I'm thinking that dropping the superblock lock completly wouldn't be much more > difficult than this mid stage. The only cases where we block in critical > sections protected by the superblock lock is in getblk/bread (bread calls > getblk) and ll_rw_block and mark_buffer_dirty. Once we drop the lock for the > first cases it should not be more difficult to drop it completly. ext2_new_block->dquot_alloc_block->lock_dquot ext2_new_block->dquot_alloc_block->check_bdq->print_warning->tty_write_message > Not sure if this is the right moment for those changes though, I'm not worried > about ext2 but about the other non-netoworked fses that nobody uses regularly. So help testing the patches to them. Arrgh... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 1:27 ` Alexander Viro @ 2000-09-25 2:02 ` Andrea Arcangeli 2000-09-25 2:01 ` Alexander Viro 2000-09-25 13:47 ` Stephen C. Tweedie 0 siblings, 2 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 2:02 UTC (permalink / raw) To: Alexander Viro Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote: > So help testing the patches to them. Arrgh... I think I'd better fix the bugs that I know about before testing patches that tries to remove the superblock_lock at this stage. I guess you should re-read the email from DaveM of two days ago. Then I've a problem: I've no idea how could I test adfs/affs/efs/hfs/hpfs/qnx4/sysv/udf. If you send me by email or point out the URL where I can find the source of the mkfs for all the above fs I will try to add the tests in the regression test suite as soon as time permits so the computer will do that job for me (that will be useful regardless of the super-lock issue). (if the mkfses are in common packages like mkfs.minix and mkfs.bfs no need to send them of course) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 2:02 ` Andrea Arcangeli @ 2000-09-25 2:01 ` Alexander Viro 2000-09-25 13:47 ` Stephen C. Tweedie 1 sibling, 0 replies; 243+ messages in thread From: Alexander Viro @ 2000-09-25 2:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote: > > So help testing the patches to them. Arrgh... > > I think I'd better fix the bugs that I know about before testing patches that > tries to remove the superblock_lock at this stage. I guess you should > re-read the email from DaveM of two days ago. Erm... Did you miss the fact that minixfs/sysvfs/UFS are choke-full of fs-corrupting races? Patch for minixfs had been posted 3 times during the last couple of weeks, each time with [CFT] in subject. So far - 0 (zero) responces. I'm way past the stage when I gave a damn - it works here and if I will not receive any bug reports it will go to Linus on Tuesday. And no, that stuff has nothing to lock_super(). But unless people will test the patches posted on l-k and fsdevel - too fscking bad, stuff _will_ break. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 2:02 ` Andrea Arcangeli 2000-09-25 2:01 ` Alexander Viro @ 2000-09-25 13:47 ` Stephen C. Tweedie 1 sibling, 0 replies; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-25 13:47 UTC (permalink / raw) To: Andrea Arcangeli Cc: Alexander Viro, Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel Hi, On Mon, Sep 25, 2000 at 04:02:30AM +0200, Andrea Arcangeli wrote: > On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote: > > So help testing the patches to them. Arrgh... > > I think I'd better fix the bugs that I know about before testing patches that > tries to remove the superblock_lock at this stage. Right. If we're introducing new deadlock possibilities, then sure we can fix the obvious cases in ext2, but it will be next to impossible to do a thorough audit of all of the other filesystems. Adding in the new shrink_icache loop into the VFS just feels too dangerous right now. Of course, that doesn't mean we shouldn't remove the excessive superblock locking from ext2 --- rather, it is simply more robust to keep the two issues separate. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli 2000-09-25 1:27 ` Alexander Viro @ 2000-09-25 10:13 ` Ingo Molnar 2000-09-25 12:58 ` Andrea Arcangeli 1 sibling, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 10:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Not sure if this is the right moment for those changes though, I'm not > worried about ext2 but about the other non-netoworked fses that nobody > uses regularly. it *is* the right moment to clean these issues up. These kinds of things are what made the 2.2 VM a mess (everybody added his easy improvements, without solving some of the conceptual problems), and frankly, instead of yet another elevator algorithm we need a squeaky clean VM balancer above all. Please help identifying, fixing, debugging and testing these VM balancing issues. This is tough work and it needs to be done. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 10:13 ` Ingo Molnar @ 2000-09-25 12:58 ` Andrea Arcangeli 2000-09-25 13:10 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 12:58 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 12:13:08PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Not sure if this is the right moment for those changes though, I'm not > > worried about ext2 but about the other non-netoworked fses that nobody > > uses regularly. > > it *is* the right moment to clean these issues up. These kinds of things I'm talking about the removal of the superblock lock from the filesystems. Note: I don't have problems with the removal of the superblock lock even if done at this stage, I'm not the one who can choose those things, it's Linus's responsability to take the final decision for the official tree, but don't ask me to test patches that removes the superblock lock _at_this_stage_ before I can run a stable and fast 2.4.x because I won't do that. Period. > yet another elevator algorithm we need a squeaky clean VM balancer above FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec in the tiobench write test compared to clean 2.4.0-test8-pre5 that delivers 8mbyte/sec instead with only blkdev layer changes in between the two kernels (and no that's not a matter of the elevator since there are no seeks in the test and I've not changed the elevator sorting algorithm during the bench). Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in wait_for_request. The high part of the queue is reserved for reads. Now if a read completes and it wakeups a write you'll hang. If you think I should delay those fixes to do something else I don't agree sorry. > all. Please help identifying, fixing, debugging and testing these VM > balancing issues. This is tough work and it needs to be done. I had an alternative VM, that I prefer from a design standpoint, I'll improve it and I'll maintain it. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 12:58 ` Andrea Arcangeli @ 2000-09-25 13:10 ` Ingo Molnar 2000-09-25 13:49 ` Jens Axboe 2000-09-25 13:56 ` Andrea Arcangeli 0 siblings, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > yet another elevator algorithm we need a squeaky clean VM balancer above > > FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec > in the tiobench write test compared to clean 2.4.0-test8-pre5 that > delivers 8mbyte/sec great! I'm happy we have a fine-tuned elevator again. > Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in > wait_for_request. The high part of the queue is reserved for reads. > Now if a read completes and it wakeups a write you'll hang. yep. But i dont understand why this makes any difference - the waitqueue wakeup is FIFO, so any other request will eventually arrive. Could you explain this bug a bit better? > If you think I should delay those fixes to do something else I don't > agree sorry. no, i never ment it. I find it very good that those half-done changes are cleaned up and the remaining bugs / performance problems are eliminated - the first reports about bad write performance came right after the original elevator patches went in, about 6 months ago. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:10 ` Ingo Molnar @ 2000-09-25 13:49 ` Jens Axboe 2000-09-25 14:11 ` Ingo Molnar 2000-09-25 14:20 ` Andrea Arcangeli 2000-09-25 13:56 ` Andrea Arcangeli 1 sibling, 2 replies; 243+ messages in thread From: Jens Axboe @ 2000-09-25 13:49 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Ingo Molnar wrote: > > If you think I should delay those fixes to do something else I don't > > agree sorry. > > no, i never ment it. I find it very good that those half-done changes are The changes made were never half-done. The recent bug fixes have mainly been to remove cruft from the earlier elevator and fixing a bug where the elevator insert would screw up a bit. So I'd call that fine tuning or adjusting, not fixing half-done stuff. > cleaned up and the remaining bugs / performance problems are eliminated - Of course > the first reports about bad write performance came right after the > original elevator patches went in, about 6 months ago. And a new elevator was introduced some months ago to solve this. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:49 ` Jens Axboe @ 2000-09-25 14:11 ` Ingo Molnar 2000-09-25 14:05 ` Jens Axboe 2000-09-25 16:46 ` Linus Torvalds 2000-09-25 14:20 ` Andrea Arcangeli 1 sibling, 2 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:11 UTC (permalink / raw) To: Jens Axboe Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Jens Axboe wrote: > The changes made were never half-done. The recent bug fixes have > mainly been to remove cruft from the earlier elevator and fixing a bug > where the elevator insert would screw up a bit. So I'd call that fine > tuning or adjusting, not fixing half-done stuff. sorry i did not mean to offend you - unadjusted and unfixed stuff hanging around in the kernel for months is 'half done' for me. > > the first reports about bad write performance came right after the > > original elevator patches went in, about 6 months ago. > > And a new elevator was introduced some months ago to solve this. and these are still not solved in the vanilla kernel, as recent complaints on l-k prove. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:11 ` Ingo Molnar @ 2000-09-25 14:05 ` Jens Axboe 2000-09-25 16:46 ` Linus Torvalds 1 sibling, 0 replies; 243+ messages in thread From: Jens Axboe @ 2000-09-25 14:05 UTC (permalink / raw) To: Ingo Molnar Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Ingo Molnar wrote: > > The changes made were never half-done. The recent bug fixes have > > mainly been to remove cruft from the earlier elevator and fixing a bug > > where the elevator insert would screw up a bit. So I'd call that fine > > tuning or adjusting, not fixing half-done stuff. > > sorry i did not mean to offend you - unadjusted and unfixed stuff hanging > around in the kernel for months is 'half done' for me. No offense taken, I just tried to explain my view. And in light of the bad test2, I'd like the new changes to not have any "issues". So this work has been going on for the last month or so, and I think we are finally getting to agreement on what needs to be done now and how. WIP. > > And a new elevator was introduced some months ago to solve this. > > and these are still not solved in the vanilla kernel, as recent complaints > on l-k prove. Different problems, though :(. However, I believe they are solved in Andrea and my current tree. Just needs the final cleaning, more later. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:11 ` Ingo Molnar 2000-09-25 14:05 ` Jens Axboe @ 2000-09-25 16:46 ` Linus Torvalds 2000-09-25 17:05 ` Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Linus Torvalds @ 2000-09-25 16:46 UTC (permalink / raw) To: Ingo Molnar Cc: Jens Axboe, Andrea Arcangeli, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Ingo Molnar wrote: > > > > And a new elevator was introduced some months ago to solve this. > > and these are still not solved in the vanilla kernel, as recent complaints > on l-k prove. THE ELEVATOR IS PROBABLY NOT THE PROBLEM. People blame the elevator for bad IO performance. But the elevator is just doing what it's told to do - and if it is told to do something bad, it will do something bad. The "something bad" is doing things like writing out 4 dicsontiguous pages, waiting a while, and then writing out 4 more discontiguous pages. There's nothing the elevator can do for that case - except just ignore the write requests completely, and wait for more requests to come in. Which it certainly could do, but that's really a policy question and should be handled at a higher level. The elevator doesn't know if there is going to be more writes. In short, I bet that the problem is at least partly that bdflush is broken, and doesn't do a good job of streaming writes. It's probably been broken to get low latencies, and in order to avoid "choppy" behaviour. But the elevator works _best_ with choppy behaviour, when there's a BIG stream of requests at a time. Blaming the elevator is unfair and unrealistic. Look at the performance reports - there was a good test-case that showed that read-performance was fine but that writes to different parts of the filesystem just suck. Which is _exactly_ what you'd expect if the elevator was fine but the writes were blocked up by higher levels. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 16:46 ` Linus Torvalds @ 2000-09-25 17:05 ` Ingo Molnar 2000-09-25 17:23 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 17:05 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Andrea Arcangeli, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Linus Torvalds wrote: > Blaming the elevator is unfair and unrealistic. [...] yep - and Jens i'm sorry about the outburst. Until a bug is found it's unrealistic to blame anything. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 17:05 ` Ingo Molnar @ 2000-09-25 17:23 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 17:23 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Jens Axboe, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 07:05:02PM +0200, Ingo Molnar wrote: > yep - and Jens i'm sorry about the outburst. Until a bug is found it's > unrealistic to blame anything. I think the only bug maybe to blame in the elevator is the EXCLUSIVE wakeup thing (and I've not benchmarked it alone to see if it makes any real world performance difference but for sure its behaviour wasn't intentional). Anything else related to the elevator internals should perform better than the old elevator (aka the 2.2.15 one). The new elevator ordering algorithm returns me much better numbers than the CSCAN one with tiobench. Also consider the latency control at the moment is completly disabled as default so there are no barriers unless you change that with elvtune. Also I'm using -r 250 and -w 500 and it doesn't change really anything in the numbers compared to too big values (but it fixes the starvation problem). Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:49 ` Jens Axboe 2000-09-25 14:11 ` Ingo Molnar @ 2000-09-25 14:20 ` Andrea Arcangeli 2000-09-25 14:11 ` Jens Axboe 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:20 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:49:52PM +0200, Jens Axboe wrote: > And a new elevator was introduced some months ago to solve this. And now that I done some benchmark it seems the major optimization consists in the implementation of the new _ordering_ algorithm in test2, not really from the removal of the more finegrined latency control (said that I'm not going to reintroduce the previous latency control, the current one doesn't provide great latency but it's ok). As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes the ordering algorithm), tiotest performance drops significantly in the 2-thread-reading case. elvtune settings doesn't matter, that's only a matter of the ordering. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:20 ` Andrea Arcangeli @ 2000-09-25 14:11 ` Jens Axboe 2000-09-25 14:33 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Jens Axboe @ 2000-09-25 14:11 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > And a new elevator was introduced some months ago to solve this. > > And now that I done some benchmark it seems the major optimization consists in > the implementation of the new _ordering_ algorithm in test2, not really from > the removal of the more finegrined latency control (said that I'm not going to > reintroduce the previous latency control, the current one doesn't provide > great latency but it's ok). Yes, I found this the greatest improvement too. > As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes > the ordering algorithm), tiotest performance drops significantly in the > 2-thread-reading case. elvtune settings doesn't matter, that's only a matter > of the ordering. Interesting. I haven't done any serious benching with the CSCAN introduction in elevator_linus, I'll try that too. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:11 ` Jens Axboe @ 2000-09-25 14:33 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:33 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:11:34PM +0200, Jens Axboe wrote: > Interesting. I haven't done any serious benching with the CSCAN introduction > in elevator_linus, I'll try that too. Only changing that the performance decreased reproducibly from 16 to 14 mbyte/sec in the read test with 2 threads. So far I'm testing only IDE with LVM striping on two equal fast disks on separate IDE channels. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:10 ` Ingo Molnar 2000-09-25 13:49 ` Jens Axboe @ 2000-09-25 13:56 ` Andrea Arcangeli 2000-09-25 13:57 ` Ingo Molnar 1 sibling, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 13:56 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:10:51PM +0200, Ingo Molnar wrote: > yep. But i dont understand why this makes any difference - the waitqueue It makes a difference because your sleeping reads won't get the wakeup even while they could queue their reserved read request (they have to wait the FIFO to roll or some write to complete). > wakeup is FIFO, so any other request will eventually arrive. Could you > explain this bug a bit better? Well it may not explain an infinite hang because as you say the write that got the suprious wakeup will unplug the queue and after some time the reads will be wakenup. So maybe that wasn't the reason of your hangs because I remeber your problem looked more like an infinite hang that was only solved by kflushd writing some more stuff and unplugging the queue as side effect (however I'm not sure since I never experienced those myself). But I hope if it wasn't that one it's the below fix that will help: Index: mm/filemap.c =================================================================== RCS file: /home/andrea/cvs/linux/mm/filemap.c,v retrieving revision 1.1.1.5.2.3 retrieving revision 1.1.1.5.2.4 diff -u -r1.1.1.5.2.3 -r1.1.1.5.2.4 --- mm/filemap.c 2000/09/21 03:11:53 1.1.1.5.2.3 +++ mm/filemap.c 2000/09/25 03:33:31 1.1.1.5.2.4 @@ -622,8 +622,8 @@ add_wait_queue(&page->wait, &wait); do { - sync_page(page); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + sync_page(page); if (!PageLocked(page)) break; schedule(); Index: fs/buffer.c =================================================================== RCS file: /home/andrea/cvs/linux/fs/buffer.c,v retrieving revision 1.1.1.5.2.1 retrieving revision 1.1.1.5.2.2 diff -u -r1.1.1.5.2.1 -r1.1.1.5.2.2 --- fs/buffer.c 2000/09/06 19:57:51 1.1.1.5.2.1 +++ fs/buffer.c 2000/09/25 03:33:30 1.1.1.5.2.2 @@ -147,8 +147,8 @@ atomic_inc(&bh->b_count); add_wait_queue(&bh->b_wait, &wait); do { - run_task_queue(&tq_disk); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + run_task_queue(&tq_disk); if (!buffer_locked(bh)) break; schedule(); Think if the buffer returns locked between set_task_state(tsk, TASK_UNINTERRUPTIBLE) and if (!buffer_locked(bh)). The window is very small but it looks a genuine window for a deadlock. (and this one could sure explain infinite hangs in read... even if it looks even less realistic than the EXCLUSIVE task thing) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:56 ` Andrea Arcangeli @ 2000-09-25 13:57 ` Ingo Molnar 2000-09-25 14:13 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 13:57 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > - sync_page(page); > set_task_state(tsk, TASK_UNINTERRUPTIBLE); > + sync_page(page); > - run_task_queue(&tq_disk); > set_task_state(tsk, TASK_UNINTERRUPTIBLE); > + run_task_queue(&tq_disk); these look like genuine fixes, but i dont think they can explain the hangs i had yesterday - those were simple VM deadlocks. I dont see any deadlocks today - but i'm running the unsafe B2 variant of the vmfixes patch. (and i have no swapping enabled which simplifies my VM setup.) but one of these two fixes could explain the slowdown i saw on and off for quite some time, seeing very bad read performance occasionally. (do you remember my sched.c tq_disc hack?) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 13:57 ` Ingo Molnar @ 2000-09-25 14:13 ` Andrea Arcangeli 2000-09-25 14:08 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:13 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 03:57:31PM +0200, Ingo Molnar wrote: > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks Definitely. They can't explain anything about the VM deadlocks. I was _only_ talking about the blkdev hangs that caused you to unplug the queue at each reschedule in tux and that Eric reported me for the SG driver (and I very much hope that with EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away because I don't see anything else wrong at this moment). > but one of these two fixes could explain the slowdown i saw on and off for > quite some time, seeing very bad read performance occasionally. (do you > remember my sched.c tq_disc hack?) Exactly, that's the only thing I was talking about in this subthread. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:13 ` Andrea Arcangeli @ 2000-09-25 14:08 ` Jens Axboe 2000-09-25 14:29 ` Andrea Arcangeli 2000-09-25 14:13 ` Ingo Molnar 2000-09-25 14:29 ` Ingo Molnar 2 siblings, 1 reply; 243+ messages in thread From: Jens Axboe @ 2000-09-25 14:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks > > Definitely. They can't explain anything about the VM deadlocks. I was > _only_ talking about the blkdev hangs that caused you to unplug the > queue at each reschedule in tux and that Eric reported me for the SG > driver (and I very much hope that with EXCLUSIVE gone away and the > wait_on_* fixed those hangs will go away because I don't see anything else > wrong at this moment). The sg problem was different. When sg queues a request, it invokes the request_fn to handle it. But if the queue is currently plugged, the scsi_request_fn will not do anything. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:08 ` Jens Axboe @ 2000-09-25 14:29 ` Andrea Arcangeli 2000-09-25 14:18 ` Jens Axboe 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:29 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:08:38PM +0200, Jens Axboe wrote: > The sg problem was different. When sg queues a request, it invokes the > request_fn to handle it. But if the queue is currently plugged, the > scsi_request_fn will not do anything. That will explain it, yes. In the same way for correctness also those should be converted from request_fn to generic_unplug_device, right? (this will also avoid to recall spurious request_fn because the device is still in the tq_disk queue even when the I/O generated by the below request_fn completed) if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7) (q->request_fn)(q); if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7) (q->request_fn)(q); Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:29 ` Andrea Arcangeli @ 2000-09-25 14:18 ` Jens Axboe 2000-09-25 14:47 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Jens Axboe @ 2000-09-25 14:18 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > The sg problem was different. When sg queues a request, it invokes the > > request_fn to handle it. But if the queue is currently plugged, the > > scsi_request_fn will not do anything. > > That will explain it, yes. In the same way for correctness also those should > be converted from request_fn to generic_unplug_device, right? (this Yes, that would be the right fix. However, then we also need some way of inserting requests in the queue and let it plug when appropriate. The scsi layer currently "manually" does a list_add on the queue itself, which doesn't look too healthy. > will also avoid to recall spurious request_fn because the device is still > in the tq_disk queue even when the I/O generated by the below request_fn > completed) > > if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7) > (q->request_fn)(q); > if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7) > (q->request_fn)(q); AFAIR, Eric tried to talk to the Compaq folks (and Leonard too, I dunno) about why they want this. What came of it, I don't know. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:18 ` Jens Axboe @ 2000-09-25 14:47 ` Andrea Arcangeli 2000-09-25 21:28 ` Jens Axboe 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:47 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:18:54PM +0200, Jens Axboe wrote: > The scsi layer currently "manually" does a list_add on the queue itself, > which doesn't look too healthy. It's grabbing the io_request_lock so it looks healthy for now :) Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:47 ` Andrea Arcangeli @ 2000-09-25 21:28 ` Jens Axboe 2000-09-25 22:14 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Jens Axboe @ 2000-09-25 21:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25 2000, Andrea Arcangeli wrote: > > The scsi layer currently "manually" does a list_add on the queue itself, > > which doesn't look too healthy. > > It's grabbing the io_request_lock so it looks healthy for now :) It's safe alright, but if we want to do the generic_unplug_queue instead of just hitting the request_fn (which might do anything anyway), it would be nicer to expose this part of the block layer (i.e. have a general way of queueing a request to the request_queue). But I guess just q->plug_device_fn(q, ...); list_add(...) generic_unplug_device(q); would suffice in scsi_lib for now. -- * Jens Axboe <axboe@suse.de> * SuSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 21:28 ` Jens Axboe @ 2000-09-25 22:14 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 22:14 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 11:28:55PM +0200, Jens Axboe wrote: > q->plug_device_fn(q, ...); > list_add(...) > generic_unplug_device(q); > > would suffice in scsi_lib for now. It looks sane to me. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:13 ` Andrea Arcangeli 2000-09-25 14:08 ` Jens Axboe @ 2000-09-25 14:13 ` Ingo Molnar 2000-09-25 14:29 ` Ingo Molnar 2 siblings, 0 replies; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I was _only_ talking about the blkdev hangs [...] i guess this was just miscommunication. It never 'hung', it just performed reads with 20k/sec or so. (without any writes being done in the background.) A 'hang' for me is a deadlock or lockup, not a slowdown. > that caused you to unplug the queue at each reschedule in tux and that > Eric reported me for the SG driver (and I very much hope that with > EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away > because I don't see anything else wrong at this moment). okay, i'll test this. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:13 ` Andrea Arcangeli 2000-09-25 14:08 ` Jens Axboe 2000-09-25 14:13 ` Ingo Molnar @ 2000-09-25 14:29 ` Ingo Molnar 2000-09-25 14:46 ` Andrea Arcangeli 2 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > driver (and I very much hope that with EXCLUSIVE gone away and the > wait_on_* fixed those hangs will go away because I don't see anything else > wrong at this moment). the EXCLUSIVE thing only optimizes the wakeup, it's not semantic! How better is it to let 100 processes race for one freed-up request slot? There is no guarantee at all that the reader will win. If reads and writes racing for request slots ever becomes a problem then we should introduce a separate read and write waitqueue. the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of (performance) sense. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:29 ` Ingo Molnar @ 2000-09-25 14:46 ` Andrea Arcangeli 2000-09-25 14:53 ` Ingo Molnar 0 siblings, 1 reply; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 14:46 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:29:42PM +0200, Ingo Molnar wrote: > There is no guarantee at all that the reader will win. If reads and writes > racing for request slots ever becomes a problem then we should introduce a > separate read and write waitqueue. I agree. However here I also have a in flight per-queue limit of locked stuff (otherwise with 512k sized request on scsi I could fill in some second 128mbyte of RAM locked and I don't want to decrease the size of the queue because it has to be large for aggressive reordering when the request are 4k large each). This in-flight-perqueue limit is actually a non exclusive wakeup and it triggers more often than the request shortage (because most of the time write are consecutive) and so having two waitqueues and the reads that reigsters themself into both shouldn't be very significative improvement at the moment (I should first care about a wake-one in-flight-limit-per-queue wakeup :). > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of Actually I'm the one who introduced the EXCLUSIVE thing there and I audited _all_ the device drivers to check they do 1 wakeup for each 1 request they release before sending it off Linus. But I never thought (until some day ago) about the fact that if a read completes a reserved request the write won't be able to accept it. So long term we'll do two wake-one queues with reads registered in both. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:46 ` Andrea Arcangeli @ 2000-09-25 14:53 ` Ingo Molnar 2000-09-25 15:02 ` Andrea Arcangeli 0 siblings, 1 reply; 243+ messages in thread From: Ingo Molnar @ 2000-09-25 14:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of > > Actually I'm the one who introduced the EXCLUSIVE thing there and I audited sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: [patch] vmfixes-2.4.0-test9-B2 2000-09-25 14:53 ` Ingo Molnar @ 2000-09-25 15:02 ` Andrea Arcangeli 0 siblings, 0 replies; 243+ messages in thread From: Andrea Arcangeli @ 2000-09-25 15:02 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel On Mon, Sep 25, 2000 at 04:53:05PM +0200, Ingo Molnar wrote: > sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC) I didn't know. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 18:40 ` Ingo Molnar 2000-09-24 18:39 ` Linus Torvalds @ 2000-09-24 21:38 ` Stephen C. Tweedie 2000-09-24 23:20 ` Alan Cox 1 sibling, 1 reply; 243+ messages in thread From: Stephen C. Tweedie @ 2000-09-24 21:38 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel, Stephen Tweedie Hi, On Sun, Sep 24, 2000 at 08:40:05PM +0200, Ingo Molnar wrote: > On Sun, 24 Sep 2000, Linus Torvalds wrote: > > > [...] I don't think shrinking the inode cache is actually illegal when > > GPF_IO isn't set. In fact, it's probably only the buffer cache itself > > that has to avoid recursion - the other stuff doesn't actually do any > > IO. > > i just found this out by example, i'm running the shrink_[i|d]cache stuff > even if __GFP_IO is not set, and no problems so far. (and much better > balancing behavior) Careful --- I found out to my cost that there are hidden recursions here. ext3 was bitten once by the fact that shrink_icache does a quota drop, and that involves quota writeback if it was the last inode on that particular quota struct. shrinking the icache _usually_ involves no IO, but the quota case is an exception which a lot of developers won't encounter during testing. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
* Re: __GFP_IO && shrink_[d|i]cache_memory()? 2000-09-24 21:38 ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie @ 2000-09-24 23:20 ` Alan Cox 0 siblings, 0 replies; 243+ messages in thread From: Alan Cox @ 2000-09-24 23:20 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel > quota drop, and that involves quota writeback if it was the last inode > on that particular quota struct. > > shrinking the icache _usually_ involves no IO, but the quota case is > an exception which a lot of developers won't encounter during testing. We've had a history of weird quota deadlocks in 2.0 and earlier 2.2. Is there a reason quota block writeback cannot be queued or handled by a thread ? Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 243+ messages in thread
end of thread, other threads:[~2000-10-09 7:37 UTC | newest] Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2000-09-24 10:11 __GFP_IO && shrink_[d|i]cache_memory()? Ingo Molnar 2000-09-24 18:11 ` Linus Torvalds 2000-09-24 18:40 ` Ingo Molnar 2000-09-24 18:39 ` Linus Torvalds 2000-09-24 18:46 ` Linus Torvalds 2000-09-24 18:59 ` Ingo Molnar 2000-09-24 19:34 ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar 2000-09-24 20:20 ` Rui Sousa 2000-09-24 20:24 ` Andrea Arcangeli 2000-09-24 20:26 ` Ingo Molnar 2000-09-24 21:12 ` Andrea Arcangeli 2000-09-24 21:12 ` Ingo Molnar 2000-09-24 21:43 ` Stephen C. Tweedie 2000-09-24 22:13 ` Andrea Arcangeli 2000-09-24 22:36 ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert 2000-09-24 23:41 ` Andrea Arcangeli 2000-09-25 16:24 ` Stephen C. Tweedie 2000-09-25 17:03 ` Andrea Arcangeli 2000-09-25 18:06 ` Stephen C. Tweedie 2000-09-25 19:32 ` Andrea Arcangeli 2000-09-25 19:26 ` Rik van Riel 2000-09-25 22:28 ` Andrea Arcangeli 2000-09-25 22:26 ` Rik van Riel 2000-09-25 22:51 ` Andrea Arcangeli 2000-09-25 22:30 ` Linus Torvalds 2000-09-25 23:03 ` Andrea Arcangeli 2000-09-25 23:18 ` Linus Torvalds 2000-09-26 0:32 ` Andrea Arcangeli 2000-09-25 22:30 ` Juan J. Quintela 2000-09-25 23:00 ` Andrea Arcangeli 2000-09-25 19:54 ` Stephen C. Tweedie 2000-09-25 22:44 ` Andrea Arcangeli 2000-09-25 22:42 ` Rik van Riel 2000-09-26 6:54 ` Christoph Rohland 2000-09-26 14:05 ` Andrea Arcangeli 2000-09-26 16:20 ` Christoph Rohland 2000-09-26 17:10 ` Andrea Arcangeli 2000-09-27 8:11 ` Christoph Rohland 2000-09-27 8:28 ` Ingo Molnar 2000-09-27 9:24 ` Christoph Rohland 2000-09-27 13:56 ` Andrea Arcangeli 2000-09-27 16:56 ` Christoph Rohland 2000-09-27 17:42 ` Andrea Arcangeli 2000-09-27 18:25 ` Erik Andersen 2000-09-27 18:55 ` Andrea Arcangeli 2000-09-28 10:08 ` Rik van Riel 2000-09-28 11:16 ` Rik van Riel 2000-09-28 14:52 ` Andrea Arcangeli 2000-09-29 14:39 ` Rik van Riel 2000-09-29 14:55 ` Andrea Arcangeli 2000-09-29 15:40 ` Rik van Riel 2000-09-28 11:31 ` Ingo Molnar 2000-09-28 14:54 ` Andrea Arcangeli 2000-09-28 15:13 ` Ingo Molnar 2000-09-28 15:23 ` Andrea Arcangeli 2000-09-28 16:16 ` Juan J. Quintela 2000-09-28 14:31 ` Andrea Arcangeli 2000-09-25 17:21 ` bert hubert 2000-09-25 17:49 ` Andrea Arcangeli 2000-09-25 15:09 ` Miles Lane 2000-09-25 15:51 ` Stephen C. Tweedie 2000-09-25 16:05 ` Ingo Molnar 2000-09-25 16:06 ` Alexander Viro 2000-09-25 16:20 ` Ingo Molnar 2000-09-25 16:29 ` Andrea Arcangeli 2000-09-25 4:56 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 2000-09-25 5:19 ` Alexander Viro 2000-09-25 6:06 ` Linus Torvalds 2000-09-25 6:17 ` Alexander Viro 2000-09-25 21:21 ` Alexander Viro 2000-09-26 13:42 ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro 2000-09-26 21:29 ` Alexander Viro 2000-09-26 22:16 ` Marko Kreen 2000-09-26 22:31 ` Alexander Viro 2000-09-26 22:47 ` Marko Kreen 2000-09-27 7:32 ` Ingo Molnar 2000-09-27 9:22 ` Alexander Viro 2000-09-26 23:19 ` Andreas Dilger 2000-09-26 23:33 ` Alexander Viro 2000-09-26 23:44 ` Alexander Viro 2000-09-25 0:09 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds 2000-09-25 0:49 ` Alexander Viro 2000-09-25 0:53 ` Marcelo Tosatti 2000-09-25 1:45 ` Andrea Arcangeli 2000-09-25 2:39 ` Marcelo Tosatti 2000-09-25 15:36 ` Andrea Arcangeli 2000-09-25 10:42 ` the new VM Ingo Molnar 2000-09-25 13:02 ` Andrea Arcangeli 2000-09-25 13:02 ` Ingo Molnar 2000-09-25 13:08 ` Andrea Arcangeli 2000-09-25 13:12 ` Ingo Molnar 2000-09-25 13:30 ` Andrea Arcangeli 2000-09-25 13:39 ` Ingo Molnar 2000-09-25 14:04 ` Andrea Arcangeli 2000-09-25 14:04 ` Ingo Molnar 2000-09-25 14:23 ` Andrea Arcangeli 2000-09-25 14:27 ` Ingo Molnar 2000-09-25 14:39 ` Andrea Arcangeli 2000-09-25 14:43 ` Ingo Molnar 2000-09-25 15:01 ` Andrea Arcangeli 2000-09-25 15:10 ` Ingo Molnar 2000-09-25 15:24 ` Andrea Arcangeli 2000-09-25 15:26 ` Ingo Molnar 2000-09-25 15:22 ` yodaiken 2000-09-26 19:10 ` Pavel Machek 2000-09-26 20:16 ` Andrea Arcangeli 2000-09-27 7:42 ` Ingo Molnar 2000-09-27 12:11 ` yodaiken 2000-09-27 14:08 ` Andrea Arcangeli 2000-09-25 16:09 ` Rik van Riel 2000-09-25 14:26 ` Marcelo Tosatti 2000-09-25 14:50 ` Andrea Arcangeli 2000-09-25 14:47 ` Alan Cox 2000-09-25 15:16 ` Ingo Molnar 2000-09-25 15:16 ` the new VMt Alan Cox 2000-09-25 15:33 ` the new VM Ingo Molnar 2000-09-25 15:41 ` the new VMt Andrea Arcangeli 2000-09-25 16:02 ` Ingo Molnar 2000-09-25 16:04 ` Andi Kleen 2000-09-25 16:19 ` Ingo Molnar 2000-09-25 16:18 ` Andi Kleen 2000-09-25 16:41 ` Andrea Arcangeli 2000-09-25 16:35 ` Linus Torvalds 2000-09-25 16:41 ` Rik van Riel 2000-09-25 16:49 ` Linus Torvalds 2000-09-25 17:03 ` Ingo Molnar 2000-09-25 17:17 ` Andrea Arcangeli 2000-09-25 17:10 ` Rik van Riel 2000-09-25 17:27 ` Andrea Arcangeli 2000-09-25 17:15 ` Andrea Arcangeli 2000-09-27 7:14 ` Rusty Russell 2000-09-25 20:23 ` Russell King 2000-09-25 16:28 ` Rik van Riel 2000-09-25 16:11 ` Andrea Arcangeli 2000-09-25 16:22 ` Ingo Molnar 2000-09-25 16:17 ` Alexander Viro 2000-09-25 16:36 ` Jeff Garzik 2000-09-25 16:57 ` Alan Cox 2000-09-25 17:01 ` Alexander Viro 2000-09-25 17:06 ` Alan Cox 2000-09-25 17:31 ` Oliver Xymoron 2000-09-25 17:51 ` Jeff Garzik 2000-09-25 19:03 ` the new VMt [4MB+ blocks] Matti Aarnio 2000-09-25 20:02 ` Stephen Williams 2000-09-25 16:33 ` the new VMt Andrea Arcangeli 2000-09-26 8:38 ` Jes Sorensen 2000-09-26 8:52 ` Ingo Molnar 2000-09-26 9:02 ` Jes Sorensen 2000-09-25 16:53 ` Alan Cox 2000-09-25 15:42 ` Stephen C. Tweedie 2000-09-25 16:05 ` Andrea Arcangeli 2000-09-25 16:22 ` Rik van Riel 2000-09-25 16:42 ` Andrea Arcangeli 2000-09-25 17:39 ` Stephen C. Tweedie 2000-09-25 16:51 ` Alan Cox 2000-09-25 17:43 ` Stephen C. Tweedie 2000-09-25 18:13 ` Alan Cox 2000-09-25 18:21 ` Stephen C. Tweedie 2000-09-25 19:09 ` Alan Cox 2000-09-25 19:21 ` Stephen C. Tweedie 2000-09-25 16:52 ` yodaiken 2000-09-25 17:18 ` Jamie Lokier 2000-09-25 17:51 ` yodaiken 2000-09-25 18:04 ` Jamie Lokier 2000-09-25 18:13 ` yodaiken 2000-09-25 18:24 ` Stephen C. Tweedie 2000-09-25 18:34 ` yodaiken 2000-09-25 18:48 ` Jamie Lokier 2000-09-25 19:25 ` Stephen C. Tweedie 2000-09-25 20:04 ` yodaiken 2000-09-25 20:23 ` Alan Cox 2000-09-25 20:35 ` yodaiken 2000-09-25 20:46 ` Alan Cox 2000-09-25 21:07 ` yodaiken 2000-09-26 9:54 ` Stephen C. Tweedie 2000-09-26 13:17 ` yodaiken 2000-09-25 20:47 ` Benjamin C.R. LaHaise 2000-09-25 21:12 ` yodaiken 2000-09-26 10:07 ` Stephen C. Tweedie 2000-09-26 13:30 ` yodaiken 2000-09-25 20:32 ` Stephen C. Tweedie 2000-09-26 12:10 ` Mark Hemment 2000-09-27 10:13 ` Andrey Savochkin 2000-09-27 12:55 ` Hugh Dickins 2000-09-28 3:25 ` Andrey Savochkin 2000-09-25 23:14 ` Erik Andersen 2000-09-26 15:17 ` yodaiken 2000-09-26 16:04 ` Stephen C. Tweedie 2000-09-26 17:02 ` Erik Andersen 2000-09-26 17:08 ` Stephen C. Tweedie 2000-09-26 17:45 ` Erik Andersen 2000-09-27 10:20 ` Andrey Savochkin 2000-09-26 21:13 ` Eric Lowe 2000-09-25 18:20 ` Andrea Arcangeli 2000-09-25 16:16 ` Rik van Riel 2000-09-25 16:55 ` Alan Cox 2000-09-25 15:48 ` the new VM Andrea Arcangeli 2000-09-25 15:40 ` Stephen C. Tweedie 2000-09-25 16:01 ` Andrea Arcangeli 2000-09-25 14:37 ` Rik van Riel 2000-09-25 20:34 ` Christoph Rohland 2000-10-06 16:14 ` Rik van Riel 2000-10-09 7:37 ` Christoph Rohland 2000-09-25 13:04 ` Ingo Molnar 2000-09-25 13:19 ` Andrea Arcangeli 2000-09-25 13:18 ` Ingo Molnar 2000-09-25 13:21 ` Ingo Molnar 2000-09-25 13:31 ` Andrea Arcangeli 2000-09-25 13:47 ` Ingo Molnar 2000-09-25 14:04 ` Andrea Arcangeli 2000-09-25 1:31 ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli 2000-09-25 1:27 ` Alexander Viro 2000-09-25 2:02 ` Andrea Arcangeli 2000-09-25 2:01 ` Alexander Viro 2000-09-25 13:47 ` Stephen C. Tweedie 2000-09-25 10:13 ` Ingo Molnar 2000-09-25 12:58 ` Andrea Arcangeli 2000-09-25 13:10 ` Ingo Molnar 2000-09-25 13:49 ` Jens Axboe 2000-09-25 14:11 ` Ingo Molnar 2000-09-25 14:05 ` Jens Axboe 2000-09-25 16:46 ` Linus Torvalds 2000-09-25 17:05 ` Ingo Molnar 2000-09-25 17:23 ` Andrea Arcangeli 2000-09-25 14:20 ` Andrea Arcangeli 2000-09-25 14:11 ` Jens Axboe 2000-09-25 14:33 ` Andrea Arcangeli 2000-09-25 13:56 ` Andrea Arcangeli 2000-09-25 13:57 ` Ingo Molnar 2000-09-25 14:13 ` Andrea Arcangeli 2000-09-25 14:08 ` Jens Axboe 2000-09-25 14:29 ` Andrea Arcangeli 2000-09-25 14:18 ` Jens Axboe 2000-09-25 14:47 ` Andrea Arcangeli 2000-09-25 21:28 ` Jens Axboe 2000-09-25 22:14 ` Andrea Arcangeli 2000-09-25 14:13 ` Ingo Molnar 2000-09-25 14:29 ` Ingo Molnar 2000-09-25 14:46 ` Andrea Arcangeli 2000-09-25 14:53 ` Ingo Molnar 2000-09-25 15:02 ` Andrea Arcangeli 2000-09-24 21:38 ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie 2000-09-24 23:20 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox