__GFP_IO && shrink_[d|i]cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* __GFP_IO && shrink_[d|i]cache_memory()?
@ 2000-09-24 10:11 Ingo Molnar
  2000-09-24 18:11 ` Linus Torvalds
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 10:11 UTC (permalink / raw)
  To: Rik van Riel, Roger Larsson; +Cc: Linus Torvalds, MM mailing list, linux-kernel

i've seen a couple of GFP_BUFFER allocation deadlocks in an atypical
system which had lots of RAM allocated to inodes. The reason for the
deadlock is that the shrink_*() functions cannot be called if __GFP_IO is
not set. Nothing else can be freed at that point, so the try_again: loop
in page_alloc() gets into an infinite loop.

as an immediate solution the previous __GFP_WAIT suggestion solves the
deadlock - because the GFP_BUFFER allocator yields the CPU and kswapd can
run and do the dcache/icache shrinking. [i cannot reproduce any deadlocks
after doing this change.]

as a longer term solution, i'm wondering how hard it would be to propagate
gfp_mask into the shrink_*() functions, and prevent recursion similarly to
the swap-out logic? This way even GFP_BUFFER allocators could touch/free
the dcache/icache.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 10:11 __GFP_IO && shrink_[d|i]cache_memory()? Ingo Molnar
@ 2000-09-24 18:11 ` Linus Torvalds
  2000-09-24 18:40   ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-24 18:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Ingo Molnar wrote:
> 
> as a longer term solution, i'm wondering how hard it would be to propagate
> gfp_mask into the shrink_*() functions, and prevent recursion similarly to
> the swap-out logic? This way even GFP_BUFFER allocators could touch/free
> the dcache/icache.

Well, the gfp_mask actually _is_ propagated already, it's just that if
__GFP_IO isn't set the calls are never done.

A trivial patch would move the __GFP_IO test into the functions (no change
in behaviour), and then slowly move the test down to the proper place. We
should be able to do some SHM swapping even if __GFP_IO isn't set. For
example, I don't think shrinking the inode cache is actually illegal when
GPF_IO isn't set. In fact, it's probably only the buffer cache itself that
has to avoid recursion - the other stuff doesn't actually do any IO.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 18:11 ` Linus Torvalds
@ 2000-09-24 18:40   ` Ingo Molnar
  2000-09-24 18:39     ` Linus Torvalds
  2000-09-24 21:38     ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie
  0 siblings, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 18:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

> [...] I don't think shrinking the inode cache is actually illegal when
> GPF_IO isn't set. In fact, it's probably only the buffer cache itself
> that has to avoid recursion - the other stuff doesn't actually do any
> IO.

i just found this out by example, i'm running the shrink_[i|d]cache stuff
even if __GFP_IO is not set, and no problems so far. (and much better
balancing behavior)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 18:40   ` Ingo Molnar
@ 2000-09-24 18:39     ` Linus Torvalds
  2000-09-24 18:46       ` Linus Torvalds
  2000-09-24 21:38     ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie
  1 sibling, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-24 18:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel


On Sun, 24 Sep 2000, Ingo Molnar wrote:
> 
> i just found this out by example, i'm running the shrink_[i|d]cache stuff
> even if __GFP_IO is not set, and no problems so far. (and much better
> balancing behavior)

Send me the tested patch (and I'd suggest moving the shm_swap() test into
shm_swap() too, so that refill_inactive() gets cleaned up a bit).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 18:39     ` Linus Torvalds
@ 2000-09-24 18:46       ` Linus Torvalds
  2000-09-24 18:59         ` Ingo Molnar
  2000-09-24 19:34         ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
  0 siblings, 2 replies; 243+ messages in thread
From: Linus Torvalds @ 2000-09-24 18:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

[ Sorry to follow up on myself.. ]

On Sun, 24 Sep 2000, Linus Torvalds wrote:
> 
> Send me the tested patch (and I'd suggest moving the shm_swap() test into
> shm_swap() too, so that refill_inactive() gets cleaned up a bit).

I think that shm_swap still needs it - it's doing things with
rw_swap_page() that means that we cannot run it without GFP_IO.

HOWEVER, I suspect that in the long run we should move to using the page
cache better by the shm routines, and that might mean that eventually we
can do it even without GFP_IO (and instead let the generic VM routines
handle the actual IO on the swap cache). 

So it makes sense to leave shm_swap() behaviour unchanged (ie do nothing
if GFP_IO is not set), but move the GFP_IO test down into shm_swap() so
that it will (a) match the other cases and (b) be easier to change the
GFP_IO logic later on if/when we clean up shm.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 18:46       ` Linus Torvalds
@ 2000-09-24 18:59         ` Ingo Molnar
  2000-09-24 19:34         ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
  1 sibling, 0 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 18:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

> I think that shm_swap still needs it - it's doing things with
> rw_swap_page() that means that we cannot run it without GFP_IO.

yep - i only pushed the test inside, it's functionally equivalent - it
only vanished from refill_inactive(). It's basically now a detail of the
lowlevel swapping functions to honor __GFP_IO.

> So it makes sense to leave shm_swap() behaviour unchanged (ie do
> nothing if GFP_IO is not set), but move the GFP_IO test down into
> shm_swap() so that it will (a) match the other cases and (b) be easier
> to change the GFP_IO logic later on if/when we clean up shm.

yep.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 18:46       ` Linus Torvalds
  2000-09-24 18:59         ` Ingo Molnar
@ 2000-09-24 19:34         ` Ingo Molnar
  2000-09-24 20:20           ` Rui Sousa
  2000-09-24 20:24           ` Andrea Arcangeli
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1616 bytes --]


the attached vmfixes-B2 patch adds the following fixes/cleanups:

vmscan.c:

 - check for __GFP_WAIT not __GFP_IO when yielding the CPU. This fixes
   GFP_BUFFER deadlocks. In fact since no caller to do_try_to_free_pages()
   can expect that function to not block, we dont test for __GFP_WAIT
   either. [GFP_KSWAPD is the only caller without __GFP_WAIT set.]

 - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance.

 - push the __GFP_IO test into shm_swap().

 - after shm_swap() do not test for !count but for <= 0, because count
   could be negative if in the future the shrink_ functions return bigger
   than 1, and we could then get into an infinite loop. Same after
   swap_out() and refill_inactive_scan(). No performance penalty, test
   for zero is exchanged with test for sign.

 - kmem_cache_reap() is done within refill_inactive(), so it's
   unnecessery to call it at the beginning of do_try_to_free_pages().
   Moved to the else branch. (i saw kmem_cache_reap() show up in profiles)

 - (small codestyle cleanup.)


page_alloc.c:

 - in __alloc_pages(), the infinite allocation loop yields the CPU if
   necessery. This prevents a potential lockup on UP, and even on SMP it
   can prevent livelocks. (i saw this happen.)

mm.h:

 - made the GFP_ flag definitions easier to parse for humans :-)

 - remove shrink_mmap() prototype, it doesnt exist anymore.

shm.c:

 - the trivial test for __GFP_IO.

swap_state.c, filemap.c:

 - (shrink_mmap doesnt exist anymore, it's refill_inactive.)

(The patch applies and compiles cleanly, and is tested under various VM
loads i use.)

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 7375 bytes --]

--- linux/mm/vmscan.c.orig	Sun Sep 24 11:41:38 2000
+++ linux/mm/vmscan.c	Sun Sep 24 12:20:27 2000
@@ -119,7 +119,7 @@
 	 * our scan.
 	 *
 	 * Basically, this just makes it possible for us to do
-	 * some real work in the future in "shrink_mmap()".
+	 * some real work in the future in "refill_inactive()".
 	 */
 	if (!pte_dirty(pte)) {
 		flush_cache_page(vma, address);
@@ -159,7 +159,7 @@
 	 * NOTE NOTE NOTE! This should just set a
 	 * dirty bit in 'page', and just drop the
 	 * pte. All the hard work would be done by
-	 * shrink_mmap().
+	 * refill_inactive().
 	 *
 	 * That would get rid of a lot of problems.
 	 */
@@ -891,7 +891,7 @@
 	do {
 		made_progress = 0;
 
-		if (current->need_resched && (gfp_mask & __GFP_IO)) {
+		if (current->need_resched) {
 			__set_current_state(TASK_RUNNING);
 			schedule();
 		}
@@ -899,34 +899,32 @@
 		while (refill_inactive_scan(priority, 1) ||
 				swap_out(priority, gfp_mask, idle_time)) {
 			made_progress = 1;
-			if (!--count)
+			if (--count <= 0)
 				goto done;
 		}
 
-		/* Try to get rid of some shared memory pages.. */
-		if (gfp_mask & __GFP_IO) {
-			/*
-			 * don't be too light against the d/i cache since
-		   	 * shrink_mmap() almost never fail when there's
-		   	 * really plenty of memory free. 
-			 */
-			count -= shrink_dcache_memory(priority, gfp_mask);
-			count -= shrink_icache_memory(priority, gfp_mask);
-			/*
-			 * Not currently working, see fixme in shrink_?cache_memory
-			 * In the inner funtions there is a comment:
-			 * "To help debugging, a zero exit status indicates
-			 *  all slabs were released." (-arca?)
-			 * lets handle it in a primitive but working way...
-			 *	if (count <= 0)
-			 *		goto done;
-			 */
+		/*
+		 * don't be too light against the d/i cache since
+	   	 * refill_inactive() almost never fail when there's
+	   	 * really plenty of memory free. 
+		 */
+		count -= shrink_dcache_memory(priority, gfp_mask);
+		count -= shrink_icache_memory(priority, gfp_mask);
+		/*
+		 * Not currently working, see fixme in shrink_?cache_memory
+		 * In the inner funtions there is a comment:
+		 * "To help debugging, a zero exit status indicates
+		 *  all slabs were released." (-arca?)
+		 * lets handle it in a primitive but working way...
+		 *	if (count <= 0)
+		 *		goto done;
+		 */
 
-			while (shm_swap(priority, gfp_mask)) {
-				made_progress = 1;
-				if (!--count)
-					goto done;
-			}
+		/* Try to get rid of some shared memory pages.. */
+		while (shm_swap(priority, gfp_mask)) {
+			made_progress = 1;
+			if (--count <= 0)
+				goto done;
 		}
 
 		/*
@@ -934,7 +932,7 @@
 		 */
 		while (swap_out(priority, gfp_mask, 0)) {
 			made_progress = 1;
-			if (!--count)
+			if (--count <= 0)
 				goto done;
 		}
 
@@ -955,9 +953,9 @@
 			priority--;
 	} while (priority >= 0);
 
-	/* Always end on a shrink_mmap.., may sleep... */
+	/* Always end on a refill_inactive.., may sleep... */
 	while (refill_inactive_scan(0, 1)) {
-		if (!--count)
+		if (--count <= 0)
 			goto done;
 	}
 
@@ -970,11 +968,6 @@
 	int ret = 0;
 
 	/*
-	 * First, reclaim unused slab cache memory.
-	 */
-	kmem_cache_reap(gfp_mask);
-
-	/*
 	 * If we're low on free pages, move pages from the
 	 * inactive_dirty list to the inactive_clean list.
 	 *
@@ -992,13 +985,14 @@
 	 * the inode and dentry cache whenever we do this.
 	 */
 	if (free_shortage() || inactive_shortage()) {
-		if (gfp_mask & __GFP_IO) {
-			ret += shrink_dcache_memory(6, gfp_mask);
-			ret += shrink_icache_memory(6, gfp_mask);
-		}
-
+		ret += shrink_dcache_memory(6, gfp_mask);
+		ret += shrink_icache_memory(6, gfp_mask);
 		ret += refill_inactive(gfp_mask, user);
 	} else {
+		/*
+		 * Reclaim unused slab cache memory.
+		 */
+		kmem_cache_reap(gfp_mask);
 		ret = 1;
 	}
 
@@ -1153,9 +1147,8 @@
 {
 	int ret = 1;
 
-	if (gfp_mask & __GFP_WAIT) {
+	if (gfp_mask & __GFP_WAIT)
 		ret = do_try_to_free_pages(gfp_mask, 1);
-	}
 
 	return ret;
 }
--- linux/mm/page_alloc.c.orig	Sun Sep 24 11:44:59 2000
+++ linux/mm/page_alloc.c	Sun Sep 24 11:52:00 2000
@@ -444,6 +444,13 @@
 		 * processes, etc).
 		 */
 		if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Give other processes a chance to run:
+			 */
+			if (current->need_resched) {
+				__set_current_state(TASK_RUNNING);
+				schedule();
+			}
 			try_to_free_pages(gfp_mask);
 			memory_pressure++;
 			goto try_again;
--- linux/mm/filemap.c.orig	Sun Sep 24 12:20:35 2000
+++ linux/mm/filemap.c	Sun Sep 24 12:20:48 2000
@@ -1925,10 +1925,10 @@
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
  * data it wants to keep.  Be sure to free swap resources too.  The
- * zap_page_range call sets things up for shrink_mmap to actually free
+ * zap_page_range call sets things up for refill_inactive to actually free
  * these pages later if no one else has touched them in the meantime,
  * although we could add these pages to a global reuse list for
- * shrink_mmap to pick up before reclaiming other pages.
+ * refill_inactive to pick up before reclaiming other pages.
  *
  * NB: This interface discards data rather than pushes it out to swap,
  * as some implementations do.  This has performance implications for
--- linux/mm/swap_state.c.orig	Sun Sep 24 12:21:02 2000
+++ linux/mm/swap_state.c	Sun Sep 24 12:21:13 2000
@@ -166,7 +166,7 @@
 			return 0;
 		/*
 		 * Though the "found" page was in the swap cache an instant
-		 * earlier, it might have been removed by shrink_mmap etc.
+		 * earlier, it might have been removed by refill_inactive etc.
 		 * Re search ... Since find_lock_page grabs a reference on
 		 * the page, it can not be reused for anything else, namely
 		 * it can not be associated with another swaphandle, so it
--- linux/include/linux/mm.h.orig	Sun Sep 24 11:46:37 2000
+++ linux/include/linux/mm.h	Sun Sep 24 12:21:54 2000
@@ -441,7 +441,6 @@
 /* filemap.c */
 extern void remove_inode_page(struct page *);
 extern unsigned long page_unuse(struct page *);
-extern int shrink_mmap(int, int);
 extern void truncate_inode_pages(struct address_space *, loff_t);
 
 /* generic vm_area_ops exported for stackable file systems */
@@ -469,11 +468,11 @@
 
 #define GFP_BUFFER	(__GFP_HIGH | __GFP_WAIT)
 #define GFP_ATOMIC	(__GFP_HIGH)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO)
-#define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
+#define GFP_USER	(             __GFP_WAIT | __GFP_IO)
+#define GFP_HIGHUSER	(             __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
 #define GFP_KERNEL	(__GFP_HIGH | __GFP_WAIT | __GFP_IO)
 #define GFP_NFS		(__GFP_HIGH | __GFP_WAIT | __GFP_IO)
-#define GFP_KSWAPD	(__GFP_IO)
+#define GFP_KSWAPD	(                          __GFP_IO)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
--- linux/ipc/shm.c.orig	Sun Sep 24 11:45:16 2000
+++ linux/ipc/shm.c	Sun Sep 24 11:53:59 2000
@@ -1536,6 +1536,12 @@
 	int counter;
 	struct page * page_map;
 
+	/*
+	 * Push this inside:
+	 */
+	if (!(gfp_mask & __GFP_IO))
+		return 0;
+
 	zshm_swap(prio, gfp_mask);
 	counter = shm_rss >> prio;
 	if (!counter)

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 19:34         ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
@ 2000-09-24 20:20           ` Rui Sousa
  2000-09-24 20:24           ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: Rui Sousa @ 2000-09-24 20:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Sun, 24 Sep 2000, Ingo Molnar wrote:

Hi,

Did any of these lead to an infinite loop in swap_out()?

> 
> the attached vmfixes-B2 patch adds the following fixes/cleanups:
> 


Rui Sousa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 19:34         ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
  2000-09-24 20:20           ` Rui Sousa
@ 2000-09-24 20:24           ` Andrea Arcangeli
  2000-09-24 20:26             ` Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-24 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Sun, Sep 24, 2000 at 09:34:43PM +0200, Ingo Molnar wrote:
>  - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance.

It will deadlock. (that same mistake was dealdocking early 2.2.x too btw)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 20:24           ` Andrea Arcangeli
@ 2000-09-24 20:26             ` Ingo Molnar
  2000-09-24 21:12               ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Sun, 24 Sep 2000, Andrea Arcangeli wrote:

> >  - do shrink_[d|i]cache_memory() even if !__GFP_IO. This improves balance.
> 
> It will deadlock. (that same mistake was dealdocking early 2.2.x too btw)

where will it deadlock?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 20:26             ` Ingo Molnar
@ 2000-09-24 21:12               ` Andrea Arcangeli
  2000-09-24 21:12                 ` Ingo Molnar
  2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-24 21:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote:
> where will it deadlock?

ext2_new_block (or whatever that runs getblk with the superlock lock acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 21:12               ` Andrea Arcangeli
@ 2000-09-24 21:12                 ` Ingo Molnar
  2000-09-24 21:43                   ` Stephen C. Tweedie
  2000-09-25  4:56                   ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-24 21:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Sun, 24 Sep 2000, Andrea Arcangeli wrote:

> ext2_new_block (or whatever that runs getblk with the superlock lock
> acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->
> prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->
> put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D

nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in
places where we might block?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 21:12                 ` Ingo Molnar
@ 2000-09-24 21:43                   ` Stephen C. Tweedie
  2000-09-24 22:13                     ` Andrea Arcangeli
  2000-09-25  4:56                   ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  1 sibling, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-24 21:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel, Stephen Tweedie

Hi,

On Sun, Sep 24, 2000 at 11:12:39PM +0200, Ingo Molnar wrote:
> 
> > ext2_new_block (or whatever that runs getblk with the superlock lock
> > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->
> > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->
> > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D
> 
> nasty indeed, sigh. Shouldnt ext2_new_block drop the superblock lock in
> places where we might block?

That's only a valid fix if there are no other filesystems, and no
other places in ext2, where we can call GFP with locks which prevent a
put_inode from being incurred.  And with the quota case to consider,
you have to avoid calling GFP with a lock against quota file writes
too (and since quota writes may GFP, this would deadlock if there was
any form of serialisation on the quota file).  This feels like rather
a lot of new and interesting deadlocks to be introducing so late in
2.4.  :-)

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 21:43                   ` Stephen C. Tweedie
@ 2000-09-24 22:13                     ` Andrea Arcangeli
  2000-09-24 22:36                       ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-24 22:13 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote:
> any form of serialisation on the quota file).  This feels like rather
> a lot of new and interesting deadlocks to be introducing so late in
> 2.4.  :-)

Agreed.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 22:13                     ` Andrea Arcangeli
@ 2000-09-24 22:36                       ` bert hubert
  2000-09-24 23:41                         ` Andrea Arcangeli
                                           ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: bert hubert @ 2000-09-24 22:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote:
> On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote:
> > any form of serialisation on the quota file).  This feels like rather
> > a lot of new and interesting deadlocks to be introducing so late in
> > 2.4.  :-)

True. But they also appear to be found and solved at an impressive rate.
These deadlocks are fatal and don't hide in corners, whereas the previous mm
problems used to be very hard to spot and fix, there not being real
showstoppers, except for abysmal performance. [1]

Since Rik's stuff was merged, the number of eyeball hours devoted to MM have
skyrocketed, whereas the previous incarnations had far smaller audiences.
The patches are barely a week in, and look how much has been improved that
hadn't been found by the people working with Rik.

It's tempting to revert the merge, but let's work at it a bit longer. There
are problems, but we are solving them rapidly and both performance and
design of the new MM are pretty pleasing.

Let's not waste this opportunity.

Regards,

bert hubert

[1] bad performance is not often attributed to the Linux kernel - people
just assume that their problem is hard, because they don't have experience
with other unixes that might outperform us. We may be running Solaris and
other unices for reference, but your average user isn't.

-- 
PowerDNS                     Versatile DNS Services  
Trilab                       The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 22:36                       ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
@ 2000-09-24 23:41                         ` Andrea Arcangeli
  2000-09-25 16:24                           ` Stephen C. Tweedie
  2000-09-25 17:21                           ` bert hubert
  2000-09-25 15:09                         ` Miles Lane
  2000-09-25 15:51                         ` Stephen C. Tweedie
  2 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-24 23:41 UTC (permalink / raw)
  To: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote:
> True. But they also appear to be found and solved at an impressive rate.

We're talking about shrink_[id]cache_memory change. That have _nothing_ to do
with the VM changes that happened anywhere between test8 and test9-pre6.

You were talking about a different thing.

> It's tempting to revert the merge, but let's work at it a bit longer. There

Since you're talking about this I'll soon (as soon as I'll finish some other
thing that is just work in progress) release a classzone against latest's
2.4.x. My approch is _quite_ different from the curren VM. Current approch is
very imperfect and it's based solely on aging whereas classzone had hooks into
pagefaults paths and all other map/unmap points to have perfect accounting of
the amount of active/inactive stuff. The mapped pages was never seen by
anything except swap_out, if they was mapped (it's not a if page->age then move
into the active list, with classzone the page was _just_ in the active list in
first place since it was mapped).

I consider the current approch the wrong way to go and for this reason I prefer
to spend time porting/improving classzone.

In classzone the aging exists too but it's _completly_ orthogonal to how rest
of the VM works. classzone had only 1 bit of aging per page to save mem_map_t
array so I'll extend the aging info from 1 bit to 32bit to make it more biased.

This is my humble opinion at least. I may be wrong. I'll let you know
once I'll have a patch I'll happy with and some real life number to proof my
theory.

In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give it a
try under swap to see the difference in the behaviour and compare (Mike said
it's still an order of magnitude faster with his "make -j30 bzImage" testcase
and he's always very reliable in his reports).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 23:41                         ` Andrea Arcangeli
@ 2000-09-25 16:24                           ` Stephen C. Tweedie
  2000-09-25 17:03                             ` Andrea Arcangeli
  2000-09-25 17:21                           ` bert hubert
  1 sibling, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 16:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 01:41:37AM +0200, Andrea Arcangeli wrote:
> 
> Since you're talking about this I'll soon (as soon as I'll finish some other
> thing that is just work in progress) release a classzone against latest's
> 2.4.x. My approch is _quite_ different from the curren VM. Current approch is
> very imperfect and it's based solely on aging whereas classzone had hooks into
> pagefaults paths and all other map/unmap points to have perfect accounting of
> the amount of active/inactive stuff.

Andrea, I'm not quite sure what you're saying here.  Could you be a
bit more specific?

The current VM _does_ track the amount of active/inactive stuff.  It
does so by keeping separate list of active and inactive stuff.
Accounting on memory pressure on these different lists is used to
generate dynamic targets for how many pages we aim to have on those
lists, so aging/reclaim activity is tuned to the current memory load.

Your other recent complaint, that newly-swapped pages end up on the
wrong end of the LRU lists and can't be reclaimed without cycling the
rest of the pages in shrink_mmap, is also cured in Rik's code, by
placing pages which are queued for swapout on a different list
altogether.  I thought we had managed to agree in Ottawa that such a
cure for the old 2.4 VM was desirable.

> The mapped pages was never seen by
> anything except swap_out, if they was mapped (it's not a if page->age then move
> into the active list, with classzone the page was _just_ in the active list in
> first place since it was mapped).

This really seems to be the biggest difference between the two
approaches right now.  The FreeBSD folks believe fervently that one of
the main reasons that their VM rocks is that it ages cache pages and
mapped pages at the same rate.  Having both on the same aging list
achieves that.  Separating the two raises the question of how to
balance the aging of cache vs. swap in a fair manner.

> In classzone the aging exists too but it's _completly_ orthogonal to how rest
> of the VM works.

Umm, that applies to Rik's stuff too!

> This is my humble opinion at least. I may be wrong. I'll let you know
> once I'll have a patch I'll happy with and some real life number to proof my
> theory.

Good, the best theoretical VM in the world can fall apart instantly on
contact with the real world. :-)

Cheers, 
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 16:24                           ` Stephen C. Tweedie
@ 2000-09-25 17:03                             ` Andrea Arcangeli
  2000-09-25 18:06                               ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:03 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 05:24:42PM +0100, Stephen C. Tweedie wrote:
> Your other recent complaint, that newly-swapped pages end up on the
> wrong end of the LRU lists and can't be reclaimed without cycling the
> rest of the pages in shrink_mmap, is also cured in Rik's code, by
> placing pages which are queued for swapout on a different list
> altogether.  I thought we had managed to agree in Ottawa that such a
> cure for the old 2.4 VM was desirable.

Yes, I seen and the fix looks ok. It's the deactivate_page call when
we swapout the anonymous page. I overlooked it at first, I apologise.

> > The mapped pages was never seen by anything except swap_out, if they was
> > mapped (it's not a if page->age then move into the active list, with
> > classzone the page was _just_ in the active list in first place since it
> > was mapped).
> 
> This really seems to be the biggest difference between the two
> approaches right now.  The FreeBSD folks believe fervently that one of

Right.

And since you move the page into the active list only once you reach it from
the cache recycler and you find it with page->age != 0, you also spend time
putting those pages back and forth from those LRU lists while in my approch the
mapped pages are never seen from the cycle recylcer and no cycle is spent on
them. This mean in a pure fs read test with cache pollution going on, there's
_no_way_ that classzone touches or notice _any_ mapped page in its path.

I think you can't be faster than classzone here.

When the cache isn't polluted adding some more bit of aging I'll better know
when it's time to unmap/swapout stuff. (it just works this way but with only
literally 1 bit of aging at the moment)

> the main reasons that their VM rocks is that it ages cache pages and
> mapped pages at the same rate.  Having both on the same aging list
> achieves that.  Separating the two raises the question of how to
> balance the aging of cache vs. swap in a fair manner.

I believe increasing the aging in the unmapped cache should take care of that
fine. (it was working pretty much fine also with only 1 bit of most
frequently used aging plus the LRU order of the list)

> > In classzone the aging exists too but it's _completly_ orthogonal to how
> > rest of the VM works.
> 
> Umm, that applies to Rik's stuff too!

I may be overlooking something but where do you notice when a page
gets unmapped from the last mapping and put it back into a place
that can be reached from shrink_mmap (or whatever the cache recycler is)?

Since none mapped page can in any way be freed by the cache recycler
(you need to unmap it first from swap_out at the moment) if you
should reach those pages from the cache recyler someway it means
thus you're wasting CPU (I couldn't reach any mapped page from the
cache recylcer in classzone and infact the mapped pages wasn't
linked in any LRU at all to save even more CPU).

> Good, the best theoretical VM in the world can fall apart instantly on
> contact with the real world. :-)

:))

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 17:03                             ` Andrea Arcangeli
@ 2000-09-25 18:06                               ` Stephen C. Tweedie
  2000-09-25 19:32                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 18:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 07:03:47PM +0200, Andrea Arcangeli wrote:
> 
> > This really seems to be the biggest difference between the two
> > approaches right now.  The FreeBSD folks believe fervently that one of
> > [ aging cache and mapped pages in the same cycle ]
> 
> Right.
> 
> And since you move the page into the active list only once you reach it from
> the cache recycler and you find it with page->age != 0, you also spend time
> putting those pages back and forth from those LRU lists while in my approch the
> mapped pages are never seen from the cycle recylcer and no cycle is spent on
> them. This mean in a pure fs read test with cache pollution going on, there's
> _no_way_ that classzone touches or notice _any_ mapped page in its path.

The "age==0" pages are basically just "pages we are ready to get rid
of right away".  The alternative to having that inactive list is to do
what we do today --- which is to throw away the pages immediately.
Having that extra list is simply giving pages a last chance before
evicting them.  It allows us to run reliably with fewer physically
free pages --- we can reap inactive pages with no IO so those pages
are as good as free for most purposes.

The alternative to moving pages to the inactive list would be freeing
them completely.  Moving a page back to the active list from inactive
is equivalent to avoiding a disk IO to pull in the page from backing
store.  It's supposed to be an optimisation to save physically
freeing things unless we really, really need to.  It is _not_ a
transition which recently referenced pages encounter.

> > the main reasons that their VM rocks is that it ages cache pages and
> > mapped pages at the same rate.  Having both on the same aging list
> > achieves that.  Separating the two raises the question of how to
> > balance the aging of cache vs. swap in a fair manner.
> 
> I believe increasing the aging in the unmapped cache should take care of that
> fine. (it was working pretty much fine also with only 1 bit of most
> frequently used aging plus the LRU order of the list)

Good.  One of the problems we always had in the past, though, was that
getting the relative aging of cache vs. vmas was easy if you had a
small set of test loads, but it was really, really hard to find a
balance that didn't show pathological behaviour in the worst cases.

> > > In classzone the aging exists too but it's _completly_ orthogonal to how
> > > rest of the VM works.
> > 
> > Umm, that applies to Rik's stuff too!
> 
> I may be overlooking something but where do you notice when a page
> gets unmapped from the last mapping and put it back into a place
> that can be reached from shrink_mmap (or whatever the cache recycler is)?

It doesn't --- that is part of the design.  The vm scanner propagates
referenced bits to the struct page, so the new shrink_mmap can do its
aging based on whether a page has been referenced at all recently, not
caring whether the reference was a VM reference or a page cache
reference.  That is done specifically to address the balance issue
between VM and filesystem memory pressure.

> Since none mapped page can in any way be freed by the cache recycler
> (you need to unmap it first from swap_out at the moment) if you
> should reach those pages from the cache recyler someway it means
> thus you're wasting CPU (I couldn't reach any mapped page from the
> cache recylcer in classzone and infact the mapped pages wasn't
> linked in any LRU at all to save even more CPU).

That's not how the current VM is supposed to work.  The cache scanner
isn't meant to reclaim pages --- it is meant to update the age
information on pages, which is not quite the same job.  If it finds
pages whose age becomes zero, those are shifted to the inactive list,
and once that list is large enough (ie. we have enough freeable
pages), it can give up.  The inactive list then gets physically freed
on demand.

The fact that we have a common loop in the VM for updating all age
information is central to the design, and requires the cache recycler
to pass over all those pages.  By doing it that way, rather than from
the VM scan, we can avoid one of the really bad properties of the old
2.0 aging code --- it means that for shared pages, we only do the
aging once per walk over the pages regardless of how many ptes refer
to the page.  This avoids the nasty worst-case behaviour of having a
recently-referenced page thrown out of memory just because there also
happened to be a lot of old, unused references to it too. 

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 18:06                               ` Stephen C. Tweedie
@ 2000-09-25 19:32                                 ` Andrea Arcangeli
  2000-09-25 19:26                                   ` Rik van Riel
  2000-09-25 19:54                                   ` Stephen C. Tweedie
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 19:32 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote:
> Good.  One of the problems we always had in the past, though, was that
> getting the relative aging of cache vs. vmas was easy if you had a
> small set of test loads, but it was really, really hard to find a
> balance that didn't show pathological behaviour in the worst cases.

Yep, that's not trivial.

> > I may be overlooking something but where do you notice when a page
> > gets unmapped from the last mapping and put it back into a place
> > that can be reached from shrink_mmap (or whatever the cache recycler is)?
> 
> It doesn't --- that is part of the design.  The vm scanner propagates

And that's the inferior part of the design IMHO.

> referenced bits to the struct page, so the new shrink_mmap can do its
> aging based on whether a page has been referenced at all recently, not

shrink_mmap could can care less about pages that it can't do anything
with them. When it notice it can't do anything it kicks in swap_out.

Having shrink_mmap that browse the mapped page cache is useless
as having shrink_mmap browsing kernel memory and anonymous pages
as it does in 2.2.x as far I can tell. It's an algorithm
complexity problem and it will waste lots of CPU.

Now think this simple real life example. A 2G RAM machine running an executable
image of 1.5G, 300M in shm and 200M in cache.

No memory pressure, no need to swap anything anytime.

Now the application starts to read heavily from disk some giga of data.

Why should shrink_mmap waste an huge amount of time rolling back
and forth from the LRUs the 384000 mapped pages? There's no memory pressure
there's no need to check those mapped pages at all.

Classzone will make an huge difference in numbers in this scenario since
it will only work on the 300M of cache (it will never see the 1.5G of
mapped .text).

> caring whether the reference was a VM reference or a page cache
> reference.  That is done specifically to address the balance issue
> between VM and filesystem memory pressure.

I think it's not necessary to pay all that huge overhead to only learn
when it's time to kick swap_out in. When we're short in unmapped cache
we can just startup swap_out. That apparently works.

> That's not how the current VM is supposed to work.  The cache scanner
> isn't meant to reclaim pages --- it is meant to update the age
> information on pages, which is not quite the same job.  If it finds

So it will be the cache scanner (not the recycler) that will waste the CPU
cycles.

> pages whose age becomes zero, those are shifted to the inactive list,
> and once that list is large enough (ie. we have enough freeable
> pages), it can give up.  The inactive list then gets physically freed
> on demand.

So in a long cache-polluting read from disk the inactive list will return empty
all the time and so cache scanner will have to waste the CPU as described.

> The fact that we have a common loop in the VM for updating all age
> information is central to the design, and requires the cache recycler
> to pass over all those pages.  By doing it that way, rather than from

That's a waste IMHO. We don't need to pass over the mapped pages.

> 2.0 aging code --- it means that for shared pages, we only do the
> aging once per walk over the pages regardless of how many ptes refer
> to the page.  This avoids the nasty worst-case behaviour of having a

You'll still refresh the referenced bit too often for those pages because
they're referenced multiple times so it will still be unfair. Said that it's
probably not that bad property since a very shared library is more justified to
live in cache than a page that is mapped only once.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 19:32                                 ` Andrea Arcangeli
@ 2000-09-25 19:26                                   ` Rik van Riel
  2000-09-25 22:28                                     ` Andrea Arcangeli
  2000-09-25 19:54                                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 19:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 07:06:57PM +0100, Stephen C. Tweedie wrote:
> > Good.  One of the problems we always had in the past, though, was that
> > getting the relative aging of cache vs. vmas was easy if you had a
> > small set of test loads, but it was really, really hard to find a
> > balance that didn't show pathological behaviour in the worst cases.
> 
> Yep, that's not trivial.

It is. Just do physical-page based aging (so you age all the
pages in the system the same) and the problem is solved.

> > > I may be overlooking something but where do you notice when a page
> > > gets unmapped from the last mapping and put it back into a place
> > > that can be reached from shrink_mmap (or whatever the cache recycler is)?
> > 
> > It doesn't --- that is part of the design.  The vm scanner propagates
> 
> And that's the inferior part of the design IMHO.

Indeed, but physical page based aging is a definate
2.5 thing ... ;(

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 19:26                                   ` Rik van Riel
@ 2000-09-25 22:28                                     ` Andrea Arcangeli
  2000-09-25 22:26                                       ` Rik van Riel
                                                         ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote:
> > > It doesn't --- that is part of the design.  The vm scanner propagates
> > 
> > And that's the inferior part of the design IMHO.
> 
> Indeed, but physical page based aging is a definate
> 2.5 thing ... ;(

I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
test9 will waste time rolling between LRUs 384000 pages, while classzone
won't ever see 1 of those pages until you run low on fs cache.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:28                                     ` Andrea Arcangeli
@ 2000-09-25 22:26                                       ` Rik van Riel
  2000-09-25 22:51                                         ` Andrea Arcangeli
  2000-09-25 22:30                                       ` Linus Torvalds
  2000-09-25 22:30                                       ` Juan J. Quintela
  2 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 22:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:26:17PM -0300, Rik van Riel wrote:
> > > > It doesn't --- that is part of the design.  The vm scanner propagates
> > > 
> > > And that's the inferior part of the design IMHO.
> > 
> > Indeed, but physical page based aging is a definate
> > 2.5 thing ... ;(
> 
> I'm talking about the fact that if you have a file mmapped in
> 1.5G of RAM test9 will waste time rolling between LRUs 384000
> pages, while classzone won't ever see 1 of those pages until you
> run low on fs cache.

IMHO this is a minor issue because:
1) you need to do page replacement with shared pages
   right
2) you don't /want/ to run low on fs cache, you want
   to have a good balance between thee cache(s) and
   the processes

OTOH, if you have a way to keep fair page aging and
fix the CPU time issue at the same time, I'd love
to see it.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:26                                       ` Rik van Riel
@ 2000-09-25 22:51                                         ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 22:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:26:56PM -0300, Rik van Riel wrote:
> IMHO this is a minor issue because:

I don't think it's a minor issue.

If you don't have reschedule point in your equivalent of shrink_mmap and this
1.5G will happen to be consecutive in the lru order (quite probably if it's
been pagedin at fast rate) then you may even hang in interruptible mode for
seconds as soon as somebody start reading from disk. 2.4.x have to scale for
dozen of Giga of RAM as there are archs supporting that amount of RAM.

> 2) you don't /want/ to run low on fs cache, you want

So I can't read more than the size that the fs cache can take? I must be
allowed to do that (they're 200 Mbyte of RAM that can be more than enough
if the server mainly generate pollution anyway).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:28                                     ` Andrea Arcangeli
  2000-09-25 22:26                                       ` Rik van Riel
@ 2000-09-25 22:30                                       ` Linus Torvalds
  2000-09-25 23:03                                         ` Andrea Arcangeli
  2000-09-25 22:30                                       ` Juan J. Quintela
  2 siblings, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25 22:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson,
	MM mailing list, linux-kernel


On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> 
> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
> test9 will waste time rolling between LRUs 384000 pages, while classzone
> won't ever see 1 of those pages until you run low on fs cache.

What drugs are you on? Nobody looks at the LRU's until the system is low
on memory. Sure, there's some background activity, but what are you
talking about? It's only when you're low on memory that _either_ approach
starts looking at the LRU list.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:30                                       ` Linus Torvalds
@ 2000-09-25 23:03                                         ` Andrea Arcangeli
  2000-09-25 23:18                                           ` Linus Torvalds
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:30:10PM -0700, Linus Torvalds wrote:
> On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> > 
> > I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
> > test9 will waste time rolling between LRUs 384000 pages, while classzone
> > won't ever see 1 of those pages until you run low on fs cache.
> 
> What drugs are you on? Nobody looks at the LRU's until the system is low
> on memory. Sure, there's some background activity, but what are you

The system is low on memory when you run `free` and you see a value
< freepages_high*PAGE_SIZE in the "free" column first row.

> talking about? It's only when you're low on memory that _either_ approach
> starts looking at the LRU list.

The machine will run low on memory as soon as I read 200mbyte from disk.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 23:03                                         ` Andrea Arcangeli
@ 2000-09-25 23:18                                           ` Linus Torvalds
  2000-09-26  0:32                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25 23:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson,
	MM mailing list, linux-kernel


On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> 
> The machine will run low on memory as soon as I read 200mbyte from disk.

So? 

Yes, at that point we'll do the LRU dance. Then we won't be low on memory
any more, and we won't do the LRU dance any more. What's the magic in
zoneinfo that makes it not have to do the same thing?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 23:18                                           ` Linus Torvalds
@ 2000-09-26  0:32                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-26  0:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:18:13PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> > 
> > The machine will run low on memory as soon as I read 200mbyte from disk.
> 
> So? 
> 
> Yes, at that point we'll do the LRU dance. Then we won't be low on memory
> any more, and we won't do the LRU dance any more. What's the magic in

We'll run low on memory again as soon as we read the next page from disk and so
very soon we'll have to roll around all the 1.5G private mapping again.  (the
program have a file working set larger than 200M)

If you want to see some number I can produce them. The testcase only need
to do a:

	truncate(1.5G)
	mmap(1.5G MAP_PRIVATE)
	fault in read mode into the mapped 1.5G
	measure how long it takes to read N Giga from disk

> zoneinfo that makes it not have to do the same thing?

The name "classzone" is misleading. The zoneinfo change is not relevant to this
case (it started only with the zoneinfo change that's why it's still called so).

This case is relevant on how the lru are been restructured.

To say it simple as soon as somebody faults into the pagecache I remove the
page from the LRU. Then munmap time (zap_page_range) the page is reinserted
into the LRU.

This avoids shrink_mmap to waste time into the mapped regions that shrink_mmap
can't do anything to change anyway. This mean that under cache pollution
there's no 1 cycle spent browsing those mapped pages and I know when it's time
to swapout in function of the age of the fs cache (so the system is very
efficient during cache pollution, this way the example performs equally to not
having any mapping in memory). The case without memory pressure (where the
working set fits in cache) is sure just fine of course.

When swap_out unmaps a page and put them back into the lru I know that such
page is not been touched recently and I consider it with zero age. (actually
it's not a big deal since there's only literally 1 bit of age, so
this may change in the future introducing more bits of info for the age)

Of course all the subtle cases of shared read only anonymous pages added to the
swap cache and page cache mapped but with bhs overlapped on it and some other
non obvious issue are handled correctly.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:28                                     ` Andrea Arcangeli
  2000-09-25 22:26                                       ` Rik van Riel
  2000-09-25 22:30                                       ` Linus Torvalds
@ 2000-09-25 22:30                                       ` Juan J. Quintela
  2000-09-25 23:00                                         ` Andrea Arcangeli
  2 siblings, 1 reply; 243+ messages in thread
From: Juan J. Quintela @ 2000-09-25 22:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds,
	Roger Larsson, MM mailing list, linux-kernel

>>>>> "andrea" == Andrea Arcangeli <andrea@suse.de> writes:

Hi

andrea> I'm talking about the fact that if you have a file mmapped in 1.5G of RAM
andrea> test9 will waste time rolling between LRUs 384000 pages, while classzone
andrea> won't ever see 1 of those pages until you run low on fs cache.

Which is completely wrong if the program uses _any not completely_
unusual locality of reference.  Think twice about that, it is more
probable that you need more that 300MB of filesystem cache that you
have an aplication that references _randomly_ 1.5GB of data.  You need
to balance that _always_ :((((((

I think that there is no silver bullet here :(

Later, Juan.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:30                                       ` Juan J. Quintela
@ 2000-09-25 23:00                                         ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 23:00 UTC (permalink / raw)
  To: Juan J. Quintela
  Cc: Rik van Riel, Stephen C. Tweedie, Ingo Molnar, Linus Torvalds,
	Roger Larsson, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 12:30:28AM +0200, Juan J. Quintela wrote:
> Which is completely wrong if the program uses _any not completely_
> unusual locality of reference.  Think twice about that, it is more
> probable that you need more that 300MB of filesystem cache that you
> have an aplication that references _randomly_ 1.5GB of data.  You need
> to balance that _always_ :((((((

The application doesn't references ramdonly 1.5GB of data. Assume
there's a big executable large 2G (and yes I know there are) and I run it.
After some hour its RSS it's 1.5G. Ok?

So now this program also shmget a 300 Mbyte shm segment.

Now this program starts reading and writing terabyte of data that
wouldn't fit in cache even if there would be 300G of ram (and
this is possible too). Or maybe the program itself uses rawio
but then you at a certain point use the machine to run a tar somewhere.

Now tell me why this program needs more than 200Mbyte of fs cache
if the kernel doesn't waste time on the mapped pages (as in
classzone).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 19:32                                 ` Andrea Arcangeli
  2000-09-25 19:26                                   ` Rik van Riel
@ 2000-09-25 19:54                                   ` Stephen C. Tweedie
  2000-09-25 22:44                                     ` Andrea Arcangeli
  2000-09-26  6:54                                     ` Christoph Rohland
  1 sibling, 2 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:

> Having shrink_mmap that browse the mapped page cache is useless
> as having shrink_mmap browsing kernel memory and anonymous pages
> as it does in 2.2.x as far I can tell. It's an algorithm
> complexity problem and it will waste lots of CPU.

It's a compromise between CPU cost and Getting It Right.  Ignoring the
mmap is not a good solution either.

> Now think this simple real life example. A 2G RAM machine running an executable
> image of 1.5G, 300M in shm and 200M in cache.

OK, and here's another simple real life example.  A 2GB RAM machine
running something like Oracle with a hundred client processes all
shm-mapping the same shared memory segment.

Oh, and you're also doing lots of file IO.  How on earth do you decide
what to swap and what to page out in this sort of scenario, where
basically the whole of memory is data cache, some of which is mapped
and some of which is not?

If you don't separate out the propagation of referenced bits from the
actual page aging, then every time you pass over the whole VM working
set, you're likely to find a handful of live references to some of the
shared memory, and a hundred or so references that haven't done
anything since last time.  Anything that only ages per-pte, not
per-page, is simply going to die horribly under such load, and any
imbalance between pure filesystem cache and VM pressure will be
magnified to the point where one dominates.

Hence my observation that it's really easy to find special cases where
certain optimisations make a ton of sense, but you often lose balance
in the process.  

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 19:54                                   ` Stephen C. Tweedie
@ 2000-09-25 22:44                                     ` Andrea Arcangeli
  2000-09-25 22:42                                       ` Rik van Riel
  2000-09-26  6:54                                     ` Christoph Rohland
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 22:44 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote:
> OK, and here's another simple real life example.  A 2GB RAM machine
> running something like Oracle with a hundred client processes all
> shm-mapping the same shared memory segment.

Oracle takes the SHM locked, and it will never run on a machine without
enough memory.

> Oh, and you're also doing lots of file IO.  How on earth do you decide
> what to swap and what to page out in this sort of scenario, where
> basically the whole of memory is data cache, some of which is mapped
> and some of which is not?

As as said in the last email aging on the cache is supposed to that.

Wasting CPU and incrasing the complexity of the algorithm is a price
that I won't pay just to get the information on when it's time
to recall swap_out().

If the cache have no age it means I'd better throw it out instead
of swapping/unmapping out stuff, simple?

> anything since last time.  Anything that only ages per-pte, not
> per-page, is simply going to die horribly under such load, and any

The aging on the fs cache is done per-page.

The per-pte issue happens when we just took the difficult decision (that it was
time to swap-out) and you have the same problem because you don't know the
chain of pte that point to the physical page (so you're refresh the referenced
bit more often). Once we'll have the chain of pte pointing to the page
classzone will only need a real lru for the mapped pages to use it instead of
walking pagetables.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 22:44                                     ` Andrea Arcangeli
@ 2000-09-25 22:42                                       ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 22:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Tue, 26 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 08:54:57PM +0100, Stephen C. Tweedie wrote:

> > basically the whole of memory is data cache, some of which is mapped
> > and some of which is not?
> 
> As as said in the last email aging on the cache is supposed to that.
> 
> Wasting CPU and incrasing the complexity of the algorithm is a price
> that I won't pay just to get the information on when it's time
> to recall swap_out().

You must be joking. Page replacement should be tuned to
do good page replacement, not just to be easy on the CPU.
(though a heavily thrashing system /is/ easy on the cpu,
I'll have to admit that)

> If the cache have no age it means I'd better throw it out instead
> of swapping/unmapping out stuff, simple?

Simple, yes. But completely BOGUS if you don't age the cache
and the mapped pages at the same rate!

If I age your pages twice as much as my pages, is it still
only fair that your pages will be swapped out first? ;)

> > anything since last time.  Anything that only ages per-pte, not
> > per-page, is simply going to die horribly under such load, and any
> 
> The aging on the fs cache is done per-page.

And the same should be done for other pages as well.
If you don't do that, you'll have big problems keeping
page replacement balanced and making the system work well
under various loads.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 19:54                                   ` Stephen C. Tweedie
  2000-09-25 22:44                                     ` Andrea Arcangeli
@ 2000-09-26  6:54                                     ` Christoph Rohland
  2000-09-26 14:05                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Christoph Rohland @ 2000-09-26  6:54 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:
> 
> > Having shrink_mmap that browse the mapped page cache is useless
> > as having shrink_mmap browsing kernel memory and anonymous pages
> > as it does in 2.2.x as far I can tell. It's an algorithm
> > complexity problem and it will waste lots of CPU.
> 
> It's a compromise between CPU cost and Getting It Right.  Ignoring the
> mmap is not a good solution either.
> 
> > Now think this simple real life example. A 2G RAM machine running
> > an executable image of 1.5G, 300M in shm and 200M in cache.

Hey that's ridiculous: 1.5G executable image and 300M shm? Take it
vice-versa and you are approaching real life.

> OK, and here's another simple real life example.  A 2GB RAM machine
> running something like Oracle with a hundred client processes all
> shm-mapping the same shared memory segment.

That sound much more realistic.

> Oh, and you're also doing lots of file IO.  How on earth do you decide
> what to swap and what to page out in this sort of scenario, where
> basically the whole of memory is data cache, some of which is mapped
> and some of which is not?
> 
> If you don't separate out the propagation of referenced bits from the
> actual page aging, then every time you pass over the whole VM working
> set, you're likely to find a handful of live references to some of the
> shared memory, and a hundred or so references that haven't done
> anything since last time.  Anything that only ages per-pte, not
> per-page, is simply going to die horribly under such load, and any
> imbalance between pure filesystem cache and VM pressure will be
> magnified to the point where one dominates.

Yes and that's why I stress most of the patch levels with my ipctst
program on a highmem machine. It's simulating a load like this: A lot
of processes attached to shm segments and trashing them. There were
very few kernels which really worked with that load without totally
breaking or killing processes _way_ too early.

> Hence my observation that it's really easy to find special cases where
> certain optimisations make a ton of sense, but you often lose balance
> in the process.  

O.K. My test case is such a special case, but it is related to real
live transactional load on a highend server.

Greetings
		Christoph

-- 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-26  6:54                                     ` Christoph Rohland
@ 2000-09-26 14:05                                       ` Andrea Arcangeli
  2000-09-26 16:20                                         ` Christoph Rohland
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-26 14:05 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 08:54:23AM +0200, Christoph Rohland wrote:
> "Stephen C. Tweedie" <sct@redhat.com> writes:
> 
> > Hi,
> > 
> > On Mon, Sep 25, 2000 at 09:32:42PM +0200, Andrea Arcangeli wrote:
> > 
> > > Having shrink_mmap that browse the mapped page cache is useless
> > > as having shrink_mmap browsing kernel memory and anonymous pages
> > > as it does in 2.2.x as far I can tell. It's an algorithm
> > > complexity problem and it will waste lots of CPU.
> > 
> > It's a compromise between CPU cost and Getting It Right.  Ignoring the
> > mmap is not a good solution either.
> > 
> > > Now think this simple real life example. A 2G RAM machine running
> > > an executable image of 1.5G, 300M in shm and 200M in cache.
> 
> Hey that's ridiculous: 1.5G executable image and 300M shm? Take it
> vice-versa and you are approaching real life.

Could you tell me what's wrong in having an app with a 1.5G mapped executable
(or a tiny executable but with a 1.5G shared/private file mapping if you
prefer), 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
filesystem cache?

The application have a misc I/O load that in some part will run out
of the working set, what's wrong with this?

What's ridiculous? Please elaborate.

To emulate that workload we only need to mmap(1.5G, MAP_PRIVATE or MAP_SHARED),
fault into it, and run bonnie.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-26 14:05                                       ` Andrea Arcangeli
@ 2000-09-26 16:20                                         ` Christoph Rohland
  2000-09-26 17:10                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Christoph Rohland @ 2000-09-26 16:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> Could you tell me what's wrong in having an app with a 1.5G mapped executable
> (or a tiny executable but with a 1.5G shared/private file mapping if you
> prefer),

O.K. that sound more reasonable. I was reading image as program
text... and a 1.5GB program text is a something I never have seen (and
hopefully will never see :-)

> 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
> filesystem cache?

I don't really see a reason for fs cache in the application. I think
that parallel applications tend to either share mostly all or nothing,
but I may be wrong here.

> The application have a misc I/O load that in some part will run out
> of the working set, what's wrong with this?
> 
> What's ridiculous? Please elaborate.

I think we fixed this misreading. 

But still IMHO you underestimate the importance of shared memory for a
lot of applications in the high end. There is not only Oracle out
there and most of the shared memory is _not_ locked.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-26 16:20                                         ` Christoph Rohland
@ 2000-09-26 17:10                                           ` Andrea Arcangeli
  2000-09-27  8:11                                             ` Christoph Rohland
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-26 17:10 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 06:20:47PM +0200, Christoph Rohland wrote:
> O.K. that sound more reasonable. I was reading image as program
> text... and a 1.5GB program text is a something I never have seen (and
> hopefully will never see :-)

:)

>From the shrink_mmap complexity of the algorithm point of view a 1.5GB .text is
completly equal to a MAP_SHARED large 1.5GB or a MAP_PRIVATE large 1.5GB
(it doesn't need to be the .text of the program).

Said that I heard of real world programs that have a .text larger than 2G
(that's why I wasn't very careful to say it doesn't need to be a 1.5G
.text but that any other so large page-cache mapping would have the same
effect).

> > 300M of shm (or 300M of anonymous memory if you prefer) and 200M as
> > filesystem cache?
> 
> I don't really see a reason for fs cache in the application. I think

Infact the application can as well use rawio.

> that parallel applications tend to either share mostly all or nothing,
> but I may be wrong here.

And then at some point you'll run `find /` or `tar mylatestsources.tar.gz
sources/` or updatedb is startedup or whatever. And you don't need more
than 200M of fs cache for that purpose.

Think at the O(N) complexity that we had in si_meminfo (guess why in 2.4.x
`free` say 0 in shared field). It was making impossible to run `xosview` on a
10G box (it was stalling for seconds).

And si_meminfo was only counting 1 field, not rolling pages around
lru grabbing locks and dirtyfing cachelines.

That's a plain complexity/scalability issue as far I can tell, and classzone
solves it completly.  When you run tar with your 1.5G shared mapping in memory
and you happen to hit the low watermark and you need to recycle some byte of
old cache, you'll run as fast as without the mapping in memory. There will be
zero difference in performance.  (just like now if you run `free` on a 10G
machine it runs as fast on a 4mbyte machine)

> I think we fixed this misreading. 

I should have explained things more carefully since the first place sorry.

> But still IMHO you underestimate the importance of shared memory for a
> lot of applications in the high end. There is not only Oracle out
> there and most of the shared memory is _not_ locked.

Well I wasn't claiming that this optimization is very sensitive for DB
applications (at least for DB that doesn't use quite big file mappings).

I know Oracle (and most other DB) are very shm intensive.  However the fact you
say the shm is not locked in memory is really a news to me. I really remembered
that the shm was locked.

I also don't see the point of keeping data cache in the swap. Swap involves SMP
tlb flushes and all the other big overhead that you could avoid by sizing
properly the shm cache and taking it locked.

Note: having very fast shm swapout/swapin is very good thing (infact we
introduced readaround of the swapin and moved shm swapout/swapin locking to the
swap cache in early 2.3.x exactly for that reason). But I just don't think
DBMS needed that.

Note: simulations are completly a different thing (their evolution is not
predicable). Simulations can sure trash shm into swap anytime (but Oracle
shouldn't do that AFIK).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-26 17:10                                           ` Andrea Arcangeli
@ 2000-09-27  8:11                                             ` Christoph Rohland
  2000-09-27  8:28                                               ` Ingo Molnar
  2000-09-27 13:56                                               ` Andrea Arcangeli
  0 siblings, 2 replies; 243+ messages in thread
From: Christoph Rohland @ 2000-09-27  8:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> Said that I heard of real world programs that have a .text larger than 2G

=:-O

> I know Oracle (and most other DB) are very shm intensive.  However
> the fact you say the shm is not locked in memory is really a news to
> me. I really remembered that the shm was locked.

I just checked one oracle system and it did not lock the memory. And I
do not think that the other databases do it by default either.

And our application server doesn't do it definitely. And it uses loads
of shared memory. We will have application servers soon with 16 GB
memory at customer sites which will have the whole memory in shmfs.

> I also don't see the point of keeping data cache in the swap. Swap
> involves SMP tlb flushes and all the other big overhead that you
> could avoid by sizing properly the shm cache and taking it locked.
> 
> Note: having very fast shm swapout/swapin is very good thing (infact
> we introduced readaround of the swapin and moved shm swapout/swapin
> locking to the swap cache in early 2.3.x exactly for that
> reason). But I just don't think DBMS needed that.

Nobody should rely on shm swapping for productive use. But you have
changing/increasing loads on application servers and out of a sudden
you run oom. In this case the system should behave and it is _very_
good to have a smooth behaviour. 

Customers with performance problems very often start with too little
memory, but they cannot upgrade until this really big job finishes :-(

Another issue about shm swapping is interactive transactions, where
some users have very large contexts and go for a coffee before
submitting. This memory can be swapped. 

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27  8:11                                             ` Christoph Rohland
@ 2000-09-27  8:28                                               ` Ingo Molnar
  2000-09-27  9:24                                                 ` Christoph Rohland
  2000-09-27 13:56                                               ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-27  8:28 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On 27 Sep 2000, Christoph Rohland wrote:

> Nobody should rely on shm swapping for productive use. But you have
> changing/increasing loads on application servers and out of a sudden
> you run oom. In this case the system should behave and it is _very_
> good to have a smooth behaviour.

it might make sense even in production use. If there is some calculation
that has to be done only once per month, then sure the customer can decide
to wait for it a few hours until it swaps itself ready, instead of buying
gigs of RAM just to execute this single operation faster. Uncooperative
OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27  8:28                                               ` Ingo Molnar
@ 2000-09-27  9:24                                                 ` Christoph Rohland
  0 siblings, 0 replies; 243+ messages in thread
From: Christoph Rohland @ 2000-09-27  9:24 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

Ingo Molnar <mingo@elte.hu> writes:

> On 27 Sep 2000, Christoph Rohland wrote:
> 
> > Nobody should rely on shm swapping for productive use. But you have
> > changing/increasing loads on application servers and out of a sudden
> > you run oom. In this case the system should behave and it is _very_
> > good to have a smooth behaviour.
> 
> it might make sense even in production use. If there is some calculation
> that has to be done only once per month, then sure the customer can decide
> to wait for it a few hours until it swaps itself ready, instead of buying
> gigs of RAM just to execute this single operation faster. Uncooperative
> OOM in such cases is a show-stopper. Or are you saying the same thing? :-)

That's what I meant with the coffee break. In a big installation
somebody is always drinking coffee :-)
 
You also have often different loads during daytime and
nighttime. Swapping buffers out to swap disk instead of rereading from
the database makes a lot of sense for this. But a single job should
never swap. (It works for two month and then next month you get the
big escalation and you would love to have hotplug memory)

So swapping happens in productive use. But nobody should rely on
that too much. 

And I completely agree that uncooperative OOM is not acceptable.

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27  8:11                                             ` Christoph Rohland
  2000-09-27  8:28                                               ` Ingo Molnar
@ 2000-09-27 13:56                                               ` Andrea Arcangeli
  2000-09-27 16:56                                                 ` Christoph Rohland
  2000-09-28 10:08                                                 ` Rik van Riel
  1 sibling, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-27 13:56 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> I just checked one oracle system and it did not lock the memory. And I

If that memory is used for I/O cache then such memory should released when the
system runs into swap instead of swapping it out too (otherwise it's not cache
anymore and it could be slower than re-reading from disk the real data in
rawio).

> Customers with performance problems very often start with too little
> memory, but they cannot upgrade until this really big job finishes :-(
> 
> Another issue about shm swapping is interactive transactions, where
> some users have very large contexts and go for a coffee before
> submitting. This memory can be swapped. 

Agreed, that's why I said shm performance under swap is very important
as well (I'm not understimating it).

But again: if the shm contains I/O cache it should be released and not swapped
out.  Swapping out shmfs that contains I/O cache would be exactly like swapping
out page-cache.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27 13:56                                               ` Andrea Arcangeli
@ 2000-09-27 16:56                                                 ` Christoph Rohland
  2000-09-27 17:42                                                   ` Andrea Arcangeli
  2000-09-28 10:08                                                 ` Rik van Riel
  1 sibling, 1 reply; 243+ messages in thread
From: Christoph Rohland @ 2000-09-27 16:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> > I just checked one oracle system and it did not lock the memory. And I
> 
> If that memory is used for I/O cache then such memory should
> released when the system runs into swap instead of swapping it out
> too (otherwise it's not cache anymore and it could be slower than
> re-reading from disk the real data in rawio).

Yes, but how does the application detect that it should free the mem?
Also you often have more overhead reading out of a database then
having preprocessed data in swap. 

> > Customers with performance problems very often start with too little
> > memory, but they cannot upgrade until this really big job finishes :-(
> > 
> > Another issue about shm swapping is interactive transactions, where
> > some users have very large contexts and go for a coffee before
> > submitting. This memory can be swapped. 
> 
> Agreed, that's why I said shm performance under swap is very important
> as well (I'm not understimating it).

fine :-)

Greetings
		Christoph
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27 16:56                                                 ` Christoph Rohland
@ 2000-09-27 17:42                                                   ` Andrea Arcangeli
  2000-09-27 18:25                                                     ` Erik Andersen
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-27 17:42 UTC (permalink / raw)
  To: Christoph Rohland
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Wed, Sep 27, 2000 at 06:56:42PM +0200, Christoph Rohland wrote:
> Yes, but how does the application detect that it should free the mem?

The trivial way is not to detect it and to allow the user to select how much
memory it will use as cache and to take it locked and then don't care (he will
have to decrease the size of the shm by hand if it wants to drop some cache).
>From the OS point of view it's like not having that RAM at all and there will
be zero performance difference compared into trashing into swap without such
memory. (on 2.2.x this is not true for a complexity problem in shrink mmap that
is solved with the real lru in 2.4.x)

The other way is to have the shm cache that shrinks dynamically by looking
/proc/meminfo and looking at the aging of their own cache. Again the user
should say a miniumum and a maximum of shm cache to keep locked in memory. Then
you look at the "freemem + cache + buffers - active cache" and you can say when
you're going to run into swap. Specifically with classzone you'll run into swap
when that value is near zero. So when such value is near zero you know it's
time to shrink the shm cache dynamically if it has a low age otherwise the
machine will trash into swap badly and performance will decrease. (you could
start shrinking when such value is below an amount of mbyte again configurable
via a form)

You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
2.4.x so it's just the overhead of a read syscall)

These DB using rawio really want to substitue part of the kernel cache
functionality and so it's quite natural that they also don't want the kernel to
play with their caches while they run and they would need some more interaction
with the kernel memory balancing (possibly via async signals) to get their shm
reclaimed dynamically more cleanly and efficiently by registering for this
functionality (they could get signals when the machine runs into swap and then
the DB chooses if it worth to release some locked cache after looking at the
/proc/meminfo and the working set on their own caches).

> Also you often have more overhead reading out of a database then
> having preprocessed data in swap. 

Yes I see, it of course depends on the kind of cache (if it's very near to the
on-disk format than more probably it shouldn't be swapped out).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27 17:42                                                   ` Andrea Arcangeli
@ 2000-09-27 18:25                                                     ` Erik Andersen
  2000-09-27 18:55                                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Erik Andersen @ 2000-09-27 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: MM mailing list, linux-kernel

On Wed Sep 27, 2000 at 07:42:00PM +0200, Andrea Arcangeli wrote:
> 
> You should of course poll the /proc/meminfo. (/proc/meminfo works in O(1) in
> 2.4.x so it's just the overhead of a read syscall)

Or sysinfo(2).  Same thing...

 -Erik

--
Erik B. Andersen   email:  andersee@debian.org
--This message was written using 73% post-consumer electrons--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27 18:25                                                     ` Erik Andersen
@ 2000-09-27 18:55                                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-27 18:55 UTC (permalink / raw)
  To: MM mailing list, linux-kernel

On Wed, Sep 27, 2000 at 12:25:44PM -0600, Erik Andersen wrote:
> Or sysinfo(2).  Same thing...

sysinfo structure doesn't export the number of active pages in the system.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-27 13:56                                               ` Andrea Arcangeli
  2000-09-27 16:56                                                 ` Christoph Rohland
@ 2000-09-28 10:08                                                 ` Rik van Riel
  2000-09-28 11:16                                                   ` Rik van Riel
                                                                     ` (2 more replies)
  1 sibling, 3 replies; 243+ messages in thread
From: Rik van Riel @ 2000-09-28 10:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Wed, 27 Sep 2000, Andrea Arcangeli wrote:
> On Wed, Sep 27, 2000 at 10:11:43AM +0200, Christoph Rohland wrote:
> > I just checked one oracle system and it did not lock the memory. And I
> 
> If that memory is used for I/O cache then such memory should
> released when the system runs into swap instead of swapping it
> out too (otherwise it's not cache anymore and it could be slower
> than re-reading from disk the real data in rawio).

It could also be faster. If the database spent half an hour
gathering pieces of data from all over the database, it might
be faster to keep it in one place in swap so it can be read
in again in one swoop.  (I had an interesting talk about this
with a database person while at OLS)

But that's not the point. If your assertion is true, then the
database will probably be using an mlock()ed SHM region and
taking care of this itself. But this is not something the OS
should prescribe to the application.

If the OS finds that certain SHM pages are used far less than
the pages in the I/O cache, then those SHM pages should be
swapped out. The system's job is to keep the most used pages
of data in memory to minimise the amount of page faults
happening. Trying to outsmart the application shouldn't (IHMO
of course) be part of that job...

> > Customers with performance problems very often start with too little
> > memory, but they cannot upgrade until this really big job finishes :-(
> > 
> > Another issue about shm swapping is interactive transactions, where
> > some users have very large contexts and go for a coffee before
> > submitting. This memory can be swapped. 
> 
> Agreed, that's why I said shm performance under swap is very important
> as well (I'm not understimating it).
> 
> But again: if the shm contains I/O cache it should be released
> and not swapped out.  Swapping out shmfs that contains I/O cache
> would be exactly like swapping out page-cache.

The OS has no business knowing what's inside that SHM page.
IF the shm contains I/O cache, maybe you're right. However,
until you know that this is the case, optimising for that
situation just doesn't make any sense.

(unless the SHM users tell you that this is the normal way
they use SHM ... but as Christoph just told us, it isn't)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 10:08                                                 ` Rik van Riel
@ 2000-09-28 11:16                                                   ` Rik van Riel
  2000-09-28 14:52                                                     ` Andrea Arcangeli
  2000-09-28 11:31                                                   ` Ingo Molnar
  2000-09-28 14:31                                                   ` Andrea Arcangeli
  2 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-28 11:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, 28 Sep 2000, Rik van Riel wrote:
> On Wed, 27 Sep 2000, Andrea Arcangeli wrote:

> > But again: if the shm contains I/O cache it should be released
> > and not swapped out.  Swapping out shmfs that contains I/O cache
> > would be exactly like swapping out page-cache.
> 
> The OS has no business knowing what's inside that SHM page.

Hmm, now I woke up maybe I should formulate this in a
different way.

Andrea, I have the strong impression that your idea of
memory balancing is based on the idea that the OS should
out-smart the application instead of looking at the usage
pattern of the pages in memory.

This is fundamentally different from the idea that the OS
should make decisions based on the observed usage patterns
of the pages in question, instead of making presumptions
based on what kind of cache the page is in.

I've been away for 10 days and have been sitting on a bus
all last night so my judgement may be off. I'd certainly
like to hear I'm wrong ;)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 11:16                                                   ` Rik van Riel
@ 2000-09-28 14:52                                                     ` Andrea Arcangeli
  2000-09-29 14:39                                                       ` Rik van Riel
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-28 14:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
> Andrea, I have the strong impression that your idea of
> memory balancing is based on the idea that the OS should
> out-smart the application instead of looking at the usage
> pattern of the pages in memory.

Not sure what you mean with out-smart.

My only point is that the OS actually can only swapout such shm. If that
SHM is not supposed to be swapped out and if the OS I/O cache have more aging
then the shm cache, then the OS should tell the DBMS that it's time to shrink
some shm page by freeing it.

> of the pages in question, instead of making presumptions
> based on what kind of cache the page is in.

For the mapped pages we never make presumptions. We always check the accessed
bit and that's the most reliable info to know if the page is been accessed
recently (set from the cpu accesse through the pte not only during page faults
or cache hits).  With the current design pages mapped multiple times will be
overaged a bit but this can't be fixed until we make a page->pte reverse
lookup...

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 14:52                                                     ` Andrea Arcangeli
@ 2000-09-29 14:39                                                       ` Rik van Riel
  2000-09-29 14:55                                                         ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-29 14:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, 28 Sep 2000, Andrea Arcangeli wrote:
> On Thu, Sep 28, 2000 at 08:16:32AM -0300, Rik van Riel wrote:
> > Andrea, I have the strong impression that your idea of
> > memory balancing is based on the idea that the OS should
> > out-smart the application instead of looking at the usage
> > pattern of the pages in memory.
> 
> Not sure what you mean with out-smart.
> 
> My only point is that the OS actually can only swapout such shm.
> If that SHM is not supposed to be swapped out and if the OS I/O
> cache have more aging then the shm cache, then the OS should
> tell the DBMS that it's time to shrink some shm page by freeing
> it.

OK, good to see that we agree on the fact that we
should age and swapout all pages equally agressively.

> > of the pages in question, instead of making presumptions
> > based on what kind of cache the page is in.
> 
> For the mapped pages we never make presumptions. We always check
> the accessed bit and that's the most reliable info to know if
> the page is been accessed recently (set from the cpu accesse
> through the pte not only during page faults or cache hits).  
> With the current design pages mapped multiple times will be
> overaged a bit but this can't be fixed until we make a page->pte
> reverse lookup...

Indeed.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-29 14:39                                                       ` Rik van Riel
@ 2000-09-29 14:55                                                         ` Andrea Arcangeli
  2000-09-29 15:40                                                           ` Rik van Riel
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-29 14:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
> OK, good to see that we agree on the fact that we
> should age and swapout all pages equally agressively.

Actually I think we should start looking at the mapped stuff _only_ when the
I/O cache aging is relevant. If the I/O cache aging isn't relevant there's no
point to look at the mapped stuff since there's cache pollution going on. It's
much less costly to drop a page from the unmapped cache than to play with
pagetables, and also having slow read() is much better than having to fault
into the .text areas (because the process is going to be designed in a way that
expects read to block so it may do it asynchronously or in a separate thread or
whatever). A `cp /dev/zero .` shouldn't swapout/unmap anything.

If the cache is re-used (so if it's useful) that's completly different issue and
in that case unmapping potentially unused stuff is the right thing to do of
course.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-29 14:55                                                         ` Andrea Arcangeli
@ 2000-09-29 15:40                                                           ` Rik van Riel
  0 siblings, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2000-09-29 15:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Fri, 29 Sep 2000, Andrea Arcangeli wrote:
> On Fri, Sep 29, 2000 at 11:39:18AM -0300, Rik van Riel wrote:
> > OK, good to see that we agree on the fact that we
> > should age and swapout all pages equally agressively.
> 
> Actually I think we should start looking at the mapped stuff
> _only_ when the I/O cache aging is relevant. If the I/O cache
> aging isn't relevant there's no point to look at the mapped
> stuff since there's cache pollution going on.

> If the cache is re-used (so if it's useful) that's completly
> different issue and in that case unmapping potentially unused
> stuff is the right thing to do of course.

This is why I want to do:

1) equal aging of all pages in the system
2) page aging to have properties of both LRU and LFU
3) drop-behind to cope with streaming IO in a good way

and maybe:
4) move unmapped pages to the inactive_clean list for
   immediate reclaiming but put pages which are/were
   mapped on the inactive_dirty list so we keep it a
   little bit longer


The only way to reliably know if the cache is re-used a
lot is by making sure we do the page aging for unmapped
and mapped pages the same. If we don't do that, we won't
be able to make a sensible comparison between the activity
of pages in different places.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 10:08                                                 ` Rik van Riel
  2000-09-28 11:16                                                   ` Rik van Riel
@ 2000-09-28 11:31                                                   ` Ingo Molnar
  2000-09-28 14:54                                                     ` Andrea Arcangeli
  2000-09-28 14:31                                                   ` Andrea Arcangeli
  2 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-28 11:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Christoph Rohland, Stephen C. Tweedie,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, 28 Sep 2000, Rik van Riel wrote:

> The OS has no business knowing what's inside that SHM page.

exactly.

> IF the shm contains I/O cache, maybe you're right. However,
> until you know that this is the case, optimising for that
> situation just doesn't make any sense.

if the shm contains raw I/O data, then thats flawed application design -
an mmap()-ed file should be used instead. Shm is equivalent to shared
anonymous pages.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 11:31                                                   ` Ingo Molnar
@ 2000-09-28 14:54                                                     ` Andrea Arcangeli
  2000-09-28 15:13                                                       ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-28 14:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, Sep 28, 2000 at 01:31:40PM +0200, Ingo Molnar wrote:
> if the shm contains raw I/O data, then thats flawed application design -
> an mmap()-ed file should be used instead. Shm is equivalent to shared

The DBMS uses shared SCSI disks across multiple hosts on the same SCSI bus
and synchronize the distributed cache via TCP. Tell me how to do that
with the OS cache and mmap.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 14:54                                                     ` Andrea Arcangeli
@ 2000-09-28 15:13                                                       ` Ingo Molnar
  2000-09-28 15:23                                                         ` Andrea Arcangeli
  2000-09-28 16:16                                                         ` Juan J. Quintela
  0 siblings, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-28 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, 28 Sep 2000, Andrea Arcangeli wrote:

> The DBMS uses shared SCSI disks across multiple hosts on the same SCSI
> bus and synchronize the distributed cache via TCP. Tell me how to do
> that with the OS cache and mmap.

this could be supported by:

1) mlock()-ing the whole mapping.

2) introducing sys_flush(), which flushes pages from the pagecache.

3) doing sys_msync() after dirtying a range and before sending a TCP
   event.

Whenever the DB-cache-flush-event comes over TCP, it calls sys_flush() for
that given virtual address range or file address space range. Sys_flush
flushes the page from the pagecache and unmaps the address. Whenever it's
needed again by the application it will be faulted in and read from disk.

Can anyone see any problems with the concept of this approach? This can be
used for a page-granularity distributed IO cache.

(there are some smaller problems with this approach, like mlock() on a big
range can only be done by priviledged users, but thats not an issue IMO.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 15:13                                                       ` Ingo Molnar
@ 2000-09-28 15:23                                                         ` Andrea Arcangeli
  2000-09-28 16:16                                                         ` Juan J. Quintela
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-28 15:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Christoph Rohland, Stephen C. Tweedie,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, Sep 28, 2000 at 05:13:59PM +0200, Ingo Molnar wrote:
> Can anyone see any problems with the concept of this approach? This can be

It works only on top of a filesystem while all the checkpointing clever stuff
is done internally by the DB (infact it _needs_ O_SYNC when it works on the
fs).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 15:13                                                       ` Ingo Molnar
  2000-09-28 15:23                                                         ` Andrea Arcangeli
@ 2000-09-28 16:16                                                         ` Juan J. Quintela
  1 sibling, 0 replies; 243+ messages in thread
From: Juan J. Quintela @ 2000-09-28 16:16 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Rik van Riel, Christoph Rohland,
	Stephen C. Tweedie, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

>>>>> "ingo" == Ingo Molnar <mingo@elte.hu> writes:

Hi

ingo> 2) introducing sys_flush(), which flushes pages from the pagecache.

It is not supposed that mincore can do that (yes, just now it is not
implemented, but the interface is there to do that)?

Just curious.

-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-28 10:08                                                 ` Rik van Riel
  2000-09-28 11:16                                                   ` Rik van Riel
  2000-09-28 11:31                                                   ` Ingo Molnar
@ 2000-09-28 14:31                                                   ` Andrea Arcangeli
  2 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-28 14:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Christoph Rohland, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Thu, Sep 28, 2000 at 07:08:51AM -0300, Rik van Riel wrote:
> taking care of this itself. But this is not something the OS
> should prescribe to the application.

Agreed.

> (unless the SHM users tell you that this is the normal way
> they use SHM ... but as Christoph just told us, it isn't)

shm is not used as I/O cache from 90% of the apps out there because normal apps
uses the OS cache functionality (90% of those apps doesn't use rawio to share a
black box that looks like a scsi disk via SCSI bus connected to other hosts as
well).

I for sure agree shm swapin/swapout is very important. (we moved shm
swapout/swapin to swap cache with readaround for that reason)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 23:41                         ` Andrea Arcangeli
  2000-09-25 16:24                           ` Stephen C. Tweedie
@ 2000-09-25 17:21                           ` bert hubert
  2000-09-25 17:49                             ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: bert hubert @ 2000-09-25 17:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

> We're talking about shrink_[id]cache_memory change. That have _nothing_ to do
> with the VM changes that happened anywhere between test8 and test9-pre6.
> 
> You were talking about a different thing.

Ok, sorry. Kernel development is proceding at a furious pace and I sometimes
lose track. 

> I consider the current approch the wrong way to go and for this reason I
> prefer to spend time porting/improving classzone.

I seem to remember that people were impressed by classzone, but that the
implementation was very non-trivial and hard to grok. One of the reasons
Rik's vm made it (so far) is that it is pretty straightforward, with all the
marks of the right amount of simplicity. 

> In the meantime if you want to go back to 2.4.0-test1-ac22-class++ to give
> it a try under swap to see the difference in the behaviour and compare
> (Mike said it's still an order of magnitude faster with his "make -j30
> bzImage" testcase and he's always very reliable in his reports).

There is no such thing as 'under swap'. There are lots of loadpatterns that
will generate different kinds of memory pressure. Just calling it 'under
swap' gives entirely the wrong impression. 

Although Mike's compile is a relevant benchmark, every VM has cases for
which it excels, and cases for which it sucks. This appears to be a general
property of VM design. 

Given knowledge of the algorithms used, you can always dream up a situation
where it will fail. It's a bit like writing the halting problem algorithm.
Same goes the other way around, every VM will have a 'shining benchmark' -
hence the invention of benchmarketing.

We used to have a bad virtual memory implementation that was sometimes well
tuned so a lots of ordinary cases showed acceptable performance. We now have
an elegant VM that works reasonably well, but needs more tweaking.

What is the point of all this ranting? Think twice before embarking on
'rivaling virtual memory' code. Energies spent on Rik's VM will yield far
higher differential improvement. 

Regards,

bert hubert

-- 
PowerDNS                     Versatile DNS Services  
Trilab                       The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 17:21                           ` bert hubert
@ 2000-09-25 17:49                             ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:49 UTC (permalink / raw)
  To: Stephen C. Tweedie, Ingo Molnar, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:21:48PM +0200, bert hubert wrote:
> Ok, sorry. Kernel development is proceding at a furious pace and I sometimes
> lose track. 

No problem :).

> I seem to remember that people were impressed by classzone, but that the
> implementation was very non-trivial and hard to grok. One of the reasons

Yes. Classzone is certainly more complex.

> There is no such thing as 'under swap'. There are lots of loadpatterns that
> will generate different kinds of memory pressure. Just calling it 'under
> swap' gives entirely the wrong impression. 

Sorry for not being precise. I meant one of those load patterns.

> 'rivaling virtual memory' code. Energies spent on Rik's VM will yield far
> higher differential improvement. 

I've spent efforts on classzone as well, and since I think it's way superior
approch I'll at least port it on top of 2.4.0-test9 as soon as time
permits to generate some number.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 22:36                       ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
  2000-09-24 23:41                         ` Andrea Arcangeli
@ 2000-09-25 15:09                         ` Miles Lane
  2000-09-25 15:51                         ` Stephen C. Tweedie
  2 siblings, 0 replies; 243+ messages in thread
From: Miles Lane @ 2000-09-25 15:09 UTC (permalink / raw)
  To: bert hubert
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

bert hubert wrote:

> On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote:
> 
>> On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote:
>> 
>>> any form of serialisation on the quota file).  This feels like rather
>>> a lot of new and interesting deadlocks to be introducing so late in
>>> 2.4.  :-)
>> 
> True. But they also appear to be found and solved at an impressive rate.
> These deadlocks are fatal and don't hide in corners, whereas the previous mm
> problems used to be very hard to spot and fix, there not being real
> showstoppers, except for abysmal performance. [1]
> 
> Since Rik's stuff was merged, the number of eyeball hours devoted to MM have
> skyrocketed, whereas the previous incarnations had far smaller audiences.
> The patches are barely a week in, and look how much has been improved that
> hadn't been found by the people working with Rik.
> 
> It's tempting to revert the merge, but let's work at it a bit longer. There
> are problems, but we are solving them rapidly and both performance and
> design of the new MM are pretty pleasing.
> 
> Let's not waste this opportunity.

I agree.  I have seen really fabulous system response since Rik's 
changes were merged in.
I have managed to crash my machine a couple of times (I am working on 
getting a
serial debugging connection set up, since I don't see any OOPS 
messages), but I think this
is not terribly surprising.  My impression is that system responsiveness 
is much improved.
Let's hang in there a bit longer.  We are making rapid progress on 
testing and fixing.

       Miles              

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-24 22:36                       ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
  2000-09-24 23:41                         ` Andrea Arcangeli
  2000-09-25 15:09                         ` Miles Lane
@ 2000-09-25 15:51                         ` Stephen C. Tweedie
  2000-09-25 16:05                           ` Ingo Molnar
  2 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 15:51 UTC (permalink / raw)
  To: Andrea Arcangeli, Stephen C. Tweedie, Ingo Molnar,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

Hi,

On Mon, Sep 25, 2000 at 12:36:50AM +0200, bert hubert wrote:
> On Mon, Sep 25, 2000 at 12:13:42AM +0200, Andrea Arcangeli wrote:
> > On Sun, Sep 24, 2000 at 10:43:03PM +0100, Stephen C. Tweedie wrote:
> > > any form of serialisation on the quota file).  This feels like rather
> > > a lot of new and interesting deadlocks to be introducing so late in
> > > 2.4.  :-)
> 
> True. But they also appear to be found and solved at an impressive rate.
> These deadlocks are fatal and don't hide in corners, whereas the previous mm
> problems used to be very hard to spot and fix, there not being real
> showstoppers, except for abysmal performance. [1]

Sorry, but in this case you have got a lot more variables than you
seem to think.  The obvious lock is the ext2 superblock lock, but
there are side cases with quota and O_SYNC which are much less
commonly triggered.  That's not even starting to consider the other
dozens of filesystems in the kernel which have to be audited if we
change the locking requirements for GFP calls.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 15:51                         ` Stephen C. Tweedie
@ 2000-09-25 16:05                           ` Ingo Molnar
  2000-09-25 16:06                             ` Alexander Viro
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 16:05 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote:

> Sorry, but in this case you have got a lot more variables than you
> seem to think.  The obvious lock is the ext2 superblock lock, but
> there are side cases with quota and O_SYNC which are much less
> commonly triggered.  That's not even starting to consider the other
> dozens of filesystems in the kernel which have to be audited if we
> change the locking requirements for GFP calls.

i'd suggest to simply BUG() in schedule() if the superblock lock is held
not directly by lock_super. Holding the superblock lock is IMO quite rude
anyway (for performance and latency) - is there any place where we hold it
for a long time and it's unavoidable?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 16:05                           ` Ingo Molnar
@ 2000-09-25 16:06                             ` Alexander Viro
  2000-09-25 16:20                               ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Alexander Viro @ 2000-09-25 16:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Ingo Molnar wrote:

> 
> On Mon, 25 Sep 2000, Stephen C. Tweedie wrote:
> 
> > Sorry, but in this case you have got a lot more variables than you
> > seem to think.  The obvious lock is the ext2 superblock lock, but
> > there are side cases with quota and O_SYNC which are much less
> > commonly triggered.  That's not even starting to consider the other
> > dozens of filesystems in the kernel which have to be audited if we
> > change the locking requirements for GFP calls.
> 
> i'd suggest to simply BUG() in schedule() if the superblock lock is held
> not directly by lock_super. Holding the superblock lock is IMO quite rude
> anyway (for performance and latency) - is there any place where we hold it
> for a long time and it's unavoidable?

Ingo, schedule() has no bloody business _knowing_ about superblock locks
in the first place. Yes, ext2 should not bother taking it at all. For
completely unrelated reasons.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 16:06                             ` Alexander Viro
@ 2000-09-25 16:20                               ` Ingo Molnar
  2000-09-25 16:29                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 16:20 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Alexander Viro wrote:

> > i'd suggest to simply BUG() in schedule() if the superblock lock is held
> > not directly by lock_super. Holding the superblock lock is IMO quite rude
> > anyway (for performance and latency) - is there any place where we hold it
> > for a long time and it's unavoidable?
> 
> Ingo, schedule() has no bloody business _knowing_ about superblock
> locks in the first place. Yes, ext2 should not bother taking it at
> all. For completely unrelated reasons.

i only suggested this as a debugging helper, instead of the suggested
ext2_getblk() BUG() helper. Obviously schedule() has no business knowing
about filesystem locks.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks
  2000-09-25 16:20                               ` Ingo Molnar
@ 2000-09-25 16:29                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexander Viro, Stephen C. Tweedie, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:20:40PM +0200, Ingo Molnar wrote:
> i only suggested this as a debugging helper, instead of the suggested

I don't think removing the superlock from all fs is good thing at this stage (I
agree with SCT doing it only for ext2 [that's what we mostly care about] would
be possible). Who cares if UFS grabs the super lock or not?

grep lock_super fs/ext2/*.c is enough and we don't need debugging in the
scheduler for that.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 21:12                 ` Ingo Molnar
  2000-09-24 21:43                   ` Stephen C. Tweedie
@ 2000-09-25  4:56                   ` Linus Torvalds
  2000-09-25  5:19                     ` Alexander Viro
  1 sibling, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25  4:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Rik van Riel, Roger Larsson, Alexander Viro,
	MM mailing list, linux-kernel

Hmm..

 Thinking some more about this issue, I actually suspect that there's a
better solution.

The fact is that GFP_BUFFER is only used for the old-fashioned buffer
block allocations, and anything that uses the page cache automatically
avoids the whole issue. As such, from a VM balancing standpoint we would
fix the problem equally well by just avoiding using old-fashioned buffer
blocks..

Now, I don't believe that the indirect blocks etc of the meta-data is much
of an issue - whenever we need to access indirect blocks we're certainly
already doing the page cache thing, so the page cache VM pressure should
be qutie sufficient to keep the VM balanced - regular file access is very
much biased towards the page cache, and the meta-data buffer-cache
accesses are likely to be a very very small part of the big picture.

The remaining part if the directory handling. THAT is very buffer-cache
intensive, as the directory handling hasn't been moved over to the page
cache at all for ext2. Doing a large "find" (or even just a "ls -l") will
basically do purely buffer cache accesses, first for the directory data
and then for the inode data. With no page cache activity to balance things
out at all - leading to a potentially quite unbalanced VM that never
really had a good chance to get rid of dentries etc.

However, Al Viro already basically has the "directories using the page
cache" code pretty much done, so for 2.5.x we'll just do that, and I bet
that the VM balancing will improve (as well as performance going up simply
just because the page cache is more efficient anyway). With the directory
information in the page cache, there simply isn't any regular operations
that depend entirely on the buffer cache any more.

Sure, there will still be the inode and indirect blocks, but there just
aren't loads that I know of that can put as much pressure on those as on
the page cache..

So the proper approach may be to just ignore the current issue with
__GFP_IO being a big deal under some loads, because it probably will go
away on its own (the superblock lock contention is still an issue, of
course, but while somewhat related it's still fairly orthogonal). 

Al, if you'd port over the "namei in page-cache" stuff from UFS to ext2, I
bet that there would be people interested in seeing whether the above
theory is just another of Linu's whimsies, or whether it really does make
a difference.. It may not be 2.4.x material, but it won't hurt to have it
tested some more anyway. Comments?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  4:56                   ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
@ 2000-09-25  5:19                     ` Alexander Viro
  2000-09-25  6:06                       ` Linus Torvalds
                                         ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25  5:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	Alexander Viro, MM mailing list, linux-kernel


On Sun, 24 Sep 2000, Linus Torvalds wrote:

> The remaining part if the directory handling. THAT is very buffer-cache
> intensive, as the directory handling hasn't been moved over to the page
> cache at all for ext2. Doing a large "find" (or even just a "ls -l") will
> basically do purely buffer cache accesses, first for the directory data
> and then for the inode data. With no page cache activity to balance things
> out at all - leading to a potentially quite unbalanced VM that never
> really had a good chance to get rid of dentries etc.

You forgot inode tables themselves.

> Al, if you'd port over the "namei in page-cache" stuff from UFS to ext2, I
> bet that there would be people interested in seeing whether the above
> theory is just another of Linu's whimsies, or whether it really does make
> a difference.. It may not be 2.4.x material, but it won't hurt to have it
> tested some more anyway. Comments?

I'll do it and post the result tomorrow. I bet that there will be issues
I've overlooked (stuff that happens to work on UFS, but needs to be more
general for ext2), so it's going as "very alpha", but hey, it's pretty
straightforward, so there is a chance to debug it fast. Yes, famous last
words and all such...

BTW, we _will_ need it on UFS side in 2.4 anyway. Rationale:
	* UFS _does_ fragments, whether we like it or not.
	* Reallocating fragments for regular files can not be done by
bread()+getblk()+memcpy()+mark_buffer_dirty() - data is in pagecache, so
that's an instant death
	* to get UFS working with pagecache and not eating filesystems we
must do fragment reallocation through pagecache
	* it means that we either duplicate the whole mess both for buffer
cache (directories) and pagecache (inodes) or move directories to
pagecache
	The former (pagecache duplicate of the reallocation code) is
nasty since we have to separate the current realloc stuff from the code
pathes where it sits right now anyway - it's merged into the functions
used by pagecache side. I.e. we would have to
	* do pagecache fragment handling
	* rip the buffer-cache fragment handling out
	* redo it, so that it would live outside of the path used by
pagecache side
	* change the callers.
The last couple means more work than switching directories to pagecache. 

	So some variant of directories in pagecache is needed for 2.4, the
question being whether it's UFS-only or we use its port on ext2... BTW,
minixfs/sysvfs can also use the thing, but that's another story.

	Off to port the bloody thing...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  5:19                     ` Alexander Viro
@ 2000-09-25  6:06                       ` Linus Torvalds
  2000-09-25  6:17                         ` Alexander Viro
  2000-09-25 21:21                         ` Alexander Viro
  2000-09-26 13:42                       ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro
  2000-09-26 21:29                       ` Alexander Viro
  2 siblings, 2 replies; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25  6:06 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	Alexander Viro, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Alexander Viro wrote:
> 
> 
> On Sun, 24 Sep 2000, Linus Torvalds wrote:
> 
> > The remaining part if the directory handling. THAT is very buffer-cache
> > intensive, as the directory handling hasn't been moved over to the page
> > cache at all for ext2. Doing a large "find" (or even just a "ls -l") will
> > basically do purely buffer cache accesses, first for the directory data
> > and then for the inode data. With no page cache activity to balance things
> > out at all - leading to a potentially quite unbalanced VM that never
> > really had a good chance to get rid of dentries etc.
> 
> You forgot inode tables themselves.

I don't. That's the "then for the inode data" part.

I'm not claiming that the buffer cache accesses would go away - I'm just
saying that the unbalanced "only buffer cache" case should go away,
because things like "find" and friends will still cause mostly page cache
activity.

(Considering the size of the inode on ext2, I don't know how true this is,
I have to admit. It might still be quite biased towards the buffer cache,
and as such the additional page cache pressure might not be enough to
really cause any major shift in balancing).

> I'll do it and post the result tomorrow. I bet that there will be issues
> I've overlooked (stuff that happens to work on UFS, but needs to be more
> general for ext2), so it's going as "very alpha", but hey, it's pretty
> straightforward, so there is a chance to debug it fast. Yes, famous last
> words and all such...

Sure.

> BTW, we _will_ need it on UFS side in 2.4 anyway. Rationale:

[ reasons removed ]

I have no problem with that. Especially as I suspect the people who use
UFS are more likely to be the technical kind of user who is more inclined
to be able to debug whatever potential problems crop up anyway. Your point
about not duplicating the fragment handling is certainly quite convincing
for the case of UFS.

> 	So some variant of directories in pagecache is needed for 2.4, the
> question being whether it's UFS-only or we use its port on ext2... BTW,
> minixfs/sysvfs can also use the thing, but that's another story.

Let's plan on UFS-only, for all the prudent reasons.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  6:06                       ` Linus Torvalds
@ 2000-09-25  6:17                         ` Alexander Viro
  2000-09-25 21:21                         ` Alexander Viro
  1 sibling, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25  6:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

> I'm not claiming that the buffer cache accesses would go away - I'm just
> saying that the unbalanced "only buffer cache" case should go away,
> because things like "find" and friends will still cause mostly page cache
> activity.
> 
> (Considering the size of the inode on ext2, I don't know how true this is,
> I have to admit. It might still be quite biased towards the buffer cache,
> and as such the additional page cache pressure might not be enough to
> really cause any major shift in balancing).

Hrrrmmm... You know, since we don't have to associate struct inode with every
address space and inode table _is_ a linear array, after all... We
might put it into pagecache too. Very few places access the on-disk
inode, so it's not too horrible. All we need is readpage() and that's
very easy, considering the fact that allocation is static. prepare_write()
and commit_write() may be NULL for all I care and writepage() will
be easy too - no holes, no allocation, no nothing. Looks like we need to deal
with ext2_update_inode(), ext2_read_inode() and that's it. Even less
intrusive than directory stuff...

Comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  6:06                       ` Linus Torvalds
  2000-09-25  6:17                         ` Alexander Viro
@ 2000-09-25 21:21                         ` Alexander Viro
  1 sibling, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Y. Ts'o, Ingo Molnar, Andrea Arcangeli,
	Rik van Riel, Alexander Viro, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

[directories in pagecache on ext2]

> > I'll do it and post the result tomorrow. I bet that there will be issues
> > I've overlooked (stuff that happens to work on UFS, but needs to be more
> > general for ext2), so it's going as "very alpha", but hey, it's pretty
> > straightforward, so there is a chance to debug it fast. Yes, famous last
> > words and all such...
> 
> Sure.

All right, I think I've got something that may work. Yes, there were issues -
UFS has the constant directory chunk size (1 sector), while ext2 makes it
equal to fs block size. _Bad_ idea, since the sector writes are atomic and
block ones... Oh, well, so ext2 is slightly less robust. It required some
changes, I'll do the initial testing and post the patch once it will pass
the trivial tests.

BTW, why on the Earth had we done it that way? It has no noticable effect
on directory fragmentation, it makes code (both in page- and buffer-cache
variants) more complex, it's less robust (by definition - directory layout
may be broken easier)... What was the point?

Not that we could do something about that now (albeit as a ro-compat feature
it would be nice), but I'm curious about the reasons...
							Cheers,
								Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [CFT][PATCH] ext2 directories in pagecache
  2000-09-25  5:19                     ` Alexander Viro
  2000-09-25  6:06                       ` Linus Torvalds
@ 2000-09-26 13:42                       ` Alexander Viro
  2000-09-26 21:29                       ` Alexander Viro
  2 siblings, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-26 13:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel

Help in testing is welcome, just keep in mind that it's ext2 we are
talking about. IOW, proceed with care and don't let it loose on the data
you can't easily restore.
	Patch moves the directory data into the pagecache. I hope that
it's sufficiently straightforward to be readable.
	Linus, if you prefer to get it in the mail - tell and I'll send it
(50K unpacked due to ext2/{dir,namei}.c modifications, so it's too large
for the lists).
							Cheers,
								Al


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [CFT][PATCH] ext2 directories in pagecache
  2000-09-25  5:19                     ` Alexander Viro
  2000-09-25  6:06                       ` Linus Torvalds
  2000-09-26 13:42                       ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro
@ 2000-09-26 21:29                       ` Alexander Viro
  2000-09-26 22:16                         ` Marko Kreen
  2000-09-26 23:19                         ` Andreas Dilger
  2 siblings, 2 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-26 21:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	Alexander Viro, MM mailing list, linux-kernel, linux-fsdevel

be really working (survives assorted builds, does the right thing on
find-based scripts and obvious local tests, yodda, yodda). It certainly
needs more testing, but I would call it (early) beta.

	Folks, give it a try - just keep decent backups. Similar code will
have to go into UFS in 2.4 and that (ext2) variant may be of interest for
2.4.<late>/2.5.<early> timeframe.

	I'm putting it on ftp.math.psu.edu/pub/viro/ext2-patch-7.gz.
Comments and help in testing are more than welcome.
							Cheers,
								Al


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 21:29                       ` Alexander Viro
@ 2000-09-26 22:16                         ` Marko Kreen
  2000-09-26 22:31                           ` Alexander Viro
  2000-09-26 23:19                         ` Andreas Dilger
  1 sibling, 1 reply; 243+ messages in thread
From: Marko Kreen @ 2000-09-26 22:16 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel

On Tue, Sep 26, 2000 at 05:29:27PM -0400, Alexander Viro wrote:
> Comments and help in testing are more than welcome.

There is something fishy in ext2_empty_dir:

+                               /* check for . and .. */
+                               if (de->name[0] != '.')
+                                       goto not_empty;
+                               if (!de->name[1]) {
+                                       if (de->inode !=
+                                           le32_to_cpu(inode->i_ino))
+                                               goto not_empty;
+                               } else if (de->name[2])
+                                       goto not_empty;
+                               else if (de->name[1] != '.')
+                                       goto not_empty;


-- 
marko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 22:16                         ` Marko Kreen
@ 2000-09-26 22:31                           ` Alexander Viro
  2000-09-26 22:47                             ` Marko Kreen
  0 siblings, 1 reply; 243+ messages in thread
From: Alexander Viro @ 2000-09-26 22:31 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel


On Wed, 27 Sep 2000, Marko Kreen wrote:

> On Tue, Sep 26, 2000 at 05:29:27PM -0400, Alexander Viro wrote:
> > Comments and help in testing are more than welcome.
> 
> There is something fishy in ext2_empty_dir:

Why?

> +                               /* check for . and .. */
> +                               if (de->name[0] != '.')
> +                                       goto not_empty;

Doesn't start with '.' - definitely not an empty directory


> +                               if (!de->name[1]) {

OK, it's {'.','\0'}, aka. ".".

> +                                       if (de->inode !=
> +                                           le32_to_cpu(inode->i_ino))

Consistency check... Aha, I see. Yup, s/le32_to_cpu/cpu_to_le32/. Doesn't
matter on all normal architectures, but yes, it's still wrong.

> +                                               goto not_empty;

If we have it screwed - leave it as is and don't mess with it.
Otherwise - skip this record, it's all right for empty directory.

> +                               } else if (de->name[2])

Starts with '.' and longer than 2 characters? Not empty.

> +                                       goto not_empty;
> +                               else if (de->name[1] != '.')

Starts with '.', 2 characters, but the second isn't '.'? Not empty.

> +                                       goto not_empty;

Otherwise - skip the record.

	So checks are OK, the only thing being that we should use
cpu_to_le32() instead of le32_to_cpu(). Doesn't affect the behaviour right
now, but ought to be fixed anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 22:31                           ` Alexander Viro
@ 2000-09-26 22:47                             ` Marko Kreen
  2000-09-27  7:32                               ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Marko Kreen @ 2000-09-26 22:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel

On Tue, Sep 26, 2000 at 06:31:04PM -0400, Alexander Viro wrote:
> On Wed, 27 Sep 2000, Marko Kreen wrote:
> > There is something fishy in ext2_empty_dir:
> 
> Why?
> 
> > +                               } else if (de->name[2])
> 
Sorry, I had a hard day and I should have gone to sleep already...
I did not think (anyway I tried ;) too hard on that [2], it seemed to me
with the following stuff as some copy-paste bug...

-- 
marko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 22:47                             ` Marko Kreen
@ 2000-09-27  7:32                               ` Ingo Molnar
  2000-09-27  9:22                                 ` Alexander Viro
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-27  7:32 UTC (permalink / raw)
  To: Marko Kreen
  Cc: Alexander Viro, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel

On Wed, 27 Sep 2000, Marko Kreen wrote:

> > Why?
> > 
> > > +                               } else if (de->name[2])
> > 
> Sorry, I had a hard day and I should have gone to sleep already...

hey, you made Alexander notice an endianness bug so it was ok :-)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-27  7:32                               ` Ingo Molnar
@ 2000-09-27  9:22                                 ` Alexander Viro
  0 siblings, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-27  9:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marko Kreen, Linus Torvalds, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel

On Wed, 27 Sep 2000, Ingo Molnar wrote:

> 
> On Wed, 27 Sep 2000, Marko Kreen wrote:
> 
> > > Why?
> > > 
> > > > +                               } else if (de->name[2])
> > > 
> > Sorry, I had a hard day and I should have gone to sleep already...
> 
> hey, you made Alexander notice an endianness bug so it was ok :-)

Definitely. Usually "it looks fishy" feeling should be trusted - if code
is non-obvious it's more likely to contain bugs.

How it was? "The goal is to write clear code, not clever code". And right
now dir.c in the patch is not clear enough - better than the corresponding
code in the tree (esp. in ext2_readdir()), but still needs cleaning up.

ObFsck: router in the $ORKPLACE apparently deciding that it's a good time
to shit itself and external SCSI on one of the home boxen joining the
fun. Sheesh...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 21:29                       ` Alexander Viro
  2000-09-26 22:16                         ` Marko Kreen
@ 2000-09-26 23:19                         ` Andreas Dilger
  2000-09-26 23:33                           ` Alexander Viro
  1 sibling, 1 reply; 243+ messages in thread
From: Andreas Dilger @ 2000-09-26 23:19 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Ingo Molnar, Andrea Arcangeli, Rik van Riel,
	Roger Larsson, Alexander Viro, MM mailing list, linux-kernel,
	linux-fsdevel

Al Viro writes:
> 	Folks, give it a try - just keep decent backups. Similar code will
> have to go into UFS in 2.4 and that (ext2) variant may be of interest for
> 2.4.<late>/2.5.<early> timeframe.

Haven't tested it yet, but just reading over the patch - in ext2_lookup():

        if (dentry->d_name.len > UFS_MAXNAMLEN)
                return ERR_PTR(-ENAMETOOLONG)

should probably be changed back to:

        if (dentry->d_name.len > EXT2_NAME_LEN)
                return ERR_PTR(-ENAMETOOLONG)

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 23:19                         ` Andreas Dilger
@ 2000-09-26 23:33                           ` Alexander Viro
  2000-09-26 23:44                             ` Alexander Viro
  0 siblings, 1 reply; 243+ messages in thread
From: Alexander Viro @ 2000-09-26 23:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Alexander Viro, Linus Torvalds, Ingo Molnar, Andrea Arcangeli,
	Rik van Riel, Roger Larsson, Alexander Viro, MM mailing list,
	linux-kernel, linux-fsdevel



On Tue, 26 Sep 2000, Andreas Dilger wrote:

> Al Viro writes:
> > 	Folks, give it a try - just keep decent backups. Similar code will
> > have to go into UFS in 2.4 and that (ext2) variant may be of interest for
> > 2.4.<late>/2.5.<early> timeframe.
> 
> Haven't tested it yet, but just reading over the patch - in ext2_lookup():
> 
>         if (dentry->d_name.len > UFS_MAXNAMLEN)
>                 return ERR_PTR(-ENAMETOOLONG)
> 
> should probably be changed back to:
> 
>         if (dentry->d_name.len > EXT2_NAME_LEN)
>                 return ERR_PTR(-ENAMETOOLONG)

Grrr... It shows the ancestry - it's a ported UFS patch. Thanks for spotting,
I'll fix that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [CFT][PATCH] ext2 directories in pagecache
  2000-09-26 23:33                           ` Alexander Viro
@ 2000-09-26 23:44                             ` Alexander Viro
  0 siblings, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-26 23:44 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andreas Dilger, Linus Torvalds, Ingo Molnar, Andrea Arcangeli,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel,
	linux-fsdevel


On Tue, 26 Sep 2000, Alexander Viro wrote:

> On Tue, 26 Sep 2000, Andreas Dilger wrote:
> 
> > Al Viro writes:
> > > 	Folks, give it a try - just keep decent backups. Similar code will
> > > have to go into UFS in 2.4 and that (ext2) variant may be of interest for
> > > 2.4.<late>/2.5.<early> timeframe.
> > 
> > Haven't tested it yet, but just reading over the patch - in ext2_lookup():
> > 
> >         if (dentry->d_name.len > UFS_MAXNAMLEN)
> >                 return ERR_PTR(-ENAMETOOLONG)
> > 
> > should probably be changed back to:
> > 
> >         if (dentry->d_name.len > EXT2_NAME_LEN)
> >                 return ERR_PTR(-ENAMETOOLONG)
> 
> Grrr... It shows the ancestry - it's a ported UFS patch. Thanks for spotting,
> I'll fix that.

Aha. And there was that UFS_LINK_MAX thing. Fixed. OK, new version is on
the same site, URL being ftp://ftp.math.psu.edu/pub/viro/ext2-patch-8.gz

	Changes: got rid of the remnants of UFS ancestry (EXT2 limits are
used; not that it mattered much, but...), fixed the conversion in
ext2_empty_dir() (cpu_to_le32() instead of le32_to_cpu()).
							Cheers,
								Al

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-24 21:12               ` Andrea Arcangeli
  2000-09-24 21:12                 ` Ingo Molnar
@ 2000-09-25  0:09                 ` Linus Torvalds
  2000-09-25  0:49                   ` Alexander Viro
                                     ` (2 more replies)
  1 sibling, 3 replies; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25  0:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Andrea Arcangeli wrote:
>
> On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote:
> > where will it deadlock?
> 
> ext2_new_block (or whatever that runs getblk with the superlock lock
> acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->
> prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->
> put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D

Whee..

Good that you remembered (now that you mention it, I recollect that we had
this bug and discussion earlier).

I added a comment to the effect, although I still moved the __GFP_IO test
into the icache and dcache shrink functions, because as with the
shm_swap() thing this is probably something we do want to fix eventually.

The icache shrinker probably has similar problems with clear_inode.

I suspect that it might be a good idea to try to fix this issue, because
it will probably keep coming up otherwise. And it's likely to be fairly
easily debugged, by just making getblk() have some debugging code that
basically says something like

	lock_super()
	{
		.. do the lock ..
+		current->super_locked++;
	}

	unlock_super()
	{
+		if (current->super_locked < 1)
+			BUG();
+		current->super_locked--;
		.. do the unlock ..
	}

	getblk()
	{
+		if (current->super_locked)
+			BUG();
		.. do the getblk ..
	}

and just making it a new rule that you cannot call getblk() with any locks
held.

It should be fairly easy to make the callers well-behaved: the hard part
is probably just enumerating and finding the suckers, which is why the
above debug code would make people aware of it..

(We definitely don't want to wait for the deadlock to happen and trap that
one: the above code will BUG() out in any normal situation regardless of
whether it would actually trigger a deadlock or even allocate memory or
not. Which is what we'd want if we want to fix this).

On the whole, fixing the cases would probably imply dropping the lock,
doing the read, re-aquireing the lock, and then going back and seeing if
maybe somebody else already filled in the bitmap cache or whatever. So not
one-liners by any means, but we'll probably want to do it at some point
(the superblock lock is quite contended right now, and the reason for that
may well be that it's just so badly done for historical reasons).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
@ 2000-09-25  0:49                   ` Alexander Viro
  2000-09-25  0:53                   ` Marcelo Tosatti
  2000-09-25  1:31                   ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
  2 siblings, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25  0:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

> > ext2_new_block (or whatever that runs getblk with the superlock lock
> > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->
> > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->
> > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D
> 
> Whee..
[snip] 

> On the whole, fixing the cases would probably imply dropping the lock,
> doing the read, re-aquireing the lock, and then going back and seeing if
> maybe somebody else already filled in the bitmap cache or whatever. So not
> one-liners by any means, but we'll probably want to do it at some point
> (the superblock lock is quite contended right now, and the reason for that
> may well be that it's just so badly done for historical reasons).

Nope. Solution is to kill the silly "hold super_block lock during the
allocation" completely. Right now the main problem making us use it at all
is the following: dquot_alloc_block() is a blocking operation. If that
gets fixed - that's it. We simply don't need anything more fancy than
rwlock on access to bitmap + rwlock or plain spinlock on access to group
descriptors cache. End of problem.

Remember that off-list thread in July when you asked what could be done
with lock_super()? I did the analysis, all right - list of ext2 races was
a side effect of that. Now we have that crap fixed, so getting rid of
lock_super() in ext2 (in clear way) is possible. So if you still want it -
tell. ext2 part is very easy, it's quota part that needs serious work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  2000-09-25  0:49                   ` Alexander Viro
@ 2000-09-25  0:53                   ` Marcelo Tosatti
  2000-09-25  1:45                     ` Andrea Arcangeli
  2000-09-25 10:42                     ` the new VM Ingo Molnar
  2000-09-25  1:31                   ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
  2 siblings, 2 replies; 243+ messages in thread
From: Marcelo Tosatti @ 2000-09-25  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, 24 Sep 2000, Linus Torvalds wrote:

> 
> 
> On Sun, 24 Sep 2000, Andrea Arcangeli wrote:
> >
> > On Sun, Sep 24, 2000 at 10:26:11PM +0200, Ingo Molnar wrote:
> > > where will it deadlock?
> > 
> > ext2_new_block (or whatever that runs getblk with the superlock lock
> > acquired)->getblk->GFP->shrink_dcache_memory->prune_dcache->
> > prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->
> > put_inode->ext2_discard_prealloc->ext2_free_blocks->lock_super->D
> 
> Whee..
> 
> Good that you remembered (now that you mention it, I recollect that we had
> this bug and discussion earlier).
> 
> I added a comment to the effect, although I still moved the __GFP_IO test
> into the icache and dcache shrink functions, because as with the
> shm_swap() thing this is probably something we do want to fix eventually.

Btw, why we need kmem_cache_shrink() inside shrink_{i,d}cache_memory ?  

Since refill_inactive and do_try_to_free_pages (the only functions which
calls shrink_{i,d}cache_memory) already shrink the SLAB cache (with
kmem_cache_reap), I dont think its needed.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  0:53                   ` Marcelo Tosatti
@ 2000-09-25  1:45                     ` Andrea Arcangeli
  2000-09-25  2:39                       ` Marcelo Tosatti
  2000-09-25 10:42                     ` the new VM Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25  1:45 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, Sep 24, 2000 at 09:53:33PM -0300, Marcelo Tosatti wrote:
> Btw, why we need kmem_cache_shrink() inside shrink_{i,d}cache_memory ?  

Because kmem_cache_free doesn't free anything. It only queues slab
objects into the partial and free part of the cachep slab queue (so that
they're ready to be freed later, and that's what we do in shrink_slab_cache).

> calls shrink_{i,d}cache_memory) already shrink the SLAB cache (with
> kmem_cache_reap), I dont think its needed.

kmem_cache_reap shrinks the slabs at _very_ low frequency. It's worthless to
keep lots of dentries and icache into the slab internal queues until
kmem_cache_reap kicks in again, if we free them such memory immediatly instead
we'll run kmem_cache_reap later and for something more appropraite for what's
been designed. The [id]cache shrink could release lots of memory.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  1:45                     ` Andrea Arcangeli
@ 2000-09-25  2:39                       ` Marcelo Tosatti
  2000-09-25 15:36                         ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Marcelo Tosatti @ 2000-09-25  2:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1326 bytes --]


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

<snip>

> kmem_cache_reap shrinks the slabs at _very_ low frequency. It's worthless to
> keep lots of dentries and icache into the slab internal queues until
> kmem_cache_reap kicks in again, if we free them such memory immediatly instead
> we'll run kmem_cache_reap later and for something more appropraite for what's
> been designed. The [id]cache shrink could release lots of memory.

I see.

Since we have code which is using GFP_BUFFER allocations to not block but
only shrink the cache (1), I've done a patch to:

- Change kmem_cache_shrink to return the number of freed pages. 

- Move __GFP_IO checking from do_try_to_free_pages/refill_inactive to
{i,d}cache shrink functions (Linus already did this in his tree)

- On the {i,d}cache shrink functions, return the value of
kmem_cache_shrink() (no need of __GFP_IO for that)


There was a comment on the shrink functions about making
kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the
wanted pages by the current allocation. 

GFP_DMA allocations will never reach this code (do_try_to_free_pages is
only called if __GFP_WAIT is set) and GFP_HIGHMEM pages will never be used
as SLAB obj's memory. (please correct me if I'm wrong)


Comments?


(1) Using GPF_BUFFER is wrong, but its a separate issue. 

[-- Attachment #2: Type: TEXT/PLAIN, Size: 4122 bytes --]

diff --exclude-from=exclude -Nur linux.orig/fs/dcache.c linux/fs/dcache.c
--- linux.orig/fs/dcache.c	Sun Sep 24 18:14:24 2000
+++ linux/fs/dcache.c	Sun Sep 24 22:49:16 2000
@@ -556,15 +556,11 @@
 	int count = 0;
 	if (priority)
 		count = dentry_stat.nr_unused / priority;
-	prune_dcache(count);
-	/* FIXME: kmem_cache_shrink here should tell us
-	   the number of pages freed, and it should
-	   work in a __GFP_DMA/__GFP_HIGHMEM behaviour
-	   to free only the interesting pages in
-	   function of the needs of the current allocation. */
-	kmem_cache_shrink(dentry_cache);
 
-	return 0;
+	if(gfp_mask & __GFP_IO)
+		prune_dcache(count);
+
+	return kmem_cache_shrink(dentry_cache);
 }
 
 #define NAME_ALLOC_LEN(len)	((len+16) & ~15)
diff --exclude-from=exclude -Nur linux.orig/fs/inode.c linux/fs/inode.c
--- linux.orig/fs/inode.c	Sun Sep 24 18:14:25 2000
+++ linux/fs/inode.c	Sun Sep 24 22:47:30 2000
@@ -460,15 +460,11 @@
 		
 	if (priority)
 		count = inodes_stat.nr_unused / priority;
-	prune_icache(count);
-	/* FIXME: kmem_cache_shrink here should tell us
-	   the number of pages freed, and it should
-	   work in a __GFP_DMA/__GFP_HIGHMEM behaviour
-	   to free only the interesting pages in
-	   function of the needs of the current allocation. */
-	kmem_cache_shrink(inode_cachep);
 
-	return 0;
+	if(gfp_mask & __GFP_IO) 
+		prune_icache(count);
+
+	return kmem_cache_shrink(inode_cachep);
 }
 
 /*
diff --exclude-from=exclude -Nur linux.orig/mm/slab.c linux/mm/slab.c
--- linux.orig/mm/slab.c	Sun Sep 24 18:14:04 2000
+++ linux/mm/slab.c	Sun Sep 24 22:46:11 2000
@@ -887,7 +887,7 @@
 static int __kmem_cache_shrink(kmem_cache_t *cachep)
 {
 	slab_t *slabp;
-	int ret;
+	int ret, freed = 0;
 
 	drain_cpu_caches(cachep);
 
@@ -912,8 +912,11 @@
 		spin_unlock_irq(&cachep->spinlock);
 		kmem_slab_destroy(cachep, slabp);
 		spin_lock_irq(&cachep->spinlock);
+
+		freed++;
 	}
-	ret = !list_empty(&cachep->slabs);
+
+	ret = ((1 << cachep->gfporder) * freed);
 	spin_unlock_irq(&cachep->spinlock);
 	return ret;
 }
@@ -923,7 +926,8 @@
  * @cachep: The cache to shrink.
  *
  * Releases as many slabs as possible for a cache.
- * To help debugging, a zero exit status indicates all slabs were released.
+ *
+ * Returns the amount of freed pages.
  */
 int kmem_cache_shrink(kmem_cache_t *cachep)
 {
@@ -962,7 +966,9 @@
 	list_del(&cachep->next);
 	up(&cache_chain_sem);
 
-	if (__kmem_cache_shrink(cachep)) {
+	__kmem_cache_shrink(cachep); 
+	
+	if (!list_empty(&cachep->slabs)) {
 		printk(KERN_ERR "kmem_cache_destroy: Can't free all objects %p\n",
 		       cachep);
 		down(&cache_chain_sem);
diff --exclude-from=exclude -Nur linux.orig/mm/vmscan.c linux/mm/vmscan.c
--- linux.orig/mm/vmscan.c	Sun Sep 24 18:14:04 2000
+++ linux/mm/vmscan.c	Sun Sep 24 23:09:01 2000
@@ -904,14 +904,16 @@
 		}
 
 		/* Try to get rid of some shared memory pages.. */
-		if (gfp_mask & __GFP_IO) {
-			/*
-			 * don't be too light against the d/i cache since
-		   	 * shrink_mmap() almost never fail when there's
-		   	 * really plenty of memory free. 
-			 */
-			count -= shrink_dcache_memory(priority, gfp_mask);
-			count -= shrink_icache_memory(priority, gfp_mask);
+
+		/*
+		 * don't be too light against the d/i cache since
+	   	 * shrink_mmap() almost never fail when there's
+	   	 * really plenty of memory free. 
+		 */
+		count -= shrink_dcache_memory(priority, gfp_mask);
+		count -= shrink_icache_memory(priority, gfp_mask);
+
+		if(gfp_mask & __GFP_IO) {
 			/*
 			 * Not currently working, see fixme in shrink_?cache_memory
 			 * In the inner funtions there is a comment:
@@ -992,10 +994,8 @@
 	 * the inode and dentry cache whenever we do this.
 	 */
 	if (free_shortage() || inactive_shortage()) {
-		if (gfp_mask & __GFP_IO) {
-			ret += shrink_dcache_memory(6, gfp_mask);
-			ret += shrink_icache_memory(6, gfp_mask);
-		}
+		ret += shrink_dcache_memory(6, gfp_mask);
+		ret += shrink_icache_memory(6, gfp_mask);
 
 		ret += refill_inactive(gfp_mask, user);
 	} else {

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  2:39                       ` Marcelo Tosatti
@ 2000-09-25 15:36                         ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, Sep 24, 2000 at 11:39:13PM -0300, Marcelo Tosatti wrote:
> - Change kmem_cache_shrink to return the number of freed pages. 

I did that too extending a patch from Mark. I also removed the first_not_full
ugliness providing a LIFO behaviour to the completly freed slabs (so
kmem_cache_reap removes the oldest completly unused slabs from the queue, not
the most recently used ones with potentially live cache in the CPU). 

> There was a comment on the shrink functions about making
> kmem_cache_shrink() work on a GFP_DMA/GFP_HIGHMEM basis to free only the
> wanted pages by the current allocation. 

This is meaningless at the moment because it can't be addressed without
classzone logic in the allocator (classzone means that the allocator will pass
to the memory balancing code the information about _which_ classzone you have
to allocate memory from, so you won't waste time to synchronously balance
unrelated zones).

My patch is here (it isn't going to apply cleanly due the test9 changes
in do_try_to_free_pages but porting is trivial). It was tested and
it was working for me.

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test7/slab-1

BTW, here there's a fix for a longstanding SMP race (since swap_out and msync
doesn't run with the big lock) that can corrupt memory: 

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/msync-smp-race-1

Here the fix for another SMP race in enstablish_pte:

	ftp://ftp.uskernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test5/tlb-flush-smp-race-1

The fix for this last bit is ugly bit it's safe because Manfred said s390 have
a flush_tlb_page that atomically flushes and makees the pte invalid (cleaner
fix means moving part of enstablish_pte into the arch inlines).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* the new VM
  2000-09-25  0:53                   ` Marcelo Tosatti
  2000-09-25  1:45                     ` Andrea Arcangeli
@ 2000-09-25 10:42                     ` Ingo Molnar
  2000-09-25 13:02                       ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 10:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

i'd also like to share my experiences with recent kernels, compared to the
'old VM'. I frequently run high VM load multi-gigabyte systems with alot
of IRQ-side allocations as well, and it's surprising how sensitive these
systems' performance is to VM balance, despite gobs of RAM.

- The biggest difference under high allocation load is that the CPU usage
of kswapd and the synchronous VM balancing code has decreased
significantly. Under previous kernels it was not uncommon to see sudden
spikes in kswapd usage, and to see significant CPU time spent in
shrink_mmap() & friends. I suspect that this is because the new VM does
much less 'guessing' and blind list-walking.

- i'm also happy that __alloc_pages() now 'guarantees' allocation. This i
believe could simplify unrelated kernel code significantly. Eg. no need to
check for NULL pointers on most allocations, a GFP_KERNEL allocation
always succeeds, end of story. This behavior also has the 'nice'
side-effect of showing memory inbalance rather forcefully: the system
locks up ;-) A GFP_ATOMIC allocation obviously still has the potential to
fail, and must be handled properly.

all in one, the new VM balancing code looks really promising, despite all
the growing pains.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 10:42                     ` the new VM Ingo Molnar
@ 2000-09-25 13:02                       ` Andrea Arcangeli
  2000-09-25 13:02                         ` Ingo Molnar
  2000-09-25 13:04                         ` Ingo Molnar
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 12:42:09PM +0200, Ingo Molnar wrote:
> believe could simplify unrelated kernel code significantly. Eg. no need to
> check for NULL pointers on most allocations, a GFP_KERNEL allocation
> always succeeds, end of story. This behavior also has the 'nice'

Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a
showstopper bug. We also have another showstopper bug in getblk that will be
hard to fix because people was used to rely on it and they wrote dealdock prone
code.

You should know that people not running benchmarks and and using the machine
power for simulations runs out of memory all the time. If you put this kind of
obvious deadlock into the main kernel allocator you'll screwup the hard work to
fix all the other deadlock problems during OOM that is been done so far. Please
fix raid1 instead of making things worse.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:02                       ` Andrea Arcangeli
@ 2000-09-25 13:02                         ` Ingo Molnar
  2000-09-25 13:08                           ` Andrea Arcangeli
  2000-09-25 13:04                         ` Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed
> that is a showstopper bug. [...]

why?

> machine power for simulations runs out of memory all the time. If you
> put this kind of obvious deadlock into the main kernel allocator

FYI, i havent put it there.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:02                         ` Ingo Molnar
@ 2000-09-25 13:08                           ` Andrea Arcangeli
  2000-09-25 13:12                             ` Ingo Molnar
  2000-09-25 14:37                             ` Rik van Riel
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote:
> 
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed
> > that is a showstopper bug. [...]
> 
> why?

Because as you said the machine can lockup when you run out of memory.

> FYI, i havent put it there.

Ok.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:08                           ` Andrea Arcangeli
@ 2000-09-25 13:12                             ` Ingo Molnar
  2000-09-25 13:30                               ` Andrea Arcangeli
  2000-09-25 14:47                               ` Alan Cox
  2000-09-25 14:37                             ` Rik van Riel
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed
> > > that is a showstopper bug. [...]
> > 
> > why?
> 
> Because as you said the machine can lockup when you run out of memory.

well, i think all kernel-space allocations have to be limited carefully,
denying succeeding allocations is not a solution against over-allocation,
especially in a multi-user environment.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:12                             ` Ingo Molnar
@ 2000-09-25 13:30                               ` Andrea Arcangeli
  2000-09-25 13:39                                 ` Ingo Molnar
  2000-09-25 14:47                               ` Alan Cox
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:12:58PM +0200, Ingo Molnar wrote:
> well, i think all kernel-space allocations have to be limited carefully,

When a machine without a gigabit ethernet runs oom it's userspace that
allocated the memory via page faults not the kernel.

And if the careful limit avoids the deadlock in the layer above alloc_pages,
then it will also avoid alloc_pages to return NULL and you won't need an
infinite loop in first place (unless the memory balancing is buggy).

GFP should return NULL only if the machine is out of memory. The kernel can be
written in a way that never deadlocks when the machine is out of memory just
checking the GFP retval. I don't think any in-kernel resource limit is
necessary to have things reliable and fast. Most dynamic big caches and kernel
data can be shrinked dynamically during memory pressure (pheraps except skbs
and I agree that for skbs on gigabit ethernet the thing is a little different).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:30                               ` Andrea Arcangeli
@ 2000-09-25 13:39                                 ` Ingo Molnar
  2000-09-25 14:04                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> And if the careful limit avoids the deadlock in the layer above
> alloc_pages, then it will also avoid alloc_pages to return NULL and
> you won't need an infinite loop in first place (unless the memory
> balancing is buggy).

yes i like this property very much because it unearths VM balancing bugs,
which plagued us for so long and are so hard to detect. But statistically
it's also possible that try_to_free_pages() frees a page and alloc_pages()
done on another CPU (or in IRQ context) 'steals' the page. This can
happen, because the VM right now guarantees no straight path from
deallocator to allocator. (and it's not necessery to guarantee it, given
the varying nature of allocation requests.)

> GFP should return NULL only if the machine is out of memory. The
> kernel can be written in a way that never deadlocks when the machine
> is out of memory just checking the GFP retval. I don't think any
> in-kernel resource limit is necessary to have things reliable and
> fast. [...]

Andrea, if you really mean this then you should not be let near the VM
balancing code :-)

> Most dynamic big caches and kernel data can be shrinked dynamically
> during memory pressure (pheraps except skbs and I agree that for skbs
> on gigabit ethernet the thing is a little different).

a big 'except'. You dont need gigabit for that, to the contrary, if the
network is slow it's easier to overallocate within the kernel. Ask Alan
about how many D.O.S. attacks there are possible without implicit or
explicit bean counting.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:39                                 ` Ingo Molnar
@ 2000-09-25 14:04                                   ` Andrea Arcangeli
  2000-09-25 14:04                                     ` Ingo Molnar
  2000-09-25 14:26                                     ` Marcelo Tosatti
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:39:51PM +0200, Ingo Molnar wrote:
> Andrea, if you really mean this then you should not be let near the VM
> balancing code :-)

What I mean is that the VM balancing is in the lower layer that knows anything
about the per-socket gigabit ethernet skbs limits, the limit should live at the
higher layer. For most code just checking for NULL in GFP is fine (for example
do_anonymous_page). It's the caller (not the VM balancing developer) that
shouldn't be let near his code if it allows his code to fill all the physical
ram with his stuff causing the machine to run OOM.

> > Most dynamic big caches and kernel data can be shrinked dynamically
> > during memory pressure (pheraps except skbs and I agree that for skbs
> > on gigabit ethernet the thing is a little different).
> 
> a big 'except'. You dont need gigabit for that, to the contrary, if the

I talked with Alexey about this and it seems the best way is to have a
per-socket reservation of clean cache in function of the receive window.  So we
don't need an huge atomic pool but we can have a special lru with an irq
spinlock that is able to shrink cache from irq as well.

> about how many D.O.S. attacks there are possible without implicit or
> explicit bean counting.

Again: the bean counting and all the limit happens at the higher layer.  I
shouldn't know anything about it when I play with the lower layer GFP memory
balancing code.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:04                                   ` Andrea Arcangeli
@ 2000-09-25 14:04                                     ` Ingo Molnar
  2000-09-25 14:23                                       ` Andrea Arcangeli
  2000-09-25 14:26                                     ` Marcelo Tosatti
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Again: the bean counting and all the limit happens at the higher
> layer.  I shouldn't know anything about it when I play with the lower
> layer GFP memory balancing code.

exactly, and this is why if a higher level lets through a GFP_KERNEL, then
it *must* succeed. Otherwise either the higher level code is buggy, or the
VM balance is buggy, but we want to have clear signs of it.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:04                                     ` Ingo Molnar
@ 2000-09-25 14:23                                       ` Andrea Arcangeli
  2000-09-25 14:27                                         ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:04:14PM +0200, Ingo Molnar wrote:
> exactly, and this is why if a higher level lets through a GFP_KERNEL, then
> it *must* succeed. Otherwise either the higher level code is buggy, or the
> VM balance is buggy, but we want to have clear signs of it.

I'm not sure if we should restrict the limiting only to the cases that needs
them. For example do_anonymous_page looks a place that could rely on the
GFP retval.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:23                                       ` Andrea Arcangeli
@ 2000-09-25 14:27                                         ` Ingo Molnar
  2000-09-25 14:39                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> I'm not sure if we should restrict the limiting only to the cases that
> needs them. For example do_anonymous_page looks a place that could
> rely on the GFP retval.

i think an application should not fail due to other applications
allocating too much RAM. OOM behavior should be a central thing and based
on allocation patterns, not pure luck or unluck. I always found it rude to
SIGBUS when some other application is abusing RAM but the oom detector has
not yet killed it off.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:27                                         ` Ingo Molnar
@ 2000-09-25 14:39                                           ` Andrea Arcangeli
  2000-09-25 14:43                                             ` Ingo Molnar
  2000-09-25 16:09                                             ` Rik van Riel
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote:
> i think an application should not fail due to other applications
> allocating too much RAM. OOM behavior should be a central thing and based

At least Linus's point is that doing perfect accounting (at least on the
userspace allocation side) may cause you to waste resources, failing even if
you could still run and I tend to agree with him. We're lazy on that
side and that's global win in most cases.

We are finegrined with page granularity, not with the mmap granularity.

The point is that not all the mmapped regions are going to be pagedin.  Think a
program that only after 1 hour did all the calculations that allocated all
the memory it requested with malloc.  Before the hour passes the unused memory
can still be used for other things and that's what the user also expects
when he runs `free`.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:39                                           ` Andrea Arcangeli
@ 2000-09-25 14:43                                             ` Ingo Molnar
  2000-09-25 15:01                                               ` Andrea Arcangeli
  2000-09-25 16:09                                             ` Rik van Riel
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> At least Linus's point is that doing perfect accounting (at least on
> the userspace allocation side) may cause you to waste resources,
> failing even if you could still run and I tend to agree with him.
> We're lazy on that side and that's global win in most cases.

well, as i said, i agree that being lazy on the user-space side (which is
by far the biggest RAM allocator in a typical system) makes sense - and we
can handle it cleanly.

Being lazy on the kernel-space side is the default behavior for us kernel
hackers :-) but i dont think it's the right thing in the long term.

> We are finegrined with page granularity, not with the mmap
> granularity. The point is that not all the mmapped regions are going
> to be pagedin. Think a program that only after 1 hour did all the
> calculations that allocated all the memory it requested with malloc.  
> Before the hour passes the unused memory can still be used for other
> things and that's what the user also expects when he runs `free`.

i think you've completely missed the fact that i made exactly this point
in my previous mail.

	'user-space laziness': correct
	'kernel-space laziness': dangerous

i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i
believe the right place to oom is via a signal, not in the gfp() case.
(because oom situation in the gfp() case is a completely random and
statistical event, which might have no connection at all to the behavior
of that given process.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:43                                             ` Ingo Molnar
@ 2000-09-25 15:01                                               ` Andrea Arcangeli
  2000-09-25 15:10                                                 ` Ingo Molnar
  2000-09-26 19:10                                                 ` Pavel Machek
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:43:44PM +0200, Ingo Molnar wrote:
> i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i

My bad, you're right I was talking about GFP_USER indeed.

But even GFP_KERNEL allocations like the init of a module or any other thing
that is static sized during production just checking the retval looks be ok.

> believe the right place to oom is via a signal, not in the gfp() case.

Signal can be trapped and ignored by malicious task. We had that security
problem until 2.2.14 IIRC.

> (because oom situation in the gfp() case is a completely random and
> statistical event, which might have no connection at all to the behavior
> of that given process.)

I agree we should have more information about the behaviour of the system
and I think a per-task page fault rate should work in practice.

But my question isn't what you do when you're OOM, but is _how_ do you
notice that you're OOM?

In the GFP_USER case simply checking when GFP fails looks right to me.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:01                                               ` Andrea Arcangeli
@ 2000-09-25 15:10                                                 ` Ingo Molnar
  2000-09-25 15:24                                                   ` Andrea Arcangeli
  2000-09-26 19:10                                                 ` Pavel Machek
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Signal can be trapped and ignored by malicious task. [...]

a SIGKILL? i agree with the 2.2 solution - first a soft signal, and if
it's being ignored then a SIGKILL.

> But my question isn't what you do when you're OOM, but is _how_ do you
> notice that you're OOM?

good question :-)

> In the GFP_USER case simply checking when GFP fails looks right to me.

i think the GFP_USER case should do the oom logic within __alloc_pages(),
by SIGTERM/SIGKILL-ing off abusive processes. Ie. it's *still* an infinite
loop (barring the case where *this* process is abusive, but thats a
detail).

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:10                                                 ` Ingo Molnar
@ 2000-09-25 15:24                                                   ` Andrea Arcangeli
  2000-09-25 15:26                                                     ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 05:10:43PM +0200, Ingo Molnar wrote:
> a SIGKILL? i agree with the 2.2 solution - first a soft signal, and if
> it's being ignored then a SIGKILL.

Actually we do the soft signal try (SIGTERM) only if the task was running
with iopl privilegies (and that means on alpha and other archs where
there isn't the iopl we send a SIGKILL to X immediatly).

Extending it to all tasks looked a bit riskious solution because then we would
even less probability to kill the right task since all tasks would run oom
while the first is put to sleep for a while. With X we really prefer to kill
another task than screwup the console instead (even when X is the real hog, and
X can be made the real hog by any tasks that allocates huge xshm). Kray
reproduces this easily.

> > But my question isn't what you do when you're OOM, but is _how_ do you
> > notice that you're OOM?
> 
> good question :-)

:))

> i think the GFP_USER case should do the oom logic within __alloc_pages(),

What's the difference of implementing the logic outside alloc_pages? Putting
the logic inside looks not clean design to me.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:24                                                   ` Andrea Arcangeli
@ 2000-09-25 15:26                                                     ` Ingo Molnar
  2000-09-25 15:22                                                       ` yodaiken
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 15:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > i think the GFP_USER case should do the oom logic within __alloc_pages(),
> 
> What's the difference of implementing the logic outside alloc_pages?
> Putting the logic inside looks not clean design to me.

it gives consistency and simplicity. The allocators themselves do not have
to care about oom.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:26                                                     ` Ingo Molnar
@ 2000-09-25 15:22                                                       ` yodaiken
  0 siblings, 0 replies; 243+ messages in thread
From: yodaiken @ 2000-09-25 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 05:26:59PM +0200, Ingo Molnar wrote:
> 
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > > i think the GFP_USER case should do the oom logic within __alloc_pages(),
> > 
> > What's the difference of implementing the logic outside alloc_pages?
> > Putting the logic inside looks not clean design to me.
> 
> it gives consistency and simplicity. The allocators themselves do not have
> to care about oom.


There are many cases where it is simple to do:
        
          if( alloc(r1) == fail) goto freeall
          if( alloc(r2) == fail) goto freeall
          if( alloc(r3) == fail) goto freeall

And the alloc functions don't know how to "freeall".

Perhaps it would be good to do an alloc_vec allocation in these cases.
      alloc_vec[0].size = n;
      ..
      alloc_vec[n].size = 0;
      if(kmalloc_all(alloc_vec) == FAIL) return -ENOMEM;
      else  alloc_vec[i].ptr is the pointer.




-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:01                                               ` Andrea Arcangeli
  2000-09-25 15:10                                                 ` Ingo Molnar
@ 2000-09-26 19:10                                                 ` Pavel Machek
  2000-09-26 20:16                                                   ` Andrea Arcangeli
  2000-09-27  7:42                                                   ` Ingo Molnar
  1 sibling, 2 replies; 243+ messages in thread
From: Pavel Machek @ 2000-09-26 19:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

Hi!
> > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i
> 
> My bad, you're right I was talking about GFP_USER indeed.
> 
> But even GFP_KERNEL allocations like the init of a module or any other thing
> that is static sized during production just checking the retval
> looks be ok.

Okay, I'm user on small machine and I'm doing stupid thing: I've got
6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I
insert module_1mb.o. Repeat. How does it end? I think that
kmalloc(GFP_KERNEL) *has* to return NULL at some point. 

Killing apps is not a solution: If my insmoder is smaller than module
I'm trying to insert, and it happens to be the only process, you just
will not be able to kmalloc(GFP_KERNEL, sizeof(module)). Will you
panic at the end?

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-26 19:10                                                 ` Pavel Machek
@ 2000-09-26 20:16                                                   ` Andrea Arcangeli
  2000-09-27  7:42                                                   ` Ingo Molnar
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-26 20:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 09:10:16PM +0200, Pavel Machek wrote:
> Hi!
> > > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i
> > 
> > My bad, you're right I was talking about GFP_USER indeed.
> > 
> > But even GFP_KERNEL allocations like the init of a module or any other thing
> > that is static sized during production just checking the retval
> > looks be ok.
> 
> Okay, I'm user on small machine and I'm doing stupid thing: I've got
> 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I
> insert module_1mb.o. Repeat. How does it end? I think that
> kmalloc(GFP_KERNEL) *has* to return NULL at some point. 

I agree and that's what I said since the first place. GFP_KERNEL must return
null when the system is truly out of memory or the kernel will deadlock at that
time. In the sentence you quoted I meant that both GFP_USER and most GFP_KERNEL
could only keep to check the retval even in the long term to be correct
(checking for NULL, that in turn means GFP_KERNEL _will_ return NULL
eventually).

There's no need of special resource accounting for many static sized data
structure in kernel (this accounting is necessary only for some of the dynamic
things that grows and shrink during production and that can't be reclaimed
synchronously when memory goes low by blocking in the allocator, like
pagetables skbs on gbit ethernet and other things).

Not sure if at the end we'll need to account also the static parts to
get the dynamic part right.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-26 19:10                                                 ` Pavel Machek
  2000-09-26 20:16                                                   ` Andrea Arcangeli
@ 2000-09-27  7:42                                                   ` Ingo Molnar
  2000-09-27 12:11                                                     ` yodaiken
  2000-09-27 14:08                                                     ` Andrea Arcangeli
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-27  7:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Tue, 26 Sep 2000, Pavel Machek wrote:

> Okay, I'm user on small machine and I'm doing stupid thing: I've got
> 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I
> insert module_1mb.o. Repeat. How does it end? I think that
> kmalloc(GFP_KERNEL) *has* to return NULL at some point.

if a stupid root user keeps inserting bogus modules :-) then thats a
problem, no matter what. I can DoS your system if given the right to
insert arbitrary size modules, even if kmalloc returns NULL. For such
things explicit highlevel protection is needed - completely independently
of the VM allocation issues. Returning NULL in kmalloc() is just a way to
say: 'oops, we screwed up somewhere'. And i'd suggest to not work around
such screwups by checking for NULL and trying to handle it. I suggest to
rather fix those screwups.

the __GFP_SOFT suggestion handles these things nicely.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-27  7:42                                                   ` Ingo Molnar
@ 2000-09-27 12:11                                                     ` yodaiken
  2000-09-27 14:08                                                     ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: yodaiken @ 2000-09-27 12:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote:
> 
> On Tue, 26 Sep 2000, Pavel Machek wrote:
> of the VM allocation issues. Returning NULL in kmalloc() is just a way to
> say: 'oops, we screwed up somewhere'. And i'd suggest to not work around

That is not at all how it is currently used in the kernel. 

> such screwups by checking for NULL and trying to handle it. I suggest to
> rather fix those screwups.

Kmalloc returns null when there is not enough memory to satisfy the request. What's
wrong with that?


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-27  7:42                                                   ` Ingo Molnar
  2000-09-27 12:11                                                     ` yodaiken
@ 2000-09-27 14:08                                                     ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-27 14:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote:
> such screwups by checking for NULL and trying to handle it. I suggest to
> rather fix those screwups.

How do you know which is the minimal amount of RAM that allows you not to be in
the screwedup state?

We for sure need a kind of counter for the special dynamic structures but I'm
not sure if that should account the static stuff as well.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:39                                           ` Andrea Arcangeli
  2000-09-25 14:43                                             ` Ingo Molnar
@ 2000-09-25 16:09                                             ` Rik van Riel
  1 sibling, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 16:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote:
> > i think an application should not fail due to other applications
> > allocating too much RAM. OOM behavior should be a central thing and based
> 
> At least Linus's point is that doing perfect accounting (at
> least on the userspace allocation side) may cause you to waste
> resources, failing even if you could still run and I tend to
> agree with him. We're lazy on that side and that's global win in
> most cases.

OK, so do you guys want my OOM-killer selection code
in 2.4? ;)

(that will fix the OOM case in the rare situations
where it occurs and do the expected thing most of the
time)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:04                                   ` Andrea Arcangeli
  2000-09-25 14:04                                     ` Ingo Molnar
@ 2000-09-25 14:26                                     ` Marcelo Tosatti
  2000-09-25 14:50                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Marcelo Tosatti @ 2000-09-25 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

<snip>

> I talked with Alexey about this and it seems the best way is to have a
> per-socket reservation of clean cache in function of the receive window.  So we
> don't need an huge atomic pool but we can have a special lru with an irq
> spinlock that is able to shrink cache from irq as well.

In the current 2.4 VM code, there is a kernel thread called
"kreclaimd".

This thread keeps freeing pages from the inactive clean list when needed
(when zone->free_pages < zone->pages_low), making them available for
atomic allocations.

Do you consider pages_low pages as a "huge atomic pool" ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:26                                     ` Marcelo Tosatti
@ 2000-09-25 14:50                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:50 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 11:26:48AM -0300, Marcelo Tosatti wrote:
> This thread keeps freeing pages from the inactive clean list when needed
> (when zone->free_pages < zone->pages_low), making them available for
> atomic allocations.

This is flawed. It's the irq that have to shrink the memory itself. It can't
certainly reschedule kreclaimd and wait it to do the work.

Increasing the free_pages_min limit is the _only_ alternative to having
irqs that are able to shrink clean cache (and hopefully that "feature"
will be resurrected soon since it's the only way to go right now). 

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:12                             ` Ingo Molnar
  2000-09-25 13:30                               ` Andrea Arcangeli
@ 2000-09-25 14:47                               ` Alan Cox
  2000-09-25 15:16                                 ` Ingo Molnar
  2000-09-25 15:40                                 ` Stephen C. Tweedie
  1 sibling, 2 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-25 14:47 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

> > Because as you said the machine can lockup when you run out of memory.
> 
> well, i think all kernel-space allocations have to be limited carefully,
> denying succeeding allocations is not a solution against over-allocation,
> especially in a multi-user environment.

GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything
jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make
space you die.

The alternative approach where it cannot fail has to be at higher levels so
you can release other resources that might need freeing for deadlock avoidance
before you retry


Alan


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:47                               ` Alan Cox
@ 2000-09-25 15:16                                 ` Ingo Molnar
  2000-09-25 15:16                                   ` the new VMt Alan Cox
  2000-09-25 15:48                                   ` the new VM Andrea Arcangeli
  2000-09-25 15:40                                 ` Stephen C. Tweedie
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 15:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Alan Cox wrote:

> GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
> everything jammed in kernel space waiting on GFP_KERNEL and if the
> swapper cannot make space you die.

if one can get everything jammed waiting for GFP_KERNEL, and not being
able to deallocate anything, thats a VM or resource-limit bug. This
situation is just 1% RAM away from the 'root cannot log in', situation.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:16                                 ` Ingo Molnar
@ 2000-09-25 15:16                                   ` Alan Cox
  2000-09-25 15:33                                     ` the new VM Ingo Molnar
                                                       ` (3 more replies)
  2000-09-25 15:48                                   ` the new VM Andrea Arcangeli
  1 sibling, 4 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-25 15:16 UTC (permalink / raw)
  To: mingo
  Cc: Alan Cox, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

> > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
> > everything jammed in kernel space waiting on GFP_KERNEL and if the
> > swapper cannot make space you die.
> 
> if one can get everything jammed waiting for GFP_KERNEL, and not being
> able to deallocate anything, thats a VM or resource-limit bug. This
> situation is just 1% RAM away from the 'root cannot log in', situation.

Unless Im missing something here think about this case

2 active processes, no swap

#1					#2
kmalloc 32K				kmalloc 16K
OK					OK
kmalloc 16K				kmalloc 32K
block					block

so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with
care, but when we have no pages left something has to give


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:16                                   ` the new VMt Alan Cox
@ 2000-09-25 15:33                                     ` Ingo Molnar
  2000-09-25 15:41                                     ` the new VMt Andrea Arcangeli
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 15:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Alan Cox wrote:

> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1					#2
> kmalloc 32K				kmalloc 16K
> OK					OK
> kmalloc 16K				kmalloc 32K
> block					block
> 
> so GFP_KERNEL has to be able to fail - it can wait for I/O in some
> cases with care, but when we have no pages left something has to give

you are right, i agree that synchronous OOM for higher-order allocations
must be preserved (just like ATOMIC allocations). But the overwhelming
majority of allocations is done at page granularity.

with multi-page allocations and the need for physically contiguous
buffers, the problem cannot be solved.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:16                                   ` the new VMt Alan Cox
  2000-09-25 15:33                                     ` the new VM Ingo Molnar
@ 2000-09-25 15:41                                     ` Andrea Arcangeli
  2000-09-25 16:02                                       ` Ingo Molnar
  2000-09-25 15:42                                     ` Stephen C. Tweedie
  2000-09-25 16:16                                     ` Rik van Riel
  3 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1					#2
> kmalloc 32K				kmalloc 16K
> OK					OK
> kmalloc 16K				kmalloc 32K
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> block					block

Yep, you're not missing anything. That was my complain about the fact
GFP_KERNEL not failing will obviously dealdock the kernel all over the place.

Ingo's point is that the underlined line won't ever happen in the first place
because of the resource accounting that will tell the upper layer that they
can't try to allocate anything, so they won't enter kmalloc at all. But he's
obviously not talking about 2.4.x. (and I'm not sure if that's the right
way to go in the general case but certainly it's the right way to go for
special cases like skbs with gigabit ethernet)

In 2.4.x GFP_KERNEL not failing is a deadlock as you said.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:41                                     ` the new VMt Andrea Arcangeli
@ 2000-09-25 16:02                                       ` Ingo Molnar
  2000-09-25 16:04                                         ` Andi Kleen
                                                           ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Ingo's point is that the underlined line won't ever happen in the
> first place

please dont misinterpret my point ...

Frankly, how often do we allocate multi-order pages? I've just made quick
statistics wrt. how allocation orders are distributed on a more or less
typical system:

	(ALLOC ORDER)
	0: 167081
	1: 850
	2: 16
	3: 25
	4: 0
	5: 1
	6: 0
	7: 2
	8: 13
	9: 5

ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
task-structure. The rest is 0.05%.

i'm not talking about 4MB contiguous physical allocations having to
succeed on a 8MB box. I'm talking about 99% of the simple allocation
points not having to worry about a NULL pointer. (not checking for NULL is
one of the most common allocation-related bug that beats low-RAM systems.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:02                                       ` Ingo Molnar
@ 2000-09-25 16:04                                         ` Andi Kleen
  2000-09-25 16:19                                           ` Ingo Molnar
  2000-09-25 16:11                                         ` Andrea Arcangeli
  2000-09-25 16:53                                         ` Alan Cox
  2 siblings, 1 reply; 243+ messages in thread
From: Andi Kleen @ 2000-09-25 16:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
> Frankly, how often do we allocate multi-order pages? I've just made quick
> statistics wrt. how allocation orders are distributed on a more or less
> typical system:
> 
> 	(ALLOC ORDER)
> 	0: 167081
> 	1: 850
> 	2: 16
> 	3: 25
> 	4: 0
> 	5: 1
> 	6: 0
> 	7: 2
> 	8: 13
> 	9: 5
> 
> ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> task-structure. The rest is 0.05%.

An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
(=GFP_ATOMIC) 16K allocations.  

Another thing I would worry about are ports with multiple user page sizes in 2.5.
Another ugly case is the x86-64 port which has 4K pages but may likely need
a 16K kernel stack due to the 64bit stack bloat.


-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:04                                         ` Andi Kleen
@ 2000-09-25 16:19                                           ` Ingo Molnar
  2000-09-25 16:18                                             ` Andi Kleen
  2000-09-25 16:28                                             ` Rik van Riel
  0 siblings, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 16:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andi Kleen wrote:

> An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
> in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
> (=GFP_ATOMIC) 16K allocations.  

the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must*
be prepared to handle occasional oom situations gracefully.

> Another thing I would worry about are ports with multiple user page
> sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> but may likely need a 16K kernel stack due to the 64bit stack bloat.

yep, but these cases are not affected, i think in the order != 0 case we
should return NULL if a certain number of iterations did not yield any
free page.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:19                                           ` Ingo Molnar
@ 2000-09-25 16:18                                             ` Andi Kleen
  2000-09-25 16:41                                               ` Andrea Arcangeli
  2000-09-25 20:23                                               ` Russell King
  2000-09-25 16:28                                             ` Rik van Riel
  1 sibling, 2 replies; 243+ messages in thread
From: Andi Kleen @ 2000-09-25 16:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrea Arcangeli, Alan Cox, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
> > Another thing I would worry about are ports with multiple user page
> > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> 
> yep, but these cases are not affected, i think in the order != 0 case we
> should return NULL if a certain number of iterations did not yield any
> free page.

Ok, that would just break fork()

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:18                                             ` Andi Kleen
@ 2000-09-25 16:41                                               ` Andrea Arcangeli
  2000-09-25 16:35                                                 ` Linus Torvalds
  2000-09-25 20:23                                               ` Russell King
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:18:17PM +0200, Andi Kleen wrote:
> On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
> > > Another thing I would worry about are ports with multiple user page
> > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> > 
> > yep, but these cases are not affected, i think in the order != 0 case we
> > should return NULL if a certain number of iterations did not yield any
> > free page.
> 
> Ok, that would just break fork()

Not sure if I have the whole context (I've not yet received Ingo's email
that you're replying to).

Currently we do a memory balancing pass indipendently by the order of the
allocation. Thus we don't do any iteraction and the memory balancing
is completly order blind (unfortunately it's also zone blind, while
at least in 2.2.x the memory balancing known which zone it had
to allocate memory from).

If Ingo suggested more iteractions of memory balancing for those cases
that should only make things better with respect to fragmentation.

But I'd much prefer to pass not only the classzone from allocator
to memory balancing, but _also_ the order of the allocation,
and then shrink_mmap will know it doesn't worth to free anything 
that isn't contigous on the order of the allocation that we need.

classzone haven't reached this point yet.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:41                                               ` Andrea Arcangeli
@ 2000-09-25 16:35                                                 ` Linus Torvalds
  2000-09-25 16:41                                                   ` Rik van Riel
  2000-09-27  7:14                                                   ` Rusty Russell
  0 siblings, 2 replies; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25 16:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> But I'd much prefer to pass not only the classzone from allocator
> to memory balancing, but _also_ the order of the allocation,
> and then shrink_mmap will know it doesn't worth to free anything 
> that isn't contigous on the order of the allocation that we need.

I suspect that the proper way to do this is to just make another gfp_flag,
which is basically another hint to the mm layer that we're doing a multi-
page allocation and that the MM layer should not try forever to handle it.

In fact, that's independent of whether it is a multi-page allocation or
not. It might be something like __GFP_SOFT - you could use it with single
pages too. 

Thinking about it, we do have it already. It's called !__GFP_HIGH, and it
used by all the GFP_USER allocations.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:35                                                 ` Linus Torvalds
@ 2000-09-25 16:41                                                   ` Rik van Riel
  2000-09-25 16:49                                                     ` Linus Torvalds
  2000-09-27  7:14                                                   ` Rusty Russell
  1 sibling, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Andi Kleen, Ingo Molnar, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Linus Torvalds wrote:
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> > 
> > But I'd much prefer to pass not only the classzone from allocator
> > to memory balancing, but _also_ the order of the allocation,
> > and then shrink_mmap will know it doesn't worth to free anything 
> > that isn't contigous on the order of the allocation that we need.
> 
> I suspect that the proper way to do this is to just make another gfp_flag,
> which is basically another hint to the mm layer that we're doing a multi-
> page allocation and that the MM layer should not try forever to handle it.
> 
> In fact, that's independent of whether it is a multi-page
> allocation or not. It might be something like __GFP_SOFT - you
> could use it with single pages too.
> 
> Thinking about it, we do have it already. It's called
> !__GFP_HIGH, and it used by all the GFP_USER allocations.

Hmm, I think these two are orthagonal.

__GFP_HIGH means that we are allowed to eat deeper into
the free list (maybe needed to avoid a deadlock freeing
pages)

__GFP_SOFT would mean "don't bother waiting for free pages",
which is something very different...

(I wouldn't want a user process to get killed simply because
kswapd is waiting for IO to finish on a swapout, in that case
we really do want to sleep for a while)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:41                                                   ` Rik van Riel
@ 2000-09-25 16:49                                                     ` Linus Torvalds
  2000-09-25 17:03                                                       ` Ingo Molnar
  2000-09-25 17:15                                                       ` Andrea Arcangeli
  0 siblings, 2 replies; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25 16:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Andi Kleen, Ingo Molnar, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Rik van Riel wrote:
> > 
> > Thinking about it, we do have it already. It's called
> > !__GFP_HIGH, and it used by all the GFP_USER allocations.
> 
> Hmm, I think these two are orthagonal.
> 
> __GFP_HIGH means that we are allowed to eat deeper into
> the free list (maybe needed to avoid a deadlock freeing
> pages)
> 
> __GFP_SOFT would mean "don't bother waiting for free pages",
> which is something very different...

Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing
that the order itself may not be the most interesting thing, and that I
don't think the balancing has to take the order of the allocation into
account - because it should be equivalent to just tell that it's a soft
allocation (whether though the current !__GFP_HIGH or through a new
__GFP_SOFT with slightly different logic).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:49                                                     ` Linus Torvalds
@ 2000-09-25 17:03                                                       ` Ingo Molnar
  2000-09-25 17:17                                                         ` Andrea Arcangeli
  2000-09-25 17:15                                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Andi Kleen, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Linus Torvalds wrote:

> Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing
> that the order itself may not be the most interesting thing, and that
> I don't think the balancing has to take the order of the allocation
> into account - because it should be equivalent to just tell that it's
> a soft allocation (whether though the current !__GFP_HIGH or through a
> new __GFP_SOFT with slightly different logic).

yep, and there is another problem with pure order-based distinction: if i
do kmalloc(5k), and write the code on Alpha and expect it to never fail,
shouldnt i expect this to never fail on x86 as well? Along with the fork()
failure. __GFP_SOFT solves this all very nicely - the *allocator* decides
what allocation policy to follow. Great!

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:03                                                       ` Ingo Molnar
@ 2000-09-25 17:17                                                         ` Andrea Arcangeli
  2000-09-25 17:10                                                           ` Rik van Riel
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Andi Kleen, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote:
> [..] __GFP_SOFT solves this all very nicely [..]

s/very nicely/throwing away lots of useful cache for no one good reason/

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:17                                                         ` Andrea Arcangeli
@ 2000-09-25 17:10                                                           ` Rik van Riel
  2000-09-25 17:27                                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 17:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote:
> > [..] __GFP_SOFT solves this all very nicely [..]
> 
> s/very nicely/throwing away lots of useful cache for no one good reason/

Not really. We could fix this by making the page freeing
functions smarter and only free the pages we need.

I just don't know if this is worth it for 0.5% of the 
allocations (and further more, since we allocate the
1-page allocations directly from the cache when we're
low on free memory, fragmentation isn't as bad as it
used to be with the old VM).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:10                                                           ` Rik van Riel
@ 2000-09-25 17:27                                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Alan Cox,
	Marcelo Tosatti, Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 02:10:07PM -0300, Rik van Riel wrote:
> Not really. We could fix this by making the page freeing
> functions smarter and only free the pages we need.

That's what I proposed in first place infact.

To free large chunk of memory you may have to throw away lots of cache. We're
not freeing contigous cache as we do in 2.2.x.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:49                                                     ` Linus Torvalds
  2000-09-25 17:03                                                       ` Ingo Molnar
@ 2000-09-25 17:15                                                       ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 09:49:46AM -0700, Linus Torvalds wrote:
> [..] I
> don't think the balancing has to take the order of the allocation into
> account [..]

Why do you prefer to throw away most of the cache (potentially at fork time)
instead of freeing only the few contigous bits that we need?

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:35                                                 ` Linus Torvalds
  2000-09-25 16:41                                                   ` Rik van Riel
@ 2000-09-27  7:14                                                   ` Rusty Russell
  1 sibling, 0 replies; 243+ messages in thread
From: Rusty Russell @ 2000-09-27  7:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Ingo Molnar, Alan Cox, Marcelo Tosatti, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

In message <Pine.LNX.4.10.10009250931570.1739-100000@penguin.transmeta.com> you
 write:
> I suspect that the proper way to do this is to just make another gfp_flag,
> which is basically another hint to the mm layer that we're doing a multi-
> page allocation and that the MM layer should not try forever to handle it.
> 
> In fact, that's independent of whether it is a multi-page allocation or
> not. It might be something like __GFP_SOFT - you could use it with single
> pages too. 

That'd be a lovely interface, now wouldn't it?

*yecch*

Please consider at least:

/* Never fails. */
#define trivial_kmalloc(s)	\
	 ((void)((s) > PAGE_SIZE ? bad_size_##s : __kmalloc((s), GFP_KERNEL)))

/* Can fail */
#define kmalloc(s, pri) __kmalloc((s), (pri)|__GFP_SOFT)

Rusty.
--
Hacking time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:18                                             ` Andi Kleen
  2000-09-25 16:41                                               ` Andrea Arcangeli
@ 2000-09-25 20:23                                               ` Russell King
  1 sibling, 0 replies; 243+ messages in thread
From: Russell King @ 2000-09-25 20:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, torvalds

Andi Kleen writes:
> On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
> > > Another thing I would worry about are ports with multiple user page
> > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> > 
> > yep, but these cases are not affected, i think in the order != 0 case we
> > should return NULL if a certain number of iterations did not yield any
> > free page.
> 
> Ok, that would just break fork()

Especially so when, on the ARM, the first level page table is 16K, and the
page size is 4K.  Should Ingo's suggestion happen, we still need a way
of allocating 16K aligned chunks of memory for such stuff.

Just a small question... I thought we were discussing 2.4, not possible
features for 2.5?
   _____
  |_____| ------------------------------------------------- ---+---+-
  |   |         Russell King        rmk@arm.linux.org.uk      --- ---
  | | | | http://www.arm.linux.org.uk/personal/aboutme.html   /  /  |
  | +-+-+                                                     --- -+-
  /   |               THE developer of ARM Linux              |+| /|\
 /  | | |                                                     ---  |
    +-+-+ -------------------------------------------------  /\\\  |
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:19                                           ` Ingo Molnar
  2000-09-25 16:18                                             ` Andi Kleen
@ 2000-09-25 16:28                                             ` Rik van Riel
  1 sibling, 0 replies; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 16:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Andrea Arcangeli, Alan Cox, Marcelo Tosatti,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Ingo Molnar wrote:
> On Mon, 25 Sep 2000, Andi Kleen wrote:
> 
> > Another thing I would worry about are ports with multiple user page
> > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> 
> yep, but these cases are not affected, i think in the order != 0
> case we should return NULL if a certain number of iterations did
> not yield any free page.

Indeed. You're right here.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:02                                       ` Ingo Molnar
  2000-09-25 16:04                                         ` Andi Kleen
@ 2000-09-25 16:11                                         ` Andrea Arcangeli
  2000-09-25 16:22                                           ` Ingo Molnar
  2000-09-25 16:53                                         ` Alan Cox
  2 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
> Frankly, how often do we allocate multi-order pages? I've just made quick

The deadlock Alan pointed out can happen also with single page allocation
if we in 2.4.x-current put a loop in GFP_KERNEL.

> ie. 99.45% of all allocations are single-page! 0.50% is the 8kb

You're right. That's why it's a waste to have so many order in the
buddy allocator. Even more now that the hashtables should be allocated
with the bootmem allocator! :) Chuck seen the slowdown of increasing
the highest order allocation in his bench. But of course in 2.2.x we can't
avoid that.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:11                                         ` Andrea Arcangeli
@ 2000-09-25 16:22                                           ` Ingo Molnar
  2000-09-25 16:17                                             ` Alexander Viro
                                                               ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> 
> You're right. That's why it's a waste to have so many order in the
> buddy allocator. [...]

yep, i agree. I'm not sure what the biggest allocation is, some drivers
might use megabytes or contiguous RAM?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:22                                           ` Ingo Molnar
@ 2000-09-25 16:17                                             ` Alexander Viro
  2000-09-25 16:36                                               ` Jeff Garzik
  2000-09-25 16:57                                               ` Alan Cox
  2000-09-25 16:33                                             ` the new VMt Andrea Arcangeli
  2000-09-26  8:38                                             ` Jes Sorensen
  2 siblings, 2 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Ingo Molnar wrote:

> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> > 
> > You're right. That's why it's a waste to have so many order in the
> > buddy allocator. [...]
> 
> yep, i agree. I'm not sure what the biggest allocation is, some drivers
> might use megabytes or contiguous RAM?

Stupidity has no limits...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:17                                             ` Alexander Viro
@ 2000-09-25 16:36                                               ` Jeff Garzik
  2000-09-25 16:57                                               ` Alan Cox
  1 sibling, 0 replies; 243+ messages in thread
From: Jeff Garzik @ 2000-09-25 16:36 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Ingo Molnar, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Alexander Viro wrote:
> On Mon, 25 Sep 2000, Ingo Molnar wrote:
> > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > might use megabytes or contiguous RAM?

> Stupidity has no limits...

Blame the hardware designers... and give me my big allocations. :)

Sounds drivers (not mine though, <g>) do stuff like

	order = 20; /* just a made-up high number*/
	while ((order-- > 0) && (mem == NULL)) {
		mem = __get_free_pages (GFP_KERNEL, order);
	}
	/* use sound buffer 'mem' */

Older or modern, less-than-cool framegrabbers need tons of contiguous
memory too...

	Jeff



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:17                                             ` Alexander Viro
  2000-09-25 16:36                                               ` Jeff Garzik
@ 2000-09-25 16:57                                               ` Alan Cox
  2000-09-25 17:01                                                 ` Alexander Viro
  1 sibling, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 16:57 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Ingo Molnar, Andrea Arcangeli, Alan Cox, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

> > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > might use megabytes or contiguous RAM?
> 
> Stupidity has no limits...

Unfortunately its frequently wired into the hardware to save a few cents on
scatter gather logic.

We need 128K blocks for sound DMA buffers and most sound cards they need to
be linear (but not the newer ones thankfully). Some video capture hardware
needs 4Mb but that needs to use bootmem (in 2.2 they use bigmem hacks)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:57                                               ` Alan Cox
@ 2000-09-25 17:01                                                 ` Alexander Viro
  2000-09-25 17:06                                                   ` Alan Cox
  0 siblings, 1 reply; 243+ messages in thread
From: Alexander Viro @ 2000-09-25 17:01 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Alan Cox wrote:

> > > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > > might use megabytes or contiguous RAM?
> > 
> > Stupidity has no limits...
> 
> Unfortunately its frequently wired into the hardware to save a few cents on
> scatter gather logic.

Since when hardware folks became exempt from the rule above? 128K is
almost tolerable, there were requests for 64 _mega_bytes...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:01                                                 ` Alexander Viro
@ 2000-09-25 17:06                                                   ` Alan Cox
  2000-09-25 17:31                                                     ` Oliver Xymoron
  2000-09-25 19:03                                                     ` the new VMt [4MB+ blocks] Matti Aarnio
  0 siblings, 2 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-25 17:06 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Alan Cox, Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

> > > Stupidity has no limits...
> > 
> > Unfortunately its frequently wired into the hardware to save a few cents on
> > scatter gather logic.
> 
> Since when hardware folks became exempt from the rule above? 128K is
> almost tolerable, there were requests for 64 _mega_bytes...

Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
allocations. There is a reason for this. You can do that 4Mb allocation on
NT or Windows 9x


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:06                                                   ` Alan Cox
@ 2000-09-25 17:31                                                     ` Oliver Xymoron
  2000-09-25 17:51                                                       ` Jeff Garzik
  2000-09-25 19:03                                                     ` the new VMt [4MB+ blocks] Matti Aarnio
  1 sibling, 1 reply; 243+ messages in thread
From: Oliver Xymoron @ 2000-09-25 17:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Alexander Viro, Ingo Molnar, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Alan Cox wrote:

> > > > Stupidity has no limits...
> > > 
> > > Unfortunately its frequently wired into the hardware to save a few cents on
> > > scatter gather logic.
> > 
> > Since when hardware folks became exempt from the rule above? 128K is
> > almost tolerable, there were requests for 64 _mega_bytes...
> 
> Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
> allocations. There is a reason for this. You can do that 4Mb allocation on
> NT or Windows 9x

Sure about that? It's been a while, but I seem to recall NT enforcing a
scatter-gather framework on all drivers because it only gave them virtual
allocations. For the cheaper cards, the s-g was done by software issuing
single span requests to the card.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:31                                                     ` Oliver Xymoron
@ 2000-09-25 17:51                                                       ` Jeff Garzik
  0 siblings, 0 replies; 243+ messages in thread
From: Jeff Garzik @ 2000-09-25 17:51 UTC (permalink / raw)
  To: Oliver Xymoron; +Cc: MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Oliver Xymoron wrote:
> Sure about that? It's been a while, but I seem to recall NT enforcing a
> scatter-gather framework on all drivers because it only gave them virtual
> allocations. For the cheaper cards, the s-g was done by software issuing
> single span requests to the card.

The Matrox framegrabber guys use some API under NT to allocate
megabytes upon megabytes of contiguous memory for DMA.

	Jeff



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt [4MB+ blocks]
  2000-09-25 17:06                                                   ` Alan Cox
  2000-09-25 17:31                                                     ` Oliver Xymoron
@ 2000-09-25 19:03                                                     ` Matti Aarnio
  2000-09-25 20:02                                                       ` Stephen Williams
  1 sibling, 1 reply; 243+ messages in thread
From: Matti Aarnio @ 2000-09-25 19:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:06:11PM +0100, Alan Cox wrote:
> > > > Stupidity has no limits...
> > > Unfortunately its frequently wired into the hardware to save a few cents on
> > > scatter gather logic.
> > 
> > Since when hardware folks became exempt from the rule above? 128K is
> > almost tolerable, there were requests for 64 _mega_bytes...
> 
> Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
> allocations. There is a reason for this. You can do that 4Mb allocation on
> NT or Windows 9x

	Sure, but intel processors have this neat 4 MB "super-page"
	feature in the MMU...  (as we all well know)

	Sometimes allocating such monster memory blocks could be supported,
	but it should not be expected to be *fast*.  E.g. if doing it in
	"reliable" way needs possibly moving currently allocated pages
	away from memory to create such a hole(s), so be it..


	Anybody here who can describe those M$ API calls ?
	Are they kernel/DDK-only, or userspace ones, or both ?

/Matti Aarnio
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt [4MB+ blocks]
  2000-09-25 19:03                                                     ` the new VMt [4MB+ blocks] Matti Aarnio
@ 2000-09-25 20:02                                                       ` Stephen Williams
  0 siblings, 0 replies; 243+ messages in thread
From: Stephen Williams @ 2000-09-25 20:02 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: Alan Cox, MM mailing list, linux-kernel

matti.aarnio@zmailer.org said:
> Sometimes allocating such monster memory blocks could be supported,
> 	but it should not be expected to be *fast*.  E.g. if doing it in
> 	"reliable" way needs possibly moving currently allocated pages
> 	away from memory to create such a hole(s), so be it.


matti.aarnio@zmailer.org said:
> Anybody here who can describe those M$ API calls ?
> 	Are they kernel/DDK-only, or userspace ones, or both ?

NT does indeed support allocating contiguous buffers of memory, which is
useful when the hardware in question doesn't do scatter-gather. I have
on occasion been compelled to use these routines. (Paradoxically, the
requirements in my case came from broken NT mmap support and not from the
hardware. Blech!)

Anyhow, these routines are indeed slow. And judging by the amount of disk
noise I hear when they are called, they do try to kick out pages to make
an allocation work. However, even so the M$ calls will eventually fail due
to lack of large enough holes, so fragmentation takes its toll.

So, they are both slow and unreliable under NT. But drivers that use them
tend to be loaded once at boot time, and that's it.
-- 
Steve Williams                "The woods are lovely, dark and deep.
steve@icarus.com              But I have promises to keep,
steve@picturel.com            and lines to code before I sleep,
http://www.picturel.com       And lines to code before I sleep."


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:22                                           ` Ingo Molnar
  2000-09-25 16:17                                             ` Alexander Viro
@ 2000-09-25 16:33                                             ` Andrea Arcangeli
  2000-09-26  8:38                                             ` Jes Sorensen
  2 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 06:22:42PM +0200, Ingo Molnar wrote:
> yep, i agree. I'm not sure what the biggest allocation is, some drivers
> might use megabytes or contiguous RAM?

I'm not sure (we should grep all the drivers to be sure...) but I bet the old
2.2.0 MAX_ORDER #define will work for everything.

The fact is that over a certain order there's no hope anyway at runtime
and the only big allocations done through the init sequence are for
the hashtable.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:22                                           ` Ingo Molnar
  2000-09-25 16:17                                             ` Alexander Viro
  2000-09-25 16:33                                             ` the new VMt Andrea Arcangeli
@ 2000-09-26  8:38                                             ` Jes Sorensen
  2000-09-26  8:52                                               ` Ingo Molnar
  2 siblings, 1 reply; 243+ messages in thread
From: Jes Sorensen @ 2000-09-26  8:38 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

>> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
>> 
>> You're right. That's why it's a waste to have so many order in the
>> buddy allocator. [...]

Ingo> yep, i agree. I'm not sure what the biggest allocation is, some
Ingo> drivers might use megabytes or contiguous RAM?

9.5KB blocks is common for people running Gigabit Ethernet with Jumbo
frames at least.

Jes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26  8:38                                             ` Jes Sorensen
@ 2000-09-26  8:52                                               ` Ingo Molnar
  2000-09-26  9:02                                                 ` Jes Sorensen
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-26  8:52 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On 26 Sep 2000, Jes Sorensen wrote:

> 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo
> frames at least.

yep, although this is more of a Linux limitation, the cards themselves are
happy to DMA fragmented buffers as well. (sans some small penalty per new
fragment.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26  8:52                                               ` Ingo Molnar
@ 2000-09-26  9:02                                                 ` Jes Sorensen
  0 siblings, 0 replies; 243+ messages in thread
From: Jes Sorensen @ 2000-09-26  9:02 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

>>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes:

Ingo> On 26 Sep 2000, Jes Sorensen wrote:

>> 9.5KB blocks is common for people running Gigabit Ethernet with
>> Jumbo frames at least.

Ingo> yep, although this is more of a Linux limitation, the cards
Ingo> themselves are happy to DMA fragmented buffers as well. (sans
Ingo> some small penalty per new fragment.)

Hence the reason I have been pushing for the kiobufifying of the skbs ;-)
It's even more important for HIPPI with the 65280 bytes MTU.

Jes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:02                                       ` Ingo Molnar
  2000-09-25 16:04                                         ` Andi Kleen
  2000-09-25 16:11                                         ` Andrea Arcangeli
@ 2000-09-25 16:53                                         ` Alan Cox
  2 siblings, 0 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-25 16:53 UTC (permalink / raw)
  To: mingo
  Cc: Andrea Arcangeli, Alan Cox, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

> Frankly, how often do we allocate multi-order pages? I've just made quick
> statistics wrt. how allocation orders are distributed on a more or less
> typical system:

Enough that failures on this crashed older 2.2 kernels because the tcp code
ended up looping trying to get memory and the slab allocator couldnt get
a new multipage block. 

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:16                                   ` the new VMt Alan Cox
  2000-09-25 15:33                                     ` the new VM Ingo Molnar
  2000-09-25 15:41                                     ` the new VMt Andrea Arcangeli
@ 2000-09-25 15:42                                     ` Stephen C. Tweedie
  2000-09-25 16:05                                       ` Andrea Arcangeli
                                                         ` (2 more replies)
  2000-09-25 16:16                                     ` Rik van Riel
  3 siblings, 3 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 15:42 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
> 
> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1					#2
> kmalloc 32K				kmalloc 16K
> OK					OK
> kmalloc 16K				kmalloc 32K
> block					block
> 

... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
able to eat memory which processes #1 and #2 are not allowed to touch.
Progress is made, clean pages are discarded and dirty ones queued for
write, memory becomes free again and the world is a better place.

Or so goes the theory, at least.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:42                                     ` Stephen C. Tweedie
@ 2000-09-25 16:05                                       ` Andrea Arcangeli
  2000-09-25 16:22                                         ` Rik van Riel
  2000-09-25 17:39                                         ` Stephen C. Tweedie
  2000-09-25 16:51                                       ` Alan Cox
  2000-09-25 16:52                                       ` yodaiken
  2 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:05 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> Progress is made, clean pages are discarded and dirty ones queued for

How can you make progress if there isn't swap avaiable and all the
freeable page/buffer cache is just been freed? The deadlock happens
in OOM condition (not when we can make progress).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:05                                       ` Andrea Arcangeli
@ 2000-09-25 16:22                                         ` Rik van Riel
  2000-09-25 16:42                                           ` Andrea Arcangeli
  2000-09-25 17:39                                         ` Stephen C. Tweedie
  1 sibling, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> > Progress is made, clean pages are discarded and dirty ones queued for
> 
> How can you make progress if there isn't swap avaiable and all the
> freeable page/buffer cache is just been freed? The deadlock happens
> in OOM condition (not when we can make progress).

This is exactly why integrating the OOM killer is on
my TODO list.

The important difference between the new VM and the
old one is that we can't fail while we are not OOM,
whereas the old allocator could break down even when
we still had enough swap free....

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:22                                         ` Rik van Riel
@ 2000-09-25 16:42                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 01:22:40PM -0300, Rik van Riel wrote:
> whereas the old allocator could break down even when
> we still had enough swap free....

As far I can see that's a bug that you hided introducing a deadlock.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:05                                       ` Andrea Arcangeli
  2000-09-25 16:22                                         ` Rik van Riel
@ 2000-09-25 17:39                                         ` Stephen C. Tweedie
  1 sibling, 0 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 17:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

Hi,

On Mon, Sep 25, 2000 at 06:05:00PM +0200, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> > Progress is made, clean pages are discarded and dirty ones queued for
> 
> How can you make progress if there isn't swap avaiable and all the
> freeable page/buffer cache is just been freed? The deadlock happens
> in OOM condition (not when we can make progress).

Agreed --- this assumes that all pinned, nonswappable pages are
subject to resource limiting to prevent them from exhausting the whole
of memory.  For things like page tables, that means we need
beancounter in place for us to be 100% safe.  For the no-swap case,
that requires an OOM killer.

The problem of avoiding filling memory with pinned pages is orthogonal
to the problem of managing the unpinned memory.  Both are obviously
required for a stable system.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:42                                     ` Stephen C. Tweedie
  2000-09-25 16:05                                       ` Andrea Arcangeli
@ 2000-09-25 16:51                                       ` Alan Cox
  2000-09-25 17:43                                         ` Stephen C. Tweedie
  2000-09-25 16:52                                       ` yodaiken
  2 siblings, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 16:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

> > 2 active processes, no swap
> > 
> > #1					#2
> > kmalloc 32K				kmalloc 16K
> > OK					OK
> > kmalloc 16K				kmalloc 32K
> > block					block
> > 
> 
> ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> able to eat memory which processes #1 and #2 are not allowed to touch.

'no swap'

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:51                                       ` Alan Cox
@ 2000-09-25 17:43                                         ` Stephen C. Tweedie
  2000-09-25 18:13                                           ` Alan Cox
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 17:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

Hi,

On Mon, Sep 25, 2000 at 05:51:49PM +0100, Alan Cox wrote:
> > > 2 active processes, no swap
> > > 
> > > #1					#2
> > > kmalloc 32K				kmalloc 16K
> > > OK					OK
> > > kmalloc 16K				kmalloc 32K
> > > block					block
> > > 
> > 
> > ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> > able to eat memory which processes #1 and #2 are not allowed to touch.
> 
> 'no swap'

kswapd is perfectly capable of evicting clean pages and triggering any
necessary writeback of dirty filesystem data at this point, even if
there is no swap.  If there is truly nothing kswapd can do to recover
here, then we are truly OOM.  Otherwise, kswapd should be able to free
the required memory, providing that the PF_MEMALLOC flag allows it to
eat into a reserved set of free pages which nobody else can allocate
once physical free pages gets below a certain threshold.

--Stephen 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:43                                         ` Stephen C. Tweedie
@ 2000-09-25 18:13                                           ` Alan Cox
  2000-09-25 18:21                                             ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 18:13 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

> there is no swap.  If there is truly nothing kswapd can do to recover
> here, then we are truly OOM.  Otherwise, kswapd should be able to free

Indeed. But we wont fail the kmalloc with a NULL return

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:13                                           ` Alan Cox
@ 2000-09-25 18:21                                             ` Stephen C. Tweedie
  2000-09-25 19:09                                               ` Alan Cox
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 18:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

Hi,

On Mon, Sep 25, 2000 at 07:13:27PM +0100, Alan Cox wrote:
> > there is no swap.  If there is truly nothing kswapd can do to recover
> > here, then we are truly OOM.  Otherwise, kswapd should be able to free
> 
> Indeed. But we wont fail the kmalloc with a NULL return

Isn't that the preferred behaviour, though?  If we are completely out
of VM on a no-swap machine, we should be killing one of the existing
processes rather than preventing any progress and keeping all of the
old tasks alive but deadlocked.

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:21                                             ` Stephen C. Tweedie
@ 2000-09-25 19:09                                               ` Alan Cox
  2000-09-25 19:21                                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 19:09 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

> > Indeed. But we wont fail the kmalloc with a NULL return
> 
> Isn't that the preferred behaviour, though?  If we are completely out
> of VM on a no-swap machine, we should be killing one of the existing
> processes rather than preventing any progress and keeping all of the
> old tasks alive but deadlocked.

Unless Im missing something we wont kill any task in that condition - even
a SIGKILL will make no odds as everyone is asleep in kmalloc


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 19:09                                               ` Alan Cox
@ 2000-09-25 19:21                                                 ` Stephen C. Tweedie
  0 siblings, 0 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 19:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: Stephen C. Tweedie, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

Hi,

On Mon, Sep 25, 2000 at 08:09:31PM +0100, Alan Cox wrote:
> > > Indeed. But we wont fail the kmalloc with a NULL return
> > 
> > Isn't that the preferred behaviour, though?  If we are completely out
> > of VM on a no-swap machine, we should be killing one of the existing
> > processes rather than preventing any progress and keeping all of the
> > old tasks alive but deadlocked.
> 
> Unless Im missing something we wont kill any task in that condition - even
> a SIGKILL will make no odds as everyone is asleep in kmalloc

Right.  Eeek.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:42                                     ` Stephen C. Tweedie
  2000-09-25 16:05                                       ` Andrea Arcangeli
  2000-09-25 16:51                                       ` Alan Cox
@ 2000-09-25 16:52                                       ` yodaiken
  2000-09-25 17:18                                         ` Jamie Lokier
  2 siblings, 1 reply; 243+ messages in thread
From: yodaiken @ 2000-09-25 16:52 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
> > 
> > Unless Im missing something here think about this case
> > 
> > 2 active processes, no swap
> > 
> > #1					#2
> > kmalloc 32K				kmalloc 16K
> > OK					OK
> > kmalloc 16K				kmalloc 32K
> > block					block
> > 
> 
> ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> able to eat memory which processes #1 and #2 are not allowed to touch.
> Progress is made, clean pages are discarded and dirty ones queued for
> write, memory becomes free again and the world is a better place.
> 
> Or so goes the theory, at least.

from fs/select.c

   walk = out;
        while(nfds > 0) {
                poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
                if (!tmp) {
                        while(out != NULL) {
                                tmp = out->next;
                                free_page((unsigned long)out);
                                out = tmp;
                        }
                        return NULL;
                }
                tmp->nr = 0;
                tmp->entry = (struct poll_table_entry *)(tmp + 1);
                tmp->next = NULL;
                walk->next = tmp;
                walk = tmp;
                nfds -=__MAX_POLL_TABLE_ENTRIES;
        }


> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:52                                       ` yodaiken
@ 2000-09-25 17:18                                         ` Jamie Lokier
  2000-09-25 17:51                                           ` yodaiken
  0 siblings, 1 reply; 243+ messages in thread
From: Jamie Lokier @ 2000-09-25 17:18 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

yodaiken@fsmlabs.com wrote:
>    walk = out;
>         while(nfds > 0) {
>                 poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
>                 if (!tmp) {

Shouldn't this be GFP_USER?  (Which would also conveniently fix the
problem Victor's pointing out...)

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:18                                         ` Jamie Lokier
@ 2000-09-25 17:51                                           ` yodaiken
  2000-09-25 18:04                                             ` Jamie Lokier
  2000-09-25 18:20                                             ` Andrea Arcangeli
  0 siblings, 2 replies; 243+ messages in thread
From: yodaiken @ 2000-09-25 17:51 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: yodaiken, Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:18:29PM +0200, Jamie Lokier wrote:
> yodaiken@fsmlabs.com wrote:
> >    walk = out;
> >         while(nfds > 0) {
> >                 poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> >                 if (!tmp) {
> 
> Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> problem Victor's pointing out...)

It should probably be GFP_ATOMIC, if I understand the mm right. 

The algorithm for requesting a collection of reources and freeing all of them
 on failure is simple, fast, and robust. 


              

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:51                                           ` yodaiken
@ 2000-09-25 18:04                                             ` Jamie Lokier
  2000-09-25 18:13                                               ` yodaiken
  2000-09-25 18:20                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Jamie Lokier @ 2000-09-25 18:04 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

yodaiken@fsmlabs.com wrote:
> > yodaiken@fsmlabs.com wrote:
> > >    walk = out;
> > >         while(nfds > 0) {
> > >                 poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> > >                 if (!tmp) {
> > 
> > Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> > problem Victor's pointing out...)
> 
> It should probably be GFP_ATOMIC, if I understand the mm right. 

Definitely not.  GFP_ATOMIC is reserved for things that really can't
swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
have to increase the number of atomic-allocatable pages.

> The algorithm for requesting a collection of reources and freeing all
> of them on failure is simple, fast, and robust.

Allocation is just as fast with GFP_KERNEL/USER, just less likely to
fail and less likely to break something else that really needs
GFP_ATOMIC allocations.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:04                                             ` Jamie Lokier
@ 2000-09-25 18:13                                               ` yodaiken
  2000-09-25 18:24                                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: yodaiken @ 2000-09-25 18:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: yodaiken, Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 08:04:54PM +0200, Jamie Lokier wrote:
> yodaiken@fsmlabs.com wrote:
> > > yodaiken@fsmlabs.com wrote:
> > > >    walk = out;
> > > >         while(nfds > 0) {
> > > >                 poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> > > >                 if (!tmp) {
> > > 
> > > Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> > > problem Victor's pointing out...)
> > 
> > It should probably be GFP_ATOMIC, if I understand the mm right. 
> 
> Definitely not.  GFP_ATOMIC is reserved for things that really can't
> swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> have to increase the number of atomic-allocatable pages.

Process 1,2 and 3 all start allocating 20 pages
      process 1 stalls after allocating 19
      some memory is freed and process 2 runs and stall after allocating 19
      some memory is free and process 3 runs and stalls after allocating 19
     
    now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM.

    
        
> > The algorithm for requesting a collection of reources and freeing all
> > of them on failure is simple, fast, and robust.
> 
> Allocation is just as fast with GFP_KERNEL/USER, just less likely to

It's not speed, it's deadlock avoidance. 

> fail and less likely to break something else that really needs
> GFP_ATOMIC allocations.

My point here is simply that error returns in memory allocation allow 
higher level kernel operations to safely marshal a collection of resources following
a safe algorithm that is optimized for the case when there is no memory shortage
and that only starts going to the slow case when the system is stalling due to memory
shortages anyways.



> 
> -- Jamie

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:13                                               ` yodaiken
@ 2000-09-25 18:24                                                 ` Stephen C. Tweedie
  2000-09-25 18:34                                                   ` yodaiken
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 18:24 UTC (permalink / raw)
  To: yodaiken
  Cc: Jamie Lokier, Stephen C. Tweedie, Alan Cox, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 12:13:15PM -0600, yodaiken@fsmlabs.com wrote:

> > Definitely not.  GFP_ATOMIC is reserved for things that really can't
> > swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> > have to increase the number of atomic-allocatable pages.
> 
> Process 1,2 and 3 all start allocating 20 pages
>       process 1 stalls after allocating 19
>       some memory is freed and process 2 runs and stall after allocating 19
>       some memory is free and process 3 runs and stalls after allocating 19
>      
>     now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM.

Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
told "yes", and goes allocating them, blocking as necessary until it
gets them.  Process 2 asks "can *I* pin 20 pages" and the answer is
either "not right now", in which case it waits for process 1 to
release its reservation, or "no, you've exceeded your user quota" in
which case it fails with ENOMEM.  (That latter case can protect us
against a lot of DoS attacks from local users.)

The same accounting really needs to be done for page tables, as that
represents one of the biggest sources of unaccounted, unswappable
pages which user processes can cause to be created right now.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:24                                                 ` Stephen C. Tweedie
@ 2000-09-25 18:34                                                   ` yodaiken
  2000-09-25 18:48                                                     ` Jamie Lokier
  2000-09-25 19:25                                                     ` Stephen C. Tweedie
  0 siblings, 2 replies; 243+ messages in thread
From: yodaiken @ 2000-09-25 18:34 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:24:53PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 12:13:15PM -0600, yodaiken@fsmlabs.com wrote:
> 
> > > Definitely not.  GFP_ATOMIC is reserved for things that really can't
> > > swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> > > have to increase the number of atomic-allocatable pages.
> > 
> > Process 1,2 and 3 all start allocating 20 pages
> >       process 1 stalls after allocating 19
> >       some memory is freed and process 2 runs and stall after allocating 19
> >       some memory is free and process 3 runs and stalls after allocating 19
> >      
> >     now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM.
> 
> Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> told "yes", and goes allocating them, blocking as necessary until it

So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates
or that blocks on something unrelated

           preallocte 20 pages
           get first
           ask for an inode -- block waiting for an inode


or
           preallocate 20 pages
           if(checkuserpath())return -ENOWAY; /* stranding my pre-allocate */
           else get them pages


What's nice about these is they don't cause errors on test and seem more 
difficult to spot than looking for cases where allocated memory gets stranded.
Doesn't the alloc_vec method seem simpler to you?

> gets them.  Process 2 asks "can *I* pin 20 pages" and the answer is
> either "not right now", in which case it waits for process 1 to
> release its reservation, or "no, you've exceeded your user quota" in

Or for someone else to free more pages ... 

> which case it fails with ENOMEM.  (That latter case can protect us
> against a lot of DoS attacks from local users.)

I like ENOMEM anyways.

> 
> The same accounting really needs to be done for page tables, as that
> represents one of the biggest sources of unaccounted, unswappable
> pages which user processes can cause to be created right now.



-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:34                                                   ` yodaiken
@ 2000-09-25 18:48                                                     ` Jamie Lokier
  2000-09-25 19:25                                                     ` Stephen C. Tweedie
  1 sibling, 0 replies; 243+ messages in thread
From: Jamie Lokier @ 2000-09-25 18:48 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

yodaiken@fsmlabs.com wrote:
> > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> > told "yes", and goes allocating them, blocking as necessary until it
> 
> So you have a "pre-allocation allocator"?  Leads to interesting and
> hard to detect bugs with old code that does not pre-allocate or with
> code that incorrectly pre-allocates or that blocks on something
> unrelated

I agree with Victor.  Relying on code that calls gfp to do the correct
accounting in advance, to avoid deadlocks, is not at all robust.  Even
the best programmers will have off by one errors in that, and the rest
of us will blindly write code that works all the time, except for the
really obscure case when it fails.

Ideally do both: see if you can allocate in advance, then try it,
but be prepared to back off and return ENOMEM if that fails.

-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 18:34                                                   ` yodaiken
  2000-09-25 18:48                                                     ` Jamie Lokier
@ 2000-09-25 19:25                                                     ` Stephen C. Tweedie
  2000-09-25 20:04                                                       ` yodaiken
  1 sibling, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 19:25 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 12:34:56PM -0600, yodaiken@fsmlabs.com wrote:

> > > Process 1,2 and 3 all start allocating 20 pages
> > >     now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM.
> > 
> > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> > told "yes", and goes allocating them, blocking as necessary until it
> 
> So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
> bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates
> or that blocks on something unrelated

Right, but if the alternative is spurious ENOMEM when we can satisfy
all of the pending requests just as long as they are serialised, is
this a problem?

If you want, wrap it in a "get_free_pagev" call which returns a vector
of pointers to free pages, doing whatever accounting is needed.  You
don't have to push all of it to the callers.

However, you just can't escape from the fact that on low memory
machinnes, we *need* beancounter-style accounting of pinned pages or
we'll be in Deep Trouble (TM).  We already have nasty DoS situations
which are embarassingly easy to reproduce.  If we need such
beancounter protection, AND such protection can prevent the situation
you describe, then do we need to go looking for another way of
achieving the same protection?

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 19:25                                                     ` Stephen C. Tweedie
@ 2000-09-25 20:04                                                       ` yodaiken
  2000-09-25 20:23                                                         ` Alan Cox
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: yodaiken @ 2000-09-25 20:04 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 08:25:49PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 12:34:56PM -0600, yodaiken@fsmlabs.com wrote:
> 
> > > > Process 1,2 and 3 all start allocating 20 pages
> > > >     now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM.
> > > 
> > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> > > told "yes", and goes allocating them, blocking as necessary until it
> > 
> > So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
> > bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates
> > or that blocks on something unrelated
> 
> Right, but if the alternative is spurious ENOMEM when we can satisfy

An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the
OS to do impossible tricks.

> all of the pending requests just as long as they are serialised, is
> this a problem?

I think you are solving the wrong problem. On a small memory machine, the kernel,
utilities, and applications should be configured to use little memory.  
BusyBox is better than BeanCount. 


> However, you just can't escape from the fact that on low memory
> machinnes, we *need* beancounter-style accounting of pinned pages or
> we'll be in Deep Trouble (TM).  We already have nasty DoS situations

What we need is simple kernel code that does not hold resources
into a  possible deadlock situation. 

> which are embarassingly easy to reproduce.  If we need such
> beancounter protection, AND such protection can prevent the situation
> you describe, then do we need to go looking for another way of
> achieving the same protection?


On general principles, I don't see any substitute for clean code in the kernel and
my prediction is that if you show me an example of 
DoS vulnerability,  I can show you fix that does not require bean counting.
Am I wrong?





-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:04                                                       ` yodaiken
@ 2000-09-25 20:23                                                         ` Alan Cox
  2000-09-25 20:35                                                           ` yodaiken
  2000-09-25 20:32                                                         ` Stephen C. Tweedie
  2000-09-25 23:14                                                         ` Erik Andersen
  2 siblings, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 20:23 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

> my prediction is that if you show me an example of 
> DoS vulnerability,  I can show you fix that does not require bean counting.
> Am I wrong?

I think so. Page tables are a good example


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:23                                                         ` Alan Cox
@ 2000-09-25 20:35                                                           ` yodaiken
  2000-09-25 20:46                                                             ` Alan Cox
  2000-09-25 20:47                                                             ` Benjamin C.R. LaHaise
  0 siblings, 2 replies; 243+ messages in thread
From: yodaiken @ 2000-09-25 20:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: yodaiken, Stephen C. Tweedie, Jamie Lokier, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > my prediction is that if you show me an example of 
> > DoS vulnerability,  I can show you fix that does not require bean counting.
> > Am I wrong?
> 
> I think so. Page tables are a good example

I'm not too sure of what you have in mind, but if it is
     "process creates vast virtual space to generate many page table
      entries -- using mmap"
the answer is, virtual address space quotas and mmap should kill 
the process on low mem for page tables.

> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:35                                                           ` yodaiken
@ 2000-09-25 20:46                                                             ` Alan Cox
  2000-09-25 21:07                                                               ` yodaiken
  2000-09-25 20:47                                                             ` Benjamin C.R. LaHaise
  1 sibling, 1 reply; 243+ messages in thread
From: Alan Cox @ 2000-09-25 20:46 UTC (permalink / raw)
  To: yodaiken
  Cc: Alan Cox, Stephen C. Tweedie, Jamie Lokier, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

> I'm not too sure of what you have in mind, but if it is
>      "process creates vast virtual space to generate many page table
>       entries -- using mmap"
> the answer is, virtual address space quotas and mmap should kill 
> the process on low mem for page tables.

Those quotas being exactly what beancounter is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:46                                                             ` Alan Cox
@ 2000-09-25 21:07                                                               ` yodaiken
  2000-09-26  9:54                                                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: yodaiken @ 2000-09-25 21:07 UTC (permalink / raw)
  To: Alan Cox
  Cc: yodaiken, Stephen C. Tweedie, Jamie Lokier, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote:
> > I'm not too sure of what you have in mind, but if it is
> >      "process creates vast virtual space to generate many page table
> >       entries -- using mmap"
> > the answer is, virtual address space quotas and mmap should kill 
> > the process on low mem for page tables.
> 
> Those quotas being exactly what beancounter is

But that is a function specific counter, not a counter in the 
alloc code.


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 21:07                                                               ` yodaiken
@ 2000-09-26  9:54                                                                 ` Stephen C. Tweedie
  2000-09-26 13:17                                                                   ` yodaiken
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-26  9:54 UTC (permalink / raw)
  To: yodaiken
  Cc: Alan Cox, Stephen C. Tweedie, Jamie Lokier, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 03:07:44PM -0600, yodaiken@fsmlabs.com wrote:
> On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote:
> > > I'm not too sure of what you have in mind, but if it is
> > >      "process creates vast virtual space to generate many page table
> > >       entries -- using mmap"
> > > the answer is, virtual address space quotas and mmap should kill 
> > > the process on low mem for page tables.
> > 
> > Those quotas being exactly what beancounter is
> 
> But that is a function specific counter, not a counter in the 
> alloc code.

Beancounter is a framework for user-level accounting.  _What_ you
account is up to the callers.  Maybe this has been a miscommunication,
but beancounter is all about allowing callers to account for stuff
before allocation, not about having the page allocation functions
themselves enforce quotas.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26  9:54                                                                 ` Stephen C. Tweedie
@ 2000-09-26 13:17                                                                   ` yodaiken
  0 siblings, 0 replies; 243+ messages in thread
From: yodaiken @ 2000-09-26 13:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: yodaiken, Alan Cox, Jamie Lokier, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote:
> Beancounter is a framework for user-level accounting.  _What_ you
> account is up to the callers.  Maybe this has been a miscommunication,
> but beancounter is all about allowing callers to account for stuff
> before allocation, not about having the page allocation functions
> themselves enforce quotas.

per-user and system-wide and per-process quotas are one thing, a
pre-allocate-and-then-allocate generic scheme seems to me to be a error prone
way of getting there. In particular, I think it is dangerous to have a pre-count that
is approximately tethered to the thing it is counting -- in the memory allocation 
we were discussing, you need to make sure that the pre-allocations are for memory that
is really going to be allocated soon and that it is later correlated with free in 
some way.  

So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then 
pre-allocate counting, or, even worse, a "smart" kmalloc that never fails.
If the problem is unaccounted for page-tables then account for
page tables and return a  -EYOURPROCESSISOUTOFCONTROL so that calling kernel code
can take the responsible action. 

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:35                                                           ` yodaiken
  2000-09-25 20:46                                                             ` Alan Cox
@ 2000-09-25 20:47                                                             ` Benjamin C.R. LaHaise
  2000-09-25 21:12                                                               ` yodaiken
  1 sibling, 1 reply; 243+ messages in thread
From: Benjamin C.R. LaHaise @ 2000-09-25 20:47 UTC (permalink / raw)
  To: yodaiken; +Cc: Stephen C. Tweedie, MM mailing list, linux-kernel

On Mon, 25 Sep 2000 yodaiken@fsmlabs.com wrote:

> On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > > my prediction is that if you show me an example of 
> > > DoS vulnerability,  I can show you fix that does not require bean counting.
> > > Am I wrong?
> > 
> > I think so. Page tables are a good example
> 
> I'm not too sure of what you have in mind, but if it is
>      "process creates vast virtual space to generate many page table
>       entries -- using mmap"
> the answer is, virtual address space quotas and mmap should kill 
> the process on low mem for page tables.

No.  Page tables are not freed after munmap (and for good reason).  The
counting of page table "beans" is critical.

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:47                                                             ` Benjamin C.R. LaHaise
@ 2000-09-25 21:12                                                               ` yodaiken
  2000-09-26 10:07                                                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: yodaiken @ 2000-09-25 21:12 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: yodaiken, Stephen C. Tweedie, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:47:21PM -0400, Benjamin C.R. LaHaise wrote:
> On Mon, 25 Sep 2000 yodaiken@fsmlabs.com wrote:
> 
> > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > > > my prediction is that if you show me an example of 
> > > > DoS vulnerability,  I can show you fix that does not require bean counting.
> > > > Am I wrong?
> > > 
> > > I think so. Page tables are a good example
> > 
> > I'm not too sure of what you have in mind, but if it is
> >      "process creates vast virtual space to generate many page table
> >       entries -- using mmap"
> > the answer is, virtual address space quotas and mmap should kill 
> > the process on low mem for page tables.
> 
> No.  Page tables are not freed after munmap (and for good reason).  The
> counting of page table "beans" is critical.

I've seen the assertion before, reasons would be interesting.


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 21:12                                                               ` yodaiken
@ 2000-09-26 10:07                                                                 ` Stephen C. Tweedie
  2000-09-26 13:30                                                                   ` yodaiken
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-26 10:07 UTC (permalink / raw)
  To: yodaiken
  Cc: Benjamin C.R. LaHaise, Stephen C. Tweedie, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 03:12:50PM -0600, yodaiken@fsmlabs.com wrote:
> > > 
> > > I'm not too sure of what you have in mind, but if it is
> > >      "process creates vast virtual space to generate many page table
> > >       entries -- using mmap"
> > > the answer is, virtual address space quotas and mmap should kill 
> > > the process on low mem for page tables.
> > 
> > No.  Page tables are not freed after munmap (and for good reason).  The
> > counting of page table "beans" is critical.
> 
> I've seen the assertion before, reasons would be interesting.

Reason 1: under DoS attack, you want to target not the process using
the most resources, but the *user* using the most resources (else a
fork-bomb style attack can work around your OOM-killer algorithms).

Reason 2: if you've got tasks stuck in low-level page allocation
routines, then you can't immediately kill -9 them, so reactive OOM
killing always has vulnerabilities --- to be robust in preventing
resource exhaustion you want limits on the use of those resources
before they are exhausted --- the necessary accounting being part of
what we refer to as "beancounter".

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 10:07                                                                 ` Stephen C. Tweedie
@ 2000-09-26 13:30                                                                   ` yodaiken
  0 siblings, 0 replies; 243+ messages in thread
From: yodaiken @ 2000-09-26 13:30 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: yodaiken, Benjamin C.R. LaHaise, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 03:12:50PM -0600, yodaiken@fsmlabs.com wrote:
> > > > 
> > > > I'm not too sure of what you have in mind, but if it is
> > > >      "process creates vast virtual space to generate many page table
> > > >       entries -- using mmap"
> > > > the answer is, virtual address space quotas and mmap should kill 
> > > > the process on low mem for page tables.
> > > 
> > > No.  Page tables are not freed after munmap (and for good reason).  The
> > > counting of page table "beans" is critical.
> > 
> > I've seen the assertion before, reasons would be interesting.
> 
> Reason 1: under DoS attack, you want to target not the process using
> the most resources, but the *user* using the most resources (else a
> fork-bomb style attack can work around your OOM-killer algorithms).

Ok.
      if(over_allocated_page_tables(task->uid) ) return ENOMEM;

makes sense in "fork".   I guess the argument here is not about whether
accounting is good, it's about where the accounting should be done. To me
the alternatives of

      if(preallocate_pages(page_table_size_for_this_process()) == -1)return error
         then actually allocate making sure to adjust counts if some other
         error turns up and with something taking care of how the pre-allocation
         works while we are sleeping waiting for possibly unrelated resources.

or
      just kmalloc with kmalloc magically juggling resources in some safe way


seem less clear.

       

     

> Reason 2: if you've got tasks stuck in low-level page allocation
> routines, then you can't immediately kill -9 them, so reactive OOM
> killing always has vulnerabilities --- to be robust in preventing
> resource exhaustion you want limits on the use of those resources
> before they are exhausted --- the necessary accounting being part of
> what we refer to as "beancounter".

doesn't the problem really come from low level page allocation at too high a level?
That is, if instead of select doing get_free_page, it maybe should do 
get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess)
Then we could have a config-optional per-process pinned page accounting with the 
possibility of doing something sensible in a user-space daemon when memory is low.

> 
> --Stephen

-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:04                                                       ` yodaiken
  2000-09-25 20:23                                                         ` Alan Cox
@ 2000-09-25 20:32                                                         ` Stephen C. Tweedie
  2000-09-26 12:10                                                           ` Mark Hemment
  2000-09-25 23:14                                                         ` Erik Andersen
  2 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 20:32 UTC (permalink / raw)
  To: yodaiken
  Cc: Stephen C. Tweedie, Jamie Lokier, Alan Cox, mingo,
	Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote:

> > Right, but if the alternative is spurious ENOMEM when we can satisfy
> 
> An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the
> OS to do impossible tricks.

Yes, but the ENOMEM _is_ spurious if you actually meant EAGAIN, and if
the OS was perfectly capable of doing the retry itself.

> > all of the pending requests just as long as they are serialised, is
> > this a problem?
> 
> I think you are solving the wrong problem. On a small memory machine, the kernel,
> utilities, and applications should be configured to use little memory.  
> BusyBox is better than BeanCount. 

Any box is a small memory machine if you get the wrong workload on it,
and the DoS attacks which are possible without beancounting let any
user bring even a large system to its knees right now.  If solving
that problem also means that small memory machines do the right thing
on their own rather than requiring specific manual configuration, then
it sounds like a good aim.

> > However, you just can't escape from the fact that on low memory
> > machinnes, we *need* beancounter-style accounting of pinned pages or
> > we'll be in Deep Trouble (TM).  We already have nasty DoS situations
> 
> What we need is simple kernel code that does not hold resources
> into a  possible deadlock situation. 

<nod>

> On general principles, I don't see any substitute for clean code in the kernel and
> my prediction is that if you show me an example of 
> DoS vulnerability,  I can show you fix that does not require bean counting.
> Am I wrong?

If you have a user forking multiple processes and exhausting some
resource, then at some point you have to do something about it.  Let's
say it's page tables, just for argument's sake, because those are
currently non-swappable, but even if you make those swappable there
are plenty of other resources it might be (eg. data shoved down unix
domain sockets if you want another example).

So you have run out of physical memory --- what do you do about it?
The important observation here is that in a multi-user environment,
simply denying further allocations isn't good enough --- unless you
revoke those existing allocations you have DoS.  And you can't fairly
revoke existing allocations without knowing WHICH user has exhausted
the memory (which requires beancounter-style resource tracking), AND
having mechanisms in place to revoke all of the possible resources
which might be involved (eg unix domain socket datagrams).  kill -9
might solve that latter problem but it doesn't help in identifying who
to kill.

--Stephen
> 
> 
> 
> 
> 
> -- 
> ---------------------------------------------------------
> Victor Yodaiken 
> Finite State Machine Labs: The RTLinux Company.
>  www.fsmlabs.com  www.rtlinux.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:32                                                         ` Stephen C. Tweedie
@ 2000-09-26 12:10                                                           ` Mark Hemment
  2000-09-27 10:13                                                             ` Andrey Savochkin
  0 siblings, 1 reply; 243+ messages in thread
From: Mark Hemment @ 2000-09-26 12:10 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

Hi,

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
> So you have run out of physical memory --- what do you do about it?

  Why let the system get into the state where it is neccessary to kill a
process?
  Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources.  That is the DoS protection.

  So an OOM is possibly;
	1) A privileged, legally resource hungry, app(s) has taken all
	   the memory.  Could be too important to simply kill (it
	   should exit gracefully).
	2) Simply too many tasks*(memory-requirements-of-each-task).

  Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources.  ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
 (note, I use the term "memory" and not "physical memory", I'm including
swap space).

  Why not protect the system from over committing its memory resources?

  It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.

  Named pages (those which came from a file) are the simplest to
handle.  If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
  How many named pages need to be held in physical memory at any one
instance for the system to function?  Only a few, although if you reach
that state, the system will be thrashing itself to death.

  Anonymous and copied (those faulted from a write to  an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap.  To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
  This space could be on swap or in physical memory.  It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be).  If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
  I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term).  This is confusing.  People then
start to believe they need backing store for each anon/copied pages.  You
don't.  You simply need somewhere to store it, and that could be a
physical page.  It is all in the accounting. :)

  Allocations made by the kernel, for the kernel, are (obviously) pinned
memory.  To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
  How to limit?
  As I first guess (and this is only a guess);
	1) don't let kernel allocations exceed 25% of physical memory
	   (tunable)
	2) don't let kernel allocations succeed if they would cause
	   over commitment.
  Both conditions would need to pass before an allocation could succeed.
  This does need much more thought.  Should some tuning be per subsystem?
I don't know....

  Perhaps 1) isn't needed.  I'm not sure.

  Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark.  Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory).  Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
  For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops).  Again, these reserved pages are in the accounting.

 mlock()ed pages need to have accouting also to prevent over commitment of
physical memory.  All fun.

  The disadvantages;

1) Extra code to do the accouting.
	This shouldn't be too heavy.

2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.

	Programs which expect to memory map areas (which would created
	anon/copied pages when written to) will see an increased failure
	rate in mmap().  This can be very annoying, espically when you
	know the mapping will be used sparsely.

	One solution is to add a new mmap() flag, which tells the kernel
	to let this mmap() exceed the actually resources.
	With such a flag, the mmap() will be allowed, but the task should
	expected to be killed if memory is exhausted.  (It could be
	possible for the kernel to deliver a SIGDANGER signal to such a
	task, as in AIX, to give it a chance of reducing its requirments
	on the system or to exit gracefully.)

	Another solution is to allow the strict resource accounting to be
	over ridden on a global basis.  Say, by allowing the system to
	over commit the memory resources by 10%. This does remove the
	absolute protection, but leaves some in place.  The OOM killer
	would come into play if the system did over commit.
	Those who don't need/want protection, could set the over commit to
	some large value.  500%?

3) fork() failures.

	There is the problem of a large(ish) process wanting to run a
	small program.  Say, a shell wanting to run a simple utility.

	Because of the memory resource accounting, the fork() is
	disallowed as the newly created child could (in theory) write to
	mmap()ed areas, creating anon/copied pages which would cause the
	kernel to (in the worst case) be OOM for user-pages.  Given that
	the child will almost immediately do an exec(), which could well
	succeed, this is frustrating.

	Again, a small over commit kludge would reduce (but not
	eliminate), this occurance.

	An idea from a colleague, is to allow such a fork() to succeed,
	but to run the child process in a "container".
	Inside the container, the child is allowed to perform operations
	which would be expected before an exec().  Such operations could
	be closing file descriptors.  However, if it tries to do something
	which would _seriously_ affect the state of the system (such as
	remove a file), then it is killed.  ie. given it a chance to
	do an exec().  This could be done by running with an alternative
	system call table for the child process, which refers to bounce
	functions within the kernel where the checks are done (ie. don't
	load the common code path with the checks).
	This could be tricky to do, and there could well be a few
	system (library?) calls which would make it impossible.  However,
	if it could be achieved, it would remove one of the most annoying
	"features" of over commitment protection.

  This sort of protection isn't to prevent DoS attacks; as said above,
they need to be on a per user/task level.  This protection is to protect
against asynchronous failures on page faults due to OOM, and to make
them synchronous (from mmap(), fork(), mlock(), etc) where programs
expected to test for an error code.
  There isn't much an application can do with a synchronous memory
failure; sleep and try again, release some of its own resources, or exit
gracefully.

  Anyway, I've skipped over a lot of interesting details (and problems).
  This stuff isn't new.  Some commercial OS have this type of protection.

  Comments?

Mark

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 12:10                                                           ` Mark Hemment
@ 2000-09-27 10:13                                                             ` Andrey Savochkin
  2000-09-27 12:55                                                               ` Hugh Dickins
  0 siblings, 1 reply; 243+ messages in thread
From: Andrey Savochkin @ 2000-09-27 10:13 UTC (permalink / raw)
  To: Mark Hemment
  Cc: yodaiken, Jamie Lokier, Alan Cox, mingo, Andrea Arcangeli,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel, Stephen C. Tweedie

Hello,

On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote:
> 
> On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
> > So you have run out of physical memory --- what do you do about it?
> 
>   Why let the system get into the state where it is neccessary to kill a
> process?
>   Per-user/task resource counters should prevent unprivileged users from
> soaking up too many resources.  That is the DoS protection.
> 
[snip]
>   It is possible to do true, system wide, resource counting of physical
> memory and swap space, and to deny a fork() or mmap() which would cause
> over committing of memoy resources if everyone cashed in their
> requirements.
[snip]

People use overcommitting not because they are fans of the idea.
Overcommitting simply is the _efficient_ way of resource sharing.
It's a waste of resources to reserve memory+swap for the case that every
running process decides to modify libc code (and, thus, should receive its
private copy of the pages).   A real waste!
I always agree to take the risk of some applications being killed in such a
case of all processes turning crazy.

The approach I believe in is:
 - ensure that accidental or intentional madness of applications of one user
   may cause only limited damage to other users; and
 - introduce a way to tell the kernel that some applications should be
   saved longer than others when troubles begin and ways to set up some
   guaranteed amounts for important processes.
Certainly, a lot of processes may consume more than their guarantee until
bad things start to happen.  Then the rules of user protection and killing
order apply.
That's how I develop the resource control in the beancounter patch
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7

Best regards
		Andrey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-27 10:13                                                             ` Andrey Savochkin
@ 2000-09-27 12:55                                                               ` Hugh Dickins
  2000-09-28  3:25                                                                 ` Andrey Savochkin
  0 siblings, 1 reply; 243+ messages in thread
From: Hugh Dickins @ 2000-09-27 12:55 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: Mark Hemment, MM mailing list, linux-kernel

On Wed, 27 Sep 2000, Andrey Savochkin wrote:
> 
> It's a waste of resources to reserve memory+swap for the case that every
> running process decides to modify libc code (and, thus, should receive its
> private copy of the pages).   A real waste!

A real waste indeed, but a bad example: libc code is mapped read-only,
so nobody would recommend reserving memory+swap for private mods to it.
Of course, a process might choose to mprotect it writable at some time,
that would be when to refuse if overcommitted.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-27 12:55                                                               ` Hugh Dickins
@ 2000-09-28  3:25                                                                 ` Andrey Savochkin
  0 siblings, 0 replies; 243+ messages in thread
From: Andrey Savochkin @ 2000-09-28  3:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Hemment, MM mailing list, linux-kernel

Hello,

On Wed, Sep 27, 2000 at 01:55:52PM +0100, Hugh Dickins wrote:
> On Wed, 27 Sep 2000, Andrey Savochkin wrote:
> > 
> > It's a waste of resources to reserve memory+swap for the case that every
> > running process decides to modify libc code (and, thus, should receive its
> > private copy of the pages).   A real waste!
> 
> A real waste indeed, but a bad example: libc code is mapped read-only,
> so nobody would recommend reserving memory+swap for private mods to it.
> Of course, a process might choose to mprotect it writable at some time,
> that would be when to refuse if overcommitted.

Returning error from mprotect() call for private mappings?
It wouldn't be what people expect...

The other example where overcommit makes sense is fork() (not vfork) and
immediate exec in one of the threads.

Best regards
		Andrey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 20:04                                                       ` yodaiken
  2000-09-25 20:23                                                         ` Alan Cox
  2000-09-25 20:32                                                         ` Stephen C. Tweedie
@ 2000-09-25 23:14                                                         ` Erik Andersen
  2000-09-26 15:17                                                           ` yodaiken
  2 siblings, 1 reply; 243+ messages in thread
From: Erik Andersen @ 2000-09-25 23:14 UTC (permalink / raw)
  To: yodaiken; +Cc: MM mailing list, linux-kernel

On Mon Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote:
> 
> > all of the pending requests just as long as they are serialised, is
> > this a problem?
> 
> I think you are solving the wrong problem. On a small memory machine, the kernel,
> utilities, and applications should be configured to use little memory.  
> BusyBox is better than BeanCount. 
> 

Granted that smaller apps can help -- for a particular workload.  But while I
am very partial to BusyBox (in fact I am about to cut a new release) I can
assure you that OOM is easily possible even when your user space is tiny.  I do
it all the time.  There are mallocs in busybox and when under memory pressure,
the kernel still tends to fall over...

 -Erik

--
Erik B. Andersen   email:  andersee@debian.org
--This message was written using 73% post-consumer electrons--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 23:14                                                         ` Erik Andersen
@ 2000-09-26 15:17                                                           ` yodaiken
  2000-09-26 16:04                                                             ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: yodaiken @ 2000-09-26 15:17 UTC (permalink / raw)
  To: yodaiken, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote:
> On Mon Sep 25, 2000 at 02:04:19PM -0600, yodaiken@fsmlabs.com wrote:
> > 
> > > all of the pending requests just as long as they are serialised, is
> > > this a problem?
> > 
> > I think you are solving the wrong problem. On a small memory machine, the kernel,
> > utilities, and applications should be configured to use little memory.  
> > BusyBox is better than BeanCount. 
> > 
> 
> Granted that smaller apps can help -- for a particular workload.  But while I
> am very partial to BusyBox (in fact I am about to cut a new release) I can
> assure you that OOM is easily possible even when your user space is tiny.  I do
> it all the time.  There are mallocs in busybox and when under memory pressure,
> the kernel still tends to fall over...

Operating systems cannot make more memory appear by magic.
The question is really about the best strategy for dealing with low memory. In my
opinion, the OS should not try to out-think physical limitations. Instead, the OS 
should take as little space as possible and provide the ability for user level 
clever management of space. In a truly embedded system, there can easily be a user level
root process that watches memory usage and prevents DOS attacks -- if the OS provides
settable enforced quotas etc. 


-- 
---------------------------------------------------------
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 15:17                                                           ` yodaiken
@ 2000-09-26 16:04                                                             ` Stephen C. Tweedie
  2000-09-26 17:02                                                               ` Erik Andersen
  0 siblings, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-26 16:04 UTC (permalink / raw)
  To: yodaiken; +Cc: MM mailing list, linux-kernel

Hi,

On Tue, Sep 26, 2000 at 09:17:44AM -0600, yodaiken@fsmlabs.com wrote:

> Operating systems cannot make more memory appear by magic.
> The question is really about the best strategy for dealing with low memory. In my
> opinion, the OS should not try to out-think physical limitations. Instead, the OS 
> should take as little space as possible and provide the ability for user level 
> clever management of space. In a truly embedded system, there can easily be a user level
> root process that watches memory usage and prevents DOS attacks -- if the OS provides
> settable enforced quotas etc. 

Agreed, absolutely.  The beancounter is one approach to those quotas,
and has the advantage of allowing per-user as well as per-process
quotas.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 16:04                                                             ` Stephen C. Tweedie
@ 2000-09-26 17:02                                                               ` Erik Andersen
  2000-09-26 17:08                                                                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 243+ messages in thread
From: Erik Andersen @ 2000-09-26 17:02 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel

On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Sep 26, 2000 at 09:17:44AM -0600, yodaiken@fsmlabs.com wrote:
> 
> > Operating systems cannot make more memory appear by magic.
> > The question is really about the best strategy for dealing with low memory. In my
> > opinion, the OS should not try to out-think physical limitations. Instead, the OS 
> > should take as little space as possible and provide the ability for user level 
> > clever management of space. In a truly embedded system, there can easily be a user level
> > root process that watches memory usage and prevents DOS attacks -- if the OS provides
> > settable enforced quotas etc. 
> 
> Agreed, absolutely.  The beancounter is one approach to those quotas,
> and has the advantage of allowing per-user as well as per-process
> quotas.

Another approach would be to let user space turn off overcommit.  
That way, user space can be assured there will be no surprises...

 -Erik

--
Erik B. Andersen   email:  andersee@debian.org
--This message was written using 73% post-consumer electrons--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 17:02                                                               ` Erik Andersen
@ 2000-09-26 17:08                                                                 ` Stephen C. Tweedie
  2000-09-26 17:45                                                                   ` Erik Andersen
  2000-09-26 21:13                                                                   ` Eric Lowe
  0 siblings, 2 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-26 17:08 UTC (permalink / raw)
  To: Stephen C. Tweedie, yodaiken, MM mailing list, linux-kernel

Hi,

On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote:

> Another approach would be to let user space turn off overcommit.  

No.  Overcommit only applies to pageable memory.  Beancounter is
really needed for non-pageable resources such as page tables and
mlock()ed pages.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 17:08                                                                 ` Stephen C. Tweedie
@ 2000-09-26 17:45                                                                   ` Erik Andersen
  2000-09-27 10:20                                                                     ` Andrey Savochkin
  2000-09-26 21:13                                                                   ` Eric Lowe
  1 sibling, 1 reply; 243+ messages in thread
From: Erik Andersen @ 2000-09-26 17:45 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel

On Tue Sep 26, 2000 at 06:08:20PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote:
> 
> > Another approach would be to let user space turn off overcommit.  
> 
> No.  Overcommit only applies to pageable memory.  Beancounter is
> really needed for non-pageable resources such as page tables and
> mlock()ed pages.

I think we do agree here, though we are having problems with semantics.

"Overcommit" to me is the same things as Mark Hemment stated earlier in this
thread -- the "fact that the system has over committed its memory resources.
ie. it has sold too many tickets for the number of seats in the plane, and all
the passengers have turned up."   Basically any case where too many tickets
have been sold (applied to the entire system, and all subsystems).

To extend the airplane metaphor a bit past credibility...

When an airline sells too many tickets, it bribes people to get off the plane.
For the kernel, it tends to fall over, or starts kicking off pilots and flight
attendants.

If the Beancounter patch lets the kernel count "passengers", classify them
(with user hinting) so the pilot and flight attendants (init, X, or whatever)
always stay on the plane, and has some sane predictable mechanism for booting
non-priveledged passengers, then I am all for it.  

How does one provide the kernel with hints as to which processes are sacred?
Where does one find this beancounter patch?   How much weight does it add to
the kernel? 

 -Erik

--
Erik B. Andersen   email:  andersee@debian.org
--This message was written using 73% post-consumer electrons--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 17:45                                                                   ` Erik Andersen
@ 2000-09-27 10:20                                                                     ` Andrey Savochkin
  0 siblings, 0 replies; 243+ messages in thread
From: Andrey Savochkin @ 2000-09-27 10:20 UTC (permalink / raw)
  To: Erik Andersen; +Cc: Stephen C. Tweedie, yodaiken, MM mailing list, linux-kernel

On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote:
[snip]
> "Overcommit" to me is the same things as Mark Hemment stated earlier in this
> thread -- the "fact that the system has over committed its memory resources.
> ie. it has sold too many tickets for the number of seats in the plane, and all
> the passengers have turned up."   Basically any case where too many tickets
> have been sold (applied to the entire system, and all subsystems).
[snip]
> If the Beancounter patch lets the kernel count "passengers", classify them
> (with user hinting) so the pilot and flight attendants (init, X, or whatever)
> always stay on the plane, and has some sane predictable mechanism for booting
> non-priveledged passengers, then I am all for it.  

That's exactly what I'm doing.

> How does one provide the kernel with hints as to which processes are sacred?
> Where does one find this beancounter patch?   How much weight does it add to
> the kernel? 

ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html

The current version has some drawbacks, and one of them is the performance.
Memory accounting is implemented as a kernel thread which goes through page
tables of processes (similar to kswapd), and it appears to consume 1-5% of
CPU (depending on number of processes).  I consider it unacceptable, and have
started reimplementation of the process memory accounting from the beginning.

Best regards
		Andrey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-26 17:08                                                                 ` Stephen C. Tweedie
  2000-09-26 17:45                                                                   ` Erik Andersen
@ 2000-09-26 21:13                                                                   ` Eric Lowe
  1 sibling, 0 replies; 243+ messages in thread
From: Eric Lowe @ 2000-09-26 21:13 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: yodaiken, MM mailing list, linux-kernel

Hello,

> > Another approach would be to let user space turn off overcommit.  
> 
> No.  Overcommit only applies to pageable memory.  Beancounter is
> really needed for non-pageable resources such as page tables and
> mlock()ed pages.
> 

In addition to beancounter, do you think pageable page tables are
something we want to tackle in 2.5.x?  4MB page mappings on x86
could be cool too, as an option...

--
Eric Lowe
FibreChannel Software Engineer, Systran Corporation
elowe@systran.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 17:51                                           ` yodaiken
  2000-09-25 18:04                                             ` Jamie Lokier
@ 2000-09-25 18:20                                             ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 18:20 UTC (permalink / raw)
  To: yodaiken
  Cc: Jamie Lokier, Stephen C. Tweedie, Alan Cox, mingo,
	Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 11:51:39AM -0600, yodaiken@fsmlabs.com wrote:
> It should probably be GFP_ATOMIC, if I understand the mm right. 

poll_wait is called from the f_op->poll callback from select just before
a sleep and since it's allowed to sleep too it should be a GFP_KERNEL
(not ATOMIC). Using GFP_ATOMIC where GFP_KERNEL can be used is a bug
and it can lead to failed allocations even while there's huge amount
of freeable/recyclable cache.

The reason it isn't GFP_USER but it's a GFP_KERNEL is because the memory
isn't allocated in userspace.

On a solid VM the only difference between GFP_USER and GFP_KERNEL happens to be
when the machine runs truly out of memory. In 2.4.x GFP_KERNEL should probably
be changed not to short the PF_MEMALLOC atomic queue when memory balancing
fails (then they would be equal).

> The algorithm for requesting a collection of reources and freeing all of them
>  on failure is simple, fast, and robust. 

Yes, I tend to like that style too because it's obviously safe and it obviously
can't dealdock during oom.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 15:16                                   ` the new VMt Alan Cox
                                                       ` (2 preceding siblings ...)
  2000-09-25 15:42                                     ` Stephen C. Tweedie
@ 2000-09-25 16:16                                     ` Rik van Riel
  2000-09-25 16:55                                       ` Alan Cox
  3 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 16:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Alan Cox wrote:

> > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
> > > everything jammed in kernel space waiting on GFP_KERNEL and if the
> > > swapper cannot make space you die.
> > 
> > if one can get everything jammed waiting for GFP_KERNEL, and not being
> > able to deallocate anything, thats a VM or resource-limit bug. This
> > situation is just 1% RAM away from the 'root cannot log in', situation.
> 
> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1					#2
> kmalloc 32K				kmalloc 16K
> OK					OK
> kmalloc 16K				kmalloc 32K
> block					block
> 
> so GFP_KERNEL has to be able to fail - it can wait for I/O in
> some cases with care, but when we have no pages left something
> has to give

The trick here is to:
1) keep some reserved pages around for PF_MEMALLOC tasks
   (we need this anyway)
2) set PF_MEMALLOC on the task you're killing for OOM,
   that way this task will either get the memory or
   fail (note that PF_MEMALLOC tasks don't wait)

This way the OOM-killed task will be able to exit quickly
and the rest of the system will not get killed as a side
effect.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VMt
  2000-09-25 16:16                                     ` Rik van Riel
@ 2000-09-25 16:55                                       ` Alan Cox
  0 siblings, 0 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-25 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, mingo, Andrea Arcangeli, Marcelo Tosatti,
	Linus Torvalds, Roger Larsson, MM mailing list, linux-kernel

> > kmalloc 16K				kmalloc 32K
> > block					block
> > 
> 2) set PF_MEMALLOC on the task you're killing for OOM,
>    that way this task will either get the memory or
>    fail (note that PF_MEMALLOC tasks don't wait)

Nobody is out of memory at this point. Everyone is in kernel space blocking
for someone else. There is also no further allocation after this deadlock 
point to cause a kill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:16                                 ` Ingo Molnar
  2000-09-25 15:16                                   ` the new VMt Alan Cox
@ 2000-09-25 15:48                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 05:16:06PM +0200, Ingo Molnar wrote:
> situation is just 1% RAM away from the 'root cannot log in', situation.

The root cannot log in is a little different. Just think that in the "root
cannot log in" you only need to press SYSRQ+E (or as worse +I).

If all tasks in the systems are hanging into the GFP loop SYSRQ+I won't solve
the deadlock.

Ok you can add a signal check in the memory balancing code but that looks an
ugly hack that shows the difference between the two cases (the one Alan pointed
out is real deadlock, the current one is kind of live lock that can go away any
time, while the deadlock can reach the point where it can't be recovered
without an hack from an irq somewhere).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:47                               ` Alan Cox
  2000-09-25 15:16                                 ` Ingo Molnar
@ 2000-09-25 15:40                                 ` Stephen C. Tweedie
  2000-09-25 16:01                                   ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 15:40 UTC (permalink / raw)
  To: Alan Cox
  Cc: mingo, Andrea Arcangeli, Marcelo Tosatti, Linus Torvalds,
	Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 03:47:03PM +0100, Alan Cox wrote:
> 
> GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything
> jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make
> space you die.

We already have PF_MEMALLOC to provide a last-chance allocation pool
which only the swapper can eat into. 

The critical thing is to avoid having the swapper itself deadlock.
Everything revolves around that.  Once you can make that guarantee,
it's perfectly safe to make GFP_KERNEL succeed for other callers, just
as long as you have enough beancounting in place in those callers.

Right now, the biggest obstacle to this is the GFP_ATOMIC behaviour:

	/*
	 * Final phase: allocate anything we can!
	 *
	 * This is basically reserved for PF_MEMALLOC and
	 * GFP_ATOMIC allocations...
	 */

Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the
wrong thing to do if we want to guarantee swapper progress under
extreme load.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 15:40                                 ` Stephen C. Tweedie
@ 2000-09-25 16:01                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 16:01 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alan Cox, mingo, Marcelo Tosatti, Linus Torvalds, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:40:44PM +0100, Stephen C. Tweedie wrote:
> Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the
> wrong thing to do if we want to guarantee swapper progress under
> extreme load.

You're definitely right. We at least need the garantee of the memory to
allocate the bhs on top of the swap cache while we atttempt to swapout one page
(that path can't fail at the moment).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:08                           ` Andrea Arcangeli
  2000-09-25 13:12                             ` Ingo Molnar
@ 2000-09-25 14:37                             ` Rik van Riel
  2000-09-25 20:34                               ` Christoph Rohland
  1 sibling, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-09-25 14:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Marcelo Tosatti, Linus Torvalds, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote:
> > On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> > 
> > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed
> > > that is a showstopper bug. [...]
> > 
> > why?
> 
> Because as you said the machine can lockup when you run out of memory.

The fix for this is to kill a user process when you're OOM
(you need to do this anyway).

The last few allocations of the "condemned" process can come
frome the reserved pages and the process we killed will exit just
fine.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 14:37                             ` Rik van Riel
@ 2000-09-25 20:34                               ` Christoph Rohland
  2000-10-06 16:14                                 ` Rik van Riel
  0 siblings, 1 reply; 243+ messages in thread
From: Christoph Rohland @ 2000-09-25 20:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: MM mailing list, linux-kernel

Hi Rik,

Rik van Riel <riel@conectiva.com.br> writes:

> > Because as you said the machine can lockup when you run out of memory.
> 
> The fix for this is to kill a user process when you're OOM
> (you need to do this anyway).
> 
> The last few allocations of the "condemned" process can come
> frome the reserved pages and the process we killed will exit just
> fine.

It's slightly offtopic, but you should think about detached shm
segments in yout OOM killer. As many of the high end applications like
databases and e.g. SAP have most of the memory in shm segments you
easily end up killing a lot of processes without freeing a lot of
memory. I see this often in my shm tests.

Greetings
                Christoph

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 20:34                               ` Christoph Rohland
@ 2000-10-06 16:14                                 ` Rik van Riel
  2000-10-09  7:37                                   ` Christoph Rohland
  0 siblings, 1 reply; 243+ messages in thread
From: Rik van Riel @ 2000-10-06 16:14 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: MM mailing list, linux-kernel

[replying to a really old email now that I've started work
 on integrating the OOM handler]

On 25 Sep 2000, Christoph Rohland wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
> 
> > > Because as you said the machine can lockup when you run out of memory.
> > 
> > The fix for this is to kill a user process when you're OOM
> > (you need to do this anyway).
> > 
> > The last few allocations of the "condemned" process can come
> > frome the reserved pages and the process we killed will exit just
> > fine.
> 
> It's slightly offtopic, but you should think about detached shm
> segments in yout OOM killer. As many of the high end
> applications like databases and e.g. SAP have most of the memory
> in shm segments you easily end up killing a lot of processes
> without freeing a lot of memory. I see this often in my shm
> tests.

Hmmm, could you help me with drawing up a selection algorithm
on how to choose which SHM segment to destroy when we run OOM?

The criteria would be about the same as with normal programs:

1) minimise the amount of work lost
2) try to protect 'innocent' stuff
3) try to kill only one thing
4) don't surprise the user, but chose something that
   the user will expect to be killed/destroyed

regards,

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
       -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-10-06 16:14                                 ` Rik van Riel
@ 2000-10-09  7:37                                   ` Christoph Rohland
  0 siblings, 0 replies; 243+ messages in thread
From: Christoph Rohland @ 2000-10-09  7:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: MM mailing list, linux-kernel

Rik van Riel <riel@conectiva.com.br> writes:

> Hmmm, could you help me with drawing up a selection algorithm
> on how to choose which SHM segment to destroy when we run OOM?
> 
> The criteria would be about the same as with normal programs:
> 
> 1) minimise the amount of work lost
> 2) try to protect 'innocent' stuff
> 3) try to kill only one thing
> 4) don't surprise the user, but chose something that
>    the user will expect to be killed/destroyed

First we only kill segments with no attachees. There are circumstances
under normal load where you have these. (SAP R/3 will do this all the
time on Linux 2.4) 

So perhaps we could signal shm that we killed a process and let it try
to find a segment where this process was the last attachee. This would
be a good candidate.

If this does not help either we could do two different things:
1) kill the biggest nonattached segment
2) kill the segment which was longest detached

Greetings
		Christoph

-- 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:02                       ` Andrea Arcangeli
  2000-09-25 13:02                         ` Ingo Molnar
@ 2000-09-25 13:04                         ` Ingo Molnar
  2000-09-25 13:19                           ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Please fix raid1 instead of making things worse.

huh, what do you mean?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:04                         ` Ingo Molnar
@ 2000-09-25 13:19                           ` Andrea Arcangeli
  2000-09-25 13:18                             ` Ingo Molnar
  2000-09-25 13:21                             ` Ingo Molnar
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:04:10PM +0200, Ingo Molnar wrote:
> 
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > Please fix raid1 instead of making things worse.
> 
> huh, what do you mean?

I mean this:

		while (!( /* FIXME: now we are rather fault tolerant than nice */
		mirror_bh[i] = kmalloc (sizeof (struct buffer_head), GFP_KERNEL)
		) )

I've seen in the 2.4.0-test9-pre6 raid1 code the above is gone (and this looks
very promising :)), it is at least proof that some care about the deadlock is
been taken) and you instead sleep on a waitqueue now. While it's not obvious at
all that sleeping on the waitqueue is not deadlock prone (for example getblk
sleeps on a waitqueue bit it's deadlock prone too), at least it's not an
infinite loop anymore and that's still better.

Is it safe to sleep on the waitqueue in the kmalloc fail path in raid1?

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:19                           ` Andrea Arcangeli
@ 2000-09-25 13:18                             ` Ingo Molnar
  2000-09-25 13:21                             ` Ingo Molnar
  1 sibling, 0 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > huh, what do you mean?
> 
> I mean this:
> 
> 		while (!( /* FIXME: now we are rather fault tolerant than nice */

this is fixed in 2.4. The 2.2 RAID code is frozen, and has known
limitations (ie. due to the above RAID1 cannot be used as a swap-device).

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:19                           ` Andrea Arcangeli
  2000-09-25 13:18                             ` Ingo Molnar
@ 2000-09-25 13:21                             ` Ingo Molnar
  2000-09-25 13:31                               ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Is it safe to sleep on the waitqueue in the kmalloc fail path in
> raid1?

yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO
latencies)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:21                             ` Ingo Molnar
@ 2000-09-25 13:31                               ` Andrea Arcangeli
  2000-09-25 13:47                                 ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:21:01PM +0200, Ingo Molnar wrote:
> yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO
> latencies)

Very good! Many thanks Ingo.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:31                               ` Andrea Arcangeli
@ 2000-09-25 13:47                                 ` Ingo Molnar
  2000-09-25 14:04                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO
> > latencies)
> 
> Very good! Many thanks Ingo.

this was actually coded/fixed by Neil Brown - so the kudos go to him!

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: the new VM
  2000-09-25 13:47                                 ` Ingo Molnar
@ 2000-09-25 14:04                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Marcelo Tosatti, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:47:57PM +0200, Ingo Molnar wrote:
> this was actually coded/fixed by Neil Brown - so the kudos go to him!

Indeed :).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
  2000-09-25  0:49                   ` Alexander Viro
  2000-09-25  0:53                   ` Marcelo Tosatti
@ 2000-09-25  1:31                   ` Andrea Arcangeli
  2000-09-25  1:27                     ` Alexander Viro
  2000-09-25 10:13                     ` Ingo Molnar
  2 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25  1:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Rik van Riel, Roger Larsson, MM mailing list, linux-kernel

On Sun, Sep 24, 2000 at 05:09:40PM -0700, Linus Torvalds wrote:
> [..] as with the
> shm_swap() thing this is probably something we do want to fix eventually.

both shm_swap and regular rw_swap_cache have the same deadlock problematic
w.r.t. __GFP_IO. We could do that on a raw device, but if we swap on top of the
filesystem then we could have deadlock problems again.  Really since with the
swapfile blocks are just allocated with ext2 we should not deadlock (but maybe
some other fs have a lock_super in the get_block path anyway). Thus it's safer
not to swapout anything when __GFP_IO is not set.

Also some linux/net/* code is using (or better abusing since __GFP_IO
originally was only meant as a deadlock avoidance thing not a thing
to only shrink the clean cache) GFP_BUFFER to not block (so actually
we would hurt networking too by causing _any_ kind of block in a GFP_BUFFER
allocation).

It would been better to introduce a new flag for allocations that must not
block for latency requirements but that wants still to shrink the clean cache
(instead of finishing the atomic queue). This is trivially fixable grepping
for GFP_BUFFER.

> The icache shrinker probably has similar problems with clear_inode.

Yep. And it sure does blocking I/O because it have to sync the dirty
inodes.

> I suspect that it might be a good idea to try to fix this issue, because
> it will probably keep coming up otherwise. And it's likely to be fairly
> easily debugged, by just making getblk() have some debugging code that
> basically says something like
> 
> 	lock_super()
> 	{
> 		.. do the lock ..
> +		current->super_locked++;
> 	}
> 
> 	unlock_super()
> 	{
> +		if (current->super_locked < 1)
> +			BUG();
> +		current->super_locked--;
> 		.. do the unlock ..
> 	}
> 
> 	getblk()
> 	{
> +		if (current->super_locked)
> +			BUG();
> 		.. do the getblk ..
> 	}

BTW (running offtopic), I collected such information in 2.2.x too (but for
another reason).

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.18pre9/VM-global-2.2.18pre9-6.bz2

I trapped all the down on the inode semaphore in the same way (I called it
current->fs_locks for both down and superlock).

I'm using such information to know if there's any lock held in the context
of the task to know if I can do I/O or not without risking to deadlock
on any inode semaphore or on any superblock lock.

With that change I could then also use GFP_KERNEL in getblk in 2.2.x (I admit
at first I did that :), but then I preferred to stay on the safe side
for things like loop that _have_ to work in 2.2.x :).

So now we know when we can writepage a dirty MAP_SHARED page in swap_out and we
do it from the task that is trying to allocate memory, so the task that is
trying to allocate memory will block waiting some dirty buffer to be written in
writepage->wakeup_bdflush(1).

In 2.2.x (as we do in 2.4.x) we _need_ to writeout the page ourself from
swapout (not async queueing into kpiod) because kpiod is completly asynchrous
and so without this change GFP was returning, we was allocating memory again,
and we was entering GFP again, all at fast rate.  In the meantime kpiod was
still blocked in mark_buffer_dirty->wakeup_bdflush(1) and then the tasks
allocating memory (who thought to have done some progress because it queued
many pages into kpiod) was getting killed.

Of course then I also killed kpiod since it wasn't necessary anymore and now
MAP_SHARED semgments doesn't kill tasks anymore.

> and just making it a new rule that you cannot call getblk() with any locks
> held.

Yes I see it would certainly trap the deadlock cases.

> (the superblock lock is quite contended right now, and the reason for that

Right (on large fs is going to be quite painful for scalability) and the
BUG would have the benefit of partly solving it.

I'm thinking that dropping the superblock lock completly wouldn't be much more
difficult than this mid stage.  The only cases where we block in critical
sections protected by the superblock lock is in getblk/bread (bread calls
getblk) and ll_rw_block and mark_buffer_dirty.  Once we drop the lock for the
first cases it should not be more difficult to drop it completly.

Not sure if this is the right moment for those changes though, I'm not worried
about ext2 but about the other non-netoworked fses that nobody uses regularly.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  1:31                   ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
@ 2000-09-25  1:27                     ` Alexander Viro
  2000-09-25  2:02                       ` Andrea Arcangeli
  2000-09-25 10:13                     ` Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Alexander Viro @ 2000-09-25  1:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> I'm thinking that dropping the superblock lock completly wouldn't be much more
> difficult than this mid stage.  The only cases where we block in critical
> sections protected by the superblock lock is in getblk/bread (bread calls
> getblk) and ll_rw_block and mark_buffer_dirty.  Once we drop the lock for the
> first cases it should not be more difficult to drop it completly.

ext2_new_block->dquot_alloc_block->lock_dquot

ext2_new_block->dquot_alloc_block->check_bdq->print_warning->tty_write_message


> Not sure if this is the right moment for those changes though, I'm not worried
> about ext2 but about the other non-netoworked fses that nobody uses regularly.

So help testing the patches to them. Arrgh...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  1:27                     ` Alexander Viro
@ 2000-09-25  2:02                       ` Andrea Arcangeli
  2000-09-25  2:01                         ` Alexander Viro
  2000-09-25 13:47                         ` Stephen C. Tweedie
  0 siblings, 2 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25  2:02 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote:
> So help testing the patches to them. Arrgh...

I think I'd better fix the bugs that I know about before testing patches that
tries to remove the superblock_lock at this stage. I guess you should
re-read the email from DaveM of two days ago.

Then I've a problem: I've no idea how could I test
adfs/affs/efs/hfs/hpfs/qnx4/sysv/udf.  If you send me by email or point out the
URL where I can find the source of the mkfs for all the above fs I will try to
add the tests in the regression test suite as soon as time permits so the
computer will do that job for me (that will be useful regardless of the
super-lock issue).

(if the mkfses are in common packages like mkfs.minix and mkfs.bfs no need to
send them of course)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  2:02                       ` Andrea Arcangeli
@ 2000-09-25  2:01                         ` Alexander Viro
  2000-09-25 13:47                         ` Stephen C. Tweedie
  1 sibling, 0 replies; 243+ messages in thread
From: Alexander Viro @ 2000-09-25  2:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote:
> > So help testing the patches to them. Arrgh...
> 
> I think I'd better fix the bugs that I know about before testing patches that
> tries to remove the superblock_lock at this stage. I guess you should
> re-read the email from DaveM of two days ago.

Erm... Did you miss the fact that minixfs/sysvfs/UFS are choke-full of
fs-corrupting races? Patch for minixfs had been posted 3 times during the
last couple of weeks, each time with [CFT] in subject. So far - 0
(zero) responces. I'm way past the stage when I gave a damn - it works
here and if I will not receive any bug reports it will go to Linus on
Tuesday.

And no, that stuff has nothing to lock_super(). But unless people will
test the patches posted on l-k and fsdevel - too fscking bad, stuff _will_
break.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  2:02                       ` Andrea Arcangeli
  2000-09-25  2:01                         ` Alexander Viro
@ 2000-09-25 13:47                         ` Stephen C. Tweedie
  1 sibling, 0 replies; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-25 13:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alexander Viro, Linus Torvalds, Ingo Molnar, Rik van Riel,
	Roger Larsson, MM mailing list, linux-kernel

Hi,

On Mon, Sep 25, 2000 at 04:02:30AM +0200, Andrea Arcangeli wrote:
> On Sun, Sep 24, 2000 at 09:27:39PM -0400, Alexander Viro wrote:
> > So help testing the patches to them. Arrgh...
> 
> I think I'd better fix the bugs that I know about before testing patches that
> tries to remove the superblock_lock at this stage.

Right.  If we're introducing new deadlock possibilities, then sure we
can fix the obvious cases in ext2, but it will be next to impossible
to do a thorough audit of all of the other filesystems.  Adding in the
new shrink_icache loop into the VFS just feels too dangerous right
now.

Of course, that doesn't mean we shouldn't remove the excessive
superblock locking from ext2 --- rather, it is simply more robust to
keep the two issues separate.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25  1:31                   ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
  2000-09-25  1:27                     ` Alexander Viro
@ 2000-09-25 10:13                     ` Ingo Molnar
  2000-09-25 12:58                       ` Andrea Arcangeli
  1 sibling, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 10:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Not sure if this is the right moment for those changes though, I'm not
> worried about ext2 but about the other non-netoworked fses that nobody
> uses regularly.

it *is* the right moment to clean these issues up. These kinds of things
are what made the 2.2 VM a mess (everybody added his easy improvements,
without solving some of the conceptual problems), and frankly, instead of
yet another elevator algorithm we need a squeaky clean VM balancer above
all. Please help identifying, fixing, debugging and testing these VM
balancing issues. This is tough work and it needs to be done.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 10:13                     ` Ingo Molnar
@ 2000-09-25 12:58                       ` Andrea Arcangeli
  2000-09-25 13:10                         ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 12:13:08PM +0200, Ingo Molnar wrote:
> 
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > Not sure if this is the right moment for those changes though, I'm not
> > worried about ext2 but about the other non-netoworked fses that nobody
> > uses regularly.
> 
> it *is* the right moment to clean these issues up. These kinds of things

I'm talking about the removal of the superblock lock from the filesystems.

Note: I don't have problems with the removal of the superblock lock even if
done at this stage, I'm not the one who can choose those things, it's Linus's
responsability to take the final decision for the official tree, but don't ask
me to test patches that removes the superblock lock _at_this_stage_ before I
can run a stable and fast 2.4.x because I won't do that. Period.

> yet another elevator algorithm we need a squeaky clean VM balancer above

FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec in the
tiobench write test compared to clean 2.4.0-test8-pre5 that delivers 8mbyte/sec
instead with only blkdev layer changes in between the two kernels (and no
that's not a matter of the elevator since there are no seeks in the test
and I've not changed the elevator sorting algorithm during the bench).

Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in
wait_for_request. The high part of the queue is reserved for reads.
Now if a read completes and it wakeups a write you'll hang.

If you think I should delay those fixes to do something else I don't agree
sorry. 

> all. Please help identifying, fixing, debugging and testing these VM
> balancing issues. This is tough work and it needs to be done.

I had an alternative VM, that I prefer from a design standpoint, I'll improve
it and I'll maintain it.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 12:58                       ` Andrea Arcangeli
@ 2000-09-25 13:10                         ` Ingo Molnar
  2000-09-25 13:49                           ` Jens Axboe
  2000-09-25 13:56                           ` Andrea Arcangeli
  0 siblings, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > yet another elevator algorithm we need a squeaky clean VM balancer above
> 
> FYI: My current tree (based on 2.4.0-test8-pre5) delivers 16mbyte/sec
> in the tiobench write test compared to clean 2.4.0-test8-pre5 that
> delivers 8mbyte/sec

great! I'm happy we have a fine-tuned elevator again.

> Also I I found the reason of your hang, it's the TASK_EXCLUSIVE in
> wait_for_request. The high part of the queue is reserved for reads.
> Now if a read completes and it wakeups a write you'll hang.

yep. But i dont understand why this makes any difference - the waitqueue
wakeup is FIFO, so any other request will eventually arrive. Could you
explain this bug a bit better?

> If you think I should delay those fixes to do something else I don't
> agree sorry.

no, i never ment it. I find it very good that those half-done changes are
cleaned up and the remaining bugs / performance problems are eliminated -
the first reports about bad write performance came right after the
original elevator patches went in, about 6 months ago.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:10                         ` Ingo Molnar
@ 2000-09-25 13:49                           ` Jens Axboe
  2000-09-25 14:11                             ` Ingo Molnar
  2000-09-25 14:20                             ` Andrea Arcangeli
  2000-09-25 13:56                           ` Andrea Arcangeli
  1 sibling, 2 replies; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 13:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Ingo Molnar wrote:
> > If you think I should delay those fixes to do something else I don't
> > agree sorry.
> 
> no, i never ment it. I find it very good that those half-done changes are

The changes made were never half-done. The recent bug fixes have
mainly been to remove cruft from the earlier elevator and fixing a bug
where the elevator insert would screw up a bit. So I'd call that fine
tuning or adjusting, not fixing half-done stuff.

> cleaned up and the remaining bugs / performance problems are eliminated -

Of course

> the first reports about bad write performance came right after the
> original elevator patches went in, about 6 months ago.

And a new elevator was introduced some months ago to solve this.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:49                           ` Jens Axboe
@ 2000-09-25 14:11                             ` Ingo Molnar
  2000-09-25 14:05                               ` Jens Axboe
  2000-09-25 16:46                               ` Linus Torvalds
  2000-09-25 14:20                             ` Andrea Arcangeli
  1 sibling, 2 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Jens Axboe wrote:

> The changes made were never half-done. The recent bug fixes have
> mainly been to remove cruft from the earlier elevator and fixing a bug
> where the elevator insert would screw up a bit. So I'd call that fine
> tuning or adjusting, not fixing half-done stuff.

sorry i did not mean to offend you - unadjusted and unfixed stuff hanging
around in the kernel for months is 'half done' for me.

> > the first reports about bad write performance came right after the
> > original elevator patches went in, about 6 months ago.
> 
> And a new elevator was introduced some months ago to solve this.

and these are still not solved in the vanilla kernel, as recent complaints
on l-k prove.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:11                             ` Ingo Molnar
@ 2000-09-25 14:05                               ` Jens Axboe
  2000-09-25 16:46                               ` Linus Torvalds
  1 sibling, 0 replies; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 14:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrea Arcangeli, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Ingo Molnar wrote:
> > The changes made were never half-done. The recent bug fixes have
> > mainly been to remove cruft from the earlier elevator and fixing a bug
> > where the elevator insert would screw up a bit. So I'd call that fine
> > tuning or adjusting, not fixing half-done stuff.
> 
> sorry i did not mean to offend you - unadjusted and unfixed stuff hanging
> around in the kernel for months is 'half done' for me.

No offense taken, I just tried to explain my view. And in light of
the bad test2, I'd like the new changes to not have any "issues". So
this work has been going on for the last month or so, and I think we are
finally getting to agreement on what needs to be done now and how. WIP.

> > And a new elevator was introduced some months ago to solve this.
> 
> and these are still not solved in the vanilla kernel, as recent complaints
> on l-k prove.

Different problems, though :(. However, I believe they are solved in
Andrea and my current tree. Just needs the final cleaning, more later.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:11                             ` Ingo Molnar
  2000-09-25 14:05                               ` Jens Axboe
@ 2000-09-25 16:46                               ` Linus Torvalds
  2000-09-25 17:05                                 ` Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Linus Torvalds @ 2000-09-25 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Ingo Molnar wrote:
> > 
> > And a new elevator was introduced some months ago to solve this.
> 
> and these are still not solved in the vanilla kernel, as recent complaints
> on l-k prove.

THE ELEVATOR IS PROBABLY NOT THE PROBLEM.

People blame the elevator for bad IO performance. But the elevator is just
doing what it's told to do - and if it is told to do something bad, it
will do something bad.

The "something bad" is doing things like writing out 4 dicsontiguous
pages, waiting a while, and then writing out 4 more discontiguous pages.

There's nothing the elevator can do for that case - except just ignore the
write requests completely, and wait for more requests to come in. Which it
certainly could do, but that's really a policy question and should be
handled at a higher level. The elevator doesn't know if there is going to
be more writes.

In short, I bet that the problem is at least partly that bdflush is
broken, and doesn't do a good job of streaming writes. It's probably been
broken to get low latencies, and in order to avoid "choppy" behaviour. But
the elevator works _best_ with choppy behaviour, when there's a BIG stream
of requests at a time.

Blaming the elevator is unfair and unrealistic. Look at the performance
reports - there was a good test-case that showed that read-performance was
fine but that writes to different parts of the filesystem just suck. Which
is _exactly_ what you'd expect if the elevator was fine but the writes
were blocked up by higher levels.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 16:46                               ` Linus Torvalds
@ 2000-09-25 17:05                                 ` Ingo Molnar
  2000-09-25 17:23                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Andrea Arcangeli, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, 25 Sep 2000, Linus Torvalds wrote:

> Blaming the elevator is unfair and unrealistic. [...]

yep - and Jens i'm sorry about the outburst. Until a bug is found it's
unrealistic to blame anything.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 17:05                                 ` Ingo Molnar
@ 2000-09-25 17:23                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 17:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jens Axboe, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 07:05:02PM +0200, Ingo Molnar wrote:
> yep - and Jens i'm sorry about the outburst. Until a bug is found it's
> unrealistic to blame anything.

I think the only bug maybe to blame in the elevator is the EXCLUSIVE wakeup
thing (and I've not benchmarked it alone to see if it makes any real world
performance difference but for sure its behaviour wasn't intentional). Anything
else related to the elevator internals should perform better than the old
elevator (aka the 2.2.15 one). The new elevator ordering algorithm returns me
much better numbers than the CSCAN one with tiobench. Also consider the latency
control at the moment is completly disabled as default so there are no barriers
unless you change that with elvtune.

Also I'm using -r 250 and -w 500 and it doesn't change really anything in the
numbers compared to too big values (but it fixes the starvation problem).

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:49                           ` Jens Axboe
  2000-09-25 14:11                             ` Ingo Molnar
@ 2000-09-25 14:20                             ` Andrea Arcangeli
  2000-09-25 14:11                               ` Jens Axboe
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 03:49:52PM +0200, Jens Axboe wrote:
> And a new elevator was introduced some months ago to solve this.

And now that I done some benchmark it seems the major optimization consists in
the implementation of the new _ordering_ algorithm in test2, not really from
the removal of the more finegrined latency control (said that I'm not going to
reintroduce the previous latency control, the current one doesn't provide great
latency but it's ok).

As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes
the ordering algorithm), tiotest performance drops significantly in the
2-thread-reading case. elvtune settings doesn't matter, that's only a matter of
the ordering.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:20                             ` Andrea Arcangeli
@ 2000-09-25 14:11                               ` Jens Axboe
  2000-09-25 14:33                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 14:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > And a new elevator was introduced some months ago to solve this.
> 
> And now that I done some benchmark it seems the major optimization consists in
> the implementation of the new _ordering_ algorithm in test2, not really from
> the removal of the more finegrined latency control (said that I'm not going to
> reintroduce the previous latency control, the current one doesn't provide
> great latency but it's ok).

Yes, I found this the greatest improvement too.

> As soon I patch my tree with Peter's perfect CSCAN ordering (that only changes
> the ordering algorithm), tiotest performance drops significantly in the
> 2-thread-reading case. elvtune settings doesn't matter, that's only a matter
> of the ordering.

Interesting. I haven't done any serious benching with the CSCAN introduction
in elevator_linus, I'll try that too.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:11                               ` Jens Axboe
@ 2000-09-25 14:33                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:11:34PM +0200, Jens Axboe wrote:
> Interesting. I haven't done any serious benching with the CSCAN introduction
> in elevator_linus, I'll try that too.

Only changing that the performance decreased reproducibly from 16 to 14
mbyte/sec in the read test with 2 threads.

So far I'm testing only IDE with LVM striping on two equal fast disks on
separate IDE channels.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:10                         ` Ingo Molnar
  2000-09-25 13:49                           ` Jens Axboe
@ 2000-09-25 13:56                           ` Andrea Arcangeli
  2000-09-25 13:57                             ` Ingo Molnar
  1 sibling, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 13:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 03:10:51PM +0200, Ingo Molnar wrote:
> yep. But i dont understand why this makes any difference - the waitqueue

It makes a difference because your sleeping reads won't get the wakeup
even while they could queue their reserved read request (they have
to wait the FIFO to roll or some write to complete).

> wakeup is FIFO, so any other request will eventually arrive. Could you
> explain this bug a bit better?

Well it may not explain an infinite hang because as you say the write that got
the suprious wakeup will unplug the queue and after some time the reads will be
wakenup. So maybe that wasn't the reason of your hangs because I remeber your
problem looked more like an infinite hang that was only solved by kflushd
writing some more stuff and unplugging the queue as side effect (however I'm
not sure since I never experienced those myself). 

But I hope if it wasn't that one it's the below fix that will help:

Index: mm/filemap.c
===================================================================
RCS file: /home/andrea/cvs/linux/mm/filemap.c,v
retrieving revision 1.1.1.5.2.3
retrieving revision 1.1.1.5.2.4
diff -u -r1.1.1.5.2.3 -r1.1.1.5.2.4
--- mm/filemap.c	2000/09/21 03:11:53	1.1.1.5.2.3
+++ mm/filemap.c	2000/09/25 03:33:31	1.1.1.5.2.4
@@ -622,8 +622,8 @@

 	add_wait_queue(&page->wait, &wait);
 	do {
-		sync_page(page);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		sync_page(page);
 		if (!PageLocked(page))
 			break;
 		schedule();
Index: fs/buffer.c
===================================================================
RCS file: /home/andrea/cvs/linux/fs/buffer.c,v
retrieving revision 1.1.1.5.2.1
retrieving revision 1.1.1.5.2.2
diff -u -r1.1.1.5.2.1 -r1.1.1.5.2.2
--- fs/buffer.c	2000/09/06 19:57:51	1.1.1.5.2.1
+++ fs/buffer.c	2000/09/25 03:33:30	1.1.1.5.2.2
@@ -147,8 +147,8 @@
 	atomic_inc(&bh->b_count);
 	add_wait_queue(&bh->b_wait, &wait);
 	do {
-		run_task_queue(&tq_disk);
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		run_task_queue(&tq_disk);
 		if (!buffer_locked(bh))
 			break;
 		schedule();

Think if the buffer returns locked between set_task_state(tsk,
TASK_UNINTERRUPTIBLE) and if (!buffer_locked(bh)). The window is very small but
it looks a genuine window for a deadlock. (and this one could sure explain
infinite hangs in read... even if it looks even less realistic than the
EXCLUSIVE task thing)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:56                           ` Andrea Arcangeli
@ 2000-09-25 13:57                             ` Ingo Molnar
  2000-09-25 14:13                               ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 13:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> -		sync_page(page);
>  		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> +		sync_page(page);

> -		run_task_queue(&tq_disk);
>  		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> +		run_task_queue(&tq_disk);

these look like genuine fixes, but i dont think they can explain the hangs
i had yesterday - those were simple VM deadlocks. I dont see any deadlocks
today - but i'm running the unsafe B2 variant of the vmfixes patch. (and i
have no swapping enabled which simplifies my VM setup.)

but one of these two fixes could explain the slowdown i saw on and off for
quite some time, seeing very bad read performance occasionally. (do you
remember my sched.c tq_disc hack?)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 13:57                             ` Ingo Molnar
@ 2000-09-25 14:13                               ` Andrea Arcangeli
  2000-09-25 14:08                                 ` Jens Axboe
                                                   ` (2 more replies)
  0 siblings, 3 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 03:57:31PM +0200, Ingo Molnar wrote:
> i had yesterday - those were simple VM deadlocks. I dont see any deadlocks

Definitely. They can't explain anything about the VM deadlocks. I was
_only_ talking about the blkdev hangs that caused you to unplug the
queue at each reschedule in tux and that Eric reported me for the SG
driver (and I very much hope that with EXCLUSIVE gone away and the
wait_on_* fixed those hangs will go away because I don't see anything else
wrong at this moment).

> but one of these two fixes could explain the slowdown i saw on and off for
> quite some time, seeing very bad read performance occasionally. (do you
> remember my sched.c tq_disc hack?)

Exactly, that's the only thing I was talking about in this subthread.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:13                               ` Andrea Arcangeli
@ 2000-09-25 14:08                                 ` Jens Axboe
  2000-09-25 14:29                                   ` Andrea Arcangeli
  2000-09-25 14:13                                 ` Ingo Molnar
  2000-09-25 14:29                                 ` Ingo Molnar
  2 siblings, 1 reply; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 14:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > i had yesterday - those were simple VM deadlocks. I dont see any deadlocks
> 
> Definitely. They can't explain anything about the VM deadlocks. I was
> _only_ talking about the blkdev hangs that caused you to unplug the
> queue at each reschedule in tux and that Eric reported me for the SG
> driver (and I very much hope that with EXCLUSIVE gone away and the
> wait_on_* fixed those hangs will go away because I don't see anything else
> wrong at this moment).

The sg problem was different. When sg queues a request, it invokes the
request_fn to handle it. But if the queue is currently plugged, the
scsi_request_fn will not do anything.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:08                                 ` Jens Axboe
@ 2000-09-25 14:29                                   ` Andrea Arcangeli
  2000-09-25 14:18                                     ` Jens Axboe
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:08:38PM +0200, Jens Axboe wrote:
> The sg problem was different. When sg queues a request, it invokes the
> request_fn to handle it. But if the queue is currently plugged, the
> scsi_request_fn will not do anything.

That will explain it, yes. In the same way for correctness also those should
be converted from request_fn to generic_unplug_device, right? (this
will also avoid to recall spurious request_fn because the device is still in the
tq_disk queue even when the I/O generated by the below request_fn completed)

	if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7)
		(q->request_fn)(q);
	if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7)
		(q->request_fn)(q);

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:29                                   ` Andrea Arcangeli
@ 2000-09-25 14:18                                     ` Jens Axboe
  2000-09-25 14:47                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 14:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > The sg problem was different. When sg queues a request, it invokes the
> > request_fn to handle it. But if the queue is currently plugged, the
> > scsi_request_fn will not do anything.
> 
> That will explain it, yes. In the same way for correctness also those should
> be converted from request_fn to generic_unplug_device, right? (this

Yes, that would be the right fix. However, then we also need some
way of inserting requests in the queue and let it plug when appropriate.
The scsi layer currently "manually" does a list_add on the queue itself,
which doesn't look too healthy.

> will also avoid to recall spurious request_fn because the device is still
> in the tq_disk queue even when the I/O generated by the below request_fn
> completed)
> 
> 	if (major >= COMPAQ_SMART2_MAJOR+0 && major <= COMPAQ_SMART2_MAJOR+7)
> 		(q->request_fn)(q);
> 	if (major >= DAC960_MAJOR+0 && major <= DAC960_MAJOR+7)
> 		(q->request_fn)(q);

AFAIR, Eric tried to talk to the Compaq folks (and Leonard too, I dunno)
about why they want this. What came of it, I don't know.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:18                                     ` Jens Axboe
@ 2000-09-25 14:47                                       ` Andrea Arcangeli
  2000-09-25 21:28                                         ` Jens Axboe
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 04:18:54PM +0200, Jens Axboe wrote:
> The scsi layer currently "manually" does a list_add on the queue itself,
> which doesn't look too healthy.

It's grabbing the io_request_lock so it looks healthy for now :)

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:47                                       ` Andrea Arcangeli
@ 2000-09-25 21:28                                         ` Jens Axboe
  2000-09-25 22:14                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Jens Axboe @ 2000-09-25 21:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25 2000, Andrea Arcangeli wrote:
> > The scsi layer currently "manually" does a list_add on the queue itself,
> > which doesn't look too healthy.
> 
> It's grabbing the io_request_lock so it looks healthy for now :)

It's safe alright, but if we want to do the generic_unplug_queue
instead of just hitting the request_fn (which might do anything
anyway), it would be nicer to expose this part of the block layer
(i.e. have a general way of queueing a request to the request_queue).
But I guess just

q->plug_device_fn(q, ...);
list_add(...)
generic_unplug_device(q);

would suffice in scsi_lib for now.

-- 
* Jens Axboe <axboe@suse.de>
* SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 21:28                                         ` Jens Axboe
@ 2000-09-25 22:14                                           ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 22:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

On Mon, Sep 25, 2000 at 11:28:55PM +0200, Jens Axboe wrote:
> q->plug_device_fn(q, ...);
> list_add(...)
> generic_unplug_device(q);
> 
> would suffice in scsi_lib for now.

It looks sane to me.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:13                               ` Andrea Arcangeli
  2000-09-25 14:08                                 ` Jens Axboe
@ 2000-09-25 14:13                                 ` Ingo Molnar
  2000-09-25 14:29                                 ` Ingo Molnar
  2 siblings, 0 replies; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> I was _only_ talking about the blkdev hangs [...]

i guess this was just miscommunication. It never 'hung', it just performed
reads with 20k/sec or so. (without any writes being done in the
background.) A 'hang' for me is a deadlock or lockup, not a slowdown.

> that caused you to unplug the queue at each reschedule in tux and that
> Eric reported me for the SG driver (and I very much hope that with
> EXCLUSIVE gone away and the wait_on_* fixed those hangs will go away
> because I don't see anything else wrong at this moment).

okay, i'll test this.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:13                               ` Andrea Arcangeli
  2000-09-25 14:08                                 ` Jens Axboe
  2000-09-25 14:13                                 ` Ingo Molnar
@ 2000-09-25 14:29                                 ` Ingo Molnar
  2000-09-25 14:46                                   ` Andrea Arcangeli
  2 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> driver (and I very much hope that with EXCLUSIVE gone away and the
> wait_on_* fixed those hangs will go away because I don't see anything else
> wrong at this moment).

the EXCLUSIVE thing only optimizes the wakeup, it's not semantic! How
better is it to let 100 processes race for one freed-up request slot?
There is no guarantee at all that the reader will win. If reads and writes
racing for request slots ever becomes a problem then we should introduce a
separate read and write waitqueue.

the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of
(performance) sense.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:29                                 ` Ingo Molnar
@ 2000-09-25 14:46                                   ` Andrea Arcangeli
  2000-09-25 14:53                                     ` Ingo Molnar
  0 siblings, 1 reply; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 14:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 04:29:42PM +0200, Ingo Molnar wrote:
> There is no guarantee at all that the reader will win. If reads and writes
> racing for request slots ever becomes a problem then we should introduce a
> separate read and write waitqueue.

I agree. However here I also have a in flight per-queue limit of locked stuff
(otherwise with 512k sized request on scsi I could fill in some second 128mbyte
of RAM locked and I don't want to decrease the size of the queue because it has
to be large for aggressive reordering when the request are 4k large each).
This in-flight-perqueue limit is actually a non exclusive wakeup and it
triggers more often than the request shortage (because most of the time write
are consecutive) and so having two waitqueues and the reads that reigsters
themself into both shouldn't be very significative improvement at the moment (I
should first care about a wake-one in-flight-limit-per-queue wakeup :).

> the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of

Actually I'm the one who introduced the EXCLUSIVE thing there and I audited
_all_ the device drivers to check they do 1 wakeup for each 1 request they
release before sending it off Linus. But I never thought (until some day ago)
about the fact that if a read completes a reserved request the write won't be
able to accept it.

So long term we'll do two wake-one queues with reads registered in both.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:46                                   ` Andrea Arcangeli
@ 2000-09-25 14:53                                     ` Ingo Molnar
  2000-09-25 15:02                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 243+ messages in thread
From: Ingo Molnar @ 2000-09-25 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > the EXCLUSIVE thing was noticed by Dimitris i think, and it makes tons of
> 
> Actually I'm the one who introduced the EXCLUSIVE thing there and I audited

sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [patch] vmfixes-2.4.0-test9-B2
  2000-09-25 14:53                                     ` Ingo Molnar
@ 2000-09-25 15:02                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 243+ messages in thread
From: Andrea Arcangeli @ 2000-09-25 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel

On Mon, Sep 25, 2000 at 04:53:05PM +0200, Ingo Molnar wrote:
> sorry - i said it was *noticed* by Dimitris. (and sent to l-k IIRC)

I didn't know.

Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 18:40   ` Ingo Molnar
  2000-09-24 18:39     ` Linus Torvalds
@ 2000-09-24 21:38     ` Stephen C. Tweedie
  2000-09-24 23:20       ` Alan Cox
  1 sibling, 1 reply; 243+ messages in thread
From: Stephen C. Tweedie @ 2000-09-24 21:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Roger Larsson, MM mailing list,
	linux-kernel, Stephen Tweedie

Hi,

On Sun, Sep 24, 2000 at 08:40:05PM +0200, Ingo Molnar wrote:
> On Sun, 24 Sep 2000, Linus Torvalds wrote:
> 
> > [...] I don't think shrinking the inode cache is actually illegal when
> > GPF_IO isn't set. In fact, it's probably only the buffer cache itself
> > that has to avoid recursion - the other stuff doesn't actually do any
> > IO.
> 
> i just found this out by example, i'm running the shrink_[i|d]cache stuff
> even if __GFP_IO is not set, and no problems so far. (and much better
> balancing behavior)

Careful --- I found out to my cost that there are hidden recursions
here.  ext3 was bitten once by the fact that shrink_icache does a
quota drop, and that involves quota writeback if it was the last inode
on that particular quota struct.

shrinking the icache _usually_ involves no IO, but the quota case is
an exception which a lot of developers won't encounter during testing.

Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: __GFP_IO && shrink_[d|i]cache_memory()?
  2000-09-24 21:38     ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie
@ 2000-09-24 23:20       ` Alan Cox
  0 siblings, 0 replies; 243+ messages in thread
From: Alan Cox @ 2000-09-24 23:20 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Ingo Molnar, Linus Torvalds, Rik van Riel, Roger Larsson,
	MM mailing list, linux-kernel

> quota drop, and that involves quota writeback if it was the last inode
> on that particular quota struct.
> 
> shrinking the icache _usually_ involves no IO, but the quota case is
> an exception which a lot of developers won't encounter during testing.

We've had a history of weird quota deadlocks in 2.0 and earlier 2.2. Is there
a reason quota block writeback cannot be queued or handled by a thread ?

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 243+ messages in thread

end of thread, other threads:[~2000-10-09  7:37 UTC | newest]

Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-09-24 10:11 __GFP_IO && shrink_[d|i]cache_memory()? Ingo Molnar
2000-09-24 18:11 ` Linus Torvalds
2000-09-24 18:40   ` Ingo Molnar
2000-09-24 18:39     ` Linus Torvalds
2000-09-24 18:46       ` Linus Torvalds
2000-09-24 18:59         ` Ingo Molnar
2000-09-24 19:34         ` [patch] vmfixes-2.4.0-test9-B2 Ingo Molnar
2000-09-24 20:20           ` Rui Sousa
2000-09-24 20:24           ` Andrea Arcangeli
2000-09-24 20:26             ` Ingo Molnar
2000-09-24 21:12               ` Andrea Arcangeli
2000-09-24 21:12                 ` Ingo Molnar
2000-09-24 21:43                   ` Stephen C. Tweedie
2000-09-24 22:13                     ` Andrea Arcangeli
2000-09-24 22:36                       ` [patch] vmfixes-2.4.0-test9-B2 - fixing deadlocks bert hubert
2000-09-24 23:41                         ` Andrea Arcangeli
2000-09-25 16:24                           ` Stephen C. Tweedie
2000-09-25 17:03                             ` Andrea Arcangeli
2000-09-25 18:06                               ` Stephen C. Tweedie
2000-09-25 19:32                                 ` Andrea Arcangeli
2000-09-25 19:26                                   ` Rik van Riel
2000-09-25 22:28                                     ` Andrea Arcangeli
2000-09-25 22:26                                       ` Rik van Riel
2000-09-25 22:51                                         ` Andrea Arcangeli
2000-09-25 22:30                                       ` Linus Torvalds
2000-09-25 23:03                                         ` Andrea Arcangeli
2000-09-25 23:18                                           ` Linus Torvalds
2000-09-26  0:32                                             ` Andrea Arcangeli
2000-09-25 22:30                                       ` Juan J. Quintela
2000-09-25 23:00                                         ` Andrea Arcangeli
2000-09-25 19:54                                   ` Stephen C. Tweedie
2000-09-25 22:44                                     ` Andrea Arcangeli
2000-09-25 22:42                                       ` Rik van Riel
2000-09-26  6:54                                     ` Christoph Rohland
2000-09-26 14:05                                       ` Andrea Arcangeli
2000-09-26 16:20                                         ` Christoph Rohland
2000-09-26 17:10                                           ` Andrea Arcangeli
2000-09-27  8:11                                             ` Christoph Rohland
2000-09-27  8:28                                               ` Ingo Molnar
2000-09-27  9:24                                                 ` Christoph Rohland
2000-09-27 13:56                                               ` Andrea Arcangeli
2000-09-27 16:56                                                 ` Christoph Rohland
2000-09-27 17:42                                                   ` Andrea Arcangeli
2000-09-27 18:25                                                     ` Erik Andersen
2000-09-27 18:55                                                       ` Andrea Arcangeli
2000-09-28 10:08                                                 ` Rik van Riel
2000-09-28 11:16                                                   ` Rik van Riel
2000-09-28 14:52                                                     ` Andrea Arcangeli
2000-09-29 14:39                                                       ` Rik van Riel
2000-09-29 14:55                                                         ` Andrea Arcangeli
2000-09-29 15:40                                                           ` Rik van Riel
2000-09-28 11:31                                                   ` Ingo Molnar
2000-09-28 14:54                                                     ` Andrea Arcangeli
2000-09-28 15:13                                                       ` Ingo Molnar
2000-09-28 15:23                                                         ` Andrea Arcangeli
2000-09-28 16:16                                                         ` Juan J. Quintela
2000-09-28 14:31                                                   ` Andrea Arcangeli
2000-09-25 17:21                           ` bert hubert
2000-09-25 17:49                             ` Andrea Arcangeli
2000-09-25 15:09                         ` Miles Lane
2000-09-25 15:51                         ` Stephen C. Tweedie
2000-09-25 16:05                           ` Ingo Molnar
2000-09-25 16:06                             ` Alexander Viro
2000-09-25 16:20                               ` Ingo Molnar
2000-09-25 16:29                                 ` Andrea Arcangeli
2000-09-25  4:56                   ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
2000-09-25  5:19                     ` Alexander Viro
2000-09-25  6:06                       ` Linus Torvalds
2000-09-25  6:17                         ` Alexander Viro
2000-09-25 21:21                         ` Alexander Viro
2000-09-26 13:42                       ` [CFT][PATCH] ext2 directories in pagecache Alexander Viro
2000-09-26 21:29                       ` Alexander Viro
2000-09-26 22:16                         ` Marko Kreen
2000-09-26 22:31                           ` Alexander Viro
2000-09-26 22:47                             ` Marko Kreen
2000-09-27  7:32                               ` Ingo Molnar
2000-09-27  9:22                                 ` Alexander Viro
2000-09-26 23:19                         ` Andreas Dilger
2000-09-26 23:33                           ` Alexander Viro
2000-09-26 23:44                             ` Alexander Viro
2000-09-25  0:09                 ` [patch] vmfixes-2.4.0-test9-B2 Linus Torvalds
2000-09-25  0:49                   ` Alexander Viro
2000-09-25  0:53                   ` Marcelo Tosatti
2000-09-25  1:45                     ` Andrea Arcangeli
2000-09-25  2:39                       ` Marcelo Tosatti
2000-09-25 15:36                         ` Andrea Arcangeli
2000-09-25 10:42                     ` the new VM Ingo Molnar
2000-09-25 13:02                       ` Andrea Arcangeli
2000-09-25 13:02                         ` Ingo Molnar
2000-09-25 13:08                           ` Andrea Arcangeli
2000-09-25 13:12                             ` Ingo Molnar
2000-09-25 13:30                               ` Andrea Arcangeli
2000-09-25 13:39                                 ` Ingo Molnar
2000-09-25 14:04                                   ` Andrea Arcangeli
2000-09-25 14:04                                     ` Ingo Molnar
2000-09-25 14:23                                       ` Andrea Arcangeli
2000-09-25 14:27                                         ` Ingo Molnar
2000-09-25 14:39                                           ` Andrea Arcangeli
2000-09-25 14:43                                             ` Ingo Molnar
2000-09-25 15:01                                               ` Andrea Arcangeli
2000-09-25 15:10                                                 ` Ingo Molnar
2000-09-25 15:24                                                   ` Andrea Arcangeli
2000-09-25 15:26                                                     ` Ingo Molnar
2000-09-25 15:22                                                       ` yodaiken
2000-09-26 19:10                                                 ` Pavel Machek
2000-09-26 20:16                                                   ` Andrea Arcangeli
2000-09-27  7:42                                                   ` Ingo Molnar
2000-09-27 12:11                                                     ` yodaiken
2000-09-27 14:08                                                     ` Andrea Arcangeli
2000-09-25 16:09                                             ` Rik van Riel
2000-09-25 14:26                                     ` Marcelo Tosatti
2000-09-25 14:50                                       ` Andrea Arcangeli
2000-09-25 14:47                               ` Alan Cox
2000-09-25 15:16                                 ` Ingo Molnar
2000-09-25 15:16                                   ` the new VMt Alan Cox
2000-09-25 15:33                                     ` the new VM Ingo Molnar
2000-09-25 15:41                                     ` the new VMt Andrea Arcangeli
2000-09-25 16:02                                       ` Ingo Molnar
2000-09-25 16:04                                         ` Andi Kleen
2000-09-25 16:19                                           ` Ingo Molnar
2000-09-25 16:18                                             ` Andi Kleen
2000-09-25 16:41                                               ` Andrea Arcangeli
2000-09-25 16:35                                                 ` Linus Torvalds
2000-09-25 16:41                                                   ` Rik van Riel
2000-09-25 16:49                                                     ` Linus Torvalds
2000-09-25 17:03                                                       ` Ingo Molnar
2000-09-25 17:17                                                         ` Andrea Arcangeli
2000-09-25 17:10                                                           ` Rik van Riel
2000-09-25 17:27                                                             ` Andrea Arcangeli
2000-09-25 17:15                                                       ` Andrea Arcangeli
2000-09-27  7:14                                                   ` Rusty Russell
2000-09-25 20:23                                               ` Russell King
2000-09-25 16:28                                             ` Rik van Riel
2000-09-25 16:11                                         ` Andrea Arcangeli
2000-09-25 16:22                                           ` Ingo Molnar
2000-09-25 16:17                                             ` Alexander Viro
2000-09-25 16:36                                               ` Jeff Garzik
2000-09-25 16:57                                               ` Alan Cox
2000-09-25 17:01                                                 ` Alexander Viro
2000-09-25 17:06                                                   ` Alan Cox
2000-09-25 17:31                                                     ` Oliver Xymoron
2000-09-25 17:51                                                       ` Jeff Garzik
2000-09-25 19:03                                                     ` the new VMt [4MB+ blocks] Matti Aarnio
2000-09-25 20:02                                                       ` Stephen Williams
2000-09-25 16:33                                             ` the new VMt Andrea Arcangeli
2000-09-26  8:38                                             ` Jes Sorensen
2000-09-26  8:52                                               ` Ingo Molnar
2000-09-26  9:02                                                 ` Jes Sorensen
2000-09-25 16:53                                         ` Alan Cox
2000-09-25 15:42                                     ` Stephen C. Tweedie
2000-09-25 16:05                                       ` Andrea Arcangeli
2000-09-25 16:22                                         ` Rik van Riel
2000-09-25 16:42                                           ` Andrea Arcangeli
2000-09-25 17:39                                         ` Stephen C. Tweedie
2000-09-25 16:51                                       ` Alan Cox
2000-09-25 17:43                                         ` Stephen C. Tweedie
2000-09-25 18:13                                           ` Alan Cox
2000-09-25 18:21                                             ` Stephen C. Tweedie
2000-09-25 19:09                                               ` Alan Cox
2000-09-25 19:21                                                 ` Stephen C. Tweedie
2000-09-25 16:52                                       ` yodaiken
2000-09-25 17:18                                         ` Jamie Lokier
2000-09-25 17:51                                           ` yodaiken
2000-09-25 18:04                                             ` Jamie Lokier
2000-09-25 18:13                                               ` yodaiken
2000-09-25 18:24                                                 ` Stephen C. Tweedie
2000-09-25 18:34                                                   ` yodaiken
2000-09-25 18:48                                                     ` Jamie Lokier
2000-09-25 19:25                                                     ` Stephen C. Tweedie
2000-09-25 20:04                                                       ` yodaiken
2000-09-25 20:23                                                         ` Alan Cox
2000-09-25 20:35                                                           ` yodaiken
2000-09-25 20:46                                                             ` Alan Cox
2000-09-25 21:07                                                               ` yodaiken
2000-09-26  9:54                                                                 ` Stephen C. Tweedie
2000-09-26 13:17                                                                   ` yodaiken
2000-09-25 20:47                                                             ` Benjamin C.R. LaHaise
2000-09-25 21:12                                                               ` yodaiken
2000-09-26 10:07                                                                 ` Stephen C. Tweedie
2000-09-26 13:30                                                                   ` yodaiken
2000-09-25 20:32                                                         ` Stephen C. Tweedie
2000-09-26 12:10                                                           ` Mark Hemment
2000-09-27 10:13                                                             ` Andrey Savochkin
2000-09-27 12:55                                                               ` Hugh Dickins
2000-09-28  3:25                                                                 ` Andrey Savochkin
2000-09-25 23:14                                                         ` Erik Andersen
2000-09-26 15:17                                                           ` yodaiken
2000-09-26 16:04                                                             ` Stephen C. Tweedie
2000-09-26 17:02                                                               ` Erik Andersen
2000-09-26 17:08                                                                 ` Stephen C. Tweedie
2000-09-26 17:45                                                                   ` Erik Andersen
2000-09-27 10:20                                                                     ` Andrey Savochkin
2000-09-26 21:13                                                                   ` Eric Lowe
2000-09-25 18:20                                             ` Andrea Arcangeli
2000-09-25 16:16                                     ` Rik van Riel
2000-09-25 16:55                                       ` Alan Cox
2000-09-25 15:48                                   ` the new VM Andrea Arcangeli
2000-09-25 15:40                                 ` Stephen C. Tweedie
2000-09-25 16:01                                   ` Andrea Arcangeli
2000-09-25 14:37                             ` Rik van Riel
2000-09-25 20:34                               ` Christoph Rohland
2000-10-06 16:14                                 ` Rik van Riel
2000-10-09  7:37                                   ` Christoph Rohland
2000-09-25 13:04                         ` Ingo Molnar
2000-09-25 13:19                           ` Andrea Arcangeli
2000-09-25 13:18                             ` Ingo Molnar
2000-09-25 13:21                             ` Ingo Molnar
2000-09-25 13:31                               ` Andrea Arcangeli
2000-09-25 13:47                                 ` Ingo Molnar
2000-09-25 14:04                                   ` Andrea Arcangeli
2000-09-25  1:31                   ` [patch] vmfixes-2.4.0-test9-B2 Andrea Arcangeli
2000-09-25  1:27                     ` Alexander Viro
2000-09-25  2:02                       ` Andrea Arcangeli
2000-09-25  2:01                         ` Alexander Viro
2000-09-25 13:47                         ` Stephen C. Tweedie
2000-09-25 10:13                     ` Ingo Molnar
2000-09-25 12:58                       ` Andrea Arcangeli
2000-09-25 13:10                         ` Ingo Molnar
2000-09-25 13:49                           ` Jens Axboe
2000-09-25 14:11                             ` Ingo Molnar
2000-09-25 14:05                               ` Jens Axboe
2000-09-25 16:46                               ` Linus Torvalds
2000-09-25 17:05                                 ` Ingo Molnar
2000-09-25 17:23                                   ` Andrea Arcangeli
2000-09-25 14:20                             ` Andrea Arcangeli
2000-09-25 14:11                               ` Jens Axboe
2000-09-25 14:33                                 ` Andrea Arcangeli
2000-09-25 13:56                           ` Andrea Arcangeli
2000-09-25 13:57                             ` Ingo Molnar
2000-09-25 14:13                               ` Andrea Arcangeli
2000-09-25 14:08                                 ` Jens Axboe
2000-09-25 14:29                                   ` Andrea Arcangeli
2000-09-25 14:18                                     ` Jens Axboe
2000-09-25 14:47                                       ` Andrea Arcangeli
2000-09-25 21:28                                         ` Jens Axboe
2000-09-25 22:14                                           ` Andrea Arcangeli
2000-09-25 14:13                                 ` Ingo Molnar
2000-09-25 14:29                                 ` Ingo Molnar
2000-09-25 14:46                                   ` Andrea Arcangeli
2000-09-25 14:53                                     ` Ingo Molnar
2000-09-25 15:02                                       ` Andrea Arcangeli
2000-09-24 21:38     ` __GFP_IO && shrink_[d|i]cache_memory()? Stephen C. Tweedie
2000-09-24 23:20       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox