More info: 2.1.108 page cache performance on low memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* More info: 2.1.108 page cache performance on low memory
@ 1998-07-13 16:53 Stephen C. Tweedie
  1998-07-13 18:08 ` Eric W. Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-13 16:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Ingo Molnar, Benjamin LaHaise, Alan Cox,
	Linus Torvalds, Stephen Tweedie

Hi all,

OK, a bit more benchmarking is showing bad problems with page ageing.
I've been running 2.1 with a big ramdisk and without, with page ageing
and without.  The results for a simple compile job (make a few
dependency files then compile four .c files) look like this:

	2.0.34, 6m ram:			1:22

	2.1.108, 16m ram, 10m ramdisk:
		With page cache ageing:	Not usable (swap death during boot.)
		Without cache ageing:	8:47

	2.1.108, 6m ram:
		With page cache ageing:	4:14
		Without cache ageing:	3:22

So we can see that on these low memory configurations, the page cache
ageing is a definite performance loss.  The situation with the ramdisk
is VERY markedly worse, which I think we can attribute to an
overly-large page cache due to the %age-physical-memory tuning
parameters; I'll be following this up to check (that's easy, since those
parameters are sysctl-able).  This is not an artificial situation:
having the page cache limits fixed in terms of %age of physical pages is
just not going to work if you can have large numbers of those pages
locked down for particular purposes.  Effectively we're reducing the
size of the page pool without the vm taking it into account.

Performance sucks overall compared to 2.0.  That may well be due to the
extra memory lost to the inode and dirent caches on 2.1, which tend to
grow much more than they did before; it may be that we can address that
without too much pain.  It is certainly possible to trim back the
kernel's ability to stop caching unused inodes/dirents, and although a
self-tuning system will be necessary in the long term, putting bounds on
these caches will at least let us see if this is where things are going
wrong.

I'll be experimenting a bit more to try to identify just where the
performance is disappearing here.  However you look at it, things look
pretty grim on 2.1 right now on low memory machines.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-13 16:53 More info: 2.1.108 page cache performance on low memory Stephen C. Tweedie
@ 1998-07-13 18:08 ` Eric W. Biederman
  1998-07-13 18:29   ` Zlatko Calusic
  1998-07-14 17:30   ` Stephen C. Tweedie
  0 siblings, 2 replies; 46+ messages in thread
From: Eric W. Biederman @ 1998-07-13 18:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi all,
ST> OK, a bit more benchmarking is showing bad problems with page ageing.
ST> I've been running 2.1 with a big ramdisk and without, with page ageing
ST> and without.  The results for a simple compile job (make a few
ST> dependency files then compile four .c files) look like this:

ST> 	2.0.34, 6m ram:			1:22

ST> 	2.1.108, 16m ram, 10m ramdisk:
ST> 		With page cache ageing:	Not usable (swap death during boot.)
ST> 		Without cache ageing:	8:47

ST> 	2.1.108, 6m ram:
ST> 		With page cache ageing:	4:14
ST> 		Without cache ageing:	3:22

O.k. Just a few thoughts.
1) We have a minimum size for the buffer cache in percent of physical pages.
   Setting the minimum to 0% may help.

2) If we play with LRU list it may be most practical use page->next and page->prev
   fields for the list, and for truncate_inode_pages && invalidate_inode_pages
do something like:
for(i = 0; i < inode->i_size; i+= PAGE_SIZE) {
	page = find_in_page_cache(inode, i);
	if (page) 
		/* remove it */
		;
}
And remove the inode->i_pages list.  This should be roughly equivalent
to the bforgets needed by truncate anyway so should impose not large
peformance penalty.

Personally I think it is broken to set the limits of cache sizes
(buffer & page) to anthing besides: max=100% min=0% by default.

But now that we have this hand tuneing option in addition to auto
tuning we should experiment with it as well.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-13 18:08 ` Eric W. Biederman
@ 1998-07-13 18:29   ` Zlatko Calusic
  1998-07-14 17:32     ` Stephen C. Tweedie
  1998-07-14 17:30   ` Stephen C. Tweedie
  1 sibling, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-13 18:29 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm

ebiederm+eric@npwt.net (Eric W. Biederman) writes:

> >>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
> 
> ST> Hi all,
> ST> OK, a bit more benchmarking is showing bad problems with page ageing.
> ST> I've been running 2.1 with a big ramdisk and without, with page ageing
> ST> and without.  The results for a simple compile job (make a few
> ST> dependency files then compile four .c files) look like this:
> 
> ST> 	2.0.34, 6m ram:			1:22
> 
> ST> 	2.1.108, 16m ram, 10m ramdisk:
> ST> 		With page cache ageing:	Not usable (swap death during boot.)
> ST> 		Without cache ageing:	8:47
> 
> ST> 	2.1.108, 6m ram:
> ST> 		With page cache ageing:	4:14
> ST> 		Without cache ageing:	3:22

I agree that ageing of the page cache has a bad impact on the
performance.

Benchmarking disks reveals much lower read speed, mostly thanks to
unneeded excessive swapping produced by outswapping pages that will be
again swapped in, in few seconds (page cache likes to take 90% of
memory when copying large files). This produces lots of redundant head 
movement (not to mention copying pages...) which effectively cuts read 
speed to half.

I personally run a system with heavily patched VM subsystem, at least
for the last three months.

Sad thing is that my patch mostly undo latest changes. :(

Just to mention, I have 64MB of physical memory, and my machine is
definitely not memory starved, but it also suffers from some of the
recent VM changes.

> 
> O.k. Just a few thoughts.
> 1) We have a minimum size for the buffer cache in percent of physical pages.
>    Setting the minimum to 0% may help.
> 
> 2) If we play with LRU list it may be most practical use page->next and page->prev
>    fields for the list, and for truncate_inode_pages && invalidate_inode_pages
> do something like:
> for(i = 0; i < inode->i_size; i+= PAGE_SIZE) {
> 	page = find_in_page_cache(inode, i);
> 	if (page) 
> 		/* remove it */
> 		;
> }
> And remove the inode->i_pages list.  This should be roughly equivalent
> to the bforgets needed by truncate anyway so should impose not large
> peformance penalty.
> 
> Personally I think it is broken to set the limits of cache sizes
> (buffer & page) to anthing besides: max=100% min=0% by default.

Exactly.

That (removing cache limits) is one of my favorite changes.

Free memory == unused memory == bad policy!

There is no reason why any of the caches would not utilize all of the
free memory at any given moment.

But, we must be very careful to swap out only unneeded pages if we
decide to enlarge cache on the behalf of the text and data pages.

> 
> But now that we have this hand tuneing option in addition to auto
> tuning we should experiment with it as well.
> 

If anybody want to see, I can provide benchmark results, but I'm not
prepared to compile another kernel image if nobody's interested. :)

Regards!
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	    10 out of 5 doctors feel it's OK to be skitzo!
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-13 18:29   ` Zlatko Calusic
@ 1998-07-14 17:32     ` Stephen C. Tweedie
  1998-07-16 12:31       ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-14 17:32 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Eric W. Biederman, Stephen C. Tweedie, linux-mm

Hi,

On 13 Jul 1998 20:29:33 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> I agree that ageing of the page cache has a bad impact on the
> performance.

> Just to mention, I have 64MB of physical memory, and my machine is
> definitely not memory starved, but it also suffers from some of the
> recent VM changes.

Yep.  Has anybody else got observations about what sort of
configurations are helped or hindered by the current 2.1 changes?

> That (removing cache limits) is one of my favorite changes.

> Free memory == unused memory == bad policy!

> There is no reason why any of the caches would not utilize all of the
> free memory at any given moment.

The existing limits don't affect the ability of the cache to grow; they
just give a target bound for the cache when we start trying to get pages
back for something else.

> If anybody want to see, I can provide benchmark results, but I'm not
> prepared to compile another kernel image if nobody's interested. :)

Well, I've been compiling kernels all day for this. :)  Any information
you can give will help, but for now it does look as if backing out the
cache ageing is a necessary first step.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-14 17:32     ` Stephen C. Tweedie
@ 1998-07-16 12:31       ` Zlatko Calusic
  0 siblings, 0 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-16 12:31 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Well, I've been compiling kernels all day for this. :)  Any information
> you can give will help, but for now it does look as if backing out the
> cache ageing is a necessary first step.
> 

OK, here we go:

Official 2.1.108:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          200  4552 65.5  5011 20.5  2570 21.1  5643 74.4  4077 14.9  84.8  2.9
                                                           ^^^^      ^^^^^

Patched 2.1.108 (no page aging, no cache limits, modified slab, etc... see below)

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
          200  6449 89.7  7450 31.4  2605 22.2  6052 80.5  7269 27.3 105.4  2.9
                                                           ^^^^      ^^^^^

I'm applying patch that produced results above. I don't claim my
work is suitable for anything. It is just part of my Linux MM
exploration, testing and simplifying things.

But, it worked stable and fast for me, last few months, and survived
all torture testing I've been putting on it. YMMV, of course.

Test platform is P166MMX, 64MB RAM, aic7xxx, Fujitsu M2954ESP.
The results are completely reproducable.

Regards,

------------------------------------------------------------

diff -urN --exclude-from=exclude linux-old/Documentation/sysctl/vm.txt linux/Documentation/sysctl/vm.txt
--- linux-old/Documentation/sysctl/vm.txt	Fri Jun 26 19:44:26 1998
+++ linux/Documentation/sysctl/vm.txt	Tue Jul 14 21:32:56 1998
@@ -15,13 +15,9 @@
 
 Currently, these files are in /proc/sys/vm:
 - bdflush
-- buffermem
 - freepages
-- kswapd
 - overcommit_memory
-- pagecache
 - swapctl
-- swapout_interval
 
 ==============================================================
 
@@ -90,80 +86,23 @@
 age_super is for filesystem metadata.
 
 ==============================================================
-buffermem:
 
-The three values in this file correspond to the values in
-the struct buffer_mem. It controls how much memory should
-be used for buffer memory. The percentage is calculated
-as a percentage of total system memory.
-
-The values are:
-min_percent	-- this is the minimum percentage of memory
-		   that should be spent on buffer memory
-borrow_percent  -- when Linux is short on memory, and the
-                   buffer cache uses more memory, free pages
-                   are stolen from it
-max_percent     -- this is the maximum amount of memory that
-                   can be used for buffer memory 
-
-==============================================================
 freepages:
 
 This file contains the values in the struct freepages. That
 struct contains three members: min, low and high.
 
-Although the goal of the Linux memory management subsystem
-is to avoid fragmentation and make large chunks of free
-memory (so that we can hand out DMA buffers and such), there
-still are some page-based limits in the system, mainly to
-make sure we don't waste too much memory trying to get large
-free area's.
-
 The meaning of the numbers is:
 
 freepages.min	When the number of free pages in the system
 		reaches this number, only the kernel can
 		allocate more memory.
-freepages.low	If memory is too fragmented, the swapout
-		daemon is started, except when the number
-		of free pages is larger than freepages.low.
-freepages.high	The swapping daemon exits when memory is
-		sufficiently defragmented, when the number
-		of free pages reaches freepages.high or when
-		it has tried the maximum number of times. 
-
-==============================================================
-
-kswapd:
-
-Kswapd is the kernel swapout daemon. That is, kswapd is that
-piece of the kernel that frees memory when it gets fragmented
-or full. Since every system is different, you'll probably want
-some control over this piece of the system.
-
-The numbers in this page correspond to the numbers in the
-struct pager_daemon {tries_base, tries_min, swap_cluster
-}; The tries_base and swap_cluster probably have the
-largest influence on system performance.
-
-tries_base	The maximum number of pages kswapd tries to
-		free in one round is calculated from this
-		number. Usually this number will be divided
-		by 4 or 8 (see mm/vmscan.c), so it isn't as
-		big as it looks.
-		When you need to increase the bandwidth to/from
-		swap, you'll want to increase this number.
-tries_min	This is the minimum number of times kswapd
-		tries to free a page each time it is called.
-		Basically it's just there to make sure that
-		kswapd frees some pages even when it's being
-		called with minimum priority.
-swap_cluster	This is the number of pages kswapd writes in
-		one turn. You want this large so that kswapd
-		does it's I/O in large chunks and the disk
-		doesn't have to seek often, but you don't want
-		it to be too large since that would flood the
-		request queue.
+freepages.low	When the number of free pages drops below
+		this number, swapping daemon (kswapd) is
+		woken up.
+freepages.high	This is kswapd's target, when there are more
+		free pages than this number, kswapd will stop
+		running.
 
 ==============================================================
 
@@ -206,18 +145,6 @@
 
 ==============================================================
 
-pagecache:
-
-This file does exactly the same as buffermem, only this
-file controls the struct page_cache, and thus controls
-the amount of memory allowed for memory mapping of files.
-
-You don't want the minimum level to be too low, otherwise
-your system might thrash when memory is tight or fragmentation
-is high...
-
-==============================================================
-
 swapctl:
 
 This file contains no less than 8 variables.
@@ -273,15 +200,3 @@
 process pages in order to satisfy buffer memory demands, you
 might want to either increase sc_bufferout_weight, or decrease
 the value of sc_pageout_weight.
-
-==============================================================
-
-swapout_interval:
-
-The single value in this file controls the amount of time
-between successive wakeups of kswapd when nr_free_pages is
-between free_pages_low and free_pages_high. The default value
-of HZ/4 is usually right, but when kswapd can't keep up with
-the number of allocations in your system, you might want to
-decrease this number. 
-
diff -urN --exclude-from=exclude linux-old/fs/buffer.c linux/fs/buffer.c
--- linux-old/fs/buffer.c	Fri Jun 26 19:44:35 1998
+++ linux/fs/buffer.c	Tue Jul 14 21:32:56 1998
@@ -704,7 +704,7 @@
 			 * of other sizes, this is necessary now that we
 			 * no longer have the lav code.
 			 */
-			try_to_free_buffer(bh,&bh,1);
+			try_to_free_buffer(bh, &bh);
 			if (!bh)
 				break;
 			continue;
@@ -733,9 +733,7 @@
 	/* We are going to try to locate this much memory. */
 	needed = bdf_prm.b_un.nrefill * size;  
 
-	while ((nr_free_pages > freepages.min*2) &&
-	        (buffermem >> PAGE_SHIFT) * 100 < (buffer_mem.max_percent * num_physpages) &&
-		grow_buffers(GFP_BUFFER, size)) {
+	while ((nr_free_pages > freepages.low) && grow_buffers(GFP_BUFFER, size)) {
 		obtained += PAGE_SIZE;
 		if (obtained >= needed)
 			return;
@@ -1646,8 +1644,7 @@
  * try_to_free_buffer() checks if all the buffers on this particular page
  * are unused, and free's the page if so.
  */
-int try_to_free_buffer(struct buffer_head * bh, struct buffer_head ** bhp,
-		       int priority)
+int try_to_free_buffer(struct buffer_head * bh, struct buffer_head ** bhp)
 {
 	unsigned long page;
 	struct buffer_head * tmp, * p;
@@ -1659,11 +1656,9 @@
 	do {
 		if (!tmp)
 			return 0;
-		if (tmp->b_count || buffer_protected(tmp) ||
-		    buffer_dirty(tmp) || buffer_locked(tmp) ||
-		    buffer_waiting(tmp))
-			return 0;
-		if (priority && buffer_touched(tmp))
+		if (tmp->b_count || buffermem < PAGE_SIZE * freepages.low ||
+		    buffer_protected(tmp) || buffer_dirty(tmp) || buffer_locked(tmp)
+		    || buffer_waiting(tmp) || buffer_touched(tmp))
 			return 0;
 		tmp = tmp->b_this_page;
 	} while (tmp != bh);
diff -urN --exclude-from=exclude linux-old/include/linux/fs.h linux/include/linux/fs.h
--- linux-old/include/linux/fs.h	Thu May 21 01:21:42 1998
+++ linux/include/linux/fs.h	Tue Jul 14 21:32:56 1998
@@ -707,7 +707,7 @@
 
 extern void refile_buffer(struct buffer_head * buf);
 extern void set_writetime(struct buffer_head * buf, int flag);
-extern int try_to_free_buffer(struct buffer_head*, struct buffer_head**, int);
+extern int try_to_free_buffer(struct buffer_head*, struct buffer_head**);
 
 extern int nr_buffers;
 extern int buffermem;
diff -urN --exclude-from=exclude linux-old/include/linux/mm.h linux/include/linux/mm.h
--- linux-old/include/linux/mm.h	Thu Jul  2 20:07:56 1998
+++ linux/include/linux/mm.h	Tue Jul 14 21:32:56 1998
@@ -253,23 +253,6 @@
 
 /* memory.c & swap.c*/
 
-/*
- * This traverses "nr" memory size lists,
- * and returns true if there is enough memory.
- *
- * For example, we want to keep on waking up
- * kswapd every once in a while until the highest
- * memory order has an entry (ie nr == 0), but
- * we want to do it in the background.
- *
- * We want to do it in the foreground only if
- * none of the three highest lists have enough
- * memory. Random number.
- */
-extern int free_memory_available(int nr);
-#define kswapd_continue()	(!free_memory_available(3))
-#define kswapd_wakeup()		(!free_memory_available(0))
-
 #define free_page(addr) free_pages((addr),0)
 extern void FASTCALL(free_pages(unsigned long addr, unsigned long order));
 extern void FASTCALL(__free_page(struct page *));
diff -urN --exclude-from=exclude linux-old/include/linux/swap.h linux/include/linux/swap.h
--- linux-old/include/linux/swap.h	Tue Jun 16 23:29:10 1998
+++ linux/include/linux/swap.h	Tue Jul 14 21:32:56 1998
@@ -50,7 +50,7 @@
 extern int shm_swap (int, int);
 
 /* linux/mm/vmscan.c */
-extern int try_to_free_page(int);
+extern void try_to_free_page(int);
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, unsigned long, char *, int);
@@ -92,17 +92,6 @@
  * swap cache stuff (in linux/mm/swap_state.c)
  */
 
-#define SWAP_CACHE_INFO
-
-#ifdef SWAP_CACHE_INFO
-extern unsigned long swap_cache_add_total;
-extern unsigned long swap_cache_add_success;
-extern unsigned long swap_cache_del_total;
-extern unsigned long swap_cache_del_success;
-extern unsigned long swap_cache_find_total;
-extern unsigned long swap_cache_find_success;
-#endif
-
 extern inline unsigned long in_swap_cache(struct page *page)
 {
 	if (PageSwapCache(page))
@@ -126,21 +115,6 @@
 	if (PageFreeAfter(page))
 		count--;
 	return (count > 1);
-}
-
-/*
- * When we're freeing pages from a user application, we want
- * to cluster swapouts too.	-- Rik.
- * linux/mm/page_alloc.c
- */
-static inline int try_to_free_pages(int gfp_mask, int count)
-{
-	int retval = 0;
-	while (count--) {
-		if (try_to_free_page(gfp_mask))
-			retval = 1;
-	}
-	return retval;
 }
 
 /*
diff -urN --exclude-from=exclude linux-old/include/linux/swapctl.h linux/include/linux/swapctl.h
--- linux-old/include/linux/swapctl.h	Thu May 21 01:21:43 1998
+++ linux/include/linux/swapctl.h	Tue Jul 14 21:32:56 1998
@@ -31,16 +31,6 @@
 typedef swapstat_v1 swapstat_t;
 extern swapstat_t swapstats;
 
-typedef struct buffer_mem_v1
-{
-	unsigned int	min_percent;
-	unsigned int	borrow_percent;
-	unsigned int	max_percent;
-} buffer_mem_v1;
-typedef buffer_mem_v1 buffer_mem_t;
-extern buffer_mem_t buffer_mem;
-extern buffer_mem_t page_cache;
-
 typedef struct freepages_v1
 {
 	unsigned int	min;
@@ -49,15 +39,6 @@
 } freepages_v1;
 typedef freepages_v1 freepages_t;
 extern freepages_t freepages;
-
-typedef struct pager_daemon_v1
-{
-	unsigned int	tries_base;
-	unsigned int	tries_min;
-	unsigned int	swap_cluster;
-} pager_daemon_v1;
-typedef pager_daemon_v1 pager_daemon_t;
-extern pager_daemon_t pager_daemon;
 
 #define SC_VERSION	1
 #define SC_MAX_VERSION	1
diff -urN --exclude-from=exclude linux-old/include/linux/sysctl.h linux/include/linux/sysctl.h
--- linux-old/include/linux/sysctl.h	Tue Jun 16 23:29:10 1998
+++ linux/include/linux/sysctl.h	Tue Jul 14 21:32:56 1998
@@ -74,13 +74,9 @@
 enum
 {
 	VM_SWAPCTL=1,		/* struct: Set vm swapping control */
-	VM_SWAPOUT,		/* int: Background pageout interval */
 	VM_FREEPG,		/* struct: Set free page thresholds */
 	VM_BDFLUSH,		/* struct: Control buffer cache flushing */
 	VM_OVERCOMMIT_MEMORY,	/* Turn off the virtual memory safety limit */
-	VM_BUFFERMEM,		/* struct: Set buffer memory thresholds */
-	VM_PAGECACHE,		/* struct: Set cache memory thresholds */
-	VM_PAGERDAEMON,		/* struct: Control kswapd behaviour */
 	VM_PGT_CACHE		/* struct: Set page table cache parameters */
 };
 
diff -urN --exclude-from=exclude linux-old/kernel/sysctl.c linux/kernel/sysctl.c
--- linux-old/kernel/sysctl.c	Tue Jun 16 23:29:11 1998
+++ linux/kernel/sysctl.c	Tue Jul 14 21:32:56 1998
@@ -7,7 +7,7 @@
  * Added hooks for /proc/sys/net (minor, minor patch), 96/4/1, Mike Shaver.
  * Added kernel/java-{interpreter,appletviewer}, 96/5/10, Mike Shaver.
  * Dynamic registration fixes, Stephen Tweedie.
- * Added kswapd-interval, ctrl-alt-del, printk stuff, 1/8/97, Chris Horn.
+ * Added ctrl-alt-del, printk stuff, 1/8/97, Chris Horn.
  * Made sysctl support optional via CONFIG_SYSCTL, 1/10/97, Chris Horn.
  */
 
@@ -37,7 +37,7 @@
 
 /* External variables not in a header file. */
 extern int panic_timeout;
-extern int console_loglevel, C_A_D, swapout_interval;
+extern int console_loglevel, C_A_D;
 extern int bdf_prm[], bdflush_min[], bdflush_max[];
 extern char binfmt_java_interpreter[], binfmt_java_appletviewer[];
 extern int sysctl_overcommit_memory;
@@ -191,21 +191,12 @@
 static ctl_table vm_table[] = {
 	{VM_SWAPCTL, "swapctl", 
 	 &swap_control, sizeof(swap_control_t), 0644, NULL, &proc_dointvec},
-	{VM_SWAPOUT, "swapout_interval",
-	 &swapout_interval, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_FREEPG, "freepages", 
 	 &freepages, sizeof(freepages_t), 0644, NULL, &proc_dointvec},
 	{VM_BDFLUSH, "bdflush", &bdf_prm, 9*sizeof(int), 0600, NULL,
-	 &proc_dointvec_minmax, &sysctl_intvec, NULL,
-	 &bdflush_min, &bdflush_max},
+	 &proc_dointvec_minmax, &sysctl_intvec, NULL, &bdflush_min, &bdflush_max},
 	{VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory,
 	 sizeof(sysctl_overcommit_memory), 0644, NULL, &proc_dointvec},
-	{VM_BUFFERMEM, "buffermem",
-	 &buffer_mem, sizeof(buffer_mem_t), 0644, NULL, &proc_dointvec},
-	{VM_PAGECACHE, "pagecache",
-	 &page_cache, sizeof(buffer_mem_t), 0644, NULL, &proc_dointvec},
-	{VM_PAGERDAEMON, "kswapd",
-	 &pager_daemon, sizeof(pager_daemon_t), 0644, NULL, &proc_dointvec},
 	{VM_PGT_CACHE, "pagetable_cache", 
 	 &pgt_cache_water, 2*sizeof(int), 0600, NULL, &proc_dointvec},
 	{0}
diff -urN --exclude-from=exclude linux-old/mm/filemap.c linux/mm/filemap.c
--- linux-old/mm/filemap.c	Thu Jul  2 20:07:56 1998
+++ linux/mm/filemap.c	Tue Jul 14 21:32:56 1998
@@ -150,10 +150,6 @@
 				}
 				tmp = tmp->b_this_page;
 			} while (tmp != bh);
-
-			/* Refuse to swap out all buffer pages */
-			if ((buffermem >> PAGE_SHIFT) * 100 < (buffer_mem.min_percent * num_physpages))
-				goto next;
 		}
 
 		/* We can't throw away shared pages, but we do mark
@@ -164,15 +160,11 @@
 
 		switch (atomic_read(&page->count)) {
 			case 1:
+				/* If it has been referenced recently, don't free it */
+				if (test_and_clear_bit(PG_referenced, &page->flags))
+					break;
 				/* is it a swap-cache or page-cache page? */
 				if (page->inode) {
-					if (test_and_clear_bit(PG_referenced, &page->flags)) {
-						touch_page(page);
-						break;
-					}
-					age_page(page);
-					if (page->age || page_cache_size * 100 < (page_cache.min_percent * num_physpages))
-						break;
 					if (PageSwapCache(page)) {
 						delete_from_swap_cache(page);
 						return 1;
@@ -182,13 +174,8 @@
 					__free_page(page);
 					return 1;
 				}
-				/* It's not a cache page, so we don't do aging.
-				 * If it has been referenced recently, don't free it */
-				if (test_and_clear_bit(PG_referenced, &page->flags))
-					break;
-
 				/* is it a buffer cache page? */
-				if ((gfp_mask & __GFP_IO) && bh && try_to_free_buffer(bh, &bh, 6))
+				if ((gfp_mask & __GFP_IO) && bh && try_to_free_buffer(bh, &bh))
 					return 1;
 				break;
 
diff -urN --exclude-from=exclude linux-old/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-old/mm/page_alloc.c	Fri Jun 26 19:44:38 1998
+++ linux/mm/page_alloc.c	Tue Jul 14 21:32:56 1998
@@ -100,53 +100,6 @@
  */
 spinlock_t page_alloc_lock = SPIN_LOCK_UNLOCKED;
 
-/*
- * This routine is used by the kernel swap daemon to determine
- * whether we have "enough" free pages. It is fairly arbitrary,
- * but this had better return false if any reasonable "get_free_page()"
- * allocation could currently fail..
- *
- * This will return zero if no list was found, non-zero
- * if there was memory (the bigger, the better).
- */
-int free_memory_available(int nr)
-{
-	int retval = 0;
-	unsigned long flags;
-	struct free_area_struct * list;
-
-	/*
-	 * If we have more than about 3% to 5% of all memory free,
-	 * consider it to be good enough for anything.
-	 * It may not be, due to fragmentation, but we
-	 * don't want to keep on forever trying to find
-	 * free unfragmented memory.
-	 * Added low/high water marks to avoid thrashing -- Rik.
-	 */
-	if (nr_free_pages > (nr ? freepages.low : freepages.high))
-		return nr+1;
-
-	list = free_area + NR_MEM_LISTS;
-	spin_lock_irqsave(&page_alloc_lock, flags);
-	/* We fall through the loop if the list contains one
-	 * item. -- thanks to Colin Plumb <colin@nyx.net>
-	 */
-	do {
-		list--;
-		/* Empty list? Bad - we need more memory */
-		if (list->next == memory_head(list))
-			break;
-		/* One item on the list? Look further */
-		if (list->next->next == memory_head(list))
-			continue;
-		/* More than one item? We're ok */
-		retval = nr + 1;
-		break;
-	} while (--nr >= 0);
-	spin_unlock_irqrestore(&page_alloc_lock, flags);
-	return retval;
-}
-
 static inline void free_pages_ok(unsigned long map_nr, unsigned long order)
 {
 	struct free_area_struct *area = free_area + order;
@@ -215,30 +168,6 @@
  */
 #define MARK_USED(index, order, area) \
 	change_bit((index) >> (1+(order)), (area)->map)
-#define CAN_DMA(x) (PageDMA(x))
-#define ADDRESS(x) (PAGE_OFFSET + ((x) << PAGE_SHIFT))
-#define RMQUEUE(order, maxorder, dma) \
-do { struct free_area_struct * area = free_area+order; \
-     unsigned long new_order = order; \
-	do { struct page *prev = memory_head(area), *ret = prev->next; \
-		while (memory_head(area) != ret) { \
-			if (new_order >= maxorder && ret->next == prev) \
-				break; \
-			if (!dma || CAN_DMA(ret)) { \
-				unsigned long map_nr = ret->map_nr; \
-				(prev->next = ret->next)->prev = prev; \
-				MARK_USED(map_nr, new_order, area); \
-				nr_free_pages -= 1 << order; \
-				EXPAND(ret, map_nr, order, new_order, area); \
-				spin_unlock_irqrestore(&page_alloc_lock, flags); \
-				return ADDRESS(map_nr); \
-			} \
-			prev = ret; \
-			ret = ret->next; \
-		} \
-		new_order++; area++; \
-	} while (new_order < NR_MEM_LISTS); \
-} while (0)
 
 #define EXPAND(map,index,low,high,area) \
 do { unsigned long size = 1 << high; \
@@ -255,18 +184,11 @@
 
 unsigned long __get_free_pages(int gfp_mask, unsigned long order)
 {
-	unsigned long flags, maxorder;
+	unsigned long flags, new_order, extra = 0;
+	struct free_area_struct *area;
 
 	if (order >= NR_MEM_LISTS)
-		goto nopage;
-
-	/*
-	 * "maxorder" is the highest order number that we're allowed
-	 * to empty in order to find a free page..
-	 */
-	maxorder = NR_MEM_LISTS-1;
-	if (gfp_mask & __GFP_HIGH)
-		maxorder = NR_MEM_LISTS;
+		return 0;
 
 	if (in_interrupt() && (gfp_mask & __GFP_WAIT)) {
 		static int count = 0;
@@ -277,18 +199,39 @@
 		}
 	}
 
-	for (;;) {
-		spin_lock_irqsave(&page_alloc_lock, flags);
-		RMQUEUE(order, maxorder, (gfp_mask & GFP_DMA));
-		spin_unlock_irqrestore(&page_alloc_lock, flags);
-		if (!(gfp_mask & __GFP_WAIT))
-			break;
-		if (!try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX))
-			break;
-		gfp_mask &= ~__GFP_WAIT;	/* go through this only once */
-		maxorder = NR_MEM_LISTS;	/* Allow anything this time */
+ repeat:
+	if ((gfp_mask & __GFP_WAIT))
+		if (extra || (nr_free_pages < freepages.min && !(gfp_mask & __GFP_MED)))
+			while (nr_free_pages + atomic_read(&nr_async_pages) <
+			       freepages.low + extra)
+				try_to_free_page(gfp_mask);
+	new_order = order;
+	area = free_area + order;
+	spin_lock_irqsave(&page_alloc_lock, flags);
+	do {
+		struct page *prev = memory_head(area), *ret;
+
+		while (memory_head(area) != (ret = prev->next)) {
+			if (!(gfp_mask & GFP_DMA) || PageDMA(ret)) {
+				unsigned long map_nr = ret->map_nr;
+
+				(prev->next = ret->next)->prev = prev;
+				MARK_USED(map_nr, new_order, area);
+				nr_free_pages -= 1 << order;
+				EXPAND(ret, map_nr, order, new_order, area);
+				spin_unlock_irqrestore(&page_alloc_lock, flags);
+				return PAGE_OFFSET + (map_nr << PAGE_SHIFT);
+			}
+			prev = ret;
+		}
+		new_order++;
+		area++;
+	} while (new_order < NR_MEM_LISTS);
+	spin_unlock_irqrestore(&page_alloc_lock, flags);
+	if (gfp_mask & __GFP_WAIT) {
+		extra += SWAP_CLUSTER_MAX;
+		goto repeat;
 	}
-nopage:
 	return 0;
 }
 
@@ -315,9 +258,6 @@
 	}
 	spin_unlock_irqrestore(&page_alloc_lock, flags);
 	printk("= %lukB)\n", total);
-#ifdef SWAP_CACHE_INFO
-	show_swap_cache_info();
-#endif	
 }
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
@@ -340,14 +280,14 @@
 	 * that we don't waste too much memory on large systems.
 	 * This is totally arbitrary.
 	 */
-	i = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT+7);
+	i = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT + 7);
 	if (i < 48)
 		i = 48;
 	if (i > 256)
 		i = 256;
 	freepages.min = i;
 	freepages.low = i << 1;
-	freepages.high = freepages.low + i;
+	freepages.high = i << 2;
 	mem_map = (mem_map_t *) LONG_ALIGN(start_mem);
 	p = mem_map + MAP_NR(end_mem);
 	start_mem = LONG_ALIGN((unsigned long) p);
diff -urN --exclude-from=exclude linux-old/mm/slab.c linux/mm/slab.c
--- linux-old/mm/slab.c	Fri Jun 26 19:44:38 1998
+++ linux/mm/slab.c	Tue Jul 14 21:32:56 1998
@@ -308,12 +308,12 @@
 #define	SLAB_MAX_GFP_ORDER	5	/* 32 pages */
 
 /* the 'preferred' minimum num of objs per slab - maybe less for large objs */
-#define	SLAB_MIN_OBJS_PER_SLAB	4
+#define	SLAB_MIN_OBJS_PER_SLAB	1
 
 /* If the num of objs per slab is <= SLAB_MIN_OBJS_PER_SLAB,
  * then the page order must be less than this before trying the next order.
  */
-#define	SLAB_BREAK_GFP_ORDER	2
+#define	SLAB_BREAK_GFP_ORDER	1
 
 /* Macros for storing/retrieving the cachep and or slab from the
  * global 'mem_map'.  With off-slab bufctls, these are used to find the
diff -urN --exclude-from=exclude linux-old/mm/swap.c linux/mm/swap.c
--- linux-old/mm/swap.c	Fri Jun 26 19:44:38 1998
+++ linux/mm/swap.c	Tue Jul 14 21:32:56 1998
@@ -10,7 +10,6 @@
  * linux/Documentation/sysctl/vm.txt.
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
- * Buffermem limits added 12.3.98, Rik van Riel.
  */
 
 #include <linux/mm.h>
@@ -36,8 +35,8 @@
 /*
  * We identify three levels of free memory.  We never let free mem
  * fall below the freepages.min except for atomic allocations.  We
- * start background swapping if we fall below freepages.high free
- * pages, and we begin intensive swapping below freepages.low.
+ * start background swapping if we fall below freepages.low free
+ * pages, and we begin intensive swapping below freepages.min.
  *
  * These values are there to keep GCC from complaining. Actual
  * initialization is done in mm/page_alloc.c or arch/sparc(64)/mm/init.c.
@@ -45,7 +44,7 @@
 freepages_t freepages = {
 	48,	/* freepages.min */
 	96,	/* freepages.low */
-	144	/* freepages.high */
+	192	/* freepages.high */
 };
 
 /* We track the number of pages currently being asynchronously swapped
@@ -65,21 +64,3 @@
 };
 
 swapstat_t swapstats = {0};
-
-buffer_mem_t buffer_mem = {
-	3,	/* minimum percent buffer */
-	10,	/* borrow percent buffer */
-	30	/* maximum percent buffer */
-};
-
-buffer_mem_t page_cache = {
-	10,	/* minimum percent page cache */
-	30,	/* borrow percent page cache */
-	75	/* maximum */
-};
-
-pager_daemon_t pager_daemon = {
-	512,	/* base number for calculating the number of tries */
-	SWAP_CLUSTER_MAX,	/* minimum number of tries */
-	SWAP_CLUSTER_MAX,	/* do swap I/O in clusters of this size */
-};
diff -urN --exclude-from=exclude linux-old/mm/swap_state.c linux/mm/swap_state.c
--- linux-old/mm/swap_state.c	Tue Mar 10 19:51:02 1998
+++ linux/mm/swap_state.c	Tue Jul 14 21:32:56 1998
@@ -24,14 +24,6 @@
 #include <asm/bitops.h>
 #include <asm/pgtable.h>
 
-#ifdef SWAP_CACHE_INFO
-unsigned long swap_cache_add_total = 0;
-unsigned long swap_cache_add_success = 0;
-unsigned long swap_cache_del_total = 0;
-unsigned long swap_cache_del_success = 0;
-unsigned long swap_cache_find_total = 0;
-unsigned long swap_cache_find_success = 0;
-
 /* 
  * Keep a reserved false inode which we will use to mark pages in the
  * page cache are acting as swap cache instead of file cache. 
@@ -43,21 +35,8 @@
  */
 struct inode swapper_inode;
 
-
-void show_swap_cache_info(void)
-{
-	printk("Swap cache: add %ld/%ld, delete %ld/%ld, find %ld/%ld\n",
-		swap_cache_add_total, swap_cache_add_success, 
-		swap_cache_del_total, swap_cache_del_success,
-		swap_cache_find_total, swap_cache_find_success);
-}
-#endif
-
 int add_to_swap_cache(struct page *page, unsigned long entry)
 {
-#ifdef SWAP_CACHE_INFO
-	swap_cache_add_total++;
-#endif
 #ifdef DEBUG_SWAP
 	printk("DebugVM: add_to_swap_cache(%08lx count %d, entry %08lx)\n",
 	       page_address(page), atomic_read(&page->count), entry);
@@ -78,9 +57,6 @@
 	page->offset = entry;
 	add_page_to_hash_queue(page, &swapper_inode, entry);
 	add_page_to_inode_queue(&swapper_inode, page);
-#ifdef SWAP_CACHE_INFO
-	swap_cache_add_success++;
-#endif
 	return 1;
 }
 
@@ -168,14 +144,9 @@
 
 long find_in_swap_cache(struct page *page)
 {
-#ifdef SWAP_CACHE_INFO
-	swap_cache_find_total++;
-#endif
 	if (PageSwapCache (page))  {
 		long entry = page->offset;
-#ifdef SWAP_CACHE_INFO
-		swap_cache_find_success++;
-#endif	
+
 		remove_from_swap_cache (page);
 		return entry;
 	}
@@ -184,14 +155,8 @@
 
 int delete_from_swap_cache(struct page *page)
 {
-#ifdef SWAP_CACHE_INFO
-	swap_cache_del_total++;
-#endif	
 	if (PageSwapCache (page))  {
 		long entry = page->offset;
-#ifdef SWAP_CACHE_INFO
-		swap_cache_del_success++;
-#endif
 #ifdef DEBUG_SWAP
 		printk("DebugVM: delete_from_swap_cache(%08lx count %d, "
 		       "entry %08lx)\n",
@@ -297,4 +262,3 @@
 #endif
 	return new_page;
 }
-
diff -urN --exclude-from=exclude linux-old/mm/vmscan.c linux/mm/vmscan.c
--- linux-old/mm/vmscan.c	Fri Jun 26 19:44:38 1998
+++ linux/mm/vmscan.c	Tue Jul 14 21:32:56 1998
@@ -29,17 +29,6 @@
 #include <asm/pgtable.h>
 
 /* 
- * When are we next due for a page scan? 
- */
-static unsigned long next_swap_jiffies = 0;
-
-/* 
- * How often do we do a pageout scan during normal conditions?
- * Default is four times a second.
- */
-int swapout_interval = HZ / 4;
-
-/* 
  * The wait queue for waking up the pageout daemon:
  */
 static struct wait_queue * kswapd_wait = NULL;
@@ -444,61 +433,39 @@
  * to be.  This works out OK, because we now do proper aging on page
  * contents. 
  */
-static inline int do_try_to_free_page(int gfp_mask)
+void try_to_free_page(int gfp_mask)
 {
 	static int state = 0;
-	int i=6;
-	int stop;
+	int prio = 6;
+
+	lock_kernel();
 
 	/* Always trim SLAB caches when memory gets low. */
 	kmem_cache_reap(gfp_mask);
 
-	/* We try harder if we are waiting .. */
-	stop = 3;
-	if (gfp_mask & __GFP_WAIT)
-		stop = 0;
-	if (((buffermem >> PAGE_SHIFT) * 100 > buffer_mem.borrow_percent * num_physpages)
-		   || (page_cache_size * 100 > page_cache.borrow_percent * num_physpages))
-		state = 0;
-
-	switch (state) {
-		do {
+	for (prio = 6; prio >= 0; prio--) {
+		switch (state) {
 		case 0:
-			if (shrink_mmap(i, gfp_mask))
-				return 1;
+			if (shrink_mmap(prio, gfp_mask))
+				goto out;
 			state = 1;
 		case 1:
-			if ((gfp_mask & __GFP_IO) && shm_swap(i, gfp_mask))
-				return 1;
+			if ((gfp_mask & __GFP_IO) && shm_swap(prio, gfp_mask))
+				goto out;
 			state = 2;
 		case 2:
-			if (swap_out(i, gfp_mask))
-				return 1;
+			if (swap_out(prio, gfp_mask))
+				goto out;
 			state = 3;
 		case 3:
-			shrink_dcache_memory(i, gfp_mask);
+			shrink_dcache_memory(prio, gfp_mask);
 			state = 0;
-		i--;
-		} while ((i - stop) >= 0);
-	}
-	return 0;
-}
-
-/*
- * This is REALLY ugly.
- *
- * We need to make the locks finer granularity, but right
- * now we need this so that we can do page allocations
- * without holding the kernel lock etc.
- */
-int try_to_free_page(int gfp_mask)
-{
-	int retval;
-
-	lock_kernel();
-	retval = do_try_to_free_page(gfp_mask);
-	unlock_kernel();
-	return retval;
+		}
+  	}
+ out:
+  	unlock_kernel();
+	if (atomic_read(&nr_async_pages) >= SWAP_CLUSTER_MAX)
+		run_task_queue(&tq_disk);
 }
 
 /*
@@ -547,54 +514,16 @@
 
 	init_swap_timer();
 	add_wait_queue(&kswapd_wait, &wait);
-	while (1) {
-		int tries;
-		int tried = 0;
-
+	for (;;) {
 		current->state = TASK_INTERRUPTIBLE;
 		flush_signals(current);
-		run_task_queue(&tq_disk);
 		schedule();
 		swapstats.wakeups++;
 
-		/*
-		 * Do the background pageout: be
-		 * more aggressive if we're really
-		 * low on free memory.
-		 *
-		 * We try page_daemon.tries_base times, divided by
-		 * an 'urgency factor'. In practice this will mean
-		 * a value of pager_daemon.tries_base / 8 or 4 = 64
-		 * or 128 pages at a time.
-		 * This gives us 64 (or 128) * 4k * 4 (times/sec) =
-		 * 1 (or 2) MB/s swapping bandwidth in low-priority
-		 * background paging. This number rises to 8 MB/s
-		 * when the priority is highest (but then we'll be
-		 * woken up more often and the rate will be even
-		 * higher).
-		 */
-		tries = pager_daemon.tries_base >> free_memory_available(3);
-	
-		while (tries--) {
-			int gfp_mask;
-
-			if (++tried > pager_daemon.tries_min && free_memory_available(0))
-				break;
-			gfp_mask = __GFP_IO;
-			try_to_free_page(gfp_mask);
-			/*
-			 * Syncing large chunks is faster than swapping
-			 * synchronously (less head movement). -- Rik.
-			 */
-			if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster)
-				run_task_queue(&tq_disk);
-
-		}
-	}
-	/* As if we could ever get here - maybe we want to make this killable */
-	remove_wait_queue(&kswapd_wait, &wait);
-	unlock_kernel();
-	return 0;
+		while (nr_free_pages + atomic_read(&nr_async_pages) < freepages.high)
+			try_to_free_page(nr_free_pages < freepages.min ?
+					 (__GFP_IO | __GFP_WAIT) : __GFP_IO);
+  	}
 }
 
 /* 
@@ -602,38 +531,9 @@
  */
 void swap_tick(void)
 {
-	unsigned long now, want;
-	int want_wakeup = 0;
-
-	want = next_swap_jiffies;
-	now = jiffies;
-
-	/*
-	 * Examine the memory queues. Mark memory low
-	 * if there is nothing available in the three
-	 * highest queues.
-	 *
-	 * Schedule for wakeup if there isn't lots
-	 * of free memory.
-	 */
-	switch (free_memory_available(3)) {
-	case 0:
-		want = now;
-		/* Fall through */
-	case 1 ... 3:
-		want_wakeup = 1;
-	default:
-	}
- 
-	if ((long) (now - want) >= 0) {
-		if (want_wakeup || (num_physpages * buffer_mem.max_percent) < (buffermem >> PAGE_SHIFT) * 100
-				|| (num_physpages * page_cache.max_percent < page_cache_size * 100)) {
-			/* Set the next wake-up time */
-			next_swap_jiffies = now + swapout_interval;
-			wake_up(&kswapd_wait);
-		}
-	}
-	timer_active |= (1<<SWAP_TIMER);
+	if (nr_free_pages < freepages.low)
+		wake_up(&kswapd_wait);
+	timer_active |= (1 << SWAP_TIMER);
 }
 
 /* 
@@ -644,5 +544,5 @@
 {
 	timer_table[SWAP_TIMER].expires = 0;
 	timer_table[SWAP_TIMER].fn = swap_tick;
-	timer_active |= (1<<SWAP_TIMER);
+	timer_active |= (1 << SWAP_TIMER);
 }

-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
Unix _IS_ user friendly - it's just selective about who its friends are!
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-13 18:08 ` Eric W. Biederman
  1998-07-13 18:29   ` Zlatko Calusic
@ 1998-07-14 17:30   ` Stephen C. Tweedie
  1998-07-18  1:10     ` Eric W. Biederman
  1 sibling, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-14 17:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm

Hi,

On 13 Jul 1998 13:08:56 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

>>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
> 1) We have a minimum size for the buffer cache in percent of physical pages.
>    Setting the minimum to 0% may help.

...

> Personally I think it is broken to set the limits of cache sizes
> (buffer & page) to anthing besides: max=100% min=0% by default.

Yep; I disabled those limits for the benchmarks I announced.  Disabling
the ageing but keeping the limits in place still resulted in a
performance loss.

> 2) If we play with LRU list it may be most practical use page->next
> and page->prev fields for the list, and for truncate_inode_pages &&
> invalidate_inode_pages

Yikes --- for large files the proposal that we do

> do something like:
> for(i = 0; i < inode->i_size; i+= PAGE_SIZE) {
> 	page = find_in_page_cache(inode, i);
> 	if (page) 
> 		/* remove it */
> 		;
> }

will be disasterous.  No, I think we still need the per-inode page
lists.  When we eventually get an fsync() which works through the page
cache, this will become even more important.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-14 17:30   ` Stephen C. Tweedie
@ 1998-07-18  1:10     ` Eric W. Biederman
  1998-07-18 13:28       ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 1998-07-18  1:10 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi,
ST> On 13 Jul 1998 13:08:56 -0500, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) said:

>>>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
>> 1) We have a minimum size for the buffer cache in percent of physical pages.
>> Setting the minimum to 0% may help.

ST> ...

>> Personally I think it is broken to set the limits of cache sizes
>> (buffer & page) to anthing besides: max=100% min=0% by default.

ST> Yep; I disabled those limits for the benchmarks I announced.  Disabling
ST> the ageing but keeping the limits in place still resulted in a
ST> performance loss.

>> 2) If we play with LRU list it may be most practical use page->next
>> and page->prev fields for the list, and for truncate_inode_pages &&
>> invalidate_inode_pages

ST> Yikes --- for large files the proposal that we do

>> do something like:
>> for(i = 0; i < inode->i_size; i+= PAGE_SIZE) {
>> page = find_in_page_cache(inode, i);
>> if (page) 
>> /* remove it */
>> ;
>> }

ST> will be disasterous.  No, I think we still need the per-inode page
ST> lists.  When we eventually get an fsync() which works through the page
ST> cache, this will become even more important.

Duh.  Ext2 only does with in truncate with the block cache on a real
truncate, when and inode is closed it doesn't need to do that.  Sorry
I though I had precedent for that algorithm.

O.k. scracth that idea.

So I guess a LRU list for pages will require that we increase the size
of struct page.  I guess it is makes sense if we can ultimately:
a) use if for every page on the system ala the swap cache.
b) remove the buffer cache which should provide the necessary
   expansion room.  So we won't ultimately use more space.
c) use it for a lru on dirty pages.
d) doesn't fragment memory with slabs...

I hate considering expanding struct page after all of the work
that has gone into shriking the lately....

And for writes it looks like I'll need a write time too, for best
performance.  I've written the code I just haven't tested it yet.

Zlatko could I talk you into setting the defines in mmap.h so it shmfs
will use those and report if bonnie improves...

Eric

p.s. Everyone please excuse any slow replies I'm in the middle of
moving and I can't read my mail too often.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18  1:10     ` Eric W. Biederman
@ 1998-07-18 13:28       ` Zlatko Calusic
  1998-07-18 16:40         ` Eric W. Biederman
  1998-07-22 10:33         ` Stephen C. Tweedie
  0 siblings, 2 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-18 13:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm

ebiederm+eric@npwt.net (Eric W. Biederman) writes:

> >>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
> 
> ST> Hi,
> ST> On 13 Jul 1998 13:08:56 -0500, ebiederm+eric@npwt.net (Eric
> ST> W. Biederman) said:
> 
> >>>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
> >> 1) We have a minimum size for the buffer cache in percent of physical pages.
> >> Setting the minimum to 0% may help.
> 
> ST> ...
> 
> >> Personally I think it is broken to set the limits of cache sizes
> >> (buffer & page) to anthing besides: max=100% min=0% by default.
> 
> ST> Yep; I disabled those limits for the benchmarks I announced.  Disabling
> ST> the ageing but keeping the limits in place still resulted in a
> ST> performance loss.
> 
> >> 2) If we play with LRU list it may be most practical use page->next
> >> and page->prev fields for the list, and for truncate_inode_pages &&
> >> invalidate_inode_pages
> 
> ST> Yikes --- for large files the proposal that we do
> 
> >> do something like:
> >> for(i = 0; i < inode->i_size; i+= PAGE_SIZE) {
> >> page = find_in_page_cache(inode, i);
> >> if (page) 
> >> /* remove it */
> >> ;
> >> }
> 
> ST> will be disasterous.  No, I think we still need the per-inode page
> ST> lists.  When we eventually get an fsync() which works through the page
> ST> cache, this will become even more important.
> 
> Duh.  Ext2 only does with in truncate with the block cache on a real
> truncate, when and inode is closed it doesn't need to do that.  Sorry
> I though I had precedent for that algorithm.
> 
> O.k. scracth that idea.
> 
> So I guess a LRU list for pages will require that we increase the size
> of struct page.  I guess it is makes sense if we can ultimately:
> a) use if for every page on the system ala the swap cache.
> b) remove the buffer cache which should provide the necessary
>    expansion room.  So we won't ultimately use more space.
> c) use it for a lru on dirty pages.
> d) doesn't fragment memory with slabs...
> 
> I hate considering expanding struct page after all of the work
> that has gone into shriking the lately....
> 
> And for writes it looks like I'll need a write time too, for best
> performance.  I've written the code I just haven't tested it yet.
> 
> Zlatko could I talk you into setting the defines in mmap.h so it shmfs
> will use those and report if bonnie improves...
> 

When it comes to benchmarking, I'm always prepared. :)

It's just, that I didn't understand completely what are you trying to
do, but if you have a prepared patch, I'll gladly test it.

BTW, looking at 2.1.109, I'm very pleased with the changes made in mm/ 
directory. Finally, free_memory_available is simple, readable and
efficient. ;)

Next week, I will test some ideas which possibly could improve things
WITH page aging.

I must admit, after lot of critics I made upon page aging, that I
believe it's the right way to go, but it should be done properly.
Performance should be better, not worse.

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
  Any sufficiently advanced bug is indistinguishable from a feature.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18 13:28       ` Zlatko Calusic
@ 1998-07-18 16:40         ` Eric W. Biederman
  1998-07-20  9:15           ` Zlatko Calusic
                             ` (2 more replies)
  1998-07-22 10:33         ` Stephen C. Tweedie
  1 sibling, 3 replies; 46+ messages in thread
From: Eric W. Biederman @ 1998-07-18 16:40 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, linux-mm

>>>>> "ZC" == Zlatko Calusic <Zlatko.Calusic@CARNet.hr> writes:

Let me just step back a second so I can be clear:

A) The idea proposed by Stephen way perhaps we could use Least
Recently Used lists instead of page aging.  It's effectively the same
thing but shrink_mmap can find the old pages much much faster, by
simply following a linked list.

B) This idea intrigues me because handling of generic dirty pages
I have about the same problem.  In cloneing bdflush for the page cache
I discovered two fields I would need to add to struct page to do an
exact cloning job.  A page writetime, and LRU list pointers for dirty
pages.  I went ahead and implemented them, but also implemented an
alternative, which is the default.

So on any discussion with LRU lists I'm terribly interested.
As soon as I get the time I'll even implement the more general case.
Mostly I just need to get my computer moved to where I am at so I can
code when I have free time :)

What I have now are controled by the defines I added to
include/linux/mm.h with my shmfs patches.
#undef USE_PG_FLUSHTIME  (This tells sync_old_pages when to stop)
#undef USE_PG_DIRTY_LIST (Define this for a first pass at an LRU list
for dirty pages)

If nothing else it's worth trying to see if it improves my write times
which fall way behind the read times, on Zlato's benchmark :(

If I can talk Zlatko or someone into looking at these it would be
nice.  I really need to get my own copy of bonnie and a few other
benchmarks...

ZC> Next week, I will test some ideas which possibly could improve things
ZC> WITH page aging.

ZC> I must admit, after lot of critics I made upon page aging, that I
ZC> believe it's the right way to go, but it should be done properly.
ZC> Performance should be better, not worse.

Agreed.  We should look very carefully though to see if any aging
solution increases fragmentation.  According to Stephen the current
one does, and this may be a natural result of aging and not just a
single implementation :(

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18 16:40         ` Eric W. Biederman
@ 1998-07-20  9:15           ` Zlatko Calusic
  1998-07-22 10:40             ` Stephen C. Tweedie
  1998-07-20 15:58           ` Stephen C. Tweedie
  1998-07-22 10:36           ` Stephen C. Tweedie
  2 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-20  9:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen C. Tweedie, linux-mm

ebiederm+eric@npwt.net (Eric W. Biederman) writes:

> >>>>> "ZC" == Zlatko Calusic <Zlatko.Calusic@CARNet.hr> writes:
> 
> Let me just step back a second so I can be clear:
> 
> A) The idea proposed by Stephen way perhaps we could use Least
> Recently Used lists instead of page aging.  It's effectively the same
> thing but shrink_mmap can find the old pages much much faster, by
> simply following a linked list.

Well, it looks like a good idea.

> 
> B) This idea intrigues me because handling of generic dirty pages
> I have about the same problem.  In cloneing bdflush for the page cache
> I discovered two fields I would need to add to struct page to do an
> exact cloning job.  A page writetime, and LRU list pointers for dirty
> pages.  I went ahead and implemented them, but also implemented an
> alternative, which is the default.
> 

I don't know how much impact does adding a few fields in the struct
page has on the performance.

Why don't you just add that two fields, so we can see what happens.

I don't know if its easy, but we probably should get rid of buffer
cache completely, at one point in time. It's hard to balance things
between two caches, not to mention other memory objects in kernel.

If page cache is ever to replace buffer cache, it will definitely need
some parts of already established mechanisms and data types that
buffer cache has now.

On the other side, I must admit that I didn't saw any more
fragmentation with page aging. It's just that memory gets used in
weird ways, when it's on, and there's lots of unneeded swapping.

Then again, I have made some changes that make my system very stable
wrt memory fragmentation:

#define SLAB_MIN_OBJS_PER_SLAB  1
#define SLAB_BREAK_GFP_ORDER    1

in mm/slab.c

I discussed this privately with slab maintainer Mark Hemment, where
he pointed out that with this setting slab is probably not as
efficient as it could be. Also, slack is bigger, obviously.

I didn't completely understand all reasons why this could be slower,
and I must admit that I can't see any bad impact on the performance.
I did really lots of benchmarking.

5.5MB/sec through two 100MBps NICs via router and straight to cheap
IDE disk on low end Pentium is not what you call a bad performance. :)

But system is much more stable, and it is now very *very* hard to get
that annoying "Couldn't get a free page..." message than before (with
default setup), when it was as easy as clicking a button in the
Netscape.

I even have some custom scripts that make lots of FTP connections to
fast sites, as that was proven to block my system quite easily before.

> So on any discussion with LRU lists I'm terribly interested.
> As soon as I get the time I'll even implement the more general case.
> Mostly I just need to get my computer moved to where I am at so I can
> code when I have free time :)
> 

I hope you that you have found a nice place to live.
So that you can get happy and make loads of great code. :)

> What I have now are controled by the defines I added to
> include/linux/mm.h with my shmfs patches.
> #undef USE_PG_FLUSHTIME  (This tells sync_old_pages when to stop)
> #undef USE_PG_DIRTY_LIST (Define this for a first pass at an LRU list
> for dirty pages)
> 
> If nothing else it's worth trying to see if it improves my write times
> which fall way behind the read times, on Zlato's benchmark :(
> 

As I alredy said, it will be my pleasure to test things and say my
comments. I spent lots of time tweaking here and there and measuring
not only performance, but stability, too.

Half a year ago, my system was really unstable, thanks to memory
fragmentation. I was occasionaly logged via XDM, and had to kill the
whole session (Ctrl-Alt-BS), because everything would stall, after
initial "Couldn't get a free page...".

Than I got annoyed with that, and tried to find a solution, or at
least a workaround... :)

> If I can talk Zlatko or someone into looking at these it would be
> nice.  I really need to get my own copy of bonnie and a few other
> benchmarks...
> 

I'll send you a copy of bonnie source in another private mail.

> ZC> Next week, I will test some ideas which possibly could improve things
> ZC> WITH page aging.
> 
> ZC> I must admit, after lot of critics I made upon page aging, that I
> ZC> believe it's the right way to go, but it should be done properly.
> ZC> Performance should be better, not worse.
> 
> Agreed.  We should look very carefully though to see if any aging
> solution increases fragmentation.  According to Stephen the current
> one does, and this may be a natural result of aging and not just a
> single implementation :(
> 

Speaking of low memory machines, I thinks that inode memory is much
bigger problem there. I had opportunity to test 2.1.x series on 5MB
386DX40, and system runs nothing near perfection. :(

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	"640K ought to be enough for anybody." Bill Gates '81
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-20  9:15           ` Zlatko Calusic
@ 1998-07-22 10:40             ` Stephen C. Tweedie
  1998-07-23 10:06               ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-22 10:40 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Eric W. Biederman, Stephen C. Tweedie, linux-mm

Hi,

On 20 Jul 1998 11:15:12 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> I don't know if its easy, but we probably should get rid of buffer
> cache completely, at one point in time. It's hard to balance things
> between two caches, not to mention other memory objects in kernel.

No, we need the buffer cache for all sorts of things.  You'd have to
reinvent it if you got rid of it, since it is the main mechanism by
which we can reliably label IO for the block device driver layer, and we
also cache non-page-aligned filesystem metadata there.

> Then again, I have made some changes that make my system very stable
> wrt memory fragmentation:

> #define SLAB_MIN_OBJS_PER_SLAB  1
> #define SLAB_BREAK_GFP_ORDER    1

The SLAB_BREAK_GFP_ORDER one is the important one on low memory
configurations.  I need to use this setting to get 2.1.110 to work at
all with NFS in low memory.

> I discussed this privately with slab maintainer Mark Hemment, where
> he pointed out that with this setting slab is probably not as
> efficient as it could be. Also, slack is bigger, obviously.

Correct, but then the main user of these larger packets is networking,
where the memory is typically short lived anyway.

> But system is much more stable, and it is now very *very* hard to get
> that annoying "Couldn't get a free page..." message than before (with
> default setup), when it was as easy as clicking a button in the
> Netscape.

I can still reproduce it if I let the inode cache grow too large: it
behaves really badly and seems to lock up rather a lot of memory.  Still
chasing this one; it's a killer right now.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-22 10:40             ` Stephen C. Tweedie
@ 1998-07-23 10:06               ` Zlatko Calusic
  1998-07-23 12:22                 ` Stephen C. Tweedie
  1998-07-26 14:49                 ` Eric W Biederman
  0 siblings, 2 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 10:06 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 20 Jul 1998 11:15:12 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > I don't know if its easy, but we probably should get rid of buffer
> > cache completely, at one point in time. It's hard to balance things
> > between two caches, not to mention other memory objects in kernel.
> 
> No, we need the buffer cache for all sorts of things.  You'd have to
> reinvent it if you got rid of it, since it is the main mechanism by
> which we can reliably label IO for the block device driver layer, and we
> also cache non-page-aligned filesystem metadata there.

Yes, I'm aware of lots of problems that would need to be resolved in
order to get rid of buffer cache (probably just to reinvent it, as you
said :)). But, then again, if I understand you completely, we will
always have the buffer cache as it is implemented now?!

Non-page aligned filesystem metadata, really looks like a hard problem
to solve without buffer cache mechanism, that's out of question, but
is there any posibility that we will introduce some logic to use
somewhat improved page cache with buffer head functionality (or
similar) that will allow us to use page cache in similar way that we
use buffer cache now?

Even I didn't investigate it that lot, I still see Erics work on
adding dirty page functionality as a step toward this.

Disclaimer: I really don't see myself as any kind of expert in this
area. But that's a one motivation more for me to try to understand
things that I don't have at control presently. :)

I've been browsing Linux source actively for the last 12 months, as
time permitted. MM area is by far of the biggest interest for me. But,
I'm still learning.

> 
> > Then again, I have made some changes that make my system very stable
> > wrt memory fragmentation:
> 
> > #define SLAB_MIN_OBJS_PER_SLAB  1
> > #define SLAB_BREAK_GFP_ORDER    1
> 
> The SLAB_BREAK_GFP_ORDER one is the important one on low memory
> configurations.  I need to use this setting to get 2.1.110 to work at
> all with NFS in low memory.
> 
> > I discussed this privately with slab maintainer Mark Hemment, where
> > he pointed out that with this setting slab is probably not as
> > efficient as it could be. Also, slack is bigger, obviously.
> 
> Correct, but then the main user of these larger packets is networking,
> where the memory is typically short lived anyway.

Two days ago, I rebooted unpatched 2.1.110 with mem=32m, just to find
it dead today:

I left at cca 22:00h on Jul 21.

Jul 21 22:16:43 atlas kernel: eth0: media is 100Mb/s full duplex. 
Jul 21 22:34:31 atlas kernel: eth0: Insufficient memory; nuking packet. 
Jul 21 22:34:44 atlas last message repeated 174 times
Jul 22 16:03:40 atlas kernel: eth0: media is TP full duplex. 
Jul 22 16:03:43 atlas kernel: eth0: media is unconnected, link down or incompati
ble connection. 
...

Used to patch every kernel that I download, I forgot how unstable
official kernels are. And that's not good. :(

Machine's only task, when I'm not logged in, is to transfer mail
(fetchmail + sendmail).

> 
> > But system is much more stable, and it is now very *very* hard to get
> > that annoying "Couldn't get a free page..." message than before (with
> > default setup), when it was as easy as clicking a button in the
> > Netscape.
> 
> I can still reproduce it if I let the inode cache grow too large: it
> behaves really badly and seems to lock up rather a lot of memory.  Still
> chasing this one; it's a killer right now.
> 

My observations with low memory machines led me to conclusion that
inode memory grows monotonically until it takes cca 1.5MB of
unswappable memory. That is around half of usable memory on 5MB
machine. You seconded that in private mail you sent me in January.

Is there any possibility that we could use slab allocator for inode
allocation/deallocation?

Reagrds,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
		  So much time, and so little to do.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:06               ` Zlatko Calusic
@ 1998-07-23 12:22                 ` Stephen C. Tweedie
  1998-07-23 14:07                   ` Zlatko Calusic
  1998-07-26 14:49                 ` Eric W Biederman
  1 sibling, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 12:22 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

Hi,

On 23 Jul 1998 12:06:05 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> Yes, I'm aware of lots of problems that would need to be resolved in
> order to get rid of buffer cache (probably just to reinvent it, as you
> said :)). But, then again, if I understand you completely, we will
> always have the buffer cache as it is implemented now?!

I don't see any pressing need to replace it.  Changing the _management_
of the buffer cache, and doing things like modifying the file write
paths, are different issues which we probably should do.

Ultimately we need synchronised access to individual blocks of a block
device.  We need something which can talk directly to the block device
drivers.  Once you have that in place, with a suitable form of buffering
added, you have something that necessarily looks sufficiently like the
buffer cache that I can't see a need to get rid of the current one.
That doesn't mean we can't improve the current system, but improving and
replacing are two very different things.

> Non-page aligned filesystem metadata, really looks like a hard problem
> to solve without buffer cache mechanism, that's out of question, but
> is there any posibility that we will introduce some logic to use
> somewhat improved page cache with buffer head functionality (or
> similar) that will allow us to use page cache in similar way that we
> use buffer cache now?

We still need a way to go to the block device drivers.  As you say, we
still need the buffer_head.  We _already_ have a way of using
buffer_heads without full buffers allocated in the cache (the swapper
uses such temporary buffer_heads, for example).  We also need mechanisms
for things like loop devices and RAID.  There's a lot going on in the
buffer cache!

> Two days ago, I rebooted unpatched 2.1.110 with mem=32m, just to find
> it dead today:

> I left at cca 22:00h on Jul 21.

> Jul 21 22:16:43 atlas kernel: eth0: media is 100Mb/s full duplex. 
> Jul 21 22:34:31 atlas kernel: eth0: Insufficient memory; nuking
> packet. 

I've got a fix for some of the (serious) fragmentation problems in 110.
111-pre1 with the fixes is looking really, really good.  Post with patch
to follow.

> My observations with low memory machines led me to conclusion that
> inode memory grows monotonically until it takes cca 1.5MB of
> unswappable memory. That is around half of usable memory on 5MB
> machine. You seconded that in private mail you sent me in January.

Does this still happen?  My own tests show 110 behaving very much better
in this respect.

> Is there any possibility that we could use slab allocator for inode
> allocation/deallocation?

Yes.  I'll have to benchmark to see how much better it gets, but (a) 110
seems to need it less anyway, and (b) it opens up a whole new pile of
synchronisation problems in fs/inode.c, which can currently make the
assumption that an inode structure can move lists but can never actually
die if the inode spinlock is dropped.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 12:22                 ` Stephen C. Tweedie
@ 1998-07-23 14:07                   ` Zlatko Calusic
  1998-07-23 17:18                     ` Stephen C. Tweedie
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 14:07 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 23 Jul 1998 12:06:05 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > Yes, I'm aware of lots of problems that would need to be resolved in
> > order to get rid of buffer cache (probably just to reinvent it, as you
> > said :)). But, then again, if I understand you completely, we will
> > always have the buffer cache as it is implemented now?!
> 
> I don't see any pressing need to replace it.  Changing the _management_
> of the buffer cache, and doing things like modifying the file write
> paths, are different issues which we probably should do.
> 
> Ultimately we need synchronised access to individual blocks of a block
> device.  We need something which can talk directly to the block device
> drivers.  Once you have that in place, with a suitable form of buffering
> added, you have something that necessarily looks sufficiently like the
> buffer cache that I can't see a need to get rid of the current one.
> That doesn't mean we can't improve the current system, but improving and
> replacing are two very different things.
> 

OK, I understand. I needed to hear opinion of someone who *really*
does know how things work.

One of the thoughts that influenced me is text at:

http://www.caip.rutgers.edu/~davem/vfsmm.html

but I can't (and won't) pretend that I understand everything that's
mentioned there. :)

Strangely enough, I think I never explained why do *I* think
integrating buffer cache functionality into page cache would (in my
thought) be a good thing. Since both caches are very different, I'm
not sure memory management can be fair enough in some cases.

Take a simple example: two applications, I/O bound, where one is
accessing raw partition (e.g. fsck) and other uses filesystem (web,
ftp...). Question is, how do I know that MM is fair. Maybe page cache
grows too large on behalf of buffer cache, so fsck runs much slower
than it could. Or if buffer cache grows faster (which is not the case,
IMO) then web would be fast, but fsck (or some database accessing raw
partition) could take a penalty.

Integrating both caches could help in these cases, which are not
uncommon (isn't Linux a beautiful multitasker? :)).

All this is consequence of buffer cache buffering raw blocks
(including FS metadata), and page cache buffering FS data.

BUT! if you say buffer cache won't go, then I believe you, just to
make it straight. :)

And thanks for explanation.
I hope my bad English doesn't make you to much trouble understanding.

> > Non-page aligned filesystem metadata, really looks like a hard problem
> > to solve without buffer cache mechanism, that's out of question, but
> > is there any posibility that we will introduce some logic to use
> > somewhat improved page cache with buffer head functionality (or
> > similar) that will allow us to use page cache in similar way that we
> > use buffer cache now?
> 
> We still need a way to go to the block device drivers.  As you say, we
> still need the buffer_head.  We _already_ have a way of using
> buffer_heads without full buffers allocated in the cache (the swapper
> uses such temporary buffer_heads, for example).  We also need mechanisms
> for things like loop devices and RAID.  There's a lot going on in the
> buffer cache!
> 

No doubt!
I never tried to underestimate buffer cache complexness and functionality. :)

> > Two days ago, I rebooted unpatched 2.1.110 with mem=32m, just to find
> > it dead today:
> 
> > I left at cca 22:00h on Jul 21.
> 
> > Jul 21 22:16:43 atlas kernel: eth0: media is 100Mb/s full duplex. 
> > Jul 21 22:34:31 atlas kernel: eth0: Insufficient memory; nuking
> > packet. 
> 
> I've got a fix for some of the (serious) fragmentation problems in 110.
> 111-pre1 with the fixes is looking really, really good.  Post with patch
> to follow.
> 

Nice, I will test it right away.

> > My observations with low memory machines led me to conclusiaon that
> > inode memory grows monotonically until it takes cca 1.5MB of
> > unswappable memory. That is around half of usable memory on 5MB
> > machine. You seconded that in private mail you sent me in January.
> 
> Does this still happen?  My own tests show 110 behaving very much better
> in this respect.

Huh, my apology needed here.

My tests on lowmem machine took place around New Year. I have that
386DX/40 with 5MB at home, but I'm rarely home. :)

So, everything I said reflects a situation before 7 months. I didn't
made any tests in the mean time. Here, at work, Linux is installed on
more appropriate hardware. :)

> 
> > Is there any possibility that we could use slab allocator for inode
> > allocation/deallocation?
> 
> Yes.  I'll have to benchmark to see how much better it gets, but (a) 110
> seems to need it less anyway, and (b) it opens up a whole new pile of
> synchronisation problems in fs/inode.c, which can currently make the
> assumption that an inode structure can move lists but can never actually
> die if the inode spinlock is dropped.
> 

Wish you luck!

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
		       Don't mess with Murphy.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 14:07                   ` Zlatko Calusic
@ 1998-07-23 17:18                     ` Stephen C. Tweedie
  1998-07-23 19:33                       ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 17:18 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

Hi,

On 23 Jul 1998 16:07:23 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> Strangely enough, I think I never explained why do *I* think
> integrating buffer cache functionality into page cache would (in my
> thought) be a good thing. Since both caches are very different, I'm
> not sure memory management can be fair enough in some cases.

> Take a simple example: two applications, I/O bound, where one is
> accessing raw partition (e.g. fsck) and other uses filesystem (web,
> ftp...). Question is, how do I know that MM is fair. Maybe page cache
> grows too large on behalf of buffer cache, so fsck runs much slower
> than it could. Or if buffer cache grows faster (which is not the case,
> IMO) then web would be fast, but fsck (or some database accessing raw
> partition) could take a penalty.

There's a single loop in shrink_mmap() which treats both buffer-cache
pages and page-cache pages identically.  It just propogates the buffer
referenced bits into the page's PG_referenced bit before doing any
ageing on the page.  It should be fair enough.  There are other issues
concerning things like locked and dirty buffers which complicate the
issue, but they are not sufficient reason to throw away the buffer
cache!

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 17:18                     ` Stephen C. Tweedie
@ 1998-07-23 19:33                       ` Zlatko Calusic
  1998-07-27 10:57                         ` Stephen C. Tweedie
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 19:33 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, werner, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 23 Jul 1998 16:07:23 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > Strangely enough, I think I never explained why do *I* think
> > integrating buffer cache functionality into page cache would (in my
> > thought) be a good thing. Since both caches are very different, I'm
> > not sure memory management can be fair enough in some cases.
> 
> > Take a simple example: two applications, I/O bound, where one is
> > accessing raw partition (e.g. fsck) and other uses filesystem (web,
> > ftp...). Question is, how do I know that MM is fair. Maybe page cache
> > grows too large on behalf of buffer cache, so fsck runs much slower
> > than it could. Or if buffer cache grows faster (which is not the case,
> > IMO) then web would be fast, but fsck (or some database accessing raw
> > partition) could take a penalty.
> 
> There's a single loop in shrink_mmap() which treats both buffer-cache
> pages and page-cache pages identically.  It just propogates the buffer
> referenced bits into the page's PG_referenced bit before doing any
> ageing on the page.  It should be fair enough.  There are other issues
> concerning things like locked and dirty buffers which complicate the
> issue, but they are not sufficient reason to throw away the buffer
> cache!
> 

Hm, I know how shrink_mmap work, but I never looked at it that way.
My eyes are wide open.

Seems like all my reasons are not valid, so I will forget about my
ideas for a while. :)

In the mean time, I applied the same benchmark, I was already doing,
to kernel with Werner's lowmem patch applied, and results are
interesting. Performance is very similar to that with my change, but
there are some differences. With Werner's patch, kernel behaviour is
yet slightly less aggressive:

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 0 0 0     0  6492  4548 23100   0   0  179   16  219  157  23   9  68
 0 0 0     0  6492  4548 23100   0   0    0    2  108    9   0   0 100
 1 0 0    84  1384  1964 31168   0   8 6051    3  229  222   1  24  74
 1 0 0   128  1200  1964 31404   0   4 6630    3  238  237   1  25  75
 1 0 0   476  1024  1964 31928   0  35 6802    9  240  241   1  26  73
 1 0 0  1764  1316  1964 32932   0 129 6522   33  240  233   1  23  76
 1 0 0  2584  1172  1964 33896   0  82 6392   21  237  227   1  23  76
 1 0 0  3384  1284  1964 34584   0  80 6330   21  234  224   1  24  75
 1 0 0  4100  1232  1964 35352   0  72 6365   19  234  228   0  23  76
 1 0 0  4164  1432  1964 35236   0   6 6176    2  229  223   1  24  75
 1 0 0  4220  1136  1964 35580   0   6 7331    2  250  258   2  27  71
 1 0 0  4892  1284  1964 36096   0  67 7417   18  255  261   2  28  70
 1 0 0  4940  1532  1964 35896   0   5 7460    2  252  258   1  28  71
 1 0 0  4980  1540  1964 35932   0   4 7307    2  251  256   2  27  72
 0 0 0  4996  1536  1964 35984   0   2 1496    2  140   66   0   5  95
 0 0 0  4996  1536  1964 35984   0   0    0    1  102    7   0   0 100

So whichever solution find a way to the official kernel, will make me
happy. :)

Thank you for your thoughts and opinions!

Wish you a nice weekend (at that wedding, is it yours?) :)
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	Crime doesn't pay... does that mean my job is a crime?
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 19:33                       ` Zlatko Calusic
@ 1998-07-27 10:57                         ` Stephen C. Tweedie
  0 siblings, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 10:57 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, werner, linux-mm

> In the mean time, I applied the same benchmark, I was already doing,
> to kernel with Werner's lowmem patch applied, and results are
> interesting. Performance is very similar to that with my change, but
> there are some differences. With Werner's patch, kernel behaviour is
> yet slightly less aggressive:

OK, time to look at a bigger set of benchmarks for this.  If it helps
this case, it needs to be considered for 2.2.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:06               ` Zlatko Calusic
  1998-07-23 12:22                 ` Stephen C. Tweedie
@ 1998-07-26 14:49                 ` Eric W Biederman
  1998-07-27 11:02                   ` Stephen C. Tweedie
  1 sibling, 1 reply; 46+ messages in thread
From: Eric W Biederman @ 1998-07-26 14:49 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Stephen C. Tweedie, linux-mm



On 23 Jul 1998, Zlatko Calusic wrote:

> "Stephen C. Tweedie" <sct@redhat.com> writes:
> 
> > Hi,
> > 
> > On 20 Jul 1998 11:15:12 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> > said:
> > 
> > > I don't know if its easy, but we probably should get rid of buffer
> > > cache completely, at one point in time. It's hard to balance things
> > > between two caches, not to mention other memory objects in kernel.
> > 
> > No, we need the buffer cache for all sorts of things.  You'd have to
> > reinvent it if you got rid of it, since it is the main mechanism by
> > which we can reliably label IO for the block device driver layer, and we
> > also cache non-page-aligned filesystem metadata there.
> 
> Even I didn't investigate it that lot, I still see Erics work on
> adding dirty page functionality as a step toward this.

>From where I sit it looks completly possible to give the buffer cache a
fake inode, and have it use the same mechanisms that I have developed for
handling other dirty data in the page cache.  It should also be possible
in this effort to simplify the buffer_head structure as well.

As time permits I'll move in that direction.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-26 14:49                 ` Eric W Biederman
@ 1998-07-27 11:02                   ` Stephen C. Tweedie
  1998-08-02  5:19                     ` Eric W Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 11:02 UTC (permalink / raw)
  To: ebiederm+eric; +Cc: Zlatko Calusic, Stephen C. Tweedie, linux-mm

Hi,

On Sun, 26 Jul 1998 09:49:02 -0500 (CDT), Eric W Biederman
<eric@flinx.npwt.net> said:

> From where I sit it looks completly possible to give the buffer cache a
> fake inode, and have it use the same mechanisms that I have developed for
> handling other dirty data in the page cache.  It should also be possible
> in this effort to simplify the buffer_head structure as well.

> As time permits I'll move in that direction.

You'd still have to persuade people that it's a good idea.  I'm not
convinced.

The reason for having things in the page cache is for fast lookup.
For this to make sense for the buffer cache, you'd have to align the
buffer cache on page boundaries, but buffers on disk are not naturally
aligned this way.  You'd end up wasting a lot of space as perhaps only
a few of the buffers in any page were useful, and you'd also have to
keep track of which buffers within the page were valid/dirty.

We *need* a mechanism which is block-aligned, not page-aligned.  The
buffer cache is a good way of doing it.  Forcing block device caching
into a page-aligned cache is not necessarily going to simplify things.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-27 11:02                   ` Stephen C. Tweedie
@ 1998-08-02  5:19                     ` Eric W Biederman
  1998-08-17 13:57                       ` Stephen C. Tweedie
  1998-08-17 15:35                       ` Stephen C. Tweedie
  0 siblings, 2 replies; 46+ messages in thread
From: Eric W Biederman @ 1998-08-02  5:19 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Zlatko Calusic, linux-mm

On Mon, 27 Jul 1998, Stephen C. Tweedie wrote:

> Hi,
> 
> On Sun, 26 Jul 1998 09:49:02 -0500 (CDT), Eric W Biederman
> <eric@flinx.npwt.net> said:
> 
> > From where I sit it looks completly possible to give the buffer cache a
> > fake inode, and have it use the same mechanisms that I have developed for
> > handling other dirty data in the page cache.  It should also be possible
> > in this effort to simplify the buffer_head structure as well.
> 
> > As time permits I'll move in that direction.
> 
> You'd still have to persuade people that it's a good idea.  I'm not
> convinced.
> 
> The reason for having things in the page cache is for fast lookup.
> For this to make sense for the buffer cache, you'd have to align the
> buffer cache on page boundaries, but buffers on disk are not naturally
> aligned this way.  You'd end up wasting a lot of space as perhaps only
> a few of the buffers in any page were useful, and you'd also have to
> keep track of which buffers within the page were valid/dirty.
> 

That wasn't actually how I was envisioning it.  Though it is a possibility
I have kicked around.  For direct device I/O and mmaping of devices it is
exactly how we should do it.  

What I was envisioning is using a single write-out daemon 
instead of 2 (one for buffer cache, one for page cache).  Using the same
tests in shrink_mmap.  Reducing the size of a buffer_head by a lot because
consolidating the two would reduce the number of lists needed.  
To sit the buffer cache upon a single pseudo inode, and keep it's current
hashing scheme.

In general allowing the management to be consolidated between the two, but
nothing more.

At this point it is not a major point, but the buffer cache is
quite likely to shrink into something barely noticeable, assuming
regular files will write buffer themselves in the page cache preventing
double buffering.

When the buffer cache becomes a shrunken appendage then we will know what
we really need it for, and how much a performance hit we will take, and
we can worry about it then.

> We *need* a mechanism which is block-aligned, not page-aligned.  The
> buffer cache is a good way of doing it.  Forcing block device caching
> into a page-aligned cache is not necessarily going to simplify things.

The page-aligned property is only a matter of the inode,offset hash
table, and virtually nothing else really cares.  Shrink_mmap, or
pgflush, the most universall parts of the page cache do not.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-08-02  5:19                     ` Eric W Biederman
@ 1998-08-17 13:57                       ` Stephen C. Tweedie
  1998-08-17 15:35                       ` Stephen C. Tweedie
  1 sibling, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-08-17 13:57 UTC (permalink / raw)
  To: ebiederm+eric; +Cc: Stephen C. Tweedie, Zlatko Calusic, linux-mm

Hi,

Sorry, I'm just back from 2 weeks on holiday.

On Sun, 2 Aug 1998 00:19:52 -0500 (CDT), Eric W Biederman
<eric@flinx.npwt.net> said:

>> We *need* a mechanism which is block-aligned, not page-aligned.  The
>> buffer cache is a good way of doing it.  Forcing block device caching
>> into a page-aligned cache is not necessarily going to simplify things.

> The page-aligned property is only a matter of the inode,offset hash
> table, and virtually nothing else really cares.  Shrink_mmap, or
> pgflush, the most universall parts of the page cache do not.

Any mmap()able files *need* to be page aligned in cache.  Internal
filesystem accesses are always block aligned, not page aligned.  That's
the conflict.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-08-02  5:19                     ` Eric W Biederman
  1998-08-17 13:57                       ` Stephen C. Tweedie
@ 1998-08-17 15:35                       ` Stephen C. Tweedie
  1998-08-20 12:40                         ` Eric W. Biederman
  1 sibling, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-08-17 15:35 UTC (permalink / raw)
  To: ebiederm+eric; +Cc: Stephen C. Tweedie, Zlatko Calusic, linux-mm

Hi,

On Sun, 2 Aug 1998 00:19:52 -0500 (CDT), Eric W Biederman
<eric@flinx.npwt.net> said:

> What I was envisioning is using a single write-out daemon 
> instead of 2 (one for buffer cache, one for page cache).  Using the same
> tests in shrink_mmap.  Reducing the size of a buffer_head by a lot because
> consolidating the two would reduce the number of lists needed.  
> To sit the buffer cache upon a single pseudo inode, and keep it's current
> hashing scheme.

The only reason we currently have two daemons is that we need one for
writing dirty memory and another for reclaiming clean memory.  That way,
even when we stall for disk writes, we are still able to reclaim free
memory via shrink_mmap().  The kswapd daemon and the shrink_mmap() code
already treat the page cache and buffer cache both the same.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-08-17 15:35                       ` Stephen C. Tweedie
@ 1998-08-20 12:40                         ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 1998-08-20 12:40 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Zlatko Calusic, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi,
ST> On Sun, 2 Aug 1998 00:19:52 -0500 (CDT), Eric W Biederman
ST> <eric@flinx.npwt.net> said:

>> What I was envisioning is using a single write-out daemon 
>> instead of 2 (one for buffer cache, one for page cache).  Using the same
>> tests in shrink_mmap.  Reducing the size of a buffer_head by a lot because
>> consolidating the two would reduce the number of lists needed.  
>> To sit the buffer cache upon a single pseudo inode, and keep it's current
>> hashing scheme.

ST> The only reason we currently have two daemons 
But I have 3.
One for writing dirty data in the buffer cache. bdflush
One for writing dirty data in the page cache.   pgflush
One for reclaiming clean memory                 kswapd

I would like to merge bdflush and pgflush in the long run if I can. 
Since pgflush is more generic than bdflush it should be doable.
This happens to give a degree of page cache and buffer cache unification
as a side effect, of setting up the buffer cache to use pgflush.

ST> is that we need one for
ST> writing dirty memory and another for reclaiming clean memory.  That way,
ST> even when we stall for disk writes, we are still able to reclaim free
ST> memory via shrink_mmap().  The kswapd daemon and the shrink_mmap() code
ST> already treat the page cache and buffer cache both the same.

I was talking of integrating my ``dirty data in the page cache'' code,
in with the rest of the kernel.  Hopefully for early 2.3.

My apologies for being so unclear  you totally missed what I was talking about.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18 16:40         ` Eric W. Biederman
  1998-07-20  9:15           ` Zlatko Calusic
@ 1998-07-20 15:58           ` Stephen C. Tweedie
  1998-07-22 10:36           ` Stephen C. Tweedie
  2 siblings, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-20 15:58 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Zlatko.Calusic, Stephen C. Tweedie, linux-mm

Hi,

On 18 Jul 1998 11:40:20 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

>>>>>> "ZC" == Zlatko Calusic <Zlatko.Calusic@CARNet.hr> writes:
> Let me just step back a second so I can be clear:

> A) The idea proposed by Stephen way perhaps we could use Least
> Recently Used lists instead of page aging.  It's effectively the same
> thing but shrink_mmap can find the old pages much much faster, by
> simply following a linked list.

> B) This idea intrigues me because handling of generic dirty pages
> I have about the same problem.  In cloneing bdflush for the page cache
> I discovered two fields I would need to add to struct page to do an
> exact cloning job.  A page writetime, and LRU list pointers for dirty
> pages.  I went ahead and implemented them, but also implemented an
> alternative, which is the default.

We already have all of the inode's pages on a linked list.  Extending
that to have two separate lists, one for clean pages and one for
dirty, would be cheap and would not have the extra memory overhead.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18 16:40         ` Eric W. Biederman
  1998-07-20  9:15           ` Zlatko Calusic
  1998-07-20 15:58           ` Stephen C. Tweedie
@ 1998-07-22 10:36           ` Stephen C. Tweedie
  1998-07-22 18:01             ` Rik van Riel
  2 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-22 10:36 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Zlatko.Calusic, Stephen C. Tweedie, linux-mm

Hi,

On 18 Jul 1998 11:40:20 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

> Agreed.  We should look very carefully though to see if any aging
> solution increases fragmentation.  According to Stephen the current
> one does, and this may be a natural result of aging and not just a
> single implementation :(

No no no!  The current VM has two separate but related problems.  First
is that it keeps too much cache in low memory configurations, and that
appears to be much much better in 2.1.109 and 110.  Second is the
fragmentation issue, but that's a lot harder to address I'm afraid.  I
have a zoned allocator now working which does help enormously: it's the
first time my VM-test 2.1 configuration has _ever_ been able to run
successfully with 8k NFS.  However, the zoned allocation can use memory
less efficiently: the odd free pages in the paged zone cannot be used by
non-paged users and vice versa, so overall performance may suffer.
Right now I'm cleaning the code up for a release against 2.1.110 so
that we can start testing.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-22 10:36           ` Stephen C. Tweedie
@ 1998-07-22 18:01             ` Rik van Riel
  1998-07-23 10:59               ` Stephen C. Tweedie
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 1998-07-22 18:01 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, Zlatko.Calusic, linux-mm

On Wed, 22 Jul 1998, Stephen C. Tweedie wrote:

> successfully with 8k NFS.  However, the zoned allocation can use memory
> less efficiently: the odd free pages in the paged zone cannot be used by
> non-paged users and vice versa, so overall performance may suffer.
> Right now I'm cleaning the code up for a release against 2.1.110 so
> that we can start testing.

Hmm, I'm curious as to what categories your allocator
divides memory users in. Is it just plain swappable
vs. non-swappable or is it fragmentation-causing vs.
fragmentation sensitive or something entirely different?

Btw, I'm working on version 2 of my zone allocator design
right now. Maybe we want the complex but complete version
for 2.3...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-22 18:01             ` Rik van Riel
@ 1998-07-23 10:59               ` Stephen C. Tweedie
  0 siblings, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 10:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, Eric W. Biederman, Zlatko.Calusic, linux-mm

Hi,

On Wed, 22 Jul 1998 20:01:51 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Wed, 22 Jul 1998, Stephen C. Tweedie wrote:
>> successfully with 8k NFS.  However, the zoned allocation can use memory
>> less efficiently: the odd free pages in the paged zone cannot be used by
>> non-paged users and vice versa, so overall performance may suffer.
>> Right now I'm cleaning the code up for a release against 2.1.110 so
>> that we can start testing.

> Hmm, I'm curious as to what categories your allocator
> divides memory users in. Is it just plain swappable
> vs. non-swappable

Yes, and so far it seems to work pretty well.

> or is it fragmentation-causing vs.  fragmentation sensitive or
> something entirely different?

As long as there are enough higher-order free pages to go around, the
fragmentation distinction is not so important.  The problem of course is
that the more different zone types we have, the less efficiently we can
use memory, so I really just want a minimal solution which does
something about fragmentation for non-swappable allocations.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-18 13:28       ` Zlatko Calusic
  1998-07-18 16:40         ` Eric W. Biederman
@ 1998-07-22 10:33         ` Stephen C. Tweedie
  1998-07-23 10:59           ` Zlatko Calusic
  1 sibling, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-22 10:33 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Eric W. Biederman, Stephen C. Tweedie, linux-mm

Hi,

On 18 Jul 1998 15:28:17 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> I must admit, after lot of critics I made upon page aging, that I
> believe it's the right way to go, but it should be done properly.
> Performance should be better, not worse.

Let me say one thing clearly: I'm not against page ageing (I implemented
it in the first place for the swapper), I'm against the bad tuning it
introduced.  *IF* we can fix that, then keep the ageing, sure.  However,
we need to fix it _completely_.  The non-cache-ageing scheme at least
has the advantage that we understand its behaviour, so fiddling too much
this close to 2.2 is not necessarily a good idea.  2.1.110, for example,
now fails to boot for me in low memory configurations because it cannot
keep enough higher order pages free for 4k NFS to work, never mind 8k.

That's the danger: we need to introduce new schemes like this at the
beginning of the development cycle for a new kernel, not the end.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-22 10:33         ` Stephen C. Tweedie
@ 1998-07-23 10:59           ` Zlatko Calusic
  1998-07-23 12:23             ` Stephen C. Tweedie
                               ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 10:59 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 18 Jul 1998 15:28:17 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > I must admit, after lot of critics I made upon page aging, that I
> > believe it's the right way to go, but it should be done properly.
> > Performance should be better, not worse.
> 
> Let me say one thing clearly: I'm not against page ageing (I implemented
> it in the first place for the swapper), I'm against the bad tuning it
> introduced.  *IF* we can fix that, then keep the ageing, sure.  However,
> we need to fix it _completely_.  The non-cache-ageing scheme at least
> has the advantage that we understand its behaviour, so fiddling too much
> this close to 2.2 is not necessarily a good idea.  2.1.110, for example,
> now fails to boot for me in low memory configurations because it cannot
> keep enough higher order pages free for 4k NFS to work, never mind 8k.
> 
> That's the danger: we need to introduce new schemes like this at the
> beginning of the development cycle for a new kernel, not the end.
> 

Cool!
Then we agree on all topics. :)

As promised, I did some testing and I maybe have a solution (big
words, yeah! :)).

As I see it, page cache seems too persistant (it grows out of bounds)
when we age pages in it.

One wrong way of fixing it is to limit page cache size, IMNSHO.

I tried the other way, to age page cache harder, and it looks like it
works very well. Patch is simple, so simple that I can't understand
nobody suggested (something like) it yet.

--- filemap.c.virgin   Tue Jul 21 18:41:30 1998
+++ filemap.c   Thu Jul 23 12:14:43 1998
@@ -171,6 +171,11 @@
                                touch_page(page);
                                break;
                        }
+                       /* Age named pages aggresively, so page cache
+                        * doesn't grow too fast.    -zcalusic
+                        */
+                       age_page(page);
+                       age_page(page);
                        age_page(page);
                        if (page->age)
                                break;

After lots of testing, I am quite pleased with performance with that
small change.

Where, using official kernel, copying few hundreds of data to
/dev/null would outswap cca 20MB (and constantly keep swapping, thus
killing performance), now it swaps out only 5MB, probably exactly that
pages that are not needed anyway. And that is something that I like
with aging.

I can provide thorough benchmark data, if needed.

If I put only two age_page()s, there's still too much swapping for my
taste.

With three age_page()s, read performance is as expected, and still we
manage memory more efficiently than without page aging.

Patch applies cleanly on 2.1.110.

Comments?

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	  Don't steal - the government hates competition...
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:59           ` Zlatko Calusic
@ 1998-07-23 12:23             ` Stephen C. Tweedie
  1998-07-23 15:06               ` Zlatko Calusic
  1998-07-23 17:12             ` Stephen C. Tweedie
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 12:23 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

Hi,

On 23 Jul 1998 12:59:38 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> As promised, I did some testing and I maybe have a solution (big
> words, yeah! :)).

> As I see it, page cache seems too persistant (it grows out of bounds)
> when we age pages in it.

Not on 110, it looks.  On low memory, .110 seems to be even better than
.108 without the page ageing.  It is looking very good right now.

> I can provide thorough benchmark data, if needed.

Please do, but is this on .110?

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 12:23             ` Stephen C. Tweedie
@ 1998-07-23 15:06               ` Zlatko Calusic
  1998-07-23 15:17                 ` Benjamin C.R. LaHaise
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 15:06 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 23 Jul 1998 12:59:38 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > As promised, I did some testing and I maybe have a solution (big
> > words, yeah! :)).
> 
> > As I see it, page cache seems too persistant (it grows out of bounds)
> > when we age pages in it.
> 
> Not on 110, it looks.  On low memory, .110 seems to be even better than
> .108 without the page ageing.  It is looking very good right now.
> 
> > I can provide thorough benchmark data, if needed.
> 
> Please do, but is this on .110?
> 

Yes, this is on .110.

Benchmarking methodology: compile kernel, reboot, fire up XDM, few
xterms, Xemacs and Netscape. In one xterm vmstat 10, in another copy
800MB worth of .mp3s :) to /dev/null (nothing special changes if I
copy them to another directory)

Official kernel:
1 x age_page() in shrink_mmap():

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0  6832  4292 22740   0   0  182   15  220  155  25   9  66
 0 0 0     0  6860  4292 22744   0   0    0    5  112   10   0   0 100
 1 0 0   428  1380  1964 31260   0  43 5579   13  221  202   1  22  77
 1 0 0  2472  1428  1964 33256  10 209 5742   53  232  211   2  24  75
 1 0 0  5200  3500  1988 33928   5 273 6017   70  236  216   2  25  73
 1 0 0  7012  2940  1964 36292   6 181 6318   46  243  224   2  27  71
 1 0 0 11036  1084  1964 42168   6 402 5910  101  240  212   1  27  72
 1 0 0 12572  3832  2028 40900   6 154 5939   39  239  211   1  23  76
 1 0 0 14288 11336  1964 35180  10 172 5863   44  233  209   1  24  75
 1 0 0 17484  1188  1964 48552  29 320 5076   81  229  189   1  23  76
 1 0 0 18588 10640  1964 40176  42 111 4668   29  217  187   1  18  81
 1 0 0 21988  1576  1964 52636  43 342 5434   86  240  204   1  22  77
 1 0 0 23524 13676  1964 42076  47 154 5652   39  236  222   1  22  77
 1 1 0 23812  1284  1992 54728  41  31 5915    9  234  230   1  25  74
 1 0 0 24076 24324  2028 31916  40  30 6106    8  239  226   1  24  75
 1 0 0 24092 16064  2028 40188  48   7 5869    3  235  226   1  22  77
 0 0 0 24020  1540  2000 54724  30   0 2356    1  162  114   0  11  89
 0 0 0 23980  1536  2000 54688   8   0    2    0  104   19   0   0 100

24MB outswapped, lots of swapouts and swapins!!!. There would be much
more swap activity if I were actually using Netscape or XEmacs during
I/O, but in both test I didn't! I forgot to put "time" before cp :(,
but... 15 lines x 10 sec = cca 150 seconds to copy files. Also, notice
that I'm not memory starved (starting with cca 7 + 4 + 23 = 34 MB for
caches to use). In the last minute, system practically outswapped
everything it could, so it started to fight for every other page
effectively losing time (~30 pages out, ~40 pages in, every second).
Too bad. :(


Patched with small patch I posted:
3 x age_page() in shrink_mmap():

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0  7072  4292 22768   0   0  172   15  217  153  23   9  68
 0 0 0     0  7048  4292 22736   0   0    0    2  109   11   0   0 100
 1 0 0    76  1044  1964 31444   0   8 5899    4  228  219   1  20  79
 1 0 0   116  6076  1964 26432   0   4 6665    2  243  241   2  27  71
 1 0 0   132  6492  2028 25980   0   2 6723    1  239  238   1  25  75
 1 0 0   488  6816  2028 26016   0  36 6671   10  240  233   1  25  74
 1 0 0  1288  1240  1964 32460   0  80 6163   21  232  220   1  23  76
 1 0 0  2152  1536  1964 33028   0  86 6234   22  233  223   1  24  76
 1 0 0  3008  1384  1964 34032   0  86 6313   22  235  229   1  22  77
 1 0 0  3084  1488  1964 34008   0   8 6135    3  229  223   1  22  77
 1 0 0  4816  1128  1964 36096   0 173 6778   44  247  237   2  25  73
 1 0 0  5912  1172  1964 37152   0 110 7103   28  252  252   1  29  70
 1 0 0  6904  1536  1964 37780   0  99 7247   26  250  252   1  27  72
 1 0 0  8348  3704  2028 36988   0 144 7095   37  255  243   1  25  73
 0 0 0  9164 14980  2028 26608   1  82 3278   22  173  120   1  13  86
 0 0 0  9164 14980  2028 26608   0   0    0    0  102    6   0   0 100

First thing to notice is only 10MB on swap (good). Second, and more
important, system was *not* swapping things in at all, because only
pages that really belonged to swap (unneeded) were swapped out.
Copying finished in 13 x 10 = ~130 seconds. Conclusion: better I/O
performance, better feel when using applications (I didn't have to
wait for Netscape or XEmacs to come from swap when I started to use
them, for real).

I was very carefull to do exactly the same sequence in both tests!
I think it is obvious from the first line of those vmstat reports.

Anything I forgot to test? :)

-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	Remember that you are unique. Just like everyone else.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 15:06               ` Zlatko Calusic
@ 1998-07-23 15:17                 ` Benjamin C.R. LaHaise
  1998-07-23 15:25                   ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-07-23 15:17 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

On 23 Jul 1998, Zlatko Calusic wrote:

> I was very carefull to do exactly the same sequence in both tests!
> I think it is obvious from the first line of those vmstat reports.
> 
> Anything I forgot to test? :)

Yeap! ;-)  Could you try Werner Fink's lowmem.patch -- it changes the
MAX_PAGE_AGE mechanism to have a dynamic upper limit which is lower on
systems with less memory...  That should have a similar effect to the
multiple invocations of age_page that you tried.

		-ben

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 15:17                 ` Benjamin C.R. LaHaise
@ 1998-07-23 15:25                   ` Zlatko Calusic
  1998-07-23 17:27                     ` Benjamin C.R. LaHaise
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 15:25 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: linux-mm

"Benjamin C.R. LaHaise" <blah@kvack.org> writes:

> On 23 Jul 1998, Zlatko Calusic wrote:
> 
> > I was very carefull to do exactly the same sequence in both tests!
> > I think it is obvious from the first line of those vmstat reports.
> > 
> > Anything I forgot to test? :)
> 
> Yeap! ;-)  Could you try Werner Fink's lowmem.patch -- it changes the
> MAX_PAGE_AGE mechanism to have a dynamic upper limit which is lower on
> systems with less memory...  That should have a similar effect to the
> multiple invocations of age_page that you tried.
> 

Not really! :)

I'm trying and trying, but every time...

While trying to retrieve the URL: http://riemann.suse.de/~werner/patches/ 

The following error was encountered: 

      ERROR 308 -- Cannot connect to the original site 

This means that: 

    The remote site may be down.


Could you please send me a copy, since I don't know for how long host
will be down?

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
		       Don't mess with Murphy.
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 15:25                   ` Zlatko Calusic
@ 1998-07-23 17:27                     ` Benjamin C.R. LaHaise
  1998-07-23 19:17                       ` Dr. Werner Fink
  0 siblings, 1 reply; 46+ messages in thread
From: Benjamin C.R. LaHaise @ 1998-07-23 17:27 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: linux-mm

On 23 Jul 1998, Zlatko Calusic wrote:

> Could you please send me a copy, since I don't know for how long host
> will be down?

Okay, there's now a copy at
http://www.kvack.org/~blah/patches/werner-lowmem.patch-2.1.110.gz (~5k).

		-ben

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 17:27                     ` Benjamin C.R. LaHaise
@ 1998-07-23 19:17                       ` Dr. Werner Fink
  0 siblings, 0 replies; 46+ messages in thread
From: Dr. Werner Fink @ 1998-07-23 19:17 UTC (permalink / raw)
  To: linux-mm

On Thu, Jul 23, 1998 at 01:27:53PM -0400, Benjamin C.R. LaHaise wrote:
> On 23 Jul 1998, Zlatko Calusic wrote:
> 
> > Could you please send me a copy, since I don't know for how long host
> > will be down?
> 
> Okay, there's now a copy at
> http://www.kvack.org/~blah/patches/werner-lowmem.patch-2.1.110.gz (~5k).


One remark ... Bill's (to be exact  Bill Hawes <whawes@star.net>) patch
of a dynamic number of inodes is more elegant then mine included in this
patch :-)


         Werner

--------------------------------------------------------------------------
This is a multi-part message in MIME format.
--------------20E446B52C4B02943B4DB385
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi Bill,

I tried running a test similar to what I think you're using for your
"rust series", and on 2.1.109 I see very little change in compile time
after doing a big find.

After booting into 8M, a compile of net-tools-1.45 takes 83 seconds the
first time, 89 seconds after doing a "find /usr -type f" (about 53,000
files on my system.) Not a speed-up, but a much smaller change than the
typical numbers you've been seeing. Subsequent finds don't have much
effect; compile times remain in the range of 84-89 sec.

My kernel is heavily patched :-), but I think the relative lack of rust
may be largely due to setting inode-max to scale with memory size. For
an 8M system I have inode-max set to 1024, which nicely limits the
fraction of both inode and dcache memory.

If you don't mind trying some further experiments, could you try 2.1.109
with either the attached patch, or just a 

	echo 1024 >/proc/sys/fs/inode-max

right after boot. The patch makes this automatic and also preallocates
the inodes so that there's no fragmentation effect, but the important
part is probably to just get the limit right.

Hope this helps a bit ...

Regards,
Bill
--------------20E446B52C4B02943B4DB385
Content-Type: text/plain; charset=us-ascii; name="inode_prealloc109-patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="inode_prealloc109-patch"

--- linux-2.1.109/include/linux/fs.h.old	Fri Jul 17 09:28:55 1998
+++ linux-2.1.109/include/linux/fs.h	Fri Jul 17 09:33:33 1998
@@ -46,7 +46,17 @@
 /* And dynamically-tunable limits and defaults: */
 extern int max_inodes;
 extern int max_files, nr_files, nr_free_files;
-#define NR_INODE 4096	/* This should no longer be bigger than NR_FILE */
+/*
+ * Make the default inode limit scale with memory size
+ * up to a limit. (A 32M system gets 4096 inodes.)
+ *
+ * Note: NR_INODE may be larger than NR_FILE, as unused
+ * inodes are still useful for preserving page cache.
+ */
+#define NR_INODE_MAX 16384
+#define NR_INODE(pages) \
+	(((pages) >> 1) <= NR_INODE_MAX ? ((pages) >> 1) : NR_INODE_MAX)
+
 #define NR_FILE  4096	/* this can well be larger on a larger system */
 #define NR_RESERVED_FILES 10 /* reserved for root */
 
--- linux-2.1.109/fs/inode.c.old	Fri Jul  3 10:32:32 1998
+++ linux-2.1.109/fs/inode.c	Fri Jul 17 10:05:55 1998
@@ -20,8 +20,12 @@
  * Famous last words.
  */
 
+/* for sizing the inode limit */
+extern unsigned long num_physpages;
+
 #define INODE_PARANOIA 1
 /* #define INODE_DEBUG 1 */
+#define INODE_PREALLOC 1 /* make a CONFIG option */
 
 /*
  * Inode lookup is no longer as critical as it used to be:
@@ -65,7 +69,8 @@
 	int dummy[4];
 } inodes_stat = {0, 0, 0,};
 
-int max_inodes = NR_INODE;
+/* Initialized in inode_init() */
+int max_inodes;
 
 /*
  * Put the inode on the super block's dirty list.
@@ -737,15 +791,35 @@
  */
 void inode_init(void)
 {
-	int i;
 	struct list_head *head = inode_hashtable;
+	int i = HASH_SIZE;
 
-	i = HASH_SIZE;
 	do {
 		INIT_LIST_HEAD(head);
 		head++;
 		i--;
 	} while (i);
+
+	/*
+	 * Initialize the default maximum based on memory size.
+	 */
+	max_inodes = NR_INODE(num_physpages);
+
+#ifdef INODE_PREALLOC
+	/*
+	 * Preallocate the inodes to avoid memory fragmentation.
+	 */
+	spin_lock(&inode_lock);
+	while (inodes_stat.nr_inodes < max_inodes) {
+		struct inode *inode = grow_inodes();
+		if (!inode)
+			goto done;
+		list_add(&inode->i_list, &inode_unused);
+		inodes_stat.nr_free_inodes++;
+	}
+	spin_unlock(&inode_lock);
+done:
+#endif
 }
 
 /* This belongs in file_table.c, not here... */

--------------20E446B52C4B02943B4DB385--


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:59           ` Zlatko Calusic
  1998-07-23 12:23             ` Stephen C. Tweedie
@ 1998-07-23 17:12             ` Stephen C. Tweedie
  1998-07-23 17:42               ` Zlatko Calusic
  1998-07-23 19:12             ` Dr. Werner Fink
  1998-07-23 19:51             ` Rik van Riel
  3 siblings, 1 reply; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 17:12 UTC (permalink / raw)
  To: Zlatko.Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

Hi,

On 23 Jul 1998 12:59:38 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
said:

> As I see it, page cache seems too persistant (it grows out of bounds)
> when we age pages in it.

> One wrong way of fixing it is to limit page cache size, IMNSHO.

I_my_NSHO, it's an awful way to fix it: adding yet another rule to the
VM is not progress, it's making things worse!

> I tried the other way, to age page cache harder, and it looks like it
> works very well. Patch is simple, so simple that I can't understand
> nobody suggested (something like) it yet.

It has been suggested before, and that's why a lot of people have
reported great success by having page ageing removed: it essentially
lets pages age faster by limiting the number of ageing passes required
to remove a page (essentially this just reduces the age value down to
the page's single PG_referenced bit).

And yes, it should work fine.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 17:12             ` Stephen C. Tweedie
@ 1998-07-23 17:42               ` Zlatko Calusic
  0 siblings, 0 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-23 17:42 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:

> Hi,
> 
> On 23 Jul 1998 12:59:38 +0200, Zlatko Calusic <Zlatko.Calusic@CARNet.hr>
> said:
> 
> > As I see it, page cache seems too persistant (it grows out of bounds)
> > when we age pages in it.
> 
> > One wrong way of fixing it is to limit page cache size, IMNSHO.
> 
> I_my_NSHO, it's an awful way to fix it: adding yet another rule to the
> VM is not progress, it's making things worse!
> 

Good, we agree. :)

> > I tried the other way, to age page cache harder, and it looks like it
> > works very well. Patch is simple, so simple that I can't understand
> > nobody suggested (something like) it yet.
> 
> It has been suggested before, and that's why a lot of people have
> reported great success by having page ageing removed: it essentially
> lets pages age faster by limiting the number of ageing passes required
> to remove a page (essentially this just reduces the age value down to
> the page's single PG_referenced bit).
> 
> And yes, it should work fine.
> 

Yep! Exactly that.

If only my english was better to explain it as easily and precisely as
you are doing. :)

As I already said (or at least tried to :)) there's nothing wrong with
the idea of page aging, it's just that current implementation is not
very good. So I would like page aging to stay, but with my or some
similar change that will make things work well and smooth.

Thanks to Benjamin, I'm going to download Werners patch and see how
does his idea perform. In a minute. :)

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	  If you don't think women are explosive, drop one!
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:59           ` Zlatko Calusic
  1998-07-23 12:23             ` Stephen C. Tweedie
  1998-07-23 17:12             ` Stephen C. Tweedie
@ 1998-07-23 19:12             ` Dr. Werner Fink
  1998-07-27 10:40               ` Stephen C. Tweedie
  1998-07-23 19:51             ` Rik van Riel
  3 siblings, 1 reply; 46+ messages in thread
From: Dr. Werner Fink @ 1998-07-23 19:12 UTC (permalink / raw)
  To: linux-mm

On Thu, Jul 23, 1998 at 12:59:38PM +0200, Zlatko Calusic wrote:
> 
> I tried the other way, to age page cache harder, and it looks like it
> works very well. Patch is simple, so simple that I can't understand
> nobody suggested (something like) it yet.
> 
> 
> --- filemap.c.virgin   Tue Jul 21 18:41:30 1998
> +++ filemap.c   Thu Jul 23 12:14:43 1998
> @@ -171,6 +171,11 @@
>                                 touch_page(page);
>                                 break;
>                         }
> +                       /* Age named pages aggresively, so page cache
> +                        * doesn't grow too fast.    -zcalusic
> +                        */
> +                       age_page(page);
> +                       age_page(page);
>                         age_page(page);
>                         if (page->age)
>                                 break;
> 

I've something similar ... cut&paste (no tabs) ... which would only do
less graduated ageing on small systems.

------------------------------------------------------------------------------- 
diff -urN linux-2.1.110/include/linux/swapctl.h linux/include/linux/swapctl.h
--- linux-2.1.110/include/linux/swapctl.h       Tue Jul 21 02:32:01 1998
+++ linux/include/linux/swapctl.h       Wed Jul 22 18:04:28 1998
@@ -94,12 +94,26 @@
                return n;
 }
 
+extern int pgcache_max_age;
+extern void do_pgcache_max_age(void);
+
 static inline void touch_page(struct page *page)
 {
-       if (page->age < (MAX_PAGE_AGE - PAGE_ADVANCE))
+       int max_age = MAX_PAGE_AGE;
+
+       if (atomic_read(&page->count) == 1) {
+               static int save_max_age = 0;
+               if (save_max_age != max_age) {
+                       save_max_age = max_age;
+                       do_pgcache_max_age();
+               }
+               max_age = pgcache_max_age;
+       }
+
+       if (page->age < (max_age - PAGE_ADVANCE))
                page->age += PAGE_ADVANCE;
        else
-               page->age = MAX_PAGE_AGE;
+               page->age = max_age;
 }
 
 static inline void age_page(struct page *page)
diff -urN linux-2.1.110/include/linux/swapctl.h linux/include/linux/swapctl.h
--- linux-2.1.110/include/linux/swapctl.h       Tue Jul 21 02:32:01 1998
+++ linux/include/linux/swapctl.h       Wed Jul 22 18:04:28 1998
@@ -94,12 +94,26 @@
                return n;
 }
 
+extern int pgcache_max_age;
+extern void do_pgcache_max_age(void);
+
 static inline void touch_page(struct page *page)
 {
-       if (page->age < (MAX_PAGE_AGE - PAGE_ADVANCE))
+       int max_age = MAX_PAGE_AGE;
+
+       if (atomic_read(&page->count) == 1) {
+               static int save_max_age = 0;
+               if (save_max_age != max_age) {
+                       save_max_age = max_age;
+                       do_pgcache_max_age();
+               }
+               max_age = pgcache_max_age;
+       }
+
+       if (page->age < (max_age - PAGE_ADVANCE))
                page->age += PAGE_ADVANCE;
        else
-               page->age = MAX_PAGE_AGE;
+               page->age = max_age;
 }
 
 static inline void age_page(struct page *page)
------------------------------------------------------------------------------- 



        Werner
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 19:12             ` Dr. Werner Fink
@ 1998-07-27 10:40               ` Stephen C. Tweedie
  0 siblings, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 10:40 UTC (permalink / raw)
  To: Dr. Werner Fink; +Cc: linux-mm, Stephen Tweedie

Hi Werner,

On Thu, 23 Jul 1998 21:12:22 +0200, "Dr. Werner Fink" <werner@suse.de>
said:

> I've something similar ... cut&paste (no tabs) ... which would only do
> less graduated ageing on small systems.

> ----------------------------------------------------------------------------
> [patch follows]

Interesting, but the patch included just two copies of the diff to
swapctl.h and no definition of the new do_pgcache_max_age() function.
Could you post a complete patch, please?!

Thanks,
 Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 10:59           ` Zlatko Calusic
                               ` (2 preceding siblings ...)
  1998-07-23 19:12             ` Dr. Werner Fink
@ 1998-07-23 19:51             ` Rik van Riel
  1998-07-24 11:21               ` Zlatko Calusic
  3 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 1998-07-23 19:51 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, linux-mm

On 23 Jul 1998, Zlatko Calusic wrote:

> One wrong way of fixing it is to limit page cache size, IMNSHO.
> 
> I tried the other way, to age page cache harder, and it looks like it
> works very well. Patch is simple, so simple that I can't understand
> nobody suggested (something like) it yet.

These solutions are somewhat the same, but your one may take
a little less computational power and has a tradeoff in the
fact that it is very inflexible.

> --- filemap.c.virgin   Tue Jul 21 18:41:30 1998
> +++ filemap.c   Thu Jul 23 12:14:43 1998
> +                       age_page(page);
> +                       age_page(page);
>                         age_page(page);
> If I put only two age_page()s, there's still too much swapping for my
> taste.
> With three age_page()s, read performance is as expected, and still we
> manage memory more efficiently than without page aging.

This only proves that three age_page()s are a good number
for _your_ computer and your workload.

> Comments?

As Stephen put it so nicely when I (in a bad mood) proposed
another artificial limit:
" O no, another arbitrary limit in the kernel! "

And another one of Stephen's wisdoms (heavily paraphrased!):
" Good solutions are dynamic and/or self-tuning "
[Sorry Stephen, this was VERY heavily paraphrased :)]

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-23 19:51             ` Rik van Riel
@ 1998-07-24 11:21               ` Zlatko Calusic
  1998-07-24 14:25                 ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-24 11:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Zlatko Calusic, Stephen C. Tweedie, Eric W. Biederman, linux-mm

Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

> On 23 Jul 1998, Zlatko Calusic wrote:
> 
> > One wrong way of fixing it is to limit page cache size, IMNSHO.
> > 
> > I tried the other way, to age page cache harder, and it looks like it
> > works very well. Patch is simple, so simple that I can't understand
> > nobody suggested (something like) it yet.
> 
> These solutions are somewhat the same, but your one may take
> a little less computational power and has a tradeoff in the
> fact that it is very inflexible.

Same? Not in your wildest dream. :)

Limiting means puting "arbitrary" limit. Then page cache would NEVER
grow above that limit.

That's how buffer cache work at the present. It never grows above cca
30% of physical memory installed. That means lots of unused memory...
I don't like it. Many times, no matter how heavy I/O I have, last 20MB
(for exampl, but in many real cases) are free, unused, WASTED.

I see that only on two OSes, NT and recent 2.1.x Linuces.

I know I can change that limit in /proc/sys... but I was always
wondering why is default set so low.

With harder aging you're NOT limiting size of page cache. You
just say  to that subsystem to be polite, but if you have lots of
memory, that memory will be instantly used by the cache. That's
FUNDAMENTALLY different from limiting.

Triple aging has all good characteristics of aging.

Why do you think it is inflexible?

> 
> > --- filemap.c.virgin   Tue Jul 21 18:41:30 1998
> > +++ filemap.c   Thu Jul 23 12:14:43 1998
> > +                       age_page(page);
> > +                       age_page(page);
> >                         age_page(page);
> > If I put only two age_page()s, there's still too much swapping for my
> > taste.
> > With three age_page()s, read performance is as expected, and still we
> > manage memory more efficiently than without page aging.
> 
> This only proves that three age_page()s are a good number
> for _your_ computer and your workload.
> 

Could be. So I'd like to see other people benchmarks.
I hope I'm not theonly speed freak around. :)

I will post another, completely different set of benchmarks today.
Under different initial conditions, so as to simulate different
machines and loads.

> > Comments?
> 
> As Stephen put it so nicely when I (in a bad mood) proposed
> another artificial limit:
> " O no, another arbitrary limit in the kernel! "
> 

I couldn't agree more. I like sane defaults. And simple solutions,
more than anything.

> And another one of Stephen's wisdoms (heavily paraphrased!):
> " Good solutions are dynamic and/or self-tuning "
> [Sorry Stephen, this was VERY heavily paraphrased :)]
> 

Agreed, but only if that self-tuning does not take more code than
the core functionality in itself. :)

I'm very satisfied with changes (in .109 I think)
free_memory_available() went through. Old function was much too much
unnessecary complicated and not useful at all. And unreadable.

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	       File not found. Should I fake it? (Y/N)
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-24 11:21               ` Zlatko Calusic
@ 1998-07-24 14:25                 ` Rik van Riel
  1998-07-24 17:01                   ` Zlatko Calusic
  0 siblings, 1 reply; 46+ messages in thread
From: Rik van Riel @ 1998-07-24 14:25 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, Linux MM

On 24 Jul 1998, Zlatko Calusic wrote:
> Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> 
> > These solutions are somewhat the same, but your one may take
> > a little less computational power and has a tradeoff in the
> > fact that it is very inflexible.
> 
> Same? Not in your wildest dream. :)
> 
> Limiting means puting "arbitrary" limit. Then page cache would NEVER
> grow above that limit.

There's also a 'soft limit', or borrow percentage. Ultimately
the minimum and maximum percentages should be 0 and 100 %
respectively.

> Triple aging has all good characteristics of aging.
> Why do you think it is inflexible?

Because there's no way to tune the 'priority' of the page aging.
It could be good to do triple aging, but it could be a non-optimal
number on other machines ... and there's no way to get out of it!

> I will post another, completely different set of benchmarks today.
> Under different initial conditions, so as to simulate different
> machines and loads.

Good, I like this. You will probably get somewhat different
results with this...

Oh, and changing the code to:

int i;
for ( i = page_cache_penalty; i--;)
	age_page(page);

and making page_cache_pentalty sysctl tunable will certainly
make your tests easier...

> I'm very satisfied with changes (in .109 I think)
> free_memory_available() went through. Old function was much too much
> unnessecary complicated and not useful at all. And unreadable.

It _was_ useful; it has always been useful to test for the
amount of memory fragmentation.

In fact, Linus himself said (when free_memory_available()
was introduced in 2.1.89) that he would not accept any
function which used the amount of free pages.

After some protests (by me) Linus managed to explain to us
exactly _why_ we should test for fragmentation, I suggest
we all go through the archives again and reread the arguments...

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-24 14:25                 ` Rik van Riel
@ 1998-07-24 17:01                   ` Zlatko Calusic
  1998-07-24 21:55                     ` Rik van Riel
  0 siblings, 1 reply; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-24 17:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Eric W. Biederman, Linux MM

Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

> On 24 Jul 1998, Zlatko Calusic wrote:
> > Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:
> > 
> > > These solutions are somewhat the same, but your one may take
> > > a little less computational power and has a tradeoff in the
> > > fact that it is very inflexible.
> > 
> > Same? Not in your wildest dream. :)
> > 
> > Limiting means puting "arbitrary" limit. Then page cache would NEVER
> > grow above that limit.
> 
> There's also a 'soft limit', or borrow percentage. Ultimately
> the minimum and maximum percentages should be 0 and 100 %
> respectively.

Could you elaborate on "borrow" percentage? I have some trouble
understanding what that could be.

> 
> > Triple aging has all good characteristics of aging.
> > Why do you think it is inflexible?
> 
> Because there's no way to tune the 'priority' of the page aging.
> It could be good to do triple aging, but it could be a non-optimal
> number on other machines ... and there's no way to get out of it!

Yes, you're right here. See below...

> 
> > I will post another, completely different set of benchmarks today.
> > Under different initial conditions, so as to simulate different
> > machines and loads.
> 
> Good, I like this. You will probably get somewhat different
> results with this...
> 
> Oh, and changing the code to:
> 
> int i;
> for ( i = page_cache_penalty; i--;)
> 	age_page(page);
> 
> and making page_cache_pentalty sysctl tunable will certainly
> make your tests easier...

Yes, I wanted to do something like this, but then again, was to lazy
to further complicate things. So, I was just recompiling kernel and
rebooting (to do testing), since only one file (filemap.c) was really
recompiled and whole operation did not take more than a few minutes. :)

Code like that is easy to put in the kernel, but only if people think
it would be a good idea. And then remains final question, what should
be the default value?

But, I also think that too much configurable parameters make trouble
too. If you have 100 variables to configure one subsystem in the
kernel, where do you start? I like solutions that work good by
themselves. Autotuning. With not too much logic in them. :)

> 
> > I'm very satisfied with changes (in .109 I think)
> > free_memory_available() went through. Old function was much too much
> > unnessecary complicated and not useful at all. And unreadable.
> 
> It _was_ useful; it has always been useful to test for the
> amount of memory fragmentation.

Whoops, here I don't share your opinion.

Checking memory fragmentation and then acting accordingly (in kswapd)
seems like a good idea, but, unfortunately, I am now pretty sure it is
NOT. And there is one and only one reason: throwing pages out of
memory at random (blindly). You know it, too.

I came to this conclusion many months before, with my first patch,
that aimed to solve fragmentation problem.

My first idea was to make sure we have at least one 128KB chunk. It
finished with many lockups and kswapd deadlocks. Then I tried to make
few 16KB chunks available and performance still sucked. To get few
16KB chunks system would happily outswap whole my memory. Thanks, not
again. I used it for a while, only to prevent network lockups.

Old (<= 2.1.108) free_memory_available() was practically that, but
with limit applied which effectively worked like: "Oh, no, memory
fragmented, swap out, swap out, oh no, too much swapped out, never
mind fragmentation, stop swapping." So, it didn't work. And it was
definitely overcomplicated.

Obviously, everybody tried hard to do the right thing, where right
thing could not be done. Wrong place to search for solution.

Stephen's new patch promises. It has some new logic in it which is
not tried before. I already tested it, and results are not bad.

But, I can't say that is final solution, either, since I can still
easily produce memory shortage, with many network simultaneous network
connections even on a 64MB unloaded machine. So, lots of work to be
done for 2.4. :)

> 
> In fact, Linus himself said (when free_memory_available()
> was introduced in 2.1.89) that he would not accept any
> function which used the amount of free pages.
> 
> After some protests (by me) Linus managed to explain to us
> exactly _why_ we should test for fragmentation, I suggest
> we all go through the archives again and reread the arguments...
> 

Yeah, I remember.

That was the time I started patching my kernels with every new
release. That was the time I went for another 32MB to solve my
problems. :(

I'm lagging very much behind on linux-kernel list (~3000 posts) and it
seems like I missed some good discussion about Linux MM (I read about
it on http://lwn.net/). Now, I hope I can still catch that all, and
then spend some time testing and coding. :)

Regards,
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
     P.S. That Linux-MM page you're doing, kicks ass. Just never
	  had opportunity to tell you that I really like it. :)
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-24 17:01                   ` Zlatko Calusic
@ 1998-07-24 21:55                     ` Rik van Riel
  1998-07-25 13:05                       ` Zlatko Calusic
  1998-07-27 10:54                       ` Stephen C. Tweedie
  0 siblings, 2 replies; 46+ messages in thread
From: Rik van Riel @ 1998-07-24 21:55 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Stephen C. Tweedie, Eric W. Biederman, Linux MM

On 24 Jul 1998, Zlatko Calusic wrote:

> > There's also a 'soft limit', or borrow percentage. Ultimately
> > the minimum and maximum percentages should be 0 and 100 %
> > respectively.
> 
> Could you elaborate on "borrow" percentage? I have some trouble
> understanding what that could be.

It's an idea I stole from Digital Unix :)

Basically, the cache is allowed to grow boundless, but is
reclaimed until it reaches the borrow percentage when
memory is short.

The philosophy behind is that caching the disk doesn't make
much sense beyond a certain point.

It's a primitive idea, but it seems to have saved Andrea's
machine quite well (with the additional patch).

I admit your patch (multiple aging) should work even better,
but in order to do that, we probably want to make it auto-tuning
on the borrow percentage:

- if page_cache_size > borrow + 5%     --> add aging loop
- if loads_of_disk_io and almost thrashing [*] --> remove aging loop

[*] this thrashing can be measured by testing the cache hit/mis
rate; if it falls below (say) 50% we could consider thrashing.

(50% should be a good rate for an aging cache, and the amount
of loops is trimmed quickly enough when we grow anyway. This
mechanism could make a nice somewhat adjusting trimming
mechanism. Expect a patch soon...)

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-24 21:55                     ` Rik van Riel
@ 1998-07-25 13:05                       ` Zlatko Calusic
  1998-07-27 10:54                       ` Stephen C. Tweedie
  1 sibling, 0 replies; 46+ messages in thread
From: Zlatko Calusic @ 1998-07-25 13:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, Eric W. Biederman, Linux MM

Rik van Riel <H.H.vanRiel@phys.uu.nl> writes:

> On 24 Jul 1998, Zlatko Calusic wrote:
> 
> > > There's also a 'soft limit', or borrow percentage. Ultimately
> > > the minimum and maximum percentages should be 0 and 100 %
> > > respectively.
> > 
> > Could you elaborate on "borrow" percentage? I have some trouble
> > understanding what that could be.
> 
> It's an idea I stole from Digital Unix :)
> 
> Basically, the cache is allowed to grow boundless, but is
> reclaimed until it reaches the borrow percentage when
> memory is short.

OK, I get it now. Looks good.

> 
> The philosophy behind is that caching the disk doesn't make
> much sense beyond a certain point.
> 

I mostly agree.

> It's a primitive idea, but it seems to have saved Andrea's
> machine quite well (with the additional patch).
> 
> I admit your patch (multiple aging) should work even better,
> but in order to do that, we probably want to make it auto-tuning
> on the borrow percentage:
> 
> - if page_cache_size > borrow + 5%     --> add aging loop
> - if loads_of_disk_io and almost thrashing [*] --> remove aging loop

Yes, something like this could be worthwhile. I observed some strange
patterns of behaviour with aging loop, sometimes system is still too
aggresive, and sometimes you can't say if it's working at all.

Probably, some debbugging and profiling code should be added to see
what's goin' on there.

> 
> [*] this thrashing can be measured by testing the cache hit/mis
> rate; if it falls below (say) 50% we could consider thrashing.

That probably wouldn't work as well as you expect. Problem is again
with that arbitrary 50%. I had code in kernel that reported
buffer/page cache hit ratio and was surprised that for both caches it
was > 90%. And that was on 5MB machine. Can you imagine? :)

> 
> (50% should be a good rate for an aging cache, and the amount
> of loops is trimmed quickly enough when we grow anyway. This
> mechanism could make a nice somewhat adjusting trimming
> mechanism. Expect a patch soon...)
> 

I'll be glad to test a patch, but I'm not that convinced that this is
really a good idea. But, then again I have nothing against it.

Keep up the good work!
-- 
Posted by Zlatko Calusic           E-mail: <Zlatko.Calusic@CARNet.hr>
---------------------------------------------------------------------
	Crime doesn't pay... does that mean my job is a crime?
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: More info: 2.1.108 page cache performance on low memory
  1998-07-24 21:55                     ` Rik van Riel
  1998-07-25 13:05                       ` Zlatko Calusic
@ 1998-07-27 10:54                       ` Stephen C. Tweedie
  1 sibling, 0 replies; 46+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 10:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Zlatko Calusic, Stephen C. Tweedie, Eric W. Biederman, Linux MM

Hi,

On Fri, 24 Jul 1998 23:55:10 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> I admit your patch (multiple aging) should work even better,
> but in order to do that, we probably want to make it auto-tuning
> on the borrow percentage:

> - if page_cache_size > borrow + 5%     --> add aging loop
<Bzzt> wrong answer...

> - if loads_of_disk_io and almost thrashing [*] --> remove aging loop
Yep, much better.

> [*] this thrashing can be measured by testing the cache hit/mis
> rate; if it falls below (say) 50% we could consider thrashing.

Doing even more rules based on the actual cache size is a bad thing
since it is enforcing an arbitrary limit which does not depend on what
the system load is right now.  Making it adapt to the current load is
ALWAYS going to be a better way of doing things.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~1998-08-20 14:30 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-07-13 16:53 More info: 2.1.108 page cache performance on low memory Stephen C. Tweedie
1998-07-13 18:08 ` Eric W. Biederman
1998-07-13 18:29   ` Zlatko Calusic
1998-07-14 17:32     ` Stephen C. Tweedie
1998-07-16 12:31       ` Zlatko Calusic
1998-07-14 17:30   ` Stephen C. Tweedie
1998-07-18  1:10     ` Eric W. Biederman
1998-07-18 13:28       ` Zlatko Calusic
1998-07-18 16:40         ` Eric W. Biederman
1998-07-20  9:15           ` Zlatko Calusic
1998-07-22 10:40             ` Stephen C. Tweedie
1998-07-23 10:06               ` Zlatko Calusic
1998-07-23 12:22                 ` Stephen C. Tweedie
1998-07-23 14:07                   ` Zlatko Calusic
1998-07-23 17:18                     ` Stephen C. Tweedie
1998-07-23 19:33                       ` Zlatko Calusic
1998-07-27 10:57                         ` Stephen C. Tweedie
1998-07-26 14:49                 ` Eric W Biederman
1998-07-27 11:02                   ` Stephen C. Tweedie
1998-08-02  5:19                     ` Eric W Biederman
1998-08-17 13:57                       ` Stephen C. Tweedie
1998-08-17 15:35                       ` Stephen C. Tweedie
1998-08-20 12:40                         ` Eric W. Biederman
1998-07-20 15:58           ` Stephen C. Tweedie
1998-07-22 10:36           ` Stephen C. Tweedie
1998-07-22 18:01             ` Rik van Riel
1998-07-23 10:59               ` Stephen C. Tweedie
1998-07-22 10:33         ` Stephen C. Tweedie
1998-07-23 10:59           ` Zlatko Calusic
1998-07-23 12:23             ` Stephen C. Tweedie
1998-07-23 15:06               ` Zlatko Calusic
1998-07-23 15:17                 ` Benjamin C.R. LaHaise
1998-07-23 15:25                   ` Zlatko Calusic
1998-07-23 17:27                     ` Benjamin C.R. LaHaise
1998-07-23 19:17                       ` Dr. Werner Fink
1998-07-23 17:12             ` Stephen C. Tweedie
1998-07-23 17:42               ` Zlatko Calusic
1998-07-23 19:12             ` Dr. Werner Fink
1998-07-27 10:40               ` Stephen C. Tweedie
1998-07-23 19:51             ` Rik van Riel
1998-07-24 11:21               ` Zlatko Calusic
1998-07-24 14:25                 ` Rik van Riel
1998-07-24 17:01                   ` Zlatko Calusic
1998-07-24 21:55                     ` Rik van Riel
1998-07-25 13:05                       ` Zlatko Calusic
1998-07-27 10:54                       ` Stephen C. Tweedie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox