[PATCH][RFC] slabnow

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH][RFC] slabnow
@ 2002-09-07 14:06 Ed Tomlinson
  2002-09-08 20:45 ` Daniel Phillips
       [not found] ` <200209081142.02839.tomlins@cam.org>
  0 siblings, 2 replies; 14+ messages in thread
From: Ed Tomlinson @ 2002-09-07 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Rik van Riel

Hi,

Andrew took a good look at slablru and asked a few deep questions.  One was why does
slab not release pages immediately?  Since Rik explained that using a lazy reclaim of free
pages from the lru worked badly, it does not make much sense to use a lazy reclaim of
slab pages either...

Second question was can you do this without the lru?  He then suggested we think about
seeks.  If we assume a lru page takes a seek to recreate and a slab object also takes a
one to recreate we can use the percentage of pages reclaimed to drive the slab shrinking.

This has some major implications.  If it works well slab.c will get gutted.  We will no longer
need *shrink* calls in slab nor will kmem_cache_reap do anything and the slabs_free list
can go too...  This version of the patch defers the slab.c cleanup.

Here is my implementation.  There is one thing missing from it.  In shrink_cache we 
need to avoid shrinking when we are not working with ZONE_DMA or ZONE_NORMAL.  I
am not sure the best way to test this.  Andrew?  Also I need to find a better name for
nr_used_zone_pages, which should tell us the number of pages used by ZONE_DMA and
ZONE_NORMAL.

This is against Linus bk at cset 1.575.1.45 (Thusday evening).  Its been tested on UP
without highmem - it needs a the zone test for highmem to work correctly.  Testing
used:

find / -name "*" > /dev/null
multiple tiobenchs
dbench on reiserfs and tmpfs
gimp working with massive tifs
plus my normal workstation load

As always comments very welcome.

Ed Tomlinson

--------- slabasap_A0
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.580   -> 1.584  
#	  include/linux/mm.h	1.77    -> 1.78   
#	     mm/page_alloc.c	1.96    -> 1.97   
#	         fs/dcache.c	1.29    -> 1.31   
#	         mm/vmscan.c	1.100   -> 1.102  
#	          fs/dquot.c	1.44    -> 1.46   
#	           mm/slab.c	1.26    -> 1.27   
#	          fs/inode.c	1.67    -> 1.69   
#	include/linux/dcache.h	1.15    -> 1.16   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/05	ed@oscar.et.ca	1.581
# free slab pages asap
# --------------------------------------------
# 02/09/07	ed@oscar.et.ca	1.584
# Here we assume one reclaimed page takes one seek to recreate.  We also
# assume a dentry or inode also takes a seek to rebuild.  With this in 
# mind we trim the cache by the same percentage we trim the lru.
# --------------------------------------------
#
diff -Nru a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c	Sat Sep  7 09:30:46 2002
+++ b/fs/dcache.c	Sat Sep  7 09:30:46 2002
@@ -573,19 +572,11 @@
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_dcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dcache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
-
+	int entries = dentry_stat.nr_dentry / ratio + 1;
 	/*
 	 * Nasty deadlock avoidance.
 	 *
@@ -600,11 +591,8 @@
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = dentry_stat.nr_unused / priority;
-
-	prune_dcache(count);
-	kmem_cache_shrink(dentry_cache);
-	return 0;
+	prune_dcache(entries);
+	return entries;
 }
 
 #define NAME_ALLOC_LEN(len)	((len+16) & ~15)
diff -Nru a/fs/dquot.c b/fs/dquot.c
--- a/fs/dquot.c	Sat Sep  7 09:30:46 2002
+++ b/fs/dquot.c	Sat Sep  7 09:30:46 2002
@@ -480,26 +480,17 @@
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dqcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *   ...
- *   6 - base-level: try to shrink a bit.
+ * more memory
  */
 
-int shrink_dqcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dqcache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
+	entries = dqstats.allocated_dquots / ratio + 1;
 
 	lock_kernel();
-	count = dqstats.free_dquots / priority;
-	prune_dqcache(count);
+	prune_dqcache(entries);
 	unlock_kernel();
-	kmem_cache_shrink(dquot_cachep);
-	return 0;
+	return entries;
 }
 
 /*
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c	Sat Sep  7 09:30:46 2002
+++ b/fs/inode.c	Sat Sep  7 09:30:46 2002
@@ -415,19 +415,11 @@
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our icache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_icache_memory(int priority, int gfp_mask)
+int shrink_icache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
-
+	int entries = inodes_stat.nr_inodes / ratio + 1;
 	/*
 	 * Nasty deadlock avoidance..
 	 *
@@ -438,12 +430,10 @@
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = inodes_stat.nr_unused / priority;
-
-	prune_icache(count);
-	kmem_cache_shrink(inode_cachep);
-	return 0;
+	prune_icache(entries);
+	return entries;
 }
+EXPORT_SYMBOL(shrink_icache_memory);
 
 /*
  * Called with the inode lock held.
diff -Nru a/include/linux/dcache.h b/include/linux/dcache.h
--- a/include/linux/dcache.h	Sat Sep  7 09:30:46 2002
+++ b/include/linux/dcache.h	Sat Sep  7 09:30:46 2002
@@ -186,7 +186,7 @@
 extern void prune_dcache(int);
 
 /* icache memory management (defined in linux/fs/inode.c) */
-extern int shrink_icache_memory(int, int);
+extern int shrink_icache_memory(int, unsigned int);
 extern void prune_icache(int);
 
 /* quota cache memory management (defined in linux/fs/dquot.c) */
diff -Nru a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h	Sat Sep  7 09:30:46 2002
+++ b/include/linux/mm.h	Sat Sep  7 09:30:46 2002
@@ -498,6 +498,7 @@
 
 extern struct page * vmalloc_to_page(void *addr);
 extern unsigned long get_page_cache_size(void);
+extern unsigned int nr_used_zone_pages(void);
 
 #endif /* __KERNEL__ */
 
diff -Nru a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	Sat Sep  7 09:30:46 2002
+++ b/mm/page_alloc.c	Sat Sep  7 09:30:46 2002
@@ -486,6 +486,19 @@
 	return sum;
 }
 
+unsigned int nr_used_zone_pages(void)
+{
+	pg_data_t *pgdat;
+	unsigned int pages = 0;
+
+	for_each_pgdat(pgdat) {
+		pages += pgdat->node_zones[ZONE_DMA].nr_active;
+		pages += pgdat->node_zones[ZONE_NORMAL].nr_inactive;
+	}
+
+	return pages;
+}
+
 static unsigned int nr_free_zone_pages(int offset)
 {
 	pg_data_t *pgdat;
diff -Nru a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c	Sat Sep  7 09:30:46 2002
+++ b/mm/slab.c	Sat Sep  7 09:30:46 2002
@@ -1500,7 +1500,11 @@
 		if (unlikely(!--slabp->inuse)) {
 			/* Was partial or full, now empty. */
 			list_del(&slabp->list);
-			list_add(&slabp->list, &cachep->slabs_free);
+/*			list_add(&slabp->list, &cachep->slabs_free); 		*/
+			if (unlikely(list_empty(&cachep->slabs_partial)))
+				list_add(&slabp->list, &cachep->slabs_partial);
+			else
+				kmem_slab_destroy(cachep, slabp);
 		} else if (unlikely(inuse == cachep->num)) {
 			/* Was full. */
 			list_del(&slabp->list);
@@ -1969,7 +1973,7 @@
 	}
 	list_for_each(q,&cachep->slabs_partial) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse == cachep->num || !slabp->inuse)
+		if (slabp->inuse == cachep->num)
 			BUG();
 		active_objs += slabp->inuse;
 		active_slabs++;
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Sat Sep  7 09:30:46 2002
+++ b/mm/vmscan.c	Sat Sep  7 09:30:46 2002
@@ -464,11 +464,13 @@
 	unsigned int gfp_mask, int nr_pages)
 {
 	unsigned long ratio;
-	int max_scan;
+	int max_scan, nr_pages_in, pages, a, b;
 
-	/* This is bogus for ZONE_HIGHMEM? */
-	if (kmem_cache_reap(gfp_mask) >= nr_pages)
-  		return 0;
+	if (nr_pages <= 0)
+		return 0;
+
+	pages = nr_used_zone_pages();
+	nr_pages_in = nr_pages;
 
 	/*
 	 * Try to keep the active list 2/3 of the size of the cache.  And
@@ -483,7 +485,7 @@
 	ratio = (unsigned long)nr_pages * zone->nr_active /
 				((zone->nr_inactive | 1) * 2);
 	atomic_add(ratio+1, &zone->refill_counter);
-	if (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
+	while (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
 		atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
 		refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
 	}
@@ -492,18 +494,27 @@
 	nr_pages = shrink_cache(nr_pages, zone,
 				gfp_mask, priority, max_scan);
 
+	/*
+	 * Here we assume it costs one seek to replace a lru page and that
+	 * it also takes a seek to recreate a cache object.  With this in
+	 * mind we age equal percentages of the lru and ageable caches.
+	 * This should balance the seeks generated by these structures.
+	 */
+	if (likely(nr_pages_in > nr_pages)) {
+		ratio = pages / (nr_pages_in-nr_pages);
+		shrink_dcache_memory(ratio, gfp_mask);
+
+		/* After aging the dcache, age inodes too .. */
+		shrink_icache_memory(ratio, gfp_mask);
+#ifdef CONFIG_QUOTA
+		shrink_dqcache_memory(ratio, gfp_mask);
+#endif
+	}
+
 	if (nr_pages <= 0)
 		return 0;
 
 	wakeup_bdflush();
-
-	shrink_dcache_memory(priority, gfp_mask);
-
-	/* After shrinking the dcache, get rid of unused inodes too .. */
-	shrink_icache_memory(1, gfp_mask);
-#ifdef CONFIG_QUOTA
-	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif
 
 	return nr_pages;
 }

---------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] slabnow
  2002-09-07 14:06 [PATCH][RFC] slabnow Ed Tomlinson
@ 2002-09-08 20:45 ` Daniel Phillips
  2002-09-09  4:59   ` Andrew Morton
       [not found] ` <200209081142.02839.tomlins@cam.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Phillips @ 2002-09-08 20:45 UTC (permalink / raw)
  To: Ed Tomlinson, linux-mm; +Cc: Andrew Morton, Rik van Riel

On Saturday 07 September 2002 16:06, Ed Tomlinson wrote:
> +	/*
> +	 * Here we assume it costs one seek to replace a lru page and that
> +	 * it also takes a seek to recreate a cache object.  With this in
> +	 * mind we age equal percentages of the lru and ageable caches.
> +	 * This should balance the seeks generated by these structures.
> +	 */

It's not a reliable assumption, for example, what is the relation between an 
skbuff and a disk seek?  Still, a better model need not be much different 
from what you've done - simply pick an arbitrary unit against which you 
estimate the average cost of replacing one unit of a given slab, and plug 
that a ratios against that into each slab header.  Make the default 
1*DISK_SEEK_COST and you will get the current behaviour, then tune the things 
that are obviously broken.

By the way, ratio multiplication is a *basic* and essential tool of any
control system worthy of the name, it's too bad Linux is so horribly
lacking in that department.  (Life without double-length intermediate
muldiv results is not worth living.)

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] slabnow
  2002-09-08 20:45 ` Daniel Phillips
@ 2002-09-09  4:59   ` Andrew Morton
  2002-09-09  5:14     ` Daniel Phillips
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2002-09-09  4:59 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Ed Tomlinson, linux-mm, Andrew Morton, Rik van Riel

Daniel Phillips wrote:
> 
> On Saturday 07 September 2002 16:06, Ed Tomlinson wrote:
> > +     /*
> > +      * Here we assume it costs one seek to replace a lru page and that
> > +      * it also takes a seek to recreate a cache object.  With this in
> > +      * mind we age equal percentages of the lru and ageable caches.
> > +      * This should balance the seeks generated by these structures.
> > +      */
> 
> It's not a reliable assumption, for example, what is the relation between an
> skbuff and a disk seek?

Of course.  But all the caches in question happen to be filesystem
ones.  I can live with that...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] slabnow
  2002-09-09  4:59   ` Andrew Morton
@ 2002-09-09  5:14     ` Daniel Phillips
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Phillips @ 2002-09-09  5:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ed Tomlinson, linux-mm, Andrew Morton, Rik van Riel

On Monday 09 September 2002 06:59, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > On Saturday 07 September 2002 16:06, Ed Tomlinson wrote:
> > > +     /*
> > > +      * Here we assume it costs one seek to replace a lru page and that
> > > +      * it also takes a seek to recreate a cache object.  With this in
> > > +      * mind we age equal percentages of the lru and ageable caches.
> > > +      * This should balance the seeks generated by these structures.
> > > +      */
> > 
> > It's not a reliable assumption, for example, what is the relation between an
> > skbuff and a disk seek?
> 
> Of course.  But all the caches in question happen to be filesystem
> ones.  I can live with that...

Sure.  And we know what to do about it when it does get around to
breaking.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <200209081142.02839.tomlins@cam.org>]

* Re: [PATCH] slabasap-mm5_A2
       [not found] ` <200209081142.02839.tomlins@cam.org>
@ 2002-09-08 20:56   ` Andrew Morton
  2002-09-08 21:08     ` Andrew Morton
                       ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Andrew Morton @ 2002-09-08 20:56 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> Hi,
> 
> Here is a rewritten slablru - this time its not using the lru...  If changes long standing slab
> behavior.  Now slab.c releases pages as soon as possible.  This was done since we noticed
> that slablru was taking a long time to release the pages it freed - from other vm experiences
> this is not a good thing.

Right.  There remains the issue that we're ripping away constructed
objects from slabs which have constructors, as Stephen points out.

I doubt if that matters.  slab constructors just initialise stuff.
If the memory is in cache then the initialisation is negligible.
If the memory is not in cache then the initialisation will pull
it into cache, which is something which we needed to do anyway.  And
unless the slab's access pattern is extremely LIFO, chances are that
most allocations will come in from part-filled slab pages anyway.

And other such waffly words ;)  I'll do the global LIFO page hotlists
soonl; that'll fix it up.

 
> In this patch I have tried to make as few changes as possible.

Thanks.  I've shuffled the patching sequence (painful), and diddled
a few things.  We actually do have the "number of scanned pages"
in there, so we can use that.  I agree that the ratio should be 
nr_scanned/total rather than nr_reclaimed/total.   This way, if
nr_reclaimed < nr_scanned (page reclaim is in trouble) then we
put more pressure on slabs.

>   With this in mind I am using
> the percentage of the active+inactive pages reclaimed to recover the same percentage of the
> pruneable caches.  In slablru the affect was to age the pruneable caches by percentage of
> the active+inactive pages scanned - this could be done but required more code so I went
> used pages reclaimed.  The same choise was made about accounting of pages freed by
> the shrink_<something>_memory calls.
> 
> There is also a question as to if we should only use the ZONE_DMA and ZONE_NORMAL to
> drive the cache shrinking.  Talk with Rik on irc convinced me to go with the choise that
> required less code, so we use all zones.

OK.  We could do with a `gimme_the_direct_addressed_classzone' utility
anyway.  It is currently open-coded in fs/buffer.c:free_more_memory().
We can just pull that out of there and use memclass() on it for this.

> To apply the patch to mm5 use the follow procedure:
> copy the two slablru patch and discard all but the vmscan changes.
> replace the slablru patch with the just created patches that just hit vmscan
> after applying the mm5 patches apply the following patch to adjust vmscan and add slabasap.
> 
> This passes the normal group of tests I apply to my patches (mm4 stalled force watchdog to
> reboot).   The varient for bk linus also survives these tests.
> 
> I have seen some unexpected messages from the kde artsd daemon when I left kde running all
> night.  This may imply we want to have slab be a little less aggressive freeing high order slabs.
> Would like to see if other have problems though - it could just be debian and kde 3.0.3 (which
> is not offical yet).

hm.
 
> Please let me know if you want any changes or the addition of any of the options mentioned.
> 

In here:

        int entries = inodes_stat.nr_inodes / ratio + 1;

what is the "+ 1" for?  If it is to avoid divide-by-zero then
it needs parentheses.  I added the "+ 1" to the call site to cover that.

>From a quick test, the shrinking rate seems quite reasonable to
me.  mem=512m, with twenty megs of ext2 inodes in core, a `dd'
of one gigabyte (twice the size of memory) steadily pushed the
ext2 inodes down to 2.5 megs (although total memory was still
9 megs - internal fragmentation of the slab).

A second 1G dd pushed it down to 1M/3M.

A third 1G dd pushed it down to .25M/1.25M

Seems OK.

A few things we should do later:

- We're calling prune_icache with a teeny number of inodes, many times.
  Would be better to batch that up a bit.

- It would be nice to bring back the pruner callbacks.  The post-facto
  hook registration thing will be fine.  Hit me with a stick for making
  you change the kmem_cache_create() prototype.  Sorry about that.

If we have the pruner callbacks then vmscan can just do:

	kmem_shrink_stuff(ratio);

and then kmem_shrink_stuff() can do:

	cachep->nr_to_prune += cacheb->inuse / ratio;
	if (cachep->nr_to_prune > cachep->prune_batch) {
		int prune = cachep->nr_to_prune;

		cachep->nr_to_prune = 0;
		(*cachep->pruner)(nr_to_prune);
	}

But let's get the current code settled in before doing these
refinements.

There are some usage patterns in which the dentry/inode aging
might be going wrong.  Try, with mem=512m

	cp -a linux a
	cp -a linux b
	cp -a linux c

etc.

Possibly the inode/dentry cache is just being FIFO here and is doing
exactly the wrong thing.  But the dcache referenced-bit logic should
cause the inodes in `linux' to be pinned with this test, so that 
should be OK.  Dunno.

The above test will be hurt a bit by the aggressively lowered (10%)
background writeback threshold - more reads competing with writes.
Maybe I should not kick off background writeback until the dirty
threshold reaches 30% if there are reads queued against the device.
That's easy enough to do.

drop-behind should help here too.

 fs/dcache.c            |   21 +++++----------------
 fs/dquot.c             |   19 +++++--------------
 fs/inode.c             |   24 +++++++-----------------
 include/linux/dcache.h |    2 +-
 include/linux/mm.h     |    1 +
 mm/page_alloc.c        |   11 +++++++++++
 mm/slab.c              |    8 ++++++--
 mm/vmscan.c            |   28 +++++++++++++++++++---------
 8 files changed, 55 insertions(+), 59 deletions(-)

--- 2.5.33/fs/dcache.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/fs/dcache.c	Sun Sep  8 12:42:43 2002
@@ -573,19 +573,11 @@ void shrink_dcache_anon(struct list_head
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_dcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dcache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
-
+	int entries = dentry_stat.nr_dentry / ratio + 1;
 	/*
 	 * Nasty deadlock avoidance.
 	 *
@@ -600,11 +592,8 @@ int shrink_dcache_memory(int priority, u
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = dentry_stat.nr_unused / priority;
-
-	prune_dcache(count);
-	kmem_cache_shrink(dentry_cache);
-	return 0;
+	prune_dcache(entries);
+	return entries;
 }
 
 #define NAME_ALLOC_LEN(len)	((len+16) & ~15)
--- 2.5.33/fs/dquot.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/fs/dquot.c	Sun Sep  8 12:42:43 2002
@@ -480,26 +480,17 @@ static void prune_dqcache(int count)
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dqcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *   ...
- *   6 - base-level: try to shrink a bit.
+ * more memory
  */
 
-int shrink_dqcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dqcache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
+	entries = dqstats.allocated_dquots / ratio + 1;
 
 	lock_kernel();
-	count = dqstats.free_dquots / priority;
-	prune_dqcache(count);
+	prune_dqcache(entries);
 	unlock_kernel();
-	kmem_cache_shrink(dquot_cachep);
-	return 0;
+	return entries;
 }
 
 /*
--- 2.5.33/fs/inode.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/fs/inode.c	Sun Sep  8 13:10:15 2002
@@ -409,7 +409,7 @@ void prune_icache(int goal)
 	struct list_head *entry, *freeable = &list;
 	int count;
 	struct inode * inode;
-
+printk("prune_icache(%d/%d)\n", goal, inodes_stat.nr_unused);
 	spin_lock(&inode_lock);
 
 	count = 0;
@@ -442,19 +442,11 @@ void prune_icache(int goal)
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our icache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_icache_memory(int priority, int gfp_mask)
+int shrink_icache_memory(int ratio, unsigned int gfp_mask)
 {
-	int count = 0;
-
+	int entries = inodes_stat.nr_inodes / ratio + 1;
 	/*
 	 * Nasty deadlock avoidance..
 	 *
@@ -465,12 +457,10 @@ int shrink_icache_memory(int priority, i
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = inodes_stat.nr_unused / priority;
-
-	prune_icache(count);
-	kmem_cache_shrink(inode_cachep);
-	return 0;
+	prune_icache(entries);
+	return entries;
 }
+EXPORT_SYMBOL(shrink_icache_memory);
 
 /*
  * Called with the inode lock held.
--- 2.5.33/include/linux/dcache.h~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/include/linux/dcache.h	Sun Sep  8 12:42:43 2002
@@ -186,7 +186,7 @@ extern int shrink_dcache_memory(int, uns
 extern void prune_dcache(int);
 
 /* icache memory management (defined in linux/fs/inode.c) */
-extern int shrink_icache_memory(int, int);
+extern int shrink_icache_memory(int, unsigned int);
 extern void prune_icache(int);
 
 /* quota cache memory management (defined in linux/fs/dquot.c) */
--- 2.5.33/include/linux/mm.h~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/include/linux/mm.h	Sun Sep  8 12:42:43 2002
@@ -509,6 +509,7 @@ extern struct vm_area_struct *find_exten
 
 extern struct page * vmalloc_to_page(void *addr);
 extern unsigned long get_page_cache_size(void);
+extern unsigned int nr_used_zone_pages(void);
 
 #endif /* __KERNEL__ */
 
--- 2.5.33/mm/page_alloc.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/mm/page_alloc.c	Sun Sep  8 12:42:43 2002
@@ -487,6 +487,17 @@ unsigned int nr_free_pages(void)
 	return sum;
 }
 
+unsigned int nr_used_zone_pages(void)
+{
+	unsigned int pages = 0;
+	struct zone *zone;
+
+	for_each_zone(zone)
+		pages += zone->nr_active + zone->nr_inactive;
+
+	return pages;
+}
+
 static unsigned int nr_free_zone_pages(int offset)
 {
 	pg_data_t *pgdat;
--- 2.5.33/mm/slab.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/mm/slab.c	Sun Sep  8 12:42:43 2002
@@ -1502,7 +1502,11 @@ static inline void kmem_cache_free_one(k
 		if (unlikely(!--slabp->inuse)) {
 			/* Was partial or full, now empty. */
 			list_del(&slabp->list);
-			list_add(&slabp->list, &cachep->slabs_free);
+/*			list_add(&slabp->list, &cachep->slabs_free); 		*/
+			if (unlikely(list_empty(&cachep->slabs_partial)))
+				list_add(&slabp->list, &cachep->slabs_partial);
+			else
+				kmem_slab_destroy(cachep, slabp);
 		} else if (unlikely(inuse == cachep->num)) {
 			/* Was full. */
 			list_del(&slabp->list);
@@ -1971,7 +1975,7 @@ static int s_show(struct seq_file *m, vo
 	}
 	list_for_each(q,&cachep->slabs_partial) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse == cachep->num || !slabp->inuse)
+		if (slabp->inuse == cachep->num)
 			BUG();
 		active_objs += slabp->inuse;
 		active_slabs++;
--- 2.5.33/mm/vmscan.c~slabasap	Sun Sep  8 12:42:41 2002
+++ 2.5.33-akpm/mm/vmscan.c	Sun Sep  8 13:10:24 2002
@@ -71,6 +71,10 @@
 #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
 #endif
 
+#ifndef CONFIG_QUOTA
+#define shrink_dqcache_memory(ratio, gfp_mask) do { } while (0)
+#endif
+
 /* Must be called with page's pte_chain_lock held. */
 static inline int page_mapping_inuse(struct page * page)
 {
@@ -566,10 +570,6 @@ shrink_zone(struct zone *zone, int max_s
 {
 	unsigned long ratio;
 
-	/* This is bogus for ZONE_HIGHMEM? */
-	if (kmem_cache_reap(gfp_mask) >= nr_pages)
-  		return 0;
-
 	/*
 	 * Try to keep the active list 2/3 of the size of the cache.  And
 	 * make sure that refill_inactive is given a decent number of pages.
@@ -597,6 +597,8 @@ shrink_caches(struct zone *classzone, in
 {
 	struct zone *first_classzone;
 	struct zone *zone;
+	int ratio;
+	int pages = nr_used_zone_pages();
 
 	first_classzone = classzone->zone_pgdat->node_zones;
 	for (zone = classzone; zone >= first_classzone; zone--) {
@@ -626,11 +628,19 @@ shrink_caches(struct zone *classzone, in
 		*total_scanned += max_scan;
 	}
 
-	shrink_dcache_memory(priority, gfp_mask);
-	shrink_icache_memory(1, gfp_mask);
-#ifdef CONFIG_QUOTA
-	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif
+	/*
+	 * Here we assume it costs one seek to replace a lru page and that
+	 * it also takes a seek to recreate a cache object.  With this in
+	 * mind we age equal percentages of the lru and ageable caches.
+	 * This should balance the seeks generated by these structures.
+	 *
+	 * NOTE: for now I do this for all zones.  If we find this is too
+	 * aggressive on large boxes we may want to exculude ZONE_HIGHMEM
+	 */
+	ratio = (pages / *total_scanned) + 1;
+	shrink_dcache_memory(ratio, gfp_mask);
+	shrink_icache_memory(ratio, gfp_mask);
+	shrink_dqcache_memory(ratio, gfp_mask);
 	return nr_pages;
 }
 

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 20:56   ` [PATCH] slabasap-mm5_A2 Andrew Morton
@ 2002-09-08 21:08     ` Andrew Morton
  2002-09-08 21:14     ` Ed Tomlinson
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2002-09-08 21:08 UTC (permalink / raw)
  To: Ed Tomlinson, linux-mm

Andrew Morton wrote:
> 
> ...
> If we have the pruner callbacks then vmscan can just do:
> 
>         kmem_shrink_stuff(ratio);
> 
> and then kmem_shrink_stuff() can do:
> 
>         cachep->nr_to_prune += cacheb->inuse / ratio;
>         if (cachep->nr_to_prune > cachep->prune_batch) {
>                 int prune = cachep->nr_to_prune;
> 
>                 cachep->nr_to_prune = 0;
>                 (*cachep->pruner)(nr_to_prune);
>         }

OK, I get it.  `ratio' here can easily be 100,000, which would
result in zero objects to prune all the time.  So it's

	cachep->nr_to_prune += cacheb->inuse / ratio + 1;

yes?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 20:56   ` [PATCH] slabasap-mm5_A2 Andrew Morton
  2002-09-08 21:08     ` Andrew Morton
@ 2002-09-08 21:14     ` Ed Tomlinson
  2002-09-08 21:47       ` Andrew Morton
  2002-09-09  9:29     ` Stephen C. Tweedie
  2002-09-09 21:33     ` Ed Tomlinson
  3 siblings, 1 reply; 14+ messages in thread
From: Ed Tomlinson @ 2002-09-08 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On September 8, 2002 04:56 pm, Andrew Morton wrote:
> Ed Tomlinson wrote:
> > Hi,
> >
> > Here is a rewritten slablru - this time its not using the lru...  If
> > changes long standing slab behavior.  Now slab.c releases pages as soon
> > as possible.  This was done since we noticed that slablru was taking a
> > long time to release the pages it freed - from other vm experiences this
> > is not a good thing.
>
> Right.  There remains the issue that we're ripping away constructed
> objects from slabs which have constructors, as Stephen points out.

I have a small optimization coded in slab.  If there are not any free
slab objects I do not free the page.   If we have problems with high
order slabs we can change this to be if we do not have <n> objects
do not free it.

> I doubt if that matters.  slab constructors just initialise stuff.
> If the memory is in cache then the initialisation is negligible.
> If the memory is not in cache then the initialisation will pull
> it into cache, which is something which we needed to do anyway.  And
> unless the slab's access pattern is extremely LIFO, chances are that
> most allocations will come in from part-filled slab pages anyway.
>
> And other such waffly words ;)  I'll do the global LIFO page hotlists
> soonl; that'll fix it up.
>
> > In this patch I have tried to make as few changes as possible.
>
> Thanks.  I've shuffled the patching sequence (painful), and diddled
> a few things.  We actually do have the "number of scanned pages"
> in there, so we can use that.  I agree that the ratio should be
> nr_scanned/total rather than nr_reclaimed/total.   This way, if
> nr_reclaimed < nr_scanned (page reclaim is in trouble) then we
> put more pressure on slabs.

OK will change this.  This also means the changes to prune functions
made for slablru will come back - they convert these fuctions so they
age <n> object rather than purge <n>.

> >   With this in mind I am using
> > the percentage of the active+inactive pages reclaimed to recover the same
> > percentage of the pruneable caches.  In slablru the affect was to age the
> > pruneable caches by percentage of the active+inactive pages scanned -
> > this could be done but required more code so I went used pages reclaimed.
> >  The same choise was made about accounting of pages freed by the
> > shrink_<something>_memory calls.
> >
> > There is also a question as to if we should only use the ZONE_DMA and
> > ZONE_NORMAL to drive the cache shrinking.  Talk with Rik on irc convinced
> > me to go with the choise that required less code, so we use all zones.
>
> OK.  We could do with a `gimme_the_direct_addressed_classzone' utility
> anyway.  It is currently open-coded in fs/buffer.c:free_more_memory().
> We can just pull that out of there and use memclass() on it for this.

Ah thanks.  Was wondering the best way to do this.  Will read the code.

> > To apply the patch to mm5 use the follow procedure:
> > copy the two slablru patch and discard all but the vmscan changes.
> > replace the slablru patch with the just created patches that just hit
> > vmscan after applying the mm5 patches apply the following patch to adjust
> > vmscan and add slabasap.
> >
> > This passes the normal group of tests I apply to my patches (mm4 stalled
> > force watchdog to reboot).   The varient for bk linus also survives these
> > tests.
> >
> > I have seen some unexpected messages from the kde artsd daemon when I
> > left kde running all night.  This may imply we want to have slab be a
> > little less aggressive freeing high order slabs. Would like to see if
> > other have problems though - it could just be debian and kde 3.0.3 (which
> > is not offical yet).
>
> hm.

Yeah.  There no messages any logs about high order allocations failing...

> > Please let me know if you want any changes or the addition of any of the
> > options mentioned.
>
> In here:
>
>         int entries = inodes_stat.nr_inodes / ratio + 1;
>
> what is the "+ 1" for?  If it is to avoid divide-by-zero then
> it needs parentheses.  I added the "+ 1" to the call site to cover that.

It was so we would alway do some work.  ratio should never end up
as zero.  Its based on (used pages in all zones) / (reclaimed in one zone)
so we are safe.

> From a quick test, the shrinking rate seems quite reasonable to
> me.  mem=512m, with twenty megs of ext2 inodes in core, a `dd'
> of one gigabyte (twice the size of memory) steadily pushed the
> ext2 inodes down to 2.5 megs (although total memory was still
> 9 megs - internal fragmentation of the slab).
>
> A second 1G dd pushed it down to 1M/3M.
>
> A third 1G dd pushed it down to .25M/1.25M
>
> Seems OK.
>
> A few things we should do later:
>
> - We're calling prune_icache with a teeny number of inodes, many times.
>   Would be better to batch that up a bit.

Why not move the prunes to try_to_free_pages?   The should help a little to get 
bigger batches of pages as will using the number of scanned pages.  

> - It would be nice to bring back the pruner callbacks.  The post-facto
>   hook registration thing will be fine.  Hit me with a stick for making
>   you change the kmem_cache_create() prototype.  Sorry about that.

I still have to code from slablru so this is not as painfull as the first time.

> If we have the pruner callbacks then vmscan can just do:
>
> 	kmem_shrink_stuff(ratio);
>
> and then kmem_shrink_stuff() can do:
>
> 	cachep->nr_to_prune += cacheb->inuse / ratio;
> 	if (cachep->nr_to_prune > cachep->prune_batch) {
> 		int prune = cachep->nr_to_prune;
>
> 		cachep->nr_to_prune = 0;
> 		(*cachep->pruner)(nr_to_prune);
> 	}

The callbacks also make it easier to setup ageable caches and quickly
integrate them into the vm.

> But let's get the current code settled in before doing these
> refinements.

I can get the aging changes to you real fast if you want them.  I initially
coded it this way then pull the changes to reduce the code...  see below

The other thing we want to be careful with is to make sure the lack of
free page accounting is detected by oom - we definitly do not want to
oom when slab has freed memory by try_to_free_pages does not
realize it..

> There are some usage patterns in which the dentry/inode aging
> might be going wrong.  Try, with mem=512m
>
> 	cp -a linux a
> 	cp -a linux b
> 	cp -a linux c
>
> etc.
>
> Possibly the inode/dentry cache is just being FIFO here and is doing
> exactly the wrong thing.  But the dcache referenced-bit logic should
> cause the inodes in `linux' to be pinned with this test, so that
> should be OK.  Dunno.
>
> The above test will be hurt a bit by the aggressively lowered (10%)
> background writeback threshold - more reads competing with writes.
> Maybe I should not kick off background writeback until the dirty
> threshold reaches 30% if there are reads queued against the device.
> That's easy enough to do.
>
> drop-behind should help here too.

This converts the prunes in inode and dcache to age <n> entries rather
than purge them.  Think this is the more correct behavior.  Code is from
slablru.

diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c	Wed Aug 28 17:47:15 2002
+++ b/fs/inode.c	Wed Aug 28 17:47:15 2002
@@ -388,10 +388,11 @@
 
 	count = 0;
 	entry = inode_unused.prev;
-	while (entry != &inode_unused)
-	{
+	for(; goal; goal--) {
 		struct list_head *tmp = entry;
 
+		if (entry == &inode_unused)
+			break;
 		entry = entry->prev;
 		inode = INODE(tmp);
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
@@ -405,8 +406,6 @@
 		list_add(tmp, freeable);
 		inode->i_state |= I_FREEING;
 		count++;
-		if (!--goal)
-			break;
 	}
 	inodes_stat.nr_unused -= count;
 	spin_unlock(&inode_lock);

diff -Nru a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c	Wed Aug 28 17:47:13 2002
+++ b/fs/dcache.c	Wed Aug 28 17:47:13 2002
@@ -329,12 +328,11 @@
 void prune_dcache(int count)
 {
 	spin_lock(&dcache_lock);
-	for (;;) {
+	for (; count ; count--) {
 		struct dentry *dentry;
 		struct list_head *tmp;
 
 		tmp = dentry_unused.prev;
-
 		if (tmp == &dentry_unused)
 			break;
 		list_del_init(tmp);
@@ -349,12 +347,8 @@
 		dentry_stat.nr_unused--;
 
 		/* Unused dentry with a count? */
-		if (atomic_read(&dentry->d_count))
-			BUG();
-
+		BUG_ON(atomic_read(&dentry->d_count));
 		prune_one_dentry(dentry);
-		if (!--count)
-			break;
 	}
 	spin_unlock(&dcache_lock);
 }



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 21:14     ` Ed Tomlinson
@ 2002-09-08 21:47       ` Andrew Morton
  2002-09-08 21:48         ` Ed Tomlinson
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2002-09-08 21:47 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> On September 8, 2002 04:56 pm, Andrew Morton wrote:
> > Ed Tomlinson wrote:
> > > Hi,
> > >
> > > Here is a rewritten slablru - this time its not using the lru...  If
> > > changes long standing slab behavior.  Now slab.c releases pages as soon
> > > as possible.  This was done since we noticed that slablru was taking a
> > > long time to release the pages it freed - from other vm experiences this
> > > is not a good thing.
> >
> > Right.  There remains the issue that we're ripping away constructed
> > objects from slabs which have constructors, as Stephen points out.
> 
> I have a small optimization coded in slab.  If there are not any free
> slab objects I do not free the page.   If we have problems with high
> order slabs we can change this to be if we do not have <n> objects
> do not free it.

OK.

> > I doubt if that matters.  slab constructors just initialise stuff.
> > If the memory is in cache then the initialisation is negligible.
> > If the memory is not in cache then the initialisation will pull
> > it into cache, which is something which we needed to do anyway.  And
> > unless the slab's access pattern is extremely LIFO, chances are that
> > most allocations will come in from part-filled slab pages anyway.
> >
> > And other such waffly words ;)  I'll do the global LIFO page hotlists
> > soonl; that'll fix it up.
> >
> > > In this patch I have tried to make as few changes as possible.
> >
> > Thanks.  I've shuffled the patching sequence (painful), and diddled
> > a few things.  We actually do have the "number of scanned pages"
> > in there, so we can use that.  I agree that the ratio should be
> > nr_scanned/total rather than nr_reclaimed/total.   This way, if
> > nr_reclaimed < nr_scanned (page reclaim is in trouble) then we
> > put more pressure on slabs.
> 
> OK will change this.  This also means the changes to prune functions
> made for slablru will come back - they convert these fuctions so they
> age <n> object rather than purge <n>.

That would make the slab pruning less aggressive than the code I'm
testing now.  I'm not sure it needs that change.  Not sure...
 
> > >   With this in mind I am using
> > > the percentage of the active+inactive pages reclaimed to recover the same
> > > percentage of the pruneable caches.  In slablru the affect was to age the
> > > pruneable caches by percentage of the active+inactive pages scanned -
> > > this could be done but required more code so I went used pages reclaimed.
> > >  The same choise was made about accounting of pages freed by the
> > > shrink_<something>_memory calls.
> > >
> > > There is also a question as to if we should only use the ZONE_DMA and
> > > ZONE_NORMAL to drive the cache shrinking.  Talk with Rik on irc convinced
> > > me to go with the choise that required less code, so we use all zones.
> >
> > OK.  We could do with a `gimme_the_direct_addressed_classzone' utility
> > anyway.  It is currently open-coded in fs/buffer.c:free_more_memory().
> > We can just pull that out of there and use memclass() on it for this.
> 
> Ah thanks.  Was wondering the best way to do this.  Will read the code.

Then again, shrinking slab harder for big highmem machines is good ;)
 
> ...
> > From a quick test, the shrinking rate seems quite reasonable to
> > me.  mem=512m, with twenty megs of ext2 inodes in core, a `dd'
> > of one gigabyte (twice the size of memory) steadily pushed the
> > ext2 inodes down to 2.5 megs (although total memory was still
> > 9 megs - internal fragmentation of the slab).
> >
> > A second 1G dd pushed it down to 1M/3M.
> >
> > A third 1G dd pushed it down to .25M/1.25M
> >
> > Seems OK.
> >
> > A few things we should do later:
> >
> > - We're calling prune_icache with a teeny number of inodes, many times.
> >   Would be better to batch that up a bit.
> 
> Why not move the prunes to try_to_free_pages?   The should help a little to get
> bigger batches of pages as will using the number of scanned pages.

But the prunes are miles too small at present.  We go into try_to_free_pages()
and reclaim 32 pages.  And we also call into prune_cache() and free about 0.3
pages.  It's out of whack.  I'd suggest not calling out to the pruner until
we want at least several pages' worth of objects.
 
> ...
> > But let's get the current code settled in before doing these
> > refinements.
> 
> I can get the aging changes to you real fast if you want them.  I initially
> coded it this way then pull the changes to reduce the code...  see below

No rush.
 
> The other thing we want to be careful with is to make sure the lack of
> free page accounting is detected by oom - we definitly do not want to
> oom when slab has freed memory by try_to_free_pages does not
> realize it..

How much memory are we talking about here?  Not much I think?
 
> > There are some usage patterns in which the dentry/inode aging
> > might be going wrong.  Try, with mem=512m
> >
> >       cp -a linux a
> >       cp -a linux b
> >       cp -a linux c
> >
> > etc.
> >
> > Possibly the inode/dentry cache is just being FIFO here and is doing
> > exactly the wrong thing.  But the dcache referenced-bit logic should
> > cause the inodes in `linux' to be pinned with this test, so that
> > should be OK.  Dunno.
> >
> > The above test will be hurt a bit by the aggressively lowered (10%)
> > background writeback threshold - more reads competing with writes.
> > Maybe I should not kick off background writeback until the dirty
> > threshold reaches 30% if there are reads queued against the device.
> > That's easy enough to do.
> >
> > drop-behind should help here too.
> 
> This converts the prunes in inode and dcache to age <n> entries rather
> than purge them.  Think this is the more correct behavior.  Code is from
> slablru.

Makes sense (I think).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 21:47       ` Andrew Morton
@ 2002-09-08 21:48         ` Ed Tomlinson
  2002-09-08 22:46           ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Ed Tomlinson @ 2002-09-08 21:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On September 8, 2002 05:47 pm, Andrew Morton wrote:
> Ed Tomlinson wrote:

This was getting a little massive so I trimmed.

> > I have a small optimization coded in slab.  If there are not any free
> > slab objects I do not free the page.   If we have problems with high
> > order slabs we can change this to be if we do not have <n> objects
> > do not free it.
>
> OK.

> > OK will change this.  This also means the changes to prune functions
> > made for slablru will come back - they convert these fuctions so they
> > age <n> object rather than purge <n>.
>
> That would make the slab pruning less aggressive than the code I'm
> testing now.  I'm not sure it needs that change.  Not sure...

Well without this change slab sometimes really get hurt.  It took me a while
to figure out what was happening in when I coded slablru.  In any case you
do have the code to fix it.

> > Ah thanks.  Was wondering the best way to do this.  Will read the code.
>
> Then again, shrinking slab harder for big highmem machines is good ;)

That was Rik's comment too...  Just figured it best to mention the options.

> But the prunes are miles too small at present.  We go into
> try_to_free_pages() and reclaim 32 pages.  And we also call into
> prune_cache() and free about 0.3 pages.  It's out of whack.  I'd suggest
> not calling out to the pruner until we want at least several pages' worth
> of objects.

Agreed.  I had not quite digested your last comments when I wrote this.  
Once we are happy I will readd the callbacks (using a second call to set
the callback - btw I have some nice oak hiking sticks here...) and fix this 
as you sugested.

> > The other thing we want to be careful with is to make sure the lack of
> > free page accounting is detected by oom - we definitly do not want to
> > oom when slab has freed memory by try_to_free_pages does not
> > realize it..
>
> How much memory are we talking about here?  Not much I think?

Usually not much.  I do know that when Rik added my slab accounting to rmap
the number of oom reports dropped.  We just need to be aware there is a 
hole and there might be a small problem.

> > This converts the prunes in inode and dcache to age <n> entries rather
> > than purge them.  Think this is the more correct behavior.  Code is from
> > slablru.
>
> Makes sense (I think).

As I mentioned above I needed this to make slablru stable...  Might be since you
now limit the number of pages scanned to 2*nr_pages we can get away without
this - not at all sure though.  Going back the basics.  Without this are we not 
devaluating seeks required to rebuild slab objects vs lru pages?

Ed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 21:48         ` Ed Tomlinson
@ 2002-09-08 22:46           ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2002-09-08 22:46 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> ...
> > That would make the slab pruning less aggressive than the code I'm
> > testing now.  I'm not sure it needs that change.  Not sure...
> 
> Well without this change slab sometimes really get hurt.  It took me a while
> to figure out what was happening in when I coded slablru.  In any case you
> do have the code to fix it.

Yup, let's run with that.
 
> > > Ah thanks.  Was wondering the best way to do this.  Will read the code.
> >
> > Then again, shrinking slab harder for big highmem machines is good ;)
> 
> That was Rik's comment too...  Just figured it best to mention the options.
> 
> > But the prunes are miles too small at present.  We go into
> > try_to_free_pages() and reclaim 32 pages.  And we also call into
> > prune_cache() and free about 0.3 pages.  It's out of whack.  I'd suggest
> > not calling out to the pruner until we want at least several pages' worth
> > of objects.
> 
> Agreed.  I had not quite digested your last comments when I wrote this.
> Once we are happy I will readd the callbacks (using a second call to set
> the callback - btw I have some nice oak hiking sticks here...) and fix this
> as you sugested.

Thanks.  Shrinking seems to work well now.  Plus, if we need, we have
a single, nice linear knob with whichto twiddle the aging: just scale
the ratio up and down.
 
> > > The other thing we want to be careful with is to make sure the lack of
> > > free page accounting is detected by oom - we definitly do not want to
> > > oom when slab has freed memory by try_to_free_pages does not
> > > realize it..
> >
> > How much memory are we talking about here?  Not much I think?
> 
> Usually not much.  I do know that when Rik added my slab accounting to rmap
> the number of oom reports dropped.  We just need to be aware there is a
> hole and there might be a small problem.
> 
> > > This converts the prunes in inode and dcache to age <n> entries rather
> > > than purge them.  Think this is the more correct behavior.  Code is from
> > > slablru.
> >
> > Makes sense (I think).
> 
> As I mentioned above I needed this to make slablru stable...  Might be since you
> now limit the number of pages scanned to 2*nr_pages we can get away without
> this - not at all sure though.  Going back the basics.  Without this are we not
> devaluating seeks required to rebuild slab objects vs lru pages?

Yes, we are.  It's a relatively small deal anyway.  The success rate in
reclaiming pages coming off the tail of the LRU is currently in the 50%
to 90% range, depending on what's going on.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 20:56   ` [PATCH] slabasap-mm5_A2 Andrew Morton
  2002-09-08 21:08     ` Andrew Morton
  2002-09-08 21:14     ` Ed Tomlinson
@ 2002-09-09  9:29     ` Stephen C. Tweedie
  2002-09-09 21:33     ` Ed Tomlinson
  3 siblings, 0 replies; 14+ messages in thread
From: Stephen C. Tweedie @ 2002-09-09  9:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ed Tomlinson, linux-mm

Hi,

On Sun, Sep 08, 2002 at 01:56:26PM -0700, Andrew Morton wrote:

> Right.  There remains the issue that we're ripping away constructed
> objects from slabs which have constructors, as Stephen points out.
> 
> I doubt if that matters.  slab constructors just initialise stuff.
> If the memory is in cache then the initialisation is negligible.
> If the memory is not in cache then the initialisation will pull
> it into cache, which is something which we needed to do anyway.  And
> unless the slab's access pattern is extremely LIFO, chances are that
> most allocations will come in from part-filled slab pages anyway.

I'm not sure this was right back when cache lines were 32 or 64 bytes:
the constructor stuff really could have helped to avoid pulling the
whole object into cache, especially for largish data structures like
buffer_heads where initialisation often only touches a few header
cache lines, and the rest is only needed once we submit it for IO.
(Of course, the bh lifespan was never sufficiently well examined for
anyone to actuall code that: to many places left fields in undefined
states so we all just assumed that bhes would be zeroed on allocation
all the time.)

But now that cache lines are typically at least 128 bytes on modern
CPUs, the gain from constructors is much less obvious.  There's so
much false aliasing in the cache that we'll probably need the whole
object in cache on allocation most of the time.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-08 20:56   ` [PATCH] slabasap-mm5_A2 Andrew Morton
                       ` (2 preceding siblings ...)
  2002-09-09  9:29     ` Stephen C. Tweedie
@ 2002-09-09 21:33     ` Ed Tomlinson
  2002-09-09 22:07       ` Andrew Morton
  3 siblings, 1 reply; 14+ messages in thread
From: Ed Tomlinson @ 2002-09-09 21:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

Hi Andrew,

Found three oops when checking this afternoon's log.  Looks like *total_scanned can
be zero...

how about;

ratio = pages > *total_scanned ? pages / (*total_scanned | 1) : 1;

Ed

> +	 * NOTE: for now I do this for all zones.  If we find this is too
> +	 * aggressive on large boxes we may want to exculude ZONE_HIGHMEM
> +	 */
> +	ratio = (pages / *total_scanned) + 1;
> +	shrink_dcache_memory(ratio, gfp_mask);
> +	shrink_icache_memory(ratio, gfp_mask);




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-09 21:33     ` Ed Tomlinson
@ 2002-09-09 22:07       ` Andrew Morton
  2002-09-09 22:28         ` Ed Tomlinson
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2002-09-09 22:07 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> Hi Andrew,
> 
> Found three oops when checking this afternoon's log.  Looks like *total_scanned can
> be zero...
> 
> how about;
> 
> ratio = pages > *total_scanned ? pages / (*total_scanned | 1) : 1;
> 

Yup, thanks.  I went the "+ 1" route ;)

Found another dumb bug in there too.  In refill_inactive_zone:

        while (nr_pages && !list_empty(&zone->active_list)) {
		...
                if (page_count(page) == 0) {
                        /* It is currently in pagevec_release() */
                        SetPageLRU(page);
                        list_add(&page->lru, &zone->active_list);
                        continue;
                }
                page_cache_get(page);
                list_add(&page->lru, &l_hold);
                nr_pages--;
        }

does not terminate if the active list consists entirely of zero-count
pages.  I'm not sure how I managed to abuse the system into that
state, but I did, and received a visit from the NMI watchdog for my
sins.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] slabasap-mm5_A2
  2002-09-09 22:07       ` Andrew Morton
@ 2002-09-09 22:28         ` Ed Tomlinson
  0 siblings, 0 replies; 14+ messages in thread
From: Ed Tomlinson @ 2002-09-09 22:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On September 9, 2002 06:07 pm, Andrew Morton wrote:
> Ed Tomlinson wrote:
> > Hi Andrew,
> >
> > Found three oops when checking this afternoon's log.  Looks like
> > *total_scanned can be zero...
> >
> > how about;
> >
> > ratio = pages > *total_scanned ? pages / (*total_scanned | 1) : 1;
>
> Yup, thanks.  I went the "+ 1" route ;)

I used the above to catch two situations.  One was if 

                to_reclaim = zone->pages_high - zone->free_pages;
                if (to_reclaim < 0)
                        continue;       /* zone has enough memory */

triggered an left *total_scanned zero, and second if to catch the case
when pages > *total_scanned.  Think both might happen.  In the oops,
think *total_scanned was zero.

Ed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2002-09-09 22:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-07 14:06 [PATCH][RFC] slabnow Ed Tomlinson
2002-09-08 20:45 ` Daniel Phillips
2002-09-09  4:59   ` Andrew Morton
2002-09-09  5:14     ` Daniel Phillips
     [not found] ` <200209081142.02839.tomlins@cam.org>
2002-09-08 20:56   ` [PATCH] slabasap-mm5_A2 Andrew Morton
2002-09-08 21:08     ` Andrew Morton
2002-09-08 21:14     ` Ed Tomlinson
2002-09-08 21:47       ` Andrew Morton
2002-09-08 21:48         ` Ed Tomlinson
2002-09-08 22:46           ` Andrew Morton
2002-09-09  9:29     ` Stephen C. Tweedie
2002-09-09 21:33     ` Ed Tomlinson
2002-09-09 22:07       ` Andrew Morton
2002-09-09 22:28         ` Ed Tomlinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox