Re: MM patches against 2.5.31

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: MM patches against 2.5.31
@ 2002-08-26 22:09 Ed Tomlinson
  2002-08-26 23:58 ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Ed Tomlinson @ 2002-08-26 22:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Andrew Morton, Christian Ehrhardt, Daniel Phillips

This seems to have been missed: 

Linus Torvalds wrote:

> In article <3D6989F7.9ED1948A@zip.com.au>,
> Andrew Morton  <akpm@zip.com.au> wrote:
>>
>>What I'm inclined to do there is to change __page_cache_release()
>>to not attempt to free the page at all.  Just let it sit on the
>>LRU until page reclaim encounters it.  With the anon-free-via-pagevec
>>patch, very, very, very few pages actually get their final release in
>>__page_cache_release() - zero on uniprocessor, I expect.
> 
> If you do this, then I would personally suggest a conceptually different
> approach: make the LRU list count towards the page count.  That will
> _automatically_ result in what you describe - if a page is on the LRU
> list, then "freeing" it will always just decrement the count, and the
> _real_ free comes from walking the LRU list and considering count==1 to
> be trivially freeable.
> 
> That way you don't have to have separate functions for releasing
> different kinds of pages (we've seen how nasty that was from a
> maintainance standpoint already with the "put_page vs
> page_cache_release" thing).
> 
> Ehh? 

If every structure locks before removing its reference (ie before testing and/or
removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
etc)  Its a bit of an audit task to make sure the various locks are taken (and
documented) though.

By leting the actual free be lazy as Linus suggests things should simplify nicely.

comments,
Ed Tomlinson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 22:09 MM patches against 2.5.31 Ed Tomlinson
@ 2002-08-26 23:58 ` Andrew Morton
  2002-08-27  0:13   ` Rik van Riel
  2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
  0 siblings, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-26 23:58 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm, linux-kernel, Christian Ehrhardt, Daniel Phillips

Ed Tomlinson wrote:
> 
> This seems to have been missed:

Still thinking about it.

> Linus Torvalds wrote:
> 
> > In article <3D6989F7.9ED1948A@zip.com.au>,
> > Andrew Morton  <akpm@zip.com.au> wrote:
> >>
> >>What I'm inclined to do there is to change __page_cache_release()
> >>to not attempt to free the page at all.  Just let it sit on the
> >>LRU until page reclaim encounters it.  With the anon-free-via-pagevec
> >>patch, very, very, very few pages actually get their final release in
> >>__page_cache_release() - zero on uniprocessor, I expect.
> >
> > If you do this, then I would personally suggest a conceptually different
> > approach: make the LRU list count towards the page count.  That will
> > _automatically_ result in what you describe - if a page is on the LRU
> > list, then "freeing" it will always just decrement the count, and the
> > _real_ free comes from walking the LRU list and considering count==1 to
> > be trivially freeable.
> >
> > That way you don't have to have separate functions for releasing
> > different kinds of pages (we've seen how nasty that was from a
> > maintainance standpoint already with the "put_page vs
> > page_cache_release" thing).
> >
> > Ehh?
> 
> If every structure locks before removing its reference (ie before testing and/or
> removing a lru reference we take zone->lru_lock, for slabs take cachep->spinlock
> etc)  Its a bit of an audit task to make sure the various locks are taken (and
> documented) though.
> 
> By leting the actual free be lazy as Linus suggests things should simplify nicely.

Well we wouldn't want to leave tons of free pages on the LRU - the
VM would needlessly reclaim pagecache before finding the free pages.  And
higher-order page allocations could suffer.

If we go for explicit lru removal in truncate and zap_pte_range
then this approach may be best.  Still thinking about it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 23:58 ` Andrew Morton
@ 2002-08-27  0:13   ` Rik van Riel
  2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2002-08-27  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ed Tomlinson, linux-mm, linux-kernel, Christian Ehrhardt,
	Daniel Phillips

On Mon, 26 Aug 2002, Andrew Morton wrote:

> Well we wouldn't want to leave tons of free pages on the LRU - the VM
> would needlessly reclaim pagecache before finding the free pages.  And
> higher-order page allocations could suffer.

We did this with the swap cache in 2.4.<7 and it was an
absolute disaster.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* slablru for 2.5.32-mm1
  2002-08-26 23:58 ` Andrew Morton
  2002-08-27  0:13   ` Rik van Riel
@ 2002-08-28 17:06   ` Ed Tomlinson
  2002-08-28 21:24     ` Andrew Morton
                       ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: Ed Tomlinson @ 2002-08-28 17:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

Hi Andrew

Here is slablru for 32-mm1.  This is based on a version ported to 31ish-mm1.  It should be
stable.  Its been booted as UP (32-mm1) and SMP on UP  (31ish-mm1 only) and works as expected.

A typical test cycle involved:
find / -name "*" > /dev/null
edit a large tif with the gimp
run dbench a few times with the dbench dir on tmpfs (trying to use gimp too)
run dbench a few times from a reiserfs dir (trying to use gimp too)
use the box for news/mail, atp-get update/upgrade etc, wait a few hours and repeat

31ish-mm1 survived a day of this, 32-mm1 is sending this message after one cycle.

Andrew, what do you thing about adding slablru to your experimental dir?  

There is also a version for virgin 2.5.32, anyone wanting it should email me - one big 
patch is eats enough bandwidth.

One interesting change in this version.  We only add the first page of a slab to the lru.  The 
reference bit setting logic for slabs has been modified to set the bit on the first page. 
Pagevec created a little bit of a problem for slablru.  How do we know the order of the
slab page when its being freed?   My solution is to use 3 bits in page->flags and save the
order there.  Then free_pages_ok was modified to take the order from page->flags.  This
was implement in a minimal fashion.  Think Wli is working on a more elaborate version of 
this - fleshed out, it could be used to support large pages in the vm.

Second topic.

I have also included an optimisation for vmscan.  I found that the current code would reduce 
the inactive list to almost nothing when applications create large numbers of active pages very
quickly run (ie. gimp loading and editing large 20m+ tiffs).  This reduces the problem.   Always
allowing nr_pages to be scanned caused the active list to be reduced to almost nothing when 
something like gimp exited and we had another task adding lots to the inactive list.  This
is fixed here too.  I do wonder if zone->refill_counter, as implemented, is a great idea.  Do
we really need/want to remember to scan the active list if it has massively decreased in size
because some app exited?  Maybe some sort of decay logic should be used...

Comments?
Ed Tomlinson

---------------
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Tue Aug 27 09:59:26 2002
+++ b/mm/vmscan.c	Tue Aug 27 09:59:26 2002
@@ -492,11 +492,14 @@
 	 * active list.
 	 */
 	ratio = (unsigned long)nr_pages * zone->nr_active /
-				((zone->nr_inactive | 1) * 2);
-	atomic_add(ratio+1, &zone->refill_counter);
-	if (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
+				((zone->nr_inactive | 1) * 2) + 1;
+	atomic_add(ratio, &zone->refill_counter);
+	max_scan = nr_pages < ratio ? nr_pages : ratio;
+	while ((atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) &&
+			(max_scan > 0)) {
 		atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
 		refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
+		max_scan -= SWAP_CLUSTER_MAX;
 	}
 
 	max_scan = zone->nr_inactive / priority;

---------------

---------------
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.523   -> 1.524  
#	include/asm-generic/tlb.h	1.12    -> 1.13   
#	include/asm-s390x/tlb.h	1.3     -> 1.4    
#	 fs/jfs/jfs_umount.c	1.3     -> 1.4    
#	include/linux/kernel.h	1.19    -> 1.20   
#	include/asm-ppc/pgalloc.h	1.8     -> 1.9    
#	include/asm-ppc/hardirq.h	1.12    -> 1.13   
#	       fs/ext2/dir.c	1.15    -> 1.16   
#	include/asm-ppc/tlb.h	1.3     -> 1.4    
#	include/asm-ppc64/page.h	1.10    -> 1.11   
#	include/linux/pagemap.h	1.24    -> 1.25   
#	include/linux/mmzone.h	1.13    -> 1.14   
#	      kernel/ksyms.c	1.122   -> 1.123  
#	include/asm-cris/cache.h	1.1     -> 1.2    
#	include/asm-ia64/pgalloc.h	1.13    -> 1.14   
#	include/linux/swap.h	1.53    -> 1.54   
#	include/asm-arm/tlb.h	1.3     -> 1.4    
#	include/linux/buffer_head.h	1.26    -> 1.27   
#	include/linux/ext3_fs.h	1.7     -> 1.8    
#	include/asm-i386/cache.h	1.1     -> 1.2    
#	include/asm-ppc64/pgalloc.h	1.6     -> 1.7    
#	arch/alpha/mm/numa.c	1.5     -> 1.6    
#	  include/linux/mm.h	1.71    -> 1.72   
#	include/asm-i386/pgalloc.h	1.16    -> 1.17   
#	include/asm-mips64/mmzone.h	1.3     -> 1.4    
#	        mm/highmem.c	1.30    -> 1.31   
#	     mm/page_alloc.c	1.89    -> 1.90   
#	include/asm-s390/cache.h	1.3     -> 1.4    
#	include/asm-sparc/pgalloc.h	1.9     -> 1.10   
#	  arch/arm/mm/init.c	1.15    -> 1.16   
#	include/asm-ia64/spinlock.h	1.4     -> 1.5    
#	include/linux/ext3_jbd.h	1.4     -> 1.5    
#	    fs/ext3/Makefile	1.3     -> 1.4    
#	include/asm-i386/highmem.h	1.7     -> 1.8    
#	arch/i386/Config.help	1.12    -> 1.13   
#	include/linux/sched.h	1.84    -> 1.85   
#	include/linux/writeback.h	1.10    -> 1.11   
#	include/asm-ppc64/cache.h	1.3     -> 1.4    
#	 fs/driverfs/inode.c	1.48    -> 1.49   
#	 mm/page-writeback.c	1.32    -> 1.33   
#	    fs/smbfs/inode.c	1.27    -> 1.28   
#	include/asm-s390x/cache.h	1.3     -> 1.4    
#	         mm/vmscan.c	1.96    -> 1.97   
#	include/asm-x86_64/tlb.h	1.3     -> 1.4    
#	 fs/proc/proc_misc.c	1.34    -> 1.35   
#	include/asm-s390x/pgalloc.h	1.6     -> 1.7    
#	include/asm-parisc/cache.h	1.1     -> 1.2    
#	include/linux/page-flags.h	1.16    -> 1.17   
#	   fs/jfs/jfs_imap.c	1.9     -> 1.10   
#	include/asm-i386/page.h	1.15    -> 1.16   
#	arch/i386/kernel/Makefile	1.21    -> 1.22   
#	include/asm-sparc64/tlb.h	1.5     -> 1.6    
#	arch/sparc/mm/sun4c.c	1.16    -> 1.17   
#	 arch/i386/mm/init.c	1.23    -> 1.24   
#	      fs/minix/dir.c	1.11    -> 1.12   
#	          fs/mpage.c	1.13    -> 1.14   
#	include/asm-i386/spinlock.h	1.7     -> 1.8    
#	include/linux/preempt.h	1.5     -> 1.6    
#	    kernel/suspend.c	1.17    -> 1.18   
#	fs/jfs/jfs_metapage.c	1.15    -> 1.16   
#	 include/linux/gfp.h	1.2     -> 1.3    
#	  include/linux/fs.h	1.157   -> 1.158  
#	arch/i386/mm/Makefile	1.3     -> 1.4    
#	arch/mips64/sgi-ip27/ip27-memory.c	1.3     -> 1.4    
#	       fs/sysv/dir.c	1.12    -> 1.13   
#	include/asm-i386/semaphore.h	1.5     -> 1.6    
#	include/asm-alpha/pgtable.h	1.13    -> 1.14   
#	 arch/ppc/mm/fault.c	1.11    -> 1.12   
#	include/asm-i386/tlb.h	1.5     -> 1.6    
#	           mm/numa.c	1.8     -> 1.9    
#	include/linux/backing-dev.h	1.1     -> 1.2    
#	        mm/bootmem.c	1.11    -> 1.12   
#	include/asm-mips/cache.h	1.2     -> 1.3    
#	 arch/i386/config.in	1.46    -> 1.47   
#	include/linux/rwsem.h	1.8     -> 1.9    
#	 fs/reiserfs/inode.c	1.63    -> 1.64   
#	    fs/ramfs/inode.c	1.24    -> 1.25   
#	arch/sparc/mm/fault.c	1.8     -> 1.9    
#	        mm/filemap.c	1.127   -> 1.128  
#	include/asm-ppc64/tlb.h	1.2     -> 1.3    
#	          fs/namei.c	1.54    -> 1.55   
#	include/asm-alpha/mmzone.h	1.1     -> 1.2    
#	arch/sparc64/mm/init.c	1.33    -> 1.34   
#	          mm/msync.c	1.7     -> 1.8    
#	       fs/nfsd/vfs.c	1.36    -> 1.37   
#	           mm/swap.c	1.28    -> 1.29   
#	include/asm-alpha/cache.h	1.3     -> 1.4    
#	include/asm-m68k/sun3_pgalloc.h	1.6     -> 1.7    
#	include/asm-ia64/tlb.h	1.7     -> 1.8    
#	include/asm-i386/pci.h	1.15    -> 1.16   
#	          mm/shmem.c	1.70    -> 1.71   
#	include/asm-alpha/tlb.h	1.2     -> 1.3    
#	include/asm-sparc/tlb.h	1.2     -> 1.3    
#	include/asm-ppc/highmem.h	1.7     -> 1.8    
#	     mm/swap_state.c	1.41    -> 1.42   
#	          fs/super.c	1.82    -> 1.83   
#	      fs/fat/inode.c	1.42    -> 1.43   
#	include/asm-mips64/cache.h	1.2     -> 1.3    
#	include/asm-m68k/cache.h	1.2     -> 1.3    
#	      fs/jfs/super.c	1.21    -> 1.22   
#	 arch/cris/mm/init.c	1.9     -> 1.10   
#	include/asm-ia64/semaphore.h	1.4     -> 1.5    
#	      fs/affs/file.c	1.20    -> 1.21   
#	include/asm-i386/io.h	1.16    -> 1.17   
#	arch/i386/kernel/setup.c	1.59    -> 1.60   
#	     kernel/printk.c	1.13    -> 1.14   
#	include/linux/spinlock.h	1.18    -> 1.19   
#	include/asm-ia64/cache.h	1.3     -> 1.4    
#	   fs/fs-writeback.c	1.18    -> 1.19   
#	include/asm-sparc64/cache.h	1.3     -> 1.4    
#	 fs/jfs/jfs_logmgr.c	1.29    -> 1.30   
#	include/asm-x86_64/pgalloc.h	1.4     -> 1.5    
#	       kernel/acct.c	1.12    -> 1.13   
#	include/linux/rwsem-spinlock.h	1.7     -> 1.8    
#	drivers/block/loop.c	1.54    -> 1.55   
#	         fs/buffer.c	1.138   -> 1.139  
#	arch/sparc/mm/srmmu.c	1.19    -> 1.20   
#	     fs/ext3/namei.c	1.20    -> 1.21   
#	 fs/jfs/jfs_txnmgr.c	1.21    -> 1.22   
#	 fs/jffs/inode-v23.c	1.35    -> 1.36   
#	  drivers/block/rd.c	1.45    -> 1.46   
#	include/asm-s390/pgalloc.h	1.7     -> 1.8    
#	include/asm-i386/setup.h	1.1     -> 1.2    
#	include/asm-sparc/cache.h	1.2     -> 1.3    
#	       fs/nfs/file.c	1.17    -> 1.18   
#	        fs/Config.in	1.33    -> 1.34   
#	     fs/ext3/inode.c	1.32    -> 1.33   
#	include/asm-sh/cache.h	1.2     -> 1.3    
#	 fs/reiserfs/stree.c	1.31    -> 1.32   
#	include/asm-i386/pgtable.h	1.17    -> 1.18   
#	      fs/nfs/inode.c	1.47    -> 1.48   
#	drivers/scsi/scsi_scan.c	1.21    -> 1.22   
#	   arch/sh/mm/init.c	1.9     -> 1.10   
#	include/asm-s390/tlb.h	1.3     -> 1.4    
#	   fs/jfs/jfs_dmap.c	1.8     -> 1.9    
#	include/asm-x86_64/cache.h	1.1     -> 1.2    
#	include/linux/bootmem.h	1.2     -> 1.3    
#	     fs/smbfs/file.c	1.17    -> 1.18   
#	     fs/ext3/super.c	1.29    -> 1.30   
#	include/asm-arm/memory.h	1.4     -> 1.5    
#	           mm/rmap.c	1.10    -> 1.11   
#	include/asm-ppc64/mmzone.h	1.2     -> 1.3    
#	include/asm-m68k/tlb.h	1.3     -> 1.4    
#	include/asm-ppc/cache.h	1.5     -> 1.6    
#	drivers/net/ns83820.c	1.14    -> 1.15   
#	include/asm-mips64/pgtable.h	1.8     -> 1.9    
#	include/asm-arm/cache.h	1.1     -> 1.2    
#	include/linux/pagevec.h	1.2     -> 1.3    
#	          fs/inode.c	1.67    -> 1.68   
#	include/asm-sparc/highmem.h	1.4     -> 1.5    
#	     fs/jffs2/file.c	1.14    -> 1.15   
#	include/linux/dcache.h	1.14    -> 1.15   
#	include/asm-sparc/hardirq.h	1.5     -> 1.6    
#	arch/i386/mm/pgtable.c	1.2     -> 1.3    
#	include/asm-m68k/motorola_pgalloc.h	1.7     -> 1.8    
#	fs/reiserfs/tail_conversion.c	1.22    -> 1.23   
#	      fs/udf/inode.c	1.21    -> 1.22   
#	include/linux/cache.h	1.4     -> 1.5    
#	               (new)	        -> 1.1     fs/ext3/hash.c 
#	               (new)	        -> 1.1     include/asm-i386/max_numnodes.h
#	               (new)	        -> 1.1     include/asm-i386/mmzone.h
#	               (new)	        -> 1.1     include/asm-i386/numaq.h
#	               (new)	        -> 1.1     arch/i386/kernel/numaq.c
#	               (new)	        -> 1.1     arch/i386/mm/discontig.c
#	               (new)	        -> 1.1     include/linux/mm_inline.h
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/08/28	ed@oscar.et.ca	1.524
# 2.5.32-mm1
# --------------------------------------------
#
diff -Nru a/arch/alpha/mm/numa.c b/arch/alpha/mm/numa.c
--- a/arch/alpha/mm/numa.c	Wed Aug 28 07:37:36 2002
+++ b/arch/alpha/mm/numa.c	Wed Aug 28 07:37:36 2002
@@ -294,7 +294,7 @@
 			zones_size[ZONE_DMA] = dma_local_pfn;
 			zones_size[ZONE_NORMAL] = (end_pfn - start_pfn) - dma_local_pfn;
 		}
-		free_area_init_node(nid, NODE_DATA(nid), NULL, zones_size, start_pfn<<PAGE_SHIFT, NULL);
+		free_area_init_node(nid, NODE_DATA(nid), NULL, zones_size, start_pfn, NULL);
 		lmax_mapnr = PLAT_NODE_DATA_STARTNR(nid) + PLAT_NODE_DATA_SIZE(nid);
 		if (lmax_mapnr > max_mapnr) {
 			max_mapnr = lmax_mapnr;
@@ -371,7 +371,7 @@
 		totalram_pages += free_all_bootmem_node(NODE_DATA(nid));
 
 		lmem_map = NODE_MEM_MAP(nid);
-		pfn = NODE_DATA(nid)->node_start_paddr >> PAGE_SHIFT;
+		pfn = NODE_DATA(nid)->node_start_pfn;
 		for (i = 0; i < PLAT_NODE_DATA_SIZE(nid); i++, pfn++)
 			if (page_is_ram(pfn) && PageReserved(lmem_map+i))
 				reservedpages++;
diff -Nru a/arch/arm/mm/init.c b/arch/arm/mm/init.c
--- a/arch/arm/mm/init.c	Wed Aug 28 07:37:36 2002
+++ b/arch/arm/mm/init.c	Wed Aug 28 07:37:36 2002
@@ -512,7 +512,7 @@
 		arch_adjust_zones(node, zone_size, zhole_size);
 
 		free_area_init_node(node, pgdat, 0, zone_size,
-				bdata->node_boot_start, zhole_size);
+				bdata->node_boot_start >> PAGE_SHIFT, zhole_size);
 	}
 
 	/*
diff -Nru a/arch/cris/mm/init.c b/arch/cris/mm/init.c
--- a/arch/cris/mm/init.c	Wed Aug 28 07:37:37 2002
+++ b/arch/cris/mm/init.c	Wed Aug 28 07:37:37 2002
@@ -345,7 +345,7 @@
 	 * mem_map page array.
 	 */
 
-	free_area_init_node(0, 0, 0, zones_size, PAGE_OFFSET, 0);
+	free_area_init_node(0, 0, 0, zones_size, PAGE_OFFSET >> PAGE_SHIFT, 0);
 
 }
 
diff -Nru a/arch/i386/Config.help b/arch/i386/Config.help
--- a/arch/i386/Config.help	Wed Aug 28 07:37:36 2002
+++ b/arch/i386/Config.help	Wed Aug 28 07:37:36 2002
@@ -41,7 +41,7 @@
   486, 586, Pentiums, and various instruction-set-compatible chips by
   AMD, Cyrix, and others.
 
-CONFIG_MULTIQUAD
+CONFIG_X86_NUMAQ
   This option is used for getting Linux to run on a (IBM/Sequent) NUMA 
   multiquad box. This changes the way that processors are bootstrapped,
   and uses Clustered Logical APIC addressing mode instead of Flat Logical.
diff -Nru a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in	Wed Aug 28 07:37:37 2002
+++ b/arch/i386/config.in	Wed Aug 28 07:37:37 2002
@@ -166,7 +166,22 @@
       define_bool CONFIG_X86_IO_APIC y
    fi
 else
-   bool 'Multiquad NUMA system' CONFIG_MULTIQUAD
+  bool 'Multi-node NUMA system support' CONFIG_X86_NUMA
+  if [ "$CONFIG_X86_NUMA" = "y" ]; then
+     #Platform Choices
+     bool 'Multiquad (IBM/Sequent) NUMAQ support' CONFIG_X86_NUMAQ
+     if [ "$CONFIG_X86_NUMAQ" = "y" ]; then
+        define_bool CONFIG_MULTIQUAD y
+     fi
+     # Common NUMA Features
+     if [ "$CONFIG_X86_NUMAQ" = "y" ]; then
+        bool 'Numa Memory Allocation Support' CONFIG_NUMA
+        if [ "$CONFIG_NUMA" = "y" ]; then
+           define_bool CONFIG_DISCONTIGMEM y
+           define_bool CONFIG_HAVE_ARCH_BOOTMEM_NODE y
+        fi
+     fi
+  fi
 fi
 
 bool 'Machine Check Exception' CONFIG_X86_MCE
diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile	Wed Aug 28 07:37:36 2002
+++ b/arch/i386/kernel/Makefile	Wed Aug 28 07:37:36 2002
@@ -25,6 +25,7 @@
 obj-$(CONFIG_X86_LOCAL_APIC)	+= mpparse.o apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o
+obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 ifdef CONFIG_VISWS
 obj-y += setup-visws.o
 obj-$(CONFIG_X86_VISWS_APIC)	+= visws_apic.o
diff -Nru a/arch/i386/kernel/numaq.c b/arch/i386/kernel/numaq.c
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/arch/i386/kernel/numaq.c	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,145 @@
+/*
+ * Written by: Patricia Gaughen, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <gone@us.ibm.com>
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/mmzone.h>
+#include <asm/numaq.h>
+
+/* These are needed before the pgdat's are created */
+unsigned long node_start_pfn[MAX_NUMNODES];
+unsigned long node_end_pfn[MAX_NUMNODES];
+
+#define	MB_TO_PAGES(addr) ((addr) << (20 - PAGE_SHIFT))
+
+/*
+ * Function: smp_dump_qct()
+ *
+ * Description: gets memory layout from the quad config table.  This
+ * function also increments numnodes with the number of nodes (quads)
+ * present.
+ */
+static void __init smp_dump_qct(void)
+{
+	int node;
+	struct eachquadmem *eq;
+	struct sys_cfg_data *scd =
+		(struct sys_cfg_data *)__va(SYS_CFG_DATA_PRIV_ADDR);
+
+	numnodes = 0;
+	for(node = 0; node < MAX_NUMNODES; node++) {
+		if(scd->quads_present31_0 & (1 << node)) {
+			numnodes++;
+			eq = &scd->eq[node];
+			/* Convert to pages */
+			node_start_pfn[node] = MB_TO_PAGES(
+				eq->hi_shrd_mem_start - eq->priv_mem_size);
+			node_end_pfn[node] = MB_TO_PAGES(
+				eq->hi_shrd_mem_start + eq->hi_shrd_mem_size);
+		}
+	}
+}
+
+/*
+ * -----------------------------------------
+ *
+ * functions related to physnode_map
+ *
+ * -----------------------------------------
+ */
+/*
+ * physnode_map keeps track of the physical memory layout of the
+ * numaq nodes on a 256Mb break (each element of the array will
+ * represent 256Mb of memory and will be marked by the node id.  so,
+ * if the first gig is on node 0, and the second gig is on node 1
+ * physnode_map will contain:
+ * physnode_map[0-3] = 0;
+ * physnode_map[4-7] = 1;
+ * physnode_map[8- ] = -1;
+ */
+int physnode_map[MAX_ELEMENTS] = { [0 ... (MAX_ELEMENTS - 1)] = -1};
+
+#define MB_TO_ELEMENT(x) (x >> ELEMENT_REPRESENTS)
+#define PA_TO_MB(pa) (pa >> 20) 	/* assumption: a physical address is in bytes */
+
+int pa_to_nid(u64 pa)
+{
+	int nid;
+	
+	nid = physnode_map[MB_TO_ELEMENT(PA_TO_MB(pa))];
+
+	/* the physical address passed in is not in the map for the system */
+	if (nid == -1)
+		BUG();
+
+	return nid;
+}
+
+int pfn_to_nid(unsigned long pfn)
+{
+	return pa_to_nid(((u64)pfn) << PAGE_SHIFT);
+}
+
+/*
+ * for each node mark the regions
+ *        TOPOFMEM = hi_shrd_mem_start + hi_shrd_mem_size
+ *
+ * need to be very careful to not mark 1024+ as belonging
+ * to node 0. will want 1027 to show as belonging to node 1
+ * example:
+ *  TOPOFMEM = 1024
+ * 1024 >> 8 = 4 (subtract 1 for starting at 0]
+ * tmpvar = TOPOFMEM - 256 = 768
+ * 1024 >> 8 = 4 (subtract 1 for starting at 0]
+ * 
+ */
+static void __init initialize_physnode_map(void)
+{
+	int nid;
+	unsigned int topofmem, cur;
+	struct eachquadmem *eq;
+ 	struct sys_cfg_data *scd =
+		(struct sys_cfg_data *)__va(SYS_CFG_DATA_PRIV_ADDR);
+
+	
+	for(nid = 0; nid < numnodes; nid++) {
+		if(scd->quads_present31_0 & (1 << nid)) {
+			eq = &scd->eq[nid];
+			cur = eq->hi_shrd_mem_start;
+			topofmem = eq->hi_shrd_mem_start + eq->hi_shrd_mem_size;
+			while (cur < topofmem) {
+				physnode_map[cur >> 8] = nid;
+				cur += (ELEMENT_REPRESENTS - 1);
+			}
+		}
+	}
+}
+
+void __init get_memcfg_numaq(void)
+{
+	smp_dump_qct();
+	initialize_physnode_map();
+}
diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
--- a/arch/i386/kernel/setup.c	Wed Aug 28 07:37:37 2002
+++ b/arch/i386/kernel/setup.c	Wed Aug 28 07:37:37 2002
@@ -36,6 +36,7 @@
 #include <linux/highmem.h>
 #include <asm/e820.h>
 #include <asm/mpspec.h>
+#include <asm/setup.h>
 
 /*
  * Machine setup..
@@ -83,35 +84,10 @@
 
 unsigned long saved_videomode;
 
-/*
- * This is set up by the setup-routine at boot-time
- */
-#define PARAM	((unsigned char *)empty_zero_page)
-#define SCREEN_INFO (*(struct screen_info *) (PARAM+0))
-#define EXT_MEM_K (*(unsigned short *) (PARAM+2))
-#define ALT_MEM_K (*(unsigned long *) (PARAM+0x1e0))
-#define E820_MAP_NR (*(char*) (PARAM+E820NR))
-#define E820_MAP    ((struct e820entry *) (PARAM+E820MAP))
-#define APM_BIOS_INFO (*(struct apm_bios_info *) (PARAM+0x40))
-#define DRIVE_INFO (*(struct drive_info_struct *) (PARAM+0x80))
-#define SYS_DESC_TABLE (*(struct sys_desc_table_struct*)(PARAM+0xa0))
-#define MOUNT_ROOT_RDONLY (*(unsigned short *) (PARAM+0x1F2))
-#define RAMDISK_FLAGS (*(unsigned short *) (PARAM+0x1F8))
-#define VIDEO_MODE (*(unsigned short *) (PARAM+0x1FA))
-#define ORIG_ROOT_DEV (*(unsigned short *) (PARAM+0x1FC))
-#define AUX_DEVICE_INFO (*(unsigned char *) (PARAM+0x1FF))
-#define LOADER_TYPE (*(unsigned char *) (PARAM+0x210))
-#define KERNEL_START (*(unsigned long *) (PARAM+0x214))
-#define INITRD_START (*(unsigned long *) (PARAM+0x218))
-#define INITRD_SIZE (*(unsigned long *) (PARAM+0x21c))
-#define COMMAND_LINE ((char *) (PARAM+2048))
-#define COMMAND_LINE_SIZE 256
-
 #define RAMDISK_IMAGE_START_MASK  	0x07FF
 #define RAMDISK_PROMPT_FLAG		0x8000
 #define RAMDISK_LOAD_FLAG		0x4000	
 
-
 static char command_line[COMMAND_LINE_SIZE];
        char saved_command_line[COMMAND_LINE_SIZE];
 
@@ -592,72 +568,13 @@
 	}
 }
 
-void __init setup_arch(char **cmdline_p)
-{
-	unsigned long bootmap_size, low_mem_size;
-	unsigned long start_pfn, max_low_pfn;
-	int i;
-
-	early_cpu_init();
-
-#ifdef CONFIG_VISWS
-	visws_get_board_type_and_rev();
-#endif
-
- 	ROOT_DEV = ORIG_ROOT_DEV;
- 	drive_info = DRIVE_INFO;
- 	screen_info = SCREEN_INFO;
-	apm_info.bios = APM_BIOS_INFO;
-	saved_videomode = VIDEO_MODE;
-	printk("Video mode to be used for restore is %lx\n", saved_videomode);
-	if( SYS_DESC_TABLE.length != 0 ) {
-		MCA_bus = SYS_DESC_TABLE.table[3] &0x2;
-		machine_id = SYS_DESC_TABLE.table[0];
-		machine_submodel_id = SYS_DESC_TABLE.table[1];
-		BIOS_revision = SYS_DESC_TABLE.table[2];
-	}
-	aux_device_present = AUX_DEVICE_INFO;
-
-#ifdef CONFIG_BLK_DEV_RAM
-	rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK;
-	rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0);
-	rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0);
-#endif
-	setup_memory_region();
-
-	if (!MOUNT_ROOT_RDONLY)
-		root_mountflags &= ~MS_RDONLY;
-	init_mm.start_code = (unsigned long) &_text;
-	init_mm.end_code = (unsigned long) &_etext;
-	init_mm.end_data = (unsigned long) &_edata;
-	init_mm.brk = (unsigned long) &_end;
-
-	code_resource.start = virt_to_phys(&_text);
-	code_resource.end = virt_to_phys(&_etext)-1;
-	data_resource.start = virt_to_phys(&_etext);
-	data_resource.end = virt_to_phys(&_edata)-1;
-
-	parse_mem_cmdline(cmdline_p);
-
-#define PFN_UP(x)	(((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
-#define PFN_DOWN(x)	((x) >> PAGE_SHIFT)
-#define PFN_PHYS(x)	((x) << PAGE_SHIFT)
-
 /*
- * Reserved space for vmalloc and iomap - defined in asm/page.h
+ * Find the highest page frame number we have available
  */
-#define MAXMEM_PFN	PFN_DOWN(MAXMEM)
-#define MAX_NONPAE_PFN	(1 << 20)
-
-	/*
-	 * partially used pages are not usable - thus
-	 * we are rounding upwards:
-	 */
-	start_pfn = PFN_UP(__pa(&_end));
+void __init find_max_pfn(void)
+{
+	int i;
 
-	/*
-	 * Find the highest page frame number we have available
-	 */
 	max_pfn = 0;
 	for (i = 0; i < e820.nr_map; i++) {
 		unsigned long start, end;
@@ -671,10 +588,15 @@
 		if (end > max_pfn)
 			max_pfn = end;
 	}
+}
+
+/*
+ * Determine low and high memory ranges:
+ */
+unsigned long __init find_max_low_pfn(void)
+{
+	unsigned long max_low_pfn;
 
-	/*
-	 * Determine low and high memory ranges:
-	 */
 	max_low_pfn = max_pfn;
 	if (max_low_pfn > MAXMEM_PFN) {
 		if (highmem_pages == -1)
@@ -724,28 +646,20 @@
 			printk(KERN_ERR "ignoring highmem size on non-highmem kernel!\n");
 #endif
 	}
+	return max_low_pfn;
+}
 
-#ifdef CONFIG_HIGHMEM
-	highstart_pfn = highend_pfn = max_pfn;
-	if (max_pfn > max_low_pfn) {
-		highstart_pfn = max_low_pfn;
-	}
-	printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
-		pages_to_mb(highend_pfn - highstart_pfn));
-#endif
-	printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
-			pages_to_mb(max_low_pfn));
-	/*
-	 * Initialize the boot-time allocator (with low memory only):
-	 */
-	bootmap_size = init_bootmem(start_pfn, max_low_pfn);
+#ifndef CONFIG_DISCONTIGMEM
+/*
+ * Register fully available low RAM pages with the bootmem allocator.
+ */
+static void __init register_bootmem_low_pages(unsigned long max_low_pfn)
+{
+	int i;
 
-	/*
-	 * Register fully available low RAM pages with the bootmem allocator.
-	 */
 	for (i = 0; i < e820.nr_map; i++) {
 		unsigned long curr_pfn, last_pfn, size;
- 		/*
+		/*
 		 * Reserve usable low memory
 		 */
 		if (e820.map[i].type != E820_RAM)
@@ -774,6 +688,39 @@
 		size = last_pfn - curr_pfn;
 		free_bootmem(PFN_PHYS(curr_pfn), PFN_PHYS(size));
 	}
+}
+
+static unsigned long __init setup_memory(void)
+{
+	unsigned long bootmap_size, start_pfn, max_low_pfn;
+
+	/*
+	 * partially used pages are not usable - thus
+	 * we are rounding upwards:
+	 */
+	start_pfn = PFN_UP(__pa(&_end));
+
+	find_max_pfn();
+
+	max_low_pfn = find_max_low_pfn();
+
+#ifdef CONFIG_HIGHMEM
+	highstart_pfn = highend_pfn = max_pfn;
+	if (max_pfn > max_low_pfn) {
+		highstart_pfn = max_low_pfn;
+	}
+	printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
+		pages_to_mb(highend_pfn - highstart_pfn));
+#endif
+	printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
+			pages_to_mb(max_low_pfn));
+	/*
+	 * Initialize the boot-time allocator (with low memory only):
+	 */
+	bootmap_size = init_bootmem(start_pfn, max_low_pfn);
+
+	register_bootmem_low_pages(max_low_pfn);
+
 	/*
 	 * Reserve the bootmem bitmap itself as well. We do this in two
 	 * steps (first step was init_bootmem()) because this catches
@@ -809,6 +756,7 @@
 	 */
 	find_smp_config();
 #endif
+
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (LOADER_TYPE && INITRD_START) {
 		if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
@@ -826,32 +774,21 @@
 		}
 	}
 #endif
+	return max_low_pfn;
+}
+#else
+extern unsigned long setup_memory(void);
+#endif /* !CONFIG_DISCONTIGMEM */
 
-	/*
-	 * NOTE: before this point _nobody_ is allowed to allocate
-	 * any memory using the bootmem allocator.
-	 */
-
-#ifdef CONFIG_SMP
-	smp_alloc_memory(); /* AP processor realmode stacks in low memory*/
-#endif
-	paging_init();
-#ifdef CONFIG_ACPI_BOOT
-	/*
-	 * Parse the ACPI tables for possible boot-time SMP configuration.
-	 */
-	acpi_boot_init(*cmdline_p);
-#endif
-#ifdef CONFIG_X86_LOCAL_APIC
-	if (smp_found_config)
-		get_smp_config();
-#endif
-
+/*
+ * Request address space for all standard RAM and ROM resources
+ * and also for regions reported as reserved by the e820.
+ */
+static void __init register_memory(unsigned long max_low_pfn)
+{
+	unsigned long low_mem_size;
+	int i;
 
-	/*
-	 * Request address space for all standard RAM and ROM resources
-	 * and also for regions reported as reserved by the e820.
-	 */
 	probe_roms();
 	for (i = 0; i < e820.nr_map; i++) {
 		struct resource *res;
@@ -888,6 +825,76 @@
 	low_mem_size = ((max_low_pfn << PAGE_SHIFT) + 0xfffff) & ~0xfffff;
 	if (low_mem_size > pci_mem_start)
 		pci_mem_start = low_mem_size;
+}
+
+void __init setup_arch(char **cmdline_p)
+{
+	unsigned long max_low_pfn;
+
+	early_cpu_init();
+
+#ifdef CONFIG_VISWS
+	visws_get_board_type_and_rev();
+#endif
+
+ 	ROOT_DEV = ORIG_ROOT_DEV;
+ 	drive_info = DRIVE_INFO;
+ 	screen_info = SCREEN_INFO;
+	apm_info.bios = APM_BIOS_INFO;
+	saved_videomode = VIDEO_MODE;
+	printk("Video mode to be used for restore is %lx\n", saved_videomode);
+	if( SYS_DESC_TABLE.length != 0 ) {
+		MCA_bus = SYS_DESC_TABLE.table[3] &0x2;
+		machine_id = SYS_DESC_TABLE.table[0];
+		machine_submodel_id = SYS_DESC_TABLE.table[1];
+		BIOS_revision = SYS_DESC_TABLE.table[2];
+	}
+	aux_device_present = AUX_DEVICE_INFO;
+
+#ifdef CONFIG_BLK_DEV_RAM
+	rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK;
+	rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0);
+	rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0);
+#endif
+	setup_memory_region();
+
+	if (!MOUNT_ROOT_RDONLY)
+		root_mountflags &= ~MS_RDONLY;
+	init_mm.start_code = (unsigned long) &_text;
+	init_mm.end_code = (unsigned long) &_etext;
+	init_mm.end_data = (unsigned long) &_edata;
+	init_mm.brk = (unsigned long) &_end;
+
+	code_resource.start = virt_to_phys(&_text);
+	code_resource.end = virt_to_phys(&_etext)-1;
+	data_resource.start = virt_to_phys(&_etext);
+	data_resource.end = virt_to_phys(&_edata)-1;
+
+	parse_mem_cmdline(cmdline_p);
+
+	max_low_pfn = setup_memory();
+
+	/*
+	 * NOTE: before this point _nobody_ is allowed to allocate
+	 * any memory using the bootmem allocator.
+	 */
+
+#ifdef CONFIG_SMP
+	smp_alloc_memory(); /* AP processor realmode stacks in low memory*/
+#endif
+	paging_init();
+#ifdef CONFIG_ACPI_BOOT
+	/*
+	 * Parse the ACPI tables for possible boot-time SMP configuration.
+	 */
+	acpi_boot_init(*cmdline_p);
+#endif
+#ifdef CONFIG_X86_LOCAL_APIC
+	if (smp_found_config)
+		get_smp_config();
+#endif
+
+	register_memory(max_low_pfn);
 
 #ifdef CONFIG_VT
 #if defined(CONFIG_VGA_CONSOLE)
diff -Nru a/arch/i386/mm/Makefile b/arch/i386/mm/Makefile
--- a/arch/i386/mm/Makefile	Wed Aug 28 07:37:37 2002
+++ b/arch/i386/mm/Makefile	Wed Aug 28 07:37:37 2002
@@ -10,6 +10,7 @@
 O_TARGET := mm.o
 
 obj-y	 := init.o pgtable.o fault.o ioremap.o extable.o pageattr.o 
+obj-$(CONFIG_DISCONTIGMEM)	+= discontig.o
 export-objs := pageattr.o
 
 include $(TOPDIR)/Rules.make
diff -Nru a/arch/i386/mm/discontig.c b/arch/i386/mm/discontig.c
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/arch/i386/mm/discontig.c	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,294 @@
+/*
+ * Written by: Patricia Gaughen, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <gone@us.ibm.com>
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/mmzone.h>
+#include <linux/highmem.h>
+#ifdef CONFIG_BLK_DEV_RAM
+#include <linux/blk.h>
+#endif
+#include <asm/e820.h>
+#include <asm/setup.h>
+
+struct pglist_data *node_data[MAX_NUMNODES];
+bootmem_data_t node0_bdata;
+
+extern unsigned long find_max_low_pfn(void);
+extern void find_max_pfn(void);
+extern void one_highpage_init(struct page *, int, int);
+
+extern unsigned long node_start_pfn[], node_end_pfn[];
+extern struct e820map e820;
+extern char _end;
+extern unsigned long highend_pfn, highstart_pfn;
+extern unsigned long max_low_pfn;
+extern unsigned long totalram_pages;
+extern unsigned long totalhigh_pages;
+
+/*
+ * Find the highest page frame number we have available for the node
+ */
+static void __init find_max_pfn_node(int nid)
+{
+	if (node_start_pfn[nid] >= node_end_pfn[nid])
+		BUG();
+	if (node_end_pfn[nid] > max_pfn)
+		node_end_pfn[nid] = max_pfn;
+}
+
+/* 
+ * Allocate memory for the pg_data_t via a crude pre-bootmem method
+ * We ought to relocate these onto their own node later on during boot.
+ */
+static void __init allocate_pgdat(int nid)
+{
+	unsigned long node_datasz;
+
+	node_datasz = PFN_UP(sizeof(struct pglist_data));
+	NODE_DATA(nid) = (pg_data_t *)(__va(min_low_pfn << PAGE_SHIFT));
+	min_low_pfn += node_datasz;
+}
+
+/*
+ * Register fully available low RAM pages with the bootmem allocator.
+ */
+static void __init register_bootmem_low_pages(unsigned long system_max_low_pfn)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		unsigned long curr_pfn, last_pfn, size;
+		/*
+		 * Reserve usable low memory
+		 */
+		if (e820.map[i].type != E820_RAM)
+			continue;
+		/*
+		 * We are rounding up the start address of usable memory:
+		 */
+		curr_pfn = PFN_UP(e820.map[i].addr);
+		if (curr_pfn >= system_max_low_pfn)
+			continue;
+		/*
+		 * ... and at the end of the usable range downwards:
+		 */
+		last_pfn = PFN_DOWN(e820.map[i].addr + e820.map[i].size);
+
+		if (last_pfn > system_max_low_pfn)
+			last_pfn = system_max_low_pfn;
+
+		/*
+		 * .. finally, did all the rounding and playing
+		 * around just make the area go away?
+		 */
+		if (last_pfn <= curr_pfn)
+			continue;
+
+		size = last_pfn - curr_pfn;
+		free_bootmem_node(NODE_DATA(0), PFN_PHYS(curr_pfn), PFN_PHYS(size));
+	}
+}
+
+unsigned long __init setup_memory(void)
+{
+	int nid;
+	unsigned long bootmap_size, system_start_pfn, system_max_low_pfn;
+
+	get_memcfg_numa();
+
+	/*
+	 * partially used pages are not usable - thus
+	 * we are rounding upwards:
+	 */
+	system_start_pfn = min_low_pfn = PFN_UP(__pa(&_end));
+
+	find_max_pfn();
+	system_max_low_pfn = max_low_pfn = find_max_low_pfn();
+
+#ifdef CONFIG_HIGHMEM
+		highstart_pfn = highend_pfn = max_pfn;
+		if (max_pfn > system_max_low_pfn) {
+			highstart_pfn = system_max_low_pfn;
+		}
+		printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
+		       pages_to_mb(highend_pfn - highstart_pfn));
+#endif
+	printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
+			pages_to_mb(system_max_low_pfn));
+	
+	for (nid = 0; nid < numnodes; nid++)
+		allocate_pgdat(nid);
+	for (nid = 0; nid < numnodes; nid++)
+		find_max_pfn_node(nid);
+
+	NODE_DATA(0)->bdata = &node0_bdata;
+
+	/*
+	 * Initialize the boot-time allocator (with low memory only):
+	 */
+	bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
+
+	register_bootmem_low_pages(system_max_low_pfn);
+
+	/*
+	 * Reserve the bootmem bitmap itself as well. We do this in two
+	 * steps (first step was init_bootmem()) because this catches
+	 * the (very unlikely) case of us accidentally initializing the
+	 * bootmem allocator with an invalid RAM area.
+	 */
+	reserve_bootmem_node(NODE_DATA(0), HIGH_MEMORY, (PFN_PHYS(min_low_pfn) +
+		 bootmap_size + PAGE_SIZE-1) - (HIGH_MEMORY));
+
+	/*
+	 * reserve physical page 0 - it's a special BIOS page on many boxes,
+	 * enabling clean reboots, SMP operation, laptop functions.
+	 */
+	reserve_bootmem_node(NODE_DATA(0), 0, PAGE_SIZE);
+
+	/*
+	 * But first pinch a few for the stack/trampoline stuff
+	 * FIXME: Don't need the extra page at 4K, but need to fix
+	 * trampoline before removing it. (see the GDT stuff)
+	 */
+	reserve_bootmem_node(NODE_DATA(0), PAGE_SIZE, PAGE_SIZE);
+
+#ifdef CONFIG_ACPI_SLEEP
+	/*
+	 * Reserve low memory region for sleep support.
+	 */
+	acpi_reserve_bootmem();
+#endif
+
+	/*
+	 * Find and reserve possible boot-time SMP configuration:
+	 */
+	find_smp_config();
+
+	/*insert other nodes into pgdat_list*/
+	for (nid = 1; nid < numnodes; nid++){       
+		NODE_DATA(nid)->pgdat_next = pgdat_list;
+		pgdat_list = NODE_DATA(nid);
+	}
+       
+
+#ifdef CONFIG_BLK_DEV_INITRD
+	if (LOADER_TYPE && INITRD_START) {
+		if (INITRD_START + INITRD_SIZE <= (system_max_low_pfn << PAGE_SHIFT)) {
+			reserve_bootmem_node(NODE_DATA(0), INITRD_START, INITRD_SIZE);
+			initrd_start =
+				INITRD_START ? INITRD_START + PAGE_OFFSET : 0;
+			initrd_end = initrd_start+INITRD_SIZE;
+		}
+		else {
+			printk(KERN_ERR "initrd extends beyond end of memory "
+			    "(0x%08lx > 0x%08lx)\ndisabling initrd\n",
+			    INITRD_START + INITRD_SIZE,
+			    system_max_low_pfn << PAGE_SHIFT);
+			initrd_start = 0;
+		}
+	}
+#endif
+	return system_max_low_pfn;
+}
+
+void __init zone_sizes_init(void)
+{
+	int nid;
+
+	for (nid = 0; nid < numnodes; nid++) {
+		unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
+		unsigned int max_dma;
+
+		unsigned long low = max_low_pfn;
+		unsigned long start = node_start_pfn[nid];
+		unsigned long high = node_end_pfn[nid];
+		
+		max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+
+		if (start > low) {
+#ifdef CONFIG_HIGHMEM
+		  zones_size[ZONE_HIGHMEM] = high - start;
+#endif
+		} else {
+			if (low < max_dma)
+				zones_size[ZONE_DMA] = low;
+			else {
+				zones_size[ZONE_DMA] = max_dma;
+				zones_size[ZONE_NORMAL] = low - max_dma;
+#ifdef CONFIG_HIGHMEM
+				zones_size[ZONE_HIGHMEM] = high - low;
+#endif
+			}
+		}
+		free_area_init_node(nid, NODE_DATA(nid), 0, zones_size, start, 0);
+	}
+	return;
+}
+
+void __init set_highmem_pages_init(int bad_ppro) 
+{
+#ifdef CONFIG_HIGHMEM
+	int nid;
+
+	for (nid = 0; nid < numnodes; nid++) {
+		unsigned long node_pfn, node_high_size, zone_start_pfn;
+		struct page * zone_mem_map;
+		
+		node_high_size = NODE_DATA(nid)->node_zones[ZONE_HIGHMEM].size;
+		zone_mem_map = NODE_DATA(nid)->node_zones[ZONE_HIGHMEM].zone_mem_map;
+		zone_start_pfn = NODE_DATA(nid)->node_zones[ZONE_HIGHMEM].zone_start_pfn;
+
+		printk("Initializing highpages for node %d\n", nid);
+		for (node_pfn = 0; node_pfn < node_high_size; node_pfn++) {
+			one_highpage_init((struct page *)(zone_mem_map + node_pfn),
+					  zone_start_pfn + node_pfn, bad_ppro);
+		}
+	}
+	totalram_pages += totalhigh_pages;
+#endif
+}
+
+void __init set_max_mapnr_init(void)
+{
+#ifdef CONFIG_HIGHMEM
+	unsigned long lmax_mapnr;
+	int nid;
+	
+	highmem_start_page = mem_map + NODE_DATA(0)->node_zones[ZONE_HIGHMEM].zone_start_mapnr;
+	num_physpages = highend_pfn;
+
+	for (nid = 0; nid < numnodes; nid++) {
+		lmax_mapnr = node_startnr(nid) + node_size(nid);
+		if (lmax_mapnr > max_mapnr) {
+			max_mapnr = lmax_mapnr;
+		}
+	}
+	
+#else
+	max_mapnr = num_physpages = max_low_pfn;
+#endif
+}
diff -Nru a/arch/i386/mm/init.c b/arch/i386/mm/init.c
--- a/arch/i386/mm/init.c	Wed Aug 28 07:37:36 2002
+++ b/arch/i386/mm/init.c	Wed Aug 28 07:37:36 2002
@@ -213,29 +213,34 @@
 	pkmap_page_table = pte;	
 }
 
+void __init one_highpage_init(struct page *page, int pfn, int bad_ppro)
+{
+	if (!page_is_ram(pfn)) {
+		SetPageReserved(page);
+		return;
+	}
+	if (bad_ppro && page_kills_ppro(pfn)) {
+		SetPageReserved(page);
+		return;
+	}
+	ClearPageReserved(page);
+	set_bit(PG_highmem, &page->flags);
+	atomic_set(&page->count, 1);
+	__free_page(page);
+	totalhigh_pages++;
+}
+
+#ifndef CONFIG_DISCONTIGMEM
 void __init set_highmem_pages_init(int bad_ppro) 
 {
 	int pfn;
-	for (pfn = highstart_pfn; pfn < highend_pfn; pfn++) {
-		struct page *page = mem_map + pfn;
-
-		if (!page_is_ram(pfn)) {
-			SetPageReserved(page);
-			continue;
-		}
-		if (bad_ppro && page_kills_ppro(pfn))
-		{
-			SetPageReserved(page);
-			continue;
-		}
-		ClearPageReserved(page);
-		set_bit(PG_highmem, &page->flags);
-		atomic_set(&page->count, 1);
-		__free_page(page);
-		totalhigh_pages++;
-	}
+	for (pfn = highstart_pfn; pfn < highend_pfn; pfn++)
+		one_highpage_init(pfn_to_page(pfn), pfn, bad_ppro);
 	totalram_pages += totalhigh_pages;
 }
+#else
+extern void set_highmem_pages_init(int);
+#endif /* !CONFIG_DISCONTIGMEM */
 
 #else
 #define kmap_init() do { } while (0)
@@ -309,6 +314,7 @@
 	flush_tlb_all();
 }
 
+#ifndef CONFIG_DISCONTIGMEM
 void __init zone_sizes_init(void)
 {
 	unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
@@ -329,6 +335,9 @@
 	}
 	free_area_init(zones_size);	
 }
+#else
+extern void zone_sizes_init(void);
+#endif /* !CONFIG_DISCONTIGMEM */
 
 /*
  * paging_init() sets up the page tables - note that the first 8MB are
@@ -405,7 +414,23 @@
 		printk("Ok.\n");
 	}
 }
-	
+
+#ifndef CONFIG_DISCONTIGMEM
+static void __init set_max_mapnr_init(void)
+{
+#ifdef CONFIG_HIGHMEM
+	highmem_start_page = pfn_to_page(highstart_pfn);
+	max_mapnr = num_physpages = highend_pfn;
+#else
+	max_mapnr = num_physpages = max_low_pfn;
+#endif
+}
+#define __free_all_bootmem() free_all_bootmem()
+#else
+#define __free_all_bootmem() free_all_bootmem_node(NODE_DATA(0))
+extern void set_max_mapnr_init(void);
+#endif /* !CONFIG_DISCONTIGMEM */
+
 void __init mem_init(void)
 {
 	extern int ppro_with_ram_bug(void);
@@ -418,26 +443,22 @@
 	
 	bad_ppro = ppro_with_ram_bug();
 
-#ifdef CONFIG_HIGHMEM
-	highmem_start_page = mem_map + highstart_pfn;
-	max_mapnr = num_physpages = highend_pfn;
-#else
-	max_mapnr = num_physpages = max_low_pfn;
-#endif
+	set_max_mapnr_init();
+
 	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
 
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
 
 	/* this will put all low memory onto the freelists */
-	totalram_pages += free_all_bootmem();
+	totalram_pages += __free_all_bootmem();
 
 	reservedpages = 0;
 	for (tmp = 0; tmp < max_low_pfn; tmp++)
 		/*
 		 * Only count reserved RAM pages
 		 */
-		if (page_is_ram(tmp) && PageReserved(mem_map+tmp))
+		if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
 			reservedpages++;
 
 	set_highmem_pages_init(bad_ppro);
diff -Nru a/arch/i386/mm/pgtable.c b/arch/i386/mm/pgtable.c
--- a/arch/i386/mm/pgtable.c	Wed Aug 28 07:37:38 2002
+++ b/arch/i386/mm/pgtable.c	Wed Aug 28 07:37:38 2002
@@ -22,24 +22,26 @@
 
 void show_mem(void)
 {
-	int i, total = 0, reserved = 0;
+	int pfn, total = 0, reserved = 0;
 	int shared = 0, cached = 0;
 	int highmem = 0;
+	struct page *page;
 
 	printk("Mem-info:\n");
 	show_free_areas();
 	printk("Free swap:       %6dkB\n",nr_swap_pages<<(PAGE_SHIFT-10));
-	i = max_mapnr;
-	while (i-- > 0) {
+	pfn = max_mapnr;
+	while (pfn-- > 0) {
+		page = pfn_to_page(pfn);
 		total++;
-		if (PageHighMem(mem_map+i))
+		if (PageHighMem(page))
 			highmem++;
-		if (PageReserved(mem_map+i))
+		if (PageReserved(page))
 			reserved++;
-		else if (PageSwapCache(mem_map+i))
+		else if (PageSwapCache(page))
 			cached++;
-		else if (page_count(mem_map+i))
-			shared += page_count(mem_map+i) - 1;
+		else if (page_count(page))
+			shared += page_count(page) - 1;
 	}
 	printk("%d pages of RAM\n", total);
 	printk("%d pages of HIGHMEM\n",highmem);
diff -Nru a/arch/mips64/sgi-ip27/ip27-memory.c b/arch/mips64/sgi-ip27/ip27-memory.c
--- a/arch/mips64/sgi-ip27/ip27-memory.c	Wed Aug 28 07:37:37 2002
+++ b/arch/mips64/sgi-ip27/ip27-memory.c	Wed Aug 28 07:37:37 2002
@@ -253,7 +253,7 @@
 
 		zones_size[ZONE_DMA] = end_pfn + 1 - start_pfn;
 		free_area_init_node(node, NODE_DATA(node), 0, zones_size, 
-						start_pfn << PAGE_SHIFT, 0);
+						start_pfn, 0);
 		if ((PLAT_NODE_DATA_STARTNR(node) + 
 					PLAT_NODE_DATA_SIZE(node)) > pagenr)
 			pagenr = PLAT_NODE_DATA_STARTNR(node) +
diff -Nru a/arch/ppc/mm/fault.c b/arch/ppc/mm/fault.c
--- a/arch/ppc/mm/fault.c	Wed Aug 28 07:37:37 2002
+++ b/arch/ppc/mm/fault.c	Wed Aug 28 07:37:37 2002
@@ -102,7 +102,7 @@
 #endif /* !CONFIG_4xx */
 #endif /* CONFIG_XMON || CONFIG_KGDB */
 
-	if (in_interrupt() || mm == NULL) {
+	if (in_atomic() || mm == NULL) {
 		bad_page_fault(regs, address, SIGSEGV);
 		return;
 	}
diff -Nru a/arch/sh/mm/init.c b/arch/sh/mm/init.c
--- a/arch/sh/mm/init.c	Wed Aug 28 07:37:37 2002
+++ b/arch/sh/mm/init.c	Wed Aug 28 07:37:37 2002
@@ -123,11 +123,11 @@
 			zones_size[ZONE_DMA] = max_dma - start_pfn;
 			zones_size[ZONE_NORMAL] = low - max_dma;
 		}
-		free_area_init_node(0, NODE_DATA(0), 0, zones_size, __MEMORY_START, 0);
+		free_area_init_node(0, NODE_DATA(0), 0, zones_size, __MEMORY_START >> PAGE_SHIFT, 0);
 #ifdef CONFIG_DISCONTIGMEM
 		zones_size[ZONE_DMA] = __MEMORY_SIZE_2ND >> PAGE_SHIFT;
 		zones_size[ZONE_NORMAL] = 0;
-		free_area_init_node(1, NODE_DATA(1), 0, zones_size, __MEMORY_START_2ND, 0);
+		free_area_init_node(1, NODE_DATA(1), 0, zones_size, __MEMORY_START_2ND >> PAGE_SHIFT, 0);
 #endif
  	}
 }
diff -Nru a/arch/sparc/mm/fault.c b/arch/sparc/mm/fault.c
--- a/arch/sparc/mm/fault.c	Wed Aug 28 07:37:37 2002
+++ b/arch/sparc/mm/fault.c	Wed Aug 28 07:37:37 2002
@@ -233,7 +233,7 @@
 	 * If we're in an interrupt or have no user
 	 * context, we must not take the fault..
 	 */
-        if (in_interrupt() || !mm)
+        if (in_atomic() || !mm)
                 goto no_context;
 
 	down_read(&mm->mmap_sem);
diff -Nru a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
--- a/arch/sparc/mm/srmmu.c	Wed Aug 28 07:37:37 2002
+++ b/arch/sparc/mm/srmmu.c	Wed Aug 28 07:37:37 2002
@@ -1311,7 +1311,7 @@
 		zholes_size[ZONE_HIGHMEM] = npages - calc_highpages();
 
 		free_area_init_node(0, NULL, NULL, zones_size,
-				    phys_base, zholes_size);
+				    phys_base >> PAGE_SHIFT, zholes_size);
 	}
 
 /* P3: easy to fix, todo. Current code is utterly broken, though. */
diff -Nru a/arch/sparc/mm/sun4c.c b/arch/sparc/mm/sun4c.c
--- a/arch/sparc/mm/sun4c.c	Wed Aug 28 07:37:36 2002
+++ b/arch/sparc/mm/sun4c.c	Wed Aug 28 07:37:36 2002
@@ -2074,7 +2074,7 @@
 		zholes_size[ZONE_HIGHMEM] = npages - calc_highpages();
 
 		free_area_init_node(0, NULL, NULL, zones_size,
-				    phys_base, zholes_size);
+				    phys_base >> PAGE_SHIFT, zholes_size);
 	}
 
 	cnt = 0;
diff -Nru a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
--- a/arch/sparc64/mm/init.c	Wed Aug 28 07:37:37 2002
+++ b/arch/sparc64/mm/init.c	Wed Aug 28 07:37:37 2002
@@ -1559,7 +1559,7 @@
 		zholes_size[ZONE_DMA] = npages - pages_avail;
 
 		free_area_init_node(0, NULL, NULL, zones_size,
-				    phys_base, zholes_size);
+				    phys_base >> PAGE_SHIFT, zholes_size);
 	}
 
 	device_scan();
diff -Nru a/drivers/block/loop.c b/drivers/block/loop.c
--- a/drivers/block/loop.c	Wed Aug 28 07:37:37 2002
+++ b/drivers/block/loop.c	Wed Aug 28 07:37:37 2002
@@ -210,8 +210,7 @@
 			goto fail;
 		if (aops->prepare_write(file, page, offset, offset+size))
 			goto unlock;
-		kaddr = page_address(page);
-		flush_dcache_page(page);
+		kaddr = kmap(page);
 		transfer_result = lo_do_transfer(lo, WRITE, kaddr + offset, data, size, IV);
 		if (transfer_result) {
 			/*
@@ -221,6 +220,8 @@
 			printk(KERN_ERR "loop: transfer error block %ld\n", index);
 			memset(kaddr + offset, 0, size);
 		}
+		flush_dcache_page(page);
+		kunmap(page);
 		if (aops->commit_write(file, page, offset, offset+size))
 			goto unlock;
 		if (transfer_result)
diff -Nru a/drivers/block/rd.c b/drivers/block/rd.c
--- a/drivers/block/rd.c	Wed Aug 28 07:37:37 2002
+++ b/drivers/block/rd.c	Wed Aug 28 07:37:37 2002
@@ -45,12 +45,14 @@
 #include <linux/config.h>
 #include <linux/string.h>
 #include <linux/slab.h>
-#include <asm/atomic.h>
+#include <linux/highmem.h>
 #include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/devfs_fs_kernel.h>
 #include <linux/buffer_head.h>		/* for invalidate_bdev() */
+#include <linux/backing-dev.h>
+
 #include <asm/uaccess.h>
 
 /*
@@ -73,10 +75,7 @@
 int initrd_below_start_ok;
 #endif
 
-/* Various static variables go here.  Most are used only in the RAM disk code.
- */
-
-static unsigned long rd_length[NUM_RAMDISKS];	/* Size of RAM disks in bytes   */
+static unsigned long rd_length[NUM_RAMDISKS];	/* Size of RAM disks in bytes */
 static int rd_kbsize[NUM_RAMDISKS];	/* Size in blocks of 1024 bytes */
 static devfs_handle_t devfs_handle;
 static struct block_device *rd_bdev[NUM_RAMDISKS];/* Protected device data */
@@ -87,7 +86,7 @@
  * architecture-specific setup routine (from the stored boot sector
  * information). 
  */
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;		/* Size of the RAM disks */
+
 /*
  * It would be very desiderable to have a soft-blocksize (that in the case
  * of the ramdisk driver is also the hardblocksize ;) of PAGE_SIZE because
@@ -101,68 +100,46 @@
  */
 int rd_blocksize = BLOCK_SIZE;			/* blocksize of the RAM disks */
 
+/* Size of the RAM disks */
+int rd_size = (CONFIG_BLK_DEV_RAM_SIZE + (PAGE_CACHE_SIZE >> 10) - 1) &
+			(PAGE_CACHE_MASK >> 10);
+
 /*
  * Copyright (C) 2000 Linus Torvalds.
  *               2000 Transmeta Corp.
  * aops copied from ramfs.
  */
-static int ramdisk_readpage(struct file *file, struct page * page)
-{
-	if (!PageUptodate(page)) {
-		memset(kmap(page), 0, PAGE_CACHE_SIZE);
-		kunmap(page);
-		flush_dcache_page(page);
-		SetPageUptodate(page);
-	}
-	unlock_page(page);
-	return 0;
-}
 
-static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static void wipe_page(struct page *page)
 {
 	if (!PageUptodate(page)) {
-		void *addr = page_address(page);
-		memset(addr, 0, PAGE_CACHE_SIZE);
+		void *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr, 0, PAGE_CACHE_SIZE);
 		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		SetPageUptodate(page);
 	}
-	SetPageDirty(page);
-	return 0;
 }
 
-static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int
+rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, int minor)
 {
-	return 0;
-}
-
-static struct address_space_operations ramdisk_aops = {
-	readpage: ramdisk_readpage,
-	writepage: fail_writepage,
-	prepare_write: ramdisk_prepare_write,
-	commit_write: ramdisk_commit_write,
-};
-
-static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec,
-				  sector_t sector, int minor)
-{
-	struct address_space * mapping;
+	struct address_space *mapping;
 	unsigned long index;
 	unsigned int vec_offset;
 	int offset, size, err;
 
 	err = 0;
 	mapping = rd_bdev[minor]->bd_inode->i_mapping;
-
 	index = sector >> (PAGE_CACHE_SHIFT - 9);
 	offset = (sector << 9) & ~PAGE_CACHE_MASK;
 	size = vec->bv_len;
 	vec_offset = vec->bv_offset;
 
 	do {
+		struct page *page;
 		int count;
-		struct page * page;
-		char * src, * dst;
-		int unlock = 0;
 
 		count = PAGE_CACHE_SIZE - offset;
 		if (count > size)
@@ -172,50 +149,39 @@
 		page = find_get_page(mapping, index);
 		if (!page) {
 			page = grab_cache_page(mapping, index);
-			err = -ENOMEM;
-			if (!page)
+			if (!page) {
+				err = -ENOMEM;
 				goto out;
-			err = 0;
-
-			if (!PageUptodate(page)) {
-				memset(kmap(page), 0, PAGE_CACHE_SIZE);
-				kunmap(page);
-				SetPageUptodate(page);
 			}
-
-			unlock = 1;
+			wipe_page(page);
+			set_page_dirty(page);
+			unlock_page(page);
 		}
-
-		index++;
-
-		if (rw == READ) {
-			src = kmap(page);
-			src += offset;
-			dst = kmap(vec->bv_page) + vec_offset;
-		} else {
-			dst = kmap(page);
-			dst += offset;
-			src = kmap(vec->bv_page) + vec_offset;
+		if (page != vec->bv_page || vec_offset != offset) {
+			if (rw == READ) {
+				char *src = kmap_atomic(page, KM_USER0);
+				char *dst = kmap_atomic(vec->bv_page, KM_USER1);
+
+				memcpy(dst + vec_offset, src + offset, count);
+				flush_dcache_page(vec->bv_page);
+				kunmap_atomic(src, KM_USER0);
+				kunmap_atomic(dst, KM_USER1);
+			} else {
+				char *src = kmap_atomic(vec->bv_page, KM_USER0);
+				char *dst = kmap_atomic(page, KM_USER1);
+
+				memcpy(dst + offset, src + vec_offset, count);
+				flush_dcache_page(page);
+				kunmap_atomic(vec->bv_page, KM_USER0);
+				kunmap_atomic(page, KM_USER1);
+			}
 		}
+		page_cache_release(page);
 		offset = 0;
 		vec_offset += count;
-
-		memcpy(dst, src, count);
-
-		kunmap(page);
-		kunmap(vec->bv_page);
-
-		if (rw == READ) {
-			flush_dcache_page(page);
-		} else {
-			SetPageDirty(page);
-		}
-		if (unlock)
-			unlock_page(page);
-		__free_page(page);
+		index++;
 	} while (size);
-
- out:
+out:
 	return err;
 }
 
@@ -243,42 +209,41 @@
  * 19-JAN-1998  Richard Gooch <rgooch@atnf.csiro.au>  Added devfs support
  *
  */
-static int rd_make_request(request_queue_t * q, struct bio *sbh)
+static int rd_make_request(request_queue_t * q, struct bio *bio)
 {
 	unsigned int minor;
 	unsigned long offset, len;
-	int rw = sbh->bi_rw;
+	int rw = bio->bi_rw;
 
-	minor = minor(to_kdev_t(sbh->bi_bdev->bd_dev));
+	minor = minor(to_kdev_t(bio->bi_bdev->bd_dev));
 
 	if (minor >= NUM_RAMDISKS)
 		goto fail;
 
-	offset = sbh->bi_sector << 9;
-	len = sbh->bi_size;
+	offset = bio->bi_sector << 9;
+	len = bio->bi_size;
 
 	if ((offset + len) > rd_length[minor])
 		goto fail;
 
-	if (rw==READA)
-		rw=READ;
-	if ((rw != READ) && (rw != WRITE)) {
-		printk(KERN_INFO "RAMDISK: bad command: %d\n", rw);
-		goto fail;
-	}
+	if (rw == READA)
+		rw = READ;
+	BUG_ON((rw != READ) && (rw != WRITE));
 
-	if (rd_blkdev_bio_IO(sbh, minor))
+	if (rd_blkdev_bio_IO(bio, minor))
 		goto fail;
 
-	set_bit(BIO_UPTODATE, &sbh->bi_flags);
-	sbh->bi_end_io(sbh);
+	set_bit(BIO_UPTODATE, &bio->bi_flags);
+	bio->bi_end_io(bio);
 	return 0;
  fail:
-	bio_io_error(sbh);
+	bio_io_error(bio);
 	return 0;
 } 
 
-static int rd_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg)
+static int
+rd_ioctl(struct inode *inode, struct file *file,
+	unsigned int cmd, unsigned long arg)
 {
 	int error = -EINVAL;
 	unsigned int minor;
@@ -325,7 +290,6 @@
 	return count;
 }
 
-
 static int initrd_release(struct inode *inode,struct file *file)
 {
 	extern void free_initrd_mem(unsigned long, unsigned long);
@@ -343,14 +307,19 @@
 	return 0;
 }
 
-
 static struct file_operations initrd_fops = {
-	read:		initrd_read,
-	release:	initrd_release,
+	.read		= initrd_read,
+	.release	= initrd_release,
 };
 
 #endif
 
+struct address_space_operations ramdisk_aops;
+
+static struct backing_dev_info rd_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+};
 
 static int rd_open(struct inode * inode, struct file * filp)
 {
@@ -375,21 +344,23 @@
 	 * Immunize device against invalidate_buffers() and prune_icache().
 	 */
 	if (rd_bdev[unit] == NULL) {
-		rd_bdev[unit] = bdget(kdev_t_to_nr(inode->i_rdev));
-		rd_bdev[unit]->bd_openers++;
-		rd_bdev[unit]->bd_block_size = rd_blocksize;
-		rd_bdev[unit]->bd_inode->i_mapping->a_ops = &ramdisk_aops;
-		rd_bdev[unit]->bd_inode->i_size = rd_length[unit];
-		rd_bdev[unit]->bd_queue = &blk_dev[MAJOR_NR].request_queue;
-	}
+		struct block_device *b = bdget(kdev_t_to_nr(inode->i_rdev));
 
+		rd_bdev[unit] = b;
+		b->bd_openers++;
+		b->bd_block_size = rd_blocksize;
+		b->bd_inode->i_mapping->a_ops = &ramdisk_aops;
+		b->bd_inode->i_mapping->backing_dev_info = &rd_backing_dev_info;
+		b->bd_inode->i_size = rd_length[unit];
+		b->bd_queue = &blk_dev[MAJOR_NR].request_queue;
+	}
 	return 0;
 }
 
 static struct block_device_operations rd_bd_op = {
-	owner:		THIS_MODULE,
-	open:		rd_open,
-	ioctl:		rd_ioctl,
+	.owner	= THIS_MODULE,
+	.open	= rd_open,
+	.ioctl	= rd_ioctl,
 };
 
 /* Before freeing the module, invalidate all of the protected buffers! */
@@ -411,6 +382,19 @@
 	blk_clear(MAJOR_NR);
 }
 
+/*
+ * If someone writes a ramdisk page with submit_bh(), we have a dirty page
+ * with clean buffers.  try_to_free_buffers() will then propagate the buffer
+ * cleanness up into page-cleaness and the VM will evict the page.
+ *
+ * To stop that happening, the ramdisk address_space has a ->releasepage()
+ * which always fails.
+ */
+static int fail_releasepage(struct page *page, int offset)
+{
+	return 0;
+}
+
 /* This is the registration and initialization section of the RAM disk driver */
 static int __init rd_init (void)
 {
@@ -422,6 +406,9 @@
 		       rd_blocksize);
 		rd_blocksize = BLOCK_SIZE;
 	}
+	ramdisk_aops = def_blk_aops;
+	ramdisk_aops.writepage = fail_writepage;
+	ramdisk_aops.releasepage = fail_releasepage;
 
 	if (register_blkdev(MAJOR_NR, "ramdisk", &rd_bd_op)) {
 		printk("RAMDISK: Could not get major %d", MAJOR_NR);
@@ -447,18 +434,18 @@
 
 #ifdef CONFIG_BLK_DEV_INITRD
 	/* We ought to separate initrd operations here */
-	register_disk(NULL, mk_kdev(MAJOR_NR,INITRD_MINOR), 1, &rd_bd_op, rd_size<<1);
+	register_disk(NULL, mk_kdev(MAJOR_NR,INITRD_MINOR),
+			1, &rd_bd_op, rd_size<<1);
 	devfs_register(devfs_handle, "initrd", DEVFS_FL_DEFAULT, MAJOR_NR,
 			INITRD_MINOR, S_IFBLK | S_IRUSR, &rd_bd_op, NULL);
 #endif
 
-	blk_size[MAJOR_NR] = rd_kbsize;		/* Size of the RAM disk in kB  */
+	blk_size[MAJOR_NR] = rd_kbsize;	/* Size of the RAM disk in kB  */
 
 	/* rd_size is given in kB */
 	printk("RAMDISK driver initialized: "
 	       "%d RAM disks of %dK size %d blocksize\n",
 	       NUM_RAMDISKS, rd_size, rd_blocksize);
-
 	return 0;
 }
 
@@ -487,9 +474,8 @@
 #endif
 
 /* options - modular */
-MODULE_PARM     (rd_size, "1i");
+MODULE_PARM(rd_size, "1i");
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 MODULE_PARM     (rd_blocksize, "i");
 MODULE_PARM_DESC(rd_blocksize, "Blocksize of each RAM disk in bytes.");
-
 MODULE_LICENSE("GPL");
diff -Nru a/drivers/net/ns83820.c b/drivers/net/ns83820.c
--- a/drivers/net/ns83820.c	Wed Aug 28 07:37:38 2002
+++ b/drivers/net/ns83820.c	Wed Aug 28 07:37:38 2002
@@ -1081,7 +1081,7 @@
 				   frag->page_offset,
 				   frag->size, PCI_DMA_TODEVICE);
 		dprintk("frag: buf=%08Lx  page=%08lx offset=%08lx\n",
-			(long long)buf, (long)(frag->page - mem_map),
+			(long long)buf, (long) page_to_pfn(frag->page),
 			frag->page_offset);
 		len = frag->size;
 		frag++;
diff -Nru a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
--- a/drivers/scsi/scsi_scan.c	Wed Aug 28 07:37:37 2002
+++ b/drivers/scsi/scsi_scan.c	Wed Aug 28 07:37:37 2002
@@ -1379,6 +1379,12 @@
 		printk(KERN_INFO "scsi: unknown device type %d\n", sdev->type);
 	}
 
+	/*
+	 * scsi_alloc_sdev did this, but do it again because we can now set
+	 * the bounce limit because the device type is known
+	 */
+	scsi_initialize_merge_fn(sdev);
+
 	sdev->random = (sdev->type == TYPE_TAPE) ? 0 : 1;
 
 	print_inquiry(inq_result);
diff -Nru a/fs/Config.in b/fs/Config.in
--- a/fs/Config.in	Wed Aug 28 07:37:37 2002
+++ b/fs/Config.in	Wed Aug 28 07:37:37 2002
@@ -32,6 +32,7 @@
 # dep_tristate '  Journal Block Device support (JBD for ext3)' CONFIG_JBD $CONFIG_EXT3_FS
 define_bool CONFIG_JBD $CONFIG_EXT3_FS
 dep_mbool '  JBD (ext3) debugging support' CONFIG_JBD_DEBUG $CONFIG_JBD
+dep_mbool '  Ext3 hashed index (htree) support' CONFIG_EXT3_INDEX $CONFIG_JBD
 
 # msdos file systems
 tristate 'DOS FAT fs support' CONFIG_FAT_FS
diff -Nru a/fs/affs/file.c b/fs/affs/file.c
--- a/fs/affs/file.c	Wed Aug 28 07:37:37 2002
+++ b/fs/affs/file.c	Wed Aug 28 07:37:37 2002
@@ -27,6 +27,7 @@
 #include <linux/fs.h>
 #include <linux/amigaffs.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 
@@ -518,6 +519,7 @@
 	pr_debug("AFFS: read_page(%u, %ld, %d, %d)\n", (u32)inode->i_ino, page->index, from, to);
 	if (from > to || to > PAGE_CACHE_SIZE)
 		BUG();
+	kmap(page);
 	data = page_address(page);
 	bsize = AFFS_SB(sb)->s_data_blksize;
 	tmp = (page->index << PAGE_CACHE_SHIFT) + from;
@@ -537,6 +539,8 @@
 		from += tmp;
 		boff = 0;
 	}
+	flush_dcache_page(page);
+	kunmap(page);
 	return 0;
 }
 
@@ -656,7 +660,11 @@
 			return err;
 	}
 	if (to < PAGE_CACHE_SIZE) {
-		memset(page_address(page) + to, 0, PAGE_CACHE_SIZE - to);
+		char *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
+		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		if (size > offset + to) {
 			if (size < offset + PAGE_CACHE_SIZE)
 				tmp = size & ~PAGE_CACHE_MASK;
diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c	Wed Aug 28 07:37:37 2002
+++ b/fs/buffer.c	Wed Aug 28 07:37:37 2002
@@ -22,6 +22,7 @@
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/smp_lock.h>
 #include <linux/blkdev.h>
@@ -307,10 +308,7 @@
 
 	/* We need to protect against concurrent writers.. */
 	down(&inode->i_sem);
-	ret = filemap_fdatawait(inode->i_mapping);
-	err = filemap_fdatawrite(inode->i_mapping);
-	if (!ret)
-		ret = err;
+	ret = filemap_fdatawrite(inode->i_mapping);
 	err = file->f_op->fsync(file, dentry, 0);
 	if (!ret)
 		ret = err;
@@ -345,10 +343,7 @@
 		goto out_putf;
 
 	down(&inode->i_sem);
-	ret = filemap_fdatawait(inode->i_mapping);
-	err = filemap_fdatawrite(inode->i_mapping);
-	if (!ret)
-		ret = err;
+	ret = filemap_fdatawrite(inode->i_mapping);
 	err = file->f_op->fsync(file, dentry, 1);
 	if (!ret)
 		ret = err;
@@ -396,14 +391,21 @@
 	head = page_buffers(page);
 	bh = head;
 	do {
-		if (bh->b_blocknr == block) {
+		if (bh->b_blocknr == block && buffer_mapped(bh)) {
 			ret = bh;
 			get_bh(bh);
 			goto out_unlock;
 		}
 		bh = bh->b_this_page;
 	} while (bh != head);
-	buffer_error();
+	/*
+	 * This path can happen if the page had some unmapped buffers, which
+	 * will have b_blocknr == -1.  When a ramdisk mapping's page was brought
+	 * partially uptodate by mkfs and unmap_underlying_metadata searches
+	 * for blocks in part of the page which wasn't touched by mkfs.
+	 *
+	 * buffer_error();
+	 */
 out_unlock:
 	spin_unlock(&bd_mapping->private_lock);
 	page_cache_release(page);
@@ -469,7 +471,7 @@
  */
 static void free_more_memory(void)
 {
-	zone_t *zone;
+	struct zone *zone;
 
 	zone = contig_page_data.node_zonelists[GFP_NOFS & GFP_ZONEMASK].zones[0];
 
@@ -1517,7 +1519,7 @@
  * @offset: the index of the truncation point
  *
  * block_invalidatepage() is called when all or part of the page has become
- * invalidatedby a truncate operation.
+ * invalidated by a truncate operation.
  *
  * block_invalidatepage() does not have to release all buffers, but it must
  * ensure that no dirty buffer is left outside @offset and that no I/O
@@ -1648,11 +1650,18 @@
  * the page lock, whoever dirtied the buffers may decide to clean them
  * again at any time.  We handle that by only looking at the buffer
  * state inside lock_buffer().
+ *
+ * If block_write_full_page() is called for regular writeback
+ * (called_for_sync() is false) then it will return -EAGAIN for a locked
+ * buffer.   This only can happen if someone has written the buffer directly,
+ * with submit_bh().  At the address_space level PageWriteback prevents this
+ * contention from occurring.
  */
 static int __block_write_full_page(struct inode *inode,
 			struct page *page, get_block_t *get_block)
 {
 	int err;
+	int ret = 0;
 	unsigned long block;
 	unsigned long last_block;
 	struct buffer_head *bh, *head;
@@ -1663,8 +1672,6 @@
 	last_block = (inode->i_size - 1) >> inode->i_blkbits;
 
 	if (!page_has_buffers(page)) {
-		if (S_ISBLK(inode->i_mode))
-			buffer_error();
 		if (!PageUptodate(page))
 			buffer_error();
 		create_empty_buffers(page, 1 << inode->i_blkbits,
@@ -1724,7 +1731,14 @@
 	do {
 		get_bh(bh);
 		if (buffer_mapped(bh) && buffer_dirty(bh)) {
-			lock_buffer(bh);
+			if (called_for_sync()) {
+				lock_buffer(bh);
+			} else {
+				if (test_set_buffer_locked(bh)) {
+					ret = -EAGAIN;
+					continue;
+				}
+			}
 			if (test_clear_buffer_dirty(bh)) {
 				if (!buffer_uptodate(bh))
 					buffer_error();
@@ -1733,8 +1747,7 @@
 				unlock_buffer(bh);
 			}
 		}
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} while ((bh = bh->b_this_page) != head);
 
 	BUG_ON(PageWriteback(page));
 	SetPageWriteback(page);		/* Keeps try_to_free_buffers() away */
@@ -1774,7 +1787,10 @@
 			SetPageUptodate(page);
 		end_page_writeback(page);
 	}
+	if (err == 0)
+		return ret;
 	return err;
+
 recover:
 	/*
 	 * ENOSPC, or some other error.  We may already have added some
@@ -1786,7 +1802,8 @@
 	bh = head;
 	/* Recovery: lock and submit the mapped buffers */
 	do {
-		if (buffer_mapped(bh)) {
+		get_bh(bh);
+		if (buffer_mapped(bh) && buffer_dirty(bh)) {
 			lock_buffer(bh);
 			mark_buffer_async_write(bh);
 		} else {
@@ -1796,21 +1813,21 @@
 			 */
 			clear_buffer_dirty(bh);
 		}
-		bh = bh->b_this_page;
-	} while (bh != head);
+	} while ((bh = bh->b_this_page) != head);
+	SetPageError(page);
+	BUG_ON(PageWriteback(page));
+	SetPageWriteback(page);
+	unlock_page(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			set_buffer_uptodate(bh);
 			clear_buffer_dirty(bh);
 			submit_bh(WRITE, bh);
 			nr_underway++;
 		}
+		put_bh(bh);
 		bh = next;
 	} while (bh != head);
-	BUG_ON(PageWriteback(page));
-	SetPageWriteback(page);
-	unlock_page(page);
 	goto done;
 }
 
@@ -1822,7 +1839,6 @@
 	int err = 0;
 	unsigned blocksize, bbits;
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
-	char *kaddr = kmap(page);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(from > PAGE_CACHE_SIZE);
@@ -1863,13 +1879,19 @@
 					set_buffer_uptodate(bh);
 					continue;
 				}
-				if (block_end > to)
-					memset(kaddr+to, 0, block_end-to);
-				if (block_start < from)
-					memset(kaddr+block_start,
-						0, from-block_start);
-				if (block_end > to || block_start < from)
+				if (block_end > to || block_start < from) {
+					void *kaddr;
+
+					kaddr = kmap_atomic(page, KM_USER0);
+					if (block_end > to)
+						memset(kaddr+to, 0,
+							block_end-to);
+					if (block_start < from)
+						memset(kaddr+block_start,
+							0, from-block_start);
 					flush_dcache_page(page);
+					kunmap_atomic(kaddr, KM_USER0);
+				}
 				continue;
 			}
 		}
@@ -1908,10 +1930,14 @@
 		if (block_start >= to)
 			break;
 		if (buffer_new(bh)) {
+			void *kaddr;
+
 			clear_buffer_new(bh);
 			if (buffer_uptodate(bh))
 				buffer_error();
+			kaddr = kmap_atomic(page, KM_USER0);
 			memset(kaddr+block_start, 0, bh->b_size);
+			kunmap_atomic(kaddr, KM_USER0);
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
@@ -1997,9 +2023,10 @@
 					SetPageError(page);
 			}
 			if (!buffer_mapped(bh)) {
-				memset(kmap(page) + i*blocksize, 0, blocksize);
+				void *kaddr = kmap_atomic(page, KM_USER0);
+				memset(kaddr + i * blocksize, 0, blocksize);
 				flush_dcache_page(page);
-				kunmap(page);
+				kunmap_atomic(kaddr, KM_USER0);
 				set_buffer_uptodate(bh);
 				continue;
 			}
@@ -2107,7 +2134,7 @@
 	long status;
 	unsigned zerofrom;
 	unsigned blocksize = 1 << inode->i_blkbits;
-	char *kaddr;
+	void *kaddr;
 
 	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
 		status = -ENOMEM;
@@ -2129,12 +2156,12 @@
 						PAGE_CACHE_SIZE, get_block);
 		if (status)
 			goto out_unmap;
-		kaddr = page_address(new_page);
+		kaddr = kmap_atomic(new_page, KM_USER0);
 		memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
 		flush_dcache_page(new_page);
+		kunmap_atomic(kaddr, KM_USER0);
 		__block_commit_write(inode, new_page,
 				zerofrom, PAGE_CACHE_SIZE);
-		kunmap(new_page);
 		unlock_page(new_page);
 		page_cache_release(new_page);
 	}
@@ -2159,21 +2186,20 @@
 	status = __block_prepare_write(inode, page, zerofrom, to, get_block);
 	if (status)
 		goto out1;
-	kaddr = page_address(page);
 	if (zerofrom < offset) {
+		kaddr = kmap_atomic(page, KM_USER0);
 		memset(kaddr+zerofrom, 0, offset-zerofrom);
 		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		__block_commit_write(inode, page, zerofrom, offset);
 	}
 	return 0;
 out1:
 	ClearPageUptodate(page);
-	kunmap(page);
 	return status;
 
 out_unmap:
 	ClearPageUptodate(new_page);
-	kunmap(new_page);
 	unlock_page(new_page);
 	page_cache_release(new_page);
 out:
@@ -2185,10 +2211,8 @@
 {
 	struct inode *inode = page->mapping->host;
 	int err = __block_prepare_write(inode, page, from, to, get_block);
-	if (err) {
+	if (err)
 		ClearPageUptodate(page);
-		kunmap(page);
-	}
 	return err;
 }
 
@@ -2196,7 +2220,6 @@
 {
 	struct inode *inode = page->mapping->host;
 	__block_commit_write(inode,page,from,to);
-	kunmap(page);
 	return 0;
 }
 
@@ -2206,7 +2229,6 @@
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 	__block_commit_write(inode,page,from,to);
-	kunmap(page);
 	if (pos > inode->i_size) {
 		inode->i_size = pos;
 		mark_inode_dirty(inode);
@@ -2223,6 +2245,7 @@
 	struct inode *inode = mapping->host;
 	struct page *page;
 	struct buffer_head *bh;
+	void *kaddr;
 	int err;
 
 	blocksize = 1 << inode->i_blkbits;
@@ -2275,9 +2298,10 @@
 			goto unlock;
 	}
 
-	memset(kmap(page) + offset, 0, length);
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + offset, 0, length);
 	flush_dcache_page(page);
-	kunmap(page);
+	kunmap_atomic(kaddr, KM_USER0);
 
 	mark_buffer_dirty(bh);
 	err = 0;
@@ -2297,7 +2321,7 @@
 	struct inode * const inode = page->mapping->host;
 	const unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
 	unsigned offset;
-	char *kaddr;
+	void *kaddr;
 
 	/* Is the page fully inside i_size? */
 	if (page->index < end_index)
@@ -2317,10 +2341,10 @@
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
-	kaddr = kmap(page);
+	kaddr = kmap_atomic(page, KM_USER0);
 	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 	flush_dcache_page(page);
-	kunmap(page);
+	kunmap_atomic(page, KM_USER0);
 	return __block_write_full_page(inode, page, get_block);
 }
 
@@ -2495,7 +2519,7 @@
 		 * This only applies in the rare case where try_to_free_buffers
 		 * succeeds but the page is not freed.
 		 */
-		ClearPageDirty(page);
+		clear_page_dirty(page);
 	}
 	spin_unlock(&mapping->private_lock);
 out:
@@ -2537,9 +2561,45 @@
 static kmem_cache_t *bh_cachep;
 static mempool_t *bh_mempool;
 
+/*
+ * Once the number of bh's in the machine exceeds this level, we start
+ * stripping them in writeback.
+ */
+static int max_buffer_heads;
+
+int buffer_heads_over_limit;
+
+struct bh_accounting {
+	int nr;			/* Number of live bh's */
+	int ratelimit;		/* Limit cacheline bouncing */
+};
+
+static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0};
+
+static void recalc_bh_state(void)
+{
+	int i;
+	int tot = 0;
+
+	if (__get_cpu_var(bh_accounting).ratelimit++ < 4096)
+		return;
+	__get_cpu_var(bh_accounting).ratelimit = 0;
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_possible(i))
+			continue;
+		tot += per_cpu(bh_accounting, i).nr;
+	}
+	buffer_heads_over_limit = (tot > max_buffer_heads);
+}
+	
 struct buffer_head *alloc_buffer_head(void)
 {
-	return mempool_alloc(bh_mempool, GFP_NOFS);
+	struct buffer_head *ret = mempool_alloc(bh_mempool, GFP_NOFS);
+	if (ret) {
+		__get_cpu_var(bh_accounting).nr++;
+		recalc_bh_state();
+	}
+	return ret;
 }
 EXPORT_SYMBOL(alloc_buffer_head);
 
@@ -2547,10 +2607,13 @@
 {
 	BUG_ON(!list_empty(&bh->b_assoc_buffers));
 	mempool_free(bh, bh_mempool);
+	__get_cpu_var(bh_accounting).nr--;
+	recalc_bh_state();
 }
 EXPORT_SYMBOL(free_buffer_head);
 
-static void init_buffer_head(void *data, kmem_cache_t *cachep, unsigned long flags)
+static void
+init_buffer_head(void *data, kmem_cache_t *cachep, unsigned long flags)
 {
 	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
 			    SLAB_CTOR_CONSTRUCTOR) {
@@ -2577,12 +2640,19 @@
 void __init buffer_init(void)
 {
 	int i;
+	int nrpages;
 
 	bh_cachep = kmem_cache_create("buffer_head",
 			sizeof(struct buffer_head), 0,
-			SLAB_HWCACHE_ALIGN, init_buffer_head, NULL);
+			0, init_buffer_head, NULL);
 	bh_mempool = mempool_create(MAX_UNUSED_BUFFERS, bh_mempool_alloc,
 				bh_mempool_free, NULL);
 	for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)
 		init_waitqueue_head(&bh_wait_queue_heads[i].wqh);
+
+	/*
+	 * Limit the bh occupancy to 10% of ZONE_NORMAL
+	 */
+	nrpages = (nr_free_buffer_pages() * 1) / 100;
+	max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
 }
diff -Nru a/fs/driverfs/inode.c b/fs/driverfs/inode.c
--- a/fs/driverfs/inode.c	Wed Aug 28 07:37:36 2002
+++ b/fs/driverfs/inode.c	Wed Aug 28 07:37:36 2002
@@ -32,6 +32,7 @@
 #include <linux/namei.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/backing-dev.h>
 #include <linux/driverfs_fs.h>
 
 #include <asm/uaccess.h>
@@ -56,12 +57,19 @@
 static spinlock_t mount_lock = SPIN_LOCK_UNLOCKED;
 static int mount_count = 0;
 
+static struct backing_dev_info driverfs_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+};
+
 static int driverfs_readpage(struct file *file, struct page * page)
 {
 	if (!PageUptodate(page)) {
-		memset(kmap(page), 0, PAGE_CACHE_SIZE);
-		kunmap(page);
+		void *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr, 0, PAGE_CACHE_SIZE);
 		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		SetPageUptodate(page);
 	}
 	unlock_page(page);
@@ -70,10 +78,12 @@
 
 static int driverfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
 {
-	void *addr = kmap(page);
 	if (!PageUptodate(page)) {
-		memset(addr, 0, PAGE_CACHE_SIZE);
+		void *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr, 0, PAGE_CACHE_SIZE);
 		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		SetPageUptodate(page);
 	}
 	return 0;
@@ -85,7 +95,6 @@
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
 	set_page_dirty(page);
-	kunmap(page);
 	if (pos > inode->i_size)
 		inode->i_size = pos;
 	return 0;
@@ -105,6 +114,7 @@
 		inode->i_rdev = NODEV;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inode->i_mapping->a_ops = &driverfs_aops;
+		inode->i_mapping->backing_dev_info = &driverfs_backing_dev_info;
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
diff -Nru a/fs/ext2/dir.c b/fs/ext2/dir.c
--- a/fs/ext2/dir.c	Wed Aug 28 07:37:36 2002
+++ b/fs/ext2/dir.c	Wed Aug 28 07:37:36 2002
@@ -571,8 +571,8 @@
 	struct page *page = grab_cache_page(mapping, 0);
 	unsigned chunk_size = ext2_chunk_size(inode);
 	struct ext2_dir_entry_2 * de;
-	char *base;
 	int err;
+	void *kaddr;
 
 	if (!page)
 		return -ENOMEM;
@@ -581,22 +581,21 @@
 		unlock_page(page);
 		goto fail;
 	}
-	base = page_address(page);
-
-	de = (struct ext2_dir_entry_2 *) base;
+	kaddr = kmap_atomic(page, KM_USER0);
+	de = (struct ext2_dir_entry_2 *)kaddr;
 	de->name_len = 1;
 	de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1));
 	memcpy (de->name, ".\0\0", 4);
 	de->inode = cpu_to_le32(inode->i_ino);
 	ext2_set_de_type (de, inode);
 
-	de = (struct ext2_dir_entry_2 *) (base + EXT2_DIR_REC_LEN(1));
+	de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
 	de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
 	de->inode = cpu_to_le32(parent->i_ino);
 	memcpy (de->name, "..\0", 4);
 	ext2_set_de_type (de, inode);
-
+	kunmap_atomic(kaddr, KM_USER0);
 	err = ext2_commit_chunk(page, 0, chunk_size);
 fail:
 	page_cache_release(page);
diff -Nru a/fs/ext3/Makefile b/fs/ext3/Makefile
--- a/fs/ext3/Makefile	Wed Aug 28 07:37:36 2002
+++ b/fs/ext3/Makefile	Wed Aug 28 07:37:36 2002
@@ -5,6 +5,6 @@
 obj-$(CONFIG_EXT3_FS) += ext3.o
 
 ext3-objs    := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
-		ioctl.o namei.o super.o symlink.o
+		ioctl.o namei.o super.o symlink.o hash.o
 
 include $(TOPDIR)/Rules.make
diff -Nru a/fs/ext3/hash.c b/fs/ext3/hash.c
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/fs/ext3/hash.c	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,117 @@
+/*
+ *  linux/fs/ext3/hash.c
+ *
+ * By Stephen C. Tweedie, 2002
+ * Copyright (C) 2002, Red Hat, Inc.
+ *
+ * This file is released under the GPL v2.
+ * 
+ * MD4 hash from drivers/char/random.c,
+ * Copyright Theodore Ts'o, 1994, 1995, 1996, 1997, 1998, 1999.  All
+ * rights reserved.
+ * 
+ *  Hash Tree Directory indexing porting
+ *  	Christopher Li, 2002
+ */
+
+#include <linux/fs.h>
+#include <linux/jbd.h>
+#include <linux/sched.h>
+#include <linux/ext3_fs.h>
+
+
+/* F, G and H are basic MD4 functions: selection, majority, parity */
+#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
+#define G(x, y, z) (((x) & (y)) + (((x) ^ (y)) & (z)))
+#define H(x, y, z) ((x) ^ (y) ^ (z))
+
+/*
+ * The generic round function.  The application is so specific that
+ * we don't bother protecting all the arguments with parens, as is generally
+ * good macro practice, in favor of extra legibility.
+ * Rotation is separate from addition to prevent recomputation
+ */
+#define ROUND(f, a, b, c, d, x, s)	\
+	(a += f(b, c, d) + x, a = (a << s) | (a >> (32-s)))
+#define K1 0
+#define K2 013240474631UL
+#define K3 015666365641UL
+
+/*
+ * Basic cut-down MD4 transform.  Returns only 32 bits of result.
+ */
+static __u32 halfMD4Transform (__u32 buf[4], __u32 const in[8])
+{
+	__u32	a = buf[0], b = buf[1], c = buf[2], d = buf[3];
+
+	/* Round 1 */
+	ROUND(F, a, b, c, d, in[0] + K1,  3);
+	ROUND(F, d, a, b, c, in[1] + K1,  7);
+	ROUND(F, c, d, a, b, in[2] + K1, 11);
+	ROUND(F, b, c, d, a, in[3] + K1, 19);
+	ROUND(F, a, b, c, d, in[4] + K1,  3);
+	ROUND(F, d, a, b, c, in[5] + K1,  7);
+	ROUND(F, c, d, a, b, in[6] + K1, 11);
+	ROUND(F, b, c, d, a, in[7] + K1, 19);
+
+	/* Round 2 */
+	ROUND(G, a, b, c, d, in[1] + K2,  3);
+	ROUND(G, d, a, b, c, in[3] + K2,  5);
+	ROUND(G, c, d, a, b, in[5] + K2,  9);
+	ROUND(G, b, c, d, a, in[7] + K2, 13);
+	ROUND(G, a, b, c, d, in[0] + K2,  3);
+	ROUND(G, d, a, b, c, in[2] + K2,  5);
+	ROUND(G, c, d, a, b, in[4] + K2,  9);
+	ROUND(G, b, c, d, a, in[6] + K2, 13);
+
+	/* Round 3 */
+	ROUND(H, a, b, c, d, in[3] + K3,  3);
+	ROUND(H, d, a, b, c, in[7] + K3,  9);
+	ROUND(H, c, d, a, b, in[2] + K3, 11);
+	ROUND(H, b, c, d, a, in[6] + K3, 15);
+	ROUND(H, a, b, c, d, in[1] + K3,  3);
+	ROUND(H, d, a, b, c, in[5] + K3,  9);
+	ROUND(H, c, d, a, b, in[0] + K3, 11);
+	ROUND(H, b, c, d, a, in[4] + K3, 15);
+
+	/* Mix the new digest into the existing digest buffer (the
+	   version in random.c does not do this --- it uses a static
+	   digest seed for each hash. */
+
+	buf[0] += a, buf[1] += b, buf[2] += c, buf[3] += d;
+	
+	return buf[1] + b;	/* "most hashed" word */
+	/* Alternative: return sum of all words? */
+}
+
+
+__u32 ext3_make_halfMD4_hash(const char *p, int len)
+{
+	__u32 buf[4];
+	__u32 hash = 0;
+
+	/* Initial MD4 digest seed, from the MD4 docs: */
+	buf[0] = 0x67452301;
+	buf[1] = 0xefcdab89;
+	buf[2] = 0x98badcfe;
+	buf[3] = 0x10325476;
+	
+	while (len) {
+		if (len < 32) {
+			char pad_buffer[32];
+			/* Need to pad the input to 8 words for the hash */
+			memcpy(pad_buffer, p, len);
+			memset(pad_buffer+len, 0, 32-len);
+			hash = halfMD4Transform(buf, (__u32 *) pad_buffer);
+			break;
+		}
+		
+		hash = halfMD4Transform(buf, (__u32 *) p);
+		len -= 32;
+		p += 32;
+		continue;
+	}
+
+	return hash;
+}
+
diff -Nru a/fs/ext3/inode.c b/fs/ext3/inode.c
--- a/fs/ext3/inode.c	Wed Aug 28 07:37:37 2002
+++ b/fs/ext3/inode.c	Wed Aug 28 07:37:37 2002
@@ -734,9 +734,9 @@
  * The BKL may not be held on entry here.  Be sure to take it early.
  */
 
-static int ext3_get_block_handle(handle_t *handle, struct inode *inode, 
-				 sector_t iblock,
-				 struct buffer_head *bh_result, int create)
+static int
+ext3_get_block_handle(handle_t *handle, struct inode *inode, sector_t iblock,
+		struct buffer_head *bh_result, int create, int extend_disksize)
 {
 	int err = -EIO;
 	int offsets[4];
@@ -818,16 +818,17 @@
 	if (err)
 		goto cleanup;
 
-	new_size = inode->i_size;
-	/*
-	 * This is not racy against ext3_truncate's modification of i_disksize
-	 * because VM/VFS ensures that the file cannot be extended while
-	 * truncate is in progress.  It is racy between multiple parallel
-	 * instances of get_block, but we have the BKL.
-	 */
-	if (new_size > ei->i_disksize)
-		ei->i_disksize = new_size;
-
+	if (extend_disksize) {
+		/*
+		 * This is not racy against ext3_truncate's modification of
+		 * i_disksize because VM/VFS ensures that the file cannot be
+		 * extended while truncate is in progress.  It is racy between
+		 * multiple parallel instances of get_block, but we have BKL.
+		 */
+		new_size = inode->i_size;
+		if (new_size > ei->i_disksize)
+			ei->i_disksize = new_size;
+	}
 	set_buffer_new(bh_result);
 	goto got_it;
 
@@ -851,10 +852,43 @@
 		handle = ext3_journal_current_handle();
 		J_ASSERT(handle != 0);
 	}
-	ret = ext3_get_block_handle(handle, inode, iblock, bh_result, create);
+	ret = ext3_get_block_handle(handle, inode, iblock,
+				bh_result, create, 1);
 	return ret;
 }
 
+#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
+
+static int
+ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
+		unsigned long max_blocks, struct buffer_head *bh_result,
+		int create)
+{
+	handle_t *handle = journal_current_handle();
+	int ret = 0;
+
+	lock_kernel();
+	if (handle && handle->h_buffer_credits <= EXT3_RESERVE_TRANS_BLOCKS) {
+		/*
+		 * Getting low on buffer credits...
+		 */
+		if (!ext3_journal_extend(handle, DIO_CREDITS)) {
+			/*
+			 * Couldn't extend the transaction.  Start a new one
+			 */
+			ret = ext3_journal_restart(handle, DIO_CREDITS);
+		}
+	}
+	if (ret == 0)
+		ret = ext3_get_block_handle(handle, inode, iblock,
+					bh_result, create, 0);
+	if (ret == 0)
+		bh_result->b_size = (1 << inode->i_blkbits);
+	unlock_kernel();
+	return ret;
+}
+
+
 /*
  * `handle' can be NULL if create is zero
  */
@@ -869,7 +903,7 @@
 	dummy.b_state = 0;
 	dummy.b_blocknr = -1000;
 	buffer_trace_init(&dummy.b_history);
-	*errp = ext3_get_block_handle(handle, inode, block, &dummy, create);
+	*errp = ext3_get_block_handle(handle, inode, block, &dummy, create, 1);
 	if (!*errp && buffer_mapped(&dummy)) {
 		struct buffer_head *bh;
 		bh = sb_getblk(inode->i_sb, dummy.b_blocknr);
@@ -1048,16 +1082,6 @@
 	if (ext3_should_journal_data(inode)) {
 		ret = walk_page_buffers(handle, page_buffers(page),
 				from, to, NULL, do_journal_get_write_access);
-		if (ret) {
-			/*
-			 * We're going to fail this prepare_write(),
-			 * so commit_write() will not be called.
-			 * We need to undo block_prepare_write()'s kmap().
-			 * AKPM: Do we need to clear PageUptodate?  I don't
-			 * think so.
-			 */
-			kunmap(page);
-		}
 	}
 prepare_write_failed:
 	if (ret)
@@ -1117,7 +1141,6 @@
 			from, to, &partial, commit_write_fn);
 		if (!partial)
 			SetPageUptodate(page);
-		kunmap(page);
 		if (pos > inode->i_size)
 			inode->i_size = pos;
 		EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
@@ -1128,17 +1151,8 @@
 		}
 		/* Be careful here if generic_commit_write becomes a
 		 * required invocation after block_prepare_write. */
-		if (ret == 0) {
+		if (ret == 0)
 			ret = generic_commit_write(file, page, from, to);
-		} else {
-			/*
-			 * block_prepare_write() was called, but we're not
-			 * going to call generic_commit_write().  So we
-			 * need to perform generic_commit_write()'s kunmap
-			 * by hand.
-			 */
-			kunmap(page);
-		}
 	}
 	if (inode->i_size > EXT3_I(inode)->i_disksize) {
 		EXT3_I(inode)->i_disksize = inode->i_size;
@@ -1344,10 +1358,11 @@
 
 	/*
 	 * We have to fail this writepage to avoid cross-fs transactions.
-	 * Put the page back on mapping->dirty_pages, but leave its buffer's
-	 * dirty state as-is.
+	 * Return EAGAIN so the caller will the page back on
+	 * mapping->dirty_pages.  The page's buffers' dirty state will be left
+	 * as-is.
 	 */
-	__set_page_dirty_nobuffers(page);
+	ret = -EAGAIN;
 	unlock_page(page);
 	return ret;
 }
@@ -1376,17 +1391,83 @@
 	return journal_try_to_free_buffers(journal, page, wait);
 }
 
+/*
+ * If the O_DIRECT write will extend the file then add this inode to the
+ * orphan list.  So recovery will truncate it back to the original size
+ * if the machine crashes during the write.
+ *
+ * If the O_DIRECT write is intantiating holes inside i_size and the machine
+ * crashes then stale disk data _may_ be exposed inside the file.
+ */
+static int ext3_direct_IO(int rw, struct inode *inode, char *buf,
+			loff_t offset, size_t count)
+{
+	struct ext3_inode_info *ei = EXT3_I(inode);
+	handle_t *handle = NULL;
+	int ret;
+	int orphan = 0;
+
+	if (rw == WRITE) {
+		loff_t final_size = offset + count;
+
+		lock_kernel();
+		handle = ext3_journal_start(inode, DIO_CREDITS);
+		unlock_kernel();
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+		if (final_size > inode->i_size) {
+			lock_kernel();
+			ret = ext3_orphan_add(handle, inode);
+			unlock_kernel();
+			if (ret)
+				goto out_stop;
+			orphan = 1;
+			ei->i_disksize = inode->i_size;
+		}
+	}
+
+	ret = generic_direct_IO(rw, inode, buf, offset,
+				count, ext3_direct_io_get_blocks);
+
+out_stop:
+	if (handle) {
+		int err;
+
+		lock_kernel();
+		if (orphan) 
+			ext3_orphan_del(handle, inode);
+		if (orphan && ret > 0) {
+			loff_t end = offset + ret;
+			if (end > inode->i_size) {
+				ei->i_disksize = end;
+				inode->i_size = end;
+				err = ext3_mark_inode_dirty(handle, inode);
+				if (!ret) 
+					ret = err;
+			}
+		}
+		err = ext3_journal_stop(handle, inode);
+		if (ret == 0)
+			ret = err;
+		unlock_kernel();
+	}
+out:
+	return ret;
+}
 
 struct address_space_operations ext3_aops = {
-	.readpage	= ext3_readpage,		/* BKL not held.  Don't need */
-	.readpages	= ext3_readpages,		/* BKL not held.  Don't need */
-	.writepage	= ext3_writepage,		/* BKL not held.  We take it */
+	.readpage	= ext3_readpage,	/* BKL not held.  Don't need */
+	.readpages	= ext3_readpages,	/* BKL not held.  Don't need */
+	.writepage	= ext3_writepage,	/* BKL not held.  We take it */
 	.sync_page	= block_sync_page,
 	.prepare_write	= ext3_prepare_write,	/* BKL not held.  We take it */
 	.commit_write	= ext3_commit_write,	/* BKL not held.  We take it */
 	.bmap		= ext3_bmap,		/* BKL held */
 	.invalidatepage	= ext3_invalidatepage,	/* BKL not held.  Don't need */
 	.releasepage	= ext3_releasepage,	/* BKL not held.  Don't need */
+	.direct_IO	= ext3_direct_IO,	/* BKL not held.  Don't need */
 };
 
 /* For writeback mode, we can use mpage_writepages() */
@@ -1405,9 +1486,9 @@
 }
 
 struct address_space_operations ext3_writeback_aops = {
-	.readpage	= ext3_readpage,		/* BKL not held.  Don't need */
-	.readpages	= ext3_readpages,		/* BKL not held.  Don't need */
-	.writepage	= ext3_writepage,		/* BKL not held.  We take it */
+	.readpage	= ext3_readpage,	/* BKL not held.  Don't need */
+	.readpages	= ext3_readpages,	/* BKL not held.  Don't need */
+	.writepage	= ext3_writepage,	/* BKL not held.  We take it */
 	.writepages	= ext3_writepages,	/* BKL not held.  Don't need */
 	.sync_page	= block_sync_page,
 	.prepare_write	= ext3_prepare_write,	/* BKL not held.  We take it */
@@ -1415,6 +1496,7 @@
 	.bmap		= ext3_bmap,		/* BKL held */
 	.invalidatepage	= ext3_invalidatepage,	/* BKL not held.  Don't need */
 	.releasepage	= ext3_releasepage,	/* BKL not held.  Don't need */
+	.direct_IO	= ext3_direct_IO,	/* BKL not held.  Don't need */
 };
 
 /*
@@ -1433,6 +1515,7 @@
 	struct page *page;
 	struct buffer_head *bh;
 	int err;
+	void *kaddr;
 
 	blocksize = inode->i_sb->s_blocksize;
 	length = offset & (blocksize - 1);
@@ -1488,10 +1571,11 @@
 		if (err)
 			goto unlock;
 	}
-	
-	memset(kmap(page) + offset, 0, length);
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + offset, 0, length);
 	flush_dcache_page(page);
-	kunmap(page);
+	kunmap_atomic(kaddr, KM_USER0);
 
 	BUFFER_TRACE(bh, "zeroed end of block");
 
diff -Nru a/fs/ext3/namei.c b/fs/ext3/namei.c
--- a/fs/ext3/namei.c	Wed Aug 28 07:37:37 2002
+++ b/fs/ext3/namei.c	Wed Aug 28 07:37:37 2002
@@ -16,9 +16,14 @@
  *        David S. Miller (davem@caip.rutgers.edu), 1995
  *  Directory entry file type support and forward compatibility hooks
  *  	for B-tree directories by Theodore Ts'o (tytso@mit.edu), 1998
+ *  Hash Tree Directory indexing (c)
+ *  	Daniel Phillips, 2001
+ *  Hash Tree Directory indexing porting
+ *  	Christopher Li, 2002
  */
 
 #include <linux/fs.h>
+#include <linux/pagemap.h>
 #include <linux/jbd.h>
 #include <linux/time.h>
 #include <linux/ext3_fs.h>
@@ -39,6 +44,414 @@
 #define NAMEI_RA_SIZE        (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
 #define NAMEI_RA_INDEX(c,b)  (((c) * NAMEI_RA_BLOCKS) + (b))
 
+static struct buffer_head *ext3_append(handle_t *handle,
+					struct inode *inode,
+					u32 *block, int *err)
+{
+	struct buffer_head *bh;
+
+	*block = inode->i_size >> inode->i_sb->s_blocksize_bits;
+
+	if ((bh = ext3_bread(handle, inode, *block, 1, err))) {
+		inode->i_size += inode->i_sb->s_blocksize;
+		EXT3_I(inode)->i_disksize = inode->i_size;
+		ext3_journal_get_write_access(handle,bh);
+	}
+	return bh;
+}
+
+#ifndef assert
+#define assert(test) J_ASSERT(test)
+#endif
+
+#ifndef swap
+#define swap(x, y) do { typeof(x) z = x; x = y; y = z; } while (0)
+#endif
+
+typedef struct { u32 v; } le_u32;
+typedef struct { u16 v; } le_u16;
+
+#define dxtrace_on(command) command
+#define dxtrace_off(command)
+#define dxtrace dxtrace_off
+
+struct fake_dirent
+{
+	/*le*/u32 inode;
+	/*le*/u16 rec_len;
+	u8 name_len;
+	u8 file_type;
+};
+
+struct dx_countlimit
+{
+	le_u16 limit;
+	le_u16 count;
+};
+
+struct dx_entry
+{
+	le_u32 hash;
+	le_u32 block;
+};
+
+/*
+ * dx_root_info is laid out so that if it should somehow get overlaid by a
+ * dirent the two low bits of the hash version will be zero.  Therefore, the
+ * hash version mod 4 should never be 0.  Sincerely, the paranoia department.
+ */
+
+struct dx_root
+{
+	struct fake_dirent dot;
+	char dot_name[4];
+	struct fake_dirent dotdot;
+	char dotdot_name[4];
+	struct dx_root_info
+	{
+		le_u32 reserved_zero;
+		u8 hash_version;
+		u8 info_length; /* 8 */
+		u8 indirect_levels;
+		u8 unused_flags;
+	}
+	info;
+	struct dx_entry	entries[0];
+};
+
+struct dx_node
+{
+	struct fake_dirent fake;
+	struct dx_entry	entries[0];
+};
+
+
+struct dx_frame
+{
+	struct buffer_head *bh;
+	struct dx_entry *entries;
+	struct dx_entry *at;
+};
+
+struct dx_map_entry
+{
+	u32 hash;
+	u32 offs;
+};
+
+typedef struct ext3_dir_entry_2 ext3_dirent;
+
+
+#ifdef CONFIG_EXT3_INDEX
+#if 0
+static inline unsigned dx_get_block (struct dx_entry *entry);
+static void dx_set_block (struct dx_entry *entry, unsigned value);
+static inline unsigned dx_get_hash (struct dx_entry *entry);
+static void dx_set_hash (struct dx_entry *entry, unsigned value);
+static unsigned dx_get_count (struct dx_entry *entries);
+static unsigned dx_get_limit (struct dx_entry *entries);
+static void dx_set_count (struct dx_entry *entries, unsigned value);
+static void dx_set_limit (struct dx_entry *entries, unsigned value);
+static unsigned dx_root_limit (struct inode *dir, unsigned infosize);
+static unsigned dx_node_limit (struct inode *dir);
+static struct dx_frame *dx_probe (struct inode *dir, u32 hash, struct dx_frame *frame);
+static void dx_release (struct dx_frame *frames);
+static int dx_make_map (ext3_dirent *de, int size, struct dx_map_entry map[]);
+static void dx_sort_map(struct dx_map_entry *map, unsigned count);
+static ext3_dirent *dx_copy_dirents (char *from, char *to,
+     struct dx_map_entry *map, int count);
+static void dx_insert_block (struct dx_frame *frame, u32 hash, u32 block);
+#endif
+/*
+ * Future: use high four bits of block for coalesce-on-delete flags
+ * Mask them off for now.
+ */
+
+static inline unsigned dx_get_block (struct dx_entry *entry)
+{
+	return le32_to_cpu(entry->block.v) & 0x00ffffff;
+}
+
+static inline void dx_set_block (struct dx_entry *entry, unsigned value)
+{
+	entry->block.v = cpu_to_le32(value);
+}
+
+static inline unsigned dx_get_hash (struct dx_entry *entry)
+{
+	return le32_to_cpu(entry->hash.v);
+}
+
+static inline void dx_set_hash (struct dx_entry *entry, unsigned value)
+{
+	entry->hash.v = cpu_to_le32(value);
+}
+
+static inline unsigned dx_get_count (struct dx_entry *entries)
+{
+	return le16_to_cpu(((struct dx_countlimit *) entries)->count.v);
+}
+
+static inline unsigned dx_get_limit (struct dx_entry *entries)
+{
+	return le16_to_cpu(((struct dx_countlimit *) entries)->limit.v);
+}
+
+static inline void dx_set_count (struct dx_entry *entries, unsigned value)
+{
+	((struct dx_countlimit *) entries)->count.v = cpu_to_le16(value);
+}
+
+static inline void dx_set_limit (struct dx_entry *entries, unsigned value)
+{
+	((struct dx_countlimit *) entries)->limit.v = cpu_to_le16(value);
+}
+
+static inline unsigned dx_root_limit (struct inode *dir, unsigned infosize)
+{
+	unsigned entry_space = dir->i_sb->s_blocksize - EXT3_DIR_REC_LEN(1) -
+		EXT3_DIR_REC_LEN(2) - infosize;
+	return 0? 20: entry_space / sizeof(struct dx_entry);
+}
+
+static inline unsigned dx_node_limit (struct inode *dir)
+{
+	unsigned entry_space = dir->i_sb->s_blocksize - EXT3_DIR_REC_LEN(0);
+	return 0? 22: entry_space / sizeof(struct dx_entry);
+}
+
+static inline __u32 dx_hash(const char *p, int len)
+{
+	return ext3_make_halfMD4_hash(p, len) & ~1UL;
+}
+
+/*
+ * Debug
+ */
+struct stats
+{ 
+	unsigned names;
+	unsigned space;
+	unsigned bcount;
+};
+
+static struct stats dx_show_leaf (ext3_dirent *de, int size, int show_names)
+{
+	unsigned names = 0, space = 0;
+	char *base = (char *) de;
+	printk("names: ");
+	while ((char *) de < base + size)
+	{
+		if (de->inode)
+		{
+			if (show_names)
+			{
+				int len = de->name_len;
+				char *name = de->name;
+				while (len--) printk("%c", *name++);
+				printk(":%x.%u ", dx_hash (de->name, de->name_len), ((char *) de - base));
+			}
+			space += EXT3_DIR_REC_LEN(de->name_len);
+	 		names++;
+		}
+		de = (ext3_dirent *) ((char *) de + le16_to_cpu(de->rec_len));
+	}
+	printk("(%i)\n", names);
+	return (struct stats) { names, space, 1 };
+}
+
+struct stats dx_show_entries (struct inode *dir, struct dx_entry *entries, int levels)
+{
+	unsigned blocksize = dir->i_sb->s_blocksize;
+	unsigned count = dx_get_count (entries), names = 0, space = 0, i;
+	unsigned bcount = 0;
+	struct buffer_head *bh;
+	int err;
+	printk("%i indexed blocks...\n", count);
+	for (i = 0; i < count; i++, entries++)
+	{
+		u32 block = dx_get_block(entries), hash = i? dx_get_hash(entries): 0;
+		u32 range = i < count - 1? (dx_get_hash(entries + 1) - hash): ~hash;
+		struct stats stats;
+		printk("%s%3u:%03u hash %8x/%8x ",levels?"":"   ", i, block, hash, range);
+		if (!(bh = ext3_bread (NULL,dir, block, 0,&err))) continue;
+		stats = levels?
+		   dx_show_entries (dir, ((struct dx_node *) bh->b_data)->entries, levels - 1):
+		   dx_show_leaf ((ext3_dirent *) bh->b_data, blocksize, 0);
+		names += stats.names;
+		space += stats.space;
+		bcount += stats.bcount;
+		brelse (bh);
+	}
+	if (bcount)
+		printk("%snames %u, fullness %u (%u%%)\n", levels?"":"   ",
+			names, space/bcount,(space/bcount)*100/blocksize);
+	return (struct stats) { names, space, bcount};
+}
+
+/*
+ * Probe for a directory leaf block to search
+ */
+
+static struct dx_frame *dx_probe (struct inode *dir, u32 hash, struct dx_frame *frame)
+{
+	unsigned count, indirect;
+	struct dx_entry *at, *entries, *p, *q, *m;
+	struct dx_root *root;
+	struct buffer_head *bh;
+	int err;
+	if (!(bh = ext3_bread (NULL,dir, 0, 0,&err)))
+		goto fail;
+	root = (struct dx_root *) bh->b_data;
+	if (root->info.hash_version != DX_HASH_HALF_MD4) {
+		ext3_warning(dir->i_sb, __FUNCTION__, 
+			     "Unrecognised inode hash code %d",
+			     root->info.hash_version);
+		goto fail;
+	}
+	
+	if (root->info.unused_flags & 1) {
+		ext3_warning(dir->i_sb, __FUNCTION__, 
+			     "Unimplemented inode hash flags: %#06x",
+			     root->info.unused_flags);
+		goto fail;
+	}
+	
+	if ((indirect = root->info.indirect_levels) > 1) {
+		ext3_warning(dir->i_sb, __FUNCTION__, 
+			     "Unimplemented inode hash depth: %#06x",
+			     root->info.indirect_levels);
+		goto fail;
+	}		
+
+	entries = (struct dx_entry *) (((char *) &root->info) + root->info.info_length);
+	assert (dx_get_limit(entries) == dx_root_limit(dir, root->info.info_length));
+	dxtrace (printk("Look up %x", hash));
+	while (1)
+	{
+		count = dx_get_count(entries);
+		assert (count && count <= dx_get_limit(entries));
+		p = entries + 1;
+		q = entries + count - 1;
+		while (p <= q)
+		{
+			m = p + (q - p)/2;
+			dxtrace(printk("."));
+			if (dx_get_hash(m) > hash)
+				q = m - 1;
+			else
+				p = m + 1;
+		}
+
+		if (0) // linear search cross check
+		{
+			unsigned n = count - 1;
+			at = entries;
+			while (n--)
+			{
+				dxtrace(printk(","));
+				if (dx_get_hash(++at) > hash)
+				{
+					at--;
+					break;
+				}
+			}
+			assert (at == p - 1);
+		}
+
+		at = p - 1;
+		dxtrace(printk(" %x->%u\n", at == entries? 0: dx_get_hash(at), dx_get_block(at)));
+		frame->bh = bh;
+		frame->entries = entries;
+		frame->at = at;
+		if (!indirect--) return frame;
+		if (!(bh = ext3_bread (NULL,dir, dx_get_block(at), 0,&err)))
+			goto fail2;
+		at = entries = ((struct dx_node *) bh->b_data)->entries;
+		assert (dx_get_limit(entries) == dx_node_limit (dir));
+		frame++;
+	}
+fail2:
+	brelse(frame->bh);
+fail:
+	return NULL;
+}
+
+static void dx_release (struct dx_frame *frames)
+{
+	if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels)
+		brelse (frames[1].bh);
+	brelse (frames[0].bh);
+}
+
+/*
+ * Directory block splitting, compacting
+ */
+
+static int dx_make_map (ext3_dirent *de, int size, struct dx_map_entry map[])
+{
+	int count = 0;
+	char *base = (char *) de;
+	while ((char *) de < base + size)
+	{
+		map[count].hash = dx_hash (de->name, de->name_len);
+		map[count].offs = (u32) ((char *) de - base);
+		de = (ext3_dirent *) ((char *) de + le16_to_cpu(de->rec_len));
+		count++;
+	}
+	return count;
+}
+
+static void dx_sort_map (struct dx_map_entry *map, unsigned count)
+{
+        struct dx_map_entry *p, *q, *top = map + count - 1;
+        int more;
+        /* Combsort until bubble sort doesn't suck */
+        while (count > 2)
+	{
+                count = count*10/13;
+                if (count - 9 < 2) /* 9, 10 -> 11 */
+                        count = 11;
+                for (p = top, q = p - count; q >= map; p--, q--)
+                        if (p->hash < q->hash)
+                                swap(*p, *q);
+        }
+        /* Garden variety bubble sort */
+        do {
+                more = 0;
+                q = top;
+                while (q-- > map)
+		{
+                        if (q[1].hash >= q[0].hash)
+				continue;
+                        swap(*(q+1), *q);
+                        more = 1;
+		}
+	} while(more);
+}
+
+static void dx_insert_block(struct dx_frame *frame, u32 hash, u32 block)
+{
+	struct dx_entry *entries = frame->entries;
+	struct dx_entry *old = frame->at, *new = old + 1;
+	int count = dx_get_count(entries);
+
+	assert(count < dx_get_limit(entries));
+	assert(old < entries + count);
+	memmove(new + 1, new, (char *)(entries + count) - (char *)(new));
+	dx_set_hash(new, hash);
+	dx_set_block(new, block);
+	dx_set_count(entries, count + 1);
+}
+#endif
+
+
+static void ext3_update_dx_flag(struct inode *inode)
+{
+	if (!test_opt(inode->i_sb, INDEX))
+		EXT3_I(inode)->i_flags &= ~EXT3_INDEX_FL;
+}
+
 /*
  * NOTE! unlike strncmp, ext3_match returns 1 for success, 0 for failure.
  *
@@ -96,6 +509,15 @@
 }
 
 /*
+ * p is at least 6 bytes before the end of page
+ */
+static inline ext3_dirent *ext3_next_entry(ext3_dirent *p)
+{
+	return (ext3_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+}
+
+
+/*
  *	ext3_find_entry()
  *
  * finds an entry in the specified directory with the wanted name. It
@@ -106,6 +528,8 @@
  * The returned buffer_head has ->b_count elevated.  The caller is expected
  * to brelse() it when appropriate.
  */
+
+	
 static struct buffer_head * ext3_find_entry (struct dentry *dentry,
 					struct ext3_dir_entry_2 ** res_dir)
 {
@@ -120,9 +544,78 @@
 	int num = 0;
 	int nblocks, i, err;
 	struct inode *dir = dentry->d_parent->d_inode;
-
+	int namelen;
+	const u8 *name;
+	unsigned blocksize;
+#ifdef CONFIG_EXT3_INDEX
+	ext3_dirent *de, *top;
+#endif
 	*res_dir = NULL;
 	sb = dir->i_sb;
+	blocksize = sb->s_blocksize;
+	namelen = dentry->d_name.len;
+	name = dentry->d_name.name;
+	if (namelen > EXT3_NAME_LEN)
+		return NULL;
+#ifdef CONFIG_EXT3_INDEX
+	if (ext3_dx && is_dx(dir)) {
+		u32 hash = dx_hash (name, namelen);
+		struct dx_frame frames[2], *frame;
+		if (!(frame = dx_probe (dir, hash, frames)))
+			return NULL;
+dxnext:
+		block = dx_get_block(frame->at);
+		if (!(bh = ext3_bread (NULL,dir, block, 0, &err)))
+			goto dxfail;
+		de = (ext3_dirent *) bh->b_data;
+		top = (ext3_dirent *) ((char *) de + blocksize -
+				EXT3_DIR_REC_LEN(0));
+		for (; de < top; de = ext3_next_entry(de))
+			if (ext3_match (namelen, name, de)) {
+				if (!ext3_check_dir_entry("ext3_find_entry",
+					  dir, de, bh,
+					  (block<<EXT3_BLOCK_SIZE_BITS(sb))
+					   +((char *)de - bh->b_data))) {
+					brelse (bh);
+					goto dxfail;
+				}
+				*res_dir = de;
+				goto dxfound;
+			}
+		brelse (bh);
+		/* Same hash continues in next block?  Search on. */
+		if (++(frame->at) == frame->entries + dx_get_count(frame->entries))
+		{
+			struct buffer_head *bh2;
+			if (frame == frames)
+				goto dxfail;
+			if (++(frames->at) == frames->entries + dx_get_count(frames->entries))
+				goto dxfail;
+			/* should omit read if not continued */
+			if (!(bh2 = ext3_bread (NULL, dir,
+						dx_get_block(frames->at),
+						0, &err)))
+				goto dxfail;
+			brelse (frame->bh);
+			frame->bh = bh2;
+			frame->at = frame->entries = ((struct dx_node *) bh2->b_data)->entries;
+			/* Subtle: the 0th entry has the count, find the hash in frame above */
+			if ((dx_get_hash(frames->at) & -2) == hash)
+				goto dxnext;
+			goto dxfail;
+		}
+		if ((dx_get_hash(frame->at) & -2) == hash)
+			goto dxnext;
+dxfail:
+		dxtrace(printk("%s not found\n", name));
+		dx_release (frames);
+		return NULL;
+dxfound:
+		dx_release (frames);
+		return bh;
+
+	}
+#endif
 
 	nblocks = dir->i_size >> EXT3_BLOCK_SIZE_BITS(sb);
 	start = EXT3_I(dir)->i_dir_start_lookup;
@@ -281,6 +774,88 @@
 		de->file_type = ext3_type_by_mode[(mode & S_IFMT)>>S_SHIFT];
 }
 
+#ifdef CONFIG_EXT3_INDEX
+static ext3_dirent *
+dx_copy_dirents (char *from, char *to, struct dx_map_entry *map, int count)
+{
+	unsigned rec_len = 0;
+
+	while (count--) {
+		ext3_dirent *de = (ext3_dirent *) (from + map->offs);
+		rec_len = EXT3_DIR_REC_LEN(de->name_len);
+		memcpy (to, de, rec_len);
+		((ext3_dirent *) to)->rec_len = rec_len;
+		to += rec_len;
+		map++;
+	}
+	return (ext3_dirent *) (to - rec_len);
+}
+
+static ext3_dirent *do_split(handle_t *handle, struct inode *dir,
+			struct buffer_head **bh,struct dx_frame *frame,
+			u32 hash, int *error)
+{
+	unsigned blocksize = dir->i_sb->s_blocksize;
+	unsigned count, continued;
+	struct buffer_head *bh2;
+	u32 newblock;
+	unsigned MAX_DX_MAP = PAGE_CACHE_SIZE/EXT3_DIR_REC_LEN(1) + 1;
+	u32 hash2;
+	struct dx_map_entry map[MAX_DX_MAP];
+	char *data1 = (*bh)->b_data, *data2, *data3;
+	unsigned split;
+	ext3_dirent *de, *de2;
+
+	bh2 = ext3_append (handle, dir, &newblock, error);
+	if (!(bh2))
+	{
+		brelse(*bh);
+		*bh = NULL;
+		return (ext3_dirent *)bh2;
+	}
+
+	BUFFER_TRACE(*bh, "get_write_access");
+	ext3_journal_get_write_access(handle, *bh);
+	BUFFER_TRACE(frame->bh, "get_write_access");
+	ext3_journal_get_write_access(handle, frame->bh);
+
+	data2 = bh2->b_data;
+
+	count = dx_make_map ((ext3_dirent *) data1, blocksize, map);
+	split = count/2; // need to adjust to actual middle
+	dx_sort_map (map, count);
+	hash2 = map[split].hash;
+	continued = hash2 == map[split - 1].hash;
+	dxtrace(printk("Split block %i at %x, %i/%i\n",
+		dx_get_block(frame->at), hash2, split, count-split));
+
+	/* Fancy dance to stay within two buffers */
+	de2 = dx_copy_dirents (data1, data2, map + split, count - split);
+	data3 = (char *) de2 + de2->rec_len;
+	de = dx_copy_dirents (data1, data3, map, split);
+	memcpy(data1, data3, (char *) de + de->rec_len - data3);
+	de = (ext3_dirent *) ((char *) de - data3 + data1); // relocate de
+	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	dxtrace(dx_show_leaf ((ext3_dirent *) data1, blocksize, 1));
+	dxtrace(dx_show_leaf ((ext3_dirent *) data2, blocksize, 1));
+
+	/* Which block gets the new entry? */
+	if (hash >= hash2)
+	{
+		swap(*bh, bh2);
+		de = de2;
+	}
+	dx_insert_block (frame, hash2 + continued, newblock);
+	ext3_journal_dirty_metadata (handle, bh2);
+	brelse (bh2);
+	ext3_journal_dirty_metadata (handle, frame->bh);
+	dxtrace(dx_show_index ("frame", frame->entries));
+	return de;
+}
+#endif
+
+
 /*
  *	ext3_add_entry()
  *
@@ -295,6 +870,7 @@
 /*
  * AKPM: the journalling code here looks wrong on the error paths
  */
+
 static int ext3_add_entry (handle_t *handle, struct dentry *dentry,
 	struct inode *inode)
 {
@@ -302,115 +878,283 @@
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
 	unsigned long offset;
-	unsigned short rec_len;
 	struct buffer_head * bh;
-	struct ext3_dir_entry_2 * de, * de1;
-	struct super_block * sb;
+	ext3_dirent *de;
+	struct super_block * sb = dir->i_sb;
 	int	retval;
+	unsigned short reclen = EXT3_DIR_REC_LEN(namelen);
 
-	sb = dir->i_sb;
+	unsigned blocksize = sb->s_blocksize;
+	unsigned nlen, rlen;
+	u32 block, blocks;
+	char *top;
 
 	if (!namelen)
 		return -EINVAL;
-	bh = ext3_bread (handle, dir, 0, 0, &retval);
-	if (!bh)
-		return retval;
-	rec_len = EXT3_DIR_REC_LEN(namelen);
-	offset = 0;
-	de = (struct ext3_dir_entry_2 *) bh->b_data;
-	while (1) {
-		if ((char *)de >= sb->s_blocksize + bh->b_data) {
-			brelse (bh);
-			bh = NULL;
-			bh = ext3_bread (handle, dir,
-				offset >> EXT3_BLOCK_SIZE_BITS(sb), 1, &retval);
-			if (!bh)
-				return retval;
-			if (dir->i_size <= offset) {
-				if (dir->i_size == 0) {
-					brelse(bh);
-					return -ENOENT;
+#ifdef CONFIG_EXT3_INDEX
+	if (ext3_dx && is_dx(dir)) {
+		struct dx_frame frames[2], *frame;
+		struct dx_entry *entries, *at;
+		u32 hash;
+		char *data1;
+
+		hash = dx_hash (name, namelen);
+		frame = dx_probe (dir, hash, frames); // do something if null
+		entries = frame->entries;
+		at = frame->at;
+
+		if (!(bh = ext3_bread (handle,dir, dx_get_block(frame->at), 0,&retval)))
+			goto dxfail1;
+
+		BUFFER_TRACE(bh, "get_write_access");
+		ext3_journal_get_write_access(handle, bh);
+
+		data1 = bh->b_data;
+		de = (ext3_dirent *) data1;
+		top = data1 + (0? 200: blocksize);
+		while ((char *) de < top)
+		{
+			/* FIXME: check EEXIST and dir */
+			nlen = EXT3_DIR_REC_LEN(de->name_len);
+			rlen = le16_to_cpu(de->rec_len);
+			if ((de->inode? rlen - nlen: rlen) >= reclen)
+				goto dx_add;
+			de = (ext3_dirent *) ((char *) de + rlen);
+		}
+		/* Block full, should compress but for now just split */
+		dxtrace(printk("using %u of %u node entries\n",
+			dx_get_count(entries), dx_get_limit(entries)));
+		/* Need to split index? */
+		if (dx_get_count(entries) == dx_get_limit(entries))
+		{
+			u32 newblock;
+			unsigned icount = dx_get_count(entries);
+			int levels = frame - frames;
+			struct dx_entry *entries2;
+			struct dx_node *node2;
+			struct buffer_head *bh2;
+			if (levels && dx_get_count(frames->entries) == dx_get_limit(frames->entries))
+				goto dxfull;
+			bh2 = ext3_append (handle, dir, &newblock, &retval);
+			if (!(bh2))
+				goto dxfail2;
+			node2 = (struct dx_node *)(bh2->b_data);
+			entries2 = node2->entries;
+			node2->fake.rec_len = cpu_to_le16(blocksize);
+			node2->fake.inode = 0;
+			BUFFER_TRACE(frame->bh, "get_write_access");
+			ext3_journal_get_write_access(handle, frame->bh);
+			if (levels)
+			{
+				unsigned icount1 = icount/2, icount2 = icount - icount1;
+				unsigned hash2 = dx_get_hash(entries + icount1);
+				dxtrace(printk("Split index %i/%i\n", icount1, icount2));
+				
+				BUFFER_TRACE(frame->bh, "get_write_access"); /* index root */
+				ext3_journal_get_write_access(handle, frames[0].bh);
+				
+				memcpy ((char *) entries2, (char *) (entries + icount1),
+					icount2 * sizeof(struct dx_entry));
+				dx_set_count (entries, icount1);
+				dx_set_count (entries2, icount2);
+				dx_set_limit (entries2, dx_node_limit(dir));
+
+				/* Which index block gets the new entry? */
+				if (at - entries >= icount1) {
+					frame->at = at = at - entries - icount1 + entries2;
+					frame->entries = entries = entries2;
+					swap(frame->bh, bh2);
 				}
-
-				ext3_debug ("creating next block\n");
-
-				BUFFER_TRACE(bh, "get_write_access");
-				ext3_journal_get_write_access(handle, bh);
-				de = (struct ext3_dir_entry_2 *) bh->b_data;
-				de->inode = 0;
-				de->rec_len = le16_to_cpu(sb->s_blocksize);
-				EXT3_I(dir)->i_disksize =
-					dir->i_size = offset + sb->s_blocksize;
-				EXT3_I(dir)->i_flags &= ~EXT3_INDEX_FL;
-				ext3_mark_inode_dirty(handle, dir);
+				dx_insert_block (frames + 0, hash2, newblock);
+				dxtrace(dx_show_index ("node", frames[1].entries));
+				dxtrace(dx_show_index ("node",
+					((struct dx_node *) bh2->b_data)->entries));
+				ext3_journal_dirty_metadata(handle, bh2);
+				brelse (bh2);
 			} else {
-
-				ext3_debug ("skipping to next block\n");
-
-				de = (struct ext3_dir_entry_2 *) bh->b_data;
+				dxtrace(printk("Creating second level index...\n"));
+				memcpy((char *) entries2, (char *) entries,
+					icount * sizeof(struct dx_entry));
+				dx_set_limit(entries2, dx_node_limit(dir));
+
+				/* Set up root */
+				dx_set_count(entries, 1);
+				dx_set_block(entries + 0, newblock);
+				((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels = 1;
+
+				/* Add new access path frame */
+				frame = frames + 1;
+				frame->at = at = at - entries + entries2;
+				frame->entries = entries = entries2;
+				frame->bh = bh2;
+				ext3_journal_get_write_access(handle, frame->bh);
 			}
+			ext3_journal_dirty_metadata(handle, frames[0].bh);
 		}
-		if (!ext3_check_dir_entry ("ext3_add_entry", dir, de, bh,
-					   offset)) {
-			brelse (bh);
-			return -ENOENT;
-		}
-		if (ext3_match (namelen, name, de)) {
+		de = do_split(handle, dir, &bh, frame, hash, &retval);
+		dx_release (frames);
+		if (!(de))
+			goto fail;
+		nlen = EXT3_DIR_REC_LEN(de->name_len);
+		rlen = le16_to_cpu(de->rec_len);
+		goto add;
+
+dx_add:
+		dx_release (frames);
+		goto add;
+
+dxfull:
+		ext3_warning(sb, __FUNCTION__, "Directory index full!\n");
+		retval = -ENOSPC;
+dxfail2:
+		brelse(bh);
+dxfail1:
+		dx_release (frames);
+		goto fail1;
+	}
+#endif
+	blocks = dir->i_size >> sb->s_blocksize_bits;
+	for (block = 0, offset = 0; block < blocks; block++) {
+		bh = ext3_bread(handle, dir, block, 0, &retval);
+		if(!bh)
+			return retval;
+		de = (ext3_dirent *)bh->b_data;
+		top = bh->b_data + blocksize - reclen;
+		while ((char *) de <= top) {
+			if (!ext3_check_dir_entry("ext3_add_entry", dir, de,
+						  bh, offset)) {
+				brelse (bh);
+				return -EIO;
+			}
+			if (ext3_match (namelen, name, de)) {
 				brelse (bh);
 				return -EEXIST;
-		}
-		if ((le32_to_cpu(de->inode) == 0 &&
-				le16_to_cpu(de->rec_len) >= rec_len) ||
-		    (le16_to_cpu(de->rec_len) >=
-				EXT3_DIR_REC_LEN(de->name_len) + rec_len)) {
-			BUFFER_TRACE(bh, "get_write_access");
-			ext3_journal_get_write_access(handle, bh);
-			/* By now the buffer is marked for journaling */
-			offset += le16_to_cpu(de->rec_len);
-			if (le32_to_cpu(de->inode)) {
-				de1 = (struct ext3_dir_entry_2 *) ((char *) de +
-					EXT3_DIR_REC_LEN(de->name_len));
-				de1->rec_len =
-					cpu_to_le16(le16_to_cpu(de->rec_len) -
-					EXT3_DIR_REC_LEN(de->name_len));
-				de->rec_len = cpu_to_le16(
-						EXT3_DIR_REC_LEN(de->name_len));
-				de = de1;
 			}
-			de->file_type = EXT3_FT_UNKNOWN;
-			if (inode) {
-				de->inode = cpu_to_le32(inode->i_ino);
-				ext3_set_de_type(dir->i_sb, de, inode->i_mode);
-			} else
-				de->inode = 0;
-			de->name_len = namelen;
-			memcpy (de->name, name, namelen);
-			/*
-			 * XXX shouldn't update any times until successful
-			 * completion of syscall, but too many callers depend
-			 * on this.
-			 *
-			 * XXX similarly, too many callers depend on
-			 * ext3_new_inode() setting the times, but error
-			 * recovery deletes the inode, so the worst that can
-			 * happen is that the times are slightly out of date
-			 * and/or different from the directory change time.
-			 */
-			dir->i_mtime = dir->i_ctime = CURRENT_TIME;
-			EXT3_I(dir)->i_flags &= ~EXT3_INDEX_FL;
-			ext3_mark_inode_dirty(handle, dir);
-			dir->i_version = ++event;
-			BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
-			ext3_journal_dirty_metadata(handle, bh);
+			nlen = EXT3_DIR_REC_LEN(de->name_len);
+			rlen = le16_to_cpu(de->rec_len);
+			if ((de->inode? rlen - nlen: rlen) >= reclen)
+				goto add;
+			de = (ext3_dirent *)((char *)de + rlen);
+			offset += rlen;
+		}
+#ifdef CONFIG_EXT3_INDEX
+		if (ext3_dx && blocks == 1 && test_opt(sb, INDEX))
+			goto dx_make_index;
+#endif
+		brelse(bh);
+	}
+	bh = ext3_append(handle, dir, &block, &retval);
+	if (!bh)
+		return retval;
+	de = (ext3_dirent *) bh->b_data;
+	de->inode = 0;
+	de->rec_len = cpu_to_le16(rlen = blocksize);
+	nlen = 0;
+	goto add;
+
+add:
+	BUFFER_TRACE(bh, "get_write_access");
+	ext3_journal_get_write_access(handle, bh);
+	/* By now the buffer is marked for journaling */
+	if (de->inode) {
+		ext3_dirent *de1 = (ext3_dirent *)((char *)de + nlen);
+		de1->rec_len = cpu_to_le16(rlen - nlen);
+		de->rec_len = cpu_to_le16(nlen);
+		de = de1;
+	}
+	de->file_type = EXT3_FT_UNKNOWN;
+	if (inode) {
+		de->inode = cpu_to_le32(inode->i_ino);
+		ext3_set_de_type(dir->i_sb, de, inode->i_mode);
+	} else
+		de->inode = 0;
+	de->name_len = namelen;
+	memcpy (de->name, name, namelen);
+	/*
+	 * XXX shouldn't update any times until successful
+	 * completion of syscall, but too many callers depend
+	 * on this.
+	 *
+	 * XXX similarly, too many callers depend on
+	 * ext3_new_inode() setting the times, but error
+	 * recovery deletes the inode, so the worst that can
+	 * happen is that the times are slightly out of date
+	 * and/or different from the directory change time.
+	 */
+	dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+	ext3_update_dx_flag(dir);
+	ext3_mark_inode_dirty(handle, dir);
+	dir->i_version = ++event;
+	BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+	ext3_journal_dirty_metadata(handle, bh);
+	brelse(bh);
+	return 0;
+
+#ifdef CONFIG_EXT3_INDEX
+dx_make_index:
+	{
+		struct buffer_head *bh2;
+		struct dx_root *root;
+		struct dx_frame frames[2], *frame;
+		struct dx_entry *entries;
+		ext3_dirent *de2;
+		char *data1;
+		unsigned len;
+		u32 hash;
+		
+		dxtrace(printk("Creating index\n"));
+		ext3_journal_get_write_access(handle, bh);
+		root = (struct dx_root *) bh->b_data;
+		
+		EXT3_I(dir)->i_flags |= EXT3_INDEX_FL;
+		bh2 = ext3_append (handle, dir, &block, &retval);
+		if (!(bh2))
+		{
 			brelse(bh);
-			return 0;
+			return retval;
 		}
-		offset += le16_to_cpu(de->rec_len);
-		de = (struct ext3_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+		data1 = bh2->b_data;
+
+		/* The 0th block becomes the root, move the dirents out */
+		de = (ext3_dirent *) &root->info;
+		len = ((char *) root) + blocksize - (char *) de;
+		memcpy (data1, de, len);
+		de = (ext3_dirent *) data1;
+		top = data1 + len;
+		while (((char *) de2=(char*)de+le16_to_cpu(de->rec_len)) < top)
+			de = de2;
+		de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+		/* Initialize the root; the dot dirents already exist */
+		de = (ext3_dirent *) (&root->dotdot);
+		de->rec_len = cpu_to_le16(blocksize - EXT3_DIR_REC_LEN(2));
+		memset (&root->info, 0, sizeof(root->info));
+		root->info.info_length = sizeof(root->info);
+		root->info.hash_version = DX_HASH_HALF_MD4;
+		entries = root->entries;
+		dx_set_block (entries, 1);
+		dx_set_count (entries, 1);
+		dx_set_limit (entries, dx_root_limit(dir, sizeof(root->info)));
+
+		/* Initialize as for dx_probe */
+		hash = dx_hash (name, namelen);
+		frame = frames;
+		frame->entries = entries;
+		frame->at = entries;
+		frame->bh = bh;
+		bh = bh2;
+		de = do_split(handle,dir, &bh, frame, hash, &retval);
+		dx_release (frames);
+		if (!(de))
+			return retval;
+		nlen = EXT3_DIR_REC_LEN(de->name_len);
+		rlen = le16_to_cpu(de->rec_len);
+		goto add;
 	}
-	brelse (bh);
-	return -ENOSPC;
+fail1:
+	return retval;
+fail:
+	return -ENOENT;
+#endif
 }
 
 /*
@@ -496,7 +1240,8 @@
 	int err;
 
 	lock_kernel();
-	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -530,7 +1275,8 @@
 	int err;
 
 	lock_kernel();
-	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
+			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -563,7 +1309,8 @@
 		return -EMLINK;
 
 	lock_kernel();
-	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -615,7 +1362,7 @@
 	if (err)
 		goto out_no_entry;
 	dir->i_nlink++;
-	EXT3_I(dir)->i_flags &= ~EXT3_INDEX_FL;
+	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
 	d_instantiate(dentry, inode);
 out_stop:
@@ -894,7 +1641,7 @@
 	ext3_mark_inode_dirty(handle, inode);
 	dir->i_nlink--;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	EXT3_I(dir)->i_flags &= ~EXT3_INDEX_FL;
+	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
 
 end_rmdir:
@@ -944,7 +1691,7 @@
 	if (retval)
 		goto end_unlink;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	EXT3_I(dir)->i_flags &= ~EXT3_INDEX_FL;
+	ext3_update_dx_flag(dir);
 	ext3_mark_inode_dirty(handle, dir);
 	inode->i_nlink--;
 	if (!inode->i_nlink)
@@ -972,7 +1719,8 @@
 		return -ENAMETOOLONG;
 
 	lock_kernel();
-	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 5);
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
+			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 5);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -1033,7 +1781,8 @@
 		return -EMLINK;
 	}
 
-	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS);
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -1073,7 +1822,8 @@
 	old_bh = new_bh = dir_bh = NULL;
 
 	lock_kernel();
-	handle = ext3_journal_start(old_dir, 2 * EXT3_DATA_TRANS_BLOCKS + 2);
+	handle = ext3_journal_start(old_dir, 2 * EXT3_DATA_TRANS_BLOCKS +
+			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 2);
 	if (IS_ERR(handle)) {
 		unlock_kernel();
 		return PTR_ERR(handle);
@@ -1157,7 +1907,7 @@
 		new_inode->i_ctime = CURRENT_TIME;
 	}
 	old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME;
-	EXT3_I(old_dir)->i_flags &= ~EXT3_INDEX_FL;
+	ext3_update_dx_flag(old_dir);
 	if (dir_bh) {
 		BUFFER_TRACE(dir_bh, "get_write_access");
 		ext3_journal_get_write_access(handle, dir_bh);
@@ -1169,7 +1919,7 @@
 			new_inode->i_nlink--;
 		} else {
 			new_dir->i_nlink++;
-			EXT3_I(new_dir)->i_flags &= ~EXT3_INDEX_FL;
+			ext3_update_dx_flag(new_dir);
 			ext3_mark_inode_dirty(handle, new_dir);
 		}
 	}
diff -Nru a/fs/ext3/super.c b/fs/ext3/super.c
--- a/fs/ext3/super.c	Wed Aug 28 07:37:38 2002
+++ b/fs/ext3/super.c	Wed Aug 28 07:37:38 2002
@@ -443,12 +443,16 @@
 	return;
 }
 
-static kmem_cache_t * ext3_inode_cachep;
+static kmem_cache_t *ext3_inode_cachep;
 
+/*
+ * Called inside transaction, so use GFP_NOFS
+ */
 static struct inode *ext3_alloc_inode(struct super_block *sb)
 {
 	struct ext3_inode_info *ei;
-	ei = (struct ext3_inode_info *)kmem_cache_alloc(ext3_inode_cachep, SLAB_KERNEL);
+
+	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
 	if (!ei)
 		return NULL;
 	return &ei->vfs_inode;
@@ -579,6 +583,12 @@
 				       "EXT3 Check option not supported\n");
 #endif
 		}
+		else if (!strcmp (this_char, "index"))
+#ifdef CONFIG_EXT3_INDEX
+			set_opt (*mount_options, INDEX);
+#else
+			printk("EXT3 index option not supported\n");
+#endif
 		else if (!strcmp (this_char, "debug"))
 			set_opt (*mount_options, DEBUG);
 		else if (!strcmp (this_char, "errors")) {
@@ -757,6 +767,12 @@
 	es->s_mtime = cpu_to_le32(CURRENT_TIME);
 	ext3_update_dynamic_rev(sb);
 	EXT3_SET_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
+
+	if (test_opt(sb, INDEX))
+		EXT3_SET_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_DIR_INDEX);
+	else if (EXT3_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_DIR_INDEX))
+		set_opt (EXT3_SB(sb)->s_mount_opt, INDEX);
+
 	ext3_commit_super (sb, es, 1);
 	if (test_opt (sb, DEBUG))
 		printk (KERN_INFO
@@ -767,6 +783,7 @@
 			EXT3_BLOCKS_PER_GROUP(sb),
 			EXT3_INODES_PER_GROUP(sb),
 			sbi->s_mount_opt);
+
 	printk(KERN_INFO "EXT3 FS " EXT3FS_VERSION ", " EXT3FS_DATE " on %s, ",
 				sb->s_id);
 	if (EXT3_SB(sb)->s_journal->j_inode == NULL) {
@@ -940,6 +957,7 @@
 		res = (512LL << 32) - (1 << bits);
 	return res;
 }
+
 
 static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 {
diff -Nru a/fs/fat/inode.c b/fs/fat/inode.c
--- a/fs/fat/inode.c	Wed Aug 28 07:37:37 2002
+++ b/fs/fat/inode.c	Wed Aug 28 07:37:37 2002
@@ -982,11 +982,24 @@
 {
 	return block_read_full_page(page,fat_get_block);
 }
-static int fat_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to)
+
+static int
+fat_prepare_write(struct file *file, struct page *page,
+			unsigned from, unsigned to)
 {
+	kmap(page);
 	return cont_prepare_write(page,from,to,fat_get_block,
 		&MSDOS_I(page->mapping->host)->mmu_private);
 }
+
+static int
+fat_commit_write(struct file *file, struct page *page,
+			unsigned from, unsigned to)
+{
+	kunmap(page);
+	return generic_commit_write(file, page, from, to);
+}
+
 static int _fat_bmap(struct address_space *mapping, long block)
 {
 	return generic_block_bmap(mapping,block,fat_get_block);
@@ -996,7 +1009,7 @@
 	writepage: fat_writepage,
 	sync_page: block_sync_page,
 	prepare_write: fat_prepare_write,
-	commit_write: generic_commit_write,
+	commit_write: fat_commit_write,
 	bmap: _fat_bmap
 };
 
diff -Nru a/fs/fs-writeback.c b/fs/fs-writeback.c
--- a/fs/fs-writeback.c	Wed Aug 28 07:37:37 2002
+++ b/fs/fs-writeback.c	Wed Aug 28 07:37:37 2002
@@ -134,8 +134,6 @@
 	struct address_space *mapping = inode->i_mapping;
 	struct super_block *sb = inode->i_sb;
 
-	list_move(&inode->i_list, &sb->s_locked_inodes);
-
 	BUG_ON(inode->i_state & I_LOCK);
 
 	/* Set I_LOCK, reset I_DIRTY */
@@ -163,12 +161,12 @@
 		if (inode->i_state & I_DIRTY) {		/* Redirtied */
 			list_add(&inode->i_list, &sb->s_dirty);
 		} else {
-			if (!list_empty(&mapping->dirty_pages)) {
+			if (!list_empty(&mapping->dirty_pages) ||
+					!list_empty(&mapping->io_pages)) {
 			 	/* Not a whole-file writeback */
 				mapping->dirtied_when = orig_dirtied_when;
 				inode->i_state |= I_DIRTY_PAGES;
-				list_add_tail(&inode->i_list,
-						&sb->s_dirty);
+				list_add_tail(&inode->i_list, &sb->s_dirty);
 			} else if (atomic_read(&inode->i_count)) {
 				list_add(&inode->i_list, &inode_in_use);
 			} else {
@@ -205,7 +203,7 @@
  * If older_than_this is non-NULL, then only write out mappings which
  * had their first dirtying at a time earlier than *older_than_this.
  *
- * If we're a pdlfush thread, then implement pdlfush collision avoidance
+ * If we're a pdlfush thread, then implement pdflush collision avoidance
  * against the entire list.
  *
  * WB_SYNC_HOLD is a hack for sys_sync(): reattach the inode to sb->s_dirty so
@@ -221,6 +219,11 @@
  * FIXME: this linear search could get expensive with many fileystems.  But
  * how to fix?  We need to go from an address_space to all inodes which share
  * a queue with that address_space.
+ *
+ * The inodes to be written are parked on sb->s_io.  They are moved back onto
+ * sb->s_dirty as they are selected for writing.  This way, none can be missed
+ * on the writer throttling path, and we get decent balancing between many
+ * thrlttled threads: we don't want them all piling up on __wait_on_inode.
  */
 static void
 sync_sb_inodes(struct backing_dev_info *single_bdi, struct super_block *sb,
@@ -241,7 +244,7 @@
 		if (single_bdi && mapping->backing_dev_info != single_bdi) {
 			if (sb != blockdev_superblock)
 				break;		/* inappropriate superblock */
-			list_move(&inode->i_list, &inode->i_sb->s_dirty);
+			list_move(&inode->i_list, &sb->s_dirty);
 			continue;		/* not this blockdev */
 		}
 
@@ -263,10 +266,11 @@
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
+		list_move(&inode->i_list, &sb->s_dirty);
 		__writeback_single_inode(inode, really_sync, nr_to_write);
 		if (sync_mode == WB_SYNC_HOLD) {
 			mapping->dirtied_when = jiffies;
-			list_move(&inode->i_list, &inode->i_sb->s_dirty);
+			list_move(&inode->i_list, &sb->s_dirty);
 		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
@@ -278,9 +282,8 @@
 	}
 out:
 	/*
-	 * Put the rest back, in the correct order.
+	 * Leave any unwritten inodes on s_io.
 	 */
-	list_splice_init(&sb->s_io, sb->s_dirty.prev);
 	return;
 }
 
@@ -302,7 +305,7 @@
 	spin_lock(&sb_lock);
 	sb = sb_entry(super_blocks.prev);
 	for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
-		if (!list_empty(&sb->s_dirty)) {
+		if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
 			spin_unlock(&sb_lock);
 			sync_sb_inodes(bdi, sb, sync_mode, nr_to_write,
 					older_than_this);
@@ -321,7 +324,7 @@
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
  * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until both ->s_dirty and ->s_locked_inodes are
+ * past sync_inodes_sb() until both the ->s_dirty and ->s_io lists are
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
@@ -352,19 +355,6 @@
 				sync_mode, older_than_this);
 }
 
-static void __wait_on_locked(struct list_head *head)
-{
-	struct list_head * tmp;
-	while ((tmp = head->prev) != head) {
-		struct inode *inode = list_entry(tmp, struct inode, i_list);
-		__iget(inode);
-		spin_unlock(&inode_lock);
-		__wait_on_inode(inode);
-		iput(inode);
-		spin_lock(&inode_lock);
-	}
-}
-
 /*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.  WB_SYNC_HOLD is
@@ -384,8 +374,6 @@
 	spin_lock(&inode_lock);
 	sync_sb_inodes(NULL, sb, wait ? WB_SYNC_ALL : WB_SYNC_HOLD,
 				&nr_to_write, NULL);
-	if (wait)
-		__wait_on_locked(&sb->s_locked_inodes);
 	spin_unlock(&inode_lock);
 }
 
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c	Wed Aug 28 07:37:38 2002
+++ b/fs/inode.c	Wed Aug 28 07:37:38 2002
@@ -193,6 +193,8 @@
  */
 void __iget(struct inode * inode)
 {
+	assert_locked(&inode_lock);
+
 	if (atomic_read(&inode->i_count)) {
 		atomic_inc(&inode->i_count);
 		return;
@@ -321,7 +323,6 @@
 	busy |= invalidate_list(&inode_unused, sb, &throw_away);
 	busy |= invalidate_list(&sb->s_dirty, sb, &throw_away);
 	busy |= invalidate_list(&sb->s_io, sb, &throw_away);
-	busy |= invalidate_list(&sb->s_locked_inodes, sb, &throw_away);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -995,11 +996,6 @@
 			remove_inode_dquot_ref(inode, type, &tofree_head);
 	}
 	list_for_each(act_head, &sb->s_io) {
-		inode = list_entry(act_head, struct inode, i_list);
-		if (IS_QUOTAINIT(inode))
-			remove_inode_dquot_ref(inode, type, &tofree_head);
-	}
-	list_for_each(act_head, &sb->s_locked_inodes) {
 		inode = list_entry(act_head, struct inode, i_list);
 		if (IS_QUOTAINIT(inode))
 			remove_inode_dquot_ref(inode, type, &tofree_head);
diff -Nru a/fs/jffs/inode-v23.c b/fs/jffs/inode-v23.c
--- a/fs/jffs/inode-v23.c	Wed Aug 28 07:37:37 2002
+++ b/fs/jffs/inode-v23.c	Wed Aug 28 07:37:37 2002
@@ -47,6 +47,7 @@
 #include <linux/stat.h>
 #include <linux/blkdev.h>
 #include <linux/quotaops.h>
+#include <linux/highmem.h>
 #include <linux/smp_lock.h>
 #include <asm/semaphore.h>
 #include <asm/byteorder.h>
@@ -751,7 +752,6 @@
 
 	get_page(page);
 	/* Don't SetPageLocked(page), should be locked already */
-	buf = page_address(page);
 	ClearPageUptodate(page);
 	ClearPageError(page);
 
@@ -760,8 +760,10 @@
 
 	read_len = 0;
 	result = 0;
-
 	offset = page->index << PAGE_CACHE_SHIFT;
+
+	kmap(page);
+	buf = page_address(page);
 	if (offset < inode->i_size) {
 		read_len = min_t(long, inode->i_size - offset, PAGE_SIZE);
 		r = jffs_read_data(f, buf, offset, read_len);
@@ -779,6 +781,8 @@
 	/* This handles the case of partial or no read in above */
 	if(read_len < PAGE_SIZE)
 	        memset(buf + read_len, 0, PAGE_SIZE - read_len);
+	flush_dcache_page(page);
+	kunmap(page);
 
 	D3(printk (KERN_NOTICE "readpage(): up biglock\n"));
 	up(&c->fmc->biglock);
@@ -788,9 +792,8 @@
 	}else {
 	        SetPageUptodate(page);	        
 	}
-	flush_dcache_page(page);
 
-	put_page(page);
+	page_cache_release(page);
 
 	D3(printk("jffs_readpage(): Leaving...\n"));
 
diff -Nru a/fs/jffs2/file.c b/fs/jffs2/file.c
--- a/fs/jffs2/file.c	Wed Aug 28 07:37:38 2002
+++ b/fs/jffs2/file.c	Wed Aug 28 07:37:38 2002
@@ -17,6 +17,7 @@
 #include <linux/fs.h>
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/crc32.h>
 #include <linux/jffs2.h>
 #include "nodelist.h"
@@ -381,9 +382,10 @@
 	ri->isize = (uint32_t)inode->i_size;
 	ri->atime = ri->ctime = ri->mtime = CURRENT_TIME;
 
-	/* We rely on the fact that generic_file_write() currently kmaps the page for us. */
+	kmap(pg);
 	ret = jffs2_write_inode_range(c, f, ri, page_address(pg) + start,
 				      (pg->index << PAGE_CACHE_SHIFT) + start, end - start, &writtenlen);
+	kunmap(pg);
 
 	if (ret) {
 		/* There was an error writing. */
diff -Nru a/fs/jfs/jfs_dmap.c b/fs/jfs/jfs_dmap.c
--- a/fs/jfs/jfs_dmap.c	Wed Aug 28 07:37:37 2002
+++ b/fs/jfs/jfs_dmap.c	Wed Aug 28 07:37:37 2002
@@ -325,7 +325,6 @@
 	/*
 	 * write out dirty pages of bmap
 	 */
-	filemap_fdatawait(ipbmap->i_mapping);
 	filemap_fdatawrite(ipbmap->i_mapping);
 	filemap_fdatawait(ipbmap->i_mapping);
 
diff -Nru a/fs/jfs/jfs_imap.c b/fs/jfs/jfs_imap.c
--- a/fs/jfs/jfs_imap.c	Wed Aug 28 07:37:36 2002
+++ b/fs/jfs/jfs_imap.c	Wed Aug 28 07:37:36 2002
@@ -281,7 +281,6 @@
 	/*
 	 * write out dirty pages of imap
 	 */
-	filemap_fdatawait(ipimap->i_mapping);
 	filemap_fdatawrite(ipimap->i_mapping);
 	filemap_fdatawait(ipimap->i_mapping);
 
@@ -595,7 +594,6 @@
 		jERROR(1, ("diFreeSpecial called with NULL ip!\n"));
 		return;
 	}
-	filemap_fdatawait(ip->i_mapping);
 	filemap_fdatawrite(ip->i_mapping);
 	filemap_fdatawait(ip->i_mapping);
 	truncate_inode_pages(ip->i_mapping, 0);
diff -Nru a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
--- a/fs/jfs/jfs_logmgr.c	Wed Aug 28 07:37:37 2002
+++ b/fs/jfs/jfs_logmgr.c	Wed Aug 28 07:37:37 2002
@@ -965,9 +965,6 @@
 		 * We need to make sure all of the "written" metapages
 		 * actually make it to disk
 		 */
-		filemap_fdatawait(sbi->ipbmap->i_mapping);
-		filemap_fdatawait(sbi->ipimap->i_mapping);
-		filemap_fdatawait(sbi->direct_inode->i_mapping);
 		filemap_fdatawrite(sbi->ipbmap->i_mapping);
 		filemap_fdatawrite(sbi->ipimap->i_mapping);
 		filemap_fdatawrite(sbi->direct_inode->i_mapping);
diff -Nru a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
--- a/fs/jfs/jfs_metapage.c	Wed Aug 28 07:37:36 2002
+++ b/fs/jfs/jfs_metapage.c	Wed Aug 28 07:37:36 2002
@@ -459,7 +459,6 @@
 	if (rc) {
 		jERROR(1, ("prepare_write return %d!\n", rc));
 		ClearPageUptodate(mp->page);
-		kunmap(mp->page);
 		unlock_page(mp->page);
 		clear_bit(META_dirty, &mp->flag);
 		return;
diff -Nru a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
--- a/fs/jfs/jfs_txnmgr.c	Wed Aug 28 07:37:37 2002
+++ b/fs/jfs/jfs_txnmgr.c	Wed Aug 28 07:37:37 2002
@@ -1165,7 +1165,6 @@
 		 *
 		 * if ((!S_ISDIR(ip->i_mode))
 		 *    && (tblk->flag & COMMIT_DELETE) == 0) {
-		 *	filemap_fdatawait(ip->i_mapping);
 		 *	filemap_fdatawrite(ip->i_mapping);
 		 *	filemap_fdatawait(ip->i_mapping);
 		 * }
diff -Nru a/fs/jfs/jfs_umount.c b/fs/jfs/jfs_umount.c
--- a/fs/jfs/jfs_umount.c	Wed Aug 28 07:37:36 2002
+++ b/fs/jfs/jfs_umount.c	Wed Aug 28 07:37:36 2002
@@ -112,7 +112,6 @@
 	 * Make sure all metadata makes it to disk before we mark
 	 * the superblock as clean
 	 */
-	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	filemap_fdatawrite(sbi->direct_inode->i_mapping);
 	filemap_fdatawait(sbi->direct_inode->i_mapping);
 
@@ -159,7 +158,6 @@
 	 */
 	dbSync(sbi->ipbmap);
 	diSync(sbi->ipimap);
-	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	filemap_fdatawrite(sbi->direct_inode->i_mapping);
 	filemap_fdatawait(sbi->direct_inode->i_mapping);
 
diff -Nru a/fs/jfs/super.c b/fs/jfs/super.c
--- a/fs/jfs/super.c	Wed Aug 28 07:37:37 2002
+++ b/fs/jfs/super.c	Wed Aug 28 07:37:37 2002
@@ -146,7 +146,6 @@
 	 * We need to clean out the direct_inode pages since this inode
 	 * is not in the inode hash.
 	 */
-	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	filemap_fdatawrite(sbi->direct_inode->i_mapping);
 	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	truncate_inode_pages(sbi->direct_mapping, 0);
@@ -362,7 +361,6 @@
 		jERROR(1, ("jfs_umount failed with return code %d\n", rc));
 	}
 out_mount_failed:
-	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	filemap_fdatawrite(sbi->direct_inode->i_mapping);
 	filemap_fdatawait(sbi->direct_inode->i_mapping);
 	truncate_inode_pages(sbi->direct_mapping, 0);
diff -Nru a/fs/minix/dir.c b/fs/minix/dir.c
--- a/fs/minix/dir.c	Wed Aug 28 07:37:36 2002
+++ b/fs/minix/dir.c	Wed Aug 28 07:37:36 2002
@@ -7,6 +7,7 @@
  */
 
 #include "minix.h"
+#include <linux/highmem.h>
 #include <linux/smp_lock.h>
 
 typedef struct minix_dir_entry minix_dirent;
@@ -261,7 +262,7 @@
 {
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = (struct inode*)mapping->host;
-	char *kaddr = (char*)page_address(page);
+	char *kaddr = page_address(page);
 	unsigned from = (char*)de - kaddr;
 	unsigned to = from + minix_sb(inode->i_sb)->s_dirsize;
 	int err;
@@ -286,7 +287,7 @@
 	struct page *page = grab_cache_page(mapping, 0);
 	struct minix_sb_info * sbi = minix_sb(inode->i_sb);
 	struct minix_dir_entry * de;
-	char *base;
+	char *kaddr;
 	int err;
 
 	if (!page)
@@ -297,15 +298,16 @@
 		goto fail;
 	}
 
-	base = (char*)page_address(page);
-	memset(base, 0, PAGE_CACHE_SIZE);
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr, 0, PAGE_CACHE_SIZE);
 
-	de = (struct minix_dir_entry *) base;
+	de = (struct minix_dir_entry *)kaddr;
 	de->inode = inode->i_ino;
 	strcpy(de->name,".");
 	de = minix_next_entry(de, sbi);
 	de->inode = dir->i_ino;
 	strcpy(de->name,"..");
+	kunmap_atomic(kaddr, KM_USER0);
 
 	err = dir_commit_chunk(page, 0, 2 * sbi->s_dirsize);
 fail:
diff -Nru a/fs/mpage.c b/fs/mpage.c
--- a/fs/mpage.c	Wed Aug 28 07:37:36 2002
+++ b/fs/mpage.c	Wed Aug 28 07:37:36 2002
@@ -19,6 +19,7 @@
 #include <linux/highmem.h>
 #include <linux/prefetch.h>
 #include <linux/mpage.h>
+#include <linux/writeback.h>
 #include <linux/pagevec.h>
 
 /*
@@ -459,6 +460,9 @@
 			clear_buffer_dirty(bh);
 			bh = bh->b_this_page;
 		} while (bh != head);
+
+		if (buffer_heads_over_limit)
+			try_to_free_buffers(page);
 	}
 
 	bvec = &bio->bi_io_vec[bio->bi_idx++];
@@ -530,6 +534,7 @@
 	sector_t last_block_in_bio = 0;
 	int ret = 0;
 	int done = 0;
+	int sync = called_for_sync();
 	struct pagevec pvec;
 	int (*writepage)(struct page *);
 
@@ -546,7 +551,7 @@
 		struct page *page = list_entry(mapping->io_pages.prev,
 					struct page, list);
 		list_del(&page->list);
-		if (PageWriteback(page)) {
+		if (PageWriteback(page) && !sync) {
 			if (PageDirty(page)) {
 				list_add(&page->list, &mapping->dirty_pages);
 				continue;
@@ -565,8 +570,11 @@
 
 		lock_page(page);
 
+		if (sync)
+			wait_on_page_writeback(page);
+
 		if (page->mapping && !PageWriteback(page) &&
-					TestClearPageDirty(page)) {
+					test_clear_page_dirty(page)) {
 			if (writepage) {
 				ret = (*writepage)(page);
 			} else {
@@ -579,6 +587,10 @@
 					pagevec_deactivate_inactive(&pvec);
 				page = NULL;
 			}
+			if (ret == -EAGAIN && page) {
+				__set_page_dirty_nobuffers(page);
+				ret = 0;
+			}
 			if (ret || (nr_to_write && --(*nr_to_write) <= 0))
 				done = 1;
 		} else {
@@ -590,9 +602,8 @@
 		write_lock(&mapping->page_lock);
 	}
 	/*
-	 * Put the rest back, in the correct order.
+	 * Leave any remaining dirty pages on ->io_pages
 	 */
-	list_splice_init(&mapping->io_pages, mapping->dirty_pages.prev);
 	write_unlock(&mapping->page_lock);
 	pagevec_deactivate_inactive(&pvec);
 	if (bio)
diff -Nru a/fs/namei.c b/fs/namei.c
--- a/fs/namei.c	Wed Aug 28 07:37:37 2002
+++ b/fs/namei.c	Wed Aug 28 07:37:37 2002
@@ -2200,8 +2200,9 @@
 	err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
 	if (err)
 		goto fail_map;
-	kaddr = page_address(page);
+	kaddr = kmap_atomic(page, KM_USER0);
 	memcpy(kaddr, symname, len-1);
+	kunmap_atomic(kaddr, KM_USER0);
 	mapping->a_ops->commit_write(NULL, page, 0, len-1);
 	/*
 	 * Notice that we are _not_ going to block here - end of page is
diff -Nru a/fs/nfs/file.c b/fs/nfs/file.c
--- a/fs/nfs/file.c	Wed Aug 28 07:37:37 2002
+++ b/fs/nfs/file.c	Wed Aug 28 07:37:37 2002
@@ -279,10 +279,7 @@
 	 * Flush all pending writes before doing anything
 	 * with locks..
 	 */
-	status = filemap_fdatawait(inode->i_mapping);
-	status2 = filemap_fdatawrite(inode->i_mapping);
-	if (!status)
-		status = status2;
+	status = filemap_fdatawrite(inode->i_mapping);
 	down(&inode->i_sem);
 	status2 = nfs_wb_all(inode);
 	if (!status)
@@ -308,7 +305,6 @@
 	 */
  out_ok:
 	if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
-		filemap_fdatawait(inode->i_mapping);
 		filemap_fdatawrite(inode->i_mapping);
 		down(&inode->i_sem);
 		nfs_wb_all(inode);      /* we may have slept */
diff -Nru a/fs/nfs/inode.c b/fs/nfs/inode.c
--- a/fs/nfs/inode.c	Wed Aug 28 07:37:37 2002
+++ b/fs/nfs/inode.c	Wed Aug 28 07:37:37 2002
@@ -775,7 +775,6 @@
 	if (!S_ISREG(inode->i_mode))
 		attr->ia_valid &= ~ATTR_SIZE;
 
-	filemap_fdatawait(inode->i_mapping);
 	filemap_fdatawrite(inode->i_mapping);
 	error = nfs_wb_all(inode);
 	filemap_fdatawait(inode->i_mapping);
diff -Nru a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
--- a/fs/nfsd/vfs.c	Wed Aug 28 07:37:37 2002
+++ b/fs/nfsd/vfs.c	Wed Aug 28 07:37:37 2002
@@ -501,7 +501,6 @@
 	struct inode *inode = dp->d_inode;
 	int (*fsync) (struct file *, struct dentry *, int);
 
-	filemap_fdatawait(inode->i_mapping);
 	filemap_fdatawrite(inode->i_mapping);
 	if (fop && (fsync = fop->fsync))
 		fsync(filp, dp, 0);
diff -Nru a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
--- a/fs/proc/proc_misc.c	Wed Aug 28 07:37:36 2002
+++ b/fs/proc/proc_misc.c	Wed Aug 28 07:37:36 2002
@@ -27,6 +27,7 @@
 #include <linux/ioport.h>
 #include <linux/config.h>
 #include <linux/mm.h>
+#include <linux/mmzone.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
@@ -41,7 +42,8 @@
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/io.h>
-
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 
 #define LOAD_INT(x) ((x) >> FSHIFT)
 #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
@@ -134,8 +136,20 @@
 	struct sysinfo i;
 	int len, committed;
 	struct page_state ps;
+	int cpu;
+	unsigned long inactive;
+	unsigned long active;
+	unsigned long flushes = 0;
+	unsigned long non_flushes = 0;
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		flushes += mmu_gathers[cpu].flushes;
+		non_flushes += mmu_gathers[cpu].avoided_flushes;
+	}
 
 	get_page_state(&ps);
+	get_zone_counts(&active, &inactive);
+
 /*
  * display in kilobytes.
  */
@@ -165,14 +179,16 @@
 		"Writeback:    %8lu kB\n"
 		"Committed_AS: %8u kB\n"
 		"PageTables:   %8lu kB\n"
-		"ReverseMaps:  %8lu\n",
+		"ReverseMaps:  %8lu\n"
+		"TLB flushes:  %8lu\n"
+		"non flushes:  %8lu\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.sharedram),
 		K(ps.nr_pagecache-swapper_space.nrpages),
 		K(swapper_space.nrpages),
-		K(ps.nr_active),
-		K(ps.nr_inactive),
+		K(active),
+		K(inactive),
 		K(i.totalhigh),
 		K(i.freehigh),
 		K(i.totalram-i.totalhigh),
@@ -183,7 +199,9 @@
 		K(ps.nr_writeback),
 		K(committed),
 		K(ps.nr_page_table_pages),
-		ps.nr_reverse_maps
+		ps.nr_reverse_maps,
+		flushes,
+		non_flushes
 		);
 
 	return proc_calc_metrics(page, start, off, count, eof, len);
diff -Nru a/fs/ramfs/inode.c b/fs/ramfs/inode.c
--- a/fs/ramfs/inode.c	Wed Aug 28 07:37:37 2002
+++ b/fs/ramfs/inode.c	Wed Aug 28 07:37:37 2002
@@ -26,9 +26,11 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
 
 #include <asm/uaccess.h>
 
@@ -40,6 +42,11 @@
 static struct file_operations ramfs_file_operations;
 static struct inode_operations ramfs_dir_inode_operations;
 
+static struct backing_dev_info ramfs_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+};
+
 /*
  * Read a page. Again trivial. If it didn't already exist
  * in the page cache, it is zero-filled.
@@ -47,8 +54,10 @@
 static int ramfs_readpage(struct file *file, struct page * page)
 {
 	if (!PageUptodate(page)) {
-		memset(kmap(page), 0, PAGE_CACHE_SIZE);
-		kunmap(page);
+		char *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr, 0, PAGE_CACHE_SIZE);
+		kunmap_atomic(kaddr, KM_USER0);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 	}
@@ -58,13 +67,15 @@
 
 static int ramfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
 {
-	void *addr = kmap(page);
 	if (!PageUptodate(page)) {
-		memset(addr, 0, PAGE_CACHE_SIZE);
+		char *kaddr = kmap_atomic(page, KM_USER0);
+
+		memset(kaddr, 0, PAGE_CACHE_SIZE);
 		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
 		SetPageUptodate(page);
 	}
-	SetPageDirty(page);
+	set_page_dirty(page);
 	return 0;
 }
 
@@ -73,7 +84,6 @@
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
-	kunmap(page);
 	if (pos > inode->i_size)
 		inode->i_size = pos;
 	return 0;
@@ -91,6 +101,7 @@
 		inode->i_blocks = 0;
 		inode->i_rdev = NODEV;
 		inode->i_mapping->a_ops = &ramfs_aops;
+		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Wed Aug 28 07:37:37 2002
+++ b/fs/reiserfs/inode.c	Wed Aug 28 07:37:37 2002
@@ -7,6 +7,7 @@
 #include <linux/reiserfs_fs.h>
 #include <linux/smp_lock.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <asm/uaccess.h>
 #include <asm/unaligned.h>
 #include <linux/buffer_head.h>
@@ -1692,8 +1693,6 @@
     if (error)
 	goto unlock ;
 
-    kunmap(page) ; /* mapped by block_prepare_write */
-
     head = page_buffers(page) ;      
     bh = head;
     do {
@@ -1788,10 +1787,13 @@
         length = offset & (blocksize - 1) ;
 	/* if we are not on a block boundary */
 	if (length) {
+	    char *kaddr;
+
 	    length = blocksize - length ;
-	    memset((char *)kmap(page) + offset, 0, length) ;   
+	    kaddr = kmap_atomic(page, KM_USER0) ;
+	    memset(kaddr + offset, 0, length) ;   
 	    flush_dcache_page(page) ;
-	    kunmap(page) ;
+	    kunmap_atomic(kaddr, KM_USER0) ;
 	    if (buffer_mapped(bh) && bh->b_blocknr != 0) {
 	        mark_buffer_dirty(bh) ;
 	    }
@@ -1941,23 +1943,25 @@
     struct buffer_head *arr[PAGE_CACHE_SIZE/512] ;
     int nr = 0 ;
 
-    if (!page_has_buffers(page)) {
+    if (!page_has_buffers(page))
         block_prepare_write(page, 0, 0, NULL) ;
-	kunmap(page) ;
-    }
+
     /* last page in the file, zero out any contents past the
     ** last byte in the file
     */
     if (page->index >= end_index) {
+	char *kaddr;
+
         last_offset = inode->i_size & (PAGE_CACHE_SIZE - 1) ;
 	/* no file contents in this page */
 	if (page->index >= end_index + 1 || !last_offset) {
 	    error =  -EIO ;
 	    goto fail ;
 	}
-	memset((char *)kmap(page)+last_offset, 0, PAGE_CACHE_SIZE-last_offset) ;
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + last_offset, 0, PAGE_CACHE_SIZE-last_offset) ;
 	flush_dcache_page(page) ;
-	kunmap(page) ;
+	kunmap_atomic(kaddr, KM_USER0) ;
     }
     head = page_buffers(page) ;
     bh = head ;
diff -Nru a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
--- a/fs/reiserfs/stree.c	Wed Aug 28 07:37:37 2002
+++ b/fs/reiserfs/stree.c	Wed Aug 28 07:37:37 2002
@@ -1284,15 +1284,15 @@
         **
         ** p_s_un_bh is from the page cache (all unformatted nodes are
         ** from the page cache) and might be a highmem page.  So, we
-        ** can't use p_s_un_bh->b_data.  But, the page has already been
-        ** kmapped, so we can use page_address()
+        ** can't use p_s_un_bh->b_data.
 	** -clm
 	*/
 
-        data = page_address(p_s_un_bh->b_page) ;
+        data = kmap_atomic(p_s_un_bh->b_page, KM_USER0);
 	off = ((le_ih_k_offset (&s_ih) - 1) & (PAGE_CACHE_SIZE - 1));
 	memcpy(data + off,
 	       B_I_PITEM(PATH_PLAST_BUFFER(p_s_path), &s_ih), n_ret_value);
+	kunmap_atomic(p_s_un_bh->b_page, KM_USER0);
     }
 
     /* Perform balancing after all resources have been collected at once. */ 
diff -Nru a/fs/reiserfs/tail_conversion.c b/fs/reiserfs/tail_conversion.c
--- a/fs/reiserfs/tail_conversion.c	Wed Aug 28 07:37:38 2002
+++ b/fs/reiserfs/tail_conversion.c	Wed Aug 28 07:37:38 2002
@@ -122,11 +122,12 @@
     }
     /* if we've copied bytes from disk into the page, we need to zero
     ** out the unused part of the block (it was not up to date before)
-    ** the page is still kmapped (by whoever called reiserfs_get_block)
     */
     if (up_to_date_bh) {
         unsigned pgoff = (tail_offset + total_tail - 1) & (PAGE_CACHE_SIZE - 1);
-	memset(page_address(unbh->b_page) + pgoff, 0, n_blk_size - total_tail) ;
+	char *kaddr=kmap_atomic(up_to_date_bh->b_page, KM_USER0);
+	memset(kaddr + pgoff, 0, n_blk_size - total_tail) ;
+	kunmap_atomic(up_to_date_bh->b_page, KM_USER0);
     }
 
     REISERFS_I(inode)->i_first_direct_byte = U32_MAX;
diff -Nru a/fs/smbfs/file.c b/fs/smbfs/file.c
--- a/fs/smbfs/file.c	Wed Aug 28 07:37:38 2002
+++ b/fs/smbfs/file.c	Wed Aug 28 07:37:38 2002
@@ -352,7 +352,6 @@
 		/* We must flush any dirty pages now as we won't be able to
 		   write anything after close. mmap can trigger this.
 		   "openers" should perhaps include mmap'ers ... */
-		filemap_fdatawait(inode->i_mapping);
 		filemap_fdatawrite(inode->i_mapping);
 		filemap_fdatawait(inode->i_mapping);
 		smb_close(inode);
diff -Nru a/fs/smbfs/inode.c b/fs/smbfs/inode.c
--- a/fs/smbfs/inode.c	Wed Aug 28 07:37:36 2002
+++ b/fs/smbfs/inode.c	Wed Aug 28 07:37:36 2002
@@ -650,7 +650,6 @@
 			DENTRY_PATH(dentry),
 			(long) inode->i_size, (long) attr->ia_size);
 
-		filemap_fdatawait(inode->i_mapping);
 		filemap_fdatawrite(inode->i_mapping);
 		filemap_fdatawait(inode->i_mapping);
 
diff -Nru a/fs/super.c b/fs/super.c
--- a/fs/super.c	Wed Aug 28 07:37:37 2002
+++ b/fs/super.c	Wed Aug 28 07:37:37 2002
@@ -58,7 +58,6 @@
 		}
 		INIT_LIST_HEAD(&s->s_dirty);
 		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_locked_inodes);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_LIST_HEAD(&s->s_anon);
diff -Nru a/fs/sysv/dir.c b/fs/sysv/dir.c
--- a/fs/sysv/dir.c	Wed Aug 28 07:37:37 2002
+++ b/fs/sysv/dir.c	Wed Aug 28 07:37:37 2002
@@ -14,6 +14,7 @@
  */
 
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/smp_lock.h>
 #include "sysv.h"
 
@@ -273,6 +274,7 @@
 
 	if (!page)
 		return -ENOMEM;
+	kmap(page);
 	err = mapping->a_ops->prepare_write(NULL, page, 0, 2 * SYSV_DIRSIZE);
 	if (err) {
 		unlock_page(page);
@@ -291,6 +293,7 @@
 
 	err = dir_commit_chunk(page, 0, 2 * SYSV_DIRSIZE);
 fail:
+	kunmap(page);
 	page_cache_release(page);
 	return err;
 }
diff -Nru a/fs/udf/inode.c b/fs/udf/inode.c
--- a/fs/udf/inode.c	Wed Aug 28 07:37:38 2002
+++ b/fs/udf/inode.c	Wed Aug 28 07:37:38 2002
@@ -208,7 +208,8 @@
 	mark_buffer_dirty_inode(bh, inode);
 	udf_release_data(bh);
 
-	inode->i_data.a_ops->writepage(page);
+	if (inode->i_data.a_ops->writepage(page) == -EAGAIN)
+		__set_page_dirty_nobuffers(page);
 	page_cache_release(page);
 
 	mark_inode_dirty(inode);
diff -Nru a/include/asm-alpha/cache.h b/include/asm-alpha/cache.h
--- a/include/asm-alpha/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-alpha/cache.h	Wed Aug 28 07:37:37 2002
@@ -20,5 +20,6 @@
 
 #define L1_CACHE_ALIGN(x)  (((x)+(L1_CACHE_BYTES-1))&~(L1_CACHE_BYTES-1))
 #define SMP_CACHE_BYTES    L1_CACHE_BYTES
+#define L1_CACHE_SHIFT_MAX 6	/* largest L1 which this arch supports */
 
 #endif
diff -Nru a/include/asm-alpha/mmzone.h b/include/asm-alpha/mmzone.h
--- a/include/asm-alpha/mmzone.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-alpha/mmzone.h	Wed Aug 28 07:37:37 2002
@@ -52,14 +52,14 @@
 
 #if 1
 #define PLAT_NODE_DATA_LOCALNR(p, n)	\
-	(((p) - PLAT_NODE_DATA(n)->gendata.node_start_paddr) >> PAGE_SHIFT)
+	(((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
 #else
 static inline unsigned long
 PLAT_NODE_DATA_LOCALNR(unsigned long p, int n)
 {
 	unsigned long temp;
-	temp = p - PLAT_NODE_DATA(n)->gendata.node_start_paddr;
-	return (temp >> PAGE_SHIFT);
+	temp = p >> PAGE_SHIFT;
+	return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn;
 }
 #endif
 
@@ -96,7 +96,7 @@
  * and returns the kaddr corresponding to first physical page in the
  * node's mem_map.
  */
-#define LOCAL_BASE_ADDR(kaddr)	((unsigned long)__va(NODE_DATA(KVADDR_TO_NID(kaddr))->node_start_paddr))
+#define LOCAL_BASE_ADDR(kaddr)	((unsigned long)__va(NODE_DATA(KVADDR_TO_NID(kaddr))->node_start_pfn << PAGE_SHIFT))
 
 #define LOCAL_MAP_NR(kvaddr) \
 	(((unsigned long)(kvaddr)-LOCAL_BASE_ADDR(kvaddr)) >> PAGE_SHIFT)
diff -Nru a/include/asm-alpha/pgtable.h b/include/asm-alpha/pgtable.h
--- a/include/asm-alpha/pgtable.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-alpha/pgtable.h	Wed Aug 28 07:37:37 2002
@@ -195,8 +195,8 @@
 #define PAGE_TO_PA(page)	((page - mem_map) << PAGE_SHIFT)
 #else
 #define PAGE_TO_PA(page) \
-		((((page)-(page)->zone->zone_mem_map) << PAGE_SHIFT) \
-		+ (page)->zone->zone_start_paddr)
+		((( (page) - (page)->zone->zone_mem_map ) \
+		+ (page)->zone->zone_start_pfn) << PAGE_SHIFT)
 #endif
 
 #ifndef CONFIG_DISCONTIGMEM
@@ -216,7 +216,7 @@
 	unsigned long pfn;							\
 										\
 	pfn = ((unsigned long)((page)-(page)->zone->zone_mem_map)) << 32;	\
-	pfn += (page)->zone->zone_start_paddr << (32-PAGE_SHIFT);		\
+	pfn += (page)->zone->zone_start_pfn << 32);				\
 	pte_val(pte) = pfn | pgprot_val(pgprot);				\
 										\
 	pte;									\
diff -Nru a/include/asm-alpha/tlb.h b/include/asm-alpha/tlb.h
--- a/include/asm-alpha/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-alpha/tlb.h	Wed Aug 28 07:37:37 2002
@@ -3,13 +3,13 @@
 
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
-#define tlb_remove_tlb_entry(tlb, pte, addr)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, pte, addr)	do { } while (0)
 
 #define tlb_flush(tlb)				flush_tlb_mm((tlb)->mm)
 
 #include <asm-generic/tlb.h>
 
-#define pte_free_tlb(tlb,pte)			pte_free(pte)
-#define pmd_free_tlb(tlb,pmd)			pmd_free(pmd)
+#define __pte_free_tlb(tlb,pte)			pte_free(pte)
+#define __pmd_free_tlb(tlb,pmd)			pmd_free(pmd)
  
 #endif
diff -Nru a/include/asm-arm/cache.h b/include/asm-arm/cache.h
--- a/include/asm-arm/cache.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-arm/cache.h	Wed Aug 28 07:37:38 2002
@@ -16,4 +16,6 @@
 		 __section__(".data.cacheline_aligned")))
 #endif
 
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
+
 #endif
diff -Nru a/include/asm-arm/memory.h b/include/asm-arm/memory.h
--- a/include/asm-arm/memory.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-arm/memory.h	Wed Aug 28 07:37:38 2002
@@ -80,8 +80,8 @@
  * around in memory.
  */
 #define page_to_pfn(page)					\
-	(((page) - page_zone(page)->zone_mem_map)		\
-	  + (page_zone(page)->zone_start_paddr >> PAGE_SHIFT))
+	(( (page) - page_zone(page)->zone_mem_map)		\
+	  + page_zone(page)->zone_start_pfn)
 
 #define pfn_to_page(pfn)					\
 	(PFN_TO_MAPBASE(pfn) + LOCAL_MAP_NR((pfn) << PAGE_SHIFT))
diff -Nru a/include/asm-arm/tlb.h b/include/asm-arm/tlb.h
--- a/include/asm-arm/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-arm/tlb.h	Wed Aug 28 07:37:36 2002
@@ -11,11 +11,11 @@
 #define tlb_end_vma(tlb,vma)	\
 	flush_tlb_range(vma, vma->vm_start, vma->vm_end)
 
-#define tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 #include <asm-generic/tlb.h>
 
-#define pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
-#define pte_free_tlb(tlb, pte)	pte_free(pte)
+#define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #endif
diff -Nru a/include/asm-cris/cache.h b/include/asm-cris/cache.h
--- a/include/asm-cris/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-cris/cache.h	Wed Aug 28 07:37:36 2002
@@ -7,4 +7,6 @@
 
 #define L1_CACHE_BYTES 32
 
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
+
 #endif /* _ASM_CACHE_H */
diff -Nru a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
--- a/include/asm-generic/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-generic/tlb.h	Wed Aug 28 07:37:36 2002
@@ -36,9 +36,12 @@
 typedef struct free_pte_ctx {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
 	unsigned long		freed;
 	struct page *		pages[FREE_PTE_NR];
+	unsigned long		flushes;/* stats: count avoided flushes */
+	unsigned long		avoided_flushes;
 } mmu_gather_t;
 
 /* Users of the generic TLB shootdown code must declare this storage space. */
@@ -66,13 +69,18 @@
 {
 	unsigned long nr;
 
+	if (!tlb->need_flush) {
+		tlb->avoided_flushes++;
+		return;
+	}
+	tlb->need_flush = 0;
+	tlb->flushes++;
+
 	tlb_flush(tlb);
 	nr = tlb->nr;
 	if (!tlb_fast_mode(tlb)) {
-		unsigned long i;
+		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
-		for (i=0; i < nr; i++)
-			free_page_and_swap_cache(tlb->pages[i]);
 	}
 }
 
@@ -103,6 +111,7 @@
  */
 static inline void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
 {
+	tlb->need_flush = 1;
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
@@ -112,5 +121,29 @@
 		tlb_flush_mmu(tlb, 0, 0);
 }
 
-#endif /* _ASM_GENERIC__TLB_H */
+/**
+ * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
+ *
+ * Record the fact that pte's were really umapped in ->need_flush, so we can
+ * later optimise away the tlb invalidate.   This helps when userspace is
+ * unmapping already-unmapped pages, which happens quite a lot.
+ */
+#define tlb_remove_tlb_entry(tlb, ptep, address)		\
+	do {							\
+		tlb->need_flush = 1;				\
+		__tlb_remove_tlb_entry(tlb, ptep, address);	\
+	} while (0)
+
+#define pte_free_tlb(tlb, ptep)					\
+	do {							\
+		tlb->need_flush = 1;				\
+		__pte_free_tlb(tlb, ptep);			\
+	} while (0)
+
+#define pmd_free_tlb(tlb, pmdp)					\
+	do {							\
+		tlb->need_flush = 1;				\
+		__pmd_free_tlb(tlb, pmdp);			\
+	} while (0)
 
+#endif /* _ASM_GENERIC__TLB_H */
diff -Nru a/include/asm-i386/cache.h b/include/asm-i386/cache.h
--- a/include/asm-i386/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-i386/cache.h	Wed Aug 28 07:37:36 2002
@@ -10,4 +10,6 @@
 #define L1_CACHE_SHIFT	(CONFIG_X86_L1_CACHE_SHIFT)
 #define L1_CACHE_BYTES	(1 << L1_CACHE_SHIFT)
 
+#define L1_CACHE_SHIFT_MAX 7	/* largest L1 which this arch supports */
+
 #endif
diff -Nru a/include/asm-i386/highmem.h b/include/asm-i386/highmem.h
--- a/include/asm-i386/highmem.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-i386/highmem.h	Wed Aug 28 07:37:36 2002
@@ -81,7 +81,7 @@
 	enum fixed_addresses idx;
 	unsigned long vaddr;
 
-	preempt_disable();
+	inc_preempt_count();
 	if (page < highmem_start_page)
 		return page_address(page);
 
@@ -119,7 +119,7 @@
 	__flush_tlb_one(vaddr);
 #endif
 
-	preempt_enable();
+	dec_preempt_count();
 }
 
 #endif /* __KERNEL__ */
diff -Nru a/include/asm-i386/io.h b/include/asm-i386/io.h
--- a/include/asm-i386/io.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/io.h	Wed Aug 28 07:37:37 2002
@@ -96,11 +96,7 @@
 /*
  * Change "struct page" to physical address.
  */
-#ifdef CONFIG_HIGHMEM64G
-#define page_to_phys(page)	((u64)(page - mem_map) << PAGE_SHIFT)
-#else
-#define page_to_phys(page)	((page - mem_map) << PAGE_SHIFT)
-#endif
+#define page_to_phys(page)    ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
 
 extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
 
diff -Nru a/include/asm-i386/max_numnodes.h b/include/asm-i386/max_numnodes.h
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/include/asm-i386/max_numnodes.h	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,12 @@
+#ifndef _ASM_MAX_NUMNODES_H
+#define _ASM_MAX_NUMNODES_H
+
+#include <linux/config.h>
+
+#ifdef CONFIG_X86_NUMAQ
+#include <asm/numaq.h>
+#else
+#define MAX_NUMNODES	1
+#endif /* CONFIG_X86_NUMAQ */
+
+#endif /* _ASM_MAX_NUMNODES_H */
diff -Nru a/include/asm-i386/mmzone.h b/include/asm-i386/mmzone.h
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/include/asm-i386/mmzone.h	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,78 @@
+/*
+ * Written by Pat Gaughen (gone@us.ibm.com) Mar 2002
+ *
+ */
+
+#ifndef _ASM_MMZONE_H_
+#define _ASM_MMZONE_H_
+
+#ifdef CONFIG_DISCONTIGMEM
+
+#ifdef CONFIG_X86_NUMAQ
+#include <asm/numaq.h>
+#else
+#define pa_to_nid(pa)	(0)
+#define pfn_to_nid(pfn)		(0)
+#ifdef CONFIG_NUMA
+#define _cpu_to_node(cpu) 0
+#endif /* CONFIG_NUMA */
+#endif /* CONFIG_X86_NUMAQ */
+
+#ifdef CONFIG_NUMA
+#define numa_node_id() _cpu_to_node(smp_processor_id())
+#endif /* CONFIG_NUMA */
+
+extern struct pglist_data *node_data[];
+
+/*
+ * Following are macros that are specific to this numa platform.
+ */
+#define reserve_bootmem(addr, size) \
+	reserve_bootmem_node(NODE_DATA(0), (addr), (size))
+#define alloc_bootmem(x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
+#define alloc_bootmem_low(x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), SMP_CACHE_BYTES, 0)
+#define alloc_bootmem_pages(x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
+#define alloc_bootmem_low_pages(x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), PAGE_SIZE, 0)
+#define alloc_bootmem_node(ignore, x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
+#define alloc_bootmem_pages_node(ignore, x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
+#define alloc_bootmem_low_pages_node(ignore, x) \
+	__alloc_bootmem_node(NODE_DATA(0), (x), PAGE_SIZE, 0)
+
+#define node_startnr(nid)	(node_data[nid]->node_start_mapnr)
+#define node_size(nid)		(node_data[nid]->node_size)
+#define node_localnr(pfn, nid)	((pfn) - node_data[nid]->node_start_pfn)
+
+/*
+ * Following are macros that each numa implmentation must define.
+ */
+
+/*
+ * Given a kernel address, find the home node of the underlying memory.
+ */
+#define kvaddr_to_nid(kaddr)	pa_to_nid(__pa(kaddr))
+
+/*
+ * Return a pointer to the node data for node n.
+ */
+#define NODE_DATA(nid)		(node_data[nid])
+
+#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+#define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
+
+#define local_mapnr(kvaddr) \
+	( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) )
+
+#define kern_addr_valid(kaddr)	test_bit(local_mapnr(kaddr), \
+		 NODE_DATA(kvaddr_to_nid(kaddr))->valid_addr_bitmap)
+
+#define pfn_to_page(pfn)	(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
+#define page_to_pfn(page)	((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
+#define pmd_page(pmd)		(pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+#endif /* CONFIG_DISCONTIGMEM */
+#endif /* _ASM_MMZONE_H_ */
diff -Nru a/include/asm-i386/numaq.h b/include/asm-i386/numaq.h
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/include/asm-i386/numaq.h	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,177 @@
+/*
+ * Written by: Patricia Gaughen, IBM Corporation
+ *
+ * Copyright (C) 2002, IBM Corp.
+ *
+ * All rights reserved.          
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Send feedback to <gone@us.ibm.com>
+ */
+
+#ifndef NUMAQ_H
+#define NUMAQ_H
+
+#ifdef CONFIG_X86_NUMAQ
+
+#include <asm/smpboot.h>
+
+/*
+ * for now assume that 64Gb is max amount of RAM for whole system
+ *    64Gb * 1024Mb/Gb = 65536 Mb
+ *    65536 Mb / 256Mb = 256
+ */
+#define MAX_ELEMENTS 256
+#define ELEMENT_REPRESENTS 8 /* 256 Mb */
+
+#define MAX_NUMNODES		8
+#ifdef CONFIG_NUMA
+#define _cpu_to_node(cpu) (cpu_to_logical_apicid(cpu) >> 4)
+#endif /* CONFIG_NUMA */
+extern int pa_to_nid(u64);
+extern int pfn_to_nid(unsigned long);
+extern void get_memcfg_numaq(void);
+#define get_memcfg_numa() get_memcfg_numaq()
+
+/*
+ * SYS_CFG_DATA_PRIV_ADDR, struct eachquadmem, and struct sys_cfg_data are the
+ */
+#define SYS_CFG_DATA_PRIV_ADDR		0x0009d000 /* place for scd in private quad space */
+
+/*
+ * Communication area for each processor on lynxer-processor tests.
+ *
+ * NOTE: If you change the size of this eachproc structure you need
+ *       to change the definition for EACH_QUAD_SIZE.
+ */
+struct eachquadmem {
+	unsigned int	priv_mem_start;		/* Starting address of this */
+						/* quad's private memory. */
+						/* This is always 0. */
+						/* In MB. */
+	unsigned int	priv_mem_size;		/* Size of this quad's */
+						/* private memory. */
+						/* In MB. */
+	unsigned int	low_shrd_mem_strp_start;/* Starting address of this */
+						/* quad's low shared block */
+						/* (untranslated). */
+						/* In MB. */
+	unsigned int	low_shrd_mem_start;	/* Starting address of this */
+						/* quad's low shared memory */
+						/* (untranslated). */
+						/* In MB. */
+	unsigned int	low_shrd_mem_size;	/* Size of this quad's low */
+						/* shared memory. */
+						/* In MB. */
+	unsigned int	lmmio_copb_start;	/* Starting address of this */
+						/* quad's local memory */
+						/* mapped I/O in the */
+						/* compatibility OPB. */
+						/* In MB. */
+	unsigned int	lmmio_copb_size;	/* Size of this quad's local */
+						/* memory mapped I/O in the */
+						/* compatibility OPB. */
+						/* In MB. */
+	unsigned int	lmmio_nopb_start;	/* Starting address of this */
+						/* quad's local memory */
+						/* mapped I/O in the */
+						/* non-compatibility OPB. */
+						/* In MB. */
+	unsigned int	lmmio_nopb_size;	/* Size of this quad's local */
+						/* memory mapped I/O in the */
+						/* non-compatibility OPB. */
+						/* In MB. */
+	unsigned int	io_apic_0_start;	/* Starting address of I/O */
+						/* APIC 0. */
+	unsigned int	io_apic_0_sz;		/* Size I/O APIC 0. */
+	unsigned int	io_apic_1_start;	/* Starting address of I/O */
+						/* APIC 1. */
+	unsigned int	io_apic_1_sz;		/* Size I/O APIC 1. */
+	unsigned int	hi_shrd_mem_start;	/* Starting address of this */
+						/* quad's high shared memory.*/
+						/* In MB. */
+	unsigned int	hi_shrd_mem_size;	/* Size of this quad's high */
+						/* shared memory. */
+						/* In MB. */
+	unsigned int	mps_table_addr;		/* Address of this quad's */
+						/* MPS tables from BIOS, */
+						/* in system space.*/
+	unsigned int	lcl_MDC_pio_addr;	/* Port-I/O address for */
+						/* local access of MDC. */
+	unsigned int	rmt_MDC_mmpio_addr;	/* MM-Port-I/O address for */
+						/* remote access of MDC. */
+	unsigned int	mm_port_io_start;	/* Starting address of this */
+						/* quad's memory mapped Port */
+						/* I/O space. */
+	unsigned int	mm_port_io_size;	/* Size of this quad's memory*/
+						/* mapped Port I/O space. */
+	unsigned int	mm_rmt_io_apic_start;	/* Starting address of this */
+						/* quad's memory mapped */
+						/* remote I/O APIC space. */
+	unsigned int	mm_rmt_io_apic_size;	/* Size of this quad's memory*/
+						/* mapped remote I/O APIC */
+						/* space. */
+	unsigned int	mm_isa_start;		/* Starting address of this */
+						/* quad's memory mapped ISA */
+						/* space (contains MDC */
+						/* memory space). */
+	unsigned int	mm_isa_size;		/* Size of this quad's memory*/
+						/* mapped ISA space (contains*/
+						/* MDC memory space). */
+	unsigned int	rmt_qmi_addr;		/* Remote addr to access QMI.*/
+	unsigned int	lcl_qmi_addr;		/* Local addr to access QMI. */
+};
+
+/*
+ * Note: This structure must be NOT be changed unless the multiproc and
+ * OS are changed to reflect the new structure.
+ */
+struct sys_cfg_data {
+	unsigned int	quad_id;
+	unsigned int	bsp_proc_id; /* Boot Strap Processor in this quad. */
+	unsigned int	scd_version; /* Version number of this table. */
+	unsigned int	first_quad_id;
+	unsigned int	quads_present31_0; /* 1 bit for each quad */
+	unsigned int	quads_present63_32; /* 1 bit for each quad */
+	unsigned int	config_flags;
+	unsigned int	boot_flags;
+	unsigned int	csr_start_addr; /* Absolute value (not in MB) */
+	unsigned int	csr_size; /* Absolute value (not in MB) */
+	unsigned int	lcl_apic_start_addr; /* Absolute value (not in MB) */
+	unsigned int	lcl_apic_size; /* Absolute value (not in MB) */
+	unsigned int	low_shrd_mem_base; /* 0 or 512MB or 1GB */
+	unsigned int	low_shrd_mem_quad_offset; /* 0,128M,256M,512M,1G */
+					/* may not be totally populated */
+	unsigned int	split_mem_enbl; /* 0 for no low shared memory */ 
+	unsigned int	mmio_sz; /* Size of total system memory mapped I/O */
+				 /* (in MB). */
+	unsigned int	quad_spin_lock; /* Spare location used for quad */
+					/* bringup. */
+	unsigned int	nonzero55; /* For checksumming. */
+	unsigned int	nonzeroaa; /* For checksumming. */
+	unsigned int	scd_magic_number;
+	unsigned int	system_type;
+	unsigned int	checksum;
+	/*
+	 *	memory configuration area for each quad
+	 */
+        struct	eachquadmem eq[MAX_NUMNODES];	/* indexed by quad id */
+};
+
+#endif /* CONFIG_X86_NUMAQ */
+#endif /* NUMAQ_H */
+
diff -Nru a/include/asm-i386/page.h b/include/asm-i386/page.h
--- a/include/asm-i386/page.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-i386/page.h	Wed Aug 28 07:37:36 2002
@@ -134,8 +134,10 @@
 #define MAXMEM			((unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE))
 #define __pa(x)			((unsigned long)(x)-PAGE_OFFSET)
 #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
+#ifndef CONFIG_DISCONTIGMEM
 #define pfn_to_page(pfn)	(mem_map + (pfn))
 #define page_to_pfn(page)	((unsigned long)((page) - mem_map))
+#endif /* !CONFIG_DISCONTIGMEM */
 #define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
 
 #define pfn_valid(pfn)		((pfn) < max_mapnr)
diff -Nru a/include/asm-i386/pci.h b/include/asm-i386/pci.h
--- a/include/asm-i386/pci.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/pci.h	Wed Aug 28 07:37:37 2002
@@ -109,7 +109,7 @@
 	if (direction == PCI_DMA_NONE)
 		BUG();
 
-	return (dma_addr_t)(page - mem_map) * PAGE_SIZE + offset;
+	return (dma_addr_t)(page_to_pfn(page)) * PAGE_SIZE + offset;
 }
 
 static inline void pci_unmap_page(struct pci_dev *hwdev, dma_addr_t dma_address,
@@ -238,9 +238,7 @@
 static __inline__ struct page *
 pci_dac_dma_to_page(struct pci_dev *pdev, dma64_addr_t dma_addr)
 {
-	unsigned long poff = (dma_addr >> PAGE_SHIFT);
-
-	return mem_map + poff;
+	return pfn_to_page(dma_addr >> PAGE_SHIFT);
 }
 
 static __inline__ unsigned long
diff -Nru a/include/asm-i386/pgalloc.h b/include/asm-i386/pgalloc.h
--- a/include/asm-i386/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-i386/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -13,7 +13,7 @@
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE +
-		((unsigned long long)(pte - mem_map) <<
+		((unsigned long long)page_to_pfn(pte) <<
 			(unsigned long long) PAGE_SHIFT)));
 }
 /*
@@ -37,7 +37,7 @@
 }
 
 
-#define pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 /*
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
@@ -47,7 +47,7 @@
 
 #define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)			do { } while (0)
-#define pmd_free_tlb(tlb,x)		do { } while (0)
+#define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
 
 #define check_pgt_cache()	do { } while (0)
diff -Nru a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/pgtable.h	Wed Aug 28 07:37:37 2002
@@ -234,8 +234,9 @@
 #define pmd_page_kernel(pmd) \
 ((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
 
-#define pmd_page(pmd) \
-	(mem_map + (pmd_val(pmd) >> PAGE_SHIFT))
+#ifndef CONFIG_DISCONTIGMEM
+#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+#endif /* !CONFIG_DISCONTIGMEM */
 
 #define pmd_large(pmd) \
 	((pmd_val(pmd) & (_PAGE_PSE|_PAGE_PRESENT)) == (_PAGE_PSE|_PAGE_PRESENT))
@@ -280,7 +281,9 @@
 
 #endif /* !__ASSEMBLY__ */
 
+#ifndef CONFIG_DISCONTIGMEM
 #define kern_addr_valid(addr)	(1)
+#endif /* !CONFIG_DISCONTIGMEM */
 
 #define io_remap_page_range remap_page_range
 
diff -Nru a/include/asm-i386/semaphore.h b/include/asm-i386/semaphore.h
--- a/include/asm-i386/semaphore.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/semaphore.h	Wed Aug 28 07:37:37 2002
@@ -40,6 +40,7 @@
 #include <asm/atomic.h>
 #include <linux/wait.h>
 #include <linux/rwsem.h>
+#include <linux/config.h>
 
 struct semaphore {
 	atomic_t count;
@@ -55,6 +56,12 @@
 		, (int)&(name).__magic
 #else
 # define __SEM_DEBUG_INIT(name)
+#endif
+
+#ifdef CONFIG_DEBUG_SPINLOCK
+# define assert_sem_held(sem)		BUG_ON(!down_trylock(sem))
+#else
+# define assert_sem_held(sem)		do { } while(0)
 #endif
 
 #define __SEMAPHORE_INITIALIZER(name,count) \
diff -Nru a/include/asm-i386/setup.h b/include/asm-i386/setup.h
--- a/include/asm-i386/setup.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/setup.h	Wed Aug 28 07:37:37 2002
@@ -6,5 +6,38 @@
 #ifndef _i386_SETUP_H
 #define _i386_SETUP_H
 
+#define PFN_UP(x)	(((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
+#define PFN_DOWN(x)	((x) >> PAGE_SHIFT)
+#define PFN_PHYS(x)	((x) << PAGE_SHIFT)
+
+/*
+ * Reserved space for vmalloc and iomap - defined in asm/page.h
+ */
+#define MAXMEM_PFN	PFN_DOWN(MAXMEM)
+#define MAX_NONPAE_PFN	(1 << 20)
+
+/*
+ * This is set up by the setup-routine at boot-time
+ */
+#define PARAM	((unsigned char *)empty_zero_page)
+#define SCREEN_INFO (*(struct screen_info *) (PARAM+0))
+#define EXT_MEM_K (*(unsigned short *) (PARAM+2))
+#define ALT_MEM_K (*(unsigned long *) (PARAM+0x1e0))
+#define E820_MAP_NR (*(char*) (PARAM+E820NR))
+#define E820_MAP    ((struct e820entry *) (PARAM+E820MAP))
+#define APM_BIOS_INFO (*(struct apm_bios_info *) (PARAM+0x40))
+#define DRIVE_INFO (*(struct drive_info_struct *) (PARAM+0x80))
+#define SYS_DESC_TABLE (*(struct sys_desc_table_struct*)(PARAM+0xa0))
+#define MOUNT_ROOT_RDONLY (*(unsigned short *) (PARAM+0x1F2))
+#define RAMDISK_FLAGS (*(unsigned short *) (PARAM+0x1F8))
+#define VIDEO_MODE (*(unsigned short *) (PARAM+0x1FA))
+#define ORIG_ROOT_DEV (*(unsigned short *) (PARAM+0x1FC))
+#define AUX_DEVICE_INFO (*(unsigned char *) (PARAM+0x1FF))
+#define LOADER_TYPE (*(unsigned char *) (PARAM+0x210))
+#define KERNEL_START (*(unsigned long *) (PARAM+0x214))
+#define INITRD_START (*(unsigned long *) (PARAM+0x218))
+#define INITRD_SIZE (*(unsigned long *) (PARAM+0x21c))
+#define COMMAND_LINE ((char *) (PARAM+2048))
+#define COMMAND_LINE_SIZE 256
 
 #endif /* _i386_SETUP_H */
diff -Nru a/include/asm-i386/spinlock.h b/include/asm-i386/spinlock.h
--- a/include/asm-i386/spinlock.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-i386/spinlock.h	Wed Aug 28 07:37:36 2002
@@ -157,6 +157,7 @@
 #define RW_LOCK_UNLOCKED (rwlock_t) { RW_LOCK_BIAS RWLOCK_MAGIC_INIT }
 
 #define rwlock_init(x)	do { *(x) = RW_LOCK_UNLOCKED; } while(0)
+#define rwlock_is_locked(x) ((x)->lock != RW_LOCK_BIAS)
 
 /*
  * On x86, we implement read-write locks as a 32-bit counter
diff -Nru a/include/asm-i386/tlb.h b/include/asm-i386/tlb.h
--- a/include/asm-i386/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-i386/tlb.h	Wed Aug 28 07:37:37 2002
@@ -7,7 +7,7 @@
  */
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 /*
  * .. because we flush the whole mm when it
diff -Nru a/include/asm-ia64/cache.h b/include/asm-ia64/cache.h
--- a/include/asm-ia64/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-ia64/cache.h	Wed Aug 28 07:37:37 2002
@@ -12,6 +12,8 @@
 #define L1_CACHE_SHIFT		CONFIG_IA64_L1_CACHE_SHIFT
 #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
 
+#define L1_CACHE_SHIFT_MAX 7	/* largest L1 which this arch supports */
+
 #ifdef CONFIG_SMP
 # define SMP_CACHE_SHIFT	L1_CACHE_SHIFT
 # define SMP_CACHE_BYTES	L1_CACHE_BYTES
diff -Nru a/include/asm-ia64/pgalloc.h b/include/asm-ia64/pgalloc.h
--- a/include/asm-ia64/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ia64/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -108,7 +108,7 @@
 	++pgtable_cache_size;
 }
 
-#define pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
+#define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
 
 static inline void
 pmd_populate (struct mm_struct *mm, pmd_t *pmd_entry, struct page *pte)
@@ -154,7 +154,7 @@
 	free_page((unsigned long) pte);
 }
 
-#define pte_free_tlb(tlb, pte)	tlb_remove_page((tlb), (pte))
+#define __pte_free_tlb(tlb, pte)	tlb_remove_page((tlb), (pte))
 
 extern void check_pgt_cache (void);
 
diff -Nru a/include/asm-ia64/semaphore.h b/include/asm-ia64/semaphore.h
--- a/include/asm-ia64/semaphore.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-ia64/semaphore.h	Wed Aug 28 07:37:37 2002
@@ -6,6 +6,7 @@
  * Copyright (C) 1998-2000 David Mosberger-Tang <davidm@hpl.hp.com>
  */
 
+#include <linux/config.h>
 #include <linux/wait.h>
 #include <linux/rwsem.h>
 
@@ -24,6 +25,12 @@
 # define __SEM_DEBUG_INIT(name)		, (long) &(name).__magic
 #else
 # define __SEM_DEBUG_INIT(name)
+#endif
+
+#ifdef CONFIG_DEBUG_SPINLOCK
+# define assert_sem_held(sem)		BUG_ON(!down_trylock(sem))
+#else
+# define assert_sem_held(sem)		do { } while(0)
 #endif
 
 #define __SEMAPHORE_INITIALIZER(name,count)					\
diff -Nru a/include/asm-ia64/spinlock.h b/include/asm-ia64/spinlock.h
--- a/include/asm-ia64/spinlock.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ia64/spinlock.h	Wed Aug 28 07:37:36 2002
@@ -109,6 +109,7 @@
 #define RW_LOCK_UNLOCKED (rwlock_t) { 0, 0 }
 
 #define rwlock_init(x) do { *(x) = RW_LOCK_UNLOCKED; } while(0)
+#define rwlock_is_locked(x) ((x)->read_counter != 0 || (x)->write_lock != 0)
 
 #define _raw_read_lock(rw)							\
 do {										\
diff -Nru a/include/asm-ia64/tlb.h b/include/asm-ia64/tlb.h
--- a/include/asm-ia64/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-ia64/tlb.h	Wed Aug 28 07:37:37 2002
@@ -172,7 +172,7 @@
  * PTE, not just those pointing to (normal) physical memory.
  */
 static inline void
-tlb_remove_tlb_entry (mmu_gather_t *tlb, pte_t *ptep, unsigned long address)
+__tlb_remove_tlb_entry (mmu_gather_t *tlb, pte_t *ptep, unsigned long address)
 {
 	if (tlb->start_addr == ~0UL)
 		tlb->start_addr = address;
diff -Nru a/include/asm-m68k/cache.h b/include/asm-m68k/cache.h
--- a/include/asm-m68k/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-m68k/cache.h	Wed Aug 28 07:37:37 2002
@@ -8,4 +8,6 @@
 #define        L1_CACHE_SHIFT  4
 #define        L1_CACHE_BYTES  (1<< L1_CACHE_SHIFT)
 
+#define L1_CACHE_SHIFT_MAX 4	/* largest L1 which this arch supports */
+
 #endif
diff -Nru a/include/asm-m68k/motorola_pgalloc.h b/include/asm-m68k/motorola_pgalloc.h
--- a/include/asm-m68k/motorola_pgalloc.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-m68k/motorola_pgalloc.h	Wed Aug 28 07:37:38 2002
@@ -55,7 +55,7 @@
 	__free_page(page);
 }
 
-static inline void pte_free_tlb(mmu_gather_t *tlb, struct page *page)
+static inline void __pte_free_tlb(mmu_gather_t *tlb, struct page *page)
 {
 	cache_page(kmap(page));
 	kunmap(page);
@@ -73,7 +73,7 @@
 	return free_pointer_table(pmd);
 }
 
-static inline int pmd_free_tlb(mmu_gather_t *tlb, pmd_t *pmd)
+static inline int __pmd_free_tlb(mmu_gather_t *tlb, pmd_t *pmd)
 {
 	return free_pointer_table(pmd);
 }
diff -Nru a/include/asm-m68k/sun3_pgalloc.h b/include/asm-m68k/sun3_pgalloc.h
--- a/include/asm-m68k/sun3_pgalloc.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-m68k/sun3_pgalloc.h	Wed Aug 28 07:37:37 2002
@@ -31,7 +31,7 @@
         __free_page(page);
 }
 
-static inline void pte_free_tlb(mmu_gather_t *tlb, struct page *page)
+static inline void __pte_free_tlb(mmu_gather_t *tlb, struct page *page)
 {
 	tlb_remove_page(tlb, page);
 }
@@ -76,7 +76,7 @@
  * inside the pgd, so has no extra memory associated with it.
  */
 #define pmd_free(x)			do { } while (0)
-#define pmd_free_tlb(tlb, x)		do { } while (0)
+#define __pmd_free_tlb(tlb, x)		do { } while (0)
 
 static inline void pgd_free(pgd_t * pgd)
 {
diff -Nru a/include/asm-m68k/tlb.h b/include/asm-m68k/tlb.h
--- a/include/asm-m68k/tlb.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-m68k/tlb.h	Wed Aug 28 07:37:38 2002
@@ -7,7 +7,7 @@
  */
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, address)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address)	do { } while (0)
 
 /*
  * .. because we flush the whole mm when it
diff -Nru a/include/asm-mips/cache.h b/include/asm-mips/cache.h
--- a/include/asm-mips/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-mips/cache.h	Wed Aug 28 07:37:37 2002
@@ -35,5 +35,6 @@
 #endif
 
 #define SMP_CACHE_BYTES		L1_CACHE_BYTES
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
 
 #endif /* _ASM_CACHE_H */
diff -Nru a/include/asm-mips64/cache.h b/include/asm-mips64/cache.h
--- a/include/asm-mips64/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-mips64/cache.h	Wed Aug 28 07:37:37 2002
@@ -11,5 +11,6 @@
 
 /* bytes per L1 cache line */
 #define L1_CACHE_BYTES		(1 << CONFIG_L1_CACHE_SHIFT)
+#define L1_CACHE_SHIFT_MAX 7	/* largest L1 which this arch supports */
 
 #endif /* _ASM_CACHE_H */
diff -Nru a/include/asm-mips64/mmzone.h b/include/asm-mips64/mmzone.h
--- a/include/asm-mips64/mmzone.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-mips64/mmzone.h	Wed Aug 28 07:37:36 2002
@@ -27,7 +27,7 @@
 #define PLAT_NODE_DATA_STARTNR(n)    (PLAT_NODE_DATA(n)->gendata.node_start_mapnr)
 #define PLAT_NODE_DATA_SIZE(n)	     (PLAT_NODE_DATA(n)->gendata.node_size)
 #define PLAT_NODE_DATA_LOCALNR(p, n) \
-		(((p) - PLAT_NODE_DATA(n)->gendata.node_start_paddr) >> PAGE_SHIFT)
+		(((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
 
 #define numa_node_id()	cputocnode(current->processor)
 
diff -Nru a/include/asm-mips64/pgtable.h b/include/asm-mips64/pgtable.h
--- a/include/asm-mips64/pgtable.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-mips64/pgtable.h	Wed Aug 28 07:37:38 2002
@@ -484,8 +484,8 @@
 #define PAGE_TO_PA(page)	((page - mem_map) << PAGE_SHIFT)
 #else
 #define PAGE_TO_PA(page) \
-		((((page)-(page)->zone->zone_mem_map) << PAGE_SHIFT) \
-		+ ((page)->zone->zone_start_paddr))
+		(( ((page)-(page)->zone->zone_mem_map) + \
+		   (page)->zone->zone_start_pfn) << PAGE_SHIFT)
 #endif
 #define mk_pte(page, pgprot)						\
 ({									\
diff -Nru a/include/asm-parisc/cache.h b/include/asm-parisc/cache.h
--- a/include/asm-parisc/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-parisc/cache.h	Wed Aug 28 07:37:36 2002
@@ -34,6 +34,7 @@
 #define L1_CACHE_ALIGN(x)       (((x)+(L1_CACHE_BYTES-1))&~(L1_CACHE_BYTES-1))
 
 #define SMP_CACHE_BYTES L1_CACHE_BYTES
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
 
 #define __cacheline_aligned __attribute__((__aligned__(L1_CACHE_BYTES)))
 
diff -Nru a/include/asm-ppc/cache.h b/include/asm-ppc/cache.h
--- a/include/asm-ppc/cache.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-ppc/cache.h	Wed Aug 28 07:37:38 2002
@@ -29,6 +29,7 @@
 #define	L1_CACHE_BYTES L1_CACHE_LINE_SIZE
 #define L1_CACHE_SHIFT LG_L1_CACHE_LINE_SIZE
 #define	SMP_CACHE_BYTES L1_CACHE_BYTES
+#define L1_CACHE_SHIFT_MAX 7	/* largest L1 which this arch supports */
 
 #define	L1_CACHE_ALIGN(x)       (((x)+(L1_CACHE_BYTES-1))&~(L1_CACHE_BYTES-1))
 #define	L1_CACHE_PAGES		8
diff -Nru a/include/asm-ppc/hardirq.h b/include/asm-ppc/hardirq.h
--- a/include/asm-ppc/hardirq.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc/hardirq.h	Wed Aug 28 07:37:36 2002
@@ -85,8 +85,10 @@
 #define irq_enter()		(preempt_count() += HARDIRQ_OFFSET)
 
 #if CONFIG_PREEMPT
+# define in_atomic()	(preempt_count() != kernel_locked())
 # define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
 #else
+# define in_atomic()	(preempt_count() != 0)
 # define IRQ_EXIT_OFFSET HARDIRQ_OFFSET
 #endif
 #define irq_exit()							\
diff -Nru a/include/asm-ppc/highmem.h b/include/asm-ppc/highmem.h
--- a/include/asm-ppc/highmem.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-ppc/highmem.h	Wed Aug 28 07:37:37 2002
@@ -88,6 +88,7 @@
 	unsigned int idx;
 	unsigned long vaddr;
 
+	inc_preempt_count();
 	if (page < highmem_start_page)
 		return page_address(page);
 
@@ -122,6 +123,7 @@
 	pte_clear(kmap_pte+idx);
 	flush_tlb_page(0, vaddr);
 #endif
+	dec_preempt_count();
 }
 
 #endif /* __KERNEL__ */
diff -Nru a/include/asm-ppc/pgalloc.h b/include/asm-ppc/pgalloc.h
--- a/include/asm-ppc/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -20,7 +20,7 @@
  */
 #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)                     do { } while (0)
-#define pmd_free_tlb(tlb,x)		do { } while (0)
+#define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)      BUG()
 
 #define pmd_populate_kernel(mm, pmd, pte)	\
@@ -33,7 +33,7 @@
 extern void pte_free_kernel(pte_t *pte);
 extern void pte_free(struct page *pte);
 
-#define pte_free_tlb(tlb, pte)	pte_free((pte))
+#define __pte_free_tlb(tlb, pte)	pte_free((pte))
 
 #define check_pgt_cache()	do { } while (0)
 
diff -Nru a/include/asm-ppc/tlb.h b/include/asm-ppc/tlb.h
--- a/include/asm-ppc/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc/tlb.h	Wed Aug 28 07:37:36 2002
@@ -34,7 +34,7 @@
 extern void flush_hash_entry(struct mm_struct *mm, pte_t *ptep,
 			     unsigned long address);
 
-static inline void tlb_remove_tlb_entry(mmu_gather_t *tlb, pte_t *ptep,
+static inline void __tlb_remove_tlb_entry(mmu_gather_t *tlb, pte_t *ptep,
 					unsigned long address)
 {
 	if (pte_val(*ptep) & _PAGE_HASHPTE)
@@ -50,7 +50,7 @@
 #define tlb_finish_arch(tlb)		do { } while (0)
 #define tlb_start_vma(tlb, vma)		do { } while (0)
 #define tlb_end_vma(tlb, vma)		do { } while (0)
-#define tlb_remove_tlb_entry(tlb, pte, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0)
 #define tlb_flush(tlb)			flush_tlb_mm((tlb)->mm)
 
 /* Get the generic bits... */
diff -Nru a/include/asm-ppc64/cache.h b/include/asm-ppc64/cache.h
--- a/include/asm-ppc64/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc64/cache.h	Wed Aug 28 07:37:36 2002
@@ -12,5 +12,6 @@
 #define L1_CACHE_BYTES	(1 << L1_CACHE_SHIFT)
 
 #define SMP_CACHE_BYTES L1_CACHE_BYTES
+#define L1_CACHE_SHIFT_MAX 7	/* largest L1 which this arch supports */
 
 #endif
diff -Nru a/include/asm-ppc64/mmzone.h b/include/asm-ppc64/mmzone.h
--- a/include/asm-ppc64/mmzone.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-ppc64/mmzone.h	Wed Aug 28 07:37:38 2002
@@ -31,7 +31,7 @@
 	(PLAT_NODE_DATA(n)->gendata.node_start_mapnr)
 #define PLAT_NODE_DATA_SIZE(n)		(PLAT_NODE_DATA(n)->gendata.node_size)
 #define PLAT_NODE_DATA_LOCALNR(p, n)	\
-	(((p) - PLAT_NODE_DATA(n)->gendata.node_start_paddr) >> PAGE_SHIFT)
+	(((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
 
 #ifdef CONFIG_DISCONTIGMEM
 
@@ -67,7 +67,7 @@
  * node's mem_map.
  */
 #define LOCAL_BASE_ADDR(kaddr) \
-	((unsigned long)__va(NODE_DATA(KVADDR_TO_NID(kaddr))->node_start_paddr))
+	((unsigned long)__va(NODE_DATA(KVADDR_TO_NID(kaddr))->node_start_pfn << PAGE_SHIFT))
 
 #define LOCAL_MAP_NR(kvaddr) \
 	(((unsigned long)(kvaddr)-LOCAL_BASE_ADDR(kvaddr)) >> PAGE_SHIFT)
diff -Nru a/include/asm-ppc64/page.h b/include/asm-ppc64/page.h
--- a/include/asm-ppc64/page.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc64/page.h	Wed Aug 28 07:37:36 2002
@@ -207,7 +207,7 @@
 #ifdef CONFIG_DISCONTIGMEM
 #define page_to_pfn(page) \
 		((page) - page_zone(page)->zone_mem_map + \
-		(page_zone(page)->zone_start_paddr >> PAGE_SHIFT))
+		(page_zone(page)->zone_start_pfn))
 #define pfn_to_page(pfn)	discontigmem_pfn_to_page(pfn)
 #else
 #define pfn_to_page(pfn)	(mem_map + (pfn))
diff -Nru a/include/asm-ppc64/pgalloc.h b/include/asm-ppc64/pgalloc.h
--- a/include/asm-ppc64/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-ppc64/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -53,7 +53,7 @@
 	free_page((unsigned long)pmd);
 }
 
-#define pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
+#define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
 
 #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, pte)
 #define pmd_populate(mm, pmd, pte_page) \
@@ -88,7 +88,7 @@
 }
 
 #define pte_free(pte_page)	pte_free_kernel(page_address(pte_page))
-#define pte_free_tlb(tlb, pte)	pte_free(pte)
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #define check_pgt_cache()	do { } while (0)
 
diff -Nru a/include/asm-ppc64/tlb.h b/include/asm-ppc64/tlb.h
--- a/include/asm-ppc64/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-ppc64/tlb.h	Wed Aug 28 07:37:37 2002
@@ -40,7 +40,7 @@
 
 extern struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS];
 
-static inline void tlb_remove_tlb_entry(mmu_gather_t *tlb, pte_t *ptep,
+static inline void __tlb_remove_tlb_entry(mmu_gather_t *tlb, pte_t *ptep,
 					unsigned long address)
 {
 	int cpu = smp_processor_id();
diff -Nru a/include/asm-s390/cache.h b/include/asm-s390/cache.h
--- a/include/asm-s390/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-s390/cache.h	Wed Aug 28 07:37:36 2002
@@ -13,5 +13,6 @@
 
 #define L1_CACHE_BYTES     256
 #define L1_CACHE_SHIFT     8
+#define L1_CACHE_SHIFT_MAX 8	/* largest L1 which this arch supports */
 
 #endif
diff -Nru a/include/asm-s390/pgalloc.h b/include/asm-s390/pgalloc.h
--- a/include/asm-s390/pgalloc.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-s390/pgalloc.h	Wed Aug 28 07:37:37 2002
@@ -49,7 +49,7 @@
  */
 #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)                     do { } while (0)
-#define pmd_free_tlb(tlb,x)		do { } while (0)
+#define __pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)      BUG()
 
 static inline void 
@@ -107,7 +107,7 @@
         __free_page(pte);
 }
 
-#define pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 /*
  * This establishes kernel virtual mappings (e.g., as a result of a
diff -Nru a/include/asm-s390/tlb.h b/include/asm-s390/tlb.h
--- a/include/asm-s390/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-s390/tlb.h	Wed Aug 28 07:37:37 2002
@@ -7,7 +7,7 @@
  */
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 /*
  * .. because we flush the whole mm when it
diff -Nru a/include/asm-s390x/cache.h b/include/asm-s390x/cache.h
--- a/include/asm-s390x/cache.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-s390x/cache.h	Wed Aug 28 07:37:36 2002
@@ -13,5 +13,6 @@
 
 #define L1_CACHE_BYTES     256
 #define L1_CACHE_SHIFT     8
+#define L1_CACHE_SHIFT_MAX 8	/* largest L1 which this arch supports */
 
 #endif
diff -Nru a/include/asm-s390x/pgalloc.h b/include/asm-s390x/pgalloc.h
--- a/include/asm-s390x/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-s390x/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -68,7 +68,7 @@
 	free_pages((unsigned long) pmd, 2);
 }
 
-#define pmd_free_tlb(tlb,pmd) pmd_free(pmd)
+#define __pmd_free_tlb(tlb,pmd) pmd_free(pmd)
 
 static inline void
 pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
@@ -123,7 +123,7 @@
         __free_page(pte);
 }
 
-#define pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 /*
  * This establishes kernel virtual mappings (e.g., as a result of a
diff -Nru a/include/asm-s390x/tlb.h b/include/asm-s390x/tlb.h
--- a/include/asm-s390x/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-s390x/tlb.h	Wed Aug 28 07:37:36 2002
@@ -7,7 +7,7 @@
  */
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 /*
  * .. because we flush the whole mm when it
diff -Nru a/include/asm-sh/cache.h b/include/asm-sh/cache.h
--- a/include/asm-sh/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-sh/cache.h	Wed Aug 28 07:37:37 2002
@@ -14,4 +14,6 @@
 #define        L1_CACHE_BYTES  32
 #endif
 
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
+
 #endif /* __ASM_SH_CACHE_H */
diff -Nru a/include/asm-sparc/cache.h b/include/asm-sparc/cache.h
--- a/include/asm-sparc/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-sparc/cache.h	Wed Aug 28 07:37:37 2002
@@ -13,6 +13,7 @@
 #define L1_CACHE_SHIFT 5
 #define L1_CACHE_BYTES 32
 #define L1_CACHE_ALIGN(x) ((((x)+(L1_CACHE_BYTES-1))&~(L1_CACHE_BYTES-1)))
+#define L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
 
 #define SMP_CACHE_BYTES 32
 
diff -Nru a/include/asm-sparc/hardirq.h b/include/asm-sparc/hardirq.h
--- a/include/asm-sparc/hardirq.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-sparc/hardirq.h	Wed Aug 28 07:37:38 2002
@@ -113,6 +113,12 @@
 #define irq_exit()		br_read_unlock(BR_GLOBALIRQ_LOCK)
 #endif
 
+#if CONFIG_PREEMPT
+# define in_atomic()	(preempt_count() != kernel_locked())
+#else
+# define in_atomic()	(preempt_count() != 0)
+#endif
+
 #ifndef CONFIG_SMP
 
 #define synchronize_irq()	barrier()
diff -Nru a/include/asm-sparc/highmem.h b/include/asm-sparc/highmem.h
--- a/include/asm-sparc/highmem.h	Wed Aug 28 07:37:38 2002
+++ b/include/asm-sparc/highmem.h	Wed Aug 28 07:37:38 2002
@@ -83,6 +83,7 @@
 	unsigned long idx;
 	unsigned long vaddr;
 
+	inc_preempt_count();
 	if (page < highmem_start_page)
 		return page_address(page);
 
@@ -142,6 +143,7 @@
 	flush_tlb_all();
 #endif
 #endif
+	dec_preempt_count();
 }
 
 #endif /* __KERNEL__ */
diff -Nru a/include/asm-sparc/pgalloc.h b/include/asm-sparc/pgalloc.h
--- a/include/asm-sparc/pgalloc.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-sparc/pgalloc.h	Wed Aug 28 07:37:36 2002
@@ -47,7 +47,7 @@
 #define free_pmd_fast(pmd)	BTFIXUP_CALL(free_pmd_fast)(pmd)
 
 #define pmd_free(pmd)           free_pmd_fast(pmd)
-#define pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
+#define __pmd_free_tlb(tlb, pmd) pmd_free(pmd)
 
 BTFIXUPDEF_CALL(void, pmd_populate, pmd_t *, struct page *)
 #define pmd_populate(MM, PMD, PTE)        BTFIXUP_CALL(pmd_populate)(PMD, PTE)
@@ -64,6 +64,6 @@
 
 BTFIXUPDEF_CALL(void, pte_free, struct page *)
 #define pte_free(pte)		BTFIXUP_CALL(pte_free)(pte)
-#define pte_free_tlb(tlb, pte)	pte_free(pte)
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #endif /* _SPARC_PGALLOC_H */
diff -Nru a/include/asm-sparc/tlb.h b/include/asm-sparc/tlb.h
--- a/include/asm-sparc/tlb.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-sparc/tlb.h	Wed Aug 28 07:37:37 2002
@@ -11,7 +11,7 @@
 	flush_tlb_range(vma, vma->vm_start, vma->vm_end);	\
 } while (0)
 
-#define tlb_remove_tlb_entry(tlb, pte, address) \
+#define __tlb_remove_tlb_entry(tlb, pte, address) \
 	do { } while (0)
 
 #define tlb_flush(tlb) \
diff -Nru a/include/asm-sparc64/cache.h b/include/asm-sparc64/cache.h
--- a/include/asm-sparc64/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-sparc64/cache.h	Wed Aug 28 07:37:37 2002
@@ -9,6 +9,7 @@
 #define        L1_CACHE_BYTES	32 /* Two 16-byte sub-blocks per line. */
 
 #define        L1_CACHE_ALIGN(x)       (((x)+(L1_CACHE_BYTES-1))&~(L1_CACHE_BYTES-1))
+#define		L1_CACHE_SHIFT_MAX 5	/* largest L1 which this arch supports */
 
 #define        SMP_CACHE_BYTES_SHIFT	6
 #define        SMP_CACHE_BYTES		(1 << SMP_CACHE_BYTES_SHIFT) /* L2 cache line size. */
diff -Nru a/include/asm-sparc64/tlb.h b/include/asm-sparc64/tlb.h
--- a/include/asm-sparc64/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-sparc64/tlb.h	Wed Aug 28 07:37:36 2002
@@ -16,12 +16,12 @@
 		flush_tlb_range(vma, vma->vm_start, vma->vm_end); \
 } while (0)
 
-#define tlb_remove_tlb_entry(tlb, ptep, address) \
+#define __tlb_remove_tlb_entry(tlb, ptep, address) \
 	do { } while (0)
 
 #include <asm-generic/tlb.h>
 
-#define pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
-#define pte_free_tlb(tlb, pte)	pte_free(pte)
+#define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #endif /* _SPARC64_TLB_H */
diff -Nru a/include/asm-x86_64/cache.h b/include/asm-x86_64/cache.h
--- a/include/asm-x86_64/cache.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-x86_64/cache.h	Wed Aug 28 07:37:37 2002
@@ -9,5 +9,6 @@
 /* L1 cache line size */
 #define L1_CACHE_SHIFT	(CONFIG_X86_L1_CACHE_SHIFT)
 #define L1_CACHE_BYTES	(1 << L1_CACHE_SHIFT)
+#define L1_CACHE_SHIFT_MAX 6	/* largest L1 which this arch supports */
 
 #endif
diff -Nru a/include/asm-x86_64/pgalloc.h b/include/asm-x86_64/pgalloc.h
--- a/include/asm-x86_64/pgalloc.h	Wed Aug 28 07:37:37 2002
+++ b/include/asm-x86_64/pgalloc.h	Wed Aug 28 07:37:37 2002
@@ -75,7 +75,7 @@
 	__free_page(pte);
 } 
 
-#define pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
-#define pmd_free_tlb(tlb,x)   do { } while (0)
+#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pmd_free_tlb(tlb,x)   do { } while (0)
 
 #endif /* _X86_64_PGALLOC_H */
diff -Nru a/include/asm-x86_64/tlb.h b/include/asm-x86_64/tlb.h
--- a/include/asm-x86_64/tlb.h	Wed Aug 28 07:37:36 2002
+++ b/include/asm-x86_64/tlb.h	Wed Aug 28 07:37:36 2002
@@ -4,7 +4,7 @@
 
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
-#define tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
diff -Nru a/include/linux/backing-dev.h b/include/linux/backing-dev.h
--- a/include/linux/backing-dev.h	Wed Aug 28 07:37:37 2002
+++ b/include/linux/backing-dev.h	Wed Aug 28 07:37:37 2002
@@ -19,6 +19,7 @@
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
+	int memory_backed;	/* Cannot clean pages with writepage */
 };
 
 extern struct backing_dev_info default_backing_dev_info;
diff -Nru a/include/linux/bootmem.h b/include/linux/bootmem.h
--- a/include/linux/bootmem.h	Wed Aug 28 07:37:38 2002
+++ b/include/linux/bootmem.h	Wed Aug 28 07:37:38 2002
@@ -36,9 +36,10 @@
 
 extern unsigned long __init bootmem_bootmap_pages (unsigned long);
 extern unsigned long __init init_bootmem (unsigned long addr, unsigned long memend);
-extern void __init reserve_bootmem (unsigned long addr, unsigned long size);
 extern void __init free_bootmem (unsigned long addr, unsigned long size);
 extern void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal);
+#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
+extern void __init reserve_bootmem (unsigned long addr, unsigned long size);
 #define alloc_bootmem(x) \
 	__alloc_bootmem((x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
 #define alloc_bootmem_low(x) \
@@ -47,6 +48,7 @@
 	__alloc_bootmem((x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
 #define alloc_bootmem_low_pages(x) \
 	__alloc_bootmem((x), PAGE_SIZE, 0)
+#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 extern unsigned long __init free_all_bootmem (void);
 
 extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn);
@@ -54,11 +56,13 @@
 extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
 extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
 extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
+#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
 #define alloc_bootmem_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
 #define alloc_bootmem_pages_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
 #define alloc_bootmem_low_pages_node(pgdat, x) \
 	__alloc_bootmem_node((pgdat), (x), PAGE_SIZE, 0)
+#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
 #endif /* _LINUX_BOOTMEM_H */
diff -Nru a/include/linux/buffer_head.h b/include/linux/buffer_head.h
--- a/include/linux/buffer_head.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/buffer_head.h	Wed Aug 28 07:37:36 2002
@@ -167,6 +167,7 @@
 struct buffer_head *alloc_buffer_head(void);
 void free_buffer_head(struct buffer_head * bh);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));
+extern int buffer_heads_over_limit;
 
 /*
  * Generic address_space_operations implementations for buffer_head-backed
diff -Nru a/include/linux/cache.h b/include/linux/cache.h
--- a/include/linux/cache.h	Wed Aug 28 07:37:38 2002
+++ b/include/linux/cache.h	Wed Aug 28 07:37:38 2002
@@ -44,4 +44,13 @@
 #endif /* CONFIG_SMP */
 #endif
 
+#if !defined(____cacheline_maxaligned_in_smp)
+#if defined(CONFIG_SMP)
+#define ____cacheline_maxaligned_in_smp \
+	__attribute__((__aligned__(1 << (L1_CACHE_SHIFT_MAX))))
+#else
+#define ____cacheline_maxaligned_in_smp
+#endif
+#endif
+
 #endif /* __LINUX_CACHE_H */
diff -Nru a/include/linux/dcache.h b/include/linux/dcache.h
--- a/include/linux/dcache.h	Wed Aug 28 07:37:38 2002
+++ b/include/linux/dcache.h	Wed Aug 28 07:37:38 2002
@@ -181,8 +181,6 @@
 extern void shrink_dcache_anon(struct list_head *);
 extern int d_invalidate(struct dentry *);
 
-#define shrink_dcache() prune_dcache(0)
-struct zone_struct;
 /* dcache memory management */
 extern int shrink_dcache_memory(int, unsigned int);
 extern void prune_dcache(int);
diff -Nru a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/ext3_fs.h	Wed Aug 28 07:37:36 2002
@@ -341,6 +341,7 @@
   #define EXT3_MOUNT_WRITEBACK_DATA	0x0C00	/* No data ordering */
 #define EXT3_MOUNT_UPDATE_JOURNAL	0x1000	/* Update the journal format */
 #define EXT3_MOUNT_NO_UID32		0x2000  /* Disable 32-bit UIDs */
+#define EXT3_MOUNT_INDEX		0x4000  /* Enable directory index */
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -580,6 +581,28 @@
 #define EXT3_DIR_ROUND			(EXT3_DIR_PAD - 1)
 #define EXT3_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT3_DIR_ROUND) & \
 					 ~EXT3_DIR_ROUND)
+/*
+ * Hash Tree Directory indexing
+ * (c) Daniel Phillips, 2001
+ */
+
+#ifdef CONFIG_EXT3_INDEX
+  enum {ext3_dx = 1};
+  #define is_dx(dir) (EXT3_I(dir)->i_flags & EXT3_INDEX_FL)
+#define EXT3_DIR_LINK_MAX(dir) (!is_dx(dir) && (dir)->i_nlink >= EXT3_LINK_MAX)
+#define EXT3_DIR_LINK_EMPTY(dir) ((dir)->i_nlink == 2 || (dir)->i_nlink == 1)
+#else
+  enum {ext3_dx = 0};
+  #define is_dx(dir) 0
+#define EXT3_DIR_LINK_MAX(dir) ((dir)->i_nlink >= EXT3_LINK_MAX)
+#define EXT3_DIR_LINK_EMPTY(dir) ((dir)->i_nlink == 2)
+#endif
+
+/* Legal values for the dx_root hash_version field: */
+
+#define DX_HASH_HALF_MD4	1
+
+extern __u32 ext3_make_halfMD4_hash(const char *p, int len);
 
 #ifdef __KERNEL__
 /*
@@ -631,6 +654,7 @@
 extern unsigned long ext3_count_free_inodes (struct super_block *);
 extern void ext3_check_inodes_bitmap (struct super_block *);
 extern unsigned long ext3_count_free (struct buffer_head *, unsigned);
+
 
 /* inode.c */
 extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
diff -Nru a/include/linux/ext3_jbd.h b/include/linux/ext3_jbd.h
--- a/include/linux/ext3_jbd.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/ext3_jbd.h	Wed Aug 28 07:37:36 2002
@@ -63,6 +63,8 @@
 
 #define EXT3_RESERVE_TRANS_BLOCKS	12
 
+#define EXT3_INDEX_EXTRA_TRANS_BLOCKS	8
+
 int
 ext3_mark_iloc_dirty(handle_t *handle, 
 		     struct inode *inode,
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/fs.h	Wed Aug 28 07:37:37 2002
@@ -655,7 +655,6 @@
 
 	struct list_head	s_dirty;	/* dirty inodes */
 	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_locked_inodes;/* inodes being synced */
 	struct list_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 
diff -Nru a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/gfp.h	Wed Aug 28 07:37:36 2002
@@ -40,7 +40,7 @@
  * virtual kernel addresses to the allocated page(s).
  */
 extern struct page * FASTCALL(_alloc_pages(unsigned int gfp_mask, unsigned int order));
-extern struct page * FASTCALL(__alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist));
+extern struct page * FASTCALL(__alloc_pages(unsigned int gfp_mask, unsigned int order, struct zonelist *zonelist));
 extern struct page * alloc_pages_node(int nid, unsigned int gfp_mask, unsigned int order);
 
 static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
diff -Nru a/include/linux/kernel.h b/include/linux/kernel.h
--- a/include/linux/kernel.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/kernel.h	Wed Aug 28 07:37:36 2002
@@ -7,6 +7,10 @@
 
 #ifdef __KERNEL__
 
+#if __GNUC__ <= 2 && __GNUC_MINOR__ < 95
+#define __func__ __FUNCTION__
+#endif
+
 #include <stdarg.h>
 #include <linux/linkage.h>
 #include <linux/stddef.h>
diff -Nru a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/mm.h	Wed Aug 28 07:37:36 2002
@@ -157,7 +157,7 @@
 	struct address_space *mapping;	/* The inode (or ...) we belong to. */
 	unsigned long index;		/* Our offset within mapping. */
 	struct list_head lru;		/* Pageout list, eg. active_list;
-					   protected by pagemap_lru_lock !! */
+					   protected by zone->lru_lock !! */
 	union {
 		struct pte_chain * chain;	/* Reverse pte mapping pointer.
 					 * protected by PG_chainlock */
@@ -182,6 +182,12 @@
 };
 
 /*
+ * FIXME: take this include out, include page-flags.h in
+ * files which need it (119 of them)
+ */
+#include <linux/page-flags.h>
+
+/*
  * Methods to modify the page usage count.
  *
  * What counts for a page usage:
@@ -198,14 +204,16 @@
 #define put_page_testzero(p) 	atomic_dec_and_test(&(p)->count)
 #define page_count(p)		atomic_read(&(p)->count)
 #define set_page_count(p,v) 	atomic_set(&(p)->count, v)
+
 extern void FASTCALL(__page_cache_release(struct page *));
-#define put_page(p)							\
-	do {								\
-		if (!PageReserved(p) && put_page_testzero(p))		\
-			__page_cache_release(p);			\
-	} while (0)
 void FASTCALL(__free_pages_ok(struct page *page, unsigned int order));
 
+static inline void put_page(struct page *page)
+{
+	if (!PageReserved(page) && put_page_testzero(page))
+		__page_cache_release(page);
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
@@ -256,22 +264,16 @@
  */
 
 /*
- * FIXME: take this include out, include page-flags.h in
- * files which need it (119 of them)
- */
-#include <linux/page-flags.h>
-
-/*
  * The zone field is never updated after free_area_init_core()
  * sets it, so none of the operations on it need to be atomic.
  */
 #define NODE_SHIFT 4
 #define ZONE_SHIFT (BITS_PER_LONG - 8)
 
-struct zone_struct;
-extern struct zone_struct *zone_table[];
+struct zone;
+extern struct zone *zone_table[];
 
-static inline zone_t *page_zone(struct page *page)
+static inline struct zone *page_zone(struct page *page)
 {
 	return zone_table[page->flags >> ZONE_SHIFT];
 }
@@ -310,8 +312,8 @@
 #else /* CONFIG_HIGHMEM || WANT_PAGE_VIRTUAL */
 
 #define page_address(page)						\
-	__va( (((page) - page_zone(page)->zone_mem_map) << PAGE_SHIFT)	\
-			+ page_zone(page)->zone_start_paddr)
+	__va( ( ((page) - page_zone(page)->zone_mem_map)		\
+			+ page_zone(page)->zone_start_pfn) << PAGE_SHIFT)
 
 #endif /* CONFIG_HIGHMEM || WANT_PAGE_VIRTUAL */
 
@@ -392,7 +394,7 @@
 
 extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
-	unsigned long * zones_size, unsigned long zone_start_paddr, 
+	unsigned long * zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size);
 extern void mem_init(void);
 extern void show_mem(void);
@@ -449,7 +451,6 @@
 		return 0;
 }
 
-struct zone_t;
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
diff -Nru a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- /dev/null	Wed Dec 31 16:00:00 1969
+++ b/include/linux/mm_inline.h	Wed Aug 28 07:37:38 2002
@@ -0,0 +1,40 @@
+
+static inline void
+add_page_to_active_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->active_list);
+	zone->nr_active++;
+}
+
+static inline void
+add_page_to_inactive_list(struct zone *zone, struct page *page)
+{
+	list_add(&page->lru, &zone->inactive_list);
+	zone->nr_inactive++;
+}
+
+static inline void
+del_page_from_active_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	zone->nr_active--;
+}
+
+static inline void
+del_page_from_inactive_list(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	zone->nr_inactive--;
+}
+
+static inline void
+del_page_from_lru(struct zone *zone, struct page *page)
+{
+	list_del(&page->lru);
+	if (PageActive(page)) {
+		ClearPageActive(page);
+		zone->nr_active--;
+	} else {
+		zone->nr_inactive--;
+	}
+}
diff -Nru a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/mmzone.h	Wed Aug 28 07:37:36 2002
@@ -8,6 +8,8 @@
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/wait.h>
+#include <linux/cache.h>
+#include <asm/atomic.h>
 
 /*
  * Free memory management - zoned buddy allocator.
@@ -27,6 +29,21 @@
 struct pglist_data;
 
 /*
+ * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * So add a wild amount of padding here to ensure that they fall into separate
+ * cachelines.  There are very few zone structures in the machine, so space
+ * consumption is not a concern here.
+ */
+#if defined(CONFIG_SMP)
+struct zone_padding {
+	int x;
+} ____cacheline_maxaligned_in_smp;
+#define ZONE_PADDING(name)	struct zone_padding name;
+#else
+#define ZONE_PADDING(name)
+#endif
+
+/*
  * On machines where it is needed (eg PCs) we divide physical memory
  * into multiple physical zones. On a PC we have 3 zones:
  *
@@ -34,7 +51,8 @@
  * ZONE_NORMAL	16-896 MB	direct mapped by the kernel
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */
-typedef struct zone_struct {
+
+struct zone {
 	/*
 	 * Commonly accessed fields:
 	 */
@@ -43,6 +61,17 @@
 	unsigned long		pages_min, pages_low, pages_high;
 	int			need_balance;
 
+	ZONE_PADDING(_pad1_)
+
+	spinlock_t		lru_lock;	
+	struct list_head	active_list;
+	struct list_head	inactive_list;
+	atomic_t		refill_counter;
+	unsigned long		nr_active;
+	unsigned long		nr_inactive;
+
+	ZONE_PADDING(_pad2_)
+
 	/*
 	 * free areas of different sizes
 	 */
@@ -81,7 +110,8 @@
 	 */
 	struct pglist_data	*zone_pgdat;
 	struct page		*zone_mem_map;
-	unsigned long		zone_start_paddr;
+	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
+	unsigned long		zone_start_pfn;
 	unsigned long		zone_start_mapnr;
 
 	/*
@@ -89,7 +119,7 @@
 	 */
 	char			*name;
 	unsigned long		size;
-} zone_t;
+} ____cacheline_maxaligned_in_smp;
 
 #define ZONE_DMA		0
 #define ZONE_NORMAL		1
@@ -107,16 +137,16 @@
  * so despite the zonelist table being relatively big, the cache
  * footprint of this construct is very small.
  */
-typedef struct zonelist_struct {
-	zone_t * zones [MAX_NR_ZONES+1]; // NULL delimited
-} zonelist_t;
+struct zonelist {
+	struct zone *zones[MAX_NR_ZONES+1]; // NULL delimited
+};
 
 #define GFP_ZONEMASK	0x0f
 
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
  * (mostly NUMA machines?) to denote a higher-level memory zone than the
- * zone_struct denotes.
+ * zone denotes.
  *
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout.
@@ -126,13 +156,13 @@
  */
 struct bootmem_data;
 typedef struct pglist_data {
-	zone_t node_zones[MAX_NR_ZONES];
-	zonelist_t node_zonelists[GFP_ZONEMASK+1];
+	struct zone node_zones[MAX_NR_ZONES];
+	struct zonelist node_zonelists[GFP_ZONEMASK+1];
 	int nr_zones;
 	struct page *node_mem_map;
 	unsigned long *valid_addr_bitmap;
 	struct bootmem_data *bdata;
-	unsigned long node_start_paddr;
+	unsigned long node_start_pfn;
 	unsigned long node_start_mapnr;
 	unsigned long node_size;
 	int node_id;
@@ -142,7 +172,8 @@
 extern int numnodes;
 extern pg_data_t *pgdat_list;
 
-static inline int memclass(zone_t *pgzone, zone_t *classzone)
+static inline int
+memclass(struct zone *pgzone, struct zone *classzone)
 {
 	if (pgzone->zone_pgdat != classzone->zone_pgdat)
 		return 0;
@@ -156,10 +187,10 @@
  * prototypes for the discontig memory code.
  */
 struct page;
-extern void show_free_areas_core(pg_data_t *pgdat);
-extern void free_area_init_core(int nid, pg_data_t *pgdat, struct page **gmap,
+void free_area_init_core(int nid, pg_data_t *pgdat, struct page **gmap,
   unsigned long *zones_size, unsigned long paddr, unsigned long *zholes_size,
   struct page *pmap);
+void get_zone_counts(unsigned long *active, unsigned long *inactive);
 
 extern pg_data_t contig_page_data;
 
@@ -181,7 +212,7 @@
  * next_zone - helper magic for for_each_zone()
  * Thanks to William Lee Irwin III for this piece of ingenuity.
  */
-static inline zone_t * next_zone(zone_t * zone)
+static inline struct zone *next_zone(struct zone *zone)
 {
 	pg_data_t *pgdat = zone->zone_pgdat;
 
@@ -198,7 +229,7 @@
 
 /**
  * for_each_zone - helper macro to iterate over all memory zones
- * @zone - pointer to zone_t variable
+ * @zone - pointer to struct zone variable
  *
  * The user only needs to declare the zone variable, for_each_zone
  * fills it in. This basically means for_each_zone() is an
@@ -206,7 +237,7 @@
  *
  * for (pgdat = pgdat_list; pgdat; pgdat = pgdat->node_next)
  * 	for (i = 0; i < MAX_NR_ZONES; ++i) {
- * 		zone_t * z = pgdat->node_zones + i;
+ * 		struct zone * z = pgdat->node_zones + i;
  * 		...
  * 	}
  * }
diff -Nru a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/page-flags.h	Wed Aug 28 07:37:36 2002
@@ -28,7 +28,7 @@
  *
  * Note that the referenced bit, the page->lru list_head and the active,
  * inactive_dirty and inactive_clean lists are protected by the
- * pagemap_lru_lock, and *NOT* by the usual PG_locked bit!
+ * zone->lru_lock, and *NOT* by the usual PG_locked bit!
  *
  * PG_error is set to indicate that an I/O error occurred on this page.
  *
@@ -52,7 +52,7 @@
 #define PG_referenced		 2
 #define PG_uptodate		 3
 
-#define PG_dirty_dontuse	 4
+#define PG_dirty	 	 4
 #define PG_lru			 5
 #define PG_active		 6
 #define PG_slab			 7	/* slab debug (Suparna wants this) */
@@ -76,8 +76,6 @@
 	unsigned long nr_dirty;
 	unsigned long nr_writeback;
 	unsigned long nr_pagecache;
-	unsigned long nr_active;	/* on active_list LRU */
-	unsigned long nr_inactive;	/* on inactive_list LRU */
 	unsigned long nr_page_table_pages;
 	unsigned long nr_reverse_maps;
 } ____cacheline_aligned_in_smp page_states[NR_CPUS];
@@ -122,37 +120,11 @@
 #define SetPageUptodate(page)	set_bit(PG_uptodate, &(page)->flags)
 #define ClearPageUptodate(page)	clear_bit(PG_uptodate, &(page)->flags)
 
-#define PageDirty(page)		test_bit(PG_dirty_dontuse, &(page)->flags)
-#define SetPageDirty(page)						\
-	do {								\
-		if (!test_and_set_bit(PG_dirty_dontuse,			\
-					&(page)->flags))		\
-			inc_page_state(nr_dirty);			\
-	} while (0)
-#define TestSetPageDirty(page)						\
-	({								\
-		int ret;						\
-		ret = test_and_set_bit(PG_dirty_dontuse,		\
-				&(page)->flags);			\
-		if (!ret)						\
-			inc_page_state(nr_dirty);			\
-		ret;							\
-	})
-#define ClearPageDirty(page)						\
-	do {								\
-		if (test_and_clear_bit(PG_dirty_dontuse,		\
-				&(page)->flags))			\
-			dec_page_state(nr_dirty);			\
-	} while (0)
-#define TestClearPageDirty(page)					\
-	({								\
-		int ret;						\
-		ret = test_and_clear_bit(PG_dirty_dontuse,		\
-				&(page)->flags);			\
-		if (ret)						\
-			dec_page_state(nr_dirty);			\
-		ret;							\
-	})
+#define PageDirty(page)		test_bit(PG_dirty, &(page)->flags)
+#define SetPageDirty(page)	set_bit(PG_dirty, &(page)->flags)
+#define TestSetPageDirty(page)	test_and_set_bit(PG_dirty, &(page)->flags)
+#define ClearPageDirty(page)	clear_bit(PG_dirty, &(page)->flags)
+#define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
 
 #define SetPageLRU(page)	set_bit(PG_lru, &(page)->flags)
 #define PageLRU(page)		test_bit(PG_lru, &(page)->flags)
@@ -163,6 +135,7 @@
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
+#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
 #define SetPageSlab(page)	set_bit(PG_slab, &(page)->flags)
@@ -265,5 +238,12 @@
  */
 extern struct address_space swapper_space;
 #define PageSwapCache(page) ((page)->mapping == &swapper_space)
+
+int test_clear_page_dirty(struct page *page);
+
+static inline void clear_page_dirty(struct page *page)
+{
+	test_clear_page_dirty(page);
+}
 
 #endif	/* PAGE_FLAGS_H */
diff -Nru a/include/linux/pagemap.h b/include/linux/pagemap.h
--- a/include/linux/pagemap.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/pagemap.h	Wed Aug 28 07:37:36 2002
@@ -22,13 +22,9 @@
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
-#define page_cache_get(x)	get_page(x)
-
-static inline void page_cache_release(struct page *page)
-{
-	if (!PageReserved(page) && put_page_testzero(page))
-		__page_cache_release(page);
-}
+#define page_cache_get(page)		get_page(page)
+#define page_cache_release(page)	put_page(page)
+void release_pages(struct page **pages, int nr);
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
diff -Nru a/include/linux/pagevec.h b/include/linux/pagevec.h
--- a/include/linux/pagevec.h	Wed Aug 28 07:37:38 2002
+++ b/include/linux/pagevec.h	Wed Aug 28 07:37:38 2002
@@ -21,6 +21,7 @@
 void __pagevec_lru_del(struct pagevec *pvec);
 void lru_add_drain(void);
 void pagevec_deactivate_inactive(struct pagevec *pvec);
+void pagevec_strip(struct pagevec *pvec);
 
 static inline void pagevec_init(struct pagevec *pvec)
 {
diff -Nru a/include/linux/preempt.h b/include/linux/preempt.h
--- a/include/linux/preempt.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/preempt.h	Wed Aug 28 07:37:36 2002
@@ -48,22 +48,12 @@
 	preempt_check_resched(); \
 } while (0)
 
-#define inc_preempt_count_non_preempt()	do { } while (0)
-#define dec_preempt_count_non_preempt()	do { } while (0)
-
 #else
 
 #define preempt_disable()		do { } while (0)
 #define preempt_enable_no_resched()	do { } while (0)
 #define preempt_enable()		do { } while (0)
 #define preempt_check_resched()		do { } while (0)
-
-/*
- * Sometimes we want to increment the preempt count, but we know that it's
- * already incremented if the kernel is compiled for preemptibility.
- */
-#define inc_preempt_count_non_preempt()	inc_preempt_count()
-#define dec_preempt_count_non_preempt()	dec_preempt_count()
 
 #endif
 
diff -Nru a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
--- a/include/linux/rwsem-spinlock.h	Wed Aug 28 07:37:37 2002
+++ b/include/linux/rwsem-spinlock.h	Wed Aug 28 07:37:37 2002
@@ -46,6 +46,14 @@
 #define __RWSEM_DEBUG_INIT	/* */
 #endif
 
+#ifdef CONFIG_DEBUG_SPINLOCK
+#define assert_rwsem_held_for_write(rwsem)	BUG_ON(__down_read_trylock(sem))
+#define assert_rwsem_held_for_read(rwsem)	BUG_ON(__down_write_trylock(rwsem))
+#else
+#define assert_rwsem_held_for_write(rwsem)	do { } while(0)
+#define assert_rwsem_held_for_read(rwsem)	do { } while(0)
+#endif
+
 #define __RWSEM_INITIALIZER(name) \
 { 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
 
diff -Nru a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h	Wed Aug 28 07:37:37 2002
+++ b/include/linux/rwsem.h	Wed Aug 28 07:37:37 2002
@@ -7,6 +7,7 @@
 #ifndef _LINUX_RWSEM_H
 #define _LINUX_RWSEM_H
 
+#include <linux/config.h>
 #include <linux/linkage.h>
 
 #define RWSEM_DEBUG 0
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/sched.h	Wed Aug 28 07:37:36 2002
@@ -405,6 +405,7 @@
 #define PF_FREEZE	0x00010000	/* this task should be frozen for suspend */
 #define PF_IOTHREAD	0x00020000	/* this thread is needed for doing I/O to swap */
 #define PF_FROZEN	0x00040000	/* frozen for system suspend */
+#define PF_SYNC		0x00080000	/* performing fsync(), etc */
 
 /*
  * Ptrace flags
diff -Nru a/include/linux/spinlock.h b/include/linux/spinlock.h
--- a/include/linux/spinlock.h	Wed Aug 28 07:37:37 2002
+++ b/include/linux/spinlock.h	Wed Aug 28 07:37:37 2002
@@ -78,7 +78,19 @@
 #define _raw_write_lock(lock)	(void)(lock)
 #define _raw_write_unlock(lock)	do { } while(0)
 
-#endif /* !SMP */
+#endif /* !CONFIG_SMP */
+
+/*
+ * Simple lock assertions for debugging and documenting where locks need
+ * to be held.
+ */
+#if defined(CONFIG_DEBUG_SPINLOCK) && defined(CONFIG_SMP)
+#define assert_locked(lock)		BUG_ON(!spin_is_locked(lock))
+#define assert_rw_locked(lock)		BUG_ON(!rwlock_is_locked(lock))
+#else
+#define assert_locked(lock)		do { } while(0)
+#define assert_rw_locked(lock)		do { } while(0)
+#endif /* CONFIG_DEBUG_SPINLOCK && CONFIG_SMP */
 
 /*
  * Define the various spin_lock and rw_lock methods.  Note we define these
diff -Nru a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/swap.h	Wed Aug 28 07:37:36 2002
@@ -139,7 +139,7 @@
 struct vm_area_struct;
 struct sysinfo;
 struct address_space;
-struct zone_t;
+struct zone;
 
 /* linux/mm/rmap.c */
 extern int FASTCALL(page_referenced(struct page *));
@@ -163,7 +163,7 @@
 
 /* linux/mm/vmscan.c */
 extern wait_queue_head_t kswapd_wait;
-extern int FASTCALL(try_to_free_pages(zone_t *, unsigned int, unsigned int));
+extern int try_to_free_pages(struct zone *, unsigned int, unsigned int);
 
 /* linux/mm/page_io.c */
 int swap_readpage(struct file *file, struct page *page);
@@ -182,6 +182,7 @@
 extern int move_from_swap_cache(struct page *page, unsigned long index,
 		struct address_space *mapping);
 extern void free_page_and_swap_cache(struct page *page);
+extern void free_pages_and_swap_cache(struct page **pages, int nr);
 extern struct page * lookup_swap_cache(swp_entry_t);
 extern struct page * read_swap_cache_async(swp_entry_t);
 
@@ -209,54 +210,7 @@
 asmlinkage long sys_swapoff(const char *);
 asmlinkage long sys_swapon(const char *, int);
 
-extern spinlock_t _pagemap_lru_lock;
-
 extern void FASTCALL(mark_page_accessed(struct page *));
-
-/*
- * List add/del helper macros. These must be called
- * with the pagemap_lru_lock held!
- */
-#define DEBUG_LRU_PAGE(page)			\
-do {						\
-	if (!PageLRU(page))			\
-		BUG();				\
-	if (PageActive(page))			\
-		BUG();				\
-} while (0)
-
-#define __add_page_to_active_list(page)		\
-do {						\
-	list_add(&(page)->lru, &active_list);	\
-	inc_page_state(nr_active);		\
-} while (0)
-
-#define add_page_to_active_list(page)		\
-do {						\
-	DEBUG_LRU_PAGE(page);			\
-	SetPageActive(page);			\
-	__add_page_to_active_list(page);	\
-} while (0)
-
-#define add_page_to_inactive_list(page)		\
-do {						\
-	DEBUG_LRU_PAGE(page);			\
-	list_add(&(page)->lru, &inactive_list);	\
-	inc_page_state(nr_inactive);		\
-} while (0)
-
-#define del_page_from_active_list(page)		\
-do {						\
-	list_del(&(page)->lru);			\
-	ClearPageActive(page);			\
-	dec_page_state(nr_active);		\
-} while (0)
-
-#define del_page_from_inactive_list(page)	\
-do {						\
-	list_del(&(page)->lru);			\
-	dec_page_state(nr_inactive);		\
-} while (0)
 
 extern spinlock_t swaplock;
 
diff -Nru a/include/linux/writeback.h b/include/linux/writeback.h
--- a/include/linux/writeback.h	Wed Aug 28 07:37:36 2002
+++ b/include/linux/writeback.h	Wed Aug 28 07:37:36 2002
@@ -72,4 +72,13 @@
 				   read-only. */
 
 
+/*
+ * Tell the writeback paths that they are being called for a "data integrity"
+ * operation such as fsync().
+ */
+static inline int called_for_sync(void)
+{
+	return current->flags & PF_SYNC;
+}
+
 #endif		/* WRITEBACK_H */
diff -Nru a/kernel/acct.c b/kernel/acct.c
--- a/kernel/acct.c	Wed Aug 28 07:37:37 2002
+++ b/kernel/acct.c	Wed Aug 28 07:37:37 2002
@@ -160,6 +160,8 @@
 {
 	struct file *old_acct = NULL;
 
+	assert_locked(&acct_globals.lock);
+
 	if (acct_globals.file) {
 		old_acct = acct_globals.file;
 		del_timer(&acct_globals.timer);
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c	Wed Aug 28 07:37:36 2002
+++ b/kernel/ksyms.c	Wed Aug 28 07:37:36 2002
@@ -133,6 +133,7 @@
 EXPORT_SYMBOL(get_user_pages);
 
 /* filesystem internal functions */
+EXPORT_SYMBOL_GPL(def_blk_aops);
 EXPORT_SYMBOL(def_blk_fops);
 EXPORT_SYMBOL(update_atime);
 EXPORT_SYMBOL(get_fs_type);
diff -Nru a/kernel/printk.c b/kernel/printk.c
--- a/kernel/printk.c	Wed Aug 28 07:37:37 2002
+++ b/kernel/printk.c	Wed Aug 28 07:37:37 2002
@@ -353,6 +353,8 @@
 	unsigned long cur_index, start_print;
 	static int msg_level = -1;
 
+	assert_sem_held(&console_sem);
+
 	if (((long)(start - end)) > 0)
 		BUG();
 
diff -Nru a/kernel/suspend.c b/kernel/suspend.c
--- a/kernel/suspend.c	Wed Aug 28 07:37:36 2002
+++ b/kernel/suspend.c	Wed Aug 28 07:37:36 2002
@@ -468,31 +468,33 @@
 {
 	int chunk_size;
 	int nr_copy_pages = 0;
-	int loop;
+	int pfn;
+	struct page *page;
 	
 	if (max_mapnr != num_physpages)
 		panic("mapnr is not expected");
-	for (loop = 0; loop < max_mapnr; loop++) {
-		if (PageHighMem(mem_map+loop))
+	for (pfn = 0; pfn < max_mapnr; pfn++) {
+		page = pfn_to_page(pfn);
+		if (PageHighMem(page))
 			panic("Swsusp not supported on highmem boxes. Send 1GB of RAM to <pavel@ucw.cz> and try again ;-).");
-		if (!PageReserved(mem_map+loop)) {
-			if (PageNosave(mem_map+loop))
+		if (!PageReserved(page)) {
+			if (PageNosave(page))
 				continue;
 
-			if ((chunk_size=is_head_of_free_region(mem_map+loop))!=0) {
-				loop += chunk_size - 1;
+			if ((chunk_size=is_head_of_free_region(page))!=0) {
+				pfn += chunk_size - 1;
 				continue;
 			}
-		} else if (PageReserved(mem_map+loop)) {
-			BUG_ON (PageNosave(mem_map+loop));
+		} else if (PageReserved(page)) {
+			BUG_ON (PageNosave(page));
 
 			/*
 			 * Just copy whole code segment. Hopefully it is not that big.
 			 */
-			if (ADDRESS(loop) >= (unsigned long)
-				&__nosave_begin && ADDRESS(loop) < 
+			if (ADDRESS(pfn) >= (unsigned long)
+				&__nosave_begin && ADDRESS(pfn) < 
 				(unsigned long)&__nosave_end) {
-				PRINTK("[nosave %x]", ADDRESS(loop));
+				PRINTK("[nosave %x]", ADDRESS(pfn));
 				continue;
 			}
 			/* Hmm, perhaps copying all reserved pages is not too healthy as they may contain 
@@ -501,7 +503,7 @@
 
 		nr_copy_pages++;
 		if (pagedir_p) {
-			pagedir_p->orig_address = ADDRESS(loop);
+			pagedir_p->orig_address = ADDRESS(pfn);
 			copy_page(pagedir_p->address, pagedir_p->orig_address);
 			pagedir_p++;
 		}
diff -Nru a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c	Wed Aug 28 07:37:37 2002
+++ b/mm/bootmem.c	Wed Aug 28 07:37:37 2002
@@ -318,10 +318,12 @@
 	return(init_bootmem_core(&contig_page_data, start, 0, pages));
 }
 
+#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
 void __init reserve_bootmem (unsigned long addr, unsigned long size)
 {
 	reserve_bootmem_core(contig_page_data.bdata, addr, size);
 }
+#endif /* !CONFIG_HAVE_ARCH_BOOTMEM_NODE */
 
 void __init free_bootmem (unsigned long addr, unsigned long size)
 {
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	Wed Aug 28 07:37:37 2002
+++ b/mm/filemap.c	Wed Aug 28 07:37:37 2002
@@ -61,7 +61,6 @@
  *      ->inode_lock		(__mark_inode_dirty)
  *        ->sb_lock		(fs/fs-writeback.c)
  */
-spinlock_t _pagemap_lru_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 
 /*
  * Remove a page from the page cache and free it. Caller has to make
@@ -182,7 +181,7 @@
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
 
-	ClearPageDirty(page);
+	clear_page_dirty(page);
 	ClearPageUptodate(page);
 	remove_from_page_cache(page);
 	page_cache_release(page);
@@ -281,7 +280,7 @@
 	for (curr = head->next; curr != head; curr = curr->next) {
 		page = list_entry(curr, struct page, list);
 		if (page->index > start)
-			ClearPageDirty(page);
+			clear_page_dirty(page);
 	}
 }
 
@@ -349,7 +348,7 @@
 		} else
 			unlocked = 0;
 
-		ClearPageDirty(page);
+		clear_page_dirty(page);
 		ClearPageUptodate(page);
 	}
 
@@ -465,31 +464,38 @@
 			SetPageReferenced(page);
 	}
 
-	/* Set the page dirty again, unlock */
-	set_page_dirty(page);
 	unlock_page(page);
-	return 0;
+	return -EAGAIN;		/* It will be set dirty again */
 }
 EXPORT_SYMBOL(fail_writepage);
 
 /**
- *  filemap_fdatawrite - walk the list of dirty pages of the given address space
- *                      and writepage() all of them.
- *
- *  @mapping: address space structure to write
+ * filemap_fdatawrite - start writeback against all of a mapping's dirty pages
+ * @mapping: address space structure to write
  *
+ * This is a "data integrity" operation, as opposed to a regular memory
+ * cleansing writeback.  The difference between these two operations is that
+ * if a dirty page/buffer is encountered, it must be waited upon, and not just
+ * skipped over.
+ *
+ * The PF_SYNC flag is set across this operation and the various functions
+ * which care about this distinction must use called_for_sync() to find out
+ * which behaviour they should implement.
  */
 int filemap_fdatawrite(struct address_space *mapping)
 {
-	return do_writepages(mapping, NULL);
+	int ret;
+
+	current->flags |= PF_SYNC;
+	ret = do_writepages(mapping, NULL);
+	current->flags &= ~PF_SYNC;
+	return ret;
 }
 
 /**
- *      filemap_fdatawait - walk the list of locked pages of the given address space
- *     	and wait for all of them.
- * 
- *      @mapping: address space structure to wait for
- *
+ * filemap_fdatawait - walk the list of locked pages of the given address
+ *                     space and wait for all of them.
+ * @mapping: address space structure to wait for
  */
 int filemap_fdatawait(struct address_space * mapping)
 {
@@ -498,8 +504,9 @@
 	write_lock(&mapping->page_lock);
 
         while (!list_empty(&mapping->locked_pages)) {
-		struct page *page = list_entry(mapping->locked_pages.next, struct page, list);
+		struct page *page;
 
+		page = list_entry(mapping->locked_pages.next,struct page,list);
 		list_del(&page->list);
 		if (PageDirty(page))
 			list_add(&page->list, &mapping->dirty_pages);
@@ -550,8 +557,8 @@
 	error = radix_tree_insert(&mapping->page_tree, offset, page);
 	if (!error) {
 		SetPageLocked(page);
-		ClearPageDirty(page);
 		___add_to_page_cache(page, mapping, offset);
+		ClearPageDirty(page);
 	} else {
 		page_cache_release(page);
 	}
@@ -611,7 +618,7 @@
  */
 static inline wait_queue_head_t *page_waitqueue(struct page *page)
 {
-	const zone_t *zone = page_zone(page);
+	const struct zone *zone = page_zone(page);
 
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
@@ -1029,7 +1036,53 @@
 	UPDATE_ATIME(inode);
 }
 
-int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+/*
+ * Fault a userspace page into pagetables.  Return non-zero on a fault.
+ *
+ * FIXME: this assumes that two userspace pages are always sufficient.  That's
+ * not true if PAGE_CACHE_SIZE > PAGE_SIZE.
+ */
+static inline int fault_in_pages_writeable(char *uaddr, int size)
+{
+	int ret;
+
+	/*
+	 * Writing zeroes into userspace here is OK, because we know that if
+	 * the zero gets there, we'll be overwriting it.
+	 */
+	ret = __put_user(0, uaddr);
+	if (ret == 0) {
+		char *end = uaddr + size - 1;
+
+		/*
+		 * If the page was already mapped, this will get a cache miss
+		 * for sure, so try to avoid doing it.
+		 */
+		if (((unsigned long)uaddr & PAGE_MASK) !=
+				((unsigned long)end & PAGE_MASK))
+		 	ret = __put_user(0, end);
+	}
+	return ret;
+}
+
+static inline int fault_in_pages_readable(const char *uaddr, int size)
+{
+	volatile char c;
+	int ret;
+
+	ret = __get_user(c, (char *)uaddr);
+	if (ret == 0) {
+		const char *end = uaddr + size - 1;
+
+		if (((unsigned long)uaddr & PAGE_MASK) !=
+				((unsigned long)end & PAGE_MASK))
+		 	ret = __get_user(c, (char *)end);
+	}
+	return ret;
+}
+
+int file_read_actor(read_descriptor_t *desc, struct page *page,
+			unsigned long offset, unsigned long size)
 {
 	char *kaddr;
 	unsigned long left, count = desc->count;
@@ -1037,14 +1090,29 @@
 	if (size > count)
 		size = count;
 
+	/*
+	 * Faults on the destination of a read are common, so do it before
+	 * taking the kmap.
+	 */
+	if (!fault_in_pages_writeable(desc->buf, size)) {
+		kaddr = kmap_atomic(page, KM_USER0);
+		left = __copy_to_user(desc->buf, kaddr + offset, size);
+		kunmap_atomic(kaddr, KM_USER0);
+		if (left == 0)
+			goto success;
+		printk("%s: Unexpected page fault\n", __FUNCTION__);
+	}
+
+	/* Do it the slow way */
 	kaddr = kmap(page);
 	left = __copy_to_user(desc->buf, kaddr + offset, size);
 	kunmap(page);
-	
+
 	if (left) {
 		size -= left;
 		desc->error = -EFAULT;
 	}
+success:
 	desc->count = count - size;
 	desc->written += size;
 	desc->buf += size;
@@ -1151,14 +1219,15 @@
 {
 	struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
 	unsigned long max;
-	struct page_state ps;
+	unsigned long active;
+	unsigned long inactive;
 
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
 	/* Limit it to a sane percentage of the inactive list.. */
-	get_page_state(&ps);
-	max = ps.nr_inactive / 2;
+	get_zone_counts(&active, &inactive);
+	max = inactive / 2;
 	if (nr > max)
 		nr = max;
 
@@ -1830,6 +1899,29 @@
 	}
 }
 
+static inline int
+filemap_copy_from_user(struct page *page, unsigned long offset,
+			const char *buf, unsigned bytes, int fault_expected)
+{
+	char *kaddr;
+	int left;
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	left = __copy_from_user(kaddr + offset, buf, bytes);
+	kunmap_atomic(kaddr, KM_USER0);
+
+	if (left != 0) {
+		if (!fault_expected)
+			printk("%s: Unexpected page fault\n", __FUNCTION__);
+
+		/* Do it the slow way */
+		kaddr = kmap(page);
+		left = __copy_from_user(kaddr + offset, buf, bytes);
+		kunmap(page);
+	}
+	return left;
+}
+
 /*
  * Write to a file through the page cache. 
  *
@@ -1982,7 +2074,7 @@
 		unsigned long index;
 		unsigned long offset;
 		long page_fault;
-		char *kaddr;
+		int fault_expected;	/* This is just debug */
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
@@ -1996,10 +2088,7 @@
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		{ volatile unsigned char dummy;
-			__get_user(dummy, buf);
-			__get_user(dummy, buf+bytes-1);
-		}
+		fault_expected = fault_in_pages_readable(buf, bytes);
 
 		page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
 		if (!page) {
@@ -2007,22 +2096,20 @@
 			break;
 		}
 
-		kaddr = kmap(page);
 		status = a_ops->prepare_write(file, page, offset, offset+bytes);
 		if (unlikely(status)) {
 			/*
 			 * prepare_write() may have instantiated a few blocks
 			 * outside i_size.  Trim these off again.
 			 */
-			kunmap(page);
 			unlock_page(page);
 			page_cache_release(page);
 			if (pos + bytes > inode->i_size)
 				vmtruncate(inode, inode->i_size);
 			break;
 		}
-		page_fault = __copy_from_user(kaddr + offset, buf, bytes);
-		flush_dcache_page(page);
+		page_fault = filemap_copy_from_user(page, offset,
+						buf, bytes, fault_expected);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(page_fault)) {
 			status = -EFAULT;
@@ -2037,7 +2124,6 @@
 				buf += status;
 			}
 		}
-		kunmap(page);
 		if (!PageReferenced(page))
 			SetPageReferenced(page);
 		unlock_page(page);
diff -Nru a/mm/highmem.c b/mm/highmem.c
--- a/mm/highmem.c	Wed Aug 28 07:37:36 2002
+++ b/mm/highmem.c	Wed Aug 28 07:37:36 2002
@@ -383,7 +383,7 @@
 		/*
 		 * is destination page below bounce pfn?
 		 */
-		if ((page - page_zone(page)->zone_mem_map) + (page_zone(page)->zone_start_paddr >> PAGE_SHIFT) < pfn)
+		if ((page - page_zone(page)->zone_mem_map) + (page_zone(page)->zone_start_pfn) < pfn)
 			continue;
 
 		/*
diff -Nru a/mm/msync.c b/mm/msync.c
--- a/mm/msync.c	Wed Aug 28 07:37:37 2002
+++ b/mm/msync.c	Wed Aug 28 07:37:37 2002
@@ -145,10 +145,7 @@
 			int err;
 
 			down(&inode->i_sem);
-			ret = filemap_fdatawait(inode->i_mapping);
-			err = filemap_fdatawrite(inode->i_mapping);
-			if (!ret)
-				ret = err;
+			ret = filemap_fdatawrite(inode->i_mapping);
 			if (flags & MS_SYNC) {
 				if (file->f_op && file->f_op->fsync) {
 					err = file->f_op->fsync(file, file->f_dentry, 1);
diff -Nru a/mm/numa.c b/mm/numa.c
--- a/mm/numa.c	Wed Aug 28 07:37:37 2002
+++ b/mm/numa.c	Wed Aug 28 07:37:37 2002
@@ -22,11 +22,11 @@
  * Should be invoked with paramters (0, 0, unsigned long *[], start_paddr).
  */
 void __init free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
-	unsigned long *zones_size, unsigned long zone_start_paddr, 
+	unsigned long *zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size)
 {
 	free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 
-				zone_start_paddr, zholes_size, pmap);
+				zone_start_pfn, zholes_size, pmap);
 }
 
 #endif /* !CONFIG_DISCONTIGMEM */
@@ -48,7 +48,7 @@
  * Nodes can be initialized parallely, in no particular order.
  */
 void __init free_area_init_node(int nid, pg_data_t *pgdat, struct page *pmap,
-	unsigned long *zones_size, unsigned long zone_start_paddr, 
+	unsigned long *zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size)
 {
 	int i, size = 0;
@@ -57,7 +57,7 @@
 	if (mem_map == NULL)
 		mem_map = (struct page *)PAGE_OFFSET;
 
-	free_area_init_core(nid, pgdat, &discard, zones_size, zone_start_paddr,
+	free_area_init_core(nid, pgdat, &discard, zones_size, zone_start_pfn,
 					zholes_size, pmap);
 	pgdat->node_id = nid;
 
diff -Nru a/mm/page-writeback.c b/mm/page-writeback.c
--- a/mm/page-writeback.c	Wed Aug 28 07:37:36 2002
+++ b/mm/page-writeback.c	Wed Aug 28 07:37:36 2002
@@ -38,7 +38,12 @@
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
-static int ratelimit_pages = 32;
+static long ratelimit_pages = 32;
+
+/*
+ * The total number of pagesin the machine.
+ */
+static long total_pages;
 
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
@@ -60,17 +65,17 @@
 /*
  * Start background writeback (via pdflush) at this level
  */
-int dirty_background_ratio = 40;
+int dirty_background_ratio = 10;
 
 /*
  * The generator of dirty data starts async writeback at this level
  */
-int dirty_async_ratio = 50;
+int dirty_async_ratio = 40;
 
 /*
  * The generator of dirty data performs sync writeout at this level
  */
-int dirty_sync_ratio = 60;
+int dirty_sync_ratio = 50;
 
 /*
  * The interval between `kupdate'-style writebacks, in centiseconds
@@ -107,18 +112,17 @@
  */
 void balance_dirty_pages(struct address_space *mapping)
 {
-	const int tot = nr_free_pagecache_pages();
 	struct page_state ps;
-	int background_thresh, async_thresh, sync_thresh;
+	long background_thresh, async_thresh, sync_thresh;
 	unsigned long dirty_and_writeback;
 	struct backing_dev_info *bdi;
 
 	get_page_state(&ps);
 	dirty_and_writeback = ps.nr_dirty + ps.nr_writeback;
 
-	background_thresh = (dirty_background_ratio * tot) / 100;
-	async_thresh = (dirty_async_ratio * tot) / 100;
-	sync_thresh = (dirty_sync_ratio * tot) / 100;
+	background_thresh = (dirty_background_ratio * total_pages) / 100;
+	async_thresh = (dirty_async_ratio * total_pages) / 100;
+	sync_thresh = (dirty_sync_ratio * total_pages) / 100;
 	bdi = mapping->backing_dev_info;
 
 	if (dirty_and_writeback > sync_thresh) {
@@ -171,13 +175,14 @@
  */
 static void background_writeout(unsigned long _min_pages)
 {
-	const int tot = nr_free_pagecache_pages();
-	const int background_thresh = (dirty_background_ratio * tot) / 100;
 	long min_pages = _min_pages;
+	long background_thresh;
 	int nr_to_write;
 
 	CHECK_EMERGENCY_SYNC
 
+	background_thresh = (dirty_background_ratio * total_pages) / 100;
+
 	do {
 		struct page_state ps;
 
@@ -269,7 +274,7 @@
 
 static void set_ratelimit(void)
 {
-	ratelimit_pages = nr_free_pagecache_pages() / (num_online_cpus() * 32);
+	ratelimit_pages = total_pages / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
 	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
@@ -288,8 +293,29 @@
 	.next		= NULL,
 };
 
+/*
+ * If the machine has a large highmem:lowmem ratio then scale back the default
+ * dirty memory thresholds: allowing too much dirty highmem pins an excessive
+ * number of buffer_heads.
+ */
 static int __init page_writeback_init(void)
 {
+	long buffer_pages = nr_free_buffer_pages();
+	long correction;
+
+	total_pages = nr_free_pagecache_pages();
+
+	correction = (100 * 4 * buffer_pages) / total_pages;
+
+	if (correction < 100) {
+		dirty_background_ratio *= correction;
+		dirty_background_ratio /= 100;
+		dirty_async_ratio *= correction;
+		dirty_async_ratio /= 100;
+		dirty_sync_ratio *= correction;
+		dirty_sync_ratio /= 100;
+	}
+
 	init_timer(&wb_timer);
 	wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100;
 	wb_timer.data = 0;
@@ -350,10 +376,15 @@
 #if 0
 		if (!PageWriteback(page) && PageDirty(page)) {
 			lock_page(page);
-			if (!PageWriteback(page) && TestClearPageDirty(page))
-				page->mapping->a_ops->writepage(page);
-			else
+			if (!PageWriteback(page)&&test_clear_page_dirty(page)) {
+				int ret;
+
+				ret = page->mapping->a_ops->writepage(page);
+				if (ret == -EAGAIN)
+					__set_page_dirty_nobuffers(page);
+			} else {
 				unlock_page(page);
+			}
 		}
 #endif
 	}
@@ -390,11 +421,15 @@
 
 	write_lock(&mapping->page_lock);
 	list_del(&page->list);
-	if (TestClearPageDirty(page)) {
+	if (test_clear_page_dirty(page)) {
 		list_add(&page->list, &mapping->locked_pages);
 		page_cache_get(page);
 		write_unlock(&mapping->page_lock);
 		ret = mapping->a_ops->writepage(page);
+		if (ret == -EAGAIN) {
+			__set_page_dirty_nobuffers(page);
+			ret = 0;
+		}
 		if (ret == 0 && wait) {
 			wait_on_page_writeback(page);
 			if (PageError(page))
@@ -478,6 +513,8 @@
 	if (!TestSetPageDirty(page)) {
 		write_lock(&mapping->page_lock);
 		if (page->mapping) {	/* Race with truncate? */
+			if (!mapping->backing_dev_info->memory_backed)
+				inc_page_state(nr_dirty);
 			list_del(&page->list);
 			list_add(&page->list, &mapping->dirty_pages);
 		}
@@ -514,6 +551,8 @@
 		if (mapping) {
 			write_lock(&mapping->page_lock);
 			if (page->mapping) {	/* Race with truncate? */
+				if (!mapping->backing_dev_info->memory_backed)
+					inc_page_state(nr_dirty);
 				list_del(&page->list);
 				list_add(&page->list, &mapping->dirty_pages);
 			}
@@ -525,4 +564,18 @@
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+/*
+ * Clear a page's dirty flag, while caring for dirty memory accounting. 
+ * Returns true if the page was previously dirty.
+ */
+int test_clear_page_dirty(struct page *page)
+{
+	if (TestClearPageDirty(page)) {
+		struct address_space *mapping = page->mapping;
 
+		if (mapping && !mapping->backing_dev_info->memory_backed)
+			dec_page_state(nr_dirty);
+		return 1;
+	}
+	return 0;
+}
diff -Nru a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c	Wed Aug 28 07:37:36 2002
+++ b/mm/page_alloc.c	Wed Aug 28 07:37:36 2002
@@ -27,15 +27,13 @@
 unsigned long totalram_pages;
 unsigned long totalhigh_pages;
 int nr_swap_pages;
-LIST_HEAD(active_list);
-LIST_HEAD(inactive_list);
 pg_data_t *pgdat_list;
 
 /*
  * Used by page_zone() to look up the address of the struct zone whose
  * id is encoded in the upper bits of page->flags
  */
-zone_t *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
+struct zone *zone_table[MAX_NR_ZONES*MAX_NR_NODES];
 EXPORT_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
@@ -46,11 +44,11 @@
 /*
  * Temporary debugging check for pages not lying within a given zone.
  */
-static inline int bad_range(zone_t *zone, struct page *page)
+static inline int bad_range(struct zone *zone, struct page *page)
 {
-	if (page - mem_map >= zone->zone_start_mapnr + zone->size)
+	if (page_to_pfn(page) >= zone->zone_start_pfn + zone->size)
 		return 1;
-	if (page - mem_map < zone->zone_start_mapnr)
+	if (page_to_pfn(page) < zone->zone_start_pfn)
 		return 1;
 	if (zone != page_zone(page))
 		return 1;
@@ -85,7 +83,7 @@
 	unsigned long index, page_idx, mask, flags;
 	free_area_t *area;
 	struct page *base;
-	zone_t *zone;
+	struct zone *zone;
 
 	KERNEL_STAT_ADD(pgfree, 1<<order);
 
@@ -93,7 +91,6 @@
 	BUG_ON(PagePrivate(page));
 	BUG_ON(page->mapping != NULL);
 	BUG_ON(PageLocked(page));
-	BUG_ON(PageLRU(page));
 	BUG_ON(PageActive(page));
 	BUG_ON(PageWriteback(page));
 	BUG_ON(page->pte.chain != NULL);
@@ -155,7 +152,8 @@
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)
 
-static inline struct page * expand (zone_t *zone, struct page *page,
+static inline struct page *
+expand(struct zone *zone, struct page *page,
 	 unsigned long index, int low, int high, free_area_t * area)
 {
 	unsigned long size = 1 << high;
@@ -186,14 +184,14 @@
 	BUG_ON(PageActive(page));
 	BUG_ON(PageDirty(page));
 	BUG_ON(PageWriteback(page));
+	BUG_ON(page->pte.chain != NULL);
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked);
 	set_page_count(page, 1);
 }
 
-static FASTCALL(struct page * rmqueue(zone_t *zone, unsigned int order));
-static struct page * rmqueue(zone_t *zone, unsigned int order)
+static struct page *rmqueue(struct zone *zone, unsigned int order)
 {
 	free_area_t * area = zone->free_area + order;
 	unsigned int curr_order = order;
@@ -236,7 +234,7 @@
 #ifdef CONFIG_SOFTWARE_SUSPEND
 int is_head_of_free_region(struct page *page)
 {
-        zone_t *zone = page_zone(page);
+        struct zone *zone = page_zone(page);
         unsigned long flags;
 	int order;
 	list_t *curr;
@@ -266,7 +264,7 @@
 #endif
 
 static /* inline */ struct page *
-balance_classzone(zone_t * classzone, unsigned int gfp_mask,
+balance_classzone(struct zone* classzone, unsigned int gfp_mask,
 			unsigned int order, int * freed)
 {
 	struct page * page = NULL;
@@ -321,10 +319,12 @@
 /*
  * This is the 'heart' of the zoned buddy allocator:
  */
-struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
+struct page *
+__alloc_pages(unsigned int gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
 {
 	unsigned long min;
-	zone_t **zones, *classzone;
+	struct zone **zones, *classzone;
 	struct page * page;
 	int freed, i;
 
@@ -338,11 +338,11 @@
 	/* Go through the zonelist once, looking for a zone with enough free */
 	min = 1UL << order;
 	for (i = 0; zones[i] != NULL; i++) {
-		zone_t *z = zones[i];
+		struct zone *z = zones[i];
 
 		/* the incremental min is allegedly to discourage fallback */
 		min += z->pages_low;
-		if (z->free_pages > min) {
+		if (z->free_pages > min || z->free_pages >= z->pages_high) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -359,13 +359,13 @@
 	min = 1UL << order;
 	for (i = 0; zones[i] != NULL; i++) {
 		unsigned long local_min;
-		zone_t *z = zones[i];
+		struct zone *z = zones[i];
 
 		local_min = z->pages_min;
 		if (gfp_mask & __GFP_HIGH)
 			local_min >>= 2;
 		min += local_min;
-		if (z->free_pages > min) {
+		if (z->free_pages > min || z->free_pages >= z->pages_high) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -378,7 +378,7 @@
 	if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
 		/* go through the zonelist yet again, ignoring mins */
 		for (i = 0; zones[i] != NULL; i++) {
-			zone_t *z = zones[i];
+			struct zone *z = zones[i];
 
 			page = rmqueue(z, order);
 			if (page)
@@ -405,10 +405,10 @@
 	/* go through the zonelist yet one more time */
 	min = 1UL << order;
 	for (i = 0; zones[i] != NULL; i++) {
-		zone_t *z = zones[i];
+		struct zone *z = zones[i];
 
 		min += z->pages_min;
-		if (z->free_pages > min) {
+		if (z->free_pages > min || z->free_pages >= z->pages_high) {
 			page = rmqueue(z, order);
 			if (page)
 				return page;
@@ -478,7 +478,7 @@
 unsigned int nr_free_pages(void)
 {
 	unsigned int sum = 0;
-	zone_t *zone;
+	struct zone *zone;
 
 	for_each_zone(zone)
 		sum += zone->free_pages;
@@ -492,9 +492,9 @@
 	unsigned int sum = 0;
 
 	for_each_pgdat(pgdat) {
-		zonelist_t *zonelist = pgdat->node_zonelists + offset;
-		zone_t **zonep = zonelist->zones;
-		zone_t *zone;
+		struct zonelist *zonelist = pgdat->node_zonelists + offset;
+		struct zone **zonep = zonelist->zones;
+		struct zone *zone;
 
 		for (zone = *zonep++; zone; zone = *zonep++) {
 			unsigned long size = zone->size;
@@ -559,13 +559,23 @@
 		ret->nr_dirty += ps->nr_dirty;
 		ret->nr_writeback += ps->nr_writeback;
 		ret->nr_pagecache += ps->nr_pagecache;
-		ret->nr_active += ps->nr_active;
-		ret->nr_inactive += ps->nr_inactive;
 		ret->nr_page_table_pages += ps->nr_page_table_pages;
 		ret->nr_reverse_maps += ps->nr_reverse_maps;
 	}
 }
 
+void get_zone_counts(unsigned long *active, unsigned long *inactive)
+{
+	struct zone *zone;
+
+	*active = 0;
+	*inactive = 0;
+	for_each_zone(zone) {
+		*active += zone->nr_active;
+		*inactive += zone->nr_inactive;
+	}
+}
+
 unsigned long get_page_cache_size(void)
 {
 	struct page_state ps;
@@ -602,8 +612,11 @@
 	pg_data_t *pgdat;
 	struct page_state ps;
 	int type;
+	unsigned long active;
+	unsigned long inactive;
 
 	get_page_state(&ps);
+	get_zone_counts(&active, &inactive);
 
 	printk("Free pages:      %6dkB (%6dkB HighMem)\n",
 		K(nr_free_pages()),
@@ -611,22 +624,28 @@
 
 	for (pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
 		for (type = 0; type < MAX_NR_ZONES; ++type) {
-			zone_t *zone = &pgdat->node_zones[type];
-			printk("Zone:%s "
-				"freepages:%6lukB "
-				"min:%6lukB "
-				"low:%6lukB " 
-				"high:%6lukB\n", 
+			struct zone *zone = &pgdat->node_zones[type];
+			printk("Zone:%s"
+				" freepages:%6lukB"
+				" min:%6lukB"
+				" low:%6lukB"
+				" high:%6lukB"
+				" active:%6lukB"
+				" inactive:%6lukB"
+				"\n",
 				zone->name,
 				K(zone->free_pages),
 				K(zone->pages_min),
 				K(zone->pages_low),
-				K(zone->pages_high));
+				K(zone->pages_high),
+				K(zone->nr_active),
+				K(zone->nr_inactive)
+				);
 		}
 
 	printk("( Active:%lu inactive:%lu dirty:%lu writeback:%lu free:%u )\n",
-		ps.nr_active,
-		ps.nr_inactive,
+		active,
+		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		nr_free_pages());
@@ -634,7 +653,7 @@
 	for (pgdat = pgdat_list; pgdat; pgdat = pgdat->pgdat_next)
 		for (type = 0; type < MAX_NR_ZONES; type++) {
 			list_t *elem;
-			zone_t *zone = &pgdat->node_zones[type];
+			struct zone *zone = &pgdat->node_zones[type];
  			unsigned long nr, flags, order, total = 0;
 
 			if (!zone->size)
@@ -663,8 +682,8 @@
 	int i, j, k;
 
 	for (i = 0; i <= GFP_ZONEMASK; i++) {
-		zonelist_t *zonelist;
-		zone_t *zone;
+		struct zonelist *zonelist;
+		struct zone *zone;
 
 		zonelist = pgdat->node_zonelists + i;
 		memset(zonelist, 0, sizeof(*zonelist));
@@ -754,7 +773,7 @@
  *   - clear the memory bitmaps
  */
 void __init free_area_init_core(int nid, pg_data_t *pgdat, struct page **gmap,
-	unsigned long *zones_size, unsigned long zone_start_paddr, 
+	unsigned long *zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size, struct page *lmem_map)
 {
 	unsigned long i, j;
@@ -762,13 +781,10 @@
 	unsigned long totalpages, offset, realtotalpages;
 	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
 
-	BUG_ON(zone_start_paddr & ~PAGE_MASK);
-
 	totalpages = 0;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long size = zones_size[i];
-		totalpages += size;
-	}
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		totalpages += zones_size[i];
+
 	realtotalpages = totalpages;
 	if (zholes_size)
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -791,13 +807,13 @@
 	}
 	*gmap = pgdat->node_mem_map = lmem_map;
 	pgdat->node_size = totalpages;
-	pgdat->node_start_paddr = zone_start_paddr;
+	pgdat->node_start_pfn = zone_start_pfn;
 	pgdat->node_start_mapnr = (lmem_map - mem_map);
 	pgdat->nr_zones = 0;
 
 	offset = lmem_map - mem_map;	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
-		zone_t *zone = pgdat->node_zones + j;
+		struct zone *zone = pgdat->node_zones + j;
 		unsigned long mask;
 		unsigned long size, realsize;
 
@@ -806,13 +822,19 @@
 		if (zholes_size)
 			realsize -= zholes_size[j];
 
-		printk("zone(%lu): %lu pages.\n", j, size);
+		printk("  %s zone: %lu pages\n", zone_names[j], realsize);
 		zone->size = size;
 		zone->name = zone_names[j];
-		zone->lock = SPIN_LOCK_UNLOCKED;
+		spin_lock_init(&zone->lock);
+		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
 		zone->need_balance = 0;
+		INIT_LIST_HEAD(&zone->active_list);
+		INIT_LIST_HEAD(&zone->inactive_list);
+		atomic_set(&zone->refill_counter, 0);
+		zone->nr_active = 0;
+		zone->nr_inactive = 0;
 		if (!size)
 			continue;
 
@@ -843,9 +865,9 @@
 
 		zone->zone_mem_map = mem_map + offset;
 		zone->zone_start_mapnr = offset;
-		zone->zone_start_paddr = zone_start_paddr;
+		zone->zone_start_pfn = zone_start_pfn;
 
-		if ((zone_start_paddr >> PAGE_SHIFT) & (zone_required_alignment-1))
+		if ((zone_start_pfn) & (zone_required_alignment-1))
 			printk("BUG: wrong zone alignment, it will crash\n");
 
 		/*
@@ -860,8 +882,12 @@
 			SetPageReserved(page);
 			INIT_LIST_HEAD(&page->list);
 			if (j != ZONE_HIGHMEM)
-				set_page_address(page, __va(zone_start_paddr));
-			zone_start_paddr += PAGE_SIZE;
+				/*
+				 * The shift left won't overflow because the
+				 * ZONE_NORMAL is below 4G.
+				 */
+				set_page_address(page, __va(zone_start_pfn << PAGE_SHIFT));
+			zone_start_pfn++;
 		}
 
 		offset += size;
diff -Nru a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c	Wed Aug 28 07:37:38 2002
+++ b/mm/rmap.c	Wed Aug 28 07:37:38 2002
@@ -14,7 +14,7 @@
 /*
  * Locking:
  * - the page->pte.chain is protected by the PG_chainlock bit,
- *   which nests within the pagemap_lru_lock, then the
+ *   which nests within the zone->lru_lock, then the
  *   mm->page_table_lock, and then the page lock.
  * - because swapout locking is opposite to the locking order
  *   in the page fault path, the swapout path uses trylocks
@@ -260,7 +260,7 @@
  * table entry mapping a page. Because locking order here is opposite
  * to the locking order used by the page fault path, we use trylocks.
  * Locking:
- *	pagemap_lru_lock		page_launder()
+ *	zone->lru_lock			page_launder()
  *	    page lock			page_launder(), trylock
  *		pte_chain_lock		page_launder()
  *		    mm->page_table_lock	try_to_unmap_one(), trylock
@@ -328,7 +328,7 @@
  * @page: the page to get unmapped
  *
  * Tries to remove all the page table entries which are mapping this
- * page, used in the pageout path.  Caller must hold pagemap_lru_lock
+ * page, used in the pageout path.  Caller must hold zone->lru_lock
  * and the page lock.  Return values are:
  *
  * SWAP_SUCCESS	- we succeeded in removing all mappings
diff -Nru a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c	Wed Aug 28 07:37:37 2002
+++ b/mm/shmem.c	Wed Aug 28 07:37:37 2002
@@ -29,6 +29,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
 #include <linux/shmem_fs.h>
 
 #include <asm/uaccess.h>
@@ -56,6 +57,11 @@
 static struct inode_operations shmem_dir_inode_operations;
 static struct vm_operations_struct shmem_vm_ops;
 
+static struct backing_dev_info shmem_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+};
+
 LIST_HEAD (shmem_inodes);
 static spinlock_t shmem_ilock = SPIN_LOCK_UNLOCKED;
 atomic_t shmem_nrpages = ATOMIC_INIT(0); /* Not used right now */
@@ -789,6 +795,7 @@
 		inode->i_blocks = 0;
 		inode->i_rdev = NODEV;
 		inode->i_mapping->a_ops = &shmem_aops;
+		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = SHMEM_I(inode);
 		spin_lock_init (&info->lock);
diff -Nru a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c	Wed Aug 28 07:37:37 2002
+++ b/mm/swap.c	Wed Aug 28 07:37:37 2002
@@ -19,31 +19,28 @@
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
 #include <linux/init.h>
+#include <linux/mm_inline.h>
+#include <linux/buffer_head.h>
 #include <linux/prefetch.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
 /*
- * Move an inactive page to the active list.
+ * FIXME: speed this up?
  */
-static inline void activate_page_nolock(struct page * page)
+void activate_page(struct page *page)
 {
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(page);
-		add_page_to_active_list(page);
+		del_page_from_inactive_list(zone, page);
+		SetPageActive(page);
+		add_page_to_active_list(zone, page);
 		KERNEL_STAT_INC(pgactivate);
 	}
-}
-
-/*
- * FIXME: speed this up?
- */
-void activate_page(struct page * page)
-{
-	spin_lock_irq(&_pagemap_lru_lock);
-	activate_page_nolock(page);
-	spin_unlock_irq(&_pagemap_lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 }
 
 /**
@@ -77,66 +74,71 @@
 void __page_cache_release(struct page *page)
 {
 	unsigned long flags;
+	struct zone *zone = page_zone(page);
 
-	spin_lock_irqsave(&_pagemap_lru_lock, flags);
-	if (TestClearPageLRU(page)) {
-		if (PageActive(page))
-			del_page_from_active_list(page);
-		else
-			del_page_from_inactive_list(page);
-	}
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	if (TestClearPageLRU(page))
+		del_page_from_lru(zone, page);
 	if (page_count(page) != 0)
 		page = NULL;
-	spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
 	if (page)
 		__free_pages_ok(page, 0);
 }
 
 /*
  * Batched page_cache_release().  Decrement the reference count on all the
- * pagevec's pages.  If it fell to zero then remove the page from the LRU and
+ * passed pages.  If it fell to zero then remove the page from the LRU and
  * free it.
  *
- * Avoid taking pagemap_lru_lock if possible, but if it is taken, retain it
+ * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
  * for the remainder of the operation.
  *
  * The locking in this function is against shrink_cache(): we recheck the
  * page count inside the lock to see whether shrink_cache grabbed the page
  * via the LRU.  If it did, give up: shrink_cache will free it.
- *
- * This function reinitialises the caller's pagevec.
  */
-void __pagevec_release(struct pagevec *pvec)
+void release_pages(struct page **pages, int nr)
 {
 	int i;
-	int lock_held = 0;
 	struct pagevec pages_to_free;
+	struct zone *zone = NULL;
 
 	pagevec_init(&pages_to_free);
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+	for (i = 0; i < nr; i++) {
+		struct page *page = pages[i];
+		struct zone *pagezone;
 
 		if (PageReserved(page) || !put_page_testzero(page))
 			continue;
 
-		if (!lock_held) {
-			spin_lock_irq(&_pagemap_lru_lock);
-			lock_held = 1;
+		pagezone = page_zone(page);
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
 		}
-
-		if (TestClearPageLRU(page)) {
-			if (PageActive(page))
-				del_page_from_active_list(page);
-			else
-				del_page_from_inactive_list(page);
+		if (TestClearPageLRU(page))
+			del_page_from_lru(zone, page);
+		if (page_count(page) == 0) {
+			if (!pagevec_add(&pages_to_free, page)) {
+				spin_unlock_irq(&zone->lru_lock);
+				pagevec_free(&pages_to_free);
+				pagevec_init(&pages_to_free);
+				spin_lock_irq(&zone->lru_lock);
+			}
 		}
-		if (page_count(page) == 0)
-			pagevec_add(&pages_to_free, page);
 	}
-	if (lock_held)
-		spin_unlock_irq(&_pagemap_lru_lock);
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
 
 	pagevec_free(&pages_to_free);
+}
+
+void __pagevec_release(struct pagevec *pvec)
+{
+	release_pages(pvec->pages, pagevec_count(pvec));
 	pagevec_init(pvec);
 }
 
@@ -169,24 +171,27 @@
 void pagevec_deactivate_inactive(struct pagevec *pvec)
 {
 	int i;
-	int lock_held = 0;
+	struct zone *zone = NULL;
 
 	if (pagevec_count(pvec) == 0)
 		return;
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
 
-		if (!lock_held) {
+		if (pagezone != zone) {
 			if (PageActive(page) || !PageLRU(page))
 				continue;
-			spin_lock_irq(&_pagemap_lru_lock);
-			lock_held = 1;
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
 		}
 		if (!PageActive(page) && PageLRU(page))
-			list_move(&page->lru, &inactive_list);
+			list_move(&page->lru, &pagezone->inactive_list);
 	}
-	if (lock_held)
-		spin_unlock_irq(&_pagemap_lru_lock);
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
 	__pagevec_release(pvec);
 }
 
@@ -197,16 +202,24 @@
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
+	struct zone *zone = NULL;
 
-	spin_lock_irq(&_pagemap_lru_lock);
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
 
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
 		if (TestSetPageLRU(page))
 			BUG();
-		add_page_to_inactive_list(page);
+		add_page_to_inactive_list(zone, page);
 	}
-	spin_unlock_irq(&_pagemap_lru_lock);
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
 	pagevec_release(pvec);
 }
 
@@ -217,20 +230,42 @@
 void __pagevec_lru_del(struct pagevec *pvec)
 {
 	int i;
+	struct zone *zone = NULL;
 
-	spin_lock_irq(&_pagemap_lru_lock);
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
 
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
 		if (!TestClearPageLRU(page))
 			BUG();
-		if (PageActive(page))
-			del_page_from_active_list(page);
-		else
-			del_page_from_inactive_list(page);
+		del_page_from_lru(zone, page);
 	}
-	spin_unlock_irq(&_pagemap_lru_lock);
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
 	pagevec_release(pvec);
+}
+
+/*
+ * Try to drop buffers from the pages in a pagevec
+ */
+void pagevec_strip(struct pagevec *pvec)
+{
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (PagePrivate(page) && !TestSetPageLocked(page)) {
+			try_to_release_page(page, 0);
+			unlock_page(page);
+		}
+	}
 }
 
 /*
diff -Nru a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c	Wed Aug 28 07:37:37 2002
+++ b/mm/swap_state.c	Wed Aug 28 07:37:37 2002
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
 #include <linux/buffer_head.h>	/* block_sync_page() */
 
 #include <asm/pgtable.h>
@@ -25,20 +26,26 @@
 	.i_mapping	= &swapper_space,
 };
 
+static struct backing_dev_info swap_backing_dev_info = {
+	.ra_pages	= 0,	/* No readahead */
+	.memory_backed	= 1,	/* Does not contribute to dirty memory */
+};
+
 extern struct address_space_operations swap_aops;
 
 struct address_space swapper_space = {
-	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC),
-	.page_lock	= RW_LOCK_UNLOCKED,
-	.clean_pages	= LIST_HEAD_INIT(swapper_space.clean_pages),
-	.dirty_pages	= LIST_HEAD_INIT(swapper_space.dirty_pages),
-	.io_pages	= LIST_HEAD_INIT(swapper_space.io_pages),
-	.locked_pages	= LIST_HEAD_INIT(swapper_space.locked_pages),
-	.host		= &swapper_inode,
-	.a_ops		= &swap_aops,
-	.i_shared_lock	= SPIN_LOCK_UNLOCKED,
-	.private_lock	= SPIN_LOCK_UNLOCKED,
-	.private_list	= LIST_HEAD_INIT(swapper_space.private_list),
+	.page_tree		= RADIX_TREE_INIT(GFP_ATOMIC),
+	.page_lock		= RW_LOCK_UNLOCKED,
+	.clean_pages		= LIST_HEAD_INIT(swapper_space.clean_pages),
+	.dirty_pages		= LIST_HEAD_INIT(swapper_space.dirty_pages),
+	.io_pages		= LIST_HEAD_INIT(swapper_space.io_pages),
+	.locked_pages		= LIST_HEAD_INIT(swapper_space.locked_pages),
+	.host			= &swapper_inode,
+	.a_ops			= &swap_aops,
+	.backing_dev_info	= &swap_backing_dev_info,
+	.i_shared_lock		= SPIN_LOCK_UNLOCKED,
+	.private_lock		= SPIN_LOCK_UNLOCKED,
+	.private_list		= LIST_HEAD_INIT(swapper_space.private_list),
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
@@ -292,26 +299,53 @@
 	return err;
 }
 
+
 /* 
- * Perform a free_page(), also freeing any swap cache associated with
- * this page if it is the last user of the page. Can not do a lock_page,
- * as we are holding the page_table_lock spinlock.
+ * If we are the only user, then try to free up the swap cache. 
+ * 
+ * Its ok to check for PageSwapCache without the page lock
+ * here because we are going to recheck again inside 
+ * exclusive_swap_page() _with_ the lock. 
+ * 					- Marcelo
  */
-void free_page_and_swap_cache(struct page *page)
+static inline void free_swap_cache(struct page *page)
 {
-	/* 
-	 * If we are the only user, then try to free up the swap cache. 
-	 * 
-	 * Its ok to check for PageSwapCache without the page lock
-	 * here because we are going to recheck again inside 
-	 * exclusive_swap_page() _with_ the lock. 
-	 * 					- Marcelo
-	 */
 	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
+}
+
+/* 
+ * Perform a free_page(), also freeing any swap cache associated with
+ * this page if it is the last user of the page. Can not do a lock_page,
+ * as we are holding the page_table_lock spinlock.
+ */
+void free_page_and_swap_cache(struct page *page)
+{
+	free_swap_cache(page);
 	page_cache_release(page);
+}
+
+/*
+ * Passed an array of pages, drop them all from swapcache and then release
+ * them.  They are removed from the LRU and freed if this is their last use.
+ */
+void free_pages_and_swap_cache(struct page **pages, int nr)
+{
+	const int chunk = 16;
+	struct page **pagep = pages;
+
+	while (nr) {
+		int todo = min(chunk, nr);
+		int i;
+
+		for (i = 0; i < todo; i++)
+			free_swap_cache(pagep[i]);
+		release_pages(pagep, todo);
+		pagep += todo;
+		nr -= todo;
+	}
 }
 
 /*
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Wed Aug 28 07:37:36 2002
+++ b/mm/vmscan.c	Wed Aug 28 07:37:36 2002
@@ -23,6 +23,7 @@
 #include <linux/writeback.h>
 #include <linux/suspend.h>
 #include <linux/buffer_head.h>		/* for try_to_release_page() */
+#include <linux/mm_inline.h>
 #include <linux/pagevec.h>
 
 #include <asm/pgalloc.h>
@@ -93,7 +94,7 @@
 }
 
 static /* inline */ int
-shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone,
+shrink_list(struct list_head *page_list, int nr_pages,
 		unsigned int gfp_mask, int priority, int *max_scan)
 {
 	struct address_space *mapping;
@@ -109,8 +110,6 @@
 
 		page = list_entry(page_list->prev, struct page, lru);
 		list_del(&page->lru);
-		if (!memclass(page_zone(page), classzone))
-			goto keep;
 
 		if (TestSetPageLocked(page))
 			goto keep;
@@ -264,7 +263,7 @@
 }
 
 /*
- * pagemap_lru_lock is heavily contented.  We relieve it by quickly privatising
+ * zone->lru_lock is heavily contented.  We relieve it by quickly privatising
  * a batch of pages and working on them outside the lock.  Any pages which were
  * not freed will be added back to the LRU.
  *
@@ -275,7 +274,7 @@
  * in the kernel (apart from the copy_*_user functions).
  */
 static /* inline */ int
-shrink_cache(int nr_pages, zone_t *classzone,
+shrink_cache(int nr_pages, struct zone *zone,
 		unsigned int gfp_mask, int priority, int max_scan)
 {
 	LIST_HEAD(page_list);
@@ -292,15 +291,17 @@
 	pagevec_init(&pvec);
 
 	lru_add_drain();
-	spin_lock_irq(&_pagemap_lru_lock);
+	spin_lock_irq(&zone->lru_lock);
 	while (max_scan > 0 && nr_pages > 0) {
 		struct page *page;
 		int n = 0;
 
-		while (n < nr_to_process && !list_empty(&inactive_list)) {
-			page = list_entry(inactive_list.prev, struct page, lru);
+		while (n < nr_to_process && !list_empty(&zone->inactive_list)) {
+			page = list_entry(zone->inactive_list.prev,
+					struct page, lru);
 
-			prefetchw_prev_lru_page(page, &inactive_list, flags);
+			prefetchw_prev_lru_page(page,
+						&zone->inactive_list, flags);
 
 			if (!TestClearPageLRU(page))
 				BUG();
@@ -308,28 +309,28 @@
 			if (page_count(page) == 0) {
 				/* It is currently in pagevec_release() */
 				SetPageLRU(page);
-				list_add(&page->lru, &inactive_list);
+				list_add(&page->lru, &zone->inactive_list);
 				continue;
 			}
 			list_add(&page->lru, &page_list);
 			page_cache_get(page);
 			n++;
 		}
-		spin_unlock_irq(&_pagemap_lru_lock);
+		zone->nr_inactive -= n;
+		spin_unlock_irq(&zone->lru_lock);
 
 		if (list_empty(&page_list))
 			goto done;
 
 		max_scan -= n;
-		mod_page_state(nr_inactive, -n);
 		KERNEL_STAT_ADD(pgscan, n);
-		nr_pages = shrink_list(&page_list, nr_pages, classzone,
+		nr_pages = shrink_list(&page_list, nr_pages,
 					gfp_mask, priority, &max_scan);
 
 		if (nr_pages <= 0 && list_empty(&page_list))
 			goto done;
 
-		spin_lock_irq(&_pagemap_lru_lock);
+		spin_lock_irq(&zone->lru_lock);
 		/*
 		 * Put back any unfreeable pages.
 		 */
@@ -339,17 +340,17 @@
 				BUG();
 			list_del(&page->lru);
 			if (PageActive(page))
-				__add_page_to_active_list(page);
+				add_page_to_active_list(zone, page);
 			else
-				add_page_to_inactive_list(page);
+				add_page_to_inactive_list(zone, page);
 			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&_pagemap_lru_lock);
+				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
-				spin_lock_irq(&_pagemap_lru_lock);
+				spin_lock_irq(&zone->lru_lock);
 			}
 		}
   	}
-	spin_unlock_irq(&_pagemap_lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 done:
 	pagevec_release(&pvec);
 	return nr_pages;	
@@ -362,9 +363,9 @@
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold pagemap_lru_lock across the whole operation.  But if
+ * appropriate to hold zone->lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop pagemap_lru_lock around each page.  It's impossible to balance
+ * should drop zone->lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
@@ -372,7 +373,8 @@
  * The downside is that we have to touch page->count against each page.
  * But we had to alter page->flags anyway.
  */
-static /* inline */ void refill_inactive(const int nr_pages_in)
+static /* inline */ void
+refill_inactive_zone(struct zone *zone, const int nr_pages_in)
 {
 	int pgdeactivate = 0;
 	int nr_pages = nr_pages_in;
@@ -383,24 +385,24 @@
 	struct pagevec pvec;
 
 	lru_add_drain();
-	spin_lock_irq(&_pagemap_lru_lock);
-	while (nr_pages && !list_empty(&active_list)) {
-		page = list_entry(active_list.prev, struct page, lru);
-		prefetchw_prev_lru_page(page, &active_list, flags);
+	spin_lock_irq(&zone->lru_lock);
+	while (nr_pages && !list_empty(&zone->active_list)) {
+		page = list_entry(zone->active_list.prev, struct page, lru);
+		prefetchw_prev_lru_page(page, &zone->active_list, flags);
 		if (!TestClearPageLRU(page))
 			BUG();
 		list_del(&page->lru);
 		if (page_count(page) == 0) {
 			/* It is currently in pagevec_release() */
 			SetPageLRU(page);
-			list_add(&page->lru, &active_list);
+			list_add(&page->lru, &zone->active_list);
 			continue;
 		}
 		page_cache_get(page);
 		list_add(&page->lru, &l_hold);
 		nr_pages--;
 	}
-	spin_unlock_irq(&_pagemap_lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		page = list_entry(l_hold.prev, struct page, lru);
@@ -419,7 +421,7 @@
 	}
 
 	pagevec_init(&pvec);
-	spin_lock_irq(&_pagemap_lru_lock);
+	spin_lock_irq(&zone->lru_lock);
 	while (!list_empty(&l_inactive)) {
 		page = list_entry(l_inactive.prev, struct page, lru);
 		prefetchw_prev_lru_page(page, &l_inactive, flags);
@@ -427,44 +429,51 @@
 			BUG();
 		if (!TestClearPageActive(page))
 			BUG();
-		list_move(&page->lru, &inactive_list);
+		list_move(&page->lru, &zone->inactive_list);
 		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&_pagemap_lru_lock);
+			spin_unlock_irq(&zone->lru_lock);
+			if (buffer_heads_over_limit)
+				pagevec_strip(&pvec);
 			__pagevec_release(&pvec);
-			spin_lock_irq(&_pagemap_lru_lock);
+			spin_lock_irq(&zone->lru_lock);
 		}
 	}
+	if (buffer_heads_over_limit) {
+		spin_unlock_irq(&zone->lru_lock);
+		pagevec_strip(&pvec);
+		pagevec_release(&pvec);
+		spin_lock_irq(&zone->lru_lock);
+	}
 	while (!list_empty(&l_active)) {
 		page = list_entry(l_active.prev, struct page, lru);
 		prefetchw_prev_lru_page(page, &l_active, flags);
 		if (TestSetPageLRU(page))
 			BUG();
 		BUG_ON(!PageActive(page));
-		list_move(&page->lru, &active_list);
+		list_move(&page->lru, &zone->active_list);
 		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&_pagemap_lru_lock);
+			spin_unlock_irq(&zone->lru_lock);
 			__pagevec_release(&pvec);
-			spin_lock_irq(&_pagemap_lru_lock);
+			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	spin_unlock_irq(&_pagemap_lru_lock);
+	zone->nr_active -= pgdeactivate;
+	zone->nr_inactive += pgdeactivate;
+	spin_unlock_irq(&zone->lru_lock);
 	pagevec_release(&pvec);
 
-	mod_page_state(nr_active, -pgdeactivate);
-	mod_page_state(nr_inactive, pgdeactivate);
 	KERNEL_STAT_ADD(pgscan, nr_pages_in - nr_pages);
 	KERNEL_STAT_ADD(pgdeactivate, pgdeactivate);
 }
 
 static /* inline */ int
-shrink_caches(zone_t *classzone, int priority,
-		unsigned int gfp_mask, int nr_pages)
+shrink_zone(struct zone *zone, int priority,
+	unsigned int gfp_mask, int nr_pages)
 {
 	unsigned long ratio;
-	struct page_state ps;
 	int max_scan;
-	static atomic_t nr_to_refill = ATOMIC_INIT(0);
 
+	/* This is bogus for ZONE_HIGHMEM? */
 	if (kmem_cache_reap(gfp_mask) >= nr_pages)
   		return 0;
 
@@ -478,17 +487,16 @@
 	 * just to make sure that the kernel will slowly sift through the
 	 * active list.
 	 */
-	get_page_state(&ps);
-	ratio = (unsigned long)nr_pages * ps.nr_active /
-				((ps.nr_inactive | 1) * 2);
-	atomic_add(ratio+1, &nr_to_refill);
-	if (atomic_read(&nr_to_refill) > SWAP_CLUSTER_MAX) {
-		atomic_sub(SWAP_CLUSTER_MAX, &nr_to_refill);
-		refill_inactive(SWAP_CLUSTER_MAX);
+	ratio = (unsigned long)nr_pages * zone->nr_active /
+				((zone->nr_inactive | 1) * 2);
+	atomic_add(ratio+1, &zone->refill_counter);
+	if (atomic_read(&zone->refill_counter) > SWAP_CLUSTER_MAX) {
+		atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
+		refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
 	}
 
-	max_scan = ps.nr_inactive / priority;
-	nr_pages = shrink_cache(nr_pages, classzone,
+	max_scan = zone->nr_inactive / priority;
+	nr_pages = shrink_cache(nr_pages, zone,
 				gfp_mask, priority, max_scan);
 
 	if (nr_pages <= 0)
@@ -507,7 +515,31 @@
 	return nr_pages;
 }
 
-int try_to_free_pages(zone_t *classzone, unsigned int gfp_mask, unsigned int order)
+static int
+shrink_caches(struct zone *classzone, int priority,
+		int gfp_mask, int nr_pages)
+{
+	struct zone *first_classzone;
+	struct zone *zone;
+
+	first_classzone = classzone->zone_pgdat->node_zones;
+	zone = classzone;
+	while (zone >= first_classzone) {
+		if (zone->free_pages <= zone->pages_high) {
+			nr_pages = shrink_zone(zone, priority,
+					gfp_mask, nr_pages);
+		}
+		zone--;
+	}
+	return nr_pages;
+}
+
+/*
+ * This is the main entry point to page reclaim.
+ */
+int
+try_to_free_pages(struct zone *classzone,
+		unsigned int gfp_mask, unsigned int order)
 {
 	int priority = DEF_PRIORITY;
 	int nr_pages = SWAP_CLUSTER_MAX;
@@ -515,24 +547,20 @@
 	KERNEL_STAT_INC(pageoutrun);
 
 	do {
-		nr_pages = shrink_caches(classzone, priority, gfp_mask, nr_pages);
+		nr_pages = shrink_caches(classzone, priority,
+					gfp_mask, nr_pages);
 		if (nr_pages <= 0)
 			return 1;
 	} while (--priority);
-
-	/*
-	 * Hmm.. Cache shrink failed - time to kill something?
-	 * Mhwahahhaha! This is the part I really like. Giggle.
-	 */
 	out_of_memory();
 	return 0;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
 
-static int check_classzone_need_balance(zone_t * classzone)
+static int check_classzone_need_balance(struct zone *classzone)
 {
-	zone_t * first_classzone;
+	struct zone *first_classzone;
 
 	first_classzone = classzone->zone_pgdat->node_zones;
 	while (classzone >= first_classzone) {
@@ -546,7 +574,7 @@
 static int kswapd_balance_pgdat(pg_data_t * pgdat)
 {
 	int need_more_balance = 0, i;
-	zone_t * zone;
+	struct zone *zone;
 
 	for (i = pgdat->nr_zones-1; i >= 0; i--) {
 		zone = pgdat->node_zones + i;
@@ -584,7 +612,7 @@
 
 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
 {
-	zone_t * zone;
+	struct zone *zone;
 	int i;
 
 	for (i = pgdat->nr_zones-1; i >= 0; i--) {

---------------


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
@ 2002-08-28 21:24     ` Andrew Morton
  2002-08-28 22:23       ` Rik van Riel
  2002-09-02  5:26     ` Andrew Morton
  2002-09-02  6:50     ` Andrew Morton
  2 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-28 21:24 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> Hi Andrew
> 
> Here is slablru for 32-mm1.  This is based on a version ported to 31ish-mm1.  It should be
> stable.  Its been booted as UP (32-mm1) and SMP on UP  (31ish-mm1 only) and works as expected.

Cool.  But the diff adds tons of stuff which is already added by -mm1.
I suspect you diffed against 2.5.31 base?

> A typical test cycle involved:
> find / -name "*" > /dev/null
> edit a large tif with the gimp
> run dbench a few times with the dbench dir on tmpfs (trying to use gimp too)
> run dbench a few times from a reiserfs dir (trying to use gimp too)
> use the box for news/mail, atp-get update/upgrade etc, wait a few hours and repeat
> 
> 31ish-mm1 survived a day of this, 32-mm1 is sending this message after one cycle.
> 
> Andrew, what do you thing about adding slablru to your experimental dir?

No probs.
 
> There is also a version for virgin 2.5.32, anyone wanting it should email me - one big
> patch is eats enough bandwidth.
> 
> One interesting change in this version.  We only add the first page of a slab to the lru.  The
> reference bit setting logic for slabs has been modified to set the bit on the first page.
> Pagevec created a little bit of a problem for slablru.  How do we know the order of the
> slab page when its being freed?   My solution is to use 3 bits in page->flags and save the
> order there.  Then free_pages_ok was modified to take the order from page->flags.  This
> was implement in a minimal fashion.  Think Wli is working on a more elaborate version of
> this - fleshed out, it could be used to support large pages in the vm.

hm.  What happened to the idea of walking mem_map[], looking for continuation
pages? (This would need to be done via pfn_to_page(), I guess).
 
> Second topic.
> 
> I have also included an optimisation for vmscan.  I found that the current code would reduce
> the inactive list to almost nothing when applications create large numbers of active pages very
> quickly run (ie. gimp loading and editing large 20m+ tiffs).  This reduces the problem.   Always
> allowing nr_pages to be scanned caused the active list to be reduced to almost nothing when
> something like gimp exited and we had another task adding lots to the inactive list.  This
> is fixed here too.  I do wonder if zone->refill_counter, as implemented, is a great idea.  Do
> we really need/want to remember to scan the active list if it has massively decreased in size
> because some app exited?  Maybe some sort of decay logic should be used...
> 

Well the refill counter thingy is just an optimisation: rather than calling refill_inacative()
lots of times to just grab two or three pages, we wait until it builds up to 32, and then
go deactivate 32 pages.

But ugh, it's a bit broken.  Yup, you're right.  Need to s/if/while/ in shrink_zone().

But we do need to slowly sift through the active list even when the inactive
list is enormously bigger.  Otherwise, completely dead pages will remain in-core
forever if there's a lot of pagecache activity going on.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-08-28 21:24     ` Andrew Morton
@ 2002-08-28 22:23       ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2002-08-28 22:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ed Tomlinson, linux-mm

On Wed, 28 Aug 2002, Andrew Morton wrote:

> But we do need to slowly sift through the active list even when the
> inactive list is enormously bigger.  Otherwise, completely dead pages
> will remain in-core forever if there's a lot of pagecache activity going
> on.

Doesn't that just indicate we want to get rid of use-once
and replace it with something slightly more predictable ?

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
  2002-08-28 21:24     ` Andrew Morton
@ 2002-09-02  5:26     ` Andrew Morton
  2002-09-02 15:00       ` Ed Tomlinson
  2002-09-02  6:50     ` Andrew Morton
  2 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-09-02  5:26 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed,

I've taken a pass through this.  It's nice and simple.  Clever, too.

This is a first pass at changelogging it - I'd appreciate comments
on accuracy, things which should be added.   Also a few questions and
comments for you, please.




Ed Tomlinson's patch which puts slab pages on the LRU.

- The patch is huge because it adds another argument to
kmem_cache_create(): the address of that slab's application-specific
"pruner" function.

- shrink_icache_memory(), shrink_dcache_memory() and
shrink_dqcache_memory() are removed.  We now have dcache, inode and
dquot "pruner" functions which are called from the VM to trim the
relevant cache.

- Ed originally had all slab pages on the LRU, including higher-order
ones.

We ended up deciding to not do that - there are not really any
interesting slab caches which use high-order allocations, and it
didn't seem worth the (minor) compexity of managing higher-order
pages on the LRU.

If, at some time in the future, the VM at large becomes aware of
higher-order pages then we can bring this back.  But it doesn't seem
justifiable purely for slablru.

The patch which brings back the higher-order slabs is available.

Description of algorithm
========================

The pages which back slab objects may be manually marked as referenced
via kmem_touch_page(), which simply sets PG_referenced.  It _could_ use
mark_page_accessed(), but doesn't.  So slab pages will always remain on
the inactive list.  (what's the thinking here?)

kmem_touch_page() is only ever used when a new object is allocated
within a page.  It is not used in application-specific places such as
dcache lookup.  But I think it could be.  (what's the thinking here?)


During page reclaim, when a slab page reaches the tail of the LRU we
look to see if the page is referenced.  If it is not referenced, has no
pruner callback, and if it has no active slab entries then we directly
reclaim the page.  (Why don't we reclaim it even if it is referenced?)

If the page does have some active entries and it has a pruner callback,
then we count them up and record the count within the slab structure.

So at the end of a shrink_cache() run, each slab has within it a record
of how many live objects-within-pages were encountered at the tail of
the LRU.

In shrink_zone(), after running shrink_cache(), we take a pass across
all the slab caches, in kmem_do_prunes().  This will call out the the
pruner functions and ask them to release a number of objects, where
that number is the number of objects which were found at the tail of
the LRU.

So the net effect is that the pruning pressure against a particular
slab is driven by the number of objects which are encountered at the
tail of the LRU.

Comments
========

Ed, this code can be sped up a bit, I think.  We can make
kmem_count_page() return a boolean back to shrink_cache(), telling it
whether it needs to call kmem_do_prunes() at all.  Often, there won't
be any work to do in there, and taking that semaphore can be quite
costly.

The code as-is will even run kmem_do_prunes() when we're examining
ZONE_HIGHMEM, which certainly won't have any slab pages.  This boolean
will fix that too.


I reverted your

BUG_ON(smp_call_function(func, arg, 1, 1));

to

if (smp_call_function(func, arg, 1, 1))
BUG();

because BUG_ONs should not have side-effects.  Someone may want to
compile a kernel which has a stubbed-out BUG_ON()...

__kmem_cache_shrink_locked() is calling kmem_slab_destroy() with local
irqs disabled, which is a change from the previous behaviour.  Is this
deliberate?

The patch does a zillion BUG->BUG_ON conversions in slab.c, which is a
bit unfortunate, because it makes it a bit confusing to review.  Let's
do that in a standalone patch next time ;)





 arch/arm/mach-arc/mm.c                 |    4 
 arch/cris/drivers/usb-host.c           |    2 
 arch/i386/mm/init.c                    |    2 
 drivers/block/ll_rw_blk.c              |    2 
 drivers/ieee1394/eth1394.c             |    2 
 drivers/ieee1394/ieee1394_core.c       |    2 
 drivers/md/raid5.c                     |    2 
 drivers/scsi/scsi.c                    |    2 
 drivers/usb/host/uhci-hcd.c            |    2 
 fs/adfs/super.c                        |    2 
 fs/affs/super.c                        |    2 
 fs/aio.c                               |    4 
 fs/bfs/inode.c                         |    1 
 fs/bio.c                               |    4 
 fs/block_dev.c                         |    4 
 fs/buffer.c                            |    2 
 fs/char_dev.c                          |    4 
 fs/coda/inode.c                        |    3 
 fs/dcache.c                            |   46 +----
 fs/devfs/base.c                        |    2 
 fs/dnotify.c                           |    2 
 fs/dquot.c                             |   20 --
 fs/efs/super.c                         |    1 
 fs/ext2/super.c                        |    2 
 fs/ext3/super.c                        |    2 
 fs/fat/inode.c                         |    1 
 fs/fcntl.c                             |    2 
 fs/freevxfs/vxfs_super.c               |    5 
 fs/hfs/super.c                         |    1 
 fs/hpfs/super.c                        |    2 
 fs/inode.c                             |   33 +--
 fs/intermezzo/dcache.c                 |    2 
 fs/isofs/inode.c                       |    2 
 fs/jbd/journal.c                       |    1 
 fs/jbd/revoke.c                        |    4 
 fs/jffs/inode-v23.c                    |    4 
 fs/jffs2/malloc.c                      |   14 -
 fs/jffs2/super.c                       |    2 
 fs/jfs/jfs_metapage.c                  |    2 
 fs/jfs/super.c                         |    2 
 fs/locks.c                             |    2 
 fs/minix/inode.c                       |    2 
 fs/namespace.c                         |    2 
 fs/ncpfs/inode.c                       |    2 
 fs/nfs/inode.c                         |    2 
 fs/nfs/pagelist.c                      |    2 
 fs/nfs/read.c                          |    2 
 fs/nfs/write.c                         |    2 
 fs/ntfs/super.c                        |    9 -
 fs/proc/inode.c                        |    2 
 fs/proc/proc_misc.c                    |    2 
 fs/qnx4/inode.c                        |    2 
 fs/reiserfs/super.c                    |    2 
 fs/romfs/inode.c                       |    2 
 fs/smbfs/inode.c                       |    2 
 fs/smbfs/request.c                     |    2 
 fs/sysv/inode.c                        |    2 
 fs/udf/super.c                         |    2 
 fs/ufs/super.c                         |    2 
 include/linux/dcache.h                 |    5 
 include/linux/page-flags.h             |    1 
 include/linux/slab.h                   |   26 ++
 kernel/fork.c                          |   12 -
 kernel/signal.c                        |    2 
 kernel/user.c                          |    2 
 lib/radix-tree.c                       |    2 
 mm/page_alloc.c                        |    1 
 mm/rmap.c                              |    1 
 mm/shmem.c                             |    2 
 mm/slab.c                              |  287 +++++++++++++++++++++------------
 mm/swap.c                              |   18 +-
 mm/vmscan.c                            |   41 +++-
 net/atm/clip.c                         |    2 
 net/bluetooth/af_bluetooth.c           |    2 
 net/core/neighbour.c                   |    2 
 net/core/skbuff.c                      |    2 
 net/core/sock.c                        |    2 
 net/decnet/dn_route.c                  |    2 
 net/decnet/dn_table.c                  |    2 
 net/ipv4/af_inet.c                     |    6 
 net/ipv4/fib_hash.c                    |    2 
 net/ipv4/inetpeer.c                    |    2 
 net/ipv4/ipmr.c                        |    2 
 net/ipv4/netfilter/ip_conntrack_core.c |    2 
 net/ipv4/route.c                       |    2 
 net/ipv4/tcp.c                         |    6 
 net/ipv6/af_inet6.c                    |    6 
 net/ipv6/ip6_fib.c                     |    2 
 net/ipv6/route.c                       |    2 
 net/socket.c                           |    2 
 net/unix/af_unix.c                     |    2 
 91 files changed, 401 insertions(+), 292 deletions(-)

--- 2.5.33/arch/arm/mach-arc/mm.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/arch/arm/mach-arc/mm.c	Sun Sep  1 21:23:12 2002
@@ -173,13 +173,13 @@ void __init pgtable_cache_init(void)
 {
 	pte_cache = kmem_cache_create("pte-cache",
 				sizeof(pte_t) * PTRS_PER_PTE,
-				0, 0, pte_cache_ctor, NULL);
+				0, 0, NULL, pte_cache_ctor, NULL);
 	if (!pte_cache)
 		BUG();
 
 	pgd_cache = kmem_cache_create("pgd-cache", MEMC_TABLE_SIZE +
 				sizeof(pgd_t) * PTRS_PER_PGD,
-				0, 0, pgd_cache_ctor, NULL);
+				0, 0, NULL, pgd_cache_ctor, NULL);
 	if (!pgd_cache)
 		BUG();
 }
--- 2.5.33/arch/cris/drivers/usb-host.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/arch/cris/drivers/usb-host.c	Sun Sep  1 21:23:12 2002
@@ -2330,7 +2330,7 @@ static int __init etrax_usb_hc_init(void
 	hc = kmalloc(sizeof(etrax_hc_t), GFP_KERNEL);
 
 	/* We use kmem_cache_* to make sure that all DMA desc. are dword aligned */
-	usb_desc_cache = kmem_cache_create("usb_desc_cache", sizeof(USB_EP_Desc_t), 0, 0, 0, 0);
+	usb_desc_cache = kmem_cache_create("usb_desc_cache", sizeof(USB_EP_Desc_t), 0, 0, NULL, NULL, NULL);
 	if (!usb_desc_cache) {
 		panic("USB Desc Cache allocation failed !!!\n");
 	}
--- 2.5.33/arch/i386/mm/init.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/arch/i386/mm/init.c	Sun Sep  1 21:30:42 2002
@@ -504,7 +504,7 @@ void __init pgtable_cache_init(void)
          * PAE pgds must be 16-byte aligned:
          */
         pae_pgd_cachep = kmem_cache_create("pae_pgd", 32, 0,
-                SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
+                SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL, NULL);
         if (!pae_pgd_cachep)
                 panic("init_pae(): Cannot alloc pae_pgd SLAB cache");
 }
--- 2.5.33/drivers/block/ll_rw_blk.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/block/ll_rw_blk.c	Sun Sep  1 21:23:33 2002
@@ -2052,7 +2052,7 @@ int __init blk_dev_init(void)
 
 	request_cachep = kmem_cache_create("blkdev_requests",
 					   sizeof(struct request),
-					   0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+					   0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 
 	if (!request_cachep)
 		panic("Can't create request pool slab cache\n");
--- 2.5.33/drivers/ieee1394/eth1394.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/ieee1394/eth1394.c	Sun Sep  1 21:23:12 2002
@@ -715,7 +715,7 @@ static struct hpsb_highlevel_ops hl_ops 
 static int __init ether1394_init_module (void)
 {
 	packet_task_cache = kmem_cache_create("packet_task", sizeof(struct packet_task),
-					      0, 0, NULL, NULL);
+					      0, 0, NULL, NULL, NULL);
 
 	/* Register ourselves as a highlevel driver */
 	hl_handle = hpsb_register_highlevel (ETHER1394_DRIVER_NAME, &hl_ops);
--- 2.5.33/drivers/ieee1394/ieee1394_core.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/ieee1394/ieee1394_core.c	Sun Sep  1 21:23:12 2002
@@ -971,7 +971,7 @@ struct proc_dir_entry *ieee1394_procfs_e
 static int __init ieee1394_init(void)
 {
 	hpsb_packet_cache = kmem_cache_create("hpsb_packet", sizeof(struct hpsb_packet),
-					      0, 0, NULL, NULL);
+					      0, 0, NULL, NULL, NULL);
 
 	ieee1394_devfs_handle = devfs_mk_dir(NULL, "ieee1394", NULL);
 
--- 2.5.33/drivers/md/raid5.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/md/raid5.c	Sun Sep  1 21:23:12 2002
@@ -277,7 +277,7 @@ static int grow_stripes(raid5_conf_t *co
 
 	sc = kmem_cache_create(conf->cache_name, 
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
-			       0, 0, NULL, NULL);
+			       0, 0, NULL, NULL, NULL);
 	if (!sc)
 		return 1;
 	conf->slab_cache = sc;
--- 2.5.33/drivers/scsi/scsi.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/scsi/scsi.c	Sun Sep  1 21:23:12 2002
@@ -2532,7 +2532,7 @@ static int __init init_scsi(void)
 		struct scsi_host_sg_pool *sgp = scsi_sg_pools + i;
 		int size = sgp->size * sizeof(struct scatterlist);
 
-		sgp->slab = kmem_cache_create(sgp->name, size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+		sgp->slab = kmem_cache_create(sgp->name, size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 		if (!sgp->slab)
 			panic("SCSI: can't init sg slab\n");
 
--- 2.5.33/drivers/usb/host/uhci-hcd.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/drivers/usb/host/uhci-hcd.c	Sun Sep  1 21:23:12 2002
@@ -2512,7 +2512,7 @@ static int __init uhci_hcd_init(void)
 #endif
 
 	uhci_up_cachep = kmem_cache_create("uhci_urb_priv",
-		sizeof(struct urb_priv), 0, 0, NULL, NULL);
+		sizeof(struct urb_priv), 0, 0, NULL, NULL, NULL);
 	if (!uhci_up_cachep)
 		goto up_failed;
 
--- 2.5.33/fs/adfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/adfs/super.c	Sun Sep  1 21:23:12 2002
@@ -234,7 +234,7 @@ static int init_inodecache(void)
 	adfs_inode_cachep = kmem_cache_create("adfs_inode_cache",
 					     sizeof(struct adfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (adfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/affs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/affs/super.c	Sun Sep  1 21:23:12 2002
@@ -117,7 +117,7 @@ static int init_inodecache(void)
 	affs_inode_cachep = kmem_cache_create("affs_inode_cache",
 					     sizeof(struct affs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (affs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/aio.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/aio.c	Sun Sep  1 21:23:12 2002
@@ -65,12 +65,12 @@ LIST_HEAD(fput_head);
 static int __init aio_setup(void)
 {
 	kiocb_cachep = kmem_cache_create("kiocb", sizeof(struct kiocb),
-				0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+				0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!kiocb_cachep)
 		panic("unable to create kiocb cache\n");
 
 	kioctx_cachep = kmem_cache_create("kioctx", sizeof(struct kioctx),
-				0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+				0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!kioctx_cachep)
 		panic("unable to create kioctx cache");
 
--- 2.5.33/fs/bfs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/bfs/inode.c	Sun Sep  1 21:23:12 2002
@@ -240,6 +240,7 @@ static int init_inodecache(void)
 	bfs_inode_cachep = kmem_cache_create("bfs_inode_cache",
 					     sizeof(struct bfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
+					     age_icache_memory,
 					     init_once, NULL);
 	if (bfs_inode_cachep == NULL)
 		return -ENOMEM;
--- 2.5.33/fs/bio.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/bio.c	Sun Sep  1 21:23:12 2002
@@ -470,7 +470,7 @@ static void __init biovec_init_pool(void
 						bp->size, size);
 
 		bp->slab = kmem_cache_create(bp->name, size, 0,
-						SLAB_HWCACHE_ALIGN, NULL, NULL);
+						SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 		if (!bp->slab)
 			panic("biovec: can't init slab cache\n");
 		bp->pool = mempool_create(BIO_POOL_SIZE, slab_pool_alloc,
@@ -484,7 +484,7 @@ static void __init biovec_init_pool(void
 static int __init init_bio(void)
 {
 	bio_slab = kmem_cache_create("bio", sizeof(struct bio), 0,
-					SLAB_HWCACHE_ALIGN, NULL, NULL);
+					SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!bio_slab)
 		panic("bio: can't create slab cache\n");
 	bio_pool = mempool_create(BIO_POOL_SIZE, slab_pool_alloc, slab_pool_free, bio_slab);
--- 2.5.33/fs/block_dev.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/block_dev.c	Sun Sep  1 21:23:12 2002
@@ -249,8 +249,8 @@ void __init bdev_cache_init(void)
 
 	bdev_cachep = kmem_cache_create("bdev_cache",
 					 sizeof(struct block_device),
-					 0, SLAB_HWCACHE_ALIGN, init_once,
-					 NULL);
+					 0, SLAB_HWCACHE_ALIGN, 
+				         NULL, init_once, NULL);
 	if (!bdev_cachep)
 		panic("Cannot create bdev_cache SLAB cache");
 	err = register_filesystem(&bd_type);
--- 2.5.33/fs/buffer.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/buffer.c	Sun Sep  1 21:23:12 2002
@@ -2645,7 +2645,7 @@ void __init buffer_init(void)
 
 	bh_cachep = kmem_cache_create("buffer_head",
 			sizeof(struct buffer_head), 0,
-			0, init_buffer_head, NULL);
+			0, NULL, init_buffer_head, NULL);
 	bh_mempool = mempool_create(MAX_UNUSED_BUFFERS, bh_mempool_alloc,
 				bh_mempool_free, NULL);
 	for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)
--- 2.5.33/fs/char_dev.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/char_dev.c	Sun Sep  1 21:23:12 2002
@@ -46,8 +46,8 @@ void __init cdev_cache_init(void)
 
 	cdev_cachep = kmem_cache_create("cdev_cache",
 					 sizeof(struct char_device),
-					 0, SLAB_HWCACHE_ALIGN, init_once,
-					 NULL);
+					 0, SLAB_HWCACHE_ALIGN, NULL,
+					 init_once, NULL);
 	if (!cdev_cachep)
 		panic("Cannot create cdev_cache SLAB cache");
 }
--- 2.5.33/fs/coda/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/coda/inode.c	Sun Sep  1 21:23:12 2002
@@ -73,9 +73,10 @@ int coda_init_inodecache(void)
 	coda_inode_cachep = kmem_cache_create("coda_inode_cache",
 					     sizeof(struct coda_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (coda_inode_cachep == NULL)
 		return -ENOMEM;
+
 	return 0;
 }
 
--- 2.5.33/fs/dcache.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/dcache.c	Sun Sep  1 21:23:12 2002
@@ -123,8 +123,7 @@ repeat:
 		return;
 
 	/* dput on a free dentry? */
-	if (!list_empty(&dentry->d_lru))
-		BUG();
+	BUG_ON(!list_empty(&dentry->d_lru));
 	/*
 	 * AV: ->d_delete() is _NOT_ allowed to block now.
 	 */
@@ -329,12 +328,11 @@ static inline void prune_one_dentry(stru
 void prune_dcache(int count)
 {
 	spin_lock(&dcache_lock);
-	for (;;) {
+	for (; count ; count--) {
 		struct dentry *dentry;
 		struct list_head *tmp;
 
 		tmp = dentry_unused.prev;
-
 		if (tmp == &dentry_unused)
 			break;
 		list_del_init(tmp);
@@ -349,12 +347,8 @@ void prune_dcache(int count)
 		dentry_stat.nr_unused--;
 
 		/* Unused dentry with a count? */
-		if (atomic_read(&dentry->d_count))
-			BUG();
-
+		BUG_ON(atomic_read(&dentry->d_count));
 		prune_one_dentry(dentry);
-		if (!--count)
-			break;
 	}
 	spin_unlock(&dcache_lock);
 }
@@ -573,19 +567,10 @@ void shrink_dcache_anon(struct list_head
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_dcache_memory(int priority, unsigned int gfp_mask)
+int age_dcache_memory(kmem_cache_t *cachep, int entries, int gfp_mask)
 {
-	int count = 0;
-
 	/*
 	 * Nasty deadlock avoidance.
 	 *
@@ -600,11 +585,11 @@ int shrink_dcache_memory(int priority, u
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = dentry_stat.nr_unused / priority;
+	if (entries > dentry_stat.nr_unused)
+		entries = dentry_stat.nr_unused;
 
-	prune_dcache(count);
-	kmem_cache_shrink(dentry_cache);
-	return 0;
+	prune_dcache(entries);
+	return entries;
 }
 
 #define NAME_ALLOC_LEN(len)	((len+16) & ~15)
@@ -686,7 +671,7 @@ struct dentry * d_alloc(struct dentry * 
  
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
-	if (!list_empty(&entry->d_alias)) BUG();
+	BUG_ON(!list_empty(&entry->d_alias));
 	spin_lock(&dcache_lock);
 	if (inode)
 		list_add(&entry->d_alias, &inode->i_dentry);
@@ -985,7 +970,7 @@ void d_delete(struct dentry * dentry)
 void d_rehash(struct dentry * entry)
 {
 	struct list_head *list = d_hash(entry->d_parent, entry->d_name.hash);
-	if (!list_empty(&entry->d_hash)) BUG();
+	BUG_ON(!list_empty(&entry->d_hash));
 	spin_lock(&dcache_lock);
 	list_add(&entry->d_hash, list);
 	spin_unlock(&dcache_lock);
@@ -1341,7 +1326,7 @@ static void __init dcache_init(unsigned 
 					 sizeof(struct dentry),
 					 0,
 					 SLAB_HWCACHE_ALIGN,
-					 NULL, NULL);
+					 age_dcache_memory, NULL, NULL);
 	if (!dentry_cache)
 		panic("Cannot create dentry cache");
 
@@ -1401,22 +1386,23 @@ void __init vfs_caches_init(unsigned lon
 {
 	names_cachep = kmem_cache_create("names_cache", 
 			PATH_MAX, 0, 
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!names_cachep)
 		panic("Cannot create names SLAB cache");
 
 	filp_cachep = kmem_cache_create("filp", 
 			sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if(!filp_cachep)
 		panic("Cannot create filp SLAB cache");
 
 #if defined (CONFIG_QUOTA)
 	dquot_cachep = kmem_cache_create("dquot", 
 			sizeof(struct dquot), sizeof(unsigned long) * 4,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, age_dqcache_memory, NULL, NULL);
 	if (!dquot_cachep)
 		panic("Cannot create dquot SLAB cache");
+	
 #endif
 
 	dcache_init(mempages);
--- 2.5.33/fs/devfs/base.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/devfs/base.c	Sun Sep  1 21:23:12 2002
@@ -3455,7 +3455,7 @@ static int __init init_devfs_fs (void)
 	    DEVFS_NAME, DEVFS_VERSION);
     devfsd_buf_cache = kmem_cache_create ("devfsd_event",
 					  sizeof (struct devfsd_buf_entry),
-					  0, 0, NULL, NULL);
+					  0, 0, NULL, NULL, NULL);
     if (!devfsd_buf_cache) OOPS ("(): unable to allocate event slab\n");
 #ifdef CONFIG_DEVFS_DEBUG
     devfs_debug = devfs_debug_init;
--- 2.5.33/fs/dnotify.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/dnotify.c	Sun Sep  1 21:23:12 2002
@@ -149,7 +149,7 @@ void __inode_dir_notify(struct inode *in
 static int __init dnotify_init(void)
 {
 	dn_cache = kmem_cache_create("dnotify cache",
-		sizeof(struct dnotify_struct), 0, 0, NULL, NULL);
+		sizeof(struct dnotify_struct), 0, 0, NULL, NULL, NULL);
 	if (!dn_cache)
 		panic("cannot create dnotify slab cache");
 	return 0;
--- 2.5.33/fs/dquot.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/dquot.c	Sun Sep  1 21:23:12 2002
@@ -480,26 +480,18 @@ static void prune_dqcache(int count)
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dqcache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *   ...
- *   6 - base-level: try to shrink a bit.
+ * more memory
  */
 
-int shrink_dqcache_memory(int priority, unsigned int gfp_mask)
+int age_dqcache_memory(kmem_cache_t *cachep, int entries, int gfp_mask)
 {
-	int count = 0;
+	if (entries > dqstats.free_dquots)
+		entries = dqstats.free_dquots;
 
 	lock_kernel();
-	count = dqstats.free_dquots / priority;
-	prune_dqcache(count);
+	prune_dqcache(entries);
 	unlock_kernel();
-	kmem_cache_shrink(dquot_cachep);
-	return 0;
+	return entries;
 }
 
 /*
--- 2.5.33/fs/efs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/efs/super.c	Sun Sep  1 21:23:12 2002
@@ -58,6 +58,7 @@ static int init_inodecache(void)
 	efs_inode_cachep = kmem_cache_create("efs_inode_cache",
 					     sizeof(struct efs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
+					     age_icache_memory,
 					     init_once, NULL);
 	if (efs_inode_cachep == NULL)
 		return -ENOMEM;
--- 2.5.33/fs/ext2/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/ext2/super.c	Sun Sep  1 21:23:12 2002
@@ -181,7 +181,7 @@ static int init_inodecache(void)
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/ext3/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/ext3/super.c	Sun Sep  1 21:30:41 2002
@@ -480,7 +480,7 @@ static int init_inodecache(void)
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/fat/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/fat/inode.c	Sun Sep  1 21:23:12 2002
@@ -597,6 +597,7 @@ int __init fat_init_inodecache(void)
 	fat_inode_cachep = kmem_cache_create("fat_inode_cache",
 					     sizeof(struct msdos_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
+					     age_icache_memory,
 					     init_once, NULL);
 	if (fat_inode_cachep == NULL)
 		return -ENOMEM;
--- 2.5.33/fs/fcntl.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/fcntl.c	Sun Sep  1 21:23:12 2002
@@ -559,7 +559,7 @@ void kill_fasync(struct fasync_struct **
 static int __init fasync_init(void)
 {
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, 0, NULL, NULL);
+		sizeof(struct fasync_struct), 0, 0, NULL, NULL, NULL);
 	if (!fasync_cache)
 		panic("cannot create fasync slab cache");
 	return 0;
--- 2.5.33/fs/freevxfs/vxfs_super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/freevxfs/vxfs_super.c	Sun Sep  1 21:23:12 2002
@@ -246,9 +246,10 @@ static int __init
 vxfs_init(void)
 {
 	vxfs_inode_cachep = kmem_cache_create("vxfs_inode",
-			sizeof(struct vxfs_inode_info), 0, 0, NULL, NULL);
-	if (vxfs_inode_cachep)
+			sizeof(struct vxfs_inode_info), 0, 0, age_icache_memory, NULL, NULL);
+	if (vxfs_inode_cachep) {
 		return (register_filesystem(&vxfs_fs_type));
+	}
 	return -ENOMEM;
 }
 
--- 2.5.33/fs/hfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/hfs/super.c	Sun Sep  1 21:23:12 2002
@@ -72,6 +72,7 @@ static int init_inodecache(void)
 	hfs_inode_cachep = kmem_cache_create("hfs_inode_cache",
 					     sizeof(struct hfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
+					     age_icache_memory,
 					     init_once, NULL);
 	if (hfs_inode_cachep == NULL)
 		return -ENOMEM;
--- 2.5.33/fs/hpfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/hpfs/super.c	Sun Sep  1 21:23:12 2002
@@ -186,7 +186,7 @@ static int init_inodecache(void)
 	hpfs_inode_cachep = kmem_cache_create("hpfs_inode_cache",
 					     sizeof(struct hpfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (hpfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/inode.c	Sun Sep  1 21:30:44 2002
@@ -388,10 +388,11 @@ void prune_icache(int goal)
 
 	count = 0;
 	entry = inode_unused.prev;
-	while (entry != &inode_unused)
-	{
+	for(; goal; goal--) {
 		struct list_head *tmp = entry;
 
+		if (entry == &inode_unused)
+			break;
 		entry = entry->prev;
 		inode = INODE(tmp);
 		if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
@@ -405,8 +406,6 @@ void prune_icache(int goal)
 		list_add(tmp, freeable);
 		inode->i_state |= I_FREEING;
 		count++;
-		if (!--goal)
-			break;
 	}
 	inodes_stat.nr_unused -= count;
 	spin_unlock(&inode_lock);
@@ -416,19 +415,10 @@ void prune_icache(int goal)
 
 /*
  * This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our icache, but not
- * too much.
- *
- * Priority:
- *   1 - very urgent: shrink everything
- *  ...
- *   6 - base-level: try to shrink a bit.
+ * more memory. 
  */
-int shrink_icache_memory(int priority, int gfp_mask)
+int age_icache_memory(kmem_cache_t *cachep, int entries, int gfp_mask)
 {
-	int count = 0;
-
 	/*
 	 * Nasty deadlock avoidance..
 	 *
@@ -439,12 +429,13 @@ int shrink_icache_memory(int priority, i
 	if (!(gfp_mask & __GFP_FS))
 		return 0;
 
-	count = inodes_stat.nr_unused / priority;
+	if (entries > inodes_stat.nr_unused)
+		entries = inodes_stat.nr_unused;
 
-	prune_icache(count);
-	kmem_cache_shrink(inode_cachep);
-	return 0;
+	prune_icache(entries);
+	return entries;
 }
+EXPORT_SYMBOL(age_icache_memory);
 
 /*
  * Called with the inode lock held.
@@ -1103,8 +1094,8 @@ void __init inode_init(unsigned long mem
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
-					 0, SLAB_HWCACHE_ALIGN, init_once,
-					 NULL);
+					 0, SLAB_HWCACHE_ALIGN, age_icache_memory,
+					 init_once, NULL);
 	if (!inode_cachep)
 		panic("cannot create inode slab cache");
 }
--- 2.5.33/fs/intermezzo/dcache.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/intermezzo/dcache.c	Sun Sep  1 21:23:12 2002
@@ -127,7 +127,7 @@ void presto_init_ddata_cache(void)
                 kmem_cache_create("presto_cache",
                                   sizeof(struct presto_dentry_data), 0,
                                   SLAB_HWCACHE_ALIGN, NULL,
-                                  NULL);
+                                  NULL, NULL);
         EXIT;
 }
 
--- 2.5.33/fs/isofs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/isofs/inode.c	Sun Sep  1 21:23:12 2002
@@ -111,7 +111,7 @@ static int init_inodecache(void)
 	isofs_inode_cachep = kmem_cache_create("isofs_inode_cache",
 					     sizeof(struct iso_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (isofs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/jbd/journal.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jbd/journal.c	Sun Sep  1 21:23:12 2002
@@ -1569,6 +1569,7 @@ static int journal_init_journal_head_cac
 				sizeof(struct journal_head),
 				0,		/* offset */
 				0,		/* flags */
+				NULL,		/* pruner */
 				NULL,		/* ctor */
 				NULL);		/* dtor */
 	retval = 0;
--- 2.5.33/fs/jbd/revoke.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jbd/revoke.c	Sun Sep  1 21:23:12 2002
@@ -163,13 +163,13 @@ int __init journal_init_revoke_caches(vo
 {
 	revoke_record_cache = kmem_cache_create("revoke_record",
 					   sizeof(struct jbd_revoke_record_s),
-					   0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+					   0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (revoke_record_cache == 0)
 		return -ENOMEM;
 
 	revoke_table_cache = kmem_cache_create("revoke_table",
 					   sizeof(struct jbd_revoke_table_s),
-					   0, 0, NULL, NULL);
+					   0, 0, NULL, NULL, NULL);
 	if (revoke_table_cache == 0) {
 		kmem_cache_destroy(revoke_record_cache);
 		revoke_record_cache = NULL;
--- 2.5.33/fs/jffs2/malloc.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jffs2/malloc.c	Sun Sep  1 21:23:12 2002
@@ -37,43 +37,43 @@ int __init jffs2_create_slab_caches(void
 {
 	full_dnode_slab = kmem_cache_create("jffs2_full_dnode", 
 					    sizeof(struct jffs2_full_dnode),
-					    0, JFFS2_SLAB_POISON, NULL, NULL);
+					    0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!full_dnode_slab)
 		goto err;
 
 	raw_dirent_slab = kmem_cache_create("jffs2_raw_dirent",
 					    sizeof(struct jffs2_raw_dirent),
-					    0, JFFS2_SLAB_POISON, NULL, NULL);
+					    0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!raw_dirent_slab)
 		goto err;
 
 	raw_inode_slab = kmem_cache_create("jffs2_raw_inode",
 					   sizeof(struct jffs2_raw_inode),
-					   0, JFFS2_SLAB_POISON, NULL, NULL);
+					   0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!raw_inode_slab)
 		goto err;
 
 	tmp_dnode_info_slab = kmem_cache_create("jffs2_tmp_dnode",
 						sizeof(struct jffs2_tmp_dnode_info),
-						0, JFFS2_SLAB_POISON, NULL, NULL);
+						0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!tmp_dnode_info_slab)
 		goto err;
 
 	raw_node_ref_slab = kmem_cache_create("jffs2_raw_node_ref",
 					      sizeof(struct jffs2_raw_node_ref),
-					      0, JFFS2_SLAB_POISON, NULL, NULL);
+					      0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!raw_node_ref_slab)
 		goto err;
 
 	node_frag_slab = kmem_cache_create("jffs2_node_frag",
 					   sizeof(struct jffs2_node_frag),
-					   0, JFFS2_SLAB_POISON, NULL, NULL);
+					   0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (!node_frag_slab)
 		goto err;
 
 	inode_cache_slab = kmem_cache_create("jffs2_inode_cache",
 					     sizeof(struct jffs2_inode_cache),
-					     0, JFFS2_SLAB_POISON, NULL, NULL);
+					     0, JFFS2_SLAB_POISON, NULL, NULL, NULL);
 	if (inode_cache_slab)
 		return 0;
  err:
--- 2.5.33/fs/jffs2/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jffs2/super.c	Sun Sep  1 21:23:12 2002
@@ -299,7 +299,7 @@ static int __init init_jffs2_fs(void)
 	jffs2_inode_cachep = kmem_cache_create("jffs2_i",
 					     sizeof(struct jffs2_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     jffs2_i_init_once, NULL);
+					     age_icache_memory, jffs2_i_init_once, NULL);
 	if (!jffs2_inode_cachep) {
 		printk(KERN_ERR "JFFS2 error: Failed to initialise inode cache\n");
 		return -ENOMEM;
--- 2.5.33/fs/jffs/inode-v23.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jffs/inode-v23.c	Sun Sep  1 21:23:12 2002
@@ -1799,9 +1799,9 @@ init_jffs_fs(void)
 	jffs_proc_root = proc_mkdir("jffs", proc_root_fs);
 #endif
 	fm_cache = kmem_cache_create("jffs_fm", sizeof(struct jffs_fm),
-				     0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+				     0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	node_cache = kmem_cache_create("jffs_node",sizeof(struct jffs_node),
-				       0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+				       0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	return register_filesystem(&jffs_fs_type);
 }
 
--- 2.5.33/fs/jfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jfs/super.c	Sun Sep  1 21:23:12 2002
@@ -457,7 +457,7 @@ static int __init init_jfs_fs(void)
 
 	jfs_inode_cachep =
 	    kmem_cache_create("jfs_ip", sizeof(struct jfs_inode_info), 0, 0,
-			      init_once, NULL);
+			      age_icache_memory, init_once, NULL);
 	if (jfs_inode_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/locks.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/locks.c	Sun Sep  1 21:23:12 2002
@@ -1892,7 +1892,7 @@ int lock_may_write(struct inode *inode, 
 static int __init filelock_init(void)
 {
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, 0, init_once, NULL);
+			sizeof(struct file_lock), 0, 0, NULL, init_once, NULL);
 	if (!filelock_cache)
 		panic("cannot create file lock slab cache");
 	return 0;
--- 2.5.33/fs/minix/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/minix/inode.c	Sun Sep  1 21:23:12 2002
@@ -79,7 +79,7 @@ static int init_inodecache(void)
 	minix_inode_cachep = kmem_cache_create("minix_inode_cache",
 					     sizeof(struct minix_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (minix_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/namespace.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/namespace.c	Sun Sep  1 21:23:12 2002
@@ -1046,7 +1046,7 @@ void __init mnt_init(unsigned long mempa
 	int i;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct vfsmount),
-					0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+					0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!mnt_cache)
 		panic("Cannot create vfsmount cache");
 
--- 2.5.33/fs/ncpfs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/ncpfs/inode.c	Sun Sep  1 21:23:12 2002
@@ -69,7 +69,7 @@ static int init_inodecache(void)
 	ncp_inode_cachep = kmem_cache_create("ncp_inode_cache",
 					     sizeof(struct ncp_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (ncp_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/nfs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/nfs/inode.c	Sun Sep  1 21:23:12 2002
@@ -1317,7 +1317,7 @@ int nfs_init_inodecache(void)
 	nfs_inode_cachep = kmem_cache_create("nfs_inode_cache",
 					     sizeof(struct nfs_inode),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (nfs_inode_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/nfs/pagelist.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/nfs/pagelist.c	Sun Sep  1 21:23:12 2002
@@ -492,7 +492,7 @@ int nfs_init_nfspagecache(void)
 	nfs_page_cachep = kmem_cache_create("nfs_page",
 					    sizeof(struct nfs_page),
 					    0, SLAB_HWCACHE_ALIGN,
-					    NULL, NULL);
+					    NULL, NULL, NULL);
 	if (nfs_page_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/nfs/read.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/nfs/read.c	Sun Sep  1 21:23:12 2002
@@ -497,7 +497,7 @@ int nfs_init_readpagecache(void)
 	nfs_rdata_cachep = kmem_cache_create("nfs_read_data",
 					     sizeof(struct nfs_read_data),
 					     0, SLAB_HWCACHE_ALIGN,
-					     NULL, NULL);
+					     NULL, NULL, NULL);
 	if (nfs_rdata_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/nfs/write.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/nfs/write.c	Sun Sep  1 21:23:12 2002
@@ -1317,7 +1317,7 @@ int nfs_init_writepagecache(void)
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
-					     NULL, NULL);
+					     NULL, NULL, NULL);
 	if (nfs_wdata_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/ntfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/ntfs/super.c	Sun Sep  1 21:23:12 2002
@@ -1667,7 +1667,7 @@ static int __init init_ntfs_fs(void)
 
 	ntfs_attr_ctx_cache = kmem_cache_create(ntfs_attr_ctx_cache_name,
 			sizeof(attr_search_context), 0 /* offset */,
-			SLAB_HWCACHE_ALIGN, NULL /* ctor */, NULL /* dtor */);
+			SLAB_HWCACHE_ALIGN, NULL, NULL /* ctor */, NULL /* dtor */);
 	if (!ntfs_attr_ctx_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
 				ntfs_attr_ctx_cache_name);
@@ -1676,7 +1676,7 @@ static int __init init_ntfs_fs(void)
 
 	ntfs_name_cache = kmem_cache_create(ntfs_name_cache_name,
 			(NTFS_MAX_NAME_LEN+1) * sizeof(uchar_t), 0,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!ntfs_name_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
 				ntfs_name_cache_name);
@@ -1684,7 +1684,8 @@ static int __init init_ntfs_fs(void)
 	}
 
 	ntfs_inode_cache = kmem_cache_create(ntfs_inode_cache_name,
-			sizeof(ntfs_inode), 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+			sizeof(ntfs_inode), 0, SLAB_HWCACHE_ALIGN,
+			age_icache_memory, NULL, NULL);
 	if (!ntfs_inode_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
 				ntfs_inode_cache_name);
@@ -1693,7 +1694,7 @@ static int __init init_ntfs_fs(void)
 
 	ntfs_big_inode_cache = kmem_cache_create(ntfs_big_inode_cache_name,
 			sizeof(big_ntfs_inode), 0, SLAB_HWCACHE_ALIGN,
-			ntfs_big_inode_init_once, NULL);
+			age_icache_memory, ntfs_big_inode_init_once, NULL);
 	if (!ntfs_big_inode_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
 				ntfs_big_inode_cache_name);
--- 2.5.33/fs/proc/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/proc/inode.c	Sun Sep  1 21:23:12 2002
@@ -123,7 +123,7 @@ int __init proc_init_inodecache(void)
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
 					     sizeof(struct proc_inode),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (proc_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/qnx4/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/qnx4/inode.c	Sun Sep  1 21:23:12 2002
@@ -544,7 +544,7 @@ static int init_inodecache(void)
 	qnx4_inode_cachep = kmem_cache_create("qnx4_inode_cache",
 					     sizeof(struct qnx4_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (qnx4_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/reiserfs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/reiserfs/super.c	Sun Sep  1 21:23:12 2002
@@ -434,7 +434,7 @@ static int init_inodecache(void)
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
 					     sizeof(struct reiserfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/romfs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/romfs/inode.c	Sun Sep  1 21:23:12 2002
@@ -577,7 +577,7 @@ static int init_inodecache(void)
 	romfs_inode_cachep = kmem_cache_create("romfs_inode_cache",
 					     sizeof(struct romfs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (romfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/smbfs/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/smbfs/inode.c	Sun Sep  1 21:23:12 2002
@@ -78,7 +78,7 @@ static int init_inodecache(void)
 	smb_inode_cachep = kmem_cache_create("smb_inode_cache",
 					     sizeof(struct smb_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (smb_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/smbfs/request.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/smbfs/request.c	Sun Sep  1 21:23:12 2002
@@ -37,7 +37,7 @@ int smb_init_request_cache(void)
 	req_cachep = kmem_cache_create("smb_request",
 				       sizeof(struct smb_request), 0,
 				       SMB_SLAB_DEBUG | SLAB_HWCACHE_ALIGN,
-				       NULL, NULL);
+				       NULL, NULL, NULL);
 	if (req_cachep == NULL)
 		return -ENOMEM;
 
--- 2.5.33/fs/sysv/inode.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/sysv/inode.c	Sun Sep  1 21:23:12 2002
@@ -325,7 +325,7 @@ int __init sysv_init_icache(void)
 {
 	sysv_inode_cachep = kmem_cache_create("sysv_inode_cache",
 			sizeof(struct sysv_inode_info), 0,
-			SLAB_HWCACHE_ALIGN, init_once, NULL);
+			SLAB_HWCACHE_ALIGN, age_icache_memory, init_once, NULL);
 	if (!sysv_inode_cachep)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/udf/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/udf/super.c	Sun Sep  1 21:23:12 2002
@@ -141,7 +141,7 @@ static int init_inodecache(void)
 	udf_inode_cachep = kmem_cache_create("udf_inode_cache",
 					     sizeof(struct udf_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (udf_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/fs/ufs/super.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/ufs/super.c	Sun Sep  1 21:23:12 2002
@@ -1028,7 +1028,7 @@ static int init_inodecache(void)
 	ufs_inode_cachep = kmem_cache_create("ufs_inode_cache",
 					     sizeof(struct ufs_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (ufs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/include/linux/dcache.h~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/include/linux/dcache.h	Sun Sep  1 21:23:12 2002
@@ -182,16 +182,11 @@ extern void shrink_dcache_anon(struct li
 extern int d_invalidate(struct dentry *);
 
 /* dcache memory management */
-extern int shrink_dcache_memory(int, unsigned int);
 extern void prune_dcache(int);
 
 /* icache memory management (defined in linux/fs/inode.c) */
-extern int shrink_icache_memory(int, int);
 extern void prune_icache(int);
 
-/* quota cache memory management (defined in linux/fs/dquot.c) */
-extern int shrink_dqcache_memory(int, unsigned int);
-
 /* only used at mount-time */
 extern struct dentry * d_alloc_root(struct inode *);
 
--- 2.5.33/include/linux/page-flags.h~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/include/linux/page-flags.h	Sun Sep  1 21:23:35 2002
@@ -78,6 +78,7 @@ extern struct page_state {
 	unsigned long nr_pagecache;
 	unsigned long nr_page_table_pages;
 	unsigned long nr_reverse_maps;
+	unsigned long nr_slab;
 } ____cacheline_aligned_in_smp page_states[NR_CPUS];
 
 extern void get_page_state(struct page_state *ret);
--- 2.5.33/include/linux/slab.h~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/include/linux/slab.h	Sun Sep  1 22:12:50 2002
@@ -49,12 +49,32 @@ typedef struct kmem_cache_s kmem_cache_t
 extern void kmem_cache_init(void);
 extern void kmem_cache_sizes_init(void);
 
+typedef int (*kmem_pruner_t)(kmem_cache_t *, int, int);
+
 extern kmem_cache_t *kmem_find_general_cachep(size_t, int gfpflags);
-extern kmem_cache_t *kmem_cache_create(const char *, size_t, size_t, unsigned long,
-				       void (*)(void *, kmem_cache_t *, unsigned long),
-				       void (*)(void *, kmem_cache_t *, unsigned long));
+extern kmem_cache_t *kmem_cache_create(const char *, size_t, size_t,
+			unsigned long, kmem_pruner_t,
+			void (*)(void *, kmem_cache_t *, unsigned long),
+			void (*)(void *, kmem_cache_t *, unsigned long));
 extern int kmem_cache_destroy(kmem_cache_t *);
 extern int kmem_cache_shrink(kmem_cache_t *);
+
+extern int kmem_do_prunes(int);
+extern int kmem_count_page(struct page *, int);
+#define kmem_touch_page(addr)	SetPageReferenced(virt_to_page(addr));
+
+/* shrink a slab */
+extern int kmem_shrink_slab(struct page *);
+
+/* dcache prune ( defined in linux/fs/dcache.c) */
+extern int age_dcache_memory(kmem_cache_t *, int, int);
+
+/* icache prune (defined in linux/fs/inode.c) */
+extern int age_icache_memory(kmem_cache_t *, int, int);
+
+/* quota cache prune (defined in linux/fs/dquot.c) */
+extern int age_dqcache_memory(kmem_cache_t *, int, int);
+
 extern void *kmem_cache_alloc(kmem_cache_t *, int);
 extern void kmem_cache_free(kmem_cache_t *, void *);
 extern unsigned int kmem_cache_size(kmem_cache_t *);
--- 2.5.33/kernel/fork.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/kernel/fork.c	Sun Sep  1 21:23:12 2002
@@ -112,7 +112,7 @@ void __init fork_init(unsigned long memp
 	task_struct_cachep =
 		kmem_cache_create("task_struct",
 				  sizeof(struct task_struct),0,
-				  SLAB_HWCACHE_ALIGN, NULL, NULL);
+				  SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!task_struct_cachep)
 		panic("fork_init(): cannot create task_struct SLAB cache");
 
@@ -940,31 +940,31 @@ void __init proc_caches_init(void)
 {
 	sigact_cachep = kmem_cache_create("signal_act",
 			sizeof(struct signal_struct), 0,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!sigact_cachep)
 		panic("Cannot create signal action SLAB cache");
 
 	files_cachep = kmem_cache_create("files_cache", 
 			 sizeof(struct files_struct), 0, 
-			 SLAB_HWCACHE_ALIGN, NULL, NULL);
+			 SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!files_cachep) 
 		panic("Cannot create files SLAB cache");
 
 	fs_cachep = kmem_cache_create("fs_cache", 
 			 sizeof(struct fs_struct), 0, 
-			 SLAB_HWCACHE_ALIGN, NULL, NULL);
+			 SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!fs_cachep) 
 		panic("Cannot create fs_struct SLAB cache");
  
 	vm_area_cachep = kmem_cache_create("vm_area_struct",
 			sizeof(struct vm_area_struct), 0,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if(!vm_area_cachep)
 		panic("vma_init: Cannot alloc vm_area_struct SLAB cache");
 
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), 0,
-			SLAB_HWCACHE_ALIGN, NULL, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if(!mm_cachep)
 		panic("vma_init: Cannot alloc mm_struct SLAB cache");
 }
--- 2.5.33/kernel/signal.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/kernel/signal.c	Sun Sep  1 21:23:12 2002
@@ -43,7 +43,7 @@ void __init signals_init(void)
 		kmem_cache_create("sigqueue",
 				  sizeof(struct sigqueue),
 				  __alignof__(struct sigqueue),
-				  SIG_SLAB_DEBUG, NULL, NULL);
+				  SIG_SLAB_DEBUG, NULL, NULL, NULL);
 	if (!sigqueue_cachep)
 		panic("signals_init(): cannot create sigqueue SLAB cache");
 }
--- 2.5.33/kernel/user.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/kernel/user.c	Sun Sep  1 21:23:12 2002
@@ -118,7 +118,7 @@ static int __init uid_cache_init(void)
 
 	uid_cachep = kmem_cache_create("uid_cache", sizeof(struct user_struct),
 				       0,
-				       SLAB_HWCACHE_ALIGN, NULL, NULL);
+				       SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if(!uid_cachep)
 		panic("Cannot create uid taskcount SLAB cache\n");
 
--- 2.5.33/lib/radix-tree.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/lib/radix-tree.c	Sun Sep  1 21:23:12 2002
@@ -293,7 +293,7 @@ void __init radix_tree_init(void)
 {
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
-			SLAB_HWCACHE_ALIGN, radix_tree_node_ctor, NULL);
+			SLAB_HWCACHE_ALIGN, NULL, radix_tree_node_ctor, NULL);
 	if (!radix_tree_node_cachep)
 		panic ("Failed to create radix_tree_node cache\n");
 	radix_tree_node_pool = mempool_create(512, radix_tree_node_pool_alloc,
--- 2.5.33/mm/page_alloc.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/page_alloc.c	Sun Sep  1 21:30:44 2002
@@ -561,6 +561,7 @@ void get_page_state(struct page_state *r
 		ret->nr_pagecache += ps->nr_pagecache;
 		ret->nr_page_table_pages += ps->nr_page_table_pages;
 		ret->nr_reverse_maps += ps->nr_reverse_maps;
+		ret->nr_slab += ps->nr_slab;
 	}
 }
 
--- 2.5.33/mm/rmap.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/rmap.c	Sun Sep  1 21:23:12 2002
@@ -521,6 +521,7 @@ void __init pte_chain_init(void)
 						sizeof(struct pte_chain),
 						0,
 						0,
+						NULL,
 						pte_chain_ctor,
 						NULL);
 
--- 2.5.33/mm/shmem.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/shmem.c	Sun Sep  1 21:23:12 2002
@@ -1514,7 +1514,7 @@ static int init_inodecache(void)
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 					     sizeof(struct shmem_inode_info),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     age_icache_memory, init_once, NULL);
 	if (shmem_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/mm/slab.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/slab.c	Sun Sep  1 22:11:59 2002
@@ -77,6 +77,7 @@
 #include	<linux/init.h>
 #include	<linux/compiler.h>
 #include	<linux/seq_file.h>
+#include	<linux/pagemap.h>
 #include	<asm/uaccess.h>
 
 /*
@@ -215,6 +216,8 @@ struct kmem_cache_s {
 	kmem_cache_t		*slabp_cache;
 	unsigned int		growing;
 	unsigned int		dflags;		/* dynamic flags */
+	kmem_pruner_t		pruner;		/* shrink callback */
+	int 			count;		/* count used to trigger shrink */
 
 	/* constructor func */
 	void (*ctor)(void *, kmem_cache_t *, unsigned long);
@@ -256,7 +259,7 @@ struct kmem_cache_s {
 
 #define	OFF_SLAB(x)	((x)->flags & CFLGS_OFF_SLAB)
 #define	OPTIMIZE(x)	((x)->flags & CFLGS_OPTIMIZE)
-#define	GROWN(x)	((x)->dlags & DFLGS_GROWN)
+#define	GROWN(x)	((x)->dflags & DFLGS_GROWN)
 
 #if STATS
 #define	STATS_INC_ACTIVE(x)	((x)->num_active++)
@@ -412,6 +415,56 @@ static int g_cpucache_up;
 static void enable_cpucache (kmem_cache_t *cachep);
 static void enable_all_cpucaches (void);
 #endif
+ 
+/* 
+ * Used by shrink_cache to determine caches that need pruning.
+ */
+int kmem_count_page(struct page *page, int ref)
+{
+	kmem_cache_t *cachep = GET_PAGE_CACHE(page);
+	slab_t *slabp = GET_PAGE_SLAB(page);
+	int ret = 0;
+
+	spin_lock_irq(&cachep->spinlock);
+	if (cachep->pruner != NULL) {
+		cachep->count += slabp->inuse;
+		ret = !slabp->inuse;
+	} else {
+		ret = !ref && !slabp->inuse;
+	}
+	spin_unlock_irq(&cachep->spinlock);
+	return ret;
+}
+
+
+/* Call the prune functions to age pruneable caches */
+int kmem_do_prunes(int gfp_mask) 
+{
+	struct list_head *p;
+	int nr;
+
+        if (gfp_mask & __GFP_WAIT)
+                down(&cache_chain_sem);
+        else
+                if (down_trylock(&cache_chain_sem))
+                        return 0;
+
+        list_for_each(p,&cache_chain) {
+                kmem_cache_t *cachep = list_entry(p, kmem_cache_t, next);
+		if (cachep->pruner != NULL) {
+			spin_lock_irq(&cachep->spinlock);
+			nr = cachep->count;
+			cachep->count = 0;
+			spin_unlock_irq(&cachep->spinlock);
+			if (nr > 0)
+				(*cachep->pruner)(cachep, nr, gfp_mask);
+
+		}
+	}
+        up(&cache_chain_sem);
+	return 1;
+}
+
 
 /* Cal the num objs, wastage, and bytes left over for a given slab size. */
 static void kmem_cache_estimate (unsigned long gfporder, size_t size,
@@ -451,8 +504,7 @@ void __init kmem_cache_init(void)
 
 	kmem_cache_estimate(0, cache_cache.objsize, 0,
 			&left_over, &cache_cache.num);
-	if (!cache_cache.num)
-		BUG();
+	BUG_ON(!cache_cache.num);
 
 	cache_cache.colour = left_over/cache_cache.colour_off;
 	cache_cache.colour_next = 0;
@@ -477,12 +529,10 @@ void __init kmem_cache_sizes_init(void)
 		 * eliminates "false sharing".
 		 * Note for systems short on memory removing the alignment will
 		 * allow tighter packing of the smaller caches. */
-		if (!(sizes->cs_cachep =
+		BUG_ON(!(sizes->cs_cachep =
 			kmem_cache_create(cache_names[sizes-cache_sizes].name, 
-					  sizes->cs_size,
-					0, SLAB_HWCACHE_ALIGN, NULL, NULL))) {
-			BUG();
-		}
+				sizes->cs_size,
+				0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL)));
 
 		/* Inc off-slab bufctl limit until the ceiling is hit. */
 		if (!(OFF_SLAB(sizes->cs_cachep))) {
@@ -490,11 +540,10 @@ void __init kmem_cache_sizes_init(void)
 			offslab_limit /= 2;
 		}
 		sizes->cs_dmacachep = kmem_cache_create(
-		    cache_names[sizes-cache_sizes].name_dma, 
+			cache_names[sizes-cache_sizes].name_dma, 
 			sizes->cs_size, 0,
-			SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL);
-		if (!sizes->cs_dmacachep)
-			BUG();
+			SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
+		BUG_ON(!sizes->cs_dmacachep);
 		sizes++;
 	} while (sizes->cs_size);
 }
@@ -510,7 +559,8 @@ int __init kmem_cpucache_init(void)
 
 __initcall(kmem_cpucache_init);
 
-/* Interface to system's page allocator. No need to hold the cache-lock.
+/*
+ * Interface to system's page allocator. No need to hold the cache-lock.
  */
 static inline void * kmem_getpages (kmem_cache_t *cachep, unsigned long flags)
 {
@@ -532,7 +582,6 @@ static inline void * kmem_getpages (kmem
 	return addr;
 }
 
-/* Interface to system's page release. */
 static inline void kmem_freepages (kmem_cache_t *cachep, void *addr)
 {
 	unsigned long i = (1<<cachep->gfporder);
@@ -545,9 +594,16 @@ static inline void kmem_freepages (kmem_
 	 */
 	while (i--) {
 		ClearPageSlab(page);
+		dec_page_state(nr_slab);
 		page++;
 	}
-	free_pages((unsigned long)addr, cachep->gfporder);
+
+	/* free slab pages, page count is decremented by page_cache_release */
+	page = virt_to_page(addr);
+	if (cachep->gfporder)
+		free_pages((unsigned long)addr, cachep->gfporder);
+	else
+		page_cache_release(page);
 }
 
 #if DEBUG
@@ -577,6 +633,7 @@ static inline int kmem_check_poison_obj 
 }
 #endif
 
+
 /* Destroy all the objs in a slab, and release the mem back to the system.
  * Before calling the slab must have been unlinked from the cache.
  * The cache-lock is not held/needed.
@@ -593,11 +650,9 @@ static void kmem_slab_destroy (kmem_cach
 			void* objp = slabp->s_mem+cachep->objsize*i;
 #if DEBUG
 			if (cachep->flags & SLAB_RED_ZONE) {
-				if (*((unsigned long*)(objp)) != RED_MAGIC1)
-					BUG();
-				if (*((unsigned long*)(objp + cachep->objsize
-						-BYTES_PER_WORD)) != RED_MAGIC1)
-					BUG();
+				BUG_ON(*((unsigned long*)(objp)) != RED_MAGIC1);
+				BUG_ON(*((unsigned long*)(objp + cachep->objsize
+					-BYTES_PER_WORD)) != RED_MAGIC1);
 				objp += BYTES_PER_WORD;
 			}
 #endif
@@ -607,9 +662,8 @@ static void kmem_slab_destroy (kmem_cach
 			if (cachep->flags & SLAB_RED_ZONE) {
 				objp -= BYTES_PER_WORD;
 			}	
-			if ((cachep->flags & SLAB_POISON)  &&
-				kmem_check_poison_obj(cachep, objp))
-				BUG();
+			BUG_ON((cachep->flags & SLAB_POISON)  &&
+				kmem_check_poison_obj(cachep, objp));
 #endif
 		}
 	}
@@ -625,6 +679,7 @@ static void kmem_slab_destroy (kmem_cach
  * @size: The size of objects to be created in this cache.
  * @offset: The offset to use within the page.
  * @flags: SLAB flags
+ * @thepruner: a callback to prune entries for ageable caches
  * @ctor: A constructor for the objects.
  * @dtor: A destructor for the objects.
  *
@@ -654,7 +709,8 @@ static void kmem_slab_destroy (kmem_cach
  */
 kmem_cache_t *
 kmem_cache_create (const char *name, size_t size, size_t offset,
-	unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
+	unsigned long flags, kmem_pruner_t thepruner,
+	void (*ctor)(void*, kmem_cache_t *, unsigned long),
 	void (*dtor)(void*, kmem_cache_t *, unsigned long))
 {
 	const char *func_nm = KERN_ERR "kmem_create: ";
@@ -664,13 +720,12 @@ kmem_cache_create (const char *name, siz
 	/*
 	 * Sanity checks... these are all serious usage bugs.
 	 */
-	if ((!name) ||
+	BUG_ON((!name) ||
 		in_interrupt() ||
 		(size < BYTES_PER_WORD) ||
 		(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
 		(dtor && !ctor) ||
-		(offset < 0 || offset > size))
-			BUG();
+		(offset < 0 || offset > size));
 
 #if DEBUG
 	if ((flags & SLAB_DEBUG_INITIAL) && !ctor) {
@@ -700,8 +755,7 @@ kmem_cache_create (const char *name, siz
 	 * Always checks flags, a caller might be expecting debug
 	 * support which isn't available.
 	 */
-	if (flags & ~CREATE_MASK)
-		BUG();
+	BUG_ON(flags & ~CREATE_MASK);
 
 	/* Get cache's description obj. */
 	cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
@@ -816,6 +870,8 @@ next:
 		flags |= CFLGS_OPTIMIZE;
 
 	cachep->flags = flags;
+	cachep->pruner = thepruner;
+	cachep->count = 0;
 	cachep->gfpflags = 0;
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
@@ -958,15 +1014,14 @@ static void drain_cpu_caches(kmem_cache_
 #define drain_cpu_caches(cachep)	do { } while (0)
 #endif
 
-static int __kmem_cache_shrink(kmem_cache_t *cachep)
+
+/* 
+ * Worker function for freeing slab caches; returns number of pages freed.
+ */
+static int __kmem_cache_shrink_locked(kmem_cache_t *cachep)
 {
 	slab_t *slabp;
-	int ret;
-
-	drain_cpu_caches(cachep);
-
-	spin_lock_irq(&cachep->spinlock);
-
+	int ret = 0;
 	/* If the cache is growing, stop shrinking. */
 	while (!cachep->growing) {
 		struct list_head *p;
@@ -977,16 +1032,29 @@ static int __kmem_cache_shrink(kmem_cach
 
 		slabp = list_entry(cachep->slabs_free.prev, slab_t, list);
 #if DEBUG
-		if (slabp->inuse)
-			BUG();
+		BUG_ON(slabp->inuse);
 #endif
 		list_del(&slabp->list);
 
-		spin_unlock_irq(&cachep->spinlock);
+		spin_unlock(&cachep->spinlock);
 		kmem_slab_destroy(cachep, slabp);
-		spin_lock_irq(&cachep->spinlock);
+		ret++;
+		spin_lock(&cachep->spinlock);
 	}
-	ret = !list_empty(&cachep->slabs_full) || !list_empty(&cachep->slabs_partial);
+	return ret;
+}
+	
+
+static int __kmem_cache_shrink(kmem_cache_t *cachep)
+{
+	int ret;
+
+	drain_cpu_caches(cachep);
+
+	spin_lock_irq(&cachep->spinlock);
+	__kmem_cache_shrink_locked(cachep);
+	ret = !list_empty(&cachep->slabs_full) || 
+		!list_empty(&cachep->slabs_partial);
 	spin_unlock_irq(&cachep->spinlock);
 	return ret;
 }
@@ -1000,12 +1068,47 @@ static int __kmem_cache_shrink(kmem_cach
  */
 int kmem_cache_shrink(kmem_cache_t *cachep)
 {
-	if (!cachep || in_interrupt() || !is_chained_kmem_cache(cachep))
-		BUG();
-
+	BUG_ON(!cachep || in_interrupt() || !is_chained_kmem_cache(cachep));
 	return __kmem_cache_shrink(cachep);
 }
 
+
+/* 
+ * Used by shrink_cache to try to shrink a cache.  The actual
+ * free is defered via a pagevec in shrink_list.
+ * - shrink works and we return the pages shrunk
+ * - shrink fails because the slab is in use, we return 0
+ * - the page_count gets decremented by __pagevec_release_nonlru
+ * called with page_lock bit set. 
+ */
+int kmem_shrink_slab(struct page *page)
+{
+	kmem_cache_t *cachep = GET_PAGE_CACHE(page);
+	slab_t *slabp = GET_PAGE_SLAB(page);
+	unsigned int ret = 0;
+
+	spin_lock_irq(&cachep->spinlock);
+	if (!slabp->inuse) {
+	 	if (!cachep->growing) { 
+			unsigned int i = (1<<cachep->gfporder);
+
+			list_del(&slabp->list);
+			ret = i;
+			while (i--) {
+				ClearPageSlab(page);
+				dec_page_state(nr_slab);
+				page++;
+			}
+			goto out;
+		}
+		BUG_ON(PageActive(page));
+	}
+out:
+	spin_unlock_irq(&cachep->spinlock);
+	return ret; 
+}
+
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
@@ -1023,8 +1126,7 @@ int kmem_cache_shrink(kmem_cache_t *cach
  */
 int kmem_cache_destroy (kmem_cache_t * cachep)
 {
-	if (!cachep || in_interrupt() || cachep->growing)
-		BUG();
+	BUG_ON(!cachep || in_interrupt() || cachep->growing);
 
 	/* Find the cache in the chain of caches. */
 	down(&cache_chain_sem);
@@ -1112,11 +1214,9 @@ static inline void kmem_cache_init_objs 
 			/* need to poison the objs */
 			kmem_poison_obj(cachep, objp);
 		if (cachep->flags & SLAB_RED_ZONE) {
-			if (*((unsigned long*)(objp)) != RED_MAGIC1)
-				BUG();
-			if (*((unsigned long*)(objp + cachep->objsize -
-					BYTES_PER_WORD)) != RED_MAGIC1)
-				BUG();
+			BUG_ON(*((unsigned long*)(objp)) != RED_MAGIC1);
+			BUG_ON(*((unsigned long*)(objp + cachep->objsize -
+				BYTES_PER_WORD)) != RED_MAGIC1);
 		}
 #endif
 		slab_bufctl(slabp)[i] = i+1;
@@ -1142,8 +1242,7 @@ static int kmem_cache_grow (kmem_cache_t
 	/* Be lazy and only check for valid flags here,
  	 * keeping it out of the critical path in kmem_cache_alloc().
 	 */
-	if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
-		BUG();
+	BUG_ON(flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW));
 	if (flags & SLAB_NO_GROW)
 		return 0;
 
@@ -1153,8 +1252,7 @@ static int kmem_cache_grow (kmem_cache_t
 	 * in kmem_cache_alloc(). If a caller is seriously mis-behaving they
 	 * will eventually be caught here (where it matters).
 	 */
-	if (in_interrupt() && (flags & __GFP_WAIT))
-		BUG();
+	BUG_ON(in_interrupt() && (flags & __GFP_WAIT));
 
 	ctor_flags = SLAB_CTOR_CONSTRUCTOR;
 	local_flags = (flags & SLAB_LEVEL_MASK);
@@ -1197,15 +1295,24 @@ static int kmem_cache_grow (kmem_cache_t
 		goto opps1;
 
 	/* Nasty!!!!!! I hope this is OK. */
-	i = 1 << cachep->gfporder;
 	page = virt_to_page(objp);
+	i = 1 << cachep->gfporder;
 	do {
 		SET_PAGE_CACHE(page, cachep);
 		SET_PAGE_SLAB(page, slabp);
 		SetPageSlab(page);
+		inc_page_state(nr_slab);
 		page++;
 	} while (--i);
 
+	/*
+	 * add to lru after setup of page - can happen in interrupt context.
+	 */
+	if (!cachep->gfporder) {
+		page = virt_to_page(objp);
+		lru_cache_add(page);
+	}
+
 	kmem_cache_init_objs(cachep, slabp, ctor_flags);
 
 	spin_lock_irqsave(&cachep->spinlock, save_flags);
@@ -1219,7 +1326,8 @@ static int kmem_cache_grow (kmem_cache_t
 	spin_unlock_irqrestore(&cachep->spinlock, save_flags);
 	return 1;
 opps1:
-	kmem_freepages(cachep, objp);
+	/* do not use kmem_freepages - we are not in the lru yet... */      
+	free_pages((unsigned long)objp, cachep->gfporder);
 failed:
 	spin_lock_irqsave(&cachep->spinlock, save_flags);
 	cachep->growing--;
@@ -1241,15 +1349,12 @@ static int kmem_extra_free_checks (kmem_
 	int i;
 	unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
 
-	if (objnr >= cachep->num)
-		BUG();
-	if (objp != slabp->s_mem + objnr*cachep->objsize)
-		BUG();
+	BUG_ON(objnr >= cachep->num);
+	BUG_ON(objp != slabp->s_mem + objnr*cachep->objsize);
 
 	/* Check slab's freelist to see if this obj is there. */
 	for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
-		if (i == objnr)
-			BUG();
+		BUG_ON(i == objnr);
 	}
 	return 0;
 }
@@ -1258,11 +1363,9 @@ static int kmem_extra_free_checks (kmem_
 static inline void kmem_cache_alloc_head(kmem_cache_t *cachep, int flags)
 {
 	if (flags & SLAB_DMA) {
-		if (!(cachep->gfpflags & GFP_DMA))
-			BUG();
+		BUG_ON(!(cachep->gfpflags & GFP_DMA));
 	} else {
-		if (cachep->gfpflags & GFP_DMA)
-			BUG();
+		BUG_ON(cachep->gfpflags & GFP_DMA);
 	}
 }
 
@@ -1284,18 +1387,16 @@ static inline void * kmem_cache_alloc_on
 		list_del(&slabp->list);
 		list_add(&slabp->list, &cachep->slabs_full);
 	}
+	kmem_touch_page(objp);
 #if DEBUG
 	if (cachep->flags & SLAB_POISON)
-		if (kmem_check_poison_obj(cachep, objp))
-			BUG();
+		BUG_ON(kmem_check_poison_obj(cachep, objp));
 	if (cachep->flags & SLAB_RED_ZONE) {
 		/* Set alloc red-zone, and check old one. */
-		if (xchg((unsigned long *)objp, RED_MAGIC2) !=
-							 RED_MAGIC1)
-			BUG();
-		if (xchg((unsigned long *)(objp+cachep->objsize -
-			  BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1)
-			BUG();
+		BUG_ON(xchg((unsigned long *)objp, RED_MAGIC2) !=
+		       RED_MAGIC1);
+		BUG_ON(xchg((unsigned long *)(objp+cachep->objsize -
+					      BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1);
 		objp += BYTES_PER_WORD;
 	}
 #endif
@@ -1473,13 +1574,11 @@ static inline void kmem_cache_free_one(k
 
 	if (cachep->flags & SLAB_RED_ZONE) {
 		objp -= BYTES_PER_WORD;
-		if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
-			/* Either write before start, or a double free. */
-			BUG();
-		if (xchg((unsigned long *)(objp+cachep->objsize -
-				BYTES_PER_WORD), RED_MAGIC1) != RED_MAGIC2)
-			/* Either write past end, or a double free. */
-			BUG();
+		BUG_ON(xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2);
+		/* Either write before start, or a double free. */
+		BUG_ON(xchg((unsigned long *)(objp+cachep->objsize -
+			BYTES_PER_WORD), RED_MAGIC1) != RED_MAGIC2);
+		/* Either write past end, or a double free. */
 	}
 	if (cachep->flags & SLAB_POISON)
 		kmem_poison_obj(cachep, objp);
@@ -1617,8 +1716,7 @@ void kmem_cache_free (kmem_cache_t *cach
 	unsigned long flags;
 #if DEBUG
 	CHECK_PAGE(objp);
-	if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
-		BUG();
+	BUG_ON(cachep != GET_PAGE_CACHE(virt_to_page(objp)));
 #endif
 
 	local_irq_save(flags);
@@ -1823,23 +1921,18 @@ int kmem_cache_reap (int gfp_mask)
 		while (p != &searchp->slabs_free) {
 			slabp = list_entry(p, slab_t, list);
 #if DEBUG
-			if (slabp->inuse)
-				BUG();
+			BUG_ON(slabp->inuse);
 #endif
 			full_free++;
 			p = p->next;
 		}
 
 		/*
-		 * Try to avoid slabs with constructors and/or
-		 * more than one page per slab (as it can be difficult
-		 * to get high orders from gfp()).
+		 * Try to avoid slabs with constuctors
 		 */
 		pages = full_free * (1<<searchp->gfporder);
 		if (searchp->ctor)
 			pages = (pages*4+1)/5;
-		if (searchp->gfporder)
-			pages = (pages*4+1)/5;
 		if (pages > best_pages) {
 			best_cachep = searchp;
 			best_len = full_free;
@@ -1876,8 +1969,7 @@ perfect:
 			break;
 		slabp = list_entry(p,slab_t,list);
 #if DEBUG
-		if (slabp->inuse)
-			BUG();
+		BUG_ON(slabp->inuse);
 #endif
 		list_del(&slabp->list);
 		STATS_INC_REAPED(best_cachep);
@@ -1962,22 +2054,19 @@ static int s_show(struct seq_file *m, vo
 	num_slabs = 0;
 	list_for_each(q,&cachep->slabs_full) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse != cachep->num)
-			BUG();
+		BUG_ON(slabp->inuse != cachep->num);
 		active_objs += cachep->num;
 		active_slabs++;
 	}
 	list_for_each(q,&cachep->slabs_partial) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse == cachep->num || !slabp->inuse)
-			BUG();
+		BUG_ON(slabp->inuse == cachep->num || !slabp->inuse);
 		active_objs += slabp->inuse;
 		active_slabs++;
 	}
 	list_for_each(q,&cachep->slabs_free) {
 		slabp = list_entry(q, slab_t, list);
-		if (slabp->inuse)
-			BUG();
+		BUG_ON(slabp->inuse);
 		num_slabs++;
 	}
 	num_slabs+=active_slabs;
--- 2.5.33/mm/swap.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/swap.c	Sun Sep  1 21:23:12 2002
@@ -46,25 +46,34 @@ void activate_page(struct page *page)
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
+ *
+ * Can be called from interrupt context by slab, so protect against that.
  */
 static struct pagevec lru_add_pvecs[NR_CPUS];
 
 void lru_cache_add(struct page *page)
 {
-	struct pagevec *pvec = &lru_add_pvecs[get_cpu()];
+	unsigned long flags;
+	struct pagevec *pvec;
 
 	page_cache_get(page);
+	pvec = &lru_add_pvecs[get_cpu()];
+	local_irq_save(flags);
 	if (!pagevec_add(pvec, page))
 		__pagevec_lru_add(pvec);
+	local_irq_restore(flags);
 	put_cpu();
 }
 
 void lru_add_drain(void)
 {
+	unsigned long flags;
 	struct pagevec *pvec = &lru_add_pvecs[get_cpu()];
 
+	local_irq_save(flags);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add(pvec);
+	local_irq_restore(flags);
 	put_cpu();
 }
 
@@ -202,6 +211,7 @@ void pagevec_deactivate_inactive(struct 
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
+	unsigned long flags = 0;	/* avoid uninitialised var warning */
 	struct zone *zone = NULL;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
@@ -210,16 +220,16 @@ void __pagevec_lru_add(struct pagevec *p
 
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 		if (TestSetPageLRU(page))
 			BUG();
 		add_page_to_inactive_list(zone, page);
 	}
 	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	pagevec_release(pvec);
 }
 
--- 2.5.33/mm/vmscan.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/mm/vmscan.c	Sun Sep  1 21:46:20 2002
@@ -114,10 +114,26 @@ shrink_list(struct list_head *page_list,
 
 		if (TestSetPageLocked(page))
 			goto keep;
-
 		BUG_ON(PageActive(page));
+
+		/*
+		 * For slab pages, use kmem_count_page to increment the aging
+		 * counter for the cache and to tell us if we should try to 
+		 * free the slab.  Use kmem_shrink_slab to free the slab and
+		 * stop if we are done.
+		 */
+		if (PageSlab(page)) {
+			int ref = TestClearPageReferenced(page);
+			if (kmem_count_page(page, ref)) {
+				if (kmem_shrink_slab(page))
+					goto free_ref;
+			}
+			goto keep_locked;
+		}
+
 		may_enter_fs = (gfp_mask & __GFP_FS) ||
 				(PageSwapCache(page) && (gfp_mask & __GFP_IO));
+
 		if (PageWriteback(page)) {
 			if (may_enter_fs)
 				wait_on_page_writeback(page);  /* throttling */
@@ -238,6 +254,7 @@ shrink_list(struct list_head *page_list,
 			__remove_from_page_cache(page);
 			write_unlock(&mapping->page_lock);
 		}
+free_ref:
 		__put_page(page);	/* The pagecache ref */
 free_it:
 		unlock_page(page);
@@ -473,10 +490,6 @@ shrink_zone(struct zone *zone, int prior
 	unsigned long ratio;
 	int max_scan;
 
-	/* This is bogus for ZONE_HIGHMEM? */
-	if (kmem_cache_reap(gfp_mask) >= nr_pages)
-  		return 0;
-
 	/*
 	 * Try to keep the active list 2/3 of the size of the cache.  And
 	 * make sure that refill_inactive is given a decent number of pages.
@@ -498,20 +511,12 @@ shrink_zone(struct zone *zone, int prior
 	max_scan = zone->nr_inactive / priority;
 	nr_pages = shrink_cache(nr_pages, zone,
 				gfp_mask, priority, max_scan);
+	kmem_do_prunes(gfp_mask);
 
 	if (nr_pages <= 0)
 		return 0;
 
 	wakeup_bdflush();
-
-	shrink_dcache_memory(priority, gfp_mask);
-
-	/* After shrinking the dcache, get rid of unused inodes too .. */
-	shrink_icache_memory(1, gfp_mask);
-#ifdef CONFIG_QUOTA
-	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif
-
 	return nr_pages;
 }
 
@@ -552,6 +557,14 @@ try_to_free_pages(struct zone *classzone
 		if (nr_pages <= 0)
 			return 1;
 	} while (--priority);
+
+	/*
+	 * perform full reap before concluding we are oom
+	 */
+	nr_pages -= kmem_cache_reap(gfp_mask);
+	if (nr_pages <= 0)
+		   return 1;
+
 	out_of_memory();
 	return 0;
 }
--- 2.5.33/net/atm/clip.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/atm/clip.c	Sun Sep  1 21:23:12 2002
@@ -751,5 +751,5 @@ void atm_clip_init(void)
 {
 	clip_tbl.lock = RW_LOCK_UNLOCKED;
 	clip_tbl.kmem_cachep = kmem_cache_create(clip_tbl.id,
-	    clip_tbl.entry_size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+	    clip_tbl.entry_size, 0, SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 }
--- 2.5.33/net/bluetooth/af_bluetooth.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/bluetooth/af_bluetooth.c	Sun Sep  1 21:23:12 2002
@@ -328,7 +328,7 @@ static int __init bluez_init(void)
 	/* Init socket cache */
 	bluez_sock_cache = kmem_cache_create("bluez_sock",
 			sizeof(struct bluez_sock), 0,
-			SLAB_HWCACHE_ALIGN, 0, 0);
+			SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 
 	if (!bluez_sock_cache) {
 		BT_ERR("BlueZ socket cache creation failed");
--- 2.5.33/net/core/neighbour.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/core/neighbour.c	Sun Sep  1 21:23:12 2002
@@ -1146,7 +1146,7 @@ void neigh_table_init(struct neigh_table
 						     (tbl->entry_size +
 						      15) & ~15,
 						     0, SLAB_HWCACHE_ALIGN,
-						     NULL, NULL);
+						     NULL, NULL, NULL);
 #ifdef CONFIG_SMP
 	tasklet_init(&tbl->gc_task, SMP_TIMER_NAME(neigh_periodic_timer),
 		     (unsigned long)tbl);
--- 2.5.33/net/core/skbuff.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/core/skbuff.c	Sun Sep  1 21:23:12 2002
@@ -1204,7 +1204,7 @@ void __init skb_init(void)
 					      sizeof(struct sk_buff),
 					      0,
 					      SLAB_HWCACHE_ALIGN,
-					      skb_headerinit, NULL);
+					      NULL, skb_headerinit, NULL);
 	if (!skbuff_head_cache)
 		panic("cannot create skbuff cache");
 
--- 2.5.33/net/core/sock.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/core/sock.c	Sun Sep  1 21:23:12 2002
@@ -633,7 +633,7 @@ void sk_free(struct sock *sk)
 void __init sk_init(void)
 {
 	sk_cachep = kmem_cache_create("sock", sizeof(struct sock), 0,
-				      SLAB_HWCACHE_ALIGN, 0, 0);
+				      SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!sk_cachep)
 		printk(KERN_CRIT "sk_init: Cannot create sock SLAB cache!");
 
--- 2.5.33/net/decnet/dn_route.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/decnet/dn_route.c	Sun Sep  1 21:23:12 2002
@@ -1244,7 +1244,7 @@ void __init dn_route_init(void)
 	dn_dst_ops.kmem_cachep = kmem_cache_create("dn_dst_cache",
 						   sizeof(struct dn_route),
 						   0, SLAB_HWCACHE_ALIGN,
-						   NULL, NULL);
+						   NULL, NULL, NULL);
 
 	if (!dn_dst_ops.kmem_cachep)
 		panic("DECnet: Failed to allocate dn_dst_cache\n");
--- 2.5.33/net/decnet/dn_table.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/decnet/dn_table.c	Sun Sep  1 21:23:12 2002
@@ -888,7 +888,7 @@ void __init dn_fib_table_init(void)
 	dn_hash_kmem = kmem_cache_create("dn_fib_info_cache",
 					sizeof(struct dn_fib_info),
 					0, SLAB_HWCACHE_ALIGN,
-					NULL, NULL);
+					NULL, NULL, NULL);
 }
 
 void __exit dn_fib_table_cleanup(void)
--- 2.5.33/net/ipv4/af_inet.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/af_inet.c	Sun Sep  1 21:23:12 2002
@@ -1142,13 +1142,13 @@ static int __init inet_init(void)
 
 	tcp_sk_cachep = kmem_cache_create("tcp_sock",
 					  sizeof(struct tcp_sock), 0,
-					  SLAB_HWCACHE_ALIGN, 0, 0);
+					  SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	udp_sk_cachep = kmem_cache_create("udp_sock",
 					  sizeof(struct udp_sock), 0,
-					  SLAB_HWCACHE_ALIGN, 0, 0);
+					  SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	raw4_sk_cachep = kmem_cache_create("raw4_sock",
 					   sizeof(struct raw_sock), 0,
-					   SLAB_HWCACHE_ALIGN, 0, 0);
+					   SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!tcp_sk_cachep || !udp_sk_cachep || !raw4_sk_cachep)
 		printk(KERN_CRIT
 		       "inet_init: Can't create protocol sock SLAB caches!\n");
--- 2.5.33/net/ipv4/fib_hash.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/fib_hash.c	Sun Sep  1 21:23:12 2002
@@ -899,7 +899,7 @@ struct fib_table * __init fib_hash_init(
 		fn_hash_kmem = kmem_cache_create("ip_fib_hash",
 						 sizeof(struct fib_node),
 						 0, SLAB_HWCACHE_ALIGN,
-						 NULL, NULL);
+						 NULL, NULL, NULL);
 
 	tb = kmalloc(sizeof(struct fib_table) + sizeof(struct fn_hash), GFP_KERNEL);
 	if (tb == NULL)
--- 2.5.33/net/ipv4/inetpeer.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/inetpeer.c	Sun Sep  1 21:23:12 2002
@@ -125,7 +125,7 @@ void __init inet_initpeers(void)
 	peer_cachep = kmem_cache_create("inet_peer_cache",
 			sizeof(struct inet_peer),
 			0, SLAB_HWCACHE_ALIGN,
-			NULL, NULL);
+			NULL, NULL, NULL);
 
 	/* All the timers, started at system startup tend
 	   to synchronize. Perturb it a bit.
--- 2.5.33/net/ipv4/ipmr.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/ipmr.c	Sun Sep  1 21:23:12 2002
@@ -1750,7 +1750,7 @@ void __init ip_mr_init(void)
 	mrt_cachep = kmem_cache_create("ip_mrt_cache",
 				       sizeof(struct mfc_cache),
 				       0, SLAB_HWCACHE_ALIGN,
-				       NULL, NULL);
+				       NULL, NULL, NULL);
 	init_timer(&ipmr_expire_timer);
 	ipmr_expire_timer.function=ipmr_expire_process;
 	register_netdevice_notifier(&ip_mr_notifier);
--- 2.5.33/net/ipv4/netfilter/ip_conntrack_core.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/netfilter/ip_conntrack_core.c	Sun Sep  1 21:23:12 2002
@@ -1444,7 +1444,7 @@ int __init ip_conntrack_init(void)
 
 	ip_conntrack_cachep = kmem_cache_create("ip_conntrack",
 	                                        sizeof(struct ip_conntrack), 0,
-	                                        SLAB_HWCACHE_ALIGN, NULL, NULL);
+	                                        SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
 	if (!ip_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create ip_conntrack slab cache\n");
 		goto err_free_hash;
--- 2.5.33/net/ipv4/route.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/route.c	Sun Sep  1 21:23:12 2002
@@ -2472,7 +2472,7 @@ void __init ip_rt_init(void)
 	ipv4_dst_ops.kmem_cachep = kmem_cache_create("ip_dst_cache",
 						     sizeof(struct rtable),
 						     0, SLAB_HWCACHE_ALIGN,
-						     NULL, NULL);
+						     NULL, NULL, NULL);
 
 	if (!ipv4_dst_ops.kmem_cachep)
 		panic("IP: failed to allocate ip_dst_cache\n");
--- 2.5.33/net/ipv4/tcp.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv4/tcp.c	Sun Sep  1 21:23:12 2002
@@ -2569,21 +2569,21 @@ void __init tcp_init(void)
 	tcp_openreq_cachep = kmem_cache_create("tcp_open_request",
 						   sizeof(struct open_request),
 					       0, SLAB_HWCACHE_ALIGN,
-					       NULL, NULL);
+					       NULL, NULL, NULL);
 	if (!tcp_openreq_cachep)
 		panic("tcp_init: Cannot alloc open_request cache.");
 
 	tcp_bucket_cachep = kmem_cache_create("tcp_bind_bucket",
 					      sizeof(struct tcp_bind_bucket),
 					      0, SLAB_HWCACHE_ALIGN,
-					      NULL, NULL);
+					      NULL, NULL, NULL);
 	if (!tcp_bucket_cachep)
 		panic("tcp_init: Cannot alloc tcp_bind_bucket cache.");
 
 	tcp_timewait_cachep = kmem_cache_create("tcp_tw_bucket",
 						sizeof(struct tcp_tw_bucket),
 						0, SLAB_HWCACHE_ALIGN,
-						NULL, NULL);
+						NULL, NULL, NULL);
 	if (!tcp_timewait_cachep)
 		panic("tcp_init: Cannot alloc tcp_tw_bucket cache.");
 
--- 2.5.33/net/ipv6/af_inet6.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv6/af_inet6.c	Sun Sep  1 21:23:12 2002
@@ -655,13 +655,13 @@ static int __init inet6_init(void)
 	/* allocate our sock slab caches */
         tcp6_sk_cachep = kmem_cache_create("tcp6_sock",
 					   sizeof(struct tcp6_sock), 0,
-                                           SLAB_HWCACHE_ALIGN, 0, 0);
+                                           SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
         udp6_sk_cachep = kmem_cache_create("udp6_sock",
 					   sizeof(struct udp6_sock), 0,
-                                           SLAB_HWCACHE_ALIGN, 0, 0);
+                                           SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
         raw6_sk_cachep = kmem_cache_create("raw6_sock",
 					   sizeof(struct raw6_sock), 0,
-                                           SLAB_HWCACHE_ALIGN, 0, 0);
+                                           SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
         if (!tcp6_sk_cachep || !udp6_sk_cachep || !raw6_sk_cachep)
                 printk(KERN_CRIT __FUNCTION__
                         ": Can't create protocol sock SLAB caches!\n");
--- 2.5.33/net/ipv6/ip6_fib.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv6/ip6_fib.c	Sun Sep  1 21:23:12 2002
@@ -1218,7 +1218,7 @@ void __init fib6_init(void)
 		fib6_node_kmem = kmem_cache_create("fib6_nodes",
 						   sizeof(struct fib6_node),
 						   0, SLAB_HWCACHE_ALIGN,
-						   NULL, NULL);
+						   NULL, NULL, NULL);
 }
 
 #ifdef MODULE
--- 2.5.33/net/ipv6/route.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/ipv6/route.c	Sun Sep  1 21:23:12 2002
@@ -1919,7 +1919,7 @@ void __init ip6_route_init(void)
 	ip6_dst_ops.kmem_cachep = kmem_cache_create("ip6_dst_cache",
 						     sizeof(struct rt6_info),
 						     0, SLAB_HWCACHE_ALIGN,
-						     NULL, NULL);
+						     NULL, NULL, NULL);
 	fib6_init();
 #ifdef 	CONFIG_PROC_FS
 	proc_net_create("ipv6_route", 0, rt6_proc_info);
--- 2.5.33/net/socket.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/socket.c	Sun Sep  1 21:23:12 2002
@@ -305,7 +305,7 @@ static int init_inodecache(void)
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
 					     sizeof(struct socket_alloc),
 					     0, SLAB_HWCACHE_ALIGN,
-					     init_once, NULL);
+					     NULL, init_once, NULL);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
--- 2.5.33/net/unix/af_unix.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/net/unix/af_unix.c	Sun Sep  1 21:23:12 2002
@@ -1893,7 +1893,7 @@ static int __init af_unix_init(void)
         /* allocate our sock slab cache */
         unix_sk_cachep = kmem_cache_create("unix_sock",
 					   sizeof(struct unix_sock), 0,
-					   SLAB_HWCACHE_ALIGN, 0, 0);
+					   SLAB_HWCACHE_ALIGN, NULL, NULL, NULL);
         if (!unix_sk_cachep)
                 printk(KERN_CRIT
                         "af_unix_init: Cannot create unix_sock SLAB cache!\n");
--- 2.5.33/fs/proc/proc_misc.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/proc/proc_misc.c	Sun Sep  1 21:30:44 2002
@@ -177,6 +177,7 @@ static int meminfo_read_proc(char *page,
 		"SwapFree:     %8lu kB\n"
 		"Dirty:        %8lu kB\n"
 		"Writeback:    %8lu kB\n"
+		"Slab:         %8lu kB\n"
 		"Committed_AS: %8u kB\n"
 		"PageTables:   %8lu kB\n"
 		"ReverseMaps:  %8lu\n"
@@ -197,6 +198,7 @@ static int meminfo_read_proc(char *page,
 		K(i.freeswap),
 		K(ps.nr_dirty),
 		K(ps.nr_writeback),
+		K(ps.nr_slab),
 		K(committed),
 		K(ps.nr_page_table_pages),
 		ps.nr_reverse_maps,
--- 2.5.33/fs/jfs/jfs_metapage.c~slablru	Sun Sep  1 21:23:12 2002
+++ 2.5.33-akpm/fs/jfs/jfs_metapage.c	Sun Sep  1 21:23:12 2002
@@ -143,7 +143,7 @@ int __init metapage_init(void)
 	 * Allocate the metapage structures
 	 */
 	metapage_cache = kmem_cache_create("jfs_mp", sizeof(metapage_t), 0, 0,
-					   init_once, NULL);
+					   NULL, init_once, NULL);
 	if (metapage_cache == NULL)
 		return -ENOMEM;
 

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
  2002-08-28 21:24     ` Andrew Morton
  2002-09-02  5:26     ` Andrew Morton
@ 2002-09-02  6:50     ` Andrew Morton
  2 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2002-09-02  6:50 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

hm.  Doing a bit more testing...

mem=512m, then build the inode and dentry caches up a bit:

  ext2_inode_cache:    20483KB    20483KB  100.0 
       buffer_head:     6083KB     6441KB   94.43
      dentry_cache:     4885KB     4885KB  100.0 

(using wli's bloatmeter, attached here).

Now,

	dd if=/dev/zero of=foo bs=1M count=2000

  ext2_inode_cache:     3789KB     8148KB   46.50
       buffer_head:     6469KB     6503KB   99.47
          size-512:     1450KB     1500KB   96.66

this took quite a long time to start dropping, and the machine
still has 27 megabytes in slab.

Which kinda surprises me, given my (probably wrong) description of the
algorithm.  I'd have expected the caches to be pruned a lot faster and
further than this.  Not that it's necessarily a bad thing, but maybe we
should be shrinking a little faster.  What are your thoughts on this?

Also, I note that age_dcache_memory is being called for lots of
tiny little shrinkings:

Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=1, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=2, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=4, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=12, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=21, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=42, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc1911e48, entries=10, gfp_mask=464) at dcache.c:585

I'd suggest that we batch these up a bit: call the pruner less
frequently, but with larger request sizes, save a few cycles.

[-- Attachment #2: bloatmeter --]
[-- Type: text/plain, Size: 134 bytes --]

#!/bin/sh
while true
do
	clear
	grep -v '^slabinfo' /proc/slabinfo	\
		| bloatmon			\
		| sort -r -n +2		\
		| head -22
	sleep 5
done

[-- Attachment #3: bloatmon --]
[-- Type: text/plain, Size: 413 bytes --]

#!/usr/bin/awk -f
BEGIN {
	printf "%18s    %8s %8s %8s\n", "cache", "active", "alloc", "%util";
}

{
	if ($3 != 0.0) {
		pct  = 100.0 * $2 / $3;
		frac = (10000.0 * $2 / $3) % 100;
	} else {
		pct  = 100.0;
		frac = 0.0;
	}
	active = ($2 * $4)/1024;
	alloc  = ($3 * $4)/1024;
	if ((alloc - active) < 1.0) {
		pct  = 100.0;
		frac = 0.0;
	}
	printf "%18s: %8dKB %8dKB  %3d.%-2d\n", $1, active, alloc, pct, frac;
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-09-02  5:26     ` Andrew Morton
@ 2002-09-02 15:00       ` Ed Tomlinson
  2002-09-02 18:35         ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Ed Tomlinson @ 2002-09-02 15:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On September 2, 2002 01:26 am, Andrew Morton wrote:
> Ed, this code can be sped up a bit, I think.  We can make
> kmem_count_page() return a boolean back to shrink_cache(), telling it
> whether it needs to call kmem_do_prunes() at all.  Often, there won't
> be any work to do in there, and taking that semaphore can be quite
> costly.
>
> The code as-is will even run kmem_do_prunes() when we're examining
> ZONE_HIGHMEM, which certainly won't have any slab pages.  This boolean
> will fix that too.

How about this?  I have modified things so we only try for the sem if there
is work to do.  It also always uses a down_trylock - if we cannot do the prune
now later is ok too...

Lightly tested.

Comments
Ed

-----------
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.531   -> 1.533  
#	           mm/slab.c	1.28    -> 1.30   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/02	ed@oscar.et.ca	1.532
# optimization.  lets only take the sem if we have work to do.
# --------------------------------------------
# 02/09/02	ed@oscar.et.ca	1.533
# more optimizations and a correction
# --------------------------------------------
#
diff -Nru a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c	Mon Sep  2 10:54:33 2002
+++ b/mm/slab.c	Mon Sep  2 10:54:33 2002
@@ -403,6 +403,9 @@
 /* Place maintainer for reaping. */
 static kmem_cache_t *clock_searchp = &cache_cache;
 
+static int pruner_flag;
+#define	PRUNE_GATE	0
+
 #define cache_chain (cache_cache.next)
 
 #ifdef CONFIG_SMP
@@ -427,6 +430,8 @@
 	spin_lock_irq(&cachep->spinlock);
 	if (cachep->pruner != NULL) {
 		cachep->count += slabp->inuse;
+		if (cachep->count)
+			set_bit(PRUNE_GATE, &pruner_flag);
 		ret = !slabp->inuse;
 	} else 
 		ret = !ref && !slabp->inuse;
@@ -441,11 +446,13 @@
 	struct list_head *p;
 	int nr;
 
-        if (gfp_mask & __GFP_WAIT)
-                down(&cache_chain_sem);
-        else
-                if (down_trylock(&cache_chain_sem))
-                        return 0;
+	if (!test_and_clear_bit(PRUNE_GATE, &pruner_flag))
+		return 0;
+
+	if (down_trylock(&cache_chain_sem)) {
+		set_bit(PRUNE_GATE, &pruner_flag);
+		return 0;
+	}
 
         list_for_each(p,&cache_chain) {
                 kmem_cache_t *cachep = list_entry(p, kmem_cache_t, next);

-----------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-09-02 15:00       ` Ed Tomlinson
@ 2002-09-02 18:35         ` Andrew Morton
  2002-09-02 19:09           ` Ed Tomlinson
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-09-02 18:35 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> On September 2, 2002 01:26 am, Andrew Morton wrote:
> > Ed, this code can be sped up a bit, I think.  We can make
> > kmem_count_page() return a boolean back to shrink_cache(), telling it
> > whether it needs to call kmem_do_prunes() at all.  Often, there won't
> > be any work to do in there, and taking that semaphore can be quite
> > costly.
> >
> > The code as-is will even run kmem_do_prunes() when we're examining
> > ZONE_HIGHMEM, which certainly won't have any slab pages.  This boolean
> > will fix that too.
> 
> How about this?  I have modified things so we only try for the sem if there
> is work to do.  It also always uses a down_trylock - if we cannot do the prune
> now later is ok too...
> 

well...   Using a global like that is a bit un-linuxy.  (bitops
are only defined on longs, btw...)

How about this one?  It does both:  tells the caller whether or
not to perform the shrink, and defers the pruning until we
have at least a page's worth of objects to be pruned.

Also, make sure that only the CPU which was responsible for
the transition-past-threshold is told to do some pruning.  Reduces
the possibility of two CPUs running the prune.

Also, when we make the sweep across the to-be-pruned caches, only
prune the ones which are over threshold.

Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=396, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=66, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=429, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=66, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=264, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=198, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=66, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=429, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=66, gfp_mask=464) at dcache.c:585
Breakpoint 1, age_dcache_memory (cachep=0xc189f66c, entries=66, gfp_mask=464) at dcache.c:585

It'll make things a bit lumpier.  Under high internal fragmentation
we'll suddenly release great gobs of pages, but I think it'll average
out OK.

What sayest thou?

 include/linux/slab.h |    2 +-
 mm/slab.c            |   23 +++++++++++++++++++----
 mm/vmscan.c          |   22 ++++++++++++----------
 3 files changed, 32 insertions(+), 15 deletions(-)

--- 2.5.33/mm/vmscan.c~slablru-speedup	Mon Sep  2 11:07:33 2002
+++ 2.5.33-akpm/mm/vmscan.c	Mon Sep  2 11:07:33 2002
@@ -95,8 +95,8 @@ static inline int is_page_cache_freeable
 }
 
 static /* inline */ int
-shrink_list(struct list_head *page_list, int nr_pages,
-		unsigned int gfp_mask, int priority, int *max_scan)
+shrink_list(struct list_head *page_list, int nr_pages, unsigned int gfp_mask,
+		int priority, int *max_scan, int *prunes_needed)
 {
 	struct address_space *mapping;
 	LIST_HEAD(ret_pages);
@@ -124,7 +124,7 @@ shrink_list(struct list_head *page_list,
 		 */
 		if (PageSlab(page)) {
 			int ref = TestClearPageReferenced(page);
-			if (kmem_count_page(page, ref)) {
+			if (kmem_count_page(page, ref, prunes_needed)) {
 				if (kmem_shrink_slab(page))
 					goto free_ref;
 			}
@@ -292,8 +292,8 @@ keep:
  * in the kernel (apart from the copy_*_user functions).
  */
 static /* inline */ int
-shrink_cache(int nr_pages, struct zone *zone,
-		unsigned int gfp_mask, int priority, int max_scan)
+shrink_cache(int nr_pages, struct zone *zone, unsigned int gfp_mask,
+		int priority, int max_scan, int *prunes_needed)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -342,8 +342,8 @@ shrink_cache(int nr_pages, struct zone *
 
 		max_scan -= n;
 		KERNEL_STAT_ADD(pgscan, n);
-		nr_pages = shrink_list(&page_list, nr_pages,
-					gfp_mask, priority, &max_scan);
+		nr_pages = shrink_list(&page_list, nr_pages, gfp_mask,
+					priority, &max_scan, prunes_needed);
 
 		if (nr_pages <= 0 && list_empty(&page_list))
 			goto done;
@@ -489,6 +489,7 @@ shrink_zone(struct zone *zone, int prior
 {
 	unsigned long ratio;
 	int max_scan;
+	int prunes_needed = 0;
 
 	/*
 	 * Try to keep the active list 2/3 of the size of the cache.  And
@@ -509,9 +510,10 @@ shrink_zone(struct zone *zone, int prior
 	}
 
 	max_scan = zone->nr_inactive / priority;
-	nr_pages = shrink_cache(nr_pages, zone,
-				gfp_mask, priority, max_scan);
-	kmem_do_prunes(gfp_mask);
+	nr_pages = shrink_cache(nr_pages, zone, gfp_mask,
+				priority, max_scan, &prunes_needed);
+	if (prunes_needed)
+		kmem_do_prunes(gfp_mask);
 
 	if (nr_pages <= 0)
 		return 0;
--- 2.5.33/mm/slab.c~slablru-speedup	Mon Sep  2 11:07:33 2002
+++ 2.5.33-akpm/mm/slab.c	Mon Sep  2 11:30:27 2002
@@ -217,7 +217,8 @@ struct kmem_cache_s {
 	unsigned int		growing;
 	unsigned int		dflags;		/* dynamic flags */
 	kmem_pruner_t		pruner;		/* shrink callback */
-	int 			count;		/* count used to trigger shrink */
+	int 			count;		/* nr of objects to be pruned */
+	int			prune_thresh;	/* threshold triggers pruning */
 
 	/* constructor func */
 	void (*ctor)(void *, kmem_cache_t *, unsigned long);
@@ -418,8 +419,11 @@ static void enable_all_cpucaches (void);
  
 /* 
  * Used by shrink_cache to determine caches that need pruning.
+ *
+ * If this particular call to kmem_count_page takes a slab over its to-be-pruned
+ * threshold then we tell the caller that kmem_do_prunes() needs to be called.
  */
-int kmem_count_page(struct page *page, int ref)
+int kmem_count_page(struct page *page, int ref, int *prunes_needed)
 {
 	kmem_cache_t *cachep = GET_PAGE_CACHE(page);
 	slab_t *slabp = GET_PAGE_SLAB(page);
@@ -427,7 +431,12 @@ int kmem_count_page(struct page *page, i
 
 	spin_lock_irq(&cachep->spinlock);
 	if (cachep->pruner != NULL) {
+		int old_count = cachep->count;
+
 		cachep->count += slabp->inuse;
+		if (old_count < cachep->prune_thresh &&
+				cachep->count >= cachep->prune_thresh)
+			*prunes_needed = 1;
 		ret = !slabp->inuse;
 	} else {
 		ret = !ref && !slabp->inuse;
@@ -453,8 +462,11 @@ int kmem_do_prunes(int gfp_mask) 
                 kmem_cache_t *cachep = list_entry(p, kmem_cache_t, next);
 		if (cachep->pruner != NULL) {
 			spin_lock_irq(&cachep->spinlock);
-			nr = cachep->count;
-			cachep->count = 0;
+			nr = 0;
+			if (cachep->count >= cachep->prune_thresh) {
+				nr = cachep->count;
+				cachep->count = 0;
+			}
 			spin_unlock_irq(&cachep->spinlock);
 			if (nr > 0)
 				(*cachep->pruner)(cachep, nr, gfp_mask);
@@ -872,6 +884,9 @@ next:
 	cachep->flags = flags;
 	cachep->pruner = thepruner;
 	cachep->count = 0;
+	cachep->prune_thresh = 0;
+	if (thepruner)
+		cachep->prune_thresh = PAGE_SIZE / size;
 	cachep->gfpflags = 0;
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
--- 2.5.33/include/linux/slab.h~slablru-speedup	Mon Sep  2 11:07:33 2002
+++ 2.5.33-akpm/include/linux/slab.h	Mon Sep  2 11:07:33 2002
@@ -60,7 +60,7 @@ extern int kmem_cache_destroy(kmem_cache
 extern int kmem_cache_shrink(kmem_cache_t *);
 
 extern int kmem_do_prunes(int);
-extern int kmem_count_page(struct page *, int);
+extern int kmem_count_page(struct page *page, int ref, int *prunes_needed);
 #define kmem_touch_page(addr)	SetPageReferenced(virt_to_page(addr));
 
 /* shrink a slab */

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-09-02 18:35         ` Andrew Morton
@ 2002-09-02 19:09           ` Ed Tomlinson
  2002-09-02 19:51             ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Ed Tomlinson @ 2002-09-02 19:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On September 2, 2002 02:35 pm, Andrew Morton wrote:
> Ed Tomlinson wrote:
> > On September 2, 2002 01:26 am, Andrew Morton wrote:
> > > Ed, this code can be sped up a bit, I think.  We can make
> > > kmem_count_page() return a boolean back to shrink_cache(), telling it
> > > whether it needs to call kmem_do_prunes() at all.  Often, there won't
> > > be any work to do in there, and taking that semaphore can be quite
> > > costly.
> > >
> > > The code as-is will even run kmem_do_prunes() when we're examining
> > > ZONE_HIGHMEM, which certainly won't have any slab pages.  This boolean
> > > will fix that too.
> >
> > How about this?  I have modified things so we only try for the sem if
> > there is work to do.  It also always uses a down_trylock - if we cannot
> > do the prune now later is ok too...
>
> well...   Using a global like that is a bit un-linuxy.  (bitops
> are only defined on longs, btw...

ah.  learn something every day.

> How about this one?  It does both:  tells the caller whether or
> not to perform the shrink, and defers the pruning until we
> have at least a page's worth of objects to be pruned.

I thought about doing something like your patch.  I wanted to avoid
semi-magic numbers (why a page worth of objects?  why not two or
three...).  I would rather see something like my patch, maybe coded
in a more stylish way, used.  If we want to get bigger batch I would
move the kmem_do_prunes up into try_to_free_pages.  This way the
code is simpler, vmscan changes for slablru are smaller, and nothing 
magic is involved.

> Also, make sure that only the CPU which was responsible for
> the transition-past-threshold is told to do some pruning.  Reduces
> the possibility of two CPUs running the prune.

With my code it is possible two cpus could prune but very unlikely.

> Also, when we make the sweep across the to-be-pruned caches, only
> prune the ones which are over threshold.

If the kmem_do_prunes is moved to try_to_free_pages its not quite as
hot a call.  Since it now never waits (with my patch) doubt if it is going
to show up as something that needs tuning...

How about this?  What does it show if you breakpoint it?   How would
you make it prettier linux wise?  (compiled, untested)

Ed

-----
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.531   -> 1.534  
#	         mm/vmscan.c	1.98    -> 1.99   
#	           mm/slab.c	1.28    -> 1.31   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/02	ed@oscar.et.ca	1.532
# optimization.  lets only take the sem if we have work to do.
# --------------------------------------------
# 02/09/02	ed@oscar.et.ca	1.533
# more optimizations and a correction
# --------------------------------------------
# 02/09/02	ed@oscar.et.ca	1.534
# more optimizing
# --------------------------------------------
#
diff -Nru a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c	Mon Sep  2 15:05:01 2002
+++ b/mm/slab.c	Mon Sep  2 15:05:01 2002
@@ -403,6 +403,9 @@
 /* Place maintainer for reaping. */
 static kmem_cache_t *clock_searchp = &cache_cache;
 
+static long pruner_flag;
+#define	PRUNE_GATE	0
+
 #define cache_chain (cache_cache.next)
 
 #ifdef CONFIG_SMP
@@ -427,6 +430,8 @@
 	spin_lock_irq(&cachep->spinlock);
 	if (cachep->pruner != NULL) {
 		cachep->count += slabp->inuse;
+		if (cachep->count)
+			set_bit(PRUNE_GATE, &pruner_flag);
 		ret = !slabp->inuse;
 	} else 
 		ret = !ref && !slabp->inuse;
@@ -441,11 +446,13 @@
 	struct list_head *p;
 	int nr;
 
-        if (gfp_mask & __GFP_WAIT)
-                down(&cache_chain_sem);
-        else
-                if (down_trylock(&cache_chain_sem))
-                        return 0;
+	if (!test_and_clear_bit(PRUNE_GATE, &pruner_flag))
+		return 0;
+
+	if (down_trylock(&cache_chain_sem)) {
+		set_bit(PRUNE_GATE, &pruner_flag);
+		return 0;
+	}
 
         list_for_each(p,&cache_chain) {
                 kmem_cache_t *cachep = list_entry(p, kmem_cache_t, next);
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c	Mon Sep  2 15:05:01 2002
+++ b/mm/vmscan.c	Mon Sep  2 15:05:01 2002
@@ -510,8 +510,6 @@
 	max_scan = zone->nr_inactive / priority;
 	nr_pages = shrink_cache(nr_pages, zone,
 				gfp_mask, priority, max_scan);
-	kmem_do_prunes(gfp_mask);
-
 	if (nr_pages <= 0)
 		return 0;
 
@@ -549,6 +547,8 @@
 	int nr_pages = SWAP_CLUSTER_MAX;
 
 	KERNEL_STAT_INC(pageoutrun);
+
+	kmem_do_prunes(gfp_mask);
 
 	do {
 		nr_pages = shrink_caches(classzone, priority,

-----


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: slablru for 2.5.32-mm1
  2002-09-02 19:09           ` Ed Tomlinson
@ 2002-09-02 19:51             ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2002-09-02 19:51 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-mm

Ed Tomlinson wrote:
> 
> ...
> I thought about doing something like your patch.  I wanted to avoid
> semi-magic numbers (why a page worth of objects?  why not two or
> three...).

Well, it's just an efficiency heuristic...

>  I would rather see something like my patch, maybe coded
> in a more stylish way, used.  If we want to get bigger batch I would
> move the kmem_do_prunes up into try_to_free_pages.  This way the
> code is simpler, vmscan changes for slablru are smaller, and nothing
> magic is involved.

Doesn't make much difference, afaict.   Generally, the first pass
through shrink_caches() frees a sufficient number of pages, so
the before- and after- code are equivalent.

And because there is only one flag, we still attempt to prune all
caches which have a pruner, when it's quite possible that just one
of them has a decent amount of stuff.  Probably a minor issue though.

I wouldn't be too fussed about the extent of changes in vmscan.c.
One day, when the kernel is perfect, all of that file will have the
inlines turned on and the whole of page reclaim becomes one big
function.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 22:39                             ` Andrew Morton
@ 2002-08-28 22:57                               ` Daniel Phillips
  0 siblings, 0 replies; 46+ messages in thread
From: Daniel Phillips @ 2002-08-28 22:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Thursday 29 August 2002 00:39, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > ...
> > So there's no question that the race is lurking in 2.4.  I noticed several
> > more paths besides the one above that look suspicious as well.  The bottom
> > line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> > something that can actually be proved.
> > 
> > It's a wonder that this problem manifests so rarely in practice.
> 
> I sort-of glanced through the 2.4 paths and it appears that in all of the
> places where it could do a page_cache_get/release, that would never happen
> because of other parts of the page state.
> 
> Like: it can't be in pagecache, so we won't run writepage, and
> it can't have buffers, so we won't run try_to_release_page().
> 
> Of course, I might have missed a path.  And, well, generally: ugh.

I think it is happening.  I just went sifting searching through the archives
on 'oops' and '2.4'.  The first one I found was:

   2.4.18-xfs (xfs related?) oops report

which fits the description nicely.

The race I showed actually causes the page->count to go negative, avoiding
a double free on a technicality.  That doesn't make me feel much better about
it.  Have you got a BUG_ON(!page_count(page)) in put_page_testzero?  I think
we might see some action.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 22:04                           ` Daniel Phillips
@ 2002-08-28 22:39                             ` Andrew Morton
  2002-08-28 22:57                               ` Daniel Phillips
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-28 22:39 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> ...
> So there's no question that the race is lurking in 2.4.  I noticed several
> more paths besides the one above that look suspicious as well.  The bottom
> line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
> something that can actually be proved.
> 
> It's a wonder that this problem manifests so rarely in practice.

I sort-of glanced through the 2.4 paths and it appears that in all of the
places where it could do a page_cache_get/release, that would never happen
because of other parts of the page state.

Like: it can't be in pagecache, so we won't run writepage, and
it can't have buffers, so we won't run try_to_release_page().

Of course, I might have missed a path.  And, well, generally: ugh.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 21:03                         ` Andrew Morton
@ 2002-08-28 22:04                           ` Daniel Phillips
  2002-08-28 22:39                             ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Phillips @ 2002-08-28 22:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Wednesday 28 August 2002 23:03, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > Going right back to basics, what do you suppose is wrong with the 2.4
> > strategy of always doing the lru removal in free_pages_ok?
> 
> That's equivalent to what we have at present, which is:
> 
> 	if (put_page_testzero(page)) {
> 		/* window here */
> 		lru_cache_del(page);
> 		__free_pages_ok(page, 0);
> 	}
> 
> versus:
> 
> 	spin_lock(lru lock);
> 	page = list_entry(lru, ...);
> 	if (page_count(page) == 0)
> 		continue;
> 	/* window here */
> 	page_cache_get(page);
> 	page_cache_release(page);	/* double-free */

Indeed it is.  In 2.4.19 we have:

(vmscan.c: shrink_cache)                        (page_alloc.c: __free_pages)

365       if (unlikely(!page_count(page)))
366               continue;
					        444         if (!PageReserved(page) && put_page_testzero(page))
          [many twisty paths, all different]
511       /* effectively free the page here */
512       page_cache_release(page);
					        445                 __free_pages_ok(page, order);
                                                [free it again just to make sure]

So there's no question that the race is lurking in 2.4.  I noticed several
more paths besides the one above that look suspicious as well.  The bottom
line is, 2.4 needs a fix along the lines of my suggestion or Christian's,
something that can actually be proved.

It's a wonder that this problem manifests so rarely in practice.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 20:41                       ` Daniel Phillips
@ 2002-08-28 21:03                         ` Andrew Morton
  2002-08-28 22:04                           ` Daniel Phillips
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-28 21:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> Going right back to basics, what do you suppose is wrong with the 2.4
> strategy of always doing the lru removal in free_pages_ok?

That's equivalent to what we have at present, which is:

	if (put_page_testzero(page)) {
		/* window here */
		lru_cache_del(page);
		__free_pages_ok(page, 0);
	}

versus:

	spin_lock(lru lock);
	page = list_entry(lru, ...);
	if (page_count(page) == 0)
		continue;
	/* window here */
	page_cache_get(page);
	page_cache_release(page);	/* double-free */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 13:14                     ` Christian Ehrhardt
  2002-08-28 17:18                       ` Daniel Phillips
@ 2002-08-28 20:41                       ` Daniel Phillips
  2002-08-28 21:03                         ` Andrew Morton
  1 sibling, 1 reply; 46+ messages in thread
From: Daniel Phillips @ 2002-08-28 20:41 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

Going right back to basics, what do you suppose is wrong with the 2.4 
strategy of always doing the lru removal in free_pages_ok?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 17:18                       ` Daniel Phillips
@ 2002-08-28 17:42                         ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-28 17:42 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> > Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> > or shrink_cache could have removed the page from the lru before
> > __pagevec_lru_del acquired the lru lock.
> 
> It's suspect all right.  If there's a chain of assumptions that proves
> the page is always on the lru at the point, I haven't seen it yet.

Yeah.  __pagevec_lru_del is only used by invalidate_inode_pages.
A very simple solution is to just delete it.

untested code:

 include/linux/pagevec.h |    7 -------
 mm/filemap.c            |   10 +++++-----
 mm/swap.c               |   28 ----------------------------
 3 files changed, 5 insertions(+), 40 deletions(-)

--- 2.5.32/mm/filemap.c~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/filemap.c	Wed Aug 28 09:51:51 2002
@@ -116,10 +116,10 @@ void invalidate_inode_pages(struct inode
 	struct list_head *head, *curr;
 	struct page * page;
 	struct address_space *mapping = inode->i_mapping;
-	struct pagevec lru_pvec;
+	struct pagevec pvec;
 
 	head = &mapping->clean_pages;
-	pagevec_init(&lru_pvec);
+	pagevec_init(&pvec);
 	write_lock(&mapping->page_lock);
 	curr = head->next;
 
@@ -143,8 +143,8 @@ void invalidate_inode_pages(struct inode
 
 		__remove_from_page_cache(page);
 		unlock_page(page);
-		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_del(&lru_pvec);
+		if (!pagevec_add(&pvec, page))
+			__pagevec_release(&pvec);
 		continue;
 unlock:
 		unlock_page(page);
@@ -152,7 +152,7 @@ unlock:
 	}
 
 	write_unlock(&mapping->page_lock);
-	pagevec_lru_del(&lru_pvec);
+	pagevec_release(&pvec);
 }
 
 static int do_invalidatepage(struct page *page, unsigned long offset)
--- 2.5.32/include/linux/pagevec.h~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/include/linux/pagevec.h	Wed Aug 28 09:51:51 2002
@@ -18,7 +18,6 @@ void __pagevec_release(struct pagevec *p
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_del(struct pagevec *pvec);
 void lru_add_drain(void);
 void pagevec_deactivate_inactive(struct pagevec *pvec);
 
@@ -69,9 +68,3 @@ static inline void pagevec_lru_add(struc
 	if (pagevec_count(pvec))
 		__pagevec_lru_add(pvec);
 }
-
-static inline void pagevec_lru_del(struct pagevec *pvec)
-{
-	if (pagevec_count(pvec))
-		__pagevec_lru_del(pvec);
-}
--- 2.5.32/mm/swap.c~pagevec_lru_del	Wed Aug 28 09:51:51 2002
+++ 2.5.32-akpm/mm/swap.c	Wed Aug 28 09:51:58 2002
@@ -214,34 +214,6 @@ void __pagevec_lru_add(struct pagevec *p
 }
 
 /*
- * Remove the passed pages from the LRU, then drop the caller's refcount on
- * them.  Reinitialises the caller's pagevec.
- */
-void __pagevec_lru_del(struct pagevec *pvec)
-{
-	int i;
-	struct zone *zone = NULL;
-
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		if (!TestClearPageLRU(page))
-			BUG();
-		del_page_from_lru(zone, page);
-	}
-	if (zone)
-		spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(pvec);
-}
-
-/*
  * Perform any setup for the swap system
  */
 void __init swap_setup(void)

.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-28 13:14                     ` Christian Ehrhardt
@ 2002-08-28 17:18                       ` Daniel Phillips
  2002-08-28 17:42                         ` Andrew Morton
  2002-08-28 20:41                       ` Daniel Phillips
  1 sibling, 1 reply; 46+ messages in thread
From: Daniel Phillips @ 2002-08-28 17:18 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Wednesday 28 August 2002 15:14, Christian Ehrhardt wrote:
> Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
> or shrink_cache could have removed the page from the lru before
> __pagevec_lru_del acquired the lru lock.

It's suspect all right.  If there's a chain of assumptions that proves
the page is always on the lru at the point, I haven't seen it yet.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27 16:48                   ` Daniel Phillips
@ 2002-08-28 13:14                     ` Christian Ehrhardt
  2002-08-28 17:18                       ` Daniel Phillips
  2002-08-28 20:41                       ` Daniel Phillips
  0 siblings, 2 replies; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-28 13:14 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Tue, Aug 27, 2002 at 06:48:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> > > Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> > > secure in the knowlege that there are no other references to it.
> > 
> > Well yes, but we cannot remove the page from the lru atomatically
> > at page_cache_release time if we follow your proposal. If you think we can,
> > show me your implementation of page_cache_release and I'll show
> > you where the races are (unless you do everything under the lru_lock
> > of course).
> 
> void page_cache_release(struct page *page)
> {
> 	spin_lock(&pagemap_lru_lock);
> 	if (PageLRU(page) && page_count(page) == 2) {
> 		__lru_cache_del(page);
> 		atomic_dec(&page->count);
> 	}
> 	spin_unlock(&pagemap_lru_lock);
> 	if (put_page_testzero(page))
> 		__free_pages_ok(page, 0);
> }
>
> This allows the following benign race, with initial page count = 3:
> [ ...]
> Neither holder of a page reference sees the count at 2, and so the page
> is left on the lru with count = 1.  This won't happen often and such
> pages will be recovered from the cold end of the list in due course.

Ok, agreed. I think this will work but taking the lru lock each time
is probably not a good idea.

> We could also do this:
> 
> void page_cache_release(struct page *page)
> {
> 	if (page_count(page) == 2) {
> 		spin_lock(&pagemap_lru_lock);
> 		if (PageLRU(page) && page_count(page) == 2) {
> 			__lru_cache_del(page);
> 			atomic_dec(&page->count);
> 		}
> 		spin_unlock(&pagemap_lru_lock);
> 	}
> 	if (put_page_testzero(page))
> 		__free_pages_ok(page, 0);
> }
> 
> Which avoids taking the lru lock sometimes in exchange for widening the
> hole through which pages can end up with count = 1 on the lru list.

This sounds like something that is worth trying. I missed that one.


Side note: The BUG in __pagevec_lru_del seems strange. refill_inactive
or shrink_cache could have removed the page from the lru before
__pagevec_lru_del acquired the lru lock.

     regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27  9:22               ` Christian Ehrhardt
@ 2002-08-27 19:19                 ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-27 19:19 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Daniel Phillips, lkml, linux-mm

Christian Ehrhardt wrote:
> 
> ...
> So what we want CPUB do instead is
> 
>         spin_lock(lru_lock);
>         page = list_entry(lru)
> 
>         START ATOMIC
>                 page_cache_get(page);
>                 res = (page_count (page) == 1)
>         END ATOMIC
> 
>         if (res) {
>                 atomic_dec (&page->count);
>                 continue;  /* with next page */
>         }
>         ...
>         page_cache_release (page);
> 
> I.e. we want to detect _atomically_ that we just raised the page count
> from zero to one. My patch actually has a solution that implements the
> needed atomic operation above by means of the atomic functions that we
> currently have on all archs (it's called get_page_testzero and
> should probably called get_page_testone).
> The more I think about this the more I think this is the way to go.
> 

Yes, I think that would provide a minimal fix to the problem.
(I'd prefer a solution in which presence on the LRU contributes
to page->count, because that means I can dump a load of expensive
page_cache_get-inside-lru-lock instances, but whatever)

You had:

-#define put_page_testzero(p)   atomic_dec_and_test(&(p)->count)
-#define page_count(p)          atomic_read(&(p)->count)
-#define set_page_count(p,v)    atomic_set(&(p)->count, v)
+#define put_page_testzero(p)   atomic_add_negative(-1, &(p)->count)
+#define page_count(p)          (1+atomic_read(&(p)->count))
+#define set_page_count(p,v)    atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p)   atomic_inc_and_test(&(p)->count)

So the page count is actually offset by -1, and that is hidden by
the macros.  Fair enough.

atomic_add_negative() is not implemented on quite a number of
architectures (sparc64, mips, ppc, sh, cris, 68k, alpha..), so
some legwork is needed there.  Looks to be pretty simple though;
alpha, ppc and others already have atomic_add_return().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:58                 ` Christian Ehrhardt
@ 2002-08-27 16:48                   ` Daniel Phillips
  2002-08-28 13:14                     ` Christian Ehrhardt
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Phillips @ 2002-08-27 16:48 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 22:58, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > > + * does not increase the page count.
> > > > > > 
> > > > > > Please remind me... why should it not?
> > > > > 
> > > > > Pages that are only on the lru but not reference by anyone are of no
> > > > > use and we want to free them immediatly. If we leave them on the lru
> > > > > list with a page count of 1, someone else will have to walk the lru
> > > > > list and remove pages that are only on the lru.
> > > > 
> > > > I don't understand this argument.  Suppose lru list membership is worth a 
> > > > page count of one.  Then anyone who finds a page by way of the lru list can 
> > > 
> > > This does fix the double free problem but think of a typical anonymous
> > > page at exit. The page is on the lru list and there is one reference held
> > > by the pte. According to your scheme the pte reference would be freed
> > > (obviously due to the exit) but the page would remain on the lru list.
> > > However, there is no point in leaving the page on the lru list at all.
> > 
> > If you want the page off the lru list at that point (which you probably do)
> > then you take the lru lock and put_page_testzero.
> 
> Could you clarify what you mean with "at that point"? Especially how
> do you plan to test for "this point".  Besides it is illegal to use
> the page after put_page_testzero (unless put_page_testzero returns true).

> > > If you think about who is going to remove the page from the lru you'll
> > > see the problem.
> > 
> > Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> > secure in the knowlege that there are no other references to it.
> 
> Well yes, but we cannot remove the page from the lru atomatically
> at page_cache_release time if we follow your proposal. If you think we can,
> show me your implementation of page_cache_release and I'll show
> you where the races are (unless you do everything under the lru_lock
> of course).

void page_cache_release(struct page *page)
{
	spin_lock(&pagemap_lru_lock);
	if (PageLRU(page) && page_count(page) == 2) {
		__lru_cache_del(page);
		atomic_dec(&page->count);
	}
	spin_unlock(&pagemap_lru_lock);
	if (put_page_testzero(page))
		__free_pages_ok(page, 0);
}

This allows the following benign race, with initial page count = 3:

spin_lock(&pagemap_lru_lock);
if (PageLRU(page) && page_count(page) == 2) /* false */
spin_unlock(&pagemap_lru_lock);
						spin_lock(&pagemap_lru_lock);
						if (PageLRU(page) && page_count(page) == 2) /* false */
						spin_unlock(&pagemap_lru_lock);
						if (put_page_testzero(page))
							__free_pages_ok(page, 0);
if (put_page_testzero(page))
	__free_pages_ok(page, 0);

Neither holder of a page reference sees the count at 2, and so the page
is left on the lru with count = 1.  This won't happen often and such
pages will be recovered from the cold end of the list in due course.

The important question is: can this code ever remove a page from the lru
erroneously, leaving somebody holding a reference to a non-lru page?  In
other words, can the test PageLRU(page) && page_count(page) == 2 return
a false positive?  Well, when this test is true we can account for both
both references: the one we own, and the one the lru list owns.  Since
we hold the lru lock, the latter won't change.  Nobody else has the
right to increment the page count, since they must inherit that right
from somebody who holds a reference, and there are none.

We could also do this:

void page_cache_release(struct page *page)
{
	if (page_count(page) == 2) {
		spin_lock(&pagemap_lru_lock);
		if (PageLRU(page) && page_count(page) == 2) {
			__lru_cache_del(page);
			atomic_dec(&page->count);
		}
		spin_unlock(&pagemap_lru_lock);
	}
	if (put_page_testzero(page))
		__free_pages_ok(page, 0);
}

Which avoids taking the lru lock sometimes in exchange for widening the
hole through which pages can end up with count = 1 on the lru list.

Let's run this through your race detector and see what happens.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
  2002-08-26 19:48               ` Christian Ehrhardt
@ 2002-08-27  9:22               ` Christian Ehrhardt
  2002-08-27 19:19                 ` Andrew Morton
  2 siblings, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-27  9:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
> 
> CPUA:
> 	if (put_page_testzero()) {
> 		/* Here's the window */
> 		spin_lock(lru_lock);
> 		list_del(page->lru);
> 
> CPUB:
> 
> 	spin_lock(lru_lock);
> 	page = list_entry(lru);
> 	page_cache_get(page);	/* If this goes from 0->1, we die */
> 	...
> 	page_cache_release(page);	/* double free */

So what we want CPUB do instead is

	spin_lock(lru_lock);
	page = list_entry(lru)

	START ATOMIC 
		page_cache_get(page);
		res = (page_count (page) == 1)
	END ATOMIC

	if (res) {
		atomic_dec (&page->count);
		continue;  /* with next page */
	}
	...
	page_cache_release (page);

I.e. we want to detect _atomically_ that we just raised the page count
from zero to one. My patch actually has a solution that implements the
needed atomic operation above by means of the atomic functions that we
currently have on all archs (it's called get_page_testzero and
should probably called get_page_testone).
The more I think about this the more I think this is the way to go.

      regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-27  3:42                   ` Benjamin LaHaise
@ 2002-08-27  4:37                     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-27  4:37 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm

Benjamin LaHaise wrote:
> 
> On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> > I like the magical-removal-just-before-free, and my gut feel is that
> > it'll provide a cleaner end result.
> 
> For the record, I'd rather see explicite removal everwhere.  We received
> a number of complaints along the lines of "I run my app immediately after
> system startup, and it's fast, but the second time it's slower" due to
> the lazy page reclaim in early 2.4.  Until there's a way to make LRU
> scanning faster than page allocation, it can't be lazy.
> 

I think that's what Rik was referring to.

But here, "explicit removal" refers to running lru_cache_del() prior
to the final put_page, rather than within the context of the final
put_page.  So it's a different thing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 21:31                 ` Andrew Morton
@ 2002-08-27  3:42                   ` Benjamin LaHaise
  2002-08-27  4:37                     ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Benjamin LaHaise @ 2002-08-27  3:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, Christian Ehrhardt, lkml, linux-mm

On Mon, Aug 26, 2002 at 02:31:57PM -0700, Andrew Morton wrote:
> I like the magical-removal-just-before-free, and my gut feel is that
> it'll provide a cleaner end result.

For the record, I'd rather see explicite removal everwhere.  We received 
a number of complaints along the lines of "I run my app immediately after 
system startup, and it's fast, but the second time it's slower" due to 
the lazy page reclaim in early 2.4.  Until there's a way to make LRU 
scanning faster than page allocation, it can't be lazy.

		-ben
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:09               ` Daniel Phillips
  2002-08-26 20:58                 ` Christian Ehrhardt
@ 2002-08-26 21:31                 ` Andrew Morton
  2002-08-27  3:42                   ` Benjamin LaHaise
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-26 21:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> ...
> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
> 
> Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Sure. But this requires that the caller of page_cache_release() has
previously removed the page from the LRU.  We (used to) do that for truncate
and page reclaim.   But we did not do that for anon pages.

For anon pages, we perform magical LRU removal when the page refcount
goes to zero.

The fact that we performed explicit removal in one place, and magical removal
in the other was unfortunate.  I nuked the explicit removal and made it
all magical (explicit removal in truncate_complete_page() was wrong anyway - the
page could have been rescued and anonymised by concurrent pagefault and must
stay on the LRU).

Possibly, we could go back to explicit removal everywhere.   Haven't
really looked at that, but I suspect we're back to a similar problem:
to do you unracily determine whether the page should be removed from
the LRU?  Take ->page_table_lock and look at page_count(page)?  Worried.

I like the magical-removal-just-before-free, and my gut feel is that
it'll provide a cleaner end result.

Making presence on the LRU contribute to page->count is attractive,
if only because it removes some irritating and expensive page_cache_gets
and puts from shrink_cache and refill_inactive.  But for it to be useful,
we must perform explicit removal everywhere.

Making presence on the LRU contribute to page->count doesn't fundamentally
change anything of course - it offsets the current problems by one.

Then again, it would remove all page_cache_gets/releases from vmscan.c
and may thus make the race go away.  That's a bit of a timebomb though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:09               ` Daniel Phillips
@ 2002-08-26 20:58                 ` Christian Ehrhardt
  2002-08-27 16:48                   ` Daniel Phillips
  2002-08-26 21:31                 ` Andrew Morton
  1 sibling, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 20:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 10:09:38PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > > + * does not increase the page count.
> > > > > 
> > > > > Please remind me... why should it not?
> > > > 
> > > > Pages that are only on the lru but not reference by anyone are of no
> > > > use and we want to free them immediatly. If we leave them on the lru
> > > > list with a page count of 1, someone else will have to walk the lru
> > > > list and remove pages that are only on the lru.
> > > 
> > > I don't understand this argument.  Suppose lru list membership is worth a 
> > > page count of one.  Then anyone who finds a page by way of the lru list can 
> > 
> > This does fix the double free problem but think of a typical anonymous
> > page at exit. The page is on the lru list and there is one reference held
> > by the pte. According to your scheme the pte reference would be freed
> > (obviously due to the exit) but the page would remain on the lru list.
> > However, there is no point in leaving the page on the lru list at all.
> 
> If you want the page off the lru list at that point (which you probably do)
> then you take the lru lock and put_page_testzero.

Could you clarify what you mean with "at that point"? Especially how
do you plan to test for "this point". Besides it is illegal to use
the page after put_page_testzero (unless put_page_testzero returns true).

> > If you think about who is going to remove the page from the lru you'll
> > see the problem.
> 
> Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
> secure in the knowlege that there are no other references to it.

Well yes, but we cannot remove the page from the lru atomatically
at page_cache_release time if we follow your proposal. If you think we can,
show me your implementation of page_cache_release and I'll show
you where the races are (unless you do everything under the lru_lock
of course).

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 20:00             ` Christian Ehrhardt
@ 2002-08-26 20:09               ` Daniel Phillips
  2002-08-26 20:58                 ` Christian Ehrhardt
  2002-08-26 21:31                 ` Andrew Morton
  0 siblings, 2 replies; 46+ messages in thread
From: Daniel Phillips @ 2002-08-26 20:09 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 22:00, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > > 
> > > > Please remind me... why should it not?
> > > 
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> > 
> > I don't understand this argument.  Suppose lru list membership is worth a 
> > page count of one.  Then anyone who finds a page by way of the lru list can 
> 
> This does fix the double free problem but think of a typical anonymous
> page at exit. The page is on the lru list and there is one reference held
> by the pte. According to your scheme the pte reference would be freed
> (obviously due to the exit) but the page would remain on the lru list.
> However, there is no point in leaving the page on the lru list at all.

If you want the page off the lru list at that point (which you probably do)
then you take the lru lock and put_page_testzero.

> If you think about who is going to remove the page from the lru you'll
> see the problem.

Nope, still don't see it.  Whoever hits put_page_testzero frees the page,
secure in the knowlege that there are no other references to it.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 17:56           ` Daniel Phillips
  2002-08-26 19:24             ` Andrew Morton
@ 2002-08-26 20:00             ` Christian Ehrhardt
  2002-08-26 20:09               ` Daniel Phillips
  1 sibling, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 20:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 07:56:52PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > > 
> > > Please remind me... why should it not?
> > 
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
> 
> I don't understand this argument.  Suppose lru list membership is worth a 
> page count of one.  Then anyone who finds a page by way of the lru list can 

This does fix the double free problem but think of a typical anonymous
page at exit. The page is on the lru list and there is one reference held
by the pte. According to your scheme the pte reference would be freed
(obviously due to the exit) but the page would remain on the lru list.
However, there is no point in leaving the page on the lru list at all.

If you think about who is going to remove the page from the lru you'll
see the problem.

     regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
@ 2002-08-26 19:48               ` Christian Ehrhardt
  2002-08-27  9:22               ` Christian Ehrhardt
  2 siblings, 0 replies; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 19:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml, linux-mm

On Mon, Aug 26, 2002 at 12:24:50PM -0700, Andrew Morton wrote:
> The flaw is in doing the put_page_testzero() outside of any locking

Well, one could argue that doing the put_page_testzero outside of any
locking is a feature.

>  [ ... ]
> 
> 2.5.31-mm1 has tests which make this race enormously improbable [1],
> but it's still there.

Agreed. Both on the improbable and on the still there part.

> It's that `put' outside the lock which is the culprit.  Normally, we
> handle that with atomic_dec_and_lock() (inodes) or by manipulating
> the refcount inside an area which has exclusion (page presence in
> pagecache).
> 
> The sane, sensible and sucky way is to always take the lock:
> 
> page_cache_release(page)
> {
> 	spin_lock(lru_lock);
> 	if (put_page_testzero(page)) {
> 		lru_cache_del(page);
> 		__free_pages_ok(page, 0);
> 	}
> 	spin_unlock(lru_lock);
> }

That would probably solve the problem.

> Because this provides exclusion from another CPU discovering the page
> via the LRU.
> 
> So taking the above as the design principle, how can we speed it up?
> How to avoid taking the lock in every page_cache_release()?  Maybe:
> 
> page_cache_release(page)
> {
> 	if (page_count(page) == 1) {
> 		spin_lock(lru_lock);
> 		if (put_page_testzero(page)) {
> 			if (PageLRU(page))
> 				__lru_cache_del(page);
> 			__free_pages_ok(page);
> 		}
> 		spin_unlock(lru_lock);
> 	} else {
> 		atomic_dec(&page->count);
> 	}
> }

However, this is an incredibly bad idea if the page is NOT on the lru.
Think of two instances of page_cache_release racing against each other.
This could result in a leaked page which is not on the LRU.

> This is nice and quick, but racy.  Two concurrent page_cache_releases
> will create a zero-ref unfreed page which is on the LRU.  These are
> rare, and can be mopped up in page reclaim.
> 
> The above code will also work for pages which aren't on the LRU.  It will
> take the lock unnecessarily for (say) slab pages.  But if we put slab pages
> on the LRU then I suspect there are so few non-LRU pages left that it isn't
> worth bothering about this.

No it will not work. See above.

> [1] The race requires that the CPU running page_cache_release find a
>     five instruction window against the CPU running shrink_cache.  And
>     that they be operating against the same page.  And that the CPU
>     running __page_cache_release() then take an interrupt in a 3-4
>     instruction window.  And that the interrupt take longer than the
>     runtime for shrink_list.  And that the page be the first page in
>     the pagevec.

The interrupt can also be a preemption which might easily take long
enough. But I agree that the race is now rare. The real problem is
that the locking rules don't guarantee that there are no other racy
paths that we both missed. 

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 19:24             ` Andrew Morton
@ 2002-08-26 19:34               ` Daniel Phillips
  2002-08-26 19:48               ` Christian Ehrhardt
  2002-08-27  9:22               ` Christian Ehrhardt
  2 siblings, 0 replies; 46+ messages in thread
From: Daniel Phillips @ 2002-08-26 19:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christian Ehrhardt, lkml, linux-mm

On Monday 26 August 2002 21:24, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > > + * does not increase the page count.
> > > >
> > > > Please remind me... why should it not?
> > >
> > > Pages that are only on the lru but not reference by anyone are of no
> > > use and we want to free them immediatly. If we leave them on the lru
> > > list with a page count of 1, someone else will have to walk the lru
> > > list and remove pages that are only on the lru.
> > 
> > I don't understand this argument.  Suppose lru list membership is worth a
> > page count of one.  Then anyone who finds a page by way of the lru list can
> > safely put_page_testzero and remove the page from the lru list.  Anyone who
> > finds a page by way of a page table can likewise put_page_testzero and clear
> > the pte, or remove the mapping and pass the page to Andrew's pagevec
> > machinery, which will eventually do the put_page_testzero.  Anyone who
> > removes a page from a radix tree will also do a put_page_testzero.  Exactly
> > one of those paths will result in the page count reaching zero, which tells
> > us nobody else holds a reference and it's time for __free_pages_ok.  The page
> > is thus freed immediately as soon as there are no more references to it, and
> > does not hang around on the lru list.
> > 
> > Nobody has to lock against the page count.  Each put_page_testzero caller
> > only locks the data structure from which it's removing the reference.
> > 
> > This seems so simple, what is the flaw?
> 
> The flaw is in doing the put_page_testzero() outside of any locking
> which would prevent other CPUs from finding and "rescuing" the zero-recount
> page.
> 
> CPUA:
> 	if (put_page_testzero()) {
> 		/* Here's the window */
> 		spin_lock(lru_lock);
> 		list_del(page->lru);

According to my assumption that lru list membership is (should be) worth one 
page count, if testzero triggers here the page is not on the lru.

> CPUB:
> 
> 	spin_lock(lru_lock);
> 	page = list_entry(lru);
> 	page_cache_get(page);	/* If this goes from 0->1, we die */

It can't.  You know that because you found the page on the lru, its count
must be at least one (again, according to assumption above).

> 	...
> 	page_cache_release(page);	/* double free */

I'd like to jump in and chase more solutions with you, but the above doesn't 
prove your point, so I'm not ready to reject this one yet.

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 17:56           ` Daniel Phillips
@ 2002-08-26 19:24             ` Andrew Morton
  2002-08-26 19:34               ` Daniel Phillips
                                 ` (2 more replies)
  2002-08-26 20:00             ` Christian Ehrhardt
  1 sibling, 3 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-26 19:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Ehrhardt, lkml, linux-mm

Daniel Phillips wrote:
> 
> On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> > On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > > + * A special Problem is the lru lists. Presence on one of these lists
> > > > + * does not increase the page count.
> > >
> > > Please remind me... why should it not?
> >
> > Pages that are only on the lru but not reference by anyone are of no
> > use and we want to free them immediatly. If we leave them on the lru
> > list with a page count of 1, someone else will have to walk the lru
> > list and remove pages that are only on the lru.
> 
> I don't understand this argument.  Suppose lru list membership is worth a
> page count of one.  Then anyone who finds a page by way of the lru list can
> safely put_page_testzero and remove the page from the lru list.  Anyone who
> finds a page by way of a page table can likewise put_page_testzero and clear
> the pte, or remove the mapping and pass the page to Andrew's pagevec
> machinery, which will eventually do the put_page_testzero.  Anyone who
> removes a page from a radix tree will also do a put_page_testzero.  Exactly
> one of those paths will result in the page count reaching zero, which tells
> us nobody else holds a reference and it's time for __free_pages_ok.  The page
> is thus freed immediately as soon as there are no more references to it, and
> does not hang around on the lru list.
> 
> Nobody has to lock against the page count.  Each put_page_testzero caller
> only locks the data structure from which it's removing the reference.
> 
> This seems so simple, what is the flaw?

The flaw is in doing the put_page_testzero() outside of any locking
which would prevent other CPUs from finding and "rescuing" the zero-recount
page.

CPUA:
	if (put_page_testzero()) {
		/* Here's the window */
		spin_lock(lru_lock);
		list_del(page->lru);

CPUB:

	spin_lock(lru_lock);
	page = list_entry(lru);
	page_cache_get(page);	/* If this goes from 0->1, we die */
	...
	page_cache_release(page);	/* double free */


2.5.31-mm1 has tests which make this race enormously improbable [1],
but it's still there.

It's that `put' outside the lock which is the culprit.  Normally, we
handle that with atomic_dec_and_lock() (inodes) or by manipulating
the refcount inside an area which has exclusion (page presence in
pagecache).

The sane, sensible and sucky way is to always take the lock:

page_cache_release(page)
{
	spin_lock(lru_lock);
	if (put_page_testzero(page)) {
		lru_cache_del(page);
		__free_pages_ok(page, 0);
	}
	spin_unlock(lru_lock);
}

Because this provides exclusion from another CPU discovering the page
via the LRU.

So taking the above as the design principle, how can we speed it up?
How to avoid taking the lock in every page_cache_release()?  Maybe:

page_cache_release(page)
{
	if (page_count(page) == 1) {
		spin_lock(lru_lock);
		if (put_page_testzero(page)) {
			if (PageLRU(page))
				__lru_cache_del(page);
			__free_pages_ok(page);
		}
		spin_unlock(lru_lock);
	} else {
		atomic_dec(&page->count);
	}
}

This is nice and quick, but racy.  Two concurrent page_cache_releases
will create a zero-ref unfreed page which is on the LRU.  These are
rare, and can be mopped up in page reclaim.

The above code will also work for pages which aren't on the LRU.  It will
take the lock unnecessarily for (say) slab pages.  But if we put slab pages
on the LRU then I suspect there are so few non-LRU pages left that it isn't
worth bothering about this.


[1] The race requires that the CPU running page_cache_release find a
    five instruction window against the CPU running shrink_cache.  And
    that they be operating against the same page.  And that the CPU
    running __page_cache_release() then take an interrupt in a 3-4
    instruction window.  And that the interrupt take longer than the
    runtime for shrink_list.  And that the page be the first page in
    the pagevec.

    It's a heat-death-of-the-universe-race, but even if it were to be
    ignored, the current code is too complex.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 15:29         ` Christian Ehrhardt
@ 2002-08-26 17:56           ` Daniel Phillips
  2002-08-26 19:24             ` Andrew Morton
  2002-08-26 20:00             ` Christian Ehrhardt
  0 siblings, 2 replies; 46+ messages in thread
From: Daniel Phillips @ 2002-08-26 17:56 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: Andrew Morton, lkml, linux-mm

On Monday 26 August 2002 17:29, Christian Ehrhardt wrote:
> On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> > On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > > + * A special Problem is the lru lists. Presence on one of these lists
> > > + * does not increase the page count.
> > 
> > Please remind me... why should it not?
> 
> Pages that are only on the lru but not reference by anyone are of no
> use and we want to free them immediatly. If we leave them on the lru
> list with a page count of 1, someone else will have to walk the lru
> list and remove pages that are only on the lru.

I don't understand this argument.  Suppose lru list membership is worth a 
page count of one.  Then anyone who finds a page by way of the lru list can 
safely put_page_testzero and remove the page from the lru list.  Anyone who 
finds a page by way of a page table can likewise put_page_testzero and clear 
the pte, or remove the mapping and pass the page to Andrew's pagevec 
machinery, which will eventually do the put_page_testzero.  Anyone who 
removes a page from a radix tree will also do a put_page_testzero.  Exactly 
one of those paths will result in the page count reaching zero, which tells 
us nobody else holds a reference and it's time for __free_pages_ok.  The page 
is thus freed immediately as soon as there are no more references to it, and 
does not hang around on the lru list.

Nobody has to lock against the page count.  Each put_page_testzero caller 
only locks the data structure from which it's removing the reference.

This seems so simple, what is the flaw?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26 14:22       ` Daniel Phillips
@ 2002-08-26 15:29         ` Christian Ehrhardt
  2002-08-26 17:56           ` Daniel Phillips
  0 siblings, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-26 15:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrew Morton, lkml, linux-mm

On Mon, Aug 26, 2002 at 04:22:50PM +0200, Daniel Phillips wrote:
> On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> > + * A special Problem is the lru lists. Presence on one of these lists
> > + * does not increase the page count.
> 
> Please remind me... why should it not?

Pages that are only on the lru but not reference by anyone are of no
use and we want to free them immediatly. If we leave them on the lru
list with a page count of 1, someone else will have to walk the lru
list and remove pages that are only on the lru. One could argue that
try_to_free_pages could do this but try_to_free_pages will process the
pages in lru order and push out other pages first.
The next suggestion that comes to mind is: Let's have some magic in
page_cache_release that will remove the page from the lru list if
it is actually dead. However, this raises the question: How do we detect
that a page is now dead? The answer is something along the lines of

	if (put_page_testzero ()) {
		__free_pages_ok (page);
		return
	}
	spin_lock_irq(pagemap_lru_lock);
	if (PageLRU(page) && (page_count(page) == 1)) {
		lru_cache_del (page);
		spin_unlock_irq(pagemap_lru_lock);
		page_cache_release (page);
		return;
	}
	spin_unlock_irq(pagemap_lru_lock);
	return;

The sad truth is, that this solution has all the same races that
we have now, plus it makes the fast path (decreasing page count
to something not zero) slower. One problem in the above would be
that the last reference might as well not be due the the lru
cache, i.e at the time we call PageLRU(page) the page might
have been freed by another processor.

I know the idea is appealing (see one of my earlier Mails on the
subject ;-) ) but it doesn't solve the Problem.

      regards   Christian Ehrhardt

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  9:10     ` Christian Ehrhardt
@ 2002-08-26 14:22       ` Daniel Phillips
  2002-08-26 15:29         ` Christian Ehrhardt
  0 siblings, 1 reply; 46+ messages in thread
From: Daniel Phillips @ 2002-08-26 14:22 UTC (permalink / raw)
  To: Christian Ehrhardt, Andrew Morton; +Cc: lkml, linux-mm

On Monday 26 August 2002 11:10, Christian Ehrhardt wrote:
> + * A special Problem is the lru lists. Presence on one of these lists
> + * does not increase the page count.

Please remind me... why should it not?

-- 
Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  1:52   ` Andrew Morton
@ 2002-08-26  9:10     ` Christian Ehrhardt
  2002-08-26 14:22       ` Daniel Phillips
  0 siblings, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-26  9:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Sun, Aug 25, 2002 at 06:52:55PM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:
> > 
> > On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> > >
> > > I've uploaded a rollup of pending fixes and feature work
> > > against 2.5.31 to
> > >
> > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> > >
> > > The rolled up patch there is suitable for ongoing testing and
> > > development.  The individual patches are in the broken-out/
> > > directory and should all be documented.
> > 
> > Sorry, but we still have the page release race in multiple places.
> > Look at the following (page starts with page_count == 1):
> > 
> 
> So we do.  It's a hugely improbable race, so there's no huge rush
> to fix it.

Actually the race seems to happen in real life (it does explain the
the pte.chain != NULL BUG) and it is not that improbable with preempt.

> Looks like the same race is present in -ac kernels,
> actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
> swap_writepage() is synchronous, and page reclaim races with 
> zap_page_range.  ho-hum.

I didn't check each kernel, but I agree that most of the recent kernels
have the potential for this race. Your tree just happend to be the
one where I found it first.

> What I'm inclined to do there is to change __page_cache_release()
> to not attempt to free the page at all.  Just let it sit on the
> LRU until page reclaim encounters it.  With the anon-free-via-pagevec
> patch, very, very, very few pages actually get their final release in
> __page_cache_release() - zero on uniprocessor, I expect.

It's not just a Problem with __page_cache_release, but yes it seems
to be a SMP only race.

> And change pagevec_release() to take the LRU lock before dropping
> the refcount on the pages.
> 
> That means there will need to be two flavours of pagevec_release():
> one which expects the pages to become freeable (and takes the LRU
> lock in anticipation of this).  And one which doesn't expect the
> pages to become freeable.  The latter will leave the occasional
> zero-count page on the LRU, as above.
> 
> Sound sane?

So the rules would be:
* if you bring the page count to zero call __free_pages_ok unless the
  page is on the lru.
* if someone (reclaim page) walking the lru finds a page with page count zero
  remove it from the lru and call __free_pages_ok.

This requires that ANYTHING that ends up calling put_page_testzero
must happen under the lru lock. This doesn't sound like a good idea
but it seems to be possible to do it race free.

I'd actually go for the following (patch in it compiles state but
otherwise untested below to illustrate what I'm talking about):

The basic idea is to move the logic into page_cache_get. The rules
would be:
1. if you bring the page count to zero (using put_page_testzero) it
   is your job to call __free_pages_ok eventually. Before doing so
   make sure that the page is no longer on the lru.
2. You may only call page_cache_get to duplicate an existing reference,
   i.e. page_cache_get could be made to BUG_ON(page_count(page) == 0).
3. If you got a pointer to a page without holding a reference (this
   is only allowd to happen if we found the pointer on an lru list)
   call page_cache_get_lru UNDER the lru lock and just leave the page
   alone if that would resurrect the page. page_cache_get_lru would
   basically look like this (implementation details below):

   int page_cache_get_lru (struct page * page) {
	if (!atomic_inc_and_test_for_one (&page->count))
		return 1;
	atomic_dec (&page->count);
	return 0;
   }

Proof of correctness:
A page is called dead if its page count reached zero before (no matter
what the page count currently is). Once a page is dead there can be
at most two pointers to the page: One held by the lru and the other
one held by the thread freeing the page. Any thread accessing the page
via the lru list will first call page_cache_get_lru under the lru lock,
the thread freeing the page will not read the page count anymore.
As page_cache_get_lru will not resurrect the page there will never
be a page count != 0 visible outside the lru lock on a dead page.
This meas that each thread trying to access the dead page via the lru
list will detect that the page is dead and leave it alone. It follows
that each page is freed at most once.
Suppose a page could be leaked under these rules. This would require
someone calling __put_page (or atomic_dec (&page->count)) to bring the
page count to zero on a not already dead page. However, the only place
where this happens is in page_cache_get_lru and it only happens if the
page is already dead.

Now let's look at the ugly part: implementation.
The basic problem is that we don't habe an atomic_inc_and_test_for_one
function and it is unlikely that we'll get one on all architectures. The
solution (and this is the ugly part) is a change in the semantics of
page->count. The value -1 now means no reference, 0 means one reference etc.

This way we can define
put_page_testzero    as   atomic_add_negative (-1, &page->count);   and
get_page_testzero    as   atomic_inc_and_test (&page->count);

Here's the promised (untested) patch against bk7 to illustrate what
I'm talking about:

diff -ur linux-2.5.31-bk7/include/linux/mm.h linux-2.5.31-cae/include/linux/mm.h
--- linux-2.5.31-bk7/include/linux/mm.h	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/mm.h	Sun Aug 25 21:40:57 2002
@@ -184,6 +184,11 @@
 /*
  * Methods to modify the page usage count.
  *
+ * NOTE: Real page counts start at -1 for no reference. This is a hack
+ * to be able to implement get_page_testzero with the existing portable
+ * atomic functions. The value exposed via set_page_count and page_count
+ * is (1+page->count).
+ *
  * What counts for a page usage:
  * - cache mapping   (page->mapping)
  * - private data    (page->private)
@@ -192,12 +197,25 @@
  *
  * Also, many kernel routines increase the page count before a critical
  * routine so they can be sure the page doesn't go away from under them.
+ *
+ * A special Problem is the lru lists. Presence on one of these lists
+ * does not increase the page count. The FIRST thread that brings the
+ * page count back to zero is responsible to remove the page from the
+ * lru list and actually free it (__free_pages_ok). This means that we
+ * can only get a reference to a page that is on a lru list, if this
+ * page is not already dead, i.e. about to be removed from the lru list.
+ * To do this we call get_page_testzero which will increment the page
+ * count and return true if we just resurrected the page i.e. the real
+ * page->count is now zero indicating one user. In this case we drop
+ * the reference again using __put_page. Both calls must happen under
+ * the lru lock.
  */
 #define get_page(p)		atomic_inc(&(p)->count)
 #define __put_page(p)		atomic_dec(&(p)->count)
-#define put_page_testzero(p) 	atomic_dec_and_test(&(p)->count)
-#define page_count(p)		atomic_read(&(p)->count)
-#define set_page_count(p,v) 	atomic_set(&(p)->count, v)
+#define put_page_testzero(p) 	atomic_add_negative(-1, &(p)->count)
+#define page_count(p)		(1+atomic_read(&(p)->count))
+#define set_page_count(p,v) 	atomic_set(&(p)->count, v-1)
+#define get_page_testzero(p)	atomic_inc_and_test(&(p)->count)
 extern void FASTCALL(__page_cache_release(struct page *));
 #define put_page(p)							\
 	do {								\
diff -ur linux-2.5.31-bk7/include/linux/pagemap.h linux-2.5.31-cae/include/linux/pagemap.h
--- linux-2.5.31-bk7/include/linux/pagemap.h	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/include/linux/pagemap.h	Sun Aug 25 21:56:30 2002
@@ -22,7 +22,43 @@
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
-#define page_cache_get(x)	get_page(x)
+/*
+ * Get a reference to the page. This function must not be called on
+ * a dead page, i.e. a page that has page count zero. If the page is
+ * still on a lru_list use page_cache_get_lru instead.
+ */
+static inline void page_cache_get (struct page * page)
+{
+	BUG_ON(page_count(page) == 0);
+	get_page(page);
+}
+
+/*
+ * Try to get a reference to page that we found on an lru list.
+ * The lru lists may contain pages with page count zero. We must
+ * not take a reference to such a page because it is already about
+ * to be freed (once it is of the lru lists). If we'd take a reference
+ * the page would eventually be freed twice.
+ * 
+ * The return value is true if we sucessfully incremented the page count.
+ * 
+ * This function must be called with the lru lock held. 
+ */
+static inline int page_cache_get_lru (struct page * page)
+{
+	/*
+	 * Yes there is a window where the page count is not zero
+	 * even though the page is dead. This is one of the reasons
+	 * why the caller must hold the lru lock. Due to the lru_lock
+	 * only the thread that is about to free the page can have
+	 * a reference to this page. This thread will not test the
+	 * page count anymore.
+	 */
+	if (!get_page_testzero (page))
+		return 1;
+	__put_page (page);
+	return 0;
+}
 
 static inline void page_cache_release(struct page *page)
 {
diff -ur linux-2.5.31-bk7/mm/swap.c linux-2.5.31-cae/mm/swap.c
--- linux-2.5.31-bk7/mm/swap.c	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/swap.c	Sun Aug 25 11:28:55 2002
@@ -77,7 +77,6 @@
 void __page_cache_release(struct page *page)
 {
 	unsigned long flags;
-	BUG_ON(page_count(page) != 0);
 
 	spin_lock_irqsave(&_pagemap_lru_lock, flags);
 	if (TestClearPageLRU(page)) {
@@ -86,11 +85,8 @@
 		else
 			del_page_from_inactive_list(page);
 	}
-	if (page_count(page) != 0)
-		page = NULL;
 	spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
-	if (page)
-		__free_pages_ok(page, 0);
+	__free_pages_ok(page, 0);
 }
 
 /*
@@ -131,8 +127,7 @@
 			else
 				del_page_from_inactive_list(page);
 		}
-		if (page_count(page) == 0)
-			pagevec_add(&pages_to_free, page);
+		pagevec_add(&pages_to_free, page);
 	}
 	if (lock_held)
 		spin_unlock_irq(&_pagemap_lru_lock);
diff -ur linux-2.5.31-bk7/mm/vmscan.c linux-2.5.31-cae/mm/vmscan.c
--- linux-2.5.31-bk7/mm/vmscan.c	Sun Aug 25 18:30:38 2002
+++ linux-2.5.31-cae/mm/vmscan.c	Sun Aug 25 21:44:07 2002
@@ -92,6 +92,10 @@
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
+/*
+ * The caller must hold a reference to each page in the list. We drop
+ * this reference if and only if we remove the page from the page_list.
+ */
 static /* inline */ int
 shrink_list(struct list_head *page_list, int nr_pages, zone_t *classzone,
 		unsigned int gfp_mask, int priority, int *max_scan)
@@ -295,24 +299,23 @@
 	spin_lock_irq(&_pagemap_lru_lock);
 	while (max_scan > 0 && nr_pages > 0) {
 		struct page *page;
+		struct list_head * curr;
 		int n = 0;
 
-		while (n < nr_to_process && !list_empty(&inactive_list)) {
-			page = list_entry(inactive_list.prev, struct page, lru);
+		curr = inactive_list.prev;
+		while (n < nr_to_process && (&inactive_list != curr)) {
+			page = list_entry(curr, struct page, lru);
 
-			prefetchw_prev_lru_page(page, &inactive_list, flags);
+			prefetchw_prev_lru_page(page, curr, flags);
+			curr = curr->prev;
 
+			/* Is the page already dead ? */
+			if (!page_cache_get_lru (page))
+				continue;
 			if (!TestClearPageLRU(page))
 				BUG();
 			list_del(&page->lru);
-			if (page_count(page) == 0) {
-				/* It is currently in pagevec_release() */
-				SetPageLRU(page);
-				list_add(&page->lru, &inactive_list);
-				continue;
-			}
 			list_add(&page->lru, &page_list);
-			page_cache_get(page);
 			n++;
 		}
 		spin_unlock_irq(&_pagemap_lru_lock);
@@ -381,15 +384,19 @@
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
 	struct page *page;
 	struct pagevec pvec;
+	struct list_head *curr;
 
 	lru_add_drain();
 	spin_lock_irq(&_pagemap_lru_lock);
-	while (nr_pages && !list_empty(&active_list)) {
-		page = list_entry(active_list.prev, struct page, lru);
-		prefetchw_prev_lru_page(page, &active_list, flags);
+	curr = active_list.prev;
+	while (nr_pages && (&active_list != curr)) {
+		page = list_entry(curr, struct page, lru);
+		prefetchw_prev_lru_page(page, curr, flags);
+		curr = curr->prev;
+		if (!page_cache_get_lru (page))
+			continue;
 		if (!TestClearPageLRU(page))
 			BUG();
-		page_cache_get(page);
 		list_move(&page->lru, &l_hold);
 		nr_pages--;
 	}

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:32         ` Andrew Morton
@ 2002-08-26  3:06           ` Steven Cole
  0 siblings, 0 replies; 46+ messages in thread
From: Steven Cole @ 2002-08-26  3:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin J. Bligh, lkml, linux-mm

On Sun, 2002-08-25 at 20:32, Andrew Morton wrote:
> "Martin J. Bligh" wrote:
> > 
> > >> > kjournald: page allocation failure. order:0, mode:0x0
> > >>
> > >> I've seen this before, but am curious how we ever passed
> > >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> > >> that does this?
> > >
> > > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > > while kjournald is executing.  A radix-tree node allocation
> > > on the add-to-swap path perhaps.  (The swapout failure messages
> > > aren't supposed to come out, but mempool_alloc() stomps on the
> > > caller's setting of PF_NOWARN.)
> > >
> > > Or:
> > >
> > > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> > >      89
> > 
> > No, GFP_ATOMIC is not 0:
> > 
> 
> It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
> __GFP_WAIT|__GFP_IO on the first attempt.
> 
> It also disables the printk, so maybe I just dunno ;)  show_stack()
> would tell.
>
 
The "kjournald: page allocation failure. order:0, mode:0x0" message and
"pdflush: page allocation failure. order:0, mode:0x0" occurred only once
each on my dual p3 scsi ext3 test box running 2.5.31-mm1.  So, I added
something like this:
--- page_alloc.c.orig	Thu Aug 22 17:27:32 2002
+++ page_alloc.c	Thu Aug 22 17:29:24 2002
@@ -388,6 +388,8 @@
 			printk("%s: page allocation failure."
 				" order:%d, mode:0x%x\n",
 				current->comm, order, gfp_mask);
+			if (gfp_mask == 0)
+				BUG();
 		}
 		return NULL;
 	}

and continued testing on Friday with no repeats of the "page allocation failure"
messages.  I obtained a second dual p3 ext3 test box (ide this time) and left both
boxes running 2.5.31-mm1 and the dbench 1..128 stress test scripted to rerun many 
times over the weekend.  Due to a couple of firewalls, I can't look at those boxes
from here, but I'll let you know what happened in about 10 to 11 hours.

Cheers,
Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:08       ` Martin J. Bligh
@ 2002-08-26  2:32         ` Andrew Morton
  2002-08-26  3:06           ` Steven Cole
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-26  2:32 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm

"Martin J. Bligh" wrote:
> 
> >> > kjournald: page allocation failure. order:0, mode:0x0
> >>
> >> I've seen this before, but am curious how we ever passed
> >> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> >> that does this?
> >
> > Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> > while kjournald is executing.  A radix-tree node allocation
> > on the add-to-swap path perhaps.  (The swapout failure messages
> > aren't supposed to come out, but mempool_alloc() stomps on the
> > caller's setting of PF_NOWARN.)
> >
> > Or:
> >
> > mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
> >      89
> 
> No, GFP_ATOMIC is not 0:
> 

It's mempool_alloc(GFP_NOIO) or such.  mempool_alloc() strips
__GFP_WAIT|__GFP_IO on the first attempt.

It also disables the printk, so maybe I just dunno ;)  show_stack()
would tell.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 16:06   ` Martin J. Bligh
  2002-08-22 19:45     ` Steven Cole
@ 2002-08-26  2:15     ` Andrew Morton
  2002-08-26  2:08       ` Martin J. Bligh
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-26  2:15 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Steven Cole, lkml, linux-mm

"Martin J. Bligh" wrote:
> 
> > kjournald: page allocation failure. order:0, mode:0x0
> 
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?

Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
while kjournald is executing.  A radix-tree node allocation 
on the add-to-swap path perhaps.  (The swapout failure messages
aren't supposed to come out, but mempool_alloc() stomps on the
caller's setting of PF_NOWARN.)

Or:

mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
     89
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-26  2:15     ` Andrew Morton
@ 2002-08-26  2:08       ` Martin J. Bligh
  2002-08-26  2:32         ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Martin J. Bligh @ 2002-08-26  2:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Steven Cole, lkml, linux-mm

>> > kjournald: page allocation failure. order:0, mode:0x0
>> 
>> I've seen this before, but am curious how we ever passed
>> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
>> that does this?
> 
> Could be anywhere, really.  A network interrupt doing GFP_ATOMIC
> while kjournald is executing.  A radix-tree node allocation 
> on the add-to-swap path perhaps.  (The swapout failure messages
> aren't supposed to come out, but mempool_alloc() stomps on the
> caller's setting of PF_NOWARN.)
> 
> Or:
> 
> mnm:/usr/src/25> grep -r GFP_ATOMIC drivers/scsi/*.c | wc -l
>      89

No, GFP_ATOMIC is not 0:

#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define GFP_ATOMIC  (__GFP_HIGH)

Looking at all the options:

#define __GFP_WAIT  0x10    /* Can wait and reschedule? */
#define __GFP_HIGH  0x20    /* Should access emergency pools? */
#define __GFP_IO    0x40    /* Can start low memory physical IO? */
#define __GFP_HIGHIO    0x80    /* Can start high mem physical IO? */
#define __GFP_FS    0x100   /* Can call down to low-level FS? */

What worries me is that 0 seems to mean "you can't do anything
to try and free it, but you can't access the emergency pools either".
Seems doomed to failure to me. And the standard sets we have are

#define GFP_NOHIGHIO    (             __GFP_WAIT | __GFP_IO)
#define GFP_NOIO    (             __GFP_WAIT)
#define GFP_NOFS    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
#define GFP_ATOMIC  (__GFP_HIGH)
#define GFP_USER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_HIGHUSER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
#define GFP_KERNEL  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_NFS     (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_KSWAPD  (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)

So I think someone's screwed something up, and this is accidental.
Or I'm just totally misunderstanding this, which is perfectly 
possible.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 11:28 ` Christian Ehrhardt
@ 2002-08-26  1:52   ` Andrew Morton
  2002-08-26  9:10     ` Christian Ehrhardt
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2002-08-26  1:52 UTC (permalink / raw)
  To: Christian Ehrhardt; +Cc: lkml, linux-mm

Christian Ehrhardt wrote:
> 
> On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> >
> > I've uploaded a rollup of pending fixes and feature work
> > against 2.5.31 to
> >
> > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> >
> > The rolled up patch there is suitable for ongoing testing and
> > development.  The individual patches are in the broken-out/
> > directory and should all be documented.
> 
> Sorry, but we still have the page release race in multiple places.
> Look at the following (page starts with page_count == 1):
> 

So we do.  It's a hugely improbable race, so there's no huge rush
to fix it.  Looks like the same race is present in -ac kernels,
actually, if add_to_swap() fails.  Also perhaps 2.4 is exposed if
swap_writepage() is synchronous, and page reclaim races with 
zap_page_range.  ho-hum.

What I'm inclined to do there is to change __page_cache_release()
to not attempt to free the page at all.  Just let it sit on the
LRU until page reclaim encounters it.  With the anon-free-via-pagevec
patch, very, very, very few pages actually get their final release in
__page_cache_release() - zero on uniprocessor, I expect.

And change pagevec_release() to take the LRU lock before dropping
the refcount on the pages.

That means there will need to be two flavours of pagevec_release():
one which expects the pages to become freeable (and takes the LRU
lock in anticipation of this).  And one which doesn't expect the
pages to become freeable.  The latter will leave the occasional
zero-count page on the LRU, as above.

Sound sane?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 16:06   ` Martin J. Bligh
@ 2002-08-22 19:45     ` Steven Cole
  2002-08-26  2:15     ` Andrew Morton
  1 sibling, 0 replies; 46+ messages in thread
From: Steven Cole @ 2002-08-22 19:45 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, lkml, linux-mm

On Thu, 2002-08-22 at 10:06, Martin J. Bligh wrote:
> > kjournald: page allocation failure. order:0, mode:0x0
> 
> I've seen this before, but am curious how we ever passed
> a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
> that does this?
> 
> Thanks,
> 
> M.

I ran dbench 1..128 on 2.5.31-mm1 several more times with nothing
unusual happening, and then got this from pdflush with dbench 96.

pdflush: page allocation failure. order:0, mode:0x0

FWIW, this 2.5.31-mm1 kernel is SMP, HIGHMEM4G, no PREEMPT.

Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22 15:59 ` Steven Cole
@ 2002-08-22 16:06   ` Martin J. Bligh
  2002-08-22 19:45     ` Steven Cole
  2002-08-26  2:15     ` Andrew Morton
  0 siblings, 2 replies; 46+ messages in thread
From: Martin J. Bligh @ 2002-08-22 16:06 UTC (permalink / raw)
  To: Steven Cole, Andrew Morton; +Cc: lkml, linux-mm

> kjournald: page allocation failure. order:0, mode:0x0

I've seen this before, but am curious how we ever passed
a gfpmask (aka mode) of 0 to __alloc_pages? Can't see anywhere
that does this?

Thanks,

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
  2002-08-22 11:28 ` Christian Ehrhardt
@ 2002-08-22 15:59 ` Steven Cole
  2002-08-22 16:06   ` Martin J. Bligh
  1 sibling, 1 reply; 46+ messages in thread
From: Steven Cole @ 2002-08-22 15:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Wed, 2002-08-21 at 20:29, Andrew Morton wrote:
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> 
> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.

The good news:  I ran my dbench 1..128 stress test and for the first
time since 2.5.31-vanilla there were _no_ BUG()s reported at all.

The other news:  from dmesg:
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,3), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald: page allocation failure. order:0, mode:0x0

The kjournald failure message came out with dbench 48 running on an ext3
partition.  The test continued with only this one instance of this
message.

Steven

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: MM patches against 2.5.31
  2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
@ 2002-08-22 11:28 ` Christian Ehrhardt
  2002-08-26  1:52   ` Andrew Morton
  2002-08-22 15:59 ` Steven Cole
  1 sibling, 1 reply; 46+ messages in thread
From: Christian Ehrhardt @ 2002-08-22 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, linux-mm

On Wed, Aug 21, 2002 at 07:29:04PM -0700, Andrew Morton wrote:
> 
> I've uploaded a rollup of pending fixes and feature work
> against 2.5.31 to
> 
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/
> 
> The rolled up patch there is suitable for ongoing testing and
> development.  The individual patches are in the broken-out/
> directory and should all be documented.

Sorry, but we still have the page release race in multiple places.
Look at the following (page starts with page_count == 1):

Processor 1                          Processor 2
refill_inactive: lines 378-395
   as page count == 1 we'll
   continue with line 401

                                     __pagevec_release: line 138
				       calls release_pages
				     release_pages: line 100-111
				       put_page_test_zero brings the
				       page count to 0 and we'll continue
				       at line 114. Note that this may
				       happen while another processor holds
				       the lru lock, i.e. there is no
				       point in checking for page count == 0
				       with the lru lock held because
				       the lru lock doesn't protect against
				       decrements of page count after
				       the check.
  line 401: page_cache_get
  resurrects the page, page
  count is now 1.
  lines 402-448.
  line 448 calls __pagevec_release

__pagevec_release: line 138
  calls release_pages
release_pages: lines 100-111
  put_page_test_zero brings the
  page count back to 0 (!!!)
  i.e. we continue at line 114:

  lines 114-123.
  The page count == 0 check in line
  123 is successful and the page
  is returned to the buddy allocator
  
				       lines 114-123.
				       The page count == 0 check in line
				       123 is successful, i.e. the page
				       is returned to the buddy allocator
				       a second time. ===> BOOM


Neither the lru lock nor any of the page count == 0 checks can
prevent this from happening.

    regards   Christian

-- 
THAT'S ALL FOLKS!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* MM patches against 2.5.31
@ 2002-08-22  2:29 Andrew Morton
  2002-08-22 11:28 ` Christian Ehrhardt
  2002-08-22 15:59 ` Steven Cole
  0 siblings, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2002-08-22  2:29 UTC (permalink / raw)
  To: lkml, linux-mm

I've uploaded a rollup of pending fixes and feature work
against 2.5.31 to

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/2.5.31-mm1/

The rolled up patch there is suitable for ongoing testing and
development.  The individual patches are in the broken-out/
directory and should all be documented.


broken-out/linus.patch
  Incremental BK patch from Linus' tree

broken-out/page_reserved.patch
  Test PageReserved in pagevec_release()

broken-out/scsi_hack.patch
  Fix block-highmem for scsi

broken-out/page_cache_release_lru_fix.patch
  Fix a race between __page_cache_release() and shrink_cache().

broken-out/page_cache_release_fix.patch
  Fix __page_cache_release() bugs

broken-out/mvm.patch
  Fix vmalloc bugs

broken-out/pte-chain-fix.patch
  Fix a VM lockup on uniprocessors

broken-out/func-fix.patch
  gcc-2.91.66 does not support __func__

broken-out/ext3-htree.patch
  Indexed directories for ext3

broken-out/rmap-mapping-BUG.patch
  Fix a BUG_ON(page->mapping == NULL) in try_to_unmap()

broken-out/misc.patch
  misc fixlets

broken-out/tlb-speedup.patch
  Reduce typical global TLB invalidation frequency by 35%

broken-out/buffer-slab-align.patch
  Don't align the buffer_head slab on hardware cacheline boundaries

broken-out/zone-rename.patch
  Rename zone_struct->zone, zonelist_struct->zonelist.  Remove zone_t,
  zonelist_t.

broken-out/per-zone-lru.patch
  Per-zone page LRUs

broken-out/per-zone-lock.patch
  Per-zone LRU list locking

broken-out/l1-max-size.patch
  Infrastructure for determining the maximum L1 cache size which the kernel
  may have to support.

broken-out/zone-lock-alignment.patch
  Pad struct zone to ensure that the lru and buddy locks are in separate
  cachelines.

broken-out/put_page_cleanup.patch
  Clean up put_page() and page_cache_release().

broken-out/anon-batch-free.patch
  Batched freeing and de-LRUing of anonymous pages

broken-out/writeback-sync.patch
  Writeback fixes and tuneups

broken-out/ext3-inode-allocation.patch
  Fix an ext3 deadlock

broken-out/ext3-o_direct.patch
  O_DIRECT support for ext3.

broken-out/jfs-bio.patch
  Convert JFS to use direct-to-BIO I/O

broken-out/discontig-paddr_to_pfn.patch
  Convert page pointers into pfns for i386 NUMA

broken-out/discontig-setup_arch.patch
  Rework setup_arch() for i386 NUMA

broken-out/discontig-mem_init.patch
  Restructure mem_init for i386 NUMA

broken-out/discontig-i386-numa.patch
  discontigmem support for i386 NUMA

broken-out/cleanup-mem_map-1.patch
  Clean up lots of open-coded uese of mem_map[].  For ia32 NUMA

broken-out/zone-pages-reporting.patch
  Fix the boot-time reporting of each zone's available pages

broken-out/enospc-recovery-fix.patch
  Fix the __block_write_full_page() error path.

broken-out/fix-faults.patch
  Back out the initial work for atomic copy_*_user()

broken-out/bkl-consolidate.patch
  Consolidation per-arch lock_kenrel() implementations.

broken-out/might_sleep.patch
  Infrastructure to detect sleep-inside-spinlock bugs

broken-out/spin-lock-check.patch
  spinlock/rwlock checking infrastructure

broken-out/atomic-copy_user.patch
  Support for atomic copy_*_user()

broken-out/kmap_atomic_reads.patch
  Use kmap_atomic() for generic_file_read()

broken-out/kmap_atomic_writes.patch
  Use kmap_atomic() for generic_file_write()

broken-out/config-PAGE_OFFSET.patch
  Configurable kenrel/user memory split

broken-out/throttling-fix.patch
  Fix throttling of heavy write()rs.

broken-out/dirty-state-accounting.patch
  Make the global dirty memory accounting more accurate

broken-out/rd-cleanup.patch
  Cleanup and fix the ramdisk driver (doesn't work right yet)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2002-09-02 19:51 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-26 22:09 MM patches against 2.5.31 Ed Tomlinson
2002-08-26 23:58 ` Andrew Morton
2002-08-27  0:13   ` Rik van Riel
2002-08-28 17:06   ` slablru for 2.5.32-mm1 Ed Tomlinson
2002-08-28 21:24     ` Andrew Morton
2002-08-28 22:23       ` Rik van Riel
2002-09-02  5:26     ` Andrew Morton
2002-09-02 15:00       ` Ed Tomlinson
2002-09-02 18:35         ` Andrew Morton
2002-09-02 19:09           ` Ed Tomlinson
2002-09-02 19:51             ` Andrew Morton
2002-09-02  6:50     ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2002-08-22  2:29 MM patches against 2.5.31 Andrew Morton
2002-08-22 11:28 ` Christian Ehrhardt
2002-08-26  1:52   ` Andrew Morton
2002-08-26  9:10     ` Christian Ehrhardt
2002-08-26 14:22       ` Daniel Phillips
2002-08-26 15:29         ` Christian Ehrhardt
2002-08-26 17:56           ` Daniel Phillips
2002-08-26 19:24             ` Andrew Morton
2002-08-26 19:34               ` Daniel Phillips
2002-08-26 19:48               ` Christian Ehrhardt
2002-08-27  9:22               ` Christian Ehrhardt
2002-08-27 19:19                 ` Andrew Morton
2002-08-26 20:00             ` Christian Ehrhardt
2002-08-26 20:09               ` Daniel Phillips
2002-08-26 20:58                 ` Christian Ehrhardt
2002-08-27 16:48                   ` Daniel Phillips
2002-08-28 13:14                     ` Christian Ehrhardt
2002-08-28 17:18                       ` Daniel Phillips
2002-08-28 17:42                         ` Andrew Morton
2002-08-28 20:41                       ` Daniel Phillips
2002-08-28 21:03                         ` Andrew Morton
2002-08-28 22:04                           ` Daniel Phillips
2002-08-28 22:39                             ` Andrew Morton
2002-08-28 22:57                               ` Daniel Phillips
2002-08-26 21:31                 ` Andrew Morton
2002-08-27  3:42                   ` Benjamin LaHaise
2002-08-27  4:37                     ` Andrew Morton
2002-08-22 15:59 ` Steven Cole
2002-08-22 16:06   ` Martin J. Bligh
2002-08-22 19:45     ` Steven Cole
2002-08-26  2:15     ` Andrew Morton
2002-08-26  2:08       ` Martin J. Bligh
2002-08-26  2:32         ` Andrew Morton
2002-08-26  3:06           ` Steven Cole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox