When you have lots of tasks, the pagetables start taking up lots of
lowmem.  We have the ability to push the PTE pages into highmem, but
that exacts a penalty from the atomic kmaps which, depending on
workload, can be a 10-15% performance hit.

The following patches implement something which we like to call UKVA. 
It's a Kernel Virtual Area which is private to a process, just like
Userspace.  You can put any process-local data that you want in the
area.  But, for now, I just put PTE pages in there.

It has some really nice attributes, which aren't taken full advantage of
in this patch.  For one, since the PTE pages are laid out virtually in
line, it's really easy to figure out where the PTE that maps a
particular address is sitting.  The PTE that maps 0x00000000 is always
virtually at *FIRST_UKVA_PTE, just as the PTE that maps 0xFFFFFFFF is
always mapped at *LAST_UKVA_PTE.  This gives implicit behavior doing
things in hardware that we usually have software constructs like
follow_page() do instead.

Since only the current process's PTEs are mapped into the area, you
still need to use kmap_atomic() to get to another process's pagetables. 
That is why I started passing mm around everywhere.

If anyone wants to play with it, be my guest.  But, don't go applying it
to anything important.  It certainly won't compile or boot without
highpte and 64GB support.

I've done all of the work on top of 2.5.70-mjb1.  There are 3 patches on
which this is built:
reslabify-pmd-pgd-2.5.70-mjb1-0.patch
sepmd-2.5.70-mjb1-0.patch
banana_split-2.5.70-mjb1-1.patch

Here's a differential profile.  Higher numbers mean worse with UKVA,
lower numbers mean better.  I'm not sure why the total is so much
bigger.  I think my profiling script screwed up, and forgot to stop the
profiler at the right time.  Everything else looks OK.

158930 total
154829 default_idle
  1523 pmd_free_ukva
  1190 do_anonymous_page
   896 pmd_alloc_ukva
   754 free_hot_cold_page
   616 .text.lock.namei
   535 buffered_rmqueue
   454 __d_lookup
   ...
  -238 fd_install
  -394 .text.lock.libfs
  -445 filemap_nopage
  -506 pte_alloc_map
  -696 kmap_atomic_to_page
 -3747 kmap_atomic

Notice that there are a lot fewer kmap_atomic() calls, and
kmap_atomic_to_page() is called less, because UKVA is used instead.  The
increase in pmd_free_ukva, pmd_alloc_ukva, and free_hot_cold_page are
all due to the extra 4 pages per process that must be allocated. 
do_anonymous_page is probably due to the extra TLB overhead because of
disabling lazy tlb mode (which I plan to fix). pmd_free_ukva() and
pmd_alloc_ukva() probably doesn't need to be clearing the pages anyway. 
-- 
Dave Hansen
haveblue@us.ibm.com