[ The attached patch is Proof of Concept (POC) code only. It only
works on x86_64, it only supports the slab allocator, it only
relocates the lowest level of page tables, it's less efficient that it
should be, and I'm convinced the locking is deficient.  It does work
well enough to play around with though. The patch is a unified diff
against a clean 2.6.23.]

I'd like to propose 4 somewhat interdependent code changes.

1) Add a separate meta-data allocation to the slab and slub allocator
and allocate full pages through kmem_cache_alloc instead of get_page.
The primary motivation of this is that we could shrink struct page by
using kmem_cache_alloc to allocate whole pages and put the supported
data in the meta_data area instead of struct page. The downside is
that we might end up using more memory because of alignment issues.  I
believe we can keep the code as efficient as the current code  by
allocating many pages at once with known alignment and locating the
meta data in the first few pages.  Then locating the meta data for a
page by page_address & mask + (page_address >> foo) & mask *
meta_data_size + offset. Which should be just as fast as the current
calculation.  This is different than the proof of concept
implementation.  I also believe this would reduce kernel memory
fragmentation.

2) Add support for relocating memory allocated via kmem_cache_alloc.
When a cache is created, optional relocation information can be
provided.  If a relocation function is provided, caches can be
defragmented and overall memory consumption can be reduced.

3) Create a handle struct for holding references to memory that might
be moved out from under you.  This is one of those things that looks
really good on paper, but in practice isn't very useful.  While I'm
sure there are a few case in /syfs and /proc where handles could be
put to good use, in general the overhead involved does not justify
their use.  I worry that they could become a fad and that  people will
start using them when they should not be used.  The reason for
including them is that they are really good for setting up synthetic
tests for relocating memory.

and finally the real reason for doing all of the above.

4) Modify pte_alloc/free and friends to use kmem_cache_alloc and make
page tables relocatable. I believe this would go a long way towards
keeping kernel memory from fragmenting.  The biggest down side is the
number of tlb flushes involved.  The POC code uses RCU to free the old
copies of the page tables, which should reduce the flushes.  However,
it blindly flushes the tlbs on all of the cpus, when it really only
needs to flush the tlb on any cpu using the mm in question.  I believe
that by only flushing the tlbs on cpus actually using the mm in
question, we can reduce the flushes to an acceptable level.  One
alternative is to create an RCU class for tlb flushes, so that the old
table only gets freed after all the cpus have flushed their tlbs.

I believe that the above opens the doors to shrinking struct page and
greatly reducing kernel memory fragmentation with the only real
downside being an increase in code complexity and a possible increase
in memory usage if we are not careful.  I'm willing to code all of
this, but I'd like to get others opinions on what's appropriate and
what's already being done.

With the exception of tlb flushes and meta data location, I believe
the POC code demonstrates how I intend to solve most of the problems
that will be encountered.  One thing I am worried about is the
performance impact of the changes and I would like pointers to any
micro benchmarks that might be relevant.

    Ross