I think we should make some cleanups before adding NUMA:

a) make the per-cpu arrays unconditional, even on UP.
	The arrays provides LIFO ordering, which should improve
	cache hit rates. The disadvantage is probably a higher
	internal fragmentation [not measured], but memory is cheap,
	and cache hit are important.
	In addition, this makes it possible to perform some cleanups.
	No more kmem_cache_alloc_one, for example.

b) use arrays for all caches, even with large objects.
	E.g. right now, the 16 kB loopback socket buffers
	are not handled per-cpu.

Nitin, in your NUMA patch, you return each "wrong-node" object
immediately, without batching them together. Have you considered
batching? It could reduce the number of spinlock operations dramatically.

There is one additional problem without an obvious solution:
kmem_cache_reap() is too inefficient. Before 2.5.39, it served 2 purposes:

1) return freeable slabs back to gfp.

	<2.5.39 scans through the caches in every try_to_free_pages.
	That scan is terribly inefficient, and akpm noticed lots of
	idle reschedules on the cache_chain_sem

	2.5.39 limits the freeable slabs list to one entry, and doesn't
	scan at all. On the one hand, this locks up one slab in each
	cache [in the worst case, a order=5 allocation]. For very
	bursty, slightly fragmented caches, it could lead to
	more kmem_cache_grow().

	My patch scans every 2 seconds, and return 80% of the pages.

2) regularly flush the per-cpu arrays
	<2.5.39 does that in every try_to_free_pages.

	My patch does that evey 2 seconds, on a random cpu.
	
	2.5.39 does that never. It probably reduces cpucache trashing
	[alloc batch of 120 objects, return one object, release
	batch of 119 objects due to th next try_to_free_pages]
	The problem is that without flushing, the cpuarrays can
	lock up lots of pages.[in the worst case, several thousand
	for each cpu].

The attached patch is WIP:
* boots on UP
* partially boots with bochs [cpu simulator], until bochs locks up.
* contains comments, where I'd add modifications for NUMA.

What do you think?
For NUMA, is it possible to figure out efficiently into which node a
pointer points? That would happen in every kmem_cache_free().

Could someone test that it works on real SMP?

What the best replacement for kmem_cache_reap()? 2.5.39 contains
obviously the simplest approach, but I'm not sure if it's the best.

--
	Manfred