I think we should make some cleanups before adding NUMA: a) make the per-cpu arrays unconditional, even on UP. The arrays provides LIFO ordering, which should improve cache hit rates. The disadvantage is probably a higher internal fragmentation [not measured], but memory is cheap, and cache hit are important. In addition, this makes it possible to perform some cleanups. No more kmem_cache_alloc_one, for example. b) use arrays for all caches, even with large objects. E.g. right now, the 16 kB loopback socket buffers are not handled per-cpu. Nitin, in your NUMA patch, you return each "wrong-node" object immediately, without batching them together. Have you considered batching? It could reduce the number of spinlock operations dramatically. There is one additional problem without an obvious solution: kmem_cache_reap() is too inefficient. Before 2.5.39, it served 2 purposes: 1) return freeable slabs back to gfp. <2.5.39 scans through the caches in every try_to_free_pages. That scan is terribly inefficient, and akpm noticed lots of idle reschedules on the cache_chain_sem 2.5.39 limits the freeable slabs list to one entry, and doesn't scan at all. On the one hand, this locks up one slab in each cache [in the worst case, a order=5 allocation]. For very bursty, slightly fragmented caches, it could lead to more kmem_cache_grow(). My patch scans every 2 seconds, and return 80% of the pages. 2) regularly flush the per-cpu arrays <2.5.39 does that in every try_to_free_pages. My patch does that evey 2 seconds, on a random cpu. 2.5.39 does that never. It probably reduces cpucache trashing [alloc batch of 120 objects, return one object, release batch of 119 objects due to th next try_to_free_pages] The problem is that without flushing, the cpuarrays can lock up lots of pages.[in the worst case, several thousand for each cpu]. The attached patch is WIP: * boots on UP * partially boots with bochs [cpu simulator], until bochs locks up. * contains comments, where I'd add modifications for NUMA. What do you think? For NUMA, is it possible to figure out efficiently into which node a pointer points? That would happen in every kmem_cache_free(). Could someone test that it works on real SMP? What the best replacement for kmem_cache_reap()? 2.5.39 contains obviously the simplest approach, but I'm not sure if it's the best. -- Manfred