From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j2GIsRRS023965 for ; Wed, 16 Mar 2005 13:54:27 -0500 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j2GIsQwE175108 for ; Wed, 16 Mar 2005 13:54:26 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j2GIsQKD004386 for ; Wed, 16 Mar 2005 13:54:26 -0500 Date: Wed, 16 Mar 2005 10:54:09 -0800 From: "Martin J. Bligh" Subject: Re: Fw: [PATCH] NUMA Slab Allocator Message-ID: <273220000.1110999247@[10.10.2.4]> In-Reply-To: <42387C2E.4040106@colorfullife.com> References: <20050315204110.6664771d.akpm@osdl.org> <42387C2E.4040106@colorfullife.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: Manfred Spraul , Christoph Lameter Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org List-ID: > Do you have profile data from your modification? Which percentage of the allocations is node-local, which percentage is from foreign nodes? Preferably per-cache. It shouldn't be difficult to add statistics counters to your patch. > And: Can you estaimate which percentage is really accessed node-local and which percentage are long-living structures that are accessed from all cpus in the system? > I had discussions with guys from IBM and SGI regarding a numa allocator, and we decided that we need profile data before we can decide if we need one: > - A node-local allocator reduces the inter-node traffic, because the callers get node-local memory > - A node-local allocator increases the inter-node traffic, because objects that are kfree'd on the wrong node must be returned to their home node. One of the big problems is that much of the slab data really is more global (ie dentry, inodes, etc). Some of it is more localized (typically the kmalloc style stuff). I can't really generate any data easily, as most of my NUMA boxes are either small Opterons / midsized PPC64, which have a fairly low NUMA factor, or large ia32, which only has kernel mem on node 0 ;-( > IIRC the conclusion from our discussion was, that there are at least four possible implementations: > - your version > - Add a second per-cpu array for off-node allocations. __cache_free batches, free_block then returns. Global spinlock or per-node spinlock. A patch with a global spinlock is in > http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66 > per-node spinlocks would require a restructuring of free_block. > - Add per-node array for each cpu for wrong node allocations. Allows very fast batch return: each array contains memory just from one node, usefull if per-node spinlocks are used. > - do nothing. Least overhead within slab. > > I'm fairly certains that "do nothing" is the right answer for some caches. > For example the dentry-cache: The object lifetime is seconds to minutes, > the objects are stored in a global hashtable. They will be touched from > all cpus in the system, thus guaranteeing that kmem_cache_alloc returns > node-local memory won't help. But the added overhead within slab.c will hurt. That'd be my inclination .... but OTOH, we do that for pagecache OK. Dunno, I'm torn. Depends if there's locality on the file access or not, I guess. Is there any *harm* in doing it node local .... perhaps creating a node mem pressure imbalance (OTOH, there's loads of stuff that does that anyway ;-)) The other thing that needs serious thought is how we balance reclaim pressure. M. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org