From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from DELTA2001.deltacomputer.de (delta2001.delnet [194.175.217.229]) by exchange.deltacomputer.de (Postfix) with SMTP id 247C77B78C for ; Tue, 14 Oct 2008 11:43:02 +0200 (CEST) Received: from [194.175.217.230] (helo=exchange.deltacomputer.de) by DELTA2001.deltacomputer.de with AVK MailGateway; for ; Tue, 14 Oct 2008 11:43:01 +0200 Received: from exchange.deltacomputer.de (localhost [127.0.0.1]) by exchange.deltacomputer.de (Postfix) with ESMTP id 2A2157B299 for ; Tue, 14 Oct 2008 11:43:00 +0200 (CEST) Message-ID: <2793369.1223977380170.SLOX.WebMail.wwwrun@exchange.deltacomputer.de> Date: Tue, 14 Oct 2008 11:43:00 +0200 (CEST) From: Oliver Weihe Subject: NUMA allocator on Opteron systems does non-local allocation on node0 In-Reply-To: <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit References: <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de> Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org List-ID: Hello, I've sent this to Andi Kleen and posted this on lkml. Andi suggested to sent it to this mailing list. --- cut here (part 1) --- > Hi Andi, > > I'm not sure if you're the right person for this but I hope you are! > > I've notived that the memory allocation on NUMA systems (Opterons) > does > memory allocation on non-local nodes for processes running node0 even > if > local memory is available. (Kernel 2.6.25 and above) > > Currently I'm playing around with a quadsocket quadcore Opteron but > I've > observed this behavior on other Opteron systems aswell. > > Hardware specs: > 1x Supermicro H8QM3-2 > 4x Quadcore Opteron > 16x 2GiB (8 GiB memory per node) > > OS: > currently openSUSE 10.3 but I've observed this on other distros aswell > Kernel: 2.6.22.* (openSUSE) / 2.6.25.4 / 2.6.25.5 / 2.6.27 (vanilla > config) > > Steps to reproduce: > Start an application which needs alot of memory and watch the memory > usage per node (I'm using "watch -n 1 numastat --hardware" to watch > the > memory usage per node) > A quick&dirty code which allocates a big array and writes data into > the > array is enough! > > In my setup I'm allocating an array of ~7GiB memory size in a > singlethreaded application. > Startup: numactl --cpunodebind=X ./app > For X=1,2,3 it works as expected, all memory is allocated on the local > node. > For X=0 I can see the memory beeing allocated on node0 as long as > ~3GiB > are "free" on node0. At this point the kernel starts using memory from > node1 for the app! > > For parallel realworld apps I've seen a performance penalty of 30% > compared to older kernels! > > numactl --cpunodebind=0 --membind=0 ./app "solves" the problem in this > case but thats not the point! > > -- > > Regards, > Oliver Weihe --- cut here (part 2) --- > Hello, > > it seems that my reproducer is not very good. :( > It "works" much better when you start several processes at once. > > for i in `seq 0 3` > do > numactl --cpunodebind=${i} ./app & > done > wait > > "app" still allocates some memory (7GiB per process) and fills the > array > with data. > > > I've noticed this behaviour during some HPL (Linpack benchmark > from/for > top500.org) runs. For small data sets there's no difference in speed > between the kernels while for big data sets (allmost the whole memory) > 2.6.23 and newer kernels are slower than 2.6.22. > I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone > 1" > to pin each process on a specific CPU. > > The bad news is: I can crash allmost every Quadcore Opteron system > with > kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and > filling the memory with data" (parallel means: there is one process > per > core doing this). While it takes some time on dualsocket machines it > takes often less than 1 minute on quadsocket quadcores until the > system > freezes. > Yust for the case it is some vendor specific BIOS bug: we're using > supermicro mainboards. > > > [Another copy of the reply with linux-kernel added this time] > > > > > In my setup I'm allocating an array of ~7GiB memory size in a > > > singlethreaded application. > > > Startup: numactl --cpunodebind=X ./app > > > For X=1,2,3 it works as expected, all memory is allocated on the > > > local > > > node. > > > For X=0 I can see the memory beeing allocated on node0 as long as > > > ~3GiB > > > are "free" on node0. At this point the kernel starts using memory > > > from > > > node1 for the app! > > > > Hmm, that sounds like it doesn't want to use the 4GB DMA zone. > > > > Normally there should be no protection on it, but perhaps something > > broke. > > > > What does cat /proc/sys/owmem_reserve_ratio say? > > 2.6.22.x: > # cat /proc/sys/vm/lowmem_reserve_ratio > 256 256 > > 2.6.23.8 (and above) > # cat /proc/sys/vm/lowmem_reserve_ratio > 256 256 32 > > > > > For parallel realworld apps I've seen a performance penalty of 30% > > > compared to older kernels! > > > > Compared to what older kernels? When did it start? > > I've tested some kernel Versions that I've laying around here... > working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org) > showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4; > 2.6.26.5; > 2.6.27 > > > > > > -Andi > > > > -- > > ak@linux.intel.com > > > > > -- > > Regards, > Oliver Weihe --- cut here --- Regards, Oliver Weihe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org