NUMA allocator on Opteron systems does non-local allocation on node0

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* NUMA allocator on Opteron systems does non-local allocation on node0
       [not found] <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
@ 2008-10-14  9:43 ` Oliver Weihe
  2008-10-14 11:41   ` Lee Schermerhorn
  0 siblings, 1 reply; 3+ messages in thread
From: Oliver Weihe @ 2008-10-14  9:43 UTC (permalink / raw)
  To: linux-mm

Hello,

I've sent this to Andi Kleen and posted this on lkml. Andi suggested to
sent it to this mailing list.


--- cut here (part 1) ---

> Hi Andi,
> 
> I'm not sure if you're the right person for this but I hope you are!
> 
> I've notived that the memory allocation on NUMA systems (Opterons)
> does
> memory allocation on non-local nodes for processes running node0 even
> if
> local memory is available. (Kernel 2.6.25 and above)
> 
> Currently I'm playing around with a quadsocket quadcore Opteron but
> I've
> observed this behavior on other Opteron systems aswell.
> 
> Hardware specs:
> 1x Supermicro H8QM3-2
> 4x Quadcore Opteron
> 16x 2GiB (8 GiB memory per node)
> 
> OS:
> currently openSUSE 10.3 but I've observed this on other distros aswell
> Kernel: 2.6.22.* (openSUSE) / 2.6.25.4 / 2.6.25.5 / 2.6.27 (vanilla
> config)
> 
> Steps to reproduce:
> Start an application which needs alot of memory and watch the memory
> usage per node (I'm using "watch -n 1 numastat --hardware" to watch
> the
> memory usage per node)
> A quick&dirty code which allocates a big array and writes data into
> the
> array is enough!
> 
> In my setup I'm allocating an array of ~7GiB memory size in a
> singlethreaded application.
> Startup: numactl --cpunodebind=X ./app
> For X=1,2,3 it works as expected, all memory is allocated on the local
> node.
> For X=0 I can see the memory beeing allocated on node0 as long as
> ~3GiB
> are "free" on node0. At this point the kernel starts using memory from
> node1 for the app!
> 
> For parallel realworld apps I've seen a performance penalty of 30%
> compared to older kernels!
> 
> numactl --cpunodebind=0 --membind=0 ./app "solves" the problem in this
> case but thats not the point!
> 
> -- 
> 
> Regards,
> Oliver Weihe

--- cut here (part 2) ---

> Hello,
> 
> it seems that my reproducer is not very good. :(
> It "works" much better when you start several processes at once.
> 
> for i in `seq 0 3`
> do
>   numactl --cpunodebind=${i} ./app &
> done
> wait
> 
> "app" still allocates some memory (7GiB per process) and fills the
> array
> with data.
> 
> 
> I've noticed this behaviour during some HPL (Linpack benchmark
> from/for
> top500.org) runs. For small data sets there's no difference in speed
> between the kernels while for big data sets (allmost the whole memory)
> 2.6.23 and newer kernels are slower than 2.6.22.
> I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone
> 1"
> to pin each process on a specific CPU.
> 
> The bad news is: I can crash allmost every Quadcore Opteron system
> with
> kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and
> filling the memory with data" (parallel means: there is one process
> per
> core doing this). While it takes some time on dualsocket machines it
> takes often less than 1 minute on quadsocket quadcores until the
> system
> freezes.
> Yust for the case it is some vendor specific BIOS bug: we're using
> supermicro mainboards.
> 
> > [Another copy of the reply with linux-kernel added this time]
> > 
> > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > singlethreaded application.
> > > Startup: numactl --cpunodebind=X ./app
> > > For X=1,2,3 it works as expected, all memory is allocated on the
> > > local
> > > node.
> > > For X=0 I can see the memory beeing allocated on node0 as long as
> > > ~3GiB
> > > are "free" on node0. At this point the kernel starts using memory
> > > from
> > > node1 for the app!
> > 
> > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> > 
> > Normally there should be no protection on it, but perhaps something 
> > broke.
> > 
> > What does cat /proc/sys/owmem_reserve_ratio say?
> 
> 2.6.22.x:
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256     256
> 
> 2.6.23.8 (and above)
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256     256     32
> 
> 
> > > For parallel realworld apps I've seen a performance penalty of 30%
> > > compared to older kernels!
> > 
> > Compared to what older kernels? When did it start?
> 
> I've tested some kernel Versions that I've laying around here...
> working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org)
> showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4;
> 2.6.26.5;
> 2.6.27
> 
> 
> > 
> > -Andi
> > 
> > -- 
> > ak@linux.intel.com
> > 
> 
> 
> -- 
> 
> Regards,
> Oliver Weihe

--- cut here ---


Regards,
 Oliver Weihe



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NUMA allocator on Opteron systems does non-local allocation on node0
  2008-10-14  9:43 ` NUMA allocator on Opteron systems does non-local allocation on node0 Oliver Weihe
@ 2008-10-14 11:41   ` Lee Schermerhorn
  2008-10-14 12:15     ` Oliver Weihe
  0 siblings, 1 reply; 3+ messages in thread
From: Lee Schermerhorn @ 2008-10-14 11:41 UTC (permalink / raw)
  To: Oliver Weihe; +Cc: linux-mm

On Tue, 2008-10-14 at 11:43 +0200, Oliver Weihe wrote:
> Hello,
> 
> I've sent this to Andi Kleen and posted this on lkml. Andi suggested to
> sent it to this mailing list.
> 
> 
<snip>
> > 
> > > [Another copy of the reply with linux-kernel added this time]
> > > 
> > > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > > singlethreaded application.
> > > > Startup: numactl --cpunodebind=X ./app
> > > > For X=1,2,3 it works as expected, all memory is allocated on the
> > > > local
> > > > node.
> > > > For X=0 I can see the memory beeing allocated on node0 as long as
> > > > ~3GiB
> > > > are "free" on node0. At this point the kernel starts using memory
> > > > from
> > > > node1 for the app!
> > > 
> > > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> > > 
> > > Normally there should be no protection on it, but perhaps something 
> > > broke.
> > > 


Check your /proc/sys/vm/numa_zonelist_order.  By default, the kernel
will use "zone order", meaning it will overflow to the same zone-e.g.,
Normal--before consuming DMA memory, if the DMA zone is <= half the
system memory.  See default_zonelist_order() and build_zonelists() in
mm/page_alloc.c

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NUMA allocator on Opteron systems does non-local allocation on node0
  2008-10-14 11:41   ` Lee Schermerhorn
@ 2008-10-14 12:15     ` Oliver Weihe
  0 siblings, 0 replies; 3+ messages in thread
From: Oliver Weihe @ 2008-10-14 12:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

Hi Lee,

thank you for the hint. I've set /proc/sys/vm/numa_zonelist_order to
"node" (was "default") and now an it works as expected. Performance is
much better than before for some parallel applications.

Regards,
 Oliver Weihe

> On Tue, 2008-10-14 at 11:43 +0200, Oliver Weihe wrote:
> > Hello,
> > 
> > I've sent this to Andi Kleen and posted this on lkml. Andi suggested
> > to
> > sent it to this mailing list.
> > 
> > 
> <snip>
> > > 
> > > > [Another copy of the reply with linux-kernel added this time]
> > > > 
> > > > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > > > singlethreaded application.
> > > > > Startup: numactl --cpunodebind=X ./app
> > > > > For X=1,2,3 it works as expected, all memory is allocated on
> > > > > the
> > > > > local
> > > > > node.
> > > > > For X=0 I can see the memory beeing allocated on node0 as long
> > > > > as
> > > > > ~3GiB
> > > > > are "free" on node0. At this point the kernel starts using
> > > > > memory
> > > > > from
> > > > > node1 for the app!
> > > > 
> > > > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> > > > 
> > > > Normally there should be no protection on it, but perhaps
> > > > something
> > > > broke.
> > > > 
> 
> 
> Check your /proc/sys/vm/numa_zonelist_order.  By default, the kernel
> will use "zone order", meaning it will overflow to the same zone-e.g.,
> Normal--before consuming DMA memory, if the DMA zone is <= half the
> system memory.  See default_zonelist_order() and build_zonelists() in
> mm/page_alloc.c
> 
> Lee
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-10-14 12:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
2008-10-14  9:43 ` NUMA allocator on Opteron systems does non-local allocation on node0 Oliver Weihe
2008-10-14 11:41   ` Lee Schermerhorn
2008-10-14 12:15     ` Oliver Weihe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox