* NUMA allocator on Opteron systems does non-local allocation on node0
[not found] <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
@ 2008-10-14 9:43 ` Oliver Weihe
2008-10-14 11:41 ` Lee Schermerhorn
0 siblings, 1 reply; 3+ messages in thread
From: Oliver Weihe @ 2008-10-14 9:43 UTC (permalink / raw)
To: linux-mm
Hello,
I've sent this to Andi Kleen and posted this on lkml. Andi suggested to
sent it to this mailing list.
--- cut here (part 1) ---
> Hi Andi,
>
> I'm not sure if you're the right person for this but I hope you are!
>
> I've notived that the memory allocation on NUMA systems (Opterons)
> does
> memory allocation on non-local nodes for processes running node0 even
> if
> local memory is available. (Kernel 2.6.25 and above)
>
> Currently I'm playing around with a quadsocket quadcore Opteron but
> I've
> observed this behavior on other Opteron systems aswell.
>
> Hardware specs:
> 1x Supermicro H8QM3-2
> 4x Quadcore Opteron
> 16x 2GiB (8 GiB memory per node)
>
> OS:
> currently openSUSE 10.3 but I've observed this on other distros aswell
> Kernel: 2.6.22.* (openSUSE) / 2.6.25.4 / 2.6.25.5 / 2.6.27 (vanilla
> config)
>
> Steps to reproduce:
> Start an application which needs alot of memory and watch the memory
> usage per node (I'm using "watch -n 1 numastat --hardware" to watch
> the
> memory usage per node)
> A quick&dirty code which allocates a big array and writes data into
> the
> array is enough!
>
> In my setup I'm allocating an array of ~7GiB memory size in a
> singlethreaded application.
> Startup: numactl --cpunodebind=X ./app
> For X=1,2,3 it works as expected, all memory is allocated on the local
> node.
> For X=0 I can see the memory beeing allocated on node0 as long as
> ~3GiB
> are "free" on node0. At this point the kernel starts using memory from
> node1 for the app!
>
> For parallel realworld apps I've seen a performance penalty of 30%
> compared to older kernels!
>
> numactl --cpunodebind=0 --membind=0 ./app "solves" the problem in this
> case but thats not the point!
>
> --
>
> Regards,
> Oliver Weihe
--- cut here (part 2) ---
> Hello,
>
> it seems that my reproducer is not very good. :(
> It "works" much better when you start several processes at once.
>
> for i in `seq 0 3`
> do
> numactl --cpunodebind=${i} ./app &
> done
> wait
>
> "app" still allocates some memory (7GiB per process) and fills the
> array
> with data.
>
>
> I've noticed this behaviour during some HPL (Linpack benchmark
> from/for
> top500.org) runs. For small data sets there's no difference in speed
> between the kernels while for big data sets (allmost the whole memory)
> 2.6.23 and newer kernels are slower than 2.6.22.
> I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone
> 1"
> to pin each process on a specific CPU.
>
> The bad news is: I can crash allmost every Quadcore Opteron system
> with
> kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and
> filling the memory with data" (parallel means: there is one process
> per
> core doing this). While it takes some time on dualsocket machines it
> takes often less than 1 minute on quadsocket quadcores until the
> system
> freezes.
> Yust for the case it is some vendor specific BIOS bug: we're using
> supermicro mainboards.
>
> > [Another copy of the reply with linux-kernel added this time]
> >
> > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > singlethreaded application.
> > > Startup: numactl --cpunodebind=X ./app
> > > For X=1,2,3 it works as expected, all memory is allocated on the
> > > local
> > > node.
> > > For X=0 I can see the memory beeing allocated on node0 as long as
> > > ~3GiB
> > > are "free" on node0. At this point the kernel starts using memory
> > > from
> > > node1 for the app!
> >
> > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> >
> > Normally there should be no protection on it, but perhaps something
> > broke.
> >
> > What does cat /proc/sys/owmem_reserve_ratio say?
>
> 2.6.22.x:
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256 256
>
> 2.6.23.8 (and above)
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256 256 32
>
>
> > > For parallel realworld apps I've seen a performance penalty of 30%
> > > compared to older kernels!
> >
> > Compared to what older kernels? When did it start?
>
> I've tested some kernel Versions that I've laying around here...
> working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org)
> showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4;
> 2.6.26.5;
> 2.6.27
>
>
> >
> > -Andi
> >
> > --
> > ak@linux.intel.com
> >
>
>
> --
>
> Regards,
> Oliver Weihe
--- cut here ---
Regards,
Oliver Weihe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread