* NUMA allocator on Opteron systems does non-local allocation on node0
[not found] <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
@ 2008-10-14 9:43 ` Oliver Weihe
2008-10-14 11:41 ` Lee Schermerhorn
0 siblings, 1 reply; 3+ messages in thread
From: Oliver Weihe @ 2008-10-14 9:43 UTC (permalink / raw)
To: linux-mm
Hello,
I've sent this to Andi Kleen and posted this on lkml. Andi suggested to
sent it to this mailing list.
--- cut here (part 1) ---
> Hi Andi,
>
> I'm not sure if you're the right person for this but I hope you are!
>
> I've notived that the memory allocation on NUMA systems (Opterons)
> does
> memory allocation on non-local nodes for processes running node0 even
> if
> local memory is available. (Kernel 2.6.25 and above)
>
> Currently I'm playing around with a quadsocket quadcore Opteron but
> I've
> observed this behavior on other Opteron systems aswell.
>
> Hardware specs:
> 1x Supermicro H8QM3-2
> 4x Quadcore Opteron
> 16x 2GiB (8 GiB memory per node)
>
> OS:
> currently openSUSE 10.3 but I've observed this on other distros aswell
> Kernel: 2.6.22.* (openSUSE) / 2.6.25.4 / 2.6.25.5 / 2.6.27 (vanilla
> config)
>
> Steps to reproduce:
> Start an application which needs alot of memory and watch the memory
> usage per node (I'm using "watch -n 1 numastat --hardware" to watch
> the
> memory usage per node)
> A quick&dirty code which allocates a big array and writes data into
> the
> array is enough!
>
> In my setup I'm allocating an array of ~7GiB memory size in a
> singlethreaded application.
> Startup: numactl --cpunodebind=X ./app
> For X=1,2,3 it works as expected, all memory is allocated on the local
> node.
> For X=0 I can see the memory beeing allocated on node0 as long as
> ~3GiB
> are "free" on node0. At this point the kernel starts using memory from
> node1 for the app!
>
> For parallel realworld apps I've seen a performance penalty of 30%
> compared to older kernels!
>
> numactl --cpunodebind=0 --membind=0 ./app "solves" the problem in this
> case but thats not the point!
>
> --
>
> Regards,
> Oliver Weihe
--- cut here (part 2) ---
> Hello,
>
> it seems that my reproducer is not very good. :(
> It "works" much better when you start several processes at once.
>
> for i in `seq 0 3`
> do
> numactl --cpunodebind=${i} ./app &
> done
> wait
>
> "app" still allocates some memory (7GiB per process) and fills the
> array
> with data.
>
>
> I've noticed this behaviour during some HPL (Linpack benchmark
> from/for
> top500.org) runs. For small data sets there's no difference in speed
> between the kernels while for big data sets (allmost the whole memory)
> 2.6.23 and newer kernels are slower than 2.6.22.
> I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone
> 1"
> to pin each process on a specific CPU.
>
> The bad news is: I can crash allmost every Quadcore Opteron system
> with
> kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and
> filling the memory with data" (parallel means: there is one process
> per
> core doing this). While it takes some time on dualsocket machines it
> takes often less than 1 minute on quadsocket quadcores until the
> system
> freezes.
> Yust for the case it is some vendor specific BIOS bug: we're using
> supermicro mainboards.
>
> > [Another copy of the reply with linux-kernel added this time]
> >
> > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > singlethreaded application.
> > > Startup: numactl --cpunodebind=X ./app
> > > For X=1,2,3 it works as expected, all memory is allocated on the
> > > local
> > > node.
> > > For X=0 I can see the memory beeing allocated on node0 as long as
> > > ~3GiB
> > > are "free" on node0. At this point the kernel starts using memory
> > > from
> > > node1 for the app!
> >
> > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> >
> > Normally there should be no protection on it, but perhaps something
> > broke.
> >
> > What does cat /proc/sys/owmem_reserve_ratio say?
>
> 2.6.22.x:
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256 256
>
> 2.6.23.8 (and above)
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256 256 32
>
>
> > > For parallel realworld apps I've seen a performance penalty of 30%
> > > compared to older kernels!
> >
> > Compared to what older kernels? When did it start?
>
> I've tested some kernel Versions that I've laying around here...
> working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org)
> showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4;
> 2.6.26.5;
> 2.6.27
>
>
> >
> > -Andi
> >
> > --
> > ak@linux.intel.com
> >
>
>
> --
>
> Regards,
> Oliver Weihe
--- cut here ---
Regards,
Oliver Weihe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: NUMA allocator on Opteron systems does non-local allocation on node0
2008-10-14 9:43 ` NUMA allocator on Opteron systems does non-local allocation on node0 Oliver Weihe
@ 2008-10-14 11:41 ` Lee Schermerhorn
2008-10-14 12:15 ` Oliver Weihe
0 siblings, 1 reply; 3+ messages in thread
From: Lee Schermerhorn @ 2008-10-14 11:41 UTC (permalink / raw)
To: Oliver Weihe; +Cc: linux-mm
On Tue, 2008-10-14 at 11:43 +0200, Oliver Weihe wrote:
> Hello,
>
> I've sent this to Andi Kleen and posted this on lkml. Andi suggested to
> sent it to this mailing list.
>
>
<snip>
> >
> > > [Another copy of the reply with linux-kernel added this time]
> > >
> > > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > > singlethreaded application.
> > > > Startup: numactl --cpunodebind=X ./app
> > > > For X=1,2,3 it works as expected, all memory is allocated on the
> > > > local
> > > > node.
> > > > For X=0 I can see the memory beeing allocated on node0 as long as
> > > > ~3GiB
> > > > are "free" on node0. At this point the kernel starts using memory
> > > > from
> > > > node1 for the app!
> > >
> > > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> > >
> > > Normally there should be no protection on it, but perhaps something
> > > broke.
> > >
Check your /proc/sys/vm/numa_zonelist_order. By default, the kernel
will use "zone order", meaning it will overflow to the same zone-e.g.,
Normal--before consuming DMA memory, if the DMA zone is <= half the
system memory. See default_zonelist_order() and build_zonelists() in
mm/page_alloc.c
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: NUMA allocator on Opteron systems does non-local allocation on node0
2008-10-14 11:41 ` Lee Schermerhorn
@ 2008-10-14 12:15 ` Oliver Weihe
0 siblings, 0 replies; 3+ messages in thread
From: Oliver Weihe @ 2008-10-14 12:15 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm
Hi Lee,
thank you for the hint. I've set /proc/sys/vm/numa_zonelist_order to
"node" (was "default") and now an it works as expected. Performance is
much better than before for some parallel applications.
Regards,
Oliver Weihe
> On Tue, 2008-10-14 at 11:43 +0200, Oliver Weihe wrote:
> > Hello,
> >
> > I've sent this to Andi Kleen and posted this on lkml. Andi suggested
> > to
> > sent it to this mailing list.
> >
> >
> <snip>
> > >
> > > > [Another copy of the reply with linux-kernel added this time]
> > > >
> > > > > In my setup I'm allocating an array of ~7GiB memory size in a
> > > > > singlethreaded application.
> > > > > Startup: numactl --cpunodebind=X ./app
> > > > > For X=1,2,3 it works as expected, all memory is allocated on
> > > > > the
> > > > > local
> > > > > node.
> > > > > For X=0 I can see the memory beeing allocated on node0 as long
> > > > > as
> > > > > ~3GiB
> > > > > are "free" on node0. At this point the kernel starts using
> > > > > memory
> > > > > from
> > > > > node1 for the app!
> > > >
> > > > Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> > > >
> > > > Normally there should be no protection on it, but perhaps
> > > > something
> > > > broke.
> > > >
>
>
> Check your /proc/sys/vm/numa_zonelist_order. By default, the kernel
> will use "zone order", meaning it will overflow to the same zone-e.g.,
> Normal--before consuming DMA memory, if the DMA zone is <= half the
> system memory. See default_zonelist_order() and build_zonelists() in
> mm/page_alloc.c
>
> Lee
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-10-14 12:15 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1449471.1223892929572.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
2008-10-14 9:43 ` NUMA allocator on Opteron systems does non-local allocation on node0 Oliver Weihe
2008-10-14 11:41 ` Lee Schermerhorn
2008-10-14 12:15 ` Oliver Weihe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox