From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH v4][RFC] hugetlb: add per-node nr_hugepages sysfs attribute From: Lee Schermerhorn In-Reply-To: <20070613191908.GR3798@us.ibm.com> References: <20070612001542.GJ14458@us.ibm.com> <20070612034407.GB11773@holomorphy.com> <20070612050910.GU3798@us.ibm.com> <20070612051512.GC11773@holomorphy.com> <20070612174503.GB3798@us.ibm.com> <20070612191347.GE11781@holomorphy.com> <20070613000446.GL3798@us.ibm.com> <20070613152649.GN3798@us.ibm.com> <20070613152847.GO3798@us.ibm.com> <1181759027.6148.77.camel@localhost> <20070613191908.GR3798@us.ibm.com> Content-Type: text/plain Date: Wed, 13 Jun 2007 16:05:10 -0400 Message-Id: <1181765111.6148.98.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nishanth Aravamudan , Christoph Lameter Cc: William Lee Irwin III , anton@samba.org, akpm@linux-foundation.org, linux-mm@kvack.org List-ID: On Wed, 2007-06-13 at 12:19 -0700, Nishanth Aravamudan wrote: > On 13.06.2007 [14:23:47 -0400], Lee Schermerhorn wrote: > > On Wed, 2007-06-13 at 08:28 -0700, Nishanth Aravamudan wrote: > > > > > > > > commit 05a7edb8c909c674cdefb0323348825cf3e2d1d0 > > > Author: Nishanth Aravamudan > > > Date: Thu Jun 7 08:54:48 2007 -0700 > > > > > > hugetlb: add per-node nr_hugepages sysfs attribute > > > > > > Allow specifying the number of hugepages to allocate on a particular > > > node. Our current global sysctl will try its best to put hugepages > > > equally on each node, but htat may not always be desired. This allows > > > the admin to control the layout of hugepage allocation at a finer level > > > (while not breaking the existing interface). Add callbacks in the sysfs > > > node registration and unregistration functions into hugetlb to add the > > > nr_hugepages attribute, which is a no-op if !NUMA or !HUGETLB. > > > > > > Signed-off-by: Nishanth Aravamudan > > > Cc: William Lee Irwin III > > > Cc: Christoph Lameter > > > Cc: Lee Schermerhorn > > > Cc: Anton Blanchard > > > Cc: Andrew Morton > > > > > > --- > > > Do the dummy function definitions need to be (void)0? > > > > > > > I tested hugepage allocation on my HP rx8620 platform [16 cpu ia64, 32GB in 4 "real" nodes and one pseudo-node containing only DMA memory]. As expected, I don't get a balanced distribution across the real nodes. Here's what I see: # before allocating huge pages: root@gwydyr(root):cat /sys/devices/system/node/node*/meminfo | grep HugeP Node 0 HugePages_Total: 0 Node 0 HugePages_Free: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 4 HugePages_Total: 0 Node 4 HugePages_Free: 0 # Now allocate 64 256MB pages. Only nodes 0-3 have NORMAL memory. # Zone 4 contains ~512MB of DMA memory. Some has already been # used, so I doubt that even 1 256MB [aligned] huge page is available. root@gwydyr(root):echo 64 >/proc/sys/vm/nr_hugepages root@gwydyr(root):cat /sys/devices/system/node/node*/meminfo | grep HugeP Node 0 HugePages_Total: 13 <---??? Node 0 HugePages_Free: 26 <---??? Node 1 HugePages_Total: 12 Node 1 HugePages_Free: 12 Node 2 HugePages_Total: 13 Node 2 HugePages_Free: 13 Node 3 HugePages_Total: 13 Node 3 HugePages_Free: 13 Node 4 HugePages_Total: 13 <---??? Node 4 HugePages_Free: 0 # 13 of the pages say they're from Node 4, but I know that has only ~512MB or memory, of which some is already used. Unlikely that I can allocate even 1 256MB huge page because of alignment. Note that the free pages are accounted on Node 0, where they actually reside. Here's some zoneinfo after the allocation above [forgot to snap it before]. # zoneinfo shell function contains: # cat /proc/zoneinfo | egrep '^Node|^ pages |^ *present|^ *spanned' # results after allocating huge pages root@gwydyr(root):zoneinfo Node 0, zone Normal pages free 36157 spanned 486400 present 484738 Node 1, zone Normal pages free 318034 spanned 520192 present 518413 Node 2, zone Normal pages free 301526 spanned 520192 present 518414 Node 3, zone Normal pages free 301932 spanned 520182 present 518362 Node 4, zone DMA pages free 31706 spanned 32767 present 32656 ^^^^^^^^^^^^^^^^^^^^^^ Nope! no huge pages allocated from here! # now try to free the huge pages. root@gwydyr(root):echo 0 >/proc/sys/vm/nr_hugepages root@gwydyr(root):cat /sys/devices/system/node/node*/meminfo | grep HugeP Node 0 HugePages_Total: 4294967283 <--- ??? Node 0 HugePages_Free: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 2 HugePages_Total: 0 Node 2 HugePages_Free: 0 Node 3 HugePages_Total: 0 Node 3 HugePages_Free: 0 Node 4 HugePages_Total: 13 <---??? they weren't really there to begin with! Node 4 HugePages_Free: 0 # Apparently on remove, the pages were decremented from node 0 instead of node 4 where they were accounted for on allocation, resulting in a negative count on node 0 and the original 13 count still on node 4. ------------------ I tried to "tighten up" alloc_pages_node() to check the location of the first zone in the selected zonelist, as discussed in previous exchange. When I do this, I hit a BUG() in slub.c in early_kmem_cache_node_alloc(), as it apparently can't handle new_slab() returning a NULL page, even tho' it calls it with GFP_THISNODE. Slub should be able to handle memoryless nodes, right? I'm looking for a work around to this now. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org