From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e36.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l5CHhCxT017311 for ; Tue, 12 Jun 2007 13:43:12 -0400 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l5CHaEbG106784 for ; Tue, 12 Jun 2007 11:43:12 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l5CHTDCQ005177 for ; Tue, 12 Jun 2007 11:29:13 -0600 Date: Tue, 12 Jun 2007 10:28:58 -0700 From: Nishanth Aravamudan Subject: Re: [PATCH] populated_map: fix !NUMA case, remove comment Message-ID: <20070612172858.GV3798@us.ibm.com> References: <20070611234155.GG14458@us.ibm.com> <20070612000705.GH14458@us.ibm.com> <20070612020257.GF3798@us.ibm.com> <20070612023209.GJ3798@us.ibm.com> <20070612032055.GQ3798@us.ibm.com> <1181660782.5592.50.camel@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1181660782.5592.50.camel@localhost> Sender: owner-linux-mm@kvack.org Return-Path: To: Lee Schermerhorn Cc: Christoph Lameter , anton@samba.org, akpm@linux-foundation.org, linux-mm@kvack.org List-ID: On 12.06.2007 [11:06:22 -0400], Lee Schermerhorn wrote: > On Mon, 2007-06-11 at 20:20 -0700, Nishanth Aravamudan wrote: > > On 11.06.2007 [19:54:13 -0700], Christoph Lameter wrote: > > > On Mon, 11 Jun 2007, Nishanth Aravamudan wrote: > > > > > > > On 11.06.2007 [19:20:58 -0700], Christoph Lameter wrote: > > > > > On Mon, 11 Jun 2007, Nishanth Aravamudan wrote: > > > > > > > > > > > [PATCH v6][RFC] Fix hugetlb pool allocation with empty nodes > > > > > > > > > > There is no point in compiling the interleave logic for !NUMA. > > > > > There needs to be some sort of !NUMA fallback in hugetlb. It would > > > > > be better to call a interleave function in mempolicy.c that > > > > > provides an appropriate shim for !NUMA. > > > > > > > > Hrm, if !NUMA, is the nid of the only node guaranteed to be 0? If so, I > > > > can just > > > > > > Yes. > > > > > > > Make alloc_fresh_huge_page() and other generic variants call into > > > > the _node() versions with nid=0, if !NUMA. > > > > > > > > Would that be ok? > > > > > > I am not sure what you are up to. Just make sure that the changes are > > > minimal. Look in the source code for other examples on how !NUMA > > > situations were handled. > > > > I swear I'm trying to make the code do the right thing, and understand > > the NUMA intricacies better. Sorry for the flood of e-mails and such. I > > asked about specific other cases because they are used in !NUMA > > situations too and I wasn't sure why node_populated_map should be > > different. > > > > But ok, I will rely on the source to be correct and make my changelog > > indicate where I got the ideas from. > > Nish: when this all settles down, I still need to make sure it works > on our platforms with the funny DMA-only node. What that comes down > to is that when alloc_fresh_huge_page() calls: Ok, thanks for these details. Would you be ok with stabilizing the generic definition of node_populated_map as is (any present pages, regardless of location), and then trying to figure out how to get your platform to work with that? > page = alloc_pages_node(nid, > GFP_HIGHUSER|__GFP_COMP|GFP_THISNODE, > HUGETLB_PAGE_ORDER); > > I need to get a page that is on nid. On our platform, GFP_HIGHUSER is > going to specify the zonelist for ZONE_NORMAL. The first zone on this > list needs to be on-node for nid. With the changes you've made to the > definition of populated map, I think this won't be the case. I need > to test your latest patches and fix that, if it's broken. Ok. But that means your platform is broken now too, right? As in, it's not a regression, per se? I'm much more concerned in the short term about the whole memoryless-node issue, which I think is more straight-forward, and generic to fix. > I still think using policy zone is the "right way" to go, here. After > all, only pages in the policy zone are controlled by policy, and > that's the goal of spreading out the huge pages across nodes--to make > them available to satisfy memory policy at allocation time. But that > would need some adjustments for x86_64 systems that have some nodes > that are all/mostly DMA32 and other nodes that are populated in zones > > DMA32, if we want to allocate huge pages out of the DMA32 zone. Well, as of right now, I'm *only* trying to deal with memoryless nodes. So then this whole notion of policy_zone is relatively moot. It matters for your platform, I understand, but I think the fix there is more complex and probably should be stacked on the current set, once it is stabilized. > As far as the static variable, and round-robin allocation: the current > method "works" both for huge pages allocated at boot time and for huge > pages allocated at run-time vi the vm.nr_hugepages sysctl. By "works", > I mean that it continues to spread the pages evenly across the > "populated" nodes. If, however, you use the task local counter to > interleave fresh huge pages, each write to the nr_hugepages from a > different task ["echo NN >.../nr_hugepages"] will start at node zero or > the first populated node--assuming you're interleaving across populated > nodes and not on-line nodes. That's probably OK if you always change > nr_hugepages by a multiple of the number of populated nodes. And, if > things get out of balance, we'll have your per node attribute, I hope, > to adjust any individual node. Yes, I will reply about the il_next thing in a sec. Maybe Christoph has some cleverness. And yes, I think the per-node attribute will fix most of the interface problems for 'odd' NUMA systems. Thanks, Nish -- Nishanth Aravamudan IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org