From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l5DJ2FfS006578 for ; Wed, 13 Jun 2007 15:02:15 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l5DJ14gp349964 for ; Wed, 13 Jun 2007 15:01:08 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l5DJ14vg008702 for ; Wed, 13 Jun 2007 15:01:04 -0400 Date: Wed, 13 Jun 2007 12:01:01 -0700 From: Nishanth Aravamudan Subject: Re: [PATCH] populated_map: fix !NUMA case, remove comment Message-ID: <20070613190101.GQ3798@us.ibm.com> References: <20070612172858.GV3798@us.ibm.com> <1181674081.5592.91.camel@localhost> <1181677473.5592.149.camel@localhost> <20070612200125.GG3798@us.ibm.com> <1181748606.6148.19.camel@localhost> <20070613175802.GP3798@us.ibm.com> <1181758874.6148.73.camel@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1181758874.6148.73.camel@localhost> Sender: owner-linux-mm@kvack.org Return-Path: To: Lee Schermerhorn Cc: Christoph Lameter , anton@samba.org, akpm@linux-foundation.org, linux-mm@kvack.org List-ID: On 13.06.2007 [14:21:14 -0400], Lee Schermerhorn wrote: > On Wed, 2007-06-13 at 10:58 -0700, Nishanth Aravamudan wrote: > > On 13.06.2007 [11:30:06 -0400], Lee Schermerhorn wrote: > > > On Tue, 2007-06-12 at 13:01 -0700, Nishanth Aravamudan wrote: > > > > On 12.06.2007 [12:58:16 -0700], Christoph Lameter wrote: > > > > > On Tue, 12 Jun 2007, Christoph Lameter wrote: > > > > > > > > > > > Uhhh... Right there is another special case. The recently > > > > > > introduces zonelist swizzle makes the DMA zone come last and if a > > > > > > node had only a DMA zone then it may become swizzled to the end of > > > > > > the zonelist. > > > > > > > > > > Maybe we can ignore that case for now: > > > > > > > > I wish we wouldn't. We need the "DMA zone comes last" for both HP and > > > Fujitsu platforms. That's why Kame and I worked on that patch > > > together. > > > > Right. I interpreted the "for now" as for this first stack of patches. > > We'll need a fix for your platform on top, but it seems to be a minority > > case? Not saying it shouldn't be fixed, by any means, just trying to get > > a handle on it. > > Yep. I'm testing the stack "as is" now. If it doesn't spread the > huge pages evenly because of our funky DMA-only node, I'll post a fix > up patch for consideration. Great, thanks. > By the way, your sysfs attribute patch doesn't compile. I'll post > comments/fixes in response to your message that submitted the patch. Dang, much appreciated. > > > > > > > > I think that the "node has memory" mask is fine for scanning nodes > > > that might have memory in the zone of interest--including in the > > > hugetlb alloc_fresh_huge_page() loop. However, I think that to > > > support all platforms in a generic way, alloc_pages_node() and > > > alloc_page_interleave() [both take a node id arg] should be more > > > strict when the gfp mask includes 'THISNODE and not assume that a > > > populated node always has on-node memory in the zone of interest. > > > > Hrm, perhaps. > > > > > E.g., something like: > > > > > > pgdat_t *pgdat; > > > struct zonelist *zonelist; > > > > > > ... > > > > > > /* > > > * after validating nid, ... > > > * Note that we need to fetch these values anyway for the > > > * [likely?] call to __alloc_pages(). > > > */ > > > pgdat = NODE_DATA(nid); > > > zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); > > > > > > if ((gfp_mask & __GFP_THISNODE) && > > > zonelist->zones[0]->zone_pgdat != pgdat) > > > return NULL; > > > > > > return __alloc_pages(gfp_mask, order, zonelist); > > > > > > > > > I see you've submitted a new patch set. I grab it [when Nish reposts] > > > and test it as is and modified to look something like the above, if > > > needed. > > > > I think your code above makes sense -- I'd still leave in the earlier > > check, though. > > > > So it probably should be: > > > > pgdat = NODE_DATA(nid); > > zonelist = pgdat->node_zonelists + gfp_zone(gfp_mask); > > > > if (unlikely((gfp_mask & __GFP_THISNODE) && > > (!node_memory(nid) || > > zonelist->zones[0]->zone_pgdat != pgdat))) > > return NULL; > > > > That way, if the node has no memory whatsoever, we don't bother checking > > the pgdat of the relevant zone? > > Well, since most nodes WILL, I think, have memory, that just adds an > extra check in the most frequent case. Then, we'll have to go ahead > and check the pgdat. However, if the first zone in the selected > zonelist IS "on-node" [pgdats match], we know that the node has memory > [altho' the zone may not have available pages]. And since we have to > fetch the pgdat and the zonelist, anyway, as the argument to > __alloc_pages(), I don't think my proposed change adds any additional > memory ref's, while eliminating the ref to the node_memory_map. I'm > assuming here that the compiler will optimize away any stores to the > pgdat/zonelist variables. You're right, sorry for the noise. > So, we can use the node_memory() test at higher levels--like the > alloc_fresh_huge_page() loop, to avoid attempting allocations from > nodes that we know have no memory, but I think the > allocate_pages_node() and allocate_interleave_page() should test the > selected zonelist explicitly. Yep. Thanks, Nish -- Nishanth Aravamudan IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org