From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 0E9426B005A for ; Wed, 1 Jul 2009 14:29:04 -0400 (EDT) Date: Wed, 1 Jul 2009 19:29:54 +0100 From: Mel Gorman Subject: Re: [RFC 2/3] hugetlb: derive huge pages nodes allowed from task mempolicy Message-ID: <20090701182953.GJ16355@csn.ul.ie> References: <20090630154716.1583.25274.sendpatchset@lts-notebook> <20090630154818.1583.26154.sendpatchset@lts-notebook> <20090701143227.GF16355@csn.ul.ie> <1246465164.23497.150.camel@lts-notebook> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1246465164.23497.150.camel@lts-notebook> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.org, akpm@linux-foundation.org, Nishanth Aravamudan , David Rientjes , Adam Litke , Andy Whitcroft , eric.whitney@hp.com List-ID: On Wed, Jul 01, 2009 at 12:19:24PM -0400, Lee Schermerhorn wrote: > On Wed, 2009-07-01 at 15:32 +0100, Mel Gorman wrote: > > On Tue, Jun 30, 2009 at 11:48:18AM -0400, Lee Schermerhorn wrote: > > > [RFC 2/3] hugetlb: derive huge pages nodes allowed from task mempolicy > > > > > > Against: 25jun09 mmotm atop the "hugetlb: balance freeing..." series > > > > > > This patch derives a "nodes_allowed" node mask from the numa > > > mempolicy of the task modifying the number of persistent huge > > > pages to control the allocation, freeing and adjusting of surplus > > > huge pages. This mask is derived as follows: > > > > > > * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer > > > is produced. This will cause the hugetlb subsystem to use > > > node_online_map as the "nodes_allowed". This preserves the > > > behavior before this patch. > > > > Sensible. > > > > > * For "preferred" mempolicy, including explicit local allocation, > > > a nodemask with the single preferred node will be produced. > > > "local" policy will NOT track any internode migrations of the > > > task adjusting nr_hugepages. > > > > This excludes fallback and could do with an in-code comment. I whinge > > about this more later. > > > > > * For "bind" and "interleave" policy, the mempolicy's nodemask > > > will be used. > > > > Sensible. > > > > > * Other than to inform the construction of the nodes_allowed node > > > mask, the actual mempolicy mode is ignored. That is, all modes > > > behave like interleave over the resulting nodes_allowed mask > > > with no "fallback". > > > > > > > Again, seems sensible. Resizing the hugepage pool is not like a normal > > applications behaviour. > > > > > Because we may have allocated or freed a huge page with a > > > different policy/nodes_allowed previously, we always need to > > > check that the next_node_to_{alloc|free} exists in the current > > > nodes_allowed mask. To avoid duplication of code, this is done > > > in the hstate_next_node_to_{alloc|free}() functions. > > > > Seems fair. While this means that some nodes could be skipped because there > > was a hugepage pool resize with a restricted policy and then no policy, I > > see little problem with that as such. I believe you caught all the direct > > users of next_nid_to_[alloc|free] and used the helpers which is good. > > Uh, I've been meaning to point out that I haven't modified > "try_to_free_low()" to use free_pool_huge_page(). The non-stub version > of try_to_free_low() is only compiled in when '_HIGHMEM is defined--i.e, > 32-bit systems. So, it will still drain nodes completely in node id > order. If this is a problem and we want try_to_free_low() to balance > freeing, I can enhance free_pool_huge_page() to take another flag > [perhaps combining it with the "acct_surplus" in a single flags arg] to > free only !highmem pages and pass a node mask of any nodes containing > such pages. > I was basically assuming that 32-bit NUMA machines weren't interesting but glancing through, it would be a relatively straight-forward conversion to obey the memory policies here as well. IIRC though, 32-bit NUMA machines will have the NORMAL zone on one node anyway so interleaving is basically a no-op if one wants to concentrate on freeing lowmem pages. > > > > > So, > > > these functions have been modified to allow them to be called > > > to obtain the "start_nid". Then, whereas prior to this patch > > > we unconditionally called hstate_next_node_to_{alloc|free}(), > > > whether or not we successfully allocated/freed a huge page on > > > the node, now we only call these functions on failure to alloc/free. > > > > > > Notes: > > > > > > 1) This patch introduces a subtle change in behavior: huge page > > > allocation and freeing will be constrained by any mempolicy > > > that the task adjusting the huge page pool inherits from its > > > parent. This policy could come from a distant ancestor. The > > > adminstrator adjusting the huge page pool without explicitly > > > specifying a mempolicy via numactl might be surprised by this. > > > > I would be trying to encourage administrators to use hugeadm instead of > > manually tuning the pools. One possible course of action is for hugeadm > > to check if a policy is currently set and output that as an INFO message. > > That will show up if they run with hugeadm -v. Alternatively, we could note > > as a WARN when any policy is set and print an INFO message on details of > > the policy. > > Yes. I saw mention of direct access to the sysctls being deprecated. > I'm not sure how that will go over with users, but if go with changes > like this, it makes sense to handle them in hugeadm. > In this case, we're only worried about notifying the user if they are under a policy. If they are accessing the sysctl's directly, there is not much that can be done to warn them. > > > > > Additionaly, any mempolicy specified by numactl will be > > > constrained by the cpuset in which numactl is invoked. > > > > > > 2) Hugepages allocated at boot time use the node_online_map. > > > An additional patch could implement a temporary boot time > > > huge pages nodes_allowed command line parameter. > > > > > > > I'd want for someone to complain before implementing such a patch. Even > > on systems where hugepages must be allocated very early on, an init script > > should be more than sufficient without implementing parsing for mempolicies > > and hugepages. > > I agree in principle. I expect that if one can't allocate the required > number of huge pages in an early init script, either one really needs > more memory, or something is wrong with earlier scripts that is > fragmenting memory. Agreed. > However, I'll point out that by the time one gets > such a complaint, there is a really long lead time before a change gets > into customer hands. This is the downside of waiting for a complaint or > definitive use case before providing tunable parameters and such :(. > There is that, but as the workaround here is basically "install more memory", it feels less critical. I just can't see the lack of node specification being a show-stopper for anyone. Famous last words of course :/ > > > > > See the updated documentation [next patch] for more information > > > about the implications of this patch. > > > > > > > Ideally the change log should show a before and after of using numactl + > > hugeadm to resize pools on a subset of available nodes. > > I'll do that. > > > > > > Signed-off-by: Lee Schermerhorn > > > > > > include/linux/mempolicy.h | 3 + > > > mm/hugetlb.c | 99 +++++++++++++++++++++++++++++++--------------- > > > mm/mempolicy.c | 54 +++++++++++++++++++++++++ > > > 3 files changed, 124 insertions(+), 32 deletions(-) > > > > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 23:01:01.000000000 -0400 > > > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-30 11:29:49.000000000 -0400 > > > @@ -621,29 +621,52 @@ static struct page *alloc_fresh_huge_pag > > > } > > > > > > /* > > > + * common helper functions for hstate_next_node_to_{alloc|free}. > > > + * We may have allocated or freed a huge pages based on a different > > > + * nodes_allowed, previously, so h->next_node_to_{alloc|free} might > > > + * be outside of *nodes_allowed. Ensure that we use the next > > > + * allowed node for alloc or free. > > > + */ > > > +static int next_node_allowed(int nid, nodemask_t *nodes_allowed) > > > +{ > > > + nid = next_node(nid, *nodes_allowed); > > > + if (nid == MAX_NUMNODES) > > > + nid = first_node(*nodes_allowed); /* handle "wrap" */ > > > + return nid; > > > +} > > > > The handle warp comment is unnecessary there. This is such a common pattern, > > it should be self-evident without placing comments that cause the > > CodingStyle Police to issue tickets. > > Thanks for the warning. I'll remove it. > > > > > > + > > > +static int this_node_allowed(int nid, nodemask_t *nodes_allowed) > > > +{ > > > + if (!node_isset(nid, *nodes_allowed)) > > > + nid = next_node_allowed(nid, nodes_allowed); > > > + return nid; > > > +} > > > + > > > > What happens if node hot-remove occured and there is now no node that we > > are allowed to allocate from? > > If we return a valid node id [see below] of an off-line node, the > allocation or free will fail and we'll advance to the next node. If all > the nodes in the nodes_allowed are off-line, we'll end up with next_nid > == start_nid and bail out. > Ok, there is a possibilily we'll OOM when returning NULL like this but a sensible way of dealing with that situation doesn't spring to mind. Mind you, maybe a user shouldn't be too suprised if they off-line a node that has active processes still bound to it :/. I guess ordinarily the pages would be migrated but hugetlbfs does not have such capabilities right now. > > > > This thing will end up returning MAX_NUMNODES right? That potentially then > > gets passed to alloc_pages_exact_node() triggering a VM_BUG_ON() there. > > No. That's "next_node_allowed()"--the function above with the spurious > "handle wrap" comment--not the bare "next_node()". So, we will "handle > wrap" appropriately. > But first_node() itself can return MAX_NUMNODES, right? > > > > > +/* > > > * Use a helper variable to find the next node and then > > > * copy it back to next_nid_to_alloc afterwards: > > > * otherwise there's a window in which a racer might > > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > > * But we don't need to use a spin_lock here: it really > > > * doesn't matter if occasionally a racer chooses the > > > - * same nid as we do. Move nid forward in the mask even > > > - * if we just successfully allocated a hugepage so that > > > - * the next caller gets hugepages on the next node. > > > + * same nid as we do. Move nid forward in the mask whether > > > + * or not we just successfully allocated a hugepage so that > > > + * the next allocation addresses the next node. > > > */ > > > static int hstate_next_node_to_alloc(struct hstate *h, > > > nodemask_t *nodes_allowed) > > > { > > > - int next_nid; > > > + int nid, next_nid; > > > > > > if (!nodes_allowed) > > > nodes_allowed = &node_online_map; > > > > > > - next_nid = next_node(h->next_nid_to_alloc, *nodes_allowed); > > > - if (next_nid == MAX_NUMNODES) > > > - next_nid = first_node(*nodes_allowed); > > > + nid = this_node_allowed(h->next_nid_to_alloc, nodes_allowed); > > > + > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_alloc = next_nid; > > > - return next_nid; > > > + > > > + return nid; > > > } > > > > > > > Seems reasonable. I see now what you mean about next_nid_to_alloc being more > > like its name in patch series than the last. > > > > > static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) > > > @@ -653,15 +676,17 @@ static int alloc_fresh_huge_page(struct > > > int next_nid; > > > int ret = 0; > > > > > > - start_nid = h->next_nid_to_alloc; > > > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > > next_nid = start_nid; > > > > > > > So, here for example, I think we need to make sure start_nid is not > > MAX_NUMNODES and bail out if it is. > > See response above, re: this_node_allowed(). Do you still think it's a > problem? > I'm not fully convinced. As first_node() can return MAX_NUMNODES, it looks like hstate_next_node_to_alloc() can return MAX_NUMNODES and that can get passed to alloc_pages_exact_node(). > > > > > do { > > > page = alloc_fresh_huge_page_node(h, next_nid); > > > - if (page) > > > + if (page) { > > > ret = 1; > > > + break; > > > + } > > > > Ok, so this break is necessary on allocation success because > > hstate_next_node_to_alloc() has already bumped next_nid_to_alloc. Right? > > Right. I don't want to fall thru and skip a node. so, I break and > remove the !page from the loop condition. > > However, since you mention it: I noticed, since I sent this, that with > this arrangement we will skip a node id in the case that we visit all > nodes w/o successfully allocating a huge page or finding a huge page to > free. When we hit the loop termination condition "next_nid == > start_nid", we've already advanced hstate_next_node_to_{alloc|free} > beyond next_nid. So, on next call, start_nid will become that value > and not the same start_nid we used before. > > Example: suppose we try to start a huge page [on, say, a 4 node] system > with no huge pages available anywhere; and the next_node_to_alloc == 0 > when we first call alloc_fresh_huge_page(). start_nid will get '0' and > we'll examine nodes 0, 1, 2 and 3. When hstate_next_node_to_alloc() > returns '0', we'll give up and return NULL page, but next_node_to_alloc > is now 1. If we try again, maybe after we think some memory has been > freed up and we have a chance of succeeding, we'll visit 1, 2, 3 and 0 > [if there are still no huge pages available]. And so forth. I don't > think this is a problem because if we actually succeed in allocating or > freeing a page, we'll break w/o advancing next_node... and pick up there > on next call. > > I have an idea for fixing this behavior, if anyone thinks it's a > problem, by "pushing back" the next_nid to the hstate with something > like: > > } while (next_nid != start_nid || push_back_alloc_node(next_nid)) > > where the push_back function always returns 0. Kind of ugly, but I > think it would work. I don't think it's needed, tho'. > I don't think it's a problem that is worth worrying about. If the administrator is interleaving between multiple nodes, I don't see why they would care which one the first node used. > > > > > next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > > - } while (!page && next_nid != start_nid); > > > + } while (next_nid != start_nid); > > > > > > if (ret) > > > count_vm_event(HTLB_BUDDY_PGALLOC); > > > @@ -672,21 +697,23 @@ static int alloc_fresh_huge_page(struct > > > } > > > > > > /* > > > - * helper for free_pool_huge_page() - find next node > > > - * from which to free a huge page > > > + * helper for free_pool_huge_page() - return the next node > > > + * from which to free a huge page. Advance the next node id > > > + * whether or not we find a free huge page to free so that the > > > + * next attempt to free addresses the next node. > > > */ > > > static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > > > { > > > - int next_nid; > > > + int nid, next_nid; > > > > > > if (!nodes_allowed) > > > nodes_allowed = &node_online_map; > > > > > > - next_nid = next_node(h->next_nid_to_free, *nodes_allowed); > > > - if (next_nid == MAX_NUMNODES) > > > - next_nid = first_node(*nodes_allowed); > > > + nid = this_node_allowed(h->next_nid_to_free, nodes_allowed); > > > + next_nid = next_node_allowed(nid, nodes_allowed); > > > h->next_nid_to_free = next_nid; > > > - return next_nid; > > > + > > > + return nid; > > > } > > > > > > /* > > > @@ -702,7 +729,7 @@ static int free_pool_huge_page(struct hs > > > int next_nid; > > > int ret = 0; > > > > > > - start_nid = h->next_nid_to_free; > > > + start_nid = hstate_next_node_to_free(h, nodes_allowed); > > > next_nid = start_nid; > > > > > > > I guess if start_nid is MAX_NUMNODES here as well, we should bail out > > early. It means that a pool shrink by the requested number of pages may > > fail if there are not enough hugepages allocated on the allowed node but > > maybe that's not such a big problem. > > Again, I don't that can happen. ??? > > > > > hugeadm will actually notice when this situation occurs and gives a > > warning about it. > > > > > do { > > > @@ -726,9 +753,10 @@ static int free_pool_huge_page(struct hs > > > } > > > update_and_free_page(h, page); > > > ret = 1; > > > + break; > > > } > > > next_nid = hstate_next_node_to_free(h, nodes_allowed); > > > - } while (!ret && next_nid != start_nid); > > > + } while (next_nid != start_nid); > > > > > > return ret; > > > } > > > @@ -1039,10 +1067,9 @@ int __weak alloc_bootmem_huge_page(struc > > > void *addr; > > > > > > addr = __alloc_bootmem_node_nopanic( > > > - NODE_DATA(h->next_nid_to_alloc), > > > + NODE_DATA(hstate_next_node_to_alloc(h, NULL)), > > > huge_page_size(h), huge_page_size(h), 0); > > > > > > - hstate_next_node_to_alloc(h, NULL); /* always advance nid */ > > > if (addr) { > > > /* > > > * Use the beginning of the huge page to store the > > > @@ -1179,29 +1206,33 @@ static int adjust_pool_surplus(struct hs > > > VM_BUG_ON(delta != -1 && delta != 1); > > > > > > if (delta < 0) > > > - start_nid = h->next_nid_to_alloc; > > > + start_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > > else > > > - start_nid = h->next_nid_to_free; > > > + start_nid = hstate_next_node_to_free(h, nodes_allowed); > > > next_nid = start_nid; > > > > > > do { > > > int nid = next_nid; > > > if (delta < 0) { > > > - next_nid = hstate_next_node_to_alloc(h, nodes_allowed); > > > /* > > > * To shrink on this node, there must be a surplus page > > > */ > > > - if (!h->surplus_huge_pages_node[nid]) > > > + if (!h->surplus_huge_pages_node[nid]) { > > > + next_nid = hstate_next_node_to_alloc(h, > > > + nodes_allowed); > > > continue; > > > + } > > > } > > > if (delta > 0) { > > > - next_nid = hstate_next_node_to_free(h, nodes_allowed); > > > /* > > > * Surplus cannot exceed the total number of pages > > > */ > > > if (h->surplus_huge_pages_node[nid] >= > > > - h->nr_huge_pages_node[nid]) > > > + h->nr_huge_pages_node[nid]) { > > > + next_nid = hstate_next_node_to_free(h, > > > + nodes_allowed); > > > continue; > > > + } > > > } > > > > > > h->surplus_huge_pages += delta; > > > @@ -1217,10 +1248,13 @@ static int adjust_pool_surplus(struct hs > > > static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count) > > > { > > > unsigned long min_count, ret; > > > + nodemask_t *nodes_allowed; > > > > > > if (h->order >= MAX_ORDER) > > > return h->max_huge_pages; > > > > > > + nodes_allowed = huge_mpol_nodes_allowed(); > > > + > > > /* > > > * Increase the pool size > > > * First take pages out of surplus state. Then make up the > > > @@ -1234,7 +1268,7 @@ static unsigned long set_max_huge_pages( > > > */ > > > spin_lock(&hugetlb_lock); > > > while (h->surplus_huge_pages && count > persistent_huge_pages(h)) { > > > - if (!adjust_pool_surplus(h, NULL, -1)) > > > + if (!adjust_pool_surplus(h, nodes_allowed, -1)) > > > break; > > > } > > > > > > @@ -1245,7 +1279,7 @@ static unsigned long set_max_huge_pages( > > > * and reducing the surplus. > > > */ > > > spin_unlock(&hugetlb_lock); > > > - ret = alloc_fresh_huge_page(h, NULL); > > > + ret = alloc_fresh_huge_page(h, nodes_allowed); > > > spin_lock(&hugetlb_lock); > > > if (!ret) > > > goto out; > > > @@ -1271,16 +1305,17 @@ static unsigned long set_max_huge_pages( > > > min_count = max(count, min_count); > > > try_to_free_low(h, min_count); > > > while (min_count < persistent_huge_pages(h)) { > > > - if (!free_pool_huge_page(h, NULL, 0)) > > > + if (!free_pool_huge_page(h, nodes_allowed, 0)) > > > break; > > > } > > > while (count < persistent_huge_pages(h)) { > > > - if (!adjust_pool_surplus(h, NULL, 1)) > > > + if (!adjust_pool_surplus(h, nodes_allowed, 1)) > > > break; > > > } > > > out: > > > ret = persistent_huge_pages(h); > > > spin_unlock(&hugetlb_lock); > > > + kfree(nodes_allowed); > > > return ret; > > > } > > > > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c > > > =================================================================== > > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/mempolicy.c 2009-06-29 12:18:11.000000000 -0400 > > > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/mempolicy.c 2009-06-29 23:11:35.000000000 -0400 > > > @@ -1544,6 +1544,60 @@ struct zonelist *huge_zonelist(struct vm > > > } > > > return zl; > > > } > > > + > > > +/** > > > + * huge_mpol_nodes_allowed() > > > + * > > > + * Return a [pointer to a] nodelist for persistent huge page allocation > > > + * based on the current task's mempolicy: > > > + * > > > + * If the task's mempolicy is "default" [NULL], just return NULL for > > > + * default behavior. Otherwise, extract the policy nodemask for 'bind' > > > + * or 'interleave' policy or construct a nodemask for 'preferred' or > > > + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t. > > > + * It is the caller's responsibility to free this nodemask. > > > + */ > > > +nodemask_t *huge_mpol_nodes_allowed(void) > > > +{ > > > + nodemask_t *nodes_allowed = NULL; > > > + struct mempolicy *mempolicy; > > > + int nid; > > > + > > > + if (!current || !current->mempolicy) > > > + return NULL; > > > + > > > > Is it really possible for current to be NULL here? > > Not sure. I've hit NULL current in mempolicy functions called at init > time before. Maybe a [VM_]BUG_ON() would be preferable? > VM_BUG_ON() I'd say would be enough. Even if support is added later to allocateo with a nodemask at boot-time when current == NULL, they'll need to know to use the default policy instead of current->mempolicy. > > > > > + mpol_get(current->mempolicy); > > > + nodes_allowed = kzalloc(sizeof(*nodes_allowed), GFP_KERNEL); > > > + if (!nodes_allowed) { > > > + printk(KERN_WARNING "Unable to allocate nodes allowed mask " > > > + "for huge page allocation\nFalling back to default\n"); > > > + goto out; > > > + } > > > > In terms of memory policies, the huge_mpol_nodes_allowed() would appear > > to be unique in allocating a nodemask and expecting the caller to free it > > without leaking. Would it be possible to move the kzalloc() to the caller > > of huge_mpol_nodes_allowed()? It'd be less surprising to me. > > I thought doing that. Then, I'd have to return an indication of whether > to use the allocated mask, or free it and use NULL. This seemed > cleaner, as set_max_huge_pages() is the only caller of > huge_mpol_nodes_allowed(), which I thought really should go in > mempolicy.c [like huge_zonelist()]. And I did note the caller's > responsibility here... > Fair point. > > > > I take it you are not allocating nodemask_t on the stack of the caller because > > potentially nodemask_t is very large if MAX_NUMNODES is a big number. Is > > that accurate? > > Well, that and the fact that I need to pass NULL to the alloc/free > functions when we have default policy. I'd still need a nodes_allowed > pointer and you know how you hate having both the nodemask_t and the > pointer :). > Yeah, it does cause me to whinge :) > > > > nodemasks are meant to be cleared with nodes_clear() but you depend on > > kzalloc() zeroing the bitmap for you. While the end result is the same, > > is using kzalloc instead of kmalloc+nodes_clear() considered ok? > > That did cross my mind. And, it's not exactly a fast path. Shall I > change it to kmalloc+nodes_clear()? Or maybe add a nodemask_alloc() or > nodemask_new() function? We don't have one of those, as nodemasks have > mostly been passed around by value rather than reference. > I think kmalloc+nodes_clear() might be more future proof in the event does something like use poison byte pattern to detect when a nodemask is improperly initialised or something similar. I don't think it needs a nodemask_alloc() or nodemask_new() helper. > > > > > + > > > + mempolicy = current->mempolicy; > > > + switch(mempolicy->mode) { > > > + case MPOL_PREFERRED: > > > + if (mempolicy->flags & MPOL_F_LOCAL) > > > + nid = numa_node_id(); > > > + else > > > + nid = mempolicy->v.preferred_node; > > > + node_set(nid, *nodes_allowed); > > > + break; > > > + > > > > I think a comment is needed here saying that MPOL_PREFERRED when > > resizing the pool acts more like MPOL_BIND to the preferred node with no > > fallback. > > Yeah, I can add such a comment. I did document this in the hugetlbfs > doc, but I guess it's good to have it here, as well. > It doesn't hurt. > > > > I see your problem though. You can't use set the next_nid to allocate from > > to be the preferred node because the second allocation will interleave away > > from it though. How messy would it be to check if the MPOL_PREFERRED policy > > was in use and avoid updating next_nid_to_alloc while the preferred node is > > being allocated from? > > > > It's not a big issue and I'd be ok with your current behaviour to start with. > > Yeah, and I don't think we want the hugetlb.c code directly looking at > mempolicy internals. I'd have to add another one user helper function, > probably in mempolicy.h. Not a biggy, but I didn't think it was > necessary. It doesn't hurt to reiterate that allocations for populating > the huge page pool do not fallback. > Lets go with your suggested behaviour for now. Double checking the policy will just over-complicate things in the first iteration. > > > > > + case MPOL_BIND: > > > + /* Fall through */ > > > + case MPOL_INTERLEAVE: > > > + *nodes_allowed = mempolicy->v.nodes; > > > + break; > > > + > > > + default: > > > + BUG(); > > > + } > > > + > > > +out: > > > + mpol_put(current->mempolicy); > > > + return nodes_allowed; > > > +} > > > #endif > > > > > > /* Allocate a page in interleaved policy. > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h > > > =================================================================== > > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/mempolicy.h 2009-06-29 12:18:11.000000000 -0400 > > > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/mempolicy.h 2009-06-29 23:06:34.000000000 -0400 > > > @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str > > > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > > > unsigned long addr, gfp_t gfp_flags, > > > struct mempolicy **mpol, nodemask_t **nodemask); > > > +extern nodemask_t *huge_mpol_nodes_allowed(void); > > > extern unsigned slab_node(struct mempolicy *policy); > > > > > > extern enum zone_type policy_zone; > > > @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone > > > return node_zonelist(0, gfp_flags); > > > } > > > > > > +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; } > > > + > > > static inline int do_migrate_pages(struct mm_struct *mm, > > > const nodemask_t *from_nodes, > > > const nodemask_t *to_nodes, int flags) > > > > > > > By and large, this patch would appear to result in reasonable behaviour > > for administrators that want to limit the hugepage pool to specific > > nodes. Predictably, I prefer this approach to the nodemask-sysctl > > approach :/ . With a few crinkles ironed out, I reckon I'd be happy with > > this. Certainly, it appears to work as advertised in that I was able to > > accurate grow/shrink the pool on specific nodes. > > > > Having written the patch, I'm beginning to warm up to this approach > myself :). We need to understand and advertise any apparently change in > behavior to be sure others agree. > > Thanks for the review, > Lee > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org