From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() From: Lee Schermerhorn In-Reply-To: <20070807171432.GY15714@us.ibm.com> References: <20070807171432.GY15714@us.ibm.com> Content-Type: text/plain Date: Tue, 07 Aug 2007 16:15:22 -0400 Message-Id: <1186517722.5067.31.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nishanth Aravamudan Cc: clameter@sgi.com, anton@samba.org, wli@holomorphy.com, linux-mm@kvack.org List-ID: On Tue, 2007-08-07 at 10:14 -0700, Nishanth Aravamudan wrote: > hugetlb: search harder for memory in alloc_fresh_huge_page() > > Currently, alloc_fresh_huge_page() returns NULL when it is not able to > allocate a huge page on the current node, as specified by its custom > interleave variable. The callers of this function, though, assume that a > failure in alloc_fresh_huge_page() indicates no hugepages can be > allocated on the system period. This might not be the case, for > instance, if we have an uneven NUMA system, and we happen to try to > allocate a hugepage on a node with less memory and fail, while there is > still plenty of free memory on the other nodes. > > To correct this, make alloc_fresh_huge_page() search through all online > nodes before deciding no hugepages can be allocated. Add a helper > function for actually allocating the hugepage. > > While there are interleave interfaces that could be exported from the > mempolicy layer, that seems like an inappropriate design decision. Work > is needed on a subsystem-level interleaving interface, but I'm still not > quite sure how that should look. Hence the custom interleaving here. > > Signed-off-by: Nishanth Aravamudan > > --- > I split up patch 1/5 into two bits, as they are really two logical > changes. Does this look better, Christoph? > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index d7ca59d..17a377e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -101,36 +101,59 @@ static void free_huge_page(struct page *page) > spin_unlock(&hugetlb_lock); > } > > -static int alloc_fresh_huge_page(void) > +static struct page *alloc_fresh_huge_page_node(int nid) > { > - static int prev_nid; > struct page *page; > - int nid; > - > - /* > - * Copy static prev_nid to local nid, work on that, then copy it > - * back to prev_nid afterwards: otherwise there's a window in which > - * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node. > - * But we don't need to use a spin_lock here: it really doesn't > - * matter if occasionally a racer chooses the same nid as we do. > - */ > - nid = next_node(prev_nid, node_online_map); > - if (nid == MAX_NUMNODES) > - nid = first_node(node_online_map); > - prev_nid = nid; > > - page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN, > - HUGETLB_PAGE_ORDER); > + page = alloc_pages_node(nid, > + htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN, > + HUGETLB_PAGE_ORDER); > if (page) { > set_compound_page_dtor(page, free_huge_page); > spin_lock(&hugetlb_lock); > nr_huge_pages++; > - nr_huge_pages_node[page_to_nid(page)]++; > + nr_huge_pages_node[nid]++; Not that I don't trust __GFP_THISNODE, but may I suggest a "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of course. Better yet, add the assertion and drop this one line change? This isn't a hot path, I think. > spin_unlock(&hugetlb_lock); > put_page(page); /* free it into the hugepage allocator */ > - return 1; > } > - return 0; > + > + return page; > +} > + > +static int alloc_fresh_huge_page(void) > +{ > + static int nid = -1; > + struct page *page; > + int start_nid; > + int next_nid; > + int ret = 0; > + > + if (nid < 0) > + nid = first_node(node_online_map); > + start_nid = nid; > + > + do { > + page = alloc_fresh_huge_page_node(nid); > + if (page) > + ret = 1; > + /* > + * Use a helper variable to find the next node and then > + * copy it back to nid nid afterwards: otherwise there's > + * a window in which a racer might pass invalid nid > + * MAX_NUMNODES to alloc_pages_node. But we don't need > + * to use a spin_lock here: it really doesn't matter if > + * occasionally a racer chooses the same nid as we do. > + * Move nid forward in the mask even if we just > + * successfully allocated a hugepage so that the next > + * caller gets hugepages on the next node. > + */ > + next_nid = next_node(nid, node_online_map); > + if (next_nid == MAX_NUMNODES) > + next_nid = first_node(node_online_map); > + nid = next_nid; > + } while (!page && nid != start_nid); > + > + return ret; > } > > static struct page *alloc_huge_page(struct vm_area_struct *vma, > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org