[RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page()
@ 2007-08-07 17:14 Nishanth Aravamudan
  2007-08-07 17:15 ` [RFC][PATCH 2/2][V10] hugetlb: fix pool allocation with memoryless nodes Nishanth Aravamudan
  2007-08-07 20:15 ` [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Lee Schermerhorn
  0 siblings, 2 replies; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07 17:14 UTC (permalink / raw)
  To: clameter; +Cc: anton, lee.schermerhorn, wli, linux-mm

Currently, alloc_fresh_huge_page() returns NULL when it is not able to
allocate a huge page on the current node, as specified by its custom
interleave variable. The callers of this function, though, assume that a
failure in alloc_fresh_huge_page() indicates no hugepages can be
allocated on the system period. This might not be the case, for
instance, if we have an uneven NUMA system, and we happen to try to
allocate a hugepage on a node with less memory and fail, while there is
still plenty of free memory on the other nodes.

To correct this, make alloc_fresh_huge_page() search through all online
nodes before deciding no hugepages can be allocated. Add a helper
function for actually allocating the hugepage.

While there are interleave interfaces that could be exported from the
mempolicy layer, that seems like an inappropriate design decision. Work
is needed on a subsystem-level interleaving interface, but I'm still not
quite sure how that should look. Hence the custom interleaving here.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
I split up patch 1/5 into two bits, as they are really two logical
changes. Does this look better, Christoph?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..17a377e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -101,36 +101,59 @@ static void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
-static int alloc_fresh_huge_page(void)
+static struct page *alloc_fresh_huge_page_node(int nid)
 {
-	static int prev_nid;
 	struct page *page;
-	int nid;
-
-	/*
-	 * Copy static prev_nid to local nid, work on that, then copy it
-	 * back to prev_nid afterwards: otherwise there's a window in which
-	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
-	 * But we don't need to use a spin_lock here: it really doesn't
-	 * matter if occasionally a racer chooses the same nid as we do.
-	 */
-	nid = next_node(prev_nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
-	prev_nid = nid;
 
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+	page = alloc_pages_node(nid,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
 		nr_huge_pages++;
-		nr_huge_pages_node[page_to_nid(page)]++;
+		nr_huge_pages_node[nid]++;
 		spin_unlock(&hugetlb_lock);
 		put_page(page); /* free it into the hugepage allocator */
-		return 1;
 	}
-	return 0;
+
+	return page;
+}
+
+static int alloc_fresh_huge_page(void)
+{
+	static int nid = -1;
+	struct page *page;
+	int start_nid;
+	int next_nid;
+	int ret = 0;
+
+	if (nid < 0)
+		nid = first_node(node_online_map);
+	start_nid = nid;
+
+	do {
+		page = alloc_fresh_huge_page_node(nid);
+		if (page)
+			ret = 1;
+		/*
+		 * Use a helper variable to find the next node and then
+		 * copy it back to nid nid afterwards: otherwise there's
+		 * a window in which a racer might pass invalid nid
+		 * MAX_NUMNODES to alloc_pages_node.  But we don't need
+		 * to use a spin_lock here: it really doesn't matter if
+		 * occasionally a racer chooses the same nid as we do.
+		 * Move nid forward in the mask even if we just
+		 * successfully allocated a hugepage so that the next
+		 * caller gets hugepages on the next node.
+		 */
+		next_nid = next_node(nid, node_online_map);
+		if (next_nid == MAX_NUMNODES)
+			next_nid = first_node(node_online_map);
+		nid = next_nid;
+	} while (!page && nid != start_nid);
+
+	return ret;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 2/2][V10] hugetlb: fix pool allocation with memoryless nodes
  2007-08-07 17:14 [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Nishanth Aravamudan
@ 2007-08-07 17:15 ` Nishanth Aravamudan
  2007-08-07 20:15 ` [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Lee Schermerhorn
  1 sibling, 0 replies; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07 17:15 UTC (permalink / raw)
  To: clameter; +Cc: anton, lee.schermerhorn, wli, linux-mm

[V10] hugetlb: fix pool allocation with empty nodes

Anton found a problem with the hugetlb pool allocation when some nodes
have no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee
worked on versions that tried to fix it, but none were accepted.
Christoph has created a set of patches which allow for GFP_THISNODE
allocations to fail if the node has no memory and for exporting a
nodemask indicating which nodes have memory. Simply interleave across
this nodemask rather than the online nodemask.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 17a377e..1f872ca 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -129,7 +129,7 @@ static int alloc_fresh_huge_page(void)
 	int ret = 0;
 
 	if (nid < 0)
-		nid = first_node(node_online_map);
+		nid = first_node(node_states[N_HIGH_MEMORY]);
 	start_nid = nid;
 
 	do {
@@ -147,9 +147,9 @@ static int alloc_fresh_huge_page(void)
 		 * successfully allocated a hugepage so that the next
 		 * caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(nid, node_online_map);
+		next_nid = next_node(nid, node_states[N_HIGH_MEMORY]);
 		if (next_nid == MAX_NUMNODES)
-			next_nid = first_node(node_online_map);
+			next_nid = first_node(node_states[N_HIGH_MEMORY]);
 		nid = next_nid;
 	} while (!page && nid != start_nid);
 
-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 17:14 [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Nishanth Aravamudan
  2007-08-07 17:15 ` [RFC][PATCH 2/2][V10] hugetlb: fix pool allocation with memoryless nodes Nishanth Aravamudan
@ 2007-08-07 20:15 ` Lee Schermerhorn
  2007-08-07 22:12   ` [RFC][PATCH 1/2][UPDATED] " Nishanth Aravamudan
  1 sibling, 1 reply; 11+ messages in thread
From: Lee Schermerhorn @ 2007-08-07 20:15 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: clameter, anton, wli, linux-mm

On Tue, 2007-08-07 at 10:14 -0700, Nishanth Aravamudan wrote:
> hugetlb: search harder for memory in alloc_fresh_huge_page()
> 
> Currently, alloc_fresh_huge_page() returns NULL when it is not able to
> allocate a huge page on the current node, as specified by its custom
> interleave variable. The callers of this function, though, assume that a
> failure in alloc_fresh_huge_page() indicates no hugepages can be
> allocated on the system period. This might not be the case, for
> instance, if we have an uneven NUMA system, and we happen to try to
> allocate a hugepage on a node with less memory and fail, while there is
> still plenty of free memory on the other nodes.
> 
> To correct this, make alloc_fresh_huge_page() search through all online
> nodes before deciding no hugepages can be allocated. Add a helper
> function for actually allocating the hugepage.
> 
> While there are interleave interfaces that could be exported from the
> mempolicy layer, that seems like an inappropriate design decision. Work
> is needed on a subsystem-level interleaving interface, but I'm still not
> quite sure how that should look. Hence the custom interleaving here.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
> I split up patch 1/5 into two bits, as they are really two logical
> changes. Does this look better, Christoph?
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d7ca59d..17a377e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -101,36 +101,59 @@ static void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> -static int alloc_fresh_huge_page(void)
> +static struct page *alloc_fresh_huge_page_node(int nid)
>  {
> -	static int prev_nid;
>  	struct page *page;
> -	int nid;
> -
> -	/*
> -	 * Copy static prev_nid to local nid, work on that, then copy it
> -	 * back to prev_nid afterwards: otherwise there's a window in which
> -	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
> -	 * But we don't need to use a spin_lock here: it really doesn't
> -	 * matter if occasionally a racer chooses the same nid as we do.
> -	 */
> -	nid = next_node(prev_nid, node_online_map);
> -	if (nid == MAX_NUMNODES)
> -		nid = first_node(node_online_map);
> -	prev_nid = nid;
>  
> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> +	page = alloc_pages_node(nid,
> +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> +		HUGETLB_PAGE_ORDER);
>  	if (page) {
>  		set_compound_page_dtor(page, free_huge_page);
>  		spin_lock(&hugetlb_lock);
>  		nr_huge_pages++;
> -		nr_huge_pages_node[page_to_nid(page)]++;
> +		nr_huge_pages_node[nid]++;

Not that I don't trust __GFP_THISNODE, but may I suggest a
"VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
course.  Better yet, add the assertion and drop this one line change?
This isn't a hot path, I think.

>  		spin_unlock(&hugetlb_lock);
>  		put_page(page); /* free it into the hugepage allocator */
> -		return 1;
>  	}
> -	return 0;
> +
> +	return page;
> +}
> +
> +static int alloc_fresh_huge_page(void)
> +{
> +	static int nid = -1;
> +	struct page *page;
> +	int start_nid;
> +	int next_nid;
> +	int ret = 0;
> +
> +	if (nid < 0)
> +		nid = first_node(node_online_map);
> +	start_nid = nid;
> +
> +	do {
> +		page = alloc_fresh_huge_page_node(nid);
> +		if (page)
> +			ret = 1;
> +		/*
> +		 * Use a helper variable to find the next node and then
> +		 * copy it back to nid nid afterwards: otherwise there's
> +		 * a window in which a racer might pass invalid nid
> +		 * MAX_NUMNODES to alloc_pages_node.  But we don't need
> +		 * to use a spin_lock here: it really doesn't matter if
> +		 * occasionally a racer chooses the same nid as we do.
> +		 * Move nid forward in the mask even if we just
> +		 * successfully allocated a hugepage so that the next
> +		 * caller gets hugepages on the next node.
> +		 */
> +		next_nid = next_node(nid, node_online_map);
> +		if (next_nid == MAX_NUMNODES)
> +			next_nid = first_node(node_online_map);
> +		nid = next_nid;
> +	} while (!page && nid != start_nid);
> +
> +	return ret;
>  }
>  
>  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 20:15 ` [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Lee Schermerhorn
@ 2007-08-07 22:12   ` Nishanth Aravamudan
  2007-08-07 22:54     ` Christoph Lameter
  2007-08-07 23:34     ` Nishanth Aravamudan
  0 siblings, 2 replies; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07 22:12 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, anton, wli, linux-mm

On 07.08.2007 [16:15:22 -0400], Lee Schermerhorn wrote:
> On Tue, 2007-08-07 at 10:14 -0700, Nishanth Aravamudan wrote:
> > hugetlb: search harder for memory in alloc_fresh_huge_page()
> > 
> > Currently, alloc_fresh_huge_page() returns NULL when it is not able to
> > allocate a huge page on the current node, as specified by its custom
> > interleave variable. The callers of this function, though, assume that a
> > failure in alloc_fresh_huge_page() indicates no hugepages can be
> > allocated on the system period. This might not be the case, for
> > instance, if we have an uneven NUMA system, and we happen to try to
> > allocate a hugepage on a node with less memory and fail, while there is
> > still plenty of free memory on the other nodes.
> > 
> > To correct this, make alloc_fresh_huge_page() search through all online
> > nodes before deciding no hugepages can be allocated. Add a helper
> > function for actually allocating the hugepage.
> > 
> > While there are interleave interfaces that could be exported from the
> > mempolicy layer, that seems like an inappropriate design decision. Work
> > is needed on a subsystem-level interleaving interface, but I'm still not
> > quite sure how that should look. Hence the custom interleaving here.
> > 
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > 

<snip>

> > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > -					HUGETLB_PAGE_ORDER);
> > +	page = alloc_pages_node(nid,
> > +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> > +		HUGETLB_PAGE_ORDER);
> >  	if (page) {
> >  		set_compound_page_dtor(page, free_huge_page);
> >  		spin_lock(&hugetlb_lock);
> >  		nr_huge_pages++;
> > -		nr_huge_pages_node[page_to_nid(page)]++;
> > +		nr_huge_pages_node[nid]++;
> 
> Not that I don't trust __GFP_THISNODE, but may I suggest a
> "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
> course.  Better yet, add the assertion and drop this one line change?
> This isn't a hot path, I think.

Hrm, I think if it's really a concern then the VM_BUG_ON should be in
alloc_pages_node() itself? Or somewhere lower level, I mean, it's a bug
everywhere, not just in hugetlb.c. And, more importantly, if
__GFP_THISNODE doesn't work, it pretty much defeats the purpose of my
sysfs attribute patch. Echo'ing a value for node 0 and getting hugepages
on node 1 would be bad :)

But here's the patch respun, as requested:

hugetlb: search harder for memory in alloc_fresh_huge_page()

Currently, alloc_fresh_huge_page() returns NULL when it is not able to
allocate a huge page on the current node, as specified by its custom
interleave variable. The callers of this function, though, assume that a
failure in alloc_fresh_huge_page() indicates no hugepages can be
allocated on the system period. This might not be the case, for
instance, if we have an uneven NUMA system, and we happen to try to
allocate a hugepage on a node with less memory and fail, while there is
still plenty of free memory on the other nodes.

To correct this, make alloc_fresh_huge_page() search through all online
nodes before deciding no hugepages can be allocated. Add a helper
function for actually allocating the hugepage. Also, since we expect
particular semantics for __GFP_THISNODE, which are newly enforced, add a
VM_BUG_ON when allocations occur off the requested node.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..e7b103d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -101,36 +101,60 @@ static void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
-static int alloc_fresh_huge_page(void)
+static struct page *alloc_fresh_huge_page_node(int nid)
 {
-	static int prev_nid;
 	struct page *page;
-	int nid;
-
-	/*
-	 * Copy static prev_nid to local nid, work on that, then copy it
-	 * back to prev_nid afterwards: otherwise there's a window in which
-	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
-	 * But we don't need to use a spin_lock here: it really doesn't
-	 * matter if occasionally a racer chooses the same nid as we do.
-	 */
-	nid = next_node(prev_nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
-	prev_nid = nid;
 
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+	page = alloc_pages_node(nid,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		HUGETLB_PAGE_ORDER);
 	if (page) {
+		VM_BUG_ON(nid != page_to_nid(page));
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
 		nr_huge_pages++;
-		nr_huge_pages_node[page_to_nid(page)]++;
+		nr_huge_pages_node[page_to_nid(nid)]++;
 		spin_unlock(&hugetlb_lock);
 		put_page(page); /* free it into the hugepage allocator */
-		return 1;
 	}
-	return 0;
+
+	return page;
+}
+
+static int alloc_fresh_huge_page(void)
+{
+	static int nid = -1;
+	struct page *page;
+	int start_nid;
+	int next_nid;
+	int ret = 0;
+
+	if (nid < 0)
+		nid = first_node(node_online_map);
+	start_nid = nid;
+
+	do {
+		page = alloc_fresh_huge_page_node(nid);
+		if (page)
+			ret = 1;
+		/*
+		 * Use a helper variable to find the next node and then
+		 * copy it back to nid nid afterwards: otherwise there's
+		 * a window in which a racer might pass invalid nid
+		 * MAX_NUMNODES to alloc_pages_node.  But we don't need
+		 * to use a spin_lock here: it really doesn't matter if
+		 * occasionally a racer chooses the same nid as we do.
+		 * Move nid forward in the mask even if we just
+		 * successfully allocated a hugepage so that the next
+		 * caller gets hugepages on the next node.
+		 */
+		next_nid = next_node(nid, node_online_map);
+		if (next_nid == MAX_NUMNODES)
+			next_nid = first_node(node_online_map);
+		nid = next_nid;
+	} while (!page && nid != start_nid);
+
+	return ret;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 22:12   ` [RFC][PATCH 1/2][UPDATED] " Nishanth Aravamudan
@ 2007-08-07 22:54     ` Christoph Lameter
  2007-08-07 23:02       ` Nishanth Aravamudan
  2007-08-08 13:17       ` Lee Schermerhorn
  2007-08-07 23:34     ` Nishanth Aravamudan
  1 sibling, 2 replies; 11+ messages in thread
From: Christoph Lameter @ 2007-08-07 22:54 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, anton, wli, linux-mm

On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:

> > 
> > Not that I don't trust __GFP_THISNODE, but may I suggest a
> > "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
> > course.  Better yet, add the assertion and drop this one line change?

Dont do this change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 22:54     ` Christoph Lameter
@ 2007-08-07 23:02       ` Nishanth Aravamudan
  2007-08-08  0:14         ` Christoph Lameter
  2007-08-08 13:17       ` Lee Schermerhorn
  1 sibling, 1 reply; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07 23:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, anton, wli, linux-mm

On 07.08.2007 [15:54:36 -0700], Christoph Lameter wrote:
> On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:
> 
> > > 
> > > Not that I don't trust __GFP_THISNODE, but may I suggest a
> > > "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
> > > course.  Better yet, add the assertion and drop this one line change?
> 
> Dont do this change.

Which change? Using nid without a VM_BUG_ON (as in the original patch)
or adding a VM_BUG_ON and using page_to_nid()?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 22:12   ` [RFC][PATCH 1/2][UPDATED] " Nishanth Aravamudan
  2007-08-07 22:54     ` Christoph Lameter
@ 2007-08-07 23:34     ` Nishanth Aravamudan
  1 sibling, 0 replies; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-07 23:34 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, anton, wli, linux-mm

On 07.08.2007 [15:12:40 -0700], Nishanth Aravamudan wrote:
> On 07.08.2007 [16:15:22 -0400], Lee Schermerhorn wrote:
> > On Tue, 2007-08-07 at 10:14 -0700, Nishanth Aravamudan wrote:
> > > hugetlb: search harder for memory in alloc_fresh_huge_page()
> > > 
> > > Currently, alloc_fresh_huge_page() returns NULL when it is not able to
> > > allocate a huge page on the current node, as specified by its custom
> > > interleave variable. The callers of this function, though, assume that a
> > > failure in alloc_fresh_huge_page() indicates no hugepages can be
> > > allocated on the system period. This might not be the case, for
> > > instance, if we have an uneven NUMA system, and we happen to try to
> > > allocate a hugepage on a node with less memory and fail, while there is
> > > still plenty of free memory on the other nodes.
> > > 
> > > To correct this, make alloc_fresh_huge_page() search through all online
> > > nodes before deciding no hugepages can be allocated. Add a helper
> > > function for actually allocating the hugepage.
> > > 
> > > While there are interleave interfaces that could be exported from the
> > > mempolicy layer, that seems like an inappropriate design decision. Work
> > > is needed on a subsystem-level interleaving interface, but I'm still not
> > > quite sure how that should look. Hence the custom interleaving here.
> > > 
> > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> > > 
> 
> <snip>
> 
> > > -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> > > -					HUGETLB_PAGE_ORDER);
> > > +	page = alloc_pages_node(nid,
> > > +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> > > +		HUGETLB_PAGE_ORDER);
> > >  	if (page) {
> > >  		set_compound_page_dtor(page, free_huge_page);
> > >  		spin_lock(&hugetlb_lock);
> > >  		nr_huge_pages++;
> > > -		nr_huge_pages_node[page_to_nid(page)]++;
> > > +		nr_huge_pages_node[nid]++;
> > 
> > Not that I don't trust __GFP_THISNODE, but may I suggest a
> > "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
> > course.  Better yet, add the assertion and drop this one line change?
> > This isn't a hot path, I think.
> 
> Hrm, I think if it's really a concern then the VM_BUG_ON should be in
> alloc_pages_node() itself? Or somewhere lower level, I mean, it's a bug
> everywhere, not just in hugetlb.c. And, more importantly, if
> __GFP_THISNODE doesn't work, it pretty much defeats the purpose of my
> sysfs attribute patch. Echo'ing a value for node 0 and getting hugepages
> on node 1 would be bad :)
> 
> But here's the patch respun, as requested:
> 
> hugetlb: search harder for memory in alloc_fresh_huge_page()
> 
> Currently, alloc_fresh_huge_page() returns NULL when it is not able to
> allocate a huge page on the current node, as specified by its custom
> interleave variable. The callers of this function, though, assume that a
> failure in alloc_fresh_huge_page() indicates no hugepages can be
> allocated on the system period. This might not be the case, for
> instance, if we have an uneven NUMA system, and we happen to try to
> allocate a hugepage on a node with less memory and fail, while there is
> still plenty of free memory on the other nodes.
> 
> To correct this, make alloc_fresh_huge_page() search through all online
> nodes before deciding no hugepages can be allocated. Add a helper
> function for actually allocating the hugepage. Also, since we expect
> particular semantics for __GFP_THISNODE, which are newly enforced, add a
> VM_BUG_ON when allocations occur off the requested node.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d7ca59d..e7b103d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -101,36 +101,60 @@ static void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> -static int alloc_fresh_huge_page(void)
> +static struct page *alloc_fresh_huge_page_node(int nid)
>  {
> -	static int prev_nid;
>  	struct page *page;
> -	int nid;
> -
> -	/*
> -	 * Copy static prev_nid to local nid, work on that, then copy it
> -	 * back to prev_nid afterwards: otherwise there's a window in which
> -	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
> -	 * But we don't need to use a spin_lock here: it really doesn't
> -	 * matter if occasionally a racer chooses the same nid as we do.
> -	 */
> -	nid = next_node(prev_nid, node_online_map);
> -	if (nid == MAX_NUMNODES)
> -		nid = first_node(node_online_map);
> -	prev_nid = nid;
>  
> -	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> -					HUGETLB_PAGE_ORDER);
> +	page = alloc_pages_node(nid,
> +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> +		HUGETLB_PAGE_ORDER);
>  	if (page) {
> +		VM_BUG_ON(nid != page_to_nid(page));
>  		set_compound_page_dtor(page, free_huge_page);
>  		spin_lock(&hugetlb_lock);
>  		nr_huge_pages++;
> -		nr_huge_pages_node[page_to_nid(page)]++;
> +		nr_huge_pages_node[page_to_nid(nid)]++;

Sigh, should check the patch before sending :)

This bit should not be there... Still waiting to hear from Christoph as
to which way he wants it to look, but here is the correct patch for this
version:

hugetlb: search harder for memory in alloc_fresh_huge_page()

Currently, alloc_fresh_huge_page() returns NULL when it is not able to
allocate a huge page on the current node, as specified by its custom
interleave variable. The callers of this function, though, assume that a
failure in alloc_fresh_huge_page() indicates no hugepages can be
allocated on the system period. This might not be the case, for
instance, if we have an uneven NUMA system, and we happen to try to
allocate a hugepage on a node with less memory and fail, while there is
still plenty of free memory on the other nodes.

To correct this, make alloc_fresh_huge_page() search through all online
nodes before deciding no hugepages can be allocated. Add a helper
function for actually allocating the hugepage. Also, since we expect
particular semantics for __GFP_THISNODE, which are newly enforced, add a
VM_BUG_ON when allocations occur off the requested node.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7ca59d..83c8026 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -101,36 +101,60 @@ static void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
-static int alloc_fresh_huge_page(void)
+static struct page *alloc_fresh_huge_page_node(int nid)
 {
-	static int prev_nid;
 	struct page *page;
-	int nid;
-
-	/*
-	 * Copy static prev_nid to local nid, work on that, then copy it
-	 * back to prev_nid afterwards: otherwise there's a window in which
-	 * a racer might pass invalid nid MAX_NUMNODES to alloc_pages_node.
-	 * But we don't need to use a spin_lock here: it really doesn't
-	 * matter if occasionally a racer chooses the same nid as we do.
-	 */
-	nid = next_node(prev_nid, node_online_map);
-	if (nid == MAX_NUMNODES)
-		nid = first_node(node_online_map);
-	prev_nid = nid;
 
-	page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+	page = alloc_pages_node(nid,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		HUGETLB_PAGE_ORDER);
 	if (page) {
+		VM_BUG_ON(nid != page_to_nid(page));
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
 		nr_huge_pages++;
 		nr_huge_pages_node[page_to_nid(page)]++;
 		spin_unlock(&hugetlb_lock);
 		put_page(page); /* free it into the hugepage allocator */
-		return 1;
 	}
-	return 0;
+
+	return page;
+}
+
+static int alloc_fresh_huge_page(void)
+{
+	static int nid = -1;
+	struct page *page;
+	int start_nid;
+	int next_nid;
+	int ret = 0;
+
+	if (nid < 0)
+		nid = first_node(node_online_map);
+	start_nid = nid;
+
+	do {
+		page = alloc_fresh_huge_page_node(nid);
+		if (page)
+			ret = 1;
+		/*
+		 * Use a helper variable to find the next node and then
+		 * copy it back to nid nid afterwards: otherwise there's
+		 * a window in which a racer might pass invalid nid
+		 * MAX_NUMNODES to alloc_pages_node.  But we don't need
+		 * to use a spin_lock here: it really doesn't matter if
+		 * occasionally a racer chooses the same nid as we do.
+		 * Move nid forward in the mask even if we just
+		 * successfully allocated a hugepage so that the next
+		 * caller gets hugepages on the next node.
+		 */
+		next_nid = next_node(nid, node_online_map);
+		if (next_nid == MAX_NUMNODES)
+			next_nid = first_node(node_online_map);
+		nid = next_nid;
+	} while (!page && nid != start_nid);
+
+	return ret;
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 23:02       ` Nishanth Aravamudan
@ 2007-08-08  0:14         ` Christoph Lameter
  2007-08-08  1:32           ` Nishanth Aravamudan
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Lameter @ 2007-08-08  0:14 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Lee Schermerhorn, anton, wli, linux-mm

On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:

> Which change? Using nid without a VM_BUG_ON (as in the original patch)
> or adding a VM_BUG_ON and using page_to_nid()?

Adding VM_BUG_ON. If page_alloc does not work then something basic is 
broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-08  0:14         ` Christoph Lameter
@ 2007-08-08  1:32           ` Nishanth Aravamudan
  2007-08-08 13:20             ` Lee Schermerhorn
  0 siblings, 1 reply; 11+ messages in thread
From: Nishanth Aravamudan @ 2007-08-08  1:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, anton, wli, linux-mm

On 07.08.2007 [17:14:31 -0700], Christoph Lameter wrote:
> On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:
> 
> > Which change? Using nid without a VM_BUG_ON (as in the original patch)
> > or adding a VM_BUG_ON and using page_to_nid()?
> 
> Adding VM_BUG_ON. If page_alloc does not work then something basic is 
> broken.

I agree. So perhaps there needs to be a VM_BUG_ON_ONCE() or something
somewhere in the core code for the case of a __GFP_THISNODE allocation
going off node?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-07 22:54     ` Christoph Lameter
  2007-08-07 23:02       ` Nishanth Aravamudan
@ 2007-08-08 13:17       ` Lee Schermerhorn
  1 sibling, 0 replies; 11+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 13:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nishanth Aravamudan, anton, wli, linux-mm

On Tue, 2007-08-07 at 15:54 -0700, Christoph Lameter wrote:
> On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:
> 
> > > 
> > > Not that I don't trust __GFP_THISNODE, but may I suggest a
> > > "VM_BUG_ON(page_to_nid(page) != nid)" -- up above the spin_lock(), of
> > > course.  Better yet, add the assertion and drop this one line change?
> 
> Dont do this change.

[being equally terse] Why?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH 1/2][UPDATED] hugetlb: search harder for memory in alloc_fresh_huge_page()
  2007-08-08  1:32           ` Nishanth Aravamudan
@ 2007-08-08 13:20             ` Lee Schermerhorn
  0 siblings, 0 replies; 11+ messages in thread
From: Lee Schermerhorn @ 2007-08-08 13:20 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: Christoph Lameter, anton, wli, linux-mm

On Tue, 2007-08-07 at 18:32 -0700, Nishanth Aravamudan wrote:
> On 07.08.2007 [17:14:31 -0700], Christoph Lameter wrote:
> > On Tue, 7 Aug 2007, Nishanth Aravamudan wrote:
> > 
> > > Which change? Using nid without a VM_BUG_ON (as in the original patch)
> > > or adding a VM_BUG_ON and using page_to_nid()?
> > 
> > Adding VM_BUG_ON. If page_alloc does not work then something basic is 
> > broken.
> 
> I agree. So perhaps there needs to be a VM_BUG_ON_ONCE() or something
> somewhere in the core code for the case of a __GFP_THISNODE allocation
> going off node?

That would work for me.  But, I would like to see us use the existing
page_to_nid() for the accounting.  I like to avoid silent corruption of
the accounting.  Things will work--for a while anyway--if pages get
returned on the wrong node and the accounting gets messed up.  But, it
can result in non-obvious problems some time later.  I hate it when that
happens :-).

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-08-08 13:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-07 17:14 [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Nishanth Aravamudan
2007-08-07 17:15 ` [RFC][PATCH 2/2][V10] hugetlb: fix pool allocation with memoryless nodes Nishanth Aravamudan
2007-08-07 20:15 ` [RFC][PATCH 1/2] hugetlb: search harder for memory in alloc_fresh_huge_page() Lee Schermerhorn
2007-08-07 22:12   ` [RFC][PATCH 1/2][UPDATED] " Nishanth Aravamudan
2007-08-07 22:54     ` Christoph Lameter
2007-08-07 23:02       ` Nishanth Aravamudan
2007-08-08  0:14         ` Christoph Lameter
2007-08-08  1:32           ` Nishanth Aravamudan
2007-08-08 13:20             ` Lee Schermerhorn
2007-08-08 13:17       ` Lee Schermerhorn
2007-08-07 23:34     ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox