[PATCH 0/3] Balance Freeing of Huge Pages across Nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Balance Freeing of Huge Pages across Nodes
@ 2009-06-29 21:52 Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

[PATCH] 0/3 Balance Freeing of Huge Pages across Nodes

This series contains V3 of the of the "Balance Freeing of Huge
Pages across Nodes" patch--containing a minor cleanup from v2--
and two additional, related patches.  I have added David Rientjes'
ACK from V2, hoping that the change to v3 doesn't invalidate that.

Patch 2/3 reworks the free_pool_huge_page() function so that it
may also be used by return_unused_surplus_page().  This patch
needs careful review [and, testing?].  Perhaps Mel Gorman can 
give it a go with the hugepages regression tests.

Patch 3/3 updates the vm hugetlbpage documentation to clarify 
the usage and to add the description of the balancing of freeing
of huge pages.  Most of the update is from my earlier "huge pages
nodes_allowed" patch series, without mention of the nodes_allowed
mask and associated boot parameter, sysctl and attributes.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] Balance Freeing of Huge Pages across Nodes
  2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn
@ 2009-06-29 21:52 ` Lee Schermerhorn
  2009-06-30 13:05   ` Mel Gorman
  2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn
  2 siblings, 1 reply; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

[PATCH] 1/3 Balance Freeing of Huge Pages across Nodes

Against:  25jun09 mmotm

Free huges pages from nodes in round robin fashion in an
attempt to keep [persistent a.k.a static] hugepages balanced
across nodes

New function free_pool_huge_page() is modeled on and
performs roughly the inverse of alloc_fresh_huge_page().
Replaces dequeue_huge_page() which now has no callers,
so this patch removes it.

Helper function hstate_next_node_to_free() uses new hstate
member next_to_free_nid to distribute "frees" across all
nodes with huge pages.

V2:

At Mel Gorman's suggestion:  renamed hstate_next_node() to
hstate_next_node_to_alloc() for symmetry.  Also, renamed
hstate member hugetlb_next_node to next_node_to_free.
["hugetlb" is implicit in the hstate struct, I think].

New in this version:

Modified adjust_pool_surplus() to use hstate_next_node_to_alloc()
and hstate_next_node_to_free() to advance node id for adjusting
surplus huge page count, as this is equivalent to allocating and
freeing persistent huge pages.  [Can't blame Mel for this part.]

V3:

Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to
better match alloc_fresh_huge_page() conventions.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/hugetlb.h |    3 -
 mm/hugetlb.c            |  132 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 88 insertions(+), 47 deletions(-)

Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h	2009-06-29 10:21:12.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h	2009-06-29 10:27:18.000000000 -0400
@@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area(
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
-	int hugetlb_next_nid;
+	int next_nid_to_alloc;
+	int next_nid_to_free;
 	unsigned int order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-06-29 10:21:12.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-06-29 15:53:55.000000000 -0400
@@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
 	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(struct hstate *h)
-{
-	int nid;
-	struct page *page = NULL;
-
-	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-			break;
-		}
-	}
-	return page;
-}
-
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
@@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag
 
 /*
  * Use a helper variable to find the next node and then
- * copy it back to hugetlb_next_nid afterwards:
+ * copy it back to next_nid_to_alloc afterwards:
  * otherwise there's a window in which a racer might
  * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
  * But we don't need to use a spin_lock here: it really
@@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag
  * if we just successfully allocated a hugepage so that
  * the next caller gets hugepages on the next node.
  */
-static int hstate_next_node(struct hstate *h)
+static int hstate_next_node_to_alloc(struct hstate *h)
 {
 	int next_nid;
-	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
 	if (next_nid == MAX_NUMNODES)
 		next_nid = first_node(node_online_map);
-	h->hugetlb_next_nid = next_nid;
+	h->next_nid_to_alloc = next_nid;
 	return next_nid;
 }
 
@@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct 
 	int next_nid;
 	int ret = 0;
 
-	start_nid = h->hugetlb_next_nid;
+	start_nid = h->next_nid_to_alloc;
+	next_nid = start_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, next_nid);
 		if (page)
 			ret = 1;
-		next_nid = hstate_next_node(h);
-	} while (!page && h->hugetlb_next_nid != start_nid);
+		next_nid = hstate_next_node_to_alloc(h);
+	} while (!page && next_nid != start_nid);
 
 	if (ret)
 		count_vm_event(HTLB_BUDDY_PGALLOC);
@@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct 
 	return ret;
 }
 
+/*
+ * helper for free_pool_huge_page() - find next node
+ * from which to free a huge page
+ */
+static int hstate_next_node_to_free(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->next_nid_to_free, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->next_nid_to_free = next_nid;
+	return next_nid;
+}
+
+/*
+ * Free huge page from pool from next node to free.
+ * Attempt to keep persistent huge pages more or less
+ * balanced over allowed nodes.
+ * Called with hugetlb_lock locked.
+ */
+static int free_pool_huge_page(struct hstate *h)
+{
+	int start_nid;
+	int next_nid;
+	int ret = 0;
+
+	start_nid = h->next_nid_to_free;
+	next_nid = start_nid;
+
+	do {
+		if (!list_empty(&h->hugepage_freelists[next_nid])) {
+			struct page *page =
+				list_entry(h->hugepage_freelists[next_nid].next,
+					  struct page, lru);
+			list_del(&page->lru);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[next_nid]--;
+			update_and_free_page(h, page);
+			ret = 1;
+		}
+		next_nid = hstate_next_node_to_free(h);
+	} while (!ret && next_nid != start_nid);
+
+	return ret;
+}
+
 static struct page *alloc_buddy_huge_page(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
 {
@@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc
 		void *addr;
 
 		addr = __alloc_bootmem_node_nopanic(
-				NODE_DATA(h->hugetlb_next_nid),
+				NODE_DATA(h->next_nid_to_alloc),
 				huge_page_size(h), huge_page_size(h), 0);
 
 		if (addr) {
@@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc
 			m = addr;
 			goto found;
 		}
-		hstate_next_node(h);
+		hstate_next_node_to_alloc(h);
 		nr_nodes--;
 	}
 	return 0;
@@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc
  */
 static int adjust_pool_surplus(struct hstate *h, int delta)
 {
-	static int prev_nid;
-	int nid = prev_nid;
+	int start_nid, next_nid;
 	int ret = 0;
 
 	VM_BUG_ON(delta != -1 && delta != 1);
-	do {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
 
-		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !h->surplus_huge_pages_node[nid])
-			continue;
-		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+	if (delta < 0)
+		start_nid = h->next_nid_to_alloc;
+	else
+		start_nid = h->next_nid_to_free;
+	next_nid = start_nid;
+
+	do {
+		int nid = next_nid;
+		if (delta < 0)  {
+			next_nid = hstate_next_node_to_alloc(h);
+			/*
+			 * To shrink on this node, there must be a surplus page
+			 */
+			if (!h->surplus_huge_pages_node[nid])
+				continue;
+		}
+		if (delta > 0) {
+			next_nid = hstate_next_node_to_free(h);
+			/*
+			 * Surplus cannot exceed the total number of pages
+			 */
+			if (h->surplus_huge_pages_node[nid] >=
 						h->nr_huge_pages_node[nid])
-			continue;
+				continue;
+		}
 
 		h->surplus_huge_pages += delta;
 		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
-	} while (nid != prev_nid);
+	} while (next_nid != start_nid);
 
-	prev_nid = nid;
 	return ret;
 }
 
@@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages(
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
-		struct page *page = dequeue_huge_page(h);
-		if (!page)
+		if (!free_pool_huge_page(h))
 			break;
-		update_and_free_page(h, page);
 	}
 	while (count < persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, 1))
@@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned 
 	h->free_huge_pages = 0;
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
-	h->hugetlb_next_nid = first_node(node_online_map);
+	h->next_nid_to_alloc = first_node(node_online_map);
+	h->next_nid_to_free = first_node(node_online_map);
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages
  2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
@ 2009-06-29 21:52 ` Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn
  2 siblings, 0 replies; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

PATCH 2/3 - Use free_pool_huge_page() for return_unused_surplus_pages()

Against:  25jun09 mmotm

Use the [modified] free_pool_huge_page() function to return unused
surplus pages.  This will help keep huge pages balanced across nodes
between freeing of unused surplus pages and freeing of persistent huge
pages [from set_max_huge_pages] by using the same node id "cursor". It
also eliminates some code duplication.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/hugetlb.c |   57 +++++++++++++++++++++++++--------------------------------
 1 file changed, 25 insertions(+), 32 deletions(-)

Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-06-29 15:53:55.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-06-29 16:52:45.000000000 -0400
@@ -686,7 +686,7 @@ static int hstate_next_node_to_free(stru
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h)
+static int free_pool_huge_page(struct hstate *h, bool acct_surplus)
 {
 	int start_nid;
 	int next_nid;
@@ -696,6 +696,13 @@ static int free_pool_huge_page(struct hs
 	next_nid = start_nid;
 
 	do {
+		/*
+		 * If we're returning unused surplus pages, skip nodes
+		 * with no surplus.
+		 */
+		if (acct_surplus && !h->surplus_huge_pages_node[next_nid])
+			continue;
+
 		if (!list_empty(&h->hugepage_freelists[next_nid])) {
 			struct page *page =
 				list_entry(h->hugepage_freelists[next_nid].next,
@@ -703,6 +710,10 @@ static int free_pool_huge_page(struct hs
 			list_del(&page->lru);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[next_nid]--;
+			if (acct_surplus) {
+				h->surplus_huge_pages--;
+				h->surplus_huge_pages_node[next_nid]--;
+			}
 			update_and_free_page(h, page);
 			ret = 1;
 		}
@@ -883,22 +894,13 @@ free:
  * When releasing a hugetlb pool reservation, any surplus pages that were
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
+ * Called with hugetlb_lock held.
  */
 static void return_unused_surplus_pages(struct hstate *h,
 					unsigned long unused_resv_pages)
 {
-	static int nid = -1;
-	struct page *page;
 	unsigned long nr_pages;
 
-	/*
-	 * We want to release as many surplus pages as possible, spread
-	 * evenly across all nodes. Iterate across all nodes until we
-	 * can no longer free unreserved surplus pages. This occurs when
-	 * the nodes with surplus pages have no free pages.
-	 */
-	unsigned long remaining_iterations = nr_online_nodes;
-
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
@@ -908,26 +910,17 @@ static void return_unused_surplus_pages(
 
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
-	while (remaining_iterations-- && nr_pages) {
-		nid = next_node(nid, node_online_map);
-		if (nid == MAX_NUMNODES)
-			nid = first_node(node_online_map);
-
-		if (!h->surplus_huge_pages_node[nid])
-			continue;
-
-		if (!list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			update_and_free_page(h, page);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-			h->surplus_huge_pages--;
-			h->surplus_huge_pages_node[nid]--;
-			nr_pages--;
-			remaining_iterations = nr_online_nodes;
-		}
+	/*
+	 * We want to release as many surplus pages as possible, spread
+	 * evenly across all nodes. Iterate across all nodes until we
+	 * can no longer free unreserved surplus pages. This occurs when
+	 * the nodes with surplus pages have no free pages.
+	 * free_pool_huge_page() will balance the the frees across the
+	 * on-line nodes for us and will handle the hstate accounting.
+	 */
+	while (nr_pages--) {
+		if (!free_pool_huge_page(h, 1))
+			break;
 	}
 }
 
@@ -1267,7 +1260,7 @@ static unsigned long set_max_huge_pages(
 	min_count = max(count, min_count);
 	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
-		if (!free_pool_huge_page(h))
+		if (!free_pool_huge_page(h, 0))
 			break;
 	}
 	while (count < persistent_huge_pages(h)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 3/3] Cleanup and update huge pages documentation
  2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
  2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn
@ 2009-06-29 21:52 ` Lee Schermerhorn
  2 siblings, 0 replies; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw)
  To: linux-mm, linux-numa
  Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

PATCH 3/3 cleanup and update huge pages documentation.

Against:  25jun09 mmotm

This patch attempts to clarify huge page administration and usage,
and updates the doucmentation to mention the balancing of huge pages
across nodes when allocating and freeing.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/hugetlbpage.txt |  133 +++++++++++++++++++++++++--------------
 1 file changed, 87 insertions(+), 46 deletions(-)

Index: linux-2.6.31-rc1-mmotm-090625-1549/Documentation/vm/hugetlbpage.txt
===================================================================
--- linux-2.6.31-rc1-mmotm-090625-1549.orig/Documentation/vm/hugetlbpage.txt	2009-06-29 12:19:02.000000000 -0400
+++ linux-2.6.31-rc1-mmotm-090625-1549/Documentation/vm/hugetlbpage.txt	2009-06-29 17:29:48.000000000 -0400
@@ -18,13 +18,13 @@ First the Linux kernel needs to be built
 automatically when CONFIG_HUGETLBFS is selected) configuration
 options.
 
-The kernel built with hugepage support should show the number of configured
-hugepages in the system by running the "cat /proc/meminfo" command.
+The kernel built with huge page support should show the number of configured
+huge pages in the system by running the "cat /proc/meminfo" command.
 
 /proc/meminfo also provides information about the total number of hugetlb
 pages configured in the kernel.  It also displays information about the
 number of free hugetlb pages at any time.  It also displays information about
-the configured hugepage size - this is needed for generating the proper
+the configured huge page size - this is needed for generating the proper
 alignment and size of the arguments to the above system calls.
 
 The output of "cat /proc/meminfo" will have lines like:
@@ -37,25 +37,27 @@ HugePages_Surp:  yyy
 Hugepagesize:    zzz kB
 
 where:
-HugePages_Total is the size of the pool of hugepages.
-HugePages_Free is the number of hugepages in the pool that are not yet
-allocated.
-HugePages_Rsvd is short for "reserved," and is the number of hugepages
-for which a commitment to allocate from the pool has been made, but no
-allocation has yet been made. It's vaguely analogous to overcommit.
-HugePages_Surp is short for "surplus," and is the number of hugepages in
-the pool above the value in /proc/sys/vm/nr_hugepages. The maximum
-number of surplus hugepages is controlled by
-/proc/sys/vm/nr_overcommit_hugepages.
+HugePages_Total is the size of the pool of huge pages.
+HugePages_Free  is the number of huge pages in the pool that are not yet
+                allocated.
+HugePages_Rsvd  is short for "reserved," and is the number of huge pages for
+                which a commitment to allocate from the pool has been made,
+                but no allocation has yet been made.  Reserved huge pages
+                guarantee that an application will be able to allocate a
+                huge page from the pool of huge pages at fault time.
+HugePages_Surp  is short for "surplus," and is the number of huge pages in
+                the pool above the value in /proc/sys/vm/nr_hugepages. The
+                maximum number of surplus huge pages is controlled by
+                /proc/sys/vm/nr_overcommit_hugepages.
 
 /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
 in the kernel.
 
 /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
 pages in the kernel.  Super user can dynamically request more (or free some
-pre-configured) hugepages.
+pre-configured) huge pages.
 The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of hugepages is
+enough physically contiguous free pages in system (freeing of huge pages is
 possible only if there are enough hugetlb pages free that can be transferred
 back to regular memory pool).
 
@@ -67,43 +69,82 @@ use either the mmap system call or share
 the huge pages.  It is required that the system administrator preallocate
 enough memory for huge page purposes.
 
-Use the following command to dynamically allocate/deallocate hugepages:
+The administrator can preallocate huge pages on the kernel boot command line by
+specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
+requested.  This is the most reliable method for preallocating huge pages as
+memory has not yet become fragmented.
+
+Some platforms support multiple huge page sizes.  To preallocate huge pages
+of a specific size, one must preceed the huge pages boot command parameters
+with a huge page size selection parameter "hugepagesz=<size>".  <size> must
+be specified in bytes with optional scale suffix [kKmMgG].  The default huge
+page size may be selected with the "default_hugepagesz=<size>" boot parameter.
+
+/proc/sys/vm/nr_hugepages indicates the current number of configured [default
+size] hugetlb pages in the kernel.  Super user can dynamically request more
+(or free some pre-configured) huge pages.
+
+Use the following command to dynamically allocate/deallocate default sized
+huge pages:
 
 	echo 20 > /proc/sys/vm/nr_hugepages
 
-This command will try to configure 20 hugepages in the system.  The success
-or failure of allocation depends on the amount of physically contiguous
-memory that is preset in system at this time.  System administrators may want
-to put this command in one of the local rc init files.  This will enable the
-kernel to request huge pages early in the boot process (when the possibility
-of getting physical contiguous pages is still very high). In either
-case, administrators will want to verify the number of hugepages actually
-allocated by checking the sysctl or meminfo.
-
-/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of
-hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are
-requested by applications. echo'ing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain
-hugepages from the buddy allocator, if the normal pool is exhausted. As
-these surplus hugepages go out of use, they are freed back to the buddy
+This command will try to configure 20 default sized huge pages in the system.
+On a NUMA platform, the kernel will attempt to distribute the huge page pool
+over the all on-line nodes.  These huge pages, allocated when nr_hugepages
+is increased, are called "persistent huge pages".
+
+The success or failure of huge page allocation depends on the amount of
+physically contiguous memory that is preset in system at the time of the
+allocation attempt.  If the kernel is unable to allocate huge pages from
+some nodes in a NUMA system, it will attempt to make up the difference by
+allocating extra pages on other nodes with sufficient available contiguous
+memory, if any.
+
+System administrators may want to put this command in one of the local rc init
+files.  This will enable the kernel to request huge pages early in the boot
+process when the possibility of getting physical contiguous pages is still
+very high.  Administrators can verify the number of huge pages actually
+allocated by checking the sysctl or meminfo.  To check the per node
+distribution of huge pages in a NUMA system, use:
+
+	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
+
+/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
+huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
+requested by applications.  Writing any non-zero value into this file
+indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
+huge pages from the buddy allocator, when the normal pool is exhausted. As
+these surplus huge pages go out of use, they are freed back to the buddy
 allocator.
 
+When increasing the huge page pool size via nr_hugepages, any surplus
+pages will first be promoted to persistent huge pages.  Then, additional
+huge pages will be allocated, if necessary and if possible, to fulfill
+the new huge page pool size.
+
+The administrator may shrink the pool of preallocated huge pages for
+the default huge page size by setting the nr_hugepages sysctl to a
+smaller value.  The kernel will attempt to balance the freeing of huge pages
+across all on-line nodes.  Any free huge pages on the selected nodes will
+be freed back to the buddy allocator.
+
 Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of hugepages in use will convert the balance to surplus
+than the number of huge pages in use will convert the balance to surplus
 huge pages even if it would exceed the overcommit value.  As long as
 this condition holds, however, no more surplus huge pages will be
 allowed on the system until one of the two sysctls are increased
 sufficiently, or the surplus huge pages go out of use and are freed.
 
-With support for multiple hugepage pools at run-time available, much of
-the hugepage userspace interface has been duplicated in sysfs. The above
-information applies to the default hugepage size (which will be
-controlled by the proc interfaces for backwards compatibility). The root
-hugepage control directory is
+With support for multiple huge page pools at run-time available, much of
+the huge page userspace interface has been duplicated in sysfs. The above
+information applies to the default huge page size which will be
+controlled by the /proc interfaces for backwards compatibility. The root
+huge page control directory in sysfs is:
 
 	/sys/kernel/mm/hugepages
 
-For each hugepage size supported by the running kernel, a subdirectory
+For each huge page size supported by the running kernel, a subdirectory
 will exist, of the form
 
 	hugepages-${size}kB
@@ -116,9 +157,9 @@ Inside each of these directories, the sa
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default hugepage-sized case.
+which function as described above for the default huge page-sized case.
 
-If the user applications are going to request hugepages using mmap system
+If the user applications are going to request huge pages using mmap system
 call, then it is required that system administrator mount a file system of
 type hugetlbfs:
 
@@ -127,7 +168,7 @@ type hugetlbfs:
 	none /mnt/huge
 
 This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
-/mnt/huge.  Any files created on /mnt/huge uses hugepages.  The uid and gid
+/mnt/huge.  Any files created on /mnt/huge uses huge pages.  The uid and gid
 options sets the owner and group of the root of the file system.  By default
 the uid and gid of the current process are taken.  The mode option sets the
 mode of root of file system to value & 0777.  This value is given in octal.
@@ -156,14 +197,14 @@ mount of filesystem will be required for
 *******************************************************************
 
 /*
- * Example of using hugepage memory in a user application using Sys V shared
+ * Example of using huge page memory in a user application using Sys V shared
  * memory system calls.  In this example the app is requesting 256MB of
  * memory that is backed by huge pages.  The application uses the flag
  * SHM_HUGETLB in the shmget system call to inform the kernel that it is
- * requesting hugepages.
+ * requesting huge pages.
  *
  * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * hugepages.  That means the addresses starting with 0x800000... will need
+ * huge pages.  That means the addresses starting with 0x800000... will need
  * to be specified.  Specifying a fixed address is not required on ppc64,
  * i386 or x86_64.
  *
@@ -252,14 +293,14 @@ int main(void)
 *******************************************************************
 
 /*
- * Example of using hugepage memory in a user application using the mmap
+ * Example of using huge page memory in a user application using the mmap
  * system call.  Before running this application, make sure that the
  * administrator has mounted the hugetlbfs filesystem (on some directory
  * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
  * example, the app is requesting memory of size 256MB that is backed by
  * huge pages.
  *
- * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
+ * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages.
  * That means the addresses starting with 0x800000... will need to be
  * specified.  Specifying a fixed address is not required on ppc64, i386
  * or x86_64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes
  2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
@ 2009-06-30 13:05   ` Mel Gorman
  2009-06-30 13:48     ` Lee Schermerhorn
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2009-06-30 13:05 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote:
> [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes
> 
> Against:  25jun09 mmotm
> 
> Free huges pages from nodes in round robin fashion in an
> attempt to keep [persistent a.k.a static] hugepages balanced
> across nodes
> 
> New function free_pool_huge_page() is modeled on and
> performs roughly the inverse of alloc_fresh_huge_page().
> Replaces dequeue_huge_page() which now has no callers,
> so this patch removes it.
> 
> Helper function hstate_next_node_to_free() uses new hstate
> member next_to_free_nid to distribute "frees" across all
> nodes with huge pages.
> 
> V2:
> 
> At Mel Gorman's suggestion:  renamed hstate_next_node() to
> hstate_next_node_to_alloc() for symmetry.  Also, renamed
> hstate member hugetlb_next_node to next_node_to_free.
> ["hugetlb" is implicit in the hstate struct, I think].
> 
> New in this version:
> 
> Modified adjust_pool_surplus() to use hstate_next_node_to_alloc()
> and hstate_next_node_to_free() to advance node id for adjusting
> surplus huge page count, as this is equivalent to allocating and
> freeing persistent huge pages.  [Can't blame Mel for this part.]
> 
> V3:
> 
> Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to
> better match alloc_fresh_huge_page() conventions.
> 
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/hugetlb.h |    3 -
>  mm/hugetlb.c            |  132 +++++++++++++++++++++++++++++++-----------------
>  2 files changed, 88 insertions(+), 47 deletions(-)
> 
> Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h	2009-06-29 10:21:12.000000000 -0400
> +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h	2009-06-29 10:27:18.000000000 -0400
> @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area(
>  #define HSTATE_NAME_LEN 32
>  /* Defines one hugetlb page size */
>  struct hstate {
> -	int hugetlb_next_nid;
> +	int next_nid_to_alloc;
> +	int next_nid_to_free;
>  	unsigned int order;
>  	unsigned long mask;
>  	unsigned long max_huge_pages;
> Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-06-29 10:21:12.000000000 -0400
> +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-06-29 15:53:55.000000000 -0400
> @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
>  	h->free_huge_pages_node[nid]++;
>  }
>  
> -static struct page *dequeue_huge_page(struct hstate *h)
> -{
> -	int nid;
> -	struct page *page = NULL;
> -
> -	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> -		if (!list_empty(&h->hugepage_freelists[nid])) {
> -			page = list_entry(h->hugepage_freelists[nid].next,
> -					  struct page, lru);
> -			list_del(&page->lru);
> -			h->free_huge_pages--;
> -			h->free_huge_pages_node[nid]--;
> -			break;
> -		}
> -	}
> -	return page;
> -}
> -
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>  				struct vm_area_struct *vma,
>  				unsigned long address, int avoid_reserve)
> @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag
>  
>  /*
>   * Use a helper variable to find the next node and then
> - * copy it back to hugetlb_next_nid afterwards:
> + * copy it back to next_nid_to_alloc afterwards:
>   * otherwise there's a window in which a racer might
>   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
>   * But we don't need to use a spin_lock here: it really
> @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag
>   * if we just successfully allocated a hugepage so that
>   * the next caller gets hugepages on the next node.
>   */
> -static int hstate_next_node(struct hstate *h)
> +static int hstate_next_node_to_alloc(struct hstate *h)
>  {
>  	int next_nid;
> -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> +	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
>  	if (next_nid == MAX_NUMNODES)
>  		next_nid = first_node(node_online_map);
> -	h->hugetlb_next_nid = next_nid;
> +	h->next_nid_to_alloc = next_nid;
>  	return next_nid;
>  }
>  

Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I
don't think it makes an important difference. Implementing it this way is
shorter and automatically ensures next_nid is an online node. 

If you wanted to be pedantic, I think the following untested code would
make it really next_nid_to_alloc but I don't think it's terribly
important.

static int hstate_next_node_to_alloc(struct hstate *h)
{
	int this_nid = h->next_nid_to_alloc;

	/* Check the node didn't get off-lined since */
	if (unlikely(!node_online(next_nid))) {
		this_nid = next_node(h->next_nid_to_alloc, node_online_map);
		h->next_nid_to_alloc = this_nid;
	}

	h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map);
	if (h->next_nid_to_alloc == MAX_NUMNODES)
		h->next_nid_to_alloc = first_node(node_online_map);

	return this_nid;
}

> @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct 
>  	int next_nid;
>  	int ret = 0;
>  
> -	start_nid = h->hugetlb_next_nid;
> +	start_nid = h->next_nid_to_alloc;
> +	next_nid = start_nid;
>  
>  	do {
> -		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> +		page = alloc_fresh_huge_page_node(h, next_nid);
>  		if (page)
>  			ret = 1;
> -		next_nid = hstate_next_node(h);
> -	} while (!page && h->hugetlb_next_nid != start_nid);
> +		next_nid = hstate_next_node_to_alloc(h);
> +	} while (!page && next_nid != start_nid);
>  
>  	if (ret)
>  		count_vm_event(HTLB_BUDDY_PGALLOC);
> @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct 
>  	return ret;
>  }
>  
> +/*
> + * helper for free_pool_huge_page() - find next node
> + * from which to free a huge page
> + */
> +static int hstate_next_node_to_free(struct hstate *h)
> +{
> +	int next_nid;
> +	next_nid = next_node(h->next_nid_to_free, node_online_map);
> +	if (next_nid == MAX_NUMNODES)
> +		next_nid = first_node(node_online_map);
> +	h->next_nid_to_free = next_nid;
> +	return next_nid;
> +}
> +
> +/*
> + * Free huge page from pool from next node to free.
> + * Attempt to keep persistent huge pages more or less
> + * balanced over allowed nodes.
> + * Called with hugetlb_lock locked.
> + */
> +static int free_pool_huge_page(struct hstate *h)
> +{
> +	int start_nid;
> +	int next_nid;
> +	int ret = 0;
> +
> +	start_nid = h->next_nid_to_free;
> +	next_nid = start_nid;
> +
> +	do {
> +		if (!list_empty(&h->hugepage_freelists[next_nid])) {
> +			struct page *page =
> +				list_entry(h->hugepage_freelists[next_nid].next,
> +					  struct page, lru);
> +			list_del(&page->lru);
> +			h->free_huge_pages--;
> +			h->free_huge_pages_node[next_nid]--;
> +			update_and_free_page(h, page);
> +			ret = 1;
> +		}
> +		next_nid = hstate_next_node_to_free(h);
> +	} while (!ret && next_nid != start_nid);
> +
> +	return ret;
> +}
> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
> @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc
>  		void *addr;
>  
>  		addr = __alloc_bootmem_node_nopanic(
> -				NODE_DATA(h->hugetlb_next_nid),
> +				NODE_DATA(h->next_nid_to_alloc),
>  				huge_page_size(h), huge_page_size(h), 0);
>  
>  		if (addr) {
> @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc
>  			m = addr;
>  			goto found;
>  		}
> -		hstate_next_node(h);
> +		hstate_next_node_to_alloc(h);
>  		nr_nodes--;
>  	}
>  	return 0;
> @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc
>   */
>  static int adjust_pool_surplus(struct hstate *h, int delta)
>  {
> -	static int prev_nid;
> -	int nid = prev_nid;
> +	int start_nid, next_nid;
>  	int ret = 0;
>  
>  	VM_BUG_ON(delta != -1 && delta != 1);
> -	do {
> -		nid = next_node(nid, node_online_map);
> -		if (nid == MAX_NUMNODES)
> -			nid = first_node(node_online_map);
>  
> -		/* To shrink on this node, there must be a surplus page */
> -		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> -			continue;
> -		/* Surplus cannot exceed the total number of pages */
> -		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> +	if (delta < 0)
> +		start_nid = h->next_nid_to_alloc;
> +	else
> +		start_nid = h->next_nid_to_free;
> +	next_nid = start_nid;
> +
> +	do {
> +		int nid = next_nid;
> +		if (delta < 0)  {
> +			next_nid = hstate_next_node_to_alloc(h);
> +			/*
> +			 * To shrink on this node, there must be a surplus page
> +			 */
> +			if (!h->surplus_huge_pages_node[nid])
> +				continue;
> +		}
> +		if (delta > 0) {
> +			next_nid = hstate_next_node_to_free(h);
> +			/*
> +			 * Surplus cannot exceed the total number of pages
> +			 */
> +			if (h->surplus_huge_pages_node[nid] >=
>  						h->nr_huge_pages_node[nid])
> -			continue;
> +				continue;
> +		}
>  
>  		h->surplus_huge_pages += delta;
>  		h->surplus_huge_pages_node[nid] += delta;
>  		ret = 1;
>  		break;
> -	} while (nid != prev_nid);
> +	} while (next_nid != start_nid);
>  
> -	prev_nid = nid;
>  	return ret;
>  }
>  
> @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages(
>  	min_count = max(count, min_count);
>  	try_to_free_low(h, min_count);
>  	while (min_count < persistent_huge_pages(h)) {
> -		struct page *page = dequeue_huge_page(h);
> -		if (!page)
> +		if (!free_pool_huge_page(h))
>  			break;
> -		update_and_free_page(h, page);
>  	}
>  	while (count < persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, 1))
> @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned 
>  	h->free_huge_pages = 0;
>  	for (i = 0; i < MAX_NUMNODES; ++i)
>  		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> -	h->hugetlb_next_nid = first_node(node_online_map);
> +	h->next_nid_to_alloc = first_node(node_online_map);
> +	h->next_nid_to_free = first_node(node_online_map);
>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>  					huge_page_size(h)/1024);
>  

Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc()
as it is;

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes
  2009-06-30 13:05   ` Mel Gorman
@ 2009-06-30 13:48     ` Lee Schermerhorn
  2009-06-30 13:58       ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-30 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, 2009-06-30 at 14:05 +0100, Mel Gorman wrote:
> On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote:
> > [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes
> > 
> > Against:  25jun09 mmotm
> > 
> > Free huges pages from nodes in round robin fashion in an
> > attempt to keep [persistent a.k.a static] hugepages balanced
> > across nodes
> > 
> > New function free_pool_huge_page() is modeled on and
> > performs roughly the inverse of alloc_fresh_huge_page().
> > Replaces dequeue_huge_page() which now has no callers,
> > so this patch removes it.
> > 
> > Helper function hstate_next_node_to_free() uses new hstate
> > member next_to_free_nid to distribute "frees" across all
> > nodes with huge pages.
> > 
> > V2:
> > 
> > At Mel Gorman's suggestion:  renamed hstate_next_node() to
> > hstate_next_node_to_alloc() for symmetry.  Also, renamed
> > hstate member hugetlb_next_node to next_node_to_free.
> > ["hugetlb" is implicit in the hstate struct, I think].
> > 
> > New in this version:
> > 
> > Modified adjust_pool_surplus() to use hstate_next_node_to_alloc()
> > and hstate_next_node_to_free() to advance node id for adjusting
> > surplus huge page count, as this is equivalent to allocating and
> > freeing persistent huge pages.  [Can't blame Mel for this part.]
> > 
> > V3:
> > 
> > Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to
> > better match alloc_fresh_huge_page() conventions.
> > 
> > Acked-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/hugetlb.h |    3 -
> >  mm/hugetlb.c            |  132 +++++++++++++++++++++++++++++++-----------------
> >  2 files changed, 88 insertions(+), 47 deletions(-)
> > 
> > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h	2009-06-29 10:21:12.000000000 -0400
> > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h	2009-06-29 10:27:18.000000000 -0400
> > @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area(
> >  #define HSTATE_NAME_LEN 32
> >  /* Defines one hugetlb page size */
> >  struct hstate {
> > -	int hugetlb_next_nid;
> > +	int next_nid_to_alloc;
> > +	int next_nid_to_free;
> >  	unsigned int order;
> >  	unsigned long mask;
> >  	unsigned long max_huge_pages;
> > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-06-29 10:21:12.000000000 -0400
> > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-06-29 15:53:55.000000000 -0400
> > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
> >  	h->free_huge_pages_node[nid]++;
> >  }
> >  
> > -static struct page *dequeue_huge_page(struct hstate *h)
> > -{
> > -	int nid;
> > -	struct page *page = NULL;
> > -
> > -	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> > -		if (!list_empty(&h->hugepage_freelists[nid])) {
> > -			page = list_entry(h->hugepage_freelists[nid].next,
> > -					  struct page, lru);
> > -			list_del(&page->lru);
> > -			h->free_huge_pages--;
> > -			h->free_huge_pages_node[nid]--;
> > -			break;
> > -		}
> > -	}
> > -	return page;
> > -}
> > -
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  				struct vm_area_struct *vma,
> >  				unsigned long address, int avoid_reserve)
> > @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag
> >  
> >  /*
> >   * Use a helper variable to find the next node and then
> > - * copy it back to hugetlb_next_nid afterwards:
> > + * copy it back to next_nid_to_alloc afterwards:
> >   * otherwise there's a window in which a racer might
> >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> >   * But we don't need to use a spin_lock here: it really
> > @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag
> >   * if we just successfully allocated a hugepage so that
> >   * the next caller gets hugepages on the next node.
> >   */
> > -static int hstate_next_node(struct hstate *h)
> > +static int hstate_next_node_to_alloc(struct hstate *h)
> >  {
> >  	int next_nid;
> > -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> > +	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> >  	if (next_nid == MAX_NUMNODES)
> >  		next_nid = first_node(node_online_map);
> > -	h->hugetlb_next_nid = next_nid;
> > +	h->next_nid_to_alloc = next_nid;
> >  	return next_nid;
> >  }
> >  
> 
> Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I
> don't think it makes an important difference. Implementing it this way is
> shorter and automatically ensures next_nid is an online node. 
> 
> If you wanted to be pedantic, I think the following untested code would
> make it really next_nid_to_alloc but I don't think it's terribly
> important.
> 
> static int hstate_next_node_to_alloc(struct hstate *h)
> {
> 	int this_nid = h->next_nid_to_alloc;
> 
> 	/* Check the node didn't get off-lined since */
> 	if (unlikely(!node_online(next_nid))) {
> 		this_nid = next_node(h->next_nid_to_alloc, node_online_map);
> 		h->next_nid_to_alloc = this_nid;
> 	}
> 
> 	h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map);
> 	if (h->next_nid_to_alloc == MAX_NUMNODES)
> 		h->next_nid_to_alloc = first_node(node_online_map);
> 
> 	return this_nid;
> }

Mel:  

I'm about to send out a series that constrains [persistent] huge page
alloc and free using task mempolicy, per your suggestion.  The functions
'next_node_to_{alloc|free} and how they're used get reworked in that
series quite a bit, and the name becomes more accurate, I think.  And, I
think it does handle the node going offline along with handling changing
to a new policy nodemask that doesn't include the value saved in the
hstate.  We can revisit this, then.

However, the way we currently use these functions, they do update the
'next_node_*' field in the hstate, and where the return value is tested
[against start_nid], it really is the "next" node.  If the alloc/free
succeeds, then the return value does turn out to be the [last] node we
just alloc'd/freed on.  But, again, we've advanced the next node to
alloc/free in the hstate.  A nit, I think :).

> 
> > @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct 
> >  	int next_nid;
> >  	int ret = 0;
> >  
> > -	start_nid = h->hugetlb_next_nid;
> > +	start_nid = h->next_nid_to_alloc;
> > +	next_nid = start_nid;
> >  
> >  	do {
> > -		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> > +		page = alloc_fresh_huge_page_node(h, next_nid);
> >  		if (page)
> >  			ret = 1;
> > -		next_nid = hstate_next_node(h);
> > -	} while (!page && h->hugetlb_next_nid != start_nid);
> > +		next_nid = hstate_next_node_to_alloc(h);
> > +	} while (!page && next_nid != start_nid);
> >  
> >  	if (ret)
> >  		count_vm_event(HTLB_BUDDY_PGALLOC);
> > @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct 
> >  	return ret;
> >  }
> >  
> > +/*
> > + * helper for free_pool_huge_page() - find next node
> > + * from which to free a huge page
> > + */
> > +static int hstate_next_node_to_free(struct hstate *h)
> > +{
> > +	int next_nid;
> > +	next_nid = next_node(h->next_nid_to_free, node_online_map);
> > +	if (next_nid == MAX_NUMNODES)
> > +		next_nid = first_node(node_online_map);
> > +	h->next_nid_to_free = next_nid;
> > +	return next_nid;
> > +}
> > +
> > +/*
> > + * Free huge page from pool from next node to free.
> > + * Attempt to keep persistent huge pages more or less
> > + * balanced over allowed nodes.
> > + * Called with hugetlb_lock locked.
> > + */
> > +static int free_pool_huge_page(struct hstate *h)
> > +{
> > +	int start_nid;
> > +	int next_nid;
> > +	int ret = 0;
> > +
> > +	start_nid = h->next_nid_to_free;
> > +	next_nid = start_nid;
> > +
> > +	do {
> > +		if (!list_empty(&h->hugepage_freelists[next_nid])) {
> > +			struct page *page =
> > +				list_entry(h->hugepage_freelists[next_nid].next,
> > +					  struct page, lru);
> > +			list_del(&page->lru);
> > +			h->free_huge_pages--;
> > +			h->free_huge_pages_node[next_nid]--;
> > +			update_and_free_page(h, page);
> > +			ret = 1;
> > +		}
> > +		next_nid = hstate_next_node_to_free(h);
> > +	} while (!ret && next_nid != start_nid);
> > +
> > +	return ret;
> > +}
> > +
> >  static struct page *alloc_buddy_huge_page(struct hstate *h,
> >  			struct vm_area_struct *vma, unsigned long address)
> >  {
> > @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc
> >  		void *addr;
> >  
> >  		addr = __alloc_bootmem_node_nopanic(
> > -				NODE_DATA(h->hugetlb_next_nid),
> > +				NODE_DATA(h->next_nid_to_alloc),
> >  				huge_page_size(h), huge_page_size(h), 0);
> >  
> >  		if (addr) {
> > @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc
> >  			m = addr;
> >  			goto found;
> >  		}
> > -		hstate_next_node(h);
> > +		hstate_next_node_to_alloc(h);
> >  		nr_nodes--;
> >  	}
> >  	return 0;
> > @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc
> >   */
> >  static int adjust_pool_surplus(struct hstate *h, int delta)
> >  {
> > -	static int prev_nid;
> > -	int nid = prev_nid;
> > +	int start_nid, next_nid;
> >  	int ret = 0;
> >  
> >  	VM_BUG_ON(delta != -1 && delta != 1);
> > -	do {
> > -		nid = next_node(nid, node_online_map);
> > -		if (nid == MAX_NUMNODES)
> > -			nid = first_node(node_online_map);
> >  
> > -		/* To shrink on this node, there must be a surplus page */
> > -		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> > -			continue;
> > -		/* Surplus cannot exceed the total number of pages */
> > -		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> > +	if (delta < 0)
> > +		start_nid = h->next_nid_to_alloc;
> > +	else
> > +		start_nid = h->next_nid_to_free;
> > +	next_nid = start_nid;
> > +
> > +	do {
> > +		int nid = next_nid;
> > +		if (delta < 0)  {
> > +			next_nid = hstate_next_node_to_alloc(h);
> > +			/*
> > +			 * To shrink on this node, there must be a surplus page
> > +			 */
> > +			if (!h->surplus_huge_pages_node[nid])
> > +				continue;
> > +		}
> > +		if (delta > 0) {
> > +			next_nid = hstate_next_node_to_free(h);
> > +			/*
> > +			 * Surplus cannot exceed the total number of pages
> > +			 */
> > +			if (h->surplus_huge_pages_node[nid] >=
> >  						h->nr_huge_pages_node[nid])
> > -			continue;
> > +				continue;
> > +		}
> >  
> >  		h->surplus_huge_pages += delta;
> >  		h->surplus_huge_pages_node[nid] += delta;
> >  		ret = 1;
> >  		break;
> > -	} while (nid != prev_nid);
> > +	} while (next_nid != start_nid);
> >  
> > -	prev_nid = nid;
> >  	return ret;
> >  }
> >  
> > @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages(
> >  	min_count = max(count, min_count);
> >  	try_to_free_low(h, min_count);
> >  	while (min_count < persistent_huge_pages(h)) {
> > -		struct page *page = dequeue_huge_page(h);
> > -		if (!page)
> > +		if (!free_pool_huge_page(h))
> >  			break;
> > -		update_and_free_page(h, page);
> >  	}
> >  	while (count < persistent_huge_pages(h)) {
> >  		if (!adjust_pool_surplus(h, 1))
> > @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned 
> >  	h->free_huge_pages = 0;
> >  	for (i = 0; i < MAX_NUMNODES; ++i)
> >  		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> > -	h->hugetlb_next_nid = first_node(node_online_map);
> > +	h->next_nid_to_alloc = first_node(node_online_map);
> > +	h->next_nid_to_free = first_node(node_online_map);
> >  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> >  					huge_page_size(h)/1024);
> >  
> 
> Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc()
> as it is;
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 

Thanks.  It did seem to test out OK on ia64 [12jun mmotm; 25jun mmotm
has a problem there--TBI] and x86_64.  Could use more testing, tho'.
Especially with various combinations of persistent and surplus huge
pages.  I saw you mention that you have a hugetlb regression test suite.
Is that available "out there, somewhere"?  I just grabbed a libhugetlbfs
source rpm, but haven't cracked it yet.  Maybe it's there?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes
  2009-06-30 13:48     ` Lee Schermerhorn
@ 2009-06-30 13:58       ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2009-06-30 13:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
	Adam Litke, Andy Whitcroft, eric.whitney

On Tue, Jun 30, 2009 at 09:48:11AM -0400, Lee Schermerhorn wrote:
> On Tue, 2009-06-30 at 14:05 +0100, Mel Gorman wrote:
> > On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote:
> > > [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes
> > > 
> > > Against:  25jun09 mmotm
> > > 
> > > Free huges pages from nodes in round robin fashion in an
> > > attempt to keep [persistent a.k.a static] hugepages balanced
> > > across nodes
> > > 
> > > New function free_pool_huge_page() is modeled on and
> > > performs roughly the inverse of alloc_fresh_huge_page().
> > > Replaces dequeue_huge_page() which now has no callers,
> > > so this patch removes it.
> > > 
> > > Helper function hstate_next_node_to_free() uses new hstate
> > > member next_to_free_nid to distribute "frees" across all
> > > nodes with huge pages.
> > > 
> > > V2:
> > > 
> > > At Mel Gorman's suggestion:  renamed hstate_next_node() to
> > > hstate_next_node_to_alloc() for symmetry.  Also, renamed
> > > hstate member hugetlb_next_node to next_node_to_free.
> > > ["hugetlb" is implicit in the hstate struct, I think].
> > > 
> > > New in this version:
> > > 
> > > Modified adjust_pool_surplus() to use hstate_next_node_to_alloc()
> > > and hstate_next_node_to_free() to advance node id for adjusting
> > > surplus huge page count, as this is equivalent to allocating and
> > > freeing persistent huge pages.  [Can't blame Mel for this part.]
> > > 
> > > V3:
> > > 
> > > Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to
> > > better match alloc_fresh_huge_page() conventions.
> > > 
> > > Acked-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  include/linux/hugetlb.h |    3 -
> > >  mm/hugetlb.c            |  132 +++++++++++++++++++++++++++++++-----------------
> > >  2 files changed, 88 insertions(+), 47 deletions(-)
> > > 
> > > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h
> > > ===================================================================
> > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h	2009-06-29 10:21:12.000000000 -0400
> > > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h	2009-06-29 10:27:18.000000000 -0400
> > > @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area(
> > >  #define HSTATE_NAME_LEN 32
> > >  /* Defines one hugetlb page size */
> > >  struct hstate {
> > > -	int hugetlb_next_nid;
> > > +	int next_nid_to_alloc;
> > > +	int next_nid_to_free;
> > >  	unsigned int order;
> > >  	unsigned long mask;
> > >  	unsigned long max_huge_pages;
> > > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c	2009-06-29 10:21:12.000000000 -0400
> > > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c	2009-06-29 15:53:55.000000000 -0400
> > > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
> > >  	h->free_huge_pages_node[nid]++;
> > >  }
> > >  
> > > -static struct page *dequeue_huge_page(struct hstate *h)
> > > -{
> > > -	int nid;
> > > -	struct page *page = NULL;
> > > -
> > > -	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> > > -		if (!list_empty(&h->hugepage_freelists[nid])) {
> > > -			page = list_entry(h->hugepage_freelists[nid].next,
> > > -					  struct page, lru);
> > > -			list_del(&page->lru);
> > > -			h->free_huge_pages--;
> > > -			h->free_huge_pages_node[nid]--;
> > > -			break;
> > > -		}
> > > -	}
> > > -	return page;
> > > -}
> > > -
> > >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> > >  				struct vm_area_struct *vma,
> > >  				unsigned long address, int avoid_reserve)
> > > @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag
> > >  
> > >  /*
> > >   * Use a helper variable to find the next node and then
> > > - * copy it back to hugetlb_next_nid afterwards:
> > > + * copy it back to next_nid_to_alloc afterwards:
> > >   * otherwise there's a window in which a racer might
> > >   * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> > >   * But we don't need to use a spin_lock here: it really
> > > @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag
> > >   * if we just successfully allocated a hugepage so that
> > >   * the next caller gets hugepages on the next node.
> > >   */
> > > -static int hstate_next_node(struct hstate *h)
> > > +static int hstate_next_node_to_alloc(struct hstate *h)
> > >  {
> > >  	int next_nid;
> > > -	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> > > +	next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > >  	if (next_nid == MAX_NUMNODES)
> > >  		next_nid = first_node(node_online_map);
> > > -	h->hugetlb_next_nid = next_nid;
> > > +	h->next_nid_to_alloc = next_nid;
> > >  	return next_nid;
> > >  }
> > >  
> > 
> > Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I
> > don't think it makes an important difference. Implementing it this way is
> > shorter and automatically ensures next_nid is an online node. 
> > 
> > If you wanted to be pedantic, I think the following untested code would
> > make it really next_nid_to_alloc but I don't think it's terribly
> > important.
> > 
> > static int hstate_next_node_to_alloc(struct hstate *h)
> > {
> > 	int this_nid = h->next_nid_to_alloc;
> > 
> > 	/* Check the node didn't get off-lined since */
> > 	if (unlikely(!node_online(next_nid))) {
> > 		this_nid = next_node(h->next_nid_to_alloc, node_online_map);
> > 		h->next_nid_to_alloc = this_nid;
> > 	}
> > 
> > 	h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map);
> > 	if (h->next_nid_to_alloc == MAX_NUMNODES)
> > 		h->next_nid_to_alloc = first_node(node_online_map);
> > 
> > 	return this_nid;
> > }
> 
> Mel:  
> 
> I'm about to send out a series that constrains [persistent] huge page
> alloc and free using task mempolicy, per your suggestion.  The functions
> 'next_node_to_{alloc|free} and how they're used get reworked in that
> series quite a bit, and the name becomes more accurate, I think.  And, I
> think it does handle the node going offline along with handling changing
> to a new policy nodemask that doesn't include the value saved in the
> hstate.  We can revisit this, then.
> 

Sounds good.

> However, the way we currently use these functions, they do update the
> 'next_node_*' field in the hstate, and where the return value is tested
> [against start_nid], it really is the "next" node. 

Good point.

> If the alloc/free
> succeeds, then the return value does turn out to be the [last] node we
> just alloc'd/freed on.  But, again, we've advanced the next node to
> alloc/free in the hstate.  A nit, I think :).
> 

It's enough of a concern to go with your current version.

> > 
> > > @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct 
> > >  	int next_nid;
> > >  	int ret = 0;
> > >  
> > > -	start_nid = h->hugetlb_next_nid;
> > > +	start_nid = h->next_nid_to_alloc;
> > > +	next_nid = start_nid;
> > >  
> > >  	do {
> > > -		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> > > +		page = alloc_fresh_huge_page_node(h, next_nid);
> > >  		if (page)
> > >  			ret = 1;
> > > -		next_nid = hstate_next_node(h);
> > > -	} while (!page && h->hugetlb_next_nid != start_nid);
> > > +		next_nid = hstate_next_node_to_alloc(h);
> > > +	} while (!page && next_nid != start_nid);
> > >  
> > >  	if (ret)
> > >  		count_vm_event(HTLB_BUDDY_PGALLOC);
> > > @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct 
> > >  	return ret;
> > >  }
> > >  
> > > +/*
> > > + * helper for free_pool_huge_page() - find next node
> > > + * from which to free a huge page
> > > + */
> > > +static int hstate_next_node_to_free(struct hstate *h)
> > > +{
> > > +	int next_nid;
> > > +	next_nid = next_node(h->next_nid_to_free, node_online_map);
> > > +	if (next_nid == MAX_NUMNODES)
> > > +		next_nid = first_node(node_online_map);
> > > +	h->next_nid_to_free = next_nid;
> > > +	return next_nid;
> > > +}
> > > +
> > > +/*
> > > + * Free huge page from pool from next node to free.
> > > + * Attempt to keep persistent huge pages more or less
> > > + * balanced over allowed nodes.
> > > + * Called with hugetlb_lock locked.
> > > + */
> > > +static int free_pool_huge_page(struct hstate *h)
> > > +{
> > > +	int start_nid;
> > > +	int next_nid;
> > > +	int ret = 0;
> > > +
> > > +	start_nid = h->next_nid_to_free;
> > > +	next_nid = start_nid;
> > > +
> > > +	do {
> > > +		if (!list_empty(&h->hugepage_freelists[next_nid])) {
> > > +			struct page *page =
> > > +				list_entry(h->hugepage_freelists[next_nid].next,
> > > +					  struct page, lru);
> > > +			list_del(&page->lru);
> > > +			h->free_huge_pages--;
> > > +			h->free_huge_pages_node[next_nid]--;
> > > +			update_and_free_page(h, page);
> > > +			ret = 1;
> > > +		}
> > > +		next_nid = hstate_next_node_to_free(h);
> > > +	} while (!ret && next_nid != start_nid);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  static struct page *alloc_buddy_huge_page(struct hstate *h,
> > >  			struct vm_area_struct *vma, unsigned long address)
> > >  {
> > > @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc
> > >  		void *addr;
> > >  
> > >  		addr = __alloc_bootmem_node_nopanic(
> > > -				NODE_DATA(h->hugetlb_next_nid),
> > > +				NODE_DATA(h->next_nid_to_alloc),
> > >  				huge_page_size(h), huge_page_size(h), 0);
> > >  
> > >  		if (addr) {
> > > @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc
> > >  			m = addr;
> > >  			goto found;
> > >  		}
> > > -		hstate_next_node(h);
> > > +		hstate_next_node_to_alloc(h);
> > >  		nr_nodes--;
> > >  	}
> > >  	return 0;
> > > @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc
> > >   */
> > >  static int adjust_pool_surplus(struct hstate *h, int delta)
> > >  {
> > > -	static int prev_nid;
> > > -	int nid = prev_nid;
> > > +	int start_nid, next_nid;
> > >  	int ret = 0;
> > >  
> > >  	VM_BUG_ON(delta != -1 && delta != 1);
> > > -	do {
> > > -		nid = next_node(nid, node_online_map);
> > > -		if (nid == MAX_NUMNODES)
> > > -			nid = first_node(node_online_map);
> > >  
> > > -		/* To shrink on this node, there must be a surplus page */
> > > -		if (delta < 0 && !h->surplus_huge_pages_node[nid])
> > > -			continue;
> > > -		/* Surplus cannot exceed the total number of pages */
> > > -		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> > > +	if (delta < 0)
> > > +		start_nid = h->next_nid_to_alloc;
> > > +	else
> > > +		start_nid = h->next_nid_to_free;
> > > +	next_nid = start_nid;
> > > +
> > > +	do {
> > > +		int nid = next_nid;
> > > +		if (delta < 0)  {
> > > +			next_nid = hstate_next_node_to_alloc(h);
> > > +			/*
> > > +			 * To shrink on this node, there must be a surplus page
> > > +			 */
> > > +			if (!h->surplus_huge_pages_node[nid])
> > > +				continue;
> > > +		}
> > > +		if (delta > 0) {
> > > +			next_nid = hstate_next_node_to_free(h);
> > > +			/*
> > > +			 * Surplus cannot exceed the total number of pages
> > > +			 */
> > > +			if (h->surplus_huge_pages_node[nid] >=
> > >  						h->nr_huge_pages_node[nid])
> > > -			continue;
> > > +				continue;
> > > +		}
> > >  
> > >  		h->surplus_huge_pages += delta;
> > >  		h->surplus_huge_pages_node[nid] += delta;
> > >  		ret = 1;
> > >  		break;
> > > -	} while (nid != prev_nid);
> > > +	} while (next_nid != start_nid);
> > >  
> > > -	prev_nid = nid;
> > >  	return ret;
> > >  }
> > >  
> > > @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages(
> > >  	min_count = max(count, min_count);
> > >  	try_to_free_low(h, min_count);
> > >  	while (min_count < persistent_huge_pages(h)) {
> > > -		struct page *page = dequeue_huge_page(h);
> > > -		if (!page)
> > > +		if (!free_pool_huge_page(h))
> > >  			break;
> > > -		update_and_free_page(h, page);
> > >  	}
> > >  	while (count < persistent_huge_pages(h)) {
> > >  		if (!adjust_pool_surplus(h, 1))
> > > @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned 
> > >  	h->free_huge_pages = 0;
> > >  	for (i = 0; i < MAX_NUMNODES; ++i)
> > >  		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> > > -	h->hugetlb_next_nid = first_node(node_online_map);
> > > +	h->next_nid_to_alloc = first_node(node_online_map);
> > > +	h->next_nid_to_free = first_node(node_online_map);
> > >  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> > >  					huge_page_size(h)/1024);
> > >  
> > 
> > Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc()
> > as it is;
> > 
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > 
> 
> Thanks.  It did seem to test out OK on ia64 [12jun mmotm; 25jun mmotm
> has a problem there--TBI] and x86_64.  Could use more testing, tho'.
> Especially with various combinations of persistent and surplus huge
> pages. 

No harm in that. I've tested the patches a bit and spotted nothing problematic
to do specifically with your patches. I am able to trigger the OOM killer
with disturbing lines such as

heap-overflow invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0

but I haven't determined if this is something new in mainline or on mmotm yet.

> I saw you mention that you have a hugetlb regression test suite.
> Is that available "out there, somewhere"?  I just grabbed a libhugetlbfs
> source rpm, but haven't cracked it yet.  Maybe it's there?
> 

It probably is, but I'm not certain. You're better off downloading from
http://sourceforge.net/projects/libhugetlbfs and doing something like

make
./obj/hugeadm --pool-pages-min 2M:64
./obj/hugeadm --create-global-mounts
make func

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-30 13:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn
2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
2009-06-30 13:05   ` Mel Gorman
2009-06-30 13:48     ` Lee Schermerhorn
2009-06-30 13:58       ` Mel Gorman
2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn
2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox