* [PATCH 0/3] Balance Freeing of Huge Pages across Nodes
@ 2009-06-29 21:52 Lee Schermerhorn
2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw)
To: linux-mm, linux-numa
Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes,
Adam Litke, Andy Whitcroft, eric.whitney
[PATCH] 0/3 Balance Freeing of Huge Pages across Nodes
This series contains V3 of the of the "Balance Freeing of Huge
Pages across Nodes" patch--containing a minor cleanup from v2--
and two additional, related patches. I have added David Rientjes'
ACK from V2, hoping that the change to v3 doesn't invalidate that.
Patch 2/3 reworks the free_pool_huge_page() function so that it
may also be used by return_unused_surplus_page(). This patch
needs careful review [and, testing?]. Perhaps Mel Gorman can
give it a go with the hugepages regression tests.
Patch 3/3 updates the vm hugetlbpage documentation to clarify
the usage and to add the description of the balancing of freeing
of huge pages. Most of the update is from my earlier "huge pages
nodes_allowed" patch series, without mention of the nodes_allowed
mask and associated boot parameter, sysctl and attributes.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/3] Balance Freeing of Huge Pages across Nodes 2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn @ 2009-06-29 21:52 ` Lee Schermerhorn 2009-06-30 13:05 ` Mel Gorman 2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn 2 siblings, 1 reply; 7+ messages in thread From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw) To: linux-mm, linux-numa Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes Against: 25jun09 mmotm Free huges pages from nodes in round robin fashion in an attempt to keep [persistent a.k.a static] hugepages balanced across nodes New function free_pool_huge_page() is modeled on and performs roughly the inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which now has no callers, so this patch removes it. Helper function hstate_next_node_to_free() uses new hstate member next_to_free_nid to distribute "frees" across all nodes with huge pages. V2: At Mel Gorman's suggestion: renamed hstate_next_node() to hstate_next_node_to_alloc() for symmetry. Also, renamed hstate member hugetlb_next_node to next_node_to_free. ["hugetlb" is implicit in the hstate struct, I think]. New in this version: Modified adjust_pool_surplus() to use hstate_next_node_to_alloc() and hstate_next_node_to_free() to advance node id for adjusting surplus huge page count, as this is equivalent to allocating and freeing persistent huge pages. [Can't blame Mel for this part.] V3: Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to better match alloc_fresh_huge_page() conventions. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> include/linux/hugetlb.h | 3 - mm/hugetlb.c | 132 +++++++++++++++++++++++++++++++----------------- 2 files changed, 88 insertions(+), 47 deletions(-) Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h =================================================================== --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h 2009-06-29 10:21:12.000000000 -0400 +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h 2009-06-29 10:27:18.000000000 -0400 @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area( #define HSTATE_NAME_LEN 32 /* Defines one hugetlb page size */ struct hstate { - int hugetlb_next_nid; + int next_nid_to_alloc; + int next_nid_to_free; unsigned int order; unsigned long mask; unsigned long max_huge_pages; Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 10:21:12.000000000 -0400 +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400 @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst h->free_huge_pages_node[nid]++; } -static struct page *dequeue_huge_page(struct hstate *h) -{ - int nid; - struct page *page = NULL; - - for (nid = 0; nid < MAX_NUMNODES; ++nid) { - if (!list_empty(&h->hugepage_freelists[nid])) { - page = list_entry(h->hugepage_freelists[nid].next, - struct page, lru); - list_del(&page->lru); - h->free_huge_pages--; - h->free_huge_pages_node[nid]--; - break; - } - } - return page; -} - static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address, int avoid_reserve) @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag /* * Use a helper variable to find the next node and then - * copy it back to hugetlb_next_nid afterwards: + * copy it back to next_nid_to_alloc afterwards: * otherwise there's a window in which a racer might * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. * But we don't need to use a spin_lock here: it really @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag * if we just successfully allocated a hugepage so that * the next caller gets hugepages on the next node. */ -static int hstate_next_node(struct hstate *h) +static int hstate_next_node_to_alloc(struct hstate *h) { int next_nid; - next_nid = next_node(h->hugetlb_next_nid, node_online_map); + next_nid = next_node(h->next_nid_to_alloc, node_online_map); if (next_nid == MAX_NUMNODES) next_nid = first_node(node_online_map); - h->hugetlb_next_nid = next_nid; + h->next_nid_to_alloc = next_nid; return next_nid; } @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct int next_nid; int ret = 0; - start_nid = h->hugetlb_next_nid; + start_nid = h->next_nid_to_alloc; + next_nid = start_nid; do { - page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); + page = alloc_fresh_huge_page_node(h, next_nid); if (page) ret = 1; - next_nid = hstate_next_node(h); - } while (!page && h->hugetlb_next_nid != start_nid); + next_nid = hstate_next_node_to_alloc(h); + } while (!page && next_nid != start_nid); if (ret) count_vm_event(HTLB_BUDDY_PGALLOC); @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct return ret; } +/* + * helper for free_pool_huge_page() - find next node + * from which to free a huge page + */ +static int hstate_next_node_to_free(struct hstate *h) +{ + int next_nid; + next_nid = next_node(h->next_nid_to_free, node_online_map); + if (next_nid == MAX_NUMNODES) + next_nid = first_node(node_online_map); + h->next_nid_to_free = next_nid; + return next_nid; +} + +/* + * Free huge page from pool from next node to free. + * Attempt to keep persistent huge pages more or less + * balanced over allowed nodes. + * Called with hugetlb_lock locked. + */ +static int free_pool_huge_page(struct hstate *h) +{ + int start_nid; + int next_nid; + int ret = 0; + + start_nid = h->next_nid_to_free; + next_nid = start_nid; + + do { + if (!list_empty(&h->hugepage_freelists[next_nid])) { + struct page *page = + list_entry(h->hugepage_freelists[next_nid].next, + struct page, lru); + list_del(&page->lru); + h->free_huge_pages--; + h->free_huge_pages_node[next_nid]--; + update_and_free_page(h, page); + ret = 1; + } + next_nid = hstate_next_node_to_free(h); + } while (!ret && next_nid != start_nid); + + return ret; +} + static struct page *alloc_buddy_huge_page(struct hstate *h, struct vm_area_struct *vma, unsigned long address) { @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc void *addr; addr = __alloc_bootmem_node_nopanic( - NODE_DATA(h->hugetlb_next_nid), + NODE_DATA(h->next_nid_to_alloc), huge_page_size(h), huge_page_size(h), 0); if (addr) { @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc m = addr; goto found; } - hstate_next_node(h); + hstate_next_node_to_alloc(h); nr_nodes--; } return 0; @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc */ static int adjust_pool_surplus(struct hstate *h, int delta) { - static int prev_nid; - int nid = prev_nid; + int start_nid, next_nid; int ret = 0; VM_BUG_ON(delta != -1 && delta != 1); - do { - nid = next_node(nid, node_online_map); - if (nid == MAX_NUMNODES) - nid = first_node(node_online_map); - /* To shrink on this node, there must be a surplus page */ - if (delta < 0 && !h->surplus_huge_pages_node[nid]) - continue; - /* Surplus cannot exceed the total number of pages */ - if (delta > 0 && h->surplus_huge_pages_node[nid] >= + if (delta < 0) + start_nid = h->next_nid_to_alloc; + else + start_nid = h->next_nid_to_free; + next_nid = start_nid; + + do { + int nid = next_nid; + if (delta < 0) { + next_nid = hstate_next_node_to_alloc(h); + /* + * To shrink on this node, there must be a surplus page + */ + if (!h->surplus_huge_pages_node[nid]) + continue; + } + if (delta > 0) { + next_nid = hstate_next_node_to_free(h); + /* + * Surplus cannot exceed the total number of pages + */ + if (h->surplus_huge_pages_node[nid] >= h->nr_huge_pages_node[nid]) - continue; + continue; + } h->surplus_huge_pages += delta; h->surplus_huge_pages_node[nid] += delta; ret = 1; break; - } while (nid != prev_nid); + } while (next_nid != start_nid); - prev_nid = nid; return ret; } @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages( min_count = max(count, min_count); try_to_free_low(h, min_count); while (min_count < persistent_huge_pages(h)) { - struct page *page = dequeue_huge_page(h); - if (!page) + if (!free_pool_huge_page(h)) break; - update_and_free_page(h, page); } while (count < persistent_huge_pages(h)) { if (!adjust_pool_surplus(h, 1)) @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned h->free_huge_pages = 0; for (i = 0; i < MAX_NUMNODES; ++i) INIT_LIST_HEAD(&h->hugepage_freelists[i]); - h->hugetlb_next_nid = first_node(node_online_map); + h->next_nid_to_alloc = first_node(node_online_map); + h->next_nid_to_free = first_node(node_online_map); snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", huge_page_size(h)/1024); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes 2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn @ 2009-06-30 13:05 ` Mel Gorman 2009-06-30 13:48 ` Lee Schermerhorn 0 siblings, 1 reply; 7+ messages in thread From: Mel Gorman @ 2009-06-30 13:05 UTC (permalink / raw) To: Lee Schermerhorn Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote: > [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes > > Against: 25jun09 mmotm > > Free huges pages from nodes in round robin fashion in an > attempt to keep [persistent a.k.a static] hugepages balanced > across nodes > > New function free_pool_huge_page() is modeled on and > performs roughly the inverse of alloc_fresh_huge_page(). > Replaces dequeue_huge_page() which now has no callers, > so this patch removes it. > > Helper function hstate_next_node_to_free() uses new hstate > member next_to_free_nid to distribute "frees" across all > nodes with huge pages. > > V2: > > At Mel Gorman's suggestion: renamed hstate_next_node() to > hstate_next_node_to_alloc() for symmetry. Also, renamed > hstate member hugetlb_next_node to next_node_to_free. > ["hugetlb" is implicit in the hstate struct, I think]. > > New in this version: > > Modified adjust_pool_surplus() to use hstate_next_node_to_alloc() > and hstate_next_node_to_free() to advance node id for adjusting > surplus huge page count, as this is equivalent to allocating and > freeing persistent huge pages. [Can't blame Mel for this part.] > > V3: > > Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to > better match alloc_fresh_huge_page() conventions. > > Acked-by: David Rientjes <rientjes@google.com> > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> > > include/linux/hugetlb.h | 3 - > mm/hugetlb.c | 132 +++++++++++++++++++++++++++++++----------------- > 2 files changed, 88 insertions(+), 47 deletions(-) > > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h > =================================================================== > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h 2009-06-29 10:21:12.000000000 -0400 > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h 2009-06-29 10:27:18.000000000 -0400 > @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area( > #define HSTATE_NAME_LEN 32 > /* Defines one hugetlb page size */ > struct hstate { > - int hugetlb_next_nid; > + int next_nid_to_alloc; > + int next_nid_to_free; > unsigned int order; > unsigned long mask; > unsigned long max_huge_pages; > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c > =================================================================== > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 10:21:12.000000000 -0400 > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400 > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst > h->free_huge_pages_node[nid]++; > } > > -static struct page *dequeue_huge_page(struct hstate *h) > -{ > - int nid; > - struct page *page = NULL; > - > - for (nid = 0; nid < MAX_NUMNODES; ++nid) { > - if (!list_empty(&h->hugepage_freelists[nid])) { > - page = list_entry(h->hugepage_freelists[nid].next, > - struct page, lru); > - list_del(&page->lru); > - h->free_huge_pages--; > - h->free_huge_pages_node[nid]--; > - break; > - } > - } > - return page; > -} > - > static struct page *dequeue_huge_page_vma(struct hstate *h, > struct vm_area_struct *vma, > unsigned long address, int avoid_reserve) > @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag > > /* > * Use a helper variable to find the next node and then > - * copy it back to hugetlb_next_nid afterwards: > + * copy it back to next_nid_to_alloc afterwards: > * otherwise there's a window in which a racer might > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > * But we don't need to use a spin_lock here: it really > @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag > * if we just successfully allocated a hugepage so that > * the next caller gets hugepages on the next node. > */ > -static int hstate_next_node(struct hstate *h) > +static int hstate_next_node_to_alloc(struct hstate *h) > { > int next_nid; > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > + next_nid = next_node(h->next_nid_to_alloc, node_online_map); > if (next_nid == MAX_NUMNODES) > next_nid = first_node(node_online_map); > - h->hugetlb_next_nid = next_nid; > + h->next_nid_to_alloc = next_nid; > return next_nid; > } > Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I don't think it makes an important difference. Implementing it this way is shorter and automatically ensures next_nid is an online node. If you wanted to be pedantic, I think the following untested code would make it really next_nid_to_alloc but I don't think it's terribly important. static int hstate_next_node_to_alloc(struct hstate *h) { int this_nid = h->next_nid_to_alloc; /* Check the node didn't get off-lined since */ if (unlikely(!node_online(next_nid))) { this_nid = next_node(h->next_nid_to_alloc, node_online_map); h->next_nid_to_alloc = this_nid; } h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map); if (h->next_nid_to_alloc == MAX_NUMNODES) h->next_nid_to_alloc = first_node(node_online_map); return this_nid; } > @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct > int next_nid; > int ret = 0; > > - start_nid = h->hugetlb_next_nid; > + start_nid = h->next_nid_to_alloc; > + next_nid = start_nid; > > do { > - page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > + page = alloc_fresh_huge_page_node(h, next_nid); > if (page) > ret = 1; > - next_nid = hstate_next_node(h); > - } while (!page && h->hugetlb_next_nid != start_nid); > + next_nid = hstate_next_node_to_alloc(h); > + } while (!page && next_nid != start_nid); > > if (ret) > count_vm_event(HTLB_BUDDY_PGALLOC); > @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct > return ret; > } > > +/* > + * helper for free_pool_huge_page() - find next node > + * from which to free a huge page > + */ > +static int hstate_next_node_to_free(struct hstate *h) > +{ > + int next_nid; > + next_nid = next_node(h->next_nid_to_free, node_online_map); > + if (next_nid == MAX_NUMNODES) > + next_nid = first_node(node_online_map); > + h->next_nid_to_free = next_nid; > + return next_nid; > +} > + > +/* > + * Free huge page from pool from next node to free. > + * Attempt to keep persistent huge pages more or less > + * balanced over allowed nodes. > + * Called with hugetlb_lock locked. > + */ > +static int free_pool_huge_page(struct hstate *h) > +{ > + int start_nid; > + int next_nid; > + int ret = 0; > + > + start_nid = h->next_nid_to_free; > + next_nid = start_nid; > + > + do { > + if (!list_empty(&h->hugepage_freelists[next_nid])) { > + struct page *page = > + list_entry(h->hugepage_freelists[next_nid].next, > + struct page, lru); > + list_del(&page->lru); > + h->free_huge_pages--; > + h->free_huge_pages_node[next_nid]--; > + update_and_free_page(h, page); > + ret = 1; > + } > + next_nid = hstate_next_node_to_free(h); > + } while (!ret && next_nid != start_nid); > + > + return ret; > +} > + > static struct page *alloc_buddy_huge_page(struct hstate *h, > struct vm_area_struct *vma, unsigned long address) > { > @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc > void *addr; > > addr = __alloc_bootmem_node_nopanic( > - NODE_DATA(h->hugetlb_next_nid), > + NODE_DATA(h->next_nid_to_alloc), > huge_page_size(h), huge_page_size(h), 0); > > if (addr) { > @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc > m = addr; > goto found; > } > - hstate_next_node(h); > + hstate_next_node_to_alloc(h); > nr_nodes--; > } > return 0; > @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc > */ > static int adjust_pool_surplus(struct hstate *h, int delta) > { > - static int prev_nid; > - int nid = prev_nid; > + int start_nid, next_nid; > int ret = 0; > > VM_BUG_ON(delta != -1 && delta != 1); > - do { > - nid = next_node(nid, node_online_map); > - if (nid == MAX_NUMNODES) > - nid = first_node(node_online_map); > > - /* To shrink on this node, there must be a surplus page */ > - if (delta < 0 && !h->surplus_huge_pages_node[nid]) > - continue; > - /* Surplus cannot exceed the total number of pages */ > - if (delta > 0 && h->surplus_huge_pages_node[nid] >= > + if (delta < 0) > + start_nid = h->next_nid_to_alloc; > + else > + start_nid = h->next_nid_to_free; > + next_nid = start_nid; > + > + do { > + int nid = next_nid; > + if (delta < 0) { > + next_nid = hstate_next_node_to_alloc(h); > + /* > + * To shrink on this node, there must be a surplus page > + */ > + if (!h->surplus_huge_pages_node[nid]) > + continue; > + } > + if (delta > 0) { > + next_nid = hstate_next_node_to_free(h); > + /* > + * Surplus cannot exceed the total number of pages > + */ > + if (h->surplus_huge_pages_node[nid] >= > h->nr_huge_pages_node[nid]) > - continue; > + continue; > + } > > h->surplus_huge_pages += delta; > h->surplus_huge_pages_node[nid] += delta; > ret = 1; > break; > - } while (nid != prev_nid); > + } while (next_nid != start_nid); > > - prev_nid = nid; > return ret; > } > > @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages( > min_count = max(count, min_count); > try_to_free_low(h, min_count); > while (min_count < persistent_huge_pages(h)) { > - struct page *page = dequeue_huge_page(h); > - if (!page) > + if (!free_pool_huge_page(h)) > break; > - update_and_free_page(h, page); > } > while (count < persistent_huge_pages(h)) { > if (!adjust_pool_surplus(h, 1)) > @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned > h->free_huge_pages = 0; > for (i = 0; i < MAX_NUMNODES; ++i) > INIT_LIST_HEAD(&h->hugepage_freelists[i]); > - h->hugetlb_next_nid = first_node(node_online_map); > + h->next_nid_to_alloc = first_node(node_online_map); > + h->next_nid_to_free = first_node(node_online_map); > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > huge_page_size(h)/1024); > Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc() as it is; Acked-by: Mel Gorman <mel@csn.ul.ie> -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes 2009-06-30 13:05 ` Mel Gorman @ 2009-06-30 13:48 ` Lee Schermerhorn 2009-06-30 13:58 ` Mel Gorman 0 siblings, 1 reply; 7+ messages in thread From: Lee Schermerhorn @ 2009-06-30 13:48 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney On Tue, 2009-06-30 at 14:05 +0100, Mel Gorman wrote: > On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote: > > [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes > > > > Against: 25jun09 mmotm > > > > Free huges pages from nodes in round robin fashion in an > > attempt to keep [persistent a.k.a static] hugepages balanced > > across nodes > > > > New function free_pool_huge_page() is modeled on and > > performs roughly the inverse of alloc_fresh_huge_page(). > > Replaces dequeue_huge_page() which now has no callers, > > so this patch removes it. > > > > Helper function hstate_next_node_to_free() uses new hstate > > member next_to_free_nid to distribute "frees" across all > > nodes with huge pages. > > > > V2: > > > > At Mel Gorman's suggestion: renamed hstate_next_node() to > > hstate_next_node_to_alloc() for symmetry. Also, renamed > > hstate member hugetlb_next_node to next_node_to_free. > > ["hugetlb" is implicit in the hstate struct, I think]. > > > > New in this version: > > > > Modified adjust_pool_surplus() to use hstate_next_node_to_alloc() > > and hstate_next_node_to_free() to advance node id for adjusting > > surplus huge page count, as this is equivalent to allocating and > > freeing persistent huge pages. [Can't blame Mel for this part.] > > > > V3: > > > > Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to > > better match alloc_fresh_huge_page() conventions. > > > > Acked-by: David Rientjes <rientjes@google.com> > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> > > > > include/linux/hugetlb.h | 3 - > > mm/hugetlb.c | 132 +++++++++++++++++++++++++++++++----------------- > > 2 files changed, 88 insertions(+), 47 deletions(-) > > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h > > =================================================================== > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h 2009-06-29 10:21:12.000000000 -0400 > > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h 2009-06-29 10:27:18.000000000 -0400 > > @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area( > > #define HSTATE_NAME_LEN 32 > > /* Defines one hugetlb page size */ > > struct hstate { > > - int hugetlb_next_nid; > > + int next_nid_to_alloc; > > + int next_nid_to_free; > > unsigned int order; > > unsigned long mask; > > unsigned long max_huge_pages; > > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c > > =================================================================== > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 10:21:12.000000000 -0400 > > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400 > > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst > > h->free_huge_pages_node[nid]++; > > } > > > > -static struct page *dequeue_huge_page(struct hstate *h) > > -{ > > - int nid; > > - struct page *page = NULL; > > - > > - for (nid = 0; nid < MAX_NUMNODES; ++nid) { > > - if (!list_empty(&h->hugepage_freelists[nid])) { > > - page = list_entry(h->hugepage_freelists[nid].next, > > - struct page, lru); > > - list_del(&page->lru); > > - h->free_huge_pages--; > > - h->free_huge_pages_node[nid]--; > > - break; > > - } > > - } > > - return page; > > -} > > - > > static struct page *dequeue_huge_page_vma(struct hstate *h, > > struct vm_area_struct *vma, > > unsigned long address, int avoid_reserve) > > @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag > > > > /* > > * Use a helper variable to find the next node and then > > - * copy it back to hugetlb_next_nid afterwards: > > + * copy it back to next_nid_to_alloc afterwards: > > * otherwise there's a window in which a racer might > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > * But we don't need to use a spin_lock here: it really > > @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag > > * if we just successfully allocated a hugepage so that > > * the next caller gets hugepages on the next node. > > */ > > -static int hstate_next_node(struct hstate *h) > > +static int hstate_next_node_to_alloc(struct hstate *h) > > { > > int next_nid; > > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > > + next_nid = next_node(h->next_nid_to_alloc, node_online_map); > > if (next_nid == MAX_NUMNODES) > > next_nid = first_node(node_online_map); > > - h->hugetlb_next_nid = next_nid; > > + h->next_nid_to_alloc = next_nid; > > return next_nid; > > } > > > > Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I > don't think it makes an important difference. Implementing it this way is > shorter and automatically ensures next_nid is an online node. > > If you wanted to be pedantic, I think the following untested code would > make it really next_nid_to_alloc but I don't think it's terribly > important. > > static int hstate_next_node_to_alloc(struct hstate *h) > { > int this_nid = h->next_nid_to_alloc; > > /* Check the node didn't get off-lined since */ > if (unlikely(!node_online(next_nid))) { > this_nid = next_node(h->next_nid_to_alloc, node_online_map); > h->next_nid_to_alloc = this_nid; > } > > h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map); > if (h->next_nid_to_alloc == MAX_NUMNODES) > h->next_nid_to_alloc = first_node(node_online_map); > > return this_nid; > } Mel: I'm about to send out a series that constrains [persistent] huge page alloc and free using task mempolicy, per your suggestion. The functions 'next_node_to_{alloc|free} and how they're used get reworked in that series quite a bit, and the name becomes more accurate, I think. And, I think it does handle the node going offline along with handling changing to a new policy nodemask that doesn't include the value saved in the hstate. We can revisit this, then. However, the way we currently use these functions, they do update the 'next_node_*' field in the hstate, and where the return value is tested [against start_nid], it really is the "next" node. If the alloc/free succeeds, then the return value does turn out to be the [last] node we just alloc'd/freed on. But, again, we've advanced the next node to alloc/free in the hstate. A nit, I think :). > > > @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct > > int next_nid; > > int ret = 0; > > > > - start_nid = h->hugetlb_next_nid; > > + start_nid = h->next_nid_to_alloc; > > + next_nid = start_nid; > > > > do { > > - page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > > + page = alloc_fresh_huge_page_node(h, next_nid); > > if (page) > > ret = 1; > > - next_nid = hstate_next_node(h); > > - } while (!page && h->hugetlb_next_nid != start_nid); > > + next_nid = hstate_next_node_to_alloc(h); > > + } while (!page && next_nid != start_nid); > > > > if (ret) > > count_vm_event(HTLB_BUDDY_PGALLOC); > > @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct > > return ret; > > } > > > > +/* > > + * helper for free_pool_huge_page() - find next node > > + * from which to free a huge page > > + */ > > +static int hstate_next_node_to_free(struct hstate *h) > > +{ > > + int next_nid; > > + next_nid = next_node(h->next_nid_to_free, node_online_map); > > + if (next_nid == MAX_NUMNODES) > > + next_nid = first_node(node_online_map); > > + h->next_nid_to_free = next_nid; > > + return next_nid; > > +} > > + > > +/* > > + * Free huge page from pool from next node to free. > > + * Attempt to keep persistent huge pages more or less > > + * balanced over allowed nodes. > > + * Called with hugetlb_lock locked. > > + */ > > +static int free_pool_huge_page(struct hstate *h) > > +{ > > + int start_nid; > > + int next_nid; > > + int ret = 0; > > + > > + start_nid = h->next_nid_to_free; > > + next_nid = start_nid; > > + > > + do { > > + if (!list_empty(&h->hugepage_freelists[next_nid])) { > > + struct page *page = > > + list_entry(h->hugepage_freelists[next_nid].next, > > + struct page, lru); > > + list_del(&page->lru); > > + h->free_huge_pages--; > > + h->free_huge_pages_node[next_nid]--; > > + update_and_free_page(h, page); > > + ret = 1; > > + } > > + next_nid = hstate_next_node_to_free(h); > > + } while (!ret && next_nid != start_nid); > > + > > + return ret; > > +} > > + > > static struct page *alloc_buddy_huge_page(struct hstate *h, > > struct vm_area_struct *vma, unsigned long address) > > { > > @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc > > void *addr; > > > > addr = __alloc_bootmem_node_nopanic( > > - NODE_DATA(h->hugetlb_next_nid), > > + NODE_DATA(h->next_nid_to_alloc), > > huge_page_size(h), huge_page_size(h), 0); > > > > if (addr) { > > @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc > > m = addr; > > goto found; > > } > > - hstate_next_node(h); > > + hstate_next_node_to_alloc(h); > > nr_nodes--; > > } > > return 0; > > @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc > > */ > > static int adjust_pool_surplus(struct hstate *h, int delta) > > { > > - static int prev_nid; > > - int nid = prev_nid; > > + int start_nid, next_nid; > > int ret = 0; > > > > VM_BUG_ON(delta != -1 && delta != 1); > > - do { > > - nid = next_node(nid, node_online_map); > > - if (nid == MAX_NUMNODES) > > - nid = first_node(node_online_map); > > > > - /* To shrink on this node, there must be a surplus page */ > > - if (delta < 0 && !h->surplus_huge_pages_node[nid]) > > - continue; > > - /* Surplus cannot exceed the total number of pages */ > > - if (delta > 0 && h->surplus_huge_pages_node[nid] >= > > + if (delta < 0) > > + start_nid = h->next_nid_to_alloc; > > + else > > + start_nid = h->next_nid_to_free; > > + next_nid = start_nid; > > + > > + do { > > + int nid = next_nid; > > + if (delta < 0) { > > + next_nid = hstate_next_node_to_alloc(h); > > + /* > > + * To shrink on this node, there must be a surplus page > > + */ > > + if (!h->surplus_huge_pages_node[nid]) > > + continue; > > + } > > + if (delta > 0) { > > + next_nid = hstate_next_node_to_free(h); > > + /* > > + * Surplus cannot exceed the total number of pages > > + */ > > + if (h->surplus_huge_pages_node[nid] >= > > h->nr_huge_pages_node[nid]) > > - continue; > > + continue; > > + } > > > > h->surplus_huge_pages += delta; > > h->surplus_huge_pages_node[nid] += delta; > > ret = 1; > > break; > > - } while (nid != prev_nid); > > + } while (next_nid != start_nid); > > > > - prev_nid = nid; > > return ret; > > } > > > > @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages( > > min_count = max(count, min_count); > > try_to_free_low(h, min_count); > > while (min_count < persistent_huge_pages(h)) { > > - struct page *page = dequeue_huge_page(h); > > - if (!page) > > + if (!free_pool_huge_page(h)) > > break; > > - update_and_free_page(h, page); > > } > > while (count < persistent_huge_pages(h)) { > > if (!adjust_pool_surplus(h, 1)) > > @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned > > h->free_huge_pages = 0; > > for (i = 0; i < MAX_NUMNODES; ++i) > > INIT_LIST_HEAD(&h->hugepage_freelists[i]); > > - h->hugetlb_next_nid = first_node(node_online_map); > > + h->next_nid_to_alloc = first_node(node_online_map); > > + h->next_nid_to_free = first_node(node_online_map); > > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > > huge_page_size(h)/1024); > > > > Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc() > as it is; > > Acked-by: Mel Gorman <mel@csn.ul.ie> > Thanks. It did seem to test out OK on ia64 [12jun mmotm; 25jun mmotm has a problem there--TBI] and x86_64. Could use more testing, tho'. Especially with various combinations of persistent and surplus huge pages. I saw you mention that you have a hugetlb regression test suite. Is that available "out there, somewhere"? I just grabbed a libhugetlbfs source rpm, but haven't cracked it yet. Maybe it's there? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes 2009-06-30 13:48 ` Lee Schermerhorn @ 2009-06-30 13:58 ` Mel Gorman 0 siblings, 0 replies; 7+ messages in thread From: Mel Gorman @ 2009-06-30 13:58 UTC (permalink / raw) To: Lee Schermerhorn Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney On Tue, Jun 30, 2009 at 09:48:11AM -0400, Lee Schermerhorn wrote: > On Tue, 2009-06-30 at 14:05 +0100, Mel Gorman wrote: > > On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote: > > > [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes > > > > > > Against: 25jun09 mmotm > > > > > > Free huges pages from nodes in round robin fashion in an > > > attempt to keep [persistent a.k.a static] hugepages balanced > > > across nodes > > > > > > New function free_pool_huge_page() is modeled on and > > > performs roughly the inverse of alloc_fresh_huge_page(). > > > Replaces dequeue_huge_page() which now has no callers, > > > so this patch removes it. > > > > > > Helper function hstate_next_node_to_free() uses new hstate > > > member next_to_free_nid to distribute "frees" across all > > > nodes with huge pages. > > > > > > V2: > > > > > > At Mel Gorman's suggestion: renamed hstate_next_node() to > > > hstate_next_node_to_alloc() for symmetry. Also, renamed > > > hstate member hugetlb_next_node to next_node_to_free. > > > ["hugetlb" is implicit in the hstate struct, I think]. > > > > > > New in this version: > > > > > > Modified adjust_pool_surplus() to use hstate_next_node_to_alloc() > > > and hstate_next_node_to_free() to advance node id for adjusting > > > surplus huge page count, as this is equivalent to allocating and > > > freeing persistent huge pages. [Can't blame Mel for this part.] > > > > > > V3: > > > > > > Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to > > > better match alloc_fresh_huge_page() conventions. > > > > > > Acked-by: David Rientjes <rientjes@google.com> > > > Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> > > > > > > include/linux/hugetlb.h | 3 - > > > mm/hugetlb.c | 132 +++++++++++++++++++++++++++++++----------------- > > > 2 files changed, 88 insertions(+), 47 deletions(-) > > > > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h > > > =================================================================== > > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h 2009-06-29 10:21:12.000000000 -0400 > > > +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h 2009-06-29 10:27:18.000000000 -0400 > > > @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area( > > > #define HSTATE_NAME_LEN 32 > > > /* Defines one hugetlb page size */ > > > struct hstate { > > > - int hugetlb_next_nid; > > > + int next_nid_to_alloc; > > > + int next_nid_to_free; > > > unsigned int order; > > > unsigned long mask; > > > unsigned long max_huge_pages; > > > Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c > > > =================================================================== > > > --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 10:21:12.000000000 -0400 > > > +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400 > > > @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst > > > h->free_huge_pages_node[nid]++; > > > } > > > > > > -static struct page *dequeue_huge_page(struct hstate *h) > > > -{ > > > - int nid; > > > - struct page *page = NULL; > > > - > > > - for (nid = 0; nid < MAX_NUMNODES; ++nid) { > > > - if (!list_empty(&h->hugepage_freelists[nid])) { > > > - page = list_entry(h->hugepage_freelists[nid].next, > > > - struct page, lru); > > > - list_del(&page->lru); > > > - h->free_huge_pages--; > > > - h->free_huge_pages_node[nid]--; > > > - break; > > > - } > > > - } > > > - return page; > > > -} > > > - > > > static struct page *dequeue_huge_page_vma(struct hstate *h, > > > struct vm_area_struct *vma, > > > unsigned long address, int avoid_reserve) > > > @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag > > > > > > /* > > > * Use a helper variable to find the next node and then > > > - * copy it back to hugetlb_next_nid afterwards: > > > + * copy it back to next_nid_to_alloc afterwards: > > > * otherwise there's a window in which a racer might > > > * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node. > > > * But we don't need to use a spin_lock here: it really > > > @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag > > > * if we just successfully allocated a hugepage so that > > > * the next caller gets hugepages on the next node. > > > */ > > > -static int hstate_next_node(struct hstate *h) > > > +static int hstate_next_node_to_alloc(struct hstate *h) > > > { > > > int next_nid; > > > - next_nid = next_node(h->hugetlb_next_nid, node_online_map); > > > + next_nid = next_node(h->next_nid_to_alloc, node_online_map); > > > if (next_nid == MAX_NUMNODES) > > > next_nid = first_node(node_online_map); > > > - h->hugetlb_next_nid = next_nid; > > > + h->next_nid_to_alloc = next_nid; > > > return next_nid; > > > } > > > > > > > Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I > > don't think it makes an important difference. Implementing it this way is > > shorter and automatically ensures next_nid is an online node. > > > > If you wanted to be pedantic, I think the following untested code would > > make it really next_nid_to_alloc but I don't think it's terribly > > important. > > > > static int hstate_next_node_to_alloc(struct hstate *h) > > { > > int this_nid = h->next_nid_to_alloc; > > > > /* Check the node didn't get off-lined since */ > > if (unlikely(!node_online(next_nid))) { > > this_nid = next_node(h->next_nid_to_alloc, node_online_map); > > h->next_nid_to_alloc = this_nid; > > } > > > > h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map); > > if (h->next_nid_to_alloc == MAX_NUMNODES) > > h->next_nid_to_alloc = first_node(node_online_map); > > > > return this_nid; > > } > > Mel: > > I'm about to send out a series that constrains [persistent] huge page > alloc and free using task mempolicy, per your suggestion. The functions > 'next_node_to_{alloc|free} and how they're used get reworked in that > series quite a bit, and the name becomes more accurate, I think. And, I > think it does handle the node going offline along with handling changing > to a new policy nodemask that doesn't include the value saved in the > hstate. We can revisit this, then. > Sounds good. > However, the way we currently use these functions, they do update the > 'next_node_*' field in the hstate, and where the return value is tested > [against start_nid], it really is the "next" node. Good point. > If the alloc/free > succeeds, then the return value does turn out to be the [last] node we > just alloc'd/freed on. But, again, we've advanced the next node to > alloc/free in the hstate. A nit, I think :). > It's enough of a concern to go with your current version. > > > > > @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct > > > int next_nid; > > > int ret = 0; > > > > > > - start_nid = h->hugetlb_next_nid; > > > + start_nid = h->next_nid_to_alloc; > > > + next_nid = start_nid; > > > > > > do { > > > - page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid); > > > + page = alloc_fresh_huge_page_node(h, next_nid); > > > if (page) > > > ret = 1; > > > - next_nid = hstate_next_node(h); > > > - } while (!page && h->hugetlb_next_nid != start_nid); > > > + next_nid = hstate_next_node_to_alloc(h); > > > + } while (!page && next_nid != start_nid); > > > > > > if (ret) > > > count_vm_event(HTLB_BUDDY_PGALLOC); > > > @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct > > > return ret; > > > } > > > > > > +/* > > > + * helper for free_pool_huge_page() - find next node > > > + * from which to free a huge page > > > + */ > > > +static int hstate_next_node_to_free(struct hstate *h) > > > +{ > > > + int next_nid; > > > + next_nid = next_node(h->next_nid_to_free, node_online_map); > > > + if (next_nid == MAX_NUMNODES) > > > + next_nid = first_node(node_online_map); > > > + h->next_nid_to_free = next_nid; > > > + return next_nid; > > > +} > > > + > > > +/* > > > + * Free huge page from pool from next node to free. > > > + * Attempt to keep persistent huge pages more or less > > > + * balanced over allowed nodes. > > > + * Called with hugetlb_lock locked. > > > + */ > > > +static int free_pool_huge_page(struct hstate *h) > > > +{ > > > + int start_nid; > > > + int next_nid; > > > + int ret = 0; > > > + > > > + start_nid = h->next_nid_to_free; > > > + next_nid = start_nid; > > > + > > > + do { > > > + if (!list_empty(&h->hugepage_freelists[next_nid])) { > > > + struct page *page = > > > + list_entry(h->hugepage_freelists[next_nid].next, > > > + struct page, lru); > > > + list_del(&page->lru); > > > + h->free_huge_pages--; > > > + h->free_huge_pages_node[next_nid]--; > > > + update_and_free_page(h, page); > > > + ret = 1; > > > + } > > > + next_nid = hstate_next_node_to_free(h); > > > + } while (!ret && next_nid != start_nid); > > > + > > > + return ret; > > > +} > > > + > > > static struct page *alloc_buddy_huge_page(struct hstate *h, > > > struct vm_area_struct *vma, unsigned long address) > > > { > > > @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc > > > void *addr; > > > > > > addr = __alloc_bootmem_node_nopanic( > > > - NODE_DATA(h->hugetlb_next_nid), > > > + NODE_DATA(h->next_nid_to_alloc), > > > huge_page_size(h), huge_page_size(h), 0); > > > > > > if (addr) { > > > @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc > > > m = addr; > > > goto found; > > > } > > > - hstate_next_node(h); > > > + hstate_next_node_to_alloc(h); > > > nr_nodes--; > > > } > > > return 0; > > > @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc > > > */ > > > static int adjust_pool_surplus(struct hstate *h, int delta) > > > { > > > - static int prev_nid; > > > - int nid = prev_nid; > > > + int start_nid, next_nid; > > > int ret = 0; > > > > > > VM_BUG_ON(delta != -1 && delta != 1); > > > - do { > > > - nid = next_node(nid, node_online_map); > > > - if (nid == MAX_NUMNODES) > > > - nid = first_node(node_online_map); > > > > > > - /* To shrink on this node, there must be a surplus page */ > > > - if (delta < 0 && !h->surplus_huge_pages_node[nid]) > > > - continue; > > > - /* Surplus cannot exceed the total number of pages */ > > > - if (delta > 0 && h->surplus_huge_pages_node[nid] >= > > > + if (delta < 0) > > > + start_nid = h->next_nid_to_alloc; > > > + else > > > + start_nid = h->next_nid_to_free; > > > + next_nid = start_nid; > > > + > > > + do { > > > + int nid = next_nid; > > > + if (delta < 0) { > > > + next_nid = hstate_next_node_to_alloc(h); > > > + /* > > > + * To shrink on this node, there must be a surplus page > > > + */ > > > + if (!h->surplus_huge_pages_node[nid]) > > > + continue; > > > + } > > > + if (delta > 0) { > > > + next_nid = hstate_next_node_to_free(h); > > > + /* > > > + * Surplus cannot exceed the total number of pages > > > + */ > > > + if (h->surplus_huge_pages_node[nid] >= > > > h->nr_huge_pages_node[nid]) > > > - continue; > > > + continue; > > > + } > > > > > > h->surplus_huge_pages += delta; > > > h->surplus_huge_pages_node[nid] += delta; > > > ret = 1; > > > break; > > > - } while (nid != prev_nid); > > > + } while (next_nid != start_nid); > > > > > > - prev_nid = nid; > > > return ret; > > > } > > > > > > @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages( > > > min_count = max(count, min_count); > > > try_to_free_low(h, min_count); > > > while (min_count < persistent_huge_pages(h)) { > > > - struct page *page = dequeue_huge_page(h); > > > - if (!page) > > > + if (!free_pool_huge_page(h)) > > > break; > > > - update_and_free_page(h, page); > > > } > > > while (count < persistent_huge_pages(h)) { > > > if (!adjust_pool_surplus(h, 1)) > > > @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned > > > h->free_huge_pages = 0; > > > for (i = 0; i < MAX_NUMNODES; ++i) > > > INIT_LIST_HEAD(&h->hugepage_freelists[i]); > > > - h->hugetlb_next_nid = first_node(node_online_map); > > > + h->next_nid_to_alloc = first_node(node_online_map); > > > + h->next_nid_to_free = first_node(node_online_map); > > > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > > > huge_page_size(h)/1024); > > > > > > > Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc() > > as it is; > > > > Acked-by: Mel Gorman <mel@csn.ul.ie> > > > > Thanks. It did seem to test out OK on ia64 [12jun mmotm; 25jun mmotm > has a problem there--TBI] and x86_64. Could use more testing, tho'. > Especially with various combinations of persistent and surplus huge > pages. No harm in that. I've tested the patches a bit and spotted nothing problematic to do specifically with your patches. I am able to trigger the OOM killer with disturbing lines such as heap-overflow invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0 but I haven't determined if this is something new in mainline or on mmotm yet. > I saw you mention that you have a hugetlb regression test suite. > Is that available "out there, somewhere"? I just grabbed a libhugetlbfs > source rpm, but haven't cracked it yet. Maybe it's there? > It probably is, but I'm not certain. You're better off downloading from http://sourceforge.net/projects/libhugetlbfs and doing something like make ./obj/hugeadm --pool-pages-min 2M:64 ./obj/hugeadm --create-global-mounts make func -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages 2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn @ 2009-06-29 21:52 ` Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn 2 siblings, 0 replies; 7+ messages in thread From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw) To: linux-mm, linux-numa Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney PATCH 2/3 - Use free_pool_huge_page() for return_unused_surplus_pages() Against: 25jun09 mmotm Use the [modified] free_pool_huge_page() function to return unused surplus pages. This will help keep huge pages balanced across nodes between freeing of unused surplus pages and freeing of persistent huge pages [from set_max_huge_pages] by using the same node id "cursor". It also eliminates some code duplication. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> mm/hugetlb.c | 57 +++++++++++++++++++++++++-------------------------------- 1 file changed, 25 insertions(+), 32 deletions(-) Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c =================================================================== --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400 +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 16:52:45.000000000 -0400 @@ -686,7 +686,7 @@ static int hstate_next_node_to_free(stru * balanced over allowed nodes. * Called with hugetlb_lock locked. */ -static int free_pool_huge_page(struct hstate *h) +static int free_pool_huge_page(struct hstate *h, bool acct_surplus) { int start_nid; int next_nid; @@ -696,6 +696,13 @@ static int free_pool_huge_page(struct hs next_nid = start_nid; do { + /* + * If we're returning unused surplus pages, skip nodes + * with no surplus. + */ + if (acct_surplus && !h->surplus_huge_pages_node[next_nid]) + continue; + if (!list_empty(&h->hugepage_freelists[next_nid])) { struct page *page = list_entry(h->hugepage_freelists[next_nid].next, @@ -703,6 +710,10 @@ static int free_pool_huge_page(struct hs list_del(&page->lru); h->free_huge_pages--; h->free_huge_pages_node[next_nid]--; + if (acct_surplus) { + h->surplus_huge_pages--; + h->surplus_huge_pages_node[next_nid]--; + } update_and_free_page(h, page); ret = 1; } @@ -883,22 +894,13 @@ free: * When releasing a hugetlb pool reservation, any surplus pages that were * allocated to satisfy the reservation must be explicitly freed if they were * never used. + * Called with hugetlb_lock held. */ static void return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages) { - static int nid = -1; - struct page *page; unsigned long nr_pages; - /* - * We want to release as many surplus pages as possible, spread - * evenly across all nodes. Iterate across all nodes until we - * can no longer free unreserved surplus pages. This occurs when - * the nodes with surplus pages have no free pages. - */ - unsigned long remaining_iterations = nr_online_nodes; - /* Uncommit the reservation */ h->resv_huge_pages -= unused_resv_pages; @@ -908,26 +910,17 @@ static void return_unused_surplus_pages( nr_pages = min(unused_resv_pages, h->surplus_huge_pages); - while (remaining_iterations-- && nr_pages) { - nid = next_node(nid, node_online_map); - if (nid == MAX_NUMNODES) - nid = first_node(node_online_map); - - if (!h->surplus_huge_pages_node[nid]) - continue; - - if (!list_empty(&h->hugepage_freelists[nid])) { - page = list_entry(h->hugepage_freelists[nid].next, - struct page, lru); - list_del(&page->lru); - update_and_free_page(h, page); - h->free_huge_pages--; - h->free_huge_pages_node[nid]--; - h->surplus_huge_pages--; - h->surplus_huge_pages_node[nid]--; - nr_pages--; - remaining_iterations = nr_online_nodes; - } + /* + * We want to release as many surplus pages as possible, spread + * evenly across all nodes. Iterate across all nodes until we + * can no longer free unreserved surplus pages. This occurs when + * the nodes with surplus pages have no free pages. + * free_pool_huge_page() will balance the the frees across the + * on-line nodes for us and will handle the hstate accounting. + */ + while (nr_pages--) { + if (!free_pool_huge_page(h, 1)) + break; } } @@ -1267,7 +1260,7 @@ static unsigned long set_max_huge_pages( min_count = max(count, min_count); try_to_free_low(h, min_count); while (min_count < persistent_huge_pages(h)) { - if (!free_pool_huge_page(h)) + if (!free_pool_huge_page(h, 0)) break; } while (count < persistent_huge_pages(h)) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 3/3] Cleanup and update huge pages documentation 2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn @ 2009-06-29 21:52 ` Lee Schermerhorn 2 siblings, 0 replies; 7+ messages in thread From: Lee Schermerhorn @ 2009-06-29 21:52 UTC (permalink / raw) To: linux-mm, linux-numa Cc: akpm, Mel Gorman, Nishanth Aravamudan, David Rientjes, Adam Litke, Andy Whitcroft, eric.whitney PATCH 3/3 cleanup and update huge pages documentation. Against: 25jun09 mmotm This patch attempts to clarify huge page administration and usage, and updates the doucmentation to mention the balancing of huge pages across nodes when allocating and freeing. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Documentation/vm/hugetlbpage.txt | 133 +++++++++++++++++++++++++-------------- 1 file changed, 87 insertions(+), 46 deletions(-) Index: linux-2.6.31-rc1-mmotm-090625-1549/Documentation/vm/hugetlbpage.txt =================================================================== --- linux-2.6.31-rc1-mmotm-090625-1549.orig/Documentation/vm/hugetlbpage.txt 2009-06-29 12:19:02.000000000 -0400 +++ linux-2.6.31-rc1-mmotm-090625-1549/Documentation/vm/hugetlbpage.txt 2009-06-29 17:29:48.000000000 -0400 @@ -18,13 +18,13 @@ First the Linux kernel needs to be built automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with hugepage support should show the number of configured -hugepages in the system by running the "cat /proc/meminfo" command. +The kernel built with huge page support should show the number of configured +huge pages in the system by running the "cat /proc/meminfo" command. /proc/meminfo also provides information about the total number of hugetlb pages configured in the kernel. It also displays information about the number of free hugetlb pages at any time. It also displays information about -the configured hugepage size - this is needed for generating the proper +the configured huge page size - this is needed for generating the proper alignment and size of the arguments to the above system calls. The output of "cat /proc/meminfo" will have lines like: @@ -37,25 +37,27 @@ HugePages_Surp: yyy Hugepagesize: zzz kB where: -HugePages_Total is the size of the pool of hugepages. -HugePages_Free is the number of hugepages in the pool that are not yet -allocated. -HugePages_Rsvd is short for "reserved," and is the number of hugepages -for which a commitment to allocate from the pool has been made, but no -allocation has yet been made. It's vaguely analogous to overcommit. -HugePages_Surp is short for "surplus," and is the number of hugepages in -the pool above the value in /proc/sys/vm/nr_hugepages. The maximum -number of surplus hugepages is controlled by -/proc/sys/vm/nr_overcommit_hugepages. +HugePages_Total is the size of the pool of huge pages. +HugePages_Free is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp is short for "surplus," and is the number of huge pages in + the pool above the value in /proc/sys/vm/nr_hugepages. The + maximum number of surplus huge pages is controlled by + /proc/sys/vm/nr_overcommit_hugepages. /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb pages in the kernel. Super user can dynamically request more (or free some -pre-configured) hugepages. +pre-configured) huge pages. The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of hugepages is +enough physically contiguous free pages in system (freeing of huge pages is possible only if there are enough hugetlb pages free that can be transferred back to regular memory pool). @@ -67,43 +69,82 @@ use either the mmap system call or share the huge pages. It is required that the system administrator preallocate enough memory for huge page purposes. -Use the following command to dynamically allocate/deallocate hugepages: +The administrator can preallocate huge pages on the kernel boot command line by +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages +requested. This is the most reliable method for preallocating huge pages as +memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To preallocate huge pages +of a specific size, one must preceed the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=<size>". <size> must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=<size>" boot parameter. + +/proc/sys/vm/nr_hugepages indicates the current number of configured [default +size] hugetlb pages in the kernel. Super user can dynamically request more +(or free some pre-configured) huge pages. + +Use the following command to dynamically allocate/deallocate default sized +huge pages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 hugepages in the system. The success -or failure of allocation depends on the amount of physically contiguous -memory that is preset in system at this time. System administrators may want -to put this command in one of the local rc init files. This will enable the -kernel to request huge pages early in the boot process (when the possibility -of getting physical contiguous pages is still very high). In either -case, administrators will want to verify the number of hugepages actually -allocated by checking the sysctl or meminfo. - -/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of -hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are -requested by applications. echo'ing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain -hugepages from the buddy allocator, if the normal pool is exhausted. As -these surplus hugepages go out of use, they are freed back to the buddy +This command will try to configure 20 default sized huge pages in the system. +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over the all on-line nodes. These huge pages, allocated when nr_hugepages +is increased, are called "persistent huge pages". + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is preset in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc init +files. This will enable the kernel to request huge pages early in the boot +process when the possibility of getting physical contiguous pages is still +very high. Administrators can verify the number of huge pages actually +allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of +huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain "surplus" +huge pages from the buddy allocator, when the normal pool is exhausted. As +these surplus huge pages go out of use, they are freed back to the buddy allocator. +When increasing the huge page pool size via nr_hugepages, any surplus +pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new huge page pool size. + +The administrator may shrink the pool of preallocated huge pages for +the default huge page size by setting the nr_hugepages sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all on-line nodes. Any free huge pages on the selected nodes will +be freed back to the buddy allocator. + Caveat: Shrinking the pool via nr_hugepages such that it becomes less -than the number of hugepages in use will convert the balance to surplus +than the number of huge pages in use will convert the balance to surplus huge pages even if it would exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. -With support for multiple hugepage pools at run-time available, much of -the hugepage userspace interface has been duplicated in sysfs. The above -information applies to the default hugepage size (which will be -controlled by the proc interfaces for backwards compatibility). The root -hugepage control directory is +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface has been duplicated in sysfs. The above +information applies to the default huge page size which will be +controlled by the /proc interfaces for backwards compatibility. The root +huge page control directory in sysfs is: /sys/kernel/mm/hugepages -For each hugepage size supported by the running kernel, a subdirectory +For each huge page size supported by the running kernel, a subdirectory will exist, of the form hugepages-${size}kB @@ -116,9 +157,9 @@ Inside each of these directories, the sa resv_hugepages surplus_hugepages -which function as described above for the default hugepage-sized case. +which function as described above for the default huge page-sized case. -If the user applications are going to request hugepages using mmap system +If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs: @@ -127,7 +168,7 @@ type hugetlbfs: none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid +/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 0777. This value is given in octal. @@ -156,14 +197,14 @@ mount of filesystem will be required for ******************************************************************* /* - * Example of using hugepage memory in a user application using Sys V shared + * Example of using huge page memory in a user application using Sys V shared * memory system calls. In this example the app is requesting 256MB of * memory that is backed by huge pages. The application uses the flag * SHM_HUGETLB in the shmget system call to inform the kernel that it is - * requesting hugepages. + * requesting huge pages. * * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * hugepages. That means the addresses starting with 0x800000... will need + * huge pages. That means the addresses starting with 0x800000... will need * to be specified. Specifying a fixed address is not required on ppc64, * i386 or x86_64. * @@ -252,14 +293,14 @@ int main(void) ******************************************************************* /* - * Example of using hugepage memory in a user application using the mmap + * Example of using huge page memory in a user application using the mmap * system call. Before running this application, make sure that the * administrator has mounted the hugetlbfs filesystem (on some directory * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this * example, the app is requesting memory of size 256MB that is backed by * huge pages. * - * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. + * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. * That means the addresses starting with 0x800000... will need to be * specified. Specifying a fixed address is not required on ppc64, i386 * or x86_64. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-06-30 13:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-06-29 21:52 [PATCH 0/3] Balance Freeing of Huge Pages across Nodes Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn 2009-06-30 13:05 ` Mel Gorman 2009-06-30 13:48 ` Lee Schermerhorn 2009-06-30 13:58 ` Mel Gorman 2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn 2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox