From: Mel Gorman <mel@csn.ul.ie>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, linux-numa@vger.org,
akpm@linux-foundation.org, Nishanth Aravamudan <nacc@us.ibm.com>,
David Rientjes <rientjes@google.com>, Adam Litke <agl@us.ibm.com>,
Andy Whitcroft <apw@canonical.com>,
eric.whitney@hp.com
Subject: Re: [PATCH 1/3] Balance Freeing of Huge Pages across Nodes
Date: Tue, 30 Jun 2009 14:05:16 +0100 [thread overview]
Message-ID: <20090630130515.GD17561@csn.ul.ie> (raw)
In-Reply-To: <20090629215234.20038.62303.sendpatchset@lts-notebook>
On Mon, Jun 29, 2009 at 05:52:34PM -0400, Lee Schermerhorn wrote:
> [PATCH] 1/3 Balance Freeing of Huge Pages across Nodes
>
> Against: 25jun09 mmotm
>
> Free huges pages from nodes in round robin fashion in an
> attempt to keep [persistent a.k.a static] hugepages balanced
> across nodes
>
> New function free_pool_huge_page() is modeled on and
> performs roughly the inverse of alloc_fresh_huge_page().
> Replaces dequeue_huge_page() which now has no callers,
> so this patch removes it.
>
> Helper function hstate_next_node_to_free() uses new hstate
> member next_to_free_nid to distribute "frees" across all
> nodes with huge pages.
>
> V2:
>
> At Mel Gorman's suggestion: renamed hstate_next_node() to
> hstate_next_node_to_alloc() for symmetry. Also, renamed
> hstate member hugetlb_next_node to next_node_to_free.
> ["hugetlb" is implicit in the hstate struct, I think].
>
> New in this version:
>
> Modified adjust_pool_surplus() to use hstate_next_node_to_alloc()
> and hstate_next_node_to_free() to advance node id for adjusting
> surplus huge page count, as this is equivalent to allocating and
> freeing persistent huge pages. [Can't blame Mel for this part.]
>
> V3:
>
> Minor cleanup: rename 'nid' to 'next_nid' in free_pool_huge_page() to
> better match alloc_fresh_huge_page() conventions.
>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> include/linux/hugetlb.h | 3 -
> mm/hugetlb.c | 132 +++++++++++++++++++++++++++++++-----------------
> 2 files changed, 88 insertions(+), 47 deletions(-)
>
> Index: linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-rc1-mmotm-090625-1549.orig/include/linux/hugetlb.h 2009-06-29 10:21:12.000000000 -0400
> +++ linux-2.6.31-rc1-mmotm-090625-1549/include/linux/hugetlb.h 2009-06-29 10:27:18.000000000 -0400
> @@ -183,7 +183,8 @@ unsigned long hugetlb_get_unmapped_area(
> #define HSTATE_NAME_LEN 32
> /* Defines one hugetlb page size */
> struct hstate {
> - int hugetlb_next_nid;
> + int next_nid_to_alloc;
> + int next_nid_to_free;
> unsigned int order;
> unsigned long mask;
> unsigned long max_huge_pages;
> Index: linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc1-mmotm-090625-1549.orig/mm/hugetlb.c 2009-06-29 10:21:12.000000000 -0400
> +++ linux-2.6.31-rc1-mmotm-090625-1549/mm/hugetlb.c 2009-06-29 15:53:55.000000000 -0400
> @@ -455,24 +455,6 @@ static void enqueue_huge_page(struct hst
> h->free_huge_pages_node[nid]++;
> }
>
> -static struct page *dequeue_huge_page(struct hstate *h)
> -{
> - int nid;
> - struct page *page = NULL;
> -
> - for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> - if (!list_empty(&h->hugepage_freelists[nid])) {
> - page = list_entry(h->hugepage_freelists[nid].next,
> - struct page, lru);
> - list_del(&page->lru);
> - h->free_huge_pages--;
> - h->free_huge_pages_node[nid]--;
> - break;
> - }
> - }
> - return page;
> -}
> -
> static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct vm_area_struct *vma,
> unsigned long address, int avoid_reserve)
> @@ -640,7 +622,7 @@ static struct page *alloc_fresh_huge_pag
>
> /*
> * Use a helper variable to find the next node and then
> - * copy it back to hugetlb_next_nid afterwards:
> + * copy it back to next_nid_to_alloc afterwards:
> * otherwise there's a window in which a racer might
> * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
> * But we don't need to use a spin_lock here: it really
> @@ -649,13 +631,13 @@ static struct page *alloc_fresh_huge_pag
> * if we just successfully allocated a hugepage so that
> * the next caller gets hugepages on the next node.
> */
> -static int hstate_next_node(struct hstate *h)
> +static int hstate_next_node_to_alloc(struct hstate *h)
> {
> int next_nid;
> - next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> + next_nid = next_node(h->next_nid_to_alloc, node_online_map);
> if (next_nid == MAX_NUMNODES)
> next_nid = first_node(node_online_map);
> - h->hugetlb_next_nid = next_nid;
> + h->next_nid_to_alloc = next_nid;
> return next_nid;
> }
>
Strictly speaking, next_nid_to_alloc looks more like last_nid_alloced but I
don't think it makes an important difference. Implementing it this way is
shorter and automatically ensures next_nid is an online node.
If you wanted to be pedantic, I think the following untested code would
make it really next_nid_to_alloc but I don't think it's terribly
important.
static int hstate_next_node_to_alloc(struct hstate *h)
{
int this_nid = h->next_nid_to_alloc;
/* Check the node didn't get off-lined since */
if (unlikely(!node_online(next_nid))) {
this_nid = next_node(h->next_nid_to_alloc, node_online_map);
h->next_nid_to_alloc = this_nid;
}
h->next_nid_to_alloc = next_node(h->next_nid_to_alloc, node_online_map);
if (h->next_nid_to_alloc == MAX_NUMNODES)
h->next_nid_to_alloc = first_node(node_online_map);
return this_nid;
}
> @@ -666,14 +648,15 @@ static int alloc_fresh_huge_page(struct
> int next_nid;
> int ret = 0;
>
> - start_nid = h->hugetlb_next_nid;
> + start_nid = h->next_nid_to_alloc;
> + next_nid = start_nid;
>
> do {
> - page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> + page = alloc_fresh_huge_page_node(h, next_nid);
> if (page)
> ret = 1;
> - next_nid = hstate_next_node(h);
> - } while (!page && h->hugetlb_next_nid != start_nid);
> + next_nid = hstate_next_node_to_alloc(h);
> + } while (!page && next_nid != start_nid);
>
> if (ret)
> count_vm_event(HTLB_BUDDY_PGALLOC);
> @@ -683,6 +666,52 @@ static int alloc_fresh_huge_page(struct
> return ret;
> }
>
> +/*
> + * helper for free_pool_huge_page() - find next node
> + * from which to free a huge page
> + */
> +static int hstate_next_node_to_free(struct hstate *h)
> +{
> + int next_nid;
> + next_nid = next_node(h->next_nid_to_free, node_online_map);
> + if (next_nid == MAX_NUMNODES)
> + next_nid = first_node(node_online_map);
> + h->next_nid_to_free = next_nid;
> + return next_nid;
> +}
> +
> +/*
> + * Free huge page from pool from next node to free.
> + * Attempt to keep persistent huge pages more or less
> + * balanced over allowed nodes.
> + * Called with hugetlb_lock locked.
> + */
> +static int free_pool_huge_page(struct hstate *h)
> +{
> + int start_nid;
> + int next_nid;
> + int ret = 0;
> +
> + start_nid = h->next_nid_to_free;
> + next_nid = start_nid;
> +
> + do {
> + if (!list_empty(&h->hugepage_freelists[next_nid])) {
> + struct page *page =
> + list_entry(h->hugepage_freelists[next_nid].next,
> + struct page, lru);
> + list_del(&page->lru);
> + h->free_huge_pages--;
> + h->free_huge_pages_node[next_nid]--;
> + update_and_free_page(h, page);
> + ret = 1;
> + }
> + next_nid = hstate_next_node_to_free(h);
> + } while (!ret && next_nid != start_nid);
> +
> + return ret;
> +}
> +
> static struct page *alloc_buddy_huge_page(struct hstate *h,
> struct vm_area_struct *vma, unsigned long address)
> {
> @@ -1007,7 +1036,7 @@ int __weak alloc_bootmem_huge_page(struc
> void *addr;
>
> addr = __alloc_bootmem_node_nopanic(
> - NODE_DATA(h->hugetlb_next_nid),
> + NODE_DATA(h->next_nid_to_alloc),
> huge_page_size(h), huge_page_size(h), 0);
>
> if (addr) {
> @@ -1019,7 +1048,7 @@ int __weak alloc_bootmem_huge_page(struc
> m = addr;
> goto found;
> }
> - hstate_next_node(h);
> + hstate_next_node_to_alloc(h);
> nr_nodes--;
> }
> return 0;
> @@ -1140,31 +1169,43 @@ static inline void try_to_free_low(struc
> */
> static int adjust_pool_surplus(struct hstate *h, int delta)
> {
> - static int prev_nid;
> - int nid = prev_nid;
> + int start_nid, next_nid;
> int ret = 0;
>
> VM_BUG_ON(delta != -1 && delta != 1);
> - do {
> - nid = next_node(nid, node_online_map);
> - if (nid == MAX_NUMNODES)
> - nid = first_node(node_online_map);
>
> - /* To shrink on this node, there must be a surplus page */
> - if (delta < 0 && !h->surplus_huge_pages_node[nid])
> - continue;
> - /* Surplus cannot exceed the total number of pages */
> - if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> + if (delta < 0)
> + start_nid = h->next_nid_to_alloc;
> + else
> + start_nid = h->next_nid_to_free;
> + next_nid = start_nid;
> +
> + do {
> + int nid = next_nid;
> + if (delta < 0) {
> + next_nid = hstate_next_node_to_alloc(h);
> + /*
> + * To shrink on this node, there must be a surplus page
> + */
> + if (!h->surplus_huge_pages_node[nid])
> + continue;
> + }
> + if (delta > 0) {
> + next_nid = hstate_next_node_to_free(h);
> + /*
> + * Surplus cannot exceed the total number of pages
> + */
> + if (h->surplus_huge_pages_node[nid] >=
> h->nr_huge_pages_node[nid])
> - continue;
> + continue;
> + }
>
> h->surplus_huge_pages += delta;
> h->surplus_huge_pages_node[nid] += delta;
> ret = 1;
> break;
> - } while (nid != prev_nid);
> + } while (next_nid != start_nid);
>
> - prev_nid = nid;
> return ret;
> }
>
> @@ -1226,10 +1267,8 @@ static unsigned long set_max_huge_pages(
> min_count = max(count, min_count);
> try_to_free_low(h, min_count);
> while (min_count < persistent_huge_pages(h)) {
> - struct page *page = dequeue_huge_page(h);
> - if (!page)
> + if (!free_pool_huge_page(h))
> break;
> - update_and_free_page(h, page);
> }
> while (count < persistent_huge_pages(h)) {
> if (!adjust_pool_surplus(h, 1))
> @@ -1441,7 +1480,8 @@ void __init hugetlb_add_hstate(unsigned
> h->free_huge_pages = 0;
> for (i = 0; i < MAX_NUMNODES; ++i)
> INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> - h->hugetlb_next_nid = first_node(node_online_map);
> + h->next_nid_to_alloc = first_node(node_online_map);
> + h->next_nid_to_free = first_node(node_online_map);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
>
Nothing problematic jumps out at me. Even with hstate_next_node_to_alloc()
as it is;
Acked-by: Mel Gorman <mel@csn.ul.ie>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-06-30 13:05 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-29 21:52 [PATCH 0/3] " Lee Schermerhorn
2009-06-29 21:52 ` [PATCH 1/3] " Lee Schermerhorn
2009-06-30 13:05 ` Mel Gorman [this message]
2009-06-30 13:48 ` Lee Schermerhorn
2009-06-30 13:58 ` Mel Gorman
2009-06-29 21:52 ` [PATCH 2/3] Use free_pool_huge_page() to return unused surplus pages Lee Schermerhorn
2009-06-29 21:52 ` [PATCH 3/3] Cleanup and update huge pages documentation Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090630130515.GD17561@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apw@canonical.com \
--cc=eric.whitney@hp.com \
--cc=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
--cc=linux-numa@vger.org \
--cc=nacc@us.ibm.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox