From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id lAFKBHOb009887 for ; Thu, 15 Nov 2007 15:11:17 -0500 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.6) with ESMTP id lAFKBHCA119438 for ; Thu, 15 Nov 2007 15:11:17 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id lAFKBGn6014523 for ; Thu, 15 Nov 2007 15:11:17 -0500 Date: Thu, 15 Nov 2007 12:10:53 -0800 From: Nishanth Aravamudan Subject: [PATCH] hugetlb: retry pool allocation attempts Message-ID: <20071115201053.GA21245@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Return-Path: To: wli@holomorphy.com Cc: kenchen@google.com, david@gibson.dropbear.id.au, linux-mm@kvack.org List-ID: Currently, successive attempts to allocate the hugepage pool via the sysctl can result in the following sort of behavior (assume each attempt is trying to grow the pool by 100 hugepages, starting with 100 hugepages in the pool, on x86_64): Attempt 1: 200 hugepages Attempt 2: 300 hugepages ... Attempt 33: 3400 hugepages Attempt 34: 3438 hugepages Attempt 35: 3438 hugepages Attempt 36: 3438 hugepages Attempt 37: 3439 hugepages Attempt 38: 3440 hugepages Attempt 39: 3441 hugepages Attempt 40: 3441 hugepages Attempt 41: 3442 hugepages ... I think, in an ideal world, we would not have a situation where the hugepage pool grows on an attempt after a previous attempt has failed (we should have freed up sufficient memory earlier). We also wouldn't get successive single-page allocations, but would have a single larger-size allocation. There are two reasons this doesn't happen currently: a) hugetlb pool allocation calls do not specify __GFP_REPEAT to ask the VM to retry the allocations (invoking reclaim to help the requests succeed). b) __alloc_pages() does not currently retry allocations for order > PAGE_ALLOC_COSTLY_ORDER. Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable results for x86, x86_64 and ppc64, but I'm not sure how to come up with the "best" number here (suggestions are welcome!). With this patch applied, the same box that gave the above results now gives: Attempt 1: 200 hugepages Attempt 2: 300 hugepages ... Attempt 33: 3400 hugepages Attempt 34: 3438 hugepages Attempt 35: 3442 hugepages Attempt 36: 3443 hugepages Attempt 37: 3443 hugepages Attempt 38: 3443 hugepages Attempt 39: 3443 hugepages Attempt 40: 3443 hugepages Attempt 41: 3444 hugepages ... While the patch makes things better (we get more hugepages sooner), but we still get an allocation success (of one hugepage) after getting a few failures in a row. But, even with 10 retry attempts, I got similar results. Determining the perfect number, I expect, would require know the current/future I/O characteristics and current/future system activity -- in lieu of this prescience, this heuristic does seem to improve things and does not require userspace applications to implement their own retry logic (or, more accurately, makes those userspace retries more effective). Signed-off-by: Nishanth Aravamudan diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4c4522a..c4e36ba 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -33,6 +33,12 @@ * will not. */ #define PAGE_ALLOC_COSTLY_ORDER 3 +/* + * COSTLY_ORDER_RETRY_ATTEMPTS is the number of retry attempts for + * allocations above PAGE_ALLOC_COSTLY_ORDER with __GFP_REPEAT + * specified. + */ +#define COSTLY_ORDER_RETRY_ATTEMPTS 5 #define MIGRATE_UNMOVABLE 0 #define MIGRATE_RECLAIMABLE 1 diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8b809ec..3d2d092 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -171,7 +171,7 @@ static struct page *alloc_fresh_huge_page_node(int nid) struct page *page; page = alloc_pages_node(nid, - htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN, + htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_REPEAT|__GFP_NOWARN, HUGETLB_PAGE_ORDER); if (page) { set_compound_page_dtor(page, free_huge_page); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index da69d83..931fb46 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1470,7 +1470,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, struct page *page; struct reclaim_state reclaim_state; struct task_struct *p = current; - int do_retry; + int do_retry_attempts = 0; int alloc_flags; int did_some_progress; @@ -1622,16 +1622,24 @@ nofail_alloc: * * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order * <= 3, but that may not be true in other implementations. + * + * For order > 3, __GFP_REPEAT means try to reclaim memory 5 + * times, but that may not be true in other implementations. */ - do_retry = 0; if (!(gfp_mask & __GFP_NORETRY)) { - if ((order <= PAGE_ALLOC_COSTLY_ORDER) || - (gfp_mask & __GFP_REPEAT)) - do_retry = 1; + if (gfp_mask & __GFP_REPEAT) { + if (order <= PAGE_ALLOC_COSTLY_ORDER) { + do_retry_attempts = 1; + } else { + if (do_retry_attempts > COSTLY_ORDER_RETRY_ATTEMPTS) + goto nopage; + do_retry_attempts += 1; + } + } if (gfp_mask & __GFP_NOFAIL) - do_retry = 1; + do_retry_attempts = 1; } - if (do_retry) { + if (do_retry_attempts) { congestion_wait(WRITE, HZ/50); goto rebalance; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org