[PATCH] hugetlb: retry pool allocation attempts

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] hugetlb: retry pool allocation attempts
@ 2007-11-15 20:10 Nishanth Aravamudan
  2007-11-15 20:18 ` [PATCH][UPDATED] " Nishanth Aravamudan
  0 siblings, 1 reply; 7+ messages in thread
From: Nishanth Aravamudan @ 2007-11-15 20:10 UTC (permalink / raw)
  To: wli; +Cc: kenchen, david, linux-mm

Currently, successive attempts to allocate the hugepage pool via the
sysctl can result in the following sort of behavior (assume each attempt
is trying to grow the pool by 100 hugepages, starting with 100 hugepages
in the pool, on x86_64):

Attempt 1: 200 hugepages
Attempt 2: 300 hugepages
...
Attempt 33: 3400 hugepages
Attempt 34: 3438 hugepages
Attempt 35: 3438 hugepages
Attempt 36: 3438 hugepages
Attempt 37: 3439 hugepages
Attempt 38: 3440 hugepages
Attempt 39: 3441 hugepages
Attempt 40: 3441 hugepages
Attempt 41: 3442 hugepages
...

I think, in an ideal world, we would not have a situation where the
hugepage pool grows on an attempt after a previous attempt has failed
(we should have freed up sufficient memory earlier). We also wouldn't
get successive single-page allocations, but would have a single
larger-size allocation. There are two reasons this doesn't happen
currently:

a) hugetlb pool allocation calls do not specify __GFP_REPEAT to ask the
VM to retry the allocations (invoking reclaim to help the requests
succeed).

b) __alloc_pages() does not currently retry allocations for order >
PAGE_ALLOC_COSTLY_ORDER.

Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
results for x86, x86_64 and ppc64, but I'm not sure how to come up with
the "best" number here (suggestions are welcome!). With this patch
applied, the same box that gave the above results now gives:

Attempt 1: 200 hugepages
Attempt 2: 300 hugepages
...
Attempt 33: 3400 hugepages
Attempt 34: 3438 hugepages
Attempt 35: 3442 hugepages
Attempt 36: 3443 hugepages
Attempt 37: 3443 hugepages
Attempt 38: 3443 hugepages
Attempt 39: 3443 hugepages
Attempt 40: 3443 hugepages
Attempt 41: 3444 hugepages
...

While the patch makes things better (we get more hugepages sooner), but
we still get an allocation success (of one hugepage) after getting a few
failures in a row. But, even with 10 retry attempts, I got similar
results. Determining the perfect number, I expect, would require know
the current/future I/O characteristics and current/future system
activity -- in lieu of this prescience, this heuristic does seem to
improve things and does not require userspace applications to implement
their own retry logic (or, more accurately, makes those userspace
retries more effective).

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c4522a..c4e36ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -33,6 +33,12 @@
  * will not.
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
+/*
+ * COSTLY_ORDER_RETRY_ATTEMPTS is the number of retry attempts for
+ * allocations above PAGE_ALLOC_COSTLY_ORDER with __GFP_REPEAT
+ * specified.
+ */
+#define COSTLY_ORDER_RETRY_ATTEMPTS 5
 
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8b809ec..3d2d092 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -171,7 +171,7 @@ static struct page *alloc_fresh_huge_page_node(int nid)
 	struct page *page;
 
 	page = alloc_pages_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_REPEAT|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index da69d83..931fb46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1470,7 +1470,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 	struct page *page;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
+	int do_retry_attempts = 0;
 	int alloc_flags;
 	int did_some_progress;
 
@@ -1622,16 +1622,24 @@ nofail_alloc:
 	 *
 	 * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
 	 * <= 3, but that may not be true in other implementations.
+	 *
+	 * For order > 3, __GFP_REPEAT means try to reclaim memory 5
+	 * times, but that may not be true in other implementations.
 	 */
-	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {
-		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
-						(gfp_mask & __GFP_REPEAT))
-			do_retry = 1;
+		if (gfp_mask & __GFP_REPEAT) {
+			if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+				do_retry_attempts = 1;
+			} else {
+				if (do_retry_attempts > COSTLY_ORDER_RETRY_ATTEMPTS)
+					goto nopage;
+				do_retry_attempts += 1;
+			}
+		}
 		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
+			do_retry_attempts = 1;
 	}
-	if (do_retry) {
+	if (do_retry_attempts) {
 		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-15 20:10 [PATCH] hugetlb: retry pool allocation attempts Nishanth Aravamudan
@ 2007-11-15 20:18 ` Nishanth Aravamudan
  2007-11-15 21:34   ` Dave Hansen
  2007-11-16  0:19   ` Mel Gorman
  0 siblings, 2 replies; 7+ messages in thread
From: Nishanth Aravamudan @ 2007-11-15 20:18 UTC (permalink / raw)
  To: wli; +Cc: kenchen, david, linux-mm

On 15.11.2007 [12:10:53 -0800], Nishanth Aravamudan wrote:
> Currently, successive attempts to allocate the hugepage pool via the
> sysctl can result in the following sort of behavior (assume each attempt
> is trying to grow the pool by 100 hugepages, starting with 100 hugepages
> in the pool, on x86_64):

Sigh, same patch, fixed up a few checkpatch issues with long lines.
Sorry for the noise.

hugetlb: retry pool allocation attempts

Currently, successive attempts to allocate the hugepage pool via the
sysctl can result in the following sort of behavior (assume each attempt
is trying to grow the pool by 100 hugepages, starting with 100 hugepages
in the pool, on x86_64):

Attempt 1: 200 hugepages
Attempt 2: 300 hugepages
...
Attempt 33: 3400 hugepages
Attempt 34: 3438 hugepages
Attempt 35: 3438 hugepages
Attempt 36: 3438 hugepages
Attempt 37: 3439 hugepages
Attempt 38: 3440 hugepages
Attempt 39: 3441 hugepages
Attempt 40: 3441 hugepages
Attempt 41: 3442 hugepages
...

I think, in an ideal world, we would not have a situation where the
hugepage pool grows on an attempt after a previous attempt has failed
(we should have freed up sufficient memory earlier). We also wouldn't
get successive single-page allocations, but would have a single
larger-size allocation. There are two reasons this doesn't happen
currently:

a) hugetlb pool allocation calls do not specify __GFP_REPEAT to ask the
VM to retry the allocations (invoking reclaim to help the requests
succeed).

b) __alloc_pages() does not currently retry allocations for order >
PAGE_ALLOC_COSTLY_ORDER.

Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
results for x86, x86_64 and ppc64, but I'm not sure how to come up with
the "best" number here (suggestions are welcome!). With this patch
applied, the same box that gave the above results now gives:

Attempt 1: 200 hugepages
Attempt 2: 300 hugepages
...
Attempt 33: 3400 hugepages
Attempt 34: 3438 hugepages
Attempt 35: 3442 hugepages
Attempt 36: 3443 hugepages
Attempt 37: 3443 hugepages
Attempt 38: 3443 hugepages
Attempt 39: 3443 hugepages
Attempt 40: 3443 hugepages
Attempt 41: 3444 hugepages
...

While the patch makes things better (we get more hugepages sooner), but
we still get an allocation success (of one hugepage) after getting a few
failures in a row. But, even with 10 retry attempts, I got similar
results. Determining the perfect number, I expect, would require know
the current/future I/O characteristics and current/future system
activity -- in lieu of this prescience, this heuristic does seem to
improve things and does not require userspace applications to implement
their own retry logic (or, more accurately, makes those userspace
retries more effective).

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
$ git show HEAD | ./scripts/checkpatch.pl -
Your patch has no obvious style problems and is ready for submission.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c4522a..c4e36ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -33,6 +33,12 @@
  * will not.
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
+/*
+ * COSTLY_ORDER_RETRY_ATTEMPTS is the number of retry attempts for
+ * allocations above PAGE_ALLOC_COSTLY_ORDER with __GFP_REPEAT
+ * specified.
+ */
+#define COSTLY_ORDER_RETRY_ATTEMPTS 5
 
 #define MIGRATE_UNMOVABLE     0
 #define MIGRATE_RECLAIMABLE   1
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8b809ec..0eb2953 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -171,7 +171,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
 	struct page *page;
 
 	page = alloc_pages_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+						__GFP_REPEAT|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index da69d83..3fcedda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1470,7 +1470,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 	struct page *page;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
-	int do_retry;
+	int do_retry_attempts = 0;
 	int alloc_flags;
 	int did_some_progress;
 
@@ -1622,16 +1622,25 @@ nofail_alloc:
 	 *
 	 * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
 	 * <= 3, but that may not be true in other implementations.
+	 *
+	 * For order > 3, __GFP_REPEAT means try to reclaim memory 5
+	 * times, but that may not be true in other implementations.
 	 */
-	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {
-		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
-						(gfp_mask & __GFP_REPEAT))
-			do_retry = 1;
+		if (gfp_mask & __GFP_REPEAT) {
+			if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+				do_retry_attempts = 1;
+			} else {
+				if (do_retry_attempts >
+					COSTLY_ORDER_RETRY_ATTEMPTS)
+					goto nopage;
+				do_retry_attempts += 1;
+			}
+		}
 		if (gfp_mask & __GFP_NOFAIL)
-			do_retry = 1;
+			do_retry_attempts = 1;
 	}
-	if (do_retry) {
+	if (do_retry_attempts) {
 		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-15 20:18 ` [PATCH][UPDATED] " Nishanth Aravamudan
@ 2007-11-15 21:34   ` Dave Hansen
  2007-11-16  0:10     ` Mel Gorman
  2007-11-16  0:46     ` Nishanth Aravamudan
  2007-11-16  0:19   ` Mel Gorman
  1 sibling, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2007-11-15 21:34 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: wli, kenchen, david, linux-mm, Andy Whitcroft

On Thu, 2007-11-15 at 12:18 -0800, Nishanth Aravamudan wrote:
> b) __alloc_pages() does not currently retry allocations for order >
> PAGE_ALLOC_COSTLY_ORDER.

... when __GFP_REPEAT has not been specified, right?

> Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
> to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
> GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
> results for x86, x86_64 and ppc64, but I'm not sure how to come up with
> the "best" number here (suggestions are welcome!). With this patch
> applied, the same box that gave the above results now gives: 

Coding in an explicit number of retries like this seems a bit hackish to
me.  Retrying the allocations N times internally (through looping)
should give roughly the same number of huge pages that retrying them N
times externally (from the /proc file).  Does doing another ~50
allocations get you to the same number of huge pages?

What happens if you *only* specify GFP_REPEAT from hugetlbfs?

I think you're asking a bit much of the page allocator (and reclaim)
here.  There is a discrete amount of memory pressure applied for each
allocator request.  Increasing the number of requests will virtually
always increase the memory pressure and make more pages available.

What is the actual behavior that you want to get here?  Do you want that
34th request to always absolutely plateau the number of huge pages?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-15 21:34   ` Dave Hansen
@ 2007-11-16  0:10     ` Mel Gorman
  2007-11-16  0:46     ` Nishanth Aravamudan
  1 sibling, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2007-11-16  0:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nishanth Aravamudan, wli, kenchen, david, linux-mm, Andy Whitcroft

On (15/11/07 13:34), Dave Hansen didst pronounce:
> On Thu, 2007-11-15 at 12:18 -0800, Nishanth Aravamudan wrote:
> > b) __alloc_pages() does not currently retry allocations for order >
> > PAGE_ALLOC_COSTLY_ORDER.
> 
> ... when __GFP_REPEAT has not been specified, right?
> 

Currently if hugetlbfs specified __GFP_RELEAT, it would end up trying to
allocate indefinitly - that does not sound like sane behaviour. Indefinite
retries for small allocations makes some sense, but for the larger allocs
it should give up after a while as __GFP_REPEAT is documented to do.

> > Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
> > to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
> > GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
> > results for x86, x86_64 and ppc64, but I'm not sure how to come up with
> > the "best" number here (suggestions are welcome!). With this patch
> > applied, the same box that gave the above results now gives: 
> 
> Coding in an explicit number of retries like this seems a bit hackish to
> me.  Retrying the allocations N times internally (through looping)
> should give roughly the same number of huge pages that retrying them N
> times externally (from the /proc file). 

The third case is where the pool is being dynamically resized and this
allocation attempt is happening via the mmap() or fault paths. In those cases
it should be making a serious attempt to satisfy the allocation without
peppering retry logic in multiple places when __GFP_REPEAT is meant to do
what is desired.

>Does doing another ~50
> allocations get you to the same number of huge pages?
> 
> What happens if you *only* specify GFP_REPEAT from hugetlbfs?
> 

Potentially, it will stay forever in a reclaim loop.

> I think you're asking a bit much of the page allocator (and reclaim)
> here. 

The ideal is that direct reclaim is only entered once. In practice, it
may not work as even if lumpy reclaim gets the necessary contiguous
pages, there is no guarantee that another process will take the pages
because a process does not take ownership of those pages. Fixing that
would be pretty invasive and while I expect those patches to exist
eventually, they are pretty far away.

Ideally Nish could just say "__GFP_REPEAT" in the flags but it looks
like he had alter slightly how __GFP_REPEAT behaves so it is not an
alias for __GFP_NOFAIL.

> There is a discrete amount of memory pressure applied for each
> allocator request.  Increasing the number of requests will virtually
> always increase the memory pressure and make more pages available.
> 

For a __GFP_REPEAT allocation, it says "try and pressure more because I
really could do with this page" as opposed to failing.

> What is the actual behavior that you want to get here?  Do you want that
> 34th request to always absolutely plateau the number of huge pages?
> 

I believe the desired behaviour is that for larger allocations specifying
__GFP_REPEAT to apply a bit more pressure than might have been used
otherwise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-15 21:34   ` Dave Hansen
  2007-11-16  0:10     ` Mel Gorman
@ 2007-11-16  0:46     ` Nishanth Aravamudan
  1 sibling, 0 replies; 7+ messages in thread
From: Nishanth Aravamudan @ 2007-11-16  0:46 UTC (permalink / raw)
  To: Dave Hansen; +Cc: wli, kenchen, david, linux-mm, Andy Whitcroft

On 15.11.2007 [13:34:35 -0800], Dave Hansen wrote:
> On Thu, 2007-11-15 at 12:18 -0800, Nishanth Aravamudan wrote:
> > b) __alloc_pages() does not currently retry allocations for order >
> > PAGE_ALLOC_COSTLY_ORDER.
> 
> ... when __GFP_REPEAT has not been specified, right?

Err, yes, sorry -- will fix the commit message. If __GFP_REPEAT is set,
though, it is translated directly to __GFP_NOFAIL.

> > Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations
> > up to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and
> > use GFP_REPEAT in the hugetlb pool allocation. 5 seems to give
> > reasonable results for x86, x86_64 and ppc64, but I'm not sure how
> > to come up with the "best" number here (suggestions are welcome!).
> > With this patch applied, the same box that gave the above results
> > now gives: 
> 
> Coding in an explicit number of retries like this seems a bit hackish
> to me.  Retrying the allocations N times internally (through looping)
> should give roughly the same number of huge pages that retrying them N
> times externally (from the /proc file).  Does doing another ~50
> allocations get you to the same number of huge pages?

Yes, in that first example, the userspace shell script, which is just
repeatedly trying to grow the pool by 100 hugepages eventually gets to
the same maximum value, just slower.

> What happens if you *only* specify GFP_REPEAT from hugetlbfs?

We get __NOFAIL behavior, as I understand it -- I haven't tried this,
though. And it seems undesirable to even have that semantic in hugetlb
code.

> I think you're asking a bit much of the page allocator (and reclaim)
> here.  There is a discrete amount of memory pressure applied for each
> allocator request.  Increasing the number of requests will virtually
> always increase the memory pressure and make more pages available.

I'm not sure how I follow that this is "a bit much"? It seems like it is
reasonable to try a little harder to satisfy a larger order request, if
the callee specified as much.

> What is the actual behavior that you want to get here?  Do you want
> that 34th request to always absolutely plateau the number of huge
> pages?

I think I said this in the commit message, but yes, in an ideal world,
we'd never have a sequence of requests where a particular number of
hugepages are requested (X) but the kernel allocates fewer, then the
same request is made and more hugepages are allocated (due to the
increased memory pressure). That may mean waiting a long time in the
kernel, though, so I figured this compromise of some effort (not
necessarliy best effort, I suppose) for trying to satisfy the request
gets us a little closer, at least.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-15 20:18 ` [PATCH][UPDATED] " Nishanth Aravamudan
  2007-11-15 21:34   ` Dave Hansen
@ 2007-11-16  0:19   ` Mel Gorman
  2007-11-16  0:58     ` Nishanth Aravamudan
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2007-11-16  0:19 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: wli, kenchen, david, linux-mm

On (15/11/07 12:18), Nishanth Aravamudan didst pronounce:
> On 15.11.2007 [12:10:53 -0800], Nishanth Aravamudan wrote:
> > Currently, successive attempts to allocate the hugepage pool via the
> > sysctl can result in the following sort of behavior (assume each attempt
> > is trying to grow the pool by 100 hugepages, starting with 100 hugepages
> > in the pool, on x86_64):
> 
> Sigh, same patch, fixed up a few checkpatch issues with long lines.
> Sorry for the noise.
> 
> hugetlb: retry pool allocation attempts
> 
> Currently, successive attempts to allocate the hugepage pool via the
> sysctl can result in the following sort of behavior (assume each attempt
> is trying to grow the pool by 100 hugepages, starting with 100 hugepages
> in the pool, on x86_64):
> 
> Attempt 1: 200 hugepages
> Attempt 2: 300 hugepages
> ...
> Attempt 33: 3400 hugepages
> Attempt 34: 3438 hugepages
> Attempt 35: 3438 hugepages
> Attempt 36: 3438 hugepages
> Attempt 37: 3439 hugepages
> Attempt 38: 3440 hugepages
> Attempt 39: 3441 hugepages
> Attempt 40: 3441 hugepages
> Attempt 41: 3442 hugepages
> ...
> 
> I think, in an ideal world, we would not have a situation where the
> hugepage pool grows on an attempt after a previous attempt has failed
> (we should have freed up sufficient memory earlier). We also wouldn't
> get successive single-page allocations, but would have a single
> larger-size allocation. There are two reasons this doesn't happen
> currently:
> 
> a) hugetlb pool allocation calls do not specify __GFP_REPEAT to ask the
> VM to retry the allocations (invoking reclaim to help the requests
> succeed).
> 
> b) __alloc_pages() does not currently retry allocations for order >
> PAGE_ALLOC_COSTLY_ORDER.
> 
> Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
> to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
> GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
> results for x86, x86_64 and ppc64, but I'm not sure how to come up with
> the "best" number here (suggestions are welcome!). With this patch
> applied, the same box that gave the above results now gives:
> 
> Attempt 1: 200 hugepages
> Attempt 2: 300 hugepages
> ...
> Attempt 33: 3400 hugepages
> Attempt 34: 3438 hugepages
> Attempt 35: 3442 hugepages
> Attempt 36: 3443 hugepages
> Attempt 37: 3443 hugepages
> Attempt 38: 3443 hugepages
> Attempt 39: 3443 hugepages
> Attempt 40: 3443 hugepages
> Attempt 41: 3444 hugepages
> ...
> 
> While the patch makes things better (we get more hugepages sooner), but
> we still get an allocation success (of one hugepage) after getting a few
> failures in a row. But, even with 10 retry attempts, I got similar
> results. Determining the perfect number, I expect, would require know
> the current/future I/O characteristics and current/future system
> activity -- in lieu of this prescience, this heuristic does seem to
> improve things and does not require userspace applications to implement
> their own retry logic (or, more accurately, makes those userspace
> retries more effective).
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
> $ git show HEAD | ./scripts/checkpatch.pl -
> Your patch has no obvious style problems and is ready for submission.
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4c4522a..c4e36ba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -33,6 +33,12 @@
>   * will not.
>   */
>  #define PAGE_ALLOC_COSTLY_ORDER 3
> +/*
> + * COSTLY_ORDER_RETRY_ATTEMPTS is the number of retry attempts for
> + * allocations above PAGE_ALLOC_COSTLY_ORDER with __GFP_REPEAT
> + * specified.

Perhaps add a note here saying that __GFP_REPEAT for allocations below
PAGE_ALLOC_COSTLY_ORDER behaves like __GFP_NOFAIL?

> + */
> +#define COSTLY_ORDER_RETRY_ATTEMPTS 5
>  
>  #define MIGRATE_UNMOVABLE     0
>  #define MIGRATE_RECLAIMABLE   1
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b809ec..0eb2953 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -171,7 +171,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
>  	struct page *page;
>  
>  	page = alloc_pages_node(nid,
> -		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> +		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
> +						__GFP_REPEAT|__GFP_NOWARN,
>  		HUGETLB_PAGE_ORDER);
>  	if (page) {
>  		set_compound_page_dtor(page, free_huge_page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index da69d83..3fcedda 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1470,7 +1470,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	struct page *page;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> -	int do_retry;
> +	int do_retry_attempts = 0;
>  	int alloc_flags;
>  	int did_some_progress;
>  
> @@ -1622,16 +1622,25 @@ nofail_alloc:
>  	 *
>  	 * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
>  	 * <= 3, but that may not be true in other implementations.
> +	 *
> +	 * For order > 3, __GFP_REPEAT means try to reclaim memory 5
> +	 * times, but that may not be true in other implementations.

magic number alert. s/3/PAGE_ALLOC_COSTLY_ORDER and
s/5/COSTLY_ORDER_RETRY_ATTEMPTS

>  	 */
> -	do_retry = 0;
>  	if (!(gfp_mask & __GFP_NORETRY)) {
> -		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
> -						(gfp_mask & __GFP_REPEAT))
> -			do_retry = 1;
> +		if (gfp_mask & __GFP_REPEAT) {
> +			if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> +				do_retry_attempts = 1;
> +			} else {
> +				if (do_retry_attempts >
> +					COSTLY_ORDER_RETRY_ATTEMPTS)
> +					goto nopage;
> +				do_retry_attempts += 1;
> +			}

Seems fair enough logic. The second if looks a little ugly but I don't
see a nicer way of expressing it right now.

> +		}
>  		if (gfp_mask & __GFP_NOFAIL)
> -			do_retry = 1;
> +			do_retry_attempts = 1;
>  	}
> -	if (do_retry) {
> +	if (do_retry_attempts) {
>  		congestion_wait(WRITE, HZ/50);
>  		goto rebalance;
>  	}
> 

Overall, seems fine to me.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][UPDATED] hugetlb: retry pool allocation attempts
  2007-11-16  0:19   ` Mel Gorman
@ 2007-11-16  0:58     ` Nishanth Aravamudan
  0 siblings, 0 replies; 7+ messages in thread
From: Nishanth Aravamudan @ 2007-11-16  0:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: wli, kenchen, david, linux-mm

On 16.11.2007 [00:19:11 +0000], Mel Gorman wrote:
> On (15/11/07 12:18), Nishanth Aravamudan didst pronounce:
> > On 15.11.2007 [12:10:53 -0800], Nishanth Aravamudan wrote:
> > > Currently, successive attempts to allocate the hugepage pool via the
> > > sysctl can result in the following sort of behavior (assume each attempt
> > > is trying to grow the pool by 100 hugepages, starting with 100 hugepages
> > > in the pool, on x86_64):
> > 
> > Sigh, same patch, fixed up a few checkpatch issues with long lines.
> > Sorry for the noise.
> > 
> > hugetlb: retry pool allocation attempts
> > 
> > Currently, successive attempts to allocate the hugepage pool via the
> > sysctl can result in the following sort of behavior (assume each attempt
> > is trying to grow the pool by 100 hugepages, starting with 100 hugepages
> > in the pool, on x86_64):

<snip>

> > Modify __alloc_pages() to retry GFP_REPEAT COSTLY_ORDER allocations up
> > to COSTLY_ORDER_RETRY_ATTEMPTS times, which I've set to 5, and use
> > GFP_REPEAT in the hugetlb pool allocation. 5 seems to give reasonable
> > results for x86, x86_64 and ppc64, but I'm not sure how to come up with
> > the "best" number here (suggestions are welcome!). With this patch
> > applied, the same box that gave the above results now gives:

<snip>

> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -33,6 +33,12 @@
> >   * will not.
> >   */
> >  #define PAGE_ALLOC_COSTLY_ORDER 3
> > +/*
> > + * COSTLY_ORDER_RETRY_ATTEMPTS is the number of retry attempts for
> > + * allocations above PAGE_ALLOC_COSTLY_ORDER with __GFP_REPEAT
> > + * specified.
> 
> Perhaps add a note here saying that __GFP_REPEAT for allocations below
> PAGE_ALLOC_COSTLY_ORDER behaves like __GFP_NOFAIL?

Good idea.

<snip>

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c

<snip>

> > @@ -1622,16 +1622,25 @@ nofail_alloc:
> >  	 *
> >  	 * In this implementation, __GFP_REPEAT means __GFP_NOFAIL for order
> >  	 * <= 3, but that may not be true in other implementations.
> > +	 *
> > +	 * For order > 3, __GFP_REPEAT means try to reclaim memory 5
> > +	 * times, but that may not be true in other implementations.
> 
> magic number alert. s/3/PAGE_ALLOC_COSTLY_ORDER and
> s/5/COSTLY_ORDER_RETRY_ATTEMPTS

I was trying to avoid changing too much outside of the context of the
intents of the patch, but this does help clarify things.

> >  	 */
> > -	do_retry = 0;
> >  	if (!(gfp_mask & __GFP_NORETRY)) {
> > -		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
> > -						(gfp_mask & __GFP_REPEAT))
> > -			do_retry = 1;
> > +		if (gfp_mask & __GFP_REPEAT) {
> > +			if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +				do_retry_attempts = 1;
> > +			} else {
> > +				if (do_retry_attempts >
> > +					COSTLY_ORDER_RETRY_ATTEMPTS)
> > +					goto nopage;
> > +				do_retry_attempts += 1;
> > +			}
> 
> Seems fair enough logic. The second if looks a little ugly but I don't
> see a nicer way of expressing it right now.

Yeah, I'm open to suggestions. I also didn't want to always increment
do_retry_attempts, because __NOFAIL might lead to it growing without
bound and wrapping...

Hrm, looking closer, this hunk is wrong anyways... the original code
says this, I think:

  if gfp_mask does not specify NORETRY
	if order is less than or equal to COSTLY_ORDER
	or gpf_mask specifies REPEAT
		engage in NOFAIL behavior

My changes lead to:

  if gfp_mask does not specify NORETRY
	if gfp_mask does specify REPEAT
		if order is less than or equal to COSTLY_ORDER
			engage in NOFAIL behavior

So before, if an allocation is less than COSTLY_ORDER regardless of
__GFP_REPEAT's state in the gfp_mask, we upgrade the behavior to NOFAIL.
That's my reading, at least. Easy enough to keep that behavior with my
code, but the comment sort of implies a different behavior. I'll update
the comment further in my patch to reflect the cases, I think.

> > +		}
> >  		if (gfp_mask & __GFP_NOFAIL)
> > -			do_retry = 1;
> > +			do_retry_attempts = 1;
> >  	}
> > -	if (do_retry) {
> > +	if (do_retry_attempts) {
> >  		congestion_wait(WRITE, HZ/50);
> >  		goto rebalance;
> >  	}
> > 
> 
> Overall, seems fine to me.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-11-16  0:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-15 20:10 [PATCH] hugetlb: retry pool allocation attempts Nishanth Aravamudan
2007-11-15 20:18 ` [PATCH][UPDATED] " Nishanth Aravamudan
2007-11-15 21:34   ` Dave Hansen
2007-11-16  0:10     ` Mel Gorman
2007-11-16  0:46     ` Nishanth Aravamudan
2007-11-16  0:19   ` Mel Gorman
2007-11-16  0:58     ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox