[RFC][PATCH 1/2] Smarter retry of costly-order allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 1/2] Smarter retry of costly-order allocations
@ 2008-02-06 23:07 Nishanth Aravamudan
  2008-02-06 23:12 ` [RFC][PATCH 2/2] Explicitly retry hugepage allocations Nishanth Aravamudan
  0 siblings, 1 reply; 8+ messages in thread
From: Nishanth Aravamudan @ 2008-02-06 23:07 UTC (permalink / raw)
  To: melgor; +Cc: apw, clameter, linux-mm

Smarter retry of costly-order allocations

Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Modify try_to_free_pages() to indicate what order of pages
were reclaimed and use that in __alloc_pages() to eventually fail large
allocations, when we've supposedly reclaimed a similar order of pages.
This relies on lumpy reclaim (and perhaps grouping of pages by
mobility?) functioning as advertised.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---
The next patch makes hugepages uses __GFP_REPEAT and demonstrates the
difference

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 353153e..e6e8030 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -180,7 +180,7 @@ extern int rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern int try_to_free_pages(struct zone **zones, int order,
 					gfp_t gfp_mask);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9153cb8..22b892b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1461,6 +1461,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 	int do_retry;
 	int alloc_flags;
 	int did_some_progress;
+	unsigned long pages_reclaimed = 0;
 
 	might_sleep_if(wait);
 
@@ -1569,7 +1570,7 @@ nofail_alloc:
 	if (order != 0)
 		drain_all_pages();
 
-	if (likely(did_some_progress)) {
+	if (likely(did_some_progress != 0)) {
 		page = get_page_from_freelist(gfp_mask, order,
 						zonelist, alloc_flags);
 		if (page)
@@ -1608,15 +1609,28 @@ nofail_alloc:
 	 * Don't let big-order allocations loop unless the caller explicitly
 	 * requests that.  Wait for some write requests to complete then retry.
 	 *
-	 * In this implementation, either order <= PAGE_ALLOC_COSTLY_ORDER or
-	 * __GFP_REPEAT mean __GFP_NOFAIL, but that may not be true in other
+	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+	 * means __GFP_NOFAIL, but that may not be true in other
 	 * implementations.
+	 *
+	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+	 * specified, then we retry until we no longer reclaim any pages
+	 * (above), or we've reclaimed an order of pages at least as
+	 * large as the allocation's order. In both cases, if the
+	 * allocation still fails, we stop retrying.
 	 */
+	if (did_some_progress != -EAGAIN)
+		pages_reclaimed += did_some_progress;
 	do_retry = 0;
 	if (!(gfp_mask & __GFP_NORETRY)) {
-		if ((order <= PAGE_ALLOC_COSTLY_ORDER) ||
-						(gfp_mask & __GFP_REPEAT))
+		if (order <= PAGE_ALLOC_COSTLY_ORDER) {
 			do_retry = 1;
+		} else {
+			if (gfp_mask & __GFP_REPEAT &&
+				(did_some_progress == -EAGAIN ||
+				pages_reclaimed < (1 << order)))
+					do_retry = 1;
+		}
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5a9597..c9d67b4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1205,8 +1205,14 @@ static unsigned long shrink_zones(int priority, struct zone **zones,
  * hope that some of these pages can be written.  But if the allocating task
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
+ *
+ * returns:	0, if no pages reclaimed
+ * 		-EAGAIN, if insufficient pages were reclaimed to satisfy the
+ * 			order specified, but further reclaim might
+ * 			succeed
+ * 		else, the order of pages reclaimed
  */
-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+int try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 {
 	int priority;
 	int ret = 0;
@@ -1248,7 +1254,7 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 		}
 		total_scanned += sc.nr_scanned;
 		if (nr_reclaimed >= sc.swap_cluster_max) {
-			ret = 1;
+			ret = nr_reclaimed;
 			goto out;
 		}
 
@@ -1270,8 +1276,12 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 			congestion_wait(WRITE, HZ/10);
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
-	if (!sc.all_unreclaimable)
-		ret = 1;
+	if (!sc.all_unreclaimable) {
+		if (nr_reclaimed >= (1 << order))
+			ret = nr_reclaimed;
+		else
+			ret = -EAGAIN;
+	}
 out:
 	/*
 	 * Now that we've scanned all the zones at this priority level, note

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-06 23:07 [RFC][PATCH 1/2] Smarter retry of costly-order allocations Nishanth Aravamudan
@ 2008-02-06 23:12 ` Nishanth Aravamudan
  2008-02-06 23:30   ` Christoph Lameter
  0 siblings, 1 reply; 8+ messages in thread
From: Nishanth Aravamudan @ 2008-02-06 23:12 UTC (permalink / raw)
  To: melgor; +Cc: apw, clameter, agl, wli, linux-mm

Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
userspace putting pressure on the VM by repeated echo's into
/proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
allow for large-order __GFP_REPEAT attempts to loop for a bit (as
opposed to indefinitely), this increases the likelihood of getting
hugepages when the system experiences (or recently experienced) load.

On a 2-way x86_64, this doubles the number of hugepages (from 10 to 20)
obtained while compiling a kernel at the same time. On a 4-way ppc64,
a similar scale increase is seen (from 3 to 5 hugepages). Finally, on a
2-way x86, this leads to a 5-fold increase in the hugepages allocatable
under load (90 to 554).

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a56420..0358a91 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -176,7 +176,8 @@ static struct page *alloc_fresh_huge_page_node(int nid)
 	struct page *page;

 	page = alloc_pages_node(nid,
-		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
+		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+						__GFP_REPEAT|__GFP_NOWARN,
 		HUGETLB_PAGE_ORDER);
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
@@ -262,7 +263,8 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
 	}
 	spin_unlock(&hugetlb_lock);

-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
+	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+					__GFP_REPEAT|__GFP_NOWARN,
 					HUGETLB_PAGE_ORDER);

 	spin_lock(&hugetlb_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-06 23:12 ` [RFC][PATCH 2/2] Explicitly retry hugepage allocations Nishanth Aravamudan
@ 2008-02-06 23:30   ` Christoph Lameter
  2008-02-07  1:04     ` Nishanth Aravamudan
  2008-02-08 17:11     ` Nishanth Aravamudan
  0 siblings, 2 replies; 8+ messages in thread
From: Christoph Lameter @ 2008-02-06 23:30 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: melgor, apw, agl, wli, linux-mm

On Wed, 6 Feb 2008, Nishanth Aravamudan wrote:

> Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
> userspace putting pressure on the VM by repeated echo's into
> /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
> allow for large-order __GFP_REPEAT attempts to loop for a bit (as
> opposed to indefinitely), this increases the likelihood of getting
> hugepages when the system experiences (or recently experienced) load.
> 
> On a 2-way x86_64, this doubles the number of hugepages (from 10 to 20)
> obtained while compiling a kernel at the same time. On a 4-way ppc64,
> a similar scale increase is seen (from 3 to 5 hugepages). Finally, on a
> 2-way x86, this leads to a 5-fold increase in the hugepages allocatable
> under load (90 to 554).

Hmmm... How about defaulting to __GFP_REPEAT by default for larger page 
allocations? There are other users of larger allocs that would also 
benefit from the same measure. I think it would be fine as long as we are 
sure to fail at some point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-06 23:30   ` Christoph Lameter
@ 2008-02-07  1:04     ` Nishanth Aravamudan
  2008-02-08 17:11     ` Nishanth Aravamudan
  1 sibling, 0 replies; 8+ messages in thread
From: Nishanth Aravamudan @ 2008-02-07  1:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: melgor, apw, agl, wli, linux-mm

On 06.02.2008 [15:30:53 -0800], Christoph Lameter wrote:
> On Wed, 6 Feb 2008, Nishanth Aravamudan wrote:
> 
> > Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
> > userspace putting pressure on the VM by repeated echo's into
> > /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch to
> > allow for large-order __GFP_REPEAT attempts to loop for a bit (as
> > opposed to indefinitely), this increases the likelihood of getting
> > hugepages when the system experiences (or recently experienced) load.
> > 
> > On a 2-way x86_64, this doubles the number of hugepages (from 10 to 20)
> > obtained while compiling a kernel at the same time. On a 4-way ppc64,
> > a similar scale increase is seen (from 3 to 5 hugepages). Finally, on a
> > 2-way x86, this leads to a 5-fold increase in the hugepages allocatable
> > under load (90 to 554).
> 
> Hmmm... How about defaulting to __GFP_REPEAT by default for larger
> page allocations? There are other users of larger allocs that would
> also benefit from the same measure. I think it would be fine as long
> as we are sure to fail at some point.

We could do that. That would essentially mean that we don't really ever
need __GFP_REPEAT in the current implementation.

if (order <= PAGE_ALLOC_COSTLY_ORDER)
  __GFP_REPEAT is implicitly __GFP_NOFAIL
if (order > PAGE_ALLOC_COSTLY_ORDER)
  __GFP_REPEAT is implicitly applied

So I guess we'd have the following semantic cases in the VM if we did
that:

if (order <= PAGE_ALLOC_COSTLY_ORDER)
  if (flags & __GFP_NORETRY)
    don't retry, might succeed
  else
    __GFP_NOFAIL, must succeed
else
  if (flags & __GPF_NORETRY)
    don't retry, might succeed
  if (flags & __GFP_NOFAIL)
    don't fail, must succeed
  else
    __GFP_REPEAT, might succeed

We *could* make the low-order __GFP_REPEAT case the same as the
high-order one (if we reclaim a certain order, then we should be able to
succeed the original allocation), however that change seemed more
invasive & aggressive, so I left it alone.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-06 23:30   ` Christoph Lameter
  2008-02-07  1:04     ` Nishanth Aravamudan
@ 2008-02-08 17:11     ` Nishanth Aravamudan
  2008-02-08 19:19       ` Christoph Lameter
  1 sibling, 1 reply; 8+ messages in thread
From: Nishanth Aravamudan @ 2008-02-08 17:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: melgor, apw, agl, wli, linux-mm

On 06.02.2008 [15:30:53 -0800], Christoph Lameter wrote:
> On Wed, 6 Feb 2008, Nishanth Aravamudan wrote:
> 
> > Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate
> > userspace putting pressure on the VM by repeated echo's into
> > /proc/sys/vm/nr_hugepages to grow the pool. With the previous patch
> > to allow for large-order __GFP_REPEAT attempts to loop for a bit (as
> > opposed to indefinitely), this increases the likelihood of getting
> > hugepages when the system experiences (or recently experienced)
> > load.
> > 
> > On a 2-way x86_64, this doubles the number of hugepages (from 10 to
> > 20) obtained while compiling a kernel at the same time. On a 4-way
> > ppc64, a similar scale increase is seen (from 3 to 5 hugepages).
> > Finally, on a 2-way x86, this leads to a 5-fold increase in the
> > hugepages allocatable under load (90 to 554).
> 
> Hmmm... How about defaulting to __GFP_REPEAT by default for larger
> page allocations? There are other users of larger allocs that would
> also benefit from the same measure. I think it would be fine as long
> as we are sure to fail at some point.

In thinking about this more, one of the harder parts for me to get my
head around was the implicit promotion of small-order allocations to
__GFP_REPEAT (and thus to __GFP_NOFAIL). I would prefer keeping the
large-order allocations explicit as to when they want the VM to try
harder to succeed. As far as I understand it, only hugepages really will
leverage this from code in in the kernel currently? I also feel like,
even if __GFP_REPEAT becomes a default behavior, it's better to use it
as a documentation of intent from the caller -- and perhaps indicate to
us sites that are over-stressing the VM unnecessarily by regularly
forcing reclaim?

I also am not 100% positive on how I would test the result of such a
change, since there are not that many large-order allocations in the
kernel... Did you have any thoughts on that?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-08 17:11     ` Nishanth Aravamudan
@ 2008-02-08 19:19       ` Christoph Lameter
  2008-02-08 23:40         ` Nishanth Aravamudan
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Lameter @ 2008-02-08 19:19 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: melgor, apw, agl, wli, linux-mm

On Fri, 8 Feb 2008, Nishanth Aravamudan wrote:

> I also am not 100% positive on how I would test the result of such a
> change, since there are not that many large-order allocations in the
> kernel... Did you have any thoughts on that?

Boot the kernel with

	slub_min_order=<whatever order you wish>

to get lots of allocations of a higher order.

You can run slub with huge pages by booting with

	slub_min_order=9

this causes some benchmarks to run much faster...

In general the use of higher order pages is discouraged right now due 
to the page allocators flaky behavior when allocating pages but 
there are several projects that would benefit from that. Amoung them large 
bufferer support for the I/O layer and larger page support for the VM to 
reduce 4k page scanning overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-08 19:19       ` Christoph Lameter
@ 2008-02-08 23:40         ` Nishanth Aravamudan
  2008-02-08 23:42           ` Christoph Lameter
  0 siblings, 1 reply; 8+ messages in thread
From: Nishanth Aravamudan @ 2008-02-08 23:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: melgor, apw, agl, wli, linux-mm

On 08.02.2008 [11:19:54 -0800], Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Nishanth Aravamudan wrote:
> 
> > I also am not 100% positive on how I would test the result of such a
> > change, since there are not that many large-order allocations in the
> > kernel... Did you have any thoughts on that?
> 
> Boot the kernel with
> 
> 	slub_min_order=<whatever order you wish>
> 
> to get lots of allocations of a higher order.
> 
> You can run slub with huge pages by booting with
> 
> 	slub_min_order=9
> 
> this causes some benchmarks to run much faster...
> 
> In general the use of higher order pages is discouraged right now due
> to the page allocators flaky behavior when allocating pages but there
> are several projects that would benefit from that. Amoung them large
> bufferer support for the I/O layer and larger page support for the VM
> to reduce 4k page scanning overhead.

That all makes sense. However, for now, if it would be ok with you, just
make higher order allocations coming from hugetlb.c use the __REPEAT
logic I'm trying to add. If the method seems good in general, then we
just need to mark other locations (SLUB allocation paths?) with
__GFP_REPEAT. When slub_min_order <= PAGE_ALLOC_COSTLY_ORDER, then we
shouldn't see any difference and when it is greater, we should hit the
logic I added. Does that seem reasonable to you? I think it's a separate
idea, though, and I'd prefer keeping it in a separate patch, if that's
ok with you.

Thanks,
Nish


-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH 2/2] Explicitly retry hugepage allocations
  2008-02-08 23:40         ` Nishanth Aravamudan
@ 2008-02-08 23:42           ` Christoph Lameter
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Lameter @ 2008-02-08 23:42 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: melgor, apw, agl, wli, linux-mm

On Fri, 8 Feb 2008, Nishanth Aravamudan wrote:

> make higher order allocations coming from hugetlb.c use the __REPEAT
> logic I'm trying to add. If the method seems good in general, then we
> just need to mark other locations (SLUB allocation paths?) with
> __GFP_REPEAT. When slub_min_order <= PAGE_ALLOC_COSTLY_ORDER, then we
> shouldn't see any difference and when it is greater, we should hit the
> logic I added. Does that seem reasonable to you? I think it's a separate
> idea, though, and I'd prefer keeping it in a separate patch, if that's
> ok with you.

Fine with me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-02-08 23:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-06 23:07 [RFC][PATCH 1/2] Smarter retry of costly-order allocations Nishanth Aravamudan
2008-02-06 23:12 ` [RFC][PATCH 2/2] Explicitly retry hugepage allocations Nishanth Aravamudan
2008-02-06 23:30   ` Christoph Lameter
2008-02-07  1:04     ` Nishanth Aravamudan
2008-02-08 17:11     ` Nishanth Aravamudan
2008-02-08 19:19       ` Christoph Lameter
2008-02-08 23:40         ` Nishanth Aravamudan
2008-02-08 23:42           ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox