[PATCH] mm: stop kswapd's infinite loop at high order allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: stop kswapd's infinite loop at high order allocation
@ 2008-12-30 10:55 KOSAKI Motohiro
  2008-12-30 11:10 ` Nick Piggin
  2008-12-30 18:59 ` Mel Gorman
  0 siblings, 2 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2008-12-30 10:55 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash; +Cc: kosaki.motohiro


ok, wassim confirmed this patch works well.


==
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation

Wassim Dagash reported following kswapd infinite loop problem.

  kswapd runs in some infinite loop trying to swap until order 10 of zone
  highmem is OK, While zone higmem (as I understand) has nothing to do
  with contiguous memory (cause there is no 1-1 mapping) which means
  kswapd will continue to try to balance order 10 of zone highmem
  forever (or until someone release a very large chunk of highmem).

He proposed remove contenious checking on highmem at all.
However hugepage on highmem need contenious highmem page.

To add infinite loop stopper is simple and good.



Reported-by: wassim dagash <wassim.dagash@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1872,6 +1872,17 @@ out:
 
 		try_to_freeze();
 
+		/*
+		 * When highmem is very fragmented,
+		 * alloc_pages(GFP_KERNEL, very-high-order) can cause
+		 * infinite loop because zone_watermark_ok(highmem) failed.
+		 * However, alloc_pages(GFP_KERNEL..) indicate highmem memory
+		 * continuousness isn't necessary.
+		 * Therefore we don't want contenious check at 2nd loop.
+		 */
+		if (nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = sc.order = 0;
+
 		goto loop_again;
 	}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-30 10:55 [PATCH] mm: stop kswapd's infinite loop at high order allocation KOSAKI Motohiro
@ 2008-12-30 11:10 ` Nick Piggin
  2008-12-30 18:59 ` Mel Gorman
  1 sibling, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2008-12-30 11:10 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, wassim dagash

On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> 
> ok, wassim confirmed this patch works well.
> 
> 
> ==
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> 
> Wassim Dagash reported following kswapd infinite loop problem.
> 
>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>   highmem is OK, While zone higmem (as I understand) has nothing to do
>   with contiguous memory (cause there is no 1-1 mapping) which means
>   kswapd will continue to try to balance order 10 of zone highmem
>   forever (or until someone release a very large chunk of highmem).
> 
> He proposed remove contenious checking on highmem at all.
> However hugepage on highmem need contenious highmem page.
> 
> To add infinite loop stopper is simple and good.
> 
> 
> 
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks,

Reviewed-by: Nick Piggin <npiggin@suse.de>

> ---
>  mm/vmscan.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1872,6 +1872,17 @@ out:
>  
>  		try_to_freeze();
>  
> +		/*
> +		 * When highmem is very fragmented,
> +		 * alloc_pages(GFP_KERNEL, very-high-order) can cause
> +		 * infinite loop because zone_watermark_ok(highmem) failed.
> +		 * However, alloc_pages(GFP_KERNEL..) indicate highmem memory
> +		 * continuousness isn't necessary.
> +		 * Therefore we don't want contenious check at 2nd loop.
> +		 */
> +		if (nr_reclaimed < SWAP_CLUSTER_MAX)
> +			order = sc.order = 0;
> +
>  		goto loop_again;
>  	}
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-30 10:55 [PATCH] mm: stop kswapd's infinite loop at high order allocation KOSAKI Motohiro
  2008-12-30 11:10 ` Nick Piggin
@ 2008-12-30 18:59 ` Mel Gorman
  2008-12-31  1:32   ` Nick Piggin
  2008-12-31  4:54   ` KOSAKI Motohiro
  1 sibling, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2008-12-30 18:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> 
> ok, wassim confirmed this patch works well.
> 
> 
> ==
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> 
> Wassim Dagash reported following kswapd infinite loop problem.
> 
>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>   highmem is OK, While zone higmem (as I understand) has nothing to do
>   with contiguous memory (cause there is no 1-1 mapping) which means
>   kswapd will continue to try to balance order 10 of zone highmem
>   forever (or until someone release a very large chunk of highmem).
> 
> He proposed remove contenious checking on highmem at all.
> However hugepage on highmem need contenious highmem page.
> 

I'm lacking the original problem report, but contiguous order-10 pages are
indeed required for hugepages in highmem and reclaiming for them should not
be totally disabled at any point. While no 1-1 mapping exists for the kernel,
contiguity is still required.

kswapd gets a sc.order when it is known there is a process trying to get
high-order pages so it can reclaim at that order in an attempt to prevent
future direct reclaim at a high-order. Your patch does not appear to depend on
GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
loop again at order-0 means it may scan and reclaim more memory unnecessarily
seeing as all_zones_ok was calculated based on a high-order value, not order-0.

While constantly looping trying to balance for high-orders is indeed bad,
I'm unconvinced this is the correct change. As we have already gone through
a priorities and scanned everything at the high-order, would it not make
more sense to do just give up with something like the following?

       /*
        * If zones are still not balanced, loop again and continue attempting
        * to rebalance the system. For high-order allocations, fragmentation
        * can prevent the zones being rebalanced no matter how hard kswapd
        * works, particularly on systems with little or no swap. For costly
        * orders, just give up and assume interested processes will either
        * direct reclaim or wake up kswapd as necessary.
        */
        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
                cond_resched();

                try_to_freeze();

                goto loop_again;
        }

I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
expected to support allocations up to that order in a fairly reliable fashion.

> To add infinite loop stopper is simple and good.
> 
> 
> 
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

For the moment

NAKed-by: Mel Gorman <mel@csn.ul.ie>

How about the following (compile-tested-only) patch?

=============
From: Mel Gorman <mel@csn.ul.ie>
Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation

  kswapd runs in some infinite loop trying to swap until order 10 of zone
  highmem is OK.... kswapd will continue to try to balance order 10 of zone
  highmem forever (or until someone release a very large chunk of highmem).

For costly high-order allocations, the system may never be balanced due to
fragmentation but kswapd should not infinitely loop as a result. The
following patch lets kswapd stop reclaiming in the event it cannot
balance zones and the order is high-order.

Reported-by: wassim dagash <wassim.dagash@gmail.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62e7f62..03ed9a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1867,7 +1867,16 @@ out:

 		zone->prev_priority = temp_priority[i];
 	}
-	if (!all_zones_ok) {
+
+	/*
+	 * If zones are still not balanced, loop again and continue attempting
+	 * to rebalance the system. For high-order allocations, fragmentation
+	 * can prevent the zones being rebalanced no matter how hard kswapd
+	 * works, particularly on systems with little or no swap. For costly
+	 * orders, just give up and assume interested processes will either
+	 * direct reclaim or wake up kswapd as necessary.
+	 */
+	if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
 		cond_resched();

 		try_to_freeze();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-30 18:59 ` Mel Gorman
@ 2008-12-31  1:32   ` Nick Piggin
  2008-12-31 11:06     ` Mel Gorman
  2008-12-31  4:54   ` KOSAKI Motohiro
  1 sibling, 1 reply; 20+ messages in thread
From: Nick Piggin @ 2008-12-31  1:32 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, wassim dagash

On Tue, Dec 30, 2008 at 06:59:19PM +0000, Mel Gorman wrote:
> On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> > 
> > ok, wassim confirmed this patch works well.
> > 
> > 
> > ==
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> > 
> > Wassim Dagash reported following kswapd infinite loop problem.
> > 
> >   kswapd runs in some infinite loop trying to swap until order 10 of zone
> >   highmem is OK, While zone higmem (as I understand) has nothing to do
> >   with contiguous memory (cause there is no 1-1 mapping) which means
> >   kswapd will continue to try to balance order 10 of zone highmem
> >   forever (or until someone release a very large chunk of highmem).
> > 
> > He proposed remove contenious checking on highmem at all.
> > However hugepage on highmem need contenious highmem page.
> > 
> 
> I'm lacking the original problem report, but contiguous order-10 pages are
> indeed required for hugepages in highmem and reclaiming for them should not
> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> contiguity is still required.

This doesn't totally disable them. It disables asynchronous reclaim for them
until the next time kswapd kicks is kicked by a higher order allocator. The
guy who kicked us off this time should go into direct reclaim.


> kswapd gets a sc.order when it is known there is a process trying to get
> high-order pages so it can reclaim at that order in an attempt to prevent
> future direct reclaim at a high-order. Your patch does not appear to depend on
> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> loop again at order-0 means it may scan and reclaim more memory unnecessarily
> seeing as all_zones_ok was calculated based on a high-order value, not order-0.

It shouldn't, because it should check all that.


> While constantly looping trying to balance for high-orders is indeed bad,
> I'm unconvinced this is the correct change. As we have already gone through
> a priorities and scanned everything at the high-order, would it not make
> more sense to do just give up with something like the following?
> 
>        /*
>         * If zones are still not balanced, loop again and continue attempting
>         * to rebalance the system. For high-order allocations, fragmentation
>         * can prevent the zones being rebalanced no matter how hard kswapd
>         * works, particularly on systems with little or no swap. For costly
>         * orders, just give up and assume interested processes will either
>         * direct reclaim or wake up kswapd as necessary.
>         */
>         if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>                 cond_resched();
> 
>                 try_to_freeze();
> 
>                 goto loop_again;
>         }
> 
> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> expected to support allocations up to that order in a fairly reliable fashion.

I actually think it's better to do it for all orders, because that
constant is more or less arbitrary. It is possible a zone might become
too fragmented to support this, but the allocating process has been OOMed or
had their allocation satisfied from another zone. kswapd would have no way
out of the loop even if the system no longer requires higher order allocations.

IOW, I don't see a big downside, and there is a real upside.

I think the patch is good.

> 
> > To add infinite loop stopper is simple and good.
> > 
> > 
> > 
> > Reported-by: wassim dagash <wassim.dagash@gmail.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> For the moment
> 
> NAKed-by: Mel Gorman <mel@csn.ul.ie>
> 
> How about the following (compile-tested-only) patch?
> 
> =============
> From: Mel Gorman <mel@csn.ul.ie>
> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
> 
>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>   highmem is OK.... kswapd will continue to try to balance order 10 of zone
>   highmem forever (or until someone release a very large chunk of highmem).
> 
> For costly high-order allocations, the system may never be balanced due to
> fragmentation but kswapd should not infinitely loop as a result. The
> following patch lets kswapd stop reclaiming in the event it cannot
> balance zones and the order is high-order.
> 
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 62e7f62..03ed9a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1867,7 +1867,16 @@ out:
>  
>  		zone->prev_priority = temp_priority[i];
>  	}
> -	if (!all_zones_ok) {
> +
> +	/*
> +	 * If zones are still not balanced, loop again and continue attempting
> +	 * to rebalance the system. For high-order allocations, fragmentation
> +	 * can prevent the zones being rebalanced no matter how hard kswapd
> +	 * works, particularly on systems with little or no swap. For costly
> +	 * orders, just give up and assume interested processes will either
> +	 * direct reclaim or wake up kswapd as necessary.
> +	 */
> +	if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>  		cond_resched();
>  
>  		try_to_freeze();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-30 18:59 ` Mel Gorman
  2008-12-31  1:32   ` Nick Piggin
@ 2008-12-31  4:54   ` KOSAKI Motohiro
  2008-12-31  8:59     ` wassim dagash
  2008-12-31 11:53     ` Mel Gorman
  1 sibling, 2 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2008-12-31  4:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

Hi

thank you for reviewing.

>> ==
>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
>>
>> Wassim Dagash reported following kswapd infinite loop problem.
>>
>>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>>   highmem is OK, While zone higmem (as I understand) has nothing to do
>>   with contiguous memory (cause there is no 1-1 mapping) which means
>>   kswapd will continue to try to balance order 10 of zone highmem
>>   forever (or until someone release a very large chunk of highmem).
>>
>> He proposed remove contenious checking on highmem at all.
>> However hugepage on highmem need contenious highmem page.
>>
>
> I'm lacking the original problem report, but contiguous order-10 pages are
> indeed required for hugepages in highmem and reclaiming for them should not
> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> contiguity is still required.

correct.
but that's ok.

my patch only change corner case bahavior and only disable high-order
when priority==0. typical hugepage reclaim don't need and don't reach
priority==0.

and sorry. I agree with my "2nd loop"  word of the patch comment is a
bit misleading.


> kswapd gets a sc.order when it is known there is a process trying to get
> high-order pages so it can reclaim at that order in an attempt to prevent
> future direct reclaim at a high-order. Your patch does not appear to depend on
> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> loop again at order-0 means it may scan and reclaim more memory unnecessarily
> seeing as all_zones_ok was calculated based on a high-order value, not order-0.

Yup. my patch doesn't depend on GFP_KERNEL.

but, Why order-0 means it may scan more memory unnecessary?
all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
depend on order argument. and my patch set order variable to 0 too.


> While constantly looping trying to balance for high-orders is indeed bad,
> I'm unconvinced this is the correct change. As we have already gone through
> a priorities and scanned everything at the high-order, would it not make
> more sense to do just give up with something like the following?
>
>       /*
>        * If zones are still not balanced, loop again and continue attempting
>        * to rebalance the system. For high-order allocations, fragmentation
>        * can prevent the zones being rebalanced no matter how hard kswapd
>        * works, particularly on systems with little or no swap. For costly
>        * orders, just give up and assume interested processes will either
>        * direct reclaim or wake up kswapd as necessary.
>        */
>        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>                cond_resched();
>
>                try_to_freeze();
>
>                goto loop_again;
>        }
>
> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> expected to support allocations up to that order in a fairly reliable fashion.

my comment is bellow.


> =============
> From: Mel Gorman <mel@csn.ul.ie>
> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
>
>  kswapd runs in some infinite loop trying to swap until order 10 of zone
>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
>  highmem forever (or until someone release a very large chunk of highmem).
>
> For costly high-order allocations, the system may never be balanced due to
> fragmentation but kswapd should not infinitely loop as a result. The
> following patch lets kswapd stop reclaiming in the event it cannot
> balance zones and the order is high-order.
>
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 62e7f62..03ed9a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1867,7 +1867,16 @@ out:
>
>                zone->prev_priority = temp_priority[i];
>        }
> -       if (!all_zones_ok) {
> +
> +       /*
> +        * If zones are still not balanced, loop again and continue attempting
> +        * to rebalance the system. For high-order allocations, fragmentation
> +        * can prevent the zones being rebalanced no matter how hard kswapd
> +        * works, particularly on systems with little or no swap. For costly
> +        * orders, just give up and assume interested processes will either
> +        * direct reclaim or wake up kswapd as necessary.
> +        */
> +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>                cond_resched();
>
>                try_to_freeze();

this patch seems no good.
kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
unnecessary priority variable decreasing.
then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.

kswapd purpose is "reclaim until pages_high", not reclaim
SWAP_CLUSTER_MAX pages.

if your patch applied and kswapd start to reclaim for hugepage, kswapd
exit balance_pgdat() function after to reclaim only 32 pages
(SWAP_CLUSTER_MAX).

In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
reclaim enough
page although priority == 0.
in this case, retry is worthless.

sorting out again.
"goto loop_again" reaching happend by two case.

1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
    at that time, kswapd reset priority variable to prevent
unnecessary priority decreasing.
    I don't hope this behavior change.
2. kswapd scanned until priority==0.
    this case is debatable. my patch reset any order to 0. but
following code is also considerable to me. (sorry for tab corrupted,
current my mail environment is very poor)


code-A:
       if (!all_zones_ok) {
              if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
                 (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
                          cond_resched();
                           try_to_freeze();
                           goto loop_again;
                }
       }

or

code-B:
       if (!all_zones_ok) {
              cond_resched();
               try_to_freeze();

              if (nr_reclaimed >= SWAP_CLUSTER_MAX)
                           goto loop_again;

              if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
                           order = sc.order = 0;
                           goto loop_again;
              }
       }


However, I still like my original proposal because ..
  - code-A forget to order-1 (for stack) allocation also can cause
infinite loop.
  - code-B doesn't simpler than my original proposal.

What do you think it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31  4:54   ` KOSAKI Motohiro
@ 2008-12-31  8:59     ` wassim dagash
  2008-12-31 12:05       ` Mel Gorman
  2008-12-31 11:53     ` Mel Gorman
  1 sibling, 1 reply; 20+ messages in thread
From: wassim dagash @ 2008-12-31  8:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 7293 bytes --]

Hi ,
Thank you all for reviewing.
Why don't we implement a solution where the order is defined per zone?
I implemented such a solution for my kernel (2.6.18) and tested it, it
worked fine for me. Attached a patch with a solution for 2.6.28
(compile tested only).

On Wed, Dec 31, 2008 at 6:54 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
> thank you for reviewing.
>
>>> ==
>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
>>>
>>> Wassim Dagash reported following kswapd infinite loop problem.
>>>
>>>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>>>   highmem is OK, While zone higmem (as I understand) has nothing to do
>>>   with contiguous memory (cause there is no 1-1 mapping) which means
>>>   kswapd will continue to try to balance order 10 of zone highmem
>>>   forever (or until someone release a very large chunk of highmem).
>>>
>>> He proposed remove contenious checking on highmem at all.
>>> However hugepage on highmem need contenious highmem page.
>>>
>>
>> I'm lacking the original problem report, but contiguous order-10 pages are
>> indeed required for hugepages in highmem and reclaiming for them should not
>> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
>> contiguity is still required.
>
> correct.
> but that's ok.
>
> my patch only change corner case bahavior and only disable high-order
> when priority==0. typical hugepage reclaim don't need and don't reach
> priority==0.
>
> and sorry. I agree with my "2nd loop"  word of the patch comment is a
> bit misleading.
>
>
>> kswapd gets a sc.order when it is known there is a process trying to get
>> high-order pages so it can reclaim at that order in an attempt to prevent
>> future direct reclaim at a high-order. Your patch does not appear to depend on
>> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
>> loop again at order-0 means it may scan and reclaim more memory unnecessarily
>> seeing as all_zones_ok was calculated based on a high-order value, not order-0.
>
> Yup. my patch doesn't depend on GFP_KERNEL.
>
> but, Why order-0 means it may scan more memory unnecessary?
> all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
> depend on order argument. and my patch set order variable to 0 too.
>
>
>> While constantly looping trying to balance for high-orders is indeed bad,
>> I'm unconvinced this is the correct change. As we have already gone through
>> a priorities and scanned everything at the high-order, would it not make
>> more sense to do just give up with something like the following?
>>
>>       /*
>>        * If zones are still not balanced, loop again and continue attempting
>>        * to rebalance the system. For high-order allocations, fragmentation
>>        * can prevent the zones being rebalanced no matter how hard kswapd
>>        * works, particularly on systems with little or no swap. For costly
>>        * orders, just give up and assume interested processes will either
>>        * direct reclaim or wake up kswapd as necessary.
>>        */
>>        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>>                cond_resched();
>>
>>                try_to_freeze();
>>
>>                goto loop_again;
>>        }
>>
>> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
>> expected to support allocations up to that order in a fairly reliable fashion.
>
> my comment is bellow.
>
>
>> =============
>> From: Mel Gorman <mel@csn.ul.ie>
>> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
>>
>>  kswapd runs in some infinite loop trying to swap until order 10 of zone
>>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
>>  highmem forever (or until someone release a very large chunk of highmem).
>>
>> For costly high-order allocations, the system may never be balanced due to
>> fragmentation but kswapd should not infinitely loop as a result. The
>> following patch lets kswapd stop reclaiming in the event it cannot
>> balance zones and the order is high-order.
>>
>> Reported-by: wassim dagash <wassim.dagash@gmail.com>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 62e7f62..03ed9a0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1867,7 +1867,16 @@ out:
>>
>>                zone->prev_priority = temp_priority[i];
>>        }
>> -       if (!all_zones_ok) {
>> +
>> +       /*
>> +        * If zones are still not balanced, loop again and continue attempting
>> +        * to rebalance the system. For high-order allocations, fragmentation
>> +        * can prevent the zones being rebalanced no matter how hard kswapd
>> +        * works, particularly on systems with little or no swap. For costly
>> +        * orders, just give up and assume interested processes will either
>> +        * direct reclaim or wake up kswapd as necessary.
>> +        */
>> +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>>                cond_resched();
>>
>>                try_to_freeze();
>
> this patch seems no good.
> kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
> unnecessary priority variable decreasing.
> then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
>
> kswapd purpose is "reclaim until pages_high", not reclaim
> SWAP_CLUSTER_MAX pages.
>
> if your patch applied and kswapd start to reclaim for hugepage, kswapd
> exit balance_pgdat() function after to reclaim only 32 pages
> (SWAP_CLUSTER_MAX).
>
> In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
> reclaim enough
> page although priority == 0.
> in this case, retry is worthless.
>
> sorting out again.
> "goto loop_again" reaching happend by two case.
>
> 1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
>    at that time, kswapd reset priority variable to prevent
> unnecessary priority decreasing.
>    I don't hope this behavior change.
> 2. kswapd scanned until priority==0.
>    this case is debatable. my patch reset any order to 0. but
> following code is also considerable to me. (sorry for tab corrupted,
> current my mail environment is very poor)
>
>
> code-A:
>       if (!all_zones_ok) {
>              if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
>                 (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                          cond_resched();
>                           try_to_freeze();
>                           goto loop_again;
>                }
>       }
>
> or
>
> code-B:
>       if (!all_zones_ok) {
>              cond_resched();
>               try_to_freeze();
>
>              if (nr_reclaimed >= SWAP_CLUSTER_MAX)
>                           goto loop_again;
>
>              if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                           order = sc.order = 0;
>                           goto loop_again;
>              }
>       }
>
>
> However, I still like my original proposal because ..
>  - code-A forget to order-1 (for stack) allocation also can cause
> infinite loop.
>  - code-B doesn't simpler than my original proposal.
>
> What do you think it?
>



-- 
too much is never enough!!!!!

[-- Attachment #2: kswapd.patch --]
[-- Type: application/octet-stream, Size: 6454 bytes --]

Reported-by: wassim dagash <wassim.dagash@gmail.com>
Signed-off-by: wassim dagash <wassim.dagash@gmail.com>

diff -Nuar linux-2.6.28.orig/include/linux/mmzone.h linux-2.6.28/include/linux/mmzone.h
--- linux-2.6.28.orig/include/linux/mmzone.h	2008-12-25 12:20:10.000000000 +0200
+++ linux-2.6.28/include/linux/mmzone.h	2008-12-31 10:16:03.000000000 +0200
@@ -409,6 +409,7 @@
 	 * rarely used fields:
 	 */
 	const char		*name;
+	unsigned int 		kswapd_max_order;
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
@@ -625,7 +626,6 @@
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
-	int kswapd_max_order;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -642,8 +642,8 @@
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free);
 void build_all_zonelists(void);
-void wakeup_kswapd(struct zone *zone, int order);
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+void wakeup_kswapd(struct zone *zone);
+int zone_watermark_ok(struct zone *z, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
diff -Nuar linux-2.6.28.orig/Makefile linux-2.6.28/Makefile
--- linux-2.6.28.orig/Makefile	2008-12-25 12:19:10.000000000 +0200
+++ linux-2.6.28/Makefile	2008-12-31 10:51:18.000000000 +0200
@@ -2,7 +2,7 @@
 PATCHLEVEL = 6
 SUBLEVEL = 28
 EXTRAVERSION =
-NAME = Erotic Pickled Herring
+NAME = kswapd_sol 
 
 # *DOCUMENTATION*
 # To see a list of typical targets execute "make help"
diff -Nuar linux-2.6.28.orig/mm/page_alloc.c linux-2.6.28/mm/page_alloc.c
--- linux-2.6.28.orig/mm/page_alloc.c	2008-12-25 12:20:14.000000000 +0200
+++ linux-2.6.28/mm/page_alloc.c	2008-12-31 10:26:32.000000000 +0200
@@ -1224,10 +1224,11 @@
  * Return 1 if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+int zone_watermark_ok(struct zone *z, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	/* free_pages my go negative - that's OK */
+	unsigned int order = z->kswapd_max_order;
 	long min = mark;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
@@ -1417,7 +1418,7 @@
 				mark = zone->pages_low;
 			else
 				mark = zone->pages_high;
-			if (!zone_watermark_ok(zone, order, mark,
+			if (!zone_watermark_ok(zone, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
 				    !zone_reclaim(zone, gfp_mask, order))
@@ -1485,9 +1486,16 @@
 
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+
 	if (page)
 		goto got_pg;
 
+        /*
+        First item in list of zones suitable for gfp_mask is the zone the request intended to,
+        the other items are fallback
+        */
+        z->zone->kswapd_max_order = order;
+
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
@@ -1500,7 +1508,7 @@
 		goto nopage;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -3448,7 +3456,6 @@
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
-	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -3488,6 +3495,8 @@
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
 
+		/* Initializing kswapd_max_order to zero */
+		zone->kswapd_max_order = 0;
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
diff -Nuar linux-2.6.28.orig/mm/vmscan.c linux-2.6.28/mm/vmscan.c
--- linux-2.6.28.orig/mm/vmscan.c	2008-12-25 12:20:14.000000000 +0200
+++ linux-2.6.28/mm/vmscan.c	2008-12-31 10:29:27.000000000 +0200
@@ -1770,7 +1770,7 @@
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
+			if (!zone_watermark_ok(zone, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
 				break;
@@ -1805,7 +1805,7 @@
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
+			if (!zone_watermark_ok(zone, zone->pages_high,
 					       end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
@@ -1815,7 +1815,7 @@
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
+			if (!zone_watermark_ok(zone, 8*zone->pages_high,
 						end_zone, 0))
 				nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
@@ -1924,22 +1924,30 @@
 	order = 0;
 	for ( ; ; ) {
 		unsigned long new_order;
-
+		int i,max_order;
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
-		new_order = pgdat->kswapd_max_order;
-		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation
-			 */
-			order = new_order;
-		} else {
-			if (!freezing(current))
-				schedule();
 
-			order = pgdat->kswapd_max_order;
+		max_order = 0;
+		for (i = pgdat->nr_zones - 1; i >= 0; i--) 
+		{
+			struct zone *zone = pgdat->node_zones + i;
+			new_order = zone->kswapd_max_order;
+			zone->kswapd_max_order = 0;
+			if (max_order < new_order){
+				max_order = new_order;
+			}
+                }
+
+		if(order < max_order)
+		{
+			order = max_order;
 		}
+		else
+		{	
+			if (!freezing(current))
+                              schedule();
+		}
+
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
 		if (!try_to_freeze()) {
@@ -1955,7 +1963,7 @@
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone)
 {
 	pg_data_t *pgdat;
 
@@ -1963,10 +1971,8 @@
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, zone->pages_low, 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
-		pgdat->kswapd_max_order = order;
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31  1:32   ` Nick Piggin
@ 2008-12-31 11:06     ` Mel Gorman
  2008-12-31 11:16       ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2008-12-31 11:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, wassim dagash

On Wed, Dec 31, 2008 at 02:32:33AM +0100, Nick Piggin wrote:
> On Tue, Dec 30, 2008 at 06:59:19PM +0000, Mel Gorman wrote:
> > On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> > > 
> > > ok, wassim confirmed this patch works well.
> > > 
> > > 
> > > ==
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> > > 
> > > Wassim Dagash reported following kswapd infinite loop problem.
> > > 
> > >   kswapd runs in some infinite loop trying to swap until order 10 of zone
> > >   highmem is OK, While zone higmem (as I understand) has nothing to do
> > >   with contiguous memory (cause there is no 1-1 mapping) which means
> > >   kswapd will continue to try to balance order 10 of zone highmem
> > >   forever (or until someone release a very large chunk of highmem).
> > > 
> > > He proposed remove contenious checking on highmem at all.
> > > However hugepage on highmem need contenious highmem page.
> > > 
> > 
> > I'm lacking the original problem report, but contiguous order-10 pages are
> > indeed required for hugepages in highmem and reclaiming for them should not
> > be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> > contiguity is still required.
> 
> This doesn't totally disable them. It disables asynchronous reclaim for them
> until the next time kswapd kicks is kicked by a higher order allocator. The
> guy who kicked us off this time should go into direct reclaim.
> 

I get that. The check to bail out is made after we've already done the
scanning. I wanted to be clear that contiguity in highmem is required for
hugepages.

> 
> > kswapd gets a sc.order when it is known there is a process trying to get
> > high-order pages so it can reclaim at that order in an attempt to prevent
> > future direct reclaim at a high-order. Your patch does not appear to depend on
> > GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> > loop again at order-0 means it may scan and reclaim more memory unnecessarily
> > seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> 
> It shouldn't, because it should check all that.
> 

Ok, with KOSAKI's patch we

1. Set order to 0 (and stop kswapd doing what it was asked to do)
2. goto loop_again
3. nr_reclaimed gets set to 0 (meaning we lose that value, but no biggie
   as it doesn't get used by the caller anyway)
4. Reset all priorities
5. Do something like the following

	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
		...
		all_zones_ok = 1;
		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
			...
			if (inactive_anon_is_low(zone)) {
				shrink_active_list(SWAP_CLUSTER_MAX, zone,
					&sc, priority, 0);
			}

			if (!zone_watermark_ok(zone, order, zone->pages_high,
					0, 0)) {
				end_zone = i;
				break;
			}
		}
	}

  So, by looping around, we could end up shrinking the active list again
  before we recheck the zone watermarks depending on the size of the
  inactive lists.

If the size of the lists is ok, I agree that we'll go through the lists,
do no reclaiming and exit out, albeit returning 0 when we have reclaimed pages.

> 
> > While constantly looping trying to balance for high-orders is indeed bad,
> > I'm unconvinced this is the correct change. As we have already gone through
> > a priorities and scanned everything at the high-order, would it not make
> > more sense to do just give up with something like the following?
> > 
> >        /*
> >         * If zones are still not balanced, loop again and continue attempting
> >         * to rebalance the system. For high-order allocations, fragmentation
> >         * can prevent the zones being rebalanced no matter how hard kswapd
> >         * works, particularly on systems with little or no swap. For costly
> >         * orders, just give up and assume interested processes will either
> >         * direct reclaim or wake up kswapd as necessary.
> >         */
> >         if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> >                 cond_resched();
> > 
> >                 try_to_freeze();
> > 
> >                 goto loop_again;
> >         }
> > 
> > I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> > expected to support allocations up to that order in a fairly reliable fashion.
> 
> I actually think it's better to do it for all orders, because that
> constant is more or less arbitrary.

i.e.

if (!all_zones_ok && sc.order == 0) {

? or something else

What I did miss was that we have 

                if (nr_reclaimed >= SWAP_CLUSTER_MAX)
                        break;

so with my patch, kswapd is bailing out early without trying to reclaim for
high-orders that hard. That was not what I intended as it means we only ever
really rebalance the full system for order-0 pages and for everything else we
do relatively light scanning. The impact is that high-order users will direct
reclaim rather than depending on kswapd scanning very heavily. Arguably,
this is a good thing.

However, it also means that KOSAKI's and my patches only differs in that mine
bails early and KOSAKI rechecks everything at order-0, possibly reclaiming
more. If the comment was not so misleading, I'd have been a lot happier.

> It is possible a zone might become
> too fragmented to support this, but the allocating process has been OOMed or
> had their allocation satisfied from another zone. kswapd would have no way
> out of the loop even if the system no longer requires higher order allocations.
> 
> IOW, I don't see a big downside, and there is a real upside.
> 
> I think the patch is good.
> 

Which one, KOSAKI's or my one?

Here is my one again which bails out for any high-order allocation after
just light scanning.

====

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31 11:06     ` Mel Gorman
@ 2008-12-31 11:16       ` Nick Piggin
  2008-12-31 12:11         ` Mel Gorman
  0 siblings, 1 reply; 20+ messages in thread
From: Nick Piggin @ 2008-12-31 11:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, wassim dagash

On Wed, Dec 31, 2008 at 11:06:19AM +0000, Mel Gorman wrote:
> On Wed, Dec 31, 2008 at 02:32:33AM +0100, Nick Piggin wrote:
> > On Tue, Dec 30, 2008 at 06:59:19PM +0000, Mel Gorman wrote:
> > > On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> > > kswapd gets a sc.order when it is known there is a process trying to get
> > > high-order pages so it can reclaim at that order in an attempt to prevent
> > > future direct reclaim at a high-order. Your patch does not appear to depend on
> > > GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> > > loop again at order-0 means it may scan and reclaim more memory unnecessarily
> > > seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> > 
> > It shouldn't, because it should check all that.
> > 
> 
> Ok, with KOSAKI's patch we
> 
> 1. Set order to 0 (and stop kswapd doing what it was asked to do)
> 2. goto loop_again
> 3. nr_reclaimed gets set to 0 (meaning we lose that value, but no biggie
>    as it doesn't get used by the caller anyway)
> 4. Reset all priorities
> 5. Do something like the following
> 
> 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> 		...
> 		all_zones_ok = 1;
> 		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> 			...
> 			if (inactive_anon_is_low(zone)) {
> 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
> 					&sc, priority, 0);
> 			}
> 
> 			if (!zone_watermark_ok(zone, order, zone->pages_high,
> 					0, 0)) {
> 				end_zone = i;
> 				break;
> 			}
> 		}
> 	}
> 
>   So, by looping around, we could end up shrinking the active list again
>   before we recheck the zone watermarks depending on the size of the
>   inactive lists.

If this is a problem, it is a problem with that code, because kswapd
can be woken up for any zone at any time anyway.


> > >                 cond_resched();
> > > 
> > >                 try_to_freeze();
> > > 
> > >                 goto loop_again;
> > >         }
> > > 
> > > I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> > > expected to support allocations up to that order in a fairly reliable fashion.
> > 
> > I actually think it's better to do it for all orders, because that
> > constant is more or less arbitrary.
> 
> i.e.
> 
> if (!all_zones_ok && sc.order == 0) {
> 
> ? or something else

Well, I jus tdon't see what's wrong with the original patch.

 
> What I did miss was that we have 
> 
>                 if (nr_reclaimed >= SWAP_CLUSTER_MAX)
>                         break;
> 
> so with my patch, kswapd is bailing out early without trying to reclaim for
> high-orders that hard. That was not what I intended as it means we only ever
> really rebalance the full system for order-0 pages and for everything else we
> do relatively light scanning. The impact is that high-order users will direct
> reclaim rather than depending on kswapd scanning very heavily. Arguably,
> this is a good thing.
> 
> However, it also means that KOSAKI's and my patches only differs in that mine
> bails early and KOSAKI rechecks everything at order-0, possibly reclaiming
> more. If the comment was not so misleading, I'd have been a lot happier.

Rechecking everything is fine by me; order-0 is going to be the most
common and most important. If higher order allocations sometimes have
to enter direct reclaim or kick off kswapd again, it isn't a big deal.


> > IOW, I don't see a big downside, and there is a real upside.
> > 
> > I think the patch is good.
> > 
> 
> Which one, KOSAKI's or my one?
> 
> Here is my one again which bails out for any high-order allocation after
> just light scanning.
> 
> ====
> 
> >From 0e09fe002d8e9956de227b880ef8458842b71ca9 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <mel@csn.ul.ie>
> Date: Tue, 30 Dec 2008 18:53:23 +0000
> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
> 
> Wassim Dagash reported the following (editted) kswapd infinite loop problem.
> 
>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>   highmem is OK.... kswapd will continue to try to balance order 10 of zone
>   highmem forever (or until someone release a very large chunk of highmem).
> 
> For costly high-order allocations, the system may never be balanced due to
> fragmentation but kswapd should not infinitely loop as a result. The
> following patch lets kswapd stop reclaiming in the event it cannot
> balance zones and the order is high-order.

This one bails out if it was a higher order reclaim, but there is still
an order-0 shortage. I prefer to run the loop again at order==0 to avoid
that condition. A higher kswapd reclaim order shouldn't weaken kswapd
postcondition for order-0 memory.

> 
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> ---
>  mm/vmscan.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 62e7f62..7b0f412 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1867,7 +1867,16 @@ out:
>  
>  		zone->prev_priority = temp_priority[i];
>  	}
> -	if (!all_zones_ok) {
> +
> +	/*
> +	 * If zones are still not balanced, loop again and continue attempting
> +	 * to rebalance the system. For high-order allocations, fragmentation
> +	 * can prevent the zones being rebalanced no matter how hard kswapd
> +	 * works, particularly on systems with little or no swap. For
> +	 * high-orders, just give up and assume interested processes will
> +	 * either direct reclaim or wake up kswapd again as necessary.
> +	 */
> +	if (!all_zones_ok && sc.order == 0) {
>  		cond_resched();
>  
>  		try_to_freeze();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31  4:54   ` KOSAKI Motohiro
  2008-12-31  8:59     ` wassim dagash
@ 2008-12-31 11:53     ` Mel Gorman
  2008-12-31 13:34       ` KOSAKI Motohiro
  1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2008-12-31 11:53 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

On Wed, Dec 31, 2008 at 01:54:17PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> thank you for reviewing.
> 
> >> ==
> >> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> >>
> >> Wassim Dagash reported following kswapd infinite loop problem.
> >>
> >>   kswapd runs in some infinite loop trying to swap until order 10 of zone
> >>   highmem is OK, While zone higmem (as I understand) has nothing to do
> >>   with contiguous memory (cause there is no 1-1 mapping) which means
> >>   kswapd will continue to try to balance order 10 of zone highmem
> >>   forever (or until someone release a very large chunk of highmem).
> >>
> >> He proposed remove contenious checking on highmem at all.
> >> However hugepage on highmem need contenious highmem page.
> >>
> >
> > I'm lacking the original problem report, but contiguous order-10 pages are
> > indeed required for hugepages in highmem and reclaiming for them should not
> > be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> > contiguity is still required.
> 
> correct.
> but that's ok.
> 
> my patch only change corner case bahavior and only disable high-order
> when priority==0. typical hugepage reclaim don't need and don't reach
> priority==0.
> 
> and sorry. I agree with my "2nd loop"  word of the patch comment is a
> bit misleading.
> 

As I mentioned in the last mail, if it wasn't so misleading, I probably
would have said nothing at all :)

> 
> > kswapd gets a sc.order when it is known there is a process trying to get
> > high-order pages so it can reclaim at that order in an attempt to prevent
> > future direct reclaim at a high-order. Your patch does not appear to depend on
> > GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> > loop again at order-0 means it may scan and reclaim more memory unnecessarily
> > seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> 
> Yup. my patch doesn't depend on GFP_KERNEL.
> 
> but, Why order-0 means it may scan more memory unnecessary?

Because we can enter shrink_active_list() depending on the size of the LRU
lists. Maybe it doesn't matter but it's what I was concerned with as well
as the fact we are changing kswapd to do work other than what it was asked for.

> all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
> depend on order argument. and my patch set order variable to 0 too.
> 
> 
> > While constantly looping trying to balance for high-orders is indeed bad,
> > I'm unconvinced this is the correct change. As we have already gone through
> > a priorities and scanned everything at the high-order, would it not make
> > more sense to do just give up with something like the following?
> >
> >       /*
> >        * If zones are still not balanced, loop again and continue attempting
> >        * to rebalance the system. For high-order allocations, fragmentation
> >        * can prevent the zones being rebalanced no matter how hard kswapd
> >        * works, particularly on systems with little or no swap. For costly
> >        * orders, just give up and assume interested processes will either
> >        * direct reclaim or wake up kswapd as necessary.
> >        */
> >        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> >                cond_resched();
> >
> >                try_to_freeze();
> >
> >                goto loop_again;
> >        }
> >
> > I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> > expected to support allocations up to that order in a fairly reliable fashion.
> 
> my comment is bellow.
> 
> 
> > =============
> > From: Mel Gorman <mel@csn.ul.ie>
> > Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
> >
> >  kswapd runs in some infinite loop trying to swap until order 10 of zone
> >  highmem is OK.... kswapd will continue to try to balance order 10 of zone
> >  highmem forever (or until someone release a very large chunk of highmem).
> >
> > For costly high-order allocations, the system may never be balanced due to
> > fragmentation but kswapd should not infinitely loop as a result. The
> > following patch lets kswapd stop reclaiming in the event it cannot
> > balance zones and the order is high-order.
> >
> > Reported-by: wassim dagash <wassim.dagash@gmail.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 62e7f62..03ed9a0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1867,7 +1867,16 @@ out:
> >
> >                zone->prev_priority = temp_priority[i];
> >        }
> > -       if (!all_zones_ok) {
> > +
> > +       /*
> > +        * If zones are still not balanced, loop again and continue attempting
> > +        * to rebalance the system. For high-order allocations, fragmentation
> > +        * can prevent the zones being rebalanced no matter how hard kswapd
> > +        * works, particularly on systems with little or no swap. For costly
> > +        * orders, just give up and assume interested processes will either
> > +        * direct reclaim or wake up kswapd as necessary.
> > +        */
> > +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> >                cond_resched();
> >
> >                try_to_freeze();
> 
> this patch seems no good.
> kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
> unnecessary priority variable decreasing.
> then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
> 
> kswapd purpose is "reclaim until pages_high", not reclaim
> SWAP_CLUSTER_MAX pages.
> 
> if your patch applied and kswapd start to reclaim for hugepage, kswapd
> exit balance_pgdat() function after to reclaim only 32 pages
> (SWAP_CLUSTER_MAX).
> 

It probably will have reclaimed more. Lumpy reclaim will have isolated
more pages in down in  isolate_lru_pages() and reclaimed pages within a
high-order blocks of pages even if that is more than SWAP_CLUSTER_MAX pages
(right?). The bailing out does mean that kswapd no longer works as hard for
high-order pages but as I said in the other mail, this is not necessarily
a bad thing as processes will still direct reclaim if they have to.

> In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
> reclaim enough
> page although priority == 0.
> in this case, retry is worthless.
> 

Good point. With my patch, we would just give up in the event SWAP_CLUSTER_MAX
pages were not even reclaimed. With your patch, we rescan at order-0 to ensure
the system is actually balanced without waiting to be woken up again. It's
not what kswapd was asked to do, but arguably it's the smart thing to do.

> sorting out again.
> "goto loop_again" reaching happend by two case.
> 
> 1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
>     at that time, kswapd reset priority variable to prevent
> unnecessary priority decreasing.
>     I don't hope this behavior change.
> 2. kswapd scanned until priority==0.
>     this case is debatable. my patch reset any order to 0. but
> following code is also considerable to me. (sorry for tab corrupted,
> current my mail environment is very poor)
> 
> 
> code-A:
>        if (!all_zones_ok) {
>               if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
>                  (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                           cond_resched();
>                            try_to_freeze();
>                            goto loop_again;
>                 }
>        }
> 
> or
> 
> code-B:
>        if (!all_zones_ok) {
>               cond_resched();
>                try_to_freeze();
> 
>               if (nr_reclaimed >= SWAP_CLUSTER_MAX)
>                            goto loop_again;
> 
>               if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                            order = sc.order = 0;
>                            goto loop_again;
>               }
>        }
> 
> 
> However, I still like my original proposal because ..
>   - code-A forget to order-1 (for stack) allocation also can cause
> infinite loop.

Conceivably it would cause an infinite loop although we are meant to be
able to grant allocations of that order. The point stands though, it is
not guaranteed which is why I changed it to sc.order == 0 in the second
revision.

>   - code-B doesn't simpler than my original proposal.
> 

Indeed.

> What do you think it?
> 

AFter looking at this for long enough, our patches are functionally similar
except you loop a second time at order-0 without waiting for kswapd to be
woken up. It may reclaim more but if people are ok with that, I'll stay
quiet. Fix the comment and I'll be happy (or even delete it, I prefer no
comments to misleading ones :/). Maybe something like

                /*
                 * Fragmentation may mean that the system cannot be
                 * rebalanced for high-order allocations in all zones.
                 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
                 * it means the zones have been fully scanned and are still
                 * not balanced. For high-order allocations, there is 
                 * little point trying all over again as kswapd may
                 * infinite loop.
                 *
                 * Instead, recheck all watermarks at order-0 as they
                 * are the most important. If watermarks are ok, kswapd will go
                 * back to sleep. High-order users can still direct reclaim
                 * if they wish.
                 */

?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31  8:59     ` wassim dagash
@ 2008-12-31 12:05       ` Mel Gorman
  2008-12-31 12:24         ` wassim dagash
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2008-12-31 12:05 UTC (permalink / raw)
  To: wassim dagash; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Nick Piggin

On Wed, Dec 31, 2008 at 10:59:58AM +0200, wassim dagash wrote:
> Hi ,
> Thank you all for reviewing.
> Why don't we implement a solution where the order is defined per zone?

kswapd is scanning trying to balance the whole system, not just one of
the zones. You don't know which of the zones would be the best to
rebalance and glancing through your path, the zone that would be
balanced is the last zone in the zonelist. i.e. kswapd will try to
rebalance the least-preferred zone to be used by the calling process.

> I implemented such a solution for my kernel (2.6.18) and tested it, it
> worked fine for me. Attached a patch with a solution for 2.6.28
> (compile tested only).
> 

One big one;

zone_watermark_ok() now always calculates order based on
zone->kswapd_max_order. This means that a process that wants an order-0 page
may check watermarks at order-10 because some other process overwrote the
zone value. That is unfair.

One less obvious one;

You add a field that is written often beside a field that is read-mostly
in struct zone. This results in poorer cache behaviour. It's easily
fixed though.

> On Wed, Dec 31, 2008 at 6:54 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Hi
> >
> > thank you for reviewing.
> >
> >>> ==
> >>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> >>>
> >>> Wassim Dagash reported following kswapd infinite loop problem.
> >>>
> >>>   kswapd runs in some infinite loop trying to swap until order 10 of zone
> >>>   highmem is OK, While zone higmem (as I understand) has nothing to do
> >>>   with contiguous memory (cause there is no 1-1 mapping) which means
> >>>   kswapd will continue to try to balance order 10 of zone highmem
> >>>   forever (or until someone release a very large chunk of highmem).
> >>>
> >>> He proposed remove contenious checking on highmem at all.
> >>> However hugepage on highmem need contenious highmem page.
> >>>
> >>
> >> I'm lacking the original problem report, but contiguous order-10 pages are
> >> indeed required for hugepages in highmem and reclaiming for them should not
> >> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> >> contiguity is still required.
> >
> > correct.
> > but that's ok.
> >
> > my patch only change corner case bahavior and only disable high-order
> > when priority==0. typical hugepage reclaim don't need and don't reach
> > priority==0.
> >
> > and sorry. I agree with my "2nd loop"  word of the patch comment is a
> > bit misleading.
> >
> >
> >> kswapd gets a sc.order when it is known there is a process trying to get
> >> high-order pages so it can reclaim at that order in an attempt to prevent
> >> future direct reclaim at a high-order. Your patch does not appear to depend on
> >> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> >> loop again at order-0 means it may scan and reclaim more memory unnecessarily
> >> seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> >
> > Yup. my patch doesn't depend on GFP_KERNEL.
> >
> > but, Why order-0 means it may scan more memory unnecessary?
> > all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
> > depend on order argument. and my patch set order variable to 0 too.
> >
> >
> >> While constantly looping trying to balance for high-orders is indeed bad,
> >> I'm unconvinced this is the correct change. As we have already gone through
> >> a priorities and scanned everything at the high-order, would it not make
> >> more sense to do just give up with something like the following?
> >>
> >>       /*
> >>        * If zones are still not balanced, loop again and continue attempting
> >>        * to rebalance the system. For high-order allocations, fragmentation
> >>        * can prevent the zones being rebalanced no matter how hard kswapd
> >>        * works, particularly on systems with little or no swap. For costly
> >>        * orders, just give up and assume interested processes will either
> >>        * direct reclaim or wake up kswapd as necessary.
> >>        */
> >>        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>                cond_resched();
> >>
> >>                try_to_freeze();
> >>
> >>                goto loop_again;
> >>        }
> >>
> >> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> >> expected to support allocations up to that order in a fairly reliable fashion.
> >
> > my comment is bellow.
> >
> >
> >> =============
> >> From: Mel Gorman <mel@csn.ul.ie>
> >> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
> >>
> >>  kswapd runs in some infinite loop trying to swap until order 10 of zone
> >>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
> >>  highmem forever (or until someone release a very large chunk of highmem).
> >>
> >> For costly high-order allocations, the system may never be balanced due to
> >> fragmentation but kswapd should not infinitely loop as a result. The
> >> following patch lets kswapd stop reclaiming in the event it cannot
> >> balance zones and the order is high-order.
> >>
> >> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 62e7f62..03ed9a0 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1867,7 +1867,16 @@ out:
> >>
> >>                zone->prev_priority = temp_priority[i];
> >>        }
> >> -       if (!all_zones_ok) {
> >> +
> >> +       /*
> >> +        * If zones are still not balanced, loop again and continue attempting
> >> +        * to rebalance the system. For high-order allocations, fragmentation
> >> +        * can prevent the zones being rebalanced no matter how hard kswapd
> >> +        * works, particularly on systems with little or no swap. For costly
> >> +        * orders, just give up and assume interested processes will either
> >> +        * direct reclaim or wake up kswapd as necessary.
> >> +        */
> >> +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>                cond_resched();
> >>
> >>                try_to_freeze();
> >
> > this patch seems no good.
> > kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
> > unnecessary priority variable decreasing.
> > then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
> >
> > kswapd purpose is "reclaim until pages_high", not reclaim
> > SWAP_CLUSTER_MAX pages.
> >
> > if your patch applied and kswapd start to reclaim for hugepage, kswapd
> > exit balance_pgdat() function after to reclaim only 32 pages
> > (SWAP_CLUSTER_MAX).
> >
> > In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
> > reclaim enough
> > page although priority == 0.
> > in this case, retry is worthless.
> >
> > sorting out again.
> > "goto loop_again" reaching happend by two case.
> >
> > 1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
> >    at that time, kswapd reset priority variable to prevent
> > unnecessary priority decreasing.
> >    I don't hope this behavior change.
> > 2. kswapd scanned until priority==0.
> >    this case is debatable. my patch reset any order to 0. but
> > following code is also considerable to me. (sorry for tab corrupted,
> > current my mail environment is very poor)
> >
> >
> > code-A:
> >       if (!all_zones_ok) {
> >              if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
> >                 (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
> >                          cond_resched();
> >                           try_to_freeze();
> >                           goto loop_again;
> >                }
> >       }
> >
> > or
> >
> > code-B:
> >       if (!all_zones_ok) {
> >              cond_resched();
> >               try_to_freeze();
> >
> >              if (nr_reclaimed >= SWAP_CLUSTER_MAX)
> >                           goto loop_again;
> >
> >              if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
> >                           order = sc.order = 0;
> >                           goto loop_again;
> >              }
> >       }
> >
> >
> > However, I still like my original proposal because ..
> >  - code-A forget to order-1 (for stack) allocation also can cause
> > infinite loop.
> >  - code-B doesn't simpler than my original proposal.
> >
> > What do you think it?
> >
> 
> 
> 
> -- 
> too much is never enough!!!!!



-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31 11:16       ` Nick Piggin
@ 2008-12-31 12:11         ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2008-12-31 12:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, wassim dagash

On Wed, Dec 31, 2008 at 12:16:47PM +0100, Nick Piggin wrote:
> On Wed, Dec 31, 2008 at 11:06:19AM +0000, Mel Gorman wrote:
> > On Wed, Dec 31, 2008 at 02:32:33AM +0100, Nick Piggin wrote:
> > > On Tue, Dec 30, 2008 at 06:59:19PM +0000, Mel Gorman wrote:
> > > > On Tue, Dec 30, 2008 at 07:55:47PM +0900, KOSAKI Motohiro wrote:
> > > > kswapd gets a sc.order when it is known there is a process trying to get
> > > > high-order pages so it can reclaim at that order in an attempt to prevent
> > > > future direct reclaim at a high-order. Your patch does not appear to depend on
> > > > GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> > > > loop again at order-0 means it may scan and reclaim more memory unnecessarily
> > > > seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> > > 
> > > It shouldn't, because it should check all that.
> > > 
> > 
> > Ok, with KOSAKI's patch we
> > 
> > 1. Set order to 0 (and stop kswapd doing what it was asked to do)
> > 2. goto loop_again
> > 3. nr_reclaimed gets set to 0 (meaning we lose that value, but no biggie
> >    as it doesn't get used by the caller anyway)
> > 4. Reset all priorities
> > 5. Do something like the following
> > 
> > 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > 		...
> > 		all_zones_ok = 1;
> > 		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
> > 			...
> > 			if (inactive_anon_is_low(zone)) {
> > 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
> > 					&sc, priority, 0);
> > 			}
> > 
> > 			if (!zone_watermark_ok(zone, order, zone->pages_high,
> > 					0, 0)) {
> > 				end_zone = i;
> > 				break;
> > 			}
> > 		}
> > 	}
> > 
> >   So, by looping around, we could end up shrinking the active list again
> >   before we recheck the zone watermarks depending on the size of the
> >   inactive lists.
> 
> If this is a problem, it is a problem with that code, because kswapd
> can be woken up for any zone at any time anyway.
> 
> 
> > > >                 cond_resched();
> > > > 
> > > >                 try_to_freeze();
> > > > 
> > > >                 goto loop_again;
> > > >         }
> > > > 
> > > > I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
> > > > expected to support allocations up to that order in a fairly reliable fashion.
> > > 
> > > I actually think it's better to do it for all orders, because that
> > > constant is more or less arbitrary.
> > 
> > i.e.
> > 
> > if (!all_zones_ok && sc.order == 0) {
> > 
> > ? or something else
> 
> Well, I jus tdon't see what's wrong with the original patch.
> 

I've more or less convinced myself it's ok as any anomolies I spotted have
either been described as intentional behaviour or is arguably correct. A
fixed up (or deleted - misleading comments suck) comment and I'm happy.

>  
> > What I did miss was that we have 
> > 
> >                 if (nr_reclaimed >= SWAP_CLUSTER_MAX)
> >                         break;
> > 
> > so with my patch, kswapd is bailing out early without trying to reclaim for
> > high-orders that hard. That was not what I intended as it means we only ever
> > really rebalance the full system for order-0 pages and for everything else we
> > do relatively light scanning. The impact is that high-order users will direct
> > reclaim rather than depending on kswapd scanning very heavily. Arguably,
> > this is a good thing.
> > 
> > However, it also means that KOSAKI's and my patches only differs in that mine
> > bails early and KOSAKI rechecks everything at order-0, possibly reclaiming
> > more. If the comment was not so misleading, I'd have been a lot happier.
> 
> Rechecking everything is fine by me; order-0 is going to be the most
> common and most important. If higher order allocations sometimes have
> to enter direct reclaim or kick off kswapd again, it isn't a big deal.
> 

Grand so. Initially it looked like accidental rather than intentional
behaviour but after thinking about it some more, it should be ok.

> 
> > > IOW, I don't see a big downside, and there is a real upside.
> > > 
> > > I think the patch is good.
> > > 
> > 
> > Which one, KOSAKI's or my one?
> > 
> > Here is my one again which bails out for any high-order allocation after
> > just light scanning.
> > 
> > ====
> > 
> > >From 0e09fe002d8e9956de227b880ef8458842b71ca9 Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <mel@csn.ul.ie>
> > Date: Tue, 30 Dec 2008 18:53:23 +0000
> > Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
> > 
> > Wassim Dagash reported the following (editted) kswapd infinite loop problem.
> > 
> >   kswapd runs in some infinite loop trying to swap until order 10 of zone
> >   highmem is OK.... kswapd will continue to try to balance order 10 of zone
> >   highmem forever (or until someone release a very large chunk of highmem).
> > 
> > For costly high-order allocations, the system may never be balanced due to
> > fragmentation but kswapd should not infinitely loop as a result. The
> > following patch lets kswapd stop reclaiming in the event it cannot
> > balance zones and the order is high-order.
> 
> This one bails out if it was a higher order reclaim, but there is still
> an order-0 shortage. I prefer to run the loop again at order==0 to avoid
> that condition. A higher kswapd reclaim order shouldn't weaken kswapd
> postcondition for order-0 memory.
> 
> > 
> > Reported-by: wassim dagash <wassim.dagash@gmail.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > ---
> >  mm/vmscan.c |   11 ++++++++++-
> >  1 files changed, 10 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 62e7f62..7b0f412 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1867,7 +1867,16 @@ out:
> >  
> >  		zone->prev_priority = temp_priority[i];
> >  	}
> > -	if (!all_zones_ok) {
> > +
> > +	/*
> > +	 * If zones are still not balanced, loop again and continue attempting
> > +	 * to rebalance the system. For high-order allocations, fragmentation
> > +	 * can prevent the zones being rebalanced no matter how hard kswapd
> > +	 * works, particularly on systems with little or no swap. For
> > +	 * high-orders, just give up and assume interested processes will
> > +	 * either direct reclaim or wake up kswapd again as necessary.
> > +	 */
> > +	if (!all_zones_ok && sc.order == 0) {
> >  		cond_resched();
> >  
> >  		try_to_freeze();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31 12:05       ` Mel Gorman
@ 2008-12-31 12:24         ` wassim dagash
  0 siblings, 0 replies; 20+ messages in thread
From: wassim dagash @ 2008-12-31 12:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Nick Piggin

On Wed, Dec 31, 2008 at 2:05 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Dec 31, 2008 at 10:59:58AM +0200, wassim dagash wrote:
>> Hi ,
>> Thank you all for reviewing.
>> Why don't we implement a solution where the order is defined per zone?
>
> kswapd is scanning trying to balance the whole system, not just one of
> the zones. You don't know which of the zones would be the best to
> rebalance and glancing through your path, the zone that would be
> balanced is the last zone in the zonelist. i.e. kswapd will try to
> rebalance the least-preferred zone to be used by the calling process.
>

You are right, but what is meant by 'order per zone' is that we will
balance all zones to kswapd_max_order which is the highest order from
which allocation was done on that zone.
What I tried to re-balance is the zone for which the allocation was
made, the other zones are fallback to that allocation ( correct me if
I'm wrong). As I understood from documentation, the first zone on the
list is the zone for which the allocation was made.

>> I implemented such a solution for my kernel (2.6.18) and tested it, it
>> worked fine for me. Attached a patch with a solution for 2.6.28
>> (compile tested only).
>>
>
> One big one;
>
> zone_watermark_ok() now always calculates order based on
> zone->kswapd_max_order. This means that a process that wants an order-0 page
> may check watermarks at order-10 because some other process overwrote the
> zone value. That is unfair.
>

This can be fixed by differing re-balance and allocation.

> One less obvious one;
>
> You add a field that is written often beside a field that is read-mostly
> in struct zone. This results in poorer cache behaviour. It's easily
> fixed though.
>

Sorry, but I'm very new to kernel programming, I read your book (and
other books) in order to figure out why KSWAPD loops in an endless
loop :)

I just wanted to understand why 'kswapd_max_order' was implemented
'per node' and not 'per zone' ? Is there a reason?

Thanks,

>> On Wed, Dec 31, 2008 at 6:54 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Hi
>> >
>> > thank you for reviewing.
>> >
>> >>> ==
>> >>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> >>> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
>> >>>
>> >>> Wassim Dagash reported following kswapd infinite loop problem.
>> >>>
>> >>>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>> >>>   highmem is OK, While zone higmem (as I understand) has nothing to do
>> >>>   with contiguous memory (cause there is no 1-1 mapping) which means
>> >>>   kswapd will continue to try to balance order 10 of zone highmem
>> >>>   forever (or until someone release a very large chunk of highmem).
>> >>>
>> >>> He proposed remove contenious checking on highmem at all.
>> >>> However hugepage on highmem need contenious highmem page.
>> >>>
>> >>
>> >> I'm lacking the original problem report, but contiguous order-10 pages are
>> >> indeed required for hugepages in highmem and reclaiming for them should not
>> >> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
>> >> contiguity is still required.
>> >
>> > correct.
>> > but that's ok.
>> >
>> > my patch only change corner case bahavior and only disable high-order
>> > when priority==0. typical hugepage reclaim don't need and don't reach
>> > priority==0.
>> >
>> > and sorry. I agree with my "2nd loop"  word of the patch comment is a
>> > bit misleading.
>> >
>> >
>> >> kswapd gets a sc.order when it is known there is a process trying to get
>> >> high-order pages so it can reclaim at that order in an attempt to prevent
>> >> future direct reclaim at a high-order. Your patch does not appear to depend on
>> >> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
>> >> loop again at order-0 means it may scan and reclaim more memory unnecessarily
>> >> seeing as all_zones_ok was calculated based on a high-order value, not order-0.
>> >
>> > Yup. my patch doesn't depend on GFP_KERNEL.
>> >
>> > but, Why order-0 means it may scan more memory unnecessary?
>> > all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
>> > depend on order argument. and my patch set order variable to 0 too.
>> >
>> >
>> >> While constantly looping trying to balance for high-orders is indeed bad,
>> >> I'm unconvinced this is the correct change. As we have already gone through
>> >> a priorities and scanned everything at the high-order, would it not make
>> >> more sense to do just give up with something like the following?
>> >>
>> >>       /*
>> >>        * If zones are still not balanced, loop again and continue attempting
>> >>        * to rebalance the system. For high-order allocations, fragmentation
>> >>        * can prevent the zones being rebalanced no matter how hard kswapd
>> >>        * works, particularly on systems with little or no swap. For costly
>> >>        * orders, just give up and assume interested processes will either
>> >>        * direct reclaim or wake up kswapd as necessary.
>> >>        */
>> >>        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>> >>                cond_resched();
>> >>
>> >>                try_to_freeze();
>> >>
>> >>                goto loop_again;
>> >>        }
>> >>
>> >> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
>> >> expected to support allocations up to that order in a fairly reliable fashion.
>> >
>> > my comment is bellow.
>> >
>> >
>> >> =============
>> >> From: Mel Gorman <mel@csn.ul.ie>
>> >> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
>> >>
>> >>  kswapd runs in some infinite loop trying to swap until order 10 of zone
>> >>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
>> >>  highmem forever (or until someone release a very large chunk of highmem).
>> >>
>> >> For costly high-order allocations, the system may never be balanced due to
>> >> fragmentation but kswapd should not infinitely loop as a result. The
>> >> following patch lets kswapd stop reclaiming in the event it cannot
>> >> balance zones and the order is high-order.
>> >>
>> >> Reported-by: wassim dagash <wassim.dagash@gmail.com>
>> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> >>
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 62e7f62..03ed9a0 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -1867,7 +1867,16 @@ out:
>> >>
>> >>                zone->prev_priority = temp_priority[i];
>> >>        }
>> >> -       if (!all_zones_ok) {
>> >> +
>> >> +       /*
>> >> +        * If zones are still not balanced, loop again and continue attempting
>> >> +        * to rebalance the system. For high-order allocations, fragmentation
>> >> +        * can prevent the zones being rebalanced no matter how hard kswapd
>> >> +        * works, particularly on systems with little or no swap. For costly
>> >> +        * orders, just give up and assume interested processes will either
>> >> +        * direct reclaim or wake up kswapd as necessary.
>> >> +        */
>> >> +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>> >>                cond_resched();
>> >>
>> >>                try_to_freeze();
>> >
>> > this patch seems no good.
>> > kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
>> > unnecessary priority variable decreasing.
>> > then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
>> >
>> > kswapd purpose is "reclaim until pages_high", not reclaim
>> > SWAP_CLUSTER_MAX pages.
>> >
>> > if your patch applied and kswapd start to reclaim for hugepage, kswapd
>> > exit balance_pgdat() function after to reclaim only 32 pages
>> > (SWAP_CLUSTER_MAX).
>> >
>> > In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
>> > reclaim enough
>> > page although priority == 0.
>> > in this case, retry is worthless.
>> >
>> > sorting out again.
>> > "goto loop_again" reaching happend by two case.
>> >
>> > 1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
>> >    at that time, kswapd reset priority variable to prevent
>> > unnecessary priority decreasing.
>> >    I don't hope this behavior change.
>> > 2. kswapd scanned until priority==0.
>> >    this case is debatable. my patch reset any order to 0. but
>> > following code is also considerable to me. (sorry for tab corrupted,
>> > current my mail environment is very poor)
>> >
>> >
>> > code-A:
>> >       if (!all_zones_ok) {
>> >              if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
>> >                 (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>> >                          cond_resched();
>> >                           try_to_freeze();
>> >                           goto loop_again;
>> >                }
>> >       }
>> >
>> > or
>> >
>> > code-B:
>> >       if (!all_zones_ok) {
>> >              cond_resched();
>> >               try_to_freeze();
>> >
>> >              if (nr_reclaimed >= SWAP_CLUSTER_MAX)
>> >                           goto loop_again;
>> >
>> >              if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>> >                           order = sc.order = 0;
>> >                           goto loop_again;
>> >              }
>> >       }
>> >
>> >
>> > However, I still like my original proposal because ..
>> >  - code-A forget to order-1 (for stack) allocation also can cause
>> > infinite loop.
>> >  - code-B doesn't simpler than my original proposal.
>> >
>> > What do you think it?
>> >
>>
>>
>>
>> --
>> too much is never enough!!!!!
>
>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
too much is never enough!!!!!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
  2008-12-31 11:53     ` Mel Gorman
@ 2008-12-31 13:34       ` KOSAKI Motohiro
  2009-01-01 14:52         ` [PATCH] mm: stop kswapd's infinite loop at high order allocation take2 KOSAKI Motohiro
  0 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2008-12-31 13:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Nick Piggin,
	wassim dagash

Hi

> > > I'm lacking the original problem report, but contiguous order-10 pages are
> > > indeed required for hugepages in highmem and reclaiming for them should not
> > > be totally disabled at any point. While no 1-1 mapping exists for the kernel,
> > > contiguity is still required.
> > 
> > correct.
> > but that's ok.
> > 
> > my patch only change corner case bahavior and only disable high-order
> > when priority==0. typical hugepage reclaim don't need and don't reach
> > priority==0.
> > 
> > and sorry. I agree with my "2nd loop"  word of the patch comment is a
> > bit misleading.
> > 
> 
> As I mentioned in the last mail, if it wasn't so misleading, I probably
> would have said nothing at all :)

very sorry.



> > > kswapd gets a sc.order when it is known there is a process trying to get
> > > high-order pages so it can reclaim at that order in an attempt to prevent
> > > future direct reclaim at a high-order. Your patch does not appear to depend on
> > > GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
> > > loop again at order-0 means it may scan and reclaim more memory unnecessarily
> > > seeing as all_zones_ok was calculated based on a high-order value, not order-0.
> > 
> > Yup. my patch doesn't depend on GFP_KERNEL.
> > 
> > but, Why order-0 means it may scan more memory unnecessary?
> 
> Because we can enter shrink_active_list() depending on the size of the LRU
> lists. Maybe it doesn't matter but it's what I was concerned with as well
> as the fact we are changing kswapd to do work other than what it was asked for.

I think it isn't matter.

	if (inactive_anon_is_low(zone)) {
		shrink_active_list(SWAP_CLUSTER_MAX, zone,
		&sc, priority, 0);
	}

this code isn't reclaim, it adjustfor number of pages in inactive list.
if the number of inactive anon pages are already enough, 
inactive_anon_is_low() return 0. then above code doesn't have bad side effect.



> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 62e7f62..03ed9a0 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1867,7 +1867,16 @@ out:
> > >
> > >                zone->prev_priority = temp_priority[i];
> > >        }
> > > -       if (!all_zones_ok) {
> > > +
> > > +       /*
> > > +        * If zones are still not balanced, loop again and continue attempting
> > > +        * to rebalance the system. For high-order allocations, fragmentation
> > > +        * can prevent the zones being rebalanced no matter how hard kswapd
> > > +        * works, particularly on systems with little or no swap. For costly
> > > +        * orders, just give up and assume interested processes will either
> > > +        * direct reclaim or wake up kswapd as necessary.
> > > +        */
> > > +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
> > >                cond_resched();
> > >
> > >                try_to_freeze();
> > 
> > this patch seems no good.
> > kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
> > unnecessary priority variable decreasing.
> > then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
> > 
> > kswapd purpose is "reclaim until pages_high", not reclaim
> > SWAP_CLUSTER_MAX pages.
> > 
> > if your patch applied and kswapd start to reclaim for hugepage, kswapd
> > exit balance_pgdat() function after to reclaim only 32 pages
> > (SWAP_CLUSTER_MAX).
> > 
> 
> It probably will have reclaimed more. Lumpy reclaim will have isolated
> more pages in down in  isolate_lru_pages() and reclaimed pages within a
> high-order blocks of pages even if that is more than SWAP_CLUSTER_MAX pages
> (right?). 

correct.
but please recall, lumpy reclaim try to get contenious pages, not guarantee
get contenious pages.

Then, although nr_reclaimed >= SWAP_CLUSTER_MAX, no contenious memory can happend.


> The bailing out does mean that kswapd no longer works as hard for
> high-order pages but as I said in the other mail, this is not necessarily
> a bad thing as processes will still direct reclaim if they have to.
> 
> > In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
> > reclaim enough
> > page although priority == 0.
> > in this case, retry is worthless.
> > 
> 
> Good point. With my patch, we would just give up in the event SWAP_CLUSTER_MAX
> pages were not even reclaimed. With your patch, we rescan at order-0 to ensure
> the system is actually balanced without waiting to be woken up again. It's
> not what kswapd was asked to do, but arguably it's the smart thing to do.

Agreed.



> AFter looking at this for long enough, our patches are functionally similar
> except you loop a second time at order-0 without waiting for kswapd to be
> woken up. It may reclaim more but if people are ok with that, I'll stay
> quiet. Fix the comment and I'll be happy (or even delete it, I prefer no
> comments to misleading ones :/). Maybe something like
> 
>                 /*
>                  * Fragmentation may mean that the system cannot be
>                  * rebalanced for high-order allocations in all zones.
>                  * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
>                  * it means the zones have been fully scanned and are still
>                  * not balanced. For high-order allocations, there is 
>                  * little point trying all over again as kswapd may
>                  * infinite loop.
>                  *
>                  * Instead, recheck all watermarks at order-0 as they
>                  * are the most important. If watermarks are ok, kswapd will go
>                  * back to sleep. High-order users can still direct reclaim
>                  * if they wish.
>                  */
> 
> ?

Excellent. I strongly like this and I hope merge it to my patch.
I'll resend new patch.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2008-12-31 13:34       ` KOSAKI Motohiro
@ 2009-01-01 14:52         ` KOSAKI Motohiro
  2009-01-02  9:55           ` MinChan Kim
  2009-01-02 11:14           ` Mel Gorman
  0 siblings, 2 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2009-01-01 14:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Nick Piggin,
	wassim dagash


> >                 /*
> >                  * Fragmentation may mean that the system cannot be
> >                  * rebalanced for high-order allocations in all zones.
> >                  * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> >                  * it means the zones have been fully scanned and are still
> >                  * not balanced. For high-order allocations, there is 
> >                  * little point trying all over again as kswapd may
> >                  * infinite loop.
> >                  *
> >                  * Instead, recheck all watermarks at order-0 as they
> >                  * are the most important. If watermarks are ok, kswapd will go
> >                  * back to sleep. High-order users can still direct reclaim
> >                  * if they wish.
> >                  */
> > 
> > ?
> 
> Excellent. I strongly like this and I hope merge it to my patch.
> I'll resend new patch.

Done.



==
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation

Wassim Dagash reported following kswapd infinite loop problem.

  kswapd runs in some infinite loop trying to swap until order 10 of zone
  highmem is OK.... kswapd will continue to try to balance order 10 of zone
  highmem forever (or until someone release a very large chunk of highmem).

For non order-0 allocations, the system may never be balanced due to
fragmentation but kswapd should not infinitely loop as a result. 

Instead, recheck all watermarks at order-0 as they are the most important. 
If watermarks are ok, kswapd will go back to sleep. 


Reported-by: wassim dagash <wassim.dagash@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>,
---
 mm/vmscan.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-12-25 08:26:37.000000000 +0900
+++ b/mm/vmscan.c	2009-01-01 01:56:02.000000000 +0900
@@ -1872,6 +1872,23 @@ out:
 
 		try_to_freeze();
 
+		/*
+		 * Fragmentation may mean that the system cannot be
+		 * rebalanced for high-order allocations in all zones.
+		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
+		 * it means the zones have been fully scanned and are still
+		 * not balanced. For high-order allocations, there is
+		 * little point trying all over again as kswapd may
+		 * infinite loop.
+		 *
+		 * Instead, recheck all watermarks at order-0 as they
+		 * are the most important. If watermarks are ok, kswapd will go
+		 * back to sleep. High-order users can still direct reclaim
+		 * if they wish.
+		 */
+		if (nr_reclaimed < SWAP_CLUSTER_MAX)
+			order = sc.order = 0;
+
 		goto loop_again;
 	}
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-01 14:52         ` [PATCH] mm: stop kswapd's infinite loop at high order allocation take2 KOSAKI Motohiro
@ 2009-01-02  9:55           ` MinChan Kim
  2009-01-02 10:00             ` KOSAKI Motohiro
  2009-01-02 11:14           ` Mel Gorman
  1 sibling, 1 reply; 20+ messages in thread
From: MinChan Kim @ 2009-01-02  9:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

Hi, kosaki-san.

I read the previous threads now. It's rather late :(.

I think it's rather awkward that sudden big change of order from 10 to 0.

This problem causes zone_water_mark's fail.
It mean now this zone's proportional free page per order size is not good.
Although order-0 page is very important, Shouldn't we consider other
order allocations ?

So I want to balance zone's proportional free page.
How about following ?

if (nr_reclaimed < SWAP_CLUSTER_MAX) {
   if (order != 0) {
     order -=1;
     sc.order -=1;
   }
}

It prevents infinite loop and do best effort to make zone's
proportional free page per order size good.

It's just my opinion within my knowledge.
If it have a problem, pz, explain me :)


On Thu, Jan 1, 2009 at 11:52 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
>> >                 /*
>> >                  * Fragmentation may mean that the system cannot be
>> >                  * rebalanced for high-order allocations in all zones.
>> >                  * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
>> >                  * it means the zones have been fully scanned and are still
>> >                  * not balanced. For high-order allocations, there is
>> >                  * little point trying all over again as kswapd may
>> >                  * infinite loop.
>> >                  *
>> >                  * Instead, recheck all watermarks at order-0 as they
>> >                  * are the most important. If watermarks are ok, kswapd will go
>> >                  * back to sleep. High-order users can still direct reclaim
>> >                  * if they wish.
>> >                  */
>> >
>> > ?
>>
>> Excellent. I strongly like this and I hope merge it to my patch.
>> I'll resend new patch.
>
> Done.
>
>
>
> ==
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
>
> Wassim Dagash reported following kswapd infinite loop problem.
>
>  kswapd runs in some infinite loop trying to swap until order 10 of zone
>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
>  highmem forever (or until someone release a very large chunk of highmem).
>
> For non order-0 allocations, the system may never be balanced due to
> fragmentation but kswapd should not infinitely loop as a result.
>
> Instead, recheck all watermarks at order-0 as they are the most important.
> If watermarks are ok, kswapd will go back to sleep.
>
>
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>,
> ---
>  mm/vmscan.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c       2008-12-25 08:26:37.000000000 +0900
> +++ b/mm/vmscan.c       2009-01-01 01:56:02.000000000 +0900
> @@ -1872,6 +1872,23 @@ out:
>
>                try_to_freeze();
>
> +               /*
> +                * Fragmentation may mean that the system cannot be
> +                * rebalanced for high-order allocations in all zones.
> +                * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> +                * it means the zones have been fully scanned and are still
> +                * not balanced. For high-order allocations, there is
> +                * little point trying all over again as kswapd may
> +                * infinite loop.
> +                *
> +                * Instead, recheck all watermarks at order-0 as they
> +                * are the most important. If watermarks are ok, kswapd will go
> +                * back to sleep. High-order users can still direct reclaim
> +                * if they wish.
> +                */
> +               if (nr_reclaimed < SWAP_CLUSTER_MAX)
> +                       order = sc.order = 0;
> +
>                goto loop_again;
>        }
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-02  9:55           ` MinChan Kim
@ 2009-01-02 10:00             ` KOSAKI Motohiro
  2009-01-02 10:29               ` MinChan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2009-01-02 10:00 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

> Hi, kosaki-san.
>
> I read the previous threads now. It's rather late :(.
>
> I think it's rather awkward that sudden big change of order from 10 to 0.
>
> This problem causes zone_water_mark's fail.
> It mean now this zone's proportional free page per order size is not good.
> Although order-0 page is very important, Shouldn't we consider other
> order allocations ?
>
> So I want to balance zone's proportional free page.
> How about following ?
>
> if (nr_reclaimed < SWAP_CLUSTER_MAX) {
>   if (order != 0) {
>     order -=1;
>     sc.order -=1;
>   }
> }
>
> It prevents infinite loop and do best effort to make zone's
> proportional free page per order size good.
>
> It's just my opinion within my knowledge.
> If it have a problem, pz, explain me :)

Please read Nick's expalin. it explain very kindly :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-02 10:00             ` KOSAKI Motohiro
@ 2009-01-02 10:29               ` MinChan Kim
  2009-01-02 10:54                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 20+ messages in thread
From: MinChan Kim @ 2009-01-02 10:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

On Fri, Jan 2, 2009 at 7:00 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> Hi, kosaki-san.
>>
>> I read the previous threads now. It's rather late :(.
>>
>> I think it's rather awkward that sudden big change of order from 10 to 0.
>>
>> This problem causes zone_water_mark's fail.
>> It mean now this zone's proportional free page per order size is not good.
>> Although order-0 page is very important, Shouldn't we consider other
>> order allocations ?
>>
>> So I want to balance zone's proportional free page.
>> How about following ?
>>
>> if (nr_reclaimed < SWAP_CLUSTER_MAX) {
>>   if (order != 0) {
>>     order -=1;
>>     sc.order -=1;
>>   }
>> }
>>
>> It prevents infinite loop and do best effort to make zone's
>> proportional free page per order size good.
>>
>> It's just my opinion within my knowledge.
>> If it have a problem, pz, explain me :)
>
> Please read Nick's expalin. it explain very kindly :)

Hm. I read Nick's explain.
I understand his point.

Nick said,
"A higher kswapd reclaim order shouldn't weaken kswapd
postcondition for order-0 memory."

My patch don't prevent order-0 memory reclaim. After all, it will do it.
It also can do best effort to reclaim other order size.

In this case, others order size reclaim is needless  ?


-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-02 10:29               ` MinChan Kim
@ 2009-01-02 10:54                 ` KOSAKI Motohiro
  2009-01-02 11:18                   ` MinChan Kim
  0 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2009-01-02 10:54 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

>>> So I want to balance zone's proportional free page.
>>> How about following ?
>>>
>>> if (nr_reclaimed < SWAP_CLUSTER_MAX) {
>>>   if (order != 0) {
>>>     order -=1;
>>>     sc.order -=1;
>>>   }
>>> }
>>>
>>> It prevents infinite loop and do best effort to make zone's
>>> proportional free page per order size good.
>>>
>>> It's just my opinion within my knowledge.
>>> If it have a problem, pz, explain me :)
>>
>> Please read Nick's expalin. it explain very kindly :)
>
> Hm. I read Nick's explain.
> I understand his point.
>
> Nick said,
> "A higher kswapd reclaim order shouldn't weaken kswapd
> postcondition for order-0 memory."
>
> My patch don't prevent order-0 memory reclaim. After all, it will do it.
> It also can do best effort to reclaim other order size.
>
> In this case, others order size reclaim is needless  ?

Yes, needless.

wakeup_kswapd() function mean
  - please make free memory until pages_high
  - and, I want to "order argument" conteniously pages.

then, shorter conteniously pages than "order argumet" pages aren't needed
by caller.

Unfortunately, your patch has more bad side effect.
high order shrink_zone() cause lumpy reclaim.
lumpy reclaim cause reclaim neighbor pages although it is active page.

needlessly active page reclaiming decrease system performance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-01 14:52         ` [PATCH] mm: stop kswapd's infinite loop at high order allocation take2 KOSAKI Motohiro
  2009-01-02  9:55           ` MinChan Kim
@ 2009-01-02 11:14           ` Mel Gorman
  1 sibling, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2009-01-02 11:14 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

On Thu, Jan 01, 2009 at 11:52:02PM +0900, KOSAKI Motohiro wrote:
> 
> > >                 /*
> > >                  * Fragmentation may mean that the system cannot be
> > >                  * rebalanced for high-order allocations in all zones.
> > >                  * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> > >                  * it means the zones have been fully scanned and are still
> > >                  * not balanced. For high-order allocations, there is 
> > >                  * little point trying all over again as kswapd may
> > >                  * infinite loop.
> > >                  *
> > >                  * Instead, recheck all watermarks at order-0 as they
> > >                  * are the most important. If watermarks are ok, kswapd will go
> > >                  * back to sleep. High-order users can still direct reclaim
> > >                  * if they wish.
> > >                  */
> > > 
> > > ?
> > 
> > Excellent. I strongly like this and I hope merge it to my patch.
> > I'll resend new patch.
> 
> Done.
> 

Looks good, thanks.

> 
> 
> ==
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
> 
> Wassim Dagash reported following kswapd infinite loop problem.
> 
>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>   highmem is OK.... kswapd will continue to try to balance order 10 of zone
>   highmem forever (or until someone release a very large chunk of highmem).
> 
> For non order-0 allocations, the system may never be balanced due to
> fragmentation but kswapd should not infinitely loop as a result. 
> 
> Instead, recheck all watermarks at order-0 as they are the most important. 
> If watermarks are ok, kswapd will go back to sleep. 
> 
> 
> Reported-by: wassim dagash <wassim.dagash@gmail.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>,
> ---
>  mm/vmscan.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c	2008-12-25 08:26:37.000000000 +0900
> +++ b/mm/vmscan.c	2009-01-01 01:56:02.000000000 +0900
> @@ -1872,6 +1872,23 @@ out:
>  
>  		try_to_freeze();
>  
> +		/*
> +		 * Fragmentation may mean that the system cannot be
> +		 * rebalanced for high-order allocations in all zones.
> +		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
> +		 * it means the zones have been fully scanned and are still
> +		 * not balanced. For high-order allocations, there is
> +		 * little point trying all over again as kswapd may
> +		 * infinite loop.
> +		 *
> +		 * Instead, recheck all watermarks at order-0 as they
> +		 * are the most important. If watermarks are ok, kswapd will go
> +		 * back to sleep. High-order users can still direct reclaim
> +		 * if they wish.
> +		 */
> +		if (nr_reclaimed < SWAP_CLUSTER_MAX)
> +			order = sc.order = 0;
> +
>  		goto loop_again;
>  	}
>  
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation take2
  2009-01-02 10:54                 ` KOSAKI Motohiro
@ 2009-01-02 11:18                   ` MinChan Kim
  0 siblings, 0 replies; 20+ messages in thread
From: MinChan Kim @ 2009-01-02 11:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, LKML, linux-mm, Andrew Morton, Nick Piggin, wassim dagash

On Fri, Jan 2, 2009 at 7:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>>>> So I want to balance zone's proportional free page.
>>>> How about following ?
>>>>
>>>> if (nr_reclaimed < SWAP_CLUSTER_MAX) {
>>>>   if (order != 0) {
>>>>     order -=1;
>>>>     sc.order -=1;
>>>>   }
>>>> }
>>>>
>>>> It prevents infinite loop and do best effort to make zone's
>>>> proportional free page per order size good.
>>>>
>>>> It's just my opinion within my knowledge.
>>>> If it have a problem, pz, explain me :)
>>>
>>> Please read Nick's expalin. it explain very kindly :)
>>
>> Hm. I read Nick's explain.
>> I understand his point.
>>
>> Nick said,
>> "A higher kswapd reclaim order shouldn't weaken kswapd
>> postcondition for order-0 memory."
>>
>> My patch don't prevent order-0 memory reclaim. After all, it will do it.
>> It also can do best effort to reclaim other order size.
>>
>> In this case, others order size reclaim is needless  ?
>
> Yes, needless.
>
> wakeup_kswapd() function mean
>  - please make free memory until pages_high
>  - and, I want to "order argument" conteniously pages.
>
> then, shorter conteniously pages than "order argumet" pages aren't needed
> by caller.
>
> Unfortunately, your patch has more bad side effect.
> high order shrink_zone() cause lumpy reclaim.
> lumpy reclaim cause reclaim neighbor pages although it is active page.
>
> needlessly active page reclaiming decrease system performance.
>

I agree. It can reclaim active pages.
Thanks for kind explain. :)


-- 
Kinds regards,
MinChan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-01-02 11:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-30 10:55 [PATCH] mm: stop kswapd's infinite loop at high order allocation KOSAKI Motohiro
2008-12-30 11:10 ` Nick Piggin
2008-12-30 18:59 ` Mel Gorman
2008-12-31  1:32   ` Nick Piggin
2008-12-31 11:06     ` Mel Gorman
2008-12-31 11:16       ` Nick Piggin
2008-12-31 12:11         ` Mel Gorman
2008-12-31  4:54   ` KOSAKI Motohiro
2008-12-31  8:59     ` wassim dagash
2008-12-31 12:05       ` Mel Gorman
2008-12-31 12:24         ` wassim dagash
2008-12-31 11:53     ` Mel Gorman
2008-12-31 13:34       ` KOSAKI Motohiro
2009-01-01 14:52         ` [PATCH] mm: stop kswapd's infinite loop at high order allocation take2 KOSAKI Motohiro
2009-01-02  9:55           ` MinChan Kim
2009-01-02 10:00             ` KOSAKI Motohiro
2009-01-02 10:29               ` MinChan Kim
2009-01-02 10:54                 ` KOSAKI Motohiro
2009-01-02 11:18                   ` MinChan Kim
2009-01-02 11:14           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox