Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "wassim dagash" <wassim.dagash@gmail.com>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>, LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Nick Piggin <npiggin@suse.de>
Subject: Re: [PATCH] mm: stop kswapd's infinite loop at high order allocation
Date: Wed, 31 Dec 2008 10:59:58 +0200	[thread overview]
Message-ID: <cb94e63d0812310059s263a0a75x12905a20526dafaf@mail.gmail.com> (raw)
In-Reply-To: <2f11576a0812302054rd26d8bcw6a113b3abefe8965@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7293 bytes --]

Hi ,
Thank you all for reviewing.
Why don't we implement a solution where the order is defined per zone?
I implemented such a solution for my kernel (2.6.18) and tested it, it
worked fine for me. Attached a patch with a solution for 2.6.28
(compile tested only).

On Wed, Dec 31, 2008 at 6:54 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
> thank you for reviewing.
>
>>> ==
>>> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> Subject: [PATCH] mm: kswapd stop infinite loop at high order allocation
>>>
>>> Wassim Dagash reported following kswapd infinite loop problem.
>>>
>>>   kswapd runs in some infinite loop trying to swap until order 10 of zone
>>>   highmem is OK, While zone higmem (as I understand) has nothing to do
>>>   with contiguous memory (cause there is no 1-1 mapping) which means
>>>   kswapd will continue to try to balance order 10 of zone highmem
>>>   forever (or until someone release a very large chunk of highmem).
>>>
>>> He proposed remove contenious checking on highmem at all.
>>> However hugepage on highmem need contenious highmem page.
>>>
>>
>> I'm lacking the original problem report, but contiguous order-10 pages are
>> indeed required for hugepages in highmem and reclaiming for them should not
>> be totally disabled at any point. While no 1-1 mapping exists for the kernel,
>> contiguity is still required.
>
> correct.
> but that's ok.
>
> my patch only change corner case bahavior and only disable high-order
> when priority==0. typical hugepage reclaim don't need and don't reach
> priority==0.
>
> and sorry. I agree with my "2nd loop"  word of the patch comment is a
> bit misleading.
>
>
>> kswapd gets a sc.order when it is known there is a process trying to get
>> high-order pages so it can reclaim at that order in an attempt to prevent
>> future direct reclaim at a high-order. Your patch does not appear to depend on
>> GFP_KERNEL at all so I found the comment misleading. Furthermore, asking it to
>> loop again at order-0 means it may scan and reclaim more memory unnecessarily
>> seeing as all_zones_ok was calculated based on a high-order value, not order-0.
>
> Yup. my patch doesn't depend on GFP_KERNEL.
>
> but, Why order-0 means it may scan more memory unnecessary?
> all_zones_ok() is calculated by zone_watermark_ok() and zone_watermark_ok()
> depend on order argument. and my patch set order variable to 0 too.
>
>
>> While constantly looping trying to balance for high-orders is indeed bad,
>> I'm unconvinced this is the correct change. As we have already gone through
>> a priorities and scanned everything at the high-order, would it not make
>> more sense to do just give up with something like the following?
>>
>>       /*
>>        * If zones are still not balanced, loop again and continue attempting
>>        * to rebalance the system. For high-order allocations, fragmentation
>>        * can prevent the zones being rebalanced no matter how hard kswapd
>>        * works, particularly on systems with little or no swap. For costly
>>        * orders, just give up and assume interested processes will either
>>        * direct reclaim or wake up kswapd as necessary.
>>        */
>>        if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>>                cond_resched();
>>
>>                try_to_freeze();
>>
>>                goto loop_again;
>>        }
>>
>> I used PAGE_ALLOC_COSTLY_ORDER instead of sc.order == 0 because we are
>> expected to support allocations up to that order in a fairly reliable fashion.
>
> my comment is bellow.
>
>
>> =============
>> From: Mel Gorman <mel@csn.ul.ie>
>> Subject: [PATCH] mm: stop kswapd's infinite loop at high order allocation
>>
>>  kswapd runs in some infinite loop trying to swap until order 10 of zone
>>  highmem is OK.... kswapd will continue to try to balance order 10 of zone
>>  highmem forever (or until someone release a very large chunk of highmem).
>>
>> For costly high-order allocations, the system may never be balanced due to
>> fragmentation but kswapd should not infinitely loop as a result. The
>> following patch lets kswapd stop reclaiming in the event it cannot
>> balance zones and the order is high-order.
>>
>> Reported-by: wassim dagash <wassim.dagash@gmail.com>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 62e7f62..03ed9a0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1867,7 +1867,16 @@ out:
>>
>>                zone->prev_priority = temp_priority[i];
>>        }
>> -       if (!all_zones_ok) {
>> +
>> +       /*
>> +        * If zones are still not balanced, loop again and continue attempting
>> +        * to rebalance the system. For high-order allocations, fragmentation
>> +        * can prevent the zones being rebalanced no matter how hard kswapd
>> +        * works, particularly on systems with little or no swap. For costly
>> +        * orders, just give up and assume interested processes will either
>> +        * direct reclaim or wake up kswapd as necessary.
>> +        */
>> +       if (!all_zones_ok && sc.order <= PAGE_ALLOC_COSTLY_ORDER) {
>>                cond_resched();
>>
>>                try_to_freeze();
>
> this patch seems no good.
> kswapd come this point every SWAP_CLUSTER_MAX reclaimed because to avoid
> unnecessary priority variable decreasing.
> then "nr_reclaimed >= SWAP_CLUSTER_MAX" indicate kswapd need reclaim more.
>
> kswapd purpose is "reclaim until pages_high", not reclaim
> SWAP_CLUSTER_MAX pages.
>
> if your patch applied and kswapd start to reclaim for hugepage, kswapd
> exit balance_pgdat() function after to reclaim only 32 pages
> (SWAP_CLUSTER_MAX).
>
> In the other hand, "nr_reclaimed < SWAP_CLUSTER_MAX" mean kswapd can't
> reclaim enough
> page although priority == 0.
> in this case, retry is worthless.
>
> sorting out again.
> "goto loop_again" reaching happend by two case.
>
> 1. kswapd reclaimed SWAP_CLUSTER_MAX pages.
>    at that time, kswapd reset priority variable to prevent
> unnecessary priority decreasing.
>    I don't hope this behavior change.
> 2. kswapd scanned until priority==0.
>    this case is debatable. my patch reset any order to 0. but
> following code is also considerable to me. (sorry for tab corrupted,
> current my mail environment is very poor)
>
>
> code-A:
>       if (!all_zones_ok) {
>              if ((nr_reclaimed >= SWAP_CLUSTER_MAX) ||
>                 (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                          cond_resched();
>                           try_to_freeze();
>                           goto loop_again;
>                }
>       }
>
> or
>
> code-B:
>       if (!all_zones_ok) {
>              cond_resched();
>               try_to_freeze();
>
>              if (nr_reclaimed >= SWAP_CLUSTER_MAX)
>                           goto loop_again;
>
>              if (sc.order <= PAGE_ALLOC_COSTLY_ORDER)) {
>                           order = sc.order = 0;
>                           goto loop_again;
>              }
>       }
>
>
> However, I still like my original proposal because ..
>  - code-A forget to order-1 (for stack) allocation also can cause
> infinite loop.
>  - code-B doesn't simpler than my original proposal.
>
> What do you think it?
>



-- 
too much is never enough!!!!!

[-- Attachment #2: kswapd.patch --]
[-- Type: application/octet-stream, Size: 6454 bytes --]

Reported-by: wassim dagash <wassim.dagash@gmail.com>
Signed-off-by: wassim dagash <wassim.dagash@gmail.com>

diff -Nuar linux-2.6.28.orig/include/linux/mmzone.h linux-2.6.28/include/linux/mmzone.h
--- linux-2.6.28.orig/include/linux/mmzone.h	2008-12-25 12:20:10.000000000 +0200
+++ linux-2.6.28/include/linux/mmzone.h	2008-12-31 10:16:03.000000000 +0200
@@ -409,6 +409,7 @@
 	 * rarely used fields:
 	 */
 	const char		*name;
+	unsigned int 		kswapd_max_order;
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
@@ -625,7 +626,6 @@
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
-	int kswapd_max_order;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -642,8 +642,8 @@
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free);
 void build_all_zonelists(void);
-void wakeup_kswapd(struct zone *zone, int order);
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+void wakeup_kswapd(struct zone *zone);
+int zone_watermark_ok(struct zone *z, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
diff -Nuar linux-2.6.28.orig/Makefile linux-2.6.28/Makefile
--- linux-2.6.28.orig/Makefile	2008-12-25 12:19:10.000000000 +0200
+++ linux-2.6.28/Makefile	2008-12-31 10:51:18.000000000 +0200
@@ -2,7 +2,7 @@
 PATCHLEVEL = 6
 SUBLEVEL = 28
 EXTRAVERSION =
-NAME = Erotic Pickled Herring
+NAME = kswapd_sol 
 
 # *DOCUMENTATION*
 # To see a list of typical targets execute "make help"
diff -Nuar linux-2.6.28.orig/mm/page_alloc.c linux-2.6.28/mm/page_alloc.c
--- linux-2.6.28.orig/mm/page_alloc.c	2008-12-25 12:20:14.000000000 +0200
+++ linux-2.6.28/mm/page_alloc.c	2008-12-31 10:26:32.000000000 +0200
@@ -1224,10 +1224,11 @@
  * Return 1 if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+int zone_watermark_ok(struct zone *z, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	/* free_pages my go negative - that's OK */
+	unsigned int order = z->kswapd_max_order;
 	long min = mark;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
@@ -1417,7 +1418,7 @@
 				mark = zone->pages_low;
 			else
 				mark = zone->pages_high;
-			if (!zone_watermark_ok(zone, order, mark,
+			if (!zone_watermark_ok(zone, mark,
 				    classzone_idx, alloc_flags)) {
 				if (!zone_reclaim_mode ||
 				    !zone_reclaim(zone, gfp_mask, order))
@@ -1485,9 +1486,16 @@
 
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+
 	if (page)
 		goto got_pg;
 
+        /*
+        First item in list of zones suitable for gfp_mask is the zone the request intended to,
+        the other items are fallback
+        */
+        z->zone->kswapd_max_order = order;
+
 	/*
 	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
 	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
@@ -1500,7 +1508,7 @@
 		goto nopage;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -3448,7 +3456,6 @@
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
-	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -3488,6 +3495,8 @@
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
 
+		/* Initializing kswapd_max_order to zero */
+		zone->kswapd_max_order = 0;
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
 #ifdef CONFIG_NUMA
diff -Nuar linux-2.6.28.orig/mm/vmscan.c linux-2.6.28/mm/vmscan.c
--- linux-2.6.28.orig/mm/vmscan.c	2008-12-25 12:20:14.000000000 +0200
+++ linux-2.6.28/mm/vmscan.c	2008-12-31 10:29:27.000000000 +0200
@@ -1770,7 +1770,7 @@
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
+			if (!zone_watermark_ok(zone, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
 				break;
@@ -1805,7 +1805,7 @@
 					priority != DEF_PRIORITY)
 				continue;
 
-			if (!zone_watermark_ok(zone, order, zone->pages_high,
+			if (!zone_watermark_ok(zone, zone->pages_high,
 					       end_zone, 0))
 				all_zones_ok = 0;
 			temp_priority[i] = priority;
@@ -1815,7 +1815,7 @@
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
+			if (!zone_watermark_ok(zone, 8*zone->pages_high,
 						end_zone, 0))
 				nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
@@ -1924,22 +1924,30 @@
 	order = 0;
 	for ( ; ; ) {
 		unsigned long new_order;
-
+		int i,max_order;
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
-		new_order = pgdat->kswapd_max_order;
-		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation
-			 */
-			order = new_order;
-		} else {
-			if (!freezing(current))
-				schedule();
 
-			order = pgdat->kswapd_max_order;
+		max_order = 0;
+		for (i = pgdat->nr_zones - 1; i >= 0; i--) 
+		{
+			struct zone *zone = pgdat->node_zones + i;
+			new_order = zone->kswapd_max_order;
+			zone->kswapd_max_order = 0;
+			if (max_order < new_order){
+				max_order = new_order;
+			}
+                }
+
+		if(order < max_order)
+		{
+			order = max_order;
 		}
+		else
+		{	
+			if (!freezing(current))
+                              schedule();
+		}
+
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
 		if (!try_to_freeze()) {
@@ -1955,7 +1963,7 @@
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone)
 {
 	pg_data_t *pgdat;
 
@@ -1963,10 +1971,8 @@
 		return;
 
 	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+	if (zone_watermark_ok(zone, zone->pages_low, 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
-		pgdat->kswapd_max_order = order;
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))

next prev parent reply	other threads:[~2008-12-31  9:00 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-30 10:55 KOSAKI Motohiro
2008-12-30 11:10 ` Nick Piggin
2008-12-30 18:59 ` Mel Gorman
2008-12-31  1:32   ` Nick Piggin
2008-12-31 11:06     ` Mel Gorman
2008-12-31 11:16       ` Nick Piggin
2008-12-31 12:11         ` Mel Gorman
2008-12-31  4:54   ` KOSAKI Motohiro
2008-12-31  8:59     ` wassim dagash [this message]
2008-12-31 12:05       ` Mel Gorman
2008-12-31 12:24         ` wassim dagash
2008-12-31 11:53     ` Mel Gorman
2008-12-31 13:34       ` KOSAKI Motohiro
2009-01-01 14:52         ` [PATCH] mm: stop kswapd's infinite loop at high order allocation take2 KOSAKI Motohiro
2009-01-02  9:55           ` MinChan Kim
2009-01-02 10:00             ` KOSAKI Motohiro
2009-01-02 10:29               ` MinChan Kim
2009-01-02 10:54                 ` KOSAKI Motohiro
2009-01-02 11:18                   ` MinChan Kim
2009-01-02 11:14           ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cb94e63d0812310059s263a0a75x12905a20526dafaf@mail.gmail.com \
    --to=wassim.dagash@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox