[PATCH][-mm][0/2] page reclaim throttle take4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH][-mm][0/2] page reclaim throttle take4
@ 2008-03-30  8:12 KOSAKI Motohiro
  2008-03-30  8:12 ` Balbir Singh
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30  8:12 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Balbir Singh, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra
  Cc: kosaki.motohiro

[-- Attachment #1: Type: text/plain, Size: 2820 bytes --]


changelog
========================================
  v3 -> v4:
     o fixed recursive shrink_zone problem.
     o add last_checked variable in shrink_zone for 
       prevent corner case regression.

  v2 -> v3:
     o use wake_up() instead wake_up_all()
     o max reclaimers can be changed Kconfig option and sysctl.
     o some cleanups

  v1 -> v2:
     o make per zone throttle 


background
=====================================
current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
  - heavy lock contention
  - unnecessary swap out

The end of last year, KAMEZA Hiroyuki proposed the patch of page 
reclaim throttle and explain it improve reclaim time.
	http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.


benefit
=====================================
<<1. fix the bug of incorrect OOM killer>>

if do following commanc, sometimes OOM killer happened.
(OOM happend about 10%)

 $ ./hackbench 125 process 1000

because following bad scenario happend.

   1. memory shortage happend.
   2. many task call shrink_zone at the same time.
   3. all page are isolated from LRU at the same time.
   4. the last task can't isolate any page from LRU.
   5. it cause reclaim failure.
   6. it cause OOM killer.

my patch is directly solution for that problem.


<<2. performance improvement>>
I mesure various parameter of hackbench.

result number mean seconds (i.e. smaller is better)


    num_group    2.6.25-rc5-mm1      previous        current
                                     proposal        proposal
   ------------------------------------------------------------
      80              26.22            25.34          25.61 
      85              27.31            27.03          27.28 
      90              29.23            28.64          28.81 
      95              30.73            32.70          30.17 
     100              32.02            32.77          32.38 
     105              33.97            39.70          31.99 
     110              35.37            50.03          33.04 
     115              36.96            48.64          36.02 
     120              74.05            45.68          37.33 
     125              41.07(*)         64.13          38.88 
     130              86.92            56.30          51.64 
     135             234.62            74.31          57.09 
     140             291.95           117.74          83.76 
     145             425.35           131.99          92.01 
     150             766.92           160.63         128.27

(*) sometimes OOM happend, please don't think this is nice result.

my patch get performance improvement at any parameter.
(attached graph image)

[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 3358 bytes --]

[-- Attachment #3: image002.png --]
[-- Type: image/png, Size: 3088 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][0/2] page reclaim throttle take4
  2008-03-30  8:12 [PATCH][-mm][0/2] page reclaim throttle take4 KOSAKI Motohiro
@ 2008-03-30  8:12 ` Balbir Singh
  2008-03-30  8:23   ` KOSAKI Motohiro
  2008-03-30  8:15 ` [PATCH][-mm][1/2] core of page reclaim throttle KOSAKI Motohiro
  2008-03-30  8:19 ` [PATCH][-mm][2/2] introduce sysctl i/f of max task of throttle KOSAKI Motohiro
  2 siblings, 1 reply; 10+ messages in thread
From: Balbir Singh @ 2008-03-30  8:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, Rik van Riel, David Rientjes,
	Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra

KOSAKI Motohiro wrote:
> changelog
> ========================================
>   v3 -> v4:
>      o fixed recursive shrink_zone problem.
>      o add last_checked variable in shrink_zone for 
>        prevent corner case regression.
> 
>   v2 -> v3:
>      o use wake_up() instead wake_up_all()
>      o max reclaimers can be changed Kconfig option and sysctl.
>      o some cleanups
> 
>   v1 -> v2:
>      o make per zone throttle 
> 
> 
> background
> =====================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
> 
> The end of last year, KAMEZA Hiroyuki proposed the patch of page 
> reclaim throttle and explain it improve reclaim time.
> 	http://marc.info/?l=linux-mm&m=119667465917215&w=2
> 
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
> 
> 
> benefit
> =====================================
> <<1. fix the bug of incorrect OOM killer>>
> 
> if do following commanc, sometimes OOM killer happened.
> (OOM happend about 10%)
> 
>  $ ./hackbench 125 process 1000
> 
> because following bad scenario happend.
> 
>    1. memory shortage happend.
>    2. many task call shrink_zone at the same time.
>    3. all page are isolated from LRU at the same time.
>    4. the last task can't isolate any page from LRU.
>    5. it cause reclaim failure.
>    6. it cause OOM killer.
> 
> my patch is directly solution for that problem.
> 
> 
> <<2. performance improvement>>
> I mesure various parameter of hackbench.
> 
> result number mean seconds (i.e. smaller is better)
> 

The results look quite impressive. Have you seen how your patches integrate with
Rik's LRU changes?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][0/2] page reclaim throttle take4
  2008-03-30  8:12 ` Balbir Singh
@ 2008-03-30  8:23   ` KOSAKI Motohiro
  2008-03-30  9:32     ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30  8:23 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Andrew Morton, linux-mm, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra

Hi

> > <<2. performance improvement>>
> > I mesure various parameter of hackbench.
> > 
> > result number mean seconds (i.e. smaller is better)
> > 
> 
> The results look quite impressive. Have you seen how your patches integrate with
> Rik's LRU changes?

I am mesuring just now.
I will be able to report about 2-3 days after.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][0/2] page reclaim throttle take4
  2008-03-30  8:23   ` KOSAKI Motohiro
@ 2008-03-30  9:32     ` KOSAKI Motohiro
  2008-03-31  2:57       ` KOSAKI Motohiro
  0 siblings, 1 reply; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30  9:32 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Andrew Morton, linux-mm, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra

Hi balbir-san,

> > The results look quite impressive. Have you seen how your patches integrate with
> > Rik's LRU changes?
> 
> I am mesuring just now.
> I will be able to report about 2-3 days after.

btw: current roughly result.
     (# of mesurement is few, yet)

    num_group  2.6.25-rc5-mm1   throttle     throttle + split_lru
   --------------------------------------------------------------
     115            36.96        36.02            36.12
     125            41.07        38.88            38.29
     150           766.92       128.27           129.09




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][0/2] page reclaim throttle take4
  2008-03-30  9:32     ` KOSAKI Motohiro
@ 2008-03-31  2:57       ` KOSAKI Motohiro
  0 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-31  2:57 UTC (permalink / raw)
  To: balbir, Rik van Riel
  Cc: kosaki.motohiro, Andrew Morton, linux-mm, David Rientjes,
	Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 527 bytes --]

Hi balbir-san,

> btw: current roughly result.
>      (# of mesurement is few, yet)
> 
>     num_group  2.6.25-rc5-mm1   throttle     throttle + split_lru
>    --------------------------------------------------------------
>      115            36.96        36.02            36.12
>      125            41.07        38.88            38.29
>      150           766.92       128.27           129.09

Hmmm,
I got very strange result.

my patch works well.
rik's patch work well too.
but both applied kernel doesn't works good.




[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 4968 bytes --]

[-- Attachment #3: image002.png --]
[-- Type: image/png, Size: 3704 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][1/2] core of page reclaim throttle
  2008-03-30  8:12 [PATCH][-mm][0/2] page reclaim throttle take4 KOSAKI Motohiro
  2008-03-30  8:12 ` Balbir Singh
@ 2008-03-30  8:15 ` KOSAKI Motohiro
  2008-03-30 11:00   ` KOSAKI Motohiro
  2008-04-12 19:30   ` Peter Zijlstra
  2008-03-30  8:19 ` [PATCH][-mm][2/2] introduce sysctl i/f of max task of throttle KOSAKI Motohiro
  2 siblings, 2 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30  8:15 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Balbir Singh, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra
  Cc: kosaki.motohiro

background
=====================================
current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
  - heavy lock contention
  - unnecessary swap out

The end of last year, KAMEZA Hiroyuki proposed the patch of page 
reclaim throttle and explain it improve reclaim time.
	http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.


benefit
=====================================
<<1. fix the bug of incorrect OOM killer>>

if do following commanc, sometimes OOM killer happened.
(OOM happend about 10%)

 $ ./hackbench 125 process 1000

because following bad scenario happend.

   1. memory shortage happend.
   2. many task call shrink_zone at the same time.
   3. all page are isolated from LRU at the same time.
   4. the last task can't isolate any page from LRU.
   5. it cause reclaim failure.
   6. it cause OOM killer.

my patch is directly solution for that problem.


<<2. performance improvement>>
I mesure various parameter of hackbench.

result number mean seconds (i.e. smaller is better)

    num_group    2.6.25-rc5-mm1        my-patch
   ----------------------------------------------
      80              26.22             25.61 
      85              27.31             27.28 
      90              29.23             28.81 
      95              30.73             30.17 
     100              32.02             32.38 
     105              33.97             31.99 
     110              35.37             33.04 
     115              36.96             36.02 
     120              74.05             37.33 
     125              41.07(*)          38.88 
     130              86.92             51.64 
     135             234.62             57.09 
     140             291.95             83.76 
     145             425.35             92.01 
     150             766.92            128.27


(*) sometimes OOM happend, please don't think this is nice result.

my patch get performance improvement at any parameter.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 include/linux/mmzone.h |    2 +
 include/linux/sched.h  |    1 
 mm/Kconfig             |   10 +++++
 mm/page_alloc.c        |    4 ++
 mm/vmscan.c            |   89 ++++++++++++++++++++++++++++++++++++++++++-------
 5 files changed, 94 insertions(+), 12 deletions(-)

Index: b/include/linux/mmzone.h
===================================================================
--- a/include/linux/mmzone.h	2008-03-27 13:35:03.000000000 +0900
+++ b/include/linux/mmzone.h	2008-03-27 15:55:50.000000000 +0900
@@ -335,6 +335,8 @@ struct zone {
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
 
+	atomic_t		nr_reclaimers;
+	wait_queue_head_t	reclaim_throttle_waitq;
 	/*
 	 * rarely used fields:
 	 */
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/page_alloc.c	2008-03-27 13:35:16.000000000 +0900
@@ -3473,6 +3473,10 @@ static void __paginginit free_area_init_
 		zone->nr_scan_inactive = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
+
+		zone->nr_reclaimers = ATOMIC_INIT(0);
+		init_waitqueue_head(&zone->reclaim_throttle_waitq);
+
 		if (!size)
 			continue;
 
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/vmscan.c	2008-03-27 19:41:50.000000000 +0900
@@ -124,6 +124,7 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+#define MAX_RECLAIM_TASKS CONFIG_NR_MAX_RECLAIM_TASKS_PER_ZONE
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1190,14 +1191,42 @@ static void shrink_active_list(unsigned 
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static unsigned long shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static int shrink_zone(int priority, struct zone *zone,
+		       struct scan_control *sc, unsigned long *ret_reclaimed)
 {
 	unsigned long nr_active;
 	unsigned long nr_inactive;
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long start_time = jiffies;
+	atomic_long_t last_checked = ATOMIC_LONG_INIT(INITIAL_JIFFIES);
+	int ret = 0;
+	int throttle_on = 0;
 
+	/* avoid recursing wait_evnet */
+	if (current->flags & PF_RECLAIMING)
+		goto shrink_it;
+
+	throttle_on = 1;
+	current->flags |= PF_RECLAIMING;
+	wait_event(zone->reclaim_throttle_waitq,
+		   atomic_add_unless(&zone->nr_reclaimers, 1,
+				     MAX_RECLAIM_TASKS));
+
+	/* reclaim still necessary? */
+	if (scan_global_lru(sc) &&
+	    !(current->flags & PF_KSWAPD) &&
+	    time_after(jiffies, start_time+HZ) &&
+	    time_after(jiffies, (ulong)atomic_long_read(&last_checked)+HZ/10)) {
+		if (zone_watermark_ok(zone, sc->order, 4*zone->pages_high,
+				      gfp_zone(sc->gfp_mask), 0)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+		atomic_long_set(&last_checked, jiffies);
+	}
+
+shrink_it:
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
@@ -1249,8 +1278,17 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
+out:
+	if (throttle_on) {
+		current->flags &= ~PF_RECLAIMING;
+		atomic_dec(&zone->nr_reclaimers);
+		wake_up(&zone->reclaim_throttle_waitq);
+	}
+
+	*ret_reclaimed += nr_reclaimed;
 	throttle_vm_writeout(sc->gfp_mask);
-	return nr_reclaimed;
+
+	return ret;
 }
 
 /*
@@ -1269,13 +1307,13 @@ static unsigned long shrink_zone(int pri
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+static int shrink_zones(int priority, struct zonelist *zonelist,
+			struct scan_control *sc, unsigned long *ret_reclaimed)
 {
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
-	unsigned long nr_reclaimed = 0;
 	struct zoneref *z;
 	struct zone *zone;
+	int ret;
 
 	sc->all_unreclaimable = 1;
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
@@ -1304,10 +1342,14 @@ static unsigned long shrink_zones(int pr
 							priority);
 		}
 
-		nr_reclaimed += shrink_zone(priority, zone, sc);
+		ret = shrink_zone(priority, zone, sc, ret_reclaimed);
+		if (ret == -EAGAIN)
+			goto out;
 	}
+	ret = 0;
 
-	return nr_reclaimed;
+out:
+	return ret;
 }
  
 /*
@@ -1335,6 +1377,8 @@ static unsigned long do_try_to_free_page
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	unsigned long last_check_time = jiffies;
+	int err;
 
 	if (scan_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
@@ -1357,7 +1401,12 @@ static unsigned long do_try_to_free_page
 		sc->nr_io_pages = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zonelist, sc);
+		err = shrink_zones(priority, zonelist, sc, &nr_reclaimed);
+		if (err == -EAGAIN) {
+			ret = 1;
+			goto out;
+		}
+
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1390,8 +1439,24 @@ static unsigned long do_try_to_free_page
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc->nr_scanned && priority < DEF_PRIORITY - 2 &&
-				sc->nr_io_pages > sc->swap_cluster_max)
+		    sc->nr_io_pages > sc->swap_cluster_max)
 			congestion_wait(WRITE, HZ/10);
+
+		if (scan_global_lru(sc) &&
+		    time_after(jiffies, last_check_time+HZ)) {
+			last_check_time = jiffies;
+
+			/* reclaim still necessary? */
+			for_each_zone_zonelist(zone, z, zonelist,
+					       high_zoneidx) {
+				if (zone_watermark_ok(zone, sc->order,
+						      4*zone->pages_high,
+						      high_zoneidx, 0)) {
+					ret = 1;
+					goto out;
+				}
+			}
+		}
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scan_global_lru(sc))
@@ -1589,7 +1654,7 @@ loop_again:
 			 */
 			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
 						end_zone, 0))
-				nr_reclaimed += shrink_zone(priority, zone, &sc);
+				shrink_zone(priority, zone, &sc, &nr_reclaimed);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
@@ -2034,7 +2099,7 @@ static int __zone_reclaim(struct zone *z
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
 			note_zone_scanning_priority(zone, priority);
-			nr_reclaimed += shrink_zone(priority, zone, &sc);
+			shrink_zone(priority, zone, &sc, &nr_reclaimed);
 			priority--;
 		} while (priority >= 0 && nr_reclaimed < nr_pages);
 	}
Index: b/mm/Kconfig
===================================================================
--- a/mm/Kconfig	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/Kconfig	2008-03-27 13:35:16.000000000 +0900
@@ -193,3 +193,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NR_MAX_RECLAIM_TASKS_PER_ZONE
+	int "maximum number of reclaiming tasks at the same time"
+	default 3
+	help
+	  This value determines the number of threads which can do page reclaim
+	  in a zone simultaneously. If this is too big, performance under heavy memory
+	  pressure will decrease.
+	  If unsure, use default.
+
Index: b/include/linux/sched.h
===================================================================
--- a/include/linux/sched.h	2008-03-27 13:35:03.000000000 +0900
+++ b/include/linux/sched.h	2008-03-27 13:35:16.000000000 +0900
@@ -1475,6 +1475,7 @@ static inline void put_task_struct(struc
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
+#define PF_RECLAIMING   0x80000000      /* The task have page reclaim throttling ticket */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][1/2] core of page reclaim throttle
  2008-03-30  8:15 ` [PATCH][-mm][1/2] core of page reclaim throttle KOSAKI Motohiro
@ 2008-03-30 11:00   ` KOSAKI Motohiro
  2008-04-12 19:30   ` Peter Zijlstra
  1 sibling, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30 11:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, Balbir Singh, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra

> The end of last year, KAMEZA Hiroyuki proposed the patch of page 
> reclaim throttle and explain it improve reclaim time.
> 	http://marc.info/?l=linux-mm&m=119667465917215&w=2

Agghh!

I misspelt Kamezawa-san's name. 
I resend this patch. sorry.


-------------------------------------------------------------------

background
=====================================
current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
  - heavy lock contention
  - unnecessary swap out

The end of last year, KAMEZAWA Hiroyuki proposed the patch of page 
reclaim throttle and explain it improve reclaim time.
	http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.


benefit
=====================================
<<1. fix the bug of incorrect OOM killer>>

if do following commanc, sometimes OOM killer happened.
(OOM happend about 10%)

 $ ./hackbench 125 process 1000

because following bad scenario happend.

   1. memory shortage happend.
   2. many task call shrink_zone at the same time.
   3. all page are isolated from LRU at the same time.
   4. the last task can't isolate any page from LRU.
   5. it cause reclaim failure.
   6. it cause OOM killer.

my patch is directly solution for that problem.


<<2. performance improvement>>
I mesure various parameter of hackbench.

result number mean seconds (i.e. smaller is better)

    num_group    2.6.25-rc5-mm1        my-patch
   ----------------------------------------------
      80              26.22             25.61 
      85              27.31             27.28 
      90              29.23             28.81 
      95              30.73             30.17 
     100              32.02             32.38 
     105              33.97             31.99 
     110              35.37             33.04 
     115              36.96             36.02 
     120              74.05             37.33 
     125              41.07(*)          38.88 
     130              86.92             51.64 
     135             234.62             57.09 
     140             291.95             83.76 
     145             425.35             92.01 
     150             766.92            128.27


(*) sometimes OOM happend, please don't think this is nice result.

my patch get performance improvement at any parameter.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 include/linux/mmzone.h |    2 +
 include/linux/sched.h  |    1 
 mm/Kconfig             |   10 +++++
 mm/page_alloc.c        |    4 ++
 mm/vmscan.c            |   89 ++++++++++++++++++++++++++++++++++++++++++-------
 5 files changed, 94 insertions(+), 12 deletions(-)

Index: b/include/linux/mmzone.h
===================================================================
--- a/include/linux/mmzone.h	2008-03-27 13:35:03.000000000 +0900
+++ b/include/linux/mmzone.h	2008-03-27 15:55:50.000000000 +0900
@@ -335,6 +335,8 @@ struct zone {
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
 
+	atomic_t		nr_reclaimers;
+	wait_queue_head_t	reclaim_throttle_waitq;
 	/*
 	 * rarely used fields:
 	 */
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/page_alloc.c	2008-03-27 13:35:16.000000000 +0900
@@ -3473,6 +3473,10 @@ static void __paginginit free_area_init_
 		zone->nr_scan_inactive = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
+
+		zone->nr_reclaimers = ATOMIC_INIT(0);
+		init_waitqueue_head(&zone->reclaim_throttle_waitq);
+
 		if (!size)
 			continue;
 
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/vmscan.c	2008-03-27 19:41:50.000000000 +0900
@@ -124,6 +124,7 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
+#define MAX_RECLAIM_TASKS CONFIG_NR_MAX_RECLAIM_TASKS_PER_ZONE
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
@@ -1190,14 +1191,42 @@ static void shrink_active_list(unsigned 
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static unsigned long shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static int shrink_zone(int priority, struct zone *zone,
+		       struct scan_control *sc, unsigned long *ret_reclaimed)
 {
 	unsigned long nr_active;
 	unsigned long nr_inactive;
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long start_time = jiffies;
+	atomic_long_t last_checked = ATOMIC_LONG_INIT(INITIAL_JIFFIES);
+	int ret = 0;
+	int throttle_on = 0;
 
+	/* avoid recursing wait_evnet */
+	if (current->flags & PF_RECLAIMING)
+		goto shrink_it;
+
+	throttle_on = 1;
+	current->flags |= PF_RECLAIMING;
+	wait_event(zone->reclaim_throttle_waitq,
+		   atomic_add_unless(&zone->nr_reclaimers, 1,
+				     MAX_RECLAIM_TASKS));
+
+	/* reclaim still necessary? */
+	if (scan_global_lru(sc) &&
+	    !(current->flags & PF_KSWAPD) &&
+	    time_after(jiffies, start_time+HZ) &&
+	    time_after(jiffies, (ulong)atomic_long_read(&last_checked)+HZ/10)) {
+		if (zone_watermark_ok(zone, sc->order, 4*zone->pages_high,
+				      gfp_zone(sc->gfp_mask), 0)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+		atomic_long_set(&last_checked, jiffies);
+	}
+
+shrink_it:
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
@@ -1249,8 +1278,17 @@ static unsigned long shrink_zone(int pri
 		}
 	}
 
+out:
+	if (throttle_on) {
+		current->flags &= ~PF_RECLAIMING;
+		atomic_dec(&zone->nr_reclaimers);
+		wake_up(&zone->reclaim_throttle_waitq);
+	}
+
+	*ret_reclaimed += nr_reclaimed;
 	throttle_vm_writeout(sc->gfp_mask);
-	return nr_reclaimed;
+
+	return ret;
 }
 
 /*
@@ -1269,13 +1307,13 @@ static unsigned long shrink_zone(int pri
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+static int shrink_zones(int priority, struct zonelist *zonelist,
+			struct scan_control *sc, unsigned long *ret_reclaimed)
 {
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
-	unsigned long nr_reclaimed = 0;
 	struct zoneref *z;
 	struct zone *zone;
+	int ret;
 
 	sc->all_unreclaimable = 1;
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
@@ -1304,10 +1342,14 @@ static unsigned long shrink_zones(int pr
 							priority);
 		}
 
-		nr_reclaimed += shrink_zone(priority, zone, sc);
+		ret = shrink_zone(priority, zone, sc, ret_reclaimed);
+		if (ret == -EAGAIN)
+			goto out;
 	}
+	ret = 0;
 
-	return nr_reclaimed;
+out:
+	return ret;
 }
  
 /*
@@ -1335,6 +1377,8 @@ static unsigned long do_try_to_free_page
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	unsigned long last_check_time = jiffies;
+	int err;
 
 	if (scan_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
@@ -1357,7 +1401,12 @@ static unsigned long do_try_to_free_page
 		sc->nr_io_pages = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zonelist, sc);
+		err = shrink_zones(priority, zonelist, sc, &nr_reclaimed);
+		if (err == -EAGAIN) {
+			ret = 1;
+			goto out;
+		}
+
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1390,8 +1439,24 @@ static unsigned long do_try_to_free_page
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc->nr_scanned && priority < DEF_PRIORITY - 2 &&
-				sc->nr_io_pages > sc->swap_cluster_max)
+		    sc->nr_io_pages > sc->swap_cluster_max)
 			congestion_wait(WRITE, HZ/10);
+
+		if (scan_global_lru(sc) &&
+		    time_after(jiffies, last_check_time+HZ)) {
+			last_check_time = jiffies;
+
+			/* reclaim still necessary? */
+			for_each_zone_zonelist(zone, z, zonelist,
+					       high_zoneidx) {
+				if (zone_watermark_ok(zone, sc->order,
+						      4*zone->pages_high,
+						      high_zoneidx, 0)) {
+					ret = 1;
+					goto out;
+				}
+			}
+		}
 	}
 	/* top priority shrink_caches still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scan_global_lru(sc))
@@ -1589,7 +1654,7 @@ loop_again:
 			 */
 			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
 						end_zone, 0))
-				nr_reclaimed += shrink_zone(priority, zone, &sc);
+				shrink_zone(priority, zone, &sc, &nr_reclaimed);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
@@ -2034,7 +2099,7 @@ static int __zone_reclaim(struct zone *z
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
 			note_zone_scanning_priority(zone, priority);
-			nr_reclaimed += shrink_zone(priority, zone, &sc);
+			shrink_zone(priority, zone, &sc, &nr_reclaimed);
 			priority--;
 		} while (priority >= 0 && nr_reclaimed < nr_pages);
 	}
Index: b/mm/Kconfig
===================================================================
--- a/mm/Kconfig	2008-03-27 13:35:03.000000000 +0900
+++ b/mm/Kconfig	2008-03-27 13:35:16.000000000 +0900
@@ -193,3 +193,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NR_MAX_RECLAIM_TASKS_PER_ZONE
+	int "maximum number of reclaiming tasks at the same time"
+	default 3
+	help
+	  This value determines the number of threads which can do page reclaim
+	  in a zone simultaneously. If this is too big, performance under heavy memory
+	  pressure will decrease.
+	  If unsure, use default.
+
Index: b/include/linux/sched.h
===================================================================
--- a/include/linux/sched.h	2008-03-27 13:35:03.000000000 +0900
+++ b/include/linux/sched.h	2008-03-27 13:35:16.000000000 +0900
@@ -1475,6 +1475,7 @@ static inline void put_task_struct(struc
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
+#define PF_RECLAIMING   0x80000000      /* The task have page reclaim throttling ticket */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][1/2] core of page reclaim throttle
  2008-03-30  8:15 ` [PATCH][-mm][1/2] core of page reclaim throttle KOSAKI Motohiro
  2008-03-30 11:00   ` KOSAKI Motohiro
@ 2008-04-12 19:30   ` Peter Zijlstra
  2008-04-14  8:20     ` KOSAKI Motohiro
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2008-04-12 19:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, Balbir Singh, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki

On Sun, 2008-03-30 at 17:15 +0900, KOSAKI Motohiro wrote:

> Index: b/include/linux/mmzone.h
> ===================================================================
> --- a/include/linux/mmzone.h	2008-03-27 13:35:03.000000000 +0900
> +++ b/include/linux/mmzone.h	2008-03-27 15:55:50.000000000 +0900
> @@ -335,6 +335,8 @@ struct zone {
>  	unsigned long		spanned_pages;	/* total size, including holes */
>  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
>  
> +	atomic_t		nr_reclaimers;
> +	wait_queue_head_t	reclaim_throttle_waitq;
>  	/*
>  	 * rarely used fields:

I'm thinking this ought to be a plist based wait_queue to avoid priority
inversions - but I don't think we have such a creature. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH][-mm][1/2] core of page reclaim throttle
  2008-04-12 19:30   ` Peter Zijlstra
@ 2008-04-14  8:20     ` KOSAKI Motohiro
  0 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-04-14  8:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrew Morton, linux-mm, Balbir Singh,
	Rik van Riel, David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki

> > Index: b/include/linux/mmzone.h
> > ===================================================================
> > --- a/include/linux/mmzone.h	2008-03-27 13:35:03.000000000 +0900
> > +++ b/include/linux/mmzone.h	2008-03-27 15:55:50.000000000 +0900
> > @@ -335,6 +335,8 @@ struct zone {
> >  	unsigned long		spanned_pages;	/* total size, including holes */
> >  	unsigned long		present_pages;	/* amount of memory (excluding holes) */
> >  
> > +	atomic_t		nr_reclaimers;
> > +	wait_queue_head_t	reclaim_throttle_waitq;
> >  	/*
> >  	 * rarely used fields:
> 
> I'm thinking this ought to be a plist based wait_queue to avoid priority
> inversions - but I don't think we have such a creature. 

agreed pi problem exist.
but that is not important in reclaim because it is already large un-deterministic.
and I hope step by step development.

I'll drop pi feature in this version and stack to future development list :)

Thanks


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH][-mm][2/2] introduce sysctl i/f of max task of throttle
  2008-03-30  8:12 [PATCH][-mm][0/2] page reclaim throttle take4 KOSAKI Motohiro
  2008-03-30  8:12 ` Balbir Singh
  2008-03-30  8:15 ` [PATCH][-mm][1/2] core of page reclaim throttle KOSAKI Motohiro
@ 2008-03-30  8:19 ` KOSAKI Motohiro
  2 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2008-03-30  8:19 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Balbir Singh, Rik van Riel,
	David Rientjes, Nick Piggin, KAMEZAWA Hiroyuki, Peter Zijlstra
  Cc: kosaki.motohiro

introduce sysctl parameter of max task of throttle.

<usage>
 # echo 5 > /proc/sys/vm/max_nr_task_per_zone

 set max reclaim tasks at the same time to 5.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

---
 include/linux/swap.h |    2 ++
 kernel/sysctl.c      |    9 +++++++++
 mm/vmscan.c          |    3 ++-
 3 files changed, 13 insertions(+), 1 deletion(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-03-27 17:47:15.000000000 +0900
+++ b/mm/vmscan.c	2008-03-27 17:47:39.000000000 +0900
@@ -124,9 +124,10 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;	/* The total number of pages which the VM controls */
 
-#define MAX_RECLAIM_TASKS CONFIG_NR_MAX_RECLAIM_TASKS_PER_ZONE
+#define MAX_RECLAIM_TASKS vm_max_nr_task_per_zone
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
+int vm_max_nr_task_per_zone = CONFIG_NR_MAX_RECLAIM_TASKS_PER_ZONE;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 #define scan_global_lru(sc)	(!(sc)->mem_cgroup)
Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h	2008-03-27 17:45:45.000000000 +0900
+++ b/include/linux/swap.h	2008-03-27 17:47:39.000000000 +0900
@@ -206,6 +206,8 @@ static inline int zone_reclaim(struct zo
 
 extern int kswapd_run(int nid);
 
+extern int vm_max_nr_task_per_zone;
+
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
Index: b/kernel/sysctl.c
===================================================================
--- a/kernel/sysctl.c	2008-03-27 17:45:45.000000000 +0900
+++ b/kernel/sysctl.c	2008-03-27 19:41:12.000000000 +0900
@@ -1141,6 +1141,15 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "max_nr_task_per_zone",
+		.data           = &vm_max_nr_task_per_zone,
+		.maxlen         = sizeof(vm_max_nr_task_per_zone),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec,
+		.strategy       = &sysctl_intvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-14  8:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-30  8:12 [PATCH][-mm][0/2] page reclaim throttle take4 KOSAKI Motohiro
2008-03-30  8:12 ` Balbir Singh
2008-03-30  8:23   ` KOSAKI Motohiro
2008-03-30  9:32     ` KOSAKI Motohiro
2008-03-31  2:57       ` KOSAKI Motohiro
2008-03-30  8:15 ` [PATCH][-mm][1/2] core of page reclaim throttle KOSAKI Motohiro
2008-03-30 11:00   ` KOSAKI Motohiro
2008-04-12 19:30   ` Peter Zijlstra
2008-04-14  8:20     ` KOSAKI Motohiro
2008-03-30  8:19 ` [PATCH][-mm][2/2] introduce sysctl i/f of max task of throttle KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox