[PATCH] vmscan: skip freeing memory from zones with lots free

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] vmscan: skip freeing memory from zones with lots free
@ 2008-11-28 11:08 Rik van Riel
  2008-11-28 11:30 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Rik van Riel @ 2008-11-28 11:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, KOSAKI Motohiro, akpm

Skip freeing memory from zones that already have lots of free memory.
If one memory zone has harder to free memory, we want to avoid freeing
excessive amounts of memory from other zones, if only because pageout
IO from the other zones can slow down page freeing from the problem zone.

This is similar to the check already done by kswapd in balance_pgdat().

Signed-off-by: Rik van Riel <riel@redhat.com>
---
Kosaki-san, this should address point (3) from your list.

 mm/vmscan.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.28-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
+++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
@@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
 			if (zone_is_all_unreclaimable(zone) &&
 						priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
+			if (zone_watermark_ok(zone, sc->order,
+					4*zone->pages_high, high_zoneidx, 0))
+				continue;	/* Lots free already */
 			sc->all_unreclaimable = 0;
 		} else {
 			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel
@ 2008-11-28 11:30 ` Peter Zijlstra
  2008-11-28 22:43 ` Johannes Weiner
  2008-11-29  7:19 ` Andrew Morton
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2008-11-28 11:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, akpm

On Fri, 2008-11-28 at 06:08 -0500, Rik van Riel wrote:
> Skip freeing memory from zones that already have lots of free memory.
> If one memory zone has harder to free memory, we want to avoid freeing
> excessive amounts of memory from other zones, if only because pageout
> IO from the other zones can slow down page freeing from the problem zone.
> 
> This is similar to the check already done by kswapd in balance_pgdat().
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Make sense,

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
> Kosaki-san, this should address point (3) from your list.
> 
>  mm/vmscan.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.28-rc5/mm/vmscan.c
> ===================================================================
> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
>  			if (zone_is_all_unreclaimable(zone) &&
>  						priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
> +			if (zone_watermark_ok(zone, sc->order,
> +					4*zone->pages_high, high_zoneidx, 0))
> +				continue;	/* Lots free already */
>  			sc->all_unreclaimable = 0;
>  		} else {
>  			/*
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel
  2008-11-28 11:30 ` Peter Zijlstra
@ 2008-11-28 22:43 ` Johannes Weiner
  2008-11-29  7:19 ` Andrew Morton
  2 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2008-11-28 22:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, akpm

On Fri, Nov 28, 2008 at 06:08:03AM -0500, Rik van Riel wrote:
> Skip freeing memory from zones that already have lots of free memory.
> If one memory zone has harder to free memory, we want to avoid freeing
> excessive amounts of memory from other zones, if only because pageout
> IO from the other zones can slow down page freeing from the problem zone.
> 
> This is similar to the check already done by kswapd in balance_pgdat().
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Acked-by: Johannes Weiner <hannes@saeurebad.de>

> ---
> Kosaki-san, this should address point (3) from your list.
> 
>  mm/vmscan.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.28-rc5/mm/vmscan.c
> ===================================================================
> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
>  			if (zone_is_all_unreclaimable(zone) &&
>  						priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
> +			if (zone_watermark_ok(zone, sc->order,
> +					4*zone->pages_high, high_zoneidx, 0))
> +				continue;	/* Lots free already */
>  			sc->all_unreclaimable = 0;
>  		} else {
>  			/*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel
  2008-11-28 11:30 ` Peter Zijlstra
  2008-11-28 22:43 ` Johannes Weiner
@ 2008-11-29  7:19 ` Andrew Morton
  2008-11-29 10:55   ` KOSAKI Motohiro
  2008-11-29 16:47   ` Rik van Riel
  2 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2008-11-29  7:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Fri, 28 Nov 2008 06:08:03 -0500 Rik van Riel <riel@redhat.com> wrote:

> Skip freeing memory from zones that already have lots of free memory.
> If one memory zone has harder to free memory, we want to avoid freeing
> excessive amounts of memory from other zones, if only because pageout
> IO from the other zones can slow down page freeing from the problem zone.
> 
> This is similar to the check already done by kswapd in balance_pgdat().
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> Kosaki-san, this should address point (3) from your list.
> 
>  mm/vmscan.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.28-rc5/mm/vmscan.c
> ===================================================================
> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
>  			if (zone_is_all_unreclaimable(zone) &&
>  						priority != DEF_PRIORITY)
>  				continue;	/* Let kswapd poll it */
> +			if (zone_watermark_ok(zone, sc->order,
> +					4*zone->pages_high, high_zoneidx, 0))
> +				continue;	/* Lots free already */
>  			sc->all_unreclaimable = 0;
>  		} else {
>  			/*

We already tried this, or something very similar in effect, I think...


commit 26e4931632352e3c95a61edac22d12ebb72038fe
Author: akpm <akpm>
Date:   Sun Sep 8 19:21:55 2002 +0000

    [PATCH] refill the inactive list more quickly
    
    Fix a problem noticed by Ed Tomlinson: under shifting workloads the
    shrink_zone() logic will refill the inactive load too slowly.
    
    Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
    rarely-occurring problem wherein refill_inactive_zone() ends up
    shuffling 100,000 pages and generally goes silly.
    
    This needs to be revisited - we should go on and rebalance the lower
    zones even if we reclaimed enough pages from highmem.
    


Then it was reverted a year or two later:


commit 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3
Author: akpm <akpm>
Date:   Fri Mar 12 16:23:50 2004 +0000

    [PATCH] vmscan: zone balancing fix
    
    We currently have a problem with the balancing of reclaim between zones: much
    more reclaim happens against highmem than against lowmem.
    
    This patch partially fixes this by changing the direct reclaim path so it
    does not bale out of the zone walk after having reclaimed sufficient pages
    from highmem: go on to reclaim from lowmem regardless of how many pages we
    reclaimed from lowmem.
    

My changelog does not adequately explain the reasons.

But we don't want to rediscover these reasons in early 2010 :(  Some trolling
of the linux-mm and lkml archives around those dates might help us avoid
a mistake here.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29  7:19 ` Andrew Morton
@ 2008-11-29 10:55   ` KOSAKI Motohiro
  2008-12-08 13:00     ` KOSAKI Motohiro
  2008-11-29 16:47   ` Rik van Riel
  1 sibling, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2008-11-29 10:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kosaki.motohiro, Rik van Riel, linux-mm, linux-kernel

> We already tried this, or something very similar in effect, I think...
> 
> 
> commit 26e4931632352e3c95a61edac22d12ebb72038fe
> Author: akpm <akpm>
> Date:   Sun Sep 8 19:21:55 2002 +0000
> 
>     [PATCH] refill the inactive list more quickly
>     
>     Fix a problem noticed by Ed Tomlinson: under shifting workloads the
>     shrink_zone() logic will refill the inactive load too slowly.
>     
>     Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
>     rarely-occurring problem wherein refill_inactive_zone() ends up
>     shuffling 100,000 pages and generally goes silly.
>     
>     This needs to be revisited - we should go on and rebalance the lower
>     zones even if we reclaimed enough pages from highmem.
>     
> 
> 
> Then it was reverted a year or two later:
> 
> 
> commit 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3
> Author: akpm <akpm>
> Date:   Fri Mar 12 16:23:50 2004 +0000
> 
>     [PATCH] vmscan: zone balancing fix
>     
>     We currently have a problem with the balancing of reclaim between zones: much
>     more reclaim happens against highmem than against lowmem.
>     
>     This patch partially fixes this by changing the direct reclaim path so it
>     does not bale out of the zone walk after having reclaimed sufficient pages
>     from highmem: go on to reclaim from lowmem regardless of how many pages we
>     reclaimed from lowmem.
>     
> 
> My changelog does not adequately explain the reasons.
> 
> But we don't want to rediscover these reasons in early 2010 :(  Some trolling
> of the linux-mm and lkml archives around those dates might help us avoid
> a mistake here.

I hope to digg past discussion archive.
Andrew, plese wait merge this patch awhile.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 10:55   ` KOSAKI Motohiro
@ 2008-12-08 13:00     ` KOSAKI Motohiro
  2008-12-08 13:03       ` KOSAKI Motohiro
  0 siblings, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2008-12-08 13:00 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner

Hi

verry sorry for late responce.

> > My changelog does not adequately explain the reasons.
> > 
> > But we don't want to rediscover these reasons in early 2010 :(  Some trolling
> > of the linux-mm and lkml archives around those dates might help us avoid
> > a mistake here.
> 
> I hope to digg past discussion archive.
> Andrew, plese wait merge this patch awhile.

I search past archive patiently.
But unfortunately, I don't find the reason at all.

this reverting fix appeared at 2.6.3-mm3 suddenly.

	http://marc.info/?l=linux-kernel&m=107749956707874&w=2

but I don't find related discussion at near month.
So, I guess akpm find anything problem by himself.


Therefore, instead, I'd like to talk about rik patch safeness by mesurement.


1. Checked this patch break reclaim balancing?


run FFSB bench by following conf.
---------------------------------------------------
directio=0
time=300

[filesystem0]
location=/mnt/sdb1/kosaki/ffsb
num_files=20
num_dirs=10
max_filesize=91534338
min_filesize=65535
[end0]

[threadgroup0]
num_threads=10
write_size=2816
write_blocksize=4096
read_size=2816
read_blocksize=4096
create_weight=100
write_weight=30
read_weight=100
[end0]
--------------------------------------------------------

<without patch>


pgscan_kswapd_dma 10624
pgscan_kswapd_normal 20640

        -> normal/dma ratio 20640 / 10624 = 1.9

pgscan_direct_dma 576
pgscan_direct_normal 2528

        -> normal/dma ratio 2528 / 576 = 4.38

<with patch>

pgscan_kswapd_dma    21824
pgscan_kswapd_normal 47424

        -> normal/dma ratio 20640 / 10624 = 2.17

pgscan_direct_dma     1632
pgscan_direct_normal  6912

        -> normal/dma ratio 2528 / 576 = 4.23


The reason is simple.
This patch only works following two case.

  1) Another process freed large memory in direct reclaim processing.
  2) Another process reclaimed large memory in direct reclaim processing.

IOW, its logic doesn't works on typical workload at all.


2. Mesured most benefit case. (IOW, much thread concurrently process swap-out at the same time)

$ ./hackbench 140 process 300  (ten times mesurement)


	2.6.28-rc6   +this patch
	+bail-out
	--------------------------
	62.514        29.270 
	225.698       30.209 
	114.694       20.881 
	179.108       19.795 
	111.080       19.563 
	189.796       19.226 
	114.124       13.330 
	112.999       10.280 
	227.842        9.669 
	81.869        10.113 

avg	141.972       18.234 
std	55.937         7.099 
min	62.514         9.669 
max	227.842       30.209 


	-> about 10 times improvement


3. Mesured worst case (much thread without swap)

mesured following three case. (ten times)

$ ./hackbench 125 process 3000
$ ./hackbench 130 process 3000
$ ./hackbench 135 process 3000


              2.6.28-rc6
            + evice streaming first          + skip freeing memory
            + rvr bail out
            + kosaki bail out improve

nr_group      125     130      135               125     130      135
	    ----------------------------------------------------------
	    67.302   68.269   77.161		89.450   75.328  173.437 
	    72.616   72.712   79.060 		69.843 	 74.145   76.217 
	    72.475   75.712   77.735 		73.531 	 76.426   85.527 
	    69.229   73.062   78.814 		72.472 	 74.891   75.129 
	    71.551   74.392   78.564 		69.423 	 73.517   75.544 
	    69.227   74.310   78.837 		72.543 	 75.347   79.237 
	    70.759   75.256   76.600 		70.477 	 77.848   90.981 
	    69.966   76.001   78.464 		71.792 	 78.722   92.048 
	    69.068   75.218   80.321 		71.313 	 74.958   78.113 
	    72.057   77.151   79.068 		72.306 	 75.644   79.888 

avg	    70.425   74.208   78.462 		73.315 	 75.683   90.612 
std	     1.665    2.348    1.007  	 	 5.516 	  1.514   28.218 
min	    67.302   68.269   76.600 		69.423 	 73.517   75.129 
max	    72.616   77.151   80.321 		89.450 	 78.722  173.437 


	-> 1 - 10% slow down
	   because zone_watermark_ok() is a bit slow function.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 13:00     ` KOSAKI Motohiro
@ 2008-12-08 13:03       ` KOSAKI Motohiro
  2008-12-08 17:48         ` KOSAKI Motohiro
  2008-12-08 20:25         ` Rik van Riel
  0 siblings, 2 replies; 23+ messages in thread
From: KOSAKI Motohiro @ 2008-12-08 13:03 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra,
	Johannes Weiner, Christoph Lameter, Nick Piggin

>               2.6.28-rc6
>             + evice streaming first          + skip freeing memory
>             + rvr bail out
>             + kosaki bail out improve
> 
> nr_group      125     130      135               125     130      135
> 	    ----------------------------------------------------------
> 	    67.302   68.269   77.161		89.450   75.328  173.437 
> 	    72.616   72.712   79.060 		69.843 	 74.145   76.217 
> 	    72.475   75.712   77.735 		73.531 	 76.426   85.527 
> 	    69.229   73.062   78.814 		72.472 	 74.891   75.129 
> 	    71.551   74.392   78.564 		69.423 	 73.517   75.544 
> 	    69.227   74.310   78.837 		72.543 	 75.347   79.237 
> 	    70.759   75.256   76.600 		70.477 	 77.848   90.981 
> 	    69.966   76.001   78.464 		71.792 	 78.722   92.048 
> 	    69.068   75.218   80.321 		71.313 	 74.958   78.113 
> 	    72.057   77.151   79.068 		72.306 	 75.644   79.888 
> 
> avg	    70.425   74.208   78.462 		73.315 	 75.683   90.612 
> std	     1.665    2.348    1.007  	 	 5.516 	  1.514   28.218 
> min	    67.302   68.269   76.600 		69.423 	 73.517   75.129 
> max	    72.616   77.151   80.321 		89.450 	 78.722  173.437 
> 
> 
> 	-> 1 - 10% slow down
> 	   because zone_watermark_ok() is a bit slow function.
> 

Next, I'd like to talk about why I think the reason is zone_watermark_ok().

I have zone_watermark_ok() improvement patch.
following patch developed for another issue.

However I observed it solve rvr patch performance degression.


<with following patch>

              2.6.28-rc6
            + evice streaming first          + skip freeing memory
            + rvr bail out		     + this patch
            + kosaki bail out improve

nr_group      125     130      135               125     130      135
	    ----------------------------------------------------------
	    67.302   68.269   77.161  		68.534 	75.733 	79.416 
	    72.616   72.712   79.060  		70.868 	74.264 	76.858 
	    72.475   75.712   77.735  		73.215 	80.278 	81.033 
	    69.229   73.062   78.814  		70.780 	72.518 	75.764 
	    71.551   74.392   78.564  		69.631 	77.252 	77.131 
	    69.227   74.310   78.837  		72.325 	72.723 	79.274 
	    70.759   75.256   76.600  		70.328 	74.046 	75.783 
	    69.966   76.001   78.464  		69.014 	72.566 	77.236 
	    69.068   75.218   80.321  		68.373 	76.447 	76.015 
	    72.057   77.151   79.068  		74.403 	72.794 	75.872 
                                      		
avg	    70.425   74.208   78.462  		70.747 	74.862 	77.438 
std	     1.665    2.348    1.007  		 1.921 	 2.428 	 1.752 
min	    67.302   68.269   76.600  		68.373 	72.518 	75.764 
max	    72.616   77.151   80.321  		74.403 	80.278 	81.033 


	-> ok, performance degression disappeared.



===========================
Subject: [PATCH] mm: zone_watermark_ok() doesn't require small fragment block


Currently, zone_watermark_ok() has a bit unfair logic.

example, 

  Called zone_watermark_ok(zone, 2, pages_min, 0, 0);
  pages_min  = 64
  free pages = 80

case A.

     order    nr_pages
   --------------------
      2         5
      1        10
      0        30

	-> zone_watermark_ok() return 1

case B.

     order    nr_pages
   --------------------
      3        10 
      2         0
      1         0
      0         0

        -> zone_watermark_ok() return 0


IOW, current zone_watermark_ok() tend to prefer small fragment block.

If dividing large block to small block by buddy is slow, abeve logic is reasonable.
However its assumption is not formed at all. linux buddy can treat large block efficiently.


In the order aspect, zone_watermark_ok() is called from get_page_from_freelist() everytime.
The get_page_from_freelist() is one of king of fast path.
In general, fast path require to
  - if system has much memory, it work as fast as possible.
  - if system doesn't have enough memory, it doesn't need to fast processing.
    but need to avoid oom as far as possible.

Unfortunately, following loop has reverse performance tendency.

        for (o = 0; o < order; o++) {
                free_pages -= z->free_area[o].nr_free << o;
                min >>= 1;
                if (free_pages <= min)
                        return 0;
        }

If the system doesn't have enough memory, above loop bail out soon.
But the system have enough memory, this loop work just number of order times.


This patch change zone_watermark_ok() logic to prefer large contenious block.


Result:

  test machine:
    CPU: ia64 x 8
    MEM: 8GB

  benchmark: 
    $ tbench 8  (three times mesurement)

    tbench works between about 600sec.
    alloc_pages() and zone_watermark_ok() are called about 15,000,000 times.


              2.6.28-rc6                        this patch

       throughput    max-latency       throughput       max-latency
        ---------------------------------------------------------
        1480.92		20.896		1,490.27	19.606 
	1483.94		19.202		1,482.86 	21.082 
	1478.93		22.215		1,490.57 	23.493 

avg	1,481.26 	20.771  	1,487.90 	21.394 
std	    2.06 	 1.233		    3.56	 1.602 
min	1,478.93 	19.202  	1,477.86 	19.606
max	1,483.94 	22.215  	1,490.57 	23.493 


throughput improve about 5MB/sec. it over measurement wobbly.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Nick Piggin <npiggin@suse.de>
CC: Christoph Lameter <cl@linux-foundation.org>
---
 mm/page_alloc.c |   16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1227,7 +1227,7 @@ static inline int should_fail_alloc_page
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
-	/* free_pages my go negative - that's OK */
+	/* free_pages may go negative - that's OK */
 	long min = mark;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
 	int o;
@@ -1239,17 +1239,13 @@ int zone_watermark_ok(struct zone *z, in
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return 0;
-	for (o = 0; o < order; o++) {
-		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
 
-		/* Require fewer higher order pages to be free */
-		min >>= 1;
-
-		if (free_pages <= min)
-			return 0;
+	for (o = order; o < MAX_ORDER; o++) {
+		if (z->free_area[o].nr_free)
+			return 1;
 	}
-	return 1;
+
+	return 0;
 }
 
 #ifdef CONFIG_NUMA



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 13:03       ` KOSAKI Motohiro
@ 2008-12-08 17:48         ` KOSAKI Motohiro
  2008-12-10  5:07           ` Nick Piggin
  2008-12-08 20:25         ` Rik van Riel
  1 sibling, 1 reply; 23+ messages in thread
From: KOSAKI Motohiro @ 2008-12-08 17:48 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra,
	Johannes Weiner, Christoph Lameter, Nick Piggin

> example,
>
>  Called zone_watermark_ok(zone, 2, pages_min, 0, 0);
>  pages_min  = 64
>  free pages = 80
>
> case A.
>
>     order    nr_pages
>   --------------------
>      2         5
>      1        10
>      0        30
>
>        -> zone_watermark_ok() return 1
>
> case B.
>
>     order    nr_pages
>   --------------------
>      3        10
>      2         0
>      1         0
>      0         0
>
>        -> zone_watermark_ok() return 0

Doh!
this example is obiously buggy.

I guess Mr. KOSAKI is very silly or Idiot.
I recommend to he get feathery blanket and good sleeping, instead
black black coffee ;-)


...but below mesurement result still true.

> This patch change zone_watermark_ok() logic to prefer large contenious block.
>
>
> Result:
>
>  test machine:
>    CPU: ia64 x 8
>    MEM: 8GB
>
>  benchmark:
>    $ tbench 8  (three times mesurement)
>
>    tbench works between about 600sec.
>    alloc_pages() and zone_watermark_ok() are called about 15,000,000 times.
>
>
>              2.6.28-rc6                        this patch
>
>       throughput    max-latency       throughput       max-latency
>        ---------------------------------------------------------
>        1480.92         20.896          1,490.27        19.606
>        1483.94         19.202          1,482.86        21.082
>        1478.93         22.215          1,490.57        23.493
>
> avg     1,481.26        20.771          1,487.90        21.394
> std         2.06         1.233              3.56         1.602
> min     1,478.93        19.202          1,477.86        19.606
> max     1,483.94        22.215          1,490.57        23.493
>
>
> throughput improve about 5MB/sec. it over measurement wobbly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 17:48         ` KOSAKI Motohiro
@ 2008-12-10  5:07           ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2008-12-10  5:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, linux-mm, linux-kernel,
	Peter Zijlstra, Johannes Weiner, Christoph Lameter

On Tue, Dec 09, 2008 at 02:48:40AM +0900, KOSAKI Motohiro wrote:
> > example,
> >
> >  Called zone_watermark_ok(zone, 2, pages_min, 0, 0);
> >  pages_min  = 64
> >  free pages = 80
> >
> > case A.
> >
> >     order    nr_pages
> >   --------------------
> >      2         5
> >      1        10
> >      0        30
> >
> >        -> zone_watermark_ok() return 1
> >
> > case B.
> >
> >     order    nr_pages
> >   --------------------
> >      3        10
> >      2         0
> >      1         0
> >      0         0
> >
> >        -> zone_watermark_ok() return 0
> 
> Doh!
> this example is obiously buggy.
> 
> I guess Mr. KOSAKI is very silly or Idiot.
> I recommend to he get feathery blanket and good sleeping, instead
> black black coffee ;-)

:) No, actually it is always good to have people reviewing existing
code, so thank you for that.


> ...but below mesurement result still true.

And it is an interesting result. As far as I can see, your patch changes
zone_watermark_ok so that it avoids some watermark checking for higher
order page blocks? I am surprised it makes a noticable difference in
performance, however such a change would be slightly detrimental to
atomic and "emergency" allocations of higher order pages, wouldn't it?

It would be interesting to know where the higher order allocations are
coming from. Do packets over loopback device still do higher order
allocations? If so, I suspect this is a bit artificial.

> 
> > This patch change zone_watermark_ok() logic to prefer large contenious block.
> >
> >
> > Result:
> >
> >  test machine:
> >    CPU: ia64 x 8
> >    MEM: 8GB
> >
> >  benchmark:
> >    $ tbench 8  (three times mesurement)
> >
> >    tbench works between about 600sec.
> >    alloc_pages() and zone_watermark_ok() are called about 15,000,000 times.
> >
> >
> >              2.6.28-rc6                        this patch
> >
> >       throughput    max-latency       throughput       max-latency
> >        ---------------------------------------------------------
> >        1480.92         20.896          1,490.27        19.606
> >        1483.94         19.202          1,482.86        21.082
> >        1478.93         22.215          1,490.57        23.493
> >
> > avg     1,481.26        20.771          1,487.90        21.394
> > std         2.06         1.233              3.56         1.602
> > min     1,478.93        19.202          1,477.86        19.606
> > max     1,483.94        22.215          1,490.57        23.493
> >
> >
> > throughput improve about 5MB/sec. it over measurement wobbly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 13:03       ` KOSAKI Motohiro
  2008-12-08 17:48         ` KOSAKI Motohiro
@ 2008-12-08 20:25         ` Rik van Riel
  2008-12-10  5:09           ` Nick Piggin
  2008-12-12  5:50           ` KOSAKI Motohiro
  1 sibling, 2 replies; 23+ messages in thread
From: Rik van Riel @ 2008-12-08 20:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, linux-kernel, Peter Zijlstra,
	Johannes Weiner, Christoph Lameter, Nick Piggin

KOSAKI Motohiro wrote:

> +	for (o = order; o < MAX_ORDER; o++) {
> +		if (z->free_area[o].nr_free)
> +			return 1;

Since page breakup and coalescing always manipulates .nr_free,
I wonder if it would make sense to pack the nr_free variables
in their own cache line(s), so we have fewer cache misses when
going through zone_watermark_ok() ?

That would end up looking something like this:

(whitespace mangled because it doesn't make sense to apply
just this thing, anyway)

Index: linux-2.6.28-rc7/include/linux/mmzone.h
===================================================================
--- linux-2.6.28-rc7.orig/include/linux/mmzone.h        2008-12-02 
15:04:33.000000000 -0500
+++ linux-2.6.28-rc7/include/linux/mmzone.h     2008-12-08 
15:24:25.000000000 -0500
@@ -58,7 +58,6 @@ static inline int get_pageblock_migratet

  struct free_area {
         struct list_head        free_list[MIGRATE_TYPES];
-       unsigned long           nr_free;
  };

  struct pglist_data;
@@ -296,6 +295,7 @@ struct zone {
         seqlock_t               span_seqlock;
  #endif
         struct free_area        free_area[MAX_ORDER];
+       struct nr_free          [MAX_ORDER];

  #ifndef CONFIG_SPARSEMEM
         /*


-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 20:25         ` Rik van Riel
@ 2008-12-10  5:09           ` Nick Piggin
  2008-12-12  5:50           ` KOSAKI Motohiro
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2008-12-10  5:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KOSAKI Motohiro, Andrew Morton, linux-mm, linux-kernel,
	Peter Zijlstra, Johannes Weiner, Christoph Lameter

On Mon, Dec 08, 2008 at 03:25:10PM -0500, Rik van Riel wrote:
> KOSAKI Motohiro wrote:
> 
> >+	for (o = order; o < MAX_ORDER; o++) {
> >+		if (z->free_area[o].nr_free)
> >+			return 1;
> 
> Since page breakup and coalescing always manipulates .nr_free,
> I wonder if it would make sense to pack the nr_free variables
> in their own cache line(s), so we have fewer cache misses when
> going through zone_watermark_ok() ?

For order-0 allocations, they should not be touched at all. For
higher order allocations in performance critical paths, we should
try to fix those to use order-0 ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-12-08 20:25         ` Rik van Riel
  2008-12-10  5:09           ` Nick Piggin
@ 2008-12-12  5:50           ` KOSAKI Motohiro
  1 sibling, 0 replies; 23+ messages in thread
From: KOSAKI Motohiro @ 2008-12-12  5:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Andrew Morton, linux-mm, linux-kernel,
	Peter Zijlstra, Johannes Weiner, Christoph Lameter, Nick Piggin

> Index: linux-2.6.28-rc7/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.28-rc7.orig/include/linux/mmzone.h        2008-12-02 
> 15:04:33.000000000 -0500
> +++ linux-2.6.28-rc7/include/linux/mmzone.h     2008-12-08 
> 15:24:25.000000000 -0500
> @@ -58,7 +58,6 @@ static inline int get_pageblock_migratet
> 
>   struct free_area {
>          struct list_head        free_list[MIGRATE_TYPES];
> -       unsigned long           nr_free;
>   };
> 
>   struct pglist_data;
> @@ -296,6 +295,7 @@ struct zone {
>          seqlock_t               span_seqlock;
>   #endif
>          struct free_area        free_area[MAX_ORDER];
> +       struct nr_free          [MAX_ORDER];
> 
>   #ifndef CONFIG_SPARSEMEM
>          /*

mesurement result:

% tbench 8

                 2.6.28-rc6              +rvr free area restructure

	throughput     max latency	throughput	max latency
	------------------------------------------------------------
	1480.920 	20.896 		742.470 	 30.401 
	1483.940 	19.202		791.648 	635.623 
	1478.930 	22.215		733.433 	 92.515 

avg	1481.263 	20.771		755.850 	252.846 
std	   2.060 	 1.233		 25.580 	271.849 
min	1478.930 	19.202		733.433 	 30.401 
max	1483.940 	22.215		791.648 	635.623 


I think nick is right. I drop this idea.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29  7:19 ` Andrew Morton
  2008-11-29 10:55   ` KOSAKI Motohiro
@ 2008-11-29 16:47   ` Rik van Riel
  2008-11-29 17:45     ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 16:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:

>> Index: linux-2.6.28-rc5/mm/vmscan.c
>> ===================================================================
>> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
>> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
>> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
>>  			if (zone_is_all_unreclaimable(zone) &&
>>  						priority != DEF_PRIORITY)
>>  				continue;	/* Let kswapd poll it */
>> +			if (zone_watermark_ok(zone, sc->order,
>> +					4*zone->pages_high, high_zoneidx, 0))
>> +				continue;	/* Lots free already */
>>  			sc->all_unreclaimable = 0;
>>  		} else {
>>  			/*
> 
> We already tried this, or something very similar in effect, I think...

Yes, we have a check just like this in balance_pgdat().

It's been there forever with no ill effect.

> commit 26e4931632352e3c95a61edac22d12ebb72038fe
> Author: akpm <akpm>
> Date:   Sun Sep 8 19:21:55 2002 +0000
> 
>     [PATCH] refill the inactive list more quickly
>     
>     Fix a problem noticed by Ed Tomlinson: under shifting workloads the
>     shrink_zone() logic will refill the inactive load too slowly.
>     
>     Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
>     rarely-occurring problem wherein refill_inactive_zone() ends up
>     shuffling 100,000 pages and generally goes silly.

This is not a bale out, this is a "skip zones that have way
too many free pages already".

Kswapd has been doing this for years already.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 16:47   ` Rik van Riel
@ 2008-11-29 17:45     ` Andrew Morton
  2008-11-29 17:58       ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-29 17:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Sat, 29 Nov 2008 11:47:25 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> >> Index: linux-2.6.28-rc5/mm/vmscan.c
> >> ===================================================================
> >> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
> >> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
> >> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
> >>  			if (zone_is_all_unreclaimable(zone) &&
> >>  						priority != DEF_PRIORITY)
> >>  				continue;	/* Let kswapd poll it */
> >> +			if (zone_watermark_ok(zone, sc->order,
> >> +					4*zone->pages_high, high_zoneidx, 0))
> >> +				continue;	/* Lots free already */
> >>  			sc->all_unreclaimable = 0;
> >>  		} else {
> >>  			/*
> > 
> > We already tried this, or something very similar in effect, I think...
> 
> Yes, we have a check just like this in balance_pgdat().
> 
> It's been there forever with no ill effect.

This patch affects direct reclaim as well as kswapd.

> > commit 26e4931632352e3c95a61edac22d12ebb72038fe
> > Author: akpm <akpm>
> > Date:   Sun Sep 8 19:21:55 2002 +0000
> > 
> >     [PATCH] refill the inactive list more quickly
> >     
> >     Fix a problem noticed by Ed Tomlinson: under shifting workloads the
> >     shrink_zone() logic will refill the inactive load too slowly.
> >     
> >     Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
> >     rarely-occurring problem wherein refill_inactive_zone() ends up
> >     shuffling 100,000 pages and generally goes silly.
> 
> This is not a bale out, this is a "skip zones that have way
> too many free pages already".

It is similar in effect.

Will this new patch reintroduce the problem which
26e4931632352e3c95a61edac22d12ebb72038fe fixed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 17:45     ` Andrew Morton
@ 2008-11-29 17:58       ` Rik van Riel
  2008-11-29 18:26         ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 17:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:
> On Sat, 29 Nov 2008 11:47:25 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>
>>>> Index: linux-2.6.28-rc5/mm/vmscan.c
>>>> ===================================================================
>>>> --- linux-2.6.28-rc5.orig/mm/vmscan.c	2008-11-28 05:53:56.000000000 -0500
>>>> +++ linux-2.6.28-rc5/mm/vmscan.c	2008-11-28 06:05:29.000000000 -0500
>>>> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
>>>>  			if (zone_is_all_unreclaimable(zone) &&
>>>>  						priority != DEF_PRIORITY)
>>>>  				continue;	/* Let kswapd poll it */
>>>> +			if (zone_watermark_ok(zone, sc->order,
>>>> +					4*zone->pages_high, high_zoneidx, 0))
>>>> +				continue;	/* Lots free already */
>>>>  			sc->all_unreclaimable = 0;
>>>>  		} else {
>>>>  			/*
>>> We already tried this, or something very similar in effect, I think...
>> Yes, we have a check just like this in balance_pgdat().
>>
>> It's been there forever with no ill effect.
> 
> This patch affects direct reclaim as well as kswapd.

No, kswapd calls shrink_zone directly from balance_pgdat,
it does not go through shrink_zones.

>>> commit 26e4931632352e3c95a61edac22d12ebb72038fe
>>> Author: akpm <akpm>
>>> Date:   Sun Sep 8 19:21:55 2002 +0000
>>>
>>>     [PATCH] refill the inactive list more quickly
>>>     
>>>     Fix a problem noticed by Ed Tomlinson: under shifting workloads the
>>>     shrink_zone() logic will refill the inactive load too slowly.
>>>     
>>>     Bale out of the zone scan when we've reclaimed enough pages.  Fixes a
>>>     rarely-occurring problem wherein refill_inactive_zone() ends up
>>>     shuffling 100,000 pages and generally goes silly.
>> This is not a bale out, this is a "skip zones that have way
>> too many free pages already".
> 
> It is similar in effect.
> 
> Will this new patch reintroduce the problem which
> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?

Googling on 26e4931632352e3c95a61edac22d12ebb72038fe only finds
your emails with that commit id in it - which git tree do I
need to search to get that changeset?

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 17:58       ` Rik van Riel
@ 2008-11-29 18:26         ` Andrew Morton
  2008-11-29 18:41           ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-29 18:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:

> > Will this new patch reintroduce the problem which
> > 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
> 
> Googling on 26e4931632352e3c95a61edac22d12ebb72038fe only finds
> your emails with that commit id in it - which git tree do I
> need to search to get that changeset?

It's the historical git tree.  All the pre-2.6.12 history which was
migrated from bitkeeper.  

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git

Spending a couple of fun hours reading `git-log mm/vmscan.c' is pretty
instructive.  For some reason that command generates rather a lot of
unrelated changelog info which needs to be manually skipped over.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 18:26         ` Andrew Morton
@ 2008-11-29 18:41           ` Rik van Riel
  2008-11-29 18:51             ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 18:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:
> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
>>> Will this new patch reintroduce the problem which
>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?

No, that problem is already taken care of by the fact that
active pages always get deactivated in the current VM,
regardless of whether or not they were referenced.

> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git
> 
> Spending a couple of fun hours reading `git-log mm/vmscan.c' is pretty
> instructive.  For some reason that command generates rather a lot of
> unrelated changelog info which needs to be manually skipped over.

Will do.  Thank you for the pointer.

(and not sure why google wouldn't find it - it finds other
git changesets...)

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 18:41           ` Rik van Riel
@ 2008-11-29 18:51             ` Andrew Morton
  2008-11-29 18:59               ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-29 18:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
> > 
> >>> Will this new patch reintroduce the problem which
> >>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
> 
> No, that problem is already taken care of by the fact that
> active pages always get deactivated in the current VM,
> regardless of whether or not they were referenced.

err, sorry, that was the wrong commit. 
26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as
predicted in the changelog.

265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 18:51             ` Andrew Morton
@ 2008-11-29 18:59               ` Rik van Riel
  2008-11-29 20:29                 ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 18:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:
> On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>
>>>>> Will this new patch reintroduce the problem which
>>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
>> No, that problem is already taken care of by the fact that
>> active pages always get deactivated in the current VM,
>> regardless of whether or not they were referenced.
> 
> err, sorry, that was the wrong commit. 
> 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as
> predicted in the changelog.
> 
> 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up.

The patch I sent in this thread does not do any baling out,
it only skips zones where the number of free pages is more
than 4 times zone->pages_high.

Equal pressure is still applied to the other zones.

This should not be a problem since we do not enter direct
reclaim unless the free pages in every zone in our zonelist
are below zone->pages_low.

Zone skipping is only done by tasks that have been in the
direct reclaim code for a long time.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 18:59               ` Rik van Riel
@ 2008-11-29 20:29                 ` Andrew Morton
  2008-11-29 21:35                   ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-29 20:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote:
> > 
> >> Andrew Morton wrote:
> >>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
> >>>
> >>>>> Will this new patch reintroduce the problem which
> >>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
> >> No, that problem is already taken care of by the fact that
> >> active pages always get deactivated in the current VM,
> >> regardless of whether or not they were referenced.
> > 
> > err, sorry, that was the wrong commit. 
> > 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as
> > predicted in the changelog.
> > 
> > 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up.
> 
> The patch I sent in this thread does not do any baling out,
> it only skips zones where the number of free pages is more
> than 4 times zone->pages_high.

But that will have the same effect as baling out.  Moreso, in fact.

> Equal pressure is still applied to the other zones.
> 
> This should not be a problem since we do not enter direct
> reclaim unless the free pages in every zone in our zonelist
> are below zone->pages_low.
> 
> Zone skipping is only done by tasks that have been in the
> direct reclaim code for a long time.

>From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3:

    We currently have a problem with the balancing of reclaim
    between zones: much more reclaim happens against highmem than
    against lowmem.

This problem will be reintroduced, will it not?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 20:29                 ` Andrew Morton
@ 2008-11-29 21:35                   ` Rik van Riel
  2008-11-29 21:57                     ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 21:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:
> On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> Andrew Morton wrote:
>>>>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>>
>>>>>>> Will this new patch reintroduce the problem which
>>>>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
>>>> No, that problem is already taken care of by the fact that
>>>> active pages always get deactivated in the current VM,
>>>> regardless of whether or not they were referenced.
>>> err, sorry, that was the wrong commit. 
>>> 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as
>>> predicted in the changelog.
>>>
>>> 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up.
>> The patch I sent in this thread does not do any baling out,
>> it only skips zones where the number of free pages is more
>> than 4 times zone->pages_high.
> 
> But that will have the same effect as baling out.  Moreso, in fact.

Kswapd already does the same in balance_pgdat.

Unequal pressure is sometimes desired, because allocation
pressure is not equal between zones.  Having lots of
lowmem allocations should not lead to gigabytes of swapped
out highmem.  A numactl pinned application should not cause
memory on other NUMA nodes to be swapped out.

Equal pressure between the zones makes sense when allocation
pressure is similar.

When allocation pressure is different, we have a choice
between evicting potentially useful data from memory or
applying uneven pressure on zones.

>> Equal pressure is still applied to the other zones.
>>
>> This should not be a problem since we do not enter direct
>> reclaim unless the free pages in every zone in our zonelist
>> are below zone->pages_low.
>>
>> Zone skipping is only done by tasks that have been in the
>> direct reclaim code for a long time.
> 
>>From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3:
> 
>     We currently have a problem with the balancing of reclaim
>     between zones: much more reclaim happens against highmem than
>     against lowmem.
> 
> This problem will be reintroduced, will it not?

We already have that behaviour in balance_pgdat().

We do not do any reclaim on zones higher than the first
zone where the zone_watermark_ok call returns true:

            if (!zone_watermark_ok(zone, order, zone->pages_high,
                                                0, 0)) {
                      end_zone = i;
                      break;
            }

Further down in balance_pgdat(), we skip reclaiming from zones
that have way too much memory free.

          /*
           * We put equal pressure on every zone, unless one
           * zone has way too many pages free already.
           */
          if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
                                                 end_zone, 0))
                   shrink_zone(priority, zone, &sc);

All my patch does is add one of these sanity checks to the
direct reclaim path.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 21:35                   ` Rik van Riel
@ 2008-11-29 21:57                     ` Andrew Morton
  2008-11-29 22:07                       ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2008-11-29 21:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

On Sat, 29 Nov 2008 16:35:45 -0500 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote:
> > 
> >> Andrew Morton wrote:
> >>> On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote:
> >>>
> >>>> Andrew Morton wrote:
> >>>>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote:
> >>>>>
> >>>>>>> Will this new patch reintroduce the problem which
> >>>>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed?
> >>>> No, that problem is already taken care of by the fact that
> >>>> active pages always get deactivated in the current VM,
> >>>> regardless of whether or not they were referenced.
> >>> err, sorry, that was the wrong commit. 
> >>> 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as
> >>> predicted in the changelog.
> >>>
> >>> 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up.
> >> The patch I sent in this thread does not do any baling out,
> >> it only skips zones where the number of free pages is more
> >> than 4 times zone->pages_high.
> > 
> > But that will have the same effect as baling out.  Moreso, in fact.
> 
> Kswapd already does the same in balance_pgdat.
> 
> Unequal pressure is sometimes desired, because allocation
> pressure is not equal between zones.  Having lots of
> lowmem allocations should not lead to gigabytes of swapped
> out highmem.  A numactl pinned application should not cause
> memory on other NUMA nodes to be swapped out.
> 
> Equal pressure between the zones makes sense when allocation
> pressure is similar.
> 
> When allocation pressure is different, we have a choice
> between evicting potentially useful data from memory or
> applying uneven pressure on zones.
> 
> >> Equal pressure is still applied to the other zones.
> >>
> >> This should not be a problem since we do not enter direct
> >> reclaim unless the free pages in every zone in our zonelist
> >> are below zone->pages_low.
> >>
> >> Zone skipping is only done by tasks that have been in the
> >> direct reclaim code for a long time.
> > 
> >>From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3:
> > 
> >     We currently have a problem with the balancing of reclaim
> >     between zones: much more reclaim happens against highmem than
> >     against lowmem.
> > 
> > This problem will be reintroduced, will it not?
> 
> We already have that behaviour in balance_pgdat().

I expect that was the case back in March 2004. 
265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 removed the bale-out only for
the direct reclaim path.

> We do not do any reclaim on zones higher than the first
> zone where the zone_watermark_ok call returns true:
> 
>             if (!zone_watermark_ok(zone, order, zone->pages_high,
>                                                 0, 0)) {
>                       end_zone = i;
>                       break;
>             }
> 
> Further down in balance_pgdat(), we skip reclaiming from zones
> that have way too much memory free.
> 
>           /*
>            * We put equal pressure on every zone, unless one
>            * zone has way too many pages free already.
>            */
>           if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
>                                                  end_zone, 0))
>                    shrink_zone(priority, zone, &sc);
> 
> All my patch does is add one of these sanity checks to the
> direct reclaim path.

It's a change in behaviour, not a "sanity check"!

The bottom line here is that we don't fully understand the problem
which 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 fixed, hence we cannot
say whether this proposed change will reintroduce it.

Why did it matter that "much more reclaim happens against highmem than
against lowmem"?  What were the observeable effects of this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vmscan: skip freeing memory from zones with lots free
  2008-11-29 21:57                     ` Andrew Morton
@ 2008-11-29 22:07                       ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2008-11-29 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro

Andrew Morton wrote:

> The bottom line here is that we don't fully understand the problem
> which 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 fixed, hence we cannot
> say whether this proposed change will reintroduce it.
> 
> Why did it matter that "much more reclaim happens against highmem than
> against lowmem"?  What were the observeable effects of this?

On a 1GB system, with 892MB lowmem and 128MB highmem, it could
lead to the page cache coming mostly from highmem.  This in turn
would mean that lowmem could have hundreds of megabytes of unused
memory, while large files would not get cached in memory.

Baling out early and not putting any memory pressure on a zone
can lead to problems.

It is important that zones with easily freeable memory get some
extra memory freed, so more allocations go to that zone.

However, we also do not want to go overboard.  Kicking potentially
useful data out of memory or causing unnecessary pageout IO is
harmful too.

By doing some amount of extra reclaim in zones with easily
freeable memory means more memory will get allocated from that
zone.  Over time this equalizes pressure between zones.

The patch I sent in limits that extra reclaim (extra allocation
space) in easily freeable zones to 4 * zone->pages_high.  That
gives the zone extra free space for alloc_pages, while limiting
unnecessary pageout IO and evicting of useful data.

I am pretty sure that we do understand the differences between
that 2004 patch and the code we have today.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-12-12  5:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel
2008-11-28 11:30 ` Peter Zijlstra
2008-11-28 22:43 ` Johannes Weiner
2008-11-29  7:19 ` Andrew Morton
2008-11-29 10:55   ` KOSAKI Motohiro
2008-12-08 13:00     ` KOSAKI Motohiro
2008-12-08 13:03       ` KOSAKI Motohiro
2008-12-08 17:48         ` KOSAKI Motohiro
2008-12-10  5:07           ` Nick Piggin
2008-12-08 20:25         ` Rik van Riel
2008-12-10  5:09           ` Nick Piggin
2008-12-12  5:50           ` KOSAKI Motohiro
2008-11-29 16:47   ` Rik van Riel
2008-11-29 17:45     ` Andrew Morton
2008-11-29 17:58       ` Rik van Riel
2008-11-29 18:26         ` Andrew Morton
2008-11-29 18:41           ` Rik van Riel
2008-11-29 18:51             ` Andrew Morton
2008-11-29 18:59               ` Rik van Riel
2008-11-29 20:29                 ` Andrew Morton
2008-11-29 21:35                   ` Rik van Riel
2008-11-29 21:57                     ` Andrew Morton
2008-11-29 22:07                       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox