* [PATCH] vmscan: skip freeing memory from zones with lots free
@ 2008-11-28 11:08 Rik van Riel
2008-11-28 11:30 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Rik van Riel @ 2008-11-28 11:08 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel, KOSAKI Motohiro, akpm
Skip freeing memory from zones that already have lots of free memory.
If one memory zone has harder to free memory, we want to avoid freeing
excessive amounts of memory from other zones, if only because pageout
IO from the other zones can slow down page freeing from the problem zone.
This is similar to the check already done by kswapd in balance_pgdat().
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Kosaki-san, this should address point (3) from your list.
mm/vmscan.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux-2.6.28-rc5/mm/vmscan.c
===================================================================
--- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500
+++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500
@@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr
if (zone_is_all_unreclaimable(zone) &&
priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ if (zone_watermark_ok(zone, sc->order,
+ 4*zone->pages_high, high_zoneidx, 0))
+ continue; /* Lots free already */
sc->all_unreclaimable = 0;
} else {
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel @ 2008-11-28 11:30 ` Peter Zijlstra 2008-11-28 22:43 ` Johannes Weiner 2008-11-29 7:19 ` Andrew Morton 2 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2008-11-28 11:30 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, akpm On Fri, 2008-11-28 at 06:08 -0500, Rik van Riel wrote: > Skip freeing memory from zones that already have lots of free memory. > If one memory zone has harder to free memory, we want to avoid freeing > excessive amounts of memory from other zones, if only because pageout > IO from the other zones can slow down page freeing from the problem zone. > > This is similar to the check already done by kswapd in balance_pgdat(). > > Signed-off-by: Rik van Riel <riel@redhat.com> Make sense, Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > Kosaki-san, this should address point (3) from your list. > > mm/vmscan.c | 3 +++ > 1 file changed, 3 insertions(+) > > Index: linux-2.6.28-rc5/mm/vmscan.c > =================================================================== > --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 > +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 > @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr > if (zone_is_all_unreclaimable(zone) && > priority != DEF_PRIORITY) > continue; /* Let kswapd poll it */ > + if (zone_watermark_ok(zone, sc->order, > + 4*zone->pages_high, high_zoneidx, 0)) > + continue; /* Lots free already */ > sc->all_unreclaimable = 0; > } else { > /* > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel 2008-11-28 11:30 ` Peter Zijlstra @ 2008-11-28 22:43 ` Johannes Weiner 2008-11-29 7:19 ` Andrew Morton 2 siblings, 0 replies; 23+ messages in thread From: Johannes Weiner @ 2008-11-28 22:43 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, akpm On Fri, Nov 28, 2008 at 06:08:03AM -0500, Rik van Riel wrote: > Skip freeing memory from zones that already have lots of free memory. > If one memory zone has harder to free memory, we want to avoid freeing > excessive amounts of memory from other zones, if only because pageout > IO from the other zones can slow down page freeing from the problem zone. > > This is similar to the check already done by kswapd in balance_pgdat(). > > Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@saeurebad.de> > --- > Kosaki-san, this should address point (3) from your list. > > mm/vmscan.c | 3 +++ > 1 file changed, 3 insertions(+) > > Index: linux-2.6.28-rc5/mm/vmscan.c > =================================================================== > --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 > +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 > @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr > if (zone_is_all_unreclaimable(zone) && > priority != DEF_PRIORITY) > continue; /* Let kswapd poll it */ > + if (zone_watermark_ok(zone, sc->order, > + 4*zone->pages_high, high_zoneidx, 0)) > + continue; /* Lots free already */ > sc->all_unreclaimable = 0; > } else { > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel 2008-11-28 11:30 ` Peter Zijlstra 2008-11-28 22:43 ` Johannes Weiner @ 2008-11-29 7:19 ` Andrew Morton 2008-11-29 10:55 ` KOSAKI Motohiro 2008-11-29 16:47 ` Rik van Riel 2 siblings, 2 replies; 23+ messages in thread From: Andrew Morton @ 2008-11-29 7:19 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Fri, 28 Nov 2008 06:08:03 -0500 Rik van Riel <riel@redhat.com> wrote: > Skip freeing memory from zones that already have lots of free memory. > If one memory zone has harder to free memory, we want to avoid freeing > excessive amounts of memory from other zones, if only because pageout > IO from the other zones can slow down page freeing from the problem zone. > > This is similar to the check already done by kswapd in balance_pgdat(). > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > Kosaki-san, this should address point (3) from your list. > > mm/vmscan.c | 3 +++ > 1 file changed, 3 insertions(+) > > Index: linux-2.6.28-rc5/mm/vmscan.c > =================================================================== > --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 > +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 > @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr > if (zone_is_all_unreclaimable(zone) && > priority != DEF_PRIORITY) > continue; /* Let kswapd poll it */ > + if (zone_watermark_ok(zone, sc->order, > + 4*zone->pages_high, high_zoneidx, 0)) > + continue; /* Lots free already */ > sc->all_unreclaimable = 0; > } else { > /* We already tried this, or something very similar in effect, I think... commit 26e4931632352e3c95a61edac22d12ebb72038fe Author: akpm <akpm> Date: Sun Sep 8 19:21:55 2002 +0000 [PATCH] refill the inactive list more quickly Fix a problem noticed by Ed Tomlinson: under shifting workloads the shrink_zone() logic will refill the inactive load too slowly. Bale out of the zone scan when we've reclaimed enough pages. Fixes a rarely-occurring problem wherein refill_inactive_zone() ends up shuffling 100,000 pages and generally goes silly. This needs to be revisited - we should go on and rebalance the lower zones even if we reclaimed enough pages from highmem. Then it was reverted a year or two later: commit 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 Author: akpm <akpm> Date: Fri Mar 12 16:23:50 2004 +0000 [PATCH] vmscan: zone balancing fix We currently have a problem with the balancing of reclaim between zones: much more reclaim happens against highmem than against lowmem. This patch partially fixes this by changing the direct reclaim path so it does not bale out of the zone walk after having reclaimed sufficient pages from highmem: go on to reclaim from lowmem regardless of how many pages we reclaimed from lowmem. My changelog does not adequately explain the reasons. But we don't want to rediscover these reasons in early 2010 :( Some trolling of the linux-mm and lkml archives around those dates might help us avoid a mistake here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 7:19 ` Andrew Morton @ 2008-11-29 10:55 ` KOSAKI Motohiro 2008-12-08 13:00 ` KOSAKI Motohiro 2008-11-29 16:47 ` Rik van Riel 1 sibling, 1 reply; 23+ messages in thread From: KOSAKI Motohiro @ 2008-11-29 10:55 UTC (permalink / raw) To: Andrew Morton; +Cc: kosaki.motohiro, Rik van Riel, linux-mm, linux-kernel > We already tried this, or something very similar in effect, I think... > > > commit 26e4931632352e3c95a61edac22d12ebb72038fe > Author: akpm <akpm> > Date: Sun Sep 8 19:21:55 2002 +0000 > > [PATCH] refill the inactive list more quickly > > Fix a problem noticed by Ed Tomlinson: under shifting workloads the > shrink_zone() logic will refill the inactive load too slowly. > > Bale out of the zone scan when we've reclaimed enough pages. Fixes a > rarely-occurring problem wherein refill_inactive_zone() ends up > shuffling 100,000 pages and generally goes silly. > > This needs to be revisited - we should go on and rebalance the lower > zones even if we reclaimed enough pages from highmem. > > > > Then it was reverted a year or two later: > > > commit 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 > Author: akpm <akpm> > Date: Fri Mar 12 16:23:50 2004 +0000 > > [PATCH] vmscan: zone balancing fix > > We currently have a problem with the balancing of reclaim between zones: much > more reclaim happens against highmem than against lowmem. > > This patch partially fixes this by changing the direct reclaim path so it > does not bale out of the zone walk after having reclaimed sufficient pages > from highmem: go on to reclaim from lowmem regardless of how many pages we > reclaimed from lowmem. > > > My changelog does not adequately explain the reasons. > > But we don't want to rediscover these reasons in early 2010 :( Some trolling > of the linux-mm and lkml archives around those dates might help us avoid > a mistake here. I hope to digg past discussion archive. Andrew, plese wait merge this patch awhile. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 10:55 ` KOSAKI Motohiro @ 2008-12-08 13:00 ` KOSAKI Motohiro 2008-12-08 13:03 ` KOSAKI Motohiro 0 siblings, 1 reply; 23+ messages in thread From: KOSAKI Motohiro @ 2008-12-08 13:00 UTC (permalink / raw) To: Andrew Morton, Rik van Riel Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner Hi verry sorry for late responce. > > My changelog does not adequately explain the reasons. > > > > But we don't want to rediscover these reasons in early 2010 :( Some trolling > > of the linux-mm and lkml archives around those dates might help us avoid > > a mistake here. > > I hope to digg past discussion archive. > Andrew, plese wait merge this patch awhile. I search past archive patiently. But unfortunately, I don't find the reason at all. this reverting fix appeared at 2.6.3-mm3 suddenly. http://marc.info/?l=linux-kernel&m=107749956707874&w=2 but I don't find related discussion at near month. So, I guess akpm find anything problem by himself. Therefore, instead, I'd like to talk about rik patch safeness by mesurement. 1. Checked this patch break reclaim balancing? run FFSB bench by following conf. --------------------------------------------------- directio=0 time=300 [filesystem0] location=/mnt/sdb1/kosaki/ffsb num_files=20 num_dirs=10 max_filesize=91534338 min_filesize=65535 [end0] [threadgroup0] num_threads=10 write_size=2816 write_blocksize=4096 read_size=2816 read_blocksize=4096 create_weight=100 write_weight=30 read_weight=100 [end0] -------------------------------------------------------- <without patch> pgscan_kswapd_dma 10624 pgscan_kswapd_normal 20640 -> normal/dma ratio 20640 / 10624 = 1.9 pgscan_direct_dma 576 pgscan_direct_normal 2528 -> normal/dma ratio 2528 / 576 = 4.38 <with patch> pgscan_kswapd_dma 21824 pgscan_kswapd_normal 47424 -> normal/dma ratio 20640 / 10624 = 2.17 pgscan_direct_dma 1632 pgscan_direct_normal 6912 -> normal/dma ratio 2528 / 576 = 4.23 The reason is simple. This patch only works following two case. 1) Another process freed large memory in direct reclaim processing. 2) Another process reclaimed large memory in direct reclaim processing. IOW, its logic doesn't works on typical workload at all. 2. Mesured most benefit case. (IOW, much thread concurrently process swap-out at the same time) $ ./hackbench 140 process 300 (ten times mesurement) 2.6.28-rc6 +this patch +bail-out -------------------------- 62.514 29.270 225.698 30.209 114.694 20.881 179.108 19.795 111.080 19.563 189.796 19.226 114.124 13.330 112.999 10.280 227.842 9.669 81.869 10.113 avg 141.972 18.234 std 55.937 7.099 min 62.514 9.669 max 227.842 30.209 -> about 10 times improvement 3. Mesured worst case (much thread without swap) mesured following three case. (ten times) $ ./hackbench 125 process 3000 $ ./hackbench 130 process 3000 $ ./hackbench 135 process 3000 2.6.28-rc6 + evice streaming first + skip freeing memory + rvr bail out + kosaki bail out improve nr_group 125 130 135 125 130 135 ---------------------------------------------------------- 67.302 68.269 77.161 89.450 75.328 173.437 72.616 72.712 79.060 69.843 74.145 76.217 72.475 75.712 77.735 73.531 76.426 85.527 69.229 73.062 78.814 72.472 74.891 75.129 71.551 74.392 78.564 69.423 73.517 75.544 69.227 74.310 78.837 72.543 75.347 79.237 70.759 75.256 76.600 70.477 77.848 90.981 69.966 76.001 78.464 71.792 78.722 92.048 69.068 75.218 80.321 71.313 74.958 78.113 72.057 77.151 79.068 72.306 75.644 79.888 avg 70.425 74.208 78.462 73.315 75.683 90.612 std 1.665 2.348 1.007 5.516 1.514 28.218 min 67.302 68.269 76.600 69.423 73.517 75.129 max 72.616 77.151 80.321 89.450 78.722 173.437 -> 1 - 10% slow down because zone_watermark_ok() is a bit slow function. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 13:00 ` KOSAKI Motohiro @ 2008-12-08 13:03 ` KOSAKI Motohiro 2008-12-08 17:48 ` KOSAKI Motohiro 2008-12-08 20:25 ` Rik van Riel 0 siblings, 2 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2008-12-08 13:03 UTC (permalink / raw) To: Andrew Morton, Rik van Riel Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter, Nick Piggin > 2.6.28-rc6 > + evice streaming first + skip freeing memory > + rvr bail out > + kosaki bail out improve > > nr_group 125 130 135 125 130 135 > ---------------------------------------------------------- > 67.302 68.269 77.161 89.450 75.328 173.437 > 72.616 72.712 79.060 69.843 74.145 76.217 > 72.475 75.712 77.735 73.531 76.426 85.527 > 69.229 73.062 78.814 72.472 74.891 75.129 > 71.551 74.392 78.564 69.423 73.517 75.544 > 69.227 74.310 78.837 72.543 75.347 79.237 > 70.759 75.256 76.600 70.477 77.848 90.981 > 69.966 76.001 78.464 71.792 78.722 92.048 > 69.068 75.218 80.321 71.313 74.958 78.113 > 72.057 77.151 79.068 72.306 75.644 79.888 > > avg 70.425 74.208 78.462 73.315 75.683 90.612 > std 1.665 2.348 1.007 5.516 1.514 28.218 > min 67.302 68.269 76.600 69.423 73.517 75.129 > max 72.616 77.151 80.321 89.450 78.722 173.437 > > > -> 1 - 10% slow down > because zone_watermark_ok() is a bit slow function. > Next, I'd like to talk about why I think the reason is zone_watermark_ok(). I have zone_watermark_ok() improvement patch. following patch developed for another issue. However I observed it solve rvr patch performance degression. <with following patch> 2.6.28-rc6 + evice streaming first + skip freeing memory + rvr bail out + this patch + kosaki bail out improve nr_group 125 130 135 125 130 135 ---------------------------------------------------------- 67.302 68.269 77.161 68.534 75.733 79.416 72.616 72.712 79.060 70.868 74.264 76.858 72.475 75.712 77.735 73.215 80.278 81.033 69.229 73.062 78.814 70.780 72.518 75.764 71.551 74.392 78.564 69.631 77.252 77.131 69.227 74.310 78.837 72.325 72.723 79.274 70.759 75.256 76.600 70.328 74.046 75.783 69.966 76.001 78.464 69.014 72.566 77.236 69.068 75.218 80.321 68.373 76.447 76.015 72.057 77.151 79.068 74.403 72.794 75.872 avg 70.425 74.208 78.462 70.747 74.862 77.438 std 1.665 2.348 1.007 1.921 2.428 1.752 min 67.302 68.269 76.600 68.373 72.518 75.764 max 72.616 77.151 80.321 74.403 80.278 81.033 -> ok, performance degression disappeared. =========================== Subject: [PATCH] mm: zone_watermark_ok() doesn't require small fragment block Currently, zone_watermark_ok() has a bit unfair logic. example, Called zone_watermark_ok(zone, 2, pages_min, 0, 0); pages_min = 64 free pages = 80 case A. order nr_pages -------------------- 2 5 1 10 0 30 -> zone_watermark_ok() return 1 case B. order nr_pages -------------------- 3 10 2 0 1 0 0 0 -> zone_watermark_ok() return 0 IOW, current zone_watermark_ok() tend to prefer small fragment block. If dividing large block to small block by buddy is slow, abeve logic is reasonable. However its assumption is not formed at all. linux buddy can treat large block efficiently. In the order aspect, zone_watermark_ok() is called from get_page_from_freelist() everytime. The get_page_from_freelist() is one of king of fast path. In general, fast path require to - if system has much memory, it work as fast as possible. - if system doesn't have enough memory, it doesn't need to fast processing. but need to avoid oom as far as possible. Unfortunately, following loop has reverse performance tendency. for (o = 0; o < order; o++) { free_pages -= z->free_area[o].nr_free << o; min >>= 1; if (free_pages <= min) return 0; } If the system doesn't have enough memory, above loop bail out soon. But the system have enough memory, this loop work just number of order times. This patch change zone_watermark_ok() logic to prefer large contenious block. Result: test machine: CPU: ia64 x 8 MEM: 8GB benchmark: $ tbench 8 (three times mesurement) tbench works between about 600sec. alloc_pages() and zone_watermark_ok() are called about 15,000,000 times. 2.6.28-rc6 this patch throughput max-latency throughput max-latency --------------------------------------------------------- 1480.92 20.896 1,490.27 19.606 1483.94 19.202 1,482.86 21.082 1478.93 22.215 1,490.57 23.493 avg 1,481.26 20.771 1,487.90 21.394 std 2.06 1.233 3.56 1.602 min 1,478.93 19.202 1,477.86 19.606 max 1,483.94 22.215 1,490.57 23.493 throughput improve about 5MB/sec. it over measurement wobbly. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Nick Piggin <npiggin@suse.de> CC: Christoph Lameter <cl@linux-foundation.org> --- mm/page_alloc.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) Index: b/mm/page_alloc.c =================================================================== --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1227,7 +1227,7 @@ static inline int should_fail_alloc_page int zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags) { - /* free_pages my go negative - that's OK */ + /* free_pages may go negative - that's OK */ long min = mark; long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; int o; @@ -1239,17 +1239,13 @@ int zone_watermark_ok(struct zone *z, in if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; - for (o = 0; o < order; o++) { - /* At the next order, this order's pages become unavailable */ - free_pages -= z->free_area[o].nr_free << o; - /* Require fewer higher order pages to be free */ - min >>= 1; - - if (free_pages <= min) - return 0; + for (o = order; o < MAX_ORDER; o++) { + if (z->free_area[o].nr_free) + return 1; } - return 1; + + return 0; } #ifdef CONFIG_NUMA -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 13:03 ` KOSAKI Motohiro @ 2008-12-08 17:48 ` KOSAKI Motohiro 2008-12-10 5:07 ` Nick Piggin 2008-12-08 20:25 ` Rik van Riel 1 sibling, 1 reply; 23+ messages in thread From: KOSAKI Motohiro @ 2008-12-08 17:48 UTC (permalink / raw) To: Andrew Morton, Rik van Riel Cc: kosaki.motohiro, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter, Nick Piggin > example, > > Called zone_watermark_ok(zone, 2, pages_min, 0, 0); > pages_min = 64 > free pages = 80 > > case A. > > order nr_pages > -------------------- > 2 5 > 1 10 > 0 30 > > -> zone_watermark_ok() return 1 > > case B. > > order nr_pages > -------------------- > 3 10 > 2 0 > 1 0 > 0 0 > > -> zone_watermark_ok() return 0 Doh! this example is obiously buggy. I guess Mr. KOSAKI is very silly or Idiot. I recommend to he get feathery blanket and good sleeping, instead black black coffee ;-) ...but below mesurement result still true. > This patch change zone_watermark_ok() logic to prefer large contenious block. > > > Result: > > test machine: > CPU: ia64 x 8 > MEM: 8GB > > benchmark: > $ tbench 8 (three times mesurement) > > tbench works between about 600sec. > alloc_pages() and zone_watermark_ok() are called about 15,000,000 times. > > > 2.6.28-rc6 this patch > > throughput max-latency throughput max-latency > --------------------------------------------------------- > 1480.92 20.896 1,490.27 19.606 > 1483.94 19.202 1,482.86 21.082 > 1478.93 22.215 1,490.57 23.493 > > avg 1,481.26 20.771 1,487.90 21.394 > std 2.06 1.233 3.56 1.602 > min 1,478.93 19.202 1,477.86 19.606 > max 1,483.94 22.215 1,490.57 23.493 > > > throughput improve about 5MB/sec. it over measurement wobbly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 17:48 ` KOSAKI Motohiro @ 2008-12-10 5:07 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2008-12-10 5:07 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, Rik van Riel, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter On Tue, Dec 09, 2008 at 02:48:40AM +0900, KOSAKI Motohiro wrote: > > example, > > > > Called zone_watermark_ok(zone, 2, pages_min, 0, 0); > > pages_min = 64 > > free pages = 80 > > > > case A. > > > > order nr_pages > > -------------------- > > 2 5 > > 1 10 > > 0 30 > > > > -> zone_watermark_ok() return 1 > > > > case B. > > > > order nr_pages > > -------------------- > > 3 10 > > 2 0 > > 1 0 > > 0 0 > > > > -> zone_watermark_ok() return 0 > > Doh! > this example is obiously buggy. > > I guess Mr. KOSAKI is very silly or Idiot. > I recommend to he get feathery blanket and good sleeping, instead > black black coffee ;-) :) No, actually it is always good to have people reviewing existing code, so thank you for that. > ...but below mesurement result still true. And it is an interesting result. As far as I can see, your patch changes zone_watermark_ok so that it avoids some watermark checking for higher order page blocks? I am surprised it makes a noticable difference in performance, however such a change would be slightly detrimental to atomic and "emergency" allocations of higher order pages, wouldn't it? It would be interesting to know where the higher order allocations are coming from. Do packets over loopback device still do higher order allocations? If so, I suspect this is a bit artificial. > > > This patch change zone_watermark_ok() logic to prefer large contenious block. > > > > > > Result: > > > > test machine: > > CPU: ia64 x 8 > > MEM: 8GB > > > > benchmark: > > $ tbench 8 (three times mesurement) > > > > tbench works between about 600sec. > > alloc_pages() and zone_watermark_ok() are called about 15,000,000 times. > > > > > > 2.6.28-rc6 this patch > > > > throughput max-latency throughput max-latency > > --------------------------------------------------------- > > 1480.92 20.896 1,490.27 19.606 > > 1483.94 19.202 1,482.86 21.082 > > 1478.93 22.215 1,490.57 23.493 > > > > avg 1,481.26 20.771 1,487.90 21.394 > > std 2.06 1.233 3.56 1.602 > > min 1,478.93 19.202 1,477.86 19.606 > > max 1,483.94 22.215 1,490.57 23.493 > > > > > > throughput improve about 5MB/sec. it over measurement wobbly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 13:03 ` KOSAKI Motohiro 2008-12-08 17:48 ` KOSAKI Motohiro @ 2008-12-08 20:25 ` Rik van Riel 2008-12-10 5:09 ` Nick Piggin 2008-12-12 5:50 ` KOSAKI Motohiro 1 sibling, 2 replies; 23+ messages in thread From: Rik van Riel @ 2008-12-08 20:25 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter, Nick Piggin KOSAKI Motohiro wrote: > + for (o = order; o < MAX_ORDER; o++) { > + if (z->free_area[o].nr_free) > + return 1; Since page breakup and coalescing always manipulates .nr_free, I wonder if it would make sense to pack the nr_free variables in their own cache line(s), so we have fewer cache misses when going through zone_watermark_ok() ? That would end up looking something like this: (whitespace mangled because it doesn't make sense to apply just this thing, anyway) Index: linux-2.6.28-rc7/include/linux/mmzone.h =================================================================== --- linux-2.6.28-rc7.orig/include/linux/mmzone.h 2008-12-02 15:04:33.000000000 -0500 +++ linux-2.6.28-rc7/include/linux/mmzone.h 2008-12-08 15:24:25.000000000 -0500 @@ -58,7 +58,6 @@ static inline int get_pageblock_migratet struct free_area { struct list_head free_list[MIGRATE_TYPES]; - unsigned long nr_free; }; struct pglist_data; @@ -296,6 +295,7 @@ struct zone { seqlock_t span_seqlock; #endif struct free_area free_area[MAX_ORDER]; + struct nr_free [MAX_ORDER]; #ifndef CONFIG_SPARSEMEM /* -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 20:25 ` Rik van Riel @ 2008-12-10 5:09 ` Nick Piggin 2008-12-12 5:50 ` KOSAKI Motohiro 1 sibling, 0 replies; 23+ messages in thread From: Nick Piggin @ 2008-12-10 5:09 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Andrew Morton, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter On Mon, Dec 08, 2008 at 03:25:10PM -0500, Rik van Riel wrote: > KOSAKI Motohiro wrote: > > >+ for (o = order; o < MAX_ORDER; o++) { > >+ if (z->free_area[o].nr_free) > >+ return 1; > > Since page breakup and coalescing always manipulates .nr_free, > I wonder if it would make sense to pack the nr_free variables > in their own cache line(s), so we have fewer cache misses when > going through zone_watermark_ok() ? For order-0 allocations, they should not be touched at all. For higher order allocations in performance critical paths, we should try to fix those to use order-0 ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-12-08 20:25 ` Rik van Riel 2008-12-10 5:09 ` Nick Piggin @ 2008-12-12 5:50 ` KOSAKI Motohiro 1 sibling, 0 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2008-12-12 5:50 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Andrew Morton, linux-mm, linux-kernel, Peter Zijlstra, Johannes Weiner, Christoph Lameter, Nick Piggin > Index: linux-2.6.28-rc7/include/linux/mmzone.h > =================================================================== > --- linux-2.6.28-rc7.orig/include/linux/mmzone.h 2008-12-02 > 15:04:33.000000000 -0500 > +++ linux-2.6.28-rc7/include/linux/mmzone.h 2008-12-08 > 15:24:25.000000000 -0500 > @@ -58,7 +58,6 @@ static inline int get_pageblock_migratet > > struct free_area { > struct list_head free_list[MIGRATE_TYPES]; > - unsigned long nr_free; > }; > > struct pglist_data; > @@ -296,6 +295,7 @@ struct zone { > seqlock_t span_seqlock; > #endif > struct free_area free_area[MAX_ORDER]; > + struct nr_free [MAX_ORDER]; > > #ifndef CONFIG_SPARSEMEM > /* mesurement result: % tbench 8 2.6.28-rc6 +rvr free area restructure throughput max latency throughput max latency ------------------------------------------------------------ 1480.920 20.896 742.470 30.401 1483.940 19.202 791.648 635.623 1478.930 22.215 733.433 92.515 avg 1481.263 20.771 755.850 252.846 std 2.060 1.233 25.580 271.849 min 1478.930 19.202 733.433 30.401 max 1483.940 22.215 791.648 635.623 I think nick is right. I drop this idea. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 7:19 ` Andrew Morton 2008-11-29 10:55 ` KOSAKI Motohiro @ 2008-11-29 16:47 ` Rik van Riel 2008-11-29 17:45 ` Andrew Morton 1 sibling, 1 reply; 23+ messages in thread From: Rik van Riel @ 2008-11-29 16:47 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: >> Index: linux-2.6.28-rc5/mm/vmscan.c >> =================================================================== >> --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 >> +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 >> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr >> if (zone_is_all_unreclaimable(zone) && >> priority != DEF_PRIORITY) >> continue; /* Let kswapd poll it */ >> + if (zone_watermark_ok(zone, sc->order, >> + 4*zone->pages_high, high_zoneidx, 0)) >> + continue; /* Lots free already */ >> sc->all_unreclaimable = 0; >> } else { >> /* > > We already tried this, or something very similar in effect, I think... Yes, we have a check just like this in balance_pgdat(). It's been there forever with no ill effect. > commit 26e4931632352e3c95a61edac22d12ebb72038fe > Author: akpm <akpm> > Date: Sun Sep 8 19:21:55 2002 +0000 > > [PATCH] refill the inactive list more quickly > > Fix a problem noticed by Ed Tomlinson: under shifting workloads the > shrink_zone() logic will refill the inactive load too slowly. > > Bale out of the zone scan when we've reclaimed enough pages. Fixes a > rarely-occurring problem wherein refill_inactive_zone() ends up > shuffling 100,000 pages and generally goes silly. This is not a bale out, this is a "skip zones that have way too many free pages already". Kswapd has been doing this for years already. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 16:47 ` Rik van Riel @ 2008-11-29 17:45 ` Andrew Morton 2008-11-29 17:58 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2008-11-29 17:45 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sat, 29 Nov 2008 11:47:25 -0500 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > >> Index: linux-2.6.28-rc5/mm/vmscan.c > >> =================================================================== > >> --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 > >> +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 > >> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr > >> if (zone_is_all_unreclaimable(zone) && > >> priority != DEF_PRIORITY) > >> continue; /* Let kswapd poll it */ > >> + if (zone_watermark_ok(zone, sc->order, > >> + 4*zone->pages_high, high_zoneidx, 0)) > >> + continue; /* Lots free already */ > >> sc->all_unreclaimable = 0; > >> } else { > >> /* > > > > We already tried this, or something very similar in effect, I think... > > Yes, we have a check just like this in balance_pgdat(). > > It's been there forever with no ill effect. This patch affects direct reclaim as well as kswapd. > > commit 26e4931632352e3c95a61edac22d12ebb72038fe > > Author: akpm <akpm> > > Date: Sun Sep 8 19:21:55 2002 +0000 > > > > [PATCH] refill the inactive list more quickly > > > > Fix a problem noticed by Ed Tomlinson: under shifting workloads the > > shrink_zone() logic will refill the inactive load too slowly. > > > > Bale out of the zone scan when we've reclaimed enough pages. Fixes a > > rarely-occurring problem wherein refill_inactive_zone() ends up > > shuffling 100,000 pages and generally goes silly. > > This is not a bale out, this is a "skip zones that have way > too many free pages already". It is similar in effect. Will this new patch reintroduce the problem which 26e4931632352e3c95a61edac22d12ebb72038fe fixed? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 17:45 ` Andrew Morton @ 2008-11-29 17:58 ` Rik van Riel 2008-11-29 18:26 ` Andrew Morton 0 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2008-11-29 17:58 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: > On Sat, 29 Nov 2008 11:47:25 -0500 Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >> >>>> Index: linux-2.6.28-rc5/mm/vmscan.c >>>> =================================================================== >>>> --- linux-2.6.28-rc5.orig/mm/vmscan.c 2008-11-28 05:53:56.000000000 -0500 >>>> +++ linux-2.6.28-rc5/mm/vmscan.c 2008-11-28 06:05:29.000000000 -0500 >>>> @@ -1510,6 +1510,9 @@ static unsigned long shrink_zones(int pr >>>> if (zone_is_all_unreclaimable(zone) && >>>> priority != DEF_PRIORITY) >>>> continue; /* Let kswapd poll it */ >>>> + if (zone_watermark_ok(zone, sc->order, >>>> + 4*zone->pages_high, high_zoneidx, 0)) >>>> + continue; /* Lots free already */ >>>> sc->all_unreclaimable = 0; >>>> } else { >>>> /* >>> We already tried this, or something very similar in effect, I think... >> Yes, we have a check just like this in balance_pgdat(). >> >> It's been there forever with no ill effect. > > This patch affects direct reclaim as well as kswapd. No, kswapd calls shrink_zone directly from balance_pgdat, it does not go through shrink_zones. >>> commit 26e4931632352e3c95a61edac22d12ebb72038fe >>> Author: akpm <akpm> >>> Date: Sun Sep 8 19:21:55 2002 +0000 >>> >>> [PATCH] refill the inactive list more quickly >>> >>> Fix a problem noticed by Ed Tomlinson: under shifting workloads the >>> shrink_zone() logic will refill the inactive load too slowly. >>> >>> Bale out of the zone scan when we've reclaimed enough pages. Fixes a >>> rarely-occurring problem wherein refill_inactive_zone() ends up >>> shuffling 100,000 pages and generally goes silly. >> This is not a bale out, this is a "skip zones that have way >> too many free pages already". > > It is similar in effect. > > Will this new patch reintroduce the problem which > 26e4931632352e3c95a61edac22d12ebb72038fe fixed? Googling on 26e4931632352e3c95a61edac22d12ebb72038fe only finds your emails with that commit id in it - which git tree do I need to search to get that changeset? -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 17:58 ` Rik van Riel @ 2008-11-29 18:26 ` Andrew Morton 2008-11-29 18:41 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2008-11-29 18:26 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: > > Will this new patch reintroduce the problem which > > 26e4931632352e3c95a61edac22d12ebb72038fe fixed? > > Googling on 26e4931632352e3c95a61edac22d12ebb72038fe only finds > your emails with that commit id in it - which git tree do I > need to search to get that changeset? It's the historical git tree. All the pre-2.6.12 history which was migrated from bitkeeper. git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git Spending a couple of fun hours reading `git-log mm/vmscan.c' is pretty instructive. For some reason that command generates rather a lot of unrelated changelog info which needs to be manually skipped over. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 18:26 ` Andrew Morton @ 2008-11-29 18:41 ` Rik van Riel 2008-11-29 18:51 ` Andrew Morton 0 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2008-11-29 18:41 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: > On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: > >>> Will this new patch reintroduce the problem which >>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? No, that problem is already taken care of by the fact that active pages always get deactivated in the current VM, regardless of whether or not they were referenced. > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/old-2.6-bkcvs.git > > Spending a couple of fun hours reading `git-log mm/vmscan.c' is pretty > instructive. For some reason that command generates rather a lot of > unrelated changelog info which needs to be manually skipped over. Will do. Thank you for the pointer. (and not sure why google wouldn't find it - it finds other git changesets...) -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 18:41 ` Rik van Riel @ 2008-11-29 18:51 ` Andrew Morton 2008-11-29 18:59 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2008-11-29 18:51 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: > > > >>> Will this new patch reintroduce the problem which > >>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? > > No, that problem is already taken care of by the fact that > active pages always get deactivated in the current VM, > regardless of whether or not they were referenced. err, sorry, that was the wrong commit. 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as predicted in the changelog. 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 18:51 ` Andrew Morton @ 2008-11-29 18:59 ` Rik van Riel 2008-11-29 20:29 ` Andrew Morton 0 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2008-11-29 18:59 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: > On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: >>> >>>>> Will this new patch reintroduce the problem which >>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? >> No, that problem is already taken care of by the fact that >> active pages always get deactivated in the current VM, >> regardless of whether or not they were referenced. > > err, sorry, that was the wrong commit. > 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as > predicted in the changelog. > > 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up. The patch I sent in this thread does not do any baling out, it only skips zones where the number of free pages is more than 4 times zone->pages_high. Equal pressure is still applied to the other zones. This should not be a problem since we do not enter direct reclaim unless the free pages in every zone in our zonelist are below zone->pages_low. Zone skipping is only done by tasks that have been in the direct reclaim code for a long time. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 18:59 ` Rik van Riel @ 2008-11-29 20:29 ` Andrew Morton 2008-11-29 21:35 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2008-11-29 20:29 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote: > > > >> Andrew Morton wrote: > >>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: > >>> > >>>>> Will this new patch reintroduce the problem which > >>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? > >> No, that problem is already taken care of by the fact that > >> active pages always get deactivated in the current VM, > >> regardless of whether or not they were referenced. > > > > err, sorry, that was the wrong commit. > > 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as > > predicted in the changelog. > > > > 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up. > > The patch I sent in this thread does not do any baling out, > it only skips zones where the number of free pages is more > than 4 times zone->pages_high. But that will have the same effect as baling out. Moreso, in fact. > Equal pressure is still applied to the other zones. > > This should not be a problem since we do not enter direct > reclaim unless the free pages in every zone in our zonelist > are below zone->pages_low. > > Zone skipping is only done by tasks that have been in the > direct reclaim code for a long time. >From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3: We currently have a problem with the balancing of reclaim between zones: much more reclaim happens against highmem than against lowmem. This problem will be reintroduced, will it not? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 20:29 ` Andrew Morton @ 2008-11-29 21:35 ` Rik van Riel 2008-11-29 21:57 ` Andrew Morton 0 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2008-11-29 21:35 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: > On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote: > >> Andrew Morton wrote: >>> On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote: >>> >>>> Andrew Morton wrote: >>>>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: >>>>> >>>>>>> Will this new patch reintroduce the problem which >>>>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? >>>> No, that problem is already taken care of by the fact that >>>> active pages always get deactivated in the current VM, >>>> regardless of whether or not they were referenced. >>> err, sorry, that was the wrong commit. >>> 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as >>> predicted in the changelog. >>> >>> 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up. >> The patch I sent in this thread does not do any baling out, >> it only skips zones where the number of free pages is more >> than 4 times zone->pages_high. > > But that will have the same effect as baling out. Moreso, in fact. Kswapd already does the same in balance_pgdat. Unequal pressure is sometimes desired, because allocation pressure is not equal between zones. Having lots of lowmem allocations should not lead to gigabytes of swapped out highmem. A numactl pinned application should not cause memory on other NUMA nodes to be swapped out. Equal pressure between the zones makes sense when allocation pressure is similar. When allocation pressure is different, we have a choice between evicting potentially useful data from memory or applying uneven pressure on zones. >> Equal pressure is still applied to the other zones. >> >> This should not be a problem since we do not enter direct >> reclaim unless the free pages in every zone in our zonelist >> are below zone->pages_low. >> >> Zone skipping is only done by tasks that have been in the >> direct reclaim code for a long time. > >>From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3: > > We currently have a problem with the balancing of reclaim > between zones: much more reclaim happens against highmem than > against lowmem. > > This problem will be reintroduced, will it not? We already have that behaviour in balance_pgdat(). We do not do any reclaim on zones higher than the first zone where the zone_watermark_ok call returns true: if (!zone_watermark_ok(zone, order, zone->pages_high, 0, 0)) { end_zone = i; break; } Further down in balance_pgdat(), we skip reclaiming from zones that have way too much memory free. /* * We put equal pressure on every zone, unless one * zone has way too many pages free already. */ if (!zone_watermark_ok(zone, order, 8*zone->pages_high, end_zone, 0)) shrink_zone(priority, zone, &sc); All my patch does is add one of these sanity checks to the direct reclaim path. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 21:35 ` Rik van Riel @ 2008-11-29 21:57 ` Andrew Morton 2008-11-29 22:07 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2008-11-29 21:57 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sat, 29 Nov 2008 16:35:45 -0500 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > On Sat, 29 Nov 2008 13:59:21 -0500 Rik van Riel <riel@redhat.com> wrote: > > > >> Andrew Morton wrote: > >>> On Sat, 29 Nov 2008 13:41:34 -0500 Rik van Riel <riel@redhat.com> wrote: > >>> > >>>> Andrew Morton wrote: > >>>>> On Sat, 29 Nov 2008 12:58:32 -0500 Rik van Riel <riel@redhat.com> wrote: > >>>>> > >>>>>>> Will this new patch reintroduce the problem which > >>>>>>> 26e4931632352e3c95a61edac22d12ebb72038fe fixed? > >>>> No, that problem is already taken care of by the fact that > >>>> active pages always get deactivated in the current VM, > >>>> regardless of whether or not they were referenced. > >>> err, sorry, that was the wrong commit. > >>> 26e4931632352e3c95a61edac22d12ebb72038fe _introduced_ the problem, as > >>> predicted in the changelog. > >>> > >>> 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 later fixed it up. > >> The patch I sent in this thread does not do any baling out, > >> it only skips zones where the number of free pages is more > >> than 4 times zone->pages_high. > > > > But that will have the same effect as baling out. Moreso, in fact. > > Kswapd already does the same in balance_pgdat. > > Unequal pressure is sometimes desired, because allocation > pressure is not equal between zones. Having lots of > lowmem allocations should not lead to gigabytes of swapped > out highmem. A numactl pinned application should not cause > memory on other NUMA nodes to be swapped out. > > Equal pressure between the zones makes sense when allocation > pressure is similar. > > When allocation pressure is different, we have a choice > between evicting potentially useful data from memory or > applying uneven pressure on zones. > > >> Equal pressure is still applied to the other zones. > >> > >> This should not be a problem since we do not enter direct > >> reclaim unless the free pages in every zone in our zonelist > >> are below zone->pages_low. > >> > >> Zone skipping is only done by tasks that have been in the > >> direct reclaim code for a long time. > > > >>From 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3: > > > > We currently have a problem with the balancing of reclaim > > between zones: much more reclaim happens against highmem than > > against lowmem. > > > > This problem will be reintroduced, will it not? > > We already have that behaviour in balance_pgdat(). I expect that was the case back in March 2004. 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 removed the bale-out only for the direct reclaim path. > We do not do any reclaim on zones higher than the first > zone where the zone_watermark_ok call returns true: > > if (!zone_watermark_ok(zone, order, zone->pages_high, > 0, 0)) { > end_zone = i; > break; > } > > Further down in balance_pgdat(), we skip reclaiming from zones > that have way too much memory free. > > /* > * We put equal pressure on every zone, unless one > * zone has way too many pages free already. > */ > if (!zone_watermark_ok(zone, order, 8*zone->pages_high, > end_zone, 0)) > shrink_zone(priority, zone, &sc); > > All my patch does is add one of these sanity checks to the > direct reclaim path. It's a change in behaviour, not a "sanity check"! The bottom line here is that we don't fully understand the problem which 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 fixed, hence we cannot say whether this proposed change will reintroduce it. Why did it matter that "much more reclaim happens against highmem than against lowmem"? What were the observeable effects of this? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] vmscan: skip freeing memory from zones with lots free 2008-11-29 21:57 ` Andrew Morton @ 2008-11-29 22:07 ` Rik van Riel 0 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2008-11-29 22:07 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro Andrew Morton wrote: > The bottom line here is that we don't fully understand the problem > which 265b2b8cac1774f5f30c88e0ab8d0bcf794ef7b3 fixed, hence we cannot > say whether this proposed change will reintroduce it. > > Why did it matter that "much more reclaim happens against highmem than > against lowmem"? What were the observeable effects of this? On a 1GB system, with 892MB lowmem and 128MB highmem, it could lead to the page cache coming mostly from highmem. This in turn would mean that lowmem could have hundreds of megabytes of unused memory, while large files would not get cached in memory. Baling out early and not putting any memory pressure on a zone can lead to problems. It is important that zones with easily freeable memory get some extra memory freed, so more allocations go to that zone. However, we also do not want to go overboard. Kicking potentially useful data out of memory or causing unnecessary pageout IO is harmful too. By doing some amount of extra reclaim in zones with easily freeable memory means more memory will get allocated from that zone. Over time this equalizes pressure between zones. The patch I sent in limits that extra reclaim (extra allocation space) in easily freeable zones to 4 * zone->pages_high. That gives the zone extra free space for alloc_pages, while limiting unnecessary pageout IO and evicting of useful data. I am pretty sure that we do understand the differences between that 2004 patch and the code we have today. -- All rights reversed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2008-12-12 5:50 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-11-28 11:08 [PATCH] vmscan: skip freeing memory from zones with lots free Rik van Riel 2008-11-28 11:30 ` Peter Zijlstra 2008-11-28 22:43 ` Johannes Weiner 2008-11-29 7:19 ` Andrew Morton 2008-11-29 10:55 ` KOSAKI Motohiro 2008-12-08 13:00 ` KOSAKI Motohiro 2008-12-08 13:03 ` KOSAKI Motohiro 2008-12-08 17:48 ` KOSAKI Motohiro 2008-12-10 5:07 ` Nick Piggin 2008-12-08 20:25 ` Rik van Riel 2008-12-10 5:09 ` Nick Piggin 2008-12-12 5:50 ` KOSAKI Motohiro 2008-11-29 16:47 ` Rik van Riel 2008-11-29 17:45 ` Andrew Morton 2008-11-29 17:58 ` Rik van Riel 2008-11-29 18:26 ` Andrew Morton 2008-11-29 18:41 ` Rik van Riel 2008-11-29 18:51 ` Andrew Morton 2008-11-29 18:59 ` Rik van Riel 2008-11-29 20:29 ` Andrew Morton 2008-11-29 21:35 ` Rik van Riel 2008-11-29 21:57 ` Andrew Morton 2008-11-29 22:07 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox