* [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
@ 2008-07-01 1:57 Dan Williams
2008-07-01 8:09 ` Mel Gorman
0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2008-07-01 1:57 UTC (permalink / raw)
To: linux-mm, linux-kernel; +Cc: NeilBrown, babydr, mel, clameter, lee.schermerhorn
[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]
Hello,
Prompted by a report from a user I have bisected a performance loss
apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
filtered by GFP mask). The test is simple sequential writes to a 4 disk
raid5 array. Performance should be about 20% greater than 2.6.25 due to
commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
write performance). The sample data below shows sporadic performance
starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc.
revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
138 168 169 167 177 149 144
140 168 172 170 109 138 142
142 165 169 164 119 138 129
144 168 169 171 120 139 135
142 165 174 166 165 122 154
MB/s (avg) 141 167 171 168 138 137 141
% change 0% 18% 21% 19% -2% -3% 0%
result base good good good [bad] bad bad
Notable observations:
1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
54a6eb5c show consistent performance at 170MB/s.
2/ Single drive performance appears to be unaffected
3/ A quick test shows that raid0 performance is also sporadic:
2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
System/Test configuration:
(2) Intel(R) Xeon(R) CPU 5150
mem=1024M
CONFIG_HIGHMEM4G=y (full config attached)
mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
Other suggestions / experiments?
Thanks,
Dan
[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 22024 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 1:57 [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Dan Williams @ 2008-07-01 8:09 ` Mel Gorman 2008-07-01 17:58 ` Andy Whitcroft 0 siblings, 1 reply; 18+ messages in thread From: Mel Gorman @ 2008-07-01 8:09 UTC (permalink / raw) To: Dan Williams Cc: linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, apw (Christoph's address corrected and Andys added to cc) On (30/06/08 18:57), Dan Williams didst pronounce: > Hello, > > Prompted by a report from a user I have bisected a performance loss > apparently introduced by commit 54a6eb5c (mm: use two zonelist that are > filtered by GFP mask). The test is simple sequential writes to a 4 disk > raid5 array. Performance should be about 20% greater than 2.6.25 due to > commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456 > write performance). The sample data below shows sporadic performance > starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc. > > revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8 > 138 168 169 167 177 149 144 > 140 168 172 170 109 138 142 > 142 165 169 164 119 138 129 > 144 168 169 171 120 139 135 > 142 165 174 166 165 122 154 > MB/s (avg) 141 167 171 168 138 137 141 > % change 0% 18% 21% 19% -2% -3% 0% > result base good good good [bad] bad bad > That is not good at all as this patch is not a straight-forward revert but the second time it's come under suspicion. > Notable observations: > 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and > 54a6eb5c show consistent performance at 170MB/s. I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one possibility if GFP_KERNEL is a major factor and it has to scan over the unusable zone a lot. However, another remote possibility is that many function calls are more expensive on x86 than on x86_64 (this is a wild guess based on the registers available). Spectulative patch is below. If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like? i.e. is the zonelist filtering looking like a performance regression or is it just somehow negating the benefits of the raid patch? > 2/ Single drive performance appears to be unaffected > 3/ A quick test shows that raid0 performance is also sporadic: > 2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s > 2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s > 2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s > 2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s > 2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s > Are these synced writes? i.e. is it possible the performance at the end is dropped because memory becomes full of dirty pages at that point? > System/Test configuration: > (2) Intel(R) Xeon(R) CPU 5150 > mem=1024M > CONFIG_HIGHMEM4G=y (full config attached) > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow. > Other suggestions / experiments? > There was a deporkify patch which replaced inline function with normal functions. I worried at the time that all the function calls in an iterator may cause a performnace problem but I couldn't measure it so assumed the reduction in text size was a plus. This is a partial reporkify patch that should reduce the number of function calls that take place at the cost of larger text. Can you try it out applied against 2.6.26-rc8 please? diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/include/linux/mmzone.h linux-2.6.26-rc8-repork/include/linux/mmzone.h --- linux-2.6.26-rc8-clean/include/linux/mmzone.h 2008-06-24 18:58:20.000000000 -0700 +++ linux-2.6.26-rc8-repork/include/linux/mmzone.h 2008-07-01 00:49:17.000000000 -0700 @@ -742,6 +742,15 @@ static inline int zonelist_node_idx(stru #endif /* CONFIG_NUMA */ } +static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) +{ +#ifdef CONFIG_NUMA + return node_isset(zonelist_node_idx(zref), *nodes); +#else + return 1; +#endif /* CONFIG_NUMA */ +} + /** * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point * @z - The cursor used as a starting point for the search @@ -754,10 +763,26 @@ static inline int zonelist_node_idx(stru * search. The zoneref returned is a cursor that is used as the next starting * point for future calls to next_zones_zonelist(). */ -struct zoneref *next_zones_zonelist(struct zoneref *z, +static inline struct zoneref *next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx, nodemask_t *nodes, - struct zone **zone); + struct zone **zone) +{ + /* + * Find the next suitable zone to use for the allocation. + * Only filter based on nodemask if it's set + */ + if (likely(nodes == NULL)) + while (zonelist_zone_idx(z) > highest_zoneidx) + z++; + else + while (zonelist_zone_idx(z) > highest_zoneidx || + (z->zone && !zref_in_nodemask(z, nodes))) + z++; + + *zone = zonelist_zone(z++); + return z; +} /** * first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/mmzone.c linux-2.6.26-rc8-repork/mm/mmzone.c --- linux-2.6.26-rc8-clean/mm/mmzone.c 2008-06-24 18:58:20.000000000 -0700 +++ linux-2.6.26-rc8-repork/mm/mmzone.c 2008-07-01 00:48:19.000000000 -0700 @@ -42,33 +42,3 @@ struct zone *next_zone(struct zone *zone return zone; } -static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes) -{ -#ifdef CONFIG_NUMA - return node_isset(zonelist_node_idx(zref), *nodes); -#else - return 1; -#endif /* CONFIG_NUMA */ -} - -/* Returns the next zone at or below highest_zoneidx in a zonelist */ -struct zoneref *next_zones_zonelist(struct zoneref *z, - enum zone_type highest_zoneidx, - nodemask_t *nodes, - struct zone **zone) -{ - /* - * Find the next suitable zone to use for the allocation. - * Only filter based on nodemask if it's set - */ - if (likely(nodes == NULL)) - while (zonelist_zone_idx(z) > highest_zoneidx) - z++; - else - while (zonelist_zone_idx(z) > highest_zoneidx || - (z->zone && !zref_in_nodemask(z, nodes))) - z++; - - *zone = zonelist_zone(z++); - return z; -} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 8:09 ` Mel Gorman @ 2008-07-01 17:58 ` Andy Whitcroft 2008-07-01 19:07 ` Mel Gorman 0 siblings, 1 reply; 18+ messages in thread From: Andy Whitcroft @ 2008-07-01 17:58 UTC (permalink / raw) To: Dan Williams Cc: Mel Gorman, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On Tue, Jul 01, 2008 at 09:09:11AM +0100, Mel Gorman wrote: > (Christoph's address corrected and Andys added to cc) > > On (30/06/08 18:57), Dan Williams didst pronounce: > > Hello, > > > > Prompted by a report from a user I have bisected a performance loss > > apparently introduced by commit 54a6eb5c (mm: use two zonelist that are > > filtered by GFP mask). The test is simple sequential writes to a 4 disk > > raid5 array. Performance should be about 20% greater than 2.6.25 due to > > commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456 > > write performance). The sample data below shows sporadic performance > > starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc. > > > > revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8 > > 138 168 169 167 177 149 144 > > 140 168 172 170 109 138 142 > > 142 165 169 164 119 138 129 > > 144 168 169 171 120 139 135 > > 142 165 174 166 165 122 154 > > MB/s (avg) 141 167 171 168 138 137 141 > > % change 0% 18% 21% 19% -2% -3% 0% > > result base good good good [bad] bad bad > > > > That is not good at all as this patch is not a straight-forward revert but > the second time it's come under suspicion. > > > Notable observations: > > 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and > > 54a6eb5c show consistent performance at 170MB/s. > > I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one > possibility if GFP_KERNEL is a major factor and it has to scan over the > unusable zone a lot. However, another remote possibility is that many function > calls are more expensive on x86 than on x86_64 (this is a wild guess based > on the registers available). Spectulative patch is below. > > If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like? > i.e. is the zonelist filtering looking like a performance regression or is > it just somehow negating the benefits of the raid patch? > > > 2/ Single drive performance appears to be unaffected > > 3/ A quick test shows that raid0 performance is also sporadic: > > 2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s > > 2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s > > 2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s > > 2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s > > 2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s > > > > Are these synced writes? i.e. is it possible the performance at the end > is dropped because memory becomes full of dirty pages at that point? > > > System/Test configuration: > > (2) Intel(R) Xeon(R) CPU 5150 > > mem=1024M > > CONFIG_HIGHMEM4G=y (full config attached) > > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean > > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow. > > Other suggestions / experiments? > > Looking at the commit in question (54a6eb5c) there is one slight anomoly in the conversion. When nr_free_zone_pages() was converted to the new iterators it started using the offset parameter to limit the zones traversed; which is not unreasonable as that appears to be the parameters purpose. However, if we look at the original implementation of this function (reproduced below) we can see it actually did nothing with this parameter: static unsigned int nr_free_zone_pages(int offset) { /* Just pick one node, since fallback list is circular */ unsigned int sum = 0; struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); struct zone **zonep = zonelist->zones; struct zone *zone; for (zone = *zonep++; zone; zone = *zonep++) { unsigned long size = zone->present_pages; unsigned long high = zone->pages_high; if (size > high) sum += size - high; } return sum; } This version of the routine would only ever accumulate the free space as found on zones in the GFP_KERNEL zonelist, all zones other than HIGHMEM, regardless of its parameters. Looking at its callers, there are only two: unsigned int nr_free_buffer_pages(void) { return nr_free_zone_pages(gfp_zone(GFP_USER)); } unsigned int nr_free_pagecache_pages(void) { return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); } Before this commit both would return the same value. Following it, nr_free_pagecache_pages() would now contain the number of pages in HIGHMEM as well. This is used to initialise vm_total_pages, which is used in a number of the heuristics related to reclaim. vm_total_pages = nr_free_pagecache_pages(); If the issue was low memory pressure (which would not be an unreasonable conjecture given the large number of internal I/O's we may generate with RAID) we may well make different reclaim decisions before and after this commit. It should also be noted that as this discrepancy would only appear where there is HIGHMEM we might expect different behaviour on i386 either side of this commit but not so on x86_64. All that said, the simple way to confirm/eliminate this would be to test at the first bad commit, with the patch below. This patch 'fixes' nr_free_pagecache_pages to return the original value. It is not clear whether this is the right behaviour, but worth testing to see if this the root cause or not. -apw === 8< === debug raid slowdown There is a small difference in the calculations leading to the value of vm_total_pages once two zone lists is applied on i386 architectures with HIGHMEM present. Bodge nr_free_pagecache_pages() to return the (arguably incorrect) value it did before this change. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2f55295..5ceef27 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1745,7 +1745,7 @@ EXPORT_SYMBOL_GPL(nr_free_buffer_pages); */ unsigned int nr_free_pagecache_pages(void) { - return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); + return nr_free_zone_pages(gfp_zone(GFP_USER)); } static inline void show_node(struct zone *zone) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 17:58 ` Andy Whitcroft @ 2008-07-01 19:07 ` Mel Gorman 2008-07-01 20:29 ` Dan Williams 2008-07-01 22:28 ` Dan Williams 0 siblings, 2 replies; 18+ messages in thread From: Mel Gorman @ 2008-07-01 19:07 UTC (permalink / raw) To: Andy Whitcroft Cc: Dan Williams, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On (01/07/08 18:58), Andy Whitcroft didst pronounce: > On Tue, Jul 01, 2008 at 09:09:11AM +0100, Mel Gorman wrote: > > (Christoph's address corrected and Andys added to cc) > > > > On (30/06/08 18:57), Dan Williams didst pronounce: > > > Hello, > > > > > > Prompted by a report from a user I have bisected a performance loss > > > apparently introduced by commit 54a6eb5c (mm: use two zonelist that are > > > filtered by GFP mask). The test is simple sequential writes to a 4 disk > > > raid5 array. Performance should be about 20% greater than 2.6.25 due to > > > commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456 > > > write performance). The sample data below shows sporadic performance > > > starting at 54a6eb5c. The '+' indicates where I hand applied 8b3e6cdc. > > > > > > revision 2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8 > > > 138 168 169 167 177 149 144 > > > 140 168 172 170 109 138 142 > > > 142 165 169 164 119 138 129 > > > 144 168 169 171 120 139 135 > > > 142 165 174 166 165 122 154 > > > MB/s (avg) 141 167 171 168 138 137 141 > > > % change 0% 18% 21% 19% -2% -3% 0% > > > result base good good good [bad] bad bad > > > > > > > That is not good at all as this patch is not a straight-forward revert but > > the second time it's come under suspicion. > > > > > Notable observations: > > > 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and > > > 54a6eb5c show consistent performance at 170MB/s. > > > > I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one > > possibility if GFP_KERNEL is a major factor and it has to scan over the > > unusable zone a lot. However, another remote possibility is that many function > > calls are more expensive on x86 than on x86_64 (this is a wild guess based > > on the registers available). Spectulative patch is below. > > > > If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like? > > i.e. is the zonelist filtering looking like a performance regression or is > > it just somehow negating the benefits of the raid patch? > > > > > 2/ Single drive performance appears to be unaffected > > > 3/ A quick test shows that raid0 performance is also sporadic: > > > 2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s > > > 2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s > > > 2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s > > > 2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s > > > 2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s > > > > > > > Are these synced writes? i.e. is it possible the performance at the end > > is dropped because memory becomes full of dirty pages at that point? > > > > > System/Test configuration: > > > (2) Intel(R) Xeon(R) CPU 5150 > > > mem=1024M > > > CONFIG_HIGHMEM4G=y (full config attached) > > > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean > > > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > > > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow. > > > Other suggestions / experiments? > > > > > Looking at the commit in question (54a6eb5c) there is one slight anomoly > in the conversion. When nr_free_zone_pages() was converted to the new > iterators it started using the offset parameter to limit the zones > traversed; which is not unreasonable as that appears to be the > parameters purpose. However, if we look at the original implementation > of this function (reproduced below) we can see it actually did nothing > with this parameter: > > static unsigned int nr_free_zone_pages(int offset) > { > /* Just pick one node, since fallback list is circular */ > unsigned int sum = 0; > > struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); > struct zone **zonep = zonelist->zones; > struct zone *zone; > > for (zone = *zonep++; zone; zone = *zonep++) { > unsigned long size = zone->present_pages; > unsigned long high = zone->pages_high; > if (size > high) > sum += size - high; > } > > return sum; > } > This looks kinda promising and depends heavily on how this patch was tested in isolation. Dan, can you post the patch you use on 2.6.25 because the commit in question should not have applied cleanly please? To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with the right zones in it. However, with two-zonelist, there is only one that gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets filtered based on offset. However, if this patch was tested in isolation, it could result in bogus values of vm_total_pages. Dan, can you confirm in your dmesg logs that the line like the following has similar values please? Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258544 Right now though 2.6.26-rc8 and 2.6.25 report the same values for "Total pages" in the dmesg at least under qemu running with 1GB. If vm_total_pages was wrong, that value would be broken. > This version of the routine would only ever accumulate the free space as > found on zones in the GFP_KERNEL zonelist, all zones other than HIGHMEM, > regardless of its parameters. > > Looking at its callers, there are only two: > > unsigned int nr_free_buffer_pages(void) > { > return nr_free_zone_pages(gfp_zone(GFP_USER)); > } > unsigned int nr_free_pagecache_pages(void) > { > return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); > } > > Before this commit both would return the same value. Following it, > nr_free_pagecache_pages() would now contain the number of pages in > HIGHMEM as well. This is used to initialise vm_total_pages, which is > used in a number of the heuristics related to reclaim. > > vm_total_pages = nr_free_pagecache_pages(); > > If the issue was low memory pressure (which would not be an unreasonable > conjecture given the large number of internal I/O's we may generate > with RAID) we may well make different reclaim decisions before and after > this commit. It should also be noted that as this discrepancy would only > appear where there is HIGHMEM we might expect different behaviour on i386 > either side of this commit but not so on x86_64. > > All that said, the simple way to confirm/eliminate this would be to > test at the first bad commit, with the patch below. This patch 'fixes' > nr_free_pagecache_pages to return the original value. It is not clear > whether this is the right behaviour, but worth testing to see if this > the root cause or not. > > -apw > > === 8< === > debug raid slowdown > > There is a small difference in the calculations leading to the value of > vm_total_pages once two zone lists is applied on i386 architectures with > HIGHMEM present. Bodge nr_free_pagecache_pages() to return the (arguably > incorrect) value it did before this change. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2f55295..5ceef27 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1745,7 +1745,7 @@ EXPORT_SYMBOL_GPL(nr_free_buffer_pages); > */ > unsigned int nr_free_pagecache_pages(void) > { > - return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); > + return nr_free_zone_pages(gfp_zone(GFP_USER)); > } > > static inline void show_node(struct zone *zone) > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 19:07 ` Mel Gorman @ 2008-07-01 20:29 ` Dan Williams 2008-07-02 5:18 ` Mel Gorman 2008-07-01 22:28 ` Dan Williams 1 sibling, 1 reply; 18+ messages in thread From: Dan Williams @ 2008-07-01 20:29 UTC (permalink / raw) To: Mel Gorman Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote: > On (01/07/08 18:58), Andy Whitcroft didst pronounce: > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow. > > > > Other suggestions / experiments? > > > > > > > > Looking at the commit in question (54a6eb5c) there is one slight anomoly > > in the conversion. When nr_free_zone_pages() was converted to the new > > iterators it started using the offset parameter to limit the zones > > traversed; which is not unreasonable as that appears to be the > > parameters purpose. However, if we look at the original implementation > > of this function (reproduced below) we can see it actually did nothing > > with this parameter: > > > > static unsigned int nr_free_zone_pages(int offset) > > { > > /* Just pick one node, since fallback list is circular */ > > unsigned int sum = 0; > > > > struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); > > struct zone **zonep = zonelist->zones; > > struct zone *zone; > > > > for (zone = *zonep++; zone; zone = *zonep++) { > > unsigned long size = zone->present_pages; > > unsigned long high = zone->pages_high; > > if (size > high) > > sum += size - high; > > } > > > > return sum; > > } > > > > This looks kinda promising and depends heavily on how this patch was > tested in isolation. Dan, can you post the patch you use on 2.6.25 > because the commit in question should not have applied cleanly please? > > To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with > the right zones in it. However, with two-zonelist, there is only one that > gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets > filtered based on offset. However, if this patch was tested in isolation, > it could result in bogus values of vm_total_pages. Dan, can you confirm > in your dmesg logs that the line like the following has similar values > please? > > Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258544 The system is booted with mem=1024M on the kernel command line and with or without Andy's patch this reports: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 227584 Performance is still sporadic with the change. Moreover this condition is reproducing even with CONFIG_NOHIGHMEM=y. Let us take commit 8b3e6cdc out of the equation and just look at raid0 performance: revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8 279 278 273 277 281 278 275 277 281 113 68.7 66.8 279 69.2 277 73.7 278 75.6 62.5 80.3 MB/s (avg) 280 163 191 155 % change 0% -42% -32% -45% result base bad bad bad These numbers are taken from the results of: for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done Where md0 is created by: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0 I will try your debug patch next Mel, and then try to collect more data with blktrace. -- Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 20:29 ` Dan Williams @ 2008-07-02 5:18 ` Mel Gorman 2008-07-03 1:49 ` Dan Williams 0 siblings, 1 reply; 18+ messages in thread From: Mel Gorman @ 2008-07-02 5:18 UTC (permalink / raw) To: Dan Williams Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On (01/07/08 13:29), Dan Williams didst pronounce: > > On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote: > > On (01/07/08 18:58), Andy Whitcroft didst pronounce: > > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow. > > > > > Other suggestions / experiments? > > > > > > > > > > > Looking at the commit in question (54a6eb5c) there is one slight anomoly > > > in the conversion. When nr_free_zone_pages() was converted to the new > > > iterators it started using the offset parameter to limit the zones > > > traversed; which is not unreasonable as that appears to be the > > > parameters purpose. However, if we look at the original implementation > > > of this function (reproduced below) we can see it actually did nothing > > > with this parameter: > > > > > > static unsigned int nr_free_zone_pages(int offset) > > > { > > > /* Just pick one node, since fallback list is circular */ > > > unsigned int sum = 0; > > > > > > struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL); > > > struct zone **zonep = zonelist->zones; > > > struct zone *zone; > > > > > > for (zone = *zonep++; zone; zone = *zonep++) { > > > unsigned long size = zone->present_pages; > > > unsigned long high = zone->pages_high; > > > if (size > high) > > > sum += size - high; > > > } > > > > > > return sum; > > > } > > > > > > > This looks kinda promising and depends heavily on how this patch was > > tested in isolation. Dan, can you post the patch you use on 2.6.25 > > because the commit in question should not have applied cleanly please? > > > > To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with > > the right zones in it. However, with two-zonelist, there is only one that > > gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets > > filtered based on offset. However, if this patch was tested in isolation, > > it could result in bogus values of vm_total_pages. Dan, can you confirm > > in your dmesg logs that the line like the following has similar values > > please? > > > > Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258544 > > The system is booted with mem=1024M on the kernel command line and with > or without Andy's patch this reports: > > Built 1 zonelists in Zone order, mobility grouping on. Total pages: 227584 > > Performance is still sporadic with the change. Moreover this condition > is reproducing even with CONFIG_NOHIGHMEM=y. > > Let us take commit 8b3e6cdc out of the equation and just look at raid0 > performance: > > revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8 > 279 278 273 277 > 281 278 275 277 > 281 113 68.7 66.8 > 279 69.2 277 73.7 > 278 75.6 62.5 80.3 > MB/s (avg) 280 163 191 155 > % change 0% -42% -32% -45% > result base bad bad bad > Ok, based on your other mail, 54a6eb5c here is a bisection point. The good figures are on par with the "good" kernel with some disasterous runs leading to a bad average. The thing is, the bad results are way worse than could be accounted for by two-zonelist alone. In fact, the figures look suspiciously like only 1 disk is in use as they are roughly quartered. Can you think of anything that would explain that? Can you also confirm that using a bisection point before two-zonelist runs steadily and with high performance as expected please? This is to rule out some other RAID patch being a factor. It would be worth running vmstat during the tests so we can see if IO rates are dropping from an overall system perspective. If possible, oprofile data from the same time would be helpful to see does it show up where we are getting stuck. > These numbers are taken from the results of: > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > > Where md0 is created by: > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0 > > I will try your debug patch next Mel, and then try to collect more data > with blktrace. > I tried reproducing this but I don't have the necessary hardware to even come close to reproducing your test case :( . I got some dbench results with oprofile but found no significant differences between 2.6.25 and 2.6.26-rc8. However, I did find the results of dbench varied less between runs with the "repork" patch that made next_zones_zonelist() an inline function. Have you tried that patch yet? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-02 5:18 ` Mel Gorman @ 2008-07-03 1:49 ` Dan Williams 2008-07-03 4:27 ` Mel Gorman 0 siblings, 1 reply; 18+ messages in thread From: Dan Williams @ 2008-07-03 1:49 UTC (permalink / raw) To: Mel Gorman Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote: > > Let us take commit 8b3e6cdc out of the equation and just look at raid0 > > performance: > > > > revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8 > > 279 278 273 277 > > 281 278 275 277 > > 281 113 68.7 66.8 > > 279 69.2 277 73.7 > > 278 75.6 62.5 80.3 > > MB/s (avg) 280 163 191 155 > > % change 0% -42% -32% -45% > > result base bad bad bad > > > > Ok, based on your other mail, 54a6eb5c here is a bisection point. The good > figures are on par with the "good" kernel with some disasterous runs leading > to a bad average. The thing is, the bad results are way worse than could be > accounted for by two-zonelist alone. In fact, the figures look suspiciously > like only 1 disk is in use as they are roughly quartered. Can you think of > anything that would explain that? raid0 in contrast to raid5 is fairly simple it just remaps the bio. Then it is up to the block layer to do some merging before it hits the block device driver. Not much room for error, especially since single drive performance seems to be unaffected. The data seems to show that the vm is not queuing pages to be written back fast enough. > Can you also confirm that using a bisection > point before two-zonelist runs steadily and with high performance as expected > please? This is to rule out some other RAID patch being a factor. The immediately preceeding commit 18ea7e71 always runs with high performance, and raid0 was virtually untouched during the merge window: $ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c Neil Brown (1): Remove blkdev warning triggered by using md > It would be worth running vmstat during the tests so we can see if IO > rates are dropping from an overall system perspective. If possible, > oprofile data from the same time would be helpful to see does it show up > where we are getting stuck. I watched /proc/vmstat during a dd run, and in the good case nr_writeback never drops below 21,000. In the bad case it may briefly break a few hundred pages. This picture generated by seekwatcher shows that there are gaps in i/o submission: http://userweb.kernel.org/~djbw/md0-bad2.png Compared to a good run: http://userweb.kernel.org/~djbw/md0-good.png What is interesting is that I was only able to get this data with the following command: ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb When I tried to watch all four disks, i.e. 5 blktrace tasks running instead of 2: ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde ...it always measured good performance, as if having more tasks running stimulates the vm to keep writeback going at a high rate. Could writeback be getting stalled on one cpu or more cpu's? Oprofile only seems to show that we are idling in the failure case: ---good--- samples % image name symbol name 18229 20.7171 vmlinux-18ea7e71 __copy_from_user_ll_nocache_nozero 4518 5.1347 vmlinux-18ea7e71 clear_user 2137 2.4287 vmlinux-18ea7e71 get_page_from_freelist 1929 2.1923 vmlinux-18ea7e71 constant_test_bit 1745 1.9832 vmlinux-18ea7e71 clear_bit 1320 1.5002 vmlinux-18ea7e71 kmem_cache_alloc 1290 1.4661 vmlinux-18ea7e71 blk_rq_map_sg 1287 1.4627 vmlinux-18ea7e71 bio_put 1022 1.1615 raid0.ko raid0_make_request 938 1.0660 vmlinux-18ea7e71 __make_request 938 1.0660 vmlinux-18ea7e71 test_clear_page_writeback 899 1.0217 vmlinux-18ea7e71 __percpu_counter_add ---bad--- samples % image name symbol name 17844 16.3714 vmlinux-54a6eb5c __copy_from_user_ll_nocache_nozero 4565 4.1883 vmlinux-54a6eb5c clear_user 2229 2.0450 vmlinux-54a6eb5c get_page_from_freelist 2202 2.0203 vmlinux-54a6eb5c mwait_idle 2129 1.9533 vmlinux-54a6eb5c schedule 1882 1.7267 vmlinux-54a6eb5c constant_test_bit 1730 1.5872 vmlinux-54a6eb5c clear_bit 1157 1.0615 vmlinux-54a6eb5c kmem_cache_alloc > > These numbers are taken from the results of: > > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > > > > Where md0 is created by: > > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0 > > > > I will try your debug patch next Mel, and then try to collect more data > > with blktrace. > > > > I tried reproducing this but I don't have the necessary hardware to even > come close to reproducing your test case :( . I got some dbench results > with oprofile but found no significant differences between 2.6.25 and > 2.6.26-rc8. However, I did find the results of dbench varied less between > runs with the "repork" patch that made next_zones_zonelist() an inline > function. Have you tried that patch yet? I gave the patch a shot but observed no change. As you predicted we are looking at something more pronounced than some lost cpu cycles. -- Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 1:49 ` Dan Williams @ 2008-07-03 4:27 ` Mel Gorman 2008-07-03 4:43 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Mel Gorman @ 2008-07-03 4:27 UTC (permalink / raw) To: Dan Williams Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, torvalds, akpm patch is below On (02/07/08 18:49), Dan Williams didst pronounce: > On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote: > > > Let us take commit 8b3e6cdc out of the equation and just look at raid0 > > > performance: > > > > > > revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8 > > > 279 278 273 277 > > > 281 278 275 277 > > > 281 113 68.7 66.8 > > > 279 69.2 277 73.7 > > > 278 75.6 62.5 80.3 > > > MB/s (avg) 280 163 191 155 > > > % change 0% -42% -32% -45% > > > result base bad bad bad > > > > > > > Ok, based on your other mail, 54a6eb5c here is a bisection point. The good > > figures are on par with the "good" kernel with some disasterous runs leading > > to a bad average. The thing is, the bad results are way worse than could be > > accounted for by two-zonelist alone. In fact, the figures look suspiciously > > like only 1 disk is in use as they are roughly quartered. Can you think of > > anything that would explain that? > > raid0 in contrast to raid5 is fairly simple it just remaps the bio. > Then it is up to the block layer to do some merging before it hits the > block device driver. Not much room for error, especially since single > drive performance seems to be unaffected. The data seems to show that > the vm is not queuing pages to be written back fast enough. > That is exactly the case but I think I know why now. > > Can you also confirm that using a bisection > > point before two-zonelist runs steadily and with high performance as expected > > please? This is to rule out some other RAID patch being a factor. > > The immediately preceeding commit 18ea7e71 always runs with high > performance, and raid0 was virtually untouched during the merge window: > > $ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c > Neil Brown (1): > Remove blkdev warning triggered by using md > > > It would be worth running vmstat during the tests so we can see if IO > > rates are dropping from an overall system perspective. If possible, > > oprofile data from the same time would be helpful to see does it show up > > where we are getting stuck. > > I watched /proc/vmstat during a dd run, and in the good case > nr_writeback never drops below 21,000. In the bad case it may briefly > break a few hundred pages. > > This picture generated by seekwatcher shows that there are gaps in i/o > submission: > > http://userweb.kernel.org/~djbw/md0-bad2.png > > Compared to a good run: > > http://userweb.kernel.org/~djbw/md0-good.png > > What is interesting is that I was only able to get this data with the > following command: > > ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb > > When I tried to watch all four disks, i.e. 5 blktrace tasks running > instead of 2: > > ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde > > ...it always measured good performance, as if having more tasks running > stimulates the vm to keep writeback going at a high rate. Could > writeback be getting stalled on one cpu or more cpu's? > Yes, we're stalling because the two-zonelist patch is breaking kswapd on !NUMA machines. The impact is that kswapd never reclaims and processes always directly reclaim. The result is you see large stalls. Alexander, I believe this may also be why lockdep is triggering and you're seeing periodic stalls in the XFS tests. The lockdep warning is a red herring but the fact it's easier to trigger could be because kswapd is not working and direct reclaim triggers the warning easily. Alexander and Dan, could you test the patch below and let me know if it fixes your respective problems please? === Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr With the two-zonelist patches on !NUMA machines, there really is only one zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the assumption is made that two zonelists exist when initialising zlcache_ptr. The result is that pgdat->nr_zones is always 0. As kswapd uses this value to determine what reclaim work is necessary, the result is that kswapd never reclaims. This causes processes to stall frequently in low-memory situations as they always direct reclaim. This patch initialises zlcache_ptr correctly. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:14:16.000000000 -0700 @@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p static void build_zonelist_cache(pg_data_t *pgdat) { pgdat->node_zonelists[0].zlcache_ptr = NULL; - pgdat->node_zonelists[1].zlcache_ptr = NULL; + if (NUMA_BUILD) + pgdat->node_zonelists[1].zlcache_ptr = NULL; } #endif /* CONFIG_NUMA */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 4:27 ` Mel Gorman @ 2008-07-03 4:43 ` Linus Torvalds 2008-07-03 5:00 ` Mel Gorman 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2008-07-03 4:43 UTC (permalink / raw) To: Mel Gorman Cc: Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm On Thu, 3 Jul 2008, Mel Gorman wrote: > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c > --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:14:16.000000000 -0700 > @@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p > static void build_zonelist_cache(pg_data_t *pgdat) > { > pgdat->node_zonelists[0].zlcache_ptr = NULL; > - pgdat->node_zonelists[1].zlcache_ptr = NULL; > + if (NUMA_BUILD) > + pgdat->node_zonelists[1].zlcache_ptr = NULL; > } This makes no sense. That whole thing is inside a #ifdef CONFIG_NUMA ... numa code .. #else ... this code .. #endif so CONFIG_NUMA will _not_ be set, and NUMA_BUILD is always 0. So why do that if (NUMA_BUILD) .. at all, when it is known to be false? So the patch may be correct, but wouldn't it be better to just remove the line entirely, instead of moving it into a conditional that cannot be true? Also, I'm not quite seeing why those zonelists should be zeroed out at all. Shouldn't a non-NUMA setup always aim to have node_zonelists[0] == node_zonelists[1] == all appropriate zones? I have to say, the whole mmzoen thing is confusing. The code makes my eyes bleed. I can't really follow it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 4:43 ` Linus Torvalds @ 2008-07-03 5:00 ` Mel Gorman 2008-07-03 5:54 ` Dan Williams 0 siblings, 1 reply; 18+ messages in thread From: Mel Gorman @ 2008-07-03 5:00 UTC (permalink / raw) To: Linus Torvalds Cc: Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm On (02/07/08 21:43), Linus Torvalds didst pronounce: > > > On Thu, 3 Jul 2008, Mel Gorman wrote: > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c > > --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 > > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:14:16.000000000 -0700 > > @@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p > > static void build_zonelist_cache(pg_data_t *pgdat) > > { > > pgdat->node_zonelists[0].zlcache_ptr = NULL; > > - pgdat->node_zonelists[1].zlcache_ptr = NULL; > > + if (NUMA_BUILD) > > + pgdat->node_zonelists[1].zlcache_ptr = NULL; > > } > > This makes no sense. > > That whole thing is inside a > > #ifdef CONFIG_NUMA > ... numa code .. > #else > ... this code .. > #endif > > so CONFIG_NUMA will _not_ be set, and NUMA_BUILD is always 0. > > So why do that > > if (NUMA_BUILD) > .. > > at all, when it is known to be false? > Because I'm a muppet and a bit cross-eyed from looking at this until the problem would reveal itself. I knew this is needed fixing but choose the most stupid way possible to fix it. > So the patch may be correct, but wouldn't it be better to just remove the > line entirely, instead of moving it into a conditional that cannot be > true? > Yes, revised patch below. Same fix, more sensible. > Also, I'm not quite seeing why those zonelists should be zeroed out at > all. Shouldn't a non-NUMA setup always aim to have node_zonelists[0] == > node_zonelists[1] == all appropriate zones? > node_zonelists[1] doesn't exist at all on non-NUMA so there is only node_zonelists[0] that is all appropriate zones. > I have to say, the whole mmzoen thing is confusing. The code makes my eyes > bleed. I can't really follow it. I'm looking at this too long to comment with anything other than a blank stare :/ ==== Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr With the two-zonelist patches on !NUMA machines, there really is only one zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the assumption is made that two zonelists exist when initialising zlcache_ptr. The result is that pgdat->nr_zones is always 0. As kswapd uses this value to determine what reclaim work is necessary, the result is that kswapd never reclaims. This causes processes to stall frequently in low-memory situations as they always direct reclaim. This patch initialises zlcache_ptr correctly. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- page_alloc.c | 1 - 1 file changed, 1 deletion(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700 @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p static void build_zonelist_cache(pg_data_t *pgdat) { pgdat->node_zonelists[0].zlcache_ptr = NULL; - pgdat->node_zonelists[1].zlcache_ptr = NULL; } #endif /* CONFIG_NUMA */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 5:00 ` Mel Gorman @ 2008-07-03 5:54 ` Dan Williams 2008-07-03 13:37 ` Christoph Lameter 2008-07-03 16:38 ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman 0 siblings, 2 replies; 18+ messages in thread From: Dan Williams @ 2008-07-03 5:54 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm On Wed, 2008-07-02 at 22:00 -0700, Mel Gorman wrote: > Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr > > With the two-zonelist patches on !NUMA machines, there really is only one > zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the > assumption is made that two zonelists exist when initialising zlcache_ptr. The > result is that pgdat->nr_zones is always 0. As kswapd uses this value to > determine what reclaim work is necessary, the result is that kswapd never > reclaims. This causes processes to stall frequently in low-memory situations > as they always direct reclaim. This patch initialises zlcache_ptr correctly. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > page_alloc.c | 1 - > 1 file changed, 1 deletion(-) > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c > --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700 > @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p > static void build_zonelist_cache(pg_data_t *pgdat) > { > pgdat->node_zonelists[0].zlcache_ptr = NULL; > - pgdat->node_zonelists[1].zlcache_ptr = NULL; > } > > #endif /* CONFIG_NUMA */ > Bug squished. # for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.73352 s, 278 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.6845 s, 279 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.74428 s, 277 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.65959 s, 280 MB/s 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 7.73107 s, 278 MB/s Tested-by: Dan Williams <dan.j.williams@intel.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 5:54 ` Dan Williams @ 2008-07-03 13:37 ` Christoph Lameter 2008-07-03 16:36 ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman 2008-07-03 16:38 ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman 1 sibling, 1 reply; 18+ messages in thread From: Christoph Lameter @ 2008-07-03 13:37 UTC (permalink / raw) To: Dan Williams Cc: Mel Gorman, Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov, akpm What a convoluted description. Simply put: We clobber the nr_zones field because we write beyond the bounds of the node_zonelists[] array in struct pglist_data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] Do not clobber pgdat->nr_zones during memory initialisation 2008-07-03 13:37 ` Christoph Lameter @ 2008-07-03 16:36 ` Mel Gorman 2008-07-03 16:44 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Mel Gorman @ 2008-07-03 16:36 UTC (permalink / raw) To: Christoph Lameter Cc: Dan Williams, Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov, akpm On (03/07/08 08:37), Christoph Lameter didst pronounce: > What a convoluted description. Simply put: We clobber the nr_zones field > because we write beyond the bounds of the node_zonelists[] array in > struct pglist_data. > Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation The nr_zones field is getting clobbered due to a write beyond the bounds of the node_zonelists[] array in struct pglist_data. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/page_alloc.c | 1 - 1 file changed, 1 deletion(-) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700 @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p static void build_zonelist_cache(pg_data_t *pgdat) { pgdat->node_zonelists[0].zlcache_ptr = NULL; - pgdat->node_zonelists[1].zlcache_ptr = NULL; } #endif /* CONFIG_NUMA */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation 2008-07-03 16:36 ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman @ 2008-07-03 16:44 ` Linus Torvalds 2008-07-03 16:46 ` Linus Torvalds 2008-07-03 17:16 ` Mel Gorman 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2008-07-03 16:44 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov, akpm On Thu, 3 Jul 2008, Mel Gorman wrote: > > Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Heh. I already applied it as ObviouslyCorrect(tm), but did the simplification I already pointed out (and which your second version already had) and rewrote your commit message a bit. So it's now committed as follows.. Linus --- commit 494de90098784b8e2797598cefdd34188884ec2e Author: Mel Gorman <mel@csn.ul.ie> Date: Thu Jul 3 05:27:51 2008 +0100 Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr The non-NUMA case of build_zonelist_cache() would initialize the zlcache_ptr for both node_zonelists[] to NULL. Which is problematic, since non-NUMA only has a single node_zonelists[] entry, and trying to zero the non-existent second one just overwrote the nr_zones field instead. As kswapd uses this value to determine what reclaim work is necessary, the result is that kswapd never reclaims. This causes processes to stall frequently in low-memory situations as they always direct reclaim. This patch initialises zlcache_ptr correctly. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Dan Williams <dan.j.williams@intel.com> [ Simplified patch a bit ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/page_alloc.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2f55295..f32fae3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *pgdat) static void build_zonelist_cache(pg_data_t *pgdat) { pgdat->node_zonelists[0].zlcache_ptr = NULL; - pgdat->node_zonelists[1].zlcache_ptr = NULL; } #endif /* CONFIG_NUMA */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation 2008-07-03 16:44 ` Linus Torvalds @ 2008-07-03 16:46 ` Linus Torvalds 2008-07-03 17:16 ` Mel Gorman 1 sibling, 0 replies; 18+ messages in thread From: Linus Torvalds @ 2008-07-03 16:46 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov, akpm On Thu, 3 Jul 2008, Linus Torvalds wrote: > > So it's now committed as follows.. Side note: master.kernel.org is going through maintenance today, so I haven't actually been able to push it out yet, in case anybody wonders why it's not showing up on the public sites. It will all be out there later today. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation 2008-07-03 16:44 ` Linus Torvalds 2008-07-03 16:46 ` Linus Torvalds @ 2008-07-03 17:16 ` Mel Gorman 1 sibling, 0 replies; 18+ messages in thread From: Mel Gorman @ 2008-07-03 17:16 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov, akpm On (03/07/08 09:44), Linus Torvalds didst pronounce: > > > On Thu, 3 Jul 2008, Mel Gorman wrote: > > > > Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation > > Heh. I already applied it as ObviouslyCorrect(tm), but did the > simplification I already pointed out (and which your second version > already had) and rewrote your commit message a bit. So it's now committed > as follows.. > Perfect. The log even looks like it was written by a sane person. Thanks for that. > Linus > > --- > commit 494de90098784b8e2797598cefdd34188884ec2e > Author: Mel Gorman <mel@csn.ul.ie> > Date: Thu Jul 3 05:27:51 2008 +0100 > > Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr > > The non-NUMA case of build_zonelist_cache() would initialize the > zlcache_ptr for both node_zonelists[] to NULL. > > Which is problematic, since non-NUMA only has a single node_zonelists[] > entry, and trying to zero the non-existent second one just overwrote the > nr_zones field instead. > > As kswapd uses this value to determine what reclaim work is necessary, > the result is that kswapd never reclaims. This causes processes to > stall frequently in low-memory situations as they always direct reclaim. > This patch initialises zlcache_ptr correctly. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Tested-by: Dan Williams <dan.j.williams@intel.com> > [ Simplified patch a bit ] > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > mm/page_alloc.c | 1 - > 1 files changed, 0 insertions(+), 1 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2f55295..f32fae3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *pgdat) > static void build_zonelist_cache(pg_data_t *pgdat) > { > pgdat->node_zonelists[0].zlcache_ptr = NULL; > - pgdat->node_zonelists[1].zlcache_ptr = NULL; > } > > #endif /* CONFIG_NUMA */ > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-03 5:54 ` Dan Williams 2008-07-03 13:37 ` Christoph Lameter @ 2008-07-03 16:38 ` Mel Gorman 1 sibling, 0 replies; 18+ messages in thread From: Mel Gorman @ 2008-07-03 16:38 UTC (permalink / raw) To: Dan Williams Cc: Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm On (02/07/08 22:54), Dan Williams didst pronounce: > > On Wed, 2008-07-02 at 22:00 -0700, Mel Gorman wrote: > > > Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr > > > > With the two-zonelist patches on !NUMA machines, there really is only one > > zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the > > assumption is made that two zonelists exist when initialising zlcache_ptr. The > > result is that pgdat->nr_zones is always 0. As kswapd uses this value to > > determine what reclaim work is necessary, the result is that kswapd never > > reclaims. This causes processes to stall frequently in low-memory situations > > as they always direct reclaim. This patch initialises zlcache_ptr correctly. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > page_alloc.c | 1 - > > 1 file changed, 1 deletion(-) > > > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c > > --- linux-2.6.26-rc8-clean/mm/page_alloc.c 2008-06-24 18:58:20.000000000 -0700 > > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700 > > @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p > > static void build_zonelist_cache(pg_data_t *pgdat) > > { > > pgdat->node_zonelists[0].zlcache_ptr = NULL; > > - pgdat->node_zonelists[1].zlcache_ptr = NULL; > > } > > > > #endif /* CONFIG_NUMA */ > > > > Bug squished. > > # for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 7.73352 s, 278 MB/s > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 7.6845 s, 279 MB/s > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 7.74428 s, 277 MB/s > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 7.65959 s, 280 MB/s > 2048+0 records in > 2048+0 records out > 2147483648 bytes (2.1 GB) copied, 7.73107 s, 278 MB/s > > Tested-by: Dan Williams <dan.j.williams@intel.com> > Great news. Dan, thanks a lot for reporting and persisting with the testing of various bits and pieces to get this pinned down. It is greatly appreciated. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) 2008-07-01 19:07 ` Mel Gorman 2008-07-01 20:29 ` Dan Williams @ 2008-07-01 22:28 ` Dan Williams 1 sibling, 0 replies; 18+ messages in thread From: Dan Williams @ 2008-07-01 22:28 UTC (permalink / raw) To: Mel Gorman Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote: > This looks kinda promising and depends heavily on how this patch was > tested in isolation. Dan, can you post the patch you use on 2.6.25 > because the commit in question should not have applied cleanly please? ? I have not tested this patch (54a6eb5c) in isolation. The patch that was applied to 2.6.25 was the raid5 performance enhancement patch. It is no longer in the equation since the problem is reproducible with stock raid0. -- Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-07-03 17:16 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-07-01 1:57 [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Dan Williams 2008-07-01 8:09 ` Mel Gorman 2008-07-01 17:58 ` Andy Whitcroft 2008-07-01 19:07 ` Mel Gorman 2008-07-01 20:29 ` Dan Williams 2008-07-02 5:18 ` Mel Gorman 2008-07-03 1:49 ` Dan Williams 2008-07-03 4:27 ` Mel Gorman 2008-07-03 4:43 ` Linus Torvalds 2008-07-03 5:00 ` Mel Gorman 2008-07-03 5:54 ` Dan Williams 2008-07-03 13:37 ` Christoph Lameter 2008-07-03 16:36 ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman 2008-07-03 16:44 ` Linus Torvalds 2008-07-03 16:46 ` Linus Torvalds 2008-07-03 17:16 ` Mel Gorman 2008-07-03 16:38 ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman 2008-07-01 22:28 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox