[problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
@ 2008-07-01  1:57 Dan Williams
  2008-07-01  8:09 ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2008-07-01  1:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: NeilBrown, babydr, mel, clameter, lee.schermerhorn

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]

Hello,

Prompted by a report from a user I have bisected a performance loss
apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
filtered by GFP mask).  The test is simple sequential writes to a 4 disk
raid5 array.  Performance should be about 20% greater than 2.6.25 due to
commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
write performance).  The sample data below shows sporadic performance
starting at 54a6eb5c.  The '+' indicates where I hand applied 8b3e6cdc.

revision   2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
           138          168       169       167       177       149        144
           140          168       172       170       109       138        142
           142          165       169       164       119       138        129
           144          168       169       171       120       139        135
           142          165       174       166       165       122        154
MB/s (avg) 141          167       171       168       138       137        141
% change   0%           18%       21%       19%       -2%       -3%        0%
result     base         good      good      good      [bad]     bad        bad

Notable observations:
1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
54a6eb5c show consistent performance at 170MB/s.
2/ Single drive performance appears to be unaffected
3/ A quick test shows that raid0 performance is also sporadic:
   2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
   2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
   2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
   2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
   2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s

System/Test configuration:
(2) Intel(R) Xeon(R) CPU 5150
mem=1024M
CONFIG_HIGHMEM4G=y (full config attached)
mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done

Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
Other suggestions / experiments?

Thanks,
Dan

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 22024 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01  1:57 [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Dan Williams
@ 2008-07-01  8:09 ` Mel Gorman
  2008-07-01 17:58   ` Andy Whitcroft
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2008-07-01  8:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, linux-kernel, NeilBrown, babydr, cl, lee.schermerhorn, apw

(Christoph's address corrected and Andys added to cc)

On (30/06/08 18:57), Dan Williams didst pronounce:
> Hello,
> 
> Prompted by a report from a user I have bisected a performance loss
> apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
> filtered by GFP mask).  The test is simple sequential writes to a 4 disk
> raid5 array.  Performance should be about 20% greater than 2.6.25 due to
> commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
> write performance).  The sample data below shows sporadic performance
> starting at 54a6eb5c.  The '+' indicates where I hand applied 8b3e6cdc.
> 
> revision   2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
>            138          168       169       167       177       149        144
>            140          168       172       170       109       138        142
>            142          165       169       164       119       138        129
>            144          168       169       171       120       139        135
>            142          165       174       166       165       122        154
> MB/s (avg) 141          167       171       168       138       137        141
> % change   0%           18%       21%       19%       -2%       -3%        0%
> result     base         good      good      good      [bad]     bad        bad
> 

That is not good at all as this patch is not a straight-forward revert but
the second time it's come under suspicion.

> Notable observations:
> 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
> 54a6eb5c show consistent performance at 170MB/s.

I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one
possibility if GFP_KERNEL is a major factor and it has to scan over the
unusable zone a lot. However, another remote possibility is that many function
calls are more expensive on x86 than on x86_64 (this is a wild guess based
on the registers available). Spectulative patch is below.

If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like?
i.e. is the zonelist filtering looking like a performance regression or is
it just somehow negating the benefits of the raid patch?

> 2/ Single drive performance appears to be unaffected
> 3/ A quick test shows that raid0 performance is also sporadic:
>    2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
>    2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
>    2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
>    2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
>    2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
> 

Are these synced writes? i.e. is it possible the performance at the end
is dropped because memory becomes full of dirty pages at that point?

> System/Test configuration:
> (2) Intel(R) Xeon(R) CPU 5150
> mem=1024M
> CONFIG_HIGHMEM4G=y (full config attached)
> mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
> for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> 
> Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> Other suggestions / experiments?
> 

There was a deporkify patch which replaced inline function with normal
functions. I worried at the time that all the function calls in an
iterator may cause a performnace problem but I couldn't measure it so
assumed the reduction in text size was a plus. This is a partial
reporkify patch that should reduce the number of function calls that
take place at the cost of larger text. Can you try it out applied
against 2.6.26-rc8 please?

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/include/linux/mmzone.h linux-2.6.26-rc8-repork/include/linux/mmzone.h
--- linux-2.6.26-rc8-clean/include/linux/mmzone.h	2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-repork/include/linux/mmzone.h	2008-07-01 00:49:17.000000000 -0700
@@ -742,6 +742,15 @@ static inline int zonelist_node_idx(stru
 #endif /* CONFIG_NUMA */
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /**
  * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
  * @z - The cursor used as a starting point for the search
@@ -754,10 +763,26 @@ static inline int zonelist_node_idx(stru
  * search. The zoneref returned is a cursor that is used as the next starting
  * point for future calls to next_zones_zonelist().
  */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
+static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
 					nodemask_t *nodes,
-					struct zone **zone);
+					struct zone **zone)
+{
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
+
+	*zone = zonelist_zone(z++);
+	return z;
+}
 
 /**
  * first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/mmzone.c linux-2.6.26-rc8-repork/mm/mmzone.c
--- linux-2.6.26-rc8-clean/mm/mmzone.c	2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-repork/mm/mmzone.c	2008-07-01 00:48:19.000000000 -0700
@@ -42,33 +42,3 @@ struct zone *next_zone(struct zone *zone
 	return zone;
 }
 
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
-{
-#ifdef CONFIG_NUMA
-	return node_isset(zonelist_node_idx(zref), *nodes);
-#else
-	return 1;
-#endif /* CONFIG_NUMA */
-}
-
-/* Returns the next zone at or below highest_zoneidx in a zonelist */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes,
-					struct zone **zone)
-{
-	/*
-	 * Find the next suitable zone to use for the allocation.
-	 * Only filter based on nodemask if it's set
-	 */
-	if (likely(nodes == NULL))
-		while (zonelist_zone_idx(z) > highest_zoneidx)
-			z++;
-	else
-		while (zonelist_zone_idx(z) > highest_zoneidx ||
-				(z->zone && !zref_in_nodemask(z, nodes)))
-			z++;
-
-	*zone = zonelist_zone(z++);
-	return z;
-}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01  8:09 ` Mel Gorman
@ 2008-07-01 17:58   ` Andy Whitcroft
  2008-07-01 19:07     ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Whitcroft @ 2008-07-01 17:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mel Gorman, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On Tue, Jul 01, 2008 at 09:09:11AM +0100, Mel Gorman wrote:
> (Christoph's address corrected and Andys added to cc)
> 
> On (30/06/08 18:57), Dan Williams didst pronounce:
> > Hello,
> > 
> > Prompted by a report from a user I have bisected a performance loss
> > apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
> > filtered by GFP mask).  The test is simple sequential writes to a 4 disk
> > raid5 array.  Performance should be about 20% greater than 2.6.25 due to
> > commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
> > write performance).  The sample data below shows sporadic performance
> > starting at 54a6eb5c.  The '+' indicates where I hand applied 8b3e6cdc.
> > 
> > revision   2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
> >            138          168       169       167       177       149        144
> >            140          168       172       170       109       138        142
> >            142          165       169       164       119       138        129
> >            144          168       169       171       120       139        135
> >            142          165       174       166       165       122        154
> > MB/s (avg) 141          167       171       168       138       137        141
> > % change   0%           18%       21%       19%       -2%       -3%        0%
> > result     base         good      good      good      [bad]     bad        bad
> > 
> 
> That is not good at all as this patch is not a straight-forward revert but
> the second time it's come under suspicion.
> 
> > Notable observations:
> > 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
> > 54a6eb5c show consistent performance at 170MB/s.
> 
> I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one
> possibility if GFP_KERNEL is a major factor and it has to scan over the
> unusable zone a lot. However, another remote possibility is that many function
> calls are more expensive on x86 than on x86_64 (this is a wild guess based
> on the registers available). Spectulative patch is below.
> 
> If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like?
> i.e. is the zonelist filtering looking like a performance regression or is
> it just somehow negating the benefits of the raid patch?
> 
> > 2/ Single drive performance appears to be unaffected
> > 3/ A quick test shows that raid0 performance is also sporadic:
> >    2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
> >    2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
> >    2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
> >    2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
> >    2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
> > 
> 
> Are these synced writes? i.e. is it possible the performance at the end
> is dropped because memory becomes full of dirty pages at that point?
> 
> > System/Test configuration:
> > (2) Intel(R) Xeon(R) CPU 5150
> > mem=1024M
> > CONFIG_HIGHMEM4G=y (full config attached)
> > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
> > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> > 
> > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> > Other suggestions / experiments?
> > 

Looking at the commit in question (54a6eb5c) there is one slight anomoly
in the conversion.  When nr_free_zone_pages() was converted to the new
iterators it started using the offset parameter to limit the zones
traversed; which is not unreasonable as that appears to be the
parameters purpose.  However, if we look at the original implementation
of this function (reproduced below) we can see it actually did nothing
with this parameter:

static unsigned int nr_free_zone_pages(int offset)
{
	/* Just pick one node, since fallback list is circular */
	unsigned int sum = 0;

	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
	struct zone **zonep = zonelist->zones;
	struct zone *zone;

	for (zone = *zonep++; zone; zone = *zonep++) {
		unsigned long size = zone->present_pages;
		unsigned long high = zone->pages_high;
		if (size > high)
			sum += size - high;
	}

	return sum;
}

This version of the routine would only ever accumulate the free space as
found on zones in the GFP_KERNEL zonelist, all zones other than HIGHMEM,
regardless of its parameters.

Looking at its callers, there are only two:

    unsigned int nr_free_buffer_pages(void)
    {
	return nr_free_zone_pages(gfp_zone(GFP_USER));
    }
    unsigned int nr_free_pagecache_pages(void)
    {
	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
    }

Before this commit both would return the same value.  Following it,
nr_free_pagecache_pages() would now contain the number of pages in
HIGHMEM as well.  This is used to initialise vm_total_pages, which is
used in a number of the heuristics related to reclaim.

        vm_total_pages = nr_free_pagecache_pages();

If the issue was low memory pressure (which would not be an unreasonable
conjecture given the large number of internal I/O's we may generate
with RAID) we may well make different reclaim decisions before and after
this commit.  It should also be noted that as this discrepancy would only
appear where there is HIGHMEM we might expect different behaviour on i386
either side of this commit but not so on x86_64.

All that said, the simple way to confirm/eliminate this would be to
test at the first bad commit, with the patch below.  This patch 'fixes'
nr_free_pagecache_pages to return the original value.  It is not clear
whether this is the right behaviour, but worth testing to see if this
the root cause or not.

-apw

=== 8< ===
debug raid slowdown

There is a small difference in the calculations leading to the value of
vm_total_pages once two zone lists is applied on i386 architectures with
HIGHMEM present.  Bodge nr_free_pagecache_pages() to return the (arguably
incorrect) value it did before this change.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f55295..5ceef27 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1745,7 +1745,7 @@ EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
  */
 unsigned int nr_free_pagecache_pages(void)
 {
-	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
+	return nr_free_zone_pages(gfp_zone(GFP_USER));
 }
 
 static inline void show_node(struct zone *zone)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01 17:58   ` Andy Whitcroft
@ 2008-07-01 19:07     ` Mel Gorman
  2008-07-01 20:29       ` Dan Williams
  2008-07-01 22:28       ` Dan Williams
  0 siblings, 2 replies; 18+ messages in thread
From: Mel Gorman @ 2008-07-01 19:07 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Dan Williams, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On (01/07/08 18:58), Andy Whitcroft didst pronounce:
> On Tue, Jul 01, 2008 at 09:09:11AM +0100, Mel Gorman wrote:
> > (Christoph's address corrected and Andys added to cc)
> > 
> > On (30/06/08 18:57), Dan Williams didst pronounce:
> > > Hello,
> > > 
> > > Prompted by a report from a user I have bisected a performance loss
> > > apparently introduced by commit 54a6eb5c (mm: use two zonelist that are
> > > filtered by GFP mask).  The test is simple sequential writes to a 4 disk
> > > raid5 array.  Performance should be about 20% greater than 2.6.25 due to
> > > commit 8b3e6cdc (md: introduce get_priority_stripe() to improve raid456
> > > write performance).  The sample data below shows sporadic performance
> > > starting at 54a6eb5c.  The '+' indicates where I hand applied 8b3e6cdc.
> > > 
> > > revision   2.6.25.8-fc8 2.6.25.9+ dac1d27b+ 18ea7e71+ 54a6eb5c+ 2.6.26-rc1 2.6.26-rc8
> > >            138          168       169       167       177       149        144
> > >            140          168       172       170       109       138        142
> > >            142          165       169       164       119       138        129
> > >            144          168       169       171       120       139        135
> > >            142          165       174       166       165       122        154
> > > MB/s (avg) 141          167       171       168       138       137        141
> > > % change   0%           18%       21%       19%       -2%       -3%        0%
> > > result     base         good      good      good      [bad]     bad        bad
> > > 
> > 
> > That is not good at all as this patch is not a straight-forward revert but
> > the second time it's come under suspicion.
> > 
> > > Notable observations:
> > > 1/ This problem does not reproduce when ARCH=x86_64, i.e. 2.6.26-rc8 and
> > > 54a6eb5c show consistent performance at 170MB/s.
> > 
> > I'm very curious as to why this doesn't affect x86_64. HIGHMEM is one
> > possibility if GFP_KERNEL is a major factor and it has to scan over the
> > unusable zone a lot. However, another remote possibility is that many function
> > calls are more expensive on x86 than on x86_64 (this is a wild guess based
> > on the registers available). Spectulative patch is below.
> > 
> > If 8b3e6cdc is reverted from 2.6.26-rc8, what do the figures look like?
> > i.e. is the zonelist filtering looking like a performance regression or is
> > it just somehow negating the benefits of the raid patch?
> > 
> > > 2/ Single drive performance appears to be unaffected
> > > 3/ A quick test shows that raid0 performance is also sporadic:
> > >    2147483648 bytes (2.1 GB) copied, 7.72408 s, 278 MB/s
> > >    2147483648 bytes (2.1 GB) copied, 7.78478 s, 276 MB/s
> > >    2147483648 bytes (2.1 GB) copied, 11.0323 s, 195 MB/s
> > >    2147483648 bytes (2.1 GB) copied, 8.41244 s, 255 MB/s
> > >    2147483648 bytes (2.1 GB) copied, 30.7649 s, 69.8 MB/s
> > > 
> > 
> > Are these synced writes? i.e. is it possible the performance at the end
> > is dropped because memory becomes full of dirty pages at that point?
> > 
> > > System/Test configuration:
> > > (2) Intel(R) Xeon(R) CPU 5150
> > > mem=1024M
> > > CONFIG_HIGHMEM4G=y (full config attached)
> > > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
> > > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> > > 
> > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> > > Other suggestions / experiments?
> > > 
> 
> Looking at the commit in question (54a6eb5c) there is one slight anomoly
> in the conversion.  When nr_free_zone_pages() was converted to the new
> iterators it started using the offset parameter to limit the zones
> traversed; which is not unreasonable as that appears to be the
> parameters purpose.  However, if we look at the original implementation
> of this function (reproduced below) we can see it actually did nothing
> with this parameter:
> 
> static unsigned int nr_free_zone_pages(int offset)
> {
> 	/* Just pick one node, since fallback list is circular */
> 	unsigned int sum = 0;
> 
> 	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
> 	struct zone **zonep = zonelist->zones;
> 	struct zone *zone;
> 
> 	for (zone = *zonep++; zone; zone = *zonep++) {
> 		unsigned long size = zone->present_pages;
> 		unsigned long high = zone->pages_high;
> 		if (size > high)
> 			sum += size - high;
> 	}
> 
> 	return sum;
> }
> 

This looks kinda promising and depends heavily on how this patch was
tested in isolation. Dan, can you post the patch you use on 2.6.25
because the commit in question should not have applied cleanly please?

To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with
the right zones in it. However, with two-zonelist, there is only one that
gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets
filtered based on offset.  However, if this patch was tested in isolation,
it could result in bogus values of vm_total_pages. Dan, can you confirm
in your dmesg logs that the line like the following has similar values
please?

Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 258544

Right now though 2.6.26-rc8 and 2.6.25 report the same values for "Total
pages" in the dmesg at least under qemu running with 1GB. If
vm_total_pages was wrong, that value would be broken.

> This version of the routine would only ever accumulate the free space as
> found on zones in the GFP_KERNEL zonelist, all zones other than HIGHMEM,
> regardless of its parameters.
> 
> Looking at its callers, there are only two:
> 
>     unsigned int nr_free_buffer_pages(void)
>     {
> 	return nr_free_zone_pages(gfp_zone(GFP_USER));
>     }
>     unsigned int nr_free_pagecache_pages(void)
>     {
> 	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
>     }
> 
> Before this commit both would return the same value.  Following it,
> nr_free_pagecache_pages() would now contain the number of pages in
> HIGHMEM as well.  This is used to initialise vm_total_pages, which is
> used in a number of the heuristics related to reclaim.
> 
>         vm_total_pages = nr_free_pagecache_pages();
> 
> If the issue was low memory pressure (which would not be an unreasonable
> conjecture given the large number of internal I/O's we may generate
> with RAID) we may well make different reclaim decisions before and after
> this commit.  It should also be noted that as this discrepancy would only
> appear where there is HIGHMEM we might expect different behaviour on i386
> either side of this commit but not so on x86_64.
> 
> All that said, the simple way to confirm/eliminate this would be to
> test at the first bad commit, with the patch below.  This patch 'fixes'
> nr_free_pagecache_pages to return the original value.  It is not clear
> whether this is the right behaviour, but worth testing to see if this
> the root cause or not.
> 
> -apw
> 
> === 8< ===
> debug raid slowdown
> 
> There is a small difference in the calculations leading to the value of
> vm_total_pages once two zone lists is applied on i386 architectures with
> HIGHMEM present.  Bodge nr_free_pagecache_pages() to return the (arguably
> incorrect) value it did before this change.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2f55295..5ceef27 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1745,7 +1745,7 @@ EXPORT_SYMBOL_GPL(nr_free_buffer_pages);
>   */
>  unsigned int nr_free_pagecache_pages(void)
>  {
> -	return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE));
> +	return nr_free_zone_pages(gfp_zone(GFP_USER));
>  }
>  
>  static inline void show_node(struct zone *zone)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01 19:07     ` Mel Gorman
@ 2008-07-01 20:29       ` Dan Williams
  2008-07-02  5:18         ` Mel Gorman
  2008-07-01 22:28       ` Dan Williams
  1 sibling, 1 reply; 18+ messages in thread
From: Dan Williams @ 2008-07-01 20:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote:
> On (01/07/08 18:58), Andy Whitcroft didst pronounce:
> > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> > > > Other suggestions / experiments?
> > > >
> >
> > Looking at the commit in question (54a6eb5c) there is one slight anomoly
> > in the conversion.  When nr_free_zone_pages() was converted to the new
> > iterators it started using the offset parameter to limit the zones
> > traversed; which is not unreasonable as that appears to be the
> > parameters purpose.  However, if we look at the original implementation
> > of this function (reproduced below) we can see it actually did nothing
> > with this parameter:
> >
> > static unsigned int nr_free_zone_pages(int offset)
> > {
> >       /* Just pick one node, since fallback list is circular */
> >       unsigned int sum = 0;
> >
> >       struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
> >       struct zone **zonep = zonelist->zones;
> >       struct zone *zone;
> >
> >       for (zone = *zonep++; zone; zone = *zonep++) {
> >               unsigned long size = zone->present_pages;
> >               unsigned long high = zone->pages_high;
> >               if (size > high)
> >                       sum += size - high;
> >       }
> >
> >       return sum;
> > }
> >
> 
> This looks kinda promising and depends heavily on how this patch was
> tested in isolation. Dan, can you post the patch you use on 2.6.25
> because the commit in question should not have applied cleanly please?
> 
> To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with
> the right zones in it. However, with two-zonelist, there is only one that
> gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets
> filtered based on offset.  However, if this patch was tested in isolation,
> it could result in bogus values of vm_total_pages. Dan, can you confirm
> in your dmesg logs that the line like the following has similar values
> please?
> 
> Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 258544

The system is booted with mem=1024M on the kernel command line and with
or without Andy's patch this reports:

	Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 227584

Performance is still sporadic with the change.  Moreover this condition
is reproducing even with CONFIG_NOHIGHMEM=y.

Let us take commit 8b3e6cdc out of the equation and just look at raid0 
performance:

revision   2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
           279          278      273                277
           281          278      275                277
           281          113      68.7               66.8
           279          69.2     277                73.7
           278          75.6     62.5               80.3
MB/s (avg) 280          163      191                155
% change   0%           -42%     -32%               -45%
result     base         bad      bad                bad

These numbers are taken from the results of:
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done

Where md0 is created by:
mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0

I will try your debug patch next Mel, and then try to collect more data
with blktrace.

--
Dan





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01 19:07     ` Mel Gorman
  2008-07-01 20:29       ` Dan Williams
@ 2008-07-01 22:28       ` Dan Williams
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2008-07-01 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote:
> This looks kinda promising and depends heavily on how this patch was
> tested in isolation. Dan, can you post the patch you use on 2.6.25
> because the commit in question should not have applied cleanly please?

?

I have not tested this patch (54a6eb5c) in isolation.  The patch that
was applied to 2.6.25 was the raid5 performance enhancement patch.  It
is no longer in the equation since the problem is reproducible with
stock raid0.

--
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-01 20:29       ` Dan Williams
@ 2008-07-02  5:18         ` Mel Gorman
  2008-07-03  1:49           ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2008-07-02  5:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On (01/07/08 13:29), Dan Williams didst pronounce:
> 
> On Tue, 2008-07-01 at 12:07 -0700, Mel Gorman wrote:
> > On (01/07/08 18:58), Andy Whitcroft didst pronounce:
> > > > > Neil suggested CONFIG_NOHIGHMEM=y, I will give that a shot tomorrow.
> > > > > Other suggestions / experiments?
> > > > >
> > >
> > > Looking at the commit in question (54a6eb5c) there is one slight anomoly
> > > in the conversion.  When nr_free_zone_pages() was converted to the new
> > > iterators it started using the offset parameter to limit the zones
> > > traversed; which is not unreasonable as that appears to be the
> > > parameters purpose.  However, if we look at the original implementation
> > > of this function (reproduced below) we can see it actually did nothing
> > > with this parameter:
> > >
> > > static unsigned int nr_free_zone_pages(int offset)
> > > {
> > >       /* Just pick one node, since fallback list is circular */
> > >       unsigned int sum = 0;
> > >
> > >       struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
> > >       struct zone **zonep = zonelist->zones;
> > >       struct zone *zone;
> > >
> > >       for (zone = *zonep++; zone; zone = *zonep++) {
> > >               unsigned long size = zone->present_pages;
> > >               unsigned long high = zone->pages_high;
> > >               if (size > high)
> > >                       sum += size - high;
> > >       }
> > >
> > >       return sum;
> > > }
> > >
> > 
> > This looks kinda promising and depends heavily on how this patch was
> > tested in isolation. Dan, can you post the patch you use on 2.6.25
> > because the commit in question should not have applied cleanly please?
> > 
> > To be clear, 2.6.25 used the offset parameter correctly to get a zonelist with
> > the right zones in it. However, with two-zonelist, there is only one that
> > gets filtered so using GFP_KERNEL to find a zone is equivilant as it gets
> > filtered based on offset.  However, if this patch was tested in isolation,
> > it could result in bogus values of vm_total_pages. Dan, can you confirm
> > in your dmesg logs that the line like the following has similar values
> > please?
> > 
> > Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 258544
> 
> The system is booted with mem=1024M on the kernel command line and with
> or without Andy's patch this reports:
> 
> 	Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 227584
> 
> Performance is still sporadic with the change.  Moreover this condition
> is reproducing even with CONFIG_NOHIGHMEM=y.
> 
> Let us take commit 8b3e6cdc out of the equation and just look at raid0 
> performance:
> 
> revision   2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
>            279          278      273                277
>            281          278      275                277
>            281          113      68.7               66.8
>            279          69.2     277                73.7
>            278          75.6     62.5               80.3
> MB/s (avg) 280          163      191                155
> % change   0%           -42%     -32%               -45%
> result     base         bad      bad                bad
> 

Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
figures are on par with the "good" kernel with some disasterous runs leading
to a bad average. The thing is, the bad results are way worse than could be
accounted for by two-zonelist alone. In fact, the figures look suspiciously
like only 1 disk is in use as they are roughly quartered. Can you think of
anything that would explain that? Can you also confirm that using a bisection
point before two-zonelist runs steadily and with high performance as expected
please? This is to rule out some other RAID patch being a factor.

It would be worth running vmstat during the tests so we can see if IO
rates are dropping from an overall system perspective. If possible,
oprofile data from the same time would be helpful to see does it show up
where we are getting stuck.

> These numbers are taken from the results of:
> for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> 
> Where md0 is created by:
> mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
> 
> I will try your debug patch next Mel, and then try to collect more data
> with blktrace.
> 

I tried reproducing this but I don't have the necessary hardware to even
come close to reproducing your test case :( .  I got some dbench results
with oprofile but found no significant differences between 2.6.25 and
2.6.26-rc8. However, I did find the results of dbench varied less between
runs with the "repork" patch that made next_zones_zonelist() an inline
function. Have you tried that patch yet?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-02  5:18         ` Mel Gorman
@ 2008-07-03  1:49           ` Dan Williams
  2008-07-03  4:27             ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2008-07-03  1:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn

On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote:
> > Let us take commit 8b3e6cdc out of the equation and just look at raid0
> > performance:
> >
> > revision   2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
> >            279          278      273                277
> >            281          278      275                277
> >            281          113      68.7               66.8
> >            279          69.2     277                73.7
> >            278          75.6     62.5               80.3
> > MB/s (avg) 280          163      191                155
> > % change   0%           -42%     -32%               -45%
> > result     base         bad      bad                bad
> >
> 
> Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
> figures are on par with the "good" kernel with some disasterous runs leading
> to a bad average. The thing is, the bad results are way worse than could be
> accounted for by two-zonelist alone.  In fact, the figures look suspiciously
> like only 1 disk is in use as they are roughly quartered. Can you think of
> anything that would explain that?

raid0 in contrast to raid5 is fairly simple it just remaps the bio.
Then it is up to the block layer to do some merging before it hits the
block device driver.  Not much room for error, especially since single
drive performance seems to be unaffected.  The data seems to show that
the vm is not queuing pages to be written back fast enough.

> Can you also confirm that using a bisection
> point before two-zonelist runs steadily and with high performance as expected
> please? This is to rule out some other RAID patch being a factor.

The immediately preceeding commit 18ea7e71 always runs with high
performance, and raid0 was virtually untouched during the merge window:

$ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c
Neil Brown (1):
      Remove blkdev warning triggered by using md

> It would be worth running vmstat during the tests so we can see if IO
> rates are dropping from an overall system perspective. If possible,
> oprofile data from the same time would be helpful to see does it show up
> where we are getting stuck.

I watched /proc/vmstat during a dd run, and in the good case
nr_writeback never drops below 21,000.  In the bad case it may briefly
break a few hundred pages.

This picture generated by seekwatcher shows that there are gaps in i/o
submission:

	http://userweb.kernel.org/~djbw/md0-bad2.png

Compared to a good run:

	http://userweb.kernel.org/~djbw/md0-good.png

What is interesting is that I was only able to get this data with the
following command:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb

When I tried to watch all four disks, i.e. 5 blktrace tasks running
instead of 2:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde

...it always measured good performance, as if having more tasks running
stimulates the vm to keep writeback going at a high rate.  Could
writeback be getting stalled on one cpu or more cpu's?

Oprofile only seems to show that we are idling in the failure case:
---good---
samples  %        image name         symbol name
18229    20.7171  vmlinux-18ea7e71   __copy_from_user_ll_nocache_nozero
4518      5.1347  vmlinux-18ea7e71   clear_user
2137      2.4287  vmlinux-18ea7e71   get_page_from_freelist
1929      2.1923  vmlinux-18ea7e71   constant_test_bit
1745      1.9832  vmlinux-18ea7e71   clear_bit
1320      1.5002  vmlinux-18ea7e71   kmem_cache_alloc
1290      1.4661  vmlinux-18ea7e71   blk_rq_map_sg
1287      1.4627  vmlinux-18ea7e71   bio_put
1022      1.1615  raid0.ko           raid0_make_request
938       1.0660  vmlinux-18ea7e71   __make_request
938       1.0660  vmlinux-18ea7e71   test_clear_page_writeback
899       1.0217  vmlinux-18ea7e71   __percpu_counter_add
---bad---
samples  %        image name    symbol name
17844    16.3714  vmlinux-54a6eb5c   __copy_from_user_ll_nocache_nozero
4565      4.1883  vmlinux-54a6eb5c   clear_user
2229      2.0450  vmlinux-54a6eb5c   get_page_from_freelist
2202      2.0203  vmlinux-54a6eb5c   mwait_idle
2129      1.9533  vmlinux-54a6eb5c   schedule
1882      1.7267  vmlinux-54a6eb5c   constant_test_bit
1730      1.5872  vmlinux-54a6eb5c   clear_bit
1157      1.0615  vmlinux-54a6eb5c   kmem_cache_alloc


> > These numbers are taken from the results of:
> > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> >
> > Where md0 is created by:
> > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
> >
> > I will try your debug patch next Mel, and then try to collect more data
> > with blktrace.
> >
> 
> I tried reproducing this but I don't have the necessary hardware to even
> come close to reproducing your test case :( .  I got some dbench results
> with oprofile but found no significant differences between 2.6.25 and
> 2.6.26-rc8. However, I did find the results of dbench varied less between
> runs with the "repork" patch that made next_zones_zonelist() an inline
> function. Have you tried that patch yet?

I gave the patch a shot but observed no change.  As you predicted we are
looking at something more pronounced than some lost cpu cycles.

--
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  1:49           ` Dan Williams
@ 2008-07-03  4:27             ` Mel Gorman
  2008-07-03  4:43               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2008-07-03  4:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Whitcroft, linux-mm, linux-kernel, NeilBrown, babydr, cl,
	lee.schermerhorn, a.beregalov, torvalds, akpm

patch is below

On (02/07/08 18:49), Dan Williams didst pronounce:
> On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote:
> > > Let us take commit 8b3e6cdc out of the equation and just look at raid0
> > > performance:
> > >
> > > revision   2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
> > >            279          278      273                277
> > >            281          278      275                277
> > >            281          113      68.7               66.8
> > >            279          69.2     277                73.7
> > >            278          75.6     62.5               80.3
> > > MB/s (avg) 280          163      191                155
> > > % change   0%           -42%     -32%               -45%
> > > result     base         bad      bad                bad
> > >
> > 
> > Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
> > figures are on par with the "good" kernel with some disasterous runs leading
> > to a bad average. The thing is, the bad results are way worse than could be
> > accounted for by two-zonelist alone.  In fact, the figures look suspiciously
> > like only 1 disk is in use as they are roughly quartered. Can you think of
> > anything that would explain that?
> 
> raid0 in contrast to raid5 is fairly simple it just remaps the bio.
> Then it is up to the block layer to do some merging before it hits the
> block device driver.  Not much room for error, especially since single
> drive performance seems to be unaffected.  The data seems to show that
> the vm is not queuing pages to be written back fast enough.
> 

That is exactly the case but I think I know why now.

> > Can you also confirm that using a bisection
> > point before two-zonelist runs steadily and with high performance as expected
> > please? This is to rule out some other RAID patch being a factor.
> 
> The immediately preceeding commit 18ea7e71 always runs with high
> performance, and raid0 was virtually untouched during the merge window:
> 
> $ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c
> Neil Brown (1):
>       Remove blkdev warning triggered by using md
> 
> > It would be worth running vmstat during the tests so we can see if IO
> > rates are dropping from an overall system perspective. If possible,
> > oprofile data from the same time would be helpful to see does it show up
> > where we are getting stuck.
> 
> I watched /proc/vmstat during a dd run, and in the good case
> nr_writeback never drops below 21,000.  In the bad case it may briefly
> break a few hundred pages.
> 
> This picture generated by seekwatcher shows that there are gaps in i/o
> submission:
> 
> 	http://userweb.kernel.org/~djbw/md0-bad2.png
> 
> Compared to a good run:
> 
> 	http://userweb.kernel.org/~djbw/md0-good.png
> 
> What is interesting is that I was only able to get this data with the
> following command:
> 
> ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb
> 
> When I tried to watch all four disks, i.e. 5 blktrace tasks running
> instead of 2:
> 
> ./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde
> 
> ...it always measured good performance, as if having more tasks running
> stimulates the vm to keep writeback going at a high rate.  Could
> writeback be getting stalled on one cpu or more cpu's?
> 

Yes, we're stalling because the two-zonelist patch is breaking kswapd on
!NUMA machines. The impact is that kswapd never reclaims and processes always
directly reclaim. The result is you see large stalls.

Alexander, I believe this may also be why lockdep is triggering and you're
seeing periodic stalls in the XFS tests. The lockdep warning is a red herring
but the fact it's easier to trigger could be because kswapd is not working
and direct reclaim triggers the warning easily.

Alexander and Dan, could you test the patch below and let me know if it
fixes your respective problems please?

===

Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr

With the two-zonelist patches on !NUMA machines, there really is only one
zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
assumption is made that two zonelists exist when initialising zlcache_ptr. The
result is that pgdat->nr_zones is always 0. As kswapd uses this value to
determine what reclaim work is necessary, the result is that kswapd never
reclaims. This causes processes to stall frequently in low-memory situations
as they always direct reclaim.  This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c	2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c	2008-07-02 21:14:16.000000000 -0700
@@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-	pgdat->node_zonelists[1].zlcache_ptr = NULL;
+	if (NUMA_BUILD)
+		pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  4:27             ` Mel Gorman
@ 2008-07-03  4:43               ` Linus Torvalds
  2008-07-03  5:00                 ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2008-07-03  4:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown,
	babydr, cl, lee.schermerhorn, a.beregalov, akpm


On Thu, 3 Jul 2008, Mel Gorman wrote:

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
> --- linux-2.6.26-rc8-clean/mm/page_alloc.c	2008-06-24 18:58:20.000000000 -0700
> +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c	2008-07-02 21:14:16.000000000 -0700
> @@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p
>  static void build_zonelist_cache(pg_data_t *pgdat)
>  {
>  	pgdat->node_zonelists[0].zlcache_ptr = NULL;
> -	pgdat->node_zonelists[1].zlcache_ptr = NULL;
> +	if (NUMA_BUILD)
> +		pgdat->node_zonelists[1].zlcache_ptr = NULL;
>  }

This makes no sense.

That whole thing is inside a

	#ifdef CONFIG_NUMA
	... numa code ..
	#else
	... this code ..
	#endif

so CONFIG_NUMA will _not_ be set, and NUMA_BUILD is always 0.

So why do that

	if (NUMA_BUILD)
		..

at all, when it is known to be false?

So the patch may be correct, but wouldn't it be better to just remove the 
line entirely, instead of moving it into a conditional that cannot be 
true?

Also, I'm not quite seeing why those zonelists should be zeroed out at 
all. Shouldn't a non-NUMA setup always aim to have node_zonelists[0] == 
node_zonelists[1] == all appropriate zones?

I have to say, the whole mmzoen thing is confusing. The code makes my eyes 
bleed. I can't really follow it.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  4:43               ` Linus Torvalds
@ 2008-07-03  5:00                 ` Mel Gorman
  2008-07-03  5:54                   ` Dan Williams
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2008-07-03  5:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dan Williams, Andy Whitcroft, linux-mm, linux-kernel, NeilBrown,
	babydr, cl, lee.schermerhorn, a.beregalov, akpm

On (02/07/08 21:43), Linus Torvalds didst pronounce:
> 
> 
> On Thu, 3 Jul 2008, Mel Gorman wrote:
> 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
> > --- linux-2.6.26-rc8-clean/mm/page_alloc.c	2008-06-24 18:58:20.000000000 -0700
> > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c	2008-07-02 21:14:16.000000000 -0700
> > @@ -2328,7 +2328,8 @@ static void build_zonelists(pg_data_t *p
> >  static void build_zonelist_cache(pg_data_t *pgdat)
> >  {
> >  	pgdat->node_zonelists[0].zlcache_ptr = NULL;
> > -	pgdat->node_zonelists[1].zlcache_ptr = NULL;
> > +	if (NUMA_BUILD)
> > +		pgdat->node_zonelists[1].zlcache_ptr = NULL;
> >  }
> 
> This makes no sense.
> 
> That whole thing is inside a
> 
> 	#ifdef CONFIG_NUMA
> 	... numa code ..
> 	#else
> 	... this code ..
> 	#endif
> 
> so CONFIG_NUMA will _not_ be set, and NUMA_BUILD is always 0.
> 
> So why do that
> 
> 	if (NUMA_BUILD)
> 		..
> 
> at all, when it is known to be false?
> 

Because I'm a muppet and a bit cross-eyed from looking at this until the
problem would reveal itself. I knew this is needed fixing but choose the
most stupid way possible to fix it.

> So the patch may be correct, but wouldn't it be better to just remove the 
> line entirely, instead of moving it into a conditional that cannot be 
> true?
> 

Yes, revised patch below. Same fix, more sensible.

> Also, I'm not quite seeing why those zonelists should be zeroed out at 
> all. Shouldn't a non-NUMA setup always aim to have node_zonelists[0] == 
> node_zonelists[1] == all appropriate zones?
> 

node_zonelists[1] doesn't exist at all on non-NUMA so there is only
node_zonelists[0] that is all appropriate zones.

> I have to say, the whole mmzoen thing is confusing. The code makes my eyes 
> bleed. I can't really follow it.

I'm looking at this too long to comment with anything other than a blank
stare :/

====
Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr

With the two-zonelist patches on !NUMA machines, there really is only one
zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
assumption is made that two zonelists exist when initialising zlcache_ptr. The
result is that pgdat->nr_zones is always 0. As kswapd uses this value to
determine what reclaim work is necessary, the result is that kswapd never
reclaims. This causes processes to stall frequently in low-memory situations
as they always direct reclaim.  This patch initialises zlcache_ptr correctly.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 page_alloc.c |    1 -
 1 file changed, 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c	2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c	2008-07-02 21:49:09.000000000 -0700
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-	pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  5:00                 ` Mel Gorman
@ 2008-07-03  5:54                   ` Dan Williams
  2008-07-03 13:37                     ` Christoph Lameter
  2008-07-03 16:38                     ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman
  0 siblings, 2 replies; 18+ messages in thread
From: Dan Williams @ 2008-07-03  5:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel,
	NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm

On Wed, 2008-07-02 at 22:00 -0700, Mel Gorman wrote:

> Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
> 
> With the two-zonelist patches on !NUMA machines, there really is only one
> zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
> assumption is made that two zonelists exist when initialising zlcache_ptr. The
> result is that pgdat->nr_zones is always 0. As kswapd uses this value to
> determine what reclaim work is necessary, the result is that kswapd never
> reclaims. This causes processes to stall frequently in low-memory situations
> as they always direct reclaim.  This patch initialises zlcache_ptr correctly.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  page_alloc.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
> --- linux-2.6.26-rc8-clean/mm/page_alloc.c      2008-06-24 18:58:20.000000000 -0700
> +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700
> @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
>  static void build_zonelist_cache(pg_data_t *pgdat)
>  {
>         pgdat->node_zonelists[0].zlcache_ptr = NULL;
> -       pgdat->node_zonelists[1].zlcache_ptr = NULL;
>  }
> 
>  #endif /* CONFIG_NUMA */
> 

Bug squished.

# for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 7.73352 s, 278 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 7.6845 s, 279 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 7.74428 s, 277 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 7.65959 s, 280 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 7.73107 s, 278 MB/s

Tested-by: Dan Williams <dan.j.williams@intel.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  5:54                   ` Dan Williams
@ 2008-07-03 13:37                     ` Christoph Lameter
  2008-07-03 16:36                       ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman
  2008-07-03 16:38                     ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2008-07-03 13:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mel Gorman, Linus Torvalds, Andy Whitcroft, linux-mm,
	linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov,
	akpm

What a convoluted description. Simply put: We clobber the nr_zones field
because we write beyond the bounds of the node_zonelists[] array in
struct pglist_data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
  2008-07-03 13:37                     ` Christoph Lameter
@ 2008-07-03 16:36                       ` Mel Gorman
  2008-07-03 16:44                         ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2008-07-03 16:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dan Williams, Linus Torvalds, Andy Whitcroft, linux-mm,
	linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov,
	akpm

On (03/07/08 08:37), Christoph Lameter didst pronounce:
> What a convoluted description. Simply put: We clobber the nr_zones field
> because we write beyond the bounds of the node_zonelists[] array in
> struct pglist_data.
> 

Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation

The nr_zones field is getting clobbered due to a write beyond the bounds of
the node_zonelists[] array in struct pglist_data.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    1 -
 1 file changed, 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
--- linux-2.6.26-rc8-clean/mm/page_alloc.c	2008-06-24 18:58:20.000000000 -0700
+++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c	2008-07-02 21:49:09.000000000 -0700
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-	pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected)
  2008-07-03  5:54                   ` Dan Williams
  2008-07-03 13:37                     ` Christoph Lameter
@ 2008-07-03 16:38                     ` Mel Gorman
  1 sibling, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2008-07-03 16:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linus Torvalds, Andy Whitcroft, linux-mm, linux-kernel,
	NeilBrown, babydr, cl, lee.schermerhorn, a.beregalov, akpm

On (02/07/08 22:54), Dan Williams didst pronounce:
> 
> On Wed, 2008-07-02 at 22:00 -0700, Mel Gorman wrote:
> 
> > Subject: [PATCH] Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
> > 
> > With the two-zonelist patches on !NUMA machines, there really is only one
> > zonelist as __GFP_THISNODE is meaningless. However, during initialisation, the
> > assumption is made that two zonelists exist when initialising zlcache_ptr. The
> > result is that pgdat->nr_zones is always 0. As kswapd uses this value to
> > determine what reclaim work is necessary, the result is that kswapd never
> > reclaims. This causes processes to stall frequently in low-memory situations
> > as they always direct reclaim.  This patch initialises zlcache_ptr correctly.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  page_alloc.c |    1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.26-rc8-clean/mm/page_alloc.c linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c
> > --- linux-2.6.26-rc8-clean/mm/page_alloc.c      2008-06-24 18:58:20.000000000 -0700
> > +++ linux-2.6.26-rc8-fix-kswapd-on-numa/mm/page_alloc.c 2008-07-02 21:49:09.000000000 -0700
> > @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *p
> >  static void build_zonelist_cache(pg_data_t *pgdat)
> >  {
> >         pgdat->node_zonelists[0].zlcache_ptr = NULL;
> > -       pgdat->node_zonelists[1].zlcache_ptr = NULL;
> >  }
> > 
> >  #endif /* CONFIG_NUMA */
> > 
> 
> Bug squished.
> 
> # for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 7.73352 s, 278 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 7.6845 s, 279 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 7.74428 s, 277 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 7.65959 s, 280 MB/s
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 7.73107 s, 278 MB/s
> 
> Tested-by: Dan Williams <dan.j.williams@intel.com>
> 

Great news. Dan, thanks a lot for reporting and persisting with the testing
of various bits and pieces to get this pinned down. It is greatly appreciated.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
  2008-07-03 16:36                       ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman
@ 2008-07-03 16:44                         ` Linus Torvalds
  2008-07-03 16:46                           ` Linus Torvalds
  2008-07-03 17:16                           ` Mel Gorman
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-07-03 16:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm,
	linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov,
	akpm


On Thu, 3 Jul 2008, Mel Gorman wrote:
> 
> Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation

Heh. I already applied it as ObviouslyCorrect(tm), but did the 
simplification I already pointed out (and which your second version 
already had) and rewrote your commit message a bit. So it's now committed 
as follows..

		Linus

---
commit 494de90098784b8e2797598cefdd34188884ec2e
Author: Mel Gorman <mel@csn.ul.ie>
Date:   Thu Jul 3 05:27:51 2008 +0100

    Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
    
    The non-NUMA case of build_zonelist_cache() would initialize the
    zlcache_ptr for both node_zonelists[] to NULL.
    
    Which is problematic, since non-NUMA only has a single node_zonelists[]
    entry, and trying to zero the non-existent second one just overwrote the
    nr_zones field instead.
    
    As kswapd uses this value to determine what reclaim work is necessary,
    the result is that kswapd never reclaims.  This causes processes to
    stall frequently in low-memory situations as they always direct reclaim.
    This patch initialises zlcache_ptr correctly.
    
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Tested-by: Dan Williams <dan.j.williams@intel.com>
    [ Simplified patch a bit ]
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/page_alloc.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f55295..f32fae3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *pgdat)
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
 	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-	pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
  2008-07-03 16:44                         ` Linus Torvalds
@ 2008-07-03 16:46                           ` Linus Torvalds
  2008-07-03 17:16                           ` Mel Gorman
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2008-07-03 16:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm,
	linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov,
	akpm


On Thu, 3 Jul 2008, Linus Torvalds wrote:
>
> So it's now committed as follows..

Side note: master.kernel.org is going through maintenance today, so I 
haven't actually been able to push it out yet, in case anybody wonders why 
it's not showing up on the public sites.

It will all be out there later today.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
  2008-07-03 16:44                         ` Linus Torvalds
  2008-07-03 16:46                           ` Linus Torvalds
@ 2008-07-03 17:16                           ` Mel Gorman
  1 sibling, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2008-07-03 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Dan Williams, Andy Whitcroft, linux-mm,
	linux-kernel, NeilBrown, babydr, lee.schermerhorn, a.beregalov,
	akpm

On (03/07/08 09:44), Linus Torvalds didst pronounce:
> 
> 
> On Thu, 3 Jul 2008, Mel Gorman wrote:
> > 
> > Subject: [PATCH] Do not clobber pgdat->nr_zones during memory initialisation
> 
> Heh. I already applied it as ObviouslyCorrect(tm), but did the 
> simplification I already pointed out (and which your second version 
> already had) and rewrote your commit message a bit. So it's now committed 
> as follows..
> 

Perfect. The log even looks like it was written by a sane person. Thanks
for that.

> 		Linus
> 
> ---
> commit 494de90098784b8e2797598cefdd34188884ec2e
> Author: Mel Gorman <mel@csn.ul.ie>
> Date:   Thu Jul 3 05:27:51 2008 +0100
> 
>     Do not overwrite nr_zones on !NUMA when initialising zlcache_ptr
>     
>     The non-NUMA case of build_zonelist_cache() would initialize the
>     zlcache_ptr for both node_zonelists[] to NULL.
>     
>     Which is problematic, since non-NUMA only has a single node_zonelists[]
>     entry, and trying to zero the non-existent second one just overwrote the
>     nr_zones field instead.
>     
>     As kswapd uses this value to determine what reclaim work is necessary,
>     the result is that kswapd never reclaims.  This causes processes to
>     stall frequently in low-memory situations as they always direct reclaim.
>     This patch initialises zlcache_ptr correctly.
>     
>     Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>     Tested-by: Dan Williams <dan.j.williams@intel.com>
>     [ Simplified patch a bit ]
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/page_alloc.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2f55295..f32fae3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2328,7 +2328,6 @@ static void build_zonelists(pg_data_t *pgdat)
>  static void build_zonelist_cache(pg_data_t *pgdat)
>  {
>  	pgdat->node_zonelists[0].zlcache_ptr = NULL;
> -	pgdat->node_zonelists[1].zlcache_ptr = NULL;
>  }
>  
>  #endif	/* CONFIG_NUMA */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-07-03 17:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-01  1:57 [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Dan Williams
2008-07-01  8:09 ` Mel Gorman
2008-07-01 17:58   ` Andy Whitcroft
2008-07-01 19:07     ` Mel Gorman
2008-07-01 20:29       ` Dan Williams
2008-07-02  5:18         ` Mel Gorman
2008-07-03  1:49           ` Dan Williams
2008-07-03  4:27             ` Mel Gorman
2008-07-03  4:43               ` Linus Torvalds
2008-07-03  5:00                 ` Mel Gorman
2008-07-03  5:54                   ` Dan Williams
2008-07-03 13:37                     ` Christoph Lameter
2008-07-03 16:36                       ` [PATCH] Do not clobber pgdat->nr_zones during memory initialisation Mel Gorman
2008-07-03 16:44                         ` Linus Torvalds
2008-07-03 16:46                           ` Linus Torvalds
2008-07-03 17:16                           ` Mel Gorman
2008-07-03 16:38                     ` [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86 (bisected) Mel Gorman
2008-07-01 22:28       ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox