[PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
@ 2009-11-12 19:30 Mel Gorman
  2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

Sorry for the long delay in posting another version. Testing is extremely
time-consuming and I wasn't getting to work on this as much as I'd have liked.

Changelog since V2
  o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
    testing, it made latencies even worse as kswapd slept more on high-order
    congestion causing order-0 direct reclaims.
  o Added changes to how congestion_wait() works
  o Added a number of new patches altering the behaviour of reclaim

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
[Bug #14141] order 2 page allocation failures (generic)
[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
[No BZ ID]   Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
[No BZ ID]   page allocation failure message kernel 2.6.31.4 (tty-related)

The following are a series of patches that bring the behaviour of reclaim
and the page allocator more in line with 2.6.30.

Patches 1-3 should be tested first. The testing I've done shows that the
page allocator and behaviour of congestion_wait() is more in line with
2.6.30 than the vanilla kernels.

It'd be nice to have 2 more tests, applying each patch on top noting any
behaviour change. i.e. ideally there would be results for

 o patches 1+2+3
 o patches 1+2+3+4
 o patches 1+2+3+4+5

Of course, any tests results are welcome. The rest of the mail is the
results of my own tests.

I've tested against 2.6.31 and 2.6.32-rc6. I've somewhat replicated the
problem in Bug #14141 and believe the other bugs are variations of the same
style of problem. The basic reproduction case was;

1. X86-64 AMD Phenom and X86 P4 booted with mem=512MB. Expectation is
	any machine will do as long as it's 512MB for the size of workload
	involved.

2. A crypted work partition and swap partition was created. On my
   own setup, I gave no passphrase so it'd be easier to activate without
   interaction but there are multiple options. I should have taken better
   notes but the setup goes something like this;

	cryptsetup create -y crypt-partition /dev/sda5
	pvcreate /dev/mapper/crypt-partition
	vgcreate crypt-volume /dev/mapper/crypt-partition
	lvcreate -L 5G -n crypt-logical crypt-volume
	lvcreate -L 2G -n crypt-swap crypt-volume
	mkfs -t ext3 /dev/crypt-volume/crypt-logical
	mkswap /dev/crypt-volume/crypt-swap

3. With the partition mounted on /scratch, I
	cd /scratch
	mkdir music
	git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6

4. On a normal partition, I expand a tarball containing test scripts available at
	http://www.csn.ul.ie/~mel/postings/latency-20091112/latency-tests-with-results.tar.gz

	There are two helper programs that run as part of the test - a fake
	music player and a fake gitk.

	The fake music player uses rsync with bandwidth limits to start
	downloading a music folder from another machine. It's bandwidth
	limited to simulate playing music over NFS. I believe it generates
	similar if not exact traffic to a music player. It occured to be
	afterwards that if one patched ogg123 to print a line when 1/10th
	of a seconds worth of music was played, it could be used as an
	indirect measure of desktop interactivity and help pin down pesky
	"audio skips" bug reports.

	The fake gitk is based on observing roughly what gitk does using
	strace. It loads all the logs into a large buffer and then builds a
	very basic hash map of parent to child commits.  The data is stored
	because it was insufficient just to read the logs. It had to be
	kept in an in-memory buffer to generate swap.  It then discards the
	data and does it over again in a loop for a small number of times
	so the test is finite. When it processes a large number of commits,
	it outputs a line to stdout so that stalls can be observed. Ideal
	behaviour is that commits are read at a constant rate and latencies
	look flat.

	Output from the two programs is piped through another script -
	latency-output. It records how far into the test it was when the
	line was outputted and what the latency was since the last line
	appeared. The latency should always be very smooth. Because pipes
	buffer IO, they are all run by expect_unbuffered which is available
	from expect-dev on Debian at least.

	All the tests are driven via run-test.sh. While the tests run,
	it records the kern.log to track page allocation failures, records
	nr_writeback at regular intervals and tracks Page IO and Swap IO.

5. For running an actual test, a kernel is built, booted, the
	crypted partition activated, lvm restarted,
	/dev/crypt-volume/crypt-logical mounted on /scratch, all
	swap partitions turned off and then the swap partition on
	/dev/crypt-volume/crypt-swap activated. I then run run-test.sh from
	the tarball

6. Run the test script

To evaluate the patches, I considered three basic metrics.

o The length of time it takes fake-gitk to complete on average
o How often and how long fake-gitk stalled for
o How long was spent in congestion_wait

All generated data is in the tarball.

On X86, the results I got were

2.6.30-0000000-force-highorder           Elapsed:10:59.095  Failures:0

2.6.31-0000000-force-highorder           Elapsed:11:53.505  Failures:0
2.6.31-revert-8aa7e847                   Elapsed:14:01.595  Failures:0
2.6.31-0000012-pgalloc-2.6.30            Elapsed:13:32.237  Failures:0
2.6.31-0000123-congestion-both           Elapsed:12:44.170  Failures:0
2.6.31-0001234-kswapd-quick-recheck      Elapsed:10:35.327  Failures:0
2.6.31-0012345-adjust-priority           Elapsed:11:02.995  Failures:0

2.6.32-rc6-0000000-force-highorder       Elapsed:18:18.562  Failures:0
2.6.32-rc6-revert-8aa7e847               Elapsed:10:29.278  Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30        Elapsed:13:32.393  Failures:0
2.6.32-rc6-0000123-congestion-both       Elapsed:14:55.265  Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck  Elapsed:13:35.628  Failures:0
2.6.32-rc6-0012345-adjust-priority       Elapsed:12:41.278  Failures:0

The 0000000-force-highorder is a vanilla kernel patched so that network
receive always results in an order-2 allocation. This machine wasn't
suffering page allocation failures even under this circumstance. However,
note how slow 2.6.32-rc6 is and how much the revert helps. With the patches
applied, there is comparable performance.

Latencies were generally reduced with the patches applied. 2.6.32-rc6 was
particularly crazy with long stalls measured over the duration of the test
but has comparable latencies with 2.6.30 with the patches applied.

congestion_wait behaviour is more in line with 2.6.30 after the
patches with similar amounts of time being spent.  In general,
2.6.32-rc6-0012345-adjust-priority waits for longer than 2.6.30 or the
reverted kernels did. It also waits in more instances such as inside
shrink_inactive_list() where it didn't before. Forcing behaviour like 2.6.30
resulted in good figures but I couldn't justify the patches with anything
more solid than "in tests, it behaves well even though it doesn't make a
lot of sense"

On X86-64, the results I got were

2.6.30-0000000-force-highorder           Elapsed:09:48.545  Failures:0

2.6.31-0000000-force-highorder           Elapsed:09:13.020  Failures:0
2.6.31-revert-8aa7e847                   Elapsed:09:02.120  Failures:0
2.6.31-0000012-pgalloc-2.6.30            Elapsed:08:52.742  Failures:0
2.6.31-0000123-congestion-both           Elapsed:08:59.375  Failures:0
2.6.31-0001234-kswapd-quick-recheck      Elapsed:09:19.208  Failures:0
2.6.31-0012345-adjust-priority           Elapsed:09:39.225  Failures:0

2.6.32-rc6-0000000-force-highorder       Elapsed:19:38.585  Failures:5
2.6.32-rc6-revert-8aa7e847               Elapsed:17:21.257  Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30        Elapsed:18:56.682  Failures:1
2.6.32-rc6-0000123-congestion-both       Elapsed:16:08.340  Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck  Elapsed:18:11.200  Failures:7
2.6.32-rc6-0012345-adjust-priority       Elapsed:21:33.158  Failures:0

Failures were down and my impression was that it was much harder to cause
failures. Performance on mainline is still not as good as 2.6.30. On
this particular machine, I was able to force performance to be in line
but not with any patch I could justify in the general case.

Latencies were slightly reduced by applying the patches against 2.6.31.
against 2.6.32-rc6, applying the patches significantly reduced the latencies
but they are still significant. I'll continue to investigate what can be
done to improve this further.

Again, congestion_wait() is more in line with 2.6.30 when the patches
are applied. Similarly to X86, almost identical behaviour can be forced
by waiting on BLK_ASYNC_BOTH for each caller to congestion_wait() in the
reclaim and allocator paths.

Bottom line, the patches made triggering allocation failures much harder
and in a number of instances and latencies are reduced when the system
is under load. I will keep looking around this area - particularly the
performance under load on 2.6.32-rc6 but with 2.6.32 almost out the door,
I am releasing what I have now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
@ 2009-11-12 19:30 ` Mel Gorman
  2009-11-13  5:23   ` KOSAKI Motohiro
  2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdcedf6..250d055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
+restart:
 	wake_all_kswapd(order, zonelist, high_zoneidx);
 
-restart:
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
@ 2009-11-13  5:23   ` KOSAKI Motohiro
  2009-11-13 13:55     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13  5:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
> 
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Umm,
My mail box have the carbon copy of akpm sent this patch to linus.
(bellow subject and data)

Does this have any update?

-----------------------------------------------------------------
Subject: [patch 07/35] page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed
To: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org,
 mel@csn.ul.ie,
 cl@linux-foundation.org,
 kosaki.motohiro@jp.fujitsu.com,
 penberg@cs.helsinki.fi,
 stable@kernel.org
From: akpm@linux-foundation.org
Date: Wed, 11 Nov 2009 14:26:14 -0800



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2009-11-13  5:23   ` KOSAKI Motohiro
@ 2009-11-13 13:55     ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 13:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 02:23:21PM +0900, KOSAKI Motohiro wrote:
> > If a direct reclaim makes no forward progress, it considers whether it
> > should go OOM or not. Whether OOM is triggered or not, it may retry the
> > application afterwards. In times past, this would always wake kswapd as well
> > but currently, kswapd is not woken up after direct reclaim fails. For order-0
> > allocations, this makes little difference but if there is a heavy mix of
> > higher-order allocations that direct reclaim is failing for, it might mean
> > that kswapd is not rewoken for higher orders as much as it did previously.
> > 
> > This patch wakes up kswapd when an allocation is being retried after a direct
> > reclaim failure. It would be expected that kswapd is already awake, but
> > this has the effect of telling kswapd to reclaim at the higher order as well.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Umm,
> My mail box have the carbon copy of akpm sent this patch to linus.
> (bellow subject and data)
> 
> Does this have any update?
> 

No, it doesn't. I included it in the series for the benefit of testers.

> -----------------------------------------------------------------
> Subject: [patch 07/35] page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed
> To: torvalds@linux-foundation.org
> Cc: akpm@linux-foundation.org,
>  mel@csn.ul.ie,
>  cl@linux-foundation.org,
>  kosaki.motohiro@jp.fujitsu.com,
>  penberg@cs.helsinki.fi,
>  stable@kernel.org
> From: akpm@linux-foundation.org
> Date: Wed, 11 Nov 2009 14:26:14 -0800
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
  2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
@ 2009-11-12 19:30 ` Mel Gorman
  2009-11-13  5:24   ` KOSAKI Motohiro
  2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 250d055..2bc2ac6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
-	} else if (unlikely(rt_task(p)))
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
@ 2009-11-13  5:24   ` KOSAKI Motohiro
  2009-11-13 13:56     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13  5:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.

ditto.
afaik, this patch have been sent to linus.


> 
> [rientjes@google.com: Spotted the problem]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 250d055..2bc2ac6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
> -	} else if (unlikely(rt_task(p)))
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -- 
> 1.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-11-13  5:24   ` KOSAKI Motohiro
@ 2009-11-13 13:56     ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 13:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 02:24:13PM +0900, KOSAKI Motohiro wrote:
> > Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> > slightly by allowing rt_tasks that are handling an interrupt to set
> > ALLOC_HARDER. This patch brings the watermark logic more in line with
> > 2.6.30.
> 
> ditto.
> afaik, this patch have been sent to linus.
> 

And ditto, it's there for testers, particularly when patching against
2.6.31.

> 
> > 
> > [rientjes@google.com: Spotted the problem]
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  mm/page_alloc.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 250d055..2bc2ac6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> >  		 */
> >  		alloc_flags &= ~ALLOC_CPUSET;
> > -	} else if (unlikely(rt_task(p)))
> > +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> >  		alloc_flags |= ALLOC_HARDER;
> >  
> >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > -- 
> > 1.6.5
> > 
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
  2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
  2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
@ 2009-11-12 19:30 ` Mel Gorman
  2009-11-13 11:20   ` KOSAKI Motohiro
  2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
this patch seemed to help a lot even though it was pointed out that the
congestion changes were very far away from high-order atomic allocations.

The key to why the revert makes such a big difference is down to timing and
how long direct reclaimers wait versus kswapd. With the patch reverted,
the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
significant part of the workload involved reads, it makes sense that the
SYNC list is what was truely congested and with the revert processes were
waiting on congestion as expected. Hence, direct reclaimers stalled
properly and kswapd was able to do its job with fewer stalls.

This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
for direct reclaimers. Instead of making the congestion_wait() on the SYNC
queue which would only fix a particular type of workload, this patch adds a
third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
and then the SYNC queue if the timeout has not been reached.  In tests, this
counter-intuitively results in kswapd stalling less and freeing up pages
resulting in fewer allocation failures and fewer direct-reclaim-orientated
stalls.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |   25 ++++++++++++++++++++++---
 mm/page_alloc.c             |    4 ++--
 mm/vmscan.c                 |    2 +-
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b449e73..b35344c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -276,6 +276,7 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 enum {
 	BLK_RW_ASYNC	= 0,
 	BLK_RW_SYNC	= 1,
+	BLK_RW_BOTH	= 2,
 };
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1065b71..ea9ffc3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -736,22 +736,41 @@ EXPORT_SYMBOL(set_bdi_congested);
 
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @sync: SYNC, ASYNC or BOTH IO
  * @timeout: timeout in jiffies
  *
  * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
  * write congestion.  If no backing_devs are congested then just wait for the
  * next write to be completed.
  */
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int sync_request, long timeout)
 {
 	long ret;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	int sync;
+	wait_queue_head_t *wqh;
+
+	/* If requested to sync both, wait on ASYNC first, then SYNC */
+	if (sync_request == BLK_RW_BOTH)
+		sync = BLK_RW_ASYNC;
+	else
+		sync = sync_request;
+	
+again:
+	wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	if (sync_request == BLK_RW_BOTH) {
+		sync_request = 0;
+		sync = BLK_RW_SYNC;
+		timeout = ret;
+		if (timeout)
+			goto again;
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bc2ac6..f6ed41c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1727,7 +1727,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			congestion_wait(BLK_RW_BOTH, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -1898,7 +1898,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(BLK_RW_BOTH, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..190bae1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1793,7 +1793,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(BLK_RW_BOTH, HZ/10);
 	}
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scanning_global_lru(sc))
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
@ 2009-11-13 11:20   ` KOSAKI Motohiro
  2009-11-13 11:55     ` Jens Axboe
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13 11:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Jens Axboe

(cc to Jens)

> Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> this patch seemed to help a lot even though it was pointed out that the
> congestion changes were very far away from high-order atomic allocations.
> 
> The key to why the revert makes such a big difference is down to timing and
> how long direct reclaimers wait versus kswapd. With the patch reverted,
> the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> significant part of the workload involved reads, it makes sense that the
> SYNC list is what was truely congested and with the revert processes were
> waiting on congestion as expected. Hence, direct reclaimers stalled
> properly and kswapd was able to do its job with fewer stalls.
> 
> This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> queue which would only fix a particular type of workload, this patch adds a
> third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> and then the SYNC queue if the timeout has not been reached.  In tests, this
> counter-intuitively results in kswapd stalling less and freeing up pages
> resulting in fewer allocation failures and fewer direct-reclaim-orientated
> stalls.

Honestly, I don't like this patch. page allocator is not related to
sync block queue. vmscan doesn't make read operation.
This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.

Please don't make mysterious heuristic code.


Sidenode: I doubt this regression was caused from page allocator.
Probably we need to confirm caller change....



> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h |    1 +
>  mm/backing-dev.c            |   25 ++++++++++++++++++++++---
>  mm/page_alloc.c             |    4 ++--
>  mm/vmscan.c                 |    2 +-
>  4 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index b449e73..b35344c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -276,6 +276,7 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
>  enum {
>  	BLK_RW_ASYNC	= 0,
>  	BLK_RW_SYNC	= 1,
> +	BLK_RW_BOTH	= 2,
>  };
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 1065b71..ea9ffc3 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -736,22 +736,41 @@ EXPORT_SYMBOL(set_bdi_congested);
>  
>  /**
>   * congestion_wait - wait for a backing_dev to become uncongested
> - * @sync: SYNC or ASYNC IO
> + * @sync: SYNC, ASYNC or BOTH IO
>   * @timeout: timeout in jiffies
>   *
>   * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
>   * write congestion.  If no backing_devs are congested then just wait for the
>   * next write to be completed.
>   */
> -long congestion_wait(int sync, long timeout)
> +long congestion_wait(int sync_request, long timeout)
>  {
>  	long ret;
>  	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +	int sync;
> +	wait_queue_head_t *wqh;
> +
> +	/* If requested to sync both, wait on ASYNC first, then SYNC */
> +	if (sync_request == BLK_RW_BOTH)
> +		sync = BLK_RW_ASYNC;
> +	else
> +		sync = sync_request;
> +	
> +again:
> +	wqh = &congestion_wqh[sync];
>  
>  	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
>  	ret = io_schedule_timeout(timeout);
>  	finish_wait(wqh, &wait);
> +
> +	if (sync_request == BLK_RW_BOTH) {
> +		sync_request = 0;
> +		sync = BLK_RW_SYNC;
> +		timeout = ret;
> +		if (timeout)
> +			goto again;
> +	}
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(congestion_wait);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2bc2ac6..f6ed41c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1727,7 +1727,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  			preferred_zone, migratetype);
>  
>  		if (!page && gfp_mask & __GFP_NOFAIL)
> -			congestion_wait(BLK_RW_ASYNC, HZ/50);
> +			congestion_wait(BLK_RW_BOTH, HZ/50);
>  	} while (!page && (gfp_mask & __GFP_NOFAIL));
>  
>  	return page;
> @@ -1898,7 +1898,7 @@ rebalance:
>  	pages_reclaimed += did_some_progress;
>  	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>  		/* Wait for some write requests to complete then retry */
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> +		congestion_wait(BLK_RW_BOTH, HZ/50);
>  		goto rebalance;
>  	}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..190bae1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1793,7 +1793,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			congestion_wait(BLK_RW_BOTH, HZ/10);
>  	}
>  	/* top priority shrink_zones still had more to do? don't OOM, then */
>  	if (!sc->all_unreclaimable && scanning_global_lru(sc))
> -- 
> 1.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 11:20   ` KOSAKI Motohiro
@ 2009-11-13 11:55     ` Jens Axboe
  2009-11-13 12:28       ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Jens Axboe @ 2009-11-13 11:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> (cc to Jens)
> 
> > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > this patch seemed to help a lot even though it was pointed out that the
> > congestion changes were very far away from high-order atomic allocations.
> > 
> > The key to why the revert makes such a big difference is down to timing and
> > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > significant part of the workload involved reads, it makes sense that the
> > SYNC list is what was truely congested and with the revert processes were
> > waiting on congestion as expected. Hence, direct reclaimers stalled
> > properly and kswapd was able to do its job with fewer stalls.
> > 
> > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > queue which would only fix a particular type of workload, this patch adds a
> > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > and then the SYNC queue if the timeout has not been reached.  In tests, this
> > counter-intuitively results in kswapd stalling less and freeing up pages
> > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > stalls.
> 
> Honestly, I don't like this patch. page allocator is not related to
> sync block queue. vmscan doesn't make read operation.
> This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> 
> Please don't make mysterious heuristic code.
> 
> 
> Sidenode: I doubt this regression was caused from page allocator.
> Probably we need to confirm caller change....

See the email from Chris from yesterday, he nicely explains why this
change made a difference with dm-crypt. dm-crypt needs fixing, not a
hack like this added.

The vm needs to drop congestion hints and usage, not increase it. The
above changelog is mostly hand-wavy nonsense, imho.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 11:55     ` Jens Axboe
@ 2009-11-13 12:28       ` Mel Gorman
  2009-11-13 13:32         ` Jens Axboe
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 12:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > (cc to Jens)
> > 
> > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > this patch seemed to help a lot even though it was pointed out that the
> > > congestion changes were very far away from high-order atomic allocations.
> > > 
> > > The key to why the revert makes such a big difference is down to timing and
> > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > significant part of the workload involved reads, it makes sense that the
> > > SYNC list is what was truely congested and with the revert processes were
> > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > properly and kswapd was able to do its job with fewer stalls.
> > > 
> > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > queue which would only fix a particular type of workload, this patch adds a
> > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > and then the SYNC queue if the timeout has not been reached.  In tests, this
> > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > stalls.
> > 
> > Honestly, I don't like this patch. page allocator is not related to
> > sync block queue. vmscan doesn't make read operation.
> > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > 
> > Please don't make mysterious heuristic code.
> > 
> > 
> > Sidenode: I doubt this regression was caused from page allocator.

Probably not. As noted, the major change is really in how long callers
are waiting on congestion_wait. The tarball includes graphs from an
instrumented kernel that shows how long callers are waiting due to
congestion_wait(). This has changed significantly.

I'll queue up tests over the weekend that test without dm-crypt being involved.

> > Probably we need to confirm caller change....
> 
> See the email from Chris from yesterday, he nicely explains why this
> change made a difference with dm-crypt.

Indeed.

But bear in mind that it also possible that direct reclaimers are also
congesting the queue due to swap-in.

> dm-crypt needs fixing, not a hack like this added.
> 

As noted by Chris in the same mail, dm-crypt has not changed. What has
changed is how long callers wait in congestion_wait.


> The vm needs to drop congestion hints and usage, not increase it. The
> above changelog is mostly hand-wavy nonsense, imho.
> 

Suggest an alternative that brings congestion_wait() more in line with
2.6.30 behaviour then.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 12:28       ` Mel Gorman
@ 2009-11-13 13:32         ` Jens Axboe
  2009-11-13 13:41           ` Pekka Enberg
                             ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Jens Axboe @ 2009-11-13 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13 2009, Mel Gorman wrote:
> On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > (cc to Jens)
> > > 
> > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > this patch seemed to help a lot even though it was pointed out that the
> > > > congestion changes were very far away from high-order atomic allocations.
> > > > 
> > > > The key to why the revert makes such a big difference is down to timing and
> > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > significant part of the workload involved reads, it makes sense that the
> > > > SYNC list is what was truely congested and with the revert processes were
> > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > properly and kswapd was able to do its job with fewer stalls.
> > > > 
> > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > queue which would only fix a particular type of workload, this patch adds a
> > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > and then the SYNC queue if the timeout has not been reached.  In tests, this
> > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > stalls.
> > > 
> > > Honestly, I don't like this patch. page allocator is not related to
> > > sync block queue. vmscan doesn't make read operation.
> > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > > 
> > > Please don't make mysterious heuristic code.
> > > 
> > > 
> > > Sidenode: I doubt this regression was caused from page allocator.
> 
> Probably not. As noted, the major change is really in how long callers
> are waiting on congestion_wait. The tarball includes graphs from an
> instrumented kernel that shows how long callers are waiting due to
> congestion_wait(). This has changed significantly.
> 
> I'll queue up tests over the weekend that test without dm-crypt being involved.
> 
> > > Probably we need to confirm caller change....
> > 
> > See the email from Chris from yesterday, he nicely explains why this
> > change made a difference with dm-crypt.
> 
> Indeed.
> 
> But bear in mind that it also possible that direct reclaimers are also
> congesting the queue due to swap-in.

Are you speculating, or has this been observed? While I don't contest
that that could happen, it's also not a new thing. And it should be an
unlikely event.

> > dm-crypt needs fixing, not a hack like this added.
> > 
> 
> As noted by Chris in the same mail, dm-crypt has not changed. What has
> changed is how long callers wait in congestion_wait.

Right dm-crypt didn't change, it WAS ALREADY BUGGY.

> > The vm needs to drop congestion hints and usage, not increase it. The
> > above changelog is mostly hand-wavy nonsense, imho.
> > 
> 
> Suggest an alternative that brings congestion_wait() more in line with
> 2.6.30 behaviour then.

I don't have a good explanation as to why the delays have changed,
unfortunately. Are we sure that they have between .30 and .31? The
dm-crypt case is overly complex and lots of changes could have broken
that house of cards.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 13:32         ` Jens Axboe
@ 2009-11-13 13:41           ` Pekka Enberg
  2009-11-13 15:22             ` Chris Mason
  2009-11-13 14:16           ` Mel Gorman
  2009-11-20 14:56           ` Mel Gorman
  2 siblings, 1 reply; 65+ messages in thread
From: Pekka Enberg @ 2009-11-13 13:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mel Gorman, KOSAKI Motohiro, Andrew Morton, Frans Pop,
	Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker,
	linux-kernel, linux-mm, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Linus Torvalds

Hi Jens,

On Fri, Nov 13, 2009 at 3:32 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> Suggest an alternative that brings congestion_wait() more in line with
>> 2.6.30 behaviour then.
>
> I don't have a good explanation as to why the delays have changed,
> unfortunately. Are we sure that they have between .30 and .31? The
> dm-crypt case is overly complex and lots of changes could have broken
> that house of cards.

Hand-waving or not, we have end user reports stating that reverting
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0 ("Fix
congestion_wait() sync/async vs read/write confusion") fixes their
(rather serious) OOM regression. The commit in question _does_
introduce a functional change and if this was your average regression,
people would be kicking and screaming to get it reverted.

So is there a reason we shouldn't send a partial revert of the commit
(switching to BLK_RW_SYNC) to Linus until the "real" issue gets
resolved? Yes, I realize it's ugly voodoo magic but dammit, it used to
work!

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 13:41           ` Pekka Enberg
@ 2009-11-13 15:22             ` Chris Mason
  0 siblings, 0 replies; 65+ messages in thread
From: Chris Mason @ 2009-11-13 15:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Jens Axboe, Mel Gorman, KOSAKI Motohiro, Andrew Morton,
	Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, linux-kernel, linux-mm, Rik van Riel,
	Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki,
	Kernel Testers List, Linus Torvalds

On Fri, Nov 13, 2009 at 03:41:46PM +0200, Pekka Enberg wrote:
> Hi Jens,
> 
> On Fri, Nov 13, 2009 at 3:32 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >> Suggest an alternative that brings congestion_wait() more in line with
> >> 2.6.30 behaviour then.
> >
> > I don't have a good explanation as to why the delays have changed,
> > unfortunately. Are we sure that they have between .30 and .31? The
> > dm-crypt case is overly complex and lots of changes could have broken
> > that house of cards.
> 
> Hand-waving or not, we have end user reports stating that reverting
> commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0 ("Fix
> congestion_wait() sync/async vs read/write confusion") fixes their
> (rather serious) OOM regression. The commit in question _does_
> introduce a functional change and if this was your average regression,
> people would be kicking and screaming to get it reverted.
> 
> So is there a reason we shouldn't send a partial revert of the commit
> (switching to BLK_RW_SYNC) to Linus until the "real" issue gets
> resolved? Yes, I realize it's ugly voodoo magic but dammit, it used to
> work!

If the workload didn't involve dm-crypt everything would make more sense.
With dm-crypt, things look like this:

VM:
Someone submits a write (sync or not) to clean a page
out of pages, I'll wait for the disk

dm-crypt:
Async threads wakeup and start encrypting submitted writes

Async threads finish and hand off to other async threads to submit new writes
or
dm encryption thread submits the write directly (without my patch)

The problem with all of this is that waiting for congestion doesn't
actually have very much to do with waiting for dm-crypt IO.  We might
be checking congestion long before dm-cryptd actually gets around to
sending the IO to the disk (so we'll just wait for the full timeout),
or the congestion check could be waiting for entirely unrelated IO.

Because dm-crypt uses the same queue for encrypting and decrypting reads
and writes, if writepage needs to read a block (say a metadata mapping
block), waiting for that block to become unencrypted will have to wait
for the entire queue of both encrypted and decrypted blocks to finish.

[ Unless I completely misread dm-crypt, which is always possible ]

So, cleaning pages can take an arbitrary amount of time, based on how
much CPU time is available, how many other pages are in the worker
threads and how fast the disk is once we finally hand the pages over to
it.

Tuning the VM timeouts based on this seems like a bad idea to me.  If
the VM is waiting for pages to leave writeback, we should just add a
sequence counter and wait queue that gets bumped as pages leave
writeback (or something that wouldn't do horrible things to cachelines).

Otherwise we're going to end up waiting way too long for the sync
congestion on configs that don't intermix sync and async in such complex
ways.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 13:32         ` Jens Axboe
  2009-11-13 13:41           ` Pekka Enberg
@ 2009-11-13 14:16           ` Mel Gorman
  2009-11-20 14:56           ` Mel Gorman
  2 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 14:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 02:32:12PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, Mel Gorman wrote:
> > On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > > (cc to Jens)
> > > > 
> > > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > > this patch seemed to help a lot even though it was pointed out that the
> > > > > congestion changes were very far away from high-order atomic allocations.
> > > > > 
> > > > > The key to why the revert makes such a big difference is down to timing and
> > > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > > significant part of the workload involved reads, it makes sense that the
> > > > > SYNC list is what was truely congested and with the revert processes were
> > > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > > properly and kswapd was able to do its job with fewer stalls.
> > > > > 
> > > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > > queue which would only fix a particular type of workload, this patch adds a
> > > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > > and then the SYNC queue if the timeout has not been reached.  In tests, this
> > > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > > stalls.
> > > > 
> > > > Honestly, I don't like this patch. page allocator is not related to
> > > > sync block queue. vmscan doesn't make read operation.
> > > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > > > 
> > > > Please don't make mysterious heuristic code.
> > > > 
> > > > 
> > > > Sidenode: I doubt this regression was caused from page allocator.
> > 
> > Probably not. As noted, the major change is really in how long callers
> > are waiting on congestion_wait. The tarball includes graphs from an
> > instrumented kernel that shows how long callers are waiting due to
> > congestion_wait(). This has changed significantly.
> > 
> > I'll queue up tests over the weekend that test without dm-crypt being involved.
> > 
> > > > Probably we need to confirm caller change....
> > > 
> > > See the email from Chris from yesterday, he nicely explains why this
> > > change made a difference with dm-crypt.
> > 
> > Indeed.
> > 
> > But bear in mind that it also possible that direct reclaimers are also
> > congesting the queue due to swap-in.
> 
> Are you speculating, or has this been observed?

Speculating. I'll need to adjust the test and instrumentation to know
for sure.

> While I don't contest
> that that could happen, it's also not a new thing. And it should be an
> unlikely event.
> 
> > > dm-crypt needs fixing, not a hack like this added.
> > > 
> > 
> > As noted by Chris in the same mail, dm-crypt has not changed. What has
> > changed is how long callers wait in congestion_wait.
> 
> Right dm-crypt didn't change, it WAS ALREADY BUGGY.
> 

Fair point.

> > > The vm needs to drop congestion hints and usage, not increase it. The
> > > above changelog is mostly hand-wavy nonsense, imho.
> > > 
> > 
> > Suggest an alternative that brings congestion_wait() more in line with
> > 2.6.30 behaviour then.
> 
> I don't have a good explanation as to why the delays have changed,
> unfortunately. Are we sure that they have between .30 and .31?

Fairly sure. The original reporter first reported that the problem was
in this range and reverting one of the congestion_wait() patches
appeared to help. What the revert was really doing was making
congestion_wait() on SYNC instead of ASYNC in a number of cases.

> The
> dm-crypt case is overly complex and lots of changes could have broken
> that house of cards.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
  2009-11-13 13:32         ` Jens Axboe
  2009-11-13 13:41           ` Pekka Enberg
  2009-11-13 14:16           ` Mel Gorman
@ 2009-11-20 14:56           ` Mel Gorman
  2 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-20 14:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: KOSAKI Motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 02:32:12PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, Mel Gorman wrote:
> > On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > > (cc to Jens)
> > > > 
> > > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > > this patch seemed to help a lot even though it was pointed out that the
> > > > > congestion changes were very far away from high-order atomic allocations.
> > > > > 
> > > > > The key to why the revert makes such a big difference is down to timing and
> > > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > > significant part of the workload involved reads, it makes sense that the
> > > > > SYNC list is what was truely congested and with the revert processes were
> > > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > > properly and kswapd was able to do its job with fewer stalls.
> > > > > 
> > > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > > queue which would only fix a particular type of workload, this patch adds a
> > > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > > and then the SYNC queue if the timeout has not been reached.  In tests, this
> > > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > > stalls.
> > > > 
> > > > Honestly, I don't like this patch. page allocator is not related to
> > > > sync block queue. vmscan doesn't make read operation.
> > > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > > > 
> > > > Please don't make mysterious heuristic code.
> > > > 
> > > > 
> > > > Sidenode: I doubt this regression was caused from page allocator.
> > 
> > Probably not. As noted, the major change is really in how long callers
> > are waiting on congestion_wait. The tarball includes graphs from an
> > instrumented kernel that shows how long callers are waiting due to
> > congestion_wait(). This has changed significantly.
> > 
> > I'll queue up tests over the weekend that test without dm-crypt being involved.
> > 
> > > > Probably we need to confirm caller change....
> > > 
> > > See the email from Chris from yesterday, he nicely explains why this
> > > change made a difference with dm-crypt.
> > 
> > Indeed.
> > 
> > But bear in mind that it also possible that direct reclaimers are also
> > congesting the queue due to swap-in.
> 
> Are you speculating, or has this been observed? While I don't contest
> that that could happen, it's also not a new thing. And it should be an
> unlikely event.
> 
> > > dm-crypt needs fixing, not a hack like this added.
> > > 
> > 
> > As noted by Chris in the same mail, dm-crypt has not changed. What has
> > changed is how long callers wait in congestion_wait.
> 
> Right dm-crypt didn't change, it WAS ALREADY BUGGY.
> 

On a different note, I tried without dm-crypt and found by far the
greatest different to page allocator success or failure was the value of
the low_latency tunable.

low_latency == 0
2.6.32-rc6-0000000-force-highorder            Elapsed:17:50.935(stddev:002.535)   Failures:0
2.6.32-rc6-0000006-dm-crypt-unplug            Elapsed:18:44.610(stddev:002.236)   Failures:1
2.6.32-rc6-0000012-pgalloc-2.6.30             Elapsed:17:18.330(stddev:002.258)   Failures:2
2.6.32-rc6-0000123-congestion-both            Elapsed:16:17.370(stddev:002.167)   Failures:1
2.6.32-rc6-0001234-kswapd-quick-recheck       Elapsed:15:54.880(stddev:002.234)   Failures:0
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:19:26.417(stddev:002.237)   Failures:1
2.6.32-rc6-0123456-dm-crypt-unplug            Elapsed:20:19.135(stddev:001.516)   Failures:0

low_latency == 1
2.6.32-rc6-0000000-force-highorder            Elapsed:20:04.755(stddev:078.551)   Failures:22
2.6.32-rc6-0000006-dm-crypt-unplug            Elapsed:25:05.608(stddev:053.224)   Failures:12
2.6.32-rc6-0000012-pgalloc-2.6.30             Elapsed:19:01.530(stddev:002.146)   Failures:14
2.6.32-rc6-0000123-congestion-both            Elapsed:18:01.938(stddev:002.171)   Failures:2
2.6.32-rc6-0001234-kswapd-quick-recheck       Elapsed:19:52.833(stddev:064.168)   Failures:7
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:26:25.767(stddev:050.827)   Failures:1
2.6.32-rc6-0123456-dm-crypt-unplug            Elapsed:22:56.850(stddev:053.914)   Failures:1

Setting low_latency both regresses performance of the test and causes a
boatload of allocation failures. Note also the deviations. With
low_latency == 0, gitk performs predictably +/- around 2 seconds. With
low_latency == 1, there are huge varianes +/- about a minute.

Sampling writeback, I found that the number of pages in writeback with
low_latency == 1 was higher for longer.

Any theories as to why enabling low_latency has such a negative impact?
Should it be disabled by default?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
@ 2009-11-12 19:30 ` Mel Gorman
  2009-11-13 10:43   ` KOSAKI Motohiro
  2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

After kswapd balances all zones in a pgdat, it goes to sleep. In the event
of no IO congestion, kswapd can go to sleep very shortly after the high
watermark was reached. If there are a constant stream of allocations from
parallel processes, it can mean that kswapd went to sleep too quickly and
the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/vmstat.h |    1 +
 mm/vmscan.c            |   44 ++++++++++++++++++++++++++++++++++++++++++--
 mm/vmstat.c            |    2 ++
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2d0f222..9716003 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
+		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 190bae1..ffa1766 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 }
 #endif
 
+/* is kswapd sleeping prematurely? */
+static int sleeping_prematurely(int order, long remaining)
+{
+	struct zone *zone;
+
+	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
+	if (remaining)
+		return 1;
+
+	/* If after HZ/10, a zone is below the high mark, it's premature */
+	for_each_populated_zone(zone)
+		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+								0, 0))
+			return 1;
+
+	return 0;
+}
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
@@ -2184,8 +2202,30 @@ static int kswapd(void *p)
 			 */
 			order = new_order;
 		} else {
-			if (!freezing(current))
-				schedule();
+			if (!freezing(current)) {
+				long remaining = 0;
+
+				/* Try to sleep for a short interval */
+				if (!sleeping_prematurely(order, remaining)) {
+					remaining = schedule_timeout(HZ/10);
+					finish_wait(&pgdat->kswapd_wait, &wait);
+					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+				}
+
+				/*
+				 * After a short sleep, check if it was a
+				 * premature sleep. If not, then go fully
+				 * to sleep until explicitly woken up
+				 */
+				if (!sleeping_prematurely(order, remaining))
+					schedule();
+				else {
+					if (remaining)
+						count_vm_event(KSWAPD_PREMATURE_FAST);
+					else
+						count_vm_event(KSWAPD_PREMATURE_SLOW);
+				}
+			}
 
 			order = pgdat->kswapd_max_order;
 		}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..90b11e4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,6 +683,8 @@ static const char * const vmstat_text[] = {
 	"slabs_scanned",
 	"kswapd_steal",
 	"kswapd_inodesteal",
+	"kswapd_slept_prematurely_fast",
+	"kswapd_slept_prematurely_slow",
 	"pageoutrun",
 	"allocstall",
 
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
@ 2009-11-13 10:43   ` KOSAKI Motohiro
  2009-11-13 14:13     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13 10:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> of no IO congestion, kswapd can go to sleep very shortly after the high
> watermark was reached. If there are a constant stream of allocations from
> parallel processes, it can mean that kswapd went to sleep too quickly and
> the high watermark is not being maintained for sufficient length time.
> 
> This patch makes kswapd go to sleep as a two-stage process. It first
> tries to sleep for HZ/10. If it is woken up by another process or the
> high watermark is no longer met, it's considered a premature sleep and
> kswapd continues work. Otherwise it goes fully to sleep.
> 
> This adds more counters to distinguish between fast and slow breaches of
> watermarks. A "fast" premature sleep is one where the low watermark was
> hit in a very short time after kswapd going to sleep. A "slow" premature
> sleep indicates that the high watermark was breached after a very short
> interval.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Why do you submit this patch to mainline? this is debugging patch
no more and no less.


> ---
>  include/linux/vmstat.h |    1 +
>  mm/vmscan.c            |   44 ++++++++++++++++++++++++++++++++++++++++++--
>  mm/vmstat.c            |    2 ++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 2d0f222..9716003 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> +		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,

Please don't use the word of "premature" and "fast". it is too hard to understand the meanings.
Plus, please use per-zone stastics (like NUMA_HIT).

>
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 190bae1..ffa1766 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  }
>  #endif
>  
> +/* is kswapd sleeping prematurely? */
> +static int sleeping_prematurely(int order, long remaining)
> +{
> +	struct zone *zone;
> +
> +	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> +	if (remaining)
> +		return 1;
> +
> +	/* If after HZ/10, a zone is below the high mark, it's premature */
> +	for_each_populated_zone(zone)
> +		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> +								0, 0))
> +			return 1;

for_each_populated_zone() iterate all populated zone. but kswapd shuld't see another node.

> +
> +	return 0;
> +}
> +
>  /*
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
> @@ -2184,8 +2202,30 @@ static int kswapd(void *p)
>  			 */
>  			order = new_order;
>  		} else {
> -			if (!freezing(current))
> -				schedule();
> +			if (!freezing(current)) {
> +				long remaining = 0;
> +
> +				/* Try to sleep for a short interval */
> +				if (!sleeping_prematurely(order, remaining)) {
> +					remaining = schedule_timeout(HZ/10);
> +					finish_wait(&pgdat->kswapd_wait, &wait);
> +					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> +				}
> +
> +				/*
> +				 * After a short sleep, check if it was a
> +				 * premature sleep. If not, then go fully
> +				 * to sleep until explicitly woken up
> +				 */
> +				if (!sleeping_prematurely(order, remaining))
> +					schedule();
> +				else {
> +					if (remaining)
> +						count_vm_event(KSWAPD_PREMATURE_FAST);
> +					else
> +						count_vm_event(KSWAPD_PREMATURE_SLOW);
> +				}
> +			}
>  
>  			order = pgdat->kswapd_max_order;
>  		}
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c81321f..90b11e4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -683,6 +683,8 @@ static const char * const vmstat_text[] = {
>  	"slabs_scanned",
>  	"kswapd_steal",
>  	"kswapd_inodesteal",
> +	"kswapd_slept_prematurely_fast",
> +	"kswapd_slept_prematurely_slow",
>  	"pageoutrun",
>  	"allocstall",
>  
> -- 
> 1.6.5
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-13 10:43   ` KOSAKI Motohiro
@ 2009-11-13 14:13     ` Mel Gorman
  2009-11-13 18:00       ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 14:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > of no IO congestion, kswapd can go to sleep very shortly after the high
> > watermark was reached. If there are a constant stream of allocations from
> > parallel processes, it can mean that kswapd went to sleep too quickly and
> > the high watermark is not being maintained for sufficient length time.
> > 
> > This patch makes kswapd go to sleep as a two-stage process. It first
> > tries to sleep for HZ/10. If it is woken up by another process or the
> > high watermark is no longer met, it's considered a premature sleep and
> > kswapd continues work. Otherwise it goes fully to sleep.
> > 
> > This adds more counters to distinguish between fast and slow breaches of
> > watermarks. A "fast" premature sleep is one where the low watermark was
> > hit in a very short time after kswapd going to sleep. A "slow" premature
> > sleep indicates that the high watermark was breached after a very short
> > interval.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Why do you submit this patch to mainline? this is debugging patch
> no more and no less.
> 

Do you mean the stats part? The stats are included until such time as the page
allocator failure reports stop or are significantly reduced. In the event a
report is received, the value of the counters help determine if kswapd was
struggling or not. They should be removed once this mess is ironed out.

If there is a preference, I can split out the stats part and send it to
people with page allocator failure reports for retesting.

> 
> > ---
> >  include/linux/vmstat.h |    1 +
> >  mm/vmscan.c            |   44 ++++++++++++++++++++++++++++++++++++++++++--
> >  mm/vmstat.c            |    2 ++
> >  3 files changed, 45 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 2d0f222..9716003 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		PGSCAN_ZONE_RECLAIM_FAILED,
> >  #endif
> >  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> > +		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> 
> Please don't use the word of "premature" and "fast". it is too hard to understand the meanings.

How about KSWAPD_LOW_WMARK_HIT_QUICKLY and
KSWAPD_HIGH_WMARK_HIT_QUICKLY?


> Plus, please use per-zone stastics (like NUMA_HIT).
> 

Is that not overkill? The intention is just to track in general terms if
kswapd is struggling or not. I hadn't been considering the problem on a
per-zone basis to date.

> >
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 190bae1..ffa1766 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  }
> >  #endif
> >  
> > +/* is kswapd sleeping prematurely? */
> > +static int sleeping_prematurely(int order, long remaining)
> > +{
> > +	struct zone *zone;
> > +
> > +	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > +	if (remaining)
> > +		return 1;
> > +
> > +	/* If after HZ/10, a zone is below the high mark, it's premature */
> > +	for_each_populated_zone(zone)
> > +		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> > +								0, 0))
> > +			return 1;
> 
> for_each_populated_zone() iterate all populated zone. but kswapd shuld't see another node.
> 

Good spot.

How about the following?

==== CUT HERE ====
vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

This patch is a fix and a claritifacation to the patch "vmscan: Have
kswapd sleep for a short interval and double check it should be asleep".
The fix is for kswapd to only check zones in the node it is responsible
for. The clarification is to rename two counters to better explain what is
being counted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/vmstat.h |    2 +-
 mm/vmscan.c            |   20 +++++++++++++-------
 mm/vmstat.c            |    4 ++--
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7d66695..0591a48 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
-		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_NO_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70967e1..5557555 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 #endif
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(int order, long remaining)
+static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
-	struct zone *zone;
+	int i;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
 	/* If after HZ/10, a zone is below the high mark, it's premature */
-	for_each_populated_zone(zone)
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
 			return 1;
+	}
 
 	return 0;
 }
@@ -2221,7 +2227,7 @@ static int kswapd(void *p)
 				long remaining = 0;
 
 				/* Try to sleep for a short interval */
-				if (!sleeping_prematurely(order, remaining)) {
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					remaining = schedule_timeout(HZ/10);
 					finish_wait(&pgdat->kswapd_wait, &wait);
 					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2232,13 +2238,13 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining))
 					schedule();
 				else {
 					if (remaining)
-						count_vm_event(KSWAPD_PREMATURE_FAST);
+						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
-						count_vm_event(KSWAPD_PREMATURE_SLOW);
+						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 				}
 			}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bc09547..6cc8dc6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
 	"slabs_scanned",
 	"kswapd_steal",
 	"kswapd_inodesteal",
-	"kswapd_slept_prematurely_fast",
-	"kswapd_slept_prematurely_slow",
+	"kswapd_low_wmark_hit_quickly",
+	"kswapd_high_wmark_hit_quickly",
 	"kswapd_no_congestion_wait",
 	"pageoutrun",
 	"allocstall",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-13 14:13     ` Mel Gorman
@ 2009-11-13 18:00       ` KOSAKI Motohiro
  2009-11-13 18:17         ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13 18:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > watermark was reached. If there are a constant stream of allocations from
> > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > the high watermark is not being maintained for sufficient length time.
> > > 
> > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > high watermark is no longer met, it's considered a premature sleep and
> > > kswapd continues work. Otherwise it goes fully to sleep.
> > > 
> > > This adds more counters to distinguish between fast and slow breaches of
> > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > sleep indicates that the high watermark was breached after a very short
> > > interval.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Why do you submit this patch to mainline? this is debugging patch
> > no more and no less.
> > 
> 
> Do you mean the stats part? The stats are included until such time as the page
> allocator failure reports stop or are significantly reduced. In the event a
> report is received, the value of the counters help determine if kswapd was
> struggling or not. They should be removed once this mess is ironed out.
> 
> If there is a preference, I can split out the stats part and send it to
> people with page allocator failure reports for retesting.

I'm sorry my last mail didn't have enough explanation.
This stats help to solve this issue. I agreed. but after solving this issue,
I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
Or, Can LKML folk make any advise to admin?

if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
following your additional code itself looks good to me. but...


> ==== CUT HERE ====
> vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> 
> This patch is a fix and a claritifacation to the patch "vmscan: Have
> kswapd sleep for a short interval and double check it should be asleep".
> The fix is for kswapd to only check zones in the node it is responsible
> for. The clarification is to rename two counters to better explain what is
> being counted.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  include/linux/vmstat.h |    2 +-
>  mm/vmscan.c            |   20 +++++++++++++-------
>  mm/vmstat.c            |    4 ++--
>  3 files changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 7d66695..0591a48 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> -		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> +		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>  		KSWAPD_NO_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>  #ifdef CONFIG_HUGETLB_PAGE
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70967e1..5557555 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  #endif
>  
>  /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(int order, long remaining)
> +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  {
> -	struct zone *zone;
> +	int i;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return 1;
>  
>  	/* If after HZ/10, a zone is below the high mark, it's premature */
> -	for_each_populated_zone(zone)
> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
>  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>  								0, 0))
>  			return 1;
> +	}
>  
>  	return 0;
>  }
> @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
>  				long remaining = 0;
>  
>  				/* Try to sleep for a short interval */
> -				if (!sleeping_prematurely(order, remaining)) {
> +				if (!sleeping_prematurely(pgdat, order, remaining)) {
>  					remaining = schedule_timeout(HZ/10);
>  					finish_wait(&pgdat->kswapd_wait, &wait);
>  					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
>  				 * premature sleep. If not, then go fully
>  				 * to sleep until explicitly woken up
>  				 */
> -				if (!sleeping_prematurely(order, remaining))
> +				if (!sleeping_prematurely(pgdat, order, remaining))
>  					schedule();
>  				else {
>  					if (remaining)
> -						count_vm_event(KSWAPD_PREMATURE_FAST);
> +						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>  					else
> -						count_vm_event(KSWAPD_PREMATURE_SLOW);
> +						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>  				}
>  			}
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index bc09547..6cc8dc6 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
>  	"slabs_scanned",
>  	"kswapd_steal",
>  	"kswapd_inodesteal",
> -	"kswapd_slept_prematurely_fast",
> -	"kswapd_slept_prematurely_slow",
> +	"kswapd_low_wmark_hit_quickly",
> +	"kswapd_high_wmark_hit_quickly",
>  	"kswapd_no_congestion_wait",
>  	"pageoutrun",
>  	"allocstall",



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-13 18:00       ` KOSAKI Motohiro
@ 2009-11-13 18:17         ` Mel Gorman
  2009-11-14  9:34           ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 18:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > watermark was reached. If there are a constant stream of allocations from
> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > the high watermark is not being maintained for sufficient length time.
> > > > 
> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > 
> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > sleep indicates that the high watermark was breached after a very short
> > > > interval.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > Why do you submit this patch to mainline? this is debugging patch
> > > no more and no less.
> > > 
> > 
> > Do you mean the stats part? The stats are included until such time as the page
> > allocator failure reports stop or are significantly reduced. In the event a
> > report is received, the value of the counters help determine if kswapd was
> > struggling or not. They should be removed once this mess is ironed out.
> > 
> > If there is a preference, I can split out the stats part and send it to
> > people with page allocator failure reports for retesting.
> 
> I'm sorry my last mail didn't have enough explanation.
> This stats help to solve this issue. I agreed. but after solving this issue,
> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?

One possible workaround would be to raise min_free_kbytes while a fix is
being worked on.

> Or, Can LKML folk make any advise to admin?
> 

Work with them to fix the bug :/

> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
> following your additional code itself looks good to me. but...
> 
> 
> > ==== CUT HERE ====
> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> > 
> > This patch is a fix and a claritifacation to the patch "vmscan: Have
> > kswapd sleep for a short interval and double check it should be asleep".
> > The fix is for kswapd to only check zones in the node it is responsible
> > for. The clarification is to rename two counters to better explain what is
> > being counted.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > --- 
> >  include/linux/vmstat.h |    2 +-
> >  mm/vmscan.c            |   20 +++++++++++++-------
> >  mm/vmstat.c            |    4 ++--
> >  3 files changed, 16 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 7d66695..0591a48 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		PGSCAN_ZONE_RECLAIM_FAILED,
> >  #endif
> >  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> > -		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> > +		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> >  		KSWAPD_NO_CONGESTION_WAIT,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >  #ifdef CONFIG_HUGETLB_PAGE
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 70967e1..5557555 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  #endif
> >  
> >  /* is kswapd sleeping prematurely? */
> > -static int sleeping_prematurely(int order, long remaining)
> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  {
> > -	struct zone *zone;
> > +	int i;
> >  
> >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >  	if (remaining)
> >  		return 1;
> >  
> >  	/* If after HZ/10, a zone is below the high mark, it's premature */
> > -	for_each_populated_zone(zone)
> > +	for (i = 0; i < pgdat->nr_zones; i++) {
> > +		struct zone *zone = pgdat->node_zones + i;
> > +
> > +		if (!populated_zone(zone))
> > +			continue;
> > +
> >  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> >  								0, 0))
> >  			return 1;
> > +	}
> >  
> >  	return 0;
> >  }
> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
> >  				long remaining = 0;
> >  
> >  				/* Try to sleep for a short interval */
> > -				if (!sleeping_prematurely(order, remaining)) {
> > +				if (!sleeping_prematurely(pgdat, order, remaining)) {
> >  					remaining = schedule_timeout(HZ/10);
> >  					finish_wait(&pgdat->kswapd_wait, &wait);
> >  					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
> >  				 * premature sleep. If not, then go fully
> >  				 * to sleep until explicitly woken up
> >  				 */
> > -				if (!sleeping_prematurely(order, remaining))
> > +				if (!sleeping_prematurely(pgdat, order, remaining))
> >  					schedule();
> >  				else {
> >  					if (remaining)
> > -						count_vm_event(KSWAPD_PREMATURE_FAST);
> > +						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> >  					else
> > -						count_vm_event(KSWAPD_PREMATURE_SLOW);
> > +						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >  				}
> >  			}
> >  
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index bc09547..6cc8dc6 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
> >  	"slabs_scanned",
> >  	"kswapd_steal",
> >  	"kswapd_inodesteal",
> > -	"kswapd_slept_prematurely_fast",
> > -	"kswapd_slept_prematurely_slow",
> > +	"kswapd_low_wmark_hit_quickly",
> > +	"kswapd_high_wmark_hit_quickly",
> >  	"kswapd_no_congestion_wait",
> >  	"pageoutrun",
> >  	"allocstall",
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-13 18:17         ` Mel Gorman
@ 2009-11-14  9:34           ` KOSAKI Motohiro
  2009-11-14 15:46             ` Mel Gorman
  2009-11-17 10:34             ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
  0 siblings, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-14  9:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
>> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
>> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
>> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
>> > > > watermark was reached. If there are a constant stream of allocations from
>> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
>> > > > the high watermark is not being maintained for sufficient length time.
>> > > >
>> > > > This patch makes kswapd go to sleep as a two-stage process. It first
>> > > > tries to sleep for HZ/10. If it is woken up by another process or the
>> > > > high watermark is no longer met, it's considered a premature sleep and
>> > > > kswapd continues work. Otherwise it goes fully to sleep.
>> > > >
>> > > > This adds more counters to distinguish between fast and slow breaches of
>> > > > watermarks. A "fast" premature sleep is one where the low watermark was
>> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
>> > > > sleep indicates that the high watermark was breached after a very short
>> > > > interval.
>> > > >
>> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > >
>> > > Why do you submit this patch to mainline? this is debugging patch
>> > > no more and no less.
>> > >
>> >
>> > Do you mean the stats part? The stats are included until such time as the page
>> > allocator failure reports stop or are significantly reduced. In the event a
>> > report is received, the value of the counters help determine if kswapd was
>> > struggling or not. They should be removed once this mess is ironed out.
>> >
>> > If there is a preference, I can split out the stats part and send it to
>> > people with page allocator failure reports for retesting.
>>
>> I'm sorry my last mail didn't have enough explanation.
>> This stats help to solve this issue. I agreed. but after solving this issue,
>> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
>> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
>
> One possible workaround would be to raise min_free_kbytes while a fix is
> being worked on.

Please correct me, if I said wrong thing.

if I was admin, I don't watch this stats because kswapd frequently
wakeup doesn't mean any trouble. instead I watch number of allocation
failure.

[see include/linux/vmstat.h]

umm...
Why don't we have nr-allocation-failure vmstat? I don't remember it.
GFP_NOWARN failure doesn't make syslog error logging. but frequently
GFP_NOWARN failure imply the system is under memory pressure or under
heavy fragmentation. It is good opportunity to change min_free_kbytes.



>> Or, Can LKML folk make any advise to admin?
>>
>
> Work with them to fix the bug :/
>
>> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
>> following your additional code itself looks good to me. but...
>>
>>
>> > ==== CUT HERE ====
>> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>> >
>> > This patch is a fix and a claritifacation to the patch "vmscan: Have
>> > kswapd sleep for a short interval and double check it should be asleep".
>> > The fix is for kswapd to only check zones in the node it is responsible
>> > for. The clarification is to rename two counters to better explain what is
>> > being counted.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > ---
>> >  include/linux/vmstat.h |    2 +-
>> >  mm/vmscan.c            |   20 +++++++++++++-------
>> >  mm/vmstat.c            |    4 ++--
>> >  3 files changed, 16 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> > index 7d66695..0591a48 100644
>> > --- a/include/linux/vmstat.h
>> > +++ b/include/linux/vmstat.h
>> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> >             PGSCAN_ZONE_RECLAIM_FAILED,
>> >  #endif
>> >             PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
>> > -           KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
>> > +           KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>> >             KSWAPD_NO_CONGESTION_WAIT,
>> >             PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>> >  #ifdef CONFIG_HUGETLB_PAGE
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 70967e1..5557555 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>> >  #endif
>> >
>> >  /* is kswapd sleeping prematurely? */
>> > -static int sleeping_prematurely(int order, long remaining)
>> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> >  {
>> > -   struct zone *zone;
>> > +   int i;
>> >
>> >     /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>> >     if (remaining)
>> >             return 1;
>> >
>> >     /* If after HZ/10, a zone is below the high mark, it's premature */
>> > -   for_each_populated_zone(zone)
>> > +   for (i = 0; i < pgdat->nr_zones; i++) {
>> > +           struct zone *zone = pgdat->node_zones + i;
>> > +
>> > +           if (!populated_zone(zone))
>> > +                   continue;
>> > +
>> >             if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>> >                                                             0, 0))
>> >                     return 1;
>> > +   }
>> >
>> >     return 0;
>> >  }
>> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
>> >                             long remaining = 0;
>> >
>> >                             /* Try to sleep for a short interval */
>> > -                           if (!sleeping_prematurely(order, remaining)) {
>> > +                           if (!sleeping_prematurely(pgdat, order, remaining)) {
>> >                                     remaining = schedule_timeout(HZ/10);
>> >                                     finish_wait(&pgdat->kswapd_wait, &wait);
>> >                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
>> >                              * premature sleep. If not, then go fully
>> >                              * to sleep until explicitly woken up
>> >                              */
>> > -                           if (!sleeping_prematurely(order, remaining))
>> > +                           if (!sleeping_prematurely(pgdat, order, remaining))
>> >                                     schedule();
>> >                             else {
>> >                                     if (remaining)
>> > -                                           count_vm_event(KSWAPD_PREMATURE_FAST);
>> > +                                           count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> >                                     else
>> > -                                           count_vm_event(KSWAPD_PREMATURE_SLOW);
>> > +                                           count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> >                             }
>> >                     }
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index bc09547..6cc8dc6 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
>> >     "slabs_scanned",
>> >     "kswapd_steal",
>> >     "kswapd_inodesteal",
>> > -   "kswapd_slept_prematurely_fast",
>> > -   "kswapd_slept_prematurely_slow",
>> > +   "kswapd_low_wmark_hit_quickly",
>> > +   "kswapd_high_wmark_hit_quickly",
>> >     "kswapd_no_congestion_wait",
>> >     "pageoutrun",
>> >     "allocstall",
>>
>>
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-14  9:34           ` KOSAKI Motohiro
@ 2009-11-14 15:46             ` Mel Gorman
  2009-11-17 11:03               ` KOSAKI Motohiro
  2009-11-17 10:34             ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-14 15:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> >> > > > watermark was reached. If there are a constant stream of allocations from
> >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> >> > > > the high watermark is not being maintained for sufficient length time.
> >> > > >
> >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> >> > > > high watermark is no longer met, it's considered a premature sleep and
> >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> >> > > >
> >> > > > This adds more counters to distinguish between fast and slow breaches of
> >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> >> > > > sleep indicates that the high watermark was breached after a very short
> >> > > > interval.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > >
> >> > > Why do you submit this patch to mainline? this is debugging patch
> >> > > no more and no less.
> >> > >
> >> >
> >> > Do you mean the stats part? The stats are included until such time as the page
> >> > allocator failure reports stop or are significantly reduced. In the event a
> >> > report is received, the value of the counters help determine if kswapd was
> >> > struggling or not. They should be removed once this mess is ironed out.
> >> >
> >> > If there is a preference, I can split out the stats part and send it to
> >> > people with page allocator failure reports for retesting.
> >>
> >> I'm sorry my last mail didn't have enough explanation.
> >> This stats help to solve this issue. I agreed. but after solving this issue,
> >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> >
> > One possible workaround would be to raise min_free_kbytes while a fix is
> > being worked on.
> 
> Please correct me, if I said wrong thing.
> 

You didn't.

> if I was admin, I don't watch this stats because kswapd frequently
> wakeup doesn't mean any trouble. instead I watch number of allocation
> failure.
> 

The stats are not tracking when kswapd wakes up. It helps track how
quickly the high or low watermarks are going under once kswapd tries to
go back to sleep.

> [see include/linux/vmstat.h]
> 
> umm...
> Why don't we have nr-allocation-failure vmstat?

There isn't a stat for allocation failures, but there are tracepoints that
can be used to find out that information if necessary. For the cases I'm
looking at the moment though, GFP_NOWARN is not a factor so just looking
through the kernel log has been sufficient.

> I don't remember it.
> GFP_NOWARN failure doesn't make syslog error logging. but frequently
> GFP_NOWARN failure imply the system is under memory pressure or under
> heavy fragmentation. It is good opportunity to change min_free_kbytes.
> 
> 
> 
> >> Or, Can LKML folk make any advise to admin?
> >>
> >
> > Work with them to fix the bug :/
> >
> >> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
> >> following your additional code itself looks good to me. but...
> >>
> >>
> >> > ==== CUT HERE ====
> >> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> >> >
> >> > This patch is a fix and a claritifacation to the patch "vmscan: Have
> >> > kswapd sleep for a short interval and double check it should be asleep".
> >> > The fix is for kswapd to only check zones in the node it is responsible
> >> > for. The clarification is to rename two counters to better explain what is
> >> > being counted.
> >> >
> >> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >> > ---
> >> >  include/linux/vmstat.h |    2 +-
> >> >  mm/vmscan.c            |   20 +++++++++++++-------
> >> >  mm/vmstat.c            |    4 ++--
> >> >  3 files changed, 16 insertions(+), 10 deletions(-)
> >> >
> >> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> >> > index 7d66695..0591a48 100644
> >> > --- a/include/linux/vmstat.h
> >> > +++ b/include/linux/vmstat.h
> >> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >> >             PGSCAN_ZONE_RECLAIM_FAILED,
> >> >  #endif
> >> >             PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> >> > -           KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> >> > +           KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> >> >             KSWAPD_NO_CONGESTION_WAIT,
> >> >             PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >> >  #ifdef CONFIG_HUGETLB_PAGE
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 70967e1..5557555 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >> >  #endif
> >> >
> >> >  /* is kswapd sleeping prematurely? */
> >> > -static int sleeping_prematurely(int order, long remaining)
> >> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >> >  {
> >> > -   struct zone *zone;
> >> > +   int i;
> >> >
> >> >     /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> >     if (remaining)
> >> >             return 1;
> >> >
> >> >     /* If after HZ/10, a zone is below the high mark, it's premature */
> >> > -   for_each_populated_zone(zone)
> >> > +   for (i = 0; i < pgdat->nr_zones; i++) {
> >> > +           struct zone *zone = pgdat->node_zones + i;
> >> > +
> >> > +           if (!populated_zone(zone))
> >> > +                   continue;
> >> > +
> >> >             if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> >> >                                                             0, 0))
> >> >                     return 1;
> >> > +   }
> >> >
> >> >     return 0;
> >> >  }
> >> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
> >> >                             long remaining = 0;
> >> >
> >> >                             /* Try to sleep for a short interval */
> >> > -                           if (!sleeping_prematurely(order, remaining)) {
> >> > +                           if (!sleeping_prematurely(pgdat, order, remaining)) {
> >> >                                     remaining = schedule_timeout(HZ/10);
> >> >                                     finish_wait(&pgdat->kswapd_wait, &wait);
> >> >                                     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
> >> >                              * premature sleep. If not, then go fully
> >> >                              * to sleep until explicitly woken up
> >> >                              */
> >> > -                           if (!sleeping_prematurely(order, remaining))
> >> > +                           if (!sleeping_prematurely(pgdat, order, remaining))
> >> >                                     schedule();
> >> >                             else {
> >> >                                     if (remaining)
> >> > -                                           count_vm_event(KSWAPD_PREMATURE_FAST);
> >> > +                                           count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> >> >                                     else
> >> > -                                           count_vm_event(KSWAPD_PREMATURE_SLOW);
> >> > +                                           count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >> >                             }
> >> >                     }
> >> >
> >> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> > index bc09547..6cc8dc6 100644
> >> > --- a/mm/vmstat.c
> >> > +++ b/mm/vmstat.c
> >> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
> >> >     "slabs_scanned",
> >> >     "kswapd_steal",
> >> >     "kswapd_inodesteal",
> >> > -   "kswapd_slept_prematurely_fast",
> >> > -   "kswapd_slept_prematurely_slow",
> >> > +   "kswapd_low_wmark_hit_quickly",
> >> > +   "kswapd_high_wmark_hit_quickly",
> >> >     "kswapd_no_congestion_wait",
> >> >     "pageoutrun",
> >> >     "allocstall",
> >>
> >>
> >>
> >

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-14 15:46             ` Mel Gorman
@ 2009-11-17 11:03               ` KOSAKI Motohiro
  2009-11-17 11:44                 ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-17 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

I'm sorry for the long delay.

> On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > >> > > > watermark was reached. If there are a constant stream of allocations from
> > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > >> > > > the high watermark is not being maintained for sufficient length time.
> > >> > > >
> > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > >> > > >
> > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > >> > > > sleep indicates that the high watermark was breached after a very short
> > >> > > > interval.
> > >> > > >
> > >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >> > >
> > >> > > Why do you submit this patch to mainline? this is debugging patch
> > >> > > no more and no less.
> > >> > >
> > >> >
> > >> > Do you mean the stats part? The stats are included until such time as the page
> > >> > allocator failure reports stop or are significantly reduced. In the event a
> > >> > report is received, the value of the counters help determine if kswapd was
> > >> > struggling or not. They should be removed once this mess is ironed out.
> > >> >
> > >> > If there is a preference, I can split out the stats part and send it to
> > >> > people with page allocator failure reports for retesting.
> > >>
> > >> I'm sorry my last mail didn't have enough explanation.
> > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > >
> > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > being worked on.
> > 
> > Please correct me, if I said wrong thing.
> 
> You didn't.
> 
> > if I was admin, I don't watch this stats because kswapd frequently
> > wakeup doesn't mean any trouble. instead I watch number of allocation
> > failure.
> 
> The stats are not tracking when kswapd wakes up. It helps track how
> quickly the high or low watermarks are going under once kswapd tries to
> go back to sleep.

Umm, honestly I'm still puzlled. probably we need go back one step at once.
 kswapd wake up when memory amount less than low watermark and sleep
when memory amount much than high watermask. We need to know 
GFP_ATOMIC failure sign.

My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
under heavy pressure and memory amount go up and down between low watermark
and high watermark, this stats don't increase at all. IOW, this stats is strong related to
high watermark.

Probaby, min watermark or low watermark are more useful for us.

# of called wake_all_kswapd() is related to low watermark. and It's conteniously
increase although the system have strong memroy pressure. I'm ok.
KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
# of page allocation failure is related to  min watermark too. I'm ok too.

IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
Can you please tell me why you think kswapd slept time is so important?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-17 11:03               ` KOSAKI Motohiro
@ 2009-11-17 11:44                 ` Mel Gorman
  2009-11-17 12:18                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-17 11:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> I'm sorry for the long delay.
> 
> > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > >> > > >
> > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > >> > > >
> > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > >> > > > interval.
> > > >> > > >
> > > >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >> > >
> > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > >> > > no more and no less.
> > > >> > >
> > > >> >
> > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > >> > report is received, the value of the counters help determine if kswapd was
> > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > >> >
> > > >> > If there is a preference, I can split out the stats part and send it to
> > > >> > people with page allocator failure reports for retesting.
> > > >>
> > > >> I'm sorry my last mail didn't have enough explanation.
> > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > >
> > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > being worked on.
> > > 
> > > Please correct me, if I said wrong thing.
> > 
> > You didn't.
> > 
> > > if I was admin, I don't watch this stats because kswapd frequently
> > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > failure.
> > 
> > The stats are not tracking when kswapd wakes up. It helps track how
> > quickly the high or low watermarks are going under once kswapd tries to
> > go back to sleep.
> 
> Umm, honestly I'm still puzlled. probably we need go back one step at once.
> kswapd wake up when memory amount less than low watermark and sleep
> when memory amount much than high watermask. We need to know 
> GFP_ATOMIC failure sign.
> 
> My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> under heavy pressure and memory amount go up and down between low watermark
> and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> high watermark.
> 

Yes, this is true but as long as kswapd is awake and doing its job, it
will continue taking direction on what order it should be reclaiming from
processes that failed the low_watermark test.  The GFP_ATOMIC allocations
will be allowed to go under this low watermark but will have informed kswapd
what order it should be reclaiming at so it stays working.

A stat that increases between the low and high watermark would indicate
that memory pressure is there or that the reclaim algorithm is not
working as expected but that's checking for a different problem.

What I was looking at was  kswapd going to sleep and the low or min watermarks
being hit very quickly after that so that kswapd pre-emptively kicks in
before allocations start failing again.

> Probaby, min watermark or low watermark are more useful for us.
> 

Why? kswapd is awake between those points.

> # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> increase although the system have strong memroy pressure. I'm ok.
> KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> # of page allocation failure is related to  min watermark too. I'm ok too.
> 
> IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> Can you please tell me why you think kswapd slept time is so important?
> 

I don't think the amount of time it has slept is important. I think it's
important to know if the system is getting back into watermark trouble very
shortly after kswapd reached the high watermark.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-17 11:44                 ` Mel Gorman
@ 2009-11-17 12:18                   ` KOSAKI Motohiro
  2009-11-17 12:25                     ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-17 12:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > I'm sorry for the long delay.
> > 
> > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > >> > > >
> > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > >> > > >
> > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > >> > > > interval.
> > > > >> > > >
> > > > >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > >> > >
> > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > >> > > no more and no less.
> > > > >> > >
> > > > >> >
> > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > >> >
> > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > >> > people with page allocator failure reports for retesting.
> > > > >>
> > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > >
> > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > being worked on.
> > > > 
> > > > Please correct me, if I said wrong thing.
> > > 
> > > You didn't.
> > > 
> > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > failure.
> > > 
> > > The stats are not tracking when kswapd wakes up. It helps track how
> > > quickly the high or low watermarks are going under once kswapd tries to
> > > go back to sleep.
> > 
> > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > kswapd wake up when memory amount less than low watermark and sleep
> > when memory amount much than high watermask. We need to know 
> > GFP_ATOMIC failure sign.
> > 
> > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > under heavy pressure and memory amount go up and down between low watermark
> > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > high watermark.
> > 
> 
> Yes, this is true but as long as kswapd is awake and doing its job, it
> will continue taking direction on what order it should be reclaiming from
> processes that failed the low_watermark test.  The GFP_ATOMIC allocations
> will be allowed to go under this low watermark but will have informed kswapd
> what order it should be reclaiming at so it stays working.
> 
> A stat that increases between the low and high watermark would indicate
> that memory pressure is there or that the reclaim algorithm is not
> working as expected but that's checking for a different problem.
> 
> What I was looking at was  kswapd going to sleep and the low or min watermarks
> being hit very quickly after that so that kswapd pre-emptively kicks in
> before allocations start failing again.
> 
> > Probaby, min watermark or low watermark are more useful for us.
> 
> Why? kswapd is awake between those points.

What's difference below (1) and (2)?

1. kswapd() run 100ms and sleep 10ms and run 100ms.
2. kswapd() run 200ms

(1) represent amount memory go beyond high-watermark very shortly and go below
low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
and probably go blow low-watermark right after.
It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
(2) don't increase any stat.

Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.


> > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > increase although the system have strong memroy pressure. I'm ok.
> > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > # of page allocation failure is related to  min watermark too. I'm ok too.
> > 
> > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > Can you please tell me why you think kswapd slept time is so important?
> 
> I don't think the amount of time it has slept is important. I think it's
> important to know if the system is getting back into watermark trouble very
> shortly after kswapd reached the high watermark.

Probably, My last mail doesn't take kindly explanation.
My point is, beyond high-watermark or not doesn't indicate any meaningful 
phenomenon.

Then, I'd prefer low or min-watermark related stats.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-17 12:18                   ` KOSAKI Motohiro
@ 2009-11-17 12:25                     ` Mel Gorman
  2009-11-18  5:20                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-17 12:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Tue, Nov 17, 2009 at 09:18:11PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > > I'm sorry for the long delay.
> > > 
> > > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > > 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > > >> > > >
> > > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > > >> > > >
> > > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > > >> > > > interval.
> > > > > >> > > >
> > > > > >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > >> > >
> > > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > > >> > > no more and no less.
> > > > > >> > >
> > > > > >> >
> > > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > > >> >
> > > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > > >> > people with page allocator failure reports for retesting.
> > > > > >>
> > > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > > >
> > > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > > being worked on.
> > > > > 
> > > > > Please correct me, if I said wrong thing.
> > > > 
> > > > You didn't.
> > > > 
> > > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > > failure.
> > > > 
> > > > The stats are not tracking when kswapd wakes up. It helps track how
> > > > quickly the high or low watermarks are going under once kswapd tries to
> > > > go back to sleep.
> > > 
> > > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > > kswapd wake up when memory amount less than low watermark and sleep
> > > when memory amount much than high watermask. We need to know 
> > > GFP_ATOMIC failure sign.
> > > 
> > > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > > under heavy pressure and memory amount go up and down between low watermark
> > > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > > high watermark.
> > > 
> > 
> > Yes, this is true but as long as kswapd is awake and doing its job, it
> > will continue taking direction on what order it should be reclaiming from
> > processes that failed the low_watermark test.  The GFP_ATOMIC allocations
> > will be allowed to go under this low watermark but will have informed kswapd
> > what order it should be reclaiming at so it stays working.
> > 
> > A stat that increases between the low and high watermark would indicate
> > that memory pressure is there or that the reclaim algorithm is not
> > working as expected but that's checking for a different problem.
> > 
> > What I was looking at was  kswapd going to sleep and the low or min watermarks
> > being hit very quickly after that so that kswapd pre-emptively kicks in
> > before allocations start failing again.
> > 
> > > Probaby, min watermark or low watermark are more useful for us.
> > 
> > Why? kswapd is awake between those points.
> 
> What's difference below (1) and (2)?
> 
> 1. kswapd() run 100ms and sleep 10ms and run 100ms.
> 2. kswapd() run 200ms
> 

Because prior to the patch, once kswapd went to sleep, it would
not wake again until the low watermark was reached. There appeared
to be a timing issue where congestion_wait() would block kswapd
just long enough before checking the high watermark to mean that
it stayed awake. This wasn't happening hence the approach of
briefly-sleep-after-high-watermark-is-reached-and-double-check-watermarks-are-ok.

> (1) represent amount memory go beyond high-watermark very shortly and go below
> low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
> and probably go blow low-watermark right after.
> It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
> (2) don't increase any stat.
> 
> Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.
> 

It indicates mild memory pressure because the high watermark was only
reached for a short period of time.

> > > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > > increase although the system have strong memroy pressure. I'm ok.
> > > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > > # of page allocation failure is related to  min watermark too. I'm ok too.
> > > 
> > > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > > Can you please tell me why you think kswapd slept time is so important?
> > 
> > I don't think the amount of time it has slept is important. I think it's
> > important to know if the system is getting back into watermark trouble very
> > shortly after kswapd reached the high watermark.
> 
> Probably, My last mail doesn't take kindly explanation.
> My point is, beyond high-watermark or not doesn't indicate any meaningful 
> phenomenon.
> 
> Then, I'd prefer low or min-watermark related stats.
> 

What would that have to do with the kswapd-briefly-sleep logic? i.e. the
stats you are suggesting, while meaningful, are for looking at a
different type of problem. If you want those stats, then I should revert
the stats part of this patch altogether. Does that then mean you also
want the patch that makes kswapd double check watermarks to be dropped
or just have no associated stats to see what it's doing in the event of
allocation failure?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
  2009-11-17 12:25                     ` Mel Gorman
@ 2009-11-18  5:20                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-18  5:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> On Tue, Nov 17, 2009 at 09:18:11PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > > > I'm sorry for the long delay.
> > > > 
> > > > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > > > 2009/11/14 Mel Gorman <mel@csn.ul.ie>:
> > > > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > > > >> > > >
> > > > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > > > >> > > >
> > > > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > > > >> > > > interval.
> > > > > > >> > > >
> > > > > > >> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > > >> > >
> > > > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > > > >> > > no more and no less.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > > > >> >
> > > > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > > > >> > people with page allocator failure reports for retesting.
> > > > > > >>
> > > > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > > > >
> > > > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > > > being worked on.
> > > > > > 
> > > > > > Please correct me, if I said wrong thing.
> > > > > 
> > > > > You didn't.
> > > > > 
> > > > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > > > failure.
> > > > > 
> > > > > The stats are not tracking when kswapd wakes up. It helps track how
> > > > > quickly the high or low watermarks are going under once kswapd tries to
> > > > > go back to sleep.
> > > > 
> > > > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > > > kswapd wake up when memory amount less than low watermark and sleep
> > > > when memory amount much than high watermask. We need to know 
> > > > GFP_ATOMIC failure sign.
> > > > 
> > > > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > > > under heavy pressure and memory amount go up and down between low watermark
> > > > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > > > high watermark.
> > > > 
> > > 
> > > Yes, this is true but as long as kswapd is awake and doing its job, it
> > > will continue taking direction on what order it should be reclaiming from
> > > processes that failed the low_watermark test.  The GFP_ATOMIC allocations
> > > will be allowed to go under this low watermark but will have informed kswapd
> > > what order it should be reclaiming at so it stays working.
> > > 
> > > A stat that increases between the low and high watermark would indicate
> > > that memory pressure is there or that the reclaim algorithm is not
> > > working as expected but that's checking for a different problem.
> > > 
> > > What I was looking at was  kswapd going to sleep and the low or min watermarks
> > > being hit very quickly after that so that kswapd pre-emptively kicks in
> > > before allocations start failing again.
> > > 
> > > > Probaby, min watermark or low watermark are more useful for us.
> > > 




> > > Why? kswapd is awake between those points.
> > 
> > What's difference below (1) and (2)?
> > 
> > 1. kswapd() run 100ms and sleep 10ms and run 100ms.
> > 2. kswapd() run 200ms
> 
> Because prior to the patch, once kswapd went to sleep, it would
> not wake again until the low watermark was reached. There appeared
> to be a timing issue where congestion_wait() would block kswapd
> just long enough before checking the high watermark to mean that
> it stayed awake. This wasn't happening hence the approach of
> briefly-sleep-after-high-watermark-is-reached-and-double-check-watermarks-are-ok.

You are right.

> > (1) represent amount memory go beyond high-watermark very shortly and go below
> > low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
> > and probably go blow low-watermark right after.
> > It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
> > (2) don't increase any stat.
> > 
> > Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.
> > 
> 
> It indicates mild memory pressure because the high watermark was only
> reached for a short period of time.
> 
> > > > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > > > increase although the system have strong memroy pressure. I'm ok.
> > > > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > > > # of page allocation failure is related to  min watermark too. I'm ok too.
> > > > 
> > > > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > > > Can you please tell me why you think kswapd slept time is so important?
> > > 
> > > I don't think the amount of time it has slept is important. I think it's
> > > important to know if the system is getting back into watermark trouble very
> > > shortly after kswapd reached the high watermark.
> > 
> > Probably, My last mail doesn't take kindly explanation.
> > My point is, beyond high-watermark or not doesn't indicate any meaningful 
> > phenomenon.
> > 
> > Then, I'd prefer low or min-watermark related stats.
> 
> What would that have to do with the kswapd-briefly-sleep logic? i.e. the
> stats you are suggesting, while meaningful, are for looking at a
> different type of problem. If you want those stats, then I should revert
> the stats part of this patch altogether. Does that then mean you also
> want the patch that makes kswapd double check watermarks to be dropped
> or just have no associated stats to see what it's doing in the event of
> allocation failure?

Yes, you talked right thing.

After awhile thinking (and one night good sleeping), I conclude I should not enforce
my personal preference any more. I'm sorry. 

Honestly, I haven't understand this stats usage. but the stats is not core
piece in this patch concept.  we sholdn't get stuck small issue.

then, I'll review your latest patch soon.

Thanks.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
  2009-11-14  9:34           ` KOSAKI Motohiro
  2009-11-14 15:46             ` Mel Gorman
@ 2009-11-17 10:34             ` Mel Gorman
  2009-11-18  5:27               ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-17 10:34 UTC (permalink / raw)
  To: KOSAKI Motohiro, Andrew Morton
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, linux-kernel, linux-mm, Pekka Enberg,
	Rik van Riel, Christoph Lameter, Stephan von Krawczynski,
	Rafael J. Wysocki, Kernel Testers List

When checking if kswapd is sleeping prematurely, all populated zones are
checked instead of the zones the instance of kswapd is responsible for.
The counters for kswapd going to sleep prematurely are also named poorly.
This patch makes kswapd only check its own zones and renames the relevant
counters.

This is a fix to the patch
vmscan-have-kswapd-sleep-for-a-short-interval-and-double-check-it-should-be-asleep.patch
and is based on top of mmotm-2009-11-13-19-59. It would be preferable if
Kosaki Motohiro signed off on it as he had comments on the patch but the
discussion petered out without a solid conclusion.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/vmstat.h |    2 +-
 mm/vmscan.c            |   20 +++++++++++++-------
 mm/vmstat.c            |    4 ++--
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 70c8093..117f0dd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
-		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 81ef29b..da6cf42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1907,19 +1907,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 #endif
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(int order, long remaining)
+static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
-	struct zone *zone;
+	int i;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
 	/* If after HZ/10, a zone is below the high mark, it's premature */
-	for_each_populated_zone(zone)
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
 			return 1;
+	}
 
 	return 0;
 }
@@ -2227,7 +2233,7 @@ static int kswapd(void *p)
 				long remaining = 0;
 
 				/* Try to sleep for a short interval */
-				if (!sleeping_prematurely(order, remaining)) {
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					remaining = schedule_timeout(HZ/10);
 					finish_wait(&pgdat->kswapd_wait, &wait);
 					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2238,13 +2244,13 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining))
 					schedule();
 				else {
 					if (remaining)
-						count_vm_event(KSWAPD_PREMATURE_FAST);
+						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
-						count_vm_event(KSWAPD_PREMATURE_SLOW);
+						count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
 				}
 			}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 889254f..6051fba 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
 	"slabs_scanned",
 	"kswapd_steal",
 	"kswapd_inodesteal",
-	"kswapd_slept_prematurely_fast",
-	"kswapd_slept_prematurely_slow",
+	"kswapd_low_wmark_hit_quickly",
+	"kswapd_high_wmark_hit_quickly",
 	"kswapd_skip_congestion_wait",
 	"pageoutrun",
 	"allocstall",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
  2009-11-17 10:34             ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
@ 2009-11-18  5:27               ` KOSAKI Motohiro
  0 siblings, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-18  5:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> When checking if kswapd is sleeping prematurely, all populated zones are
> checked instead of the zones the instance of kswapd is responsible for.
> The counters for kswapd going to sleep prematurely are also named poorly.
> This patch makes kswapd only check its own zones and renames the relevant
> counters.
> 
> This is a fix to the patch
> vmscan-have-kswapd-sleep-for-a-short-interval-and-double-check-it-should-be-asleep.patch
> and is based on top of mmotm-2009-11-13-19-59. It would be preferable if
> Kosaki Motohiro signed off on it as he had comments on the patch but the
> discussion petered out without a solid conclusion.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Looks good to me.
I apologize to bother you by nit for long time.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
                   ` (3 preceding siblings ...)
  2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
@ 2009-11-12 19:30 ` Mel Gorman
  2009-11-13  9:54   ` KOSAKI Motohiro
  2009-11-13 12:41   ` Minchan Kim
  2009-11-13  9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, on systems with large numbers of high-order atomics due to
crappy network cards, it's important that kswapd keep working in
parallel to save their sorry ass.

This patch takes into account the order kswapd is reclaiming at before
waiting on congestion. The higher the order, the longer it is before
kswapd considers itself to be in trouble. The impact is that kswapd
works harder in parallel rather than depending on direct reclaimers or
atomic allocations to fail.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..5e200f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
-	int priority;
+	int priority, congestion_priority;
 	int i;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	 */
 	int temp_priority[MAX_NR_ZONES];
 
+	/*
+	 * When priority reaches congestion_priority, kswapd will sleep
+	 * for a short time while congestion clears. The higher the
+	 * order being reclaimed, the less likely kswapd will go to
+	 * sleep as high-order allocations are harder to reclaim and
+	 * stall direct reclaimers longer
+	 */
+	congestion_priority = DEF_PRIORITY - 2;
+	congestion_priority -= min(congestion_priority, sc.order);
+
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
@@ -2092,7 +2102,7 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
+		if (total_scanned && priority < congestion_priority)
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
@ 2009-11-13  9:54   ` KOSAKI Motohiro
  2009-11-13 13:54     ` Mel Gorman
  2009-11-13 14:38     ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Minchan Kim
  2009-11-13 12:41   ` Minchan Kim
  1 sibling, 2 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13  9:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
> However, on systems with large numbers of high-order atomics due to
> crappy network cards, it's important that kswapd keep working in
> parallel to save their sorry ass.
> 
> This patch takes into account the order kswapd is reclaiming at before
> waiting on congestion. The higher the order, the longer it is before
> kswapd considers itself to be in trouble. The impact is that kswapd
> works harder in parallel rather than depending on direct reclaimers or
> atomic allocations to fail.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   14 ++++++++++++--
>  1 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffa1766..5e200f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
>  static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>  {
>  	int all_zones_ok;
> -	int priority;
> +	int priority, congestion_priority;
>  	int i;
>  	unsigned long total_scanned;
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>  	 */
>  	int temp_priority[MAX_NR_ZONES];
>  
> +	/*
> +	 * When priority reaches congestion_priority, kswapd will sleep
> +	 * for a short time while congestion clears. The higher the
> +	 * order being reclaimed, the less likely kswapd will go to
> +	 * sleep as high-order allocations are harder to reclaim and
> +	 * stall direct reclaimers longer
> +	 */
> +	congestion_priority = DEF_PRIORITY - 2;
> +	congestion_priority -= min(congestion_priority, sc.order);

This calculation mean

	sc.order	congestion_priority	scan-pages
	---------------------------------------------------------
	0		10			1/1024 * zone-mem
	1		9			1/512  * zone-mem
	2		8			1/256  * zone-mem
	3		7			1/128  * zone-mem
	4		6			1/64   * zone-mem
	5		5			1/32   * zone-mem
	6		4			1/16   * zone-mem
	7		3			1/8    * zone-mem
	8		2			1/4    * zone-mem
	9		1			1/2    * zone-mem
	10		0			1      * zone-mem
	11+		0			1      * zone-mem

I feel this is too agressive. The intention of this congestion_wait()
is to prevent kswapd use 100% cpu time. but the above promotion seems
break it.

example,
ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.

example2,
order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
reclaim, because it doesn't use lumpy recliam.
I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
waste cpu time 100%.


> +
>  loop_again:
>  	total_scanned = 0;
>  	sc.nr_reclaimed = 0;
> @@ -2092,7 +2102,7 @@ loop_again:
>  		 * OK, kswapd is getting into trouble.  Take a nap, then take
>  		 * another pass across the zones.
>  		 */
> -		if (total_scanned && priority < DEF_PRIORITY - 2)
> +		if (total_scanned && priority < congestion_priority)
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);

Instead, How about this?



---
 mm/vmscan.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..937e90d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1938,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	 * free_pages == high_wmark_pages(zone).
 	 */
 	int temp_priority[MAX_NR_ZONES];
+	int has_under_min_watermark_zone = 0;
 
 loop_again:
 	total_scanned = 0;
@@ -2057,6 +2058,15 @@ loop_again:
 			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
+
+			/*
+			 * We are still under min water mark. it mean we have
+			 * GFP_ATOMIC allocation failure risk. Hurry up!
+			 */
+			if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+					      end_zone, 0))
+				has_under_min_watermark_zone = 1;
+
 		}
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
@@ -2064,7 +2074,8 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
+		if (total_scanned && (priority < DEF_PRIORITY - 2) &&
+		    !has_under_min_watermark_zone)
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.6.2.5



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-13  9:54   ` KOSAKI Motohiro
@ 2009-11-13 13:54     ` Mel Gorman
  2009-11-13 14:48       ` Minchan Kim
  2009-11-13 18:00       ` KOSAKI Motohiro
  2009-11-13 14:38     ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Minchan Kim
  1 sibling, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 13:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 06:54:29PM +0900, KOSAKI Motohiro wrote:
> > If reclaim fails to make sufficient progress, the priority is raised.
> > Once the priority is higher, kswapd starts waiting on congestion.
> > However, on systems with large numbers of high-order atomics due to
> > crappy network cards, it's important that kswapd keep working in
> > parallel to save their sorry ass.
> > 
> > This patch takes into account the order kswapd is reclaiming at before
> > waiting on congestion. The higher the order, the longer it is before
> > kswapd considers itself to be in trouble. The impact is that kswapd
> > works harder in parallel rather than depending on direct reclaimers or
> > atomic allocations to fail.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   14 ++++++++++++--
> >  1 files changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ffa1766..5e200f1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
> >  static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> >  {
> >  	int all_zones_ok;
> > -	int priority;
> > +	int priority, congestion_priority;
> >  	int i;
> >  	unsigned long total_scanned;
> >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> > @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> >  	 */
> >  	int temp_priority[MAX_NR_ZONES];
> >  
> > +	/*
> > +	 * When priority reaches congestion_priority, kswapd will sleep
> > +	 * for a short time while congestion clears. The higher the
> > +	 * order being reclaimed, the less likely kswapd will go to
> > +	 * sleep as high-order allocations are harder to reclaim and
> > +	 * stall direct reclaimers longer
> > +	 */
> > +	congestion_priority = DEF_PRIORITY - 2;
> > +	congestion_priority -= min(congestion_priority, sc.order);
> 
> This calculation mean
> 
> 	sc.order	congestion_priority	scan-pages
> 	---------------------------------------------------------
> 	0		10			1/1024 * zone-mem
> 	1		9			1/512  * zone-mem
> 	2		8			1/256  * zone-mem
> 	3		7			1/128  * zone-mem
> 	4		6			1/64   * zone-mem
> 	5		5			1/32   * zone-mem
> 	6		4			1/16   * zone-mem
> 	7		3			1/8    * zone-mem
> 	8		2			1/4    * zone-mem
> 	9		1			1/2    * zone-mem
> 	10		0			1      * zone-mem
> 	11+		0			1      * zone-mem
> 
> I feel this is too agressive. The intention of this congestion_wait()
> is to prevent kswapd use 100% cpu time.

Ok, I thought the intention might be to avoid dumping too many pages on
the queue but it was already waiting on congestion elsewhere.

> but the above promotion seems
> break it.
> 
> example,
> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.
> 
> example2,
> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
> reclaim, because it doesn't use lumpy recliam.
> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
> waste cpu time 100%.
> 
> 
> > +
> >  loop_again:
> >  	total_scanned = 0;
> >  	sc.nr_reclaimed = 0;
> > @@ -2092,7 +2102,7 @@ loop_again:
> >  		 * OK, kswapd is getting into trouble.  Take a nap, then take
> >  		 * another pass across the zones.
> >  		 */
> > -		if (total_scanned && priority < DEF_PRIORITY - 2)
> > +		if (total_scanned && priority < congestion_priority)
> >  			congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> Instead, How about this?
> 

This makes a lot of sense. Tests look good and I added stats to make sure
the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
times and X86-64 avoided it 5084 times. I think we should hold onto the
stats temporarily until all these bugs are ironed out.

Would you like to sign off the following?

If you are ok to sign off, this patch should replace my patch 5 in
the series.

==== CUT HERE ====

vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.  However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: Add stats to track how relevant the logic is]
Needs-signed-off-by-original-author

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+		KSWAPD_NO_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70967e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1966,6 +1966,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	 * free_pages == high_wmark_pages(zone).
 	 */
 	int temp_priority[MAX_NR_ZONES];
+	int has_under_min_watermark_zone = 0;
 
 loop_again:
 	total_scanned = 0;
@@ -2085,6 +2086,15 @@ loop_again:
 			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
+
+			/*
+			 * We are still under min water mark. it mean we have
+			 * GFP_ATOMIC allocation failure risk. Hurry up!
+			 */
+			if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+					      end_zone, 0))
+				has_under_min_watermark_zone = 1;
+
 		}
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
@@ -2092,8 +2102,13 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+
+			if (!has_under_min_watermark_zone)
+				count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
+			else
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+		}
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
 	"kswapd_inodesteal",
 	"kswapd_slept_prematurely_fast",
 	"kswapd_slept_prematurely_slow",
+	"kswapd_no_congestion_wait",
 	"pageoutrun",
 	"allocstall",
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-13 13:54     ` Mel Gorman
@ 2009-11-13 14:48       ` Minchan Kim
  2009-11-13 18:00       ` KOSAKI Motohiro
  1 sibling, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2009-11-13 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 10:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Fri, Nov 13, 2009 at 06:54:29PM +0900, KOSAKI Motohiro wrote:
>> > If reclaim fails to make sufficient progress, the priority is raised.
>> > Once the priority is higher, kswapd starts waiting on congestion.
>> > However, on systems with large numbers of high-order atomics due to
>> > crappy network cards, it's important that kswapd keep working in
>> > parallel to save their sorry ass.
>> >
>> > This patch takes into account the order kswapd is reclaiming at before
>> > waiting on congestion. The higher the order, the longer it is before
>> > kswapd considers itself to be in trouble. The impact is that kswapd
>> > works harder in parallel rather than depending on direct reclaimers or
>> > atomic allocations to fail.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > ---
>> >  mm/vmscan.c |   14 ++++++++++++--
>> >  1 files changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index ffa1766..5e200f1 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
>> >  static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> >  {
>> >     int all_zones_ok;
>> > -   int priority;
>> > +   int priority, congestion_priority;
>> >     int i;
>> >     unsigned long total_scanned;
>> >     struct reclaim_state *reclaim_state = current->reclaim_state;
>> > @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> >      */
>> >     int temp_priority[MAX_NR_ZONES];
>> >
>> > +   /*
>> > +    * When priority reaches congestion_priority, kswapd will sleep
>> > +    * for a short time while congestion clears. The higher the
>> > +    * order being reclaimed, the less likely kswapd will go to
>> > +    * sleep as high-order allocations are harder to reclaim and
>> > +    * stall direct reclaimers longer
>> > +    */
>> > +   congestion_priority = DEF_PRIORITY - 2;
>> > +   congestion_priority -= min(congestion_priority, sc.order);
>>
>> This calculation mean
>>
>>       sc.order        congestion_priority     scan-pages
>>       ---------------------------------------------------------
>>       0               10                      1/1024 * zone-mem
>>       1               9                       1/512  * zone-mem
>>       2               8                       1/256  * zone-mem
>>       3               7                       1/128  * zone-mem
>>       4               6                       1/64   * zone-mem
>>       5               5                       1/32   * zone-mem
>>       6               4                       1/16   * zone-mem
>>       7               3                       1/8    * zone-mem
>>       8               2                       1/4    * zone-mem
>>       9               1                       1/2    * zone-mem
>>       10              0                       1      * zone-mem
>>       11+             0                       1      * zone-mem
>>
>> I feel this is too agressive. The intention of this congestion_wait()
>> is to prevent kswapd use 100% cpu time.

As I said in reply of kosaki's patch, I can't understand point.

> Ok, I thought the intention might be to avoid dumping too many pages on
> the queue but it was already waiting on congestion elsewhere.
>
>> but the above promotion seems
>> break it.
>>
>> example,
>> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.
>> example2,

But, This is a true problem missed in my review.
Thanks, Kosaki.

>> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
>> reclaim, because it doesn't use lumpy recliam.
>> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
>> waste cpu time 100%.
>>
>>
>> > +
>> >  loop_again:
>> >     total_scanned = 0;
>> >     sc.nr_reclaimed = 0;
>> > @@ -2092,7 +2102,7 @@ loop_again:
>> >              * OK, kswapd is getting into trouble.  Take a nap, then take
>> >              * another pass across the zones.
>> >              */
>> > -           if (total_scanned && priority < DEF_PRIORITY - 2)
>> > +           if (total_scanned && priority < congestion_priority)
>> >                     congestion_wait(BLK_RW_ASYNC, HZ/10);
>>
>> Instead, How about this?
>>
>
> This makes a lot of sense. Tests look good and I added stats to make sure
> the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
> times and X86-64 avoided it 5084 times. I think we should hold onto the
> stats temporarily until all these bugs are ironed out.
>
> Would you like to sign off the following?
>
> If you are ok to sign off, this patch should replace my patch 5 in
> the series.

I agree Kosaki's patch is more strightforward.

You can add my review sign, too.
Thanks for good patch, Kosaki. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-13 13:54     ` Mel Gorman
  2009-11-13 14:48       ` Minchan Kim
@ 2009-11-13 18:00       ` KOSAKI Motohiro
  2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
  1 sibling, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13 18:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List


> This makes a lot of sense. Tests look good and I added stats to make sure
> the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
> times and X86-64 avoided it 5084 times. I think we should hold onto the
> stats temporarily until all these bugs are ironed out.
> 
> Would you like to sign off the following?
> 
> If you are ok to sign off, this patch should replace my patch 5 in
> the series.

I'm sorry, I found my bug.
Please see below.

> 
> ==== CUT HERE ====
> 
> vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
> 
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.  However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
> 
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
> 
> [mel@csn.ul.ie: Add stats to track how relevant the logic is]
> Needs-signed-off-by-original-author
> 
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 9716003..7d66695 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #endif
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
>  		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> +		KSWAPD_NO_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffa1766..70967e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1966,6 +1966,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>  	 * free_pages == high_wmark_pages(zone).
>  	 */
>  	int temp_priority[MAX_NR_ZONES];
> +	int has_under_min_watermark_zone = 0;

This is wrong declaration place.  It must change to

        for (priority = DEF_PRIORITY; priority >= 0; priority--) {
                int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */
                unsigned long lru_pages = 0;
+                int has_under_min_watermark_zone = 0;


because, has_under_min_watermark_zone should be initialized every priority.


>  loop_again:
>  	total_scanned = 0;
> @@ -2085,6 +2086,15 @@ loop_again:
>  			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
> +
> +			/*
> +			 * We are still under min water mark. it mean we have
> +			 * GFP_ATOMIC allocation failure risk. Hurry up!
> +			 */
> +			if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +					      end_zone, 0))
> +				has_under_min_watermark_zone = 1;
> +
>  		}
>  		if (all_zones_ok)
>  			break;		/* kswapd: all done */
> @@ -2092,8 +2102,13 @@ loop_again:
>  		 * OK, kswapd is getting into trouble.  Take a nap, then take
>  		 * another pass across the zones.
>  		 */
> -		if (total_scanned && priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> +

This blank line is unnecesary.

> +			if (!has_under_min_watermark_zone)

Probably "if (has_under_min_watermark_zone)" is correct.


> +				count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
> +			else
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		}

Otherthing looks pretty good to me. please feel free to add my s-o-b or reviewed-by.

Thanks.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
  2009-11-13 18:00       ` KOSAKI Motohiro
@ 2009-11-13 18:15         ` Mel Gorman
  2009-11-13 18:26           ` Frans Pop
                             ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 18:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.  However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

This patch replaces
vmscan-take-order-into-consideration-when-deciding-if-kswapd-is-in-trouble.patch .

[mel@csn.ul.ie: Add stats to track how relevant the logic is]
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/vmstat.h |    1 +
 mm/vmscan.c            |   18 ++++++++++++++++--
 mm/vmstat.c            |    1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+		KSWAPD_NO_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70a2322 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1979,6 +1979,7 @@ loop_again:
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		int has_under_min_watermark_zone = 0;
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
@@ -2085,6 +2086,15 @@ loop_again:
 			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
+
+			/*
+			 * We are still under min water mark. it mean we have
+			 * GFP_ATOMIC allocation failure risk. Hurry up!
+			 */
+			if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+					      end_zone, 0))
+				has_under_min_watermark_zone = 1;
+
 		}
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
@@ -2092,8 +2102,12 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+			if (!has_under_min_watermark_zone)
+				count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
+			else
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+		}
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
 	"kswapd_inodesteal",
 	"kswapd_slept_prematurely_fast",
 	"kswapd_slept_prematurely_slow",
+	"kswapd_no_congestion_wait",
 	"pageoutrun",
 	"allocstall",
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
  2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
@ 2009-11-13 18:26           ` Frans Pop
  2009-11-13 18:33           ` KOSAKI Motohiro
  2009-11-13 18:36           ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Rik van Riel
  2 siblings, 0 replies; 65+ messages in thread
From: Frans Pop @ 2009-11-13 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KOSAKI Motohiro, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Friday 13 November 2009, Mel Gorman wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
>  However, if the zone is below the min watermark then kswapd needs to
> continue working without delay as there is a danger of an increased rate
> of GFP_ATOMIC allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
>
> This patch replaces
> vmscan-take-order-into-consideration-when-deciding-if-kswapd-is-in-troub
>le.patch .
>
> [mel@csn.ul.ie: Add stats to track how relevant the logic is]
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

For this to work with git-am, the From: line has to be _above_ the patch 
description (must be the first line of the mail even). AFAIK at least.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Cheers,
FJP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
  2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
  2009-11-13 18:26           ` Frans Pop
@ 2009-11-13 18:33           ` KOSAKI Motohiro
  2009-11-13 20:03             ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2 Mel Gorman
  2009-11-13 18:36           ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Rik van Riel
  2 siblings, 1 reply; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-11-13 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

> @@ -2092,8 +2102,12 @@ loop_again:
>                 * OK, kswapd is getting into trouble.  Take a nap, then take
>                 * another pass across the zones.
>                 */
> -               if (total_scanned && priority < DEF_PRIORITY - 2)
> -                       congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> +                       if (!has_under_min_watermark_zone)

if I am correct, we must to remove "!".


> +                               count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
> +                       else
> +                               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
  2009-11-13 18:33           ` KOSAKI Motohiro
@ 2009-11-13 20:03             ` Mel Gorman
  2009-11-26 14:45               ` Tobias Oetiker
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

The last version has a stupid bug in it. Sorry.

Changelog since V1
  o Fix incorrect negation
  o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
    suggested by Rik

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.  However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: Add stats to track how relevant the logic is]
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
--- 
 include/linux/vmstat.h |    1 +
 mm/vmscan.c            |   18 ++++++++++++++++--
 mm/vmstat.c            |    1 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70a2322 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1979,6 +1979,7 @@ loop_again:
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
+		int has_under_min_watermark_zone = 0;
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
@ -2085,6 +2086,15 @@ loop_again:
 			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
+
+			/*
+			 * We are still under min water mark. it mean we have
+			 * GFP_ATOMIC allocation failure risk. Hurry up!
+			 */
+			if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+					      end_zone, 0))
+				has_under_min_watermark_zone = 1;
+
 		}
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
@@ -2092,8 +2102,12 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+			if (has_under_min_watermark_zone)
+				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
+			else
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+		}
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
 	"kswapd_inodesteal",
 	"kswapd_slept_prematurely_fast",
 	"kswapd_slept_prematurely_slow",
+	"kswapd_skip_congestion_wait",
 	"pageoutrun",
 	"allocstall",
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
  2009-11-13 20:03             ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2 Mel Gorman
@ 2009-11-26 14:45               ` Tobias Oetiker
  2009-11-29  7:42                 ` still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2) Tobi Oetiker
  0 siblings, 1 reply; 65+ messages in thread
From: Tobias Oetiker @ 2009-11-26 14:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KOSAKI Motohiro, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Mel,

Nov 13 Mel Gorman wrote:

> The last version has a stupid bug in it. Sorry.
>
> Changelog since V1
>   o Fix incorrect negation
>   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
>     suggested by Rik
>
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.  However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.

I finally got around to test this together with the whole series on
2.6.31.6. after running it for a day I have not yet seen a single
order:5 allocation problem ... (while I had several an hour before)

for the record, my kernel is now running with the following
patches:

patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER

patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2

patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-11-26 14:45               ` Tobias Oetiker
@ 2009-11-29  7:42                 ` Tobi Oetiker
  2009-12-02 11:32                   ` Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Tobi Oetiker @ 2009-11-29  7:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KOSAKI Motohiro, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Mel,

Thursday Tobias Oetiker wrote:
> Hi Mel,
>
> Nov 13 Mel Gorman wrote:
>
> > The last version has a stupid bug in it. Sorry.
> >
> > Changelog since V1
> >   o Fix incorrect negation
> >   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> >     suggested by Rik
> >
> > If reclaim fails to make sufficient progress, the priority is raised.
> > Once the priority is higher, kswapd starts waiting on congestion.  However,
> > if the zone is below the min watermark then kswapd needs to continue working
> > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > allocation failure.
> >
> > This patch changes the conditions under which kswapd waits on
> > congestion by only going to sleep if the min watermarks are being met.
>
> I finally got around to test this together with the whole series on
> 2.6.31.6. after running it for a day I have not yet seen a single
> order:5 allocation problem ... (while I had several an hour before)

> for the record, my kernel is now running with the following
> patches:
>
> patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
>
> patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
>
> patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
>
> patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
>
> patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
>
> patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>
I have now been running the new kernel for a few days and I am
sorry to report that about a day after booting the allocation
failures started showing again. More order:4 instead of order:5 ...

Nov 29 07:16:17 johan kernel: [261565.598627] nfsd: page allocation failure. order:4, mode:0x4020 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598638] Pid: 6956, comm: nfsd Tainted: G      D    2.6.31.6-oep #1 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598641] Call Trace: [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598646]  <IRQ>  [<ffffffff810cb730>] __alloc_pages_nodemask+0x570/0x690 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598672]  [<ffffffff8142aba5>] ? tcp_tso_segment+0x265/0x2e0 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598680]  [<ffffffff810faf08>] kmalloc_large_node+0x68/0xc0 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598692]  [<ffffffff810febfa>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]

the fact that there were no problems right after booting might give
some indication as to where the problem is. This is on a machine
with 24 GB of memory, as you can see from the 'top' headers, it has
plenty of room to spare and should never run into allocation problems
anyway. The only thing that would have changed after running it for
about a day is that more memory is used for caching ...

# cat /proc/meminfo
MemTotal:       24746268 kB
MemFree:          128012 kB
Buffers:         6255988 kB
Cached:         10600780 kB
SwapCached:        23668 kB
Active:         13078944 kB
Inactive:        9229172 kB
Active(anon):    5455164 kB
Inactive(anon):   508616 kB
Active(file):    7623780 kB
Inactive(file):  8720556 kB
Unevictable:      291768 kB
Mlocked:          291768 kB
SwapTotal:      83886056 kB
SwapFree:       83846520 kB
Dirty:              2028 kB
Writeback:             0 kB
AnonPages:       5723748 kB
Mapped:           152240 kB
Slab:            1339176 kB
SReclaimable:    1270156 kB
SUnreclaim:        69020 kB
PageTables:        31208 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    101208440 kB
Committed_AS:    8146696 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      115084 kB
VmallocChunk:   34359518011 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       10176 kB
DirectMap2M:    25155584 kB

hth
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-11-29  7:42                 ` still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2) Tobi Oetiker
@ 2009-12-02 11:32                   ` Mel Gorman
  2009-12-02 21:30                     ` Tobias Oetiker
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-12-02 11:32 UTC (permalink / raw)
  To: Tobi Oetiker
  Cc: Andrew Morton, KOSAKI Motohiro, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
> Hi Mel,
> 
> Thursday Tobias Oetiker wrote:
> > Hi Mel,
> >
> > Nov 13 Mel Gorman wrote:
> >
> > > The last version has a stupid bug in it. Sorry.
> > >
> > > Changelog since V1
> > >   o Fix incorrect negation
> > >   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> > >     suggested by Rik
> > >
> > > If reclaim fails to make sufficient progress, the priority is raised.
> > > Once the priority is higher, kswapd starts waiting on congestion.  However,
> > > if the zone is below the min watermark then kswapd needs to continue working
> > > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > > allocation failure.
> > >
> > > This patch changes the conditions under which kswapd waits on
> > > congestion by only going to sleep if the min watermarks are being met.
> >
> > I finally got around to test this together with the whole series on
> > 2.6.31.6. after running it for a day I have not yet seen a single
> > order:5 allocation problem ... (while I had several an hour before)
> 
> > for the record, my kernel is now running with the following
> > patches:
> >
> > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> >
> > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
> >
> > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> >
> > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> >
> > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
> >
> > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> >
> I have now been running the new kernel for a few days and I am
> sorry to report that about a day after booting the allocation
> failures started showing again. More order:4 instead of order:5 ...
> 

Why has the order changed?

Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
SLUB? Did you happen to change them when upgrading the kernel?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-12-02 11:32                   ` Mel Gorman
@ 2009-12-02 21:30                     ` Tobias Oetiker
  2009-12-03 20:26                       ` Corrado Zoccolo
  0 siblings, 1 reply; 65+ messages in thread
From: Tobias Oetiker @ 2009-12-02 21:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KOSAKI Motohiro, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, linux-kernel, linux-mm,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Mel,

Today Mel Gorman wrote:

> On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
> > Hi Mel,
> >
> > Thursday Tobias Oetiker wrote:
> > > Hi Mel,
> > >
> > > Nov 13 Mel Gorman wrote:
> > >
> > > > The last version has a stupid bug in it. Sorry.
> > > >
> > > > Changelog since V1
> > > >   o Fix incorrect negation
> > > >   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> > > >     suggested by Rik
> > > >
> > > > If reclaim fails to make sufficient progress, the priority is raised.
> > > > Once the priority is higher, kswapd starts waiting on congestion.  However,
> > > > if the zone is below the min watermark then kswapd needs to continue working
> > > > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > > > allocation failure.
> > > >
> > > > This patch changes the conditions under which kswapd waits on
> > > > congestion by only going to sleep if the min watermarks are being met.
> > >
> > > I finally got around to test this together with the whole series on
> > > 2.6.31.6. after running it for a day I have not yet seen a single
> > > order:5 allocation problem ... (while I had several an hour before)
> >
> > > for the record, my kernel is now running with the following
> > > patches:
> > >
> > > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> > > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> > >
> > > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> > > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
> > >
> > > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> > > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> > >
> > > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> > > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> > >
> > > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> > > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
> > >
> > > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> > > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> > >
> > I have now been running the new kernel for a few days and I am
> > sorry to report that about a day after booting the allocation
> > failures started showing again. More order:4 instead of order:5 ...
> >
>
> Why has the order changed?

? no idea ... the order has changed after applying the patches
cited above.

> Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
> SLUB? Did you happen to change them when upgrading the kernel?

I have been and still am using SLUB  ...

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-12-02 21:30                     ` Tobias Oetiker
@ 2009-12-03 20:26                       ` Corrado Zoccolo
  2009-12-14  5:59                         ` Tobias Oetiker
  0 siblings, 1 reply; 65+ messages in thread
From: Corrado Zoccolo @ 2009-12-03 20:26 UTC (permalink / raw)
  To: Tobias Oetiker
  Cc: Mel Gorman, Andrew Morton, KOSAKI Motohiro, Frans Pop,
	Jiri Kosina, Sven Geggus, Karol Lewandowski, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Tobias,
does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
high order allocation problems?
It seems that you have lot of memory, but high order pages do not show up.
The patch should make them more likely to appear.
On my machine (that has much less ram than yours), with the patch, I
always have order-10 pages available.

Corrado

On Wed, Dec 2, 2009 at 10:30 PM, Tobias Oetiker <tobi@oetiker.ch> wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
>> On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
>> > Hi Mel,
>> >
>> > Thursday Tobias Oetiker wrote:
>> > > Hi Mel,
>> > >
>> > > Nov 13 Mel Gorman wrote:
>> > >
>> > > > The last version has a stupid bug in it. Sorry.
>> > > >
>> > > > Changelog since V1
>> > > >   o Fix incorrect negation
>> > > >   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
>> > > >     suggested by Rik
>> > > >
>> > > > If reclaim fails to make sufficient progress, the priority is raised.
>> > > > Once the priority is higher, kswapd starts waiting on congestion.  However,
>> > > > if the zone is below the min watermark then kswapd needs to continue working
>> > > > without delay as there is a danger of an increased rate of GFP_ATOMIC
>> > > > allocation failure.
>> > > >
>> > > > This patch changes the conditions under which kswapd waits on
>> > > > congestion by only going to sleep if the min watermarks are being met.
>> > >
>> > > I finally got around to test this together with the whole series on
>> > > 2.6.31.6. after running it for a day I have not yet seen a single
>> > > order:5 allocation problem ... (while I had several an hour before)
>> >
>> > > for the record, my kernel is now running with the following
>> > > patches:
>> > >
>> > > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
>> > > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
>> > >
>> > > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
>> > > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
>> > >
>> > > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
>> > > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
>> > >
>> > > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
>> > > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
>> > >
>> > > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
>> > > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
>> > >
>> > > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
>> > > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>> > >
>> > I have now been running the new kernel for a few days and I am
>> > sorry to report that about a day after booting the allocation
>> > failures started showing again. More order:4 instead of order:5 ...
>> >
>>
>> Why has the order changed?
>
> ? no idea ... the order has changed after applying the patches
> cited above.
>
>> Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
>> SLUB? Did you happen to change them when upgrading the kernel?
>
> I have been and still am using SLUB  ...
>
> cheers
> tobi
>
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-12-03 20:26                       ` Corrado Zoccolo
@ 2009-12-14  5:59                         ` Tobias Oetiker
  2009-12-14  8:49                           ` Corrado Zoccolo
  0 siblings, 1 reply; 65+ messages in thread
From: Tobias Oetiker @ 2009-12-14  5:59 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Mel Gorman, Andrew Morton, KOSAKI Motohiro, Frans Pop,
	Jiri Kosina, Sven Geggus, Karol Lewandowski, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Corrado,

Dec 3 Corrado Zoccolo wrote:

> Hi Tobias,
> does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
> high order allocation problems?
> It seems that you have lot of memory, but high order pages do not show up.
> The patch should make them more likely to appear.
> On my machine (that has much less ram than yours), with the patch, I
> always have order-10 pages available.

I have tried it and ... it does not work, the  page allocation
failure still shows. BUT while testing it on two machines I found that it
only shows on on machine. The workload on the two machines is
similar (they both run virtualbox) and also the available memory.

Could it be caused by a hardware driver ?

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)
  2009-12-14  5:59                         ` Tobias Oetiker
@ 2009-12-14  8:49                           ` Corrado Zoccolo
  0 siblings, 0 replies; 65+ messages in thread
From: Corrado Zoccolo @ 2009-12-14  8:49 UTC (permalink / raw)
  To: Tobias Oetiker
  Cc: Mel Gorman, Andrew Morton, KOSAKI Motohiro, Frans Pop,
	Jiri Kosina, Sven Geggus, Karol Lewandowski, linux-kernel,
	linux-mm, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Tobi,
On Mon, Dec 14, 2009 at 6:59 AM, Tobias Oetiker <tobi@oetiker.ch> wrote:
> Hi Corrado,
>
> Dec 3 Corrado Zoccolo wrote:
>
>> Hi Tobias,
>> does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
>> high order allocation problems?
>> It seems that you have lot of memory, but high order pages do not show up.
>> The patch should make them more likely to appear.
>> On my machine (that has much less ram than yours), with the patch, I
>> always have order-10 pages available.
>
> I have tried it and ... it does not work, the  page allocation
> failure still shows. BUT while testing it on two machines I found that it
> only shows on on machine. The workload on the two machines is
> similar (they both run virtualbox) and also the available memory.

Where those both failing before the patch?
Did the order of failure change?

> Could it be caused by a hardware driver ?
It should be something that is taking more time to release pages, but
I don't know what can it be. What happens if you drop the caches when
you are getting failures? Does the failure rate drops as if you had
just rebooted?
Can you log at regular intervals the content of /proc/buddyinfo, and
try  correlating when the number of pages of the requested order are
becoming scarce with some other event?

Thanks,
Corrado
>
> cheers
> tobi
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
  2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
  2009-11-13 18:26           ` Frans Pop
  2009-11-13 18:33           ` KOSAKI Motohiro
@ 2009-11-13 18:36           ` Rik van Riel
  2 siblings, 0 replies; 65+ messages in thread
From: Rik van Riel @ 2009-11-13 18:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KOSAKI Motohiro, Frans Pop, Jiri Kosina,
	Sven Geggus, Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm, Pekka Enberg, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On 11/13/2009 01:15 PM, Mel Gorman wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.  However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
>
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 90b11e4..bc09547 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
>   	"kswapd_inodesteal",
>   	"kswapd_slept_prematurely_fast",
>   	"kswapd_slept_prematurely_slow",
> +	"kswapd_no_congestion_wait",
>   	
>    
Perhaps better named "kswapd_skip_congestion_wait" ?

Other than that, the patch looks good to me.

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-13  9:54   ` KOSAKI Motohiro
  2009-11-13 13:54     ` Mel Gorman
@ 2009-11-13 14:38     ` Minchan Kim
  1 sibling, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2009-11-13 14:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi, Koskai.

I missed this patch.
I noticed this after Mel reply your patch.

On Fri, Nov 13, 2009 at 6:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> If reclaim fails to make sufficient progress, the priority is raised.
>> Once the priority is higher, kswapd starts waiting on congestion.
>> However, on systems with large numbers of high-order atomics due to
>> crappy network cards, it's important that kswapd keep working in
>> parallel to save their sorry ass.
>>
>> This patch takes into account the order kswapd is reclaiming at before
>> waiting on congestion. The higher the order, the longer it is before
>> kswapd considers itself to be in trouble. The impact is that kswapd
>> works harder in parallel rather than depending on direct reclaimers or
>> atomic allocations to fail.
>>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> ---
>>  mm/vmscan.c |   14 ++++++++++++--
>>  1 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index ffa1766..5e200f1 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
>>  static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>>  {
>>       int all_zones_ok;
>> -     int priority;
>> +     int priority, congestion_priority;
>>       int i;
>>       unsigned long total_scanned;
>>       struct reclaim_state *reclaim_state = current->reclaim_state;
>> @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>>        */
>>       int temp_priority[MAX_NR_ZONES];
>>
>> +     /*
>> +      * When priority reaches congestion_priority, kswapd will sleep
>> +      * for a short time while congestion clears. The higher the
>> +      * order being reclaimed, the less likely kswapd will go to
>> +      * sleep as high-order allocations are harder to reclaim and
>> +      * stall direct reclaimers longer
>> +      */
>> +     congestion_priority = DEF_PRIORITY - 2;
>> +     congestion_priority -= min(congestion_priority, sc.order);
>
> This calculation mean
>
>        sc.order        congestion_priority     scan-pages
>        ---------------------------------------------------------
>        0               10                      1/1024 * zone-mem
>        1               9                       1/512  * zone-mem
>        2               8                       1/256  * zone-mem
>        3               7                       1/128  * zone-mem
>        4               6                       1/64   * zone-mem
>        5               5                       1/32   * zone-mem
>        6               4                       1/16   * zone-mem
>        7               3                       1/8    * zone-mem
>        8               2                       1/4    * zone-mem
>        9               1                       1/2    * zone-mem
>        10              0                       1      * zone-mem
>        11+             0                       1      * zone-mem
>
> I feel this is too agressive. The intention of this congestion_wait()
> is to prevent kswapd use 100% cpu time. but the above promotion seems
> break it.

I can't understand your point.
Mel didn't change the number of scan pages.
It denpends on priority.
He just added another one to prevent frequent contestion_wait.
Still, shrink_zone is called with priority, not congestion_priority.

> example,
> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.

Indeed. Good catch.

> example2,
> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
> reclaim, because it doesn't use lumpy recliam.
> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
> waste cpu time 100%.

Above I said, It depends on priority, not congestion_priority.

>
>> +
>>  loop_again:
>>       total_scanned = 0;
>>       sc.nr_reclaimed = 0;
>> @@ -2092,7 +2102,7 @@ loop_again:
>>                * OK, kswapd is getting into trouble.  Take a nap, then take
>>                * another pass across the zones.
>>                */
>> -             if (total_scanned && priority < DEF_PRIORITY - 2)
>> +             if (total_scanned && priority < congestion_priority)
>>                       congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Instead, How about this?
>
>
>
> ---
>  mm/vmscan.c |   13 ++++++++++++-
>  1 files changed, 12 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64e4388..937e90d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1938,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>         * free_pages == high_wmark_pages(zone).
>         */
>        int temp_priority[MAX_NR_ZONES];
> +       int has_under_min_watermark_zone = 0;

Let's make the shorter.
How about "under_min_watermark"?

>
>  loop_again:
>        total_scanned = 0;
> @@ -2057,6 +2058,15 @@ loop_again:
>                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
>                            total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>                                sc.may_writepage = 1;
> +
> +                       /*
> +                        * We are still under min water mark. it mean we have
> +                        * GFP_ATOMIC allocation failure risk. Hurry up!
> +                        */
> +                       if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +                                             end_zone, 0))
> +                               has_under_min_watermark_zone = 1;
> +
>                }
>                if (all_zones_ok)
>                        break;          /* kswapd: all done */
> @@ -2064,7 +2074,8 @@ loop_again:
>                 * OK, kswapd is getting into trouble.  Take a nap, then take
>                 * another pass across the zones.
>                 */
> -               if (total_scanned && priority < DEF_PRIORITY - 2)
> +               if (total_scanned && (priority < DEF_PRIORITY - 2) &&
> +                   !has_under_min_watermark_zone)
>                        congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                /*
> --
> 1.6.2.5

Anyway, Looks good to me.
It's more straightforward than Mel's one, I think.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
  2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
  2009-11-13  9:54   ` KOSAKI Motohiro
@ 2009-11-13 12:41   ` Minchan Kim
  1 sibling, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2009-11-13 12:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel,
	linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 4:30 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
> However, on systems with large numbers of high-order atomics due to
> crappy network cards, it's important that kswapd keep working in
> parallel to save their sorry ass.
>
> This patch takes into account the order kswapd is reclaiming at before
> waiting on congestion. The higher the order, the longer it is before
> kswapd considers itself to be in trouble. The impact is that kswapd
> works harder in parallel rather than depending on direct reclaimers or
> atomic allocations to fail.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

It's make sense to me.
It can help high order atomic allocation which is a big problem of allocator. :)

Thanks Mel.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
                   ` (4 preceding siblings ...)
  2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
@ 2009-11-13  9:04 ` Frans Pop
  2009-11-16 17:57   ` Mel Gorman
  2009-11-13 12:47 ` Tobias Oetiker
  2009-11-15 12:07 ` Karol Lewandowski
  7 siblings, 1 reply; 65+ messages in thread
From: Frans Pop @ 2009-11-13  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, linux-kernel, linux-mm, KOSAKI Motohiro,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Chris Mason

On Thursday 12 November 2009, Mel Gorman wrote:
> Changelog since V2
>   o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
>     testing, it made latencies even worse as kswapd slept more on
>     high-order congestion causing order-0 direct reclaims.
>   o Added changes to how congestion_wait() works
>   o Added a number of new patches altering the behaviour of reclaim

I have tested this series on top of .32-rc7. First impression is that it 
does seem to improve my test case, but does not yet completely solve it.

My last gitk instance now loads more smoothly for most of the time it takes 
to complete, but I still see a choke point where things freeze for a while 
and where I get SKB allocation errors from my wireless.
However, that choke point does seem to happen later and to be shorter than 
without the patches.

I'll try to do additional tests (with .31). If you'd like me to run this 
set with your instrumentation patch for congestion_wait, then please let 
me know.

Chris Mason's analysis regarding dm-crypt workqueues in reply to your other 
mail looks very interesting.

Cheers,
FJP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-13  9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
@ 2009-11-16 17:57   ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-16 17:57 UTC (permalink / raw)
  To: Frans Pop
  Cc: Andrew Morton, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, linux-kernel, linux-mm, KOSAKI Motohiro,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Chris Mason

On Fri, Nov 13, 2009 at 10:04:21AM +0100, Frans Pop wrote:
> On Thursday 12 November 2009, Mel Gorman wrote:
> > Changelog since V2
> >   o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> >     testing, it made latencies even worse as kswapd slept more on
> >     high-order congestion causing order-0 direct reclaims.
> >   o Added changes to how congestion_wait() works
> >   o Added a number of new patches altering the behaviour of reclaim
> 
> I have tested this series on top of .32-rc7. First impression is that it 
> does seem to improve my test case, but does not yet completely solve it.
> 
> My last gitk instance now loads more smoothly for most of the time it takes 
> to complete, but I still see a choke point where things freeze for a while 
> and where I get SKB allocation errors from my wireless.
> However, that choke point does seem to happen later and to be shorter than 
> without the patches.
> 

I haven't fully figured out why this makes a difference yet, but with
.32-rc7 and these patches, could you retry the test except beforehand do

cd /sys
for SYS in `find -name low_latency`; do
        echo 0 > $SYS
done

I believe the low_latency logic might be interfering with the number of
clean pages available for kswapd to reclaim.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
                   ` (5 preceding siblings ...)
  2009-11-13  9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
@ 2009-11-13 12:47 ` Tobias Oetiker
  2009-11-13 13:37   ` Mel Gorman
  2009-11-15 12:07 ` Karol Lewandowski
  7 siblings, 1 reply; 65+ messages in thread
From: Tobias Oetiker @ 2009-11-13 12:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, linux-kernel, linux-mm@kvack.org\",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

Hi Mel,

Yesterday Mel Gorman wrote:

> Sorry for the long delay in posting another version. Testing is extremely
> time-consuming and I wasn't getting to work on this as much as I'd have liked.
>
> Changelog since V2
>   o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
>     testing, it made latencies even worse as kswapd slept more on high-order
>     congestion causing order-0 direct reclaims.
>   o Added changes to how congestion_wait() works
>   o Added a number of new patches altering the behaviour of reclaim

so is there anything promissing for the order 5 allocation problems
in this set?

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-13 12:47 ` Tobias Oetiker
@ 2009-11-13 13:37   ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-13 13:37 UTC (permalink / raw)
  To: Tobias Oetiker
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, linux-kernel, linux-mm, KOSAKI Motohiro,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Fri, Nov 13, 2009 at 01:47:35PM +0100, Tobias Oetiker wrote:
> Hi Mel,
> 
> Yesterday Mel Gorman wrote:
> 
> > Sorry for the long delay in posting another version. Testing is extremely
> > time-consuming and I wasn't getting to work on this as much as I'd have liked.
> >
> > Changelog since V2
> >   o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> >     testing, it made latencies even worse as kswapd slept more on high-order
> >     congestion causing order-0 direct reclaims.
> >   o Added changes to how congestion_wait() works
> >   o Added a number of new patches altering the behaviour of reclaim
> 
> so is there anything promissing for the order 5 allocation problems
> in this set?
> 

Yes. While the change in timing of direct reclaimers might be less
important when dm-crypt is not involved, kswapd is more pro-active about
maintaining the watermarks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
                   ` (6 preceding siblings ...)
  2009-11-13 12:47 ` Tobias Oetiker
@ 2009-11-15 12:07 ` Karol Lewandowski
  2009-11-16  9:52   ` Mel Gorman
  7 siblings, 1 reply; 65+ messages in thread
From: Karol Lewandowski @ 2009-11-15 12:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, linux-kernel, linux-mm,
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Thu, Nov 12, 2009 at 07:30:30PM +0000, Mel Gorman wrote:

> [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100

> Patches 1-3 should be tested first. The testing I've done shows that the
> page allocator and behaviour of congestion_wait() is more in line with
> 2.6.30 than the vanilla kernels.
> 
> It'd be nice to have 2 more tests, applying each patch on top noting any
> behaviour change. i.e. ideally there would be results for
> 
>  o patches 1+2+3
>  o patches 1+2+3+4
>  o patches 1+2+3+4+5
> 
> Of course, any tests results are welcome. The rest of the mail is the
> results of my own tests.

I've tried testing 3+4+5 against 2.6.32-rc7 (1+2 seem to be in
mainline) and got failure.  I've noticed something strange (I think).
I was unable to trigger failures when system was under heavy memory
pressure (i.e. my testing - gitk, firefoxes, etc.).  When I killed
almost all memory hogs, put system into sleep and resumed -- it
failed.  free(1) showed:

             total       used       free     shared    buffers     cached
Mem:        255240     194052      61188          0       4040      49364
-/+ buffers/cache:     140648     114592
Swap:       514040      72712     441328


Is that ok?  Wild guess -- maybe kswapd doesn't take fragmentation (or
other factors) into account as hard as it used to in 2.6.30?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-15 12:07 ` Karol Lewandowski
@ 2009-11-16  9:52   ` Mel Gorman
  2009-11-16 12:08     ` Karol Lewandowski
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-11-16  9:52 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Tobias Oetiker, linux-kernel, linux-mm, KOSAKI Motohiro,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> On Thu, Nov 12, 2009 at 07:30:30PM +0000, Mel Gorman wrote:
> 
> > [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
> 
> > Patches 1-3 should be tested first. The testing I've done shows that the
> > page allocator and behaviour of congestion_wait() is more in line with
> > 2.6.30 than the vanilla kernels.
> > 
> > It'd be nice to have 2 more tests, applying each patch on top noting any
> > behaviour change. i.e. ideally there would be results for
> > 
> >  o patches 1+2+3
> >  o patches 1+2+3+4
> >  o patches 1+2+3+4+5
> > 
> > Of course, any tests results are welcome. The rest of the mail is the
> > results of my own tests.
> 
> I've tried testing 3+4+5 against 2.6.32-rc7 (1+2 seem to be in
> mainline) and got failure.  I've noticed something strange (I think).
> I was unable to trigger failures when system was under heavy memory
> pressure (i.e. my testing - gitk, firefoxes, etc.).  When I killed
> almost all memory hogs, put system into sleep and resumed -- it
> failed.  free(1) showed:
> 
>              total       used       free     shared    buffers     cached
> Mem:        255240     194052      61188          0       4040      49364
> -/+ buffers/cache:     140648     114592
> Swap:       514040      72712     441328
> 
> 
> Is that ok?  Wild guess -- maybe kswapd doesn't take fragmentation (or
> other factors) into account as hard as it used to in 2.6.30?
> 

That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
failed. What was the dmesg for it please?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-16  9:52   ` Mel Gorman
@ 2009-11-16 12:08     ` Karol Lewandowski
  2009-11-16 14:32       ` Karol Lewandowski
  0 siblings, 1 reply; 65+ messages in thread
From: Karol Lewandowski @ 2009-11-16 12:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Karol Lewandowski, Andrew Morton, Frans Pop, Jiri Kosina,
	Sven Geggus, Tobias Oetiker, linux-kernel, linux-mm,
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Mon, Nov 16, 2009 at 09:52:58AM +0000, Mel Gorman wrote:
> On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> >              total       used       free     shared    buffers     cached
> > Mem:        255240     194052      61188          0       4040      49364
> > -/+ buffers/cache:     140648     114592
> > Swap:       514040      72712     441328
> > 
> > 
> > Is that ok?  Wild guess -- maybe kswapd doesn't take fragmentation (or
> > other factors) into account as hard as it used to in 2.6.30?

> That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
> failed. What was the dmesg for it please?

Sure, it's attached below.

Hmm, "Normal free"/"DMA free" are much lower than 61MB as shown above.
free(1) output was collected right after resume successfully finished.

Thanks.


e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
e100 0000:00:03.0: PME# disabled
e100: eth0: e100_probe: addr 0xe8120000, irq 9, MAC addr 00:10:a4:89:e8:84
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 7339, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0373362>] ? do_page_fault+0x33b/0x351
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  89
active_anon:15632 inactive_anon:17534 isolated_anon:0
 active_file:12373 inactive_file:12884 isolated_file:0
 unevictable:0 dirty:15 writeback:62 unstable:0
 free:2072 slab_reclaimable:849 slab_unreclaimable:1080
 mapped:7343 shmem:233 pagetables:505 bounce:0
DMA free:1440kB min:120kB low:148kB high:180kB active_anon:224kB inactive_anon:1516kB active_file:3440kB inactive_file:5140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:404kB shmem:44kB slab_reclaimable:16kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:6848kB min:1912kB low:2388kB high:2868kB active_anon:62304kB inactive_anon:68620kB active_file:46052kB inactive_file:46396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:60kB writeback:248kB mapped:28968kB shmem:888kB slab_reclaimable:3380kB slab_unreclaimable:4304kB kernel_stack:464kB pagetables:2020kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 4*4kB 8*8kB 33*16kB 22*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1440kB
Normal: 486*4kB 271*8kB 139*16kB 12*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6848kB
29040 total pagecache pages
3548 pages in swap cache
Swap cache stats: add 72385, delete 68837, find 70772/74308
Free swap  = 483200kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
20970 pages shared
50976 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 7339, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0189370>] ? vfs_write+0xf4/0x105
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  82
active_anon:15563 inactive_anon:17593 isolated_anon:0
 active_file:12294 inactive_file:12787 isolated_file:0
 unevictable:0 dirty:18 writeback:155 unstable:0
 free:2267 slab_reclaimable:847 slab_unreclaimable:1080
 mapped:7271 shmem:229 pagetables:505 bounce:0
DMA free:1440kB min:120kB low:148kB high:180kB active_anon:224kB inactive_anon:1516kB active_file:3440kB inactive_file:5140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:404kB shmem:44kB slab_reclaimable:16kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:7628kB min:1912kB low:2388kB high:2868kB active_anon:62028kB inactive_anon:68856kB active_file:45736kB inactive_file:46008kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:72kB writeback:620kB mapped:28680kB shmem:872kB slab_reclaimable:3372kB slab_unreclaimable:4304kB kernel_stack:464kB pagetables:2020kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 4*4kB 8*8kB 33*16kB 22*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1440kB
Normal: 559*4kB 290*8kB 152*16kB 16*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7628kB
29016 total pagecache pages
3702 pages in swap cache
Swap cache stats: add 72550, delete 68848, find 70774/74310
Free swap  = 482540kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
20826 pages shared
50895 pages non-shared
e100 0000:00:03.0: firmware: requesting e100/d101s_ucode.bin
e100: eth0 NIC Link is Up 100 Mbps Full Duplex
crashreporter[9491]: segfault at b67701a0 ip b67701a0 sp bfa0fc2c error 4 in libnss_files-2.7.so[b67b6000+a000]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
  2009-11-16 12:08     ` Karol Lewandowski
@ 2009-11-16 14:32       ` Karol Lewandowski
  0 siblings, 0 replies; 65+ messages in thread
From: Karol Lewandowski @ 2009-11-16 14:32 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Mel Gorman, Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Tobias Oetiker, linux-kernel, linux-mm, KOSAKI Motohiro,
	Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List

On Mon, Nov 16, 2009 at 01:08:45PM +0100, Karol Lewandowski wrote:
> On Mon, Nov 16, 2009 at 09:52:58AM +0000, Mel Gorman wrote:
> > On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> > >              total       used       free     shared    buffers     cached
> > > Mem:        255240     194052      61188          0       4040      49364
> > > -/+ buffers/cache:     140648     114592
> > > Swap:       514040      72712     441328
> > > 
> > > 
> > > Is that ok?  Wild guess -- maybe kswapd doesn't take fragmentation (or
> > > other factors) into account as hard as it used to in 2.6.30?
> 
> > That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
> > failed. What was the dmesg for it please?
> 
> Sure, it's attached below.
> 
> Hmm, "Normal free"/"DMA free" are much lower than 61MB as shown above.
> free(1) output was collected right after resume successfully finished.

Replying to myself, but well, it happened again:

  $ free
             total       used       free     shared    buffers     cached
Mem:        255240     189400      65840          0       3968      58916
-/+ buffers/cache:     126516     128724
Swap:       514040      98460     415580


  $ dmesg

e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
e100 0000:00:03.0: PME# disabled
e100: eth0: e100_probe: addr 0xe8120000, irq 9, MAC addr 00:10:a4:89:e8:84
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10402, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0373362>] ? do_page_fault+0x33b/0x351
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  81
active_anon:10734 inactive_anon:19229 isolated_anon:0
 active_file:14360 inactive_file:14515 isolated_file:0
 unevictable:0 dirty:7 writeback:53 unstable:0
 free:1101 slab_reclaimable:1334 slab_unreclaimable:1143
 mapped:2686 shmem:44 pagetables:522 bounce:0
DMA free:1496kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3964kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:216kB slab_unreclaimable:36kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:2908kB min:1912kB low:2388kB high:2868kB active_anon:42212kB inactive_anon:75396kB active_file:53688kB inactive_file:54096kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:28kB writeback:212kB mapped:10420kB shmem:176kB slab_reclaimable:5120kB slab_unreclaimable:4536kB kernel_stack:444kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 26*4kB 48*8kB 53*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1496kB
Normal: 275*4kB 58*8kB 50*16kB 13*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2908kB
35091 total pagecache pages
6168 pages in swap cache
Swap cache stats: add 367741, delete 361573, find 144185/185303
Free swap  = 438964kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
13452 pages shared
53168 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10402, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0189370>] ? vfs_write+0xf4/0x105
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  93
active_anon:10715 inactive_anon:19236 isolated_anon:0
 active_file:14312 inactive_file:14473 isolated_file:0
 unevictable:0 dirty:7 writeback:52 unstable:0
 free:1191 slab_reclaimable:1334 slab_unreclaimable:1144
 mapped:2686 shmem:44 pagetables:522 bounce:0
DMA free:1496kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3964kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:216kB slab_unreclaimable:36kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:3268kB min:1912kB low:2388kB high:2868kB active_anon:42136kB inactive_anon:75424kB active_file:53496kB inactive_file:53928kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:28kB writeback:208kB mapped:10420kB shmem:176kB slab_reclaimable:5120kB slab_unreclaimable:4540kB kernel_stack:444kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 26*4kB 48*8kB 53*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1496kB
Normal: 291*4kB 73*8kB 63*16kB 12*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3268kB
35039 total pagecache pages
6209 pages in swap cache
Swap cache stats: add 367794, delete 361585, find 144185/185304
Free swap  = 438756kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
13445 pages shared
53086 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10440, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0373362>] ? do_page_fault+0x33b/0x351
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  27
active_anon:9587 inactive_anon:19792 isolated_anon:0
 active_file:12298 inactive_file:12256 isolated_file:0
 unevictable:0 dirty:13 writeback:442 unstable:0
 free:6098 slab_reclaimable:1287 slab_unreclaimable:1142
 mapped:2436 shmem:37 pagetables:519 bounce:0
DMA free:1948kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3524kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:208kB slab_unreclaimable:32kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:22444kB min:1912kB low:2388kB high:2868kB active_anon:37624kB inactive_anon:77648kB active_file:45440kB inactive_file:45500kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:52kB writeback:1768kB mapped:9420kB shmem:148kB slab_reclaimable:4940kB slab_unreclaimable:4536kB kernel_stack:444kB pagetables:2060kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 115*4kB 54*8kB 56*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1948kB
Normal: 2005*4kB 829*8kB 379*16kB 50*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22444kB
32732 total pagecache pages
8128 pages in swap cache
Swap cache stats: add 370349, delete 362221, find 144309/185439
Free swap  = 428812kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
10771 pages shared
50488 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10440, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
 [<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
 [<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
 [<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
 [<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
 [<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
 [<d0936e14>] ? e100_open+0x17/0x41 [e100]
 [<c03245cc>] ? dev_open+0x8f/0xc5
 [<c0323d54>] ? dev_change_flags+0xa2/0x155
 [<c03576b0>] ? devinet_ioctl+0x22a/0x51e
 [<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
 [<c0316d06>] ? sock_ioctl+0x0/0x1da
 [<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
 [<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
 [<c0101683>] ? __switch_to+0xf/0x14c
 [<c011d24b>] ? finish_task_switch+0x2e/0x7d
 [<c0370423>] ? schedule+0x4cc/0x4f3
 [<c01945cd>] ? sys_ioctl+0x2c/0x42
 [<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:   90, btch:  15 usd:  30
active_anon:9587 inactive_anon:19792 isolated_anon:0
 active_file:12298 inactive_file:12270 isolated_file:0
 unevictable:0 dirty:13 writeback:434 unstable:0
 free:6083 slab_reclaimable:1287 slab_unreclaimable:1143
 mapped:2449 shmem:37 pagetables:519 bounce:0
DMA free:1948kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3524kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:208kB slab_unreclaimable:32kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:22384kB min:1912kB low:2388kB high:2868kB active_anon:37624kB inactive_anon:77648kB active_file:45440kB inactive_file:45556kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:52kB writeback:1724kB mapped:9472kB shmem:148kB slab_reclaimable:4940kB slab_unreclaimable:4540kB kernel_stack:444kB pagetables:2060kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 115*4kB 54*8kB 56*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1948kB
Normal: 1990*4kB 829*8kB 379*16kB 50*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22384kB
32740 total pagecache pages
8128 pages in swap cache
Swap cache stats: add 370349, delete 362221, find 144309/185439
Free swap  = 428812kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
10844 pages shared
50429 pages non-shared

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
@ 2009-10-22 14:22 Mel Gorman
  2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org",
	Mel Gorman

Sorry for the large cc list. Variations of this bug have cropped up in a
number of different places and so there are a fair few people that should
be vaguely aware of what's going on.

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
	Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
	allocations within the wireless driver. This has caused large numbers
	of failure reports to occur as reported by Frans Pop. Fixing this
	requires changes to the driver if it wants to use GFP_ATOMIC which
	is in the hands of Mohamed Abbas and Reinette Chatre. However,
	it is very likely that it has being compounded by core mm changes
	that this series is aimed at.

[Bug #14141] order 2 page allocation failures (generic)
	This problem is being tracked under bug #14141 but chances are it's
	unrelated to the wireless change. Tobi Oetiker has reported that a
	virtualised machine using a bridged interface is reporting a small
	number of order-5 GFP_ATOMIC failures. He has reported that the
	errors can be suppressed with kswapd patches in this series. However,
	I would like to confirm they are necessary.

[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
	Karol Lewandows reported that e100 fails to allocate order-5
	GFP_ATOMIC when loading firmware during resume. This has started
	happening relatively recent.

[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
	This apparently is easily reproducible, particular in comparison to
	the other reports. The point of greatest interest is that this is
	order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
	will be able to follow the tests below as you are the most likely
	person to have an easily reproducible situation.

[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)
	reported at: http://lkml.org/lkml/2009/10/20/139. Looks the same
	as the order-2 failures.

There are 5 patches in this series. For people affected by this bug,
I'm afraid there is a lot of legwork involved to help pin down which of
these patches are relevant. These patches are all against 2.6.32-rc5 and
have been tested on X86 and X86-64 by running the sysbench benchmark to
completion. I'll post against 2.6.31.4 where necessary.

Test 1: Verify your problem occurs on 2.6.32-rc5 if you can

Test 2: Apply the following two patches and test again

  1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER


	These patches correct problems introduced by me during the 2.6.31-rc1
	merge window. The patches were not meant to introduce any functional
	changes but two were missed.

	If your problem goes away with just these two patches applied,
	please tell me.

Test 3: If you are getting allocation failures, try with the following patch

  3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit

	This is a functional change that causes kswapd to notice sooner
	when high-order watermarks have been hit. There have been a number
	of changes in page reclaim since 2.6.30 that might have delayed
	when kswapd kicks in for higher orders

	If your problem goes away with these three patches applied, please
	tell me

Test 4: If you are still getting failures, apply the following
  4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit

	This patch is very heavy handed and pre-emptively kicks kswapd when
	watermarks are hit. It should only be necessary if there has been
	significant changes in the timing and density of page allocations
	from an unknown source. Tobias, this patch is largely aimed at you.
	You reported that with patches 3+4 applied that your problems went
	away. I need to know if patch 3 on its own is enough or if both
	are required

	If your problem goes away with these four patches applied, please
	tell me

Test 5: If things are still screwed, apply the following
  5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion

	Frans Pop reports that the bulk of his problems go away when this
	patch is reverted on 2.6.31. There has been some confusion on why
	exactly this patch was wrong but apparently the conversion was not
	complete and further work was required. It's unknown if all the
	necessary work exists in 2.6.31-rc5 or not. If there are still
	allocation failures and applying this patch fixes the problem,
	there are still snags that need to be ironed out.

Test 6: If only testing 2.6.31.4, test with patches 1, 2 and 5 as posted for that kernel
	Even if patches 3, 4 or both are necessary against mainline, I'm
	hoping they are unnecessary against -stable.

Thanks to all that reported problems and are testing this. The major bulk of
the work was done by Frans Pop so a big thanks to him in particular. I/we owe
him beers.

 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |   41 ++++++++++++++++++++++++++---------------
 mm/vmscan.c                 |   17 +++++++++++++----
 16 files changed, 75 insertions(+), 58 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 14:22 [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2 Mel Gorman
@ 2009-10-22 14:22 ` Mel Gorman
  2009-10-22 16:33   ` Stephan von Krawczynski
  2009-10-26  1:15   ` KOSAKI Motohiro
  0 siblings, 2 replies; 65+ messages in thread
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org",
	Mel Gorman

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

[rientjes@google.com: Spotted the problem]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfa4362..7f2aa3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
-	} else if (unlikely(rt_task(p)))
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
@ 2009-10-22 16:33   ` Stephan von Krawczynski
  2009-10-22 16:37     ` Mel Gorman
  2009-10-26  1:15   ` KOSAKI Motohiro
  1 sibling, 1 reply; 65+ messages in thread
From: Stephan von Krawczynski @ 2009-10-22 16:33 UTC (permalink / raw)
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel,
	linux-mm@kvack.org",
	Mel Gorman

On Thu, 22 Oct 2009 15:22:33 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.
> 
> [rientjes@google.com: Spotted the problem]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfa4362..7f2aa3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
> -	} else if (unlikely(rt_task(p)))
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -- 
> 1.6.3.3
> 

Is it correct that this one applies offset -54 lines in 2.6.31.4 ? 

-- 
Regards,
Stephan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 16:33   ` Stephan von Krawczynski
@ 2009-10-22 16:37     ` Mel Gorman
  2009-10-23  9:57       ` Stephan von Krawczynski
                         ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Mel Gorman @ 2009-10-22 16:37 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel, linux-mm

On Thu, Oct 22, 2009 at 06:33:03PM +0200, Stephan von Krawczynski wrote:
> On Thu, 22 Oct 2009 15:22:33 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> > slightly by allowing rt_tasks that are handling an interrupt to set
> > ALLOC_HARDER. This patch brings the watermark logic more in line with
> > 2.6.30.
> > 
> > [rientjes@google.com: Spotted the problem]
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> > ---
> >  mm/page_alloc.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index dfa4362..7f2aa3e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> >  		 */
> >  		alloc_flags &= ~ALLOC_CPUSET;
> > -	} else if (unlikely(rt_task(p)))
> > +	} else if (unlikely(rt_task(p)) && !in_interrupt())
> >  		alloc_flags |= ALLOC_HARDER;
> >  
> >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > -- 
> > 1.6.3.3
> > 
> 
> Is it correct that this one applies offset -54 lines in 2.6.31.4 ? 
> 

In this case, it's ok. It's just a harmless heads-up that the kernel
looks slightly different than expected. I posted a 2.6.31.4 version of
the two patches that cause real problems.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 16:37     ` Mel Gorman
@ 2009-10-23  9:57       ` Stephan von Krawczynski
  2009-10-24  2:03       ` Christoph Lameter
  2009-10-25 12:57       ` Stephan von Krawczynski
  2 siblings, 0 replies; 65+ messages in thread
From: Stephan von Krawczynski @ 2009-10-23  9:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel, linux-mm

On Thu, 22 Oct 2009 17:37:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> In this case, it's ok. It's just a harmless heads-up that the kernel
> looks slightly different than expected. I posted a 2.6.31.4 version of
> the two patches that cause real problems.

After around 12 hours of runtime with patch 1/5 and 2/5 on 2.6.31.4 I can see
no page allocation failure messages so far. I'll keep you informed.

-- 
Regards,
Stephan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 16:37     ` Mel Gorman
  2009-10-23  9:57       ` Stephan von Krawczynski
@ 2009-10-24  2:03       ` Christoph Lameter
  2009-10-27 15:19         ` Mel Gorman
  2009-10-25 12:57       ` Stephan von Krawczynski
  2 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2009-10-24  2:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Stephan von Krawczynski, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
	Pekka Enberg, Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel, linux-mm

There are now rt dependencies in the page allocator that screw things up?

And an rt flag causes the page allocator to try harder meaning it adds
latency.

?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-24  2:03       ` Christoph Lameter
@ 2009-10-27 15:19         ` Mel Gorman
  0 siblings, 0 replies; 65+ messages in thread
From: Mel Gorman @ 2009-10-27 15:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Stephan von Krawczynski, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
	Pekka Enberg, Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel, linux-mm

On Fri, Oct 23, 2009 at 10:03:05PM -0400, Christoph Lameter wrote:
> There are now rt dependencies in the page allocator that screw things up?
> 

The rt logic was present before.

> And an rt flag causes the page allocator to try harder meaning it adds
> latency.
> 

The harder term in the flag is a bit misleading. The effect of ALLOC_HARDER
is that the watermark is lower for the task. i.e. a rt task is less likely
to enter direct reclaim than normal tasks so it should have less latency.

> ?
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 16:37     ` Mel Gorman
  2009-10-23  9:57       ` Stephan von Krawczynski
  2009-10-24  2:03       ` Christoph Lameter
@ 2009-10-25 12:57       ` Stephan von Krawczynski
  2 siblings, 0 replies; 65+ messages in thread
From: Stephan von Krawczynski @ 2009-10-25 12:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel, linux-mm

On Thu, 22 Oct 2009 17:37:52 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> In this case, it's ok. It's just a harmless heads-up that the kernel
> looks slightly different than expected. I posted a 2.6.31.4 version of
> the two patches that cause real problems.

After around 12 hours of runtime with patch 1/5 and 2/5 on 2.6.31.4 I can see
no page allocation failure messages so far. I'll keep you informed.

[Update]
After 2 days I still see no failure messages. 
[/update]

-- 
Regards,
Stephan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
  2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
  2009-10-22 16:33   ` Stephan von Krawczynski
@ 2009-10-26  1:15   ` KOSAKI Motohiro
  1 sibling, 0 replies; 65+ messages in thread
From: KOSAKI Motohiro @ 2009-10-26  1:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Stephan von Krawczynski, Kernel Testers List, netdev,
	linux-kernel, linux-mm@kvack.org"

> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.
> 
> [rientjes@google.com: Spotted the problem]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
> ---
>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfa4362..7f2aa3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
> -	} else if (unlikely(rt_task(p)))
> +	} else if (unlikely(rt_task(p)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -- 
> 1.6.3.3

good catch.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2009-12-14  8:49 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
2009-11-13  5:23   ` KOSAKI Motohiro
2009-11-13 13:55     ` Mel Gorman
2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
2009-11-13  5:24   ` KOSAKI Motohiro
2009-11-13 13:56     ` Mel Gorman
2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
2009-11-13 11:20   ` KOSAKI Motohiro
2009-11-13 11:55     ` Jens Axboe
2009-11-13 12:28       ` Mel Gorman
2009-11-13 13:32         ` Jens Axboe
2009-11-13 13:41           ` Pekka Enberg
2009-11-13 15:22             ` Chris Mason
2009-11-13 14:16           ` Mel Gorman
2009-11-20 14:56           ` Mel Gorman
2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
2009-11-13 10:43   ` KOSAKI Motohiro
2009-11-13 14:13     ` Mel Gorman
2009-11-13 18:00       ` KOSAKI Motohiro
2009-11-13 18:17         ` Mel Gorman
2009-11-14  9:34           ` KOSAKI Motohiro
2009-11-14 15:46             ` Mel Gorman
2009-11-17 11:03               ` KOSAKI Motohiro
2009-11-17 11:44                 ` Mel Gorman
2009-11-17 12:18                   ` KOSAKI Motohiro
2009-11-17 12:25                     ` Mel Gorman
2009-11-18  5:20                       ` KOSAKI Motohiro
2009-11-17 10:34             ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
2009-11-18  5:27               ` KOSAKI Motohiro
2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
2009-11-13  9:54   ` KOSAKI Motohiro
2009-11-13 13:54     ` Mel Gorman
2009-11-13 14:48       ` Minchan Kim
2009-11-13 18:00       ` KOSAKI Motohiro
2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
2009-11-13 18:26           ` Frans Pop
2009-11-13 18:33           ` KOSAKI Motohiro
2009-11-13 20:03             ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2 Mel Gorman
2009-11-26 14:45               ` Tobias Oetiker
2009-11-29  7:42                 ` still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2) Tobi Oetiker
2009-12-02 11:32                   ` Mel Gorman
2009-12-02 21:30                     ` Tobias Oetiker
2009-12-03 20:26                       ` Corrado Zoccolo
2009-12-14  5:59                         ` Tobias Oetiker
2009-12-14  8:49                           ` Corrado Zoccolo
2009-11-13 18:36           ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Rik van Riel
2009-11-13 14:38     ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Minchan Kim
2009-11-13 12:41   ` Minchan Kim
2009-11-13  9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
2009-11-16 17:57   ` Mel Gorman
2009-11-13 12:47 ` Tobias Oetiker
2009-11-13 13:37   ` Mel Gorman
2009-11-15 12:07 ` Karol Lewandowski
2009-11-16  9:52   ` Mel Gorman
2009-11-16 12:08     ` Karol Lewandowski
2009-11-16 14:32       ` Karol Lewandowski
  -- strict thread matches above, loose matches on Subject: below --
2009-10-22 14:22 [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2 Mel Gorman
2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
2009-10-22 16:33   ` Stephan von Krawczynski
2009-10-22 16:37     ` Mel Gorman
2009-10-23  9:57       ` Stephan von Krawczynski
2009-10-24  2:03       ` Christoph Lameter
2009-10-27 15:19         ` Mel Gorman
2009-10-25 12:57       ` Stephan von Krawczynski
2009-10-26  1:15   ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox