linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3
@ 2009-11-12 19:30 Mel Gorman
  2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 65+ messages in thread
From: Mel Gorman @ 2009-11-12 19:30 UTC (permalink / raw)
  To: Andrew Morton, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker
  Cc: linux-kernel, linux-mm@kvack.org",
	KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter,
	Stephan von Krawczynski, Rafael J. Wysocki, Kernel Testers List,
	Mel Gorman

Sorry for the long delay in posting another version. Testing is extremely
time-consuming and I wasn't getting to work on this as much as I'd have liked.

Changelog since V2
  o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
    testing, it made latencies even worse as kswapd slept more on high-order
    congestion causing order-0 direct reclaims.
  o Added changes to how congestion_wait() works
  o Added a number of new patches altering the behaviour of reclaim

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
[Bug #14141] order 2 page allocation failures (generic)
[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
[No BZ ID]   Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
[No BZ ID]   page allocation failure message kernel 2.6.31.4 (tty-related)

The following are a series of patches that bring the behaviour of reclaim
and the page allocator more in line with 2.6.30.

Patches 1-3 should be tested first. The testing I've done shows that the
page allocator and behaviour of congestion_wait() is more in line with
2.6.30 than the vanilla kernels.

It'd be nice to have 2 more tests, applying each patch on top noting any
behaviour change. i.e. ideally there would be results for

 o patches 1+2+3
 o patches 1+2+3+4
 o patches 1+2+3+4+5

Of course, any tests results are welcome. The rest of the mail is the
results of my own tests.

I've tested against 2.6.31 and 2.6.32-rc6. I've somewhat replicated the
problem in Bug #14141 and believe the other bugs are variations of the same
style of problem. The basic reproduction case was;

1. X86-64 AMD Phenom and X86 P4 booted with mem=512MB. Expectation is
	any machine will do as long as it's 512MB for the size of workload
	involved.

2. A crypted work partition and swap partition was created. On my
   own setup, I gave no passphrase so it'd be easier to activate without
   interaction but there are multiple options. I should have taken better
   notes but the setup goes something like this;

	cryptsetup create -y crypt-partition /dev/sda5
	pvcreate /dev/mapper/crypt-partition
	vgcreate crypt-volume /dev/mapper/crypt-partition
	lvcreate -L 5G -n crypt-logical crypt-volume
	lvcreate -L 2G -n crypt-swap crypt-volume
	mkfs -t ext3 /dev/crypt-volume/crypt-logical
	mkswap /dev/crypt-volume/crypt-swap

3. With the partition mounted on /scratch, I
	cd /scratch
	mkdir music
	git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6

4. On a normal partition, I expand a tarball containing test scripts available at
	http://www.csn.ul.ie/~mel/postings/latency-20091112/latency-tests-with-results.tar.gz

	There are two helper programs that run as part of the test - a fake
	music player and a fake gitk.

	The fake music player uses rsync with bandwidth limits to start
	downloading a music folder from another machine. It's bandwidth
	limited to simulate playing music over NFS. I believe it generates
	similar if not exact traffic to a music player. It occured to be
	afterwards that if one patched ogg123 to print a line when 1/10th
	of a seconds worth of music was played, it could be used as an
	indirect measure of desktop interactivity and help pin down pesky
	"audio skips" bug reports.

	The fake gitk is based on observing roughly what gitk does using
	strace. It loads all the logs into a large buffer and then builds a
	very basic hash map of parent to child commits.  The data is stored
	because it was insufficient just to read the logs. It had to be
	kept in an in-memory buffer to generate swap.  It then discards the
	data and does it over again in a loop for a small number of times
	so the test is finite. When it processes a large number of commits,
	it outputs a line to stdout so that stalls can be observed. Ideal
	behaviour is that commits are read at a constant rate and latencies
	look flat.

	Output from the two programs is piped through another script -
	latency-output. It records how far into the test it was when the
	line was outputted and what the latency was since the last line
	appeared. The latency should always be very smooth. Because pipes
	buffer IO, they are all run by expect_unbuffered which is available
	from expect-dev on Debian at least.

	All the tests are driven via run-test.sh. While the tests run,
	it records the kern.log to track page allocation failures, records
	nr_writeback at regular intervals and tracks Page IO and Swap IO.

5. For running an actual test, a kernel is built, booted, the
	crypted partition activated, lvm restarted,
	/dev/crypt-volume/crypt-logical mounted on /scratch, all
	swap partitions turned off and then the swap partition on
	/dev/crypt-volume/crypt-swap activated. I then run run-test.sh from
	the tarball

6. Run the test script

To evaluate the patches, I considered three basic metrics.

o The length of time it takes fake-gitk to complete on average
o How often and how long fake-gitk stalled for
o How long was spent in congestion_wait

All generated data is in the tarball.

On X86, the results I got were

2.6.30-0000000-force-highorder           Elapsed:10:59.095  Failures:0

2.6.31-0000000-force-highorder           Elapsed:11:53.505  Failures:0
2.6.31-revert-8aa7e847                   Elapsed:14:01.595  Failures:0
2.6.31-0000012-pgalloc-2.6.30            Elapsed:13:32.237  Failures:0
2.6.31-0000123-congestion-both           Elapsed:12:44.170  Failures:0
2.6.31-0001234-kswapd-quick-recheck      Elapsed:10:35.327  Failures:0
2.6.31-0012345-adjust-priority           Elapsed:11:02.995  Failures:0

2.6.32-rc6-0000000-force-highorder       Elapsed:18:18.562  Failures:0
2.6.32-rc6-revert-8aa7e847               Elapsed:10:29.278  Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30        Elapsed:13:32.393  Failures:0
2.6.32-rc6-0000123-congestion-both       Elapsed:14:55.265  Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck  Elapsed:13:35.628  Failures:0
2.6.32-rc6-0012345-adjust-priority       Elapsed:12:41.278  Failures:0

The 0000000-force-highorder is a vanilla kernel patched so that network
receive always results in an order-2 allocation. This machine wasn't
suffering page allocation failures even under this circumstance. However,
note how slow 2.6.32-rc6 is and how much the revert helps. With the patches
applied, there is comparable performance.

Latencies were generally reduced with the patches applied. 2.6.32-rc6 was
particularly crazy with long stalls measured over the duration of the test
but has comparable latencies with 2.6.30 with the patches applied.

congestion_wait behaviour is more in line with 2.6.30 after the
patches with similar amounts of time being spent.  In general,
2.6.32-rc6-0012345-adjust-priority waits for longer than 2.6.30 or the
reverted kernels did. It also waits in more instances such as inside
shrink_inactive_list() where it didn't before. Forcing behaviour like 2.6.30
resulted in good figures but I couldn't justify the patches with anything
more solid than "in tests, it behaves well even though it doesn't make a
lot of sense"

On X86-64, the results I got were

2.6.30-0000000-force-highorder           Elapsed:09:48.545  Failures:0

2.6.31-0000000-force-highorder           Elapsed:09:13.020  Failures:0
2.6.31-revert-8aa7e847                   Elapsed:09:02.120  Failures:0
2.6.31-0000012-pgalloc-2.6.30            Elapsed:08:52.742  Failures:0
2.6.31-0000123-congestion-both           Elapsed:08:59.375  Failures:0
2.6.31-0001234-kswapd-quick-recheck      Elapsed:09:19.208  Failures:0
2.6.31-0012345-adjust-priority           Elapsed:09:39.225  Failures:0

2.6.32-rc6-0000000-force-highorder       Elapsed:19:38.585  Failures:5
2.6.32-rc6-revert-8aa7e847               Elapsed:17:21.257  Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30        Elapsed:18:56.682  Failures:1
2.6.32-rc6-0000123-congestion-both       Elapsed:16:08.340  Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck  Elapsed:18:11.200  Failures:7
2.6.32-rc6-0012345-adjust-priority       Elapsed:21:33.158  Failures:0

Failures were down and my impression was that it was much harder to cause
failures. Performance on mainline is still not as good as 2.6.30. On
this particular machine, I was able to force performance to be in line
but not with any patch I could justify in the general case.

Latencies were slightly reduced by applying the patches against 2.6.31.
against 2.6.32-rc6, applying the patches significantly reduced the latencies
but they are still significant. I'll continue to investigate what can be
done to improve this further.

Again, congestion_wait() is more in line with 2.6.30 when the patches
are applied. Similarly to X86, almost identical behaviour can be forced
by waiting on BLK_ASYNC_BOTH for each caller to congestion_wait() in the
reclaim and allocator paths.

Bottom line, the patches made triggering allocation failures much harder
and in a number of instances and latencies are reduced when the system
is under load. I will keep looking around this area - particularly the
performance under load on 2.6.32-rc6 but with 2.6.32 almost out the door,
I am releasing what I have now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread
* [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
@ 2009-10-22 14:22 Mel Gorman
  2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
  0 siblings, 1 reply; 65+ messages in thread
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org",
	Mel Gorman

Sorry for the large cc list. Variations of this bug have cropped up in a
number of different places and so there are a fair few people that should
be vaguely aware of what's going on.

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
	Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
	allocations within the wireless driver. This has caused large numbers
	of failure reports to occur as reported by Frans Pop. Fixing this
	requires changes to the driver if it wants to use GFP_ATOMIC which
	is in the hands of Mohamed Abbas and Reinette Chatre. However,
	it is very likely that it has being compounded by core mm changes
	that this series is aimed at.

[Bug #14141] order 2 page allocation failures (generic)
	This problem is being tracked under bug #14141 but chances are it's
	unrelated to the wireless change. Tobi Oetiker has reported that a
	virtualised machine using a bridged interface is reporting a small
	number of order-5 GFP_ATOMIC failures. He has reported that the
	errors can be suppressed with kswapd patches in this series. However,
	I would like to confirm they are necessary.

[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
	Karol Lewandows reported that e100 fails to allocate order-5
	GFP_ATOMIC when loading firmware during resume. This has started
	happening relatively recent.

[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
	This apparently is easily reproducible, particular in comparison to
	the other reports. The point of greatest interest is that this is
	order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
	will be able to follow the tests below as you are the most likely
	person to have an easily reproducible situation.

[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)
	reported at: http://lkml.org/lkml/2009/10/20/139. Looks the same
	as the order-2 failures.

There are 5 patches in this series. For people affected by this bug,
I'm afraid there is a lot of legwork involved to help pin down which of
these patches are relevant. These patches are all against 2.6.32-rc5 and
have been tested on X86 and X86-64 by running the sysbench benchmark to
completion. I'll post against 2.6.31.4 where necessary.

Test 1: Verify your problem occurs on 2.6.32-rc5 if you can

Test 2: Apply the following two patches and test again

  1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER


	These patches correct problems introduced by me during the 2.6.31-rc1
	merge window. The patches were not meant to introduce any functional
	changes but two were missed.

	If your problem goes away with just these two patches applied,
	please tell me.

Test 3: If you are getting allocation failures, try with the following patch

  3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit

	This is a functional change that causes kswapd to notice sooner
	when high-order watermarks have been hit. There have been a number
	of changes in page reclaim since 2.6.30 that might have delayed
	when kswapd kicks in for higher orders

	If your problem goes away with these three patches applied, please
	tell me

Test 4: If you are still getting failures, apply the following
  4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit

	This patch is very heavy handed and pre-emptively kicks kswapd when
	watermarks are hit. It should only be necessary if there has been
	significant changes in the timing and density of page allocations
	from an unknown source. Tobias, this patch is largely aimed at you.
	You reported that with patches 3+4 applied that your problems went
	away. I need to know if patch 3 on its own is enough or if both
	are required

	If your problem goes away with these four patches applied, please
	tell me

Test 5: If things are still screwed, apply the following
  5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion

	Frans Pop reports that the bulk of his problems go away when this
	patch is reverted on 2.6.31. There has been some confusion on why
	exactly this patch was wrong but apparently the conversion was not
	complete and further work was required. It's unknown if all the
	necessary work exists in 2.6.31-rc5 or not. If there are still
	allocation failures and applying this patch fixes the problem,
	there are still snags that need to be ironed out.

Test 6: If only testing 2.6.31.4, test with patches 1, 2 and 5 as posted for that kernel
	Even if patches 3, 4 or both are necessary against mainline, I'm
	hoping they are unnecessary against -stable.

Thanks to all that reported problems and are testing this. The major bulk of
the work was done by Frans Pop so a big thanks to him in particular. I/we owe
him beers.

 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |   41 ++++++++++++++++++++++++++---------------
 mm/vmscan.c                 |   17 +++++++++++++----
 16 files changed, 75 insertions(+), 58 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2009-12-14  8:49 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
2009-11-13  5:23   ` KOSAKI Motohiro
2009-11-13 13:55     ` Mel Gorman
2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
2009-11-13  5:24   ` KOSAKI Motohiro
2009-11-13 13:56     ` Mel Gorman
2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
2009-11-13 11:20   ` KOSAKI Motohiro
2009-11-13 11:55     ` Jens Axboe
2009-11-13 12:28       ` Mel Gorman
2009-11-13 13:32         ` Jens Axboe
2009-11-13 13:41           ` Pekka Enberg
2009-11-13 15:22             ` Chris Mason
2009-11-13 14:16           ` Mel Gorman
2009-11-20 14:56           ` Mel Gorman
2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
2009-11-13 10:43   ` KOSAKI Motohiro
2009-11-13 14:13     ` Mel Gorman
2009-11-13 18:00       ` KOSAKI Motohiro
2009-11-13 18:17         ` Mel Gorman
2009-11-14  9:34           ` KOSAKI Motohiro
2009-11-14 15:46             ` Mel Gorman
2009-11-17 11:03               ` KOSAKI Motohiro
2009-11-17 11:44                 ` Mel Gorman
2009-11-17 12:18                   ` KOSAKI Motohiro
2009-11-17 12:25                     ` Mel Gorman
2009-11-18  5:20                       ` KOSAKI Motohiro
2009-11-17 10:34             ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
2009-11-18  5:27               ` KOSAKI Motohiro
2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
2009-11-13  9:54   ` KOSAKI Motohiro
2009-11-13 13:54     ` Mel Gorman
2009-11-13 14:48       ` Minchan Kim
2009-11-13 18:00       ` KOSAKI Motohiro
2009-11-13 18:15         ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
2009-11-13 18:26           ` Frans Pop
2009-11-13 18:33           ` KOSAKI Motohiro
2009-11-13 20:03             ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2 Mel Gorman
2009-11-26 14:45               ` Tobias Oetiker
2009-11-29  7:42                 ` still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2) Tobi Oetiker
2009-12-02 11:32                   ` Mel Gorman
2009-12-02 21:30                     ` Tobias Oetiker
2009-12-03 20:26                       ` Corrado Zoccolo
2009-12-14  5:59                         ` Tobias Oetiker
2009-12-14  8:49                           ` Corrado Zoccolo
2009-11-13 18:36           ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Rik van Riel
2009-11-13 14:38     ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Minchan Kim
2009-11-13 12:41   ` Minchan Kim
2009-11-13  9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
2009-11-16 17:57   ` Mel Gorman
2009-11-13 12:47 ` Tobias Oetiker
2009-11-13 13:37   ` Mel Gorman
2009-11-15 12:07 ` Karol Lewandowski
2009-11-16  9:52   ` Mel Gorman
2009-11-16 12:08     ` Karol Lewandowski
2009-11-16 14:32       ` Karol Lewandowski
  -- strict thread matches above, loose matches on Subject: below --
2009-10-22 14:22 [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2 Mel Gorman
2009-10-22 14:22 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
2009-10-22 16:33   ` Stephan von Krawczynski
2009-10-22 16:37     ` Mel Gorman
2009-10-23  9:57       ` Stephan von Krawczynski
2009-10-24  2:03       ` Christoph Lameter
2009-10-27 15:19         ` Mel Gorman
2009-10-25 12:57       ` Stephan von Krawczynski
2009-10-26  1:15   ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox