[PATCH 0/1] Synchronous lumpy reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] Synchronous lumpy reclaim
@ 2007-07-20 19:41 Mel Gorman
  2007-07-20 19:41 ` [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas Mel Gorman
  0 siblings, 1 reply; 4+ messages in thread
From: Mel Gorman @ 2007-07-20 19:41 UTC (permalink / raw)
  To: apw; +Cc: Mel Gorman, linux-mm

With ZONE_MOVABLE and lumpy reclaim merged for 2.6.23, I started testing
different scenarios based on just the zones. Previous testing scenarios
had assumed the presence of grouping pages by mobility and various other
-mm patches which might not be indicative of mainline behaviour. Previous
tests also aggressively tried to allocate the pages which is unlikely to
be typical behaviour.

I've included a script below for testing the behaviour of ZONE_MOVABLE in a
more reasonable fashion likely to be used by an application or the growing
of the huge page pool. The test fills physical memory with some data and
then uses SystemTap to allocate memory from just ZONE_MOVABLE. The expected
results were that the zone could be fully allocated as huge pages if the
system was at rest. Without systemtap, similar results can be found by using
nr_hugepages proc file but you have to guess from the values in buddyinfo on
whether the pages were coming from ZONE_MOVABLE or not so it's not as clear.

The results were not as expected. Even with lumpy reclaim, the system had
difficulty allocating huge pages from ZONE_MOVABLE unless the pages were
all the same age or the system inactive for quite some time. It also often
required that applications be exited or using tricks like dd'ing large
files and deleting them or using drop_caches. Trying to use the zone while X
apps were running was particularly difficult.

The problem is that reclaim moves on too easily to the next block of pages.
This means it queues up a number of pages but they are not reclaimed by
the time an allocation attempt is made so it appears to fail.  The patch
following after this mail teaches processes directly reclaiming contiguous
pages to wait for pages to free within an area before retrying the allocation.
I believe this patch or something like it will be needed in 2.6.23 because
this is a looks like buggy behaviour in lumpy reclaim even though strictly
speaking it is not wrong.

Test scenario is based on my desktop machine and looks like;

o i386 with 2GB of RAM
o ZONE_MOVABLE = 512MB of RAM (i.e. 4 active zones)
o Machine freshly booted
o X started - 9 terminals, two instances of konqueror
o Light load in the background due to some badly behaving daemons
o Test script runs

These are the results of the test script.

2.6.22-git13
============
Total huge pages:             122
Successfully allocated:       12 hugepages
Failed allocation at attempt: 12
Failed allocation at attempt: 12
Failed allocation at attempt: 12
Failed allocation at attempt: 12

2.6.22-git13-syncwriteback
==========================
Total huge pages:             122
Successfully allocated:       50 hugepages
Failed allocation at attempt: 14
Failed allocation at attempt: 14
Failed allocation at attempt: 18
Failed allocation at attempt: 50

I ran the SystemTap script a second time a few minues after the test completed
so that all IO would have completed and the system would be relatively idle
again. The results were

2.6.22-git13 after some idle time
=================================
Total huge pages:             122
Successfully allocated:       45 hugepages
Failed allocation at attempt: 45
Failed allocation at attempt: 45
Failed allocation at attempt: 45
Failed allocation at attempt: 45

2.6.22-git13-syncwriteback after some idle time
===============================================
Total huge pages:             122
Successfully allocated:       122 hugepages
Failed allocation at attempt: 78
Failed allocation at attempt: 80
Failed allocation at attempt: 112

The patches have been tested on i386, x86_64 and ppc64. This is the test
script I used to verify the problem for anyone wishing to reproduce the
results.

===> CUT HERE <===
#!/bin/bash
# This script is a simple regression test for the usage of ZONE_MOVABLE. It
# requires SystemTap to be installed to act as a trigger. The test is fairly
# simple but must be run as root. The actions of the test are;
# 
# 1. dd a file the size of physical memory from /dev/zero while updatedb runs
# 2. When step 1 completes, run the embedded systemtap script to allocate
#    huge pages from ZONE_MOVABLE
#
# The systemtap script is described more later but basically, it tries to
# allocate hugepages from ZONE_MOVABLE but gives up easily.
#
# If ZONE_MOVABLE and all associated code is working perfectly, this test will
# always successfully allocate all the hugepages from that ZONE with all the
# failures at the end. The expected behaviour for 2.6.23 is that the zone can
# be fully allocated when the system is at rest but will have difficulty
# under load
#
# Copyright (C) IBM Corporation, 2007
# Author: Mel Gorman <mel@csn.ul.ie>

LARGEFILE=$HOME/zonemovable_test_largefile
STAP=/usr/bin/stap

die() {
	echo "FATAL: $@"
	exit 1
}

echo Checking SystemTap exists
if [ ! -x $STAP ]; then
	die SystemTap must be available at $STAP
fi

echo Checking largefile can be created
echo -n > $LARGEFILE || die Failed to create $LARGEFILE

echo Checking parameters
MEMKB=`grep MemTotal: /proc/meminfo | awk '{print $2}'`
if [ "$MEMKB" = "" ]; then
	die Failed to determine total amount of memory
fi
MEMMB=$(($MEMKB/1024))

echo Running dd of ${MEMMB}MB to $LARGEFILE while running updatedb
dd if=/dev/zero of=$LARGEFILE ibs=1048576 count=$MEMMB & updatedb

echo Running allocation test
TIME=`which time`
$TIME time /usr/bin/stap -g - <<EOFSTAP
# alloctrigger.stp
#
# This script is a test for the allocation of hugepages from ZONE_MOVABLE. It
# works by estimating how many hugepages there are contained in ZONE_MOVABLE
# and then creating a zonelists consisting of just ZONE_MOVABLE from each
# active node.
#
# It allows up to ABORT_AFTER_FAILCOUNT before stopping the test to prevent
# hammering allocation attempts. The results of the test is the number of
# hugepages that exist, the number that were successfully allocated and when
# each of the failures occured.
#
# WARNING: As this attempts allocations even after fails, the system may
#	   decide that it is OOM and start killing things. Arguably, the
#	   system should not consider itself OOM for high-order allocation
#	   failures.

function alloc_hugepages:long () %{

#define ABORT_AFTER_FAILCOUNT 4

	struct zonelist zonelist;
	struct page *page;
	struct page **pages;
	int freecount, nid;
	int nr_nodes = 0;
	int count = 0;
	int failcount = 0;
	int fails[ABORT_AFTER_FAILCOUNT];
	int max_hugepages = 0;

	/* Create a zonelist containing only ZONE_MOVABLE */
	_stp_printf("Building zonelist\n");
	for_each_online_node(nid) {
		struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
		if (populated_zone(zone)) {
			int hugepages;
			_stp_printf("  o Added ZONE_MOVABLE on node %d\n", nid);
			zonelist.zones[nr_nodes] = zone;
			nr_nodes++;

			/* Sum how many huge pages are possible to allocate */
			hugepages = (zone->present_pages - zone->pages_low);
			hugepages = (hugepages >> HUGETLB_PAGE_ORDER) - 1;
			hugepages--;
			if (hugepages > 0)
				max_hugepages += hugepages;
		}
	}
	zonelist.zones[nr_nodes] = NULL;

	/* Make sure ZONE_MOVABLE exists */
	if (nr_nodes == 0) {
		_stp_printf("No suitable ZONE_MOVABLE was found\n");
		return;
	}

	/* Allocate array for allocated pages pointers */
	pages = vmalloc(max_hugepages * sizeof(struct page *));
	if (!pages) {
		_stp_printf("Failed to allocate %d page pointers\n",
								max_hugepages);
		return;
	}

	/* Allocate pages in the zonelist until we start failing */
	_stp_printf("Attempting to allocate %d pages\n", max_hugepages);
	do {
		page = __alloc_pages(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN,
					HUGETLB_PAGE_ORDER, &zonelist);
		if (page) {
			pages[count++] = page;
		} else {
			fails[failcount] = count;
			failcount++;
			_stp_printf(" o Failure %d after %d allocs\n",
							failcount, count);
			if (failcount >= ABORT_AFTER_FAILCOUNT ||
							count >= max_hugepages)
				break;

			/* Wait a little after failing */
			congestion_wait(WRITE, HZ/2);
		}
	} while (count < max_hugepages);

	/* Free pages */
	_stp_printf("Freeing %d huge pages\n", count);
	for (freecount = 0; freecount < count; freecount++)
		__free_pages(pages[freecount], HUGETLB_PAGE_ORDER);
	vfree(pages);

	_stp_printf("\nResults\n=======\n");
	_stp_printf("Total huge pages:             %d\n", max_hugepages);
	_stp_printf("Successfully allocated:       %d hugepages\n", count);
	for (count = 0; count < failcount; count++)
		_stp_printf("Failed allocation at attempt: %d\n", fails[count]);
	_stp_printf("\n");
%}

probe begin {
	print("\n\n")
	alloc_hugepages()
	print("\n\n")
	exit()
}
EOFSTAP

echo Cleaning up $LARGEFILE
rm $LARGEFILE
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas
  2007-07-20 19:41 [PATCH 0/1] Synchronous lumpy reclaim Mel Gorman
@ 2007-07-20 19:41 ` Mel Gorman
  2007-07-23 16:51   ` Andy Whitcroft
  0 siblings, 1 reply; 4+ messages in thread
From: Mel Gorman @ 2007-07-20 19:41 UTC (permalink / raw)
  To: apw; +Cc: Mel Gorman, linux-mm

Lumpy reclaim works by selecting a lead page from the LRU list and then
selecting pages for reclaim from the order-aligned area of pages. In the
situation were all pages in that region are inactive and not referenced by
any process over time, it works well.

In the situation where there is even light load on the system, the pages may
not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
freed when the allocation attempt occurs because lumpy reclaim returned early.
This patch alters the behaviour of direct reclaim for large contiguous blocks.

The first attempt to call shrink_page_list() is asynchronous but if it
fails, the pages are submitted a second time and the calling process waits
for the IO to complete. It'll retry up to 5 times for the pages to be
fully freed. This may stall allocators waiting for contiguous memory but
that should be expected behaviour for high-order users. It is preferable
behaviour to potentially queueing unnecessary areas for IO. Note that kswapd
will not stall in this fashion.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 vmscan.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d419e10..6531f49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -287,7 +287,8 @@ typedef enum {
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+						int sync_writeback)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -346,6 +347,15 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
 			ClearPageReclaim(page);
 			return PAGE_ACTIVATE;
 		}
+
+		/*
+		 * Wait on writeback if requested to. This happens when
+		 * direct reclaiming a large contiguous area and the
+		 * first attempt to free a ranage of pages fails
+		 */
+		if (PageWriteback(page) && sync_writeback != WB_SYNC_NONE)
+			wait_on_page_writeback(page);
+
 		if (!PageWriteback(page)) {
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
@@ -423,7 +433,8 @@ cannot_free:
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc)
+					struct scan_control *sc,
+					int sync_writeback)
 {
 	LIST_HEAD(ret_pages);
 	struct pagevec freed_pvec;
@@ -458,8 +469,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (page_mapped(page) || PageSwapCache(page))
 			sc->nr_scanned++;
 
-		if (PageWriteback(page))
-			goto keep_locked;
+		if (PageWriteback(page)) {
+			if (sync_writeback)
+				wait_on_page_writeback(page);
+			else
+				goto keep_locked;
+		}
 
 		referenced = page_referenced(page, 1);
 		/* In active use or really unfreeable?  Activate it. */
@@ -505,7 +520,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch(pageout(page, mapping)) {
+			switch(pageout(page, mapping, sync_writeback)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -770,6 +785,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		int retries = 0;
 
 		nr_taken = isolate_lru_pages(sc->swap_cluster_max,
 			     &zone->inactive_list,
@@ -784,8 +800,33 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
 
+		/* Retry shrink list up to 5 times for costly allocations */
+		if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+			retries = 5;
+
 		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc);
+		nr_freed = shrink_page_list(&page_list, sc, WB_SYNC_NONE);
+
+		/*
+		 * If we are direct reclaiming for contiguous pages and we do
+		 * not reclaim everything in the list, try again and wait
+		 * for IO to complete. This will stall high-order allocations
+		 * but that should be acceptable to the caller
+		 */
+		while (nr_freed < nr_taken && !current_is_kswapd() && retries) {
+			retries--;
+			congestion_wait(WRITE, HZ/10);
+
+			/* Reclear active flags */
+			nr_active = clear_active_flags(&page_list);
+			if (nr_active)
+				mod_zone_page_state(zone, NR_ACTIVE,
+								-nr_active);
+
+			nr_freed += shrink_page_list(&page_list, sc,
+								WB_SYNC_ALL);
+		}
+
 		nr_reclaimed += nr_freed;
 		local_irq_disable();
 		if (current_is_kswapd()) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas
  2007-07-20 19:41 ` [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas Mel Gorman
@ 2007-07-23 16:51   ` Andy Whitcroft
  2007-07-24 10:23     ` Mel Gorman
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Whitcroft @ 2007-07-23 16:51 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm

Mel Gorman wrote:
> Lumpy reclaim works by selecting a lead page from the LRU list and then
> selecting pages for reclaim from the order-aligned area of pages. In the
> situation were all pages in that region are inactive and not referenced by
> any process over time, it works well.
> 
> In the situation where there is even light load on the system, the pages may
> not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
> freed when the allocation attempt occurs because lumpy reclaim returned early.
> This patch alters the behaviour of direct reclaim for large contiguous blocks.

Yes, lumpy is prone to starting reclaim on an area and moving on to the
next.  Generally where there are a lot of areas, the areas are smaller
and the number of requests larger, this is sufficient.  However for
higher orders it will tend to suffer from the effect you indicate.  As
you say when the system is unloaded even at very high orders we will get
good success rates, but higher orders on a loaded machine are problematic.

It seems logical that if we could know when all reclaim for a targeted
area is completed that we would have a higher chance of subsequent
success allocating.  Looking at your patch, you are using synchronising
with the completion of all pending writeback on pages in the targeted
area which, pretty much gives us that.

I am surprised to see a need for a retry loop here, I would have
expected to see an async start and a sync complete pass with the
expectation that this would be sufficient.  Otherwise the patch is
surprisingly simple.

I will try and reproduce with your test script and also do some general
testing to see how this might effect the direct allocation latencies,
which I see as key.  It may well improve those for larger allocations.

> The first attempt to call shrink_page_list() is asynchronous but if it
> fails, the pages are submitted a second time and the calling process waits
> for the IO to complete. It'll retry up to 5 times for the pages to be
> fully freed. This may stall allocators waiting for contiguous memory but
> that should be expected behaviour for high-order users. It is preferable
> behaviour to potentially queueing unnecessary areas for IO. Note that kswapd
> will not stall in this fashion.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
[...]

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas
  2007-07-23 16:51   ` Andy Whitcroft
@ 2007-07-24 10:23     ` Mel Gorman
  0 siblings, 0 replies; 4+ messages in thread
From: Mel Gorman @ 2007-07-24 10:23 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-mm

On Mon, 23 Jul 2007, Andy Whitcroft wrote:

> Mel Gorman wrote:
>> Lumpy reclaim works by selecting a lead page from the LRU list and then
>> selecting pages for reclaim from the order-aligned area of pages. In the
>> situation were all pages in that region are inactive and not referenced by
>> any process over time, it works well.
>>
>> In the situation where there is even light load on the system, the pages may
>> not free quickly. Out of a area of 1024 pages, maybe only 950 of them are
>> freed when the allocation attempt occurs because lumpy reclaim returned early.
>> This patch alters the behaviour of direct reclaim for large contiguous blocks.
>
> Yes, lumpy is prone to starting reclaim on an area and moving on to the
> next.  Generally where there are a lot of areas, the areas are smaller
> and the number of requests larger, this is sufficient.  However for
> higher orders it will tend to suffer from the effect you indicate.  As
> you say when the system is unloaded even at very high orders we will get
> good success rates, but higher orders on a loaded machine are problematic.
>

All sounds about right. When I was testing on my desktop though, even an 
"unloaded" machine had enough background activity to cause problems. I 
imagine this will generally be the case.

> It seems logical that if we could know when all reclaim for a targeted
> area is completed that we would have a higher chance of subsequent
> success allocating.  Looking at your patch, you are using synchronising
> with the completion of all pending writeback on pages in the targeted
> area which, pretty much gives us that.
>

That was the intention. Critically, it queues up everything 
asynchronously first and then waits for it to complete instead of 
queueing and waiting on one page at a time. In pageout(), I was somewhat 
suprised I could not have

struct writeback_control wbc = {
 	.sync_mode = sync_writeback,
 	.nonblocking = 0,
 	...
}

and have sync_writeback equal to WB_SYNC_NONE or WB_SYNC_ANY depending on 
whether the caller to pageout() wanted to sync or not. This didn't work 
out though and led to this retry logic you bring up later.

> I am surprised to see a need for a retry loop here, I would have
> expected to see an async start and a sync complete pass with the
> expectation that this would be sufficient.  Otherwise the patch is
> surprisingly simple.
>

That retry loop should be gotten rid of because as you say, it should be a 
single retry. This had been left over from an earlier version of the patch 
and I should have gotten rid of it. I'll look at testing with a WARN_ON if 
the "synchronous" returns with pages still on the list to see if it 
happens.

> I will try and reproduce with your test script and also do some general
> testing to see how this might effect the direct allocation latencies,
> which I see as key.  It may well improve those for larger allocations.
>

Cool. Thanks.

>> The first attempt to call shrink_page_list() is asynchronous but if it
>> fails, the pages are submitted a second time and the calling process waits
>> for the IO to complete. It'll retry up to 5 times for the pages to be
>> fully freed. This may stall allocators waiting for contiguous memory but
>> that should be expected behaviour for high-order users. It is preferable
>> behaviour to potentially queueing unnecessary areas for IO. Note that kswapd
>> will not stall in this fashion.
>>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> [...]
>
> -apw
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-07-24 10:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-20 19:41 [PATCH 0/1] Synchronous lumpy reclaim Mel Gorman
2007-07-20 19:41 ` [PATCH 1/1] Wait for page writeback when directly reclaiming contiguous areas Mel Gorman
2007-07-23 16:51   ` Andy Whitcroft
2007-07-24 10:23     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox