[patch] swap-speedup-2.4.3-A1, massive swapping speedup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] swap-speedup-2.4.3-A1, massive swapping speedup
@ 2001-04-23  9:20 Ingo Molnar
  2001-04-23 15:33 ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2001-04-23  9:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Linux Kernel List, linux-mm, Marcelo Tosatti,
	Rik van Riel, Szabolcs Szakacsits

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2949 bytes --]

background: sometime during the 2.4 cycle swapping performance and the
efficiency of the swapcache dropped significantly. I believe i finally
managed to find the problem that caused poor swapping performance and
annoying 'sudden process stoppage' symptoms.

(to those people who are experiencing swapping problems in 2.4: please try
the attached swap-speedup-2.4.3-A1 patch and let me know whether it works
& helps. The patch is against 2.4.4-pre6 or 2.4.3-ac12 and works just fine
on UP and SMP systems here.)

the problem is in lookup_swap_cache(). Per design it's a read-only,
lightweight function that just looks up the swapcache and reestablishes
the mapping if there is an entry still present. (ie. in most cases this
means that there is a fresh swapout still pending). In reality
lookup_swap_cache() did not work as intended: pages were locked in most of
the cases due to swap-out WRITEs, which caused the find_lock_page() to act
as a synchronization point - it blocked until the writeout finished. (!!!)
This is highly inefficient and undesirable - a lookup cache should not and
must not serialize with writeouts.

the reason why lookup_swap_cache() locks the page is due to a valid race,
but the solution excessive: it tries to keep the lookup atomic against
destroyers of the page, page_launder() and reclaim_page(). But it does not
really need the page lock for this - what it needs is atomicity against
swapcache updates. The same atomicity can be achieved by taking the LRU
and pagecache locks, the PageSwapCache() check is now embedded in a new
function: __find_get_swapcache_page().

the patch dramatically improves swapping performance in the tests i've
tried: swap-trashing tests that used to effectively lock the system up,
are chugging along just fine now, and the system is still more than
usable. The performance bug basically killed all good effects of the
swapcache. Swap-in latency of swapped-out processes has decreased
significantly, and overall swapping throughput has increased and
stabilized.

I'd really like to ask all MM developers to take some time to lean back
and verify current code to find similar performance bugs, instead of
trying to hack up new functionality to hide symptoms of bad design or bad
implementation. (for example there are some plans to add "avoid trashing
via process suspension" heuristics, which just work around real problems
like this one. With such code in place i'd probably never have found this
problem.) I believe we have most of the VM functionality in place to have
a world-class VM (most of which is new), what we now need is reliable and
verified behavior, not more complexity.

[ i'd also like to ask for new methods to create 'swap trashing', right
now i cannot make my system unusable via excessive swapping. (i'm
currently using the sieve.c memory trasher from Cesar Eduardo Barros, this
code used to produce the most extreme trashing load - it works just fine
now.) ]

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2954 bytes --]

--- linux/mm/filemap.c.orig	Mon Apr 23 13:35:06 2001
+++ linux/mm/filemap.c	Mon Apr 23 13:37:09 2001
@@ -710,6 +710,34 @@
 }

 /*
+ * Find a swapcache page (and get a reference) or return NULL.
+ * The SwapCache check is protected by the pagecache lock.
+ */
+struct page * __find_get_swapcache_page(struct address_space *mapping,
+			      unsigned long offset, struct page **hash)
+{
+	struct page *page;
+
+	/*
+	 * We need the LRU lock to protect against page_launder().
+	 */
+	spin_lock(&pagecache_lock);
+	spin_lock(&pagemap_lru_lock);
+
+	page = __find_page_nolock(mapping, offset, *hash);
+	if (page) {
+		if (PageSwapCache(page))
+			page_cache_get(page);
+		else
+			page = NULL;
+	}
+	spin_unlock(&pagemap_lru_lock);
+	spin_unlock(&pagecache_lock);
+
+	return page;
+}
+
+/*
  * Same as the above, but lock the page too, verifying that
  * it's still valid once we own it.
  */
--- linux/mm/swap_state.c.orig	Mon Apr 23 13:35:06 2001
+++ linux/mm/swap_state.c	Mon Apr 23 13:37:01 2001
@@ -163,37 +163,18 @@
 		/*
 		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
 		 */
-repeat:
-		found = find_lock_page(&swapper_space, entry.val);
+		found = find_get_swapcache_page(&swapper_space, entry.val);
 		if (!found)
 			return 0;
-		/*
-		 * Though the "found" page was in the swap cache an instant
-		 * earlier, it might have been removed by refill_inactive etc.
-		 * Re search ... Since find_lock_page grabs a reference on
-		 * the page, it can not be reused for anything else, namely
-		 * it can not be associated with another swaphandle, so it
-		 * is enough to check whether the page is still in the scache.
-		 */
-		if (!PageSwapCache(found)) {
-			UnlockPage(found);
-			page_cache_release(found);
-			goto repeat;
-		}
+		if (!PageSwapCache(found))
+			BUG();
 		if (found->mapping != &swapper_space)
-			goto out_bad;
+			BUG();
 #ifdef SWAP_CACHE_INFO
 		swap_cache_find_success++;
 #endif
-		UnlockPage(found);
 		return found;
 	}
-
-out_bad:
-	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
-	UnlockPage(found);
-	page_cache_release(found);
-	return 0;
 }

 /* 
--- linux/include/linux/pagemap.h.orig	Mon Apr 23 13:35:05 2001
+++ linux/include/linux/pagemap.h	Mon Apr 23 13:37:01 2001
@@ -77,7 +77,12 @@
 				unsigned long index, struct page **hash);
 extern void lock_page(struct page *page);
 #define find_lock_page(mapping, index) \
-		__find_lock_page(mapping, index, page_hash(mapping, index))
+	__find_lock_page(mapping, index, page_hash(mapping, index))
+
+extern struct page * __find_get_swapcache_page (struct address_space * mapping,
+				unsigned long index, struct page **hash);
+#define find_get_swapcache_page(mapping, index) \
+	__find_get_swapcache_page(mapping, index, page_hash(mapping, index))

 extern void __add_page_to_hash_queue(struct page * page, struct page **p);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A1, massive swapping speedup
  2001-04-23  9:20 [patch] swap-speedup-2.4.3-A1, massive swapping speedup Ingo Molnar
@ 2001-04-23 15:33 ` Rik van Riel
  2001-04-23 16:05   ` [patch] swap-speedup-2.4.3-A2 Ingo Molnar
  2001-04-23 16:53   ` [patch] swap-speedup-2.4.3-A1, massive swapping speedup Jonathan Morton
  0 siblings, 2 replies; 11+ messages in thread
From: Rik van Riel @ 2001-04-23 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

On Mon, 23 Apr 2001, Ingo Molnar wrote:

> the reason why lookup_swap_cache() locks the page is due to a valid race,
> but the solution excessive: it tries to keep the lookup atomic against
> destroyers of the page, page_launder() and reclaim_page(). But it does not
> really need the page lock for this - what it needs is atomicity against
> swapcache updates. The same atomicity can be achieved by taking the LRU
> and pagecache locks, the PageSwapCache() check is now embedded in a new
> function: __find_get_swapcache_page().

There seems to be one more reason, take a look at the function
read_swap_cache_async() in swap_state.c, around line 240:

        /*
         * Add it to the swap cache and read its contents.
         */
        lock_page(new_page);
        add_to_swap_cache(new_page, entry);
        rw_swap_page(READ, new_page, wait);
        return new_page;

Here we add an "empty" page to the swap cache and use the
page lock to protect people from reading this non-up-to-date
page.

Your patch looks like it could occasionally let a process
map in such a page before the IO is done. We really need to
fix read_swap_cache_async() and your __find_get_swapcache_page()
to make sure that only pages which are up to date can be given
to processes.

> the patch dramatically improves swapping performance in the tests i've
> tried:

> Swap-in latency of swapped-out processes has decreased significantly,
> and overall swapping throughput has increased and stabilized.

Cool, this really sounds like a patch we need, once the correctness
issue is resolved.

> I'd really like to ask all MM developers to take some time to lean back
> and verify current code to find similar performance bugs, instead of
> trying to hack up new functionality to hide symptoms of bad design or bad
> implementation. (for example there are some plans to add "avoid trashing
> via process suspension" heuristics, which just work around real problems
> like this one. With such code in place i'd probably never have found this
> problem.) I believe we have most of the VM functionality in place to have
> a world-class VM (most of which is new), what we now need is reliable and
> verified behavior, not more complexity.

Agreed, we need to look for all the gotchas in the current code.

OTOH, no matter how fast we make the current functionality, there
will always be some point at which the system is so overloaded that
no paging system can help save it from thrashing. At this point we
need load control ... IMHO it is unacceptable to have Linux fall
over under heavy overload when we could stop this from happening by
simple load control code.

Btw, to test this ... try running 10 copies of gnuchess playing
against itself on a machine with 64 MB of RAM. You'll go under
thrashing pretty quickly.

This same situation is possible in real-life scenarios, like a
website getting slashdotted or a mail server getting bombed. We
really want some form of load control to save a machine from
thrashing itself to death in such a situation.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] swap-speedup-2.4.3-A2
  2001-04-23 15:33 ` Rik van Riel
@ 2001-04-23 16:05   ` Ingo Molnar
  2001-04-23 17:17     ` Linus Torvalds
  2001-04-24  5:44     ` [patch] swap-speedup-2.4.3-B3 Ingo Molnar
  2001-04-23 16:53   ` [patch] swap-speedup-2.4.3-A1, massive swapping speedup Jonathan Morton
  1 sibling, 2 replies; 11+ messages in thread
From: Ingo Molnar @ 2001-04-23 16:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2120 bytes --]


On Mon, 23 Apr 2001, Rik van Riel wrote:

> There seems to be one more reason, take a look at the function
> read_swap_cache_async() in swap_state.c, around line 240:

you are right - i thought about that issue too and assumed it works like
the pagecache (which first reads the page without hashing it, then tries
to add the result to the pagecache and throws away the page if anyone else
finished it already), but that was incorrect.

i think the swapcache's behavior is actually the more correct one, and the
pagecache should first hash pending reads, then start the IO. If someone
else tries to read the page and the page is not uptodate and locked, then
we should sleep on the page's waitqueue. End-of-page-IO or PageError
should then unlock the page and wake up all waiters. [which happens
already]

several processes trying to read the same page is a legitimate scenario,
and should not result in multiple reads done on the same disk area.
Besides the (albeit small) performance win, IMO this is also a quality of
implementation issue, even if the window is small. (the window might be
not that small in some cases like NFS?)

i've fixed the patch so that it checks Page_Uptodate() within
__find_get_swapcache_page(). The only case i'm not convinced about is the
case of IO errors, but otherwise this should work. I've also fixed an
SMP-only bug in the new __find_get_swapcache_page()  function: it must not
run __find_page_nolock() with the LRU lock held.

i've attached swap-speedup-2.4.3-A2 which has these fixes included.

> OTOH, no matter how fast we make the current functionality, there will
> always be some point at which the system is so overloaded that no
> paging system can help save it from thrashing.

agreed - but this should really be the last resort, and i'm still worried
about potentially hiding real performance bugs.

> Btw, to test this ... try running 10 copies of gnuchess playing
> against itself on a machine with 64 MB of RAM. You'll go under
> thrashing pretty quickly.

thanks, i'll try that. (do you have any easy oneliner command line to
start up gnuchess against itself?)

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3316 bytes --]

--- linux/mm/filemap.c.orig	Mon Apr 23 13:35:06 2001
+++ linux/mm/filemap.c	Mon Apr 23 20:55:23 2001
@@ -710,6 +710,43 @@
 }
 
 /*
+ * Find a swapcache page (and get a reference) or return NULL.
+ * The SwapCache check is protected by the pagecache lock.
+ */
+struct page * __find_get_swapcache_page(struct address_space *mapping,
+			      unsigned long offset, struct page **hash)
+{
+	struct page *page;
+
+	/*
+	 * We need the LRU lock to protect against page_launder().
+	 */
+repeat:
+	spin_lock(&pagecache_lock);
+	page = __find_page_nolock(mapping, offset, *hash);
+	if (page) {
+		spin_lock(&pagemap_lru_lock);
+		if (PageSwapCache(page)) {
+			page_cache_get(page);
+			if (!Page_Uptodate(page) && PageLocked(page)) {
+				spin_unlock(&pagemap_lru_lock);
+				spin_unlock(&pagecache_lock);
+				___wait_on_page(page);
+				page_cache_release(page);
+				goto repeat;
+			}
+			if (!Page_Uptodate(page))
+				printk("hm: page not uptodate in __find_get_swapcache_page()? page flags: %08lx\n", page->flags);
+		} else
+			page = NULL;
+		spin_unlock(&pagemap_lru_lock);
+	}
+	spin_unlock(&pagecache_lock);
+
+	return page;
+}
+
+/*
  * Same as the above, but lock the page too, verifying that
  * it's still valid once we own it.
  */
--- linux/mm/swap_state.c.orig	Mon Apr 23 13:35:06 2001
+++ linux/mm/swap_state.c	Mon Apr 23 13:37:01 2001
@@ -163,37 +163,18 @@
 		/*
 		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
 		 */
-repeat:
-		found = find_lock_page(&swapper_space, entry.val);
+		found = find_get_swapcache_page(&swapper_space, entry.val);
 		if (!found)
 			return 0;
-		/*
-		 * Though the "found" page was in the swap cache an instant
-		 * earlier, it might have been removed by refill_inactive etc.
-		 * Re search ... Since find_lock_page grabs a reference on
-		 * the page, it can not be reused for anything else, namely
-		 * it can not be associated with another swaphandle, so it
-		 * is enough to check whether the page is still in the scache.
-		 */
-		if (!PageSwapCache(found)) {
-			UnlockPage(found);
-			page_cache_release(found);
-			goto repeat;
-		}
+		if (!PageSwapCache(found))
+			BUG();
 		if (found->mapping != &swapper_space)
-			goto out_bad;
+			BUG();
 #ifdef SWAP_CACHE_INFO
 		swap_cache_find_success++;
 #endif
-		UnlockPage(found);
 		return found;
 	}
-
-out_bad:
-	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
-	UnlockPage(found);
-	page_cache_release(found);
-	return 0;
 }
 
 /* 
--- linux/include/linux/pagemap.h.orig	Mon Apr 23 13:35:05 2001
+++ linux/include/linux/pagemap.h	Mon Apr 23 13:37:01 2001
@@ -77,7 +77,12 @@
 				unsigned long index, struct page **hash);
 extern void lock_page(struct page *page);
 #define find_lock_page(mapping, index) \
-		__find_lock_page(mapping, index, page_hash(mapping, index))
+	__find_lock_page(mapping, index, page_hash(mapping, index))
+
+extern struct page * __find_get_swapcache_page (struct address_space * mapping,
+				unsigned long index, struct page **hash);
+#define find_get_swapcache_page(mapping, index) \
+	__find_get_swapcache_page(mapping, index, page_hash(mapping, index))
 
 extern void __add_page_to_hash_queue(struct page * page, struct page **p);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A2
  2001-04-23 16:05   ` [patch] swap-speedup-2.4.3-A2 Ingo Molnar
@ 2001-04-23 17:17     ` Linus Torvalds
  2001-04-23 16:54       ` Ingo Molnar
  2001-04-24  5:44     ` [patch] swap-speedup-2.4.3-B3 Ingo Molnar
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2001-04-23 17:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

On Mon, 23 Apr 2001, Ingo Molnar wrote:
> 
> you are right - i thought about that issue too and assumed it works like
> the pagecache (which first reads the page without hashing it, then tries
> to add the result to the pagecache and throws away the page if anyone else
> finished it already), but that was incorrect.

The above is NOT how the page cache works. Or if some part of the page
cache works that way, then it is a BUG. You must NEVER allow multiple
outstanding reads from the same location - that implies that you're doing
something wrong, and the system is doing too much IO.

The way _all_ parts of the page cache should work is:

Create new page:
 - look up page. If found, return it
 - allocate new page.
 - look up page again, in case somebody else added it while we allocated
   it.
 - add the page atomically with the lookup if the lookup failed, otherwise
   just free the page without doing anything.
 - return the looked-up / allocated page.

return up-to-date page:
 - call the above to get a page cache page.
 - if uptodate, return
 - lock_page()
 - if now uptodate (ie somebody else filled it and held the lock), unlock
   and return.
 - start the IO
 - wait on IO by waiting on the page (modulo other work that you could do
   in the background).
 - if the page is still not up-to-date after we tried to read it, we got
   an IO error. Return error.

The above is how it is always meant to work. The above works for both new
allocations and for old. It works even if an earlier read had failed (due
to wrong permissions for example - think about NFS page caches where some
people may be unable to actually fill a page, so that you need to re-try
on failure). The above is how the regular read/write paths work (modulo
bugs). And it's also how the swap cache should work.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A2
  2001-04-23 17:17     ` Linus Torvalds
@ 2001-04-23 16:54       ` Ingo Molnar
  0 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2001-04-23 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

On Mon, 23 Apr 2001, Linus Torvalds wrote:

> The above is NOT how the page cache works. Or if some part of the page
> cache works that way, then it is a BUG. You must NEVER allow multiple
> outstanding reads from the same location - that implies that you're
> doing something wrong, and the system is doing too much IO.

you are right, the pagecache does it correctly, i've just re-checked all
places.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [patch] swap-speedup-2.4.3-B3
  2001-04-23 16:05   ` [patch] swap-speedup-2.4.3-A2 Ingo Molnar
  2001-04-23 17:17     ` Linus Torvalds
@ 2001-04-24  5:44     ` Ingo Molnar
  2001-04-24 16:38       ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2001-04-24  5:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Linux Kernel List, linux-mm, Marcelo Tosatti,
	Rik van Riel, Szabolcs Szakacsits

the latest swap-speedup patch can be found at:

  http://people.redhat.com/mingo/swap-speedup/swap-speedup-2.4.3-B3

(the patch is against 2.4.4-pre6 or 2.4.3-ac13.)

-B3 includes Marcelo's patch for another area that blocks unnecesserily on
locked swapcache pages: async swapcache readahead. Marcello did some tests
which shows that this fix brought some nice improvements too.

"make -j32 bzImage" using 128MB mem, 128MB swap, 4 CPUs:

  stock 2.4.3-ac13
  ----------------
  real    4m0.678s
  user    4m2.870s
  sys     0m38.920s

  swap-speedup-A2
  ---------------
  real    3m24.190s
  user    4m1.070s
  sys     0m31.950s

  swap-speedup-B3 (A2 + Marcelo's swapin-readahead non-blocking patch)
  ---------------
  real    3m7.410s
  user    4m0.940s
  sys     0m28.680s

ie. for this kernel compile test:

   swap-speedup-A2 is a 18% speedup relative to stock 2.4.3-ac13
   swap-speedup-B3 is a 28% speedup relative to stock 2.4.3-ac13

and the amount of CPU time spent in the kernel has been reduced
significantly as well.

I believe all the correctness and SMP-locking issues have been taken care
of in -B3 as well.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-B3
  2001-04-24  5:44     ` [patch] swap-speedup-2.4.3-B3 Ingo Molnar
@ 2001-04-24 16:38       ` Linus Torvalds
  2001-04-25  2:28         ` Marcelo Tosatti
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2001-04-24 16:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Linux Kernel List, linux-mm, Marcelo Tosatti,
	Rik van Riel, Szabolcs Szakacsits

On Tue, 24 Apr 2001, Ingo Molnar wrote:
> 
> the latest swap-speedup patch can be found at:

Please don't add more of those horrible "wait" arguments.

Make two different versions of a function instead. It's going to clean up
and simplify the code, and there really isn't any reason to do what you're
doing.

You should split up the logic differently: if you want to wait for the
page, then DO so:

	page = lookup_swap_cache(..);
	if (page) {
		wait_for_swap_cache:valid(page);
		.. use page ..
	}

Note how much more readable and UNDERSTANDABLE the above is, compared to

	page = lookup_swap_cache(..., 1);
	if (page) {
		...

and note also how splitting up the waiting will

 - simplify the swap cache lookup function, making it faster for people
   who do _NOT_ want to wait.

 - make it easier to statically check the correctness of programs by just
   eye-balling them ("Hey, he's calling 'wait' with the spinlock held").

 - more easily moving the wait around, allowing for more concurrency.

Basically, I don't want to mix synchronous and asynchronous
interfaces. Everything should be asynchronous by default, and waiting
should be explicit.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-B3
  2001-04-24 16:38       ` Linus Torvalds
@ 2001-04-25  2:28         ` Marcelo Tosatti
  0 siblings, 0 replies; 11+ messages in thread
From: Marcelo Tosatti @ 2001-04-25  2:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Alan Cox, Linux Kernel List, linux-mm, Rik van Riel,
	Szabolcs Szakacsits


On Tue, 24 Apr 2001, Linus Torvalds wrote:

> Basically, I don't want to mix synchronous and asynchronous
> interfaces. Everything should be asynchronous by default, and waiting
> should be explicit.

The following patch changes all swap IO functions to be asynchronous by
default and changes the callers to wait when needed (except
rw_swap_page_base which will block writers if there are too many in flight
swap IOs). 

Ingo's find_get_swapcache_page() does not wait on locked pages anymore,
which is now up to the callers.

time make -j32 test with 4 CPUs machine, 128M ram and 128M swap: 

pre5
----
228.04user 28.14system 5:16.52elapsed 80%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (525113major+678617minor)pagefaults 0swaps

pre5 + attached patch 
--------------------
227.18user 25.49system 3:40.53elapsed 114%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (495387major+644924minor)pagefaults 0swaps


Comments? 


diff -Nur linux.orig/include/linux/pagemap.h linux/include/linux/pagemap.h
--- linux.orig/include/linux/pagemap.h	Wed Apr 25 00:51:36 2001
+++ linux/include/linux/pagemap.h	Wed Apr 25 00:53:04 2001
@@ -77,7 +77,12 @@
 				unsigned long index, struct page **hash);
 extern void lock_page(struct page *page);
 #define find_lock_page(mapping, index) \
-		__find_lock_page(mapping, index, page_hash(mapping, index))
+	__find_lock_page(mapping, index, page_hash(mapping, index))
+
+extern struct page * __find_get_swapcache_page (struct address_space * mapping,
+				unsigned long index, struct page **hash);
+#define find_get_swapcache_page(mapping, index) \
+	__find_get_swapcache_page(mapping, index, page_hash(mapping, index))
 
 extern void __add_page_to_hash_queue(struct page * page, struct page **p);
 
diff -Nur linux.orig/include/linux/swap.h linux/include/linux/swap.h
--- linux.orig/include/linux/swap.h	Wed Apr 25 00:51:36 2001
+++ linux/include/linux/swap.h	Wed Apr 25 00:53:04 2001
@@ -111,8 +111,8 @@
 extern int try_to_free_pages(unsigned int gfp_mask);
 
 /* linux/mm/page_io.c */
-extern void rw_swap_page(int, struct page *, int);
-extern void rw_swap_page_nolock(int, swp_entry_t, char *, int);
+extern void rw_swap_page(int, struct page *);
+extern void rw_swap_page_nolock(int, swp_entry_t, char *);
 
 /* linux/mm/page_alloc.c */
 
@@ -121,8 +121,7 @@
 extern void add_to_swap_cache(struct page *, swp_entry_t);
 extern int swap_check_entry(unsigned long);
 extern struct page * lookup_swap_cache(swp_entry_t);
-extern struct page * read_swap_cache_async(swp_entry_t, int);
-#define read_swap_cache(entry) read_swap_cache_async(entry, 1);
+extern struct page * read_swap_cache_async(swp_entry_t);
 
 /* linux/mm/oom_kill.c */
 extern int out_of_memory(void);
diff -Nur linux.orig/mm/filemap.c linux/mm/filemap.c
--- linux.orig/mm/filemap.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/filemap.c	Wed Apr 25 00:53:04 2001
@@ -678,6 +678,34 @@
 }
 
 /*
+ * Find a swapcache page (and get a reference) or return NULL.
+ * The SwapCache check is protected by the pagecache lock.
+ */
+struct page * __find_get_swapcache_page(struct address_space *mapping,
+			      unsigned long offset, struct page **hash)
+{
+	struct page *page;
+
+	/*
+	 * We need the LRU lock to protect against page_launder().
+	 */
+
+	spin_lock(&pagecache_lock);
+	page = __find_page_nolock(mapping, offset, *hash);
+	if (page) {
+		spin_lock(&pagemap_lru_lock);
+		if (PageSwapCache(page)) 
+			page_cache_get(page);
+		else
+			page = NULL;
+		spin_unlock(&pagemap_lru_lock);
+	}
+	spin_unlock(&pagecache_lock);
+
+	return page;
+}
+
+/*
  * Same as the above, but lock the page too, verifying that
  * it's still valid once we own it.
  */
diff -Nur linux.orig/mm/memory.c linux/mm/memory.c
--- linux.orig/mm/memory.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/memory.c	Wed Apr 25 00:53:04 2001
@@ -1040,7 +1040,7 @@
 			break;
 		}
 		/* Ok, do the async read-ahead now */
-		new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0);
+		new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset));
 		if (new_page != NULL)
 			page_cache_release(new_page);
 		swap_free(SWP_ENTRY(SWP_TYPE(entry), offset));
@@ -1063,13 +1063,13 @@
 	if (!page) {
 		lock_kernel();
 		swapin_readahead(entry);
-		page = read_swap_cache(entry);
+		page = read_swap_cache_async(entry);
 		unlock_kernel();
 		if (!page) {
 			spin_lock(&mm->page_table_lock);
 			return -1;
 		}
-
+		wait_on_page(page);
 		flush_page_to_ram(page);
 		flush_icache_page(vma, page);
 	}
diff -Nur linux.orig/mm/page_io.c linux/mm/page_io.c
--- linux.orig/mm/page_io.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/page_io.c	Wed Apr 25 00:53:04 2001
@@ -33,7 +33,7 @@
  * that shared pages stay shared while being swapped.
  */
 
-static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page, int wait)
+static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page)
 {
 	unsigned long offset;
 	int zones[PAGE_SIZE/512];
@@ -41,6 +41,7 @@
 	kdev_t dev = 0;
 	int block_size;
 	struct inode *swapf = 0;
+	int wait = 0;
 
 	/* Don't allow too many pending pages in flight.. */
 	if ((rw == WRITE) && atomic_read(&nr_async_pages) >
@@ -104,7 +105,7 @@
  *  - it's marked as being swap-cache
  *  - it's associated with the swap inode
  */
-void rw_swap_page(int rw, struct page *page, int wait)
+void rw_swap_page(int rw, struct page *page)
 {
 	swp_entry_t entry;
 
@@ -116,7 +117,7 @@
 		PAGE_BUG(page);
 	if (page->mapping != &swapper_space)
 		PAGE_BUG(page);
-	if (!rw_swap_page_base(rw, entry, page, wait))
+	if (!rw_swap_page_base(rw, entry, page))
 		UnlockPage(page);
 }
 
@@ -125,7 +126,7 @@
  * Therefore we can't use it.  Later when we can remove the need for the
  * lock map and we can reduce the number of functions exported.
  */
-void rw_swap_page_nolock(int rw, swp_entry_t entry, char *buf, int wait)
+void rw_swap_page_nolock(int rw, swp_entry_t entry, char *buf)
 {
 	struct page *page = virt_to_page(buf);
 	
@@ -137,7 +138,8 @@
 		PAGE_BUG(page);
 	/* needs sync_page to wait I/O completation */
 	page->mapping = &swapper_space;
-	if (!rw_swap_page_base(rw, entry, page, wait))
+	if (!rw_swap_page_base(rw, entry, page))
 		UnlockPage(page);
+	wait_on_page(page);
 	page->mapping = NULL;
 }
diff -Nur linux.orig/mm/shmem.c linux/mm/shmem.c
--- linux.orig/mm/shmem.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/shmem.c	Wed Apr 25 00:53:04 2001
@@ -124,6 +124,7 @@
 		*ptr = (swp_entry_t){0};
 		freed++;
 		if ((page = lookup_swap_cache(entry)) != NULL) {
+			wait_on_page(page);
 			delete_from_swap_cache(page);
 			page_cache_release(page);	
 		}
@@ -329,10 +330,11 @@
 			spin_unlock (&info->lock);
 			lock_kernel();
 			swapin_readahead(*entry);
-			page = read_swap_cache(*entry);
+			page = read_swap_cache_async(*entry);
 			unlock_kernel();
 			if (!page) 
 				return ERR_PTR(-ENOMEM);
+			wait_on_page(page);
 			if (!Page_Uptodate(page)) {
 				page_cache_release(page);
 				return ERR_PTR(-EIO);
diff -Nur linux.orig/mm/swap_state.c linux/mm/swap_state.c
--- linux.orig/mm/swap_state.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/swap_state.c	Wed Apr 25 00:53:04 2001
@@ -34,7 +34,7 @@
 	return 0;
 
 in_use:
-	rw_swap_page(WRITE, page, 0);
+	rw_swap_page(WRITE, page);
 	return 0;
 }
 
@@ -163,37 +163,18 @@
 		/*
 		 * Right now the pagecache is 32-bit only.  But it's a 32 bit index. =)
 		 */
-repeat:
-		found = find_lock_page(&swapper_space, entry.val);
+		found = find_get_swapcache_page(&swapper_space, entry.val);
 		if (!found)
 			return 0;
-		/*
-		 * Though the "found" page was in the swap cache an instant
-		 * earlier, it might have been removed by refill_inactive etc.
-		 * Re search ... Since find_lock_page grabs a reference on
-		 * the page, it can not be reused for anything else, namely
-		 * it can not be associated with another swaphandle, so it
-		 * is enough to check whether the page is still in the scache.
-		 */
-		if (!PageSwapCache(found)) {
-			UnlockPage(found);
-			page_cache_release(found);
-			goto repeat;
-		}
+		if (!PageSwapCache(found))
+			BUG();
 		if (found->mapping != &swapper_space)
-			goto out_bad;
+			BUG();
 #ifdef SWAP_CACHE_INFO
 		swap_cache_find_success++;
 #endif
-		UnlockPage(found);
 		return found;
 	}
-
-out_bad:
-	printk (KERN_ERR "VM: Found a non-swapper swap page!\n");
-	UnlockPage(found);
-	page_cache_release(found);
-	return 0;
 }
 
 /* 
@@ -205,7 +186,7 @@
  * the swap entry is no longer in use.
  */
 
-struct page * read_swap_cache_async(swp_entry_t entry, int wait)
+struct page * read_swap_cache_async(swp_entry_t entry)
 {
 	struct page *found_page = 0, *new_page;
 	unsigned long new_page_addr;
@@ -238,7 +219,7 @@
 	 */
 	lock_page(new_page);
 	add_to_swap_cache(new_page, entry);
-	rw_swap_page(READ, new_page, wait);
+	rw_swap_page(READ, new_page);
 	return new_page;
 
 out_free_page:
diff -Nur linux.orig/mm/swapfile.c linux/mm/swapfile.c
--- linux.orig/mm/swapfile.c	Wed Apr 25 00:51:35 2001
+++ linux/mm/swapfile.c	Wed Apr 25 00:53:24 2001
@@ -369,13 +369,15 @@
 		/* Get a page for the entry, using the existing swap
                    cache page if there is one.  Otherwise, get a clean
                    page and read the swap into it. */
-		page = read_swap_cache(entry);
+		page = read_swap_cache_async(entry);
 		if (!page) {
 			swap_free(entry);
   			return -ENOMEM;
 		}
+		lock_page(page);
 		if (PageSwapCache(page))
-			delete_from_swap_cache(page);
+			delete_from_swap_cache_nolock(page);
+		UnlockPage(page);
 		read_lock(&tasklist_lock);
 		for_each_task(p)
 			unuse_process(p->mm, entry, page);
@@ -650,7 +652,7 @@
 	}
 
 	lock_page(virt_to_page(swap_header));
-	rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header, 1);
+	rw_swap_page_nolock(READ, SWP_ENTRY(type,0), (char *) swap_header);
 
 	if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
 		swap_header_version = 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A1, massive swapping speedup
  2001-04-23 15:33 ` Rik van Riel
  2001-04-23 16:05   ` [patch] swap-speedup-2.4.3-A2 Ingo Molnar
@ 2001-04-23 16:53   ` Jonathan Morton
  2001-04-23 17:10     ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Jonathan Morton @ 2001-04-23 16:53 UTC (permalink / raw)
  To: Rik van Riel, Ingo Molnar
  Cc: Linus Torvalds, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

>There seems to be one more reason, take a look at the function
>read_swap_cache_async() in swap_state.c, around line 240:
>
>        /*
>         * Add it to the swap cache and read its contents.
>         */
>        lock_page(new_page);
>        add_to_swap_cache(new_page, entry);
>        rw_swap_page(READ, new_page, wait);
>        return new_page;
>
>Here we add an "empty" page to the swap cache and use the
>page lock to protect people from reading this non-up-to-date
>page.

How about reversing the order of the calls - ie. add the page to the cache
only when it's been filled?  That would fix the race.

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
big-mail: chromatix@penguinpowered.com
uni-mail: j.d.morton@lancaster.ac.uk

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
-----END GEEK CODE BLOCK-----


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A1, massive swapping speedup
  2001-04-23 16:53   ` [patch] swap-speedup-2.4.3-A1, massive swapping speedup Jonathan Morton
@ 2001-04-23 17:10     ` Linus Torvalds
  2001-04-23 22:13       ` Marcelo Tosatti
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2001-04-23 17:10 UTC (permalink / raw)
  To: Jonathan Morton
  Cc: Rik van Riel, Ingo Molnar, Alan Cox, Linux Kernel List, linux-mm,
	Marcelo Tosatti, Szabolcs Szakacsits

On Mon, 23 Apr 2001, Jonathan Morton wrote:
> >There seems to be one more reason, take a look at the function
> >read_swap_cache_async() in swap_state.c, around line 240:
> >
> >        /*
> >         * Add it to the swap cache and read its contents.
> >         */
> >        lock_page(new_page);
> >        add_to_swap_cache(new_page, entry);
> >        rw_swap_page(READ, new_page, wait);
> >        return new_page;
> >
> >Here we add an "empty" page to the swap cache and use the
> >page lock to protect people from reading this non-up-to-date
> >page.
> 
> How about reversing the order of the calls - ie. add the page to the cache
> only when it's been filled?  That would fix the race.

No. The page cache is used as the IO synchronization point, both for
swapping and for regular IO. You have to add the page in _first_, because
otherwise you may end up doing multiple IO's from different pages.

The proper fix is to do the equivalent of this on all the lookup paths
that want a valid page:

	if (!PageUptodate(page)) {
		lock_page(page);
		if (PageUptodate(page)) {
			unlock_page(page);
			return 0;
		}
		rw_swap_page(page, 0);
		wait_on_page(page);
		if (!PageUptodate(page))
			return -EIO;
	}
	return 0;

This is the whole point of the "uptodate" flag, and for all I know we may
already do all of this (it's certainly the normal setup).

Note how we do NOT block on write-backs in the above: the page will be
up-to-date from the bery beginning (it had _better_ be, it's hard to write
back a swap page that isn't up-to-date ;).

The above is how all the file paths work. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [patch] swap-speedup-2.4.3-A1, massive swapping speedup
  2001-04-23 17:10     ` Linus Torvalds
@ 2001-04-23 22:13       ` Marcelo Tosatti
  0 siblings, 0 replies; 11+ messages in thread
From: Marcelo Tosatti @ 2001-04-23 22:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Morton, Rik van Riel, Ingo Molnar, Alan Cox,
	Linux Kernel List, linux-mm, Szabolcs Szakacsits

On Mon, 23 Apr 2001, Linus Torvalds wrote:

> 
> On Mon, 23 Apr 2001, Jonathan Morton wrote:
> > >There seems to be one more reason, take a look at the function
> > >read_swap_cache_async() in swap_state.c, around line 240:
> > >
> > >        /*
> > >         * Add it to the swap cache and read its contents.
> > >         */
> > >        lock_page(new_page);
> > >        add_to_swap_cache(new_page, entry);
> > >        rw_swap_page(READ, new_page, wait);
> > >        return new_page;
> > >
> > >Here we add an "empty" page to the swap cache and use the
> > >page lock to protect people from reading this non-up-to-date
> > >page.
> > 
> > How about reversing the order of the calls - ie. add the page to the cache
> > only when it's been filled?  That would fix the race.
> 
> No. The page cache is used as the IO synchronization point, both for
> swapping and for regular IO. You have to add the page in _first_, because
> otherwise you may end up doing multiple IO's from different pages.
> 
> The proper fix is to do the equivalent of this on all the lookup paths
> that want a valid page:
> 
> 	if (!PageUptodate(page)) {
> 		lock_page(page);
> 		if (PageUptodate(page)) {
> 			unlock_page(page);
> 			return 0;
> 		}
> 		rw_swap_page(page, 0);
> 		wait_on_page(page);
> 		if (!PageUptodate(page))
> 			return -EIO;
> 	}
> 	return 0;
> 
> This is the whole point of the "uptodate" flag, and for all I know we may
> already do all of this (it's certainly the normal setup).
> 
> Note how we do NOT block on write-backs in the above: the page will be
> up-to-date from the bery beginning (it had _better_ be, it's hard to write
> back a swap page that isn't up-to-date ;).
> 
> The above is how all the file paths work. 

Please don't forget about swapin readahead. 

If the page is not uptodated and we are doing swapin readahead on it, we
want to fail if the page is already locked (which means its already under
IO).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-04-25  2:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-04-23  9:20 [patch] swap-speedup-2.4.3-A1, massive swapping speedup Ingo Molnar
2001-04-23 15:33 ` Rik van Riel
2001-04-23 16:05   ` [patch] swap-speedup-2.4.3-A2 Ingo Molnar
2001-04-23 17:17     ` Linus Torvalds
2001-04-23 16:54       ` Ingo Molnar
2001-04-24  5:44     ` [patch] swap-speedup-2.4.3-B3 Ingo Molnar
2001-04-24 16:38       ` Linus Torvalds
2001-04-25  2:28         ` Marcelo Tosatti
2001-04-23 16:53   ` [patch] swap-speedup-2.4.3-A1, massive swapping speedup Jonathan Morton
2001-04-23 17:10     ` Linus Torvalds
2001-04-23 22:13       ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox