shrink_mmap SMP race fix

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* shrink_mmap SMP race fix
@ 2000-03-30 15:16 Andrea Arcangeli
  2000-03-30 15:43 ` Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 15:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm, MOLNAR Ingo

This patch fixes a race in shrink_mmap. In shrink_mmap between the
UnlockPage and the spin_lock(&pagemap_lru_lock) we wasn't holding any lock
and truncate_inode_pages or invalidate_inode_pages would been able to race
with us before we would had the time to reinsert the page in the local
temporary lru list.

The fix is to acquire the pagemap_lru_lock while the page is still locked.
We don't have priority inversion troubles there since we acquired the
per-page lock in the reverse order in first place using the trylock
method. Once we have the pagemap-lock acquired we can immediatly release
the per-page lock because the only thing we have to do then is to reinsert
the page in a lru and we don't need the per-page lock anymore for that. So
somebody can lockdown the page and start using it while we are inserting
it in the lru from shrink_mmap without races (fun :).

I'll keep thinking about it, right now it looks ok and it runs stable
here under swap SMP.

--- 2.3.99-pre3aa1-alpha/mm/filemap.c.~1~	Mon Mar 27 22:44:50 2000
+++ 2.3.99-pre3aa1-alpha/mm/filemap.c	Thu Mar 30 16:07:20 2000
@@ -329,20 +329,17 @@
 cache_unlock_continue:
 		spin_unlock(&pagecache_lock);
 unlock_continue:
+		spin_lock(&pagemap_lru_lock);
 		UnlockPage(page);
 		put_page(page);
-dispose_relock_continue:
-		/* even if the dispose list is local, a truncate_inode_page()
-		   may remove a page from its queue so always
-		   synchronize with the lru lock while accesing the
-		   page->lru field */
-		spin_lock(&pagemap_lru_lock);
 		list_add(page_lru, dispose);
 		continue;

 unlock_noput_continue:
+		spin_lock(&pagemap_lru_lock);
 		UnlockPage(page);
-		goto dispose_relock_continue;
+		list_add(page_lru, dispose);
+		continue;

 dispose_continue:
 		list_add(page_lru, dispose);

This additional debugging patch make sure we don't do errors like we had
in invalidate_inode_pages (lru_cache_del() must be _always_ run with the
per-page lock acquired to avoid us to remove from the lru list a page that
is in the middle of the shrink_mmap processing).

--- 2.3.99-pre3aa1-alpha/include/linux/swap.h.~1~	Wed Mar 29 18:16:18 2000
+++ 2.3.99-pre3aa1-alpha/include/linux/swap.h	Thu Mar 30 16:41:45 2000
@@ -173,6 +173,8 @@

 #define	lru_cache_del(page)			\
 do {						\
+	if (!PageLocked(page))			\
+		BUG();				\
 	spin_lock(&pagemap_lru_lock);		\
 	list_del(&(page)->lru);			\
 	nr_lru_pages--;				\

This third patch removes a path that makes no sense to me. If you have an
explanation for it it's very welcome. The page aging happens very earlier
not before such place. I don't see the connection between the priority and
a fixed level of lru-cache. If something the higher is the priority the
harder we should shrink the cache (that's the opposite that the patch
achieves). Usually priority is always zero and the below check has no
effect. Also if something since it's relative to the LRU cache it should
be done _before_ start looking into the page and before clearing reference
bits all over the place (before the aging!) and also it should break the
loop instead of wasting CPU since there's going to be no way nr_lru_pages
will increase within shrink_mmap and so the check once start failng it
will keep failing wasting CPU for no good reason.

I also dislike the pgcache_under_min() thing but at least that happens to
make sense and tunable via sysctl and there's a good reason for not
breaking the loop there.

--- 2.3.99-pre3aa1-alpha/mm/filemap.c.~1~	Thu Mar 30 16:07:20 2000
+++ 2.3.99-pre3aa1-alpha/mm/filemap.c	Thu Mar 30 16:10:38 2000
@@ -294,12 +294,6 @@
 			goto cache_unlock_continue;

 		/*
-		 * We did the page aging part.
-		 */
-		if (nr_lru_pages < freepages.min * priority)
-			goto cache_unlock_continue;
-
-		/*
 		 * Is it a page swap page? If so, we want to
 		 * drop it if it is no longer used, even if it
 		 * were to be marked referenced..

I have algorithms completly autotuning (they happened to be in the
2.2.x-andrea patches somewhere in ftp.suse.com, there were many benchmarks
also posted on l-k at that time), they don't add anything fixed like the
above and I strongly believe the responsiveness under swap will be amazing
as soon as I'll port them to the new kernels. The only problem is that
with such algorithms there will be new flavours of SMP races in all the
map-unmap and depending on the implementation the struct page can waste a
further long word (there were no SMP issues in 2.2.x instead of obvious
reasons...) so probably it's more 2.5.x stuff now. Comments are welcome of
course.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 15:16 shrink_mmap SMP race fix Andrea Arcangeli
@ 2000-03-30 15:43 ` Andrea Arcangeli
  2000-03-30 17:17   ` Rik van Riel
  2000-03-30 16:50 ` Andrea Arcangeli
  2000-03-30 17:14 ` Rik van Riel
  2 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 15:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Andrea Arcangeli wrote:

>[..] If something the higher is the priority the
>harder we should shrink the cache (that's the opposite that the patch
>achieves). Usually priority is always zero and the below check has no
>effect. [..]

thinko, I noticed I was wrong about this, apologies (prio start with 6 and
0 is the most severe one).

anyway I keep not enjoying such path for all the other reasons. It should
at _least_ be done outside the loop before calculating `count`.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 15:16 shrink_mmap SMP race fix Andrea Arcangeli
  2000-03-30 15:43 ` Andrea Arcangeli
@ 2000-03-30 16:50 ` Andrea Arcangeli
  2000-03-30 17:14 ` Rik van Riel
  2 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 16:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm, MOLNAR Ingo

I also did two incremental locking change (for performance). The old code
was doing:

		if (TryLockPage(page))
			goto dispose_continue;

		/* Release the pagemap_lru lock even if the page is not yet
		   queued in any lru queue since we have just locked down
		   the page so nobody else may SMP race with us running
		   a lru_cache_del() (lru_cache_del() always run with the
		   page locked down ;). */
		spin_unlock(&pagemap_lru_lock);

		/* avoid unscalable SMP locking */
		if (!page->buffers && page_count(page) > 1)
			goto unlock_noput_continue;

We was doing the guess check to avoid unscalable locking in the tight loop
too late (after having just played with the pagemap_lru_lock and the
per-page lock). I moved the unscalable locking change inside the
pagemap_lru_lock critical section so that now no lock get touched in the
tight loop (except for the reference bit).

Then this following section of code was doing:

		/* Take the pagecache_lock spinlock held to avoid
		   other tasks to notice the page while we are looking at its
		   page count. If it's a pagecache-page we'll free it
		   in one atomic transaction after checking its page count. */
		spin_lock(&pagecache_lock);

		/* avoid freeing the page while it's locked */
		get_page(page);

		/* Is it a buffer page? */
		if (page->buffers) {
			spin_unlock(&pagecache_lock);
			if (!try_to_free_buffers(page))
				goto unlock_continue;
			/* page was locked, inode can't go away under us */
			if (!page->mapping) {
				atomic_dec(&buffermem_pages);
				goto made_buffer_progress;
			}
			spin_lock(&pagecache_lock);
		}

and as far I can tell we can instead move the spin_lock(&pagecache_lock)
_after_ the page->buffers path and to increase the per-page counter
(get_page(page)) without any lock acquired (removing a not necessary
lock/unlock in the buffer freeing path). We are allowed to check the 
page->buffer with only the per-page lock held (nobody can drop
page->buffers from under us if we hold the lock on the page).

This is the incremental cleanup:

--- 2.3.99-pre3aa1-alpha/mm/filemap.c.~1~	Thu Mar 30 16:10:38 2000
+++ 2.3.99-pre3aa1-alpha/mm/filemap.c	Thu Mar 30 18:24:29 2000
@@ -250,6 +250,11 @@
 		count--;
 
 		dispose = &young;
+
+		/* avoid unscalable SMP locking */
+		if (!page->buffers && page_count(page) > 1)
+			goto dispose_continue;
+
 		if (TryLockPage(page))
 			goto dispose_continue;
 
@@ -260,22 +265,11 @@
 		   page locked down ;). */
 		spin_unlock(&pagemap_lru_lock);
 
-		/* avoid unscalable SMP locking */
-		if (!page->buffers && page_count(page) > 1)
-			goto unlock_noput_continue;
-
-		/* Take the pagecache_lock spinlock held to avoid
-		   other tasks to notice the page while we are looking at its
-		   page count. If it's a pagecache-page we'll free it
-		   in one atomic transaction after checking its page count. */
-		spin_lock(&pagecache_lock);
-
 		/* avoid freeing the page while it's locked */
 		get_page(page);
 
 		/* Is it a buffer page? */
 		if (page->buffers) {
-			spin_unlock(&pagecache_lock);
 			if (!try_to_free_buffers(page))
 				goto unlock_continue;
 			/* page was locked, inode can't go away under us */
@@ -283,9 +277,14 @@
 				atomic_dec(&buffermem_pages);
 				goto made_buffer_progress;
 			}
-			spin_lock(&pagecache_lock);
 		}
 
+		/* Take the pagecache_lock spinlock held to avoid
+		   other tasks to notice the page while we are looking at its
+		   page count. If it's a pagecache-page we'll free it
+		   in one atomic transaction after checking its page count. */
+		spin_lock(&pagecache_lock);
+
 		/*
 		 * We can't free pages unless there's just one user
 		 * (count == 2 because we added one ourselves above).
@@ -326,12 +325,6 @@
 		spin_lock(&pagemap_lru_lock);
 		UnlockPage(page);
 		put_page(page);
-		list_add(page_lru, dispose);
-		continue;
-
-unlock_noput_continue:
-		spin_lock(&pagemap_lru_lock);
-		UnlockPage(page);
 		list_add(page_lru, dispose);
 		continue;
 

It's running without problems here so far.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 15:16 shrink_mmap SMP race fix Andrea Arcangeli
  2000-03-30 15:43 ` Andrea Arcangeli
  2000-03-30 16:50 ` Andrea Arcangeli
@ 2000-03-30 17:14 ` Rik van Riel
  2000-03-30 18:57   ` Andrea Arcangeli
  2 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2000-03-30 17:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Andrea Arcangeli wrote:

> This third patch removes a path that makes no sense to me. If
> you have an explanation for it it's very welcome. The page aging
> happens very earlier not before such place.

Sorry, but if page aging happens elsewhere, why do we go through
the trouble of maintaining an LRU list in the first place?

The answer is that the one-bit "page aging" (NRU replacement) of
pages in the page tables simply isn't enough. I agree that the
current 'magic number' approach isn't ideal and I welcome you to
come up with something better.

> I don't see the connection between the priority and a fixed
> level of lru-cache. If something the higher is the priority the
> harder we should shrink the cache (that's the opposite that the
> patch achieves). Usually priority is always zero and the below
> check has no effect.

The idea of this approach is that we need the LRU cache to do some
aging on pages we're about to free. We absolutely need this because
otherwise the system will be thrashing much earlier than needed.
Good page replacement simply is a must.

Since priority starts at 6 and only goes down in absolute
emergencies, this approach allows us to have a minimum number
of pages we age we can do better page aging when the system
isn't under too much stress.

The only big problem is that we seem to keep pages in the
LRU cache that are also mapped in processes. This makes us
do a lot of unneeded work on pages (though I hope I've
overlooked something and the LRU cache only contains unmapped
pages).

> I have algorithms completly autotuning (they happened to be in
> the 2.2.x-andrea patches somewhere in ftp.suse.com, there were
> many benchmarks also posted on l-k at that time), they don't add
> anything fixed like the above and I strongly believe the
> responsiveness under swap will be amazing as soon as I'll port
> them to the new kernels.

That would be great!

I also have some ideas about how to make the LRU cache better,
but the changes needed to get that right are IMHO a bit too big
to do now that we've started on 2.4 pre...

(my idea is very much like the 'free' list on BSD systems, where
 we reclaim pages in FIFO order from the list, but give applications
 the chance to reclaim them while they're on the list. The size of
 the list is varied depending on the ratio between freed pages and
 reclaimed pages)

kind regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 15:43 ` Andrea Arcangeli
@ 2000-03-30 17:17   ` Rik van Riel
  2000-03-30 18:41     ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2000-03-30 17:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
> On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
> 
> >[..] If something the higher is the priority the
> >harder we should shrink the cache (that's the opposite that the patch
> >achieves). Usually priority is always zero and the below check has no
> >effect. [..]
> 
> thinko, I noticed I was wrong about this, apologies (prio start with 6 and
> 0 is the most severe one).
> 
> anyway I keep not enjoying such path for all the other reasons. It should
> at _least_ be done outside the loop before calculating `count`.

True. The only reason this check is inside shrink_mmap() is that we
may want to do the page aging on the LRU pages anyway because of
page replacement reasons. It will be interesting to see if moving
it out of the loop has any adverse effects or not...

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 17:17   ` Rik van Riel
@ 2000-03-30 18:41     ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 18:41 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Rik van Riel wrote:

>On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
>> On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
>> 
>> >[..] If something the higher is the priority the
>> >harder we should shrink the cache (that's the opposite that the patch
>> >achieves). Usually priority is always zero and the below check has no
>> >effect. [..]
>> 
>> thinko, I noticed I was wrong about this, apologies (prio start with 6 and
>> 0 is the most severe one).
>> 
>> anyway I keep not enjoying such path for all the other reasons. It should
>> at _least_ be done outside the loop before calculating `count`.
>
>True. The only reason this check is inside shrink_mmap() is that we
>may want to do the page aging on the LRU pages anyway because of
>page replacement reasons. It will be interesting to see if moving

I don't think there were any valid reason to clear all reference bits
while enforcing the minimal limit of the lru size.

The check is only a kind of barrier that tries to preserve some lru cache
based on an hardwired value (took from the freepages min level that have
nothing to do with the lru minimal size btw) while the system is under
swap. It tries to forbid shrink_mmap to eat too much cache pages. This may
have the effect that the `ls` binary gets not took out of the cache while
you hit swap or something like that depending how big is `ls` in your
system.

That's very similar to the pgcache_under_min() hack we just have but that
is a page-cache only thing and since it's not a global lru thing it make
sense to not break the loop in such case (since such check doesn't know
anything about the buffer cache). The main difference is that
pgcache_under_min doesn't care about the severity of the shrink_mmap (it's
not in function of `priority').

So I believe that we just have a kind of hack to try to preserve the `ls`
binary to be shrunk from the cache and we don't need a secondary one.

IMHO it's also better to replace the pgcache_under_min with a global lru
sysctl where we can set a low limit on the percentage of the lru cache.
And then we do:

	if (nr_lru_pages < lru_min)
		return;

before enterning the loop.

In practice the above sysctl will act the same as the current
pgcache_under_min but it avoid wasting CPU power and it also won't
penalize the buffer cache for no good reason.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 17:14 ` Rik van Riel
@ 2000-03-30 18:57   ` Andrea Arcangeli
  2000-03-30 19:34     ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 18:57 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Rik van Riel wrote:

>Sorry, but if page aging happens elsewhere, why do we go through
>the trouble of maintaining an LRU list in the first place?

We can use the LRU for the page aging at any time. I did that at first.
But to do that we have to roll the page-LRU at each page/swap/buffer cache
hit and that's slow and not worty. Setting a bit is much faster and the
roll of the list become zero cost in shrink_mmap and the current aging
works fine as far I can tell.

>The answer is that the one-bit "page aging" (NRU replacement) of
>pages in the page tables simply isn't enough. I agree that the

Actually it's better than NRU for the aging anyway since new pages are
allocated always added to the bottom of the LRU for example. Also with the
LRU we avoid wasting time in non-cache pages and if almost all cache is
freeable shrink_mmap works in O(1) despite of how much memory is allocated
in userspace or in non cache things (that wasn't true in
2.2.x). Also thanks to the page-LRU 2.3.x doesn't random swap as 2.2.x
does.

>The idea of this approach is that we need the LRU cache to do some
>aging on pages we're about to free. We absolutely need this because
>otherwise the system will be thrashing much earlier than needed.
>Good page replacement simply is a must.

I really don't think aging is the problem. If you want I can send you the
patch to replace the test_and_set_bit(PG_referenced) with a perfect and
costly roll of the lru list. That's almost trivial patch. But I'm 99& sure
you'll get the same swap behaviour.

The _real_ problem is that we have to split the LRU in page/buffer-cache
LRU and swap-cache LRU. And then we have to always try to shrink the
swap-cache LRU first. This is what we have to do for great swap behaviour
IMHO. But there's a problem, to do that we have to keep the mapped pages
out of the LRU (at least out of the swap-cache LRU), otherwise we'll have
to pass over all the unfreeable mapped swap cache pages before we can
shrink the page/buffer cache and that would have a too high complexity
cost that we can't accept (it would hit us also when the memory pressure
is finished).

Shrinking the unused swap cache first is the way to go.

>That would be great!

Do you think we should do that for 2.4.x? How is the current swap
behaviour with low mem? It doesn't feel bad to me while pushing 100mbyte
on swap in 2.3.99-pre4-pre1 + the latest posted patches (but I have to say
that I don't hit swap while closing the linux-kernel folder anymore... ;).

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 18:57   ` Andrea Arcangeli
@ 2000-03-30 19:34     ` Rik van Riel
  2000-03-30 20:41       ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2000-03-30 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
> On Thu, 30 Mar 2000, Rik van Riel wrote:

> >The idea of this approach is that we need the LRU cache to do some
> >aging on pages we're about to free. We absolutely need this because
> >otherwise the system will be thrashing much earlier than needed.
> >Good page replacement simply is a must.
> 
> I really don't think aging is the problem. If you want I can
> send you the patch to replace the test_and_set_bit(PG_referenced)
> with a perfect and costly roll of the lru list. That's almost
> trivial patch. But I'm 99& sure you'll get the same swap
> behaviour.

I'm sorry I didn't explain clearly. Of course it doesn't matter
if we do perfect LRU sorting or second-chance behaviour.

What matters is that the pages should spend _enough time_ in the
LRU list for them to have a chance to be reclaimed by the original
application. If we maintain a too small list, pages don't get
enough of a chance to be reclaimed by their application and the
"extra aging" benefit is minimal.

> The _real_ problem is that we have to split the LRU in
> page/buffer-cache LRU and swap-cache LRU.

> Shrinking the unused swap cache first is the way to go.

NOOOOOO!!  The only reason that the current VM behaves decently
at all is that all pages are treated equally. We _need_ to reclaim
all pages in the same way and in the same queue because only then
we achieve automatic balancing between the different memory uses
in an efficient way.

> >That would be great!
> 
> Do you think we should do that for 2.4.x? How is the current
> swap behaviour with low mem? It doesn't feel bad to me while
> pushing 100mbyte on swap in 2.3.99-pre4-pre1 + the latest posted
> patches

It's certainly not bad. In fact, current VM behaviour is good enough
that I'd rather not touch it before 2.5... In the past we had some
very bad problems with VM behaviour just before a stable release,
now things seem to work quite well.

I propose we leave the VM subsystem alone for 2.4.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 19:34     ` Rik van Riel
@ 2000-03-30 20:41       ` Andrea Arcangeli
  2000-03-30 21:19         ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 20:41 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Rik van Riel wrote:

>What matters is that the pages should spend _enough time_ in the
>LRU list for them to have a chance to be reclaimed by the original
>application. If we maintain a too small list, pages don't get
>enough of a chance to be reclaimed by their application and the
>"extra aging" benefit is minimal.

To avoid shrinking the real cache you have to get rid of the swap cache.

swap_out is just an engine that generate page cache. If you want to keep
the good cache you have to shrink from the swap cache. The swap cache
generated by the swap_out engine is by definition just old, it was
belonging to a not recently accessed pte. So it's a very less interesting
cache and preserving the very interesting page cache is better than
preserving the polluting swap cache.

>> Shrinking the unused swap cache first is the way to go.
>
>NOOOOOO!!  The only reason that the current VM behaves decently
>at all is that all pages are treated equally. We _need_ to reclaim

I am now talking about theory, I tried that and it worked very better ;).
See explanation above on why it gives a smooth swap behaviour.

Anyway I prefer to continue talking about this again when I'll have wrote
the code so I'll try it out again and you'll play with it too.

>all pages in the same way and in the same queue because only then
>we achieve automatic balancing between the different memory uses
>in an efficient way.

If the swap cache page is mapped/used (like after shared swapin)
shrink_mmap is not going to eat it. Only the cache pollution generated by
the swap_out gets cleaned up. And it gets cleaned up in a dynamic lazy way
as usual (if somebody fault in the swap cache before shrink_mmap eat it
only minor fault happens as usual).

>It's certainly not bad. In fact, current VM behaviour is good enough
>that I'd rather not touch it before 2.5... In the past we had some
>very bad problems with VM behaviour just before a stable release,
>now things seem to work quite well.

Agreed.

Right now I'm not fully convinced about zone_balance_memory() that is
calling try_to_free_pages for all the zonelist while try_to_free_pages is
supposed to be aware of classzones and to shrink all the inclusive zone
automatically (same for kswapd since it uses the same helper function to
do the page freeing). Currently I can't notice the effect of such code:

  DMA: 399*8kB 293*16kB 247*32kB 226*64kB 81*128kB 122*256kB 129*512kB 60*1024kB
  Normal: = 0kB)
  HighMem: = 0kB)

but it looks the source of the too much swap thing reported to l-k.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 20:41       ` Andrea Arcangeli
@ 2000-03-30 21:19         ` Rik van Riel
  2000-03-30 21:55           ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2000-03-30 21:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Andrea Arcangeli wrote:
> On Thu, 30 Mar 2000, Rik van Riel wrote:
> 
> >What matters is that the pages should spend _enough time_ in the
> >LRU list for them to have a chance to be reclaimed by the original
> >application. If we maintain a too small list, pages don't get
> >enough of a chance to be reclaimed by their application and the
> >"extra aging" benefit is minimal.
> 
> To avoid shrinking the real cache you have to get rid of the swap cache.
> 
> swap_out is just an engine that generate page cache. If you want
> to keep the good cache you have to shrink from the swap cache.
> The swap cache generated by the swap_out engine is by definition
> just old, it was belonging to a not recently accessed pte. So
> it's a very less interesting cache and preserving the very
> interesting page cache is better than preserving the polluting
> swap cache.

The reason to keep the LRU cache (of _unmapped_ pages) fairly
big is to cheaply increase the page aging from one-bit aging
to two-stage aging. That way we can do more precise page aging.

That this works has been demonstrated by the second chance
replacement patchlet that went into 2.2.15pre...

> >> Shrinking the unused swap cache first is the way to go.
> >
> >NOOOOOO!!  The only reason that the current VM behaves decently
> >at all is that all pages are treated equally. We _need_ to reclaim
> 
> I am now talking about theory, I tried that and it worked very better ;).
> See explanation above on why it gives a smooth swap behaviour.
> 
> Anyway I prefer to continue talking about this again when I'll
> have wrote the code so I'll try it out again and you'll play
> with it too.

That might be the better idea since there seems to be some
misunderstanding when talking about it in English :)

> >all pages in the same way and in the same queue because only then
> >we achieve automatic balancing between the different memory uses
> >in an efficient way.
> 
> If the swap cache page is mapped/used (like after shared swapin)
> shrink_mmap is not going to eat it. Only the cache pollution
> generated by the swap_out gets cleaned up. And it gets cleaned
> up in a dynamic lazy way as usual (if somebody fault in the swap
> cache before shrink_mmap eat it only minor fault happens as
> usual).

Point is, we should give the process a good amount of time
to fault their page back in. Otherwise page aging is just
the single NRU bit; in 2.2 we've already seen that that is
just not enough.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: shrink_mmap SMP race fix
  2000-03-30 21:19         ` Rik van Riel
@ 2000-03-30 21:55           ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2000-03-30 21:55 UTC (permalink / raw)
  To: riel; +Cc: Linus Torvalds, linux-kernel, linux-mm, MOLNAR Ingo

On Thu, 30 Mar 2000, Rik van Riel wrote:

>That this works has been demonstrated by the second chance
>replacement patchlet that went into 2.2.15pre...

The trick is only to not clear the referenced bit for mapped pages from
within shrink_mmap (not to change swap_out). That's good since it gives to
mapped pages a longer life since they are more critical (exectutables sits
on mapped pages). I agree in changing shrink_mmap to not clear the
reference bit for mapped pages in 2.3.x too, that is going to make the
swapout stage smoother I think, agreed.

>Point is, we should give the process a good amount of time
>to fault their page back in. Otherwise page aging is just

The process just had the time to fault back in before the page got
unmapped from the pte. The swap cache that gets generated is less
interesting information than the page cache (there was not accessed bit or
pte in the page cache).

Also consider if there's a swap loop the swap_out continue to generate
swap_cache. If we threat swap_cache and page cache equally we'll end
finishing the page cache completly and having only the pollution in the
cache. If you instead shrink from the right place you may end preserving
the interesting cache. Consider there are lots of swap hog cases where you
never fault in the swap cache again (but you only touch always different
memory) and they are usually the harmful ones where we should react
smoothly.

>the single NRU bit; in 2.2 we've already seen that that is
>just not enough.

IMHO it's enough. We also have the implicit information of inserction
point in the LRU (that's better than 2.2.x). Also if you really want to do
better you don't need more than one bit but you want zero bit and to keep
the LRU in perfect order rolling entries at each cache hit ;).

The aging of mapped pages is right now quite special case (and the change
you did there to make sure unmapped pages have reference bit set is very
good idea for 2.3.x too! :). The reason we need such trick is that such
pages are going to be always with the reference bit clear when they gets
unmapped by swap_out just because shrink_mmap has to fail (so potentially
clearing the age bits) in order to trigger the swap_out in first place.
With the new design instead we'll be able to know when we have to swap_out
without the need of entering shrink_mmap and the reference bit won't be
cleared in order to trigger swap_out. And in general though is good idea
to give longer lifetime to the .text sections so the mouse keeps moving
the arrow more probably ;).

This below untested patch incremental with all the previous patches should
do the trick:

--- 2.3.99-pre3aa1-alpha/mm/filemap.c.~1~	Thu Mar 30 18:24:29 2000
+++ 2.3.99-pre3aa1-alpha/mm/filemap.c	Thu Mar 30 23:50:26 2000
@@ -233,6 +233,16 @@
 		page = list_entry(page_lru, struct page, lru);
 		list_del(page_lru);

+		/*
+		 * HACK: don't clear the reference bit on mapped page cache
+		 * to give them longer life.
+		 */
+		if (page->mapping && page_count(page) > 1) {
+			dispose = &young;
+			count--;
+			goto dispose_continue;
+		}
+
 		dispose = &zone->lru_cache;
 		if (test_and_clear_bit(PG_referenced, &page->flags))
 			/* Roll the page at the top of the lru list,

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2000-03-30 21:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-03-30 15:16 shrink_mmap SMP race fix Andrea Arcangeli
2000-03-30 15:43 ` Andrea Arcangeli
2000-03-30 17:17   ` Rik van Riel
2000-03-30 18:41     ` Andrea Arcangeli
2000-03-30 16:50 ` Andrea Arcangeli
2000-03-30 17:14 ` Rik van Riel
2000-03-30 18:57   ` Andrea Arcangeli
2000-03-30 19:34     ` Rik van Riel
2000-03-30 20:41       ` Andrea Arcangeli
2000-03-30 21:19         ` Rik van Riel
2000-03-30 21:55           ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox