Re: kswapd @ 60-80% CPU during heavy HD i/o.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
       [not found] <390E1534.B33FF871@norran.net>
@ 2000-05-01 23:23 ` Rik van Riel
  2000-05-01 23:33   ` David S. Miller
  2000-05-02 16:17   ` Roger Larsson
  0 siblings, 2 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-01 23:23 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel, linux-mm

On Tue, 2 May 2000, Roger Larsson wrote:

> I think there are some problems in the current (pre7-1) shrink_mmap.
> 
> 1) "Random" resorting for zone with free_pages > pages_high
>   while loop searches from the end of the list.
>   old pages on non memory pressure zones are disposed as 'young'.
>   Young pages are put in front, like recently touched ones.
>   This results in a random resort for these pages.

Not doing this would result in having to scan the same "wrong zone"
pages over and over again, possibly never reaching the pages we do
want to free.

> 2) The implemented algorithm results in a lot of list operations -
>    each scanned page is deleted from the list.

*nod*

Maybe it's better to scan the list and leave it unchanged, doing
second chance replacement on it like we do in 2.2 ... or even 2
or 3 bit aging?

That way we only have to scan and do none of the expensive list
operations. Sorting doesn't make much sense anyway since we put
most pages on the list in an essentially random order...

> 3) The list is supposed to be small - it is not...

Who says the list is supposed to be small?

> 4) Count is only decreased for suitable pages, but is related
>    to total pages.

Not doing this resulted in being unable to free the "right" pages,
even if they are there on the list (just beyond where we stopped
scanning) and killing a process with out of memory errors.

> 5) Returns on first fully successful page. Rescan from beginning
>    at next call to get another one... (not that bad since pages
>    are moved to the end)

Well, it *is* bad since we'll end up scanning all the pages in
&old; (and trying to free them again, which probably fails just
like it did last time). The more I think about it, the more I think
we want to go to a second chance algorithm where we don't change
the list (except to remove pages from the list).

We can simply "move" the list_head when we're done scanning and
continue from where we left off last time. That way we'll be much
less cpu intensive and scan all pages fairly.

Using not one but 2 or 3 bits for aging the pages can result in 
something closer to lru and cheaper than the scheme we have now.

What do you (and others) think about this idea?

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel
@ 2000-05-01 23:33   ` David S. Miller
  2000-05-02  0:07     ` Rik van Riel
  2000-05-02 16:17   ` Roger Larsson
  1 sibling, 1 reply; 31+ messages in thread
From: David S. Miller @ 2000-05-01 23:33 UTC (permalink / raw)
  To: riel; +Cc: roger.larsson, linux-kernel, linux-mm

   We can simply "move" the list_head when we're done scanning and
   continue from where we left off last time. That way we'll be much
   less cpu intensive and scan all pages fairly.

   Using not one but 2 or 3 bits for aging the pages can result in
   something closer to lru and cheaper than the scheme we have now.

   What do you (and others) think about this idea?

Why not have two lists, an active and an inactive list.  As reference
bits clear, pages move to the inactive list.  If a reference bit stays
clear up until when the page moves up to the head of the inactive
list, we then try to free it.  For the active list, you do the "move
the list head" technique.

So you have two passes, one populates the inactive list, the next
inspects the inactive list for pages to free up.  The toplevel
shrink_mmap scheme can look something like:

	free_unreferenced_pages_in_inactive_list();
	repopulate_inactive_list();

And you define some heuristics to decide how populated you wish to
try to keep the inactive list.

Next, during periods of inactivity you have kswapd or some other
daemon periodically (once every 5 seconds, something like this)
perform an inactive list population run.

The inactive lru population can be done cheaply, using the above
ideas, roughly like:

	LIST_HEAD(inactive_queue);
	struct list_head * active_scan_point = &lru_active;

	for_each_active_lru_page() {
		if (!test_and_clear_referenced(page)) {
			list_del(entry);
			list_add(entry, &inactive_queue);
		} else
			active_scan_point = entry;
	}

	list_splice(&inactive_queue, &lru_inactive);
	list_head_move(&lru_active, active_scan_point);

This way you only do list manipulations for actual work done
(ie. moving inactive page candidates to the inactive list).

I may try to toss together and example implementation, but feel
free to beat me to it :-)

Later,
David S. Miller
davem@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-01 23:33   ` David S. Miller
@ 2000-05-02  0:07     ` Rik van Riel
  2000-05-02  0:23       ` David S. Miller
  2000-05-02  7:56       ` michael
  0 siblings, 2 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02  0:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm

On Mon, 1 May 2000, David S. Miller wrote:
>    Date: 	Mon, 1 May 2000 20:23:43 -0300 (BRST)
>    From: Rik van Riel <riel@conectiva.com.br>
> 
>    We can simply "move" the list_head when we're done scanning and
>    continue from where we left off last time. That way we'll be much
>    less cpu intensive and scan all pages fairly.
> 
>    Using not one but 2 or 3 bits for aging the pages can result in
>    something closer to lru and cheaper than the scheme we have now.
> 
>    What do you (and others) think about this idea?
> 
> Why not have two lists, an active and an inactive list.  As reference
> bits clear, pages move to the inactive list.  If a reference bit stays
> clear up until when the page moves up to the head of the inactive
> list, we then try to free it.  For the active list, you do the "move
> the list head" technique.

Sounds like a winning idea. Well, we also want to keep mapped pages
on the active list...

> And you define some heuristics to decide how populated you wish to
> try to keep the inactive list.

We can aim/tune the inactive list size for 25% reclaims by the
original applications and 75% page stealing for "use" by the
free list. If we have far too much reclaims we can shrink the
list, if we have to little reclaims we can grow the inactive
list (and scan the active list more agressively).

This should also "catch" IO intensive applications, by moving
a lot of stuff to the inactive list quickly.

> Next, during periods of inactivity you have kswapd or some other
> daemon periodically (once every 5 seconds, something like this)
> perform an inactive list population run.

*nod*

We should scan the inactive list and move all reactivated pages
back to the active list and then repopulate the inactive list.
Alternatively, all "reactivation" actions (mapping a page back
into the application, __find_page_nolock(), etc...) should put
pages back onto the active queue.

(and repopulate the active queue whenever we go below the low
watermark, which is a fraction of the dynamically tuned high
watermark)

> The inactive lru population can be done cheaply, using the above
> ideas, roughly like:
> 
> 	LIST_HEAD(inactive_queue);
> 	struct list_head * active_scan_point = &lru_active;
> 
> 	for_each_active_lru_page() {
> 		if (!test_and_clear_referenced(page)) {

I'd like to add the "if (!page->buffers && atomic_read(&page->count) > 1)"
test to this, since there is no way to free those pages and they may
well have "hidden" referenced bits in their page table entries...

> 			list_del(entry);
> 			list_add(entry, &inactive_queue);
> 		} else
> 			active_scan_point = entry;
> 	}
> 
> 	list_splice(&inactive_queue, &lru_inactive);
> 	list_head_move(&lru_active, active_scan_point);
> 
> This way you only do list manipulations for actual work done
> (ie. moving inactive page candidates to the inactive list).
>
> I may try to toss together and example implementation, but feel
> free to beat me to it :-)

If you have the time to spare, feel free to go ahead, but since
I'm working on this stuff full-time now I see no real reason you
should waste^Wspend your time on this ... there must be something
to do in the network layer or another more critical/subtle place
of the kernel.

cheers,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  0:07     ` Rik van Riel
@ 2000-05-02  0:23       ` David S. Miller
  2000-05-02  1:03         ` Rik van Riel
  2000-05-03 17:11         ` [PATCHlet] " Rik van Riel
  2000-05-02  7:56       ` michael
  1 sibling, 2 replies; 31+ messages in thread
From: David S. Miller @ 2000-05-02  0:23 UTC (permalink / raw)
  To: riel; +Cc: roger.larsson, linux-kernel, linux-mm

BTW, what loop are you trying to "continue;" out of here?

+			    do {
 				if (tsk->need_resched)
 					schedule();
 				if ((!zone->size) || (!zone->zone_wake_kswapd))
 					continue;
 				do_try_to_free_pages(GFP_KSWAPD, zone);
+			   } while (zone->free_pages < zone->pages_low &&
+					   --count);

:-)  Just add a "next_zone:" label at the end of that code and
change the continue; to a goto next_zone;

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  0:23       ` David S. Miller
@ 2000-05-02  1:03         ` Rik van Riel
  2000-05-02  1:13           ` David S. Miller
  2000-05-03 17:11         ` [PATCHlet] " Rik van Riel
  1 sibling, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2000-05-02  1:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm

On Mon, 1 May 2000, David S. Miller wrote:

> BTW, what loop are you trying to "continue;" out of here?
> 
> +			    do {
>  				if (tsk->need_resched)
>  					schedule();
>  				if ((!zone->size) || (!zone->zone_wake_kswapd))
>  					continue;
>  				do_try_to_free_pages(GFP_KSWAPD, zone);
> +			   } while (zone->free_pages < zone->pages_low &&
> +					   --count);
> 
> :-)  Just add a "next_zone:" label at the end of that code and
> change the continue; to a goto next_zone;

I want kswapd to continue with freeing pages from this zone if
there aren't enough free pages in this zone. This is needed
because kswapd used to stop freeing pages even if we were below
pages_min...

(leading to out of memory situations when it wasn't needed, or
to dozens or even hundreds of extra context switches / extra
swapin latency when we call balance_zones from alloc_pages, etc)

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  1:03         ` Rik van Riel
@ 2000-05-02  1:13           ` David S. Miller
  2000-05-02  1:31             ` Rik van Riel
  2000-05-02  1:51             ` Andrea Arcangeli
  0 siblings, 2 replies; 31+ messages in thread
From: David S. Miller @ 2000-05-02  1:13 UTC (permalink / raw)
  To: riel; +Cc: roger.larsson, linux-kernel, linux-mm

   On Mon, 1 May 2000, David S. Miller wrote:

   > BTW, what loop are you trying to "continue;" out of here?
   > 
   > +			    do {
   >  				if (tsk->need_resched)
   >  					schedule();
   >  				if ((!zone->size) || (!zone->zone_wake_kswapd))
   >  					continue;
   >  				do_try_to_free_pages(GFP_KSWAPD, zone);
   > +			   } while (zone->free_pages < zone->pages_low &&
   > +					   --count);
   > 
   > :-)  Just add a "next_zone:" label at the end of that code and
   > change the continue; to a goto next_zone;

   I want kswapd to continue with freeing pages from this zone if
   there aren't enough free pages in this zone. This is needed
   because kswapd used to stop freeing pages even if we were below
   pages_min...

Rik, zone_wake_kswapd implies this information, via what
__free_pages_ok does to that flag.

I see it like this:

	if "!zone->size || !zone->zone_wake_kswapd"

		then zone->free_pages >= zone->pages_high by
		implication

Therefore when the continue happens, the loop will currently just
execute:

	if (!zone->size || !zone->zone_wake_kswapd)
		continue;
    ...
	} while (zone->free_pages < zone->pages_low &&

and the while condition is false and therefore will branch out of the
loop.  __free_pages_ok _always_ clears the zone_wake_kswapd flag when
zone->free_pages goes beyond zone->pages_high.

Later,
David S. Miller
davem@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  1:13           ` David S. Miller
@ 2000-05-02  1:31             ` Rik van Riel
  2000-05-02  1:51             ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02  1:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm

On Mon, 1 May 2000, David S. Miller wrote:
> From: Rik van Riel <riel@conectiva.com.br>
>    On Mon, 1 May 2000, David S. Miller wrote:
> 
>    > BTW, what loop are you trying to "continue;" out of here?
>    > 
>    > +			    do {
>    >  				if (tsk->need_resched)
>    >  					schedule();
>    >  				if ((!zone->size) || (!zone->zone_wake_kswapd))
>    >  					continue;
>    >  				do_try_to_free_pages(GFP_KSWAPD, zone);
>    > +			   } while (zone->free_pages < zone->pages_low &&
>    > +					   --count);
>    > 
>    > :-)  Just add a "next_zone:" label at the end of that code and
>    > change the continue; to a goto next_zone;
> 
>    I want kswapd to continue with freeing pages from this zone if
>    there aren't enough free pages in this zone. This is needed
>    because kswapd used to stop freeing pages even if we were below
>    pages_min...
> 
> Rik, zone_wake_kswapd implies this information, via what
> __free_pages_ok does to that flag.

Indeed, I should have moved the test for zone->zone_wake_kswapd to 
before the loop. But using zone->zone_wake_kswapd for the test isn't
really enough since that is only turned off if zone->free_pages 
reaches zone->pages_high, but we probably don't want to do agressive
swapout when we're already above zone->pages_low ...

(just background swapping that happens incidentally when we're
swapping stuff for other zones)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  1:13           ` David S. Miller
  2000-05-02  1:31             ` Rik van Riel
@ 2000-05-02  1:51             ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2000-05-02  1:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: riel, roger.larsson, linux-kernel, linux-mm

Actually I think you missed the pgdat_list is a queue and it's not null
terminated. I fixed this in my classzone patch of last week in this chunk:

@@ -507,9 +529,8 @@
 	unsigned long i, j;
 	unsigned long map_size;
 	unsigned long totalpages, offset, realtotalpages;
-	unsigned int cumulative = 0;

-	pgdat->node_next = pgdat_list;
+	pgdat->node_next = NULL;

however that's not enough without the thing I'm doing in the
kswapd_can_sleep() again in the classzone patch.

Note that my latest classzone patch had a few minor bugs.

Last days and today I worked on getting mapped pages out of the lru and
splitting the lru in two pieces since swap cache is less priority and it
have to be shrink first. Doing that things is giving smooth swap
behaviour. I'm incremental with the classzone patch.

My current tree works rock solid but I forgot a little design detail ;).
If a mapped page have anonymous buffers on it it have to _stay_ on the lru
otherwise the bh headers will become unfreeable and so I can basically
leak memory. Once this little bit will be fixed (and it's not a trivial
bit if you think at it) I'll post the patch where the above and other
things are fixed.

It should be fully orthogonal (at least conceptually) with your anon.c
stuff since all new code lives in the lru_cache domain.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCHlet] Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  0:23       ` David S. Miller
  2000-05-02  1:03         ` Rik van Riel
@ 2000-05-03 17:11         ` Rik van Riel
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-03 17:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm, Linus Torvalds

On Mon, 1 May 2000, David S. Miller wrote:
> BTW, what loop are you trying to "continue;" out of here?
> 
> +			    do {
>  				if (tsk->need_resched)
>  					schedule();
>  				if ((!zone->size) || (!zone->zone_wake_kswapd))
>  					continue;
>  				do_try_to_free_pages(GFP_KSWAPD, zone);
> +			   } while (zone->free_pages < zone->pages_low &&
> +					   --count);

Ughhhhh. And the worst part is that it took me a few _days_ to
figure out ;)

Anyway, the fix for this small buglet is attached. I'll continue
working on the active/inactive lists (per pgdat!), if I haven't
sent in the active/inactive list thing for the next prepatch, it
would be nice to have this small fix applied.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/


--- vmscan.c.orig	Wed May  3 10:51:36 2000
+++ vmscan.c	Wed May  3 13:00:00 2000
@@ -528,15 +528,15 @@
 		pgdat = pgdat_list;
 		while (pgdat) {
 			for (i = 0; i < MAX_NR_ZONES; i++) {
-			    int count = SWAP_CLUSTER_MAX;
-			    zone = pgdat->node_zones + i;
-			    do {
-				if (tsk->need_resched)
-					schedule();
+				int count = SWAP_CLUSTER_MAX;
+				zone = pgdat->node_zones + i;
 				if ((!zone->size) || (!zone->zone_wake_kswapd))
 					continue;
-				do_try_to_free_pages(GFP_KSWAPD, zone);
-			   } while (zone->free_pages < zone->pages_low &&
+				do {
+					if (tsk->need_resched)
+						schedule();
+					do_try_to_free_pages(GFP_KSWAPD, zone);
+		 		} while (zone->free_pages < zone->pages_low &&
 					   --count);
 			}
 			pgdat = pgdat->node_next;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02  0:07     ` Rik van Riel
  2000-05-02  0:23       ` David S. Miller
@ 2000-05-02  7:56       ` michael
  1 sibling, 0 replies; 31+ messages in thread
From: michael @ 2000-05-02  7:56 UTC (permalink / raw)
  To: riel; +Cc: David S. Miller, roger.larsson, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:
> On Mon, 1 May 2000, David S. Miller wrote:
> > Why not have two lists, an active and an inactive list.  As reference
> > bits clear, pages move to the inactive list.  If a reference bit stays
> > clear up until when the page moves up to the head of the inactive
> > list, we then try to free it.  For the active list, you do the "move
> > the list head" technique.
[...]
> We should scan the inactive list and move all reactivated pages
> back to the active list and then repopulate the inactive list.
> Alternatively, all "reactivation" actions (mapping a page back
> into the application, __find_page_nolock(), etc...) should put
> pages back onto the active queue.
[..]

> > The inactive lru population can be done cheaply, using the above
> > ideas, roughly like:
> > 
> > 	LIST_HEAD(inactive_queue);
> > 	struct list_head * active_scan_point = &lru_active;
> > 
> > 	for_each_active_lru_page() {
> > 		if (!test_and_clear_referenced(page)) {

Tiny comment: You'd probably be better off waking up
more frequently, and just processing just a bit of the active
page queue.

I.e. Every 1/10th of a second, walk 2% of the active queue.
This would give you something closer to LRU, and smooth the
load, yes?

Or, I guess that could be ; every 1/10th of a second,
walk as much of the active queue as is needed to refill the
inactive list, starting from where you left of last time.
So if nothing is consuming out of the inactive queue, we
effectively stop walking. (This is basically pure clock).

Michael.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel
  2000-05-01 23:33   ` David S. Miller
@ 2000-05-02 16:17   ` Roger Larsson
  2000-05-02 15:43     ` Rik van Riel
  1 sibling, 1 reply; 31+ messages in thread
From: Roger Larsson @ 2000-05-02 16:17 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm

Hi,

I have been playing with the idea to have a lru for each zone.
It should be trivial to do since page contains a pointer to zone.

With this change you will shrink_mmap only check among relevant pages.
(the caller will need to call shrink_mmap for other zone if call failed)

With this change you probably do not need to move pages to young. And
can get around without modifying the list.

I think keeping active/inactive (= generational) lists are also an
interesting proposal. But since it is orthogonal both methods can be
used!

/RogerL


Rik van Riel wrote:
> 
> On Tue, 2 May 2000, Roger Larsson wrote:
> 
> > I think there are some problems in the current (pre7-1) shrink_mmap.
> >
> > 1) "Random" resorting for zone with free_pages > pages_high
> >   while loop searches from the end of the list.
> >   old pages on non memory pressure zones are disposed as 'young'.
> >   Young pages are put in front, like recently touched ones.
> >   This results in a random resort for these pages.
> 
> Not doing this would result in having to scan the same "wrong zone"
> pages over and over again, possibly never reaching the pages we do
> want to free.
> 
> > 2) The implemented algorithm results in a lot of list operations -
> >    each scanned page is deleted from the list.
> 
> *nod*
> 
> Maybe it's better to scan the list and leave it unchanged, doing
> second chance replacement on it like we do in 2.2 ... or even 2
> or 3 bit aging?
> 
> That way we only have to scan and do none of the expensive list
> operations. Sorting doesn't make much sense anyway since we put
> most pages on the list in an essentially random order...
> 
> > 3) The list is supposed to be small - it is not...
> 
> Who says the list is supposed to be small?
> 
> > 4) Count is only decreased for suitable pages, but is related
> >    to total pages.
> 
> Not doing this resulted in being unable to free the "right" pages,
> even if they are there on the list (just beyond where we stopped
> scanning) and killing a process with out of memory errors.
> 
> > 5) Returns on first fully successful page. Rescan from beginning
> >    at next call to get another one... (not that bad since pages
> >    are moved to the end)
> 
> Well, it *is* bad since we'll end up scanning all the pages in
> &old; (and trying to free them again, which probably fails just
> like it did last time). The more I think about it, the more I think
> we want to go to a second chance algorithm where we don't change
> the list (except to remove pages from the list).
> 
> We can simply "move" the list_head when we're done scanning and
> continue from where we left off last time. That way we'll be much
> less cpu intensive and scan all pages fairly.
> 
> Using not one but 2 or 3 bits for aging the pages can result in
> something closer to lru and cheaper than the scheme we have now.
> 
> What do you (and others) think about this idea?
> 
> regards,
> 
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> 
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/               http://www.surriel.com/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 16:17   ` Roger Larsson
@ 2000-05-02 15:43     ` Rik van Riel
  2000-05-02 16:20       ` Andrea Arcangeli
  2000-05-02 18:03       ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
  0 siblings, 2 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02 15:43 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel, linux-mm

On Tue, 2 May 2000, Roger Larsson wrote:

> I have been playing with the idea to have a lru for each zone.
> It should be trivial to do since page contains a pointer to zone.
> 
> With this change you will shrink_mmap only check among relevant pages.
> (the caller will need to call shrink_mmap for other zone if call failed)

That's a very bad idea.

In this case you can end up constantly cycling through the pages of
one zone while the pages in another zone remain idle.

Local page replacement is worse than global page replacement and
has always been...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 15:43     ` Rik van Riel
@ 2000-05-02 16:20       ` Andrea Arcangeli
  2000-05-02 17:06         ` Rik van Riel
  2000-05-02 18:03       ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
  1 sibling, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2000-05-02 16:20 UTC (permalink / raw)
  To: riel; +Cc: Roger Larsson, linux-kernel, linux-mm

On Tue, 2 May 2000, Rik van Riel wrote:

>That's a very bad idea.

However the lru_cache have definitely to be per-node and not global as now
in 2.3.99-pre6 and pre7-1 or you won't be able to do the smart things I
was mentining some day ago in linux-mm with NUMA.

My current tree looks like this:

#define LRU_SWAP_CACHE		0
#define LRU_NORMAL_CACHE	1
#define NR_LRU_CACHE		2
typedef struct lru_cache_s {
	struct list_head heads[NR_LRU_CACHE];
	unsigned long nr_cache_pages; /* pages in the lrus */
	unsigned long nr_map_pages; /* pages temporarly out of the lru */
	/* keep lock in a separate cacheline to avoid ping pong in SMP */
	spinlock_t lock ____cacheline_aligned_in_smp;
} lru_cache_t;

struct bootmem_data;
typedef struct pglist_data {
	int nr_zones;
	zone_t node_zones[MAX_NR_ZONES];
	gfpmask_zone_t node_gfpmask_zone[NR_GFPINDEX];
	lru_cache_t lru_cache;
	struct page *node_mem_map;
	unsigned long *valid_addr_bitmap;
	struct bootmem_data *bdata;
	unsigned long node_start_paddr;
	unsigned long node_start_mapnr;
	unsigned long node_size;
	int node_id;
	struct pglist_data *node_next;
	spinlock_t freelist_lock ____cacheline_aligned_in_smp;
} pg_data_t;

Stay tuned...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 16:20       ` Andrea Arcangeli
@ 2000-05-02 17:06         ` Rik van Riel
  2000-05-02 21:14           ` Stephen C. Tweedie
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2000-05-02 17:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Roger Larsson, linux-kernel, linux-mm

On Tue, 2 May 2000, Andrea Arcangeli wrote:
> On Tue, 2 May 2000, Rik van Riel wrote:
> 
> >That's a very bad idea.
> 
> However the lru_cache have definitely to be per-node and not
> global as now in 2.3.99-pre6 and pre7-1 or you won't be able to
> do the smart things I was mentining some day ago in linux-mm
> with NUMA.

How do you want to take care of global page balancing with
this "optimisation"?

If you cannot find a good answer to that, you'd better not
spend too much time implementing any of this...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 17:06         ` Rik van Riel
@ 2000-05-02 21:14           ` Stephen C. Tweedie
  2000-05-02 21:42             ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Stephen C. Tweedie @ 2000-05-02 21:14 UTC (permalink / raw)
  To: riel; +Cc: Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm

Hi,

On Tue, May 02, 2000 at 02:06:20PM -0300, Rik van Riel wrote:
> > do the smart things I was mentining some day ago in linux-mm
> > with NUMA.
> 
> How do you want to take care of global page balancing with
> this "optimisation"?

You don't.  With NUMA, the memory is inherently unbalanced, and you
don't want the allocator to smooth over the different nodes.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 21:14           ` Stephen C. Tweedie
@ 2000-05-02 21:42             ` Rik van Riel
  2000-05-02 22:34               ` Stephen C. Tweedie
  2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
  0 siblings, 2 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02 21:42 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm

On Tue, 2 May 2000, Stephen C. Tweedie wrote:
> On Tue, May 02, 2000 at 02:06:20PM -0300, Rik van Riel wrote:
> > > do the smart things I was mentining some day ago in linux-mm
> > > with NUMA.
> > 
> > How do you want to take care of global page balancing with
> > this "optimisation"?
> 
> You don't.  With NUMA, the memory is inherently unbalanced, and you
> don't want the allocator to smooth over the different nodes.

Ermmm, a few days ago (yesterday?) you told me on irc that we
needed to balance between zones ... maybe we need some way to
measure "memory load" on a zone and only allocate from a different
NUMA zone if:

	local_load        remote_load
	----------   >=   -----------
	1.0               load penalty for local->remote

(or something more or less like this ... only use one of the
nodes one hop away if the remote load is <90% of the local
load, 70% for two hops, 30% for > 2 hops ...)

We could use the scavenge list in combination with more or
less balanced page reclamation to determine memory load on
the different nodes...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 21:42             ` Rik van Riel
@ 2000-05-02 22:34               ` Stephen C. Tweedie
  2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
  1 sibling, 0 replies; 31+ messages in thread
From: Stephen C. Tweedie @ 2000-05-02 22:34 UTC (permalink / raw)
  To: riel
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Roger Larsson,
	linux-kernel, linux-mm

Hi,

On Tue, May 02, 2000 at 06:42:31PM -0300, Rik van Riel wrote:
> On Tue, 2 May 2000, Stephen C. Tweedie wrote:
> Ermmm, a few days ago (yesterday?) you told me on irc that we
> needed to balance between zones ... 

On a single NUMA node, definitely.  We need balance between all of
the zones which may be in use in a specific allocation class.

NUMA is a very different issue.  _Some_ memory pressure between nodes is
necessary: if one node is completely out of memory then we may have to 
start allocating memory on other nodes to tasks tied to the node under
pressure.  But in the normal case, you really do want NUMA memory classes
to be as independent of each other as possible.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH][RFC] Alternate shrink_mmap
  2000-05-02 21:42             ` Rik van Riel
  2000-05-02 22:34               ` Stephen C. Tweedie
@ 2000-05-04 12:37               ` Roger Larsson
  2000-05-04 14:34                 ` Rik van Riel
  2000-05-04 15:25                 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
  1 sibling, 2 replies; 31+ messages in thread
From: Roger Larsson @ 2000-05-04 12:37 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 296 bytes --]

Hi all,

Here is an alternative shrink_mmap.
It tries to touch the list as little as possible
(only young pages are moved)

And tries to be quick.

Comments please.

It compiles but I have not dared to run it yet...
(My biggest patch yet)

/RogerL

--
Home page:
  http://www.norran.net/nra02596/

[-- Attachment #2: patch-2.3-shrink_mmap --]
[-- Type: text/plain, Size: 7789 bytes --]

diff -Naur linux-2.3-pre7+/mm/filemap.c linux-2.3/mm/filemap.c
--- linux-2.3-pre7+/mm/filemap.c	Mon May  1 21:41:10 2000
+++ linux-2.3/mm/filemap.c	Thu May  4 13:30:36 2000
@@ -237,143 +237,149 @@
 {
 	int ret = 0, count;
 	LIST_HEAD(young);
-	LIST_HEAD(old);
 	LIST_HEAD(forget);
 	struct list_head * page_lru, * dispose;
 	struct page * page = NULL;
 	struct zone_struct * p_zone;
-	
+
+	/* This could be removed.
+	 * NULL translates to: fulfill all zone requests. */
 	if (!zone)
 		BUG();
 
 	count = nr_lru_pages >> priority;
-	if (!count)
-		return ret;
 
 	spin_lock(&pagemap_lru_lock);
 again:
-	/* we need pagemap_lru_lock for list_del() ... subtle code below */
-	while (count > 0 && (page_lru = lru_cache.prev) != &lru_cache) {
-		page = list_entry(page_lru, struct page, lru);
-		list_del(page_lru);
-		p_zone = page->zone;
-
-		/* This LRU list only contains a few pages from the system,
-		 * so we must fail and let swap_out() refill the list if
-		 * there aren't enough freeable pages on the list */
-
-		/* The page is in use, or was used very recently, put it in
-		 * &young to make sure that we won't try to free it the next
-		 * time */
-		dispose = &young;
-		if (test_and_clear_bit(PG_referenced, &page->flags))
-			goto dispose_continue;
-
-		if (p_zone->free_pages > p_zone->pages_high)
-			goto dispose_continue;
-
-		if (!page->buffers && page_count(page) > 1)
-			goto dispose_continue;
-
-		count--;
-		/* Page not used -> free it or put it on the old list
-		 * so it gets freed first the next time */
-		dispose = &old;
-		if (TryLockPage(page))
-			goto dispose_continue;
-
-		/* Release the pagemap_lru lock even if the page is not yet
-		   queued in any lru queue since we have just locked down
-		   the page so nobody else may SMP race with us running
-		   a lru_cache_del() (lru_cache_del() always run with the
-		   page locked down ;). */
-		spin_unlock(&pagemap_lru_lock);
-
-		/* avoid freeing the page while it's locked */
-		get_page(page);
-
-		/* Is it a buffer page? */
-		if (page->buffers) {
-			if (!try_to_free_buffers(page))
-				goto unlock_continue;
-			/* page was locked, inode can't go away under us */
-			if (!page->mapping) {
-				atomic_dec(&buffermem_pages);
-				goto made_buffer_progress;
-			}
-		}
-
-		/* Take the pagecache_lock spinlock held to avoid
-		   other tasks to notice the page while we are looking at its
-		   page count. If it's a pagecache-page we'll free it
-		   in one atomic transaction after checking its page count. */
-		spin_lock(&pagecache_lock);
-
-		/*
-		 * We can't free pages unless there's just one user
-		 * (count == 2 because we added one ourselves above).
-		 */
-		if (page_count(page) != 2)
-			goto cache_unlock_continue;
-
-		/*
-		 * Is it a page swap page? If so, we want to
-		 * drop it if it is no longer used, even if it
-		 * were to be marked referenced..
-		 */
-		if (PageSwapCache(page)) {
-			spin_unlock(&pagecache_lock);
-			__delete_from_swap_cache(page);
-			goto made_inode_progress;
-		}	
-
-		/* is it a page-cache page? */
-		if (page->mapping) {
-			if (!PageDirty(page) && !pgcache_under_min()) {
-				remove_page_from_inode_queue(page);
-				remove_page_from_hash_queue(page);
-				page->mapping = NULL;
-				spin_unlock(&pagecache_lock);
-				goto made_inode_progress;
-			}
-			goto cache_unlock_continue;
-		}
-
-		dispose = &forget;
-		printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
-
+	for (page_lru = lru_cache.prev;
+	     count-- && page_lru != &lru_cache;
+	     page_lru = page_lru->prev) {
+	  page = list_entry(page_lru, struct page, lru);
+	  p_zone = page->zone;
+
+
+	  /* Check if zone has pressure, most pages would continue here.
+	   * Also pages from zones that initally was under pressure */
+	  if (!p_zone->zone_wake_kswapd)
+	    continue;
+
+	  /* Can't do anything about this... */
+	  if (!page->buffers && page_count(page) > 1)
+	    continue;
+
+	  /* Page not used -> free it 
+	   * If it could not be locked it is somehow in use
+	   * try another time */
+	  if (TryLockPage(page))
+	    continue;
+
+	  /* Ok, a possible page.
+	  * Note: can't unlock lru if we do we will have
+	  * to restart this loop */
+
+	  /* The page is in use, or was used very recently, put it in
+	   * &young to make it ulikely that we will try to free it the next
+	   * time */
+	  dispose = &young;
+	  if (test_and_clear_bit(PG_referenced, &page->flags))
+	    goto dispose_continue;
+		
+	  
+	  /* avoid freeing the page while it's locked [RL???] */
+	  get_page(page);
+
+	  /* If it can not be freed here it is unlikely to
+	   * at next attempt. */
+	  dispose = NULL; 
+
+	  /* Is it a buffer page? */
+	  if (page->buffers) {
+	    if (!try_to_free_buffers(page))
+	      goto unlock_continue;
+	    /* page was locked, inode can't go away under us */
+	    if (!page->mapping) {
+	      atomic_dec(&buffermem_pages);
+	      goto made_buffer_progress;
+	    }
+	  }
+
+
+	  /* Take the pagecache_lock spinlock held to avoid
+	     other tasks to notice the page while we are looking at its
+	     page count. If it's a pagecache-page we'll free it
+	     in one atomic transaction after checking its page count. */
+	  spin_lock(&pagecache_lock);
+
+	  /*
+	   * We can't free pages unless there's just one user
+	   * (count == 2 because we added one ourselves above).
+	   */
+	  if (page_count(page) != 2)
+	    goto cache_unlock_continue;
+
+	  /*
+	   * Is it a page swap page? If so, we want to
+	   * drop it if it is no longer used, even if it
+	   * were to be marked referenced..
+	   */
+	  if (PageSwapCache(page)) {
+	    spin_unlock(&pagecache_lock);
+	    __delete_from_swap_cache(page);
+	    goto made_inode_progress;
+	  }	
+
+	  /* is it a page-cache page? */
+	  if (page->mapping) {
+	    if (!PageDirty(page) && !pgcache_under_min()) {
+	      remove_page_from_inode_queue(page);
+	      remove_page_from_hash_queue(page);
+	      page->mapping = NULL;
+	      spin_unlock(&pagecache_lock);
+	      goto made_inode_progress;
+	    }
+	    goto cache_unlock_continue;
+	  }
+
+	  dispose = &forget;
+	  printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
+	  
 cache_unlock_continue:
-		spin_unlock(&pagecache_lock);
+	  spin_unlock(&pagecache_lock);
 unlock_continue:
-		spin_lock(&pagemap_lru_lock);
-		UnlockPage(page);
-		put_page(page);
-		list_add(page_lru, dispose);
-		continue;
+	  /* never released... spin_lock(&pagemap_lru_lock); */
+	  UnlockPage(page);
+	  put_page(page);
+	  if (dispose == NULL) /* only forget should end up here - predicted taken */
+	    continue;
 
-		/* we're holding pagemap_lru_lock, so we can just loop again */
 dispose_continue:
-		list_add(page_lru, dispose);
-	}
-	goto out;
+	  list_del(page_lru);
+	  list_add(page_lru, dispose);
+	  continue;
 
 made_inode_progress:
-	page_cache_release(page);
+	  page_cache_release(page);
 made_buffer_progress:
-	UnlockPage(page);
-	put_page(page);
-	ret = 1;
-	spin_lock(&pagemap_lru_lock);
-	/* nr_lru_pages needs the spinlock */
-	nr_lru_pages--;
-
-	/* wrong zone?  not looped too often?    roll again... */
-	if (page->zone != zone && count)
-		goto again;
+	  UnlockPage(page);
+	  put_page(page);
+	  ret++;
+	  /* never unlocked... spin_lock(&pagemap_lru_lock); */
+	  /* nr_lru_pages needs the spinlock */
+	  list_del(page_lru);
+	  nr_lru_pages--;
+
+	  /* Might (and should) have been done by free calls
+	   * p_zone->zone_wake_kswapd = 0;
+	   */
+
+	  /* If no more pages are needed to release on specifically
+	     requested zone concider it done!
+	     Note: zone might be NULL to make all requests fulfilled */
+	  if (p_zone == zone && !p_zone->zone_wake_kswapd)
+	    break;
+	}
 
-out:
 	list_splice(&young, &lru_cache);
-	list_splice(&old, lru_cache.prev);
 
 	spin_unlock(&pagemap_lru_lock);
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
@ 2000-05-04 14:34                 ` Rik van Riel
  2000-05-04 22:38                   ` [PATCH][RFC] Another shrink_mmap Roger Larsson
  2000-05-04 15:25                 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
  1 sibling, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2000-05-04 14:34 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-mm

On Thu, 4 May 2000, Roger Larsson wrote:

> Here is an alternative shrink_mmap.
> It tries to touch the list as little as possible
> (only young pages are moved)

I will use something like this in the active/inactive queue
thing. The major differences will be that:
- we won't be reclaiming memory in the first queue
  (only from the inactive queue)
- we'll try to keep a minimum number of active and
  inactive pages in every zone
- we will probably have a (per pg_dat) self-tuning
  target for the number of inactive pages

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH][RFC] Another shrink_mmap
  2000-05-04 14:34                 ` Rik van Riel
@ 2000-05-04 22:38                   ` Roger Larsson
  0 siblings, 0 replies; 31+ messages in thread
From: Roger Larsson @ 2000-05-04 22:38 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

Hi all,

Here is an another shrink_mmap.

This time lock handling should be better.

It tries to touch the list as little as possible
Only young pages are moved, probable pages are
replaced by a cursor - makes it possible to release
the pagemap_lru_lock.

And it tries to be quick.

Comments please.

It compiles but I have not dared to run it yet...

(My biggest patch yet, included straight text
 since I do not dare to run it I shouldn't tempt
 you...)

/RogerL

--
Home page:
  http://www.norran.net/nra02596/

---

static zone_t null_zone;

int shrink_mmap(int priority, int gfp_mask, zone_t *zone)
{
	int ret = 0, count;
	LIST_HEAD(young);
	LIST_HEAD(forget);
	struct list_head * page_lru, * cursor, * dispose;
	struct page * page = NULL;
	struct zone_struct * p_zone;
	struct page cursor_page; /* unique by thread, too much on stack? */

	/* This could be removed.
	 * NULL translates to: fulfill all zone requests. */
	if (!zone)
		BUG();

	count = nr_lru_pages >> priority;

	cursor = &cursor_page.lru;
	cursor_page.zone = &null_zone;

	spin_lock(&pagemap_lru_lock);
	/* cursor always part of the list, but not a real page... 
	 * make a special page that points to a special zone
	 *     with zone_wake_kswapd always 0
	 * - some more toughts required... */
	list_add_tail(cursor, &lru_cache);

	for (page_lru = lru_cache.prev;
	     count-- && page_lru != &lru_cache;
	     page_lru = page_lru->prev) {

	  /* Avoid processing our own cursor... 
	   * Note: check not needed with page cursor.
	   * if (page_lru == cursor)
	   *   continue;
	   */

	  page = list_entry(page_lru, struct page, lru);
	  p_zone = page->zone;


	  /* Check if zone has pressure, most pages would continue here.
	   * Also pages from zones that initally was under pressure */
	  if (!p_zone->zone_wake_kswapd)
	    continue;

	  /* Can't do anything about this... */
	  if (!page->buffers && page_count(page) > 1)
	    continue;

	  /* Page not used -> free it 
	   * If it could not be locked it is somehow in use
	   * try another time */
	  if (TryLockPage(page))
	    continue;

	  /* Ok, a possible page.
	  * Note: can't unlock lru if we do we will have
	  * to restart this loop */

	  /* The page is in use, or was used very recently, put it in
	   * &young to make it ulikely that we will try to free it the next
	   * time */
	  dispose = &young;
	  if (test_and_clear_bit(PG_referenced, &page->flags))
	    goto dispose_continue;
		
	  
	  /* cursor takes page_lru's place in lru_list
	   * if disposed later it ends up at the same place!
	   * Note: compilers should be able to optimize this a bit... */
	  list_del(cursor);
	  list_add_tail(cursor, page_lru);
	  list_del(page_lru);
	  spin_unlock(&pagemap_lru_lock);

	  /* Spinlock is released, anything might happen to the list!
	   * But the cursor will remain on spot.
	   * - it will not be deleted from outside,
	   *   no one knows about it.
	   * - it will not be deleted by another shrink_mmap,
           *   zone_wake_kswapd == 0
	   */

	  /* If page is redisposed after attempt, place it at the same spot */
	  dispose = cursor;

	  /* avoid freeing the page while it's locked */
	  get_page(page);

	  /* Is it a buffer page? */
	  if (page->buffers) {
	    if (!try_to_free_buffers(page))
	      goto unlock_continue;
	    /* page was locked, inode can't go away under us */
	    if (!page->mapping) {
	      atomic_dec(&buffermem_pages);
	      goto made_buffer_progress;
	    }
	  }

	  /* Take the pagecache_lock spinlock held to avoid
	     other tasks to notice the page while we are looking at its
	     page count. If it's a pagecache-page we'll free it
	     in one atomic transaction after checking its page count. */
	  spin_lock(&pagecache_lock);

	  /*
	   * We can't free pages unless there's just one user
	   * (count == 2 because we added one ourselves above).
	   */
	  if (page_count(page) != 2)
	    goto cache_unlock_continue;

	  /*
	   * Is it a page swap page? If so, we want to
	   * drop it if it is no longer used, even if it
	   * were to be marked referenced..
	   */
	  if (PageSwapCache(page)) {
	    spin_unlock(&pagecache_lock);
	    __delete_from_swap_cache(page);
	    goto made_inode_progress;
	  }	

	  /* is it a page-cache page? */
	  if (page->mapping) {
	    if (!PageDirty(page) && !pgcache_under_min()) {
	      remove_page_from_inode_queue(page);
	      remove_page_from_hash_queue(page);
	      page->mapping = NULL;
	      spin_unlock(&pagecache_lock);
	      goto made_inode_progress;
	    }
	    goto cache_unlock_continue;
	  }

	  dispose = &forget;
	  printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
	  
cache_unlock_continue:
	  spin_unlock(&pagecache_lock);
unlock_continue:
	  spin_lock(&pagemap_lru_lock);
	  UnlockPage(page);
	  put_page(page);

dispose_continue:
	  list_add(page_lru, dispose);
	  /* final disposition to other list than lru? */
	  /* then return list index to old lru-list position */
	  if (dispose != cursor)
	    page_lru = cursor;
	  continue;

made_inode_progress:
	  page_cache_release(page);
made_buffer_progress:
	  UnlockPage(page);
	  put_page(page);
	  ret++;
	  spin_lock(&pagemap_lru_lock);
	  /* nr_lru_pages needs the spinlock */
	  nr_lru_pages--;

	  /* Might (and should) have been done by free calls
	   * p_zone->zone_wake_kswapd = 0;
	   */

	  /* If no more pages are needed to release on specifically
	     requested zone concider it done!
	     Note: zone might be NULL to make all requests fulfilled */
	  if (p_zone == zone && !p_zone->zone_wake_kswapd)
	    break;

	  /* Back to cursor position to ensure correct next step */
	  page_lru = cursor;
	}

	list_splice(&young, &lru_cache);
	list_del(cursor);

	spin_unlock(&pagemap_lru_lock);

	return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
  2000-05-04 14:34                 ` Rik van Riel
@ 2000-05-04 15:25                 ` Roger Larsson
  2000-05-04 18:30                   ` Rik van Riel
  1 sibling, 1 reply; 31+ messages in thread
From: Roger Larsson @ 2000-05-04 15:25 UTC (permalink / raw)
  To: riel, linux-mm

Hi all,

I have noticed (not by running - lucky me) that I break this
assumption....
/*
 * NOTE: to avoid deadlocking you must never acquire the pagecache_lock
with
 *       the pagemap_lru_lock held.
 */

/RogerL

Roger Larsson wrote:
> 
> Hi all,
> 
> Here is an alternative shrink_mmap.
> It tries to touch the list as little as possible
> (only young pages are moved)
> 
> And tries to be quick.
> 
> Comments please.
> 
> It compiles but I have not dared to run it yet...
> (My biggest patch yet)
> 
> /RogerL
> 
> --
> Home page:
>   http://www.norran.net/nra02596/
> 
>   ------------------------------------------------------------------------
> diff -Naur linux-2.3-pre7+/mm/filemap.c linux-2.3/mm/filemap.c
> --- linux-2.3-pre7+/mm/filemap.c        Mon May  1 21:41:10 2000
> +++ linux-2.3/mm/filemap.c      Thu May  4 13:30:36 2000
> @@ -237,143 +237,149 @@
>  {
>         int ret = 0, count;
>         LIST_HEAD(young);
> -       LIST_HEAD(old);
>         LIST_HEAD(forget);
>         struct list_head * page_lru, * dispose;
>         struct page * page = NULL;
>         struct zone_struct * p_zone;
> -
> +
> +       /* This could be removed.
> +        * NULL translates to: fulfill all zone requests. */
>         if (!zone)
>                 BUG();
> 
>         count = nr_lru_pages >> priority;
> -       if (!count)
> -               return ret;
> 
>         spin_lock(&pagemap_lru_lock);
>  again:
> -       /* we need pagemap_lru_lock for list_del() ... subtle code below */
> -       while (count > 0 && (page_lru = lru_cache.prev) != &lru_cache) {
> -               page = list_entry(page_lru, struct page, lru);
> -               list_del(page_lru);
> -               p_zone = page->zone;
> -
> -               /* This LRU list only contains a few pages from the system,
> -                * so we must fail and let swap_out() refill the list if
> -                * there aren't enough freeable pages on the list */
> -
> -               /* The page is in use, or was used very recently, put it in
> -                * &young to make sure that we won't try to free it the next
> -                * time */
> -               dispose = &young;
> -               if (test_and_clear_bit(PG_referenced, &page->flags))
> -                       goto dispose_continue;
> -
> -               if (p_zone->free_pages > p_zone->pages_high)
> -                       goto dispose_continue;
> -
> -               if (!page->buffers && page_count(page) > 1)
> -                       goto dispose_continue;
> -
> -               count--;
> -               /* Page not used -> free it or put it on the old list
> -                * so it gets freed first the next time */
> -               dispose = &old;
> -               if (TryLockPage(page))
> -                       goto dispose_continue;
> -
> -               /* Release the pagemap_lru lock even if the page is not yet
> -                  queued in any lru queue since we have just locked down
> -                  the page so nobody else may SMP race with us running
> -                  a lru_cache_del() (lru_cache_del() always run with the
> -                  page locked down ;). */
> -               spin_unlock(&pagemap_lru_lock);
> -
> -               /* avoid freeing the page while it's locked */
> -               get_page(page);
> -
> -               /* Is it a buffer page? */
> -               if (page->buffers) {
> -                       if (!try_to_free_buffers(page))
> -                               goto unlock_continue;
> -                       /* page was locked, inode can't go away under us */
> -                       if (!page->mapping) {
> -                               atomic_dec(&buffermem_pages);
> -                               goto made_buffer_progress;
> -                       }
> -               }
> -
> -               /* Take the pagecache_lock spinlock held to avoid
> -                  other tasks to notice the page while we are looking at its
> -                  page count. If it's a pagecache-page we'll free it
> -                  in one atomic transaction after checking its page count. */
> -               spin_lock(&pagecache_lock);
> -
> -               /*
> -                * We can't free pages unless there's just one user
> -                * (count == 2 because we added one ourselves above).
> -                */
> -               if (page_count(page) != 2)
> -                       goto cache_unlock_continue;
> -
> -               /*
> -                * Is it a page swap page? If so, we want to
> -                * drop it if it is no longer used, even if it
> -                * were to be marked referenced..
> -                */
> -               if (PageSwapCache(page)) {
> -                       spin_unlock(&pagecache_lock);
> -                       __delete_from_swap_cache(page);
> -                       goto made_inode_progress;
> -               }
> -
> -               /* is it a page-cache page? */
> -               if (page->mapping) {
> -                       if (!PageDirty(page) && !pgcache_under_min()) {
> -                               remove_page_from_inode_queue(page);
> -                               remove_page_from_hash_queue(page);
> -                               page->mapping = NULL;
> -                               spin_unlock(&pagecache_lock);
> -                               goto made_inode_progress;
> -                       }
> -                       goto cache_unlock_continue;
> -               }
> -
> -               dispose = &forget;
> -               printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
> -
> +       for (page_lru = lru_cache.prev;
> +            count-- && page_lru != &lru_cache;
> +            page_lru = page_lru->prev) {
> +         page = list_entry(page_lru, struct page, lru);
> +         p_zone = page->zone;
> +
> +
> +         /* Check if zone has pressure, most pages would continue here.
> +          * Also pages from zones that initally was under pressure */
> +         if (!p_zone->zone_wake_kswapd)
> +           continue;
> +
> +         /* Can't do anything about this... */
> +         if (!page->buffers && page_count(page) > 1)
> +           continue;
> +
> +         /* Page not used -> free it
> +          * If it could not be locked it is somehow in use
> +          * try another time */
> +         if (TryLockPage(page))
> +           continue;
> +
> +         /* Ok, a possible page.
> +         * Note: can't unlock lru if we do we will have
> +         * to restart this loop */
> +
> +         /* The page is in use, or was used very recently, put it in
> +          * &young to make it ulikely that we will try to free it the next
> +          * time */
> +         dispose = &young;
> +         if (test_and_clear_bit(PG_referenced, &page->flags))
> +           goto dispose_continue;
> +
> +
> +         /* avoid freeing the page while it's locked [RL???] */
> +         get_page(page);
> +
> +         /* If it can not be freed here it is unlikely to
> +          * at next attempt. */
> +         dispose = NULL;
> +
> +         /* Is it a buffer page? */
> +         if (page->buffers) {
> +           if (!try_to_free_buffers(page))
> +             goto unlock_continue;
> +           /* page was locked, inode can't go away under us */
> +           if (!page->mapping) {
> +             atomic_dec(&buffermem_pages);
> +             goto made_buffer_progress;
> +           }
> +         }
> +
> +
> +         /* Take the pagecache_lock spinlock held to avoid
> +            other tasks to notice the page while we are looking at its
> +            page count. If it's a pagecache-page we'll free it
> +            in one atomic transaction after checking its page count. */
> +         spin_lock(&pagecache_lock);
> +
> +         /*
> +          * We can't free pages unless there's just one user
> +          * (count == 2 because we added one ourselves above).
> +          */
> +         if (page_count(page) != 2)
> +           goto cache_unlock_continue;
> +
> +         /*
> +          * Is it a page swap page? If so, we want to
> +          * drop it if it is no longer used, even if it
> +          * were to be marked referenced..
> +          */
> +         if (PageSwapCache(page)) {
> +           spin_unlock(&pagecache_lock);
> +           __delete_from_swap_cache(page);
> +           goto made_inode_progress;
> +         }
> +
> +         /* is it a page-cache page? */
> +         if (page->mapping) {
> +           if (!PageDirty(page) && !pgcache_under_min()) {
> +             remove_page_from_inode_queue(page);
> +             remove_page_from_hash_queue(page);
> +             page->mapping = NULL;
> +             spin_unlock(&pagecache_lock);
> +             goto made_inode_progress;
> +           }
> +           goto cache_unlock_continue;
> +         }
> +
> +         dispose = &forget;
> +         printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
> +
>  cache_unlock_continue:
> -               spin_unlock(&pagecache_lock);
> +         spin_unlock(&pagecache_lock);
>  unlock_continue:
> -               spin_lock(&pagemap_lru_lock);
> -               UnlockPage(page);
> -               put_page(page);
> -               list_add(page_lru, dispose);
> -               continue;
> +         /* never released... spin_lock(&pagemap_lru_lock); */
> +         UnlockPage(page);
> +         put_page(page);
> +         if (dispose == NULL) /* only forget should end up here - predicted taken */
> +           continue;
> 
> -               /* we're holding pagemap_lru_lock, so we can just loop again */
>  dispose_continue:
> -               list_add(page_lru, dispose);
> -       }
> -       goto out;
> +         list_del(page_lru);
> +         list_add(page_lru, dispose);
> +         continue;
> 
>  made_inode_progress:
> -       page_cache_release(page);
> +         page_cache_release(page);
>  made_buffer_progress:
> -       UnlockPage(page);
> -       put_page(page);
> -       ret = 1;
> -       spin_lock(&pagemap_lru_lock);
> -       /* nr_lru_pages needs the spinlock */
> -       nr_lru_pages--;
> -
> -       /* wrong zone?  not looped too often?    roll again... */
> -       if (page->zone != zone && count)
> -               goto again;
> +         UnlockPage(page);
> +         put_page(page);
> +         ret++;
> +         /* never unlocked... spin_lock(&pagemap_lru_lock); */
> +         /* nr_lru_pages needs the spinlock */
> +         list_del(page_lru);
> +         nr_lru_pages--;
> +
> +         /* Might (and should) have been done by free calls
> +          * p_zone->zone_wake_kswapd = 0;
> +          */
> +
> +         /* If no more pages are needed to release on specifically
> +            requested zone concider it done!
> +            Note: zone might be NULL to make all requests fulfilled */
> +         if (p_zone == zone && !p_zone->zone_wake_kswapd)
> +           break;
> +       }
> 
> -out:
>         list_splice(&young, &lru_cache);
> -       list_splice(&old, lru_cache.prev);
> 
>         spin_unlock(&pagemap_lru_lock);
> 

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 15:25                 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
@ 2000-05-04 18:30                   ` Rik van Riel
  2000-05-04 20:44                     ` Roger Larsson
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2000-05-04 18:30 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-mm

On Thu, 4 May 2000, Roger Larsson wrote:

> I have noticed (not by running - lucky me) that I break this
> assumption....
> /*
>  * NOTE: to avoid deadlocking you must never acquire the pagecache_lock
> with
>  *       the pagemap_lru_lock held.
>  */

Also, you seem to start scanning at the beginning of the
list every time, instead of moving the list head around
so you scan all pages in the list evenly...

Anyway, I'll use something like your code, but have two
lists (an active and an inactive list, like the BSD's
seem to have).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 18:30                   ` Rik van Riel
@ 2000-05-04 20:44                     ` Roger Larsson
  2000-05-04 18:59                       ` Rik van Riel
  0 siblings, 1 reply; 31+ messages in thread
From: Roger Larsson @ 2000-05-04 20:44 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

Yes, I start scanning in the beginning every time - but I do not think
that is so bad here, why?

a) It releases more than one page of the required zone before returning.
b) It should be rather fast to scan.

I have been trying to handle the lockup(!), my best idea is to put in
an artificial page that serves as a cursor...

/RogerL


Rik van Riel wrote:
> 
> On Thu, 4 May 2000, Roger Larsson wrote:
> 
> > I have noticed (not by running - lucky me) that I break this
> > assumption....
> > /*
> >  * NOTE: to avoid deadlocking you must never acquire the pagecache_lock
> > with
> >  *       the pagemap_lru_lock held.
> >  */
> 
> Also, you seem to start scanning at the beginning of the
> list every time, instead of moving the list head around
> so you scan all pages in the list evenly...
> 
> Anyway, I'll use something like your code, but have two
> lists (an active and an inactive list, like the BSD's
> seem to have).
> 
> regards,
> 
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> 
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/               http://www.surriel.com/

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 20:44                     ` Roger Larsson
@ 2000-05-04 18:59                       ` Rik van Riel
  2000-05-04 22:29                         ` Roger Larsson
  0 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2000-05-04 18:59 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-mm

On Thu, 4 May 2000, Roger Larsson wrote:

> Yes, I start scanning in the beginning every time - but I do not
> think that is so bad here, why?

Because you'll end up scanning the same few pages over and
over again, even if those pages are used all the time and
the pages you want to free are somewhere else in the list.

> a) It releases more than one page of the required zone before returning.
> b) It should be rather fast to scan.
> 
> I have been trying to handle the lockup(!), my best idea is to
> put in an artificial page that serves as a cursor...

You have to "move the list head".

If you do that, you are free to "start at the beginning"
(which has changed) each time...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH][RFC] Alternate shrink_mmap
  2000-05-04 18:59                       ` Rik van Riel
@ 2000-05-04 22:29                         ` Roger Larsson
  0 siblings, 0 replies; 31+ messages in thread
From: Roger Larsson @ 2000-05-04 22:29 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

Rik van Riel wrote:
> 
> On Thu, 4 May 2000, Roger Larsson wrote:
> 
> > Yes, I start scanning in the beginning every time - but I do not
> > think that is so bad here, why?
> 
> Because you'll end up scanning the same few pages over and
> over again, even if those pages are used all the time and
> the pages you want to free are somewhere else in the list.

Not really since priority is decreased too...
Next time the double amount of pages is scanned, and the oldest are
always scanned.

> 
> > a) It releases more than one page of the required zone before returning.
> > b) It should be rather fast to scan.
> >
> > I have been trying to handle the lockup(!), my best idea is to
> > put in an artificial page that serves as a cursor...
> 
> You have to "move the list head".

Hmm,

If the list head is moved your oldest pages will end up at top,
not that good.
I do not want to resort the list for any reason other than
page use!
Currently I try to compile another version of my patch.

I think it has been mentioned before when finding young pages and
moving them up you probably need to scan the whole list.

An interesting and remaining issue:
* What happens if you read a lot of new pages from disk.
  Read only once, but too many to fit in memory...
- Should pages used many times be rewarded?

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 15:43     ` Rik van Riel
  2000-05-02 16:20       ` Andrea Arcangeli
@ 2000-05-02 18:03       ` Roger Larsson
  2000-05-02 17:37         ` Rik van Riel
  1 sibling, 1 reply; 31+ messages in thread
From: Roger Larsson @ 2000-05-02 18:03 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, linux-mm

Rik van Riel wrote:
> 
> On Tue, 2 May 2000, Roger Larsson wrote:
> 
> > I have been playing with the idea to have a lru for each zone.
> > It should be trivial to do since page contains a pointer to zone.
> >
> > With this change you will shrink_mmap only check among relevant pages.
> > (the caller will need to call shrink_mmap for other zone if call failed)
> 
> That's a very bad idea.

Has it been tested?
I think the problem with searching for a DMA page among lots and lots
of normal and high pages might be worse...

> 
> In this case you can end up constantly cycling through the pages of
> one zone while the pages in another zone remain idle.

Yes you might. But concidering the possible no of pages in each zone,
it might not be that a bad idea.

You usually needs normal pages and there are more normal pages.
You rarely needs DMA pages but there are less.
=> recycle rate might be about the same...

Anyway I think it is up to the caller of shrink_mmap to be intelligent
about which zone it requests.

> 
> Local page replacement is worse than global page replacement and
> has always been...
> 
> regards,
> 
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> 
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/               http://www.surriel.com/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/

--
Home page:
  http://www.norran.net/nra02596/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 18:03       ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
@ 2000-05-02 17:37         ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02 17:37 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel, linux-mm

On Tue, 2 May 2000, Roger Larsson wrote:
> Rik van Riel wrote:
> > On Tue, 2 May 2000, Roger Larsson wrote:
> > 
> > > I have been playing with the idea to have a lru for each zone.
> > > It should be trivial to do since page contains a pointer to zone.
> > >
> > > With this change you will shrink_mmap only check among relevant pages.
> > > (the caller will need to call shrink_mmap for other zone if call failed)
> > 
> > That's a very bad idea.
> 
> Has it been tested?

Yes, and it was quite bad. We ended up only freeing pages from
zones where there was memory pressure, leaving idle pages in the
other zone(s).

> I think the problem with searching for a DMA page among lots and
> lots of normal and high pages might be worse...

It'll cost you some CPU time, but you don't need to do this very
often (and freeing pages on a global basis, up to zone->pages_high
free pages per zone will let __alloc_pages() take care of balancing
the load between zones).

> > In this case you can end up constantly cycling through the pages of
> > one zone while the pages in another zone remain idle.
> 
> Yes you might. But concidering the possible no of pages in each
> zone, it might not be that a bad idea.

So we count the number of inactive pages in every zone, keeping them
at a certain minimum. Problem solved.

> You usually needs normal pages and there are more normal pages.
> You rarely needs DMA pages but there are less.
> => recycle rate might be about the same...

Then again, it might not. Think about a 1GB machine, which has
a 900MB "normal" zone and a ~64MB highmem zone.

> Anyway I think it is up to the caller of shrink_mmap to be
> intelligent about which zone it requests.

That's bull. The only place where we have information about which
page is the best one to free is in the "lru" queue. Splitting the
queue into local queues per zone removes that information.

> > Local page replacement is worse than global page replacement and
> > has always been...

(let me repeat this just in case)

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
@ 2000-05-02 17:26 frankeh
  2000-05-02 17:45 ` Rik van Riel
  2000-05-02 17:53 ` Andrea Arcangeli
  0 siblings, 2 replies; 31+ messages in thread
From: frankeh @ 2000-05-02 17:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, Roger Larsson, linux-kernel, linux-mm

It makes sense to me to make the number of pools configurable and not tie
them directly to the number of nodes in a NUMA system.
In particular allow memory pools (i.e. instance of pg_dat_t) to be smaller
than a node size.

The smart things that I see has to happen is to allow a set of processes to
be attached to a set of memory pools and the OS basically enforcing
allocation in those constraints. I brought this up before and I think
Andrea proposed something similar. Allocation should take place in those
pools along the allocation levels based on GFP_MASK, so first allocate on
HIGH along all specified pools and if unsuccessful, then fallback on a
previous level.
With each pool we should associate a kswapd.

Making the size of the pools configurable allows to control the velocity at
which we can swap out. Standard Queuing theory: if we can't get the desired
througput, then increase the number of servers, here kswapd.

Comments...

-- Hubertus

Andrea Arcangeli <andrea@suse.de>@kvack.org on 05/02/2000 12:20:41 PM

Sent by:  owner-linux-mm@kvack.org

To:   riel@nl.linux.org
cc:   Roger Larsson <roger.larsson@norran.net>,
      linux-kernel@vger.rutgers.edu, linux-mm@kvack.org
Subject:  Re: kswapd @ 60-80% CPU during heavy HD i/o.

On Tue, 2 May 2000, Rik van Riel wrote:

>That's a very bad idea.

However the lru_cache have definitely to be per-node and not global as now
in 2.3.99-pre6 and pre7-1 or you won't be able to do the smart things I
was mentining some day ago in linux-mm with NUMA.

My current tree looks like this:

#define LRU_SWAP_CACHE        0
#define LRU_NORMAL_CACHE 1
#define NR_LRU_CACHE          2
typedef struct lru_cache_s {
     struct list_head heads[NR_LRU_CACHE];
     unsigned long nr_cache_pages; /* pages in the lrus */
     unsigned long nr_map_pages; /* pages temporarly out of the lru */
     /* keep lock in a separate cacheline to avoid ping pong in SMP */
     spinlock_t lock ____cacheline_aligned_in_smp;
} lru_cache_t;

struct bootmem_data;
typedef struct pglist_data {
     int nr_zones;
     zone_t node_zones[MAX_NR_ZONES];
     gfpmask_zone_t node_gfpmask_zone[NR_GFPINDEX];
     lru_cache_t lru_cache;
     struct page *node_mem_map;
     unsigned long *valid_addr_bitmap;
     struct bootmem_data *bdata;
     unsigned long node_start_paddr;
     unsigned long node_start_mapnr;
     unsigned long node_size;
     int node_id;
     struct pglist_data *node_next;
     spinlock_t freelist_lock ____cacheline_aligned_in_smp;
} pg_data_t;

Stay tuned...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 17:26 frankeh
@ 2000-05-02 17:45 ` Rik van Riel
  2000-05-02 17:53 ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2000-05-02 17:45 UTC (permalink / raw)
  To: frankeh; +Cc: Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm

On Tue, 2 May 2000 frankeh@us.ibm.com wrote:

> It makes sense to me to make the number of pools configurable
> and not tie them directly to the number of nodes in a NUMA
> system. In particular allow memory pools (i.e. instance of
> pg_dat_t) to be smaller than a node size.

*nod*

We should have different memory zones per node on
Intel-handi^Wequipped NUMA machines.

> The smart things that I see has to happen is to allow a set of processes to
> be attached to a set of memory pools and the OS basically enforcing
> allocation in those constraints. I brought this up before and I think
> Andrea proposed something similar. Allocation should take place in those
> pools along the allocation levels based on GFP_MASK, so first allocate on
> HIGH along all specified pools and if unsuccessful, then fallback on a
> previous level.

That idea is broken if you don't do balancing of VM load between
zones.

> With each pool we should associate a kswapd.

How will local page replacement help you if the node next door
has practically unloaded virtual memory? You need to do global
page replacement of some sort...

> Making the size of the pools configurable allows to control the
> velocity at which we can swap out. Standard Queuing theory: if
> we can't get the desired througput, then increase the number of
> servers, here kswapd.

What we _could_ do is have one (or maybe even a few) kswapds
doing global replacement with io-less and more fine-grained
swap_out() and shrink_mmap() functions, and per-node kswapds
taking care of the IO and maybe even a per-node inactive list
(though that would probably be *bad* for page replacement).

Then again, if your machine can't get the desired throughput,
how would adding kswapds help??? Have you taken a look at
mm/page_alloc.c::alloc_pages()? If kswapd can't keep up, the
biggest memory consumers will help a hand and prevent the
rest of the system from thrashing too much.

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
  2000-05-02 17:26 frankeh
  2000-05-02 17:45 ` Rik van Riel
@ 2000-05-02 17:53 ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2000-05-02 17:53 UTC (permalink / raw)
  To: frankeh; +Cc: riel, Roger Larsson, linux-kernel, linux-mm

On Tue, 2 May 2000 frankeh@us.ibm.com wrote:

>The smart things that I see has to happen is to allow a set of processes to
>be attached to a set of memory pools and the OS basically enforcing
>allocation in those constraints. I brought this up before and I think
>Andrea proposed something similar. Allocation should take place in those

Yes, that's why I think we need to be able to know the state of the cache
in a single pg_data_t. If 99% of the pg_data_t is _freeable_ cache it
worth to shrink a bit from the cache of _such_ pg_data_t instead of
risking shrinking and then allocating the memory from a foregin pg_data_t
because we respect a global LRU). This can't hurt at all the common non
NUMA case since in the common case of 99% of IA32 boxes out there we have
_one_ only pg_data_t thus the lru keeps to be effectively system-global
for them.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: kswapd @ 60-80% CPU during heavy HD i/o.
@ 2000-05-02 18:46 frankeh
  0 siblings, 0 replies; 31+ messages in thread
From: frankeh @ 2000-05-02 18:46 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

Hi, Rik...

Rik van Riel <riel@conectiva.com.br> on 05/02/2000 02:15:18 PM

   Please respond to riel@nl.linux.org

   To:  Hubertus Franke/Watson/IBM@IBMUS
   cc:  Andrea Arcangeli <andrea@suse.de>, Roger Larsson
        <roger.larsson@norran.net>, linux-kernel@vger.rutgers.edu,
        linux-mm@kvack.org
   Subject:     Re: kswapd @ 60-80% CPU during heavy HD i/o.

   On Tue, 2 May 2000 frankeh@us.ibm.com wrote:

   > It makes sense to me to make the number of pools configurable
   > and not tie them directly to the number of nodes in a NUMA
   > system. In particular allow memory pools (i.e. instance of
   > pg_dat_t) to be smaller than a node size.

   *nod*

   We should have different memory zones per node on
   Intel-handi^Wequipped NUMA machines.

Wouldn't that be orthogonal....
Anyway, I believe x86 NUMA machines will exist in the future, so I am not
ready to trash them right now, whether I like their architecture or not.

   > The smart things that I see has to happen is to allow a set of
   processes to
   > be attached to a set of memory pools and the OS basically enforcing
   > allocation in those constraints. I brought this up before and I think
   > Andrea proposed something similar. Allocation should take place in
   those
   > pools along the allocation levels based on GFP_MASK, so first allocate
   on
   > HIGH along all specified pools and if unsuccessful, then fallback on a
   > previous level.

   That idea is broken if you don't do balancing of VM load between
   zones.

   > With each pool we should associate a kswapd.

   How will local page replacement help you if the node next door
   has practically unloaded virtual memory? You need to do global
   page replacement of some sort...

You wouldn't balance a zone until you have checked on the same level (e.g.
HIGHMEM) on all the specified nodes. Then and only then you fall back. So
we aren't doing any local page replacement unless I can not satisfy a page
request within the given resource set.
That means something along the following pseudo code

   forall zonelevels
        forall nodes in resource set
             zone = pgdat[node].zones[zonelevel];
             if (zone->free_pages > threshold)
                  alloc_page;
                  return;
             set kswapd_required flag   (kick)

   balance zones;   // couldn't allocate a page in the desired resource set
so start balancing.

Now balancing zones kicks the kswaps or helps out... global balancing can
take place by servicing the pgdat_t with the highest number of kicks...
I think it is ok to have pools with unused memory lying around if a
particular resource set does not include those pools. How else are you
planning to control locality and affinity within memory other than using
resource sets.
We take the same approach in the kernel, for instance we have a minimum
file cache size, because we know that we can increase throughput by doing
so.

   > Making the size of the pools configurable allows to control the
   > velocity at which we can swap out. Standard Queuing theory: if
   > we can't get the desired througput, then increase the number of
   > servers, here kswapd.

   What we _could_ do is have one (or maybe even a few) kswapds
   doing global replacement with io-less and more fine-grained
   swap_out() and shrink_mmap() functions, and per-node kswapds
   taking care of the IO and maybe even a per-node inactive list
   (though that would probably be *bad* for page replacement).

That is workable .......

   Then again, if your machine can't get the desired throughput,
   how would adding kswapds help??? Have you taken a look at
   mm/page_alloc.c::alloc_pages()? If kswapd can't keep up, the
   biggest memory consumers will help a hand and prevent the
   rest of the system from thrashing too much.

Correct...

However, having finer grain pools also allows you to deal with potential
lock contention, which is one of the biggest impedements to scale up.
characteristics of NUMA machines are large memory and large number of CPUs.
This implies that there will be increased lock contention, for instance on
the lock that protects the
memory pool. Also increased lock contention can arise by increased lock
hold time, which I assume is somewhat related to the size of the memory. So
decreasing lock contention time by limiting the number of pages that are
managed per pool could remove an arising bottleneck.

   regards,

   Rik
   --
   The Internet is not a network of computers. It is a network
   of people. That is its real strength.

   Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
   http://www.conectiva.com/        http://www.surriel.com/

regards...

Hubertus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2000-05-04 22:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <390E1534.B33FF871@norran.net>
2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel
2000-05-01 23:33   ` David S. Miller
2000-05-02  0:07     ` Rik van Riel
2000-05-02  0:23       ` David S. Miller
2000-05-02  1:03         ` Rik van Riel
2000-05-02  1:13           ` David S. Miller
2000-05-02  1:31             ` Rik van Riel
2000-05-02  1:51             ` Andrea Arcangeli
2000-05-03 17:11         ` [PATCHlet] " Rik van Riel
2000-05-02  7:56       ` michael
2000-05-02 16:17   ` Roger Larsson
2000-05-02 15:43     ` Rik van Riel
2000-05-02 16:20       ` Andrea Arcangeli
2000-05-02 17:06         ` Rik van Riel
2000-05-02 21:14           ` Stephen C. Tweedie
2000-05-02 21:42             ` Rik van Riel
2000-05-02 22:34               ` Stephen C. Tweedie
2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 14:34                 ` Rik van Riel
2000-05-04 22:38                   ` [PATCH][RFC] Another shrink_mmap Roger Larsson
2000-05-04 15:25                 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 18:30                   ` Rik van Riel
2000-05-04 20:44                     ` Roger Larsson
2000-05-04 18:59                       ` Rik van Riel
2000-05-04 22:29                         ` Roger Larsson
2000-05-02 18:03       ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
2000-05-02 17:37         ` Rik van Riel
2000-05-02 17:26 frankeh
2000-05-02 17:45 ` Rik van Riel
2000-05-02 17:53 ` Andrea Arcangeli
2000-05-02 18:46 frankeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox