* Re: kswapd @ 60-80% CPU during heavy HD i/o. [not found] <390E1534.B33FF871@norran.net> @ 2000-05-01 23:23 ` Rik van Riel 2000-05-01 23:33 ` David S. Miller 2000-05-02 16:17 ` Roger Larsson 0 siblings, 2 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-01 23:23 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-kernel, linux-mm On Tue, 2 May 2000, Roger Larsson wrote: > I think there are some problems in the current (pre7-1) shrink_mmap. > > 1) "Random" resorting for zone with free_pages > pages_high > while loop searches from the end of the list. > old pages on non memory pressure zones are disposed as 'young'. > Young pages are put in front, like recently touched ones. > This results in a random resort for these pages. Not doing this would result in having to scan the same "wrong zone" pages over and over again, possibly never reaching the pages we do want to free. > 2) The implemented algorithm results in a lot of list operations - > each scanned page is deleted from the list. *nod* Maybe it's better to scan the list and leave it unchanged, doing second chance replacement on it like we do in 2.2 ... or even 2 or 3 bit aging? That way we only have to scan and do none of the expensive list operations. Sorting doesn't make much sense anyway since we put most pages on the list in an essentially random order... > 3) The list is supposed to be small - it is not... Who says the list is supposed to be small? > 4) Count is only decreased for suitable pages, but is related > to total pages. Not doing this resulted in being unable to free the "right" pages, even if they are there on the list (just beyond where we stopped scanning) and killing a process with out of memory errors. > 5) Returns on first fully successful page. Rescan from beginning > at next call to get another one... (not that bad since pages > are moved to the end) Well, it *is* bad since we'll end up scanning all the pages in &old; (and trying to free them again, which probably fails just like it did last time). The more I think about it, the more I think we want to go to a second chance algorithm where we don't change the list (except to remove pages from the list). We can simply "move" the list_head when we're done scanning and continue from where we left off last time. That way we'll be much less cpu intensive and scan all pages fairly. Using not one but 2 or 3 bits for aging the pages can result in something closer to lru and cheaper than the scheme we have now. What do you (and others) think about this idea? regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel @ 2000-05-01 23:33 ` David S. Miller 2000-05-02 0:07 ` Rik van Riel 2000-05-02 16:17 ` Roger Larsson 1 sibling, 1 reply; 27+ messages in thread From: David S. Miller @ 2000-05-01 23:33 UTC (permalink / raw) To: riel; +Cc: roger.larsson, linux-kernel, linux-mm We can simply "move" the list_head when we're done scanning and continue from where we left off last time. That way we'll be much less cpu intensive and scan all pages fairly. Using not one but 2 or 3 bits for aging the pages can result in something closer to lru and cheaper than the scheme we have now. What do you (and others) think about this idea? Why not have two lists, an active and an inactive list. As reference bits clear, pages move to the inactive list. If a reference bit stays clear up until when the page moves up to the head of the inactive list, we then try to free it. For the active list, you do the "move the list head" technique. So you have two passes, one populates the inactive list, the next inspects the inactive list for pages to free up. The toplevel shrink_mmap scheme can look something like: free_unreferenced_pages_in_inactive_list(); repopulate_inactive_list(); And you define some heuristics to decide how populated you wish to try to keep the inactive list. Next, during periods of inactivity you have kswapd or some other daemon periodically (once every 5 seconds, something like this) perform an inactive list population run. The inactive lru population can be done cheaply, using the above ideas, roughly like: LIST_HEAD(inactive_queue); struct list_head * active_scan_point = &lru_active; for_each_active_lru_page() { if (!test_and_clear_referenced(page)) { list_del(entry); list_add(entry, &inactive_queue); } else active_scan_point = entry; } list_splice(&inactive_queue, &lru_inactive); list_head_move(&lru_active, active_scan_point); This way you only do list manipulations for actual work done (ie. moving inactive page candidates to the inactive list). I may try to toss together and example implementation, but feel free to beat me to it :-) Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-01 23:33 ` David S. Miller @ 2000-05-02 0:07 ` Rik van Riel 2000-05-02 0:23 ` David S. Miller 2000-05-02 7:56 ` michael 0 siblings, 2 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-02 0:07 UTC (permalink / raw) To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm On Mon, 1 May 2000, David S. Miller wrote: > Date: Mon, 1 May 2000 20:23:43 -0300 (BRST) > From: Rik van Riel <riel@conectiva.com.br> > > We can simply "move" the list_head when we're done scanning and > continue from where we left off last time. That way we'll be much > less cpu intensive and scan all pages fairly. > > Using not one but 2 or 3 bits for aging the pages can result in > something closer to lru and cheaper than the scheme we have now. > > What do you (and others) think about this idea? > > Why not have two lists, an active and an inactive list. As reference > bits clear, pages move to the inactive list. If a reference bit stays > clear up until when the page moves up to the head of the inactive > list, we then try to free it. For the active list, you do the "move > the list head" technique. Sounds like a winning idea. Well, we also want to keep mapped pages on the active list... > And you define some heuristics to decide how populated you wish to > try to keep the inactive list. We can aim/tune the inactive list size for 25% reclaims by the original applications and 75% page stealing for "use" by the free list. If we have far too much reclaims we can shrink the list, if we have to little reclaims we can grow the inactive list (and scan the active list more agressively). This should also "catch" IO intensive applications, by moving a lot of stuff to the inactive list quickly. > Next, during periods of inactivity you have kswapd or some other > daemon periodically (once every 5 seconds, something like this) > perform an inactive list population run. *nod* We should scan the inactive list and move all reactivated pages back to the active list and then repopulate the inactive list. Alternatively, all "reactivation" actions (mapping a page back into the application, __find_page_nolock(), etc...) should put pages back onto the active queue. (and repopulate the active queue whenever we go below the low watermark, which is a fraction of the dynamically tuned high watermark) > The inactive lru population can be done cheaply, using the above > ideas, roughly like: > > LIST_HEAD(inactive_queue); > struct list_head * active_scan_point = &lru_active; > > for_each_active_lru_page() { > if (!test_and_clear_referenced(page)) { I'd like to add the "if (!page->buffers && atomic_read(&page->count) > 1)" test to this, since there is no way to free those pages and they may well have "hidden" referenced bits in their page table entries... > list_del(entry); > list_add(entry, &inactive_queue); > } else > active_scan_point = entry; > } > > list_splice(&inactive_queue, &lru_inactive); > list_head_move(&lru_active, active_scan_point); > > This way you only do list manipulations for actual work done > (ie. moving inactive page candidates to the inactive list). > > I may try to toss together and example implementation, but feel > free to beat me to it :-) If you have the time to spare, feel free to go ahead, but since I'm working on this stuff full-time now I see no real reason you should waste^Wspend your time on this ... there must be something to do in the network layer or another more critical/subtle place of the kernel. cheers, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 0:07 ` Rik van Riel @ 2000-05-02 0:23 ` David S. Miller 2000-05-02 1:03 ` Rik van Riel 2000-05-03 17:11 ` [PATCHlet] " Rik van Riel 2000-05-02 7:56 ` michael 1 sibling, 2 replies; 27+ messages in thread From: David S. Miller @ 2000-05-02 0:23 UTC (permalink / raw) To: riel; +Cc: roger.larsson, linux-kernel, linux-mm BTW, what loop are you trying to "continue;" out of here? + do { if (tsk->need_resched) schedule(); if ((!zone->size) || (!zone->zone_wake_kswapd)) continue; do_try_to_free_pages(GFP_KSWAPD, zone); + } while (zone->free_pages < zone->pages_low && + --count); :-) Just add a "next_zone:" label at the end of that code and change the continue; to a goto next_zone; Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 0:23 ` David S. Miller @ 2000-05-02 1:03 ` Rik van Riel 2000-05-02 1:13 ` David S. Miller 2000-05-03 17:11 ` [PATCHlet] " Rik van Riel 1 sibling, 1 reply; 27+ messages in thread From: Rik van Riel @ 2000-05-02 1:03 UTC (permalink / raw) To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm On Mon, 1 May 2000, David S. Miller wrote: > BTW, what loop are you trying to "continue;" out of here? > > + do { > if (tsk->need_resched) > schedule(); > if ((!zone->size) || (!zone->zone_wake_kswapd)) > continue; > do_try_to_free_pages(GFP_KSWAPD, zone); > + } while (zone->free_pages < zone->pages_low && > + --count); > > :-) Just add a "next_zone:" label at the end of that code and > change the continue; to a goto next_zone; I want kswapd to continue with freeing pages from this zone if there aren't enough free pages in this zone. This is needed because kswapd used to stop freeing pages even if we were below pages_min... (leading to out of memory situations when it wasn't needed, or to dozens or even hundreds of extra context switches / extra swapin latency when we call balance_zones from alloc_pages, etc) Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 1:03 ` Rik van Riel @ 2000-05-02 1:13 ` David S. Miller 2000-05-02 1:31 ` Rik van Riel 2000-05-02 1:51 ` Andrea Arcangeli 0 siblings, 2 replies; 27+ messages in thread From: David S. Miller @ 2000-05-02 1:13 UTC (permalink / raw) To: riel; +Cc: roger.larsson, linux-kernel, linux-mm On Mon, 1 May 2000, David S. Miller wrote: > BTW, what loop are you trying to "continue;" out of here? > > + do { > if (tsk->need_resched) > schedule(); > if ((!zone->size) || (!zone->zone_wake_kswapd)) > continue; > do_try_to_free_pages(GFP_KSWAPD, zone); > + } while (zone->free_pages < zone->pages_low && > + --count); > > :-) Just add a "next_zone:" label at the end of that code and > change the continue; to a goto next_zone; I want kswapd to continue with freeing pages from this zone if there aren't enough free pages in this zone. This is needed because kswapd used to stop freeing pages even if we were below pages_min... Rik, zone_wake_kswapd implies this information, via what __free_pages_ok does to that flag. I see it like this: if "!zone->size || !zone->zone_wake_kswapd" then zone->free_pages >= zone->pages_high by implication Therefore when the continue happens, the loop will currently just execute: if (!zone->size || !zone->zone_wake_kswapd) continue; ... } while (zone->free_pages < zone->pages_low && and the while condition is false and therefore will branch out of the loop. __free_pages_ok _always_ clears the zone_wake_kswapd flag when zone->free_pages goes beyond zone->pages_high. Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 1:13 ` David S. Miller @ 2000-05-02 1:31 ` Rik van Riel 2000-05-02 1:51 ` Andrea Arcangeli 1 sibling, 0 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-02 1:31 UTC (permalink / raw) To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm On Mon, 1 May 2000, David S. Miller wrote: > From: Rik van Riel <riel@conectiva.com.br> > On Mon, 1 May 2000, David S. Miller wrote: > > > BTW, what loop are you trying to "continue;" out of here? > > > > + do { > > if (tsk->need_resched) > > schedule(); > > if ((!zone->size) || (!zone->zone_wake_kswapd)) > > continue; > > do_try_to_free_pages(GFP_KSWAPD, zone); > > + } while (zone->free_pages < zone->pages_low && > > + --count); > > > > :-) Just add a "next_zone:" label at the end of that code and > > change the continue; to a goto next_zone; > > I want kswapd to continue with freeing pages from this zone if > there aren't enough free pages in this zone. This is needed > because kswapd used to stop freeing pages even if we were below > pages_min... > > Rik, zone_wake_kswapd implies this information, via what > __free_pages_ok does to that flag. Indeed, I should have moved the test for zone->zone_wake_kswapd to before the loop. But using zone->zone_wake_kswapd for the test isn't really enough since that is only turned off if zone->free_pages reaches zone->pages_high, but we probably don't want to do agressive swapout when we're already above zone->pages_low ... (just background swapping that happens incidentally when we're swapping stuff for other zones) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 1:13 ` David S. Miller 2000-05-02 1:31 ` Rik van Riel @ 2000-05-02 1:51 ` Andrea Arcangeli 1 sibling, 0 replies; 27+ messages in thread From: Andrea Arcangeli @ 2000-05-02 1:51 UTC (permalink / raw) To: David S. Miller; +Cc: riel, roger.larsson, linux-kernel, linux-mm Actually I think you missed the pgdat_list is a queue and it's not null terminated. I fixed this in my classzone patch of last week in this chunk: @@ -507,9 +529,8 @@ unsigned long i, j; unsigned long map_size; unsigned long totalpages, offset, realtotalpages; - unsigned int cumulative = 0; - pgdat->node_next = pgdat_list; + pgdat->node_next = NULL; however that's not enough without the thing I'm doing in the kswapd_can_sleep() again in the classzone patch. Note that my latest classzone patch had a few minor bugs. Last days and today I worked on getting mapped pages out of the lru and splitting the lru in two pieces since swap cache is less priority and it have to be shrink first. Doing that things is giving smooth swap behaviour. I'm incremental with the classzone patch. My current tree works rock solid but I forgot a little design detail ;). If a mapped page have anonymous buffers on it it have to _stay_ on the lru otherwise the bh headers will become unfreeable and so I can basically leak memory. Once this little bit will be fixed (and it's not a trivial bit if you think at it) I'll post the patch where the above and other things are fixed. It should be fully orthogonal (at least conceptually) with your anon.c stuff since all new code lives in the lru_cache domain. Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCHlet] Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 0:23 ` David S. Miller 2000-05-02 1:03 ` Rik van Riel @ 2000-05-03 17:11 ` Rik van Riel 1 sibling, 0 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-03 17:11 UTC (permalink / raw) To: David S. Miller; +Cc: roger.larsson, linux-kernel, linux-mm, Linus Torvalds On Mon, 1 May 2000, David S. Miller wrote: > BTW, what loop are you trying to "continue;" out of here? > > + do { > if (tsk->need_resched) > schedule(); > if ((!zone->size) || (!zone->zone_wake_kswapd)) > continue; > do_try_to_free_pages(GFP_KSWAPD, zone); > + } while (zone->free_pages < zone->pages_low && > + --count); Ughhhhh. And the worst part is that it took me a few _days_ to figure out ;) Anyway, the fix for this small buglet is attached. I'll continue working on the active/inactive lists (per pgdat!), if I haven't sent in the active/inactive list thing for the next prepatch, it would be nice to have this small fix applied. regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ --- vmscan.c.orig Wed May 3 10:51:36 2000 +++ vmscan.c Wed May 3 13:00:00 2000 @@ -528,15 +528,15 @@ pgdat = pgdat_list; while (pgdat) { for (i = 0; i < MAX_NR_ZONES; i++) { - int count = SWAP_CLUSTER_MAX; - zone = pgdat->node_zones + i; - do { - if (tsk->need_resched) - schedule(); + int count = SWAP_CLUSTER_MAX; + zone = pgdat->node_zones + i; if ((!zone->size) || (!zone->zone_wake_kswapd)) continue; - do_try_to_free_pages(GFP_KSWAPD, zone); - } while (zone->free_pages < zone->pages_low && + do { + if (tsk->need_resched) + schedule(); + do_try_to_free_pages(GFP_KSWAPD, zone); + } while (zone->free_pages < zone->pages_low && --count); } pgdat = pgdat->node_next; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 0:07 ` Rik van Riel 2000-05-02 0:23 ` David S. Miller @ 2000-05-02 7:56 ` michael 1 sibling, 0 replies; 27+ messages in thread From: michael @ 2000-05-02 7:56 UTC (permalink / raw) To: riel; +Cc: David S. Miller, roger.larsson, linux-mm Rik van Riel <riel@conectiva.com.br> writes: > On Mon, 1 May 2000, David S. Miller wrote: > > Why not have two lists, an active and an inactive list. As reference > > bits clear, pages move to the inactive list. If a reference bit stays > > clear up until when the page moves up to the head of the inactive > > list, we then try to free it. For the active list, you do the "move > > the list head" technique. [...] > We should scan the inactive list and move all reactivated pages > back to the active list and then repopulate the inactive list. > Alternatively, all "reactivation" actions (mapping a page back > into the application, __find_page_nolock(), etc...) should put > pages back onto the active queue. [..] > > The inactive lru population can be done cheaply, using the above > > ideas, roughly like: > > > > LIST_HEAD(inactive_queue); > > struct list_head * active_scan_point = &lru_active; > > > > for_each_active_lru_page() { > > if (!test_and_clear_referenced(page)) { Tiny comment: You'd probably be better off waking up more frequently, and just processing just a bit of the active page queue. I.e. Every 1/10th of a second, walk 2% of the active queue. This would give you something closer to LRU, and smooth the load, yes? Or, I guess that could be ; every 1/10th of a second, walk as much of the active queue as is needed to refill the inactive list, starting from where you left of last time. So if nothing is consuming out of the inactive queue, we effectively stop walking. (This is basically pure clock). Michael. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel 2000-05-01 23:33 ` David S. Miller @ 2000-05-02 16:17 ` Roger Larsson 2000-05-02 15:43 ` Rik van Riel 1 sibling, 1 reply; 27+ messages in thread From: Roger Larsson @ 2000-05-02 16:17 UTC (permalink / raw) To: riel; +Cc: linux-kernel, linux-mm Hi, I have been playing with the idea to have a lru for each zone. It should be trivial to do since page contains a pointer to zone. With this change you will shrink_mmap only check among relevant pages. (the caller will need to call shrink_mmap for other zone if call failed) With this change you probably do not need to move pages to young. And can get around without modifying the list. I think keeping active/inactive (= generational) lists are also an interesting proposal. But since it is orthogonal both methods can be used! /RogerL Rik van Riel wrote: > > On Tue, 2 May 2000, Roger Larsson wrote: > > > I think there are some problems in the current (pre7-1) shrink_mmap. > > > > 1) "Random" resorting for zone with free_pages > pages_high > > while loop searches from the end of the list. > > old pages on non memory pressure zones are disposed as 'young'. > > Young pages are put in front, like recently touched ones. > > This results in a random resort for these pages. > > Not doing this would result in having to scan the same "wrong zone" > pages over and over again, possibly never reaching the pages we do > want to free. > > > 2) The implemented algorithm results in a lot of list operations - > > each scanned page is deleted from the list. > > *nod* > > Maybe it's better to scan the list and leave it unchanged, doing > second chance replacement on it like we do in 2.2 ... or even 2 > or 3 bit aging? > > That way we only have to scan and do none of the expensive list > operations. Sorting doesn't make much sense anyway since we put > most pages on the list in an essentially random order... > > > 3) The list is supposed to be small - it is not... > > Who says the list is supposed to be small? > > > 4) Count is only decreased for suitable pages, but is related > > to total pages. > > Not doing this resulted in being unable to free the "right" pages, > even if they are there on the list (just beyond where we stopped > scanning) and killing a process with out of memory errors. > > > 5) Returns on first fully successful page. Rescan from beginning > > at next call to get another one... (not that bad since pages > > are moved to the end) > > Well, it *is* bad since we'll end up scanning all the pages in > &old; (and trying to free them again, which probably fails just > like it did last time). The more I think about it, the more I think > we want to go to a second chance algorithm where we don't change > the list (except to remove pages from the list). > > We can simply "move" the list_head when we're done scanning and > continue from where we left off last time. That way we'll be much > less cpu intensive and scan all pages fairly. > > Using not one but 2 or 3 bits for aging the pages can result in > something closer to lru and cheaper than the scheme we have now. > > What do you (and others) think about this idea? > > regards, > > Rik > -- > The Internet is not a network of computers. It is a network > of people. That is its real strength. > > Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies > http://www.conectiva.com/ http://www.surriel.com/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.rutgers.edu > Please read the FAQ at http://www.tux.org/lkml/ -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 16:17 ` Roger Larsson @ 2000-05-02 15:43 ` Rik van Riel 2000-05-02 16:20 ` Andrea Arcangeli 2000-05-02 18:03 ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson 0 siblings, 2 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-02 15:43 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-kernel, linux-mm On Tue, 2 May 2000, Roger Larsson wrote: > I have been playing with the idea to have a lru for each zone. > It should be trivial to do since page contains a pointer to zone. > > With this change you will shrink_mmap only check among relevant pages. > (the caller will need to call shrink_mmap for other zone if call failed) That's a very bad idea. In this case you can end up constantly cycling through the pages of one zone while the pages in another zone remain idle. Local page replacement is worse than global page replacement and has always been... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 15:43 ` Rik van Riel @ 2000-05-02 16:20 ` Andrea Arcangeli 2000-05-02 17:06 ` Rik van Riel 2000-05-02 18:03 ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson 1 sibling, 1 reply; 27+ messages in thread From: Andrea Arcangeli @ 2000-05-02 16:20 UTC (permalink / raw) To: riel; +Cc: Roger Larsson, linux-kernel, linux-mm On Tue, 2 May 2000, Rik van Riel wrote: >That's a very bad idea. However the lru_cache have definitely to be per-node and not global as now in 2.3.99-pre6 and pre7-1 or you won't be able to do the smart things I was mentining some day ago in linux-mm with NUMA. My current tree looks like this: #define LRU_SWAP_CACHE 0 #define LRU_NORMAL_CACHE 1 #define NR_LRU_CACHE 2 typedef struct lru_cache_s { struct list_head heads[NR_LRU_CACHE]; unsigned long nr_cache_pages; /* pages in the lrus */ unsigned long nr_map_pages; /* pages temporarly out of the lru */ /* keep lock in a separate cacheline to avoid ping pong in SMP */ spinlock_t lock ____cacheline_aligned_in_smp; } lru_cache_t; struct bootmem_data; typedef struct pglist_data { int nr_zones; zone_t node_zones[MAX_NR_ZONES]; gfpmask_zone_t node_gfpmask_zone[NR_GFPINDEX]; lru_cache_t lru_cache; struct page *node_mem_map; unsigned long *valid_addr_bitmap; struct bootmem_data *bdata; unsigned long node_start_paddr; unsigned long node_start_mapnr; unsigned long node_size; int node_id; struct pglist_data *node_next; spinlock_t freelist_lock ____cacheline_aligned_in_smp; } pg_data_t; Stay tuned... Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 16:20 ` Andrea Arcangeli @ 2000-05-02 17:06 ` Rik van Riel 2000-05-02 21:14 ` Stephen C. Tweedie 0 siblings, 1 reply; 27+ messages in thread From: Rik van Riel @ 2000-05-02 17:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Roger Larsson, linux-kernel, linux-mm On Tue, 2 May 2000, Andrea Arcangeli wrote: > On Tue, 2 May 2000, Rik van Riel wrote: > > >That's a very bad idea. > > However the lru_cache have definitely to be per-node and not > global as now in 2.3.99-pre6 and pre7-1 or you won't be able to > do the smart things I was mentining some day ago in linux-mm > with NUMA. How do you want to take care of global page balancing with this "optimisation"? If you cannot find a good answer to that, you'd better not spend too much time implementing any of this... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 17:06 ` Rik van Riel @ 2000-05-02 21:14 ` Stephen C. Tweedie 2000-05-02 21:42 ` Rik van Riel 0 siblings, 1 reply; 27+ messages in thread From: Stephen C. Tweedie @ 2000-05-02 21:14 UTC (permalink / raw) To: riel; +Cc: Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm Hi, On Tue, May 02, 2000 at 02:06:20PM -0300, Rik van Riel wrote: > > do the smart things I was mentining some day ago in linux-mm > > with NUMA. > > How do you want to take care of global page balancing with > this "optimisation"? You don't. With NUMA, the memory is inherently unbalanced, and you don't want the allocator to smooth over the different nodes. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 21:14 ` Stephen C. Tweedie @ 2000-05-02 21:42 ` Rik van Riel 2000-05-02 22:34 ` Stephen C. Tweedie 2000-05-04 12:37 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson 0 siblings, 2 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-02 21:42 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm On Tue, 2 May 2000, Stephen C. Tweedie wrote: > On Tue, May 02, 2000 at 02:06:20PM -0300, Rik van Riel wrote: > > > do the smart things I was mentining some day ago in linux-mm > > > with NUMA. > > > > How do you want to take care of global page balancing with > > this "optimisation"? > > You don't. With NUMA, the memory is inherently unbalanced, and you > don't want the allocator to smooth over the different nodes. Ermmm, a few days ago (yesterday?) you told me on irc that we needed to balance between zones ... maybe we need some way to measure "memory load" on a zone and only allocate from a different NUMA zone if: local_load remote_load ---------- >= ----------- 1.0 load penalty for local->remote (or something more or less like this ... only use one of the nodes one hop away if the remote load is <90% of the local load, 70% for two hops, 30% for > 2 hops ...) We could use the scavenge list in combination with more or less balanced page reclamation to determine memory load on the different nodes... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 21:42 ` Rik van Riel @ 2000-05-02 22:34 ` Stephen C. Tweedie 2000-05-04 12:37 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson 1 sibling, 0 replies; 27+ messages in thread From: Stephen C. Tweedie @ 2000-05-02 22:34 UTC (permalink / raw) To: riel Cc: Stephen C. Tweedie, Andrea Arcangeli, Roger Larsson, linux-kernel, linux-mm Hi, On Tue, May 02, 2000 at 06:42:31PM -0300, Rik van Riel wrote: > On Tue, 2 May 2000, Stephen C. Tweedie wrote: > Ermmm, a few days ago (yesterday?) you told me on irc that we > needed to balance between zones ... On a single NUMA node, definitely. We need balance between all of the zones which may be in use in a specific allocation class. NUMA is a very different issue. _Some_ memory pressure between nodes is necessary: if one node is completely out of memory then we may have to start allocating memory on other nodes to tasks tied to the node under pressure. But in the normal case, you really do want NUMA memory classes to be as independent of each other as possible. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH][RFC] Alternate shrink_mmap 2000-05-02 21:42 ` Rik van Riel 2000-05-02 22:34 ` Stephen C. Tweedie @ 2000-05-04 12:37 ` Roger Larsson 2000-05-04 14:34 ` Rik van Riel 2000-05-04 15:25 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson 1 sibling, 2 replies; 27+ messages in thread From: Roger Larsson @ 2000-05-04 12:37 UTC (permalink / raw) To: riel; +Cc: linux-mm [-- Attachment #1: Type: text/plain, Size: 296 bytes --] Hi all, Here is an alternative shrink_mmap. It tries to touch the list as little as possible (only young pages are moved) And tries to be quick. Comments please. It compiles but I have not dared to run it yet... (My biggest patch yet) /RogerL -- Home page: http://www.norran.net/nra02596/ [-- Attachment #2: patch-2.3-shrink_mmap --] [-- Type: text/plain, Size: 7789 bytes --] diff -Naur linux-2.3-pre7+/mm/filemap.c linux-2.3/mm/filemap.c --- linux-2.3-pre7+/mm/filemap.c Mon May 1 21:41:10 2000 +++ linux-2.3/mm/filemap.c Thu May 4 13:30:36 2000 @@ -237,143 +237,149 @@ { int ret = 0, count; LIST_HEAD(young); - LIST_HEAD(old); LIST_HEAD(forget); struct list_head * page_lru, * dispose; struct page * page = NULL; struct zone_struct * p_zone; - + + /* This could be removed. + * NULL translates to: fulfill all zone requests. */ if (!zone) BUG(); count = nr_lru_pages >> priority; - if (!count) - return ret; spin_lock(&pagemap_lru_lock); again: - /* we need pagemap_lru_lock for list_del() ... subtle code below */ - while (count > 0 && (page_lru = lru_cache.prev) != &lru_cache) { - page = list_entry(page_lru, struct page, lru); - list_del(page_lru); - p_zone = page->zone; - - /* This LRU list only contains a few pages from the system, - * so we must fail and let swap_out() refill the list if - * there aren't enough freeable pages on the list */ - - /* The page is in use, or was used very recently, put it in - * &young to make sure that we won't try to free it the next - * time */ - dispose = &young; - if (test_and_clear_bit(PG_referenced, &page->flags)) - goto dispose_continue; - - if (p_zone->free_pages > p_zone->pages_high) - goto dispose_continue; - - if (!page->buffers && page_count(page) > 1) - goto dispose_continue; - - count--; - /* Page not used -> free it or put it on the old list - * so it gets freed first the next time */ - dispose = &old; - if (TryLockPage(page)) - goto dispose_continue; - - /* Release the pagemap_lru lock even if the page is not yet - queued in any lru queue since we have just locked down - the page so nobody else may SMP race with us running - a lru_cache_del() (lru_cache_del() always run with the - page locked down ;). */ - spin_unlock(&pagemap_lru_lock); - - /* avoid freeing the page while it's locked */ - get_page(page); - - /* Is it a buffer page? */ - if (page->buffers) { - if (!try_to_free_buffers(page)) - goto unlock_continue; - /* page was locked, inode can't go away under us */ - if (!page->mapping) { - atomic_dec(&buffermem_pages); - goto made_buffer_progress; - } - } - - /* Take the pagecache_lock spinlock held to avoid - other tasks to notice the page while we are looking at its - page count. If it's a pagecache-page we'll free it - in one atomic transaction after checking its page count. */ - spin_lock(&pagecache_lock); - - /* - * We can't free pages unless there's just one user - * (count == 2 because we added one ourselves above). - */ - if (page_count(page) != 2) - goto cache_unlock_continue; - - /* - * Is it a page swap page? If so, we want to - * drop it if it is no longer used, even if it - * were to be marked referenced.. - */ - if (PageSwapCache(page)) { - spin_unlock(&pagecache_lock); - __delete_from_swap_cache(page); - goto made_inode_progress; - } - - /* is it a page-cache page? */ - if (page->mapping) { - if (!PageDirty(page) && !pgcache_under_min()) { - remove_page_from_inode_queue(page); - remove_page_from_hash_queue(page); - page->mapping = NULL; - spin_unlock(&pagecache_lock); - goto made_inode_progress; - } - goto cache_unlock_continue; - } - - dispose = &forget; - printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); - + for (page_lru = lru_cache.prev; + count-- && page_lru != &lru_cache; + page_lru = page_lru->prev) { + page = list_entry(page_lru, struct page, lru); + p_zone = page->zone; + + + /* Check if zone has pressure, most pages would continue here. + * Also pages from zones that initally was under pressure */ + if (!p_zone->zone_wake_kswapd) + continue; + + /* Can't do anything about this... */ + if (!page->buffers && page_count(page) > 1) + continue; + + /* Page not used -> free it + * If it could not be locked it is somehow in use + * try another time */ + if (TryLockPage(page)) + continue; + + /* Ok, a possible page. + * Note: can't unlock lru if we do we will have + * to restart this loop */ + + /* The page is in use, or was used very recently, put it in + * &young to make it ulikely that we will try to free it the next + * time */ + dispose = &young; + if (test_and_clear_bit(PG_referenced, &page->flags)) + goto dispose_continue; + + + /* avoid freeing the page while it's locked [RL???] */ + get_page(page); + + /* If it can not be freed here it is unlikely to + * at next attempt. */ + dispose = NULL; + + /* Is it a buffer page? */ + if (page->buffers) { + if (!try_to_free_buffers(page)) + goto unlock_continue; + /* page was locked, inode can't go away under us */ + if (!page->mapping) { + atomic_dec(&buffermem_pages); + goto made_buffer_progress; + } + } + + + /* Take the pagecache_lock spinlock held to avoid + other tasks to notice the page while we are looking at its + page count. If it's a pagecache-page we'll free it + in one atomic transaction after checking its page count. */ + spin_lock(&pagecache_lock); + + /* + * We can't free pages unless there's just one user + * (count == 2 because we added one ourselves above). + */ + if (page_count(page) != 2) + goto cache_unlock_continue; + + /* + * Is it a page swap page? If so, we want to + * drop it if it is no longer used, even if it + * were to be marked referenced.. + */ + if (PageSwapCache(page)) { + spin_unlock(&pagecache_lock); + __delete_from_swap_cache(page); + goto made_inode_progress; + } + + /* is it a page-cache page? */ + if (page->mapping) { + if (!PageDirty(page) && !pgcache_under_min()) { + remove_page_from_inode_queue(page); + remove_page_from_hash_queue(page); + page->mapping = NULL; + spin_unlock(&pagecache_lock); + goto made_inode_progress; + } + goto cache_unlock_continue; + } + + dispose = &forget; + printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); + cache_unlock_continue: - spin_unlock(&pagecache_lock); + spin_unlock(&pagecache_lock); unlock_continue: - spin_lock(&pagemap_lru_lock); - UnlockPage(page); - put_page(page); - list_add(page_lru, dispose); - continue; + /* never released... spin_lock(&pagemap_lru_lock); */ + UnlockPage(page); + put_page(page); + if (dispose == NULL) /* only forget should end up here - predicted taken */ + continue; - /* we're holding pagemap_lru_lock, so we can just loop again */ dispose_continue: - list_add(page_lru, dispose); - } - goto out; + list_del(page_lru); + list_add(page_lru, dispose); + continue; made_inode_progress: - page_cache_release(page); + page_cache_release(page); made_buffer_progress: - UnlockPage(page); - put_page(page); - ret = 1; - spin_lock(&pagemap_lru_lock); - /* nr_lru_pages needs the spinlock */ - nr_lru_pages--; - - /* wrong zone? not looped too often? roll again... */ - if (page->zone != zone && count) - goto again; + UnlockPage(page); + put_page(page); + ret++; + /* never unlocked... spin_lock(&pagemap_lru_lock); */ + /* nr_lru_pages needs the spinlock */ + list_del(page_lru); + nr_lru_pages--; + + /* Might (and should) have been done by free calls + * p_zone->zone_wake_kswapd = 0; + */ + + /* If no more pages are needed to release on specifically + requested zone concider it done! + Note: zone might be NULL to make all requests fulfilled */ + if (p_zone == zone && !p_zone->zone_wake_kswapd) + break; + } -out: list_splice(&young, &lru_cache); - list_splice(&old, lru_cache.prev); spin_unlock(&pagemap_lru_lock); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 12:37 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson @ 2000-05-04 14:34 ` Rik van Riel 2000-05-04 22:38 ` [PATCH][RFC] Another shrink_mmap Roger Larsson 2000-05-04 15:25 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson 1 sibling, 1 reply; 27+ messages in thread From: Rik van Riel @ 2000-05-04 14:34 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-mm On Thu, 4 May 2000, Roger Larsson wrote: > Here is an alternative shrink_mmap. > It tries to touch the list as little as possible > (only young pages are moved) I will use something like this in the active/inactive queue thing. The major differences will be that: - we won't be reclaiming memory in the first queue (only from the inactive queue) - we'll try to keep a minimum number of active and inactive pages in every zone - we will probably have a (per pg_dat) self-tuning target for the number of inactive pages regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH][RFC] Another shrink_mmap 2000-05-04 14:34 ` Rik van Riel @ 2000-05-04 22:38 ` Roger Larsson 0 siblings, 0 replies; 27+ messages in thread From: Roger Larsson @ 2000-05-04 22:38 UTC (permalink / raw) To: riel; +Cc: linux-mm Hi all, Here is an another shrink_mmap. This time lock handling should be better. It tries to touch the list as little as possible Only young pages are moved, probable pages are replaced by a cursor - makes it possible to release the pagemap_lru_lock. And it tries to be quick. Comments please. It compiles but I have not dared to run it yet... (My biggest patch yet, included straight text since I do not dare to run it I shouldn't tempt you...) /RogerL -- Home page: http://www.norran.net/nra02596/ --- static zone_t null_zone; int shrink_mmap(int priority, int gfp_mask, zone_t *zone) { int ret = 0, count; LIST_HEAD(young); LIST_HEAD(forget); struct list_head * page_lru, * cursor, * dispose; struct page * page = NULL; struct zone_struct * p_zone; struct page cursor_page; /* unique by thread, too much on stack? */ /* This could be removed. * NULL translates to: fulfill all zone requests. */ if (!zone) BUG(); count = nr_lru_pages >> priority; cursor = &cursor_page.lru; cursor_page.zone = &null_zone; spin_lock(&pagemap_lru_lock); /* cursor always part of the list, but not a real page... * make a special page that points to a special zone * with zone_wake_kswapd always 0 * - some more toughts required... */ list_add_tail(cursor, &lru_cache); for (page_lru = lru_cache.prev; count-- && page_lru != &lru_cache; page_lru = page_lru->prev) { /* Avoid processing our own cursor... * Note: check not needed with page cursor. * if (page_lru == cursor) * continue; */ page = list_entry(page_lru, struct page, lru); p_zone = page->zone; /* Check if zone has pressure, most pages would continue here. * Also pages from zones that initally was under pressure */ if (!p_zone->zone_wake_kswapd) continue; /* Can't do anything about this... */ if (!page->buffers && page_count(page) > 1) continue; /* Page not used -> free it * If it could not be locked it is somehow in use * try another time */ if (TryLockPage(page)) continue; /* Ok, a possible page. * Note: can't unlock lru if we do we will have * to restart this loop */ /* The page is in use, or was used very recently, put it in * &young to make it ulikely that we will try to free it the next * time */ dispose = &young; if (test_and_clear_bit(PG_referenced, &page->flags)) goto dispose_continue; /* cursor takes page_lru's place in lru_list * if disposed later it ends up at the same place! * Note: compilers should be able to optimize this a bit... */ list_del(cursor); list_add_tail(cursor, page_lru); list_del(page_lru); spin_unlock(&pagemap_lru_lock); /* Spinlock is released, anything might happen to the list! * But the cursor will remain on spot. * - it will not be deleted from outside, * no one knows about it. * - it will not be deleted by another shrink_mmap, * zone_wake_kswapd == 0 */ /* If page is redisposed after attempt, place it at the same spot */ dispose = cursor; /* avoid freeing the page while it's locked */ get_page(page); /* Is it a buffer page? */ if (page->buffers) { if (!try_to_free_buffers(page)) goto unlock_continue; /* page was locked, inode can't go away under us */ if (!page->mapping) { atomic_dec(&buffermem_pages); goto made_buffer_progress; } } /* Take the pagecache_lock spinlock held to avoid other tasks to notice the page while we are looking at its page count. If it's a pagecache-page we'll free it in one atomic transaction after checking its page count. */ spin_lock(&pagecache_lock); /* * We can't free pages unless there's just one user * (count == 2 because we added one ourselves above). */ if (page_count(page) != 2) goto cache_unlock_continue; /* * Is it a page swap page? If so, we want to * drop it if it is no longer used, even if it * were to be marked referenced.. */ if (PageSwapCache(page)) { spin_unlock(&pagecache_lock); __delete_from_swap_cache(page); goto made_inode_progress; } /* is it a page-cache page? */ if (page->mapping) { if (!PageDirty(page) && !pgcache_under_min()) { remove_page_from_inode_queue(page); remove_page_from_hash_queue(page); page->mapping = NULL; spin_unlock(&pagecache_lock); goto made_inode_progress; } goto cache_unlock_continue; } dispose = &forget; printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); cache_unlock_continue: spin_unlock(&pagecache_lock); unlock_continue: spin_lock(&pagemap_lru_lock); UnlockPage(page); put_page(page); dispose_continue: list_add(page_lru, dispose); /* final disposition to other list than lru? */ /* then return list index to old lru-list position */ if (dispose != cursor) page_lru = cursor; continue; made_inode_progress: page_cache_release(page); made_buffer_progress: UnlockPage(page); put_page(page); ret++; spin_lock(&pagemap_lru_lock); /* nr_lru_pages needs the spinlock */ nr_lru_pages--; /* Might (and should) have been done by free calls * p_zone->zone_wake_kswapd = 0; */ /* If no more pages are needed to release on specifically requested zone concider it done! Note: zone might be NULL to make all requests fulfilled */ if (p_zone == zone && !p_zone->zone_wake_kswapd) break; /* Back to cursor position to ensure correct next step */ page_lru = cursor; } list_splice(&young, &lru_cache); list_del(cursor); spin_unlock(&pagemap_lru_lock); return ret; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 12:37 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson 2000-05-04 14:34 ` Rik van Riel @ 2000-05-04 15:25 ` Roger Larsson 2000-05-04 18:30 ` Rik van Riel 1 sibling, 1 reply; 27+ messages in thread From: Roger Larsson @ 2000-05-04 15:25 UTC (permalink / raw) To: riel, linux-mm Hi all, I have noticed (not by running - lucky me) that I break this assumption.... /* * NOTE: to avoid deadlocking you must never acquire the pagecache_lock with * the pagemap_lru_lock held. */ /RogerL Roger Larsson wrote: > > Hi all, > > Here is an alternative shrink_mmap. > It tries to touch the list as little as possible > (only young pages are moved) > > And tries to be quick. > > Comments please. > > It compiles but I have not dared to run it yet... > (My biggest patch yet) > > /RogerL > > -- > Home page: > http://www.norran.net/nra02596/ > > ------------------------------------------------------------------------ > diff -Naur linux-2.3-pre7+/mm/filemap.c linux-2.3/mm/filemap.c > --- linux-2.3-pre7+/mm/filemap.c Mon May 1 21:41:10 2000 > +++ linux-2.3/mm/filemap.c Thu May 4 13:30:36 2000 > @@ -237,143 +237,149 @@ > { > int ret = 0, count; > LIST_HEAD(young); > - LIST_HEAD(old); > LIST_HEAD(forget); > struct list_head * page_lru, * dispose; > struct page * page = NULL; > struct zone_struct * p_zone; > - > + > + /* This could be removed. > + * NULL translates to: fulfill all zone requests. */ > if (!zone) > BUG(); > > count = nr_lru_pages >> priority; > - if (!count) > - return ret; > > spin_lock(&pagemap_lru_lock); > again: > - /* we need pagemap_lru_lock for list_del() ... subtle code below */ > - while (count > 0 && (page_lru = lru_cache.prev) != &lru_cache) { > - page = list_entry(page_lru, struct page, lru); > - list_del(page_lru); > - p_zone = page->zone; > - > - /* This LRU list only contains a few pages from the system, > - * so we must fail and let swap_out() refill the list if > - * there aren't enough freeable pages on the list */ > - > - /* The page is in use, or was used very recently, put it in > - * &young to make sure that we won't try to free it the next > - * time */ > - dispose = &young; > - if (test_and_clear_bit(PG_referenced, &page->flags)) > - goto dispose_continue; > - > - if (p_zone->free_pages > p_zone->pages_high) > - goto dispose_continue; > - > - if (!page->buffers && page_count(page) > 1) > - goto dispose_continue; > - > - count--; > - /* Page not used -> free it or put it on the old list > - * so it gets freed first the next time */ > - dispose = &old; > - if (TryLockPage(page)) > - goto dispose_continue; > - > - /* Release the pagemap_lru lock even if the page is not yet > - queued in any lru queue since we have just locked down > - the page so nobody else may SMP race with us running > - a lru_cache_del() (lru_cache_del() always run with the > - page locked down ;). */ > - spin_unlock(&pagemap_lru_lock); > - > - /* avoid freeing the page while it's locked */ > - get_page(page); > - > - /* Is it a buffer page? */ > - if (page->buffers) { > - if (!try_to_free_buffers(page)) > - goto unlock_continue; > - /* page was locked, inode can't go away under us */ > - if (!page->mapping) { > - atomic_dec(&buffermem_pages); > - goto made_buffer_progress; > - } > - } > - > - /* Take the pagecache_lock spinlock held to avoid > - other tasks to notice the page while we are looking at its > - page count. If it's a pagecache-page we'll free it > - in one atomic transaction after checking its page count. */ > - spin_lock(&pagecache_lock); > - > - /* > - * We can't free pages unless there's just one user > - * (count == 2 because we added one ourselves above). > - */ > - if (page_count(page) != 2) > - goto cache_unlock_continue; > - > - /* > - * Is it a page swap page? If so, we want to > - * drop it if it is no longer used, even if it > - * were to be marked referenced.. > - */ > - if (PageSwapCache(page)) { > - spin_unlock(&pagecache_lock); > - __delete_from_swap_cache(page); > - goto made_inode_progress; > - } > - > - /* is it a page-cache page? */ > - if (page->mapping) { > - if (!PageDirty(page) && !pgcache_under_min()) { > - remove_page_from_inode_queue(page); > - remove_page_from_hash_queue(page); > - page->mapping = NULL; > - spin_unlock(&pagecache_lock); > - goto made_inode_progress; > - } > - goto cache_unlock_continue; > - } > - > - dispose = &forget; > - printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); > - > + for (page_lru = lru_cache.prev; > + count-- && page_lru != &lru_cache; > + page_lru = page_lru->prev) { > + page = list_entry(page_lru, struct page, lru); > + p_zone = page->zone; > + > + > + /* Check if zone has pressure, most pages would continue here. > + * Also pages from zones that initally was under pressure */ > + if (!p_zone->zone_wake_kswapd) > + continue; > + > + /* Can't do anything about this... */ > + if (!page->buffers && page_count(page) > 1) > + continue; > + > + /* Page not used -> free it > + * If it could not be locked it is somehow in use > + * try another time */ > + if (TryLockPage(page)) > + continue; > + > + /* Ok, a possible page. > + * Note: can't unlock lru if we do we will have > + * to restart this loop */ > + > + /* The page is in use, or was used very recently, put it in > + * &young to make it ulikely that we will try to free it the next > + * time */ > + dispose = &young; > + if (test_and_clear_bit(PG_referenced, &page->flags)) > + goto dispose_continue; > + > + > + /* avoid freeing the page while it's locked [RL???] */ > + get_page(page); > + > + /* If it can not be freed here it is unlikely to > + * at next attempt. */ > + dispose = NULL; > + > + /* Is it a buffer page? */ > + if (page->buffers) { > + if (!try_to_free_buffers(page)) > + goto unlock_continue; > + /* page was locked, inode can't go away under us */ > + if (!page->mapping) { > + atomic_dec(&buffermem_pages); > + goto made_buffer_progress; > + } > + } > + > + > + /* Take the pagecache_lock spinlock held to avoid > + other tasks to notice the page while we are looking at its > + page count. If it's a pagecache-page we'll free it > + in one atomic transaction after checking its page count. */ > + spin_lock(&pagecache_lock); > + > + /* > + * We can't free pages unless there's just one user > + * (count == 2 because we added one ourselves above). > + */ > + if (page_count(page) != 2) > + goto cache_unlock_continue; > + > + /* > + * Is it a page swap page? If so, we want to > + * drop it if it is no longer used, even if it > + * were to be marked referenced.. > + */ > + if (PageSwapCache(page)) { > + spin_unlock(&pagecache_lock); > + __delete_from_swap_cache(page); > + goto made_inode_progress; > + } > + > + /* is it a page-cache page? */ > + if (page->mapping) { > + if (!PageDirty(page) && !pgcache_under_min()) { > + remove_page_from_inode_queue(page); > + remove_page_from_hash_queue(page); > + page->mapping = NULL; > + spin_unlock(&pagecache_lock); > + goto made_inode_progress; > + } > + goto cache_unlock_continue; > + } > + > + dispose = &forget; > + printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); > + > cache_unlock_continue: > - spin_unlock(&pagecache_lock); > + spin_unlock(&pagecache_lock); > unlock_continue: > - spin_lock(&pagemap_lru_lock); > - UnlockPage(page); > - put_page(page); > - list_add(page_lru, dispose); > - continue; > + /* never released... spin_lock(&pagemap_lru_lock); */ > + UnlockPage(page); > + put_page(page); > + if (dispose == NULL) /* only forget should end up here - predicted taken */ > + continue; > > - /* we're holding pagemap_lru_lock, so we can just loop again */ > dispose_continue: > - list_add(page_lru, dispose); > - } > - goto out; > + list_del(page_lru); > + list_add(page_lru, dispose); > + continue; > > made_inode_progress: > - page_cache_release(page); > + page_cache_release(page); > made_buffer_progress: > - UnlockPage(page); > - put_page(page); > - ret = 1; > - spin_lock(&pagemap_lru_lock); > - /* nr_lru_pages needs the spinlock */ > - nr_lru_pages--; > - > - /* wrong zone? not looped too often? roll again... */ > - if (page->zone != zone && count) > - goto again; > + UnlockPage(page); > + put_page(page); > + ret++; > + /* never unlocked... spin_lock(&pagemap_lru_lock); */ > + /* nr_lru_pages needs the spinlock */ > + list_del(page_lru); > + nr_lru_pages--; > + > + /* Might (and should) have been done by free calls > + * p_zone->zone_wake_kswapd = 0; > + */ > + > + /* If no more pages are needed to release on specifically > + requested zone concider it done! > + Note: zone might be NULL to make all requests fulfilled */ > + if (p_zone == zone && !p_zone->zone_wake_kswapd) > + break; > + } > > -out: > list_splice(&young, &lru_cache); > - list_splice(&old, lru_cache.prev); > > spin_unlock(&pagemap_lru_lock); > -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 15:25 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson @ 2000-05-04 18:30 ` Rik van Riel 2000-05-04 20:44 ` Roger Larsson 0 siblings, 1 reply; 27+ messages in thread From: Rik van Riel @ 2000-05-04 18:30 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-mm On Thu, 4 May 2000, Roger Larsson wrote: > I have noticed (not by running - lucky me) that I break this > assumption.... > /* > * NOTE: to avoid deadlocking you must never acquire the pagecache_lock > with > * the pagemap_lru_lock held. > */ Also, you seem to start scanning at the beginning of the list every time, instead of moving the list head around so you scan all pages in the list evenly... Anyway, I'll use something like your code, but have two lists (an active and an inactive list, like the BSD's seem to have). regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 18:30 ` Rik van Riel @ 2000-05-04 20:44 ` Roger Larsson 2000-05-04 18:59 ` Rik van Riel 0 siblings, 1 reply; 27+ messages in thread From: Roger Larsson @ 2000-05-04 20:44 UTC (permalink / raw) To: riel; +Cc: linux-mm Yes, I start scanning in the beginning every time - but I do not think that is so bad here, why? a) It releases more than one page of the required zone before returning. b) It should be rather fast to scan. I have been trying to handle the lockup(!), my best idea is to put in an artificial page that serves as a cursor... /RogerL Rik van Riel wrote: > > On Thu, 4 May 2000, Roger Larsson wrote: > > > I have noticed (not by running - lucky me) that I break this > > assumption.... > > /* > > * NOTE: to avoid deadlocking you must never acquire the pagecache_lock > > with > > * the pagemap_lru_lock held. > > */ > > Also, you seem to start scanning at the beginning of the > list every time, instead of moving the list head around > so you scan all pages in the list evenly... > > Anyway, I'll use something like your code, but have two > lists (an active and an inactive list, like the BSD's > seem to have). > > regards, > > Rik > -- > The Internet is not a network of computers. It is a network > of people. That is its real strength. > > Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies > http://www.conectiva.com/ http://www.surriel.com/ -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 20:44 ` Roger Larsson @ 2000-05-04 18:59 ` Rik van Riel 2000-05-04 22:29 ` Roger Larsson 0 siblings, 1 reply; 27+ messages in thread From: Rik van Riel @ 2000-05-04 18:59 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-mm On Thu, 4 May 2000, Roger Larsson wrote: > Yes, I start scanning in the beginning every time - but I do not > think that is so bad here, why? Because you'll end up scanning the same few pages over and over again, even if those pages are used all the time and the pages you want to free are somewhere else in the list. > a) It releases more than one page of the required zone before returning. > b) It should be rather fast to scan. > > I have been trying to handle the lockup(!), my best idea is to > put in an artificial page that serves as a cursor... You have to "move the list head". If you do that, you are free to "start at the beginning" (which has changed) each time... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH][RFC] Alternate shrink_mmap 2000-05-04 18:59 ` Rik van Riel @ 2000-05-04 22:29 ` Roger Larsson 0 siblings, 0 replies; 27+ messages in thread From: Roger Larsson @ 2000-05-04 22:29 UTC (permalink / raw) To: riel; +Cc: linux-mm Rik van Riel wrote: > > On Thu, 4 May 2000, Roger Larsson wrote: > > > Yes, I start scanning in the beginning every time - but I do not > > think that is so bad here, why? > > Because you'll end up scanning the same few pages over and > over again, even if those pages are used all the time and > the pages you want to free are somewhere else in the list. Not really since priority is decreased too... Next time the double amount of pages is scanned, and the oldest are always scanned. > > > a) It releases more than one page of the required zone before returning. > > b) It should be rather fast to scan. > > > > I have been trying to handle the lockup(!), my best idea is to > > put in an artificial page that serves as a cursor... > > You have to "move the list head". Hmm, If the list head is moved your oldest pages will end up at top, not that good. I do not want to resort the list for any reason other than page use! Currently I try to compile another version of my patch. I think it has been mentioned before when finding young pages and moving them up you probably need to scan the whole list. An interesting and remaining issue: * What happens if you read a lot of new pages from disk. Read only once, but too many to fit in memory... - Should pages used many times be rewarded? -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 15:43 ` Rik van Riel 2000-05-02 16:20 ` Andrea Arcangeli @ 2000-05-02 18:03 ` Roger Larsson 2000-05-02 17:37 ` Rik van Riel 1 sibling, 1 reply; 27+ messages in thread From: Roger Larsson @ 2000-05-02 18:03 UTC (permalink / raw) To: riel; +Cc: linux-kernel, linux-mm Rik van Riel wrote: > > On Tue, 2 May 2000, Roger Larsson wrote: > > > I have been playing with the idea to have a lru for each zone. > > It should be trivial to do since page contains a pointer to zone. > > > > With this change you will shrink_mmap only check among relevant pages. > > (the caller will need to call shrink_mmap for other zone if call failed) > > That's a very bad idea. Has it been tested? I think the problem with searching for a DMA page among lots and lots of normal and high pages might be worse... > > In this case you can end up constantly cycling through the pages of > one zone while the pages in another zone remain idle. Yes you might. But concidering the possible no of pages in each zone, it might not be that a bad idea. You usually needs normal pages and there are more normal pages. You rarely needs DMA pages but there are less. => recycle rate might be about the same... Anyway I think it is up to the caller of shrink_mmap to be intelligent about which zone it requests. > > Local page replacement is worse than global page replacement and > has always been... > > regards, > > Rik > -- > The Internet is not a network of computers. It is a network > of people. That is its real strength. > > Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies > http://www.conectiva.com/ http://www.surriel.com/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.rutgers.edu > Please read the FAQ at http://www.tux.org/lkml/ -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o. 2000-05-02 18:03 ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson @ 2000-05-02 17:37 ` Rik van Riel 0 siblings, 0 replies; 27+ messages in thread From: Rik van Riel @ 2000-05-02 17:37 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-kernel, linux-mm On Tue, 2 May 2000, Roger Larsson wrote: > Rik van Riel wrote: > > On Tue, 2 May 2000, Roger Larsson wrote: > > > > > I have been playing with the idea to have a lru for each zone. > > > It should be trivial to do since page contains a pointer to zone. > > > > > > With this change you will shrink_mmap only check among relevant pages. > > > (the caller will need to call shrink_mmap for other zone if call failed) > > > > That's a very bad idea. > > Has it been tested? Yes, and it was quite bad. We ended up only freeing pages from zones where there was memory pressure, leaving idle pages in the other zone(s). > I think the problem with searching for a DMA page among lots and > lots of normal and high pages might be worse... It'll cost you some CPU time, but you don't need to do this very often (and freeing pages on a global basis, up to zone->pages_high free pages per zone will let __alloc_pages() take care of balancing the load between zones). > > In this case you can end up constantly cycling through the pages of > > one zone while the pages in another zone remain idle. > > Yes you might. But concidering the possible no of pages in each > zone, it might not be that a bad idea. So we count the number of inactive pages in every zone, keeping them at a certain minimum. Problem solved. > You usually needs normal pages and there are more normal pages. > You rarely needs DMA pages but there are less. > => recycle rate might be about the same... Then again, it might not. Think about a 1GB machine, which has a 900MB "normal" zone and a ~64MB highmem zone. > Anyway I think it is up to the caller of shrink_mmap to be > intelligent about which zone it requests. That's bull. The only place where we have information about which page is the best one to free is in the "lru" queue. Splitting the queue into local queues per zone removes that information. > > Local page replacement is worse than global page replacement and > > has always been... (let me repeat this just in case) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2000-05-04 22:38 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <390E1534.B33FF871@norran.net>
2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel
2000-05-01 23:33 ` David S. Miller
2000-05-02 0:07 ` Rik van Riel
2000-05-02 0:23 ` David S. Miller
2000-05-02 1:03 ` Rik van Riel
2000-05-02 1:13 ` David S. Miller
2000-05-02 1:31 ` Rik van Riel
2000-05-02 1:51 ` Andrea Arcangeli
2000-05-03 17:11 ` [PATCHlet] " Rik van Riel
2000-05-02 7:56 ` michael
2000-05-02 16:17 ` Roger Larsson
2000-05-02 15:43 ` Rik van Riel
2000-05-02 16:20 ` Andrea Arcangeli
2000-05-02 17:06 ` Rik van Riel
2000-05-02 21:14 ` Stephen C. Tweedie
2000-05-02 21:42 ` Rik van Riel
2000-05-02 22:34 ` Stephen C. Tweedie
2000-05-04 12:37 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 14:34 ` Rik van Riel
2000-05-04 22:38 ` [PATCH][RFC] Another shrink_mmap Roger Larsson
2000-05-04 15:25 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 18:30 ` Rik van Riel
2000-05-04 20:44 ` Roger Larsson
2000-05-04 18:59 ` Rik van Riel
2000-05-04 22:29 ` Roger Larsson
2000-05-02 18:03 ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
2000-05-02 17:37 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox