From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 26 Apr 2000 07:33:55 -0700 Message-Id: <200004261433.HAA13894@pizda.ninka.net> From: "David S. Miller" In-reply-to: (message from Rik van Riel on Wed, 26 Apr 2000 10:46:03 -0300 (BRST)) Subject: Re: [PATCH] 2.3.99-pre6-3+ VM rebalancing References: Sender: owner-linux-mm@kvack.org Return-Path: To: riel@nl.linux.org Cc: sct@redhat.com, sim@stormix.com, jgarzik@mandrakesoft.com, andrea@suse.de, linux-mm@kvack.org, bcrl@redhat.com, linux-kernel@vger.rutgers.edu List-ID: > I am of the opinion that vmscan.c:swap_out() is one of our > biggest problems, because it kills us in the case where a few > processes have a pagecache page mapped, haven't accessed it in a > long time, and swap_out doesn't unmap those pages in time for > the LRU shrink_mmap code to fully toss it. Please take a look at the patch I sent to the list a few minutes ago. The "anti-hog" code, using swap_out() as a primary mechanism for achieving its goal, seems to bring some amazing results ... for one, memory hogs no longer have a big performance impact on small processes. It's a nice change considering we are so close to 2.4.x but long term I still contend that swap_out is a hack that should die soon. I believe that it will be pretty much impossible to achieve "fair" balancing with any VM code which weighs all pages the same. And before you start crying that all pages should be weighed the same to protect the performance of that important memory hogging server process, the fact that it'll be the only process waiting for disk and that its pages are aged better often make the memory hog run faster as well! ;) Let's start at square one. I have never suggested that we weigh all pages the same. Global aging of all pages, on the other hand, is something completely different. It doesn't treat all pages the same, it treats them all differently based upon how people are using the page. Inactive pages are inactive pages are inactive pages, regardless of who has used them or what they are being used specifically for. Let me give a rough outline of what kind of paging algorithm I am suggesting: check_page_references(page) { if (TestAndClearReferenced(page) || (page->mapping->i_mmap && test_and_clear_pgtable_references(page))) return 1; return 0; } populate_inactive_list(long goal_pages) { for each active lru page { if (! check_page_references(page)) { add_to_inactive_lru(page) if (--goal_pages <= 0) break; } add page back to head of active lru } } free_inactive_pages(long goal_pages) { for each inactive lru page { if (check_page_references(page)) { add page back to active lru } else if (page_dirty_somewhere(page)) { add page to head of dirty lru } else { if (page->buffers) { ... deal with them just like current code ... } mapping = page->mapping; if (! mapping->a_ops->try_to_free_page(page)) { add page to head of inactive lru } else { if (--goal_pages <= 0) break; } } } } swap_out_dirty_pages(long goal_pages) { for each dirty lru page { try to swap it out, you get the idea } } Some salient points about what is missing in this suggested infrastructure: 1) There is no metric mentioned for handling pages that reactivate themselves often (ie. get referenced while they are on the inactive list), one is certainly needed. A simple scheme would be a counter in the page struct which we increment (up to some MAX value) when the page gets moved back to the active list from the inactive list. Then the inactive list population decrements this counter when it finds the page unreferenced, and only if the counter comes down to zero does it actually move the page to the inactive list. Some more heuristics could be added to this simple scheme, such as adding to this counter in the number of references seen at a reactivation event. 2) There are no methods mentioned to control when we actually do the page table walks, if and when to delete the user mappings for an inactive page to get the counts down to just the mapping owning it, etc. These sorts of heuristics would be needed to for a well tuned implementation. Next, let's assume we have the above and the general try_to_free_pages toplevel code does something like: try_to_free_pages() { goal = number_of_freepages_we_would_like_to_have - nr_free_pages; free_inactive_pages(goal); populate_inactive_list(sysctl_inactive_list_goal /* or whatever */); if (nr_free_pages >= goal) break; goal = number_of_freepages_we_would_like_to_have - nr_free_pages; swap_out_dirty_pages(goal); } [ AMAZING, astute readers will notice that all of this looks suspiciously familiar to sys/vm/vm_paging.c in the freebsd sources, and this is not by accident. Sometimes I wonder if I am the only person who went and checked out what they were doing when the accusations went flying around that our paging sucks. ] You get the idea, and next we have kswapd wake up periodically to just do populate_inactive_list() runs to keep the inactive lru list ready to go at the onset of future paging. Of course, kswapd does forces try_to_free_pages runs when memory starts to run low, just like it does now. Now what will such a scheme like the above (remember, swap and anonymous pages are in these LRU queues too) do in the memory hog case you mentioned? The big problem I have with the memory hog hacks is that it needs to classify _processes_ to work effectively in some set of cases. When what we really are concerned about is classification of _pages_, and the system just does this naturally by setting dirty/referenced state on the page->flags and the ptes which map those pages. See? The global LRU scheme dynamically figures out what page usage is like, it doesn't need to classify processes in a certain way, because the per-page reference and dirty state will drive the page liberation to just do the right thing. Also, the anon layer I posted earlier today also allows us to provide the strict swap reservation people cry for from time to time, since we track all anonymous pages, we can do a "nr_swap_pages--" check and fail if it would hit zero. The only hard part about this would be adding a way to specify the boot time swap device before the first process is executed, or just ignore this issue and only worry about swap space reservation when the swap is actually enabled during the init scripts. Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/