From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from cs.utexas.edu (root@cs.utexas.edu [128.83.139.9]) by kvack.org (8.8.7/8.8.7) with ESMTP id VAA16222 for ; Sat, 9 Jan 1999 21:01:36 -0500 Message-Id: <199901100201.UAA14684@feta.cs.utexas.edu> From: "Paul R. Wilson" Date: Sat, 9 Jan 1999 20:01:20 -0600 Subject: eviction from shrink_mmap() (was re: 2.2.0-pre5) Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, riel@humbolt.geo.uu.nl Cc: wilson@cs.utexas.edu List-ID: [Rik:] >while browsing the pre5 patch I saw quite a bit of >VM changes that made a lot of sense, but there was >one statement that really worried me (in vmscan.c) > >+ * NOTE NOTE NOTE! This should just set a >+ * dirty bit in page_map, and just drop the >+ * pte. All the hard work would be done by >+ * shrink_mmap(). >+ * >+ * That would get rid of a lot of problems. >+ */ > >Of course we should never do this since it would mean >we'd loose the benefit of clustered swapout (and >consequently clustered swapin). I think Linus is right about this. I pretty sure it's the right thing to do, if people want to revamp the VM system a bit to make it more elegant. >The only way this could ever be implemented is by >using a linked list of things-to-swap-out that: >- is swapped out in the correct order and resorted > if needs be (to preserve or actually improve the > locality of reference in the swap area) >- can never be longer than X entries, to avoid ending > up in all kinds of nasty situations You are right that it will require something like this, but it shouldn't be hard. Here's the way I'd do it if I was feeling radical (and competent as a kernel hacker). (I'll suggesting something simpler after sketching the righteous thing.) radical version: * mark each page with some kind of unambiguous identifier, maybe the process ID (or some proxy for it, like a dummy inode) and the page offset within the VM address space. * run the kswapd/try_to_free_pages() clock significantly faster, and use the swap cache as an aging area for both clean and dirty pages. * keep a FIFO queue of the stuff in the swap cache, so you don't need the secondary clock in shrink_mmap. (Maybe keep a timestamp on each page frame recording which kswapd clock cycle evicted it to the fifo. This could just be a few bits, and a crude timestamp could help in coordinating the main cache with other caches & prefetching.) * use the FIFO queue as a reservoir of clean or cleanable pages, and clean pages opportunistically when the disk is idle for more than a few jiffies, to build up slack pages (clean, trivially freeable ones in the swap cache) ahead of crunch times * order the pages to be cleaned by whatever heuristic you like, probably including: 1. PID, to avoid mixing pages that were touched by different processes which may only have been accidentally related. 2. approximate time of last touch, to group pages that were touched together at about the same time 3. nearness to each other in the virtual address ordering. I think #2 would work pretty well if the ages were known to about 2 bits of precision, but may not with a simple NUR (1-bit) clock---too many pages all look equally old. That's why I think it's good to run the clock faster and use a FIFO back end, or timestamps, or both. The sorting of pages by these principles can be very, very fast. Sorting by pid can be done by hashing, then sorting by crude time can also be done by hashing. (Or if you don't bother with the timestamps, just grabbing the stuff at the end of the FIFO would do the same thing.) This leaves only sorting by address to a more general sort. That sorting can be a bucket sort, sorting pages by the middle bits of the page numbers first (fast constant time), and then separating out the things whose high bits are different within each bucket. For disk prefetching, you don't need to cluster things that are touched *very* closely together in time. Given the timescales that virtual memory works at, almost any ordering will be as good as any other, so long as you group together things that are touched within a few million instruction cycles of each other---or a few hundred page faults of each other, whichever is worse. A VM prefetch is a very good one if the page you fetch is one the next few hundred pages you would otherwise have faulted on. Details of clustering are very non-critical as long as you have a passable predictor of what's touched "at about the same time" on the timescales of virtual memory replacement. I think the sorting could be done in about 25 to 50 instructions per page, if you really wanted to bum instructions---we do tense things like that for fast compression algorithms, and I can explain how. (It's not at all complicated---you just have to know a couple of tricks.) But that would be overkill. A couple of hundred instructions would be plenty fast, and easy to do cleanly and simply. (Once all this is done, you could dispense with the kswapd clock entirely, and use the shrink_mmap() clock to do all the work. The clock over process address spaces is doing something very good---clustering related pages---but the architecture would be cleaner and more efficient if you just maintained page/process identities and let the swapper use and explicit heuristic for clustering. Dispensing with the kswapd clock would require some kind of inverse pte map, but I really don't think the space costs should be significant---I'd be more worried about the implementation hassles.) This kind of architecture has important applications to per-process page allocation: This kind of arrangement would be very good for intelligent rss tradeoff stuff. To decide which processes should get more or fewer pages, you have to know which ones are touching a lot of pages that have been evicted recently, and which ones are touching a lot of pages that would have been evicted if they had fewer page frames. To do that, you have to know roughly how old old each page is, to a few bits of precision, and what process it belongs to. Then you can keep a couple of counters per process to tell you whether giving it more page frames would cut its fault rate significantly, or whether taking page frames away would increase its fault rate signficantly. This lets you estimate the costs and benefits of taking pages away from a given process, or giving it more, and put the memory where it will do the most good. (We have a paper out on a similar technique that beats LRU at it's own game, even for a single process. You notice when a process is touching more recently-evicted pages than soon-to-be evicted pages, which is where LRU screws up; typically, that's for a loop over more data than will fit in RAM, so that you keep faulting on them shortly after evicting them. In that case, you switch from LRU to something more like MRU---you evict part of the loop early so that other stuff can stay in "late." Having a few bits of age information lets you detect the bad cases for LRU, and avoid using LRU when it would hose you. Mostly our algorithm defaults to LRU, but occasionally notices the worst pitfall of LRU, and neatly avoids it. This is a very hard game to play, because we're taking pages away from one process and giving them back to the *same* process. Arbitrating between multiple processes with significantly different locality characteristics should be a lot easier, and get bigger improvements more often.) Now for the simpler thing (maybe): Without bagging the current two-clock scheme, the first clock could record the order in which it reaches pages, so that the second (shrink_mmap()) clock could use it as a hint about eviction order. I think you could use the buffer list field of the page struct for this. It's always zero for a normal swap page, so you could could add a "PG_has_buffers" flag to the flags field, and free up the buffer list field for ordering hints. This could be just a sequence number, giving the order in which the first clock swept through the page frames, i.e., the kswapd order. In that case, you'd want to run the shrink_map clock fast enough to keep a list of evictable pages, which you could then sort on (approximate) kswapd order. Alternatively, kswapd could actually use the freed-up buffer list field of swappable pages to link swappable pages into an ordered list, which could be used by shrink_mmap() as an ordering directive or ordering hint. The general version could use this ordering for shrink_mmap() to clock over, rather than using the mem_map array order, but that might require a backpointer field, too. (I think it'd be well worth it, because it'd allow the shrink_mmap() clock hand to "follow" the kswapd clock hand at a meaningful distance, giving a definite FIFO effect, and a much more precise notion of page age---much better than any normal clock algorithm can provide.) If you don't want to go that way, you could use just the forward link set by kswapd as a hint to shrink_mmap(), which would be used ONLY when it was about to evict a page, and only to eagerly look for other page frames holding evictable pages in kswapd() order. That is, it would ignore the link field most of the time, but when it decides to evict a page to disk, it would scan ahead by those links to see if it finds other pages that it would normally evict whenever it happened to get to them in its usual scan order. In general, I think that it's majoring in the minors to worry about adding a field or two to the page struct. If adding a link field would improve the architecture by having a sane relationship between kswapd() and shrink_mmap(), so be it. Even adding 4 fields would only increase kernel memory usage by 0.4 percent, which has a very small effect on miss rate. If making the arrangement more sane and elegant enables *any* of several potentially useful improvements to the memory management policy---a better approximation of LRU, smarter clustering, smart rss arbitration between processes, compressed paging, etc.---it's hard to imagine that it wouldn't be worth it. Even if it's not worth it, because none of that stuff gets done and pans out, it just can't cost very much. I think it's worth the risk to make it cleaner. -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org