From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 12 Aug 2002 13:58:17 -0400 Subject: Re: Broad questions about the current design Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v482) From: Scott Kaplan Content-Transfer-Encoding: 7bit In-Reply-To: Message-Id: <147C8BD2-AE1D-11D6-8D07-000393829FA4@cs.amherst.edu> Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org List-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday, August 12, 2002, at 05:13 AM, Daniel Phillips wrote: > On Friday 09 August 2002 17:12, Scott Kaplan wrote: >> 1) What happened to page ages? I found them in 2.4.0, but they're >> gone by 2.4.19, and remain gone in 2.5.30. > > The jury is still out as to whether aging or LRU is the better page > replacement policy, and to date, no formal comparisons have been done. Okay, so *that* explains when and how it happened. Thank you. As to ``which is better'': At the least, it's a very hard question to answer, not only because testing VM policies is difficult, but also because this policy is managing pages that are being put to very different uses. From a VM perspective, LRU tends to be the better idea, and frequency information, while always tempting, is generally a bad idea with one notable exception -- In those cases where LRU performs poorly, frequency allows the replacement policy to deviate from LRU. As it happens, just about *anything* that isn't LRU performs better in those cases, so there's nothing laudable about using frequency information in such situations, as other non-LRU approaches work, too. Since page aging is only partly frequency based, it may be that its benefits are exactly these cases where just about anything that deviates from LRU will help. For filesystem caching, the picture is less clear. Some studies have shown frequency to be a genuinely good idea, as file access patterns exhibit strong regularities for which LRU performs poorly. While I think that even those studies are oversimplifying the problem, frequency could be a decent approach. Since the Linux MM system manages both VM pages and filesystem cache pages together, its hard to say how those pools compete, and which policy is a better choice. I certainly think that something LRU-like is going to be more stable and predictable, failing in cases that people understand pretty well. > This code implements the LRU on the active list: > > http://lxr.linux.no/source/mm/vmscan.c?v=2.5.28#L349: > > 349 if (page->pte.chain && page_referenced(page)) { > 350 list_del(&page->lru); > 351 list_add(&page->lru, &active_list); > 352 pte_chain_unlock(page); > 353 continue; > 354 } > > Yes, it was supposed to be LRU but as you point out, it's merely a clock. > It would be an LRU if the list deletion and reinsertion occured directly > in > try_to_swap_out, but there the page referenced bit is merely set. I asked > Andrea why he did not do this and he wasn't sure, but he thought that > maybe > the way he did it was more efficient. I'm a bit confused by these comments, so maybe you can help me out a bit. While I agree that it would be possible to move pages to the front of the active list in try_to_swap_out() rather than setting their reference bits, I don't think that change would make this an LRU policy. There are only two ways to achieve a true LRU ordering: 1) Trap into the kernel on every reference, moving the referenced page to the front immediately. (Obviously, the overhead here would be absurd.) 2) Use hardware that timestamps each page frame with the time of each reference, allowing you to discover the order of last reference. No chip does this, of course, since it's not worth the hardware or the cycles to examine the timestamps. In other words, by the time try_to_swap_out() runs, it is possible to discover which pages have been used lately and move them to the front, but the order of last reference among those pages is already lost. It's not a true LRU ordering. Critically, true LRU orderings aren't worth much. That the active list is managed via CLOCK is totally appropriate and desirable. The entire purpose of CLOCK is that it is an approximation of LRU -- and, in fact, a very good one -- that doesn't incur the overhead needed for true LRU. I can't think of any reason to try to make the active list more LRU-like, as there's no real benefit to be gained. > For any page that is explicitly touched, e.g., by file IO, we use > activate_page, which moves the page to the head of the active list > regardless of which list the page is currently on. This is a classic LRU. Okay, that sounds fine, although for the reasons I just mentioned, it's not clear that it's helping much to move the page to the front if it's already on the active list, which itself is a CLOCK that yields good LRU-like behavior. > The inactive list is a fifo queue. So you have a (sort-of) LRU feeding > pages from its cold end into the FIFO, and if the page stays on the FIFO > long enough to reach the code end it gets evicted, or at least it starts > on the process. Wait, this doesn't make sense to me. I assume that there is some code that examines pages on the inactive list and, if they've been referenced, moves them to the front of the active list. That would make the inactive list another CLOCK-like queue, not a FIFO queue. (It would be FIFO only if pages were pulled only from the cold end of the FIFO queue and either (a) evicted if they've not been used or (b) moved to the front of the active list if they have been.) Making this list an LRU queue makes this whole structure a classic segmented queue (SEGQ) arrangement. The first queue is a FIFO or CLOCK -- a kind of queue where references to pages are not detected immediately, but only when something from that queue needs replacement for an incoming page. Pages evicted from the first queue are inserted into the second one. The purpose of the structure is that pages in the second queue are referenced far less often, and so incurring the overhead of detecting their references when they occur -- that is, protecting the pages so that the reference causes a trap into the kernel -- is a low cost way to order the pages near eviction. While keeping the inactive queue as a CLOCK-like structure is fine, it could be a true LRU queue, and there's little advantage to making it a FIFO queue. (Again, though, it doesn't sound to me as though it *is* a FIFO queue -- or am I misunderstanding your comments?) > [...] and needing to find page referenced bits by virtual scanning. The > latter means that the referenced information at the cold end of the LRU > and FIFO is unreliable. The cold end of the CLOCK actually tends to be quite reliable. I agree that the cold end of a FIFO queue is not, so why not do away with the scanning and simply mark pages in the inactive queue as not present (although they are, of course)? If referenced, they are immediately moved to the front. It's not likely to happen often, so the overhead is modest (and, for many workloads, far less than with the continual scanning). Leave the first queue as a CLOCK, make the second queue a true LRU. > I haven't tried this, because I think it's more important to get the > reverse mapping work nailed down so that the page referenced information > is reliable. Fair enough -- we can't solve all problems at once. But, I do have a proposal, now that I've done all of this nit-picking: For my own purposes, I want a simpler, non-scanning structure. I want the CLOCK/LRU SEGQ structure that I described. So I'll just go ahead and do that, as it will be the basis of some other experiments that I'm trying to do. Once (if?) I've managed that, we can try some workloads to see what the overhead of scanning is vs. the overhead of minor (non-I/O) page faults for the inactive list references. My prediction for the outcome is as follows: For workloads that are loop-like and require space near to the capacity of memory (that is, workloads that will hit the inactive list pages often), my approach will incur more overhead. I think on other workloads, eliminating the scanning will be worthwhile, not only in the reduction of overhead, but in the elimination of one more factor to tune. (Mind you, scanning will still be somewhat needed to batch together page-write operations for cleaning purposes, but that's yet another topic.) Anyone think this is interesting? Or am I just doing this for myself? (I' m happy to do it for myself, but if others want to know, I'll try to share the results.) Also, does anyone think I'm nuts, and misunderstanding some of the issues? (Always a possibility.) Scott -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (Darwin) Comment: For info see http://www.gnupg.org iD8DBQE9V/c88eFdWQtoOmgRAsPUAJ9P8Qkag/wXeBibK01CjvgnjtnnwgCgkwYw Zu1FbDBWZPYYV/tg13hifiA= =sANG -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/