From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark_H_Johnson@Raytheon.com Subject: Re: [RFC] 2.3/4 VM queues idea Message-ID: Date: Wed, 24 May 2000 14:37:44 -0500 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org Return-Path: To: riel@conectiva.com.br Cc: acme@conectiva.com.br, dillon@apollo.backplane.com, linux-mm@kvack.org, sct@redhat.com List-ID: I'll try to combine comments on the VM queues & Matt Dillon's material in one response. I've added some analysis at the end - yes the reference is OLD, but the math is still valid. The bottom line of my comments - make sure the goals are right, put some measures into place so we can determine "success", & build the solution based on sound principles. Also, thanks to all who are working to make the VM system in Linux better. Re: Goals - Robust design - absolutely essential. VM is one of those key capabilities that must be done right or not at all. - page aging and "buffer for allocations" - not sure if I care which methods we use as long as they work well. To me, the goal of VM is to have the "right" pages in memory when the CPU executes an instruction. Page aging and free lists are two methods, lookahead and clustered reads & writes are others. - treat pages equally - I think I disagree with both you and Matt on this one. We have different usage patterns for different kinds of data (e.g. execution of code tends to be localized but not sequential vs. sequential read of data in a file) & should have a means of distinguishing between them. This does not mean that one algorithm won't do a good job for both the VM & buffer cache, just recognize that we should have ways to treat them differently. See my comments on "stress cases" below for my rationale. - simple changes - I agree, get this issue settled & move on. To these goals I might add - predictable performance [essential if I deploy any large (or real time) system on Linux] - no more automatic process kills - give the system [and/or operator] the ability to recover without terminating jobs - make it adjustable by the user or system administrator for different workloads - durable - works well with varied CPU, memory, and disk performance A related item - measure that the "goals have been met". How do we know that method A works better than method B without some hooks to measure the performance? A final comment on goals - I honestly don't mind systems that swap - IF they do a better job of running the active jobs as a result. The FUD that Matt refers to on FreeBSD swapping may actually mean that FreeBSD runs better than Linux. I can't tell, and it probably varies by application area anyway. Re: Design ideas [with ties into Matt's material] - not sure what the 3 [4?] queues are doing for us; I can't tie the queue concept to how we are meeting one of the goals above. - allocations / second as a parameter to adjust free pages. Sounds OK, may want a scalable factor here with min & max limits to free pages. For example, in a real time system I want no paging - that doesn't mean that I want no free pages. A burst of activity could occur where I need to do an allocation & don't want to wait for a disk write before taking action [though that usually means MY system design is broke...]. - not sure the distinction between "atomic" and "non atomic" allocations. Please relate this to meeting the goals above, or add a goal. - not clear how the inactive queue scan & page cleaner methods meet the goals either. - [Matt] real statistics - I think is a part of the data needed to determine if A is better than B. - [Matt] several places talks about LRU or not strict LRU - LRU is a good algorithm but has overhead involved. You may find that a simple FIFO or clock algorithm gets >90% of the benefit of LRU without the overhead [can be a net gain, something to consider]. - [Matt] relate scan rate to perceived memory load. Looks OK if the overhead makes it worth the investment. - [Matt] moving balance points & changing algorithms [swap, not page]. Ditto. - [Matt] adjustments to initial weight. Ditto - relates directly to the "right page" goal. See the analysis below to get an idea of what I mean by "overhead is worth the investment". It looks like we can spend a lot of CPU cycles to save one disk read or write and still run the system efficiently. The remainder of Matt's message was somewhat harder to read - some mixture of stress cases & methods used to implement VM. I've reorganized what he said to characterize this somewhat more clearly [at least in my mind]... Stress cases [characterized as sequential/not, read only or read/write, shared or not, locked or not] I tried to note the application for each case as well as the general technique that could be applied to memory management. - sequential read only access [file I/O or mmap'd file, use once & discard] - sequential read/write access [file updates or mmap'd file, use once & push to file on disk] - non-sequential read or execute access, not shared [typical executable, locality of reference; when memory is tight, can discard & refresh from disk] - non-sequential read/write access, not shared [stack or heap; locality of reference, must push to file or swap when memory is tight] - read only shared access [typical shared library, higher "apparent cost" if not in memory, can discard & refresh from disk] - read/write shared access [typical memory mapped file or shared memory region, higher apparent cost if not in memory, has cache impact on multi processors, must push to file or swap when memory is tight] - locked pages for I/O or real time applications [fixed in memory due to system constraints, VM system must (?) leave alone, cannot (?) move in memory, not "likely" needed on disk (?)]. I had a crazy idea as I was writing this one - application is to capture a stream of data from the network onto disk. The user mmap's a file & locks into memory to ensure bandwidth can be met, fills region with data from network, unlocks & sync's the file when transfer is done. How would FreeBSD/Linux handle that case? I expect there are other combinations possible - I just tried to characterize the kind of memory usage patterns that are typical. It may also be good to look at some real application areas - interactive X applications, application compile/link, database updates, web server, real time, streaming data, and so on; measure how they stress the system & build the VM system to handle the varied loads - automatically if you can but with help from the user/system administrator if you need it. Methods [recap of Matt's list] - write clustering - a good technique used on many systems - read clustering - ditto, especially note about priority reduction on pages already read - sequential access detection - ditto - sequential write detection - ditto, would be especially helpful with some of our shared memory test programs that currently stress Linux to process kills; think of sequential access to memory mapped files in the same way as sequential file I/O. They should have similar performance characteristics if we do this right [relatively small resident set sizes, high throughput]. Some analysis of the situation I took a few minutes to go through some of my old reference books. Here are some concepts that might help from "Operating System Principles" by Per Brinch Hansen (1973), pp 182-191. Your disk can supply at most one page every T seconds, memory access every t seconds, and your program demands at a rate of p(s) - s relates to the percent of resident to total size of the process. The processor tends to demands one page every t/p(s) memory accesses. There are three situations: - disk is idle, p(s) < t/T. Generally OK, desired if you need to keep latency down. - CPU is idle, p(s) > t/T. Generally bad, leads to thrashing. May indicate you need more memory or faster disks for your application. - balanced system, p(s) = t/T. Full utilization, may lead to latency increases. Let's feed some current performance numbers & see where it leads. A disk access (T) is still a few milliseconds and a memory access (t) is well under a microsecond (lets use T=5E-3 and t=5E-8 to keep the math simple). The ratio of t/T is thus 1/100000 - that means one disk page access per 100000 memory accesses. Check the math, on 20mhz memory accesses, your balanced demand is 200 disk pages per second (perhaps high?). Note that many of our stress tests have much higher ratios than this. I suggest a few real measurements to get ratios for real systems rather than my made up answers. Even if I'm off by 10x or 100x, this kind of ratio means that many of the methods Matt describes makes sense. It helps justify why spending time to do read/write clusters pays off - it gets you more work done for the cost of each disk access (T). It helps justify why detecting sequential access & limiting resident set sizes in those cases is OK [discard pages that will never be used again]. It shows how some of the measures Matt has described (e.g., page weights & perceived memory load) help determine how we should proceed. The book goes on to talk about replacement algorithms. FIFO or approximations to LRU can cause up to 2-3x more pages to be transferred than the "ideal algorithm" (one that knows in advance the best choice). This results in a 12-24 % reduction of working set size but improves utilization only a few percent. There is also a discussion of transfer algorithms, basically a combination of elevator and/or clustering to reduce latency & related overhead. I think Matt has covered better techniques than was in this old book. I did have to chuckle with the biggest gain - fix the application to improve locality - a problem then & still one today. Small tools work better than big ones. Perhaps a reason to combat code bloat in the tools we use. Closing First, let's make sure we agree what we really need from the VM system (goals). Second, define a few measurements in place so we can determine we've succeeded. Third, implement solutions that will get us there. Don't forget to give thanks for all the hard work that has gone into what we have today & will have tomorrow. --Mark H Johnson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/