From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 24 May 2000 13:02:57 -0300 (BRST) From: Rik van Riel Subject: discussion with Matt Dillon Message-ID: MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="655889-1842000895-959184177=:24993" Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org Cc: Matthew Dillon List-ID: --655889-1842000895-959184177=:24993 Content-Type: TEXT/PLAIN; charset=US-ASCII Hi, over the last days I had a very interesting discussion about virtual memory with Matt Dillon. It would be cool if people here could read these emails and give us comments, reactions, questions, ... about them. I believe it would be a good thing for Linux to have a whole bunch of people who understand virtual memory so we can keep an eye on each other and make sure something good gets built. Also, I'd *really* appreciate some comments on the 2.3/4 VM plan I sent to the list earlier today. I want to start implementing it RSN (as in, after lunch) so I need to know if I missed any critical design issues... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ --655889-1842000895-959184177=:24993 Content-Type: MULTIPART/Digest; BOUNDARY="655889-1110733960-959184177=:24993" Content-ID: Content-Description: Digest of 7 messages This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Return-Path: Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id DAA18483 for ; Tue, 23 May 2000 03:32:33 -0300 Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id DAA00990 for ; Tue, 23 May 2000 03:32:56 -0300 Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id XAA64159; Mon, 22 May 2000 23:32:20 -0700 (PDT) (envelope-from dillon) Date: Mon, 22 May 2000 23:32:20 -0700 (PDT) From: Matthew Dillon Message-Id: <200005230632.XAA64159@apollo.backplane.com> To: Rik van Riel Subject: Linux VM/IO balancing (fwd to linux-mm?) I sent this to Alan who suggested that I send this to you! I've fleshed it out a bit more from the version I sent to Alan. I've been following the linux VM/memory subsystem issues closely and I have a few suggestions on how to clean it up. Unfortunately, being FreeBSD centric these days I do not want to create controversy. But at this point I think even linux developers know that the algorithms being used in the linux kernel are struggling with the issue. The time may be ripe for some outside input (though I will note that I was heavy into linux in earlier years, as Alan's mail archives attest to!). What I do below is essentially describe the FreeBSD VM/memory subsystem from an algorithmic point of view, minus some of the fluff. It is very straightforward in concept and the benefits should be obvious to anyone who has worked on the VM system. I think it would be a fine blueprint to use in linux. I would like you to consider posting this to linux-kernel. I leave it up to you -- I think if I were to post it independantly it would simply result in controversy and seem more like a FreeBSD vs Linux thing rather then a 'how to fix the VM system' thing. But if you post it and preface it appropriately, the result might be better. So I am sending it to you. It may help kickstart creating a formal algorithmic blueprint rather then hacking up patches that solve one problem that only create other problems. Please feel free to edit it in any way if you think you can reduce potential us-vs-them issues even more. I think there is a chance here to get the whole of the linux developer community on the same page in regards to the memory-load issue. I didn't intend to turn this into a paper, but I've spent so much time describing it that I think I am going to submit it to daemon news in a cleaned-up form :-). Thanks! -Matt --------- Even though I don't do much coding for Linux these days, I've always avidly tracked the kernel mailing list. I've been tracking the memory balancing thread for a while now and, having dealt with similar issues in FreeBSD I believe I can offer a suggestion. First of all, I am not trying to turn this into a comparison or anything. Don't think of the FreeBSD memory subsystem as being 'FreeBSD' more as being the work of a handful of very good theorists and programmers (John Dyson being the main element there), and years of testing under varying loads. The algorithms FreeBSD uses are NOT 15 years old, they are more like 3 years old, and much of the work I myself have done is less then one year old. Also, keep in mind that the standard FUD about FreeBSD 'swapping more' the linux is just that --- FUD. FreeBSD only swaps when it needs to, or thinks it may benefit by freeing up an idle dirty page for more active reuse. I make an attempt to describe in exact detail how and why FreeBSD pages things out, and why it works so well, somewhere down below :-). My suggestions are as follows: First, stop treating the buffer cache as an entity separate from the VM system. The two are inexorably bound together, especially considering the massive use of mmap() (both file-backed and anonymous mmaps) in modern programming. Depending on what you are running a properly balanced system might have anywhere from 80% of its memory assigned as file cache to 80% of its memory assigned to hold anonymous memory for processes. it is NOT possible to impose limitations and still get a reasonably scaleable balanced system. DO NOT TREAT THESE AS TWO DIFFERENT CACHES! Second, start keeping real statistics on memory use across on a physical-page-basis. That means tracking how often VM pages are referenced (statistically) as well as how often filesystem pages are referenced by discrete I/O calls (deterministically). Keep track of a real per-physical-page statistical 'weight'. (What this means for linux is that you really need to test the pte's associated with physical pages by iterating through the physical pagse in your outer loop, NOT by trying to iterate through every page table of every process!). FreeBSD keeps a center-weighted statistic for every page of memory (buffer cache or VM cache, it makes no difference). This has turned out to be a nearly perfect balancing algorithm and I strongly recommend that linux adopt a similar model. But what makes it work is that FreeBSD is willing to eat a couple of cpu cycles to keep accurate statistics of page use by the VM system in order to avoid the bad things that happen when one would otherwise choose the wrong page to reuse or clean. What I describe below is the essential core of the algorithm FreeBSD uses. It's not an exact representation, but it gets to the heart of FreeBSD's memory-balancing success. The algorithm is a *modified* LRU. Lets say you decide on a weighting betweeen 0 and 10. When a page is first allocated (either to the buffer cache or for anonymous memory) its statistical weight is set to the middle (5). If the page is used often the statistical weight slowly rises to its maximum (10). If the page remains idle (or was just used once) the statistical weight slowly drops to its minimum (0). The statistical weight is updated in real time by I/O system calls, and updated statistically (by checking and clearing the page-referenced bit in pte's) for mapped memory. When you mmap() a file and issue syscalls on the descriptor, the weight may be updated by BOTH methods. The rate at which the statistical page-reference updating operates depends on the perceived memory load. A lightly loaded system (unstressed memory) doesn't bother to scan the page-referenced bit all that often, while a heavy memory load scans the page-referenced bit quite often to keep the statistical weight intact. When memory is allocated and no free pages are available, a clean page is discarded from the cache (all 'clean' pages are considered to be cache pretty much), lowest weight first. This in itself does NOT contribute to the memory load calculation. That is, if you are scanning a 10GB file you are not creating any memory stress on the system. The LRU nature of the order of the pages in the queue is not strict. The real parameter is the statistic, the ordering of the pages in the queue uses a heuristic -- the pages 'migrate' over time so they are reasonably well ordered within the queue, but no great effort is made to order them exactly. The VM system will scan a portion of the queue to locate a reasonable page to reuse (for example, it will look for a page with a weighting less then 2). The pagedaemon's scan rate is based on the perceived memory load and ONLY the perceived memory load. It is perfectly acceptable to have most of the system memory in 'active' use if allocations are not occuring often, perfectly acceptable to have most of the system memory backing file pages if processes aren't doing a lot of pageins, perfectly acceptable for the system memory to be mostly dedicated to process anonymous memory if processes have big ACTIVE footprints, perfectly acceptable for most of the pages to be dirty if they are all in active use and the memory subsystem is not otherwise being stressed. The reason FreeBSD's memory subsystem works so well is precisely because it does not impose any artificial limitations on the balance point. Memory load is calculated in two ways: First, if the memory system finds itself reusing active pages (in my example, any page with a statistical weight greater then 5), second based on the dirty:clean page ratio. A high ratio does not itself cause paging to occur, but a high ratio combined with the system reusing active pages does. The dirty/clean ratio is treated as an INDEPENDANT problem. The same statistic is kept for dirty pages as it is for clean pages, but dirty pages are placed on their own independant LRUish queue and do not take part in the 'normal' memory allocation algorithm. A separate algorithm (also part of the pageout daemon) controls the cleaning of dirty pages. When the memory load increases, an attempt is made to balance the dirty/clean ratio by 'cleaning' dirty pages, which of course means paging them out. FreeBSD makes NO distinction between writing a dirty file-backed page and allocating swap for a dirty anonymous memory page. The same per-page memory-use statistic is also used to determine which dirty pages to clean first. In effect, it is precisely this attempt to balance the dirty/clean ratio which increases the number of clean pages available to reuse. The idea here is to increase the number of clean pages to the point where the system is no longer being forced to reuse 'active' pages. Once this is achieved there is no longer any need clean the remaining dirty pages. Under extreme memory loads the balance point moves on its own to a point where FreeBSD tries to keep as many pages in a clean state as possible. When the memory load gets to this point the system is considered to be thrashing and we start taking anti-thrashing measures, such as swapping out whole processes and holding them idle for 20-second spans. It rarely gets to this point, but even when it does the system is still kept reasonably balanced. It should be noted that the center-weighting algorithm works in virtually all situations, including workign WONDERFULLY when you have I/O centric programs (i.e. a program that reads or writes gigabytes of data). By making slight adjustments to the initial weight (or even no adjustments at all) the VM system will tend to reuse used-once memory (think of scanning a file) before it tries to reuse more actively used memory. Now, of course, there are other kernel processes messing with memory. The filesystem update daemon, for example. But these processes are not designed to handle heavy memory loads and we do it that way on purpose. At most the update daemon will speed up a little under intense filesystem loads, but that is as far as it goes. Only one process is designed to handle heavy memory loads and that is the pageout daemon. --- Stress Cases * Stressing dirty pages in the system via I/O calls (read/write) The algorithm tends to cause sequential I/O calls to give pages a middling weight, and since the pages are not reused they tend to be recycled within their domain (so you don't blow the rest of the cache). * Stressing dirty pages in the system via mmap (shared R+W) The system tends to run low on clean pages, detected by the fact that new allocations are reusing clean pages which have high weights. When this occurs the pageout daemon attempts to 'clean' dirty pages (page them out) in order to increase the number of clean pages available. Having a larger number of clean pages available tends to give them more time to age, thus reducing the average weight the allocator sees. This is a negative feedback loop which results in balance. * I/O (read/shared-mmap) stress The algorithm tends to weight the clean pages according to use. The weightings for filesystem cache pages read via read() are adjusted at the time of the read() while VM pages are adjusted statistically (The VM page scan rate depends on the level of stress). Since in modern systems mmap() is used heavily, no special consideration is given to one access method verses the other. * VM (anonymous memory) stress Anonymous swap-backed memory is treated no differently from file-backed (filesystem buffers / mmap) memory. Clean anonymous pages (most likely with swap already allocated if they are clean) can be reused just the same as pages belonging to the filesystem buffer cache. Swap is assigned to dirty anonymous pages on the fly, only when the pageout daemon decides to actually clean the page. Once swap is assigned the clean page can be reused. If a swap-backed page is brought back into memory, it is brought back in clean (swap is left assigned). Swap is only freed if the page is re-dirtied by the process. Thus most anonymous-memory pages in a heavily loaded system tend to remain clean, allowing them to be reused more easily and extending the life of the system further along the curve before it reaches a thrashing state. * Write Clustering. Whenever the system decides to clean a dirty page it will, on the fly, attempt to locate dirty nearby pages. FreeBSD is actually quite sophisticated in this regard in that it actually goes and does the calculation to ensure that only pages physically contiguous on the disk are clustered for the write. The cluster is then written and marked clean all in one go (cluster size limit is 64-128K). * Sequential Detection Heuristic for read clustering (read()) A heuristic detects sequential read behavior and implements two optimizations. (1) it implements read-aheads (as long as they are reasonably contiguous on the physical media, we explicitly do not try to issue read-aheads if it would cause an extra disk seek), (2) it implements priority depression read-behind (reduce by 1 the statistical weight of pages that have already been read). Reuse of the pages can still cause the statistical weighting to increase to the maximum, but this optimization has a tendancy to greatly reduce the stress that large sequential reads have on the rest of the memory subsystem. * Sequential Detection Heuristic for read clustering (VM fault) A heuristic detects sequential VM fault operation, either forwards or backwards and adjusts the cluster window around the fault taken, either shifting it forwards or backwards, or making the window smaller (e.g. if random fault operation is detecting). fault-ahead I/O is initiated based on the algorithm and anything found cached is pre-faulted into the page table. (The window size in FreeBSD is approximately 64KBytes for this particular algorithm). The window is further restriction to ensure that only media-contiguous blocks are clustered. * Sequential Detection Heuristic for write clustering (write()) In the case of write() I/O (write system call), in order to avoid saturating the memory system with dirty pages, if the sequential detection heuristic determines that writes are occuring sequentially, FreeBSD implements write-behind. That is it issues the I/O on the dirty buffers preceding the write point immediately (and asynchronously), in order to get the pages into a clean state and thus reuseable, thus avoiding stressing the memory system. In this case there is also a limit emplaced on the number of dirty filesystem buffers allowed to accumulate (since I/O is slower then the write() calls creating the dirty buffers). What you wind up in this case is maximum disk throughput for the sequential write without thousands of unnecessary dirty pages, which is asynchronous up to a reasonable point and then starts blocking to give the I/O the chance to catch up a little in order to avoid starving the clean page cache. * Sequential Detection Heuristic for write clustering (mmap) Currently not implemented under FreeBSD. This used to be a big problem because you could completely saturate the VM system with dirty pages before the system even realized it. To fix this we threw in a memory-stress check in vm_fault to block when dirtying pages in the face of having too many dirty pages already, giving I/O a chance to catch up a little. This actually improved performance because it left a greater number of clean pages available and so the page selection algorithm in the allocator worked better (tended to select idle pages rather then active pages). -Matt Matthew Dillon --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Date: Tue, 23 May 2000 09:34:23 -0300 (BRST) From: Rik van Riel X-Sender: riel@duckman.distro.conectiva To: Matthew Dillon Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) In-Reply-To: <200005230632.XAA64159@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Mon, 22 May 2000, Matthew Dillon wrote: > What I do below is essentially describe the FreeBSD VM/memory subsystem > from an algorithmic point of view, minus some of the fluff. It is very > straightforward in concept and the benefits should be obvious to anyone > who has worked on the VM system. I think it would be a fine blueprint > to use in linux. I'm already looking at implementing this (and have been for a few weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan for pitfalls which I may have overlooked in my own design. Davem has implemented some nice active/inactive queue code too, I will probably cut'n'paste some of his code. I'll also try to avoid some of the problems FreeBSD has (using the active/inactive queue as primary scanning point and _then_ skipping to the virtual object for IO clustering, as well as results obtained with my anti-hog code suggest that we want to do scanning of virtual objects only, with an active list, cache list and laundry list as only lists). > My suggestions are as follows: > > First, stop treating the buffer cache as an entity separate from the > VM system. The two are inexorably bound together, especially considering > the massive use of mmap() (both file-backed and anonymous mmaps) in > modern programming. Depending on what you are running a properly > balanced system might have anywhere from 80% of its memory assigned > as file cache to 80% of its memory assigned to hold anonymous memory > for processes. it is NOT possible to impose limitations and still > get a reasonably scaleable balanced system. DO NOT TREAT THESE > AS TWO DIFFERENT CACHES! We stopped doing this around kernel version 2.3.8 ;) For all intents and purposes the buffer cache and the page cache are one. There are still a few special cases where we haven't figured out how to put the data into the page cache cleanly, but those are real exceptions. > Second, start keeping real statistics on memory use across on a > physical-page-basis. That means tracking how often VM pages are > referenced (statistically) as well as how often filesystem pages are > referenced by discrete I/O calls (deterministically). Keep track of > a real per-physical-page statistical 'weight'. (What this means for > linux is that you really need to test the pte's associated with physical > pages by iterating through the physical pagse in your outer loop, NOT > by trying to iterate through every page table of every process!). Hmmm, why would it matter in what order we scan stuff? > FreeBSD keeps a center-weighted statistic for every page of memory > (buffer cache or VM cache, it makes no difference). This has turned > out to be a nearly perfect balancing algorithm and I strongly recommend > that linux adopt a similar model. But what makes it work is that > FreeBSD is willing to eat a couple of cpu cycles to keep accurate > statistics of page use by the VM system in order to avoid the bad > things that happen when one would otherwise choose the wrong page to > reuse or clean. I fully agree on this point. Linus and Andrea have always been a strong opponent of any possible "overhead", leading to a VM subsystem that is optimised for CPU use, not for managing memory ;( > The algorithm is a *modified* LRU. Lets say you decide on a weighting > betweeen 0 and 10. When a page is first allocated (either to the > buffer cache or for anonymous memory) its statistical weight is > set to the middle (5). If the page is used often the statistical > weight slowly rises to its maximum (10). If the page remains idle > (or was just used once) the statistical weight slowly drops to its > minimum (0). *nod* We may want to age physical pages in a way like this. > The statistical weight is updated in real time by I/O system calls, > and updated statistically (by checking and clearing the page-referenced > bit in pte's) for mapped memory. When you mmap() a file and issue > syscalls on the descriptor, the weight may be updated by BOTH methods. > The rate at which the statistical page-reference updating operates depends > on the perceived memory load. A lightly loaded system (unstressed > memory) doesn't bother to scan the page-referenced bit all that often, > while a heavy memory load scans the page-referenced bit quite often > to keep the statistical weight intact. Nice. > When memory is allocated and no free pages are available, a clean page > is discarded from the cache (all 'clean' pages are considered to be > cache pretty much), lowest weight first. This in itself does NOT > contribute to the memory load calculation. That is, if you are scanning > a 10GB file you are not creating any memory stress on the system. *nod* > The LRU nature of the order of the pages in the queue is not strict. > The real parameter is the statistic, the ordering of the pages in the > queue uses a heuristic -- the pages 'migrate' over time so they are > reasonably well ordered within the queue, but no great effort is made > to order them exactly. The VM system will scan a portion of the queue > to locate a reasonable page to reuse (for example, it will look for > a page with a weighting less then 2). *nod* > The pagedaemon's scan rate is based on the perceived memory load > and ONLY the perceived memory load. It is perfectly acceptable to > have most of the system memory in 'active' use if allocations are not > occuring often, perfectly acceptable to have most of the system memory > backing file pages if processes aren't doing a lot of pageins, perfectly > acceptable for the system memory to be mostly dedicated to process > anonymous memory if processes have big ACTIVE footprints, perfectly > acceptable for most of the pages to be dirty if they are all in active > use and the memory subsystem is not otherwise being stressed. > > The reason FreeBSD's memory subsystem works so well is precisely because > it does not impose any artificial limitations on the balance point. Yup. This is a good thing which will be implemented in Linux too. > Memory load is calculated in two ways: First, if the memory system finds > itself reusing active pages (in my example, any page with a statistical > weight greater then 5), second based on the dirty:clean page ratio. A > high ratio does not itself cause paging to occur, but a high ratio > combined with the system reusing active pages does. Hmmmm, this isn't what I'm seeing in the code. Could you point me at the code where this is happening? > The dirty/clean ratio is treated as an INDEPENDANT problem. The > same statistic is kept for dirty pages as it is for clean pages, but > dirty pages are placed on their own independant LRUish queue and do > not take part in the 'normal' memory allocation algorithm. A > separate algorithm (also part of the pageout daemon) controls the > cleaning of dirty pages. This is planned for Linux as well. > When the memory load increases, an attempt is made to balance the > dirty/clean ratio by 'cleaning' dirty pages, which of course means > paging them out. FreeBSD makes NO distinction between writing a dirty > file-backed page and allocating swap for a dirty anonymous memory page. > The same per-page memory-use statistic is also used to determine which > dirty pages to clean first. In effect, it is precisely this attempt > to balance the dirty/clean ratio which increases the number of clean > pages available to reuse. The idea here is to increase the number of > clean pages to the point where the system is no longer being forced > to reuse 'active' pages. Once this is achieved there is no longer any > need clean the remaining dirty pages. I've read the code and it seems to do this: - put inactive pages in the cached queue - launder up to maxlaunder pages per scan (if needed) > Under extreme memory loads the balance point moves on its own to a > point where FreeBSD tries to keep as many pages in a clean state as > possible. When the memory load gets to this point the system is > considered to be thrashing and we start taking anti-thrashing measures, > such as swapping out whole processes and holding them idle for 20-second > spans. It rarely gets to this point, but even when it does the system > is still kept reasonably balanced. Uhmmmm. This does not seem to be exactly what the code does... (then again, I could be wrong) > It should be noted that the center-weighting algorithm works in virtually > all situations, including workign WONDERFULLY when you have I/O > centric programs (i.e. a program that reads or writes gigabytes of > data). By making slight adjustments to the initial weight (or even no > adjustments at all) the VM system will tend to reuse used-once memory > (think of scanning a file) before it tries to reuse more actively used > memory. *nod* Always trying to keep 1/3rd of the mapped pages in an "inactive" age is probably a good idea for Linux. The biggest problem with our code seems to be that we're *not* trying to have a balanced aging of pages ... > Now, of course, there are other kernel processes messing with memory. > The filesystem update daemon, for example. But these processes are > not designed to handle heavy memory loads and we do it that way on > purpose. At most the update daemon will speed up a little under intense > filesystem loads, but that is as far as it goes. Only one process is > designed to handle heavy memory loads and that is the pageout daemon. *nod* > --- > Stress Cases > > * Stressing dirty pages in the system via I/O calls (read/write) > > The algorithm tends to cause sequential I/O calls to give pages > a middling weight, and since the pages are not reused they tend > to be recycled within their domain (so you don't blow the rest > of the cache). This is a good argument for starting at a middle weight. Thanks for bringing this case to my attention, we really want something like this. > * Stressing dirty pages in the system via mmap (shared R+W) > > The system tends to run low on clean pages, detected by the > fact that new allocations are reusing clean pages which have high > weights. When this occurs the pageout daemon attempts to 'clean' > dirty pages (page them out) in order to increase the number of > clean pages available. Having a larger number of clean pages > available tends to give them more time to age, thus reducing the > average weight the allocator sees. This is a negative feedback > loop which results in balance. Hummm, where in the code can I find info on this? It certainly sounds interesting... > * I/O (read/shared-mmap) stress > > The algorithm tends to weight the clean pages according to use. > The weightings for filesystem cache pages read via read() are > adjusted at the time of the read() while VM pages are adjusted > statistically (The VM page scan rate depends on the level of > stress). Since in modern systems mmap() is used heavily, no > special consideration is given to one access method verses the > other. *nod* > * VM (anonymous memory) stress > > Anonymous swap-backed memory is treated no differently from > file-backed (filesystem buffers / mmap) memory. Clean anonymous > pages (most likely with swap already allocated if they are clean) > can be reused just the same as pages belonging to the filesystem > buffer cache. Swap is assigned to dirty anonymous pages on the > fly, only when the pageout daemon decides to actually clean the > page. Once swap is assigned the clean page can be reused. > > If a swap-backed page is brought back into memory, it is brought > back in clean (swap is left assigned). Swap is only freed if > the page is re-dirtied by the process. > > Thus most anonymous-memory pages in a heavily loaded system tend > to remain clean, allowing them to be reused more easily and extending > the life of the system further along the curve before it reaches a > thrashing state. A few points: - it may be easier to assign swap entries based on virtual scanning of processes, otherwise you'll end up with the strange jumping between queue scanning and virtual scanning that FreeBSD is doing now - vnodes are why FreeBSD is able to delay writing anonymous pages till later on - not every page weighs equally ... you want some fairness between memory hogs and small processes - Linux already does the don't-rewrite-clean-swap thing > * Write Clustering. > > Whenever the system decides to clean a dirty page it will, on the > fly, attempt to locate dirty nearby pages. FreeBSD is actually > quite sophisticated in this regard in that it actually goes and does > the calculation to ensure that only pages physically contiguous > on the disk are clustered for the write. The cluster is then written > and marked clean all in one go (cluster size limit is 64-128K). Neat... > * Sequential Detection Heuristic for read clustering (read()) > > A heuristic detects sequential read behavior and implements two > optimizations. (1) it implements read-aheads (as long as they > are reasonably contiguous on the physical media, we explicitly do > not try to issue read-aheads if it would cause an extra disk seek), > (2) it implements priority depression read-behind (reduce by 1 the > statistical weight of pages that have already been read). Reuse of > the pages can still cause the statistical weighting to increase to > the maximum, but this optimization has a tendancy to greatly reduce > the stress that large sequential reads have on the rest of the > memory subsystem. Ohhhh, nice ;) > * Sequential Detection Heuristic for read clustering (VM fault) > > A heuristic detects sequential VM fault operation, either forwards > or backwards and adjusts the cluster window around the fault taken, > either shifting it forwards or backwards, or making the window > smaller (e.g. if random fault operation is detecting). fault-ahead > I/O is initiated based on the algorithm and anything found cached > is pre-faulted into the page table. (The window size in FreeBSD is > approximately 64KBytes for this particular algorithm). The window > is further restriction to ensure that only media-contiguous blocks > are clustered. Again ... neat. > * Sequential Detection Heuristic for write clustering (write()) > > In the case of write() I/O (write system call), in order to > avoid saturating the memory system with dirty pages, if the > sequential detection heuristic determines that writes are > occuring sequentially, FreeBSD implements write-behind. That > is it issues the I/O on the dirty buffers preceding the write > point immediately (and asynchronously), in order to get the > pages into a clean state and thus reuseable, thus avoiding > stressing the memory system. In this case there is also a > limit emplaced on the number of dirty filesystem buffers > allowed to accumulate (since I/O is slower then the write() > calls creating the dirty buffers). > > What you wind up in this case is maximum disk throughput for the > sequential write without thousands of unnecessary dirty pages, > which is asynchronous up to a reasonable point and then starts > blocking to give the I/O the chance to catch up a little in > order to avoid starving the clean page cache. Cool. > * Sequential Detection Heuristic for write clustering (mmap) > > Currently not implemented under FreeBSD. This used to be a big > problem because you could completely saturate the VM system with > dirty pages before the system even realized it. To fix this we > threw in a memory-stress check in vm_fault to block when dirtying > pages in the face of having too many dirty pages already, giving > I/O a chance to catch up a little. > > This actually improved performance because it left a greater number > of clean pages available and so the page selection algorithm in the > allocator worked better (tended to select idle pages rather then > active pages). This is an extremely good idea. When we have too many (or just, a lot of) dirty pages, we should delay faults in write-enabled VMAs, while allowing faults in read-only VMAs to proceed as fast as possible ... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Return-Path: Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id OAA20874 for ; Tue, 23 May 2000 14:07:04 -0300 Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id OAA31919 for ; Tue, 23 May 2000 14:07:19 -0300 Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id KAA68220; Tue, 23 May 2000 10:06:38 -0700 (PDT) (envelope-from dillon) Date: Tue, 23 May 2000 10:06:38 -0700 (PDT) From: Matthew Dillon Message-Id: <200005231706.KAA68220@apollo.backplane.com> To: Rik van Riel Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) References: :I'm already looking at implementing this (and have been for a few :weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan :for pitfalls which I may have overlooked in my own design. Great! :Davem has implemented some nice active/inactive queue code too, I :will probably cut'n'paste some of his code. I'll also try to avoid :some of the problems FreeBSD has (using the active/inactive queue :as primary scanning point and _then_ skipping to the virtual object :for IO clustering, as well as results obtained with my anti-hog code :suggest that we want to do scanning of virtual objects only, with an :active list, cache list and laundry list as only lists). I would not characterize this as a problem in FreeBSD, it's actually one of its more powerful features. What FreeBSD is doing is using the center-weighted algorithm to locate a page that it believes should be reused or paged out. Once the page is selected FreeBSD tries to make the best of the disk seek that it's about to do by clustering other nearby pages along with it. You can't find the best candidate by scanning via the VM objects. The problem with trying to locate the page to reuse from the VM object alone is that it doesn't work. You might end up scanning all the VM objects to locate a good page to reuse. If you just scan a few VM objects, there is a very good chance that you will *not* choose the best page to reuse/flush. For example, there was some code in the linux kernel a while ago that tried to scan the 'biggest' VM object under the assumption that good candidates for page reuse would likely be found there)... if you happen to choose a VM object which is being very actively used by the system, then try to reuse/flush a page from it, you wind up stalling whatever process was using it even if other VM objects have better candidates. Nor do you want to scan ALL VM objects to locate the best page -- that would lead to O(N^2) cpu overhead in locating the page. FreeBSD locates the page candidate using the page queues because it can do so in an O(N) or better manner (usually it's more around O(1)). :> Second, start keeping real statistics on memory use across on a :> physical-page-basis. That means tracking how often VM pages are :> referenced (statistically) as well as how often filesystem pages are :> referenced by discrete I/O calls (deterministically). Keep track of :> a real per-physical-page statistical 'weight'. (What this means for :> linux is that you really need to test the pte's associated with physical :> pages by iterating through the physical pagse in your outer loop, NOT :> by trying to iterate through every page table of every process!). : :Hmmm, why would it matter in what order we scan stuff? The problem here is that any given page of physical memory may exist in the virtual address space of many processes. If you scan by iterating through process page tables you wind up giving an artificial advantage (or disadvantage) to shared pages. This is not a good measure of whether the page can be reused. Also, iterating through the process page tables means going through an unknown number of pte's and blowing your L1 cache in the process -- it doesn't scale. Iterating through physical memory and then locating the associated pte's treats each page on a more equal footing, irregardless on how much or how little it is being shared in regards to updating its statistics. A massively shared page that happens to be very active already gets the benefit of being accessed by multiple processse (setting the ref'd bit), you don't need to give it any special handling. :> FreeBSD keeps a center-weighted statistic for every page of memory :> (buffer cache or VM cache, it makes no difference). This has turned :> out to be a nearly perfect balancing algorithm and I strongly recommend :> that linux adopt a similar model. But what makes it work is that :> FreeBSD is willing to eat a couple of cpu cycles to keep accurate :> statistics of page use by the VM system in order to avoid the bad :> things that happen when one would otherwise choose the wrong page to :> reuse or clean. : :I fully agree on this point. Linus and Andrea have always been :a strong opponent of any possible "overhead", leading to a VM :subsystem that is optimised for CPU use, not for managing memory ;( I think you could sell it to them if you make it clear that the statistics do not eat cpu unless the system is stressed (FreeBSD doesn't even bother doing the page scan unless it thinks the system is stressed), and when the system is stressed it makes a whole lot more sense to eat cpu choosing the right page since choosing the wrong page can lead to hundreds of thousands of wasted cpu cycles doing unnecessary paging and even more real-time in process stalls. :>...... (all sorts of good stuff removed) :> Memory load is calculated in two ways: First, if the memory system finds :> itself reusing active pages (in my example, any page with a statistical :> weight greater then 5), second based on the dirty:clean page ratio. A :> high ratio does not itself cause paging to occur, but a high ratio :> combined with the system reusing active pages does. : :Hmmmm, this isn't what I'm seeing in the code. Could you point me :at the code where this is happening? Yes. It's confusing because FreeBSD implements the stress calculation with additionals page queues rather then attempt to keep a single queue sorted by weight, which can be expensive. When the system allocates a page (vm_page_alloc() in vm/vm_page.c) it uses free pages first (there usually aren't very many truely free pages), then cache pages (basically clean 'idle' pages), and if this fails it wakes up the page daemon, blocks, and retries. The page daemon is responsible for moving pages into the cache for use by the allocator. The page daemon (vm/vm_pageout.c, vm_pageout() and vm_pageout_scan()) is responsible for adjusting act_count (see vm_pageout_scan()) and adjusting the position of pages in their queues. The page daemon starts by scanning pages in the inactive queue. Pages in this queue have no mmu mappings and have been determined to be relatively inactive. Within the queue the pages are loosely ordered by weight (trying to order them exactly would waste too much cpu). See the 'rescan0' label in vm/vm_pageout.c. While scanning this queue the pageout daemon updates the activity weighting for the page by checking the PG_REFERENED bit. The inactive queue also tends to contain dirty pages to launder. The page daemon does not flush out dirty pages immediately, if it can't find enough clean pages in the inactive queue (line 927 in vm/vm_pageout.c, see 'If we still have a page shortage'), it goes up and rescans the inactive queue to start the pageout ops on some of its pages before going on to the active queue. If there is still a page shortage the page daemon then scans the active queue. Line 943 'Scan the active queue for things we can deactivate'. This queue is also loosely sorted by the page weighting and as with the inactive queue scan the loop also checks the page-referenced bit to update the weighting on the fly while scanning the active queue. Under normal conditions, pages are not deactivated until the active count reaches 0. Line 1013 of vm/vm_pageout.c. This is where the stress test is implied. If the page daemon scans the active queue and does not find any pages whos weighting has reached 0, then it starts to hit around line 1077 (the vm_paging_target() test) and becomes much more aggressive with the syncer and swapper. At line 1097 if we determine that we have run out of swap and also do not have sufficient clean pages, we start killing processes. I'll get back to the vm_paging_target() function in a second. The vm_daemon() is responsible for the next stage of aggressiveness in a heavily loaded system. The daemon is responsible for testing processes against their resident-memory rlimit and forcefully deactivating pages. The daemon also notices processes marked as being swapped out and is more agressive on those processes (the act of marking a process as being swapped out does not actually pageout any of its pages, but if the system swapper is activated in a heavy-use situation it will use the mark to then agressively swap out that process's pages). Ok, back to vm_paging_target(). This function (sys/vmmeter.h) is responsible for balancing the clean/dirty ratio by forcing pages to be moved from the active queue to the inactive queue if insufficient pages were found and no pages in the active queue were found with a 0 weighting. This is the stress test. If the page daemon was not able to find a sufficient number of pages in the active queue with a 0 weighting, it assumes that the memory system is under stress and speeds up the syncer and moer-agressively enforces RSS resource limits (vm/vm_pageout.c line 1081). I will readily admit that FreeBSD suffers from accretions here... there has been a lot of hacking of the code. The main reason FreeBSD is not using a completely pure center-weighted LRU algorithm is to avoid thrashing. It attempts to enforce leaving a page in the active queue for a certain minimum period of time (based on the weighting having to drop to 0 in order for the page to leave the queue), and FreeBSD doesn't mind stalling processes a bit if this winds up occuring because if it takes recently activated pages out of the active queue too quickly, the system's thrash point occurs a lot sooner. :... : :> When the memory load increases, an attempt is made to balance the :> dirty/clean ratio by 'cleaning' dirty pages, which of course means :> paging them out. FreeBSD makes NO distinction between writing a dirty :> file-backed page and allocating swap for a dirty anonymous memory page. :> The same per-page memory-use statistic is also used to determine which :> dirty pages to clean first. In effect, it is precisely this attempt :> to balance the dirty/clean ratio which increases the number of clean :> pages available to reuse. The idea here is to increase the number of :> clean pages to the point where the system is no longer being forced :> to reuse 'active' pages. Once this is achieved there is no longer any :> need clean the remaining dirty pages. : :I've read the code and it seems to do this: :- put inactive pages in the cached queue :- launder up to maxlaunder pages per scan (if needed) Correct. Basically the way to think of the algorithm is to ignore the fact that an active queue exists at all. The active queue holds pages in an unstressed memory environment. In a stressed memory environment the active queue imposes a minimum in-core-resident time for the page to avoid thrashing. The system implements the page weighting independantly for both the active and inactive queues. Under medium stress conditions pages are moved from the active to the inactive queue based on their weighting reaching 0. Once in the inactive queue (under medium stress conditions), pages tend to be LRU ordered (all their weightings tend to be 0 at this point). But under heavier loading conditions pages may be moved from the active to the inactive queue forcefully, before their weighting reaches 0. In this case the inactive queue serves the same loosely-sorted weighting function that the active queue was supposed to serve until it too runs out of pages. The page laundering code is stuffed in the middle there somewhere using a simple heuristic to try to start laundering pages during the inactive queue scan. It is not strictly based on page weighting but it winds up being that way due to the loosely weighted ordering of the pages in the queue. :> Under extreme memory loads the balance point moves on its own to a :> point where FreeBSD tries to keep as many pages in a clean state as :> possible. When the memory load gets to this point the system is :> considered to be thrashing and we start taking anti-thrashing measures, :> such as swapping out whole processes and holding them idle for 20-second :> spans. It rarely gets to this point, but even when it does the system :> is still kept reasonably balanced. : :Uhmmmm. This does not seem to be exactly what the code does... :(then again, I could be wrong) Your right. It isn't *exactly* what the code does, but it's the algorithmic effect minus the confusion of implementing the algorithm over multiple page queues (active/inactive/cache), which was done solely to avoid page thrashing. :Always trying to keep 1/3rd of the mapped pages in an "inactive" :age is probably a good idea for Linux. The biggest problem with :our code seems to be that we're *not* trying to have a balanced :aging of pages ... That's my take too. Also note that FreeBSD doesn't bother deactivating pages unless there is actually some stress on the memory system... if you look at a lightly loaded FreeBSD box you will see a huge number of 'active' pages -- most of those are probably idle. The moving of pages from queue to queue only occurs when the memory allocator starts to run out. This alone does not indicate stress... it's only when the page daemon is unable to find sufficient pages in the active queue with a 0 weighting where the stress starts to come into play. :> * Stressing dirty pages in the system via I/O calls (read/write) :> :> The algorithm tends to cause sequential I/O calls to give pages :> a middling weight, and since the pages are not reused they tend :> to be recycled within their domain (so you don't blow the rest :> of the cache). : :This is a good argument for starting at a middle weight. Thanks :for bringing this case to my attention, we really want something :like this. This is something I added to FreeBSD 4.x after tests showed that FreeBSD 3.x was 'blowing' the page cache unnecessarily when people would do things like tar up directory trees. It made a big difference in the performance of the rest of the system without effecting the I/O bound process. There is a gotcha, though-- you have to be careful to allow some balancing to occur so you do not recycle too-few pages within the domain. e.g. if the domain is only 10 pages, then the tar or dd or whatever is throwing away potentially cache able file pages much to soon. I have a hack in 4.x now that slowly balances this case. See vm_page_dontneed() in vm/vm_page.c. : :> * Stressing dirty pages in the system via mmap (shared R+W) :> :> The system tends to run low on clean pages, detected by the :> fact that new allocations are reusing clean pages which have high :> weights. When this occurs the pageout daemon attempts to 'clean' :> dirty pages (page them out) in order to increase the number of :> clean pages available. Having a larger number of clean pages :> available tends to give them more time to age, thus reducing the :> average weight the allocator sees. This is a negative feedback :> loop which results in balance. : :Hummm, where in the code can I find info on this? :It certainly sounds interesting... This is a side effect of the maxlaunder stuff in the inactive page scan done by the page daemon. However, under heavier load situations we also speedup the filesystem syncer, and we also, through other means, speed up the buffer daemon (which is responsible for staging out I/O related to dirty filesystem buffers). :> page. Once swap is assigned the clean page can be reused. :> :> If a swap-backed page is brought back into memory, it is brought :> back in clean (swap is left assigned). Swap is only freed if :> the page is re-dirtied by the process. :> :> Thus most anonymous-memory pages in a heavily loaded system tend :> to remain clean, allowing them to be reused more easily and extending :> the life of the system further along the curve before it reaches a :> thrashing state. : :A few points: :- it may be easier to assign swap entries based on virtual : scanning of processes, otherwise you'll end up with the : strange jumping between queue scanning and virtual scanning : that FreeBSD is doing now No, because a physical page may be associated with many processes and thus have multiple virtual copies. Scanning by virtual results in improperly weighting these pages and winds up being an O(N^2) scanning algorithm rather then an O(N) or O(1) algorithm. FreeBSD scans physical pages in order to guarentee deterministic algorithmic overhead and to avoid unnecessary redundant scanning. The two-phase I/O clustering operation is done on purpose... it may seem a bit strange to you but it actually works! The first phase is to locate the page candidate (the center weighted LRU algorithm), the second phase is to locate clusterable pages near that candidate. The second phase is explicitly restricted to candidates which wind up being contiguous on the physical media... if we are going to do an I/O we might as well make the best of it. The clustering goes to the core of the reason why FreeBSD remains reasonably efficient even when under very heavy loads, but it only works if FreeBSD is able to choose the right page candidate in the first place. :- vnodes are why FreeBSD is able to delay writing anonymous : pages till later on :- not every page weighs equally ... you want some fairness : between memory hogs and small processes This is handled by the 'memoryuse' resource limit. You cannot otherwise determine which is the more important process. You definitely do not want to make the assumption that the processes with larger RSS should be paged out moer... that will lead to serious performance problems with things like, oh, your X session. So FreeBSD does not make the distinction until it sees serious memory stress on the system. FreeBSD tries to operate strictly based on the page-use weighting. :- Linux already does the don't-rewrite-clean-swap thing : :... (more good stuff removed) :> :> A heuristic detects sequential read behavior and implements two :> optimizations. (1) it implements read-aheads (as long as they :> are reasonably contiguous on the physical media, we explicitly do :... : :Ohhhh, nice ;) : :> * Sequential Detection Heuristic for read clustering (VM fault) :> :> A heuristic detects sequential VM fault operation, either forwards :... : :Again ... neat. Yes. These optimizations have the side effect of reducing memory stress as well, which makes the LRU center weighting algorithm work better. :> * Sequential Detection Heuristic for write clustering (write()) :> :> In the case of write() I/O (write system call), in order to :> avoid saturating the memory system with dirty pages, if the :... : :Cool. Same effect on the LRU cw algorithm. :> * Sequential Detection Heuristic for write clustering (mmap) :> :> Currently not implemented under FreeBSD. This used to be a big :> problem because you could completely saturate the VM system with :> dirty pages before the system even realized it. To fix this we :> threw in a memory-stress check in vm_fault to block when dirtying :> pages in the face of having too many dirty pages already, giving :> I/O a chance to catch up a little. :> :> This actually improved performance because it left a greater number :> of clean pages available and so the page selection algorithm in the :> allocator worked better (tended to select idle pages rather then :> active pages). : :This is an extremely good idea. When we have too many (or just, :a lot of) dirty pages, we should delay faults in write-enabled :VMAs, while allowing faults in read-only VMAs to proceed as :fast as possible ... : :regards, : :Rik Happy to help! If you have any other questions don't hesitate to give me an email ring, I really enjoy talking tech with people who know what they are doing :-). -Matt Matthew Dillon --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Date: Tue, 23 May 2000 15:11:04 -0300 (BRST) From: Rik van Riel X-Sender: riel@duckman.distro.conectiva To: Matthew Dillon Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) In-Reply-To: <200005231706.KAA68220@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 23 May 2000, Matthew Dillon wrote: > :I'm already looking at implementing this (and have been for a few > :weeks). I'm reading vm/vm_pageout.c in detail now, mainly to scan > :for pitfalls which I may have overlooked in my own design. > > Great! I'm currently looking at which part of the code I can implement before 2.4 is released by Linus. I've read your email and would like to think I'm understanding most of the rationale behind the FreeBSD VM subsystem... > :- not every page weighs equally ... you want some fairness > : between memory hogs and small processes > > This is handled by the 'memoryuse' resource limit. You cannot otherwise > determine which is the more important process. You definitely do not > want to make the assumption that the processes with larger RSS should > be paged out moer... that will lead to serious performance problems > with things like, oh, your X session. This is not exactly what the antihog code does. It will scan bigger processes more often, leading (hopefully) to: 1) better statistics about which pages are really used in big processes 2) better interactive performance because the lightly used memory of the user's shell is not paged out 3) more memory pressure on big processes, I've seen this speed up the system because a) small processes had less page faults b) the memory hog(s) had less disk wait time I haven't seen this code be any problem to X or other interactive processes ... if the memory is really used it won't be paged out, it's just that a big process will have to use its pages more heavily than a smaller process. To express this mathematically: A = process a, rss size(A) B = process b, rss size(B) [N times size(A)] P(X) = memory pressure on process X P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A)) and the memory pressure on every page from process B will be: P(page in B) = P(page in A) * sqrt(size(B)/size(A)) I know this doesn't give the best system-wide aging of pages, but it should provide fairer distribution of memory between processes. There are no artificial restrictions on processes with a big RSS, the only thing is that we pressure their pages harder (and will inactivate their pages sooner). This makes sure that a big process (that puts a lot of load on the VM subsystem) does not "transfer" its memory load to smaller processes. Lightly used pages in smaller processes will still be paged out in favor of heavily used pages in the big process, but small processes have a better chance of running well. This is doing something comperable to what the swap code is doing (penalise big processes more than small ones), but does it in the paging code. It allows a few big processes to grow until the point where they are thrashing without impacting the rest of the system too much. Then again, this VMS-influenced idea could be all wrong. I'd *love* to hear your view on this ... regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Return-Path: Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id QAA21732 for ; Tue, 23 May 2000 16:39:31 -0300 Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id QAA30959 for ; Tue, 23 May 2000 16:39:47 -0300 Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id MAA69005; Tue, 23 May 2000 12:39:07 -0700 (PDT) (envelope-from dillon) Date: Tue, 23 May 2000 12:39:07 -0700 (PDT) From: Matthew Dillon Message-Id: <200005231939.MAA69005@apollo.backplane.com> To: Rik van Riel Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) References: :This is not exactly what the antihog code does. It will scan :bigger processes more often, leading (hopefully) to: :1) better statistics about which pages are really used in big : processes :2) better interactive performance because the lightly used memory : of the user's shell is not paged out :3) more memory pressure on big processes, I've seen this speed : up the system because : a) small processes had less page faults : b) the memory hog(s) had less disk wait time : :I haven't seen this code be any problem to X or other interactive :processes ... if the memory is really used it won't be paged out, :it's just that a big process will have to use its pages more heavily :than a smaller process. To express this mathematically: : :A = process a, rss size(A) :B = process b, rss size(B) [N times size(A)] :P(X) = memory pressure on process X : : P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A)) : :and the memory pressure on every page from process B will be: : : P(page in B) = P(page in A) * sqrt(size(B)/size(A)) : :I know this doesn't give the best system-wide aging of pages, :but it should provide fairer distribution of memory between :processes. There are no artificial restrictions on processes :with a big RSS, the only thing is that we pressure their pages :harder (and will inactivate their pages sooner). : :This makes sure that a big process (that puts a lot of load on :the VM subsystem) does not "transfer" its memory load to smaller :processes. Lightly used pages in smaller processes will still be :paged out in favor of heavily used pages in the big process, but :small processes have a better chance of running well. : :This is doing something comperable to what the swap code is :doing (penalise big processes more than small ones), but does :it in the paging code. It allows a few big processes to grow :until the point where they are thrashing without impacting :the rest of the system too much. : :Then again, this VMS-influenced idea could be all wrong. I'd :*love* to hear your view on this ... : :regards, : :Rik Well, I have a pretty strong opinion on trying to rationalize penalizing big processes simply because they are big. It's a bad idea for several reasons, not the least of which being that by making such a rationalization you are assuming a particular system topology -- you are assuming, for example, that the system may contain a few large less-important processes and a reasonable number of small processes. But if the system contains hundreds of small processes or if some of the large processes turn out to be important, the rationalization fails. Also if the large process in question happens to really need the pages (is accessing them all the time), trying to page those pages out gratuitously does nothing but create a massive paging load on the system. Unless you have a mechanism to (such as FreeBSD has) to impose a 20-second forced sleep under extreme memory loads, any focus on large processes will simply result in thrashing (read: screw up the system). Another reason for not trying to weight things in favor of small processes is that the LRU algorithm *already* weights things in favor of small processes, simply by virtue of the fact that small processes tend to access/touch any given page in their RSS set much more often then large processes access/touch their pages. The same goes for shared pages --- which is why basing the algorithms on a physical page scan works well. Basing algorithms on a virtual page scan, with lots of physical pages being redundantly scan, skews the statistics badly. FreeBSD has two mechanisms to deal with large processes, both used only under duress. The first occurs when FreeBSD hits the memory stress point - it starts enforcing the 'memoryuse' resource limit on processes. The second occurs when FreeBSD hits the 'holy shit we're starting to thrash' point - it starts forcefully swapping out processes and holding them idle for 20-second periods (p.s. I have never personally seen a production FreeBSD box get to the second stress point!). Keep in mind that memory stress is best defined as "the system reused or paged-out a page that some process then tries to use soon after". When this occurs, processes stall, and the system is forced to issue I/O (pageins) to unstall them. Worse, if the system just paged the page out and then has to turn around and page it back in, the processes stall even longer AND you wind up with a doubly-whammy on the I/O load -- this is my definition of thrashing. Attempting to reuse pages from large processes which are truely accessing those pages will thrash the system much sooner then simply treating all pages and processes the same and focusing on the page-activity weighting. -Matt Matthew Dillon --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Date: Tue, 23 May 2000 20:18:25 -0300 (BRST) From: Rik van Riel X-Sender: riel@duckman.distro.conectiva To: Matthew Dillon Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) In-Reply-To: <200005231939.MAA69005@apollo.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 23 May 2000, Matthew Dillon wrote: > :A = process a, rss size(A) > :B = process b, rss size(B) [N times size(A)] > :P(X) = memory pressure on process X > : > : P(B) = P(A) * (size(B)/size(A)) * sqrt(size(B)/size(A)) > : > :and the memory pressure on every page from process B will be: > : > : P(page in B) = P(page in A) * sqrt(size(B)/size(A)) > > Well, I have a pretty strong opinion on trying to rationalize > penalizing big processes simply because they are big. It's a bad > idea for several reasons, not the least of which being that by > making such a rationalization you are assuming a particular system > topology -- you are assuming, for example, that the system may contain > a few large less-important processes and a reasonable number of > small processes. But if the system contains hundreds of small processes > or if some of the large processes turn out to be important, the > rationalization fails. The main assumption I was making here was that memory usage would be dominated by a small number bigger programs and not by a larger number of small processes. I see your point in that I may be wrong here in some situations. Also, memory pressure per page increases only two-fold if the bigger process is four times bigger that the smaller process... (likewise, 4 times for a 16 times bigger process, etc) > Also if the large process in question happens to really need the pages > (is accessing them all the time), trying to page those pages out > gratuitously does nothing but create a massive paging load on the > system. But we don't. All we do is scan those pages more often. If the big process is really using them all the time we will not page them out. But we *will* try the pages of the big process more often than the pages of smaller processes... > Unless you have a mechanism to (such as FreeBSD has) to > impose a 20-second forced sleep under extreme memory loads, any focus > on large processes will simply result in thrashing (read: screw up > the system). I have a proposal for this that could be implemented relatively easily. It is a variant on the swapping scheme that ITS had a long long time ago and is "somewhat" different from what any of the Unices has. Since I didn't get a lot of response from the Linux community (some "sweet, I hope it works" replies on IRC) I'd be really interested in what someone like you would have to say (or other people in the know ... do you know any?). http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html > Another reason for not trying to weight things in favor of small > processes is that the LRU algorithm *already* weights things in favor > of small processes, simply by virtue of the fact that small processes > tend to access/touch any given page in their RSS set much more often > then large processes access/touch their pages. The same goes for > shared pages --- which is why basing the algorithms on a physical page > scan works well. Basing algorithms on a virtual page scan, with lots > of physical pages being redundantly scan, skews the statistics badly. I'm curious about this ... would a small interactive task like an xterm or a shell *really* touch its pages as much as a Netscape or Mathematica? I have no doubts about the second point though .. that needs to be fixed. (but probably not before Linux 2.5, the changes in the code would be too invasive) > FreeBSD has two mechanisms to deal with large processes, both used only > under duress. The first occurs when FreeBSD hits the memory > stress point - it starts enforcing the 'memoryuse' resource limit on I've grepped the source for this but haven't been able to find the code which does this. I'm very interested in what is does and how it is supposed to work (and how it would compare to my "push memory hogs harder" idea). > processes. The second occurs when FreeBSD hits the 'holy shit we're > starting to thrash' point - it starts forcefully swapping out processes > and holding them idle for 20-second periods (p.s. I have never personally > seen a production FreeBSD box get to the second stress point!). hehe > Keep in mind that memory stress is best defined as "the system reused or > paged-out a page that some process then tries to use soon after". When > this occurs, processes stall, and the system is forced to issue I/O > (pageins) to unstall them. Worse, if the system just paged the page out > and then has to turn around and page it back in, the processes stall even > longer AND you wind up with a doubly-whammy on the I/O load -- this is > my definition of thrashing. Attempting to reuse pages from large > processes which are truely accessing those pages will thrash the system > much sooner then simply treating all pages and processes the same and > focusing on the page-activity weighting. Ahh, but if a process is truely using those pages, we won't page them out. The trick is that the anti-hog code does *not* impose any limits ... it simply makes sure that a big process *really* uses its memory before it can stay at a large RSS size. (but I agree that this is slightly worse from a global, system throughput point of view .. and I must admit that I have no idea how much worse it would be) regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength. Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies http://www.conectiva.com/ http://www.surriel.com/ --655889-1110733960-959184177=:24993 Content-Type: MESSAGE/RFC822; CHARSET=US-ASCII Content-ID: Content-Description: Re: Linux VM/IO balancing (fwd to linux-mm?) (fwd) Return-Path: Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by duckman.distro.conectiva (8.9.3/8.8.7) with ESMTP id VAA23117 for ; Tue, 23 May 2000 21:32:56 -0300 Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by perninha.conectiva.com.br (8.9.3/8.9.3) with ESMTP id VAA06276 for ; Tue, 23 May 2000 21:32:46 -0300 Received: (from dillon@localhost) by apollo.backplane.com (8.9.3/8.9.1) id RAA70157; Tue, 23 May 2000 17:31:50 -0700 (PDT) (envelope-from dillon) Date: Tue, 23 May 2000 17:31:50 -0700 (PDT) From: Matthew Dillon Message-Id: <200005240031.RAA70157@apollo.backplane.com> To: Rik van Riel Subject: Re: Linux VM/IO balancing (fwd to linux-mm?) References: :> Also if the large process in question happens to really need the pages :> (is accessing them all the time), trying to page those pages out :> gratuitously does nothing but create a massive paging load on the :> system. : :But we don't. All we do is scan those pages more often. If :the big process is really using them all the time we will :not page them out. But we *will* try the pages of the big :process more often than the pages of smaller processes... :... I don't think this really accomplishes your goal because the effect of choosing the wrong page to reuse/flush is the same whether that page comes from a large process or a small process. The effect being a disk I/O and (if the wrong page is chosen) the process faulting it back in immediately (a second disk I/O). This effects the entire system, not just that one process. When you skew the algorithm towards the larger process you are in effect penalizing the large process, which I think is what you are trying to accomplish. BUT, you are also penalizing the entire system by virtue of causing the extra disk I/O. That's two marks against the idea... penalizing a large process without knowing whether it deserves to be penalized, and causing extra disk I/O by choosing the 'wrong' page (that is, choosing a page that is likely to result in more disk activity by being more active then the skewed statistic believes it is). In a heavily loaded environment, using the normal center-weighted LRU algorithm, a larger process is going to be penalized anyway by virtue of the fact that it has more pages to be potentially reused. I don't think you want to penalize it even more. Quick mathmatical proof: If you have a small process with 10 pages in core and a large process with 100 pages in core, and all the pages have the same weighting, and the allocator tries to allocate a new page, 90% of the allocations are going to be taken from the large process and only 10% will be taken from the small process. Thus the large process is already being penalized by the normal algorithm... you don't have to penalize it any more. The algorithm is self-correcting for large processes. :I have a proposal for this that could be implemented relatively :.. : : http://mail.nl.linux.org/linux-mm/2000-05/msg00273.html Very interesting idea! In reference to your comment 'swapping doesn't actually swap anything...' near the end... this is *precisely* the method that FreeBSD uses when the idle-swap system feature is turned on. FreeBSD also depresses the activity weight for the pages to make it more likely to be reused (FreeBSD does this simply by moving the page out of the active queue and onto the inactive queue). I'm not sure you can use p->state as one of the basis for your memory load calculation, because there are many classes of I/O bound applications which nevertheless have low memory footprints. You have to be careful not to generate a positive feedback loop. That is, where you make some calculation X which has consequence Y which has the side effect of increasing the calculation X2, which has consequence Y2, and so forth until the calculation of X becomes absurd and is no longer a proper measure of the state of the system. :I'm curious about this ... would a small interactive task like :an xterm or a shell *really* touch its pages as much as a Netscape :or Mathematica? : :I have no doubts about the second point though .. that needs to :be fixed. (but probably not before Linux 2.5, the changes in the :code would be too invasive) Yes, absolutely. Its simply due to tighter code... with fewer active pages, smaller processes tend to touch more of them whenever they do anything (like hitting a key in an xterm). Larger processes tend to zone their memory. They will have a core set of pages they touch as much as any other process, but the more complex issues of larger processes will also touch additional pages depending on what they are doing. So with something like Netscape the 'broader core' set of pages it touches will be different depending on what site you are visiting. As you visit more sites, the RSS footprint starts to bloat, but many of those pages will tend to be more 'idle' then the core pages in the smaller process. :> FreeBSD has two mechanisms to deal with large processes, both used only :> under duress. The first occurs when FreeBSD hits the memory :> stress point - it starts enforcing the 'memoryuse' resource limit on : :I've grepped the source for this but haven't been able to find :the code which does this. I'm very interested in what is does :and how it is supposed to work (and how it would compare to my :"push memory hogs harder" idea). vm/vm_pageout.c, vm_daemon() process, aroudn line 1392. The vm_daemon is only woken up under certain heavy load situations. The actual limit check occurs on line 1417. The actual forced page deactivation occurs on line 1431 (the call to vm_pageout_map_deactivate_pages). The enforced sleep is handled by the swapin code. vm/vm_glue.c line 364 (in scheduler()). If a process is forcefully swapped out, it is not allowed to wakeup until the system has recovered sufficiently for it to be able to run reasonably well. The longer the process stays in its forced sleep, the more likely it will be given the cpu back so after a certain point it *will* wake up again, even if the system is still thrashing. It's not strictly 20 seconds any more (it was in pre-freeware BSDs). ( I'm assuming you are using the 4.x source tree as your reference here ). :> Keep in mind that memory stress is best defined as "the system reused or :> paged-out a page that some process then tries to use soon after". When :> this occurs, processes stall, and the system is forced to issue I/O :> (pageins) to unstall them. Worse, if the system just paged the page out :... : :Ahh, but if a process is truely using those pages, we won't page :them out. The trick is that the anti-hog code does *not* impose :any limits ... it simply makes sure that a big process *really* :uses its memory before it can stay at a large RSS size. Everything starts to break down when you hit extreme loads. Even simple things like trying to give priority to one type of process over another. If you weight in favor of smaller processes and happen to be running on a system with larger processes, your algorithm could (for example) wind up scanning their pages *too* often, depressing their priority so quickly that the system may believe it can page out their pages when, in fact, those processes are still using the pages. Another example: When a system starts to thrash, *ALL* processes doing any sort of I/O (paging or normal I/O) begin to stall. If under heavy load conditions your algorithm 'speeds up' the page scan on an unequal basis (only for certain processes), then the combination of the sped-up scan AND the lower page access rate that is occuring due to the thrashing itself may lead your algorithm to believe it can steal pages from those processes when in reality stealing pages will only result in even worse thrashing. In fact, one of the reasons why FreeBSD went to a multi-queue approach was to avoid these sorts of positive feedback loops. It has a separate 'active' queue as a means of guarenteeing that when a page is faulted in, it *stays* in for a certain minimum period of time before being available for paging out again, and if the enforced wait causes the system to run low on memory and stall more, then too bad -- the alternative (insane thrashing) is worse. :(but I agree that this is slightly worse from a global, system :throughput point of view .. and I must admit that I have no :idea how much worse it would be) : :regards, : :Rik -Matt Matthew Dillon --655889-1110733960-959184177=:24993-- --655889-1842000895-959184177=:24993-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/