From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4277259C.6000207@engr.sgi.com> Date: Tue, 03 May 2005 02:17:48 -0500 From: Ray Bryant MIME-Version: 1.0 Subject: Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim References: <20050427150848.GR8018@localhost> <20050427233335.492d0b6f.akpm@osdl.org> In-Reply-To: <20050427233335.492d0b6f.akpm@osdl.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: Martin Hicks , linux-mm@kvack.org, raybry@sgi.com, ak@suse.de List-ID: Andrew Morton wrote: > Martin Hicks wrote: > >>The patches introduce two different ways to free up page cache from a >> node: manually through a syscall and automatically through flag >> modifiers to a mempolicy. > > > Backing up and thinking about this a bit more.... > > >> Currently if a job is started and there is page cache lying around on a >> particular node then allocations will spill onto remote nodes and page >> cache won't be reclaimed until the whole system is short on memory. >> This can result in a signficiant performance hit for HPC applications >> that planned on that memory being allocated locally. > > > Why do it this way at all? > > Is it not possible to change the page allocator's zone fallback mechanism > so that once the local node's zones' pages are all allocated, we don't > simply advance onto the next node? Instead, could we not perform a bit of > reclaim on this node's zones first? Only advance onto the next nodes if > things aren't working out? > Effectively, that is what we are trying to do with this set of patches. Let me see if I can describe the problem we are trying to solve a little more clearly, and to explain how we got to this particular set of patches. Before we start on that, however, it is important to understand that this particular optimization is a crucial performance optimization for certain kinds of workloads (admittedly on NUMA hardware, only). That is why we have made this a controllable policy that would be enabled only for those workloads where it makes sense. When the policy is not enabled, the code is neutral with respect to VM algorithms. It is not expected that this code would be enabled for a traditional workload where LRU aging is important. So, while it is true that the proposed patch does modify LRU ordering, that should not be a fundamental argument against this patchset, since for workloads where keeping the LRU ordering correct is important, the page cache reclaim code would not be enabled. Secondly, I would observe that I have run benchmarks of OpenMP applications with and without these type of page cache reclaiming optimizations. If we don't have the kind of operations needed (more on the scenario's below) there can be a 30-40% reduction in performance due to the fact that storage which the application believes is local to a thread is actually allocated on a remote node. So the optimizations proposed here do matter, and they can be quite significant. So what is the problem we are trying to solve? ---------------------------------------------- We are trying to fix the "stickiness" of page-cache pages on NUMA systems for workloads where local allocation of storage is crucial. (Note well, this is not all workloads.) In many cases, caching disk data in memory is very important to performance; so the correct tradeoff to make in most cases is to allocate remotely when a local page is not available and not to look to see if there are local pages that can be freed instead. However, the typical scenario we run up against is the following: We start up a long running parallel application. As part of the application work flow, a large amount of data is staged (copied) from a distributed file system to higher speed local storage. The data being copied can be 10's to 100's of GB. This data is brought into the page cache and the pages become cleaned through the normal I/O process. Remember the curse of a large NUMA machine is that there is lots of memory, but practically none of it is local. (e. g. on a 512 CPU Altix, if each node has the same amount of memory, only 1/256th of the memory available is local). So what happens due to copying this data is that non-trival numbers (but not all) of the nodes on the machine become completely filled with page cache. Now when the parallel application starts, it pins processes to nodes and tries to allocate storage local to those processes. This is required for good performance -- the applications are optimized to place heavily referenced data in storage that the application expects to be local to the thread. Since some of the nodes are full of page cache, the processes that are running on those nodes don't get local storage and hence run more slowly. We then run up against the second rule of parallel processing: A parallel application only runs as quickly as the slowest thread. So performance of the entire parallel job is impacted because a few of the threads didn't get the local storage they expected. What we have done for our current production kernels to work around this problem is to implement "manual" page cache reclaim. This is the toss_page_cache_nodes patch that we previously submitted. The disadvantage of that patch is that it is a "big hammer". It causes all clean page-cache pages on the nodes to be released. The idea of the current patch is to only reclaim as much clean page-cache as are required for the application by reacting to allocation requests and freeing storage proportional to these requests. Why must this be an optionally controlled component of the VM? -------------------------------------------------------------- Because this is fundamentally a workload dependent optimization. Many workloads want the normal VM algorithms to apply. Caching data is important, and until the entire system is under memory pressure, it makes sense to keep that data in storage. New page allocation requests that come in and that can be allocated remotely should be allocated on a remote node since the system has no way of knowing how important getting local storage is to the application. (Equivalently, the O/S has no way of knowing how long and how intensely the newly allocated page will be used. So it cannot make an intelligent tradeoff about where to allocate the page.) Effectively, the interface we are proposing here is a way of telling the system that for this application, getting local storage is more important than caching data. It needs to be optional because this trade off does not apply to all applications. But for our parallel application, which may run for 10's to 100's of hours, getting local storage is crucial and the O/S should work quite hard to try to allocate local storage. The time spent doing that now will be more than made up for by the increased efficiency of the application during its long run. So we need a way to tell the O/S that this is the case for this particular application. Why can't the VM system just figure this out? --------------------------------------------- One of the common responses to changes in the VM system for optimizations of this type is that we instead should devote our efforts to improving the VM system algorithms and that we are taking an "easy way out" by putting a hack into the VM system. Fundamentally, the VM system cannot predict the future behavior of the application in order to correctly make this tradeoff. Since the application programmer (in this environment) typically knows a lot about the behavior of the application it simply makes sense to allow the developer a way of telling the operating system what is about to happen rather than having the O/S make a guess. Without this interface, the O/S's tradeoff will normally be to allocate remotely if local storage is not available. In the past, it has been suggested that the way to react to improper local/remote storage is to watch the application (using storage reference counters in the memory interconnect, for example) and to find pages that appear to be incorrectly placed and to move those pages. (This is the so called "Automatic Page Migration" problem.) Our experience at SGI with such algorithms is that they don't work very well. Part of the reason is that the penalty for making a mistake is very high -- moving a page takes a long time, and if you move it to the wrong node you can be very sorry. The other part of the problem is that by using sampling based methods to figure out page placement, you only have partially correct information, and this leads to occasionally making mistakes. The combination of these two factors results in poor placement decisions and a corresponding poorly performing system. The other part of the problem is that sampling is historical rather than predictive. Just when the O/S has enough samples to make a migration decision, the computation can start a new phase, possibly invalidating the decision the operating system has made, and without the operating systems knowledge. So it does the wrong thing. Why isn't it good enough to use the synchronous page cache reclaim path? ------------------------------------------------------------------------- There are basically two reasons (1) We have found it to be too slow (running the entire synchronous reclaim path on even a moderately large Altix system can take 10's of minutes) and (2) it is indiscriminate in that it can also free mapped pages, and we want to keep those around. Effectively what we are looking for here is a way to tell the VM system that allocating local storage is more important to this application than caching clean file system pages. (Setting vm_swappiness=0 doesn't do this correctly because it is global to the system rather than the application, and in certain cases we have found setting vm_swappiness=0 can cause the VM system to live-lock if then the system is overcommitted due to mapped pages.) Why isn't POSIX_FADV_DONTNEED good enough here? ---------------------------------------------- We've tried that too. If the application is sufficiently aware of what files it has opened, it could schedule those page cache pages to be released. Unfortunately, this doesn't handle the case of the last application that ran and wrote out a bunch of data before it terminated, nor does it deal very well with shell scripts that stage data onto and off of the compute node as part the job's workflow. So how did we end up with this particular set of patches? -------------------------------------------------------- This set of patches is based, in part, on experience with our 2.4.21 based kernels. Those kernels had an "automatic page cache reclaim" facility, and our benchmarks have shown that this is almost as good using the "manual page cache reclaim" approach we previously proposed. Our experience with those kernels was that using the synchronous reclaim path was too slow, so we special cased the search with code that paralleled the existing code but would only release clean page-cache pages. For 2.6.x, we didn't want code that duplicated much of the VM path in a separate routine, but instead wanted to slightly modify the existing VM routines so they would only release clean page-cache pages and not release mapped storage. Hence, the extensions that were proposed to the "scan control" structure. Originally, we wanted to start with the "manual page cache release" code we previously proposed, but that got shot down, so here we are with the "automatic page cache release" approach. I hope this all helps, rather than hinders the discussion of Martin's patchset. Discussion, complaints, and flames, all happily accepted by yours truly, -- Best Regards, Ray ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org