From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 3 May 2005 01:08:46 -0700 From: Andrew Morton Subject: Re: [PATCH/RFC 0/4] VM: Manual and Automatic page cache reclaim Message-Id: <20050503010846.508bbe62.akpm@osdl.org> In-Reply-To: <4277259C.6000207@engr.sgi.com> References: <20050427150848.GR8018@localhost> <20050427233335.492d0b6f.akpm@osdl.org> <4277259C.6000207@engr.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Ray Bryant Cc: mort@sgi.com, linux-mm@kvack.org, raybry@sgi.com, ak@suse.de List-ID: Ray Bryant wrote: > > ... > One of the common responses to changes in the VM system for optimizations > of this type is that we instead should devote our efforts to improving > the VM system algorithms and that we are taking an "easy way out" by > putting a hack into the VM system. There's that plus the question which forever lurks around funky SGI patches: How many machines in the world want this feature? Because if the answer is "twelve" then gee it becomes hard to justify merging things into the mainline kernel. Particularly when they add complexity to page reclaim. > Fundamentally, the VM system cannot > predict the future behavior of the application in order to correctly > make this tradeoff. Yup. But we could add a knob to each zone which says, during page allocation "be more reluctant to advance onto the next node - do some direct reclaim instead" And the good thing about that is that it is an easier merge because it's a simpler patch and because it's useful to more machines. People can tune it and get better (or worse) performance from existing apps on NUMA. Yes, if it's a "simple" patch then it _might_ do a bit of swapout or something. But the VM does prefer to reclaim clean pagecache first (as well as slab, which is a bonus for this approach). Worth trying, at least? > > Why isn't POSIX_FADV_DONTNEED good enough here? > ---------------------------------------------- I was going to ask that. > We've tried that too. If the application is sufficiently aware of what > files it has opened, it could schedule those page cache pages to be > released. Unfortunately, this doesn't handle the case of the last > application that ran and wrote out a bunch of data before it terminated, > nor does it deal very well with shell scripts that stage data onto and > off of the compute node as part the job's workflow. Ah. But to do this we need to be able to answer the question "what files are in pagecache, and how much pagecache do they have". (And "on what nodes", but let's ignore that coz it's hard ;)) And something like this would be an easier merge because it's useful to more than twelve machines. It could be done in userspace, really. Hack into glibc's open() and creat() to log file opening activity, something silly like that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org