From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH] Document Linux Memory Policy From: Lee Schermerhorn In-Reply-To: References: <1180467234.5067.52.camel@localhost> <200705312243.20242.ak@suse.de> <20070601093803.GE10459@minantech.com> <200706011221.33062.ak@suse.de> <1180718106.5278.28.camel@localhost> Content-Type: text/plain Date: Fri, 01 Jun 2007 15:38:33 -0400 Message-Id: <1180726713.5278.80.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: Andi Kleen , Gleb Natapov , linux-mm , Andrew Morton List-ID: On Fri, 2007-06-01 at 11:43 -0700, Christoph Lameter wrote: > On Fri, 1 Jun 2007, Lee Schermerhorn wrote: > > > Like Gleb, I find the different behaviors for different memory regions > > to be unnatural. Not because of the fraction of applications or > > deployments that might use them, but because [speaking for customers] I > > expect and want to be able to control placement of any object mapped > > into an application's address space, subject to permissions and > > privileges. > > Same here and I wish we had a clean memory region based implementation. > But that is just what your patches do *not* provide. Instead they are file > based. They should be memory region based. > > Would you please come up with such a solution? Christoph: I don't understand what you mean by "memory region based". Linux does not have bona fide "memory objects" that sit between a task's address space and the backing store--be it swap or regular files--like some systems I've worked with. Rather, anonymous regions are described by the vma_struct, and pages backing those regions must be referenced by one or more ptes or a swap cache entry, or both. For a disk back file mapped into a task address space, the vma points directly to the inode +address_space structures via the file structure. Shmem regions attach to a task address space much like regular files--via a pseudo-fs inode +address_space. I don't know the rationale, but I suspect that Linux dispenses with the extra memory object layer to conserve memory for smaller systems. And that's a good thing, IMO. So, for a shared memory mapped file, the inode+address_space--i.e., the in-memory incarnation of the file--is as close to a "memory region" as we have. In contains the mapping between [file/address] offset and memory page. It's the only object representing the file and its in-memory pages that gets shared between multiple task address spaces. That seems, to me, to be the natural place to hang the shared policy. Indeed, this is where we attach shared policy to shmem/tmpfs/hugetlbfs pseudo-files. Even if we had a layer between the vma's and the files/inodes, I don't see what that would buy us. We'd still want to maintain coherency between files accessed via file descriptor function calls and files mapped via mmap(SHARED). That's one of the purposes of a shared page cache. [I've seen unix variants where these weren't coherent. Now THAT's unnatural ;-)!] So, yes any policy applied to the memory mapped file affects the location of pages accessed via file descriptor access. That's a good thing for the application that use shared mapped files. The load/store access by the application that maps the file, and goes to the trouble of specifying memory policy, takes precedence. Load/store is the "fast path". File descriptor access system calls are the slow path. You're usually gung-ho about locality on a NUMA platform, avoiding off node access or page allocations, respecting the fast path, ... Why the resistance here? > > > Then why does Christoph keep insisting that "page cache pages" must > > always follow task policy, when shmem, tmpfs and anonymous pages don't > > have to? > > No I just said that the page cache handling is consistently following task > policy. Well, not for anon, shmem, tmpfs, ... page cache pages. All of those are page cache based, according to Andi, and they certainly aren't constrained to "consistently follow task policy". Of course, I'm just being facetious [and, no doubt, annoying] to make a point. We're using the same words, sometimes referring to the same concepts, but in slightly different context and "talking past each other". I'm trying real hard to believe that this is what's happening in this entire exchange. That's the most benign reason I can come up with... Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org