From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH] Document Linux Memory Policy From: Lee Schermerhorn In-Reply-To: <20070531113011.GN4715@minantech.com> References: <1180467234.5067.52.camel@localhost> <20070531064753.GA31143@minantech.com> <200705311243.20119.ak@suse.de> <20070531110412.GM4715@minantech.com> <20070531113011.GN4715@minantech.com> Content-Type: text/plain Date: Thu, 31 May 2007 11:26:44 -0400 Message-Id: <1180625204.5091.55.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Gleb Natapov Cc: Andi Kleen , Christoph Lameter , linux-mm , Andrew Morton List-ID: On Thu, 2007-05-31 at 14:30 +0300, Gleb Natapov wrote: > On Thu, May 31, 2007 at 02:04:12PM +0300, Gleb Natapov wrote: > > On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote: > > > > > > > > The faulted page will use the memory policy of the task that faulted it > > > > > in. If that process has numa_set_localalloc() set then the page will be > > > > > located as closely as possible to the allocating thread. > > > > > > > > Thanks. But I have to say this feels very unnatural. > > > > > > What do you think is unnatural exactly? First one wins seems like a quite > > > natural policy to me. > > No it is not (not always). I want to create shared memory for > > interprocess communication. Process A will write into the memory and > > process B will periodically poll it to see if there is a message there. > > In NUMA system I want the physical memory for this VMA to be allocated > > from node close to process B since it will use it much more frequently. > > But I don't want to pre-fault all pages in process B to achieve this > > because the region can be huge and because it doesn't guaranty much if > > swapping is involved. So numa_set_localalloc() looks like it achieves > > exactly this. Without this function I agree that the "first one wins" is > > very sensible assumption, but when each process stated it's preferences > > explicitly by calling the function it is not longer sensible to me as a > > user of the API. When you start to thing about how memory policy may be > OK now, rereading man page, I see that numa_tonode_memory() to achieve > this without pre-faulting. A should now what CPU B is running on, but > this is a minor problem. Gleb: numa_tonode_memory() won't do what you want if the file is mapped shared. The numa_*_memory() interfaces use mbind() which installs a VMA policy in the address space of the caller. When a page is faulted in for a mmap'd file, the page will be allocated using the faulting task's task policy, if any, else system default. Now, if it were a private mapping and you write/store to the pages, the kernel will COW and give you a page that obeys the policy you installed in your address space. But, shared mappings don't COW, so you retain the original page that followed the faulting task's policy. There are currently 2 ways that I know of to explicitly place a page in a shared file mapping: 1) let the task policy default to local or explicitly specify local access [numa_set_localalloc()] and then ensure that you prefault the page while executing on a cpu local to the node where you want the page. 2) set the task policy to bind to the desired node and prefault the page. Of course, if the page gets reclaimed and later faulted back in, the location will depend on the task policy, and possibly the location, of the task that causes the refault. I've been proposing patches to generalize the shared policy support enjoyed by shmem segments for use with shared mmap'd files. I was beginning to think that I'm the only one with applications [well, with customers with applications] that need this behavior. Sounds like your requirements are very similar: huge file [don't want to prefault nor wait for it to all be read into shmem before starting processing], only accesses via mmap, ... > > > implemented in the kernel and understand that memory policy is a > > property of an address space (is it?) and not a file then you start to > > understand current behaviour, but this is implementation details. > > > > > > > > > > > So to have > > > > desirable effect I have to create shared memory with shmget? > > > > > > shmget behaves the same. > > > > > Then I misinterpreted "Shared Policy" section from Lee's document. > > It seems that he states that for memory region created with shmget the > > policy is a property of a shared object and not of a process' address > > space. > > > Man page states: > Memory policy set for memory areas is shared by all threads of the > process. Memory policy is also shared by other processes mapping the > same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not > shared for disk backed file mappings right now although that may change > in the future. > So what does this mean? If I set local policy for memory region in process > A it should be obeyed by memory access in process B? shmem does, indeed, work this way. Policies installed on ranges of the shared segment via mbind() are stored with the shared object. I think the future is now: time to share policy for disk backed file mappings. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org