From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 2 Mar 1998 23:03:05 GMT Message-Id: <199803022303.XAA03640@dax.dcs.ed.ac.uk> From: "Stephen C. Tweedie" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re: reverse pte lookups and anonymous private mappings; avl trees? In-Reply-To: References: Sender: owner-linux-mm@kvack.org To: "Benjamin C.R. LaHaise" Cc: linux-mm@kvack.org, torvalds@transmeta.com List-ID: Hi Ben, On Mon, 2 Mar 1998 14:04:01 -0500 (U), "Benjamin C.R. LaHaise" said: > Okay, I've been ripping my hair out this weekend trying to get my > reverse pte lookups working using the inode/vma walking scheme, and > I'm about to go mad because of it. I'm not in the slightest bit surprised. I've been giving it a lot of thought, and it just gets uglier and uglier. :) The problem --- of course --- is that the inheritance relationship of private pages is a tree. If we have a process which forks off several children then in any given privately mapped vma, some pages can be shared between all children. Other children may have their own private copies of other pages. If _those_ children fork, we get a subtree amongst which certain pages are shared only amongst that subtree; other pages in the leaf processes are not shared anywhere at all; and still other pages are shared by the whole tree. Ouch. Basically, what it comes down to is that after forking, the vma is not really a good unit by which to account for individual pages. > Here are the alternatives I've come up with. Please give me some > comments as to which one people think is the best from a design > standpoint (hence the cc to Linus) and performance wise, too. Note > that I'm ignoring the problem of overlapping inode/offset pairs > given the new swap cache code. (this is probably riddled with > typos/thinkos) > a) add a struct vm_area_struct pointer and vm_offset to struct > page. To each vm_area_struct, add a pair of pointers so that there's a > list of vm_area_structs which are private, but can be sharing pages. What about privately mapped images, such as shared libraries? Will those vmas will end up with one inode for the backing file and a separate inode for the private pages? Such private vmas still have anonymous pages in them. > Hence we end up with something like: page--> vma -> vm_next_private -> ... (cicular list) > | | > vm_mm vm_mm > | ... > ... (use page->vm_offset - vma->vm_offset + vma->vm_start) > | > pte > This, I think, is the cleanest approach. What happens to these vmas and inodes on fork()? On mremap()? > It makes life easy for finding a shared anonymous pages, Shared anonymous pages are a very special case, and they are much much much easier than the general case. With shared anonymous vmas, _all_ pages in the entire vma are guaranteed to be shared; where is no mixing of shared and private pages as there is with a private map (anonymous or not). My original plan with the swap cache was to introduce a basis inode for vmas only for this special case; all other cases are dealt with adequately by the swap cache scheme (not counting the performance issues of being able to swap out arbitrary pages). > b) per-vma inodes. This scheme is a headache. It would involve > linking the inodes in order to maintain a chain of the anonymous > pages. > At fork() time, each private anonymous vma would need to have two new > inodes allocated (one for each task), which would point to the old inode. > Ptes are found using the inode, offset pair already in struct page, plus > walking the inode tree. Has the advantage that we can use the inode, > offset pair already in struct page, no need to grow struct vm_area_struct; > disadvantage: hairy, conflicts with the new swap cache code - requires > the old swap_entry to return to struct page. This is getting much closer to the FreeBSD scheme, which has similar stacks of inodes for each anonymous vma. It is a bit cumbersome, but ought to work pretty well. > c) per-mm inodes, shared on fork. Again, this one is a bit > painful, although less so than b. This one requires that we allow > multiple pages in the page cache exist with the same offset (ugly). Very. :) > Each anonymous page gets the mm_struct's inode attached to it, with > the virtual address within the process' memory space being the > offset. The ptes are found in the same manner as for normal shared > pages (inode->i_mmap->vmas). Aliased page-cache entries are created > on COW and mremap(). > My thoughts are that the only real options are (a) and (c). (a) seems to > be the cleanest conceptually, while (c) has very little overhead, but, > again, conflicts with the new swap cache... Yes. The conclusion I came to a while ago is that (c) is the only way to go if we do this. This leaves us with one main question --- how do we select pages for swapout? We either do it via a scan of the page tables, as currently happens in vmscan, or with a scan of physical pages as shrink_mmap() does. The changes described here are only useful if we take the second choice. My personal feeling at the moment is that the second option is going to give us lower overhead swapping and a much fairer balance between private data and the page cache. However, there are are other arguments; in particular, it is much easier to do RSS quotas and guarantees if we are swapping on a per page-table basis. If we do make this change, then some of the new swap cache code becomes redundant. The next question is, how do we cache swap pages in this new scenario? We still need a swap cache mechanism, both to support proper readahead for swap and to allow us to defer the freeing of swapped out pages until the last possible moment. Now, if we know we can atomically remove a page from all ptes at once, then it is probably sufficient to unhook the struct page from the (inode-vma, offset) hash and rehash it as (swapper-inode, entry). It gets harder if we want per-process control of RSS, since in that case we want to use the same physical page for both vma and swap, and we will really need a quick lookup from both indexes. I'm not at all averse to growing the struct page if we absolutely have to, however. If we can effectively keep several hundred more pages in memory as a result of not having to throw data away just to cope with the worst-case interrupt memory requirements, then we will still see a great overall performance improvement. Cheers, Stephen.