From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" From: Lee Schermerhorn In-Reply-To: <20070829043803.GD25335@wotan.suse.de> References: <20070823041137.GH18788@wotan.suse.de> <1187988218.5869.64.camel@localhost> <20070827013525.GA23894@wotan.suse.de> <1188225247.5952.41.camel@localhost> <20070828000648.GB14109@wotan.suse.de> <1188312766.5079.77.camel@localhost> <20070829043803.GD25335@wotan.suse.de> Content-Type: text/plain Date: Thu, 30 Aug 2007 12:34:45 -0400 Message-Id: <1188491685.5794.96.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: linux-mm , Rik van Riel , Christoph Hellwig List-ID: On Wed, 2007-08-29 at 06:38 +0200, Nick Piggin wrote: > On Tue, Aug 28, 2007 at 10:52:46AM -0400, Lee Schermerhorn wrote: > > On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote: > > > > > > I don't have a problem with having a more unified approach, although if > > > we did that, then I'd prefer just to do it more simply and don't special > > > case mlocked pages _at all_. Ie. just slowly try to reclaim them and > > > eventually when everybody unlocks them, you will notice sooner or later. > > > > I didn't think I was special casing mlocked pages. I wanted to treat > > all !page_reclaimable() pages the same--i.e., put them on the noreclaim > > list. > > But you are keeping track of the mlock count? Why not simply call > try_to_unmap and see if they are still mlocked? We may be talking past each other here. So, let me try this: We're trying to hide nonreclaimable, including mlock'ed, pages from vmscan to the extent possible--to make reclaim as efficient as possible. Sometimes, to avoid races [as in your comment in __mlock_pages_range() regarding anonymous pages], we may end up putting mlock'ed pages on the normal lru list. That's OK. We can cull them in shrink_*_list(). Now, if we have an mlock lock count in a dedicated field or a page flag indicating mlock'ed state [perhaps with a count in an overloaded field], we can easily cull the mlock'ed pages w/o access to any vma so that it never gets to shrink_page_list() where try_to_unmap() would be called. IMO, try_to_unmap() is/can be a fairly heavy hammer, walking the entire rmap, as it does. And, we only get to try_to_unmap() after already walking the entire rmap in page_referenced() [hmmm, maybe cull mlock'ed pages in page_referenced()--before even checking page table for ref?]. So, I'd like to cull them early by just looking at the page. If a page occasionally makes it through, like only the first time for anon pages?, we only take the hit once. Now you may be thinking that, in general, reverse maps are not all that large. But, I've seen live locks on the i_mmap_lock with heavy Oracle loads [I think I already mentioned this]. On large servers, we can see hundreds or thousands of tasks mapping the data base executables, libraries and shared memory areas--just the types of regions one might want to mlock. Further, the shared memory areas can get quite large--10s, 100s, even 1000s of GB. That's a lot of pages to be running through page_referenced/try_to_unmap too often. > > > > > But once you do the code for mlock refcounting, that's most of the hard > > > part done so you may as well remove them completely from the LRU, no? > > > Then they become more or less transparent to the rest of the VM as well. > > > > Well, no. Depending on the reason for !reclaimable, the page would go > > on the noreclaim list or just be dropped--special handling. More > > importantly [for me], we still have to handle them specially in > > migration, dumping them back onto the LRU so that we can arbitrate > > access. If I'm ever successful in getting automatic/lazy page migration > > +replication accepted, I don't want that overhead in > > auto-migration/replication. > > Oh OK. I don't know if there should be a whole lot of overhead involved > with that, though. I can't remember exactly what the problems were here > with my mlock patch, but I think it could have been made more optimal. The basic issue was that one can't migrate pages [nor unmap them for lazy migration/replication] if check_range() can't find them on and successfully isolate them from the lru. In a respin of the patch, you dumped the pages back on to the LRU so that they could be migrated. Then, later, they'll need to be lazily culled back off the lru. Could be a lot of pages for some regions. With the noreclaim lru list, this isn't necessary. It works just like the [in]active lists from migration's perspective. I guess the overhead depends on the size of the regions being migrated. It occurs to me that we probably need a way to exempt some regions--like huge shared memory areas--from auto-migration/replication. > > > > > Could be possible. Tricky though. Probably take less code to use > > > ->lru ;) > > > > Oh, certainly less code to use any separate field. But the lru list > > field is the only link we have in the page struct, and a lot of VM > > depends on being able to pass around lists of pages. I'd hate to lose > > that for mlocked pages, or to have to dump the lock count and > > reestablish it in those cases, like migration, where we need to put the > > page on a list. > > Hmm, yes. Migration could possibly use a single linked list. > But I'm only saying it _could_ be possible to do mlocked accounting > efficiently with one of the LRU pointers -- I agree, that we don't want to keep the pages on an lru list or want to use some other list type for migration and such, the accounting in one of the lru pointers is no[t much] more overhead, timewise, than a dedicated field. The dedicated file increases space overhead, tho'. > I would prefer the idea > of just using a single bit for example, if that is sufficient. It > should cut down on code. I've been thinking about how to eliminate the mlock count entirely and just use a single page flag and "lazy culling"--i.e., try to unmap. But, one scenario I want to avoid is where tasks come and go, attaching to a shared memory area/executable with an mlock'ed vma. When they detach, without a count, we'd just drop the mlock flag, moving the pages back to the normal lru lists, and let vmscan cull them if some vma still have them mlock'ed. Again, I'd like to avoid the flood of pages between normal lru and noreclaim lists in my model. Perhaps the "flood" can be eliminated for shared memory areas--likely to be the largest source of mlock'ed pages--by not unlocking pages in shmem areas that have the VM_LOCKED flag set in the shmem_inode_info flags field [SHM_LOCKED regions]. I don't see any current interaction of that flag with the vm_flags when attaching to a SHM_LOCKED region. Such interaction is not required to prevent swap out--that's handled in shmem_writepage. But, to keep those pages off the LRU, we probably need to consult the shmem_inode_info flags in the modified mlock code. Maybe pull the flag into the vm_flags on attach? This way, try_to_unmap() will see it w/o having to consult vm_file->... I'm looking into this. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org