From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" From: Lee Schermerhorn In-Reply-To: <20070827013525.GA23894@wotan.suse.de> References: <20070823041137.GH18788@wotan.suse.de> <1187988218.5869.64.camel@localhost> <20070827013525.GA23894@wotan.suse.de> Content-Type: text/plain Date: Mon, 27 Aug 2007 10:34:07 -0400 Message-Id: <1188225247.5952.41.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: linux-mm , Rik van Riel List-ID: On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote: > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote: > > Nick: > > > > For your weekend reading pleasure [:-)] > > > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed > > but no swap space, excessively long anon_vma list] on a separate > > noreclaim LRU list--more or less ignored by vmscan. To do this, I had > > to adda newmlock_count memberto > > thepage struct. This brings the size of the page struct to a > > nice, round 64 bytes. The mlock_count member and [most of] the > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which > > depends on CONFIG_NORECLAIM. Currently, the entire noreclaim > > infrastructure is only supported on 64bit archs because I'm using a > > higher order bit [~30] for the PG_noreclaim flag. > > Can you keep the old system of removing mlocked pages completely, and > keeping the mlock count in one of the lru pointers? That should avoid > the need to have a new mlock_count, I think, because none of the other > noreclaim types should need a refcount? Well, keeping the mlock count in the lru pointer more or less defeats the purpose of this exercise for me--that is, a unified mechanism for tracking "non-reclaimable" pages. I wanted to maintain the ability to use the zone lru_lock and isolate_lru_page() to arbitrate access to pages for migration, etc. w/o having to temporarily put the pages back on the lru during migration. And, by using another LRU list for non-reclaimable pages, the non-reclaimable nature of locked, un-swappable, ... pages becomes transparent to much of the rest of VM. vmscan and try_to_unmap*() still have to handle lazy culling of non-reclaimable pages. If/when you do get a chance to look at the patches, you'll see that I separated the culling of non-reclaimable pages in the fault path into a separate patch. We could eliminate this overhead in the fault path in favor of lazy culling in vmscan. Vmscan would only have to deal with these pages once to move them to the noreclaim list. > > I do approve of bringing struct page to a nice round 64 bytes ;), but I > think I would rather we used up those 8 bytes by making count and > mapcount 8 bytes each. I knew the new page struct member would be controversial, at best, but it allows me to prototype and test this approach. I'd like to find somewhere else to put the mlock count, but the page struct it pretty tight as it is. It occurred to me that while anon and other swap-backed pages are mlocked, I might be able to use the private field as the mlock count. I don't understand the interaction of vm with file systems to know if we could do the same for file-backed pages. Maybe a separate PG_mlock flag would allow one to move the page's private contents to an external structure along with the mlock count? Or maybe just with PG_noreclaim, externalize the private info? Another approach that I've seen used elsewhere, IFF we can find a smaller bit field for the mlock count: maintain a mlock count in a bit field that is too small to contain max possible lock count. [Probably don't need all 64-bits, in any case.] Clip the count at maximum that the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the count won't accommodate the additional lock. I haven't investigated this enough to determine what additional complications it would involve. It would probably complicate inheriting locks across fork(), if we ever want to do that [I do!]. Any thoughts on restricting this to 64-bit archs? > > > > Using the noreclaim infrastructure does seem to simplify the "keep > > mlocked pages off the LRU" code tho'. All of the isolate_lru_page(), > > move_to_lru(), ... functions have been taught about the noreclaim list, > > so many places don't need changes. That being said, I really not sure > > I've covered all of the bases here... > > > > Now, mlocked pages come back off the noreclaim list nicely when the last > > mlock reference goes away--assuming I have the counting correct. > > However, pages marked non-reclaimable for other reasons--no swap > > available, excessive anon_vma ref count--can languish there > > indefinitely. At some point, perhaps vmscan could be taught to do a > > slow background scan of the noreclaim list [making it more like > > "slo-reclaim"--but we already have that :-)] when swap is added and we > > have unswappable pages on the list. Currently, I don't keep track of > > the various reasons for the no-reclaim pages, but that could be added. > > > > Rik Van Riel mentions, on his VM wiki page that a background scan might > > be useful to age pages actively [clock hand, anyone?], so I might be > > able to piggyback on that, or even prototype it at some point. In the > > meantime, I'm going to add a scan of the noreclaim list manually > > triggered by a temporary sysctl. > > Yeah, I think the basic slow simple clock would be a reasonable starting > point. You may end up wanting to introduce some feedback from near-OOM > condition and/or free swap accounting to speed up the scanning rate. Yep. It's all those little details that have prevented me from diving into this yet. Still cogitating on that, as a background task. > > I haven't had much look at the patches yet, but I'm glad to see the old > mlocked patch come to something ;) Given the issues we've encountered in the field with a large number [millions] of non-reclaimable pages on the LRU lists, the idea of hiding nonreclaimable pages from vmscan is appealing. I'm hoping we can find some acceptable way of doing this in the long run. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org