* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru @ 2007-08-23 4:11 Nick Piggin 2007-08-23 7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton 2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn 0 siblings, 2 replies; 19+ messages in thread From: Nick Piggin @ 2007-08-23 4:11 UTC (permalink / raw) To: Andrew Morton, Martin Bligh, Rik van Riel, Linux Memory Management List About this patch... I hope it doesn't get merged without good reason... Our current reclaim scheme may not always make great choices, but one thing I really like about it is that it can generally always reclaim file backed pages at O(1) WRT the size of RAM. Once you start giving things multiple trips around lists, you can reach the situation where you need to scan all or a huge number of pages before reclaiming any. If you have long periods of not touching reclaim, it could be very likely that most memory is on the active list with referenced set. One thing you could potentially do is have mark_page_accessed always put active pages back to the head of the LRU, but that is probably going to take way too much locking... I'm not completely happy with our random page reclaim policy either, but I console myself in this case by thinking of PG_referenced as giving the page a slightly better chance before leaving the inactive list. FWIW, this is one of the big reasons not to go with the scheme where you rip out mark_page_accessed completely and do all aging simply based on referenced or second chance bits. It's conceptually a lot simpler and more consistent, and it behaves really well for use-once type pages too, but in practice it can cause big pauses when you initially start reclaim (and I expect this patch could be subject to the same, even if there were fewer cases that triggered such behaviour). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru 2007-08-23 4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin @ 2007-08-23 7:15 ` Andrew Morton 2007-08-23 9:07 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin 2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn 1 sibling, 1 reply; 19+ messages in thread From: Andrew Morton @ 2007-08-23 7:15 UTC (permalink / raw) To: Nick Piggin; +Cc: Martin Bligh, Rik van Riel, Linux Memory Management List On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote: > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch > > About this patch... I hope it doesn't get merged without good reason... I have no intention at all of merging it until it's proven to be a net benefit. This is engineering. We shouldn't merge VM changes based on handwaving. It does fix a bug (ie: a difference between design intent and implementation) but I have no idea whether it improves or worsens anything. > [handwaving] ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru 2007-08-23 7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton @ 2007-08-23 9:07 ` Nick Piggin 2007-08-23 11:48 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru Andrea Arcangeli 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2007-08-23 9:07 UTC (permalink / raw) To: Andrew Morton; +Cc: Martin Bligh, Rik van Riel, Linux Memory Management List On Thu, Aug 23, 2007 at 12:15:17AM -0700, Andrew Morton wrote: > On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote: > > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch > > > > About this patch... I hope it doesn't get merged without good reason... > > I have no intention at all of merging it until it's proven to be a net > benefit. This is engineering. We shouldn't merge VM changes based on > handwaving. > > It does fix a bug (ie: a difference between design intent and > implementation) but I have no idea whether it improves or worsens anything. > > > [handwaving] > > ;) Well what I say is handwaving too, but it is a situation that wouldn't be completely unusual to hit. Anyway, I know I don't need to make an airtight argument as to why _not_ to merge a patch, so this is just a heads-up to be on the lookout for one potential issue I have seen with a similar change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru 2007-08-23 9:07 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin @ 2007-08-23 11:48 ` Andrea Arcangeli 0 siblings, 0 replies; 19+ messages in thread From: Andrea Arcangeli @ 2007-08-23 11:48 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Martin Bligh, Rik van Riel, Linux Memory Management List On Thu, Aug 23, 2007 at 11:07:22AM +0200, Nick Piggin wrote: > On Thu, Aug 23, 2007 at 12:15:17AM -0700, Andrew Morton wrote: > > On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote: > > > > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch > > > > > > About this patch... I hope it doesn't get merged without good reason... > > > > I have no intention at all of merging it until it's proven to be a net > > benefit. This is engineering. We shouldn't merge VM changes based on > > handwaving. > > > > It does fix a bug (ie: a difference between design intent and > > implementation) but I have no idea whether it improves or worsens anything. > > > > > [handwaving] > > > > ;) > > Well what I say is handwaving too, but it is a situation that wouldn't be > completely unusual to hit. Anyway, I know I don't need to make an airtight > argument as to why _not_ to merge a patch, so this is just a heads-up to > be on the lookout for one potential issue I have seen with a similar change. I like the patch, I consider it a fix but perhaps I'm biased ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-23 4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin 2007-08-23 7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton @ 2007-08-24 20:43 ` Lee Schermerhorn 2007-08-27 1:35 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-24 20:43 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, Rik van Riel For your weekend reading pleasure [:-)] I have reworked your "move mlocked pages off LRU" atop my "noreclaim infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed but no swap space, excessively long anon_vma list] on a separate noreclaim LRU list--more or less ignored by vmscan. To do this, I had to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to the<mumble>page struct. This brings the size of the page struct to a nice, round 64 bytes. The mlock_count member and [most of] the noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which depends on CONFIG_NORECLAIM. Currently, the entire noreclaim infrastructure is only supported on 64bit archs because I'm using a higher order bit [~30] for the PG_noreclaim flag. Using the noreclaim infrastructure does seem to simplify the "keep mlocked pages off the LRU" code tho'. All of the isolate_lru_page(), move_to_lru(), ... functions have been taught about the noreclaim list, so many places don't need changes. That being said, I really not sure I've covered all of the bases here... Now, mlocked pages come back off the noreclaim list nicely when the last mlock reference goes away--assuming I have the counting correct. However, pages marked non-reclaimable for other reasons--no swap available, excessive anon_vma ref count--can languish there indefinitely. At some point, perhaps vmscan could be taught to do a slow background scan of the noreclaim list [making it more like "slo-reclaim"--but we already have that :-)] when swap is added and we have unswappable pages on the list. Currently, I don't keep track of the various reasons for the no-reclaim pages, but that could be added. Rik Van Riel mentions, on his VM wiki page that a background scan might be useful to age pages actively [clock hand, anyone?], so I might be able to piggyback on that, or even prototype it at some point. In the meantime, I'm going to add a scan of the noreclaim list manually triggered by a temporary sysctl. Anyway, if anyone is interested, the patches are in a gzip'd tarball in: http://free.linux.hp.com/~lts/Patches/Noreclaim/ Cursory functional testing with memtoy shows that it basically works. I've started a moderately stressful workload for the weekend. We'll see how it goes. Cheers, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn @ 2007-08-27 1:35 ` Nick Piggin 2007-08-27 14:34 ` Lee Schermerhorn 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2007-08-27 1:35 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote: > Nick: > > For your weekend reading pleasure [:-)] > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed > but no swap space, excessively long anon_vma list] on a separate > noreclaim LRU list--more or less ignored by vmscan. To do this, I had > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to > the<mumble>page struct. This brings the size of the page struct to a > nice, round 64 bytes. The mlock_count member and [most of] the > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which > depends on CONFIG_NORECLAIM. Currently, the entire noreclaim > infrastructure is only supported on 64bit archs because I'm using a > higher order bit [~30] for the PG_noreclaim flag. Can you keep the old system of removing mlocked pages completely, and keeping the mlock count in one of the lru pointers? That should avoid the need to have a new mlock_count, I think, because none of the other noreclaim types should need a refcount? I do approve of bringing struct page to a nice round 64 bytes ;), but I think I would rather we used up those 8 bytes by making count and mapcount 8 bytes each. > Using the noreclaim infrastructure does seem to simplify the "keep > mlocked pages off the LRU" code tho'. All of the isolate_lru_page(), > move_to_lru(), ... functions have been taught about the noreclaim list, > so many places don't need changes. That being said, I really not sure > I've covered all of the bases here... > > Now, mlocked pages come back off the noreclaim list nicely when the last > mlock reference goes away--assuming I have the counting correct. > However, pages marked non-reclaimable for other reasons--no swap > available, excessive anon_vma ref count--can languish there > indefinitely. At some point, perhaps vmscan could be taught to do a > slow background scan of the noreclaim list [making it more like > "slo-reclaim"--but we already have that :-)] when swap is added and we > have unswappable pages on the list. Currently, I don't keep track of > the various reasons for the no-reclaim pages, but that could be added. > > Rik Van Riel mentions, on his VM wiki page that a background scan might > be useful to age pages actively [clock hand, anyone?], so I might be > able to piggyback on that, or even prototype it at some point. In the > meantime, I'm going to add a scan of the noreclaim list manually > triggered by a temporary sysctl. Yeah, I think the basic slow simple clock would be a reasonable starting point. You may end up wanting to introduce some feedback from near-OOM condition and/or free swap accounting to speed up the scanning rate. I haven't had much look at the patches yet, but I'm glad to see the old mlocked patch come to something ;) Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-27 1:35 ` Nick Piggin @ 2007-08-27 14:34 ` Lee Schermerhorn 2007-08-27 15:44 ` Christoph Hellwig 2007-08-28 0:06 ` Nick Piggin 0 siblings, 2 replies; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-27 14:34 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, Rik van Riel On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote: > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote: > > Nick: > > > > For your weekend reading pleasure [:-)] > > > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed > > but no swap space, excessively long anon_vma list] on a separate > > noreclaim LRU list--more or less ignored by vmscan. To do this, I had > > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to > > the<mumble>page struct. This brings the size of the page struct to a > > nice, round 64 bytes. The mlock_count member and [most of] the > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which > > depends on CONFIG_NORECLAIM. Currently, the entire noreclaim > > infrastructure is only supported on 64bit archs because I'm using a > > higher order bit [~30] for the PG_noreclaim flag. > > Can you keep the old system of removing mlocked pages completely, and > keeping the mlock count in one of the lru pointers? That should avoid > the need to have a new mlock_count, I think, because none of the other > noreclaim types should need a refcount? Well, keeping the mlock count in the lru pointer more or less defeats the purpose of this exercise for me--that is, a unified mechanism for tracking "non-reclaimable" pages. I wanted to maintain the ability to use the zone lru_lock and isolate_lru_page() to arbitrate access to pages for migration, etc. w/o having to temporarily put the pages back on the lru during migration. And, by using another LRU list for non-reclaimable pages, the non-reclaimable nature of locked, un-swappable, ... pages becomes transparent to much of the rest of VM. vmscan and try_to_unmap*() still have to handle lazy culling of non-reclaimable pages. If/when you do get a chance to look at the patches, you'll see that I separated the culling of non-reclaimable pages in the fault path into a separate patch. We could eliminate this overhead in the fault path in favor of lazy culling in vmscan. Vmscan would only have to deal with these pages once to move them to the noreclaim list. > > I do approve of bringing struct page to a nice round 64 bytes ;), but I > think I would rather we used up those 8 bytes by making count and > mapcount 8 bytes each. I knew the new page struct member would be controversial, at best, but it allows me to prototype and test this approach. I'd like to find somewhere else to put the mlock count, but the page struct it pretty tight as it is. It occurred to me that while anon and other swap-backed pages are mlocked, I might be able to use the private field as the mlock count. I don't understand the interaction of vm with file systems to know if we could do the same for file-backed pages. Maybe a separate PG_mlock flag would allow one to move the page's private contents to an external structure along with the mlock count? Or maybe just with PG_noreclaim, externalize the private info? Another approach that I've seen used elsewhere, IFF we can find a smaller bit field for the mlock count: maintain a mlock count in a bit field that is too small to contain max possible lock count. [Probably don't need all 64-bits, in any case.] Clip the count at maximum that the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the count won't accommodate the additional lock. I haven't investigated this enough to determine what additional complications it would involve. It would probably complicate inheriting locks across fork(), if we ever want to do that [I do!]. Any thoughts on restricting this to 64-bit archs? > > > > Using the noreclaim infrastructure does seem to simplify the "keep > > mlocked pages off the LRU" code tho'. All of the isolate_lru_page(), > > move_to_lru(), ... functions have been taught about the noreclaim list, > > so many places don't need changes. That being said, I really not sure > > I've covered all of the bases here... > > > > Now, mlocked pages come back off the noreclaim list nicely when the last > > mlock reference goes away--assuming I have the counting correct. > > However, pages marked non-reclaimable for other reasons--no swap > > available, excessive anon_vma ref count--can languish there > > indefinitely. At some point, perhaps vmscan could be taught to do a > > slow background scan of the noreclaim list [making it more like > > "slo-reclaim"--but we already have that :-)] when swap is added and we > > have unswappable pages on the list. Currently, I don't keep track of > > the various reasons for the no-reclaim pages, but that could be added. > > > > Rik Van Riel mentions, on his VM wiki page that a background scan might > > be useful to age pages actively [clock hand, anyone?], so I might be > > able to piggyback on that, or even prototype it at some point. In the > > meantime, I'm going to add a scan of the noreclaim list manually > > triggered by a temporary sysctl. > > Yeah, I think the basic slow simple clock would be a reasonable starting > point. You may end up wanting to introduce some feedback from near-OOM > condition and/or free swap accounting to speed up the scanning rate. Yep. It's all those little details that have prevented me from diving into this yet. Still cogitating on that, as a background task. > > I haven't had much look at the patches yet, but I'm glad to see the old > mlocked patch come to something ;) Given the issues we've encountered in the field with a large number [millions] of non-reclaimable pages on the LRU lists, the idea of hiding nonreclaimable pages from vmscan is appealing. I'm hoping we can find some acceptable way of doing this in the long run. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-27 14:34 ` Lee Schermerhorn @ 2007-08-27 15:44 ` Christoph Hellwig 2007-08-27 23:51 ` Nick Piggin 2007-08-28 0:06 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2007-08-27 15:44 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote: > Well, keeping the mlock count in the lru pointer more or less defeats > the purpose of this exercise for me--that is, a unified mechanism for > tracking "non-reclaimable" pages. I wanted to maintain the ability to > use the zone lru_lock and isolate_lru_page() to arbitrate access to > pages for migration, etc. w/o having to temporarily put the pages back > on the lru during migration. A few years ago I tried to implement a mlocked counter in the page aswell, and my approach was to create a union to reuse the space occupied by the lru list pointers for this. I never really got it stable enough because people tripped over the lru list randomly far too often. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-27 15:44 ` Christoph Hellwig @ 2007-08-27 23:51 ` Nick Piggin 2007-08-28 12:29 ` Christoph Hellwig 0 siblings, 1 reply; 19+ messages in thread From: Nick Piggin @ 2007-08-27 23:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Lee Schermerhorn, linux-mm, Rik van Riel On Mon, Aug 27, 2007 at 04:44:26PM +0100, Christoph Hellwig wrote: > On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote: > > Well, keeping the mlock count in the lru pointer more or less defeats > > the purpose of this exercise for me--that is, a unified mechanism for > > tracking "non-reclaimable" pages. I wanted to maintain the ability to > > use the zone lru_lock and isolate_lru_page() to arbitrate access to > > pages for migration, etc. w/o having to temporarily put the pages back > > on the lru during migration. > > A few years ago I tried to implement a mlocked counter in the page > aswell, and my approach was to create a union to reuse the space occupied > by the lru list pointers for this. I never really got it stable enough > because people tripped over the lru list randomly far too often. My original mlock patches that Lee is talking about did use your method. I _believe_ it is basically bug free and worked nicely. These days we're a bit more consistent and have fewer races with LRU handling, which is perhaps what made it doable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-27 23:51 ` Nick Piggin @ 2007-08-28 12:29 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2007-08-28 12:29 UTC (permalink / raw) To: Nick Piggin; +Cc: Christoph Hellwig, Lee Schermerhorn, linux-mm, Rik van Riel On Tue, Aug 28, 2007 at 01:51:53AM +0200, Nick Piggin wrote: > > A few years ago I tried to implement a mlocked counter in the page > > aswell, and my approach was to create a union to reuse the space occupied > > by the lru list pointers for this. I never really got it stable enough > > because people tripped over the lru list randomly far too often. > > My original mlock patches that Lee is talking about did use your > method. I _believe_ it is basically bug free and worked nicely. > > These days we're a bit more consistent and have fewer races with > LRU handling, which is perhaps what made it doable. If this works that'd be wonderful. It also means xfs could switch back to using the block device mapping for it's buffer cache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-27 14:34 ` Lee Schermerhorn 2007-08-27 15:44 ` Christoph Hellwig @ 2007-08-28 0:06 ` Nick Piggin 2007-08-28 14:52 ` Lee Schermerhorn 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2007-08-28 0:06 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote: > On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote: > > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote: > > > Nick: > > > > > > For your weekend reading pleasure [:-)] > > > > > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim > > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed > > > but no swap space, excessively long anon_vma list] on a separate > > > noreclaim LRU list--more or less ignored by vmscan. To do this, I had > > > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to > > > the<mumble>page struct. This brings the size of the page struct to a > > > nice, round 64 bytes. The mlock_count member and [most of] the > > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which > > > depends on CONFIG_NORECLAIM. Currently, the entire noreclaim > > > infrastructure is only supported on 64bit archs because I'm using a > > > higher order bit [~30] for the PG_noreclaim flag. > > > > Can you keep the old system of removing mlocked pages completely, and > > keeping the mlock count in one of the lru pointers? That should avoid > > the need to have a new mlock_count, I think, because none of the other > > noreclaim types should need a refcount? > > Well, keeping the mlock count in the lru pointer more or less defeats > the purpose of this exercise for me--that is, a unified mechanism for > tracking "non-reclaimable" pages. I wanted to maintain the ability to > use the zone lru_lock and isolate_lru_page() to arbitrate access to > pages for migration, etc. w/o having to temporarily put the pages back > on the lru during migration. > > And, by using another LRU list for non-reclaimable pages, the > non-reclaimable nature of locked, un-swappable, ... pages becomes > transparent to much of the rest of VM. vmscan and try_to_unmap*() still > have to handle lazy culling of non-reclaimable pages. If/when you do > get a chance to look at the patches, you'll see that I separated the > culling of non-reclaimable pages in the fault path into a separate > patch. We could eliminate this overhead in the fault path in favor of > lazy culling in vmscan. Vmscan would only have to deal with these pages > once to move them to the noreclaim list. I don't have a problem with having a more unified approach, although if we did that, then I'd prefer just to do it more simply and don't special case mlocked pages _at all_. Ie. just slowly try to reclaim them and eventually when everybody unlocks them, you will notice sooner or later. But once you do the code for mlock refcounting, that's most of the hard part done so you may as well remove them completely from the LRU, no? Then they become more or less transparent to the rest of the VM as well. > > I do approve of bringing struct page to a nice round 64 bytes ;), but I > > think I would rather we used up those 8 bytes by making count and > > mapcount 8 bytes each. > > I knew the new page struct member would be controversial, at best, but > it allows me to prototype and test this approach. I'd like to find > somewhere else to put the mlock count, but the page struct it pretty > tight as it is. It occurred to me that while anon and other swap-backed > pages are mlocked, I might be able to use the private field as the mlock > count. I don't understand the interaction of vm with file systems to > know if we could do the same for file-backed pages. Maybe a separate > PG_mlock flag would allow one to move the page's private contents to an > external structure along with the mlock count? Or maybe just with > PG_noreclaim, externalize the private info? Could be possible. Tricky though. Probably take less code to use ->lru ;) > Another approach that I've seen used elsewhere, IFF we can find a > smaller bit field for the mlock count: maintain a mlock count in a bit > field that is too small to contain max possible lock count. [Probably > don't need all 64-bits, in any case.] Clip the count at maximum that > the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the > count won't accommodate the additional lock. I haven't investigated > this enough to determine what additional complications it would involve. > It would probably complicate inheriting locks across fork(), if we ever > want to do that [I do!]. Well instead of failing further mlocks, you could just have MLOCK_MAX signal that counting is disabled, and require a full rmap scan in order to reclaim it. > Any thoughts on restricting this to 64-bit archs? I don't know. I'd have thought efficient mlock handling might be useful for realtime systems, probably many of which would be 32-bit. Are you seeing mlock pinning heaps of memory in the field? > > I haven't had much look at the patches yet, but I'm glad to see the old > > mlocked patch come to something ;) > > Given the issues we've encountered in the field with a large number > [millions] of non-reclaimable pages on the LRU lists, the idea of hiding > nonreclaimable pages from vmscan is appealing. I'm hoping we can find > some acceptable way of doing this in the long run. Oh yeah I think that's a good idea, especially for less transient conditions like mlock and out-of-swap. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-28 0:06 ` Nick Piggin @ 2007-08-28 14:52 ` Lee Schermerhorn 2007-08-28 21:54 ` Christoph Lameter 2007-08-29 4:38 ` Nick Piggin 0 siblings, 2 replies; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-28 14:52 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, Rik van Riel On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote: > On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote: > > On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote: > > > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote: > > > > Nick: > > > > > > > > For your weekend reading pleasure [:-)] > > > > > > > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim > > > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed > > > > but no swap space, excessively long anon_vma list] on a separate > > > > noreclaim LRU list--more or less ignored by vmscan. To do this, I had > > > > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to > > > > the<mumble>page struct. This brings the size of the page struct to a > > > > nice, round 64 bytes. The mlock_count member and [most of] the > > > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which > > > > depends on CONFIG_NORECLAIM. Currently, the entire noreclaim > > > > infrastructure is only supported on 64bit archs because I'm using a > > > > higher order bit [~30] for the PG_noreclaim flag. > > > > > > Can you keep the old system of removing mlocked pages completely, and > > > keeping the mlock count in one of the lru pointers? That should avoid > > > the need to have a new mlock_count, I think, because none of the other > > > noreclaim types should need a refcount? > > > > Well, keeping the mlock count in the lru pointer more or less defeats > > the purpose of this exercise for me--that is, a unified mechanism for > > tracking "non-reclaimable" pages. I wanted to maintain the ability to > > use the zone lru_lock and isolate_lru_page() to arbitrate access to > > pages for migration, etc. w/o having to temporarily put the pages back > > on the lru during migration. > > > > And, by using another LRU list for non-reclaimable pages, the > > non-reclaimable nature of locked, un-swappable, ... pages becomes > > transparent to much of the rest of VM. vmscan and try_to_unmap*() still > > have to handle lazy culling of non-reclaimable pages. If/when you do > > get a chance to look at the patches, you'll see that I separated the > > culling of non-reclaimable pages in the fault path into a separate > > patch. We could eliminate this overhead in the fault path in favor of > > lazy culling in vmscan. Vmscan would only have to deal with these pages > > once to move them to the noreclaim list. > > I don't have a problem with having a more unified approach, although if > we did that, then I'd prefer just to do it more simply and don't special > case mlocked pages _at all_. Ie. just slowly try to reclaim them and > eventually when everybody unlocks them, you will notice sooner or later. I didn't think I was special casing mlocked pages. I wanted to treat all !page_reclaimable() pages the same--i.e., put them on the noreclaim list. > > But once you do the code for mlock refcounting, that's most of the hard > part done so you may as well remove them completely from the LRU, no? > Then they become more or less transparent to the rest of the VM as well. Well, no. Depending on the reason for !reclaimable, the page would go on the noreclaim list or just be dropped--special handling. More importantly [for me], we still have to handle them specially in migration, dumping them back onto the LRU so that we can arbitrate access. If I'm ever successful in getting automatic/lazy page migration +replication accepted, I don't want that overhead in auto-migration/replication. > > > > > I do approve of bringing struct page to a nice round 64 bytes ;), but I > > > think I would rather we used up those 8 bytes by making count and > > > mapcount 8 bytes each. > > > > I knew the new page struct member would be controversial, at best, but > > it allows me to prototype and test this approach. I'd like to find > > somewhere else to put the mlock count, but the page struct it pretty > > tight as it is. It occurred to me that while anon and other swap-backed > > pages are mlocked, I might be able to use the private field as the mlock > > count. I don't understand the interaction of vm with file systems to > > know if we could do the same for file-backed pages. Maybe a separate > > PG_mlock flag would allow one to move the page's private contents to an > > external structure along with the mlock count? Or maybe just with > > PG_noreclaim, externalize the private info? > > Could be possible. Tricky though. Probably take less code to use > ->lru ;) Oh, certainly less code to use any separate field. But the lru list field is the only link we have in the page struct, and a lot of VM depends on being able to pass around lists of pages. I'd hate to lose that for mlocked pages, or to have to dump the lock count and reestablish it in those cases, like migration, where we need to put the page on a list. > > > > Another approach that I've seen used elsewhere, IFF we can find a > > smaller bit field for the mlock count: maintain a mlock count in a bit > > field that is too small to contain max possible lock count. [Probably > > don't need all 64-bits, in any case.] Clip the count at maximum that > > the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the > > count won't accommodate the additional lock. I haven't investigated > > this enough to determine what additional complications it would involve. > > It would probably complicate inheriting locks across fork(), if we ever > > want to do that [I do!]. > > Well instead of failing further mlocks, you could just have MLOCK_MAX > signal that counting is disabled, and require a full rmap scan in order > to reclaim it. Yeah. But, rather than totally disabling the counting, I'd suggest to go ahead and decrement the count [! < 0, of course] on unmap/unlock. If it's infrequent, we could just let try_to_unmap*() cull pages in VM_LOCKED vmas when it's already doing the full rmap scan, and have shrink_page_list() put it on noreclaim list when try_to_unmap() returns SWAP_LOCK. This does mean that we won't be able to cull the mlocked pages early in shrink_[in]active_list() via !page_reclaimable(). So, we still have to do the complete rmap scan for page_referenced() [why do we do this? don't trust mapcount?] and then again for try_to_unmap(). We'd probably also want to cull new pages in the fault path, where the vma is available. This would reduce the number of mlocked pages encountered on the LRU lists by vmscan. If we're willing to live with this [increased rmap scans on mlocked pages], we might be able to dispense with the mlock count altogether. Just a single flag [somewhere--doesn't need to be in page flags member] to indicate mlocked for page_reclaimable(). munmap()/munlock() could reset the bit and put the page back on the [in]active list. If some other vma has it locked, we'll catch it on next attempt to unmap. > > > > Any thoughts on restricting this to 64-bit archs? > > I don't know. I'd have thought efficient mlock handling might be useful > for realtime systems, probably many of which would be 32-bit. I agree. I just wonder if those systems have a sufficient number of pages that they're suffering from the long lru lists with a large fraction of unreclaimable pages... If we do want to support keeping nonreclaimable pages off the [in]active lists for these systems, we'll need to find a place for the flag[s]. > > Are you seeing mlock pinning heaps of memory in the field? It is a common usage to mlock() large shared memory areas, as well as entire tasks [MLOCK_CURRENT|MLOCK_FUTURE]. I think it would be even more frequent if one could inherit MLOCK_FUTURE across fork and exec. Then one could write/enhance a prefix command, like numactl and taskset, to enable locking of unmodified applications. I prototyped this once, but never updated it to do the mlock accounting [e.g., down in copy_page_range() during fork()] for your patch. What we see more of is folks just figuring that they've got sufficient memory [100s of GB] for their apps and shared memory areas, so they don't add enough swap to back all of the anon and shmem regions. Then, when they get under memory pressure--e.g., the old "backup ate my pagecache" scenario--the system more or less live-locks in vmscan shuffling non-reclaimable [unswappable] pages. A large number of mlocked pages on the LRU produces the same symptom; as do excessively long anon_vma lists and huge i_mmap trees--the latter seen with some large Oracle workloads. > > > > > I haven't had much look at the patches yet, but I'm glad to see the old > > > mlocked patch come to something ;) > > > > Given the issues we've encountered in the field with a large number > > [millions] of non-reclaimable pages on the LRU lists, the idea of hiding > > nonreclaimable pages from vmscan is appealing. I'm hoping we can find > > some acceptable way of doing this in the long run. > > Oh yeah I think that's a good idea, especially for less transient > conditions like mlock and out-of-swap. This is all still a work in progress. I'll keep it up to date, run occasional benchmarks to measure effects and track the other page reclaim activity on the lists and see where it goes... Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-28 14:52 ` Lee Schermerhorn @ 2007-08-28 21:54 ` Christoph Lameter 2007-08-29 14:40 ` Lee Schermerhorn 2007-08-29 4:38 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: Christoph Lameter @ 2007-08-28 21:54 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel On Tue, 28 Aug 2007, Lee Schermerhorn wrote: > I didn't think I was special casing mlocked pages. I wanted to treat > all !page_reclaimable() pages the same--i.e., put them on the noreclaim > list. I think that is the right approach. Do not forget that ramfs and other ram based filesystems create unmapped unreclaimable pages. > Well, no. Depending on the reason for !reclaimable, the page would go > on the noreclaim list or just be dropped--special handling. More > importantly [for me], we still have to handle them specially in > migration, dumping them back onto the LRU so that we can arbitrate > access. If I'm ever successful in getting automatic/lazy page migration > +replication accepted, I don't want that overhead in > auto-migration/replication. Right. I posted a patch a week ago that generalized LRU handling and would allow the adding of additional lists as needed by such an approach. > If we're willing to live with this [increased rmap scans on mlocked > pages], we might be able to dispense with the mlock count altogether. > Just a single flag [somewhere--doesn't need to be in page flags member] > to indicate mlocked for page_reclaimable(). munmap()/munlock() could > reset the bit and put the page back on the [in]active list. If some > other vma has it locked, we'll catch it on next attempt to unmap. You need a page flag to indicate the fact that the page is on the unreclaimable list. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-28 21:54 ` Christoph Lameter @ 2007-08-29 14:40 ` Lee Schermerhorn 2007-08-29 17:39 ` Christoph Lameter 0 siblings, 1 reply; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-29 14:40 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, linux-mm, Rik van Riel On Tue, 2007-08-28 at 14:54 -0700, Christoph Lameter wrote: > On Tue, 28 Aug 2007, Lee Schermerhorn wrote: > > > I didn't think I was special casing mlocked pages. I wanted to treat > > all !page_reclaimable() pages the same--i.e., put them on the noreclaim > > list. > > I think that is the right approach. Do not forget that ramfs and other > ram based filesystems create unmapped unreclaimable pages. They don't go on the LRU lists now, do they? The primary function of the noreclaim infrastructure is to hide non-reclaimable pages that would otherwise go on the [in]active lists from vmscan. So, if pages used by the ram base file systems don't go onto the LRU, we probably don't need to put them on the noreclaim list which is conceptually another LRU list. That being said, the lumpy reclaim patch tries to reclaim pages that are contiguous to other pages being reclaimed when trying to free higher order pages. I'll have to check to see if it tries to reclaim pages that might be used by ram/tmp/... fs. > > > Well, no. Depending on the reason for !reclaimable, the page would go > > on the noreclaim list or just be dropped--special handling. More > > importantly [for me], we still have to handle them specially in > > migration, dumping them back onto the LRU so that we can arbitrate > > access. If I'm ever successful in getting automatic/lazy page migration > > +replication accepted, I don't want that overhead in > > auto-migration/replication. > > Right. I posted a patch a week ago that generalized LRU handling and would > allow the adding of additional lists as needed by such an approach. Which one was that? > > > > If we're willing to live with this [increased rmap scans on mlocked > > pages], we might be able to dispense with the mlock count altogether. > > Just a single flag [somewhere--doesn't need to be in page flags member] > > to indicate mlocked for page_reclaimable(). munmap()/munlock() could > > reset the bit and put the page back on the [in]active list. If some > > other vma has it locked, we'll catch it on next attempt to unmap. > > You need a page flag to indicate the fact that the page is on the > unreclaimable list. Yes, I have that now--PG_noreclaim. In my prototype, I'm using a high order bit unavailable to 32-bit archs, because all of the others are used right now. This is one of my unresolved issues. PageNoreclaim() is like, but mutually exclusive to, PageActive()--it tells us which LRU list the page is on. Thanks, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-29 14:40 ` Lee Schermerhorn @ 2007-08-29 17:39 ` Christoph Lameter 2007-08-30 0:09 ` Rik van Riel 0 siblings, 1 reply; 19+ messages in thread From: Christoph Lameter @ 2007-08-29 17:39 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel On Wed, 29 Aug 2007, Lee Schermerhorn wrote: > > I think that is the right approach. Do not forget that ramfs and other > > ram based filesystems create unmapped unreclaimable pages. > > They don't go on the LRU lists now, do they? The primary function of > the noreclaim infrastructure is to hide non-reclaimable pages that would > otherwise go on the [in]active lists from vmscan. So, if pages used by > the ram base file systems don't go onto the LRU, we probably don't need > to put them on the noreclaim list which is conceptually another LRU > list. They do go into the LRU. When attempts are made to write them out they are put back onto the active lists via a strange return code AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round... > > Right. I posted a patch a week ago that generalized LRU handling and would > > allow the adding of additional lists as needed by such an approach. > > Which one was that? This one [RECLAIM] Use an indexed array for active/inactive variables Currently we are defining explicit variables for the inactive and active list. An indexed array can be more generic and avoid repeating similar code in several places in the reclaim code. We are saving a few bytes in terms of code size: Before: text data bss dec hex filename 4097753 573120 4092484 8763357 85b7dd vmlinux After: text data bss dec hex filename 4097729 573120 4092484 8763333 85b7c5 vmlinux Having an easy way to add new lru lists may ease future work on the reclaim code. --- include/linux/mm_inline.h | 34 +++++++---- include/linux/mmzone.h | 13 +++- mm/page_alloc.c | 9 +-- mm/swap.c | 2 mm/vmscan.c | 132 ++++++++++++++++++++++------------------------ mm/vmstat.c | 3 - 6 files changed, 104 insertions(+), 89 deletions(-) Index: linux-2.6/include/linux/mmzone.h =================================================================== --- linux-2.6.orig/include/linux/mmzone.h 2007-08-20 20:43:35.000000000 -0700 +++ linux-2.6/include/linux/mmzone.h 2007-08-20 21:39:48.000000000 -0700 @@ -82,6 +82,13 @@ enum zone_stat_item { #endif NR_VM_ZONE_STAT_ITEMS }; +enum lru_list { + LRU_INACTIVE, + LRU_ACTIVE, + NR_LRU_LISTS }; + +#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++) + struct per_cpu_pages { int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ @@ -227,10 +234,8 @@ struct zone { /* Fields commonly accessed by the page reclaim scanner */ spinlock_t lru_lock; - struct list_head active_list; - struct list_head inactive_list; - unsigned long nr_scan_active; - unsigned long nr_scan_inactive; + struct list_head list[NR_LRU_LISTS]; + unsigned long nr_scan[NR_LRU_LISTS]; unsigned long pages_scanned; /* since last reclaim */ int all_unreclaimable; /* All pages pinned */ Index: linux-2.6/include/linux/mm_inline.h =================================================================== --- linux-2.6.orig/include/linux/mm_inline.h 2007-08-20 20:43:35.000000000 -0700 +++ linux-2.6/include/linux/mm_inline.h 2007-08-20 21:39:48.000000000 -0700 @@ -1,40 +1,50 @@ static inline void -add_page_to_active_list(struct zone *zone, struct page *page) +add_page_to_list(struct zone *zone, struct page *page, enum lru_list l) { - list_add(&page->lru, &zone->active_list); - __inc_zone_state(zone, NR_ACTIVE); + list_add(&page->lru, &zone->list[l]); + __inc_zone_state(zone, NR_INACTIVE + l); +} + +static inline void +add_page_to_active_list(struct zone *zone, struct page *page) { + add_page_to_list(zone, page, LRU_ACTIVE); } static inline void add_page_to_inactive_list(struct zone *zone, struct page *page) { - list_add(&page->lru, &zone->inactive_list); - __inc_zone_state(zone, NR_INACTIVE); + add_page_to_list(zone, page, LRU_INACTIVE); } static inline void -del_page_from_active_list(struct zone *zone, struct page *page) +del_page_from_list(struct zone *zone, struct page *page, enum lru_list l) { list_del(&page->lru); - __dec_zone_state(zone, NR_ACTIVE); + __dec_zone_state(zone, NR_INACTIVE + l); +} + +static inline void +del_page_from_active_list(struct zone *zone, struct page *page) +{ + del_page_from_list(zone, page, LRU_ACTIVE); } static inline void del_page_from_inactive_list(struct zone *zone, struct page *page) { - list_del(&page->lru); - __dec_zone_state(zone, NR_INACTIVE); + del_page_from_list(zone, page, LRU_INACTIVE); } static inline void del_page_from_lru(struct zone *zone, struct page *page) { + enum lru_list l = LRU_INACTIVE; + list_del(&page->lru); if (PageActive(page)) { __ClearPageActive(page); - __dec_zone_state(zone, NR_ACTIVE); - } else { - __dec_zone_state(zone, NR_INACTIVE); + l = LRU_ACTIVE; } + __dec_zone_state(zone, NR_INACTIVE + l); } Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2007-08-20 20:43:34.000000000 -0700 +++ linux-2.6/mm/page_alloc.c 2007-08-20 21:39:48.000000000 -0700 @@ -2908,6 +2908,7 @@ static void __meminit free_area_init_cor for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; unsigned long size, realsize, memmap_pages; + enum lru_list l; size = zone_spanned_pages_in_node(nid, j, zones_size); realsize = size - zone_absent_pages_in_node(nid, j, @@ -2957,10 +2958,10 @@ static void __meminit free_area_init_cor zone->prev_priority = DEF_PRIORITY; zone_pcp_init(zone); - INIT_LIST_HEAD(&zone->active_list); - INIT_LIST_HEAD(&zone->inactive_list); - zone->nr_scan_active = 0; - zone->nr_scan_inactive = 0; + for_each_lru(l) { + INIT_LIST_HEAD(&zone->list[l]); + zone->nr_scan[l] = 0; + } zap_zone_vm_stats(zone); atomic_set(&zone->reclaim_in_progress, 0); if (!size) Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c 2007-08-20 20:43:34.000000000 -0700 +++ linux-2.6/mm/swap.c 2007-08-20 21:39:48.000000000 -0700 @@ -125,7 +125,7 @@ int rotate_reclaimable_page(struct page zone = page_zone(page); spin_lock_irqsave(&zone->lru_lock, flags); if (PageLRU(page) && !PageActive(page)) { - list_move_tail(&page->lru, &zone->inactive_list); + list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]); __count_vm_event(PGROTATED); } if (!test_clear_page_writeback(page)) Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c 2007-08-20 20:43:35.000000000 -0700 +++ linux-2.6/mm/vmscan.c 2007-08-20 21:40:12.000000000 -0700 @@ -772,7 +772,7 @@ static unsigned long shrink_inactive_lis unsigned long nr_active; nr_taken = isolate_lru_pages(sc->swap_cluster_max, - &zone->inactive_list, + &zone->list[LRU_INACTIVE], &page_list, &nr_scan, sc->order, (sc->order > PAGE_ALLOC_COSTLY_ORDER)? ISOLATE_BOTH : ISOLATE_INACTIVE); @@ -807,10 +807,7 @@ static unsigned long shrink_inactive_lis VM_BUG_ON(PageLRU(page)); SetPageLRU(page); list_del(&page->lru); - if (PageActive(page)) - add_page_to_active_list(zone, page); - else - add_page_to_inactive_list(zone, page); + add_page_to_list(zone, page, PageActive(page)); if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); __pagevec_release(&pvec); @@ -869,11 +866,14 @@ static void shrink_active_list(unsigned int pgdeactivate = 0; unsigned long pgscanned; LIST_HEAD(l_hold); /* The pages which were snipped off */ - LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */ - LIST_HEAD(l_active); /* Pages to go onto the active_list */ + struct list_head list[NR_LRU_LISTS]; struct page *page; struct pagevec pvec; int reclaim_mapped = 0; + enum lru_list l; + + for_each_lru(l) + INIT_LIST_HEAD(&list[l]); if (sc->may_swap) { long mapped_ratio; @@ -924,7 +924,7 @@ force_reclaim_mapped: lru_add_drain(); spin_lock_irq(&zone->lru_lock); - pgmoved = isolate_lru_pages(nr_pages, &zone->active_list, + pgmoved = isolate_lru_pages(nr_pages, &zone->list[LRU_ACTIVE], &l_hold, &pgscanned, sc->order, ISOLATE_ACTIVE); zone->pages_scanned += pgscanned; __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved); @@ -938,25 +938,25 @@ force_reclaim_mapped: if (!reclaim_mapped || (total_swap_pages == 0 && PageAnon(page)) || page_referenced(page, 0)) { - list_add(&page->lru, &l_active); + list_add(&page->lru, &list[LRU_ACTIVE]); continue; } } - list_add(&page->lru, &l_inactive); + list_add(&page->lru, &list[LRU_INACTIVE]); } pagevec_init(&pvec, 1); pgmoved = 0; spin_lock_irq(&zone->lru_lock); - while (!list_empty(&l_inactive)) { - page = lru_to_page(&l_inactive); - prefetchw_prev_lru_page(page, &l_inactive, flags); + while (!list_empty(&list[LRU_INACTIVE])) { + page = lru_to_page(&list[LRU_INACTIVE]); + prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags); VM_BUG_ON(PageLRU(page)); SetPageLRU(page); VM_BUG_ON(!PageActive(page)); ClearPageActive(page); - list_move(&page->lru, &zone->inactive_list); + list_move(&page->lru, &zone->list[LRU_INACTIVE]); pgmoved++; if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); @@ -978,13 +978,13 @@ force_reclaim_mapped: } pgmoved = 0; - while (!list_empty(&l_active)) { - page = lru_to_page(&l_active); - prefetchw_prev_lru_page(page, &l_active, flags); + while (!list_empty(&list[LRU_ACTIVE])) { + page = lru_to_page(&list[LRU_ACTIVE]); + prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags); VM_BUG_ON(PageLRU(page)); SetPageLRU(page); VM_BUG_ON(!PageActive(page)); - list_move(&page->lru, &zone->active_list); + list_move(&page->lru, &zone->list[LRU_ACTIVE]); pgmoved++; if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); @@ -1003,16 +1003,26 @@ force_reclaim_mapped: pagevec_release(&pvec); } +static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan, + struct zone *zone, struct scan_control *sc, int priority) +{ + if (l == LRU_ACTIVE) { + shrink_active_list(nr_to_scan, zone, sc, priority); + return 0; + } + return shrink_inactive_list(nr_to_scan, zone, sc); +} + /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ static unsigned long shrink_zone(int priority, struct zone *zone, struct scan_control *sc) { - unsigned long nr_active; - unsigned long nr_inactive; + unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; unsigned long nr_reclaimed = 0; + enum lru_list l; atomic_inc(&zone->reclaim_in_progress); @@ -1020,36 +1030,26 @@ static unsigned long shrink_zone(int pri * Add one to `nr_to_scan' just to make sure that the kernel will * slowly sift through the active list. */ - zone->nr_scan_active += - (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; - nr_active = zone->nr_scan_active; - if (nr_active >= sc->swap_cluster_max) - zone->nr_scan_active = 0; - else - nr_active = 0; - - zone->nr_scan_inactive += - (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; - nr_inactive = zone->nr_scan_inactive; - if (nr_inactive >= sc->swap_cluster_max) - zone->nr_scan_inactive = 0; - else - nr_inactive = 0; - - while (nr_active || nr_inactive) { - if (nr_active) { - nr_to_scan = min(nr_active, - (unsigned long)sc->swap_cluster_max); - nr_active -= nr_to_scan; - shrink_active_list(nr_to_scan, zone, sc, priority); - } + for_each_lru(l) { + zone->nr_scan[l] += (zone_page_state(zone, NR_INACTIVE + l) + >> priority) + 1; + nr[l] = zone->nr_scan[l]; + if (nr[l] >= sc->swap_cluster_max) + zone->nr_scan[l] = 0; + else + nr[l] = 0; + } - if (nr_inactive) { - nr_to_scan = min(nr_inactive, + while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) { + for_each_lru(l) { + if (nr[l]) { + nr_to_scan = min(nr[l], (unsigned long)sc->swap_cluster_max); - nr_inactive -= nr_to_scan; - nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, - sc); + nr[l] -= nr_to_scan; + + nr_reclaimed += shrink_list(l, nr_to_scan, + zone, sc, priority); + } } } @@ -1489,6 +1489,7 @@ static unsigned long shrink_all_zones(un { struct zone *zone; unsigned long nr_to_scan, ret = 0; + enum lru_list l; for_each_zone(zone) { @@ -1498,28 +1499,25 @@ static unsigned long shrink_all_zones(un if (zone->all_unreclaimable && prio != DEF_PRIORITY) continue; - /* For pass = 0 we don't shrink the active list */ - if (pass > 0) { - zone->nr_scan_active += - (zone_page_state(zone, NR_ACTIVE) >> prio) + 1; - if (zone->nr_scan_active >= nr_pages || pass > 3) { - zone->nr_scan_active = 0; + for_each_lru(l) { + /* For pass = 0 we don't shrink the active list */ + if (pass == 0 && l == LRU_ACTIVE) + continue; + + zone->nr_scan[l] += + (zone_page_state(zone, NR_INACTIVE + l) + >> prio) + 1; + if (zone->nr_scan[l] >= nr_pages || pass > 3) { + zone->nr_scan[l] = 0; nr_to_scan = min(nr_pages, - zone_page_state(zone, NR_ACTIVE)); - shrink_active_list(nr_to_scan, zone, sc, prio); + zone_page_state(zone, + NR_INACTIVE + l)); + ret += shrink_list(l, nr_to_scan, zone, + sc, prio); + if (ret >= nr_pages) + return ret; } } - - zone->nr_scan_inactive += - (zone_page_state(zone, NR_INACTIVE) >> prio) + 1; - if (zone->nr_scan_inactive >= nr_pages || pass > 3) { - zone->nr_scan_inactive = 0; - nr_to_scan = min(nr_pages, - zone_page_state(zone, NR_INACTIVE)); - ret += shrink_inactive_list(nr_to_scan, zone, sc); - if (ret >= nr_pages) - return ret; - } } return ret; Index: linux-2.6/mm/vmstat.c =================================================================== --- linux-2.6.orig/mm/vmstat.c 2007-08-20 20:43:35.000000000 -0700 +++ linux-2.6/mm/vmstat.c 2007-08-20 21:39:48.000000000 -0700 @@ -563,7 +563,8 @@ static int zoneinfo_show(struct seq_file zone->pages_low, zone->pages_high, zone->pages_scanned, - zone->nr_scan_active, zone->nr_scan_inactive, + zone->nr_scan[LRU_ACTIVE], + zone->nr_scan[LRU_INACTIVE], zone->spanned_pages, zone->present_pages); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-29 17:39 ` Christoph Lameter @ 2007-08-30 0:09 ` Rik van Riel 2007-08-30 14:49 ` Lee Schermerhorn 0 siblings, 1 reply; 19+ messages in thread From: Rik van Riel @ 2007-08-30 0:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: Lee Schermerhorn, Nick Piggin, linux-mm Christoph Lameter wrote: > On Wed, 29 Aug 2007, Lee Schermerhorn wrote: > >>> I think that is the right approach. Do not forget that ramfs and other >>> ram based filesystems create unmapped unreclaimable pages. >> They don't go on the LRU lists now, do they? The primary function of >> the noreclaim infrastructure is to hide non-reclaimable pages that would >> otherwise go on the [in]active lists from vmscan. So, if pages used by >> the ram base file systems don't go onto the LRU, we probably don't need >> to put them on the noreclaim list which is conceptually another LRU >> list. > > They do go into the LRU. When attempts are made to write them out they are > put back onto the active lists via a strange return code > AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round... > >>> Right. I posted a patch a week ago that generalized LRU handling and would >>> allow the adding of additional lists as needed by such an approach. >> Which one was that? > > This one > > [RECLAIM] Use an indexed array for active/inactive variables > > Currently we are defining explicit variables for the inactive and active > list. An indexed array can be more generic and avoid repeating similar > code in several places in the reclaim code. I like it. This will make the code that has separate lists for anonymous (and other swap backed) pages a lot nicer. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-30 0:09 ` Rik van Riel @ 2007-08-30 14:49 ` Lee Schermerhorn 0 siblings, 0 replies; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-30 14:49 UTC (permalink / raw) To: Rik van Riel; +Cc: Christoph Lameter, Nick Piggin, linux-mm On Wed, 2007-08-29 at 20:09 -0400, Rik van Riel wrote: > Christoph Lameter wrote: > > On Wed, 29 Aug 2007, Lee Schermerhorn wrote: > > > >>> I think that is the right approach. Do not forget that ramfs and other > >>> ram based filesystems create unmapped unreclaimable pages. > >> They don't go on the LRU lists now, do they? The primary function of > >> the noreclaim infrastructure is to hide non-reclaimable pages that would > >> otherwise go on the [in]active lists from vmscan. So, if pages used by > >> the ram base file systems don't go onto the LRU, we probably don't need > >> to put them on the noreclaim list which is conceptually another LRU > >> list. > > > > They do go into the LRU. When attempts are made to write them out they are > > put back onto the active lists via a strange return code > > AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round... > > > >>> Right. I posted a patch a week ago that generalized LRU handling and would > >>> allow the adding of additional lists as needed by such an approach. > >> Which one was that? > > > > This one > > > > [RECLAIM] Use an indexed array for active/inactive variables > > > > Currently we are defining explicit variables for the inactive and active > > list. An indexed array can be more generic and avoid repeating similar > > code in several places in the reclaim code. > > I like it. This will make the code that has separate lists > for anonymous (and other swap backed) pages a lot nicer. Ditto. I'll incorporate it into the noreclaim set and into the copy of Rik's split lru patch that I'm maintaining. Should make it easier to merge the two sets. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-28 14:52 ` Lee Schermerhorn 2007-08-28 21:54 ` Christoph Lameter @ 2007-08-29 4:38 ` Nick Piggin 2007-08-30 16:34 ` Lee Schermerhorn 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2007-08-29 4:38 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel On Tue, Aug 28, 2007 at 10:52:46AM -0400, Lee Schermerhorn wrote: > On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote: > > > > I don't have a problem with having a more unified approach, although if > > we did that, then I'd prefer just to do it more simply and don't special > > case mlocked pages _at all_. Ie. just slowly try to reclaim them and > > eventually when everybody unlocks them, you will notice sooner or later. > > I didn't think I was special casing mlocked pages. I wanted to treat > all !page_reclaimable() pages the same--i.e., put them on the noreclaim > list. But you are keeping track of the mlock count? Why not simply call try_to_unmap and see if they are still mlocked? > > But once you do the code for mlock refcounting, that's most of the hard > > part done so you may as well remove them completely from the LRU, no? > > Then they become more or less transparent to the rest of the VM as well. > > Well, no. Depending on the reason for !reclaimable, the page would go > on the noreclaim list or just be dropped--special handling. More > importantly [for me], we still have to handle them specially in > migration, dumping them back onto the LRU so that we can arbitrate > access. If I'm ever successful in getting automatic/lazy page migration > +replication accepted, I don't want that overhead in > auto-migration/replication. Oh OK. I don't know if there should be a whole lot of overhead involved with that, though. I can't remember exactly what the problems were here with my mlock patch, but I think it could have been made more optimal. > > Could be possible. Tricky though. Probably take less code to use > > ->lru ;) > > Oh, certainly less code to use any separate field. But the lru list > field is the only link we have in the page struct, and a lot of VM > depends on being able to pass around lists of pages. I'd hate to lose > that for mlocked pages, or to have to dump the lock count and > reestablish it in those cases, like migration, where we need to put the > page on a list. Hmm, yes. Migration could possibly use a single linked list. But I'm only saying it _could_ be possible to do mlocked accounting efficiently with one of the LRU pointers -- I would prefer the idea of just using a single bit for example, if that is sufficient. It should cut down on code. > > I don't know. I'd have thought efficient mlock handling might be useful > > for realtime systems, probably many of which would be 32-bit. > > I agree. I just wonder if those systems have a sufficient number of > pages that they're suffering from the long lru lists with a large > fraction of unreclaimable pages... If we do want to support keeping > nonreclaimable pages off the [in]active lists for these systems, we'll > need to find a place for the flag[s]. That's true, they will have a lot less pages (and probably won't be using highmem). > > Are you seeing mlock pinning heaps of memory in the field? > > It is a common usage to mlock() large shared memory areas, as well as > entire tasks [MLOCK_CURRENT|MLOCK_FUTURE]. I think it would be even > more frequent if one could inherit MLOCK_FUTURE across fork and exec. > Then one could write/enhance a prefix command, like numactl and taskset, > to enable locking of unmodified applications. I prototyped this once, > but never updated it to do the mlock accounting [e.g., down in > copy_page_range() during fork()] for your patch. > > What we see more of is folks just figuring that they've got sufficient > memory [100s of GB] for their apps and shared memory areas, so they > don't add enough swap to back all of the anon and shmem regions. Then, > when they get under memory pressure--e.g., the old "backup ate my > pagecache" scenario--the system more or less live-locks in vmscan > shuffling non-reclaimable [unswappable] pages. A large number of > mlocked pages on the LRU produces the same symptom; as do excessively > long anon_vma lists and huge i_mmap trees--the latter seen with some > large Oracle workloads. OK, thanks for the background. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: Noreclaim with "Keep Mlocked Pages off the LRU" 2007-08-29 4:38 ` Nick Piggin @ 2007-08-30 16:34 ` Lee Schermerhorn 0 siblings, 0 replies; 19+ messages in thread From: Lee Schermerhorn @ 2007-08-30 16:34 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-mm, Rik van Riel, Christoph Hellwig On Wed, 2007-08-29 at 06:38 +0200, Nick Piggin wrote: > On Tue, Aug 28, 2007 at 10:52:46AM -0400, Lee Schermerhorn wrote: > > On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote: > > > > > > I don't have a problem with having a more unified approach, although if > > > we did that, then I'd prefer just to do it more simply and don't special > > > case mlocked pages _at all_. Ie. just slowly try to reclaim them and > > > eventually when everybody unlocks them, you will notice sooner or later. > > > > I didn't think I was special casing mlocked pages. I wanted to treat > > all !page_reclaimable() pages the same--i.e., put them on the noreclaim > > list. > > But you are keeping track of the mlock count? Why not simply call > try_to_unmap and see if they are still mlocked? We may be talking past each other here. So, let me try this: We're trying to hide nonreclaimable, including mlock'ed, pages from vmscan to the extent possible--to make reclaim as efficient as possible. Sometimes, to avoid races [as in your comment in __mlock_pages_range() regarding anonymous pages], we may end up putting mlock'ed pages on the normal lru list. That's OK. We can cull them in shrink_*_list(). Now, if we have an mlock lock count in a dedicated field or a page flag indicating mlock'ed state [perhaps with a count in an overloaded field], we can easily cull the mlock'ed pages w/o access to any vma so that it never gets to shrink_page_list() where try_to_unmap() would be called. IMO, try_to_unmap() is/can be a fairly heavy hammer, walking the entire rmap, as it does. And, we only get to try_to_unmap() after already walking the entire rmap in page_referenced() [hmmm, maybe cull mlock'ed pages in page_referenced()--before even checking page table for ref?]. So, I'd like to cull them early by just looking at the page. If a page occasionally makes it through, like only the first time for anon pages?, we only take the hit once. Now you may be thinking that, in general, reverse maps are not all that large. But, I've seen live locks on the i_mmap_lock with heavy Oracle loads [I think I already mentioned this]. On large servers, we can see hundreds or thousands of tasks mapping the data base executables, libraries and shared memory areas--just the types of regions one might want to mlock. Further, the shared memory areas can get quite large--10s, 100s, even 1000s of GB. That's a lot of pages to be running through page_referenced/try_to_unmap too often. > > > > > But once you do the code for mlock refcounting, that's most of the hard > > > part done so you may as well remove them completely from the LRU, no? > > > Then they become more or less transparent to the rest of the VM as well. > > > > Well, no. Depending on the reason for !reclaimable, the page would go > > on the noreclaim list or just be dropped--special handling. More > > importantly [for me], we still have to handle them specially in > > migration, dumping them back onto the LRU so that we can arbitrate > > access. If I'm ever successful in getting automatic/lazy page migration > > +replication accepted, I don't want that overhead in > > auto-migration/replication. > > Oh OK. I don't know if there should be a whole lot of overhead involved > with that, though. I can't remember exactly what the problems were here > with my mlock patch, but I think it could have been made more optimal. The basic issue was that one can't migrate pages [nor unmap them for lazy migration/replication] if check_range() can't find them on and successfully isolate them from the lru. In a respin of the patch, you dumped the pages back on to the LRU so that they could be migrated. Then, later, they'll need to be lazily culled back off the lru. Could be a lot of pages for some regions. With the noreclaim lru list, this isn't necessary. It works just like the [in]active lists from migration's perspective. I guess the overhead depends on the size of the regions being migrated. It occurs to me that we probably need a way to exempt some regions--like huge shared memory areas--from auto-migration/replication. > > > > > Could be possible. Tricky though. Probably take less code to use > > > ->lru ;) > > > > Oh, certainly less code to use any separate field. But the lru list > > field is the only link we have in the page struct, and a lot of VM > > depends on being able to pass around lists of pages. I'd hate to lose > > that for mlocked pages, or to have to dump the lock count and > > reestablish it in those cases, like migration, where we need to put the > > page on a list. > > Hmm, yes. Migration could possibly use a single linked list. > But I'm only saying it _could_ be possible to do mlocked accounting > efficiently with one of the LRU pointers -- I agree, that we don't want to keep the pages on an lru list or want to use some other list type for migration and such, the accounting in one of the lru pointers is no[t much] more overhead, timewise, than a dedicated field. The dedicated file increases space overhead, tho'. > I would prefer the idea > of just using a single bit for example, if that is sufficient. It > should cut down on code. I've been thinking about how to eliminate the mlock count entirely and just use a single page flag and "lazy culling"--i.e., try to unmap. But, one scenario I want to avoid is where tasks come and go, attaching to a shared memory area/executable with an mlock'ed vma. When they detach, without a count, we'd just drop the mlock flag, moving the pages back to the normal lru lists, and let vmscan cull them if some vma still have them mlock'ed. Again, I'd like to avoid the flood of pages between normal lru and noreclaim lists in my model. Perhaps the "flood" can be eliminated for shared memory areas--likely to be the largest source of mlock'ed pages--by not unlocking pages in shmem areas that have the VM_LOCKED flag set in the shmem_inode_info flags field [SHM_LOCKED regions]. I don't see any current interaction of that flag with the vm_flags when attaching to a SHM_LOCKED region. Such interaction is not required to prevent swap out--that's handled in shmem_writepage. But, to keep those pages off the LRU, we probably need to consult the shmem_inode_info flags in the modified mlock code. Maybe pull the flag into the vm_flags on attach? This way, try_to_unmap() will see it w/o having to consult vm_file->... I'm looking into this. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-08-30 16:34 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-08-23 4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin 2007-08-23 7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton 2007-08-23 9:07 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin 2007-08-23 11:48 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru Andrea Arcangeli 2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn 2007-08-27 1:35 ` Nick Piggin 2007-08-27 14:34 ` Lee Schermerhorn 2007-08-27 15:44 ` Christoph Hellwig 2007-08-27 23:51 ` Nick Piggin 2007-08-28 12:29 ` Christoph Hellwig 2007-08-28 0:06 ` Nick Piggin 2007-08-28 14:52 ` Lee Schermerhorn 2007-08-28 21:54 ` Christoph Lameter 2007-08-29 14:40 ` Lee Schermerhorn 2007-08-29 17:39 ` Christoph Lameter 2007-08-30 0:09 ` Rik van Riel 2007-08-30 14:49 ` Lee Schermerhorn 2007-08-29 4:38 ` Nick Piggin 2007-08-30 16:34 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox