Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru
@ 2007-08-23  4:11 Nick Piggin
  2007-08-23  7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton
  2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn
  0 siblings, 2 replies; 19+ messages in thread
From: Nick Piggin @ 2007-08-23  4:11 UTC (permalink / raw)
  To: Andrew Morton, Martin Bligh, Rik van Riel, Linux Memory Management List

About this patch... I hope it doesn't get merged without good reason...

Our current reclaim scheme may not always make great choices, but one
thing I really like about it is that it can generally always reclaim file
backed pages at O(1) WRT the size of RAM. Once you start giving things
multiple trips around lists, you can reach the situation where you need
to scan all or a huge number of pages before reclaiming any. If you have
long periods of not touching reclaim, it could be very likely that most
memory is on the active list with referenced set.

One thing you could potentially do is have mark_page_accessed always
put active pages back to the head of the LRU, but that is probably going
to take way too much locking... I'm not completely happy with our random
page reclaim policy either, but I console myself in this case by thinking
of PG_referenced as giving the page a slightly better chance before
leaving the inactive list.

FWIW, this is one of the big reasons not to go with the scheme where
you rip out mark_page_accessed completely and do all aging simply based
on referenced or second chance bits. It's conceptually a lot simpler
and more consistent, and it behaves really well for use-once type pages
too, but in practice it can cause big pauses when you initially start
reclaim (and I expect this patch could be subject to the same, even if
there were fewer cases that triggered such behaviour).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru
  2007-08-23  4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
@ 2007-08-23  7:15 ` Andrew Morton
  2007-08-23  9:07   ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
  2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2007-08-23  7:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Martin Bligh, Rik van Riel, Linux Memory Management List

On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote:

> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> 
> About this patch... I hope it doesn't get merged without good reason...

I have no intention at all of merging it until it's proven to be a net
benefit.  This is engineering.  We shouldn't merge VM changes based on
handwaving.

It does fix a bug (ie: a difference between design intent and
implementation) but I have no idea whether it improves or worsens anything.

> [handwaving]

;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru
  2007-08-23  7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton
@ 2007-08-23  9:07   ` Nick Piggin
  2007-08-23 11:48     ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-08-23  9:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Martin Bligh, Rik van Riel, Linux Memory Management List

On Thu, Aug 23, 2007 at 12:15:17AM -0700, Andrew Morton wrote:
> On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> > 
> > About this patch... I hope it doesn't get merged without good reason...
> 
> I have no intention at all of merging it until it's proven to be a net
> benefit.  This is engineering.  We shouldn't merge VM changes based on
> handwaving.
> 
> It does fix a bug (ie: a difference between design intent and
> implementation) but I have no idea whether it improves or worsens anything.
> 
> > [handwaving]
> 
> ;)

Well what I say is handwaving too, but it is a situation that wouldn't be
completely unusual to hit. Anyway, I know I don't need to make an airtight
argument as to why _not_ to merge a patch, so this is just a heads-up to
be on the lookout for one potential issue I have seen with a similar change.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru
  2007-08-23  9:07   ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
@ 2007-08-23 11:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2007-08-23 11:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Martin Bligh, Rik van Riel, Linux Memory Management List

On Thu, Aug 23, 2007 at 11:07:22AM +0200, Nick Piggin wrote:
> On Thu, Aug 23, 2007 at 12:15:17AM -0700, Andrew Morton wrote:
> > On Thu, 23 Aug 2007 06:11:37 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc3/2.6.23-rc3-mm1/broken-out/vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
> > > 
> > > About this patch... I hope it doesn't get merged without good reason...
> > 
> > I have no intention at all of merging it until it's proven to be a net
> > benefit.  This is engineering.  We shouldn't merge VM changes based on
> > handwaving.
> > 
> > It does fix a bug (ie: a difference between design intent and
> > implementation) but I have no idea whether it improves or worsens anything.
> > 
> > > [handwaving]
> > 
> > ;)
> 
> Well what I say is handwaving too, but it is a situation that wouldn't be
> completely unusual to hit. Anyway, I know I don't need to make an airtight
> argument as to why _not_ to merge a patch, so this is just a heads-up to
> be on the lookout for one potential issue I have seen with a similar change.

I like the patch, I consider it a fix but perhaps I'm biased ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-23  4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
  2007-08-23  7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton
@ 2007-08-24 20:43 ` Lee Schermerhorn
  2007-08-27  1:35   ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-24 20:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Rik van Riel

For your weekend reading pleasure [:-)]

I have reworked your "move mlocked pages off LRU" atop my "noreclaim
infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed
but no swap space, excessively long anon_vma list] on a separate
noreclaim LRU list--more or less ignored by vmscan.  To do this, I had
to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to
the<mumble>page struct.  This brings the size of the page struct to a
nice, round 64 bytes.  The mlock_count member and [most of] the
noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which
depends on CONFIG_NORECLAIM.  Currently,  the entire noreclaim
infrastructure is only supported on 64bit archs because I'm using a
higher order bit [~30] for the PG_noreclaim flag.

Using the noreclaim infrastructure does seem to simplify the "keep
mlocked pages off the LRU" code tho'.  All of the isolate_lru_page(),
move_to_lru(), ... functions have been taught about the noreclaim list,
so many places don't need changes.  That being said, I really not sure
I've covered all of the bases here...

Now, mlocked pages come back off the noreclaim list nicely when the last
mlock reference goes away--assuming I have the counting correct.
However, pages marked non-reclaimable for other reasons--no swap
available, excessive anon_vma ref count--can languish there
indefinitely.   At some point, perhaps vmscan could be taught to do a
slow background scan of the noreclaim list [making it more like
"slo-reclaim"--but we already have that :-)] when swap is added and we
have unswappable pages on the list.  Currently, I don't keep track of
the various reasons for the no-reclaim pages, but that could be added.  

Rik Van Riel mentions, on his VM wiki page that a background scan might
be useful to age pages actively [clock hand, anyone?], so I might be
able to piggyback on that, or even prototype it at some point.   In the
meantime, I'm going to add a scan of the noreclaim list manually
triggered by a temporary sysctl.

Anyway, if anyone is interested, the patches are in a gzip'd tarball in:

http://free.linux.hp.com/~lts/Patches/Noreclaim/

Cursory functional testing with memtoy shows that it basically works.
I've started a moderately stressful workload for the weekend.  We'll see
how it goes.

Cheers,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn
@ 2007-08-27  1:35   ` Nick Piggin
  2007-08-27 14:34     ` Lee Schermerhorn
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-08-27  1:35 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel

On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote:
> Nick:
> 
> For your weekend reading pleasure [:-)]
> 
> I have reworked your "move mlocked pages off LRU" atop my "noreclaim
> infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed
> but no swap space, excessively long anon_vma list] on a separate
> noreclaim LRU list--more or less ignored by vmscan.  To do this, I had
> to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to
> the<mumble>page struct.  This brings the size of the page struct to a
> nice, round 64 bytes.  The mlock_count member and [most of] the
> noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which
> depends on CONFIG_NORECLAIM.  Currently,  the entire noreclaim
> infrastructure is only supported on 64bit archs because I'm using a
> higher order bit [~30] for the PG_noreclaim flag.

Can you keep the old system of removing mlocked pages completely, and
keeping the mlock count in one of the lru pointers? That should avoid
the need to have a new mlock_count, I think, because none of the other
noreclaim types should need a refcount?

I do approve of bringing struct page to a nice round 64 bytes ;), but I
think I would rather we used up those 8 bytes by making count and
mapcount 8 bytes each.


> Using the noreclaim infrastructure does seem to simplify the "keep
> mlocked pages off the LRU" code tho'.  All of the isolate_lru_page(),
> move_to_lru(), ... functions have been taught about the noreclaim list,
> so many places don't need changes.  That being said, I really not sure
> I've covered all of the bases here...
> 
> Now, mlocked pages come back off the noreclaim list nicely when the last
> mlock reference goes away--assuming I have the counting correct.
> However, pages marked non-reclaimable for other reasons--no swap
> available, excessive anon_vma ref count--can languish there
> indefinitely.   At some point, perhaps vmscan could be taught to do a
> slow background scan of the noreclaim list [making it more like
> "slo-reclaim"--but we already have that :-)] when swap is added and we
> have unswappable pages on the list.  Currently, I don't keep track of
> the various reasons for the no-reclaim pages, but that could be added.  
> 
> Rik Van Riel mentions, on his VM wiki page that a background scan might
> be useful to age pages actively [clock hand, anyone?], so I might be
> able to piggyback on that, or even prototype it at some point.   In the
> meantime, I'm going to add a scan of the noreclaim list manually
> triggered by a temporary sysctl.

Yeah, I think the basic slow simple clock would be a reasonable starting
point. You may end up wanting to introduce some feedback from near-OOM
condition and/or free swap accounting to speed up the scanning rate.

I haven't had much look at the patches yet, but I'm glad to see the old
mlocked patch come to something ;)

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-27  1:35   ` Nick Piggin
@ 2007-08-27 14:34     ` Lee Schermerhorn
  2007-08-27 15:44       ` Christoph Hellwig
  2007-08-28  0:06       ` Nick Piggin
  0 siblings, 2 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-27 14:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Rik van Riel

On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote:
> On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote:
> > Nick:
> > 
> > For your weekend reading pleasure [:-)]
> > 
> > I have reworked your "move mlocked pages off LRU" atop my "noreclaim
> > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed
> > but no swap space, excessively long anon_vma list] on a separate
> > noreclaim LRU list--more or less ignored by vmscan.  To do this, I had
> > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to
> > the<mumble>page struct.  This brings the size of the page struct to a
> > nice, round 64 bytes.  The mlock_count member and [most of] the
> > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which
> > depends on CONFIG_NORECLAIM.  Currently,  the entire noreclaim
> > infrastructure is only supported on 64bit archs because I'm using a
> > higher order bit [~30] for the PG_noreclaim flag.
> 
> Can you keep the old system of removing mlocked pages completely, and
> keeping the mlock count in one of the lru pointers? That should avoid
> the need to have a new mlock_count, I think, because none of the other
> noreclaim types should need a refcount?

Well, keeping the mlock count in the lru pointer more or less defeats
the purpose of this exercise for me--that is, a unified mechanism for
tracking "non-reclaimable" pages.  I wanted to maintain the ability to
use the zone lru_lock and isolate_lru_page() to arbitrate access to
pages for migration, etc. w/o having to temporarily put the pages back
on the lru during migration.   

And, by using another LRU list for non-reclaimable pages, the
non-reclaimable nature of locked, un-swappable, ... pages becomes
transparent to much of the rest of VM.  vmscan and try_to_unmap*() still
have to handle lazy culling of non-reclaimable pages.  If/when you do
get a chance to look at the patches, you'll see that I separated the
culling of non-reclaimable pages in the fault path into a separate
patch.  We could eliminate this overhead in the fault path in favor of
lazy culling in vmscan.  Vmscan would only have to deal with these pages
once to move them to the noreclaim list.

> 
> I do approve of bringing struct page to a nice round 64 bytes ;), but I
> think I would rather we used up those 8 bytes by making count and
> mapcount 8 bytes each.

I knew the new page struct member would be controversial, at best, but
it allows me to prototype and test this approach.  I'd like to find
somewhere else to put the mlock count, but the page struct it pretty
tight as it is.  It occurred to me that while anon and other swap-backed
pages are mlocked, I might be able to use the private field as the mlock
count.  I don't understand the interaction of vm with file systems to
know if we could do the same for file-backed pages.  Maybe a separate
PG_mlock flag would allow one to move the page's private contents to an
external structure along with the mlock count?  Or maybe just with
PG_noreclaim, externalize the private info?

Another approach that I've seen used elsewhere, IFF we can find a
smaller bit field for the mlock count:  maintain a mlock count in a bit
field that is too small to contain max possible lock count.  [Probably
don't need all 64-bits, in any case.]  Clip the count at maximum that
the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the
count won't accommodate the additional lock.  I haven't investigated
this enough to determine what additional complications it would involve.
It would probably complicate inheriting locks across fork(), if we ever
want to do that [I do!].

Any thoughts on restricting this to 64-bit archs?

> 
> 
> > Using the noreclaim infrastructure does seem to simplify the "keep
> > mlocked pages off the LRU" code tho'.  All of the isolate_lru_page(),
> > move_to_lru(), ... functions have been taught about the noreclaim list,
> > so many places don't need changes.  That being said, I really not sure
> > I've covered all of the bases here...
> > 
> > Now, mlocked pages come back off the noreclaim list nicely when the last
> > mlock reference goes away--assuming I have the counting correct.
> > However, pages marked non-reclaimable for other reasons--no swap
> > available, excessive anon_vma ref count--can languish there
> > indefinitely.   At some point, perhaps vmscan could be taught to do a
> > slow background scan of the noreclaim list [making it more like
> > "slo-reclaim"--but we already have that :-)] when swap is added and we
> > have unswappable pages on the list.  Currently, I don't keep track of
> > the various reasons for the no-reclaim pages, but that could be added.  
> > 
> > Rik Van Riel mentions, on his VM wiki page that a background scan might
> > be useful to age pages actively [clock hand, anyone?], so I might be
> > able to piggyback on that, or even prototype it at some point.   In the
> > meantime, I'm going to add a scan of the noreclaim list manually
> > triggered by a temporary sysctl.
> 
> Yeah, I think the basic slow simple clock would be a reasonable starting
> point. You may end up wanting to introduce some feedback from near-OOM
> condition and/or free swap accounting to speed up the scanning rate.

Yep.   It's all those little details that have prevented me from diving
into this yet.  Still cogitating on that, as a background task.

> 
> I haven't had much look at the patches yet, but I'm glad to see the old
> mlocked patch come to something ;)

Given the issues we've encountered in the field with a large number
[millions] of non-reclaimable pages on the LRU lists, the idea of hiding
nonreclaimable pages from vmscan is appealing.  I'm hoping we can find
some acceptable way of doing this in the long run.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-27 14:34     ` Lee Schermerhorn
@ 2007-08-27 15:44       ` Christoph Hellwig
  2007-08-27 23:51         ` Nick Piggin
  2007-08-28  0:06       ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2007-08-27 15:44 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel

On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote:
> Well, keeping the mlock count in the lru pointer more or less defeats
> the purpose of this exercise for me--that is, a unified mechanism for
> tracking "non-reclaimable" pages.  I wanted to maintain the ability to
> use the zone lru_lock and isolate_lru_page() to arbitrate access to
> pages for migration, etc. w/o having to temporarily put the pages back
> on the lru during migration.   

A few years ago I tried to implement a mlocked counter in the page
aswell, and my approach was to create a union to reuse the space occupied
by the lru list pointers for this.  I never really got it stable enough
because people tripped over the lru list randomly far too often.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-27 15:44       ` Christoph Hellwig
@ 2007-08-27 23:51         ` Nick Piggin
  2007-08-28 12:29           ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-08-27 23:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Lee Schermerhorn, linux-mm, Rik van Riel

On Mon, Aug 27, 2007 at 04:44:26PM +0100, Christoph Hellwig wrote:
> On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote:
> > Well, keeping the mlock count in the lru pointer more or less defeats
> > the purpose of this exercise for me--that is, a unified mechanism for
> > tracking "non-reclaimable" pages.  I wanted to maintain the ability to
> > use the zone lru_lock and isolate_lru_page() to arbitrate access to
> > pages for migration, etc. w/o having to temporarily put the pages back
> > on the lru during migration.   
> 
> A few years ago I tried to implement a mlocked counter in the page
> aswell, and my approach was to create a union to reuse the space occupied
> by the lru list pointers for this.  I never really got it stable enough
> because people tripped over the lru list randomly far too often.

My original mlock patches that Lee is talking about did use your
method. I _believe_ it is basically bug free and worked nicely.

These days we're a bit more consistent and have fewer races with
LRU handling, which is perhaps what made it doable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-27 23:51         ` Nick Piggin
@ 2007-08-28 12:29           ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2007-08-28 12:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Lee Schermerhorn, linux-mm, Rik van Riel

On Tue, Aug 28, 2007 at 01:51:53AM +0200, Nick Piggin wrote:
> > A few years ago I tried to implement a mlocked counter in the page
> > aswell, and my approach was to create a union to reuse the space occupied
> > by the lru list pointers for this.  I never really got it stable enough
> > because people tripped over the lru list randomly far too often.
> 
> My original mlock patches that Lee is talking about did use your
> method. I _believe_ it is basically bug free and worked nicely.
> 
> These days we're a bit more consistent and have fewer races with
> LRU handling, which is perhaps what made it doable.

If this works that'd be wonderful.  It also means xfs could switch back
to using the block device mapping for it's buffer cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-27 14:34     ` Lee Schermerhorn
  2007-08-27 15:44       ` Christoph Hellwig
@ 2007-08-28  0:06       ` Nick Piggin
  2007-08-28 14:52         ` Lee Schermerhorn
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-08-28  0:06 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel

On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote:
> On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote:
> > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote:
> > > Nick:
> > > 
> > > For your weekend reading pleasure [:-)]
> > > 
> > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim
> > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed
> > > but no swap space, excessively long anon_vma list] on a separate
> > > noreclaim LRU list--more or less ignored by vmscan.  To do this, I had
> > > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to
> > > the<mumble>page struct.  This brings the size of the page struct to a
> > > nice, round 64 bytes.  The mlock_count member and [most of] the
> > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which
> > > depends on CONFIG_NORECLAIM.  Currently,  the entire noreclaim
> > > infrastructure is only supported on 64bit archs because I'm using a
> > > higher order bit [~30] for the PG_noreclaim flag.
> > 
> > Can you keep the old system of removing mlocked pages completely, and
> > keeping the mlock count in one of the lru pointers? That should avoid
> > the need to have a new mlock_count, I think, because none of the other
> > noreclaim types should need a refcount?
> 
> Well, keeping the mlock count in the lru pointer more or less defeats
> the purpose of this exercise for me--that is, a unified mechanism for
> tracking "non-reclaimable" pages.  I wanted to maintain the ability to
> use the zone lru_lock and isolate_lru_page() to arbitrate access to
> pages for migration, etc. w/o having to temporarily put the pages back
> on the lru during migration.   
> 
> And, by using another LRU list for non-reclaimable pages, the
> non-reclaimable nature of locked, un-swappable, ... pages becomes
> transparent to much of the rest of VM.  vmscan and try_to_unmap*() still
> have to handle lazy culling of non-reclaimable pages.  If/when you do
> get a chance to look at the patches, you'll see that I separated the
> culling of non-reclaimable pages in the fault path into a separate
> patch.  We could eliminate this overhead in the fault path in favor of
> lazy culling in vmscan.  Vmscan would only have to deal with these pages
> once to move them to the noreclaim list.

I don't have a problem with having a more unified approach, although if
we did that, then I'd prefer just to do it more simply and don't special
case mlocked pages _at all_. Ie. just slowly try to reclaim them and
eventually when everybody unlocks them, you will notice sooner or later.

But once you do the code for mlock refcounting, that's most of the hard
part done so you may as well remove them completely from the LRU, no?
Then they become more or less transparent to the rest of the VM as well.


> > I do approve of bringing struct page to a nice round 64 bytes ;), but I
> > think I would rather we used up those 8 bytes by making count and
> > mapcount 8 bytes each.
> 
> I knew the new page struct member would be controversial, at best, but
> it allows me to prototype and test this approach.  I'd like to find
> somewhere else to put the mlock count, but the page struct it pretty
> tight as it is.  It occurred to me that while anon and other swap-backed
> pages are mlocked, I might be able to use the private field as the mlock
> count.  I don't understand the interaction of vm with file systems to
> know if we could do the same for file-backed pages.  Maybe a separate
> PG_mlock flag would allow one to move the page's private contents to an
> external structure along with the mlock count?  Or maybe just with
> PG_noreclaim, externalize the private info?
 
Could be possible. Tricky though. Probably take less code to use
->lru ;)


> Another approach that I've seen used elsewhere, IFF we can find a
> smaller bit field for the mlock count:  maintain a mlock count in a bit
> field that is too small to contain max possible lock count.  [Probably
> don't need all 64-bits, in any case.]  Clip the count at maximum that
> the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the
> count won't accommodate the additional lock.  I haven't investigated
> this enough to determine what additional complications it would involve.
> It would probably complicate inheriting locks across fork(), if we ever
> want to do that [I do!].

Well instead of failing further mlocks, you could just have MLOCK_MAX
signal that counting is disabled, and require a full rmap scan in order
to reclaim it. 


> Any thoughts on restricting this to 64-bit archs?

I don't know. I'd have thought efficient mlock handling might be useful
for realtime systems, probably many of which would be 32-bit.

Are you seeing mlock pinning heaps of memory in the field?

 
> > I haven't had much look at the patches yet, but I'm glad to see the old
> > mlocked patch come to something ;)
> 
> Given the issues we've encountered in the field with a large number
> [millions] of non-reclaimable pages on the LRU lists, the idea of hiding
> nonreclaimable pages from vmscan is appealing.  I'm hoping we can find
> some acceptable way of doing this in the long run.

Oh yeah I think that's a good idea, especially for less transient
conditions like mlock and out-of-swap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-28  0:06       ` Nick Piggin
@ 2007-08-28 14:52         ` Lee Schermerhorn
  2007-08-28 21:54           ` Christoph Lameter
  2007-08-29  4:38           ` Nick Piggin
  0 siblings, 2 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-28 14:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Rik van Riel

On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote:
> On Mon, Aug 27, 2007 at 10:34:07AM -0400, Lee Schermerhorn wrote:
> > On Mon, 2007-08-27 at 03:35 +0200, Nick Piggin wrote:
> > > On Fri, Aug 24, 2007 at 04:43:38PM -0400, Lee Schermerhorn wrote:
> > > > Nick:
> > > > 
> > > > For your weekend reading pleasure [:-)]
> > > > 
> > > > I have reworked your "move mlocked pages off LRU" atop my "noreclaim
> > > > infrastructure" that keeps non-reclaimable pages [mlocked, swap-backed
> > > > but no swap space, excessively long anon_vma list] on a separate
> > > > noreclaim LRU list--more or less ignored by vmscan.  To do this, I had
> > > > to <mumble>add<mumble>a new<mumble>mlock_count member<mumble>to
> > > > the<mumble>page struct.  This brings the size of the page struct to a
> > > > nice, round 64 bytes.  The mlock_count member and [most of] the
> > > > noreclaim-mlocked-pages work now depends on CONFIG_NORECLAIM_MLOCK which
> > > > depends on CONFIG_NORECLAIM.  Currently,  the entire noreclaim
> > > > infrastructure is only supported on 64bit archs because I'm using a
> > > > higher order bit [~30] for the PG_noreclaim flag.
> > > 
> > > Can you keep the old system of removing mlocked pages completely, and
> > > keeping the mlock count in one of the lru pointers? That should avoid
> > > the need to have a new mlock_count, I think, because none of the other
> > > noreclaim types should need a refcount?
> > 
> > Well, keeping the mlock count in the lru pointer more or less defeats
> > the purpose of this exercise for me--that is, a unified mechanism for
> > tracking "non-reclaimable" pages.  I wanted to maintain the ability to
> > use the zone lru_lock and isolate_lru_page() to arbitrate access to
> > pages for migration, etc. w/o having to temporarily put the pages back
> > on the lru during migration.   
> > 
> > And, by using another LRU list for non-reclaimable pages, the
> > non-reclaimable nature of locked, un-swappable, ... pages becomes
> > transparent to much of the rest of VM.  vmscan and try_to_unmap*() still
> > have to handle lazy culling of non-reclaimable pages.  If/when you do
> > get a chance to look at the patches, you'll see that I separated the
> > culling of non-reclaimable pages in the fault path into a separate
> > patch.  We could eliminate this overhead in the fault path in favor of
> > lazy culling in vmscan.  Vmscan would only have to deal with these pages
> > once to move them to the noreclaim list.
> 
> I don't have a problem with having a more unified approach, although if
> we did that, then I'd prefer just to do it more simply and don't special
> case mlocked pages _at all_. Ie. just slowly try to reclaim them and
> eventually when everybody unlocks them, you will notice sooner or later.

I didn't think I was special casing mlocked pages.  I wanted to treat
all !page_reclaimable() pages the same--i.e., put them on the noreclaim
list.

> 
> But once you do the code for mlock refcounting, that's most of the hard
> part done so you may as well remove them completely from the LRU, no?
> Then they become more or less transparent to the rest of the VM as well.

Well, no.  Depending on the reason for !reclaimable, the page would go
on the noreclaim list or just be dropped--special handling.  More
importantly [for me], we still have to handle them specially in
migration, dumping them back onto the LRU so that we can arbitrate
access.  If I'm ever successful in getting automatic/lazy page migration
+replication accepted, I don't want that overhead in
auto-migration/replication.

> 
> 
> > > I do approve of bringing struct page to a nice round 64 bytes ;), but I
> > > think I would rather we used up those 8 bytes by making count and
> > > mapcount 8 bytes each.
> > 
> > I knew the new page struct member would be controversial, at best, but
> > it allows me to prototype and test this approach.  I'd like to find
> > somewhere else to put the mlock count, but the page struct it pretty
> > tight as it is.  It occurred to me that while anon and other swap-backed
> > pages are mlocked, I might be able to use the private field as the mlock
> > count.  I don't understand the interaction of vm with file systems to
> > know if we could do the same for file-backed pages.  Maybe a separate
> > PG_mlock flag would allow one to move the page's private contents to an
> > external structure along with the mlock count?  Or maybe just with
> > PG_noreclaim, externalize the private info?
>  
> Could be possible. Tricky though. Probably take less code to use
> ->lru ;)

Oh, certainly less code to use any separate field.  But the lru list
field is the only link we have in the page struct, and a lot of VM
depends on being able to pass around lists of pages.  I'd hate to lose
that for mlocked pages, or to have to dump the lock count and
reestablish it in those cases, like migration, where we need to put the
page on a list.

> 
> 
> > Another approach that I've seen used elsewhere, IFF we can find a
> > smaller bit field for the mlock count:  maintain a mlock count in a bit
> > field that is too small to contain max possible lock count.  [Probably
> > don't need all 64-bits, in any case.]  Clip the count at maximum that
> > the field can contain [like SWAP_MAP_MAX] and fail mlock attempts if the
> > count won't accommodate the additional lock.  I haven't investigated
> > this enough to determine what additional complications it would involve.
> > It would probably complicate inheriting locks across fork(), if we ever
> > want to do that [I do!].
> 
> Well instead of failing further mlocks, you could just have MLOCK_MAX
> signal that counting is disabled, and require a full rmap scan in order
> to reclaim it. 

Yeah.  But, rather than totally disabling the counting, I'd suggest to
go ahead and decrement the count [! < 0, of course] on unmap/unlock.
If it's infrequent, we could just let try_to_unmap*() cull pages in
VM_LOCKED vmas when it's already doing the full rmap scan, and have
shrink_page_list() put it on noreclaim list when try_to_unmap() returns
SWAP_LOCK.

This does mean that we won't be able to cull the mlocked pages early in
shrink_[in]active_list() via !page_reclaimable().  So, we still have to
do the complete rmap scan for page_referenced() [why do we do this?
don't trust mapcount?]  and then again for try_to_unmap().    We'd
probably also want to cull new pages in the fault path, where the vma is
available.  This would reduce the number of mlocked pages encountered on
the LRU lists by vmscan.

If we're willing to live with this [increased rmap scans on mlocked
pages], we might be able to dispense with the mlock count altogether.
Just a single flag [somewhere--doesn't need to be in page flags member]
to indicate mlocked for page_reclaimable().  munmap()/munlock() could
reset the bit and put the page back on the [in]active list.  If some
other vma has it locked, we'll catch it on next attempt to unmap.

> 
> 
> > Any thoughts on restricting this to 64-bit archs?
> 
> I don't know. I'd have thought efficient mlock handling might be useful
> for realtime systems, probably many of which would be 32-bit.

I agree.  I just wonder if those systems have a sufficient number of
pages that they're suffering from the long lru lists with a large
fraction of unreclaimable pages...  If we do want to support keeping
nonreclaimable pages off the [in]active lists for these systems, we'll
need to find a place for the flag[s].

> 
> Are you seeing mlock pinning heaps of memory in the field?

It is a common usage to mlock() large shared memory areas, as well as
entire tasks [MLOCK_CURRENT|MLOCK_FUTURE].  I think it would be even
more frequent if one could inherit MLOCK_FUTURE across fork and exec.
Then one could write/enhance a prefix command, like numactl and taskset,
to enable locking of unmodified applications.  I prototyped this once,
but never updated it to do the mlock accounting [e.g., down in
copy_page_range() during fork()] for your patch.

What we see more of is folks just figuring that they've got sufficient
memory [100s of GB] for their apps and shared memory areas, so they
don't add enough swap to back all of the anon and shmem regions.  Then,
when they get under memory pressure--e.g., the old "backup ate my
pagecache" scenario--the system more or less live-locks in vmscan
shuffling non-reclaimable [unswappable] pages.  A large number of
mlocked pages on the LRU produces the same symptom; as do excessively
long anon_vma lists and huge i_mmap trees--the latter seen with some
large Oracle workloads.

> 
>  
> > > I haven't had much look at the patches yet, but I'm glad to see the old
> > > mlocked patch come to something ;)
> > 
> > Given the issues we've encountered in the field with a large number
> > [millions] of non-reclaimable pages on the LRU lists, the idea of hiding
> > nonreclaimable pages from vmscan is appealing.  I'm hoping we can find
> > some acceptable way of doing this in the long run.
> 
> Oh yeah I think that's a good idea, especially for less transient
> conditions like mlock and out-of-swap.

This is all still a work in progress.  I'll keep it up to date, run
occasional benchmarks to measure effects and track the other page
reclaim activity on the lists and see where it goes...

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-28 14:52         ` Lee Schermerhorn
@ 2007-08-28 21:54           ` Christoph Lameter
  2007-08-29 14:40             ` Lee Schermerhorn
  2007-08-29  4:38           ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2007-08-28 21:54 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel

On Tue, 28 Aug 2007, Lee Schermerhorn wrote:

> I didn't think I was special casing mlocked pages.  I wanted to treat
> all !page_reclaimable() pages the same--i.e., put them on the noreclaim
> list.

I think that is the right approach. Do not forget that ramfs and other 
ram based filesystems create unmapped unreclaimable pages.

> Well, no.  Depending on the reason for !reclaimable, the page would go
> on the noreclaim list or just be dropped--special handling.  More
> importantly [for me], we still have to handle them specially in
> migration, dumping them back onto the LRU so that we can arbitrate
> access.  If I'm ever successful in getting automatic/lazy page migration
> +replication accepted, I don't want that overhead in
> auto-migration/replication.

Right. I posted a patch a week ago that generalized LRU handling and would 
allow the adding of additional lists as needed by such an approach.


> If we're willing to live with this [increased rmap scans on mlocked
> pages], we might be able to dispense with the mlock count altogether.
> Just a single flag [somewhere--doesn't need to be in page flags member]
> to indicate mlocked for page_reclaimable().  munmap()/munlock() could
> reset the bit and put the page back on the [in]active list.  If some
> other vma has it locked, we'll catch it on next attempt to unmap.

You need a page flag to indicate the fact that the page is on the 
unreclaimable list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-28 21:54           ` Christoph Lameter
@ 2007-08-29 14:40             ` Lee Schermerhorn
  2007-08-29 17:39               ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-29 14:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, linux-mm, Rik van Riel

On Tue, 2007-08-28 at 14:54 -0700, Christoph Lameter wrote:
> On Tue, 28 Aug 2007, Lee Schermerhorn wrote:
> 
> > I didn't think I was special casing mlocked pages.  I wanted to treat
> > all !page_reclaimable() pages the same--i.e., put them on the noreclaim
> > list.
> 
> I think that is the right approach. Do not forget that ramfs and other 
> ram based filesystems create unmapped unreclaimable pages.

They don't go on the LRU lists now, do they?  The primary function of
the noreclaim infrastructure is to hide non-reclaimable pages that would
otherwise go on the [in]active lists from vmscan.  So, if pages used by
the ram base file systems don't go onto the LRU, we probably don't need
to put them on the noreclaim list which is conceptually another LRU
list.

That being said, the lumpy reclaim patch tries to reclaim pages that are
contiguous to other pages being reclaimed when trying to free higher
order pages.  I'll have to check to see if it tries to reclaim pages
that might be used by ram/tmp/... fs.

> 
> > Well, no.  Depending on the reason for !reclaimable, the page would go
> > on the noreclaim list or just be dropped--special handling.  More
> > importantly [for me], we still have to handle them specially in
> > migration, dumping them back onto the LRU so that we can arbitrate
> > access.  If I'm ever successful in getting automatic/lazy page migration
> > +replication accepted, I don't want that overhead in
> > auto-migration/replication.
> 
> Right. I posted a patch a week ago that generalized LRU handling and would 
> allow the adding of additional lists as needed by such an approach.

Which one was that? 

> 
> 
> > If we're willing to live with this [increased rmap scans on mlocked
> > pages], we might be able to dispense with the mlock count altogether.
> > Just a single flag [somewhere--doesn't need to be in page flags member]
> > to indicate mlocked for page_reclaimable().  munmap()/munlock() could
> > reset the bit and put the page back on the [in]active list.  If some
> > other vma has it locked, we'll catch it on next attempt to unmap.
> 
> You need a page flag to indicate the fact that the page is on the 
> unreclaimable list.

Yes, I have that now--PG_noreclaim.  In my prototype, I'm using a high
order bit unavailable to 32-bit archs, because all of the others are
used right now.  This is one of my unresolved issues.  PageNoreclaim()
is like, but mutually exclusive to, PageActive()--it tells us which LRU
list the page is on.

Thanks,
Lee



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-29 14:40             ` Lee Schermerhorn
@ 2007-08-29 17:39               ` Christoph Lameter
  2007-08-30  0:09                 ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2007-08-29 17:39 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, Rik van Riel

On Wed, 29 Aug 2007, Lee Schermerhorn wrote:

> > I think that is the right approach. Do not forget that ramfs and other 
> > ram based filesystems create unmapped unreclaimable pages.
> 
> They don't go on the LRU lists now, do they?  The primary function of
> the noreclaim infrastructure is to hide non-reclaimable pages that would
> otherwise go on the [in]active lists from vmscan.  So, if pages used by
> the ram base file systems don't go onto the LRU, we probably don't need
> to put them on the noreclaim list which is conceptually another LRU
> list.

They do go into the LRU. When attempts are made to write them out they are 
put back onto the active lists via a strange return code 
AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round...

> > Right. I posted a patch a week ago that generalized LRU handling and would 
> > allow the adding of additional lists as needed by such an approach.
> 
> Which one was that? 

This one

[RECLAIM] Use an indexed array for active/inactive variables

Currently we are defining explicit variables for the inactive and active
list. An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on the reclaim
code.

---
 include/linux/mm_inline.h |   34 +++++++----
 include/linux/mmzone.h    |   13 +++-
 mm/page_alloc.c           |    9 +--
 mm/swap.c                 |    2 
 mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
 mm/vmstat.c               |    3 -
 6 files changed, 104 insertions(+), 89 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2007-08-20 20:43:35.000000000 -0700
+++ linux-2.6/include/linux/mmzone.h	2007-08-20 21:39:48.000000000 -0700
@@ -82,6 +82,13 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum lru_list {
+	LRU_INACTIVE,
+	LRU_ACTIVE,
+	NR_LRU_LISTS };
+
+#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -227,10 +234,8 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	struct list_head	list[NR_LRU_LISTS];
+	unsigned long		nr_scan[NR_LRU_LISTS];
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
Index: linux-2.6/include/linux/mm_inline.h
===================================================================
--- linux-2.6.orig/include/linux/mm_inline.h	2007-08-20 20:43:35.000000000 -0700
+++ linux-2.6/include/linux/mm_inline.h	2007-08-20 21:39:48.000000000 -0700
@@ -1,40 +1,50 @@
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	list_add(&page->lru, &zone->list[l]);
+	__inc_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+add_page_to_active_list(struct zone *zone, struct page *page) {
+	add_page_to_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	add_page_to_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+del_page_from_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	__dec_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+del_page_from_active_list(struct zone *zone, struct page *page)
+{
+	del_page_from_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	del_page_from_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum lru_list l = LRU_INACTIVE;
+
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		l = LRU_ACTIVE;
 	}
+	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-08-20 20:43:34.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-08-20 21:39:48.000000000 -0700
@@ -2908,6 +2908,7 @@ static void __meminit free_area_init_cor
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
+		enum lru_list l;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
@@ -2957,10 +2958,10 @@ static void __meminit free_area_init_cor
 		zone->prev_priority = DEF_PRIORITY;
 
 		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
+		for_each_lru(l) {
+			INIT_LIST_HEAD(&zone->list[l]);
+			zone->nr_scan[l] = 0;
+		}
 		zap_zone_vm_stats(zone);
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-08-20 20:43:34.000000000 -0700
+++ linux-2.6/mm/swap.c	2007-08-20 21:39:48.000000000 -0700
@@ -125,7 +125,7 @@ int rotate_reclaimable_page(struct page 
 	zone = page_zone(page);
 	spin_lock_irqsave(&zone->lru_lock, flags);
 	if (PageLRU(page) && !PageActive(page)) {
-		list_move_tail(&page->lru, &zone->inactive_list);
+		list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 		__count_vm_event(PGROTATED);
 	}
 	if (!test_clear_page_writeback(page))
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-20 20:43:35.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-20 21:40:12.000000000 -0700
@@ -772,7 +772,7 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_active;
 
 		nr_taken = isolate_lru_pages(sc->swap_cluster_max,
-			     &zone->inactive_list,
+			     &zone->list[LRU_INACTIVE],
 			     &page_list, &nr_scan, sc->order,
 			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
 					     ISOLATE_BOTH : ISOLATE_INACTIVE);
@@ -807,10 +807,7 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			add_page_to_list(zone, page, PageActive(page));
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -869,11 +866,14 @@ static void shrink_active_list(unsigned 
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
-	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
-	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+	struct list_head list[NR_LRU_LISTS];
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
+	enum lru_list l;
+
+	for_each_lru(l)
+		INIT_LIST_HEAD(&list[l]);
 
 	if (sc->may_swap) {
 		long mapped_ratio;
@@ -924,7 +924,7 @@ force_reclaim_mapped:
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
+	pgmoved = isolate_lru_pages(nr_pages, &zone->list[LRU_ACTIVE],
 			    &l_hold, &pgscanned, sc->order, ISOLATE_ACTIVE);
 	zone->pages_scanned += pgscanned;
 	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
@@ -938,25 +938,25 @@ force_reclaim_mapped:
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
 			    page_referenced(page, 0)) {
-				list_add(&page->lru, &l_active);
+				list_add(&page->lru, &list[LRU_ACTIVE]);
 				continue;
 			}
 		}
-		list_add(&page->lru, &l_inactive);
+		list_add(&page->lru, &list[LRU_INACTIVE]);
 	}
 
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
 	spin_lock_irq(&zone->lru_lock);
-	while (!list_empty(&l_inactive)) {
-		page = lru_to_page(&l_inactive);
-		prefetchw_prev_lru_page(page, &l_inactive, flags);
+	while (!list_empty(&list[LRU_INACTIVE])) {
+		page = lru_to_page(&list[LRU_INACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->inactive_list);
+		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
@@ -978,13 +978,13 @@ force_reclaim_mapped:
 	}
 
 	pgmoved = 0;
-	while (!list_empty(&l_active)) {
-		page = lru_to_page(&l_active);
-		prefetchw_prev_lru_page(page, &l_active, flags);
+	while (!list_empty(&list[LRU_ACTIVE])) {
+		page = lru_to_page(&list[LRU_ACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->active_list);
+		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
@@ -1003,16 +1003,26 @@ force_reclaim_mapped:
 	pagevec_release(&pvec);
 }
 
+static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+	struct zone *zone, struct scan_control *sc, int priority)
+{
+	if (l == LRU_ACTIVE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority);
+		return 0;
+	}
+	return shrink_inactive_list(nr_to_scan, zone, sc);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
 				struct scan_control *sc)
 {
-	unsigned long nr_active;
-	unsigned long nr_inactive;
+	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	enum lru_list l;
 
 	atomic_inc(&zone->reclaim_in_progress);
 
@@ -1020,36 +1030,26 @@ static unsigned long shrink_zone(int pri
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	zone->nr_scan_active +=
-		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
-		zone->nr_scan_active = 0;
-	else
-		nr_active = 0;
-
-	zone->nr_scan_inactive +=
-		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
-		zone->nr_scan_inactive = 0;
-	else
-		nr_inactive = 0;
-
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
-					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
-			shrink_active_list(nr_to_scan, zone, sc, priority);
-		}
+	for_each_lru(l) {
+		zone->nr_scan[l] += (zone_page_state(zone, NR_INACTIVE + l)
+							>> priority) + 1;
+		nr[l] = zone->nr_scan[l];
+		if (nr[l] >= sc->swap_cluster_max)
+			zone->nr_scan[l] = 0;
+		else
+			nr[l] = 0;
+	}
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
+	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+		for_each_lru(l) {
+			if (nr[l]) {
+				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
-			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
-								sc);
+				nr[l] -= nr_to_scan;
+
+				nr_reclaimed += shrink_list(l, nr_to_scan,
+							zone, sc, priority);
+			}
 		}
 	}
 
@@ -1489,6 +1489,7 @@ static unsigned long shrink_all_zones(un
 {
 	struct zone *zone;
 	unsigned long nr_to_scan, ret = 0;
+	enum lru_list l;
 
 	for_each_zone(zone) {
 
@@ -1498,28 +1499,25 @@ static unsigned long shrink_all_zones(un
 		if (zone->all_unreclaimable && prio != DEF_PRIORITY)
 			continue;
 
-		/* For pass = 0 we don't shrink the active list */
-		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
+		for_each_lru(l) {
+			/* For pass = 0 we don't shrink the active list */
+			if (pass == 0 && l == LRU_ACTIVE)
+				continue;
+
+			zone->nr_scan[l] +=
+				(zone_page_state(zone, NR_INACTIVE + l)
+								>> prio) + 1;
+			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
+				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
-				shrink_active_list(nr_to_scan, zone, sc, prio);
+					zone_page_state(zone,
+							NR_INACTIVE + l));
+				ret += shrink_list(l, nr_to_scan, zone,
+								sc, prio);
+				if (ret >= nr_pages)
+					return ret;
 			}
 		}
-
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
-			ret += shrink_inactive_list(nr_to_scan, zone, sc);
-			if (ret >= nr_pages)
-				return ret;
-		}
 	}
 
 	return ret;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2007-08-20 20:43:35.000000000 -0700
+++ linux-2.6/mm/vmstat.c	2007-08-20 21:39:48.000000000 -0700
@@ -563,7 +563,8 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_low,
 			   zone->pages_high,
 			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
+			   zone->nr_scan[LRU_ACTIVE],
+			   zone->nr_scan[LRU_INACTIVE],
 			   zone->spanned_pages,
 			   zone->present_pages);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-29 17:39               ` Christoph Lameter
@ 2007-08-30  0:09                 ` Rik van Riel
  2007-08-30 14:49                   ` Lee Schermerhorn
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2007-08-30  0:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, Nick Piggin, linux-mm

Christoph Lameter wrote:
> On Wed, 29 Aug 2007, Lee Schermerhorn wrote:
> 
>>> I think that is the right approach. Do not forget that ramfs and other 
>>> ram based filesystems create unmapped unreclaimable pages.
>> They don't go on the LRU lists now, do they?  The primary function of
>> the noreclaim infrastructure is to hide non-reclaimable pages that would
>> otherwise go on the [in]active lists from vmscan.  So, if pages used by
>> the ram base file systems don't go onto the LRU, we probably don't need
>> to put them on the noreclaim list which is conceptually another LRU
>> list.
> 
> They do go into the LRU. When attempts are made to write them out they are 
> put back onto the active lists via a strange return code 
> AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round...
> 
>>> Right. I posted a patch a week ago that generalized LRU handling and would 
>>> allow the adding of additional lists as needed by such an approach.
>> Which one was that? 
> 
> This one
> 
> [RECLAIM] Use an indexed array for active/inactive variables
> 
> Currently we are defining explicit variables for the inactive and active
> list. An indexed array can be more generic and avoid repeating similar
> code in several places in the reclaim code.

I like it.  This will make the code that has separate lists
for anonymous (and other swap backed) pages a lot nicer.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-30  0:09                 ` Rik van Riel
@ 2007-08-30 14:49                   ` Lee Schermerhorn
  0 siblings, 0 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 14:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Lameter, Nick Piggin, linux-mm

On Wed, 2007-08-29 at 20:09 -0400, Rik van Riel wrote:
> Christoph Lameter wrote:
> > On Wed, 29 Aug 2007, Lee Schermerhorn wrote:
> > 
> >>> I think that is the right approach. Do not forget that ramfs and other 
> >>> ram based filesystems create unmapped unreclaimable pages.
> >> They don't go on the LRU lists now, do they?  The primary function of
> >> the noreclaim infrastructure is to hide non-reclaimable pages that would
> >> otherwise go on the [in]active lists from vmscan.  So, if pages used by
> >> the ram base file systems don't go onto the LRU, we probably don't need
> >> to put them on the noreclaim list which is conceptually another LRU
> >> list.
> > 
> > They do go into the LRU. When attempts are made to write them out they are 
> > put back onto the active lists via a strange return code 
> > AOP_WRITEPAGE_ACTIVATE. So they circle round and round and round...
> > 
> >>> Right. I posted a patch a week ago that generalized LRU handling and would 
> >>> allow the adding of additional lists as needed by such an approach.
> >> Which one was that? 
> > 
> > This one
> > 
> > [RECLAIM] Use an indexed array for active/inactive variables
> > 
> > Currently we are defining explicit variables for the inactive and active
> > list. An indexed array can be more generic and avoid repeating similar
> > code in several places in the reclaim code.
> 
> I like it.  This will make the code that has separate lists
> for anonymous (and other swap backed) pages a lot nicer.

Ditto.

I'll incorporate it into the noreclaim set and into the copy of Rik's
split lru patch that I'm maintaining.  Should make it easier to merge
the two sets.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-28 14:52         ` Lee Schermerhorn
  2007-08-28 21:54           ` Christoph Lameter
@ 2007-08-29  4:38           ` Nick Piggin
  2007-08-30 16:34             ` Lee Schermerhorn
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-08-29  4:38 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, Rik van Riel

On Tue, Aug 28, 2007 at 10:52:46AM -0400, Lee Schermerhorn wrote:
> On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote:
> > 
> > I don't have a problem with having a more unified approach, although if
> > we did that, then I'd prefer just to do it more simply and don't special
> > case mlocked pages _at all_. Ie. just slowly try to reclaim them and
> > eventually when everybody unlocks them, you will notice sooner or later.
> 
> I didn't think I was special casing mlocked pages.  I wanted to treat
> all !page_reclaimable() pages the same--i.e., put them on the noreclaim
> list.

But you are keeping track of the mlock count? Why not simply call
try_to_unmap and see if they are still mlocked?


> > But once you do the code for mlock refcounting, that's most of the hard
> > part done so you may as well remove them completely from the LRU, no?
> > Then they become more or less transparent to the rest of the VM as well.
> 
> Well, no.  Depending on the reason for !reclaimable, the page would go
> on the noreclaim list or just be dropped--special handling.  More
> importantly [for me], we still have to handle them specially in
> migration, dumping them back onto the LRU so that we can arbitrate
> access.  If I'm ever successful in getting automatic/lazy page migration
> +replication accepted, I don't want that overhead in
> auto-migration/replication.

Oh OK. I don't know if there should be a whole lot of overhead involved
with that, though. I can't remember exactly what the problems were here
with my mlock patch, but I think it could have been made more optimal.


> > Could be possible. Tricky though. Probably take less code to use
> > ->lru ;)
> 
> Oh, certainly less code to use any separate field.  But the lru list
> field is the only link we have in the page struct, and a lot of VM
> depends on being able to pass around lists of pages.  I'd hate to lose
> that for mlocked pages, or to have to dump the lock count and
> reestablish it in those cases, like migration, where we need to put the
> page on a list.

Hmm, yes. Migration could possibly use a single linked list.
But I'm only saying it _could_ be possible to do mlocked accounting
efficiently with one of the LRU pointers -- I would prefer the idea
of just using a single bit for example, if that is sufficient. It
should cut down on code.


> > I don't know. I'd have thought efficient mlock handling might be useful
> > for realtime systems, probably many of which would be 32-bit.
> 
> I agree.  I just wonder if those systems have a sufficient number of
> pages that they're suffering from the long lru lists with a large
> fraction of unreclaimable pages...  If we do want to support keeping
> nonreclaimable pages off the [in]active lists for these systems, we'll
> need to find a place for the flag[s].

That's true, they will have a lot less pages (and probably won't
be using highmem).


> > Are you seeing mlock pinning heaps of memory in the field?
> 
> It is a common usage to mlock() large shared memory areas, as well as
> entire tasks [MLOCK_CURRENT|MLOCK_FUTURE].  I think it would be even
> more frequent if one could inherit MLOCK_FUTURE across fork and exec.
> Then one could write/enhance a prefix command, like numactl and taskset,
> to enable locking of unmodified applications.  I prototyped this once,
> but never updated it to do the mlock accounting [e.g., down in
> copy_page_range() during fork()] for your patch.
> 
> What we see more of is folks just figuring that they've got sufficient
> memory [100s of GB] for their apps and shared memory areas, so they
> don't add enough swap to back all of the anon and shmem regions.  Then,
> when they get under memory pressure--e.g., the old "backup ate my
> pagecache" scenario--the system more or less live-locks in vmscan
> shuffling non-reclaimable [unswappable] pages.  A large number of
> mlocked pages on the LRU produces the same symptom; as do excessively
> long anon_vma lists and huge i_mmap trees--the latter seen with some
> large Oracle workloads.

OK, thanks for the background.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RFC:  Noreclaim with "Keep Mlocked Pages off the LRU"
  2007-08-29  4:38           ` Nick Piggin
@ 2007-08-30 16:34             ` Lee Schermerhorn
  0 siblings, 0 replies; 19+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 16:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, Rik van Riel, Christoph Hellwig

On Wed, 2007-08-29 at 06:38 +0200, Nick Piggin wrote:
> On Tue, Aug 28, 2007 at 10:52:46AM -0400, Lee Schermerhorn wrote:
> > On Tue, 2007-08-28 at 02:06 +0200, Nick Piggin wrote:
> > > 
> > > I don't have a problem with having a more unified approach, although if
> > > we did that, then I'd prefer just to do it more simply and don't special
> > > case mlocked pages _at all_. Ie. just slowly try to reclaim them and
> > > eventually when everybody unlocks them, you will notice sooner or later.
> > 
> > I didn't think I was special casing mlocked pages.  I wanted to treat
> > all !page_reclaimable() pages the same--i.e., put them on the noreclaim
> > list.
> 
> But you are keeping track of the mlock count? Why not simply call
> try_to_unmap and see if they are still mlocked?

We may be talking past each other here.  So, let me try this:

We're trying to hide nonreclaimable, including mlock'ed, pages from
vmscan to the extent possible--to make reclaim as efficient as possible.
Sometimes, to avoid races [as in your comment in __mlock_pages_range()
regarding anonymous pages], we may end up putting mlock'ed pages on the
normal lru list.  That's OK.  We can cull them in shrink_*_list().  Now,
if we have an mlock lock count in a dedicated field or a page flag
indicating mlock'ed state [perhaps with a count in an overloaded field],
we can easily cull the mlock'ed pages w/o access to any vma so that it
never gets to shrink_page_list() where try_to_unmap() would be called.

IMO, try_to_unmap() is/can be a fairly heavy hammer, walking the entire
rmap, as it does.  And, we only get to try_to_unmap() after already
walking the entire rmap in page_referenced() [hmmm, maybe cull mlock'ed
pages in page_referenced()--before even checking page table for ref?].
So, I'd like to cull them early by just looking at the page.  If a page
occasionally makes it through, like only the first time for anon pages?,
we only take the hit once.

Now you may be thinking that, in general, reverse maps are not all that
large.  But, I've seen live locks on the i_mmap_lock with heavy Oracle
loads [I think I already mentioned this].  On large servers, we can see
hundreds or thousands of tasks mapping the data base executables,
libraries and shared memory areas--just the types of regions one might
want to mlock.  Further, the shared memory areas can get quite
large--10s, 100s, even 1000s of GB.  That's a lot of pages to be running
through page_referenced/try_to_unmap too often.  

> 
> 
> > > But once you do the code for mlock refcounting, that's most of the hard
> > > part done so you may as well remove them completely from the LRU, no?
> > > Then they become more or less transparent to the rest of the VM as well.
> > 
> > Well, no.  Depending on the reason for !reclaimable, the page would go
> > on the noreclaim list or just be dropped--special handling.  More
> > importantly [for me], we still have to handle them specially in
> > migration, dumping them back onto the LRU so that we can arbitrate
> > access.  If I'm ever successful in getting automatic/lazy page migration
> > +replication accepted, I don't want that overhead in
> > auto-migration/replication.
> 
> Oh OK. I don't know if there should be a whole lot of overhead involved
> with that, though. I can't remember exactly what the problems were here
> with my mlock patch, but I think it could have been made more optimal.

The basic issue was that one can't migrate pages [nor unmap them for
lazy migration/replication] if check_range() can't find them on and
successfully isolate them from the lru.  In a respin of the patch, you
dumped the pages back on to the LRU so that they could be migrated.
Then, later, they'll need to be lazily culled back off the lru.  Could
be a lot of pages for some regions.   With the noreclaim lru list, this
isn't necessary.  It works just like the [in]active lists from
migration's perspective.

I guess the overhead depends on the size of the regions being migrated.
It occurs to me that we probably need a way to exempt some regions--like
huge shared memory areas--from auto-migration/replication.

> 
> 
> > > Could be possible. Tricky though. Probably take less code to use
> > > ->lru ;)

> > 
> > Oh, certainly less code to use any separate field.  But the lru list
> > field is the only link we have in the page struct, and a lot of VM
> > depends on being able to pass around lists of pages.  I'd hate to lose
> > that for mlocked pages, or to have to dump the lock count and
> > reestablish it in those cases, like migration, where we need to put the
> > page on a list.
> 
> Hmm, yes. Migration could possibly use a single linked list.
> But I'm only saying it _could_ be possible to do mlocked accounting
> efficiently with one of the LRU pointers -- 

I agree, that we don't want to keep the pages on an lru list or want to
use some other list type for migration and such, the accounting in one
of the lru pointers is no[t much] more overhead, timewise, than a
dedicated field.  The dedicated file increases space overhead, tho'.

> I would prefer the idea
> of just using a single bit for example, if that is sufficient. It
> should cut down on code.

I've been thinking about how to eliminate the mlock count entirely and
just use a single page flag and "lazy culling"--i.e., try to unmap.
But, one scenario I want to avoid is where tasks come and go, attaching
to a shared memory area/executable with an mlock'ed vma.  When they
detach, without a count, we'd just drop the mlock flag, moving the pages
back to the normal lru lists, and let vmscan cull them if some vma still
have them mlock'ed.  Again, I'd like to avoid the flood of pages between
normal lru and noreclaim lists in my model.

Perhaps the "flood" can be eliminated for shared memory areas--likely to
be the largest source of mlock'ed pages--by not unlocking pages in shmem
areas that have the VM_LOCKED flag set in the shmem_inode_info flags
field [SHM_LOCKED regions].  I don't see any current interaction of that
flag with the vm_flags when attaching to a SHM_LOCKED region.  Such
interaction is not required to prevent swap out--that's handled in
shmem_writepage.  But, to keep those pages off the LRU, we probably need
to consult the shmem_inode_info flags in the modified mlock code.  Maybe
pull the flag into the vm_flags on attach?  This way, try_to_unmap()
will see it w/o having to consult vm_file->...  I'm looking into this.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-08-30 16:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-23  4:11 vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
2007-08-23  7:15 ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Andrew Morton
2007-08-23  9:07   ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru Nick Piggin
2007-08-23 11:48     ` vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-aroun d-the-lru Andrea Arcangeli
2007-08-24 20:43 ` RFC: Noreclaim with "Keep Mlocked Pages off the LRU" Lee Schermerhorn
2007-08-27  1:35   ` Nick Piggin
2007-08-27 14:34     ` Lee Schermerhorn
2007-08-27 15:44       ` Christoph Hellwig
2007-08-27 23:51         ` Nick Piggin
2007-08-28 12:29           ` Christoph Hellwig
2007-08-28  0:06       ` Nick Piggin
2007-08-28 14:52         ` Lee Schermerhorn
2007-08-28 21:54           ` Christoph Lameter
2007-08-29 14:40             ` Lee Schermerhorn
2007-08-29 17:39               ` Christoph Lameter
2007-08-30  0:09                 ` Rik van Riel
2007-08-30 14:49                   ` Lee Schermerhorn
2007-08-29  4:38           ` Nick Piggin
2007-08-30 16:34             ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox