[PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
@ 2006-03-09 18:28 Lee Schermerhorn
  2006-03-09 19:12 ` Christoph Lameter
  0 siblings, 1 reply; 6+ messages in thread
From: Lee Schermerhorn @ 2006-03-09 18:28 UTC (permalink / raw)
  To: linux-mm; +Cc: Christoph Lameter

For your entertainment:

Migrate-on-fault prototype 0/5 V0.1 - Overview

This series of patches, against 2.6.16-rc5-git11, implements page
migration
in the fault path.  Based on discussions with Christoph Lameter, this 
seems like the next logical step in page migration.

The basic idea is that when a fault handler [do_swap_page,
filemap_nopage,
...] finds a cached page with zero mappings that is otherwise "stable"--
i.e., no writebacks--this is a good opportunity to check whether the 
page resides on the node indicated by the policy in the current context.

We only want to check if there are zero mappings because 1) we can
easily
migrate the page--don't have to go through the effort of removing all
mappings and 2) default policy--a common case--can give different
answers
from different tasks running on different nodes.  Checking the policy
when there are zero mappings effectively implements a "first touch"
placement policy.

Note that this mechanism can be used to migrate page cache pages that 
were read in earlier, are no longer referenced, but are about to be
used by a new task on another node from where the page resides.  The
same mechanism can be used to pull anon pages along with a task when
the load balancer decides to move it to another node.  However, that
will require a bit more mechanism, and is the subject of another
patch series.

The current [2.6.16-rc5+] direct migration patches support most of the
mechanism that is required to implement this "migration on fault".  
Some of the necessary operations are combined in functions with other
code that isn't required [must not be executed] in the fault path,
so these have been separated out in a couple of cases.

Then we need to add the function[s] to test the current page in the
fault path for zero mapping, no writebacks, misplacement; and the
function[s] to acutally migrate the page contents to a newly
allocated page using the [modified] migratepage address space
operations of the direct migration mechanism.

The Patches:

The patches are broken out in the order I implemented them. Each
should build and boot on its own.  [at least they did at one time!]

migrate-on-fault-01-separate-unmap-replace.patch

Separates the mm/vmscan.c:migrate_page_remove_references()
function into its 2 distinct operations:  removing references
[try_to_unmap()], and replacing the old page in the radix 
tree of the page's "mapping".  Only the second part is 
needed in the fault path, as the page is already completely
unmapped.

A wrapper function that calls both operations is provided,
and the 2 places that call migrate_page_remove_references()
have been modified to call that wrapper.

migrate-on-fault-02-mpol_misplaced.patch

This patch implements the function mpol_misplaced() in
mm/mempolicy.c to check whether a page resides on the
node indicated by the vma and address arguments.  If
so, it returns 0 [!misplaced].  If not, it returns an
indication of whether the policy was interleaved or not
[for properly accounting later allocation] and passes the
node indicated by the policy through a pointer argument.

Because this will be called in the fault path, I don't 
want to go through the effort of actually allocating a
page--e.g., via alloc_page_vma()--only to find that the
current page in on the correct node.  However, I wanted
to come to the same answer that alloc_page_vma() would.
So, mpol_misplaced() mimics the node computation logic
of alloc_page_vma().

migrate-on-fault-03-migrate_misplaced_page.patch

This patch contains the main migrate on fault functions:

check_migrate_misplaced_page() is implemented as a static
inline function in mempolicy.h when MIGRATION is configured.
If the page has zero mappings, is stable and misplaced,
check_*() will call migrate_misplaced_page() in vmscan.c
to do the dirty work.  If for any reason the page can't
or shouldn't be migrated, these functions will return the
old page in the state it was found.

Note that when a page is NOT found in the cache, and the fault
handler has to allocate one and read it in, it will have zero
mappings, so check_migrate_misplaced_page() WILL call
mpol_misplaced() to see if it needs migration.  Of course, it
should have been allocated on the correct node, so no migration
should be necessary.  However, it's possible that the node 
indicated by the policy has no free pages so the newly 
allocated page may be on a different node.  In this case, I
guess check_migrate_misplaced_page() will attempt to migrate
it.  In either case, the "unnecessary" calls to mpol_misplaced()
and to migrate_misplaced_page(), if the original allocation
"overflowed", occur after an IO, so this is the slow path
anyway.  

When MIGRATION is NOT configured, check_migrate_misplaced_page()
becomes a macro that evaluates to its argument page.

More details with the patch.

migrate-on-fault-04.1-misplaced-anon-pages.patch

This is a simple one-liner [OK, 2, counting an empty line]
to call check_migrate_misplaced_page() from do_swap_page()
in memory.c.  

Patches to hook other fault paths [filemap_nopage(), etc.] 
are TBD, based on feedback to this series.  [Oh, I'll 
probably do them anyway, to measure the effects.]

migrate-on-fault-05-mbind-lazy-migrate.patch

This patch adds an MPOL_MF_LAZY [maybe should be '_DEFERRED?]
flag to modify the behavior of MPOL_MF_MOVE[_ALL].  When
the 'LAZY flag is specified, mbind() simply unmaps eligible
pages in the specified range, moving anon pages to the
swap cache, if not already there.  Then, when the task
touch the pages, or queries their location via 
get_mempolicy(..., MPOL_F_NODE|MPOL_F_ADDR), it will take
fault, find the page in the cache and migrate it, if the
policy so indicates.  Actually, this will only happen for
anon pages, until additional fault paths are hooked up.

This patch allows me to test the migrate on fault mechanism
by forcing pages to be unmapped.

Testing:

I have tested migrate-on-fault of anon pages using the MPOL_MF_LAZY 
extension to mbind() discussed in patch 5 above on 2.6.16-rc5-git11.
I have an ad hoc [odd hack?] test program, called memtoy, available at:

http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

The Xpm-tests subdirectory in the tarball contains memtoy test
scripts for "manual page migration"--i.e., the migrate_pages()
syscall, "direct migration" using mbind(MPOL_MF_MOVE) and
migrate-on-fault using mbind(MPOL_MF_MOVE+MPOL_MF_LAZY).

---
Why are these patches NOT against the -mm tree?

I've been using some trace instrumentation that relies on relayfs.
I haven't been motivated to port it to the sysfs relay channels yet.
Soon come...

If you're interested in seeing an annotated trace log of direct
migration
and migrate-on-fault [lazy] in action, you can find one at:

http://free.linux.hp.com/~lts/Tools/mtrace-anon-8p-direct+lazy.log

This file contains the log for 2 memtoy runs, each migrating an 8 page
anon segment from one node to another.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
  2006-03-09 18:28 [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview Lee Schermerhorn
@ 2006-03-09 19:12 ` Christoph Lameter
  2006-03-09 19:30   ` Lee Schermerhorn
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Lameter @ 2006-03-09 19:12 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

On Thu, 9 Mar 2006, Lee Schermerhorn wrote:

> The basic idea is that when a fault handler [do_swap_page,
> filemap_nopage,
> ...] finds a cached page with zero mappings that is otherwise "stable"--
> i.e., no writebacks--this is a good opportunity to check whether the 
> page resides on the node indicated by the policy in the current context.

Note that this is only one of the types of use of memory policy. Policy is 
typically used for placement and may be changed repeatedly for the same 
memory area in order to get certain patterns of allocation. This approach 
assumes that pages must follow policy. This is not the case for 
applications that keep changing allocation policies. But we have a similar
use with MPOL_MF_MOVE and MPOL_MF_MOVE_ALL. However, these need to be 
enabled explicitly. We may not want this mechanism to be on by default 
because it may destroy the arrangement of pages that an HPC application 
has tried to obtain.

> Note that when a page is NOT found in the cache, and the fault
> handler has to allocate one and read it in, it will have zero
> mappings, so check_migrate_misplaced_page() WILL call
> mpol_misplaced() to see if it needs migration.  Of course, it
> should have been allocated on the correct node, so no migration
> should be necessary.  However, it's possible that the node 
> indicated by the policy has no free pages so the newly 
> allocated page may be on a different node.  In this case, I
> guess check_migrate_misplaced_page() will attempt to migrate
> it.  In either case, the "unnecessary" calls to mpol_misplaced()
> and to migrate_misplaced_page(), if the original allocation
> "overflowed", occur after an IO, so this is the slow path
> anyway.  

There is a general issue with memory policies. vma vma policies are 
currently not implemented for file backed pages. So if a page is read in 
then it should be read into a node that follows vma policy.

What you are  doing here is reading a page then checking if 
it is on the correct node? I think you would need to fix the policy issue 
with file backed pages first. Then the page will be placed on the correct 
node after the read and you do not need to check the page afterwards.

I'd be glad to have a a look at the pages when you get the issues with 
the mailer fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
  2006-03-09 19:12 ` Christoph Lameter
@ 2006-03-09 19:30   ` Lee Schermerhorn
  2006-03-09 19:42     ` Christoph Lameter
  2006-03-10 14:15     ` Lee Schermerhorn
  0 siblings, 2 replies; 6+ messages in thread
From: Lee Schermerhorn @ 2006-03-09 19:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Thu, 2006-03-09 at 11:12 -0800, Christoph Lameter wrote:
> On Thu, 9 Mar 2006, Lee Schermerhorn wrote:
> 
> > The basic idea is that when a fault handler [do_swap_page,
> > filemap_nopage,
> > ...] finds a cached page with zero mappings that is otherwise "stable"--
> > i.e., no writebacks--this is a good opportunity to check whether the 
> > page resides on the node indicated by the policy in the current context.
> 
> Note that this is only one of the types of use of memory policy. Policy is 
> typically used for placement and may be changed repeatedly for the same 
> memory area in order to get certain patterns of allocation. This approach 
> assumes that pages must follow policy. This is not the case for 
> applications that keep changing allocation policies. But we have a similar
> use with MPOL_MF_MOVE and MPOL_MF_MOVE_ALL. However, these need to be 
> enabled explicitly. We may not want this mechanism to be on by default 
> because it may destroy the arrangement of pages that an HPC application 
> has tried to obtain.

Yes, I am assuming that pages must [should, best effort, anyway] follow
policy.  When they don't, I assume it's because of current limitations
in the mechanism.  But, that's just me...  

I'm wondering if applications keep changing the policy as you describe
to "finesse" the system--e.g., because they don't have fine enough
control over the policies.  Perhaps I read it wrong, but it appears to
me that we can't set the policy for subranges of a vm area.  So maybe
applications have to set the policy for the [entire] vma, touch a few
pages to get them placed, change the policy for the [entire] vma, touch
a few more pages, ...   Of course, storing policies on subranges of vmas
takes more mechanism that we current have, and increases the cost of
node computation on each allocation.  Probably why we don't have it
currently.

Anyway, with the patches I sent, pages would only migrate on fault if
they had no mappings at the time of fault.  If an application had
explicitly placed them by touching them, they could only have zero map
count if something happened to pull them out of the task's pte.  I would
think that if they cared, they'd mlock them so that wouldn't happen?

> 
> > Note that when a page is NOT found in the cache, and the fault
> > handler has to allocate one and read it in, it will have zero
> > mappings, so check_migrate_misplaced_page() WILL call
> > mpol_misplaced() to see if it needs migration.  Of course, it
> > should have been allocated on the correct node, so no migration
> > should be necessary.  However, it's possible that the node 
> > indicated by the policy has no free pages so the newly 
> > allocated page may be on a different node.  In this case, I
> > guess check_migrate_misplaced_page() will attempt to migrate
> > it.  In either case, the "unnecessary" calls to mpol_misplaced()
> > and to migrate_misplaced_page(), if the original allocation
> > "overflowed", occur after an IO, so this is the slow path
> > anyway.  
> 
> There is a general issue with memory policies. vma vma policies are 
> currently not implemented for file backed pages. So if a page is read in 
> then it should be read into a node that follows vma policy.

I agree.  That should happen.  Might not be the first node specified.
Might have overflowed to another node/zone in the list [preferred or
bind with multiple nodes].

> 
> What you are  doing here is reading a page then checking if 
> it is on the correct node? I think you would need to fix the policy issue 
> with file backed pages first. Then the page will be placed on the correct 
> node after the read and you do not need to check the page afterwards.

Yes, that could happen.  That's what I was trying to explain.  I don't
LIKE that, but I haven't thought about how to distinguish a page that
just go read in and is likely on the right node [an acceptable one,
anyway] and one that has zero mappings because it hasn't been referenced
in a while.  Any ideas?

> 
> I'd be glad to have a a look at the pages when you get the issues with 
> the mailer fixed.

I just sent another one to myself, and got it just fine.  I copied you
in addition to the list.  Was that copy borked, too?  If so, I'll try
sending you copies with good ol' mail(1).

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
  2006-03-09 19:30   ` Lee Schermerhorn
@ 2006-03-09 19:42     ` Christoph Lameter
  2006-03-09 20:14       ` Lee Schermerhorn
  2006-03-10 14:15     ` Lee Schermerhorn
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Lameter @ 2006-03-09 19:42 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

On Thu, 9 Mar 2006, Lee Schermerhorn wrote:

> I'm wondering if applications keep changing the policy as you describe
> to "finesse" the system--e.g., because they don't have fine enough
> control over the policies.  Perhaps I read it wrong, but it appears to
> me that we can't set the policy for subranges of a vm area.  So maybe

We can set the policies for subranges. See mempolicy.c

> applications have to set the policy for the [entire] vma, touch a few
> pages to get them placed, change the policy for the [entire] vma, touch
> a few more pages, ...   Of course, storing policies on subranges of vmas
> takes more mechanism that we current have, and increases the cost of
> node computation on each allocation.  Probably why we don't have it
> currently.

We have it currently for anonymous pages. Its just not implemented yet for 
file backed pages.

> Anyway, with the patches I sent, pages would only migrate on fault if
> they had no mappings at the time of fault.  If an application had
> explicitly placed them by touching them, they could only have zero map
> count if something happened to pull them out of the task's pte.  I would
> think that if they cared, they'd mlock them so that wouldn't happen?

Currently page migration may remove ptes for file mapped pages relying on
the fault handler to restore ptes. Hopefully we will restore all ptes in 
the future but as long as that the current situation persist you may 
potentially move pages belonging (well, in some loose fashion since there 
is no pte) to another process.

> Yes, that could happen.  That's what I was trying to explain.  I don't
> LIKE that, but I haven't thought about how to distinguish a page that
> just go read in and is likely on the right node [an acceptable one,
> anyway] and one that has zero mappings because it hasn't been referenced
> in a while.  Any ideas?

Implement the vma policies for file mapped pages and you can just rely on 
that mechanism to correctly place your pages without any need for 
checking. Plus we will have fixed a major open issue for memory policies. 

> I just sent another one to myself, and got it just fine.  I copied you
> in addition to the list.  Was that copy borked, too?  If so, I'll try
> sending you copies with good ol' mail(1).

Have not seen it yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
  2006-03-09 19:42     ` Christoph Lameter
@ 2006-03-09 20:14       ` Lee Schermerhorn
  0 siblings, 0 replies; 6+ messages in thread
From: Lee Schermerhorn @ 2006-03-09 20:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Thu, 2006-03-09 at 11:42 -0800, Christoph Lameter wrote:
> On Thu, 9 Mar 2006, Lee Schermerhorn wrote:
> 
> > I'm wondering if applications keep changing the policy as you describe
> > to "finesse" the system--e.g., because they don't have fine enough
> > control over the policies.  Perhaps I read it wrong, but it appears to
> > me that we can't set the policy for subranges of a vm area.  So maybe
> 
> We can set the policies for subranges. See mempolicy.c

Yow!  I see.  We split the vma.  I did look at this a while back to see
if I needed to worry about different policies on subranges of VMAs.
Came away realizing that I did not have to worry because each vma only
has a single policy.  Forgot about the splitting...   Hmmm, isn't a
vm_area_struct a rather heavy-weight policy container?  Oh well, at
least we can't exceed sysctl_max_map_count of them [64K by default] per
task/mm ;-).  And probably only applications on really big systems will
do this to any extent.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview
  2006-03-09 19:30   ` Lee Schermerhorn
  2006-03-09 19:42     ` Christoph Lameter
@ 2006-03-10 14:15     ` Lee Schermerhorn
  1 sibling, 0 replies; 6+ messages in thread
From: Lee Schermerhorn @ 2006-03-10 14:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Thu, 2006-03-09 at 14:30 -0500, Lee Schermerhorn wrote:

> If you're interested in seeing an annotated trace log of direct
> migration
> and migrate-on-fault [lazy] in action, you can find one at:
> 
> http://free.linux.hp.com/~lts/Tools/mtrace-anon-8p-direct+lazy.log
> 
> This file contains the log for 2 memtoy runs, each migrating an 8 page
> anon segment from one node to another. 

Duh!  not my day..

correct link:
http://free.linux.hp.com/~lts/Tools/mmtrace-anon-8p-direct+lazy.log



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-03-10 14:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-09 18:28 [PATCH/RFC] Migrate-on-fault prototype 0/5 V0.1 - Overview Lee Schermerhorn
2006-03-09 19:12 ` Christoph Lameter
2006-03-09 19:30   ` Lee Schermerhorn
2006-03-09 19:42     ` Christoph Lameter
2006-03-09 20:14       ` Lee Schermerhorn
2006-03-10 14:15     ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox