[PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
@ 2006-04-07 20:18 Lee Schermerhorn
  2006-04-07 20:22 ` [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace Lee Schermerhorn
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:18 UTC (permalink / raw)
  To: linux-mm

This is a reposting of the migrate-on-fault series, against
the 2.6.17-rc1-mm1 tree.  I would love to get some feedback on 
these patches--especially regarding criteria for getting them
into the mm tree for wider testing.

I will send the remainder of the series as responses to this 
message.  Auto-migrate series V0.2 to follow.

Lee
----------------------------------------------------------------------

Migrate-on-fault prototype 0/6 V0.2 - Overview

V0.2 -	refreshed against 2.6.17-rc1-mm1 with Christoph's migration
	code reorg.
	Some rework to 'mpol_replaced'.  See comments therein.

TODO:
	+ make a Kconfig sub-option of MIGRATION?
	+ add a sysctl to enable/disable migrate on fault?
		separate controls for anon, page cache?

This series of patches, against 2.6.17-rc1-mm1, implements page migration
in the fault path.  Based on discussions with Christoph Lameter, this 
seems like the next logical step in page migration.

The basic idea is that when a fault handler [do_swap_page, filemap_nopage,
...] finds a cached page with zero mappings that is otherwise "stable"--
i.e., no writebacks--this is a good opportunity to check whether the 
page resides on the node indicated by the policy in the current context.

We only want to check if there are zero mappings because 1) we can easily
migrate the page--don't have to go through the effort of removing all
mappings and 2) default policy--a common case--can give different answers
from different tasks running on different nodes.  Checking the policy
when there are zero mappings effectively implements a "first touch"
placement policy.

Note that this mechanism can be used to migrate page cache pages that 
were read in earlier, are no longer referenced, but are about to be
used by a new task on another node from where the page resides.  The
same mechanism can be used to pull anon pages along with a task when
the load balancer decides to move it to another node.  However, that
will require a bit more mechanism, and is the subject of another
patch series.

The current [2.6.17-rc*] direct migration facility support most of the
mechanism that is required to implement this "migration on fault".  
Some of the necessary operations are combined in functions with other
code that isn't required [must not be executed] in the fault path,
so these have been separated out in a couple of cases.

Then we need to add the function[s] to test the current page in the
fault path for zero mapping, no writebacks, misplacement; and the
function[s] to acutally migrate the page contents to a newly
allocated page using the [modified] migratepage address space
operations of the direct migration mechanism.

The Patches:

The patches are broken out in the order I implemented them. Each
should build and boot on its own.  [at least they did at one time!]

migrate-on-fault-01-separate-unmap-replace.patch

	Separates the mm/migrate.c:migrate_page_remove_references()
	function into its 2 distinct operations:  removing references
	[try_to_unmap()], and replacing the old page in the radix 
	tree of the page's "mapping".  Only the second part is 
	needed in the fault path, as the page is already completely
	unmapped.

	A wrapper function that calls both operations is provided,
	and the 2 places that call migrate_page_remove_references()
	have been modified to call that wrapper.

migrate-on-fault-02-mpol_misplaced.patch

	This patch implements the function mpol_misplaced() in
	mm/mempolicy.c to check whether a page resides on the
	node indicated by the vma and address arguments.  If
	so, it returns 0 [!misplaced].  If not, it returns an
	indication of whether the policy was interleaved or not
	[for properly accounting later allocation] and passes the
	node indicated by the policy through a pointer argument.

	Because this will be called in the fault path, I don't 
	want to go through the effort of actually allocating a
	page--e.g., via alloc_page_vma()--only to find that the
	current page in on the correct node.  However, I wanted
	to come to the same answer that alloc_page_vma() would.
	So, mpol_misplaced() mimics the node computation logic
	of alloc_page_vma().

migrate-on-fault-03-migrate_misplaced_page.patch

	This patch contains the main migrate on fault functions:

	check_migrate_misplaced_page() is implemented as a static
	inline function in mempolicy.h when MIGRATION is configured.
	If the page has zero mappings, is stable and misplaced,
	check_*() will call migrate_misplaced_page() in mmigrate.c
	to do the dirty work.  If for any reason the page can't
	or shouldn't be migrated, these functions will return the
	old page in the state it was found.

	Note that when a page is NOT found in the cache, and the fault
	handler has to allocate one and read it in, it will have zero
	mappings, so check_migrate_misplaced_page() WILL call
	mpol_misplaced() to see if it needs migration.  Of course, it
	should have been allocated on the correct node, so no migration
	should be necessary.  However, it's possible that the node 
	indicated by the policy has no free pages so the newly 
	allocated page may be on a different node.  In this case, I
	guess check_migrate_misplaced_page() will attempt to migrate
	it.  In either case, the "unnecessary" calls to mpol_misplaced()
	and to migrate_misplaced_page(), if the original allocation
	"overflowed", occur after an IO, so this is the slow path
	anyway.  

	When MIGRATION is NOT configured, check_migrate_misplaced_page()
	becomes a macro that evaluates to its argument page.

	More details with the patch.

migrate-on-fault-04.1-misplaced-anon-pages.patch

	This is a simple one-liner [OK, 2, counting an empty line]
	to call check_migrate_misplaced_page() from do_swap_page()
	in memory.c.  

	Patches to hook other fault paths [filemap_nopage(), etc.] 
	are still TBD.

migrate-on-fault-05-mbind-lazy-migrate.patch

	This patch adds an MPOL_MF_LAZY [maybe should be '_DEFERRED?]
	flag to modify the behavior of MPOL_MF_MOVE[_ALL].  When
	the 'LAZY flag is specified, mbind() simply unmaps eligible
	pages in the specified range, moving anon pages to the
	swap cache, if not already there.  Then, when the task
	touch the pages, or queries their location via 
	get_mempolicy(..., MPOL_F_NODE|MPOL_F_ADDR), it will take
	fault, find the page in the cache and migrate it, if the
	policy so indicates.  Actually, this will only happen for
	anon pages, until additional fault paths are hooked up.

	This patch allows me to test the migrate on fault mechanism
	by forcing pages to be unmapped.

migrate-on-fault-06-mbind-noop-policy.patch

	This patch adds a "NO-OP" policy to mbind() so that the
	"'MOVE+'LAZY" unmap-only function can be performed on a
	range of task memory without changing the policy.

Testing:

I have tested migrate-on-fault of anon pages using the MPOL_MF_LAZY 
extension to mbind() discussed in patch 5 above on 2.6.17-rc1-mm1.
I have an ad hoc [odd hack?] test program, called memtoy, available at:

	http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

The Xpm-tests subdirectory in the tarball contains memtoy test
scripts for "manual page migration"--i.e., the migrate_pages()
syscall, "direct migration" using mbind(MPOL_MF_MOVE) and
migrate-on-fault using mbind(MPOL_MF_MOVE+MPOL_MF_LAZY).

I have also tested with the "automigration" series layered on top
of this one.  In that environment, whenever the scheduler migrates
a task to a new node, the task unmaps pages with default policy and
migrates them, if necessary, on first touch after unmap.  Running
kernel builds in this environment provides a fairly good stress test
of the migrate-on-fault mechanism.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
@ 2006-04-07 20:22 ` Lee Schermerhorn
  2006-04-11 18:08   ` Christoph Lameter
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page Lee Schermerhorn
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:22 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 1/6 V0.2 - separate unmap from radix tree replace

V0.2 - rework against 2.6.17-rc1, with Christoph migration code
       reorg.  No change for 2.6.17-rc1-mm1

The migrate_page_remove_references() function performs two distinct
operations:  actually attempting to remove pte references from the
page via try_to_unmap() and replacing the page with a new page in
the page's mapping's radix tree.  This patch separates these 
operations into two functions so that they can be called separately.

Then, migrate_page_remove_references() is replaced with a function
named migrate_page_unmap_and_replace() to indicate the two operations,
and existing calls in mm/migrate.c:migrate_page() and
mm/migrate.c:buffer_migrate_page() are updated.

Note:  this results in each of the functions having to load the
mapping when called for direct migration.  Perhaps passing mapping as
an argument would be preferable?

Subsequent patches in the series will make use of the separate
operations. 

Eventually, we can remove migrate_page_unmap_and_replace()

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/migrate.h	2006-04-03 08:51:08.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/migrate.h	2006-04-03 12:09:57.000000000 -0400
@@ -9,7 +9,9 @@ extern int isolate_lru_page(struct page 
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct page *, struct page *);
 extern void migrate_page_copy(struct page *, struct page *);
-extern int migrate_page_remove_references(struct page *, struct page *, int);
+extern int migrate_page_try_to_unmap(struct page *, int);
+extern int migrate_page_replace_in_mapping(struct page *, struct page *, int);
+extern int migrate_page_unmap_and_replace(struct page *, struct page *, int);
 extern int migrate_pages(struct list_head *l, struct list_head *t,
 		struct list_head *moved, struct list_head *failed);
 extern int migrate_pages_to(struct list_head *pagelist,
Index: linux-2.6.17-rc1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/migrate.c	2006-04-03 08:51:08.000000000 -0400
+++ linux-2.6.17-rc1/mm/migrate.c	2006-04-03 12:09:57.000000000 -0400
@@ -179,14 +179,12 @@ retry:
 EXPORT_SYMBOL(swap_page);
 
 /*
- * Remove references for a page and establish the new page with the correct
- * basic settings to be able to stop accesses to the page.
+ * Try to remove pte references from page in preparation to migrate to
+ * a new page.
  */
-int migrate_page_remove_references(struct page *newpage,
-				struct page *page, int nr_refs)
+int migrate_page_try_to_unmap(struct page *page, int nr_refs)
 {
 	struct address_space *mapping = page_mapping(page);
-	struct page **radix_pointer;
 
 	/*
 	 * Avoid doing any of the following work if the page count
@@ -225,6 +223,19 @@ int migrate_page_remove_references(struc
 	if (page_mapcount(page))
 		return -EAGAIN;
 
+	return 0;
+}
+EXPORT_SYMBOL(migrate_page_try_to_unmap);
+
+/*
+ * replace page in it's mapping's radix tree with newpage
+ */
+int migrate_page_replace_in_mapping(struct page *newpage,
+		struct page *page, int nr_refs)
+{
+	struct address_space *mapping = page_mapping(page);
+        struct page **radix_pointer;
+
 	write_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -254,12 +265,29 @@ int migrate_page_remove_references(struc
 	}
 
 	*radix_pointer = newpage;
-	__put_page(page);
+	__put_page(page);		/* drop cache ref */
 	write_unlock_irq(&mapping->tree_lock);
 
 	return 0;
 }
-EXPORT_SYMBOL(migrate_page_remove_references);
+EXPORT_SYMBOL(migrate_page_replace_in_mapping);
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+int migrate_page_unmap_and_replace(struct page *newpage,
+				struct page *page, int nr_refs)
+{
+	/*
+	 * Give up if we were unable to remove all mappings.
+	 */
+	if (migrate_page_try_to_unmap(page, nr_refs))
+		return 1;
+
+	return migrate_page_replace_in_mapping(page, newpage, nr_refs);
+}
+EXPORT_SYMBOL(migrate_page_unmap_and_replace);
 
 /*
  * Copy the page to its new location
@@ -310,10 +338,11 @@ EXPORT_SYMBOL(migrate_page_copy);
 int migrate_page(struct page *newpage, struct page *page)
 {
 	int rc;
+	int nr_refs = 2;	/* cache + current */
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_remove_references(newpage, page, 2);
+	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
 
 	if (rc)
 		return rc;
@@ -530,6 +559,7 @@ int buffer_migrate_page(struct page *new
 {
 	struct address_space *mapping = page->mapping;
 	struct buffer_head *bh, *head;
+	int nr_refs = 3;	/* cache + bufs + current */
 	int rc;
 
 	if (!mapping)
@@ -540,7 +570,7 @@ int buffer_migrate_page(struct page *new
 
 	head = page_buffers(page);
 
-	rc = migrate_page_remove_references(newpage, page, 3);
+ 	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
 
 	if (rc)
 		return rc;
@@ -556,7 +586,7 @@ int buffer_migrate_page(struct page *new
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
 	set_page_private(page, 0);
-	put_page(page);
+	put_page(page);		/* transfer buf ref to newpage */
 	get_page(newpage);
 
 	bh = head;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace
  2006-04-07 20:22 ` [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace Lee Schermerhorn
@ 2006-04-11 18:08   ` Christoph Lameter
  2006-04-11 18:47     ` Lee Schermerhorn
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-04-11 18:08 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

On Fri, 7 Apr 2006, Lee Schermerhorn wrote:

> +		struct page *page, int nr_refs)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +        struct page **radix_pointer;
> +

Whitespace damage. Some other places as well.

>  /*
>   * Copy the page to its new location
> @@ -310,10 +338,11 @@ EXPORT_SYMBOL(migrate_page_copy);
>  int migrate_page(struct page *newpage, struct page *page)
>  {
>  	int rc;
> +	int nr_refs = 2;	/* cache + current */

Why the nr_refs variables if you do not modify them before passing them 
to the migration functions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace
  2006-04-11 18:08   ` Christoph Lameter
@ 2006-04-11 18:47     ` Lee Schermerhorn
  0 siblings, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 18:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Tue, 2006-04-11 at 11:08 -0700, Christoph Lameter wrote:
> On Fri, 7 Apr 2006, Lee Schermerhorn wrote:
> 
> > +		struct page *page, int nr_refs)
> > +{
> > +	struct address_space *mapping = page_mapping(page);
> > +        struct page **radix_pointer;
> > +
> 
> Whitespace damage. Some other places as well.

OK.  Not sure how that [and the others] snuck in there....

> 
> >  /*
> >   * Copy the page to its new location
> > @@ -310,10 +338,11 @@ EXPORT_SYMBOL(migrate_page_copy);
> >  int migrate_page(struct page *newpage, struct page *page)
> >  {
> >  	int rc;
> > +	int nr_refs = 2;	/* cache + current */
> 
> Why the nr_refs variables if you do not modify them before passing them 
> to the migration functions?

Couple of reasons:   I prefer symbolic names to magic numbers like '2'.
This value will be passed to a function as arg "nr_refs", so that seemed
like a good name for it here.  It's also a place to hang a comment for
tracking the reference counts.  This was, for me, one of the trickiest
areas in getting migrate on fault to work--keeping track of the page ref
counts.  I wanted to be clear on what ref's we expect where, and what
we're doing to them.  Finally, I'll be adding in the fault path
reference in a subsequent patch in the series.

I thought it made the code easier to read, and I hope the compiler is
smart enough to "do the right thing".

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
  2006-04-07 20:22 ` [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace Lee Schermerhorn
@ 2006-04-07 20:23 ` Lee Schermerhorn
  2006-04-11 18:21   ` Christoph Lameter
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate " Lee Schermerhorn
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:23 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 2/6 V0.2 - check for misplaced page

V0.2 -	reworked against 2.6.17-rc1-mm1 with Christoph's migration
	code reorg
	Also:	get vma policy after updating task's cpuset memory
		state.  Use mems_allowed in policy to vet nodes,
		but I'm not sure this check is necessary.

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  Because interleaved and
non-interleaved allocations are accounted differently, the function
also returns whether or not the new node came from an interleaved
policy, if the page is misplaced.

A subsequent patch will call this function from the fault path for
stable pages with zero page_mapcount().  Because of this, I don't
want to go ahead and allocate the page, e.g., via alloc_page_vma()
only to have to free it if it has the correct policy.  So, I just
mimic the alloc_page_vma() node computation logic.

Note that for "process interleaving" the destination node depends
on the order of access to pages.  I.e., there is no fixed layout
for process interleaved pages, as there is for pages interleaved
via vma policy.  So, as long as the page resides on a node that
exists in the process's interleave set, no migration is indicated.
Having said that, we may never need to call this function without
a vma, so maybe we can lose that "feature".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/mempolicy.c	2006-04-06 16:45:13.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/mempolicy.c	2006-04-06 16:47:14.000000000 -0400
@@ -1874,3 +1874,102 @@ out:
 	return 0;
 }
 
+/**
+ * mpol_misplaced - check whether current page node id valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ * @newnid - [ptr to] node id to which page should be migrated
+ *
+ * lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ * if same, return 0 -- reuse current page
+ * if different,
+ *     return destination nid via newnid
+ *     return MPOL_MIGRATE_NONINTERLEAVED for non-interleaved policy
+ *     return MPOL_MIGRATE_INTERLEAVED for interleaved policy.
+ * policy determination mimics alloc_page_vma()
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+			 unsigned long addr, int *newnid)
+{
+	struct mempolicy *pol;
+	struct zonelist *zl;
+	nodemask_t *mems;
+	int curnid = page_to_nid(page);
+	int polnid = -1, interleave = 0;
+	int i;
+
+//TODO:  can we call this here, in the fault path [with mmap_sem held?]
+//       do we want to?  applications and systems that could benefit from
+//       migrate-on-fault probably want cpusets as well.
+	cpuset_update_task_memory_state();
+	pol = get_vma_policy(current, vma, addr);
+
+	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+		interleave = 1;	/* for accounting */
+		if (vma) {
+			unsigned long off;
+			BUG_ON(addr >= vma->vm_end);
+			BUG_ON(addr < vma->vm_start);
+			off = vma->vm_pgoff;
+			off += (addr - vma->vm_start) >> PAGE_SHIFT;
+			polnid = offset_il_node(pol, vma, off);
+		} else {
+//TODO:  can this ever happen?
+			/*
+			 * for process interleaving, just ensure that
+			 * curnid is in policy nodes -- to avoid thrashing
+			 */
+			if (node_isset(curnid, pol->v.nodes))
+				return 0;
+			polnid = interleave_nodes(pol);
+		}
+	} else
+		switch (pol->policy) {
+		case MPOL_PREFERRED:
+			polnid = pol->v.preferred_node;
+			if (polnid < 0)
+				polnid = numa_node_id();
+			break;
+		case MPOL_BIND:
+			/*
+			 * allows binding to multiple nodes.
+			 * use current page if in zonelist,
+			 * else select first allowed node
+			 */
+			mems = &pol->cpuset_mems_allowed;
+			zl = pol->v.zonelist;
+			for (i = 0; zl->zones[i]; i++) {
+				int nid = zl->zones[i]->zone_pgdat->node_id;
+
+				if (nid == curnid)
+					return 0;
+
+				if (polnid < 0 &&
+//TODO:  is this check necessary?
+					node_isset(nid, *mems))
+					polnid = nid;
+			}
+			if (polnid >= 0)
+				break;
+			/*FALL THROUGH*/
+		case MPOL_INTERLEAVE: /* should not happen */
+		case MPOL_DEFAULT:
+			polnid = numa_node_id();
+			break;
+		default:
+			polnid = 0;
+			BUG();
+		}
+
+	if (curnid == polnid)
+		return 0;
+
+	*newnid = polnid;
+	if (interleave)
+		return MPOL_MIGRATE_INTERLEAVED;
+
+	return MPOL_MIGRATE_NONINTERLEAVED;
+}
Index: linux-2.6.17-rc1-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/mempolicy.h	2006-04-06 16:45:13.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/mempolicy.h	2006-04-06 16:46:17.000000000 -0400
@@ -173,6 +173,17 @@ static inline void check_highest_zone(in
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
+/*
+ * mm/vmscan.c doesn't include mempolicy.  Keep knowledge of these
+ * macros' values internal to mempolicy.[ch]
+ */
+#define MPOL_MIGRATE_NONINTERLEAVED 1
+#define MPOL_MIGRATE_INTERLEAVED 2
+#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+
+int mpol_misplaced(struct page *, struct vm_area_struct *,
+		unsigned long, int *);
+
 extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
 
 #else


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page Lee Schermerhorn
@ 2006-04-11 18:21   ` Christoph Lameter
  2006-04-11 19:28     ` Lee Schermerhorn
  2006-04-12 16:43     ` Paul Jackson
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-04-11 18:21 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, ak

On Fri, 7 Apr 2006, Lee Schermerhorn wrote:

> This patch provides a new function to test whether a page resides
> on a node that is appropriate for the mempolicy for the vma and
> address where the page is supposed to be mapped.  This involves
> looking up the node where the page belongs.  So, the function
> returns that node so that it may be used to allocated the page
> without consulting the policy again.  Because interleaved and
> non-interleaved allocations are accounted differently, the function
> also returns whether or not the new node came from an interleaved
> policy, if the page is misplaced.

The misplaced page function should not consider the vma policy if the page 
is mapped because the VM does not handle vma policies for file 
mapped pages yet. This version may be checking for a policy that would
not be applied to the page for regular allocations.

As I said before: It would be best if memory policy support for file 
mapped vmas would be implemented before opportunistic and lazy migration 
went in. Otherwise we will need a lot of exceptions to even implement
the opportunistic migration in a clean way.

> Note that for "process interleaving" the destination node depends
> on the order of access to pages.  I.e., there is no fixed layout
> for process interleaved pages, as there is for pages interleaved
> via vma policy.  So, as long as the page resides on a node that
> exists in the process's interleave set, no migration is indicated.
> Having said that, we may never need to call this function without
> a vma, so maybe we can lose that "feature".

This would radically change if the file backed pages would be allocated 
properly allocated according to vma policy. Then almost all pages would 
have a proper node for interleave and the node could be calculated based 
on the address. Opportunistic migration can destroy carefully laid out 
interleaving of pages. 

Note also that opportunistic migration like this may move a pagecache page 
out of place that is repeated in used by processes that have
completely different allocation policies. It may just happen that the 
processes currently do not map that page.

> +//TODO:  can we call this here, in the fault path [with mmap_sem held?]
> +//       do we want to?  applications and systems that could benefit from
> +//       migrate-on-fault probably want cpusets as well.
> +	cpuset_update_task_memory_state();
> +	pol = get_vma_policy(current, vma, addr);

You need to use the task policy instead of the vma policy if the page is 
file backed because vma policies do not apply in that case.

> +			/*
> +			 * allows binding to multiple nodes.
> +			 * use current page if in zonelist,
> +			 * else select first allowed node
> +			 */
> +			mems = &pol->cpuset_mems_allowed;
> +			zl = pol->v.zonelist;
> +			for (i = 0; zl->zones[i]; i++) {
> +				int nid = zl->zones[i]->zone_pgdat->node_id;
> +
> +				if (nid == curnid)
> +					return 0;
> +
> +				if (polnid < 0 &&
> +//TODO:  is this check necessary?
> +					node_isset(nid, *mems))
> +					polnid = nid;
> +			}
> +			if (polnid >= 0)
> +				break;

Hmm.... Checking for the current node in memory policy? How does this 
interact with cpuset constraints?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-11 18:21   ` Christoph Lameter
@ 2006-04-11 19:28     ` Lee Schermerhorn
  2006-04-11 19:33       ` Christoph Lameter
  2006-04-12 16:43     ` Paul Jackson
  1 sibling, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 19:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak

On Tue, 2006-04-11 at 11:21 -0700, Christoph Lameter wrote:
> On Fri, 7 Apr 2006, Lee Schermerhorn wrote:
> 
> > This patch provides a new function to test whether a page resides
> > on a node that is appropriate for the mempolicy for the vma and
> > address where the page is supposed to be mapped.  This involves
> > looking up the node where the page belongs.  So, the function
> > returns that node so that it may be used to allocated the page
> > without consulting the policy again.  Because interleaved and
> > non-interleaved allocations are accounted differently, the function
> > also returns whether or not the new node came from an interleaved
> > policy, if the page is misplaced.
> 
> The misplaced page function should not consider the vma policy if the page 
> is mapped because the VM does not handle vma policies for file 
> mapped pages yet. This version may be checking for a policy that would
> not be applied to the page for regular allocations.

When you say "mapped" here, you mean a mmap()ed file?  As opposed to
"mapped by a pte" such that page_mapcount(page) != 0, right?  Because if
the mapcount() isn't zero, we won't even look for misplaced pages.  And,
with the V0.2 series, I'm only checking for misplaced pages with
mapcount == 0 in the anon page fault path.  If necessary, I can skip
pages in VMAs that have non-NULL vm_file.  Do we get these in the anon
fault path?

> 
> As I said before: It would be best if memory policy support for file 
> mapped vmas would be implemented before opportunistic and lazy migration 
> went in. Otherwise we will need a lot of exceptions to even implement
> the opportunistic migration in a clean way.

OK.  I won't hook up migrate-on-fault to the file mapped fault path
until this is done.  I'm still not clear on what you have in mind for
policies on file mapped vmas.  Do you want to attach the policies to the
file/inode itself [like for shared memory segments], so that they apply
to all mappers?  

> 
> > Note that for "process interleaving" the destination node depends
> > on the order of access to pages.  I.e., there is no fixed layout
> > for process interleaved pages, as there is for pages interleaved
> > via vma policy.  So, as long as the page resides on a node that
> > exists in the process's interleave set, no migration is indicated.
> > Having said that, we may never need to call this function without
> > a vma, so maybe we can lose that "feature".
> 
> This would radically change if the file backed pages would be allocated 
> properly allocated according to vma policy. Then almost all pages would 
> have a proper node for interleave and the node could be calculated based 
> on the address. Opportunistic migration can destroy carefully laid out 
> interleaving of pages. 

I agree, I think...  However, if the policies are attached directly to
the file itself [I mean the in-memory incarnation in the form of
file/inode structs--not the on disk info], then I don't see why
"migrate-on-fault", opportunistic or otherwise, would do anything
different from normal allocation.  I mean, my intention is that migrate-
on-fault move page [with zero map count] that don't reside where initial
allocation under the current policy would place them.  Thus, I want to
avoid policies, or interpretations of policies, that give different
answers each time you evaluate them.

> 
> Note also that opportunistic migration like this may move a pagecache page 
> out of place that is repeated in used by processes that have
> completely different allocation policies. It may just happen that the 
> processes currently do not map that page.

Do you mean with my current implementation, if I hooked up that fault
path?  Or do you mean when/if file back pages are "properly allocated
according to vma [???] policy"?  Are you're suggesting that proper
behavior is for each mapping process to have a different policy on the
file [in the vma] and whoever brings it into memory gets to choose where
it lands?  In that case, then yes, migrate-on-fault could move the page
if it finds it in the cache with mapcount==0 and misplaced according to
the policy of the faulting task's vma mapping the file.   If, however,
the policies are attached to the underlying file/inode struct, then any
task faulting a page for that file will see the same policy.  If it uses
the file offset to compute interleaving, then it should get the same
answer from any task.  This is how I've seen it implemented in other
systems and so had the "least astonishment" for me.  Others may see it
differently.

> 
> > +//TODO:  can we call this here, in the fault path [with mmap_sem held?]
> > +//       do we want to?  applications and systems that could benefit from
> > +//       migrate-on-fault probably want cpusets as well.
> > +	cpuset_update_task_memory_state();
> > +	pol = get_vma_policy(current, vma, addr);
> 
> You need to use the task policy instead of the vma policy if the page is 
> file backed because vma policies do not apply in that case.

OK, but again, I haven't hooked up migrate-on-fault for file backed
pages yet.  Here, you're saying that if I DID hook it up before fixing
how file back pages are handled, then to be consistent with current
behavior, I should use task policy for file back pages?

How about shmem backed pages?

> 
> > +			/*
> > +			 * allows binding to multiple nodes.
> > +			 * use current page if in zonelist,
> > +			 * else select first allowed node
> > +			 */
> > +			mems = &pol->cpuset_mems_allowed;
> > +			zl = pol->v.zonelist;
> > +			for (i = 0; zl->zones[i]; i++) {
> > +				int nid = zl->zones[i]->zone_pgdat->node_id;
> > +
> > +				if (nid == curnid)
> > +					return 0;
> > +
> > +				if (polnid < 0 &&
> > +//TODO:  is this check necessary?
> > +					node_isset(nid, *mems))
> > +					polnid = nid;
> > +			}
> > +			if (polnid >= 0)
> > +				break;
> 
> Hmm.... Checking for the current node in memory policy? How does this 
> interact with cpuset constraints?

That's why I asked if it's necessary.  If I call
cpuset_update_task_memory_state() above, I think that it rebinds the
tasks policies so that the zone lists have only valid mems.  Having
found a node in the zonelist, do I need to check it again?  I think I
was TRYING to honor the cpuset contraints.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-11 19:28     ` Lee Schermerhorn
@ 2006-04-11 19:33       ` Christoph Lameter
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-04-11 19:33 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, ak

On Tue, 11 Apr 2006, Lee Schermerhorn wrote:

> > The misplaced page function should not consider the vma policy if the page 
> > is mapped because the VM does not handle vma policies for file 
> > mapped pages yet. This version may be checking for a policy that would
> > not be applied to the page for regular allocations.
> 
> When you say "mapped" here, you mean a mmap()ed file?  As opposed to
> "mapped by a pte" such that page_mapcount(page) != 0, right?  Because if
> the mapcount() isn't zero, we won't even look for misplaced pages.  And,
> with the V0.2 series, I'm only checking for misplaced pages with
> mapcount == 0 in the anon page fault path.  If necessary, I can skip
> pages in VMAs that have non-NULL vm_file.  Do we get these in the anon
> fault path?

You would need to skip evaluating the vma policy for file backed pages
for the misplaced page check.

> > You need to use the task policy instead of the vma policy if the page is 
> > file backed because vma policies do not apply in that case.
> 
> OK, but again, I haven't hooked up migrate-on-fault for file backed
> pages yet.  Here, you're saying that if I DID hook it up before fixing
> how file back pages are handled, then to be consistent with current
> behavior, I should use task policy for file back pages?

If this applied only to anonymous pages then its okay.

> How about shmem backed pages?

Those have a valid policy even when they are unmapped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-11 18:21   ` Christoph Lameter
  2006-04-11 19:28     ` Lee Schermerhorn
@ 2006-04-12 16:43     ` Paul Jackson
  2006-04-12 18:49       ` Lee Schermerhorn
  1 sibling, 1 reply; 25+ messages in thread
From: Paul Jackson @ 2006-04-12 16:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee.Schermerhorn, linux-mm, ak

Christoph, respnonding to Lee:
> > +			/*
> > +			 * allows binding to multiple nodes.
> > +			 * use current page if in zonelist,
> > +			 * else select first allowed node
> > +			 */
> > +			mems = &pol->cpuset_mems_allowed;
> > +			...
> 
> Hmm.... Checking for the current node in memory policy? How does this 
> interact with cpuset constraints?

The per-mempolicy 'cpuset_mems_allowed' does not specify the nodes to
which the task is bound, but rather the nodes to which the mempolicy is
relative.  No code except the mempolicy rebinding code should be using
the mempolicy->cpuset_mems_allowed field.

The proper way to check if a zone is allowed by cpusets appears
in several places in the files mm/page_alloc.c, mm/vmscan.c, and
mm/hugetlb.c.

$ grep cpuset_zone_allowed mm/*.c
mm/hugetlb.c:           if (cpuset_zone_allowed(*z, GFP_HIGHUSER) &&
mm/oom_kill.c:          if (cpuset_zone_allowed(*z, gfp_mask))
mm/page_alloc.c:         * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
mm/page_alloc.c:                                !cpuset_zone_allowed(*z, gfp_mask))
mm/page_alloc.c:         * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
mm/vmscan.c:            if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
mm/vmscan.c:            if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
mm/vmscan.c:            if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
mm/vmscan.c:    if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-12 16:43     ` Paul Jackson
@ 2006-04-12 18:49       ` Lee Schermerhorn
  2006-04-12 20:55         ` Paul Jackson
  0 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-12 18:49 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, linux-mm, ak

On Wed, 2006-04-12 at 09:43 -0700, Paul Jackson wrote:
> Christoph, respnonding to Lee:
> > > +			/*
> > > +			 * allows binding to multiple nodes.
> > > +			 * use current page if in zonelist,
> > > +			 * else select first allowed node
> > > +			 */
> > > +			mems = &pol->cpuset_mems_allowed;
> > > +			...
> > 
> > Hmm.... Checking for the current node in memory policy? How does this 
> > interact with cpuset constraints?
> 
> The per-mempolicy 'cpuset_mems_allowed' does not specify the nodes to
> which the task is bound, but rather the nodes to which the mempolicy is
> relative.  No code except the mempolicy rebinding code should be using
> the mempolicy->cpuset_mems_allowed field.
> 
> The proper way to check if a zone is allowed by cpusets appears
> in several places in the files mm/page_alloc.c, mm/vmscan.c, and
> mm/hugetlb.c.

Thanks, Paul.  But, I wonder, do I even need to do this check at all?
I just found the node in the policy's nodelist after having done a
cpuset_update_task_memory_state().  Looks like updating the task
memory state refreshes the policy zonelist, so it should only have nodes
valid in the cpuset.  Is this correct?

If so, I can just drop that check...

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
  2006-04-12 18:49       ` Lee Schermerhorn
@ 2006-04-12 20:55         ` Paul Jackson
  0 siblings, 0 replies; 25+ messages in thread
From: Paul Jackson @ 2006-04-12 20:55 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: clameter, linux-mm, ak

> Thanks, Paul.  But, I wonder, do I even need to do this check at all?

Quite possibly you don't need that check.  I'm pretending to be on
vacation this week and avoiding thinking too hard ;).

Hmmm ... looking around for a bit ... Notice the other code that picks
off the mempolicy.zonelist when it needs to place a page under
MPOL_BIND:

/* Return a zonelist representing a mempolicy */
static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
{
        int nd;

        switch (policy->policy) {
        case MPOL_PREFERRED:
                ...
                break;
        case MPOL_BIND:
                /* Lower zones don't get a policy applied */
                /* Careful: current->mems_allowed might have moved */
                if (gfp_zone(gfp) >= policy_zone)
                        if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
                                return policy->v.zonelist;

My recollection is that it goes like this.  If someone sets a mempolicy
MPOL_BIND on some nodes, and then someone moves that task to a cpuset
that doesn't include any of the BIND nodes, then that MPOL_BIND
mempolicy is basically ignored, until such time as if/when the task
fixes it to refer to some nodes currently allowed by its cpuset.

So my 'cpuset_zone_allowed()' suggestion was wrong.

Looks like you need a 'cpuset_zonelist_valid_mems_allowed()' check, and
if that fails, behave as if they had a default mempolicy, ignoring the
MPOL_BIND setting.

Note that I still haven't given any thought to the larger issues that
others have considered for this patch ... back to vacation.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate misplaced page
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
  2006-04-07 20:22 ` [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace Lee Schermerhorn
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page Lee Schermerhorn
@ 2006-04-07 20:23 ` Lee Schermerhorn
  2006-04-11 18:32   ` Christoph Lameter
  2006-04-07 20:24 ` [PATCH 2.6.17-rc1-mm1 4/6] Migrate-on-fault - handle misplaced anon pages Lee Schermerhorn
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:23 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 3/6 V0.2 - migrate misplaced page

V0.2 - reworked against 2.6.17-rc1-mm1 with Christoph's migration
       code reorg.

This patch adds a new function migrate_misplaced_page() to mm/migrate.c
[where most of the other page migration functions live] to migrate a
misplace page to a specified destination node.  This function will be
called from the fault path.  Because we already know the destination
node for the migration, we allocate pages directly rather than rerunning
the policy node computation in alloc_page_vma().

migrate_misplaced_page() will need to put a single page [the old or
new page] back to the lru, so this patch also splits out a
"putback_lru_page()" function from move_lru_page().  This avoids having
to insert the page on a dummy list just to have move_lru_page() delete
it from the list.

The patch also updates the address space migratepage operations to
skip the attempt to unmap the page, if the operation is being called
in the fault path to migrate a misplaced page.  To accomplish this, I
added an additional boolean [int] argument "faulting" to the migratepage
op functions.   This argument also adjusts the # of expected page
references because we have an extra count when called in the fault
path.

The migratepage operations now use the migrate_page_try_to_unmap()
and migrate_page_replace_in_mapping() functions separated out in a
previous patch.

I believe that we can now delete migrate_page_remove_references().
But, I haven't, yet.

Finally, the page adds the static inline function 
check_migrate_misplaced_page() to mempolicy.h to check whether a
page has no mappings [no pte references] and is "misplaced"--i.e.
on a node different from what the policy for (vma, address) dictates.
In this case, the page will be migrated to the "correct" node, if
possible.  If migration fails for any reason, we just use the
original page.

Note that when NUMA or MIGRATION is not configured, the
check_migrate_misplaced_page() function becomes a macro that
evaluates to its page argument.

Subsequent patches will hook the fault handlers [anon, file, shmem]
to check_migrate_misplaced_page().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/mempolicy.h	2006-04-05 10:14:39.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/mempolicy.h	2006-04-05 10:14:41.000000000 -0400
@@ -34,6 +34,7 @@
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
+#include <linux/migrate.h>
 
 struct vm_area_struct;
 
@@ -184,6 +185,31 @@ int do_migrate_pages(struct mm_struct *m
 int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
 
+#if defined(CONFIG_MIGRATION) && defined(_LINUX_MM_H)
+/*
+ * called in fault path, where _LINUX_MM_H will be defined.
+ * page is uptodate and locked.
+ */
+static inline struct page *check_migrate_misplaced_page(struct page *page,
+		struct vm_area_struct *vma, unsigned long address)
+{
+	int polnid, misplaced;
+
+	if (page_mapcount(page) || PageWriteback(page))
+		return page;
+
+	misplaced = mpol_misplaced(page, vma, address, &polnid);
+	if (!misplaced)
+		return page;
+
+	return migrate_misplaced_page(page, polnid,
+			misplaced_is_interleaved(misplaced));
+
+}
+#else
+#define check_migrate_misplaced_page(page, vma, address) (page)
+#endif
+
 extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
 
 #else
@@ -279,6 +305,8 @@ static inline int do_migrate_pages(struc
 	return 0;
 }
 
+#define check_migrate_misplaced_page(page, vma, address) (page)
+
 static inline void check_highest_zone(int k)
 {
 }
Index: linux-2.6.17-rc1-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/fs.h	2006-04-05 10:14:36.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/fs.h	2006-04-05 10:14:41.000000000 -0400
@@ -373,7 +373,7 @@ struct address_space_operations {
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
 	/* migrate the contents of a page to the specified target */
-	int (*migratepage) (struct page *, struct page *);
+	int (*migratepage) (struct page *, struct page *, int);
 };
 
 struct backing_dev_info;
@@ -1760,7 +1760,7 @@ extern void simple_release_fs(struct vfs
 extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct page *, struct page *);
+extern int buffer_migrate_page(struct page *, struct page *, int);
 #else
 #define buffer_migrate_page NULL
 #endif
Index: linux-2.6.17-rc1-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/gfp.h	2006-03-20 00:53:29.000000000 -0500
+++ linux-2.6.17-rc1-mm1/include/linux/gfp.h	2006-04-05 10:14:41.000000000 -0400
@@ -131,10 +131,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+					unsigned nid);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_interleave(gfp_mask, order, nid) alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: linux-2.6.17-rc1-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/mempolicy.c	2006-04-05 10:14:39.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/mempolicy.c	2006-04-05 10:14:41.000000000 -0400
@@ -1179,7 +1179,7 @@ struct zonelist *huge_zonelist(struct vm
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
 	struct zonelist *zl;
Index: linux-2.6.17-rc1-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/migrate.c	2006-04-05 10:14:38.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/migrate.c	2006-04-05 10:14:41.000000000 -0400
@@ -59,7 +59,8 @@ int isolate_lru_page(struct page *page, 
 				del_page_from_active_list(zone, page);
 			else
 				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
+			if (pagelist)
+				list_add_tail(&page->lru, pagelist);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -88,9 +89,14 @@ int migrate_prep(void)
 	return 0;
 }
 
-static inline void move_to_lru(struct page *page)
+/*
+ * Put a single page back to appropriate lru list via cache.
+ * Removes page reference added by isolate_lru_page, but
+ * the lru_cache_add*() will add a temporary ref while the
+ * pages resides in the cache [pagevec].
+ */
+static inline void putback_lru_page(struct page *page)
 {
-	list_del(&page->lru);
 	if (PageActive(page)) {
 		/*
 		 * lru_cache_add_active checks that
@@ -104,6 +110,12 @@ static inline void move_to_lru(struct pa
 	put_page(page);
 }
 
+static inline void move_to_lru(struct page *page)
+{
+	list_del(&page->lru);
+	putback_lru_page(page);
+}
+
 /*
  * Add isolated pages on the list back to the LRU.
  *
@@ -125,7 +137,7 @@ int putback_lru_pages(struct list_head *
 /*
  * Non migratable page
  */
-int fail_migrate_page(struct page *newpage, struct page *page)
+int fail_migrate_page(struct page *newpage, struct page *page, int faulting)
 {
 	return -EIO;
 }
@@ -335,29 +347,35 @@ EXPORT_SYMBOL(migrate_page_copy);
  *
  * Pages are locked upon entry and exit.
  */
-int migrate_page(struct page *newpage, struct page *page)
+int migrate_page(struct page *newpage, struct page *page, int faulting)
 {
-	int rc;
-	int nr_refs = 2;	/* cache + current */
+	int rc = 0;
+	/*
+	 * nr_refs:  cache + current [+ fault path]
+	 */
+	int nr_refs = 2 + !!faulting;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
-
+	if (!faulting)
+		rc = migrate_page_try_to_unmap(page, nr_refs);
+	if (!rc)
+		rc = migrate_page_replace_in_mapping(newpage, page, nr_refs);
 	if (rc)
 		return rc;
 
 	migrate_page_copy(newpage, page);
 
 	/*
-	 * Remove auxiliary swap entries and replace
-	 * them with real ptes.
+	 * If we are not already in the fault path, remove auxiliary swap
+	 * entries and replace them with real ptes.
 	 *
 	 * Note that a real pte entry will allow processes that are not
 	 * waiting on the page lock to use the new page via the page tables
 	 * before the new page is unlocked.
 	 */
-	remove_from_swap(newpage);
+	if (!faulting)
+		remove_from_swap(newpage);
 	return 0;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -468,7 +486,7 @@ redo:
 			 * own migration function. This is the most common
 			 * path for page migration.
 			 */
-			rc = mapping->a_ops->migratepage(newpage, page);
+			rc = mapping->a_ops->migratepage(newpage, page, 0);
 			goto unlock_both;
                 }
 
@@ -498,7 +516,7 @@ redo:
 		 */
 		if (!page_has_buffers(page) ||
 		    try_to_release_page(page, GFP_KERNEL)) {
-			rc = migrate_page(newpage, page);
+			rc = migrate_page(newpage, page, 0);
 			goto unlock_both;
 		}
 
@@ -555,23 +573,28 @@ next:
  * if the underlying filesystem guarantees that no other references to "page"
  * exist.
  */
-int buffer_migrate_page(struct page *newpage, struct page *page)
+int buffer_migrate_page(struct page *newpage, struct page *page, int faulting)
 {
 	struct address_space *mapping = page->mapping;
 	struct buffer_head *bh, *head;
-	int nr_refs = 3;	/* cache + bufs + current */
-	int rc;
+	int rc = 0;
+	/*
+	 * nr_refs:  cache + bufs + current [+ fault path]
+	 */
+	int nr_refs = 3 + !!faulting;
 
 	if (!mapping)
 		return -EAGAIN;
 
 	if (!page_has_buffers(page))
-		return migrate_page(newpage, page);
+		return migrate_page(newpage, page, faulting);
 
 	head = page_buffers(page);
 
- 	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
-
+	if (!faulting)
+		rc = migrate_page_try_to_unmap(page, nr_refs);
+	if (!rc)
+		rc = migrate_page_replace_in_mapping(newpage, page, nr_refs);
 	if (rc)
 		return rc;
 
@@ -683,3 +706,71 @@ out:
 		nr_pages++;
 	return nr_pages;
 }
+
+/*
+ * attempt to migrate a misplaced page to the specified destination
+ * node.  Page is already unmapped and locked by caller. Anon pages
+ * are in the swap cache.
+ *
+ * page refs on entry/exit:  cache + fault path [+ bufs]
+ */
+struct page *migrate_misplaced_page(struct page *page,
+				 int dest, int interleaved)
+{
+	struct page *newpage;
+	struct address_space *mapping = page_mapping(page);
+	unsigned int gfp;
+
+//TODO:  explicit assertions during debug/testing
+	BUG_ON(!PageLocked(page));
+	BUG_ON(page_mapcount(page));
+	if (PageAnon(page))
+		BUG_ON(!PageSwapCache(page));
+	BUG_ON(!mapping);
+
+	if (isolate_lru_page(page, NULL)) /* incrs page count on success */
+		goto out_nolru;	/* we lost */
+
+//TODO:  or just use GFP_HIGHUSER ?
+	gfp = (unsigned int)mapping_gfp_mask(mapping);
+
+	if (interleaved)
+		newpage = alloc_page_interleave(gfp, 0, dest);
+	else
+		newpage = alloc_pages_node(dest, gfp, 0);
+
+	if (!newpage)
+		goto out;	/* give up */
+	lock_page(newpage);
+
+	if (mapping->a_ops->migratepage) {
+		/*
+		 * migrating in fault path.
+		 * migrate a_op transfers cache [+ buf] refs
+		 */
+		int rc = mapping->a_ops->migratepage(newpage, page, 1);
+		if (rc) {
+			unlock_page(newpage);
+			__free_page(newpage);
+		} else {
+			get_page(newpage);	/* add isolate_lru_page ref */
+			put_page(page);		/* drop       "          "  */
+
+			unlock_page(page);
+			put_page(page);		/* drop fault path ref & free */
+
+			page = newpage;
+		}
+		goto out;
+	} else {
+//TODO:  for now, give up if no address space migrate op.
+//       later, handle w/ default mechanism, like migrate_pages?
+	}
+
+out:
+	putback_lru_page(page);		/* drops a page ref */
+
+out_nolru:
+	return page;
+
+}
Index: linux-2.6.17-rc1-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/migrate.h	2006-04-05 10:14:38.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/migrate.h	2006-04-05 10:14:41.000000000 -0400
@@ -7,7 +7,7 @@
 #ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
-extern int migrate_page(struct page *, struct page *);
+extern int migrate_page(struct page *, struct page *, int);
 extern void migrate_page_copy(struct page *, struct page *);
 extern int migrate_page_try_to_unmap(struct page *, int);
 extern int migrate_page_replace_in_mapping(struct page *, struct page *, int);
@@ -16,7 +16,8 @@ extern int migrate_pages(struct list_hea
 		struct list_head *moved, struct list_head *failed);
 extern int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest);
-extern int fail_migrate_page(struct page *, struct page *);
+struct page *migrate_misplaced_page(struct page *, int, int);
+extern int fail_migrate_page(struct page *, struct page *, int);
 
 extern int migrate_prep(void);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate misplaced page
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate " Lee Schermerhorn
@ 2006-04-11 18:32   ` Christoph Lameter
  2006-04-11 19:51     ` Lee Schermerhorn
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-04-11 18:32 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

On Fri, 7 Apr 2006, Lee Schermerhorn wrote:

> @@ -184,6 +185,31 @@ int do_migrate_pages(struct mm_struct *m
>  int mpol_misplaced(struct page *, struct vm_area_struct *,
>  		unsigned long, int *);
>  
> +#if defined(CONFIG_MIGRATION) && defined(_LINUX_MM_H)

Remove the defined(_LINUX_MM_H). This is pretty obscure.

> Index: linux-2.6.17-rc1-mm1/mm/migrate.c
> ===================================================================
> --- linux-2.6.17-rc1-mm1.orig/mm/migrate.c	2006-04-05 10:14:38.000000000 -0400
> +++ linux-2.6.17-rc1-mm1/mm/migrate.c	2006-04-05 10:14:41.000000000 -0400
> @@ -59,7 +59,8 @@ int isolate_lru_page(struct page *page, 
>  				del_page_from_active_list(zone, page);
>  			else
>  				del_page_from_inactive_list(zone, page);
> -			list_add_tail(&page->lru, pagelist);
> +			if (pagelist)
> +				list_add_tail(&page->lru, pagelist);
>  		}
>  		spin_unlock_irq(&zone->lru_lock);
>  	}

isolate lru page can be called without a pagelist now?


> -int fail_migrate_page(struct page *newpage, struct page *page)
> +int fail_migrate_page(struct page *newpage, struct page *page, int faulting)

I do not think the faulting parameter is needed. mapcount == 0 if 
we are faulting on an unmapped page. try_to_unmap() will do nothing or 
you can check for mapcount.

>  	 *
>  	 * Note that a real pte entry will allow processes that are not
>  	 * waiting on the page lock to use the new page via the page tables
>  	 * before the new page is unlocked.
>  	 */
> -	remove_from_swap(newpage);
> +	if (!faulting)
> +		remove_from_swap(newpage);
>  	return 0;

If we are faulting then there is nothing to remove. remove_from_swap would 
do nothing.

> +out:
> +	putback_lru_page(page);		/* drops a page ref */

We already have a ref from the fault patch and do not need another one 
in isolate_lru page right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate misplaced page
  2006-04-11 18:32   ` Christoph Lameter
@ 2006-04-11 19:51     ` Lee Schermerhorn
  0 siblings, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 19:51 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm

On Tue, 2006-04-11 at 11:32 -0700, Christoph Lameter wrote:
> On Fri, 7 Apr 2006, Lee Schermerhorn wrote:
> 
> > @@ -184,6 +185,31 @@ int do_migrate_pages(struct mm_struct *m
> >  int mpol_misplaced(struct page *, struct vm_area_struct *,
> >  		unsigned long, int *);
> >  
> > +#if defined(CONFIG_MIGRATION) && defined(_LINUX_MM_H)

I go back and look at it.  I may have to come up with another way to
avoid header dependency hell.  I did it this way because, as I recall,
one place where mempolicy.h is included, I encountered errors because
some of the functions used by the "check_migrate_misplace_page()" are
not available because <linux/mm.h> was not included there.  That seems
like a pretty heavy-weight header to be dragging into a source file to
satisfy a dependency in a static-inline function that the source file
doesn't even care about.  

Maybe check_migrate_misplaced_page() belongs in some other header.
mempolicy.h seemed like the right place.  And I wanted to put it in a
header so that I could turn it into a no-op when migrate-on-fault is not
enabled.  That seems to be the preferred method, when possible, rather
than #ifdefs" in the .c's.  

> 
> Remove the defined(_LINUX_MM_H). This is pretty obscure.
> 
> > Index: linux-2.6.17-rc1-mm1/mm/migrate.c
> > ===================================================================
> > --- linux-2.6.17-rc1-mm1.orig/mm/migrate.c	2006-04-05 10:14:38.000000000 -0400
> > +++ linux-2.6.17-rc1-mm1/mm/migrate.c	2006-04-05 10:14:41.000000000 -0400
> > @@ -59,7 +59,8 @@ int isolate_lru_page(struct page *page, 
> >  				del_page_from_active_list(zone, page);
> >  			else
> >  				del_page_from_inactive_list(zone, page);
> > -			list_add_tail(&page->lru, pagelist);
> > +			if (pagelist)
> > +				list_add_tail(&page->lru, pagelist);
> >  		}
> >  		spin_unlock_irq(&zone->lru_lock);
> >  	}
> 
> isolate lru page can be called without a pagelist now?

I'll take a look.  I thought I still had to do something here to get the
interface that I needed.

> 
> 
> > -int fail_migrate_page(struct page *newpage, struct page *page)
> > +int fail_migrate_page(struct page *newpage, struct page *page, int faulting)
> 
> I do not think the faulting parameter is needed. mapcount == 0 if 
> we are faulting on an unmapped page. try_to_unmap() will do nothing or 
> you can check for mapcount.

I also need to allow another reference count for the fault path.  I much
prefer having the explicit indication and think it less likely to cause
breakage down the line that counting on zero map count here.  

> 
> >  	 *
> >  	 * Note that a real pte entry will allow processes that are not
> >  	 * waiting on the page lock to use the new page via the page tables
> >  	 * before the new page is unlocked.
> >  	 */
> > -	remove_from_swap(newpage);
> > +	if (!faulting)
> > +		remove_from_swap(newpage);
> >  	return 0;
> 
> If we are faulting then there is nothing to remove. remove_from_swap would 
> do nothing.

Not true.  The page is in the swap cache [or migration cache, if we ever
get one].  And, the faulting task may not have the only pte reference to
that page.  I don't remove_from_swap() walking the reverse map and
replacing any other ptes in the fault path of another tasks--as we've
discussed before.

> 
> > +out:
> > +	putback_lru_page(page);		/* drops a page ref */
> 
> We already have a ref from the fault patch and do not need another one 
> in isolate_lru page right?
> 

No, we don't need another one.  I only did the isolate_lru_page() so
that the page being migrated in the fault path is in the same state as
pages being migrated directly--i.e., we hold them isolated from the lru.
Then, they can only be found via the cache.  For anon pages, this means
via faulting tasks' ptes.  For file back and shmem pages [when/if we
hook them up], they could also be found by faulting on the appropriate
file page.  However, in those cases, we'll already have the page locked,
the subsquent faulters will be held off until migration is complete.
Then they'll need to check and do the right thing [as discussed in a
different thread].

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 4/6] Migrate-on-fault - handle misplaced anon pages
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate " Lee Schermerhorn
@ 2006-04-07 20:24 ` Lee Schermerhorn
  2006-04-07 20:26 ` [PATCH 2.6.17-rc1-mm1 5/6] Migrate-on-fault - add MPOL_MF_LAZY Lee Schermerhorn
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:24 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 4/6 V0.2 - handle misplaced anon pages

V0.2 -- refreshed against 2.6.16-mm2 [no changes for 2.6.17-rc1-mm1]

This patch simply hooks the anon page fault handler [do_swap_page()]
to check for and migrate misplaced pages.

File and shmem fault paths will be addressed in separate patches.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm2/mm/memory.c
===================================================================
--- linux-2.6.16-mm2.orig/mm/memory.c	2006-03-28 12:00:46.000000000 -0500
+++ linux-2.6.16-mm2/mm/memory.c	2006-03-28 12:01:07.000000000 -0500
@@ -48,6 +48,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/mempolicy.h>	/* check_migrate_misplaced_page() */
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1924,6 +1925,8 @@ again:
 
 	/* The page isn't present yet, go ahead with the fault. */
 
+	page = check_migrate_misplaced_page(page, vma, address);
+
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 5/6] Migrate-on-fault - add MPOL_MF_LAZY
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2006-04-07 20:24 ` [PATCH 2.6.17-rc1-mm1 4/6] Migrate-on-fault - handle misplaced anon pages Lee Schermerhorn
@ 2006-04-07 20:26 ` Lee Schermerhorn
  2006-04-07 20:27 ` [PATCH 2.6.17-rc1-mm1 6/6] Migrate-on-fault - add MPOL_NOOP Lee Schermerhorn
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:26 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 5/6 V0.2 - add MPOL_MF_LAZY

V0.2 - reworked against 2.6.17-rc1 with Christoph's migration code
       reorg.  Moved migrate_pages_unmap_only() to mm/migrate.c

This patch adds another mbind() flag to request "lazy migration".
The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are simply unmapped from the calling task's page table ['_MOVE]
or from all referencing page tables [_MOVE_ALL].  Anon pages will first
be added to the swap [or migration?] cache, if necessary.  The pages
will be migrated in the fault path on "first touch", if the policy
dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault.  If useful to
applications, it could become a permanent part of the mbind() interface. 
Yes, it does duplicate some of the code in migrate_pages().  However,
lazy migration doesn't need to do all that migrate_pages() does, nor
does it need to try as hard.  Trying to weave both functions into
migrate_pages() could probably be done, but that could  result in fairly
ugly code. 

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/mempolicy.h	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/mempolicy.h	2006-04-03 12:12:30.000000000 -0400
@@ -22,9 +22,14 @@
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	(1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID \
+	(MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL | MPOL_MF_LAZY)
 
 #ifdef __KERNEL__
 
@@ -180,7 +185,7 @@ int do_migrate_pages(struct mm_struct *m
  */
 #define MPOL_MIGRATE_NONINTERLEAVED 1
 #define MPOL_MIGRATE_INTERLEAVED 2
-#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+#define misplaced_is_interleaved(pol) (pol == MPOL_MIGRATE_INTERLEAVED)
 
 int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
Index: linux-2.6.17-rc1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/mempolicy.c	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/mm/mempolicy.c	2006-04-03 12:12:30.000000000 -0400
@@ -718,9 +718,7 @@ long do_mbind(unsigned long start, unsig
 	int err;
 	LIST_HEAD(pagelist);
 
-	if ((flags & ~(unsigned long)(MPOL_MF_STRICT |
-				      MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-	    || mode > MPOL_MAX)
+	if ((flags & ~(unsigned long)MPOL_MF_VALID) || mode > MPOL_MAX)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -766,8 +764,13 @@ long do_mbind(unsigned long start, unsig
 
 		err = mbind_range(vma, start, end, new);
 
-		if (!list_empty(&pagelist))
-			nr_failed = migrate_pages_to(&pagelist, vma, -1);
+		if (!list_empty(&pagelist)) {
+			if (!(flags & MPOL_MF_LAZY))
+				nr_failed = migrate_pages_to(&pagelist,
+								 vma, -1);
+			else
+				nr_failed = migrate_pages_unmap_only(&pagelist);
+		}
 
 		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
Index: linux-2.6.17-rc1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/migrate.h	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/migrate.h	2006-04-03 12:12:30.000000000 -0400
@@ -17,6 +17,7 @@ extern int migrate_pages(struct list_hea
 extern int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest);
 struct page *migrate_misplaced_page(struct page *, int, int);
+extern int migrate_pages_unmap_only(struct list_head *);
 extern int fail_migrate_page(struct page *, struct page *, int);
 
 extern int migrate_prep(void);
Index: linux-2.6.17-rc1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/migrate.c	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/mm/migrate.c	2006-04-03 12:12:30.000000000 -0400
@@ -567,6 +567,66 @@ next:
 
 	return nr_failed + retry;
 }
+/*
+ * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
+ * necessary.  Migration will occur, if policy dictates, when a task faults
+ * an unmapped page back into its page table--i.e., on "first touch" after
+ * unmapping.
+ *
+ * Successfully unmapped pages will be put back on the LRU.  Failed pages
+ * will be left on the argument pagelist for the caller to handle, like
+ * migrate_pages[_to]().
+ */
+int migrate_pages_unmap_only(struct list_head *pagelist)
+{
+	struct page *page;
+	struct page *page2;
+	int nr_failed = 0, nr_unmapped = 0;
+
+	list_for_each_entry_safe(page, page2, pagelist, lru) {
+		int nr_refs;
+
+		/*
+		 * Give up easily.  We are being lazy.
+		 */
+		if (page_count(page) == 1 || TestSetPageLocked(page))
+			continue;
+
+		if (PageWriteback(page))
+			goto unlock_page;
+
+		if (PageAnon(page) && !PageSwapCache(page)) {
+			if (!add_to_swap(page, GFP_KERNEL)) {
+				goto unlock_page;
+			}
+		}
+
+		if (page_has_buffers(page))
+			nr_refs = 3;	/* cache, bufs and current */
+		else
+			nr_refs = 2;	/* cache and current */
+
+		if (migrate_page_try_to_unmap(page, nr_refs)) {
+			++nr_failed;
+			goto unlock_page;
+		}
+
+		++nr_unmapped;
+		move_to_lru(page);
+
+	unlock_page:
+		unlock_page(page);
+
+	}
+
+	/*
+	 * so fault path can find them on lru
+	 */
+	if (nr_unmapped)
+		lru_add_drain_all();
+
+	return nr_failed;
+}
 
 /*
  * Migration function for pages with buffers. This function can only be used


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 6/6] Migrate-on-fault - add MPOL_NOOP
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2006-04-07 20:26 ` [PATCH 2.6.17-rc1-mm1 5/6] Migrate-on-fault - add MPOL_MF_LAZY Lee Schermerhorn
@ 2006-04-07 20:27 ` Lee Schermerhorn
  2006-04-09  7:01 ` [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Andi Kleen
  2006-04-11 18:46 ` Christoph Lameter
  7 siblings, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-07 20:27 UTC (permalink / raw)
  To: linux-mm

Migrate-on-fault prototype 6/6 V0.2 - add MPOL_NOOP

V0.2 -	this patch is new in the V0.2 series.  No change between
	2.6.16-mm1 and 2.6.17-rc1-mm1

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
policy to mbind().  When the NOOP policy is used with the 'MOVE
and 'LAZY flags, mbind() [check_range()] will walk the specified
range and unmap eligible pages so that they will be migrated on
next touch.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/mempolicy.h	2006-03-23 16:49:16.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/mempolicy.h	2006-03-23 16:49:22.000000000 -0500
@@ -13,8 +13,9 @@
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
+#define MPOL_NOOP	4	/* retain existing policy for range */
 
-#define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MAX MPOL_NOOP
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:49:16.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:49:22.000000000 -0500
@@ -117,6 +117,7 @@ static int mpol_check_policy(int mode, n
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_NOOP:
 		if (!empty)
 			return -EINVAL;
 		break;
@@ -163,7 +164,7 @@ static struct mempolicy *mpol_new(int mo
 	struct mempolicy *policy;
 
 	PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes_addr(*nodes)[0]);
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		return NULL;
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
 	if (!policy)
@@ -726,7 +727,7 @@ long do_mbind(unsigned long start, unsig
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -762,10 +763,13 @@ long do_mbind(unsigned long start, unsig
 	if (!IS_ERR(vma)) {
 		int nr_failed = 0;
 
-		err = mbind_range(vma, start, end, new);
+		if (mode == MPOL_NOOP)
+			err = 0;
+		else
+			err = mbind_range(vma, start, end, new);
 
 		if (!list_empty(&pagelist)) {
-			if (!(flags & MPOL_MF_LAZY))
+			if (mode != MPOL_NOOP && !(flags & MPOL_MF_LAZY))
 				nr_failed = migrate_pages_to(&pagelist,
 								 vma, -1);
 			else


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2006-04-07 20:27 ` [PATCH 2.6.17-rc1-mm1 6/6] Migrate-on-fault - add MPOL_NOOP Lee Schermerhorn
@ 2006-04-09  7:01 ` Andi Kleen
  2006-04-11 18:46 ` Christoph Lameter
  7 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2006-04-09  7:01 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm

On Friday 07 April 2006 22:18, Lee Schermerhorn wrote:
> This is a reposting of the migrate-on-fault series, against
> the 2.6.17-rc1-mm1 tree.  I would love to get some feedback on 
> these patches--especially regarding criteria for getting them
> into the mm tree for wider testing.

The biggest criteria would be some numbers that it actually
helps for something and doesn't break performance in other workloads.

For me it seems rather risky.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2006-04-09  7:01 ` [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Andi Kleen
@ 2006-04-11 18:46 ` Christoph Lameter
  2006-04-11 18:52   ` Andi Kleen
  2006-04-11 20:40   ` Lee Schermerhorn
  7 siblings, 2 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-04-11 18:46 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, ak

On Fri, 7 Apr 2006, Lee Schermerhorn wrote:

> Note that this mechanism can be used to migrate page cache pages that 
> were read in earlier, are no longer referenced, but are about to be
> used by a new task on another node from where the page resides.  The
> same mechanism can be used to pull anon pages along with a task when
> the load balancer decides to move it to another node.  However, that
> will require a bit more mechanism, and is the subject of another
> patch series.

The fundamental assumption in these patchsets is that memory policies are 
permanently used to control allocation. However, allocation policies may 
be temporarily set to various allocation methods in order to allocate 
certain memory structures in special ways. The policy may be reset later 
and not reflect the allocation wanted for a certain structure when the 
opportunistic or lazy migration takes place.

Maybe we can use the memory polices in the way you suggest (my 
MPOL_MF_MOVE_* flags certainly do the same but they are set by the coder 
of the user space application who is aware of what is going on !). 

But there are significant components missing to make this work the right 
way. In particular file backed pages are not allocated according to vma 
policy. Only anonymous pages are. So this would only work correctly for 
anonymous pages that are explicitly shifted onto swap. 

I think there will be mostly correct behavior for file backed pages. Most 
processes do not use policies at all and so this will move the file 
backed page to the node where the process is executing. If the process 
frequently refers to the page then the effort that was expended is 
justified. However, if the page is not frequently references then the 
effort required to migrate the page was not justified.

For some processes this has the potential to actually decreasing the 
performance, for other processes that are using memory policies to 
control the allocation of structures it may allocate the page in a way 
that the application tried to avoid because it may be using the wrong 
memory policy.

Then there is the known deficiency that memory policies do not work with 
file backed pages. I surely wish that this would be addressed first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 18:46 ` Christoph Lameter
@ 2006-04-11 18:52   ` Andi Kleen
  2006-04-11 19:03     ` Jack Steiner
  2006-04-11 20:40     ` Lee Schermerhorn
  2006-04-11 20:40   ` Lee Schermerhorn
  1 sibling, 2 replies; 25+ messages in thread
From: Andi Kleen @ 2006-04-11 18:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, ak

On Tuesday 11 April 2006 20:46, Christoph Lameter wrote:
> However, if the page is not frequently references then the 
> effort required to migrate the page was not justified.

I have my doubts the whole thing is really worthwhile. It probably 
would at least need some statistics to only do this for frequent
accesses, but I don't know where to put this data.

At least it would be a serious research project to figure out 
a good way to do automatic migration. From what I was told by
people who tried this (e.g. in Irix) it is really hard and
didn't turn out to be a win for them.

The better way is to just provide the infrastructure
and let batch managers or program itselves take care of migration.

That was the whole idea behind NUMA API - some problems 
are too hard to figure out automatically by the kernel, so 
allow the user or application to give it a hand.

And frankly the defaults we have currently are not that bad,
perhaps with some small tweaks (e.g. i'm still liking the idea
of interleaving file cache by default) 

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 18:52   ` Andi Kleen
@ 2006-04-11 19:03     ` Jack Steiner
  2006-04-11 20:40       ` Lee Schermerhorn
  2006-04-11 20:40     ` Lee Schermerhorn
  1 sibling, 1 reply; 25+ messages in thread
From: Jack Steiner @ 2006-04-11 19:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, ak

On Tue, Apr 11, 2006 at 08:52:49PM +0200, Andi Kleen wrote:
> On Tuesday 11 April 2006 20:46, Christoph Lameter wrote:
> > However, if the page is not frequently references then the 
> > effort required to migrate the page was not justified.
> 
> I have my doubts the whole thing is really worthwhile. It probably 
> would at least need some statistics to only do this for frequent
> accesses, but I don't know where to put this data.

Agree. And a way to disable the migration-on-fault.

> 
> At least it would be a serious research project to figure out 
> a good way to do automatic migration. From what I was told by
> people who tried this (e.g. in Irix) it is really hard and
> didn't turn out to be a win for them.

IRIX had hardware support for counting offnode vs. onnode references
to a page & sending interrupts when migration appeared to be beneficial

We intended to use this info to migrate pages.  Unfortunately, we were 
never able to demonstrate a performance benefit of migrating pages. 
The overhead always exceeded the cost except in a very small number
of carefully selected benchmarks.


> 
> The better way is to just provide the infrastructure
> and let batch managers or program itselves take care of migration.
> 
> That was the whole idea behind NUMA API - some problems 
> are too hard to figure out automatically by the kernel, so 
> allow the user or application to give it a hand.
> 
> And frankly the defaults we have currently are not that bad,
> perhaps with some small tweaks (e.g. i'm still liking the idea
> of interleaving file cache by default) 
> 
> -Andi
> 

-- 
Jack

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 19:03     ` Jack Steiner
@ 2006-04-11 20:40       ` Lee Schermerhorn
  2006-04-11 22:12         ` Jack Steiner
  0 siblings, 1 reply; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 20:40 UTC (permalink / raw)
  To: Jack Steiner; +Cc: Andi Kleen, Christoph Lameter, linux-mm, ak

On Tue, 2006-04-11 at 14:03 -0500, Jack Steiner wrote:
> On Tue, Apr 11, 2006 at 08:52:49PM +0200, Andi Kleen wrote:
> > On Tuesday 11 April 2006 20:46, Christoph Lameter wrote:
> > > However, if the page is not frequently references then the 
> > > effort required to migrate the page was not justified.
> > 
> > I have my doubts the whole thing is really worthwhile. It probably 
> > would at least need some statistics to only do this for frequent
> > accesses, but I don't know where to put this data.
> 
> Agree. And a way to disable the migration-on-fault.

I know.  I don't have such a control in the current series.  I've
thought of adding one, but I think this might be better as a per task
control.  And, to set those, I kind of like Paul Jackson's cpuset
methodology--like "memory_spread_page".  A "migrate_on_fault" cpuset
attribute would turn this on for tasks in the cpuset.  Default should
probably be off.

Might even want separate controls for migrating anon, file backed, shmem
pages on fault.  Depends on how the policy for file backed pages gets
sorted out.

> 
> > 
> > At least it would be a serious research project to figure out 
> > a good way to do automatic migration. From what I was told by
> > people who tried this (e.g. in Irix) it is really hard and
> > didn't turn out to be a win for them.
> 
> IRIX had hardware support for counting offnode vs. onnode references
> to a page & sending interrupts when migration appeared to be beneficial
> 
> We intended to use this info to migrate pages.  Unfortunately, we were 
> never able to demonstrate a performance benefit of migrating pages. 
> The overhead always exceeded the cost except in a very small number
> of carefully selected benchmarks.

This was the work that I heard about.  I don't think I'm trying to do
that.  The migrate-on-fault series just migrates a cached page that is
eligible [mapcount==0] and misplaced.  Seems like a good time to
evaluate the policy.  If enabled, of course.

I do think that one could find some interesting research in measuring
the cost of migrating pages vs the benefits of having them local.  One
might want to track per node RSS [as Eric Focht and Martin Bligh, maybe
others, have previously attempted] and prefer those with smaller memory
footprints to move offnode during load balancing.  One might chose to
move larger tasks less frequently based on the cost of migrating and/or
remote accesses.

We plan on doing a lot of this measurement and testing.  But, I needed
the basic infrastructure [migrate on fault, auto-migrate] in place to do
the testing.  I've already seen benefit in how the system settles back
into a "good" [if not optimum] state after transient perturbations with
the multithread streams benchmark results that I posted with V0.1 of the
auto-migration series.  No fancy page use statistics.  Unmapping pages
controlled by default policy when the task migrates to a new node caused
the tasks to pull pages they were using close to themselves.  For a
multi-threaded OMP job, this tended to do the right thing [to achieve
maximum throughput] without any explicit placement.  Just start'em up,
give 'em a good swift kick, and let them fall back into place.  

Real soon now, I'll take some time out from tracking the bleeding edge
and run some more benchmarks on our NUMA platforms, with and without
hardware interleaving, with and without these patches, ...    I'll, uh,
keep you posted ;-).

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 20:40       ` Lee Schermerhorn
@ 2006-04-11 22:12         ` Jack Steiner
  0 siblings, 0 replies; 25+ messages in thread
From: Jack Steiner @ 2006-04-11 22:12 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Andi Kleen, Christoph Lameter, linux-mm, ak

On Tue, Apr 11, 2006 at 04:40:45PM -0400, Lee Schermerhorn wrote:
> On Tue, 2006-04-11 at 14:03 -0500, Jack Steiner wrote:
> > On Tue, Apr 11, 2006 at 08:52:49PM +0200, Andi Kleen wrote:
> > > On Tuesday 11 April 2006 20:46, Christoph Lameter wrote:
> > > > However, if the page is not frequently references then the 
> > > > effort required to migrate the page was not justified.
> > > 
> > > I have my doubts the whole thing is really worthwhile. It probably 
> > > would at least need some statistics to only do this for frequent
> > > accesses, but I don't know where to put this data.
> > 
> > Agree. And a way to disable the migration-on-fault.
> 
> I know.  I don't have such a control in the current series.  I've
> thought of adding one, but I think this might be better as a per task
> control.  And, to set those, I kind of like Paul Jackson's cpuset
> methodology--like "memory_spread_page".  A "migrate_on_fault" cpuset
> attribute would turn this on for tasks in the cpuset.  Default should
> probably be off.
> 
> Might even want separate controls for migrating anon, file backed, shmem
> pages on fault.  Depends on how the policy for file backed pages gets
> sorted out.

Agree. Adding the controls thru cpuset options seems like a good way 
to go. 


> 
> > 
> > > 
> > > At least it would be a serious research project to figure out 
> > > a good way to do automatic migration. From what I was told by
> > > people who tried this (e.g. in Irix) it is really hard and
> > > didn't turn out to be a win for them.
> > 
> > IRIX had hardware support for counting offnode vs. onnode references
> > to a page & sending interrupts when migration appeared to be beneficial
> > 
> > We intended to use this info to migrate pages.  Unfortunately, we were 
> > never able to demonstrate a performance benefit of migrating pages. 
> > The overhead always exceeded the cost except in a very small number
> > of carefully selected benchmarks.
> 
> This was the work that I heard about.  I don't think I'm trying to do
> that.  The migrate-on-fault series just migrates a cached page that is
> eligible [mapcount==0] and misplaced.  Seems like a good time to
> evaluate the policy.  If enabled, of course.

I realize that what you are doing is somewhat different - particularily in
the way that you decide to migrate a page. However, you still have
some of the same problems that we had on IRIX. If the
page is remote, it is not worth the cost to migrate the page unless
the app will take many cache misses to the page. At one extreme,
if the app is short lived or references only a portion of the page,
migrating the page may have no benefit. Even if the app is long
lived and references most of the page, many apps have a small cache
footprint & sucessfully keep the page in the cache. Again, there many
be no benefit of migration. 

OTOH, if the app is long lived OR has big cache footprint, migration can
be a definite win. 


> 
> I do think that one could find some interesting research in measuring
> the cost of migrating pages vs the benefits of having them local. 

Yes! 

> One
> might want to track per node RSS [as Eric Focht and Martin Bligh, maybe
> others, have previously attempted] and prefer those with smaller memory
> footprints to move offnode during load balancing.  One might chose to
> move larger tasks less frequently based on the cost of migrating and/or
> remote accesses.
> 
> We plan on doing a lot of this measurement and testing.  But, I needed
> the basic infrastructure [migrate on fault, auto-migrate] in place to do
> the testing.  I've already seen benefit in how the system settles back
> into a "good" [if not optimum] state after transient perturbations with
> the multithread streams benchmark results that I posted with V0.1 of the
> auto-migration series.  No fancy page use statistics.  Unmapping pages
> controlled by default policy when the task migrates to a new node caused
> the tasks to pull pages they were using close to themselves.  For a
> multi-threaded OMP job, this tended to do the right thing [to achieve
> maximum throughput] without any explicit placement.  Just start'em up,
> give 'em a good swift kick, and let them fall back into place.  
> 
> Real soon now, I'll take some time out from tracking the bleeding edge
> and run some more benchmarks on our NUMA platforms, with and without
> hardware interleaving, with and without these patches, ...    I'll, uh,
> keep you posted ;-).
> 
> Lee

-- 
Jack

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 18:52   ` Andi Kleen
  2006-04-11 19:03     ` Jack Steiner
@ 2006-04-11 20:40     ` Lee Schermerhorn
  1 sibling, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 20:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, linux-mm, ak

On Tue, 2006-04-11 at 20:52 +0200, Andi Kleen wrote:
> On Tuesday 11 April 2006 20:46, Christoph Lameter wrote:
> > However, if the page is not frequently references then the 
> > effort required to migrate the page was not justified.
> 
> I have my doubts the whole thing is really worthwhile. It probably 
> would at least need some statistics to only do this for frequent
> accesses, but I don't know where to put this data.
> 
> At least it would be a serious research project to figure out 
> a good way to do automatic migration. From what I was told by
> people who tried this (e.g. in Irix) it is really hard and
> didn't turn out to be a win for them.

My understanding is that it IS really hard to optimize this or try to
use statistics to do this.  Especially if your goal is to eke out the
last ounce of performance, as HPC apps are wont to do.  I'm not
interested in this.  

> 
> The better way is to just provide the infrastructure
> and let batch managers or program itselves take care of migration.

I agree, for that last ounce of performance.  But, I still think we can
do better [have on other systems] than just letting the scheduler move
tasks around a numa system with no attention to their memory locality.
Not everybody want to be locking tasks down to prevent this, either.

> 
> That was the whole idea behind NUMA API - some problems 
> are too hard to figure out automatically by the kernel, so 
> allow the user or application to give it a hand.

I really don't want the kernel to have to figure too much out.  You KNOW
that when you move a task to a different node that either you're moving
away from some memory footprint or, if you're lucky, back close to some
earlier footprint.  How lucky you need to be to achieve the latter
depends on how many nodes you have and how badly the tasks memory
footprint is spread around the nodes due to involuntary migration.
Migrate on fault provides the first piece of infrastructure to address
this.  

The first time a task touches a page that is not in memory, that task's
policies get to choose where the page goes.  Presumably, we go through
some amount of effort to get the page somewhere close to the task or
where it wants it.  We've got a lot of vm infrastructure in support of
this endeavor.  What migrate on fault does is allow that same decision
to be made when a task finds a cached page for which no other tasks
currently have translations [ptes].  Seems like a good time to
reevaluate this.  Now, arranging for a significant number of the task's
pages to be in that state is the subject of another patch series.

> 
> And frankly the defaults we have currently are not that bad,
> perhaps with some small tweaks (e.g. i'm still liking the idea
> of interleaving file cache by default)

No, the defaults aren't bad for initial allocation.  But, they don't
prevent scheduling/load balancing from undoing all the good work done up
front.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
  2006-04-11 18:46 ` Christoph Lameter
  2006-04-11 18:52   ` Andi Kleen
@ 2006-04-11 20:40   ` Lee Schermerhorn
  1 sibling, 0 replies; 25+ messages in thread
From: Lee Schermerhorn @ 2006-04-11 20:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, ak

On Tue, 2006-04-11 at 11:46 -0700, Christoph Lameter wrote:
> On Fri, 7 Apr 2006, Lee Schermerhorn wrote:
> 
> > Note that this mechanism can be used to migrate page cache pages that 
> > were read in earlier, are no longer referenced, but are about to be
> > used by a new task on another node from where the page resides.  The
> > same mechanism can be used to pull anon pages along with a task when
> > the load balancer decides to move it to another node.  However, that
> > will require a bit more mechanism, and is the subject of another
> > patch series.
> 
> The fundamental assumption in these patchsets is that memory policies are 
> permanently used to control allocation. However, allocation policies may 
> be temporarily set to various allocation methods in order to allocate 
> certain memory structures in special ways. The policy may be reset later 
> and not reflect the allocation wanted for a certain structure when the 
> opportunistic or lazy migration takes place.

Yes, that is the fundamental assumption.  That pages follow their
policies to the extent that the system is capable of enforcing this.  I
have always assumed that applications only played the games with
changing the policies the way you describe because of the limitations of
the current implementation.  If the system always did what you said vis
a vis the policy, then why change it to something that's not what you
want?

> 
> Maybe we can use the memory polices in the way you suggest (my 
> MPOL_MF_MOVE_* flags certainly do the same but they are set by the coder 
> of the user space application who is aware of what is going on !). 
> 
> But there are significant components missing to make this work the right 
> way. In particular file backed pages are not allocated according to vma 
> policy. Only anonymous pages are. So this would only work correctly for 
> anonymous pages that are explicitly shifted onto swap. 

Right.  You mentioned this in the prior mail and in off-list exchanges
we've had.  I agree.  IMO, this is another area where work could be
done.  I'd be willing to tackle that as part of this effort if I can
understand what it is that would be acceptable.

> 
> I think there will be mostly correct behavior for file backed pages. Most 
> processes do not use policies at all and so this will move the file 
> backed page to the node where the process is executing. If the process 
> frequently refers to the page then the effort that was expended is 
> justified. However, if the page is not frequently references then the 
> effort required to migrate the page was not justified.

Well, the migration wouldn't have occurred unless the task just happened
to touch the page at a point where 1) it's in the cache, 2) no tasks
have any pte's referencing the page [mapcount ==0] and 3) its location
does not follow applicable policy--WHATEVER that is.  This is similar to
what would happen for the first task to touch a page after it has been
evicted from the cache for some reason, right?

> 
> For some processes this has the potential to actually decreasing the 
> performance, for other processes that are using memory policies to 
> control the allocation of structures it may allocate the page in a way 
> that the application tried to avoid because it may be using the wrong 
> memory policy.

Probably true.  Just as migrating task away from their memory due to
scheduling load imbalances can decrease the performance of the affected
task and, possibly, tasks on the node where it's left behind memory
resides.  We need to ensure that users who have gone to a great deal of
trouble to layout their application don't get burned.  However, I'd also
like to provide some benefit for applications that haven't been
carefully hand tuned/bound to the configuration.

> 
> Then there is the known deficiency that memory policies do not work with 
> file backed pages. I surely wish that this would be addressed first.

Without more information, I suspect that my approach to that may not be
what you had in mind.  I discussed some ideas in response to other
messages in this series.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-04-12 20:55 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-07 20:18 [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Lee Schermerhorn
2006-04-07 20:22 ` [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace Lee Schermerhorn
2006-04-11 18:08   ` Christoph Lameter
2006-04-11 18:47     ` Lee Schermerhorn
2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page Lee Schermerhorn
2006-04-11 18:21   ` Christoph Lameter
2006-04-11 19:28     ` Lee Schermerhorn
2006-04-11 19:33       ` Christoph Lameter
2006-04-12 16:43     ` Paul Jackson
2006-04-12 18:49       ` Lee Schermerhorn
2006-04-12 20:55         ` Paul Jackson
2006-04-07 20:23 ` [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate " Lee Schermerhorn
2006-04-11 18:32   ` Christoph Lameter
2006-04-11 19:51     ` Lee Schermerhorn
2006-04-07 20:24 ` [PATCH 2.6.17-rc1-mm1 4/6] Migrate-on-fault - handle misplaced anon pages Lee Schermerhorn
2006-04-07 20:26 ` [PATCH 2.6.17-rc1-mm1 5/6] Migrate-on-fault - add MPOL_MF_LAZY Lee Schermerhorn
2006-04-07 20:27 ` [PATCH 2.6.17-rc1-mm1 6/6] Migrate-on-fault - add MPOL_NOOP Lee Schermerhorn
2006-04-09  7:01 ` [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview Andi Kleen
2006-04-11 18:46 ` Christoph Lameter
2006-04-11 18:52   ` Andi Kleen
2006-04-11 19:03     ` Jack Steiner
2006-04-11 20:40       ` Lee Schermerhorn
2006-04-11 22:12         ` Jack Steiner
2006-04-11 20:40     ` Lee Schermerhorn
2006-04-11 20:40   ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox