[RFC] page migration: patches for later than 2.6.18

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] page migration: patches for later than 2.6.18
@ 2006-05-18 18:21 Christoph Lameter
  2006-05-18 18:21 ` [RFC 1/5] page migration: simplify migrate_pages() Christoph Lameter
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

This a selection of patches on top of 2.6.17-rc4-mm1 that may
address additional requirements such as

- Automatic page migration from user space.
- Support for migration memory that has no page_structs.
- Move pages to the correct pages in memory areas with
  MPOL_INTERLEAVE policy.

Plus it does a significant cleanup of the code. All of these
patches will require additional feedback before they can get in.
If any of this code gets in then probably later than 2.6.18.

A test program for page based migration may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/christoph/pmig/patches-2.6.17-rc4-mm1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/5] page migration: simplify migrate_pages()
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  2006-05-18 18:21 ` [RFC 2/5] page migration: handle freeing of pages in migrate_pages() Christoph Lameter
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

page migration: Simplify migrate_pages()

Currently migrate_pages() is mess with lots of goto.
Extract two functions from migrate_pages() and get rid of the gotos.

Plus we can just unconditionally set the locked bit on the new page
since we are the only one holding a reference. Locking is to
stop others from accessing the page once we establish references
to the new page.

Remove the list_del from move_to_lru in order to have finer control
over list processing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-15 15:40:13.214835974 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 09:41:59.814842493 -0700
@@ -84,7 +84,6 @@ int migrate_prep(void)
 
 static inline void move_to_lru(struct page *page)
 {
-	list_del(&page->lru);
 	if (PageActive(page)) {
 		/*
 		 * lru_cache_add_active checks that
@@ -110,6 +109,7 @@ int putback_lru_pages(struct list_head *
 	int count = 0;
 
 	list_for_each_entry_safe(page, page2, l, lru) {
+		list_del(&page->lru);
 		move_to_lru(page);
 		count++;
 	}
@@ -534,11 +534,108 @@ static int fallback_migrate_page(struct 
 }
 
 /*
+ * Move a page to a newly allocated page
+ * The page is locked and all ptes have been successfully removed.
+ *
+ * The new page will have replaced the old page if this function
+ * is successful.
+ */
+static int move_to_new_page(struct page *newpage, struct page *page)
+{
+	struct address_space *mapping;
+	int rc;
+
+	/*
+	 * Block others from accessing the page when we get around to
+	 * establishing additional references. We are the only one
+	 * holding a reference to the new page at this point.
+	 */
+	SetPageLocked(newpage);
+
+	/* Prepare mapping for the new page.*/
+	newpage->index = page->index;
+	newpage->mapping = page->mapping;
+
+	mapping = page_mapping(page);
+	if (!mapping)
+		rc = migrate_page(mapping, newpage, page);
+
+	else if (mapping->a_ops->migratepage)
+		/*
+		 * Most pages have a mapping and most filesystems
+		 * should provide a migration function. Anonymous
+		 * pages are part of swap space which also has its
+		 * own migration function. This is the most common
+		 * path for page migration.
+		 */
+		rc = mapping->a_ops->migratepage(mapping,
+						newpage, page);
+	else
+		rc = fallback_migrate_page(mapping, newpage, page);
+
+	if (!rc)
+		remove_migration_ptes(page, newpage);
+	else
+		newpage->mapping = NULL;
+
+	unlock_page(newpage);
+
+	return rc;
+}
+
+/*
+ * Obtain the lock on page, remove all ptes and migrate the page
+ * to the newly allocated page in newpage.
+ */
+static int unmap_and_move(struct page *newpage, struct page *page, int force)
+{
+	int rc = 0;
+
+	if (page_count(page) == 1)
+		/* page was freed from under us. So we are done. */
+		goto ret;
+
+	rc = -EAGAIN;
+	if (TestSetPageLocked(page)) {
+		if (!force)
+			goto ret;
+		lock_page(page);
+	}
+
+	if (PageWriteback(page)) {
+		if (!force)
+			goto unlock;
+		wait_on_page_writeback(page);
+	}
+
+	/*
+	 * Establish migration ptes or remove ptes
+	 */
+	if (try_to_unmap(page, 1) != SWAP_FAIL) {
+		if (!page_mapped(page))
+			rc = move_to_new_page(newpage, page);
+	} else
+		/* A vma has VM_LOCKED set -> permanent failure */
+		rc = -EPERM;
+
+	if (rc)
+		remove_migration_ptes(page, page);
+unlock:
+	unlock_page(page);
+ret:
+	if (rc != -EAGAIN) {
+		list_del(&newpage->lru);
+		move_to_lru(newpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * Two lists are passed to this function. The first list
  * contains the pages isolated from the LRU to be migrated.
- * The second list contains new pages that the pages isolated
+ * The second list contains new pages that the isolated pages
  * can be moved to.
  *
  * The function returns after 10 attempts or if no pages
@@ -550,7 +647,7 @@ static int fallback_migrate_page(struct 
 int migrate_pages(struct list_head *from, struct list_head *to,
 		  struct list_head *moved, struct list_head *failed)
 {
-	int retry;
+	int retry = 1;
 	int nr_failed = 0;
 	int pass = 0;
 	struct page *page;
@@ -561,118 +658,33 @@ int migrate_pages(struct list_head *from
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
 
-redo:
-	retry = 0;
-
-	list_for_each_entry_safe(page, page2, from, lru) {
-		struct page *newpage = NULL;
-		struct address_space *mapping;
-
-		cond_resched();
-
-		rc = 0;
-		if (page_count(page) == 1)
-			/* page was freed from under us. So we are done. */
-			goto next;
-
-		if (to && list_empty(to))
-			break;
-
-		/*
-		 * Skip locked pages during the first two passes to give the
-		 * functions holding the lock time to release the page. Later we
-		 * use lock_page() to have a higher chance of acquiring the
-		 * lock.
-		 */
-		rc = -EAGAIN;
-		if (pass > 2)
-			lock_page(page);
-		else
-			if (TestSetPageLocked(page))
-				goto next;
-
-		/*
-		 * Only wait on writeback if we have already done a pass where
-		 * we we may have triggered writeouts for lots of pages.
-		 */
-		if (pass > 0)
-			wait_on_page_writeback(page);
-		else
-			if (PageWriteback(page))
-				goto unlock_page;
+	for(pass = 0; pass < 10 && retry; pass++) {
+		retry = 0;
 
-		/*
-		 * Establish migration ptes or remove ptes
-		 */
-		rc = -EPERM;
-		if (try_to_unmap(page, 1) == SWAP_FAIL)
-			/* A vma has VM_LOCKED set -> permanent failure */
-			goto unlock_page;
-
-		rc = -EAGAIN;
-		if (page_mapped(page))
-			goto unlock_page;
-
-		newpage = lru_to_page(to);
-		lock_page(newpage);
-		/* Prepare mapping for the new page.*/
-		newpage->index = page->index;
-		newpage->mapping = page->mapping;
+		list_for_each_entry_safe(page, page2, from, lru) {
 
-		/*
-		 * Pages are properly locked and writeback is complete.
-		 * Try to migrate the page.
-		 */
-		mapping = page_mapping(page);
-		if (!mapping)
-			rc = migrate_page(mapping, newpage, page);
+			if (list_empty(to))
+				break;
 
-		else if (mapping->a_ops->migratepage)
-			/*
-			 * Most pages have a mapping and most filesystems
-			 * should provide a migration function. Anonymous
-			 * pages are part of swap space which also has its
-			 * own migration function. This is the most common
-			 * path for page migration.
-			 */
-			rc = mapping->a_ops->migratepage(mapping,
-							newpage, page);
-		else
-			rc = fallback_migrate_page(mapping, newpage, page);
-
-		if (!rc)
-			remove_migration_ptes(page, newpage);
-
-		unlock_page(newpage);
+			cond_resched();
 
-unlock_page:
-		if (rc)
-			remove_migration_ptes(page, page);
+			rc = unmap_and_move(lru_to_page(to), page, pass > 2);
 
-		unlock_page(page);
-
-next:
-		if (rc) {
-			if (newpage)
-				newpage->mapping = NULL;
-
-			if (rc == -EAGAIN)
+			switch(rc) {
+			case -EAGAIN:
 				retry++;
-			else {
+				break;
+			case 0:
+				list_move(&page->lru, moved);
+				break;
+			default:
 				/* Permanent failure */
 				list_move(&page->lru, failed);
 				nr_failed++;
+				break;
 			}
-		} else {
-			if (newpage) {
-				/* Successful migration. Return page to LRU */
-				move_to_lru(newpage);
-			}
-			list_move(&page->lru, moved);
 		}
 	}
-	if (retry && pass++ < 10)
-		goto redo;
 
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 2/5] page migration: handle freeing of pages in migrate_pages()
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
  2006-05-18 18:21 ` [RFC 1/5] page migration: simplify migrate_pages() Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  2006-05-18 18:21 ` [RFC 3/5] page migration: use allocator function for migrate_pages() Christoph Lameter
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

Dispose of pages in migrate_pages()

Do not leave pages on the lists passed to migrate_pages(). Seems
that we will not need any postprocessing of pages. This will simplify
the handling of pages by the callers of migrate_pages().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-18 09:41:59.814842493 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 09:44:09.438655508 -0700
@@ -624,6 +624,15 @@ unlock:
 	unlock_page(page);
 ret:
 	if (rc != -EAGAIN) {
+ 		/*
+ 		 * A page that has been migrated has all references
+ 		 * removed and will be freed. A page that has not been
+ 		 * migrated will have kepts its references and be
+ 		 * restored.
+ 		 */
+ 		list_del(&page->lru);
+ 		move_to_lru(page);
+
 		list_del(&newpage->lru);
 		move_to_lru(newpage);
 	}
@@ -640,12 +649,12 @@ ret:
  *
  * The function returns after 10 attempts or if no pages
  * are movable anymore because to has become empty
- * or no retryable pages exist anymore.
+ * or no retryable pages exist anymore. All pages will be
+ * retruned to the LRU or freed.
  *
- * Return: Number of pages not migrated when "to" ran empty.
+ * Return: Number of pages not migrated.
  */
-int migrate_pages(struct list_head *from, struct list_head *to,
-		  struct list_head *moved, struct list_head *failed)
+int migrate_pages(struct list_head *from, struct list_head *to)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -675,11 +684,9 @@ int migrate_pages(struct list_head *from
 				retry++;
 				break;
 			case 0:
-				list_move(&page->lru, moved);
 				break;
 			default:
 				/* Permanent failure */
-				list_move(&page->lru, failed);
 				nr_failed++;
 				break;
 			}
@@ -689,6 +696,7 @@ int migrate_pages(struct list_head *from
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
+	putback_lru_pages(from);
 	return nr_failed + retry;
 }
 
@@ -702,11 +710,10 @@ int migrate_pages_to(struct list_head *p
 			struct vm_area_struct *vma, int dest)
 {
 	LIST_HEAD(newlist);
-	LIST_HEAD(moved);
-	LIST_HEAD(failed);
 	int err = 0;
 	unsigned long offset = 0;
 	int nr_pages;
+	int nr_failed = 0;
 	struct page *page;
 	struct list_head *p;
 
@@ -740,26 +747,17 @@ redo:
 		if (nr_pages > MIGRATE_CHUNK_SIZE)
 			break;
 	}
-	err = migrate_pages(pagelist, &newlist, &moved, &failed);
+	err = migrate_pages(pagelist, &newlist);
 
-	putback_lru_pages(&moved);	/* Call release pages instead ?? */
-
-	if (err >= 0 && list_empty(&newlist) && !list_empty(pagelist))
-		goto redo;
-out:
-	/* Return leftover allocated pages */
-	while (!list_empty(&newlist)) {
-		page = list_entry(newlist.next, struct page, lru);
-		list_del(&page->lru);
-		__free_page(page);
+	if (err >= 0) {
+		nr_failed += err;
+		if (list_empty(&newlist) && !list_empty(pagelist))
+			goto redo;
 	}
-	list_splice(&failed, pagelist);
-	if (err < 0)
-		return err;
+out:
 
 	/* Calculate number of leftover pages */
-	nr_pages = 0;
 	list_for_each(p, pagelist)
-		nr_pages++;
-	return nr_pages;
+		nr_failed++;
+	return nr_failed;
 }
Index: linux-2.6.17-rc4-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/migrate.h	2006-05-15 15:40:12.349655322 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/migrate.h	2006-05-18 09:42:45.070830541 -0700
@@ -8,8 +8,7 @@ extern int isolate_lru_page(struct page 
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
-extern int migrate_pages(struct list_head *l, struct list_head *t,
-		struct list_head *moved, struct list_head *failed);
+extern int migrate_pages(struct list_head *l, struct list_head *t);
 extern int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest);
 extern int fail_migrate_page(struct address_space *,
@@ -22,8 +21,8 @@ extern int migrate_prep(void);
 static inline int isolate_lru_page(struct page *p, struct list_head *list)
 					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
-static inline int migrate_pages(struct list_head *l, struct list_head *t,
-	struct list_head *moved, struct list_head *failed) { return -ENOSYS; }
+static inline int migrate_pages(struct list_head *l, struct list_head *t)
+					{ return -ENOSYS; }
 
 static inline int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest) { return 0; }
Index: linux-2.6.17-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/mempolicy.c	2006-05-15 15:40:13.211906469 -0700
+++ linux-2.6.17-rc4-mm1/mm/mempolicy.c	2006-05-18 09:42:45.071807043 -0700
@@ -603,11 +603,8 @@ int migrate_to_node(struct mm_struct *mm
 	check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nmask,
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
-	if (!list_empty(&pagelist)) {
+	if (!list_empty(&pagelist))
 		err = migrate_pages_to(&pagelist, NULL, dest);
-		if (!list_empty(&pagelist))
-			putback_lru_pages(&pagelist);
-	}
 	return err;
 }
 
@@ -773,9 +770,6 @@ long do_mbind(unsigned long start, unsig
 			err = -EIO;
 	}
 
-	if (!list_empty(&pagelist))
-		putback_lru_pages(&pagelist);
-
 	up_write(&mm->mmap_sem);
 	mpol_free(new);
 	return err;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 3/5] page migration: use allocator function for migrate_pages()
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
  2006-05-18 18:21 ` [RFC 1/5] page migration: simplify migrate_pages() Christoph Lameter
  2006-05-18 18:21 ` [RFC 2/5] page migration: handle freeing of pages in migrate_pages() Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  2006-05-18 18:21 ` [RFC 4/5] page migration: Support moving of individual pages Christoph Lameter
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

Pass a function that allocates the target page to migrate_pages()

Instead of passing a list of new pages, pass a function to allocate
a new page. This allows the correct placement of MPOL_INTERLEAVE pages
during page migration. It also further simplifies the callers
of migrate pages. migrate_pages() becomes similar to migrate_pages_to()
so drop migrate_pages_to(). The batching of new page allocations
becomes unnecessary.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-18 09:44:09.438655508 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 09:56:43.766958108 -0700
@@ -28,9 +28,6 @@
 
 #include "internal.h"
 
-/* The maximum number of pages to take off the LRU for migration */
-#define MIGRATE_CHUNK_SIZE 256
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
@@ -587,18 +584,23 @@ static int move_to_new_page(struct page 
  * Obtain the lock on page, remove all ptes and migrate the page
  * to the newly allocated page in newpage.
  */
-static int unmap_and_move(struct page *newpage, struct page *page, int force)
+static int unmap_and_move(new_page_t get_new_page, unsigned long private,
+			struct page *page, int force)
 {
 	int rc = 0;
+	struct page *newpage = get_new_page(page, private);
+
+	if (!newpage)
+		return -ENOMEM;
 
 	if (page_count(page) == 1)
 		/* page was freed from under us. So we are done. */
-		goto ret;
+		goto move_newpage;
 
 	rc = -EAGAIN;
 	if (TestSetPageLocked(page)) {
 		if (!force)
-			goto ret;
+			goto move_newpage;
 		lock_page(page);
 	}
 
@@ -622,7 +624,7 @@ static int unmap_and_move(struct page *n
 		remove_migration_ptes(page, page);
 unlock:
 	unlock_page(page);
-ret:
+
 	if (rc != -EAGAIN) {
  		/*
  		 * A page that has been migrated has all references
@@ -632,29 +634,33 @@ ret:
  		 */
  		list_del(&page->lru);
  		move_to_lru(page);
-
-		list_del(&newpage->lru);
-		move_to_lru(newpage);
 	}
+
+move_newpage:
+	/*
+	 * Move the new page to the LRU. If migration was not successful
+	 * then this will free the page.
+	 */
+	move_to_lru(newpage);
 	return rc;
 }
 
 /*
  * migrate_pages
  *
- * Two lists are passed to this function. The first list
- * contains the pages isolated from the LRU to be migrated.
- * The second list contains new pages that the isolated pages
- * can be moved to.
+ * The function takes one list of pages to migrate and a function
+ * that determines from the page to be migrated and the private data
+ * the target of the move and allocates the page.
  *
  * The function returns after 10 attempts or if no pages
  * are movable anymore because to has become empty
  * or no retryable pages exist anymore. All pages will be
  * retruned to the LRU or freed.
  *
- * Return: Number of pages not migrated.
+ * Return: Number of pages not migrated or error code.
  */
-int migrate_pages(struct list_head *from, struct list_head *to)
+int migrate_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private)
 {
 	int retry = 1;
 	int nr_failed = 0;
@@ -671,15 +677,14 @@ int migrate_pages(struct list_head *from
 		retry = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
-
-			if (list_empty(to))
-				break;
-
 			cond_resched();
 
-			rc = unmap_and_move(lru_to_page(to), page, pass > 2);
+			rc = unmap_and_move(get_new_page, private,
+						page, pass > 2);
 
 			switch(rc) {
+			case -ENOMEM:
+				goto out;
 			case -EAGAIN:
 				retry++;
 				break;
@@ -692,72 +697,16 @@ int migrate_pages(struct list_head *from
 			}
 		}
 	}
-
+	rc = 0;
+out:
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
 	putback_lru_pages(from);
-	return nr_failed + retry;
-}
-
-/*
- * Migrate the list 'pagelist' of pages to a certain destination.
- *
- * Specify destination with either non-NULL vma or dest_node >= 0
- * Return the number of pages not migrated or error code
- */
-int migrate_pages_to(struct list_head *pagelist,
-			struct vm_area_struct *vma, int dest)
-{
-	LIST_HEAD(newlist);
-	int err = 0;
-	unsigned long offset = 0;
-	int nr_pages;
-	int nr_failed = 0;
-	struct page *page;
-	struct list_head *p;
-
-redo:
-	nr_pages = 0;
-	list_for_each(p, pagelist) {
-		if (vma) {
-			/*
-			 * The address passed to alloc_page_vma is used to
-			 * generate the proper interleave behavior. We fake
-			 * the address here by an increasing offset in order
-			 * to get the proper distribution of pages.
-			 *
-			 * No decision has been made as to which page
-			 * a certain old page is moved to so we cannot
-			 * specify the correct address.
-			 */
-			page = alloc_page_vma(GFP_HIGHUSER, vma,
-					offset + vma->vm_start);
-			offset += PAGE_SIZE;
-		}
-		else
-			page = alloc_pages_node(dest, GFP_HIGHUSER, 0);
-
-		if (!page) {
-			err = -ENOMEM;
-			goto out;
-		}
-		list_add_tail(&page->lru, &newlist);
-		nr_pages++;
-		if (nr_pages > MIGRATE_CHUNK_SIZE)
-			break;
-	}
-	err = migrate_pages(pagelist, &newlist);
 
-	if (err >= 0) {
-		nr_failed += err;
-		if (list_empty(&newlist) && !list_empty(pagelist))
-			goto redo;
-	}
-out:
+	if (rc)
+		return rc;
 
-	/* Calculate number of leftover pages */
-	list_for_each(p, pagelist)
-		nr_failed++;
-	return nr_failed;
+	return nr_failed + retry;
 }
+
Index: linux-2.6.17-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/mempolicy.c	2006-05-18 09:42:45.071807043 -0700
+++ linux-2.6.17-rc4-mm1/mm/mempolicy.c	2006-05-18 09:48:12.491970088 -0700
@@ -87,6 +87,7 @@
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
 #include <linux/migrate.h>
+#include <linux/rmap.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -587,6 +588,11 @@ static void migrate_page_add(struct page
 		isolate_lru_page(page, pagelist);
 }
 
+static struct page *new_node_page(struct page *page, unsigned long node)
+{
+	return alloc_pages_node(node, GFP_HIGHUSER, 0);
+}
+
 /*
  * Migrate pages from one node to a target node.
  * Returns error or the number of pages not migrated.
@@ -604,7 +610,8 @@ int migrate_to_node(struct mm_struct *mm
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
 	if (!list_empty(&pagelist))
-		err = migrate_pages_to(&pagelist, NULL, dest);
+		err = migrate_pages(&pagelist, new_node_page, dest);
+
 	return err;
 }
 
@@ -691,6 +698,12 @@ int do_migrate_pages(struct mm_struct *m
 
 }
 
+static struct page *new_vma_page(struct page *page, unsigned long private)
+{
+	struct vm_area_struct *vma = (struct vm_area_struct *)private;
+
+	return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, vma));
+}
 #else
 
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -703,6 +716,11 @@ int do_migrate_pages(struct mm_struct *m
 {
 	return -ENOSYS;
 }
+
+static struct page *new_vma_page(struct page *page, unsigned long private)
+{
+	return NULL;
+}
 #endif
 
 long do_mbind(unsigned long start, unsigned long len,
@@ -764,7 +782,8 @@ long do_mbind(unsigned long start, unsig
 		err = mbind_range(vma, start, end, new);
 
 		if (!list_empty(&pagelist))
-			nr_failed = migrate_pages_to(&pagelist, vma, -1);
+			nr_failed = migrate_pages(&pagelist, new_vma_page,
+						(unsigned long)vma);
 
 		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
Index: linux-2.6.17-rc4-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/migrate.h	2006-05-18 09:42:45.070830541 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/migrate.h	2006-05-18 09:48:12.493923092 -0700
@@ -3,14 +3,15 @@
 
 #include <linux/mm.h>
 
+typedef struct page *new_page_t(struct page *, unsigned long private);
+
 #ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
-extern int migrate_pages(struct list_head *l, struct list_head *t);
-extern int migrate_pages_to(struct list_head *pagelist,
-			struct vm_area_struct *vma, int dest);
+extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
+
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
 
@@ -21,8 +22,8 @@ extern int migrate_prep(void);
 static inline int isolate_lru_page(struct page *p, struct list_head *list)
 					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
-static inline int migrate_pages(struct list_head *l, struct list_head *t)
-					{ return -ENOSYS; }
+static inline int migrate_pages(struct list_head *l, new_page_t x,
+		unsigned long private) { return -ENOSYS; }
 
 static inline int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest) { return 0; }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 4/5] page migration: Support moving of individual pages
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
                   ` (2 preceding siblings ...)
  2006-05-18 18:21 ` [RFC 3/5] page migration: use allocator function for migrate_pages() Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  2006-05-19 19:27   ` Andrew Morton
  2006-05-18 18:21 ` [RFC 5/5] page migration: Detailed status for " Christoph Lameter
  2006-05-18 18:21 ` [RFC 6/6] page migration: Support a vma migration function Christoph Lameter
  5 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

Add support for sys_move_pages()

move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

int move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of unsigned longs pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed then no pages are moved but the status array is
updated.

The status array contains a status indicating the result of the migration
operation or the current state of the page if nodes == NULL.

Possible page states:

0..MAX_NUMNODES		The page is now on the indicated node.

-ENOENT		Page is not present or target node is not present

-EPERM		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified. Or the
		target node is not allowed by the current cpuset.
		Or the page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Cannot read node information from node array.

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. Page must be written
		back since the page is dirty and the filesystem does not
		provide a migration function.

-EINVAL		Filesystem does not provide a migration function but also
		has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.
MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-EINVAL		flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-ESRCH		Process does not exist.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

Test program for this may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm1

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-18 09:56:43.766958108 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 10:02:04.586936931 -0700
@@ -25,6 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/writeback.h>
+#include <linux/mempolicy.h>
 
 #include "internal.h"
 
@@ -710,3 +711,176 @@ out:
 	return nr_failed + retry;
 }
 
+#ifdef CONFIG_NUMA
+/*
+ * Move a list of individual pages
+ */
+struct page_to_node {
+	struct page *page;
+	int node;
+	int status;
+};
+
+static struct page *new_page_node(struct page *p, unsigned long private)
+{
+	struct page_to_node *pm = (struct page_to_node *)private;
+
+	while (pm->page && pm->page != p)
+		pm++;
+
+	if (!pm->page)
+		return NULL;
+
+	return alloc_pages_node(pm->node, GFP_HIGHUSER, 0);
+}
+
+/*
+ * Move a list of pages in the address space of the currently executing
+ * process.
+ */
+asmlinkage long sys_move_pages(int pid, unsigned long nr_pages,
+			const unsigned long __user *pages,
+			const int __user *nodes,
+			int __user *status, int flags)
+{
+	int err = 0;
+	int i;
+	struct task_struct *task;
+	nodemask_t task_nodes;
+	struct mm_struct *mm;
+	struct page_to_node *pm = NULL;
+	LIST_HEAD(pagelist);
+
+	/* Check flags */
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+		return -EINVAL;
+
+	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
+		return -EPERM;
+
+	/* Find the mm_struct */
+	read_lock(&tasklist_lock);
+	task = pid ? find_task_by_pid(pid) : current;
+	if (!task) {
+		read_unlock(&tasklist_lock);
+		return -ESRCH;
+	}
+	mm = get_task_mm(task);
+	read_unlock(&tasklist_lock);
+
+	if (!mm)
+		return -EINVAL;
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. The right exists if the process has administrative
+	 * capabilities, superuser privileges or the same
+	 * userid as the target process.
+	 */
+	if ((current->euid != task->suid) && (current->euid != task->uid) &&
+	    (current->uid != task->suid) && (current->uid != task->uid) &&
+	    !capable(CAP_SYS_NICE)) {
+		err = -EPERM;
+		goto out2;
+	}
+
+	task_nodes = cpuset_mems_allowed(task);
+	pm = kmalloc(GFP_KERNEL, (nr_pages + 1) * sizeof(struct page_to_node));
+	if (!pm) {
+		err = -ENOMEM;
+		goto out2;
+	}
+
+	down_read(&mm->mmap_sem);
+
+	for(i = 0 ; i < nr_pages; i++) {
+		unsigned long addr;
+		int node;
+		struct vm_area_struct *vma;
+		struct page *page;
+
+		pm[i].page = ZERO_PAGE(0);
+
+		err = -EFAULT;
+		if (get_user(addr, pages + i))
+			goto putback;
+
+		vma = find_vma(mm, addr);
+		if (!vma)
+			goto set_status;
+
+		page = follow_page(vma, addr, FOLL_GET);
+		err = -ENOENT;
+		if (!page)
+			goto set_status;
+
+		pm[i].page = page;
+		if (!nodes) {
+			err = page_to_nid(page);
+			put_page(page);
+			goto set_status;
+		}
+
+		err = -EPERM;
+		if (page_mapcount(page) > 1 &&
+				!(flags & MPOL_MF_MOVE_ALL)) {
+			put_page(page);
+			goto set_status;
+		}
+
+
+		err = isolate_lru_page(page, &pagelist);
+		__put_page(page);
+		if (err)
+			goto remove;
+
+		err = -EFAULT;
+		if (get_user(node, nodes + i))
+			goto remove;
+
+		err = -ENOENT;
+		if (!node_online(node))
+			goto remove;
+
+		err = -EPERM;
+		if (!node_isset(node, task_nodes))
+			goto remove;
+
+		pm[i].node = node;
+		err = 0;
+		if (node != page_to_nid(page))
+			goto set_status;
+
+		err = node;
+remove:
+		list_del(&page->lru);
+		move_to_lru(page);
+set_status:
+		pm[i].status = err;
+	}
+	err = 0;
+	if (!nodes || list_empty(&pagelist))
+		goto out;
+
+	pm[nr_pages].page = NULL;
+
+	err = migrate_pages(&pagelist, new_page_node, (unsigned long)pm);
+	goto out;
+
+putback:
+	putback_lru_pages(&pagelist);
+
+out:
+	up_read(&mm->mmap_sem);
+	if (err >= 0)
+		/* Return status information */
+		for(i = 0; i < nr_pages; i++)
+			put_user(pm[i].status, status +i);
+
+	kfree(pm);
+out2:
+	mmput(mm);
+	return err;
+}
+#endif
+
Index: linux-2.6.17-rc4-mm1/kernel/sys_ni.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/kernel/sys_ni.c	2006-05-11 16:31:53.000000000 -0700
+++ linux-2.6.17-rc4-mm1/kernel/sys_ni.c	2006-05-18 09:59:39.621304007 -0700
@@ -87,6 +87,7 @@ cond_syscall(sys_inotify_init);
 cond_syscall(sys_inotify_add_watch);
 cond_syscall(sys_inotify_rm_watch);
 cond_syscall(sys_migrate_pages);
+cond_syscall(sys_move_pages);
 cond_syscall(sys_chown16);
 cond_syscall(sys_fchown16);
 cond_syscall(sys_getegid16);
Index: linux-2.6.17-rc4-mm1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/asm-ia64/unistd.h	2006-05-15 15:40:11.023565789 -0700
+++ linux-2.6.17-rc4-mm1/include/asm-ia64/unistd.h	2006-05-18 09:59:39.623257011 -0700
@@ -265,7 +265,7 @@
 #define __NR_keyctl			1273
 #define __NR_ioprio_set			1274
 #define __NR_ioprio_get			1275
-/* 1276 is available for reuse (was briefly sys_set_zone_reclaim) */
+#define __NR_move_pages			1276
 #define __NR_inotify_init		1277
 #define __NR_inotify_add_watch		1278
 #define __NR_inotify_rm_watch		1279
Index: linux-2.6.17-rc4-mm1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.17-rc4-mm1.orig/arch/ia64/kernel/entry.S	2006-05-15 15:40:06.642978421 -0700
+++ linux-2.6.17-rc4-mm1/arch/ia64/kernel/entry.S	2006-05-18 09:59:39.625210015 -0700
@@ -1584,7 +1584,7 @@ sys_call_table:
 	data8 sys_keyctl
 	data8 sys_ioprio_set
 	data8 sys_ioprio_get			// 1275
-	data8 sys_ni_syscall
+	data8 sys_move_pages
 	data8 sys_inotify_init
 	data8 sys_inotify_add_watch
 	data8 sys_inotify_rm_watch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 5/5] page migration: Detailed status for moving of individual pages
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
                   ` (3 preceding siblings ...)
  2006-05-18 18:21 ` [RFC 4/5] page migration: Support moving of individual pages Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  2006-05-18 18:21 ` [RFC 6/6] page migration: Support a vma migration function Christoph Lameter
  5 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

Detailed results for sys_move_pages()

Pass a pointer to an integer to get_new_page() that may be used
to indicate where the completion status of a migration operation should
be placed. This allows sys_move_pags() to report back exactly what
happened to each page.

Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-18 10:02:04.586936931 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 10:06:17.159186880 -0700
@@ -589,7 +589,8 @@ static int unmap_and_move(new_page_t get
 			struct page *page, int force)
 {
 	int rc = 0;
-	struct page *newpage = get_new_page(page, private);
+	int *result = NULL;
+	struct page *newpage = get_new_page(page, private, &result);
 
 	if (!newpage)
 		return -ENOMEM;
@@ -643,6 +644,12 @@ move_newpage:
 	 * then this will free the page.
 	 */
 	move_to_lru(newpage);
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(newpage);
+	}
 	return rc;
 }
 
@@ -721,7 +728,8 @@ struct page_to_node {
 	int status;
 };
 
-static struct page *new_page_node(struct page *p, unsigned long private)
+static struct page *new_page_node(struct page *p, unsigned long private,
+		int **result)
 {
 	struct page_to_node *pm = (struct page_to_node *)private;
 
@@ -731,6 +739,8 @@ static struct page *new_page_node(struct
 	if (!pm->page)
 		return NULL;
 
+	*result = &pm->status;
+
 	return alloc_pages_node(pm->node, GFP_HIGHUSER, 0);
 }
 
@@ -847,7 +857,7 @@ asmlinkage long sys_move_pages(int pid, 
 			goto remove;
 
 		pm[i].node = node;
-		err = 0;
+		err = -EAGAIN;
 		if (node != page_to_nid(page))
 			goto set_status;
 
Index: linux-2.6.17-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/mempolicy.c	2006-05-18 09:48:12.491970088 -0700
+++ linux-2.6.17-rc4-mm1/mm/mempolicy.c	2006-05-18 10:05:00.079975821 -0700
@@ -588,7 +588,7 @@ static void migrate_page_add(struct page
 		isolate_lru_page(page, pagelist);
 }
 
-static struct page *new_node_page(struct page *page, unsigned long node)
+static struct page *new_node_page(struct page *page, unsigned long node, int **x)
 {
 	return alloc_pages_node(node, GFP_HIGHUSER, 0);
 }
@@ -698,7 +698,7 @@ int do_migrate_pages(struct mm_struct *m
 
 }
 
-static struct page *new_vma_page(struct page *page, unsigned long private)
+static struct page *new_vma_page(struct page *page, unsigned long private, int **x)
 {
 	struct vm_area_struct *vma = (struct vm_area_struct *)private;
 
Index: linux-2.6.17-rc4-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/migrate.h	2006-05-18 09:48:12.493923092 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/migrate.h	2006-05-18 10:05:00.080952323 -0700
@@ -3,7 +3,7 @@
 
 #include <linux/mm.h>
 
-typedef struct page *new_page_t(struct page *, unsigned long private);
+typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
 #ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 6/6] page migration: Support a vma migration function
  2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
                   ` (4 preceding siblings ...)
  2006-05-18 18:21 ` [RFC 5/5] page migration: Detailed status for " Christoph Lameter
@ 2006-05-18 18:21 ` Christoph Lameter
  5 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-18 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, bls, jes, Lee Schermerhorn, Christoph Lameter, KAMEZAWA Hiroyuki

Hooks for calling vma specific migration functions

With this patch a vma may define a vma->vm_ops->migrate function.
That function may perform page migration on its own (some vmas may
not contain page structs and therefore cannot be handled by regular
page migration. Pages in a vma may require special preparatory
treatment before migration is possible etc) . Only mmap_sem is
held when the migration function is called. The migrate() function
gets passed two sets of nodemasks describing the source and the target
of the migration. The flags parameter either contains

MPOL_MF_MOVE	which means that only pages used exclusively by
		the specified mm should be moved

or

MPOL_MF_MOVE_ALL which means that pages shared with other processes
		should also be moved.

The migration function returns 0 on success or an error condition.
An error condition will prevent regular page migration from occurring.

On its own this patch cannot be included since there are no users
for this functionality. But it seems that the uncached allocator
will need this functionality at some point.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.17-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/mempolicy.c	2006-05-18 10:28:46.290423356 -0700
+++ linux-2.6.17-rc4-mm1/mm/mempolicy.c	2006-05-18 10:28:51.629936158 -0700
@@ -631,6 +631,10 @@ int do_migrate_pages(struct mm_struct *m
 
   	down_read(&mm->mmap_sem);
 
+	err = migrate_vmas(mm, from_nodes, to_nodes, flags);
+	if (err)
+		goto out;
+
 /*
  * Find a 'source' bit set in 'tmp' whose corresponding 'dest'
  * bit in 'to' is not also set in 'tmp'.  Clear the found 'source'
@@ -690,7 +694,7 @@ int do_migrate_pages(struct mm_struct *m
 		if (err < 0)
 			break;
 	}
-
+out:
 	up_read(&mm->mmap_sem);
 	if (err < 0)
 		return err;
Index: linux-2.6.17-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/migrate.c	2006-05-18 10:28:46.289446854 -0700
+++ linux-2.6.17-rc4-mm1/mm/migrate.c	2006-05-18 10:36:56.930910584 -0700
@@ -894,3 +894,23 @@ out2:
 }
 #endif
 
+/*
+ * Call migration functions in the vma_ops that may prepare
+ * memory in a vm for migration. migration functions may perform
+ * the migration for vmas that do not have an underlying page struct.
+ */
+int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
+	const nodemask_t *from, unsigned long flags)
+{
+ 	struct vm_area_struct *vma;
+ 	int err = 0;
+
+ 	for(vma = mm->mmap; vma->vm_next && !err; vma = vma->vm_next) {
+ 		if (vma->vm_ops && vma->vm_ops->migrate) {
+ 			err = vma->vm_ops->migrate(vma, to, from, flags);
+ 			if (err)
+ 				break;
+ 		}
+ 	}
+ 	return err;
+}
Index: linux-2.6.17-rc4-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/mm.h	2006-05-15 15:40:12.355514333 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/mm.h	2006-05-18 10:38:35.269541654 -0700
@@ -209,6 +209,8 @@ struct vm_operations_struct {
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr);
+	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
+		const nodemask_t *to, unsigned long flags);
 #endif
 };
 
Index: linux-2.6.17-rc4-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/migrate.h	2006-05-18 10:28:46.291399858 -0700
+++ linux-2.6.17-rc4-mm1/include/linux/migrate.h	2006-05-18 10:37:43.795193223 -0700
@@ -16,7 +16,9 @@ extern int fail_migrate_page(struct addr
 			struct page *, struct page *);
 
 extern int migrate_prep(void);
-
+extern int migrate_vmas(struct mm_struct *mm,
+		const nodemask_t *from, const nodemask_t *to,
+		unsigned long flags);
 #else
 
 static inline int isolate_lru_page(struct page *p, struct list_head *list)
@@ -30,6 +32,13 @@ static inline int migrate_pages_to(struc
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 
+static inline int migrate_vmas(struct mm_struct *mm,
+		const nodemask_t *from, const nodemask_t *to,
+		unsigned long flags)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 4/5] page migration: Support moving of individual pages
  2006-05-18 18:21 ` [RFC 4/5] page migration: Support moving of individual pages Christoph Lameter
@ 2006-05-19 19:27   ` Andrew Morton
  2006-05-19 23:23     ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2006-05-19 19:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, bls, jes, lee.schermerhorn, kamezawa.hiroyu, Michael Kerrisk

Christoph Lameter <clameter@sgi.com> wrote:
>
> Add support for sys_move_pages()

This should be reviewed by the selinux guys (bcc'ed) to see if security
hooks are needed.

> move_pages() is used to move individual pages of a process. The function can
> be used to determine the location of pages and to move them onto the desired
> node. move_pages() returns status information for each page.
> 
> int move_pages(pid, number_of_pages_to_move,
> 		addresses_of_pages[],
> 		nodes[] or NULL,
> 		status[],
> 		flags);
> 
> The addresses of pages is an array of unsigned longs pointing to the
> pages to be moved.
> 
> The nodes array contains the node numbers that the pages should be moved
> to. If a NULL is passed then no pages are moved but the status array is
> updated.
> 
> The status array contains a status indicating the result of the migration
> operation or the current state of the page if nodes == NULL.
> 
> Possible page states:
> 
> 0..MAX_NUMNODES		The page is now on the indicated node.
> 
> -ENOENT		Page is not present or target node is not present

So the caller has no way of distinguishing one case from the other? 
Perhaps it would be better to permit that.


> -EPERM		Page is mapped by multiple processes and can only
> 		be moved if MPOL_MF_MOVE_ALL is specified. Or the
> 		target node is not allowed by the current cpuset.
> 		Or the page has been mlocked by a process/driver and
> 		cannot be moved.
> 
> -EBUSY		Page is busy and cannot be moved. Try again later.
> 
> -EFAULT		Cannot read node information from node array.
> 
> -ENOMEM		Unable to allocate memory on target node.
> 
> -EIO		Unable to write back page. Page must be written
> 		back since the page is dirty and the filesystem does not
> 		provide a migration function.
> 
> -EINVAL		Filesystem does not provide a migration function but also
> 		has no ability to write back pages.

OK, the mapping from sys_move_pages() semantics onto errnos is reasonably
close.

But it still feels a bit kludgy to me.  Perhaps it would be nicer to define
a specific set of return codes for this application.

> 
> Test program for this may be found with the patches
> on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm1

The syscall is ia64-only at present.  And that's OK, but if anyone has an
interest in page migration on other architectures (damn well hope so) then
let's hope they wire the syscall up and get onto it..

> +/*
> + * Move a list of pages in the address space of the currently executing
> + * process.
> + */
> +asmlinkage long sys_move_pages(int pid, unsigned long nr_pages,
> +			const unsigned long __user *pages,
> +			const int __user *nodes,
> +			int __user *status, int flags)
> +{

I expect this is going to be a bitch to write compat emulation for.  If we
want to support this syscall for 32-bit userspace.

If there's any possibility of that then perhaps we should revisit these
types, see if we can design this syscall so that it doesn't need a compat
wrapper.

The `status' array should be char*, surely?

> +	int err = 0;
> +	int i;
> +	struct task_struct *task;
> +	nodemask_t task_nodes;
> +	struct mm_struct *mm;
> +	struct page_to_node *pm = NULL;
> +	LIST_HEAD(pagelist);
> +
> +	/* Check flags */
> +	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> +		return -EINVAL;
> +
> +	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> +		return -EPERM;
> +
> +	/* Find the mm_struct */
> +	read_lock(&tasklist_lock);
> +	task = pid ? find_task_by_pid(pid) : current;
> +	if (!task) {
> +		read_unlock(&tasklist_lock);
> +		return -ESRCH;
> +	}
> +	mm = get_task_mm(task);
> +	read_unlock(&tasklist_lock);
> +
> +	if (!mm)
> +		return -EINVAL;
> +
> +	/*
> +	 * Check if this process has the right to modify the specified
> +	 * process. The right exists if the process has administrative
> +	 * capabilities, superuser privileges or the same
> +	 * userid as the target process.
> +	 */
> +	if ((current->euid != task->suid) && (current->euid != task->uid) &&
> +	    (current->uid != task->suid) && (current->uid != task->uid) &&
> +	    !capable(CAP_SYS_NICE)) {
> +		err = -EPERM;
> +		goto out2;
> +	}

We have code which looks very much like this in maybe five or more places. 
Someone should fix it ;)

> +	task_nodes = cpuset_mems_allowed(task);
> +	pm = kmalloc(GFP_KERNEL, (nr_pages + 1) * sizeof(struct page_to_node));

A horrid bug.  If userspace passes in a sufficiently large nr_pages, the
multiplication will overflow and we'll allocate far too little memory and
we'll proceed to scrog kernel memory.

(OK, that's what would happen if you'd got the kmalloc args the correct way
around.  As it stands, heaven knows what it'll do ;))

> +	if (!pm) {
> +		err = -ENOMEM;
> +		goto out2;
> +	}
> +
> +	down_read(&mm->mmap_sem);
> +
> +	for(i = 0 ; i < nr_pages; i++) {

I really should write a fix-common-whitespace-mistakes script.

> +		unsigned long addr;
> +		int node;
> +		struct vm_area_struct *vma;
> +		struct page *page;
> +
> +		pm[i].page = ZERO_PAGE(0);
> +
> +		err = -EFAULT;
> +		if (get_user(addr, pages + i))
> +			goto putback;

No, we cannot run get_user() inside down_read(mmap_sem).  Because that ends
up taking mmap_sem recursively and an intervening down_write() from another
process will deadlock the kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 4/5] page migration: Support moving of individual pages
  2006-05-19 19:27   ` Andrew Morton
@ 2006-05-19 23:23     ` Christoph Lameter
  2006-05-19 23:45       ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2006-05-19 23:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, bls, jes, lee.schermerhorn, kamezawa.hiroyu, Michael Kerrisk

On Fri, 19 May 2006, Andrew Morton wrote:

> > Possible page states:
> > 
> > 0..MAX_NUMNODES		The page is now on the indicated node.
> > 
> > -ENOENT		Page is not present or target node is not present
> 
> So the caller has no way of distinguishing one case from the other? 
> Perhaps it would be better to permit that.

But then we would not follow the meaning of the -Exx codes?

> But it still feels a bit kludgy to me.  Perhaps it would be nicer to define
> a specific set of return codes for this application.

The -Exx cocdes are in use thoughout the migration code for error 
conditions. We could do another pass through all of this and define 
specific error codes for page migration alone?

> > Test program for this may be found with the patches
> > on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm1
> 
> The syscall is ia64-only at present.  And that's OK, but if anyone has an
> interest in page migration on other architectures (damn well hope so) then
> let's hope they wire the syscall up and get onto it..

Well I expecteed a longer discussion on how to do this, why are we doing 
it this way etc etc before the patch got in and before I would have to 
polish it up for prime time. Hopefully this whole thing does not become 
too volatile. You are keeping this separate from the other material that 
is intended for 2.6.18 right?

> > +			const int __user *nodes,
> > +			int __user *status, int flags)
> > +{
> 
> I expect this is going to be a bitch to write compat emulation for.  If we
> want to support this syscall for 32-bit userspace.

Page migration on a 32 bit platform? Do we really need that?

> If there's any possibility of that then perhaps we should revisit these
> types, see if we can design this syscall so that it doesn't need a compat
> wrapper.
> 
> The `status' array should be char*, surely?

Could be. But then its an integer status and not a character so I thought 
that an int would be cleaner.

> > +	/*
> > +	 * Check if this process has the right to modify the specified
> > +	 * process. The right exists if the process has administrative
> > +	 * capabilities, superuser privileges or the same
> > +	 * userid as the target process.
> > +	 */
> > +	if ((current->euid != task->suid) && (current->euid != task->uid) &&
> > +	    (current->uid != task->suid) && (current->uid != task->uid) &&
> > +	    !capable(CAP_SYS_NICE)) {
> > +		err = -EPERM;
> > +		goto out2;
> > +	}
> 
> We have code which looks very much like this in maybe five or more places. 
> Someone should fix it ;)

hmmm. yes this seems to be duplicated quite a bit.

> > +	task_nodes = cpuset_mems_allowed(task);
> > +	pm = kmalloc(GFP_KERNEL, (nr_pages + 1) * sizeof(struct page_to_node));
> 
> A horrid bug.  If userspace passes in a sufficiently large nr_pages, the
> multiplication will overflow and we'll allocate far too little memory and
> we'll proceed to scrog kernel memory.

nr_pages is a 32 bit entity. On a 64 bit platform it will be difficult to 
overflow the result. So we only have an issue if we support move_pages() 
on 32 bit.

> (OK, that's what would happen if you'd got the kmalloc args the correct way
> around.  As it stands, heaven knows what it'll do ;))

It survived the test (ROTFL). But why did we add this gfp_t type if it 
does not cause the compiler to spit out a warning? We only get a warning 
with sparse checking?

> > +		err = -EFAULT;
> > +		if (get_user(addr, pages + i))
> > +			goto putback;
> 
> No, we cannot run get_user() inside down_read(mmap_sem).  Because that ends
> up taking mmap_sem recursively and an intervening down_write() from another
> process will deadlock the kernel.

Ok. Will fix the numerous bugs next week unless there are more concerns on 
a basic conceptual level.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 4/5] page migration: Support moving of individual pages
  2006-05-19 23:23     ` Christoph Lameter
@ 2006-05-19 23:45       ` Andrew Morton
  2006-05-20  0:46         ` Christoph Lameter
  2006-05-22  8:02         ` Jes Sorensen
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2006-05-19 23:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, bls, jes, lee.schermerhorn, kamezawa.hiroyu, mtk-manpages

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Fri, 19 May 2006, Andrew Morton wrote:
> 
> > > Possible page states:
> > > 
> > > 0..MAX_NUMNODES		The page is now on the indicated node.
> > > 
> > > -ENOENT		Page is not present or target node is not present
> > 
> > So the caller has no way of distinguishing one case from the other? 
> > Perhaps it would be better to permit that.
> 
> But then we would not follow the meaning of the -Exx codes?

If we're returning this fine-grained info back to userspace (good) then we
should go all the way.  If that's hard to do with the current
map-it-onto-existing-errnos approach then we've hit the limits of that
approach.

> > But it still feels a bit kludgy to me.  Perhaps it would be nicer to define
> > a specific set of return codes for this application.
> 
> The -Exx cocdes are in use thoughout the migration code for error 
> conditions. We could do another pass through all of this and define 
> specific error codes for page migration alone?

They're syscall return codes, not page-migration-per-page-result codes.

I'd have thought that would produce a cleaner result, really.  I don't know
how much impact that would have from a back-compatibility POV though.

> > > Test program for this may be found with the patches
> > > on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm1
> > 
> > The syscall is ia64-only at present.  And that's OK, but if anyone has an
> > interest in page migration on other architectures (damn well hope so) then
> > let's hope they wire the syscall up and get onto it..
> 
> Well I expecteed a longer discussion on how to do this, why are we doing 
> it this way etc etc before the patch got in and before I would have to 
> polish it up for prime time. Hopefully this whole thing does not become 
> too volatile.

The patches looked fairly straightforward to me.  Maybe I missed something ;)

> You are keeping this separate from the other material that 
> is intended for 2.6.18 right?

yup.

> > > +			const int __user *nodes,
> > > +			int __user *status, int flags)
> > > +{
> > 
> > I expect this is going to be a bitch to write compat emulation for.  If we
> > want to support this syscall for 32-bit userspace.
> 
> Page migration on a 32 bit platform? Do we really need that?

sys_migrate_pages is presently wired up in the x86 syscall table.  And it's
available in x86_64's 32-bit mode.

> > If there's any possibility of that then perhaps we should revisit these
> > types, see if we can design this syscall so that it doesn't need a compat
> > wrapper.
> > 
> > The `status' array should be char*, surely?
> 
> Could be. But then its an integer status and not a character so I thought 
> that an int would be cleaner.

As it's just a status result it's hard to see that we'd ever need more
bits.  Might as well get the speed and space savings of using a char?

> > > +	/*
> > > +	 * Check if this process has the right to modify the specified
> > > +	 * process. The right exists if the process has administrative
> > > +	 * capabilities, superuser privileges or the same
> > > +	 * userid as the target process.
> > > +	 */
> > > +	if ((current->euid != task->suid) && (current->euid != task->uid) &&
> > > +	    (current->uid != task->suid) && (current->uid != task->uid) &&
> > > +	    !capable(CAP_SYS_NICE)) {
> > > +		err = -EPERM;
> > > +		goto out2;
> > > +	}
> > 
> > We have code which looks very much like this in maybe five or more places. 
> > Someone should fix it ;)
> 
> hmmm. yes this seems to be duplicated quite a bit.
> 
> > > +	task_nodes = cpuset_mems_allowed(task);
> > > +	pm = kmalloc(GFP_KERNEL, (nr_pages + 1) * sizeof(struct page_to_node));
> > 
> > A horrid bug.  If userspace passes in a sufficiently large nr_pages, the
> > multiplication will overflow and we'll allocate far too little memory and
> > we'll proceed to scrog kernel memory.
> 
> nr_pages is a 32 bit entity. On a 64 bit platform it will be difficult to 
> overflow the result. So we only have an issue if we support move_pages() 
> on 32 bit.

nr_pages is declared as unsigned long.

> > (OK, that's what would happen if you'd got the kmalloc args the correct way
> > around.  As it stands, heaven knows what it'll do ;))
> 
> It survived the test (ROTFL). But why did we add this gfp_t type if it 
> does not cause the compiler to spit out a warning? We only get a warning 
> with sparse checking?
> 
> > > +		err = -EFAULT;
> > > +		if (get_user(addr, pages + i))
> > > +			goto putback;
> > 
> > No, we cannot run get_user() inside down_read(mmap_sem).  Because that ends
> > up taking mmap_sem recursively and an intervening down_write() from another
> > process will deadlock the kernel.
> 
> Ok. Will fix the numerous bugs next week unless there are more concerns on 
> a basic conceptual level.

Who else is interested in these features apart from the high-end ia64
people?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 4/5] page migration: Support moving of individual pages
  2006-05-19 23:45       ` Andrew Morton
@ 2006-05-20  0:46         ` Christoph Lameter
  2006-05-22  8:02         ` Jes Sorensen
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2006-05-20  0:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, bls, jes, lee.schermerhorn, kamezawa.hiroyu, mtk-manpages

On Fri, 19 May 2006, Andrew Morton wrote:

> If we're returning this fine-grained info back to userspace (good) then we
> should go all the way.  If that's hard to do with the current
> map-it-onto-existing-errnos approach then we've hit the limits of that
> approach.

I think the level of detail of -Exx is sufficient. I will have to precheck
the arguments passed before taking mmap sem in the next release. With that
some of the clashes can be removed and I could just f.e. return -ENOENT
if any invalid node was specified so that the -ENOENT page state is really
no page there.

> > The -Exx cocdes are in use thoughout the migration code for error 
> > conditions. We could do another pass through all of this and define 
> > specific error codes for page migration alone?
> 
> They're syscall return codes, not page-migration-per-page-result codes.
> 
> I'd have thought that would produce a cleaner result, really.  I don't know
> how much impact that would hav from a back-compatibility POV though.

I have used these thoughout the page migration code for error conditions 
on pages since we thought this would be a good way to avoid defining error
conditions for multiple function. Better try to keep it.

> > Well I expecteed a longer discussion on how to do this, why are we doing 
> > it this way etc etc before the patch got in and before I would have to 
> > polish it up for prime time. Hopefully this whole thing does not become 
> > too volatile.
> 
> The patches looked fairly straightforward to me.  Maybe I missed something ;)

Great! Will clean it up and do some more testing on it.

Brian: Could you give me some feedback on this one as well? Could you do
some testing with your framework for page migration?

> > Page migration on a 32 bit platform? Do we really need that?
> 
> sys_migrate_pages is presently wired up in the x86 syscall table.  And it's
> available in x86_64's 32-bit mode.

Ok. I will look at that.

> > Could be. But then its an integer status and not a character so I thought 
> > that an int would be cleaner.
> 
> As it's just a status result it's hard to see that we'd ever need more
> bits.  Might as well get the speed and space savings of using a char?

This is just a temporary value and (oh.... yes) we are going up to 4k
nodes right now and are still shooting for more. So the node number
wont fit into a char, lets keep it an int.

> > Ok. Will fix the numerous bugs next week unless there are more concerns on 
> > a basic conceptual level.
> 
> Who else is interested in these features apart from the high-end ia64
> people?

The usual I guess: PowerPC and x86_64(opteron) high end machines plus the 
i386 IBM NUMA machines. Is sparc64 now NUMA capable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 4/5] page migration: Support moving of individual pages
  2006-05-19 23:45       ` Andrew Morton
  2006-05-20  0:46         ` Christoph Lameter
@ 2006-05-22  8:02         ` Jes Sorensen
  1 sibling, 0 replies; 12+ messages in thread
From: Jes Sorensen @ 2006-05-22  8:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-mm, bls, lee.schermerhorn,
	kamezawa.hiroyu, mtk-manpages

Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
>> On Fri, 19 May 2006, Andrew Morton wrote:
>>> I expect this is going to be a bitch to write compat emulation for.  If we
>>> want to support this syscall for 32-bit userspace.
>> Page migration on a 32 bit platform? Do we really need that?
> 
> sys_migrate_pages is presently wired up in the x86 syscall table.  And it's
> available in x86_64's 32-bit mode.

And probably other architectures where the 32 bit userland is the
primary one used (Sparc64, PARISC and possibly others).


Cheers,
Jes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-05-22  8:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-18 18:21 [RFC] page migration: patches for later than 2.6.18 Christoph Lameter
2006-05-18 18:21 ` [RFC 1/5] page migration: simplify migrate_pages() Christoph Lameter
2006-05-18 18:21 ` [RFC 2/5] page migration: handle freeing of pages in migrate_pages() Christoph Lameter
2006-05-18 18:21 ` [RFC 3/5] page migration: use allocator function for migrate_pages() Christoph Lameter
2006-05-18 18:21 ` [RFC 4/5] page migration: Support moving of individual pages Christoph Lameter
2006-05-19 19:27   ` Andrew Morton
2006-05-19 23:23     ` Christoph Lameter
2006-05-19 23:45       ` Andrew Morton
2006-05-20  0:46         ` Christoph Lameter
2006-05-22  8:02         ` Jes Sorensen
2006-05-18 18:21 ` [RFC 5/5] page migration: Detailed status for " Christoph Lameter
2006-05-18 18:21 ` [RFC 6/6] page migration: Support a vma migration function Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox