linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] Hugepage migration (v4)
@ 2010-09-03  4:37 Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Hi,

This is the 4th version of "hugepage migration" set.

Major changes: (see individual patches for more details)
- Folded alloc_buddy_huge_page_node() into alloc_buddy_huge_page().
- Fixed race condition between dequeue function and allocate function.
  This is based on the draft patch from Wu Fengguang. Thank you.
- Enabled missing path of recovery from uncorrected error on free hugepage.
- Change semantics of refcount of isolated hugepage from freelist.


Future works:

- Migration can fail for various reasons depending on various factors,
  so it's useful if soft offline can be retried when it noticed migration
  fails. This problem is a more general one because it's applied for
  soft offline of normal-sized pages. So we leave it as a future work.
  
- Corrupted hugepage counter implemeted in the previous version was dropped
  because it's not directly related to migration topic and have no serious
  impact on kernel behavior. We also leave it as the next work.


Summary:

 [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault()
 [PATCH 02/10] hugetlb: add allocate function for hugepage migration
 [PATCH 03/10] hugetlb: redefine hugepage copy functions
 [PATCH 04/10] hugetlb: hugepage migration core
 [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
 [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
 [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
 [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage
 [PATCH 09/10] HWPOISON, hugetlb: fix unpoison for hugepage
 [PATCH 10/10] page-types.c: fix name of unpoison interface

 Documentation/vm/page-types.c |    2 +-
 fs/hugetlbfs/inode.c          |   15 +++
 include/linux/hugetlb.h       |   11 ++-
 include/linux/migrate.h       |   12 ++
 mm/hugetlb.c                  |  225 ++++++++++++++++++++++++++++------------
 mm/memory-failure.c           |   93 +++++++++++++----
 mm/migrate.c                  |  192 +++++++++++++++++++++++++++++++----
 mm/vmscan.c                   |    9 ++-
 8 files changed, 446 insertions(+), 113 deletions(-)


Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault()
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
Otherwise, metadata corruption occurs if memory failure happens between
alloc_huge_page() and lock_page() because page fault fails with metadata
changes remained (such as refcount, mapcount, etc.)

This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hugetlb.c |   21 +++++++++------------
 1 files changed, 9 insertions(+), 12 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index cc5be78..6871b41 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2518,22 +2518,19 @@ retry:
 			hugepage_add_new_anon_rmap(page, vma, address);
 		}
 	} else {
+		/*
+		 * If memory error occurs between mmap() and fault, some process
+		 * don't have hwpoisoned swap entry for errored virtual address.
+		 * So we need to block hugepage fault by PG_hwpoison bit check.
+		 */
+		if (unlikely(PageHWPoison(page))) {
+			ret = VM_FAULT_HWPOISON;
+			goto backout_unlocked;
+		}
 		page_dup_rmap(page);
 	}
 
 	/*
-	 * Since memory error handler replaces pte into hwpoison swap entry
-	 * at the time of error handling, a process which reserved but not have
-	 * the mapping to the error hugepage does not have hwpoison swap entry.
-	 * So we need to block accesses from such a process by checking
-	 * PG_hwpoison bit here.
-	 */
-	if (unlikely(PageHWPoison(page))) {
-		ret = VM_FAULT_HWPOISON;
-		goto backout_unlocked;
-	}
-
-	/*
 	 * If we are going to COW a private mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that
 	 * any allocations necessary to record that reservation occur outside
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 02/10] hugetlb: add allocate function for hugepage migration
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 include/linux/hugetlb.h |    3 ++
 mm/hugetlb.c            |   78 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 56 insertions(+), 25 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index f479700..0b73c53 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -228,6 +228,8 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+struct page *alloc_huge_page_node(struct hstate *h, int nid);
+
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
@@ -303,6 +305,7 @@ static inline struct hstate *page_hstate(struct page *page)
 
 #else
 struct hstate {};
+#define alloc_huge_page_node(h, nid) NULL
 #define alloc_bootmem_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 6871b41..d12431b 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -466,11 +466,22 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	h->free_huge_pages_node[nid]++;
 }
 
+static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+	if (list_empty(&h->hugepage_freelists[nid]))
+		return NULL;
+	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
+	list_del(&page->lru);
+	h->free_huge_pages--;
+	h->free_huge_pages_node[nid]--;
+	return page;
+}
+
 static struct page *dequeue_huge_page_vma(struct hstate *h,
 				struct vm_area_struct *vma,
 				unsigned long address, int avoid_reserve)
 {
-	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -496,19 +507,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		nid = zone_to_nid(zone);
-		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
-		    !list_empty(&h->hugepage_freelists[nid])) {
-			page = list_entry(h->hugepage_freelists[nid].next,
-					  struct page, lru);
-			list_del(&page->lru);
-			h->free_huge_pages--;
-			h->free_huge_pages_node[nid]--;
-
-			if (!avoid_reserve)
-				decrement_hugepage_resv_vma(h, vma);
-
-			break;
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
+			page = dequeue_huge_page_node(h, zone_to_nid(zone));
+			if (page) {
+				if (!avoid_reserve)
+					decrement_hugepage_resv_vma(h, vma);
+				break;
+			}
 		}
 	}
 err:
@@ -770,11 +775,10 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 {
 	struct page *page;
-	unsigned int nid;
+	unsigned int r_nid;
 
 	if (h->order >= MAX_ORDER)
 		return NULL;
@@ -812,9 +816,14 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	}
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
-					__GFP_REPEAT|__GFP_NOWARN,
-					huge_page_order(h));
+	if (nid == NUMA_NO_NODE)
+		page = alloc_pages(htlb_alloc_mask|__GFP_COMP|
+				   __GFP_REPEAT|__GFP_NOWARN,
+				   huge_page_order(h));
+	else
+		page = alloc_pages_exact_node(nid,
+			htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
+			__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
 
 	if (page && arch_prepare_hugepage(page)) {
 		__free_pages(page, huge_page_order(h));
@@ -829,13 +838,13 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 		 */
 		put_page_testzero(page);
 		VM_BUG_ON(page_count(page));
-		nid = page_to_nid(page);
+		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
 		 * We incremented the global counters already
 		 */
-		h->nr_huge_pages_node[nid]++;
-		h->surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[r_nid]++;
+		h->surplus_huge_pages_node[r_nid]++;
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	} else {
 		h->nr_huge_pages--;
@@ -848,6 +857,25 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 }
 
 /*
+ * This allocation function is useful in the context where vma is irrelevant.
+ * E.g. soft-offlining uses this function because it only cares physical
+ * address of error page.
+ */
+struct page *alloc_huge_page_node(struct hstate *h, int nid)
+{
+	struct page *page;
+
+	spin_lock(&hugetlb_lock);
+	page = dequeue_huge_page_node(h, nid);
+	spin_unlock(&hugetlb_lock);
+
+	if (!page)
+		page = alloc_buddy_huge_page(h, nid);
+
+	return page;
+}
+
+/*
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
@@ -871,7 +899,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(h, NULL, 0);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -1052,7 +1080,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 	spin_unlock(&hugetlb_lock);
 
 	if (!page) {
-		page = alloc_buddy_huge_page(h, vma, addr);
+		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
 		if (!page) {
 			hugetlb_put_quota(inode->i_mapping, chg);
 			return ERR_PTR(-VM_FAULT_SIGBUS);
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 03/10] hugetlb: redefine hugepage copy functions
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/hugetlb.h |    4 ++++
 mm/hugetlb.c            |   43 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 0b73c53..9e51f77 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -44,6 +44,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 void __isolate_hwpoisoned_huge_page(struct page *page);
+void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -102,6 +103,9 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 #define __isolate_hwpoisoned_huge_page(page)	0
+static inline void copy_huge_page(struct page *dst, struct page *src)
+{
+}
 
 #define hugetlb_change_protection(vma, address, end, newprot)
 
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index d12431b..c3f2ec6 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -423,7 +423,7 @@ static void clear_huge_page(struct page *page,
 	}
 }
 
-static void copy_gigantic_page(struct page *dst, struct page *src,
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
@@ -440,14 +440,15 @@ static void copy_gigantic_page(struct page *dst, struct page *src,
 		src = mem_map_next(src, src_base, i);
 	}
 }
-static void copy_huge_page(struct page *dst, struct page *src,
+
+static void copy_user_huge_page(struct page *dst, struct page *src,
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
 	struct hstate *h = hstate_vma(vma);
 
 	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_gigantic_page(dst, src, addr, vma);
+		copy_user_gigantic_page(dst, src, addr, vma);
 		return;
 	}
 
@@ -458,6 +459,40 @@ static void copy_huge_page(struct page *dst, struct page *src,
 	}
 }
 
+static void copy_gigantic_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); ) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+
+	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
+		copy_gigantic_page(dst, src);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page(h); i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
@@ -2414,7 +2449,7 @@ retry_avoidcopy:
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
-	copy_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	/*
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03 14:10   ` Christoph Lameter
  2010-09-03  4:37 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/hugetlbfs/inode.c    |   15 ++++
 include/linux/migrate.h |   12 +++
 mm/hugetlb.c            |   18 ++++-
 mm/migrate.c            |  192 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c             |    9 ++-
 5 files changed, 225 insertions(+), 21 deletions(-)

diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
index 6e5bd42..1f7ca50 100644
--- v2.6.36-rc2/fs/hugetlbfs/inode.c
+++ v2.6.36-rc2/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
 #include <linux/statfs.h>
 #include <linux/security.h>
 #include <linux/magic.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 
@@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 	return 0;
 }
 
+static int hugetlbfs_migrate_page(struct address_space *mapping,
+				struct page *newpage, struct page *page)
+{
+	int rc;
+
+	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+	if (rc)
+		return rc;
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
@@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
 	.write_begin	= hugetlbfs_write_begin,
 	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.migratepage    = hugetlbfs_migrate_page,
 };
 
 
diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
index 7238231..f4c15ff 100644
--- v2.6.36-rc2/include/linux/migrate.h
+++ v2.6.36-rc2/include/linux/migrate.h
@@ -23,6 +23,9 @@ extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page);
 #else
 #define PAGE_MIGRATION 0
 
@@ -40,6 +43,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
 	return -ENOSYS;
 }
 
+static inline void migrate_page_copy(struct page *newpage,
+				     struct page *page) {}
+
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index c3f2ec6..d3e9d29 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2216,6 +2216,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_migration(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_migration_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
@@ -2650,7 +2663,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON;
 	}
 
diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
index 38e7cad..aae3b3d 100644
--- v2.6.36-rc2/mm/migrate.c
+++ v2.6.36-rc2/mm/migrate.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/memcontrol.h>
 #include <linux/syscalls.h>
+#include <linux/hugetlb.h>
 #include <linux/gfp.h>
 
 #include "internal.h"
@@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
 
- 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		goto out;
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			goto out;
+		ptl = &mm->page_table_lock;
+	} else {
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
 
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
-		goto out;
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd))
+			goto out;
 
-	ptep = pte_offset_map(pmd, addr);
+		ptep = pte_offset_map(pmd, addr);
 
-	if (!is_swap_pte(*ptep)) {
-		pte_unmap(ptep);
-		goto out;
- 	}
+		if (!is_swap_pte(*ptep)) {
+			pte_unmap(ptep);
+			goto out;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+	}
 
- 	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
@@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
+	if (PageHuge(new))
+		pte = pte_mkhuge(pte);
 	flush_cache_page(vma, addr, pte_pfn(pte));
 	set_pte_at(mm, addr, ptep, pte);
 
-	if (PageAnon(new))
+	if (PageHuge(new)) {
+		if (PageAnon(new))
+			hugepage_add_anon_rmap(new, vma, addr);
+		else
+			page_dup_rmap(new);
+	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr);
 	else
 		page_add_file_rmap(new);
@@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * The expected number of remaining references is the same as that
+ * of migrate_page_move_mapping().
+ */
+int migrate_huge_page_move_mapping(struct address_space *mapping,
+				   struct page *newpage, struct page *page)
+{
+	int expected_count;
+	void **pslot;
+
+	if (!mapping) {
+		if (page_count(page) != 1)
+			return -EAGAIN;
+		return 0;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree,
+					page_index(page));
+
+	expected_count = 2 + page_has_private(page);
+	if (page_count(page) != expected_count ||
+	    (struct page *)radix_tree_deref_slot(pslot) != page) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	if (!page_freeze_refs(page, expected_count)) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	get_page(newpage);
+
+	radix_tree_replace_slot(pslot, newpage);
+
+	page_unfreeze_refs(page, expected_count);
+
+	__put_page(page);
+
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
  * Copy the page to its new location
  */
-static void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	copy_highpage(newpage, page);
+	if (PageHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
 
 	if (PageError(page))
 		SetPageError(newpage);
@@ -724,6 +788,92 @@ move_newpage:
 }
 
 /*
+ * Counterpart of unmap_and_move_page() for hugepage migration.
+ *
+ * This function doesn't wait the completion of hugepage I/O
+ * because there is no race between I/O and migration for hugepage.
+ * Note that currently hugepage I/O occurs only in direct I/O
+ * where no lock is held and PG_writeback is irrelevant,
+ * and writeback status of all subpages are counted in the reference
+ * count of the head page (i.e. if all subpages of a 2MB hugepage are
+ * under direct I/O, the reference of the head page is 512 and a bit more.)
+ * This means that when we try to migrate hugepage whose subpages are
+ * doing direct I/O, some references remain after try_to_unmap() and
+ * hugepage migration fails without data corruption.
+ *
+ * There is also no race when direct I/O is issued on the page under migration,
+ * because then pte is replaced with migration swap entry and direct I/O code
+ * will wait in the page fault for migration to complete.
+ */
+static int unmap_and_move_huge_page(new_page_t get_new_page,
+				unsigned long private, struct page *hpage,
+				int force, int offlining)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *new_hpage = get_new_page(hpage, private, &result);
+	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	if (!new_hpage)
+		return -ENOMEM;
+
+	rc = -EAGAIN;
+
+	if (!trylock_page(hpage)) {
+		if (!force)
+			goto out;
+		lock_page(hpage);
+	}
+
+	if (PageAnon(hpage)) {
+		rcu_read_lock();
+		rcu_locked = 1;
+
+		if (page_mapped(hpage)) {
+			anon_vma = page_anon_vma(hpage);
+			atomic_inc(&anon_vma->external_refcount);
+		}
+	}
+
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+	if (!page_mapped(hpage))
+		rc = move_to_new_page(new_hpage, hpage, 1);
+
+	if (rc)
+		remove_migration_ptes(hpage, hpage);
+
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+					    &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
+	if (rcu_locked)
+		rcu_read_unlock();
+out:
+	unlock_page(hpage);
+
+	if (rc != -EAGAIN) {
+		list_del(&hpage->lru);
+		putback_lru_page(hpage);
+	}
+
+	putback_lru_page(new_hpage);
+
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(new_hpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
@@ -757,7 +907,11 @@ int migrate_pages(struct list_head *from,
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
+			if (PageHuge(page)) {
+				rc = unmap_and_move_huge_page(get_new_page,
+					private, page, pass > 2, offlining);
+			} else
+				rc = unmap_and_move(get_new_page, private,
 						page, pass > 2, offlining);
 
 			switch(rc) {
diff --git v2.6.36-rc2/mm/vmscan.c v2.6.36-rc2/mm/vmscan.c
index c391c32..69ce2a3 100644
--- v2.6.36-rc2/mm/vmscan.c
+++ v2.6.36-rc2/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -508,6 +509,10 @@ void putback_lru_page(struct page *page)
 
 	VM_BUG_ON(PageLRU(page));
 
+	if (PageHuge(page)) {
+		put_page(page);
+		return;
+	}
 redo:
 	ClearPageUnevictable(page);
 
@@ -1112,7 +1117,9 @@ int isolate_lru_page(struct page *page)
 {
 	int ret = -EBUSY;
 
-	if (PageLRU(page)) {
+	if (PageHuge(page) && get_page_unless_zero(compound_head(page)))
+		ret = 0;
+	else if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
 
 		spin_lock_irq(&zone->lru_lock);
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This check is necessary to avoid race between dequeue and allocation,
which can cause a free hugepage to be dequeued twice and get kernel unstable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/hugetlb.h |    4 ++--
 mm/hugetlb.c            |   29 +++++++++++++++++++++++++----
 mm/memory-failure.c     |    6 ++++--
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git v2.6.36-rc2/include/linux/hugetlb.h v2.6.36-rc2/include/linux/hugetlb.h
index 9e51f77..796f30e 100644
--- v2.6.36-rc2/include/linux/hugetlb.h
+++ v2.6.36-rc2/include/linux/hugetlb.h
@@ -43,7 +43,7 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
-void __isolate_hwpoisoned_huge_page(struct page *page);
+int dequeue_hwpoisoned_huge_page(struct page *page);
 void copy_huge_page(struct page *dst, struct page *src);
 
 extern unsigned long hugepages_treat_as_movable;
@@ -102,7 +102,7 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
-#define __isolate_hwpoisoned_huge_page(page)	0
+#define dequeue_hwpoisoned_huge_page(page)	0
 static inline void copy_huge_page(struct page *dst, struct page *src)
 {
 }
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index d3e9d29..7cf9225 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2952,18 +2952,39 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 	hugetlb_acct_memory(h, -(chg - freed));
 }
 
+/* Should be called in hugetlb_lock */
+static int is_hugepage_on_freelist(struct page *hpage)
+{
+	struct page *page;
+	struct page *tmp;
+	struct hstate *h = page_hstate(hpage);
+	int nid = page_to_nid(hpage);
+
+	list_for_each_entry_safe(page, tmp, &h->hugepage_freelists[nid], lru)
+		if (page == hpage)
+			return 1;
+	return 0;
+}
+
+#ifdef CONFIG_MEMORY_FAILURE
 /*
  * This function is called from memory failure code.
  * Assume the caller holds page lock of the head page.
  */
-void __isolate_hwpoisoned_huge_page(struct page *hpage)
+int dequeue_hwpoisoned_huge_page(struct page *hpage)
 {
 	struct hstate *h = page_hstate(hpage);
 	int nid = page_to_nid(hpage);
+	int ret = -EBUSY;
 
 	spin_lock(&hugetlb_lock);
-	list_del(&hpage->lru);
-	h->free_huge_pages--;
-	h->free_huge_pages_node[nid]--;
+	if (is_hugepage_on_freelist(hpage)) {
+		list_del(&hpage->lru);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		ret = 0;
+	}
 	spin_unlock(&hugetlb_lock);
+	return ret;
 }
+#endif
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 9c26eec..c67f801 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -698,6 +698,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
+	int res = 0;
 	struct page *hpage = compound_head(p);
 	/*
 	 * We can safely recover from error on free or reserved (i.e.
@@ -710,8 +711,9 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 	 * so there is no race between isolation and mapping/unmapping.
 	 */
 	if (!(page_mapping(hpage) || PageAnon(hpage))) {
-		__isolate_hwpoisoned_huge_page(hpage);
-		return RECOVERED;
+		res = dequeue_hwpoisoned_huge_page(hpage);
+		if (!res)
+			return RECOVERED;
 	}
 	return DELAYED;
 }
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
but it causes race when dequeue_hwpoison_huge_page() runs concurrently
with alloc_huge_page().
To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/hugetlb.c |   35 +++++++++++++----------------------
 1 files changed, 13 insertions(+), 22 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 7cf9225..8ecae16 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -508,6 +508,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
 		return NULL;
 	page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
 	list_del(&page->lru);
+	set_page_refcounted(page);
 	h->free_huge_pages--;
 	h->free_huge_pages_node[nid]--;
 	return page;
@@ -867,12 +868,6 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
-		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
-		 */
-		put_page_testzero(page);
-		VM_BUG_ON(page_count(page));
 		r_nid = page_to_nid(page);
 		set_compound_page_dtor(page, free_huge_page);
 		/*
@@ -935,16 +930,13 @@ retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
-		if (!page) {
+		if (!page)
 			/*
 			 * We were not able to allocate enough pages to
 			 * satisfy the entire reservation so we free what
 			 * we've allocated so far.
 			 */
-			spin_lock(&hugetlb_lock);
-			needed = 0;
 			goto free;
-		}
 
 		list_add(&page->lru, &surplus_list);
 	}
@@ -971,31 +963,31 @@ retry:
 	needed += allocated;
 	h->resv_huge_pages += delta;
 	ret = 0;
-free:
+
+	spin_unlock(&hugetlb_lock);
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		if ((--needed) < 0)
 			break;
 		list_del(&page->lru);
+		/*
+		 * This page is now managed by the hugetlb allocator and has
+		 * no users -- drop the buddy allocator's reference.
+		 */
+		put_page_testzero(page);
+		VM_BUG_ON(page_count(page));
 		enqueue_huge_page(h, page);
 	}
 
 	/* Free unnecessary surplus pages to the buddy allocator */
+free:
 	if (!list_empty(&surplus_list)) {
-		spin_unlock(&hugetlb_lock);
 		list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 			list_del(&page->lru);
-			/*
-			 * The page has a reference count of zero already, so
-			 * call free_huge_page directly instead of using
-			 * put_page.  This must be done with hugetlb_lock
-			 * unlocked which is safe because free_huge_page takes
-			 * hugetlb_lock before deciding how to free the page.
-			 */
-			free_huge_page(page);
+			put_page(page);
 		}
-		spin_lock(&hugetlb_lock);
 	}
+	spin_lock(&hugetlb_lock);
 
 	return ret;
 }
@@ -1122,7 +1114,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
 		}
 	}
 
-	set_page_refcounted(page);
 	set_page_private(page, (unsigned long) mapping);
 
 	vma_commit_reservation(h, vma, addr);
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
This patch enables !MF_COUNT_INCREASED case.

Free hugepages can be handled directly by alloc_huge_page() and
dequeue_hwpoisoned_huge_page(), and both of them are protected
by hugetlb_lock, so there is no race between them.

Note that this patch defines the refcount of HWPoisoned hugepage
dequeued from freelist is 1, deviated from present 0, thereby we
can avoid race between unpoison and memory failure on free hugepage.
This is reasonable because unlikely to free buddy pages, free hugepage
is governed by hugetlbfs even after error handling finishes.
And it also makes unpoison code added in the later patch cleaner.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/hugetlb.c        |    1 +
 mm/memory-failure.c |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 8ecae16..ad7d175 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2971,6 +2971,7 @@ int dequeue_hwpoisoned_huge_page(struct page *hpage)
 	spin_lock(&hugetlb_lock);
 	if (is_hugepage_on_freelist(hpage)) {
 		list_del(&hpage->lru);
+		set_page_refcounted(hpage);
 		h->free_huge_pages--;
 		h->free_huge_pages_node[nid]--;
 		ret = 0;
diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index c67f801..dfeb8b8 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -983,7 +983,10 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * We need/can do nothing about count=0 pages.
 	 * 1) it's a free page, and therefore in safe hand:
 	 *    prep_new_page() will be the gate keeper.
-	 * 2) it's part of a non-compound high order page.
+	 * 2) it's a free hugepage, which is also safe:
+	 *    an affected hugepage will be dequeued from hugepage freelist,
+	 *    so there's no concern about reusing it ever after.
+	 * 3) it's part of a non-compound high order page.
 	 *    Implies some kernel user: cannot stop them from
 	 *    R/W the page; let's pray that the page has been
 	 *    used and will be freed some time later.
@@ -995,6 +998,24 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, "free buddy", DELAYED);
 			return 0;
+		} else if (PageHuge(hpage)) {
+			/*
+			 * Check "just unpoisoned", "filter hit", and
+			 * "race with other subpage."
+			 */
+			lock_page_nosync(hpage);
+			if (!PageHWPoison(hpage)
+			    || (hwpoison_filter(p) && TestClearPageHWPoison(p))
+			    || (p != hpage && TestSetPageHWPoison(hpage))) {
+				atomic_long_sub(nr_pages, &mce_bad_pages);
+				return 0;
+			}
+			set_page_hwpoison_huge_page(hpage);
+			res = dequeue_hwpoisoned_huge_page(hpage);
+			action_result(pfn, "free huge",
+				      res ? IGNORED : DELAYED);
+			unlock_page(hpage);
+			return res;
 		} else {
 			action_result(pfn, "high order kernel", IGNORED);
 			return -EBUSY;
@@ -1156,6 +1177,16 @@ int unpoison_memory(unsigned long pfn)
 	nr_pages = 1 << compound_order(page);
 
 	if (!get_page_unless_zero(page)) {
+		/*
+		 * Since HWPoisoned hugepage should have non-zero refcount,
+		 * race between memory failure and unpoison seems to happen.
+		 * In such case unpoison fails and memory failure runs
+		 * to the end.
+		 */
+		if (PageHuge(page)) {
+			pr_debug("MCE: Memory failure is now running on free hugepage %#lx\n", pfn);
+			return 0;
+		}
 		if (TestClearPageHWPoison(p))
 			atomic_long_sub(nr_pages, &mce_bad_pages);
 		pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v3:
- remove comment about "ToDo: hugepage soft-offline"

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |   50 +++++++++++++++++++++++++++++++++-----------------
 1 files changed, 33 insertions(+), 17 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index dfeb8b8..80e71cd 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -693,8 +693,6 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
  * Issues:
  * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
  *   To narrow down kill region to one page, we need to break up pmd.
- * - To support soft-offlining for hugepage, we need to support hugepage
- *   migration.
  */
 static int me_huge_page(struct page *p, unsigned long pfn)
 {
@@ -1220,7 +1218,11 @@ EXPORT_SYMBOL(unpoison_memory);
 static struct page *new_page(struct page *p, unsigned long private, int **x)
 {
 	int nid = page_to_nid(p);
-	return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+	if (PageHuge(p))
+		return alloc_huge_page_node(page_hstate(compound_head(p)),
+						   nid);
+	else
+		return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
 }
 
 /*
@@ -1248,8 +1250,15 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * was free.
 	 */
 	set_migratetype_isolate(p);
+	/*
+	 * When the target page is a free hugepage, just remove it
+	 * from free hugepage list.
+	 */
 	if (!get_page_unless_zero(compound_head(p))) {
-		if (is_free_buddy_page(p)) {
+		if (PageHuge(p)) {
+			pr_debug("get_any_page: %#lx free huge page\n", pfn);
+			ret = dequeue_hwpoisoned_huge_page(compound_head(p));
+		} else if (is_free_buddy_page(p)) {
 			pr_debug("get_any_page: %#lx free buddy page\n", pfn);
 			/* Set hwpoison bit while page is still isolated */
 			SetPageHWPoison(p);
@@ -1294,6 +1303,7 @@ int soft_offline_page(struct page *page, int flags)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
 
 	ret = get_any_page(page, pfn, flags);
 	if (ret < 0)
@@ -1304,7 +1314,7 @@ int soft_offline_page(struct page *page, int flags)
 	/*
 	 * Page cache page we can handle?
 	 */
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		/*
 		 * Try to free it.
 		 */
@@ -1320,21 +1330,21 @@ int soft_offline_page(struct page *page, int flags)
 		if (ret == 0)
 			goto done;
 	}
-	if (!PageLRU(page)) {
+	if (!PageLRU(page) && !PageHuge(page)) {
 		pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
 				pfn, page->flags);
 		return -EIO;
 	}
 
-	lock_page(page);
-	wait_on_page_writeback(page);
+	lock_page(hpage);
+	wait_on_page_writeback(hpage);
 
 	/*
 	 * Synchronized using the page lock with memory_failure()
 	 */
-	if (PageHWPoison(page)) {
-		unlock_page(page);
-		put_page(page);
+	if (PageHWPoison(hpage)) {
+		unlock_page(hpage);
+		put_page(hpage);
 		pr_debug("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1344,7 +1354,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * non dirty unmapped page cache pages.
 	 */
 	ret = invalidate_inode_page(page);
-	unlock_page(page);
+	unlock_page(hpage);
 
 	/*
 	 * Drop count because page migration doesn't like raised
@@ -1353,7 +1363,7 @@ int soft_offline_page(struct page *page, int flags)
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
-	put_page(page);
+	put_page(hpage);
 	if (ret == 1) {
 		ret = 0;
 		pr_debug("soft_offline: %#lx: invalidated\n", pfn);
@@ -1369,7 +1379,7 @@ int soft_offline_page(struct page *page, int flags)
 	if (!ret) {
 		LIST_HEAD(pagelist);
 
-		list_add(&page->lru, &pagelist);
+		list_add(&hpage->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0);
 		if (ret) {
 			pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
@@ -1385,9 +1395,15 @@ int soft_offline_page(struct page *page, int flags)
 		return ret;
 
 done:
-	atomic_long_add(1, &mce_bad_pages);
-	SetPageHWPoison(page);
-	/* keep elevated page count for bad page */
+	if (!PageHWPoison(hpage))
+		atomic_long_add(1 << compound_order(hpage), &mce_bad_pages);
+	if (PageHuge(hpage)) {
+		set_page_hwpoison_huge_page(hpage);
+		dequeue_hwpoisoned_huge_page(hpage);
+	} else {
+		SetPageHWPoison(page);
+		/* keep elevated page count for bad page */
+	}
 	return ret;
 }
 
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 09/10] HWPOISON, hugetlb: fix unpoison for hugepage
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  2010-09-03  4:37 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Currently unpoisoning hugepages doesn't work correctly because
clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 mm/memory-failure.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git v2.6.36-rc2/mm/memory-failure.c v2.6.36-rc2/mm/memory-failure.c
index 80e71cd..b7fb8cc 100644
--- v2.6.36-rc2/mm/memory-failure.c
+++ v2.6.36-rc2/mm/memory-failure.c
@@ -1202,9 +1202,9 @@ int unpoison_memory(unsigned long pfn)
 		pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &mce_bad_pages);
 		freeit = 1;
+		if (PageHuge(page))
+			clear_page_hwpoison_huge_page(page);
 	}
-	if (PageHuge(p))
-		clear_page_hwpoison_huge_page(page);
 	unlock_page(page);
 
 	put_page(page);
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 10/10] page-types.c: fix name of unpoison interface
  2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
                   ` (8 preceding siblings ...)
  2010-09-03  4:37 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
@ 2010-09-03  4:37 ` Naoya Horiguchi
  9 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-03  4:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

debugfs:hwpoison/renew-pfn is the old interface.
This patch renames and fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/vm/page-types.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git v2.6.36-rc2/Documentation/vm/page-types.c v2.6.36-rc2/Documentation/vm/page-types.c
index ccd951f..cc96ee2 100644
--- v2.6.36-rc2/Documentation/vm/page-types.c
+++ v2.6.36-rc2/Documentation/vm/page-types.c
@@ -478,7 +478,7 @@ static void prepare_hwpoison_fd(void)
 	}
 
 	if (opt_unpoison && !hwpoison_forget_fd) {
-		sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs);
+		sprintf(buf, "%s/unpoison-pfn", hwpoison_debug_fs);
 		hwpoison_forget_fd = checked_open(buf, O_WRONLY);
 	}
 }
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-03  4:37 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
@ 2010-09-03 14:10   ` Christoph Lameter
  2010-09-07  1:25     ` Naoya Horiguchi
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2010-09-03 14:10 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Fri, 3 Sep 2010, Naoya Horiguchi wrote:

> diff --git v2.6.36-rc2/mm/vmscan.c v2.6.36-rc2/mm/vmscan.c
> index c391c32..69ce2a3 100644
> --- v2.6.36-rc2/mm/vmscan.c
> +++ v2.6.36-rc2/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/hugetlb.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -508,6 +509,10 @@ void putback_lru_page(struct page *page)
>
>  	VM_BUG_ON(PageLRU(page));
>
> +	if (PageHuge(page)) {
> +		put_page(page);
> +		return;
> +	}
>  redo:
>  	ClearPageUnevictable(page);
>
> @@ -1112,7 +1117,9 @@ int isolate_lru_page(struct page *page)
>  {
>  	int ret = -EBUSY;
>
> -	if (PageLRU(page)) {
> +	if (PageHuge(page) && get_page_unless_zero(compound_head(page)))
> +		ret = 0;
> +	else if (PageLRU(page)) {
>  		struct zone *zone = page_zone(page);
>
>  		spin_lock_irq(&zone->lru_lock);
>

Huge pages are not on the LRU right? So why use the lru functions for
them and then not use the lru? Its a bit stranger. The caller must aware
of the different handling of huge pages since there is no toying around
with the lru. So just have the caller do a put or get page instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-03 14:10   ` Christoph Lameter
@ 2010-09-07  1:25     ` Naoya Horiguchi
  0 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-07  1:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Fri, Sep 03, 2010 at 09:10:03AM -0500, Christoph Lameter wrote:
> On Fri, 3 Sep 2010, Naoya Horiguchi wrote:
> 
> > diff --git v2.6.36-rc2/mm/vmscan.c v2.6.36-rc2/mm/vmscan.c
> > index c391c32..69ce2a3 100644
> > --- v2.6.36-rc2/mm/vmscan.c
> > +++ v2.6.36-rc2/mm/vmscan.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/delayacct.h>
> >  #include <linux/sysctl.h>
> > +#include <linux/hugetlb.h>
> >
> >  #include <asm/tlbflush.h>
> >  #include <asm/div64.h>
> > @@ -508,6 +509,10 @@ void putback_lru_page(struct page *page)
> >
> >  	VM_BUG_ON(PageLRU(page));
> >
> > +	if (PageHuge(page)) {
> > +		put_page(page);
> > +		return;
> > +	}
> >  redo:
> >  	ClearPageUnevictable(page);
> >
> > @@ -1112,7 +1117,9 @@ int isolate_lru_page(struct page *page)
> >  {
> >  	int ret = -EBUSY;
> >
> > -	if (PageLRU(page)) {
> > +	if (PageHuge(page) && get_page_unless_zero(compound_head(page)))
> > +		ret = 0;
> > +	else if (PageLRU(page)) {
> >  		struct zone *zone = page_zone(page);
> >
> >  		spin_lock_irq(&zone->lru_lock);
> >
> 
> Huge pages are not on the LRU right?

Yes.

> So why use the lru functions for
> them and then not use the lru? Its a bit stranger. The caller must aware
> of the different handling of huge pages since there is no toying around
> with the lru. So just have the caller do a put or get page instead.

So, you mean we should copy migrate_huge_pages() from migrate_pages() and
make two changes on new function
  1. unmap_and_move() -> unmap_and_move_huge_page()
  2. putback_lru_pages() -> put_page()
, right?

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-23 16:52   ` Christoph Lameter
@ 2010-09-24  5:58     ` Naoya Horiguchi
  0 siblings, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-24  5:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

Thank you for your review.

On Thu, Sep 23, 2010 at 11:52:26AM -0500, Christoph Lameter wrote:
> On Wed, 8 Sep 2010, Naoya Horiguchi wrote:
> 
> 
> > +static int hugetlbfs_migrate_page(struct address_space *mapping,
> > +				struct page *newpage, struct page *page)
> > +{
> > +	int rc;
> > +
> > +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > +	if (rc)
> > +		return rc;
> > +	migrate_page_copy(newpage, page);
> > +
> > +	return 0;
> > +}
> > +
> >  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> >  {
> >  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> > @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
> >  	.write_begin	= hugetlbfs_write_begin,
> >  	.write_end	= hugetlbfs_write_end,
> >  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> > +	.migratepage    = hugetlbfs_migrate_page,
> >  };
> 
> Very straightforward conversion of innermost piece to huge pages. Good.
> 
> If migrate_page_move_mapping would do huge pages like it seems
> migrate_page_copy() does (a bit surprising) then we could save ourselves
> the function?

Yes, that sounds nice.
I'll do it in the next version.

> > index 351f8d1..55f3e2d 100644
> > --- v2.6.36-rc2/mm/hugetlb.c
> > +++ v2.6.36-rc2/mm/hugetlb.c
> > @@ -2217,6 +2217,19 @@ nomem:
> >  	return -ENOMEM;
> >  }
> >
> > +static int is_hugetlb_entry_migration(pte_t pte)
> > +{
> > +	swp_entry_t swp;
> > +
> > +	if (huge_pte_none(pte) || pte_present(pte))
> > +		return 0;
> > +	swp = pte_to_swp_entry(pte);
> > +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> > +		return 1;
> > +	} else
> > +		return 0;
> > +}
> 
> Ok that implies that to some extend swap must be supported for this case
> in the core vm?

Yes.
Currently hugepage does not support swapping in/out,
but does support migration entry and hwpoison entry
(both of them have swap entry format.)

> > @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	ptep = huge_pte_offset(mm, address);
> >  	if (ptep) {
> >  		entry = huge_ptep_get(ptep);
> > -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> > +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> > +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> > +			return 0;
> > +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> >  			return VM_FAULT_HWPOISON;
> 
> No we have here in hugetlb_fault() a copy of  the migration wait logic
> from do_swap_page(). Hope the rest of the VM cannot inadvertantly
> encounter such a migration entry elsewhere?

No, because is_hugetlb_entry_migration() and is_hugetlb_entry_hwpoisoned()
can return true only when (!huge_pte_none(pte) && !pte_present(pte)) is true,
which means the entry is a swap entry.
Currently hugepage swap entry can only be created in page migration context
and in memory failure context, without any exceptions.

If hugepage swapping becomes available, we can insert an additional if()
to branch to like hugetlb_swap_page().


> 
> > diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
> 
> > @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> >  	if (is_write_migration_entry(entry))
> >  		pte = pte_mkwrite(pte);
> > +	if (PageHuge(new))
> > +		pte = pte_mkhuge(pte);
> >  	flush_cache_page(vma, addr, pte_pfn(pte));
> >  	set_pte_at(mm, addr, ptep, pte);
> >
> > -	if (PageAnon(new))
> > +	if (PageHuge(new)) {
> > +		if (PageAnon(new))
> > +			hugepage_add_anon_rmap(new, vma, addr);
> > +		else
> > +			page_dup_rmap(new);
> 
> dup_rmap? What are non anon huge pages use for? That is not a file backed
> huge page right?

Yes, it's for file backed huge page.

This page_dup_rmap() is a hugepage variant of page_add_file_rmap().
Currently we don't account hugepage for zone_page_state nor for memory cgroup,
so only atomic_add() is needed and page_dup_rmap() does it.
I think it's OK because try_to_unmap_one() calls page_remove_rmap() both for
file mapped page and for anonymous page ("rmap" is a common term for both.)


> > @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
> >  }
> >
> >  /*
> > + * The expected number of remaining references is the same as that
> > + * of migrate_page_move_mapping().
> > + */
> > +int migrate_huge_page_move_mapping(struct address_space *mapping,
> > +				   struct page *newpage, struct page *page)
> > +{
> > +	int expected_count;
> > +	void **pslot;
> > +
> > +	if (!mapping) {
> > +		if (page_count(page) != 1)
> > +			return -EAGAIN;
> > +		return 0;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +
> > +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> > +					page_index(page));
> > +
> > +	expected_count = 2 + page_has_private(page);
> > +	if (page_count(page) != expected_count ||
> > +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	if (!page_freeze_refs(page, expected_count)) {
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	get_page(newpage);
> > +
> > +	radix_tree_replace_slot(pslot, newpage);
> > +
> > +	page_unfreeze_refs(page, expected_count);
> > +
> > +	__put_page(page);
> > +
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	return 0;
> > +}
> 
> Thats a pretty accurate copy of move_mapping(). Why are the counter
> updates missing at the end? This also suggests that the two functions
> could be merged into one.

Because hugepage are not counted as zone page statistics elsewhere.
But surely it looks better to merge these two functions with adding
if () around counter updates.


> > @@ -724,6 +788,92 @@ move_newpage:
> >  }
> >
> >  /*
> > + * Counterpart of unmap_and_move_page() for hugepage migration.
> > + *
> > + * This function doesn't wait the completion of hugepage I/O
> > + * because there is no race between I/O and migration for hugepage.
> > + * Note that currently hugepage I/O occurs only in direct I/O
> > + * where no lock is held and PG_writeback is irrelevant,
> > + * and writeback status of all subpages are counted in the reference
> > + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> > + * under direct I/O, the reference of the head page is 512 and a bit more.)
> > + * This means that when we try to migrate hugepage whose subpages are
> > + * doing direct I/O, some references remain after try_to_unmap() and
> > + * hugepage migration fails without data corruption.
> > + *
> > + * There is also no race when direct I/O is issued on the page under migration,
> > + * because then pte is replaced with migration swap entry and direct I/O code
> > + * will wait in the page fault for migration to complete.
> > + */
> > +static int unmap_and_move_huge_page(new_page_t get_new_page,
> > +				unsigned long private, struct page *hpage,
> > +				int force, int offlining)
> > +{
> > +	int rc = 0;
> > +	int *result = NULL;
> > +	struct page *new_hpage = get_new_page(hpage, private, &result);
> > +	int rcu_locked = 0;
> > +	struct anon_vma *anon_vma = NULL;
> > +
> > +	if (!new_hpage)
> > +		return -ENOMEM;
> > +
> > +	rc = -EAGAIN;
> > +
> > +	if (!trylock_page(hpage)) {
> > +		if (!force)
> > +			goto out;
> > +		lock_page(hpage);
> > +	}
> > +
> > +	if (PageAnon(hpage)) {
> > +		rcu_read_lock();
> > +		rcu_locked = 1;
> > +
> > +		if (page_mapped(hpage)) {
> > +			anon_vma = page_anon_vma(hpage);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> > +	}
> > +
> > +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> > +
> > +	if (!page_mapped(hpage))
> > +		rc = move_to_new_page(new_hpage, hpage, 1);
> > +
> > +	if (rc)
> > +		remove_migration_ptes(hpage, hpage);
> > +
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> > +					    &anon_vma->lock)) {
> > +		int empty = list_empty(&anon_vma->head);
> > +		spin_unlock(&anon_vma->lock);
> > +		if (empty)
> > +			anon_vma_free(anon_vma);
> 
> Hmmm.. The anon_vma dropping looks different? Why cant we use
> drop_anon_mva like in unmap_and_move? Also we do not take the root lock
> like drop_anon_vma does.

Oh, I failed to follow the latest code in rebasing.
Recently anon_vma code is frequently updated.
I'll fix it.


> > @@ -788,6 +938,52 @@ out:
> >  	return nr_failed + retry;
> >  }
> >
> > +int migrate_huge_pages(struct list_head *from,
> > +		new_page_t get_new_page, unsigned long private, int offlining)
> > +{
> > +	int retry = 1;
> > +	int nr_failed = 0;
> > +	int pass = 0;
> > +	struct page *page;
> > +	struct page *page2;
> > +	int rc;
> > +
> > +	for (pass = 0; pass < 10 && retry; pass++) {
> > +		retry = 0;
> > +
> > +		list_for_each_entry_safe(page, page2, from, lru) {
> > +			cond_resched();
> > +
> > +			rc = unmap_and_move_huge_page(get_new_page,
> > +					private, page, pass > 2, offlining);
> > +
> > +			switch(rc) {
> > +			case -ENOMEM:
> > +				goto out;
> > +			case -EAGAIN:
> > +				retry++;
> > +				break;
> > +			case 0:
> > +				break;
> > +			default:
> > +				/* Permanent failure */
> > +				nr_failed++;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	rc = 0;
> > +out:
> > +
> > +	list_for_each_entry_safe(page, page2, from, lru)
> > +		put_page(page);
> > +
> > +	if (rc)
> > +		return rc;
> > +
> > +	return nr_failed + retry;
> > +}
> 
> Copy of migrate_pages(). putback_lru_pages() omitted as also proposed for
> upstream in a recent discussion.

Sure.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
  2010-09-20 11:10   ` Mel Gorman
@ 2010-09-23 16:52   ` Christoph Lameter
  2010-09-24  5:58     ` Naoya Horiguchi
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2010-09-23 16:52 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, 8 Sep 2010, Naoya Horiguchi wrote:


> +static int hugetlbfs_migrate_page(struct address_space *mapping,
> +				struct page *newpage, struct page *page)
> +{
> +	int rc;
> +
> +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> +	if (rc)
> +		return rc;
> +	migrate_page_copy(newpage, page);
> +
> +	return 0;
> +}
> +
>  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
>  	.write_begin	= hugetlbfs_write_begin,
>  	.write_end	= hugetlbfs_write_end,
>  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> +	.migratepage    = hugetlbfs_migrate_page,
>  };

Very straightforward conversion of innermost piece to huge pages. Good.

If migrate_page_move_mapping would do huge pages like it seems
migrate_page_copy() does (a bit surprising) then we could save ourselves
the function?

> index 351f8d1..55f3e2d 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -2217,6 +2217,19 @@ nomem:
>  	return -ENOMEM;
>  }
>
> +static int is_hugetlb_entry_migration(pte_t pte)
> +{
> +	swp_entry_t swp;
> +
> +	if (huge_pte_none(pte) || pte_present(pte))
> +		return 0;
> +	swp = pte_to_swp_entry(pte);
> +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> +		return 1;
> +	} else
> +		return 0;
> +}

Ok that implies that to some extend swap must be supported for this case
in the core vm?

> @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			return 0;
> +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON;

No we have here in hugetlb_fault() a copy of  the migration wait logic
from do_swap_page(). Hope the rest of the VM cannot inadvertantly
encounter such a migration entry elsewhere?


> diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c

> @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (is_write_migration_entry(entry))
>  		pte = pte_mkwrite(pte);
> +	if (PageHuge(new))
> +		pte = pte_mkhuge(pte);
>  	flush_cache_page(vma, addr, pte_pfn(pte));
>  	set_pte_at(mm, addr, ptep, pte);
>
> -	if (PageAnon(new))
> +	if (PageHuge(new)) {
> +		if (PageAnon(new))
> +			hugepage_add_anon_rmap(new, vma, addr);
> +		else
> +			page_dup_rmap(new);

dup_rmap? What are non anon huge pages use for? That is not a file backed
huge page right?


> @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  }
>
>  /*
> + * The expected number of remaining references is the same as that
> + * of migrate_page_move_mapping().
> + */
> +int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				   struct page *newpage, struct page *page)
> +{
> +	int expected_count;
> +	void **pslot;
> +
> +	if (!mapping) {
> +		if (page_count(page) != 1)
> +			return -EAGAIN;
> +		return 0;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> +					page_index(page));
> +
> +	expected_count = 2 + page_has_private(page);
> +	if (page_count(page) != expected_count ||
> +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	if (!page_freeze_refs(page, expected_count)) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	get_page(newpage);
> +
> +	radix_tree_replace_slot(pslot, newpage);
> +
> +	page_unfreeze_refs(page, expected_count);
> +
> +	__put_page(page);
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return 0;
> +}

Thats a pretty accurate copy of move_mapping(). Why are the counter
updates missing at the end? This also suggests that the two functions
could be merged into one.


> @@ -724,6 +788,92 @@ move_newpage:
>  }
>
>  /*
> + * Counterpart of unmap_and_move_page() for hugepage migration.
> + *
> + * This function doesn't wait the completion of hugepage I/O
> + * because there is no race between I/O and migration for hugepage.
> + * Note that currently hugepage I/O occurs only in direct I/O
> + * where no lock is held and PG_writeback is irrelevant,
> + * and writeback status of all subpages are counted in the reference
> + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> + * under direct I/O, the reference of the head page is 512 and a bit more.)
> + * This means that when we try to migrate hugepage whose subpages are
> + * doing direct I/O, some references remain after try_to_unmap() and
> + * hugepage migration fails without data corruption.
> + *
> + * There is also no race when direct I/O is issued on the page under migration,
> + * because then pte is replaced with migration swap entry and direct I/O code
> + * will wait in the page fault for migration to complete.
> + */
> +static int unmap_and_move_huge_page(new_page_t get_new_page,
> +				unsigned long private, struct page *hpage,
> +				int force, int offlining)
> +{
> +	int rc = 0;
> +	int *result = NULL;
> +	struct page *new_hpage = get_new_page(hpage, private, &result);
> +	int rcu_locked = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	if (!new_hpage)
> +		return -ENOMEM;
> +
> +	rc = -EAGAIN;
> +
> +	if (!trylock_page(hpage)) {
> +		if (!force)
> +			goto out;
> +		lock_page(hpage);
> +	}
> +
> +	if (PageAnon(hpage)) {
> +		rcu_read_lock();
> +		rcu_locked = 1;
> +
> +		if (page_mapped(hpage)) {
> +			anon_vma = page_anon_vma(hpage);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
> +	}
> +
> +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +
> +	if (!page_mapped(hpage))
> +		rc = move_to_new_page(new_hpage, hpage, 1);
> +
> +	if (rc)
> +		remove_migration_ptes(hpage, hpage);
> +
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> +					    &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);

Hmmm.. The anon_vma dropping looks different? Why cant we use
drop_anon_mva like in unmap_and_move? Also we do not take the root lock
like drop_anon_vma does.

> @@ -788,6 +938,52 @@ out:
>  	return nr_failed + retry;
>  }
>
> +int migrate_huge_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, int offlining)
> +{
> +	int retry = 1;
> +	int nr_failed = 0;
> +	int pass = 0;
> +	struct page *page;
> +	struct page *page2;
> +	int rc;
> +
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			cond_resched();
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +					private, page, pass > 2, offlining);
> +
> +			switch(rc) {
> +			case -ENOMEM:
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				break;
> +			case 0:
> +				break;
> +			default:
> +				/* Permanent failure */
> +				nr_failed++;
> +				break;
> +			}
> +		}
> +	}
> +	rc = 0;
> +out:
> +
> +	list_for_each_entry_safe(page, page2, from, lru)
> +		put_page(page);
> +
> +	if (rc)
> +		return rc;
> +
> +	return nr_failed + retry;
> +}

Copy of migrate_pages(). putback_lru_pages() omitted as also proposed for
upstream in a recent discussion.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-22  4:59     ` Naoya Horiguchi
@ 2010-09-22  8:40       ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2010-09-22  8:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 22, 2010 at 01:59:07PM +0900, Naoya Horiguchi wrote:
> On Mon, Sep 20, 2010 at 12:10:55PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
> ...
> > > @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > >  	pte_t *ptep, pte;
> > >   	spinlock_t *ptl;
> > >  
> > > - 	pgd = pgd_offset(mm, addr);
> > > -	if (!pgd_present(*pgd))
> > > -		goto out;
> > > +	if (unlikely(PageHuge(new))) {
> > > +		ptep = huge_pte_offset(mm, addr);
> > > +		if (!ptep)
> > > +			goto out;
> > > +		ptl = &mm->page_table_lock;
> > > +	} else {
> > > +		pgd = pgd_offset(mm, addr);
> > > +		if (!pgd_present(*pgd))
> > > +			goto out;
> > >  
> > > -	pud = pud_offset(pgd, addr);
> > > -	if (!pud_present(*pud))
> > > -		goto out;
> > > +		pud = pud_offset(pgd, addr);
> > > +		if (!pud_present(*pud))
> > > +			goto out;
> > >  
> > 
> > Why are the changes to teh rest of the walkers necessary? Instead, why
> > did you not identify which PTL lock you needed and then goto the point
> > where spin_lock(ptl) is called? Similar to what page_check_address()
> > does for example.
> 
> This is because Andi-san commented to avoid using goto sentense.
> But honestly I'm not sure which is a clear way.
> 

Ok, personally I would prefer to have matched page_check_address() so we use
similar code patterns when the intention is the same but as functionally it
seems fine, I'm not going to fight over it either.

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-20 11:10   ` Mel Gorman
@ 2010-09-22  4:59     ` Naoya Horiguchi
  2010-09-22  8:40       ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-22  4:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Mon, Sep 20, 2010 at 12:10:55PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
...
> > @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> >  	pte_t *ptep, pte;
> >   	spinlock_t *ptl;
> >  
> > - 	pgd = pgd_offset(mm, addr);
> > -	if (!pgd_present(*pgd))
> > -		goto out;
> > +	if (unlikely(PageHuge(new))) {
> > +		ptep = huge_pte_offset(mm, addr);
> > +		if (!ptep)
> > +			goto out;
> > +		ptl = &mm->page_table_lock;
> > +	} else {
> > +		pgd = pgd_offset(mm, addr);
> > +		if (!pgd_present(*pgd))
> > +			goto out;
> >  
> > -	pud = pud_offset(pgd, addr);
> > -	if (!pud_present(*pud))
> > -		goto out;
> > +		pud = pud_offset(pgd, addr);
> > +		if (!pud_present(*pud))
> > +			goto out;
> >  
> 
> Why are the changes to teh rest of the walkers necessary? Instead, why
> did you not identify which PTL lock you needed and then goto the point
> where spin_lock(ptl) is called? Similar to what page_check_address()
> does for example.

This is because Andi-san commented to avoid using goto sentense.
But honestly I'm not sure which is a clear way.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
@ 2010-09-20 11:10   ` Mel Gorman
  2010-09-22  4:59     ` Naoya Horiguchi
  2010-09-23 16:52   ` Christoph Lameter
  1 sibling, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2010-09-20 11:10 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Andrew Morton, Christoph Lameter, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

On Wed, Sep 08, 2010 at 10:19:35AM +0900, Naoya Horiguchi wrote:
> This patch extends page migration code to support hugepage migration.
> One of the potential users of this feature is soft offlining which
> is triggered by memory corrected errors (added by the next patch.)
> 
> Todo:
> - there are other users of page migration such as memory policy,
>   memory hotplug and memocy compaction.
>   They are not ready for hugepage support for now.
> 
> ChangeLog since v4:
> - define migrate_huge_pages()
> - remove changes on isolation/putback_lru_page()
> 
> ChangeLog since v2:
> - refactor isolate/putback_lru_page() to handle hugepage
> - add comment about race on unmap_and_move_huge_page()
> 
> ChangeLog since v1:
> - divide migration code path for hugepage
> - define routine checking migration swap entry for hugetlb
> - replace "goto" with "if/else" in remove_migration_pte()
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
> ---
>  fs/hugetlbfs/inode.c    |   15 +++
>  include/linux/migrate.h |   16 +++
>  mm/hugetlb.c            |   18 ++++-
>  mm/migrate.c            |  232 +++++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 262 insertions(+), 19 deletions(-)
> 
> diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
> index 6e5bd42..1f7ca50 100644
> --- v2.6.36-rc2/fs/hugetlbfs/inode.c
> +++ v2.6.36-rc2/fs/hugetlbfs/inode.c
> @@ -31,6 +31,7 @@
>  #include <linux/statfs.h>
>  #include <linux/security.h>
>  #include <linux/magic.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
>  	return 0;
>  }
>  
> +static int hugetlbfs_migrate_page(struct address_space *mapping,
> +				struct page *newpage, struct page *page)
> +{
> +	int rc;
> +
> +	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> +	if (rc)
> +		return rc;
> +	migrate_page_copy(newpage, page);
> +
> +	return 0;
> +}
> +
>  static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> @@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
>  	.write_begin	= hugetlbfs_write_begin,
>  	.write_end	= hugetlbfs_write_end,
>  	.set_page_dirty	= hugetlbfs_set_page_dirty,
> +	.migratepage    = hugetlbfs_migrate_page,
>  };
>  
>  
> diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
> index 7238231..3c1941e 100644
> --- v2.6.36-rc2/include/linux/migrate.h
> +++ v2.6.36-rc2/include/linux/migrate.h
> @@ -14,6 +14,8 @@ extern int migrate_page(struct address_space *,
>  			struct page *, struct page *);
>  extern int migrate_pages(struct list_head *l, new_page_t x,
>  			unsigned long private, int offlining);
> +extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> +			unsigned long private, int offlining);
>  
>  extern int fail_migrate_page(struct address_space *,
>  			struct page *, struct page *);
> @@ -23,12 +25,17 @@ extern int migrate_prep_local(void);
>  extern int migrate_vmas(struct mm_struct *mm,
>  		const nodemask_t *from, const nodemask_t *to,
>  		unsigned long flags);
> +extern void migrate_page_copy(struct page *newpage, struct page *page);
> +extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				  struct page *newpage, struct page *page);
>  #else
>  #define PAGE_MIGRATION 0
>  
>  static inline void putback_lru_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>  		unsigned long private, int offlining) { return -ENOSYS; }
> +static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> +		unsigned long private, int offlining) { return -ENOSYS; }
>  
>  static inline int migrate_prep(void) { return -ENOSYS; }
>  static inline int migrate_prep_local(void) { return -ENOSYS; }
> @@ -40,6 +47,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
>  	return -ENOSYS;
>  }
>  
> +static inline void migrate_page_copy(struct page *newpage,
> +				     struct page *page) {}
> +
> +extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				  struct page *newpage, struct page *page)
> +{
> +	return -ENOSYS;
> +}
> +
>  /* Possible settings for the migrate_page() method in address_operations */
>  #define migrate_page NULL
>  #define fail_migrate_page NULL
> diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
> index 351f8d1..55f3e2d 100644
> --- v2.6.36-rc2/mm/hugetlb.c
> +++ v2.6.36-rc2/mm/hugetlb.c
> @@ -2217,6 +2217,19 @@ nomem:
>  	return -ENOMEM;
>  }
>  
> +static int is_hugetlb_entry_migration(pte_t pte)
> +{
> +	swp_entry_t swp;
> +
> +	if (huge_pte_none(pte) || pte_present(pte))
> +		return 0;
> +	swp = pte_to_swp_entry(pte);
> +	if (non_swap_entry(swp) && is_migration_entry(swp)) {
> +		return 1;
> +	} else
> +		return 0;
> +}
> +
>  static int is_hugetlb_entry_hwpoisoned(pte_t pte)
>  {
>  	swp_entry_t swp;
> @@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	ptep = huge_pte_offset(mm, address);
>  	if (ptep) {
>  		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +		if (unlikely(is_hugetlb_entry_migration(entry))) {
> +			migration_entry_wait(mm, (pmd_t *)ptep, address);
> +			return 0;
> +		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>  			return VM_FAULT_HWPOISON;
>  	}
>  
> diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
> index 38e7cad..55dbc45 100644
> --- v2.6.36-rc2/mm/migrate.c
> +++ v2.6.36-rc2/mm/migrate.c
> @@ -32,6 +32,7 @@
>  #include <linux/security.h>
>  #include <linux/memcontrol.h>
>  #include <linux/syscalls.h>
> +#include <linux/hugetlb.h>
>  #include <linux/gfp.h>
>  
>  #include "internal.h"
> @@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte_t *ptep, pte;
>   	spinlock_t *ptl;
>  
> - 	pgd = pgd_offset(mm, addr);
> -	if (!pgd_present(*pgd))
> -		goto out;
> +	if (unlikely(PageHuge(new))) {
> +		ptep = huge_pte_offset(mm, addr);
> +		if (!ptep)
> +			goto out;
> +		ptl = &mm->page_table_lock;
> +	} else {
> +		pgd = pgd_offset(mm, addr);
> +		if (!pgd_present(*pgd))
> +			goto out;
>  
> -	pud = pud_offset(pgd, addr);
> -	if (!pud_present(*pud))
> -		goto out;
> +		pud = pud_offset(pgd, addr);
> +		if (!pud_present(*pud))
> +			goto out;
>  

Why are the changes to teh rest of the walkers necessary? Instead, why
did you not identify which PTL lock you needed and then goto the point
where spin_lock(ptl) is called? Similar to what page_check_address()
does for example.

Otherwise, I did not spot anything that was obviously wrong.

> -	pmd = pmd_offset(pud, addr);
> -	if (!pmd_present(*pmd))
> -		goto out;
> +		pmd = pmd_offset(pud, addr);
> +		if (!pmd_present(*pmd))
> +			goto out;
>  
> -	ptep = pte_offset_map(pmd, addr);
> +		ptep = pte_offset_map(pmd, addr);
>  
> -	if (!is_swap_pte(*ptep)) {
> -		pte_unmap(ptep);
> -		goto out;
> - 	}
> +		if (!is_swap_pte(*ptep)) {
> +			pte_unmap(ptep);
> +			goto out;
> +		}
> +
> +		ptl = pte_lockptr(mm, pmd);
> +	}
>  
> - 	ptl = pte_lockptr(mm, pmd);
>   	spin_lock(ptl);
>  	pte = *ptep;
>  	if (!is_swap_pte(pte))
> @@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (is_write_migration_entry(entry))
>  		pte = pte_mkwrite(pte);
> +	if (PageHuge(new))
> +		pte = pte_mkhuge(pte);
>  	flush_cache_page(vma, addr, pte_pfn(pte));
>  	set_pte_at(mm, addr, ptep, pte);
>  
> -	if (PageAnon(new))
> +	if (PageHuge(new)) {
> +		if (PageAnon(new))
> +			hugepage_add_anon_rmap(new, vma, addr);
> +		else
> +			page_dup_rmap(new);
> +	} else if (PageAnon(new))
>  		page_add_anon_rmap(new, vma, addr);
>  	else
>  		page_add_file_rmap(new);
> @@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
>  }
>  
>  /*
> + * The expected number of remaining references is the same as that
> + * of migrate_page_move_mapping().
> + */
> +int migrate_huge_page_move_mapping(struct address_space *mapping,
> +				   struct page *newpage, struct page *page)
> +{
> +	int expected_count;
> +	void **pslot;
> +
> +	if (!mapping) {
> +		if (page_count(page) != 1)
> +			return -EAGAIN;
> +		return 0;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +
> +	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> +					page_index(page));
> +
> +	expected_count = 2 + page_has_private(page);
> +	if (page_count(page) != expected_count ||
> +	    (struct page *)radix_tree_deref_slot(pslot) != page) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	if (!page_freeze_refs(page, expected_count)) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return -EAGAIN;
> +	}
> +
> +	get_page(newpage);
> +
> +	radix_tree_replace_slot(pslot, newpage);
> +
> +	page_unfreeze_refs(page, expected_count);
> +
> +	__put_page(page);
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return 0;
> +}
> +
> +/*
>   * Copy the page to its new location
>   */
> -static void migrate_page_copy(struct page *newpage, struct page *page)
> +void migrate_page_copy(struct page *newpage, struct page *page)
>  {
> -	copy_highpage(newpage, page);
> +	if (PageHuge(page))
> +		copy_huge_page(newpage, page);
> +	else
> +		copy_highpage(newpage, page);
>  
>  	if (PageError(page))
>  		SetPageError(newpage);
> @@ -724,6 +788,92 @@ move_newpage:
>  }
>  
>  /*
> + * Counterpart of unmap_and_move_page() for hugepage migration.
> + *
> + * This function doesn't wait the completion of hugepage I/O
> + * because there is no race between I/O and migration for hugepage.
> + * Note that currently hugepage I/O occurs only in direct I/O
> + * where no lock is held and PG_writeback is irrelevant,
> + * and writeback status of all subpages are counted in the reference
> + * count of the head page (i.e. if all subpages of a 2MB hugepage are
> + * under direct I/O, the reference of the head page is 512 and a bit more.)
> + * This means that when we try to migrate hugepage whose subpages are
> + * doing direct I/O, some references remain after try_to_unmap() and
> + * hugepage migration fails without data corruption.
> + *
> + * There is also no race when direct I/O is issued on the page under migration,
> + * because then pte is replaced with migration swap entry and direct I/O code
> + * will wait in the page fault for migration to complete.
> + */
> +static int unmap_and_move_huge_page(new_page_t get_new_page,
> +				unsigned long private, struct page *hpage,
> +				int force, int offlining)
> +{
> +	int rc = 0;
> +	int *result = NULL;
> +	struct page *new_hpage = get_new_page(hpage, private, &result);
> +	int rcu_locked = 0;
> +	struct anon_vma *anon_vma = NULL;
> +
> +	if (!new_hpage)
> +		return -ENOMEM;
> +
> +	rc = -EAGAIN;
> +
> +	if (!trylock_page(hpage)) {
> +		if (!force)
> +			goto out;
> +		lock_page(hpage);
> +	}
> +
> +	if (PageAnon(hpage)) {
> +		rcu_read_lock();
> +		rcu_locked = 1;
> +
> +		if (page_mapped(hpage)) {
> +			anon_vma = page_anon_vma(hpage);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
> +	}
> +
> +	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +
> +	if (!page_mapped(hpage))
> +		rc = move_to_new_page(new_hpage, hpage, 1);
> +
> +	if (rc)
> +		remove_migration_ptes(hpage, hpage);
> +
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
> +					    &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> +
> +	if (rcu_locked)
> +		rcu_read_unlock();
> +out:
> +	unlock_page(hpage);
> +
> +	if (rc != -EAGAIN) {
> +		list_del(&hpage->lru);
> +		put_page(hpage);
> +	}
> +
> +	put_page(new_hpage);
> +
> +	if (result) {
> +		if (rc)
> +			*result = rc;
> +		else
> +			*result = page_to_nid(new_hpage);
> +	}
> +	return rc;
> +}
> +
> +/*
>   * migrate_pages
>   *
>   * The function takes one list of pages to migrate and a function
> @@ -788,6 +938,52 @@ out:
>  	return nr_failed + retry;
>  }
>  
> +int migrate_huge_pages(struct list_head *from,
> +		new_page_t get_new_page, unsigned long private, int offlining)
> +{
> +	int retry = 1;
> +	int nr_failed = 0;
> +	int pass = 0;
> +	struct page *page;
> +	struct page *page2;
> +	int rc;
> +
> +	for (pass = 0; pass < 10 && retry; pass++) {
> +		retry = 0;
> +
> +		list_for_each_entry_safe(page, page2, from, lru) {
> +			cond_resched();
> +
> +			rc = unmap_and_move_huge_page(get_new_page,
> +					private, page, pass > 2, offlining);
> +
> +			switch(rc) {
> +			case -ENOMEM:
> +				goto out;
> +			case -EAGAIN:
> +				retry++;
> +				break;
> +			case 0:
> +				break;
> +			default:
> +				/* Permanent failure */
> +				nr_failed++;
> +				break;
> +			}
> +		}
> +	}
> +	rc = 0;
> +out:
> +
> +	list_for_each_entry_safe(page, page2, from, lru)
> +		put_page(page);
> +
> +	if (rc)
> +		return rc;
> +
> +	return nr_failed + retry;
> +}
> +
>  #ifdef CONFIG_NUMA
>  /*
>   * Move a list of individual pages
> -- 
> 1.7.2.2
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 04/10] hugetlb: hugepage migration core
  2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
@ 2010-09-08  1:19 ` Naoya Horiguchi
  2010-09-20 11:10   ` Mel Gorman
  2010-09-23 16:52   ` Christoph Lameter
  0 siblings, 2 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2010-09-08  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Christoph Lameter, Mel Gorman, Wu Fengguang,
	Jun'ichi Nomura, linux-mm, LKML

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 fs/hugetlbfs/inode.c    |   15 +++
 include/linux/migrate.h |   16 +++
 mm/hugetlb.c            |   18 ++++-
 mm/migrate.c            |  232 +++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 262 insertions(+), 19 deletions(-)

diff --git v2.6.36-rc2/fs/hugetlbfs/inode.c v2.6.36-rc2/fs/hugetlbfs/inode.c
index 6e5bd42..1f7ca50 100644
--- v2.6.36-rc2/fs/hugetlbfs/inode.c
+++ v2.6.36-rc2/fs/hugetlbfs/inode.c
@@ -31,6 +31,7 @@
 #include <linux/statfs.h>
 #include <linux/security.h>
 #include <linux/magic.h>
+#include <linux/migrate.h>
 
 #include <asm/uaccess.h>
 
@@ -573,6 +574,19 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 	return 0;
 }
 
+static int hugetlbfs_migrate_page(struct address_space *mapping,
+				struct page *newpage, struct page *page)
+{
+	int rc;
+
+	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+	if (rc)
+		return rc;
+	migrate_page_copy(newpage, page);
+
+	return 0;
+}
+
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
@@ -659,6 +673,7 @@ static const struct address_space_operations hugetlbfs_aops = {
 	.write_begin	= hugetlbfs_write_begin,
 	.write_end	= hugetlbfs_write_end,
 	.set_page_dirty	= hugetlbfs_set_page_dirty,
+	.migratepage    = hugetlbfs_migrate_page,
 };
 
 
diff --git v2.6.36-rc2/include/linux/migrate.h v2.6.36-rc2/include/linux/migrate.h
index 7238231..3c1941e 100644
--- v2.6.36-rc2/include/linux/migrate.h
+++ v2.6.36-rc2/include/linux/migrate.h
@@ -14,6 +14,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, int offlining);
+extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+			unsigned long private, int offlining);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -23,12 +25,17 @@ extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page);
 #else
 #define PAGE_MIGRATION 0
 
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, int offlining) { return -ENOSYS; }
+static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+		unsigned long private, int offlining) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
@@ -40,6 +47,15 @@ static inline int migrate_vmas(struct mm_struct *mm,
 	return -ENOSYS;
 }
 
+static inline void migrate_page_copy(struct page *newpage,
+				     struct page *page) {}
+
+extern int migrate_huge_page_move_mapping(struct address_space *mapping,
+				  struct page *newpage, struct page *page)
+{
+	return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git v2.6.36-rc2/mm/hugetlb.c v2.6.36-rc2/mm/hugetlb.c
index 351f8d1..55f3e2d 100644
--- v2.6.36-rc2/mm/hugetlb.c
+++ v2.6.36-rc2/mm/hugetlb.c
@@ -2217,6 +2217,19 @@ nomem:
 	return -ENOMEM;
 }
 
+static int is_hugetlb_entry_migration(pte_t pte)
+{
+	swp_entry_t swp;
+
+	if (huge_pte_none(pte) || pte_present(pte))
+		return 0;
+	swp = pte_to_swp_entry(pte);
+	if (non_swap_entry(swp) && is_migration_entry(swp)) {
+		return 1;
+	} else
+		return 0;
+}
+
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
@@ -2651,7 +2664,10 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, address);
 	if (ptep) {
 		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+		if (unlikely(is_hugetlb_entry_migration(entry))) {
+			migration_entry_wait(mm, (pmd_t *)ptep, address);
+			return 0;
+		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			return VM_FAULT_HWPOISON;
 	}
 
diff --git v2.6.36-rc2/mm/migrate.c v2.6.36-rc2/mm/migrate.c
index 38e7cad..55dbc45 100644
--- v2.6.36-rc2/mm/migrate.c
+++ v2.6.36-rc2/mm/migrate.c
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/memcontrol.h>
 #include <linux/syscalls.h>
+#include <linux/hugetlb.h>
 #include <linux/gfp.h>
 
 #include "internal.h"
@@ -95,26 +96,34 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte_t *ptep, pte;
  	spinlock_t *ptl;
 
- 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		goto out;
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			goto out;
+		ptl = &mm->page_table_lock;
+	} else {
+		pgd = pgd_offset(mm, addr);
+		if (!pgd_present(*pgd))
+			goto out;
 
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		goto out;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto out;
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
-		goto out;
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd))
+			goto out;
 
-	ptep = pte_offset_map(pmd, addr);
+		ptep = pte_offset_map(pmd, addr);
 
-	if (!is_swap_pte(*ptep)) {
-		pte_unmap(ptep);
-		goto out;
- 	}
+		if (!is_swap_pte(*ptep)) {
+			pte_unmap(ptep);
+			goto out;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+	}
 
- 	ptl = pte_lockptr(mm, pmd);
  	spin_lock(ptl);
 	pte = *ptep;
 	if (!is_swap_pte(pte))
@@ -130,10 +139,17 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
 		pte = pte_mkwrite(pte);
+	if (PageHuge(new))
+		pte = pte_mkhuge(pte);
 	flush_cache_page(vma, addr, pte_pfn(pte));
 	set_pte_at(mm, addr, ptep, pte);
 
-	if (PageAnon(new))
+	if (PageHuge(new)) {
+		if (PageAnon(new))
+			hugepage_add_anon_rmap(new, vma, addr);
+		else
+			page_dup_rmap(new);
+	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr);
 	else
 		page_add_file_rmap(new);
@@ -276,11 +292,59 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 }
 
 /*
+ * The expected number of remaining references is the same as that
+ * of migrate_page_move_mapping().
+ */
+int migrate_huge_page_move_mapping(struct address_space *mapping,
+				   struct page *newpage, struct page *page)
+{
+	int expected_count;
+	void **pslot;
+
+	if (!mapping) {
+		if (page_count(page) != 1)
+			return -EAGAIN;
+		return 0;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	pslot = radix_tree_lookup_slot(&mapping->page_tree,
+					page_index(page));
+
+	expected_count = 2 + page_has_private(page);
+	if (page_count(page) != expected_count ||
+	    (struct page *)radix_tree_deref_slot(pslot) != page) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	if (!page_freeze_refs(page, expected_count)) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+
+	get_page(newpage);
+
+	radix_tree_replace_slot(pslot, newpage);
+
+	page_unfreeze_refs(page, expected_count);
+
+	__put_page(page);
+
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+}
+
+/*
  * Copy the page to its new location
  */
-static void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	copy_highpage(newpage, page);
+	if (PageHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
 
 	if (PageError(page))
 		SetPageError(newpage);
@@ -724,6 +788,92 @@ move_newpage:
 }
 
 /*
+ * Counterpart of unmap_and_move_page() for hugepage migration.
+ *
+ * This function doesn't wait the completion of hugepage I/O
+ * because there is no race between I/O and migration for hugepage.
+ * Note that currently hugepage I/O occurs only in direct I/O
+ * where no lock is held and PG_writeback is irrelevant,
+ * and writeback status of all subpages are counted in the reference
+ * count of the head page (i.e. if all subpages of a 2MB hugepage are
+ * under direct I/O, the reference of the head page is 512 and a bit more.)
+ * This means that when we try to migrate hugepage whose subpages are
+ * doing direct I/O, some references remain after try_to_unmap() and
+ * hugepage migration fails without data corruption.
+ *
+ * There is also no race when direct I/O is issued on the page under migration,
+ * because then pte is replaced with migration swap entry and direct I/O code
+ * will wait in the page fault for migration to complete.
+ */
+static int unmap_and_move_huge_page(new_page_t get_new_page,
+				unsigned long private, struct page *hpage,
+				int force, int offlining)
+{
+	int rc = 0;
+	int *result = NULL;
+	struct page *new_hpage = get_new_page(hpage, private, &result);
+	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
+
+	if (!new_hpage)
+		return -ENOMEM;
+
+	rc = -EAGAIN;
+
+	if (!trylock_page(hpage)) {
+		if (!force)
+			goto out;
+		lock_page(hpage);
+	}
+
+	if (PageAnon(hpage)) {
+		rcu_read_lock();
+		rcu_locked = 1;
+
+		if (page_mapped(hpage)) {
+			anon_vma = page_anon_vma(hpage);
+			atomic_inc(&anon_vma->external_refcount);
+		}
+	}
+
+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+	if (!page_mapped(hpage))
+		rc = move_to_new_page(new_hpage, hpage, 1);
+
+	if (rc)
+		remove_migration_ptes(hpage, hpage);
+
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount,
+					    &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
+	if (rcu_locked)
+		rcu_read_unlock();
+out:
+	unlock_page(hpage);
+
+	if (rc != -EAGAIN) {
+		list_del(&hpage->lru);
+		put_page(hpage);
+	}
+
+	put_page(new_hpage);
+
+	if (result) {
+		if (rc)
+			*result = rc;
+		else
+			*result = page_to_nid(new_hpage);
+	}
+	return rc;
+}
+
+/*
  * migrate_pages
  *
  * The function takes one list of pages to migrate and a function
@@ -788,6 +938,52 @@ out:
 	return nr_failed + retry;
 }
 
+int migrate_huge_pages(struct list_head *from,
+		new_page_t get_new_page, unsigned long private, int offlining)
+{
+	int retry = 1;
+	int nr_failed = 0;
+	int pass = 0;
+	struct page *page;
+	struct page *page2;
+	int rc;
+
+	for (pass = 0; pass < 10 && retry; pass++) {
+		retry = 0;
+
+		list_for_each_entry_safe(page, page2, from, lru) {
+			cond_resched();
+
+			rc = unmap_and_move_huge_page(get_new_page,
+					private, page, pass > 2, offlining);
+
+			switch(rc) {
+			case -ENOMEM:
+				goto out;
+			case -EAGAIN:
+				retry++;
+				break;
+			case 0:
+				break;
+			default:
+				/* Permanent failure */
+				nr_failed++;
+				break;
+			}
+		}
+	}
+	rc = 0;
+out:
+
+	list_for_each_entry_safe(page, page2, from, lru)
+		put_page(page);
+
+	if (rc)
+		return rc;
+
+	return nr_failed + retry;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
-- 
1.7.2.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-09-24  6:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-03  4:37 [PATCH 0/10] Hugepage migration (v4) Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 01/10] hugetlb: fix metadata corruption in hugetlb_fault() Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 02/10] hugetlb: add allocate function for hugepage migration Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 03/10] hugetlb: redefine hugepage copy functions Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
2010-09-03 14:10   ` Christoph Lameter
2010-09-07  1:25     ` Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 05/10] HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 06/10] hugetlb: move refcounting in hugepage allocation inside hugetlb_lock Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 07/10] HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 08/10] HWPOISON, hugetlb: soft offlining for hugepage Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 09/10] HWPOISON, hugetlb: fix unpoison " Naoya Horiguchi
2010-09-03  4:37 ` [PATCH 10/10] page-types.c: fix name of unpoison interface Naoya Horiguchi
2010-09-08  1:19 [PATCH 0/10] Hugepage migration (v5) Naoya Horiguchi
2010-09-08  1:19 ` [PATCH 04/10] hugetlb: hugepage migration core Naoya Horiguchi
2010-09-20 11:10   ` Mel Gorman
2010-09-22  4:59     ` Naoya Horiguchi
2010-09-22  8:40       ` Mel Gorman
2010-09-23 16:52   ` Christoph Lameter
2010-09-24  5:58     ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox