[patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86
@ 2008-04-10 17:02 npiggin
  2008-04-10 17:02 ` [patch 01/17] hugetlb: modular state npiggin
                   ` (17 more replies)
  0 siblings, 18 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

Hi,

I'm taking care of Andi's hugetlb patchset now. I've taken a while to appear
to do anything with it because I have had other things to do and also needed
some time to get up to speed on it.

Anyway, from my reviewing of the patchset, I didn't find a great deal
wrong with it in the technical aspects. Taking hstate out of the hugetlbfs
inode and vma is really the main thing I did.

However on the less technical side, I think a few things could be improved,
eg. to do with the configuring and reporting, as well as the "administrative"
type of code. I tried to make improvements to things in the last patch of
the series. I will end up folding this properly into the rest of the patchset
where possible.

The other thing I did was try to shuffle the patches around a bit. There
were one or two (pretty trivial) points where it wasn't bisectable, and also
merge a couple of patches.

I will try to get this patchset merged in -mm soon if feedback is positive.
I would also like to take patches for other architectures or any other
patches or suggestions for improvements.

Patches are against head.

Thanks,
Nick

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 01/17] hugetlb: modular state
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-21 20:51   ` Jon Tollefson
  2008-04-10 17:02 ` [patch 02/17] hugetlb: multiple hstates npiggin
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-modular-state.patch --]
[-- Type: text/plain, Size: 40315 bytes --]

Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but 
most of the infrastructure to extend it is there.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 arch/ia64/mm/hugetlbpage.c    |    2 
 arch/powerpc/mm/hugetlbpage.c |    2 
 arch/sh/mm/hugetlbpage.c      |    2 
 arch/sparc64/mm/hugetlbpage.c |    2 
 arch/x86/mm/hugetlbpage.c     |    2 
 fs/hugetlbfs/inode.c          |   45 +++---
 include/linux/hugetlb.h       |   69 +++++++++
 ipc/shm.c                     |    3 
 mm/hugetlb.c                  |  298 ++++++++++++++++++++++--------------------
 mm/memory.c                   |    2 
 mm/mempolicy.c                |   10 -
 mm/mmap.c                     |    3 
 12 files changed, 270 insertions(+), 170 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,30 +22,24 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
 unsigned long max_huge_pages;
 unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate global_hstate;
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page, unsigned long addr, unsigned sz)
 {
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < sz/PAGE_SIZE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
@@ -55,34 +49,35 @@ static void copy_huge_page(struct page *
 			   unsigned long addr, struct vm_area_struct *vma)
 {
 	int i;
+	struct hstate *h = hstate_vma(vma);
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < 1 << huge_page_order(h); i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
 
-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_add(&page->lru, &hugepage_freelists[nid]);
-	free_huge_pages++;
-	free_huge_pages_node[nid]++;
+	list_add(&page->lru, &h->hugepage_freelists[nid]);
+	h->free_huge_pages++;
+	h->free_huge_pages_node[nid]++;
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
 {
 	int nid;
 	struct page *page = NULL;
 
 	for (nid = 0; nid < MAX_NUMNODES; ++nid) {
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			break;
 		}
 	}
@@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
 	struct zone **z;
+	struct hstate *h = hstate_vma(vma);
 
 	for (z = zonelist->zones; *z; z++) {
 		nid = zone_to_nid(*z);
 		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
-		    !list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		    !list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
 			if (vma && vma->vm_flags & VM_MAYSHARE)
-				resv_huge_pages--;
+				h->resv_huge_pages--;
 			break;
 		}
 	}
@@ -117,23 +113,24 @@ static struct page *dequeue_huge_page_vm
 	return page;
 }
 
-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+	h->nr_huge_pages--;
+	h->nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < (1 << huge_page_order(h)); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
 	}
 	set_compound_page_dtor(page, NULL);
 	set_page_refcounted(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
+	__free_pages(page, huge_page_order(h));
 }
 
 static void free_huge_page(struct page *page)
 {
+	struct hstate *h = &global_hstate;
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -143,12 +140,12 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages_node[nid]) {
-		update_and_free_page(page);
-		surplus_huge_pages--;
-		surplus_huge_pages_node[nid]--;
+	if (h->surplus_huge_pages_node[nid]) {
+		update_and_free_page(h, page);
+		h->surplus_huge_pages--;
+		h->surplus_huge_pages_node[nid]--;
 	} else {
-		enqueue_huge_page(page);
+		enqueue_huge_page(h, page);
 	}
 	spin_unlock(&hugetlb_lock);
 	if (mapping)
@@ -160,7 +157,7 @@ static void free_huge_page(struct page *
  * balanced by operating on them in a round-robin fashion.
  * Returns 1 if an adjustment was made.
  */
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
 {
 	static int prev_nid;
 	int nid = prev_nid;
@@ -173,15 +170,15 @@ static int adjust_pool_surplus(int delta
 			nid = first_node(node_online_map);
 
 		/* To shrink on this node, there must be a surplus page */
-		if (delta < 0 && !surplus_huge_pages_node[nid])
+		if (delta < 0 && !h->surplus_huge_pages_node[nid])
 			continue;
 		/* Surplus cannot exceed the total number of pages */
-		if (delta > 0 && surplus_huge_pages_node[nid] >=
-						nr_huge_pages_node[nid])
+		if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+						h->nr_huge_pages_node[nid])
 			continue;
 
-		surplus_huge_pages += delta;
-		surplus_huge_pages_node[nid] += delta;
+		h->surplus_huge_pages += delta;
+		h->surplus_huge_pages_node[nid] += delta;
 		ret = 1;
 		break;
 	} while (nid != prev_nid);
@@ -190,18 +187,18 @@ static int adjust_pool_surplus(int delta
 	return ret;
 }
 
-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
-		HUGETLB_PAGE_ORDER);
+			huge_page_order(h));
 	if (page) {
 		set_compound_page_dtor(page, free_huge_page);
 		spin_lock(&hugetlb_lock);
-		nr_huge_pages++;
-		nr_huge_pages_node[nid]++;
+		h->nr_huge_pages++;
+		h->nr_huge_pages_node[nid]++;
 		spin_unlock(&hugetlb_lock);
 		put_page(page); /* free it into the hugepage allocator */
 	}
@@ -209,17 +206,17 @@ static struct page *alloc_fresh_huge_pag
 	return page;
 }
 
-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
 	int start_nid;
 	int next_nid;
 	int ret = 0;
 
-	start_nid = hugetlb_next_nid;
+	start_nid = h->hugetlb_next_nid;
 
 	do {
-		page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
 		/*
@@ -233,17 +230,18 @@ static int alloc_fresh_huge_page(void)
 		 * if we just successfully allocated a hugepage so that
 		 * the next caller gets hugepages on the next node.
 		 */
-		next_nid = next_node(hugetlb_next_nid, node_online_map);
+		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
 		if (next_nid == MAX_NUMNODES)
 			next_nid = first_node(node_online_map);
-		hugetlb_next_nid = next_nid;
-	} while (!page && hugetlb_next_nid != start_nid);
+		h->hugetlb_next_nid = next_nid;
+	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	return ret;
 }
 
-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
-						unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+					  struct vm_area_struct *vma,
+					  unsigned long address)
 {
 	struct page *page;
 	unsigned int nid;
@@ -272,17 +270,17 @@ static struct page *alloc_buddy_huge_pag
 	 * per-node value is checked there.
 	 */
 	spin_lock(&hugetlb_lock);
-	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		return NULL;
 	} else {
-		nr_huge_pages++;
-		surplus_huge_pages++;
+		h->nr_huge_pages++;
+		h->surplus_huge_pages++;
 	}
 	spin_unlock(&hugetlb_lock);
 
 	page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
-					HUGETLB_PAGE_ORDER);
+			   huge_page_order(h));
 
 	spin_lock(&hugetlb_lock);
 	if (page) {
@@ -297,11 +295,11 @@ static struct page *alloc_buddy_huge_pag
 		/*
 		 * We incremented the global counters already
 		 */
-		nr_huge_pages_node[nid]++;
-		surplus_huge_pages_node[nid]++;
+		h->nr_huge_pages_node[nid]++;
+		h->surplus_huge_pages_node[nid]++;
 	} else {
-		nr_huge_pages--;
-		surplus_huge_pages--;
+		h->nr_huge_pages--;
+		h->surplus_huge_pages--;
 	}
 	spin_unlock(&hugetlb_lock);
 
@@ -312,16 +310,16 @@ static struct page *alloc_buddy_huge_pag
  * Increase the hugetlb pool such that it can accomodate a reservation
  * of size 'delta'.
  */
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
 {
 	struct list_head surplus_list;
 	struct page *page, *tmp;
 	int ret, i;
 	int needed, allocated;
 
-	needed = (resv_huge_pages + delta) - free_huge_pages;
+	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
 	if (needed <= 0) {
-		resv_huge_pages += delta;
+		h->resv_huge_pages += delta;
 		return 0;
 	}
 
@@ -332,7 +330,7 @@ static int gather_surplus_pages(int delt
 retry:
 	spin_unlock(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		page = alloc_buddy_huge_page(NULL, 0);
+		page = alloc_buddy_huge_page(h, NULL, 0);
 		if (!page) {
 			/*
 			 * We were not able to allocate enough pages to
@@ -353,7 +351,8 @@ retry:
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock(&hugetlb_lock);
-	needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+	needed = (h->resv_huge_pages + delta) -
+			(h->free_huge_pages + allocated);
 	if (needed > 0)
 		goto retry;
 
@@ -366,13 +365,13 @@ retry:
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	resv_huge_pages += delta;
+	h->resv_huge_pages += delta;
 	ret = 0;
 free:
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
 		list_del(&page->lru);
 		if ((--needed) >= 0)
-			enqueue_huge_page(page);
+			enqueue_huge_page(h, page);
 		else {
 			/*
 			 * The page has a reference count of zero already, so
@@ -395,7 +394,8 @@ free:
  * allocated to satisfy the reservation must be explicitly freed if they were
  * never used.
  */
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void return_unused_surplus_pages(struct hstate *h,
+					unsigned long unused_resv_pages)
 {
 	static int nid = -1;
 	struct page *page;
@@ -410,27 +410,27 @@ static void return_unused_surplus_pages(
 	unsigned long remaining_iterations = num_online_nodes();
 
 	/* Uncommit the reservation */
-	resv_huge_pages -= unused_resv_pages;
+	h->resv_huge_pages -= unused_resv_pages;
 
-	nr_pages = min(unused_resv_pages, surplus_huge_pages);
+	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
 		nid = next_node(nid, node_online_map);
 		if (nid == MAX_NUMNODES)
 			nid = first_node(node_online_map);
 
-		if (!surplus_huge_pages_node[nid])
+		if (!h->surplus_huge_pages_node[nid])
 			continue;
 
-		if (!list_empty(&hugepage_freelists[nid])) {
-			page = list_entry(hugepage_freelists[nid].next,
+		if (!list_empty(&h->hugepage_freelists[nid])) {
+			page = list_entry(h->hugepage_freelists[nid].next,
 					  struct page, lru);
 			list_del(&page->lru);
-			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[nid]--;
-			surplus_huge_pages--;
-			surplus_huge_pages_node[nid]--;
+			update_and_free_page(h, page);
+			h->free_huge_pages--;
+			h->free_huge_pages_node[nid]--;
+			h->surplus_huge_pages--;
+			h->surplus_huge_pages_node[nid]--;
 			nr_pages--;
 			remaining_iterations = num_online_nodes();
 		}
@@ -453,16 +453,17 @@ static struct page *alloc_huge_page_priv
 						unsigned long addr)
 {
 	struct page *page = NULL;
+	struct hstate *h = hstate_vma(vma);
 
 	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
 		return ERR_PTR(-VM_FAULT_SIGBUS);
 
 	spin_lock(&hugetlb_lock);
-	if (free_huge_pages > resv_huge_pages)
+	if (h->free_huge_pages > h->resv_huge_pages)
 		page = dequeue_huge_page_vma(vma, addr);
 	spin_unlock(&hugetlb_lock);
 	if (!page) {
-		page = alloc_buddy_huge_page(vma, addr);
+		page = alloc_buddy_huge_page(h, vma, addr);
 		if (!page) {
 			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
 			return ERR_PTR(-VM_FAULT_OOM);
@@ -492,21 +493,27 @@ static struct page *alloc_huge_page(stru
 static int __init hugetlb_init(void)
 {
 	unsigned long i;
+	struct hstate *h = &global_hstate;
 
 	if (HPAGE_SHIFT == 0)
 		return 0;
 
+	if (!h->order) {
+		h->order = HPAGE_SHIFT - PAGE_SHIFT;
+		h->mask = HPAGE_MASK;
+	}
+
 	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&hugepage_freelists[i]);
+		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	hugetlb_next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages; ++i) {
-		if (!alloc_fresh_huge_page())
+		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = free_huge_pages = nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
 	return 0;
 }
 module_init(hugetlb_init);
@@ -534,19 +541,21 @@ static unsigned int cpuset_mems_nr(unsig
 #ifdef CONFIG_HIGHMEM
 static void try_to_free_low(unsigned long count)
 {
+	struct hstate *h = &global_hstate;
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
-		list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
+		struct list_head *freel = &h->hugepage_freelists[i];
+		list_for_each_entry_safe(page, next, freel, lru) {
 			if (count >= nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
 			update_and_free_page(page);
-			free_huge_pages--;
-			free_huge_pages_node[page_to_nid(page)]--;
+			h->free_huge_pages--;
+			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
@@ -556,10 +565,11 @@ static inline void try_to_free_low(unsig
 }
 #endif
 
-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static unsigned long set_max_huge_pages(unsigned long count)
 {
 	unsigned long min_count, ret;
+	struct hstate *h = &global_hstate;
 
 	/*
 	 * Increase the pool size
@@ -573,12 +583,12 @@ static unsigned long set_max_huge_pages(
 	 * within all the constraints specified by the sysctls.
 	 */
 	spin_lock(&hugetlb_lock);
-	while (surplus_huge_pages && count > persistent_huge_pages) {
-		if (!adjust_pool_surplus(-1))
+	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, -1))
 			break;
 	}
 
-	while (count > persistent_huge_pages) {
+	while (count > persistent_huge_pages(h)) {
 		int ret;
 		/*
 		 * If this allocation races such that we no longer need the
@@ -586,7 +596,7 @@ static unsigned long set_max_huge_pages(
 		 * and reducing the surplus.
 		 */
 		spin_unlock(&hugetlb_lock);
-		ret = alloc_fresh_huge_page();
+		ret = alloc_fresh_huge_page(h);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -608,21 +618,21 @@ static unsigned long set_max_huge_pages(
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
 	try_to_free_low(min_count);
-	while (min_count < persistent_huge_pages) {
-		struct page *page = dequeue_huge_page();
+	while (min_count < persistent_huge_pages(h)) {
+		struct page *page = dequeue_huge_page(h);
 		if (!page)
 			break;
-		update_and_free_page(page);
+		update_and_free_page(h, page);
 	}
-	while (count < persistent_huge_pages) {
-		if (!adjust_pool_surplus(1))
+	while (count < persistent_huge_pages(h)) {
+		if (!adjust_pool_surplus(h, 1))
 			break;
 	}
 out:
-	ret = persistent_huge_pages;
+	ret = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -652,9 +662,10 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
+	struct hstate *h = &global_hstate;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }
@@ -663,34 +674,37 @@ int hugetlb_overcommit_handler(struct ct
 
 int hugetlb_report_meminfo(char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"HugePages_Rsvd:  %5lu\n"
 			"HugePages_Surp:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			nr_huge_pages,
-			free_huge_pages,
-			resv_huge_pages,
-			surplus_huge_pages,
-			HPAGE_SIZE/1024);
+			h->nr_huge_pages,
+			h->free_huge_pages,
+			h->resv_huge_pages,
+			h->surplus_huge_pages,
+			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
+	struct hstate *h = &global_hstate;
 	return sprintf(buf,
 		"Node %d HugePages_Total: %5u\n"
 		"Node %d HugePages_Free:  %5u\n"
 		"Node %d HugePages_Surp:  %5u\n",
-		nid, nr_huge_pages_node[nid],
-		nid, free_huge_pages_node[nid],
-		nid, surplus_huge_pages_node[nid]);
+		nid, h->nr_huge_pages_node[nid],
+		nid, h->free_huge_pages_node[nid],
+		nid, h->surplus_huge_pages_node[nid]);
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	struct hstate *h = &global_hstate;
+	return h->nr_huge_pages * (1 << huge_page_order(h));
 }
 
 /*
@@ -745,14 +759,16 @@ int copy_hugetlb_page_range(struct mm_st
 	struct page *ptepage;
 	unsigned long addr;
 	int cow;
+	struct hstate *h = hstate_vma(vma);
+	unsigned sz = huge_page_size(h);
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr, sz);
 		if (!dst_pte)
 			goto nomem;
 
@@ -788,6 +804,9 @@ void __unmap_hugepage_range(struct vm_ar
 	pte_t pte;
 	struct page *page;
 	struct page *tmp;
+	struct hstate *h = hstate_vma(vma);
+	unsigned sz = huge_page_size(h);
+
 	/*
 	 * A page gathering list, protected by per file i_mmap_lock. The
 	 * lock is used to avoid list corruption from multiple unmapping
@@ -796,11 +815,11 @@ void __unmap_hugepage_range(struct vm_ar
 	LIST_HEAD(page_list);
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON(start & ~HPAGE_MASK);
-	BUG_ON(end & ~HPAGE_MASK);
+	BUG_ON(start & ~huge_page_mask(h));
+	BUG_ON(end & ~huge_page_mask(h));
 
 	spin_lock(&mm->page_table_lock);
-	for (address = start; address < end; address += HPAGE_SIZE) {
+	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -848,6 +867,7 @@ static int hugetlb_cow(struct mm_struct 
 {
 	struct page *old_page, *new_page;
 	int avoidcopy;
+	struct hstate *h = hstate_vma(vma);
 
 	old_page = pte_page(pte);
 
@@ -872,7 +892,7 @@ static int hugetlb_cow(struct mm_struct 
 	__SetPageUptodate(new_page);
 	spin_lock(&mm->page_table_lock);
 
-	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(*ptep, pte))) {
 		/* Break COW */
 		set_huge_pte_at(mm, address, ptep,
@@ -894,10 +914,11 @@ static int hugetlb_no_page(struct mm_str
 	struct page *page;
 	struct address_space *mapping;
 	pte_t new_pte;
+	struct hstate *h = hstate_vma(vma);
 
 	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+	idx = ((address - vma->vm_start) >> huge_page_shift(h))
+		+ (vma->vm_pgoff >> huge_page_order(h));
 
 	/*
 	 * Use page lock to guard against racing truncation
@@ -906,7 +927,7 @@ static int hugetlb_no_page(struct mm_str
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
-		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
 		page = alloc_huge_page(vma, address);
@@ -914,7 +935,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address);
+		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_SHARED) {
@@ -930,14 +951,14 @@ retry:
 			}
 
 			spin_lock(&inode->i_lock);
-			inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+			inode->i_blocks += (huge_page_size(h)) / 512;
 			spin_unlock(&inode->i_lock);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
 		goto backout;
 
@@ -973,8 +994,9 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct hstate *h = hstate_vma(vma);
 
-	ptep = huge_pte_alloc(mm, address);
+	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
 	if (!ptep)
 		return VM_FAULT_OOM;
 
@@ -1012,6 +1034,7 @@ int follow_hugetlb_page(struct mm_struct
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
 	int remainder = *length;
+	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
 	while (vaddr < vma->vm_end && remainder) {
@@ -1023,7 +1046,7 @@ int follow_hugetlb_page(struct mm_struct
 		 * each hugepage.  We have to make * sure we get the
 		 * first, for the page indexing below to work.
 		 */
-		pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
 
 		if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) {
 			int ret;
@@ -1040,7 +1063,7 @@ int follow_hugetlb_page(struct mm_struct
 			break;
 		}
 
-		pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(*pte);
 same_page:
 		if (pages) {
@@ -1056,7 +1079,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < (1 << huge_page_order(h))) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
@@ -1078,13 +1101,14 @@ void hugetlb_change_protection(struct vm
 	unsigned long start = address;
 	pte_t *ptep;
 	pte_t pte;
+	struct hstate *h = hstate_vma(vma);
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
 	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
-	for (; address < end; address += HPAGE_SIZE) {
+	for (; address < end; address += huge_page_size(h)) {
 		ptep = huge_pte_offset(mm, address);
 		if (!ptep)
 			continue;
@@ -1223,7 +1247,7 @@ static long region_truncate(struct list_
 	return chg;
 }
 
-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	int ret = -ENOMEM;
 
@@ -1246,18 +1270,18 @@ static int hugetlb_acct_memory(long delt
 	 * semantics that cpuset has.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
+		if (gather_surplus_pages(h, delta) < 0)
 			goto out;
 
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
+		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+			return_unused_surplus_pages(h, delta);
 			goto out;
 		}
 	}
 
 	ret = 0;
 	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
+		return_unused_surplus_pages(h, (unsigned long) -delta);
 
 out:
 	spin_unlock(&hugetlb_lock);
@@ -1267,6 +1291,7 @@ out:
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
 	long ret, chg;
+	struct hstate *h = &global_hstate;
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
 	if (chg < 0)
@@ -1274,7 +1299,7 @@ int hugetlb_reserve_pages(struct inode *
 
 	if (hugetlb_get_quota(inode->i_mapping, chg))
 		return -ENOSPC;
-	ret = hugetlb_acct_memory(chg);
+	ret = hugetlb_acct_memory(h, chg);
 	if (ret < 0) {
 		hugetlb_put_quota(inode->i_mapping, chg);
 		return ret;
@@ -1285,12 +1310,13 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
+	struct hstate *h = &global_hstate;
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
-	inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+	inode->i_blocks -= ((huge_page_size(h))/512) * freed;
 	spin_unlock(&inode->i_lock);
 
 	hugetlb_put_quota(inode->i_mapping, (chg - freed));
-	hugetlb_acct_memory(-(chg - freed));
+	hugetlb_acct_memory(h, -(chg - freed));
 }
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct 
 	return NULL;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pg;
 	pud_t *pu;
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
 				pgoff, flags);
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -22,7 +22,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -24,7 +24,7 @@
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
 pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, int sz)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
 	pgd_t *pgd;
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
 	return 1;
 }
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
 {
 	pgd_t *pgd;
 	pud_t *pud;
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
 
 /* arch callbacks */
 
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz);
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
 #else
 void hugetlb_prefault_arch_hook(struct mm_struct *mm);
 #endif
-
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
 int hugetlb_get_quota(struct address_space *mapping, long delta);
 void hugetlb_put_quota(struct address_space *mapping, long delta);
 
-#define BLOCKS_PER_HUGEPAGE	(HPAGE_SIZE / 512)
-
 static inline int is_file_hugepages(struct file *file)
 {
 	if (file->f_op == &hugetlbfs_file_operations)
@@ -199,4 +196,68 @@ unsigned long hugetlb_get_unmapped_area(
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+	int hugetlb_next_nid;
+	unsigned int order;
+	unsigned long mask;
+	unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
+	unsigned long surplus_huge_pages;
+	unsigned long nr_overcommit_huge_pages;
+	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int nr_huge_pages_node[MAX_NUMNODES];
+	unsigned int free_huge_pages_node[MAX_NUMNODES];
+	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate global_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+	return &global_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+	return &global_hstate;
+}
+
+static inline unsigned huge_page_size(struct hstate *h)
+{
+	return PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+	return h->mask;
+}
+
+static inline unsigned long huge_page_order(struct hstate *h)
+{
+	return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+	return h->order + PAGE_SHIFT;
+}
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#endif
+
 #endif /* _LINUX_HUGETLB_H */
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
 	struct inode *inode = file->f_path.dentry->d_inode;
 	loff_t len, vma_len;
 	int ret;
+	struct hstate *h = hstate_file(file);
 
 	/*
 	 * vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
 
-	if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
 	if (vma->vm_flags & VM_MAYSHARE &&
-	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				  len >> HPAGE_SHIFT))
+	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
+				  len >> huge_page_shift(h)))
 		goto out;
 
 	ret = 0;
@@ -130,8 +131,9 @@ hugetlb_get_unmapped_area(struct file *f
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
+	struct hstate *h = hstate_file(file);
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -143,7 +145,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
 		start_addr = TASK_UNMAPPED_BASE;
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:
 
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 #endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page, 
 static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 			      size_t len, loff_t *ppos)
 {
+	struct hstate *h = hstate_file(filp);
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-	unsigned long index = *ppos >> HPAGE_SHIFT;
-	unsigned long offset = *ppos & ~HPAGE_MASK;
+	unsigned long index = *ppos >> huge_page_shift(h);
+	unsigned long offset = *ppos & ~huge_page_mask(h);
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> HPAGE_SHIFT;
+	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
 		int nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = HPAGE_SIZE;
+		nr = huge_page_size(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+			nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
 		offset += ret;
 		retval += ret;
 		len -= ret;
-		index += offset >> HPAGE_SHIFT;
-		offset &= ~HPAGE_MASK;
+		index += offset >> huge_page_shift(h);
+		offset &= ~huge_page_mask(h);
 
 		if (page)
 			page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
 			break;
 	}
 out:
-	*ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
 	mutex_unlock(&inode->i_mutex);
 	return retval;
 }
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
 
 static void truncate_hugepages(struct inode *inode, loff_t lstart)
 {
+	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
-	const pgoff_t start = lstart >> HPAGE_SHIFT;
+	const pgoff_t start = lstart >> huge_page_shift(h);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
 {
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
+	struct hstate *h = hstate_inode(inode);
 
-	BUG_ON(offset & ~HPAGE_MASK);
+	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct hstate *h = hstate_inode(inode);
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
 
@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
 
 	if (ia_valid & ATTR_SIZE) {
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
+		if (!(attr->ia_size & ~huge_page_mask(h)))
 			error = hugetlb_vmtruncate(inode, attr->ia_size);
 		if (error)
 			goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
 static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+	struct hstate *h = hstate_inode(dentry->d_inode);
 
 	buf->f_type = HUGETLBFS_MAGIC;
-	buf->f_bsize = HPAGE_SIZE;
+	buf->f_bsize = huge_page_size(h);
 	if (sbinfo) {
 		spin_lock(&sbinfo->stat_lock);
 		/* If no limits set, just report 0 for max/free/used
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -613,7 +613,8 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			struct hstate *h = hstate_file(shp->shm_file);
+			*rss += (1 << huge_page_order(h)) * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -848,7 +848,7 @@ unsigned long unmap_vmas(struct mmu_gath
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+					(1 << huge_page_order(hstate_vma(vma)));
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		nid = interleave_nid(pol, vma, addr,
+					huge_page_shift(hstate_vma(vma)));
 		if (unlikely(pol != &default_policy &&
 				pol != current->mempolicy))
 			__mpol_free(pol);	/* finished with pol */
@@ -1944,9 +1945,12 @@ static void check_huge_range(struct vm_a
 {
 	unsigned long addr;
 	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+	unsigned sz = huge_page_size(h);
 
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+	for (addr = start; addr < end; addr += sz) {
+		pte_t *ptep = huge_pte_offset(vma->vm_mm,
+						addr & huge_page_mask(h));
 		pte_t pte;
 
 		if (!ptep)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1793,7 +1793,8 @@ int split_vma(struct mm_struct * mm, str
 	struct mempolicy *pol;
 	struct vm_area_struct *new;
 
-	if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+	if (is_vm_hugetlb_page(vma) && (addr &
+					~(huge_page_mask(hstate_vma(vma)))))
 		return -EINVAL;
 
 	if (mm->map_count >= sysctl_max_map_count)

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 02/17] hugetlb: multiple hstates
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
  2008-04-10 17:02 ` [patch 01/17] hugetlb: modular state npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 03/17] hugetlb: multi hstate proc files npiggin
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-multiple-hstates.patch --]
[-- Type: text/plain, Size: 6013 bytes --]

Add basic support for more than one hstate in hugetlbfs

- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 include/linux/hugetlb.h |   11 ++++++-
 mm/hugetlb.c            |   71 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 64 insertions(+), 18 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -27,7 +27,15 @@ unsigned long sysctl_overcommit_huge_pag
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-struct hstate global_hstate;
+static int max_hstate = 1;
+
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+struct hstate *parsed_hstate __initdata = &global_hstate;
+
+#define for_each_hstate(h) \
+	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
 
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -128,9 +136,19 @@ static void update_and_free_page(struct 
 	__free_pages(page, huge_page_order(h));
 }
 
+struct hstate *size_to_hstate(unsigned long size)
+{
+	struct hstate *h;
+	for_each_hstate (h) {
+		if (huge_page_size(h) == size)
+			return h;
+	}
+	return NULL;
+}
+
 static void free_huge_page(struct page *page)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h = size_to_hstate(PAGE_SIZE << compound_order(page));
 	int nid = page_to_nid(page);
 	struct address_space *mapping;
 
@@ -490,15 +508,11 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
-static int __init hugetlb_init(void)
+static int __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
-	struct hstate *h = &global_hstate;
 
-	if (HPAGE_SHIFT == 0)
-		return 0;
-
-	if (!h->order) {
+	if (h == &global_hstate && !h->order) {
 		h->order = HPAGE_SHIFT - PAGE_SHIFT;
 		h->mask = HPAGE_MASK;
 	}
@@ -513,11 +527,35 @@ static int __init hugetlb_init(void)
 			break;
 	}
 	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
+
+	printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+			h->free_huge_pages,
+			1 << (h->order + PAGE_SHIFT - 20));
 	return 0;
 }
+
+static int __init hugetlb_init(void)
+{
+	if (HPAGE_SHIFT == 0)
+		return 0;
+	return hugetlb_init_hstate(&global_hstate);
+}
 module_init(hugetlb_init);
 
+/* Should be called on processing a hugepagesz=... option */
+void __init huge_add_hstate(unsigned order)
+{
+	struct hstate *h;
+	BUG_ON(size_to_hstate(PAGE_SIZE << order));
+	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+	BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
+	h = &hstates[max_hstate++];
+	h->order = order;
+	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+	hugetlb_init_hstate(h);
+	parsed_hstate = h;
+}
+
 static int __init hugetlb_setup(char *s)
 {
 	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
@@ -539,28 +577,27 @@ static unsigned int cpuset_mems_nr(unsig
 
 #ifdef CONFIG_SYSCTL
 #ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void try_to_free_low(struct hstate *h, unsigned long count)
 {
-	struct hstate *h = &global_hstate;
 	int i;
 
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(page, next, freel, lru) {
-			if (count >= nr_huge_pages)
+			if (count >= h->nr_huge_pages)
 				return;
 			if (PageHighMem(page))
 				continue;
 			list_del(&page->lru);
-			update_and_free_page(page);
+			update_and_free_page(h, page);
 			h->free_huge_pages--;
 			h->free_huge_pages_node[page_to_nid(page)]--;
 		}
 	}
 }
 #else
-static inline void try_to_free_low(unsigned long count)
+static inline void try_to_free_low(struct hstate *h, unsigned long count)
 {
 }
 #endif
@@ -620,7 +657,7 @@ static unsigned long set_max_huge_pages(
 	 */
 	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 	min_count = max(count, min_count);
-	try_to_free_low(min_count);
+	try_to_free_low(h, min_count);
 	while (min_count < persistent_huge_pages(h)) {
 		struct page *page = dequeue_huge_page(h);
 		if (!page)
@@ -1291,7 +1328,7 @@ out:
 int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 {
 	long ret, chg;
-	struct hstate *h = &global_hstate;
+	struct hstate *h = hstate_inode(inode);
 
 	chg = region_chg(&inode->i_mapping->private_list, from, to);
 	if (chg < 0)
@@ -1310,7 +1347,7 @@ int hugetlb_reserve_pages(struct inode *
 
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h = hstate_inode(inode);
 	long chg = region_truncate(&inode->i_mapping->private_list, offset);
 
 	spin_lock(&inode->i_lock);
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -212,7 +212,16 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 };
 
-extern struct hstate global_hstate;
+void __init huge_add_hstate(unsigned order);
+struct hstate *size_to_hstate(unsigned long size);
+
+#ifndef HUGE_MAX_HSTATE
+#define HUGE_MAX_HSTATE 1
+#endif
+
+extern struct hstate hstates[HUGE_MAX_HSTATE];
+
+#define global_hstate (hstates[0])
 
 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 03/17] hugetlb: multi hstate proc files
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
  2008-04-10 17:02 ` [patch 01/17] hugetlb: modular state npiggin
  2008-04-10 17:02 ` [patch 02/17] hugetlb: multiple hstates npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 04/17] hugetlbfs: per mount hstates npiggin
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-proc-hstates.patch --]
[-- Type: text/plain, Size: 3521 bytes --]

Convert /proc output code over to report multiple hstates

I chose to just report the numbers in a row, in the hope 
to minimze breakage of existing software. The "compat" page size
is always the first number.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/hugetlb.c |   64 ++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 42 insertions(+), 22 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -709,39 +709,59 @@ int hugetlb_overcommit_handler(struct ct
 
 #endif /* CONFIG_SYSCTL */
 
+static int dump_field(char *buf, unsigned field)
+{
+	int n = 0;
+	struct hstate *h;
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
+	buf[n++] = '\n';
+	return n;
+}
+
 int hugetlb_report_meminfo(char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-			"HugePages_Total: %5lu\n"
-			"HugePages_Free:  %5lu\n"
-			"HugePages_Rsvd:  %5lu\n"
-			"HugePages_Surp:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
-			h->nr_huge_pages,
-			h->free_huge_pages,
-			h->resv_huge_pages,
-			h->surplus_huge_pages,
-			1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
+	struct hstate *h;
+	int n = 0;
+	n += sprintf(buf + 0, "HugePages_Total:");
+	n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
+	n += sprintf(buf + n, "HugePages_Free: ");
+	n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
+	n += sprintf(buf + n, "HugePages_Rsvd: ");
+	n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
+	n += sprintf(buf + n, "HugePages_Surp: ");
+	n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
+	n += sprintf(buf + n, "Hugepagesize:   ");
+	for_each_hstate (h)
+		n += sprintf(buf + n, " %5u", huge_page_size(h) / 1024);
+	n += sprintf(buf + n, " kB\n");
+	return n;
 }
 
 int hugetlb_report_node_meminfo(int nid, char *buf)
 {
-	struct hstate *h = &global_hstate;
-	return sprintf(buf,
-		"Node %d HugePages_Total: %5u\n"
-		"Node %d HugePages_Free:  %5u\n"
-		"Node %d HugePages_Surp:  %5u\n",
-		nid, h->nr_huge_pages_node[nid],
-		nid, h->free_huge_pages_node[nid],
-		nid, h->surplus_huge_pages_node[nid]);
+	int n = 0;
+	n += sprintf(buf, "Node %d HugePages_Total: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						nr_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Free: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						free_huge_pages_node[nid]));
+	n += sprintf(buf + n, "Node %d HugePages_Surp: ", nid);
+	n += dump_field(buf + n, offsetof(struct hstate,
+						surplus_huge_pages_node[nid]));
+	return n;
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	struct hstate *h = &global_hstate;
-	return h->nr_huge_pages * (1 << huge_page_order(h));
+	long x = 0;
+	struct hstate *h;
+	for_each_hstate (h) {
+		x += h->nr_huge_pages * (1 << huge_page_order(h));
+	}
+	return x;
 }
 
 /*

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 04/17] hugetlbfs: per mount hstates
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (2 preceding siblings ...)
  2008-04-10 17:02 ` [patch 03/17] hugetlb: multi hstate proc files npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 05/17] hugetlb: multi hstate sysctls npiggin
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlbfs-per-mount-hstate.patch --]
[-- Type: text/plain, Size: 8145 bytes --]

Add support to have individual hstates for each hugetlbfs mount

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 fs/hugetlbfs/inode.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/hugetlb.h |   14 +++++++++-----
 mm/hugetlb.c            |   16 +++-------------
 mm/memory.c             |   18 ++++++++++++++++--
 4 files changed, 66 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -136,6 +136,7 @@ struct hugetlbfs_config {
 	umode_t mode;
 	long	nr_blocks;
 	long	nr_inodes;
+	struct hstate *hstate;
 };
 
 struct hugetlbfs_sb_info {
@@ -144,6 +145,7 @@ struct hugetlbfs_sb_info {
 	long	max_inodes;   /* inodes allowed */
 	long	free_inodes;  /* inodes free */
 	spinlock_t	stat_lock;
+	struct hstate *hstate;
 };
 
 
@@ -223,19 +225,21 @@ extern struct hstate hstates[HUGE_MAX_HS
 
 #define global_hstate (hstates[0])
 
-static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+static inline struct hstate *hstate_inode(struct inode *i)
 {
-	return &global_hstate;
+	struct hugetlbfs_sb_info *hsb;
+	hsb = HUGETLBFS_SB(i->i_sb);
+	return hsb->hstate;
 }
 
 static inline struct hstate *hstate_file(struct file *f)
 {
-	return &global_hstate;
+	return hstate_inode(f->f_dentry->d_inode);
 }
 
-static inline struct hstate *hstate_inode(struct inode *i)
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
 {
-	return &global_hstate;
+	return hstate_file(vma->vm_file);
 }
 
 static inline unsigned huge_page_size(struct hstate *h)
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
 enum {
 	Opt_size, Opt_nr_inodes,
 	Opt_mode, Opt_uid, Opt_gid,
+	Opt_pagesize,
 	Opt_err,
 };
 
@@ -62,6 +63,7 @@ static match_table_t tokens = {
 	{Opt_mode,	"mode=%o"},
 	{Opt_uid,	"uid=%u"},
 	{Opt_gid,	"gid=%u"},
+	{Opt_pagesize,	"pagesize=%s"},
 	{Opt_err,	NULL},
 };
 
@@ -750,6 +752,8 @@ hugetlbfs_parse_options(char *options, s
 	char *p, *rest;
 	substring_t args[MAX_OPT_ARGS];
 	int option;
+	unsigned long long size = 0;
+	enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;
 
 	if (!options)
 		return 0;
@@ -780,17 +784,13 @@ hugetlbfs_parse_options(char *options, s
 			break;
 
 		case Opt_size: {
- 			unsigned long long size;
 			/* memparse() will accept a K/M/G without a digit */
 			if (!isdigit(*args[0].from))
 				goto bad_val;
 			size = memparse(args[0].from, &rest);
-			if (*rest == '%') {
-				size <<= HPAGE_SHIFT;
-				size *= max_huge_pages;
-				do_div(size, 100);
-			}
-			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
+			setsize = SIZE_STD;
+			if (*rest == '%')
+				setsize = SIZE_PERCENT;
 			break;
 		}
 
@@ -801,6 +801,19 @@ hugetlbfs_parse_options(char *options, s
 			pconfig->nr_inodes = memparse(args[0].from, &rest);
 			break;
 
+		case Opt_pagesize: {
+			unsigned long ps;
+			ps = memparse(args[0].from, &rest);
+			pconfig->hstate = size_to_hstate(ps);
+			if (!pconfig->hstate) {
+				printk(KERN_ERR
+				"hugetlbfs: Unsupported page size %lu MB\n",
+					ps >> 20);
+				return -EINVAL;
+			}
+			break;
+		}
+
 		default:
 			printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
 				 p);
@@ -808,6 +821,18 @@ hugetlbfs_parse_options(char *options, s
 			break;
 		}
 	}
+
+	/* Do size after hstate is set up */
+	if (setsize > NO_SIZE) {
+		struct hstate *h = pconfig->hstate;
+		if (setsize == SIZE_PERCENT) {
+			size <<= huge_page_shift(h);
+			size *= max_huge_pages[h - hstates];
+			do_div(size, 100);
+		}
+		pconfig->nr_blocks = (size >> huge_page_shift(h));
+	}
+
 	return 0;
 
 bad_val:
@@ -832,6 +857,7 @@ hugetlbfs_fill_super(struct super_block 
 	config.uid = current->fsuid;
 	config.gid = current->fsgid;
 	config.mode = 0755;
+	config.hstate = &global_hstate;
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
@@ -840,14 +866,15 @@ hugetlbfs_fill_super(struct super_block 
 	if (!sbinfo)
 		return -ENOMEM;
 	sb->s_fs_info = sbinfo;
+	sbinfo->hstate = config.hstate;
 	spin_lock_init(&sbinfo->stat_lock);
 	sbinfo->max_blocks = config.nr_blocks;
 	sbinfo->free_blocks = config.nr_blocks;
 	sbinfo->max_inodes = config.nr_inodes;
 	sbinfo->free_inodes = config.nr_inodes;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = HPAGE_SIZE;
-	sb->s_blocksize_bits = HPAGE_SHIFT;
+	sb->s_blocksize = huge_page_size(config.hstate);
+	sb->s_blocksize_bits = huge_page_shift(config.hstate);
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
@@ -949,7 +976,8 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
+	if (hugetlb_reserve_pages(inode, 0,
+			size >> huge_page_shift(hstate_inode(inode))))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -904,19 +904,9 @@ void __unmap_hugepage_range(struct vm_ar
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end)
 {
-	/*
-	 * It is undesirable to test vma->vm_file as it should be non-null
-	 * for valid hugetlb area. However, vm_file will be NULL in the error
-	 * cleanup path of do_mmap_pgoff. When hugetlbfs ->mmap method fails,
-	 * do_mmap_pgoff() nullifies vma->vm_file before calling this function
-	 * to clean up. Since no pte has actually been setup, it is safe to
-	 * do nothing in this case.
-	 */
-	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
-		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
-	}
+	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	__unmap_hugepage_range(vma, start, end);
+	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 }
 
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -846,9 +846,23 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			if (unlikely(is_vm_hugetlb_page(vma))) {
-				unmap_hugepage_range(vma, start, end);
-				zap_work -= (end - start) /
+				/*
+				 * It is undesirable to test vma->vm_file as it
+				 * should be non-null for valid hugetlb area.
+				 * However, vm_file will be NULL in the error
+				 * cleanup path of do_mmap_pgoff. When
+				 * hugetlbfs ->mmap method fails,
+				 * do_mmap_pgoff() nullifies vma->vm_file
+				 * before calling this function to clean up.
+				 * Since no pte has actually been setup, it is
+				 * safe to do nothing in this case.
+	 			 */
+				if (vma->vm_file) {
+					unmap_hugepage_range(vma, start, end);
+					zap_work -= (end - start) /
 					(1 << huge_page_order(hstate_vma(vma)));
+				}
+
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 05/17] hugetlb: multi hstate sysctls
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (3 preceding siblings ...)
  2008-04-10 17:02 ` [patch 04/17] hugetlbfs: per mount hstates npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 06/17] hugetlb: abstract numa round robin selection npiggin
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlbfs-sysctl-hstates.patch --]
[-- Type: text/plain, Size: 5186 bytes --]

Expand the hugetlbfs sysctls to handle arrays for all hstates

- I didn't bother with hugetlb_shm_group and treat_as_movable,
these are still single global.
- Also improve error propagation for the sysctl handlers a bit


Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 include/linux/hugetlb.h |    5 +++--
 kernel/sysctl.c         |    2 +-
 mm/hugetlb.c            |   43 +++++++++++++++++++++++++++++++------------
 3 files changed, 35 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm, 
 int hugetlb_reserve_pages(struct inode *inode, long from, long to);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 
-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
@@ -262,6 +260,9 @@ static inline unsigned huge_page_shift(s
 	return h->order + PAGE_SHIFT;
 }
 
+extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+
 #else
 struct hstate {};
 #define hstate_file(f) NULL
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -935,7 +935,7 @@ static struct ctl_table vm_table[] = {
 	 {
 		.procname	= "nr_hugepages",
 		.data		= &max_huge_pages,
-		.maxlen		= sizeof(unsigned long),
+		.maxlen 	= sizeof(max_huge_pages),
 		.mode		= 0644,
 		.proc_handler	= &hugetlb_sysctl_handler,
 		.extra1		= (void *)&hugetlb_zero,
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -22,8 +22,8 @@
 #include "internal.h"
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages;
-unsigned long sysctl_overcommit_huge_pages;
+unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
@@ -522,11 +522,11 @@ static int __init hugetlb_init_hstate(st
 
 	h->hugetlb_next_nid = first_node(node_online_map);
 
-	for (i = 0; i < max_huge_pages; ++i) {
+	for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
 		if (!alloc_fresh_huge_page(h))
 			break;
 	}
-	max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+	max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
 
 	printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
 			h->free_huge_pages,
@@ -558,8 +558,9 @@ void __init huge_add_hstate(unsigned ord
 
 static int __init hugetlb_setup(char *s)
 {
-	if (sscanf(s, "%lu", &max_huge_pages) <= 0)
-		max_huge_pages = 0;
+	unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
+	if (sscanf(s, "%lu", mhp) <= 0)
+		*mhp = 0;
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
@@ -603,10 +604,12 @@ static inline void try_to_free_low(struc
 #endif
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(unsigned long count)
+static unsigned long
+set_max_huge_pages(struct hstate *h, unsigned long count, int *err)
 {
 	unsigned long min_count, ret;
-	struct hstate *h = &global_hstate;
+
+	*err = 0;
 
 	/*
 	 * Increase the pool size
@@ -678,8 +681,20 @@ int hugetlb_sysctl_handler(struct ctl_ta
 			   struct file *file, void __user *buffer,
 			   size_t *length, loff_t *ppos)
 {
-	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-	max_huge_pages = set_max_huge_pages(max_huge_pages);
+	int err = 0;
+	struct hstate *h;
+	int i;
+	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+	if (err)
+		return err;
+	i = 0;
+	for_each_hstate (h) {
+		max_huge_pages[i] = set_max_huge_pages(h, max_huge_pages[i],
+							&err);
+		if (err)
+			return err;
+		i++;
+	}
 	return 0;
 }
 
@@ -699,10 +714,14 @@ int hugetlb_overcommit_handler(struct ct
 			struct file *file, void __user *buffer,
 			size_t *length, loff_t *ppos)
 {
-	struct hstate *h = &global_hstate;
+	struct hstate *h;
+	int i = 0;
 	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	spin_lock(&hugetlb_lock);
-	h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+	for_each_hstate (h) {
+		h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
+		i++;
+	}
 	spin_unlock(&hugetlb_lock);
 	return 0;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 06/17] hugetlb: abstract numa round robin selection
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (4 preceding siblings ...)
  2008-04-10 17:02 ` [patch 05/17] hugetlb: multi hstate sysctls npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 07/17] mm: introduce non panic alloc_bootmem npiggin
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-abstract-numa-rr.patch --]
[-- Type: text/plain, Size: 2698 bytes --]

Need this as a separate function for a future patch.

No behaviour change.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/hugetlb.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -224,6 +224,27 @@ static struct page *alloc_fresh_huge_pag
 	return page;
 }
 
+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do.  Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int hstate_next_node(struct hstate *h)
+{
+	int next_nid;
+	next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+	if (next_nid == MAX_NUMNODES)
+		next_nid = first_node(node_online_map);
+	h->hugetlb_next_nid = next_nid;
+	return next_nid;
+}
+
 static int alloc_fresh_huge_page(struct hstate *h)
 {
 	struct page *page;
@@ -237,21 +258,7 @@ static int alloc_fresh_huge_page(struct 
 		page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
 		if (page)
 			ret = 1;
-		/*
-		 * Use a helper variable to find the next node and then
-		 * copy it back to hugetlb_next_nid afterwards:
-		 * otherwise there's a window in which a racer might
-		 * pass invalid nid MAX_NUMNODES to alloc_pages_node.
-		 * But we don't need to use a spin_lock here: it really
-		 * doesn't matter if occasionally a racer chooses the
-		 * same nid as we do.  Move nid forward in the mask even
-		 * if we just successfully allocated a hugepage so that
-		 * the next caller gets hugepages on the next node.
-		 */
-		next_nid = next_node(h->hugetlb_next_nid, node_online_map);
-		if (next_nid == MAX_NUMNODES)
-			next_nid = first_node(node_online_map);
-		h->hugetlb_next_nid = next_nid;
+		next_nid = hstate_next_node(h);
 	} while (!page && h->hugetlb_next_nid != start_nid);
 
 	return ret;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 07/17] mm: introduce non panic alloc_bootmem
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (5 preceding siblings ...)
  2008-04-10 17:02 ` [patch 06/17] hugetlb: abstract numa round robin selection npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 08/17] mm: export prep_compound_page to mm npiggin
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: __alloc_bootmem_node_nopanic.patch --]
[-- Type: text/plain, Size: 1970 bytes --]

Straight forward variant of the existing __alloc_bootmem_node, only 
difference is that it doesn't panic on failure.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com
---
 include/linux/bootmem.h |    4 ++++
 mm/bootmem.c            |   12 ++++++++++++
 2 files changed, 16 insertions(+)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -484,6 +484,18 @@ void * __init __alloc_bootmem_node(pg_da
 	return __alloc_bootmem(size, align, goal);
 }
 
+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+				   unsigned long align, unsigned long goal)
+{
+	void *ptr;
+
+	ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+	if (ptr)
+		return ptr;
+
+	return __alloc_bootmem_nopanic(size, align, goal);
+}
+
 #ifndef ARCH_LOW_ADDRESS_LIMIT
 #define ARCH_LOW_ADDRESS_LIMIT	0xffffffffUL
 #endif
Index: linux-2.6/include/linux/bootmem.h
===================================================================
--- linux-2.6.orig/include/linux/bootmem.h
+++ linux-2.6/include/linux/bootmem.h
@@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
 				  unsigned long size,
 				  unsigned long align,
 				  unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+				  unsigned long size,
+				  unsigned long align,
+				  unsigned long goal);
 extern unsigned long init_bootmem_node(pg_data_t *pgdat,
 				       unsigned long freepfn,
 				       unsigned long startpfn,

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 08/17] mm: export prep_compound_page to mm
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (6 preceding siblings ...)
  2008-04-10 17:02 ` [patch 07/17] mm: introduce non panic alloc_bootmem npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 09/17] hugetlb: factor out huge_new_page npiggin
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: mm-export-prep_compound_page.patch --]
[-- Type: text/plain, Size: 1570 bytes --]

hugetlb will need to get compound pages from bootmem to handle
the case of them being larger than MAX_ORDER. Export
the constructor function needed for this.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/internal.h   |    2 ++
 mm/page_alloc.c |    2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -13,6 +13,8 @@
 
 #include <linux/mm.h>
 
+extern void prep_compound_page(struct page *page, unsigned long order);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -272,7 +272,7 @@ static void free_compound_page(struct pa
 	__free_pages_ok(page, compound_order(page));
 }
 
-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
 {
 	int i;
 	int nr_pages = 1 << order;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 09/17] hugetlb: factor out huge_new_page
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (7 preceding siblings ...)
  2008-04-10 17:02 ` [patch 08/17] mm: export prep_compound_page to mm npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 10/17] mm: fix bootmem alignment npiggin
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-factor-page-prep.patch --]
[-- Type: text/plain, Size: 2022 bytes --]

Needed to avoid code duplication in follow up patches.

This happens to fix a minor bug. When alloc_bootmem_node returns
a fallback node on a different node than passed the old code
would have put it into the free lists of the wrong node.
Now it would end up in the freelist of the correct node.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/hugetlb.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -205,6 +205,17 @@ static int adjust_pool_surplus(struct hs
 	return ret;
 }
 
+static void huge_new_page(struct hstate *h, struct page *page)
+{
+	unsigned nid = pfn_to_nid(page_to_pfn(page));
+	set_compound_page_dtor(page, free_huge_page);
+	spin_lock(&hugetlb_lock);
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
+	spin_unlock(&hugetlb_lock);
+	put_page(page); /* free it into the hugepage allocator */
+}
+
 static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
@@ -212,14 +223,8 @@ static struct page *alloc_fresh_huge_pag
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
 			huge_page_order(h));
-	if (page) {
-		set_compound_page_dtor(page, free_huge_page);
-		spin_lock(&hugetlb_lock);
-		h->nr_huge_pages++;
-		h->nr_huge_pages_node[nid]++;
-		spin_unlock(&hugetlb_lock);
-		put_page(page); /* free it into the hugepage allocator */
-	}
+	if (page)
+		huge_new_page(h, page);
 
 	return page;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 10/17] mm: fix bootmem alignment
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (8 preceding siblings ...)
  2008-04-10 17:02 ` [patch 09/17] hugetlb: factor out huge_new_page npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:33   ` Yinghai Lu
  2008-04-10 17:02 ` [patch 11/17] hugetlbfs: support larger than MAX_ORDER npiggin
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, Yinghai Lu, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: bootmem-fix-alignment.patch --]
[-- Type: text/plain, Size: 3220 bytes --]

Without this fix bootmem can return unaligned addresses when the start of a
node is not aligned to the align value. Needed for reliably allocating
gigabyte pages.

I removed the offset variable because all tests should align themself correctly
now. Slight drawback might be that the bootmem allocator will spend
some more time skipping bits in the bitmap initially, but that shouldn't
be a big issue.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/bootmem.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -206,8 +206,9 @@ void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
 	      unsigned long align, unsigned long goal, unsigned long limit)
 {
-	unsigned long offset, remaining_size, areasize, preferred;
-	unsigned long i, start = 0, incr, eidx, end_pfn;
+	unsigned long remaining_size, areasize, preferred;
+	unsigned long i, start, incr, eidx, end_pfn;
+	unsigned long pfn;
 	void *ret;
 
 	if (!size) {
@@ -229,10 +230,6 @@ __alloc_bootmem_core(struct bootmem_data
 		end_pfn = limit;
 
 	eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
-	offset = 0;
-	if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
-		offset = align - (bdata->node_boot_start & (align - 1UL));
-	offset = PFN_DOWN(offset);
 
 	/*
 	 * We try to allocate bootmem pages above 'goal'
@@ -247,15 +244,18 @@ __alloc_bootmem_core(struct bootmem_data
 	} else
 		preferred = 0;
 
-	preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+	start = bdata->node_boot_start;
+	preferred = PFN_DOWN(ALIGN(preferred + start, align) - start);
 	areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
 	incr = align >> PAGE_SHIFT ? : 1;
+	pfn = PFN_DOWN(start);
+	start = 0;
 
 restart_scan:
 	for (i = preferred; i < eidx; i += incr) {
 		unsigned long j;
 		i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
-		i = ALIGN(i, incr);
+		i = ALIGN(pfn + i, incr) - pfn;
 		if (i >= eidx)
 			break;
 		if (test_bit(i, bdata->node_bootmem_map))
@@ -269,11 +269,11 @@ restart_scan:
 		start = i;
 		goto found;
 	fail_block:
-		i = ALIGN(j, incr);
+		i = ALIGN(j + pfn, incr) - pfn;
 	}
 
-	if (preferred > offset) {
-		preferred = offset;
+	if (preferred > 0) {
+		preferred = 0;
 		goto restart_scan;
 	}
 	return NULL;
@@ -289,7 +289,7 @@ found:
 	 */
 	if (align < PAGE_SIZE &&
 	    bdata->last_offset && bdata->last_pos+1 == start) {
-		offset = ALIGN(bdata->last_offset, align);
+		unsigned long offset = ALIGN(bdata->last_offset, align);
 		BUG_ON(offset > PAGE_SIZE);
 		remaining_size = PAGE_SIZE - offset;
 		if (size < remaining_size) {

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 11/17] hugetlbfs: support larger than MAX_ORDER
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (9 preceding siblings ...)
  2008-04-10 17:02 ` [patch 10/17] mm: fix bootmem alignment npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-11  8:13   ` Andi Kleen
  2008-04-10 17:02 ` [patch 12/17] hugetlb: support boot allocate different sizes npiggin
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-unlimited-order.patch --]
[-- Type: text/plain, Size: 4944 bytes --]

This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB. 

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (> MAX_ORDER pages cannot be allocated later) I decided not to currently.

The > MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory 
allocated for gb pages in "Total". This is a little confusing for the
user.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/hugetlb.c |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/bootmem.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -158,7 +159,7 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	if (h->surplus_huge_pages_node[nid]) {
+	if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {
 		update_and_free_page(h, page);
 		h->surplus_huge_pages--;
 		h->surplus_huge_pages_node[nid]--;
@@ -220,6 +221,9 @@ static struct page *alloc_fresh_huge_pag
 {
 	struct page *page;
 
+	if (h->order > MAX_ORDER)
+		return NULL;
+
 	page = alloc_pages_node(nid,
 		htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
 			huge_page_order(h));
@@ -276,6 +280,9 @@ static struct page *alloc_buddy_huge_pag
 	struct page *page;
 	unsigned int nid;
 
+	if (h->order > MAX_ORDER)
+		return NULL;
+
 	/*
 	 * Assume we will successfully allocate the surplus page to
 	 * prevent racing processes from causing the surplus to exceed
@@ -442,6 +449,10 @@ static void return_unused_surplus_pages(
 	/* Uncommit the reservation */
 	h->resv_huge_pages -= unused_resv_pages;
 
+	/* Cannot return gigantic pages currently */
+	if (h->order > MAX_ORDER)
+		return;
+
 	nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
 
 	while (remaining_iterations-- && nr_pages) {
@@ -520,6 +531,44 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
+static __initdata LIST_HEAD(huge_boot_pages);
+
+struct huge_bm_page {
+	struct list_head list;
+	struct hstate *hstate;
+};
+
+static int __init alloc_bm_huge_page(struct hstate *h)
+{
+	struct huge_bm_page *m;
+	m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
+					huge_page_size(h), huge_page_size(h),
+					0);
+	if (!m)
+		return 0;
+	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+	/* Put them into a private list first because mem_map is not up yet */
+	list_add(&m->list, &huge_boot_pages);
+	m->hstate = h;
+	hstate_next_node(h);
+	return 1;
+}
+
+/* Put bootmem huge pages into the standard lists after mem_map is up */
+static int __init huge_init_bm(void)
+{
+	struct huge_bm_page *m;
+	list_for_each_entry (m, &huge_boot_pages, list) {
+		struct page *page = virt_to_page(m);
+		struct hstate *h = m->hstate;
+		__ClearPageReserved(page);
+		prep_compound_page(page, h->order);
+		huge_new_page(h, page);
+	}
+	return 0;
+}
+__initcall(huge_init_bm);
+
 static int __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
@@ -535,7 +584,10 @@ static int __init hugetlb_init_hstate(st
 	h->hugetlb_next_nid = first_node(node_online_map);
 
 	for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
-		if (!alloc_fresh_huge_page(h))
+		if (h->order > MAX_ORDER) {
+			if (!alloc_bm_huge_page(h))
+				break;
+		} else if (!alloc_fresh_huge_page(h))
 			break;
 	}
 	max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
@@ -594,6 +646,9 @@ static void try_to_free_low(struct hstat
 {
 	int i;
 
+	if (h->order > MAX_ORDER)
+		return;
+
 	for (i = 0; i < MAX_NUMNODES; ++i) {
 		struct page *page, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
@@ -623,6 +678,11 @@ set_max_huge_pages(struct hstate *h, uns
 
 	*err = 0;
 
+	if (h->order > MAX_ORDER) {
+		*err = -EINVAL;
+		return max_huge_pages[h - hstates];
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 12/17] hugetlb: support boot allocate different sizes
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (10 preceding siblings ...)
  2008-04-10 17:02 ` [patch 11/17] hugetlbfs: support larger than MAX_ORDER npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 13/17] hugetlb: printk cleanup npiggin
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-different-page-sizes.patch --]
[-- Type: text/plain, Size: 2780 bytes --]

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 include/linux/hugetlb.h |    1 +
 mm/hugetlb.c            |   23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -578,19 +578,23 @@ static int __init hugetlb_init_hstate(st
 		h->mask = HPAGE_MASK;
 	}
 
-	for (i = 0; i < MAX_NUMNODES; ++i)
-		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+	/* Don't reinitialize lists if they have been already init'ed */
+	if (!h->hugepage_freelists[0].next) {
+		for (i = 0; i < MAX_NUMNODES; ++i)
+			INIT_LIST_HEAD(&h->hugepage_freelists[i]);
 
-	h->hugetlb_next_nid = first_node(node_online_map);
+		h->hugetlb_next_nid = first_node(node_online_map);
+	}
 
-	for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
+	while (h->parsed_hugepages < max_huge_pages[h - hstates]) {
 		if (h->order > MAX_ORDER) {
 			if (!alloc_bm_huge_page(h))
 				break;
 		} else if (!alloc_fresh_huge_page(h))
 			break;
+		h->parsed_hugepages++;
 	}
-	max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
+	max_huge_pages[h - hstates] = h->parsed_hugepages;
 
 	printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
 			h->free_huge_pages,
@@ -625,6 +629,15 @@ static int __init hugetlb_setup(char *s)
 	unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
+	/*
+	 * Global state is always initialized later in hugetlb_init.
+	 * But we need to allocate > MAX_ORDER hstates here early to still
+	 * use the bootmem allocator.
+	 * If you add additional hstates <= MAX_ORDER you'll need
+	 * to fix that.
+	 */
+	if (parsed_hstate != &global_hstate)
+		hugetlb_init_hstate(parsed_hstate);
 	return 1;
 }
 __setup("hugepages=", hugetlb_setup);
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -210,6 +210,7 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+	unsigned long parsed_hugepages;
 };
 
 void __init huge_add_hstate(unsigned order);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 13/17] hugetlb: printk cleanup
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (11 preceding siblings ...)
  2008-04-10 17:02 ` [patch 12/17] hugetlb: support boot allocate different sizes npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 14/17] hugetlb: introduce huge_pud npiggin
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-printk-cleanup.patch --]
[-- Type: text/plain, Size: 2837 bytes --]

- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 mm/hugetlb.c |   33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -531,6 +531,15 @@ static struct page *alloc_huge_page(stru
 	return page;
 }
 
+static __init char *memfmt(char *buf, unsigned long n)
+{
+	if (n >= (1UL << 30))
+		sprintf(buf, "%lu GB", n >> 30);
+	else
+		sprintf(buf, "%lu MB", n >> 20);
+	return buf;
+}
+
 static __initdata LIST_HEAD(huge_boot_pages);
 
 struct huge_bm_page {
@@ -557,14 +566,28 @@ static int __init alloc_bm_huge_page(str
 /* Put bootmem huge pages into the standard lists after mem_map is up */
 static int __init huge_init_bm(void)
 {
+	unsigned long pages = 0;
 	struct huge_bm_page *m;
+	struct hstate *h = NULL;
+	char buf[32];
+
 	list_for_each_entry (m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
-		struct hstate *h = m->hstate;
+		h = m->hstate;
 		__ClearPageReserved(page);
 		prep_compound_page(page, h->order);
 		huge_new_page(h, page);
+		pages++;
 	}
+
+	/*
+	 * This only prints for a single hstate. This works for x86-64,
+	 * but if you do multiple > MAX_ORDER hstates you'll need to fix it.
+	 */
+	if (pages > 0)
+		printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
+				h->free_huge_pages,
+				memfmt(buf, huge_page_size(h)));
 	return 0;
 }
 __initcall(huge_init_bm);
@@ -572,6 +595,8 @@ __initcall(huge_init_bm);
 static int __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
+	char buf[32];
+	unsigned long pages = 0;
 
 	if (h == &global_hstate && !h->order) {
 		h->order = HPAGE_SHIFT - PAGE_SHIFT;
@@ -593,12 +618,14 @@ static int __init hugetlb_init_hstate(st
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 		h->parsed_hugepages++;
+		pages++;
 	}
 	max_huge_pages[h - hstates] = h->parsed_hugepages;
 
-	printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+	if (pages > 0)
+		printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
 			h->free_huge_pages,
-			1 << (h->order + PAGE_SHIFT - 20));
+			memfmt(buf, huge_page_size(h)));
 	return 0;
 }
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 14/17] hugetlb: introduce huge_pud
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (12 preceding siblings ...)
  2008-04-10 17:02 ` [patch 13/17] hugetlb: printk cleanup npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 15/17] x86: support GB hugepages on 64-bit npiggin
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlbfs-huge_pud.patch --]
[-- Type: text/plain, Size: 6397 bytes --]

Straight forward extensions for huge pages located in the PUD
instead of PMDs.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 arch/ia64/mm/hugetlbpage.c    |    6 ++++++
 arch/powerpc/mm/hugetlbpage.c |    5 +++++
 arch/sh/mm/hugetlbpage.c      |    5 +++++
 arch/sparc64/mm/hugetlbpage.c |    5 +++++
 arch/x86/mm/hugetlbpage.c     |   25 ++++++++++++++++++++++++-
 include/linux/hugetlb.h       |    5 +++++
 mm/hugetlb.c                  |    9 +++++++++
 mm/memory.c                   |   10 +++++++++-
 8 files changed, 68 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -45,7 +45,10 @@ struct page *follow_huge_addr(struct mm_
 			      int write);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 				pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+				pud_t *pud, int write);
 int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
 void hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -112,8 +115,10 @@ static inline unsigned long hugetlb_tota
 #define hugetlb_report_meminfo(buf)		0
 #define hugetlb_report_node_meminfo(n, buf)	0
 #define follow_huge_pmd(mm, addr, pmd, write)	NULL
+#define follow_huge_pud(mm, addr, pud, write)	NULL
 #define prepare_hugepage_range(addr,len)	(-EINVAL)
 #define pmd_huge(x)	0
+#define pud_huge(x)	0
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, write)	({ BUG(); 0; })
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -106,6 +106,12 @@ int pmd_huge(pmd_t pmd)
 {
 	return 0;
 }
+
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux-2.6/arch/powerpc/mm/hugetlbpage.c
@@ -368,6 +368,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
Index: linux-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/hugetlbpage.c
+++ linux-2.6/arch/sh/mm/hugetlbpage.c
@@ -78,6 +78,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -294,6 +294,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 			     pmd_t *pmd, int write)
 {
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -188,6 +188,11 @@ int pmd_huge(pmd_t pmd)
 	return 0;
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -208,6 +213,11 @@ int pmd_huge(pmd_t pmd)
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
 
+int pud_huge(pud_t pud)
+{
+	return 0;
+}
+
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
@@ -216,9 +226,22 @@ follow_huge_pmd(struct mm_struct *mm, un
 
 	page = pte_page(*(pte_t *)pmd);
 	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
 	return page;
 }
+
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+		pud_t *pud, int write)
+{
+	struct page *page;
+
+	page = pte_page(*(pte_t *)pud);
+	if (page)
+		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
+	return page;
+}
+
 #endif
 
 /* x86_64 also uses this file */
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -1204,6 +1204,15 @@ int hugetlb_fault(struct mm_struct *mm, 
 	return ret;
 }
 
+/* Can be overriden by architectures */
+__attribute__((weak)) struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+	       pud_t *pud, int write)
+{
+	BUG();
+	return NULL;
+}
+
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
 			unsigned long *position, int *length, int i,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -945,7 +945,13 @@ struct page *follow_page(struct vm_area_
 	pud = pud_offset(pgd, address);
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto no_page_table;
-	
+
+	if (pud_huge(*pud)) {
+		BUG_ON(flags & FOLL_GET);
+		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+		goto out;
+	}
+
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
@@ -1436,6 +1442,8 @@ static int apply_to_pmd_range(struct mm_
 	unsigned long next;
 	int err;
 
+	BUG_ON(pud_huge(*pud));
+
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 15/17] x86: support GB hugepages on 64-bit
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (13 preceding siblings ...)
  2008-04-10 17:02 ` [patch 14/17] hugetlb: introduce huge_pud npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 16/17] x86: add hugepagesz option " npiggin
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: x86-support-GB-hugetlb-pages.patch --]
[-- Type: text/plain, Size: 1506 bytes --]

---
 arch/x86/mm/hugetlbpage.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -133,9 +133,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	pgd = pgd_offset(mm, addr);
 	pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
-		if (pud_none(*pud))
-			huge_pmd_share(mm, addr, pud);
-		pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		if (sz == PUD_SIZE) {
+			pte = (pte_t *)pud;
+		} else {
+			BUG_ON(sz != PMD_SIZE);
+			if (pud_none(*pud))
+				huge_pmd_share(mm, addr, pud);
+			pte = (pte_t *) pmd_alloc(mm, pud, addr);
+		}
 	}
 	BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
 
@@ -151,8 +156,11 @@ pte_t *huge_pte_offset(struct mm_struct 
 	pgd = pgd_offset(mm, addr);
 	if (pgd_present(*pgd)) {
 		pud = pud_offset(pgd, addr);
-		if (pud_present(*pud))
+		if (pud_present(*pud)) {
+			if (pud_large(*pud))
+				return (pte_t *)pud;
 			pmd = pmd_offset(pud, addr);
+		}
 	}
 	return (pte_t *) pmd;
 }
@@ -215,7 +223,7 @@ int pmd_huge(pmd_t pmd)
 
 int pud_huge(pud_t pud)
 {
-	return 0;
+	return !!(pud_val(pud) & _PAGE_PSE);
 }
 
 struct page *

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 16/17] x86: add hugepagesz option on 64-bit
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (14 preceding siblings ...)
  2008-04-10 17:02 ` [patch 15/17] x86: support GB hugepages on 64-bit npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 17:02 ` [patch 17/17] hugetlb: misc fixes npiggin
  2008-04-10 23:59 ` [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Nish Aravamudan
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: x86-64-implement-hugepagesz.patch --]
[-- Type: text/plain, Size: 3180 bytes --]

Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 Documentation/kernel-parameters.txt |   11 +++++++++--
 arch/x86/mm/hugetlbpage.c           |   17 +++++++++++++++++
 include/asm-x86/page.h              |    2 ++
 3 files changed, 28 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -421,3 +421,20 @@ hugetlb_get_unmapped_area(struct file *f
 
 #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
 
+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+	unsigned long ps = memparse(opt, &opt);
+	if (ps == PMD_SIZE) {
+		huge_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
+		huge_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+	} else {
+		printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+			ps >> 20);
+		return 0;
+	}
+	return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux-2.6/include/asm-x86/page.h
===================================================================
--- linux-2.6.orig/include/asm-x86/page.h
+++ linux-2.6/include/asm-x86/page.h
@@ -21,6 +21,8 @@
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 
+#define HUGE_MAX_HSTATE 2
+
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr)	(((addr)+PAGE_SIZE-1)&PAGE_MASK)
 
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -722,8 +722,15 @@ and is between 256 and 4096 characters. 
 	hisax=		[HW,ISDN]
 			See Documentation/isdn/README.HiSax.
 
-	hugepages=	[HW,X86-32,IA-64] Maximal number of HugeTLB pages.
-	hugepagesz=	[HW,IA-64,PPC] The size of the HugeTLB pages.
+	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+			On x86 this option can be specified multiple times
+			interleaved with hugepages= to reserve huge pages
+			of different sizes. Valid pages sizes on x86-64
+			are 2M (when the CPU supports "pse") and 1G (when the
+			CPU supports the "pdpe1gb" cpuinfo flag)
+			Note that 1GB pages can only be allocated at boot time
+			using hugepages= and not freed afterwards.
 
 	i8042.direct	[HW] Put keyboard port into non-translated mode
 	i8042.dumbkbd	[HW] Pretend that controller can only read data from

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 17/17] hugetlb: misc fixes
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (15 preceding siblings ...)
  2008-04-10 17:02 ` [patch 16/17] x86: add hugepagesz option " npiggin
@ 2008-04-10 17:02 ` npiggin
  2008-04-10 23:59 ` [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Nish Aravamudan
  17 siblings, 0 replies; 28+ messages in thread
From: npiggin @ 2008-04-10 17:02 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, pj, andi, kniht

[-- Attachment #1: hugetlb-fixes.patch --]
[-- Type: text/plain, Size: 12665 bytes --]

These are some various fixes I noticed while reviewing and testing the
hugetlbfs patchset. Nothing fundamental, but I feel it tidies things up
a bit. Where possible I will merge each of these changes into the
appropriate patch, or otherwise split them up.

- remove global_hstate, make the default hstate handling slightly more regular
- fix some hangs and bugs when multiple hugepage command line options are given
- have alloc_bm_huge_page fall back to other nodes rather than give up first
- santise the printk hugepage reporting
- remove one of the initcalls and instead just call it from the main initcall.
- make it slightly more robust at handling bad command line input (eg duplicate
  parameters).
- align hugepage mmaps in x86 code
- sysctl shouldn't always return -EINVAL if the > MAX_ORDER value is unchanged.
  This fix involved putting a max_huge_pages value in the hstate, as well as
  retaining the sysctl table. I think this makes most of the code look nicer
  though.
- I've only been testing on a limited system (1 1GB page available), but
  the tlp test Andi previously reported failing appears to work OK. Not sure
  if it is due to these changes because I only started testing while writing
  this patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: pj@sgi.com
Cc: andi@firstfloor.org
Cc: kniht@linux.vnet.ibm.com

---
 arch/x86/mm/hugetlbpage.c |   15 ++--
 fs/hugetlbfs/inode.c      |    4 -
 include/linux/hugetlb.h   |    7 +-
 mm/hugetlb.c              |  145 ++++++++++++++++++++++++++++------------------
 4 files changed, 105 insertions(+), 66 deletions(-)

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c
+++ linux-2.6/mm/hugetlb.c
@@ -28,12 +28,13 @@ unsigned long sysctl_overcommit_huge_pag
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 
-static int max_hstate = 1;
+static int max_hstate = 0;
 
+static unsigned long default_hstate_resv = 0;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
 /* for command line parsing */
-struct hstate *parsed_hstate __initdata = &global_hstate;
+struct hstate *parsed_hstate __initdata = NULL;
 
 #define for_each_hstate(h) \
 	for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
@@ -550,58 +551,48 @@ struct huge_bm_page {
 static int __init alloc_bm_huge_page(struct hstate *h)
 {
 	struct huge_bm_page *m;
-	m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
+	int nr_nodes = nodes_weight(node_online_map);
+
+	while (nr_nodes) {
+		m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
 					huge_page_size(h), huge_page_size(h),
 					0);
-	if (!m)
-		return 0;
+		if (m)
+			goto found;
+		hstate_next_node(h);
+		nr_nodes--;
+	}
+	return 0;
+
+found:
 	BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
 	/* Put them into a private list first because mem_map is not up yet */
 	list_add(&m->list, &huge_boot_pages);
 	m->hstate = h;
-	hstate_next_node(h);
 	return 1;
 }
 
 /* Put bootmem huge pages into the standard lists after mem_map is up */
-static int __init huge_init_bm(void)
+static void gather_bootmem_prealloc(void)
 {
 	unsigned long pages = 0;
 	struct huge_bm_page *m;
 	struct hstate *h = NULL;
-	char buf[32];
 
 	list_for_each_entry (m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
 		h = m->hstate;
 		__ClearPageReserved(page);
+		WARN_ON(page_count(page) != 1);
 		prep_compound_page(page, h->order);
 		huge_new_page(h, page);
 		pages++;
 	}
-
-	/*
-	 * This only prints for a single hstate. This works for x86-64,
-	 * but if you do multiple > MAX_ORDER hstates you'll need to fix it.
-	 */
-	if (pages > 0)
-		printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
-				h->free_huge_pages,
-				memfmt(buf, huge_page_size(h)));
-	return 0;
 }
-__initcall(huge_init_bm);
 
-static int __init hugetlb_init_hstate(struct hstate *h)
+static void __init hugetlb_init_hstate(struct hstate *h)
 {
 	unsigned long i;
-	char buf[32];
-	unsigned long pages = 0;
-
-	if (h == &global_hstate && !h->order) {
-		h->order = HPAGE_SHIFT - PAGE_SHIFT;
-		h->mask = HPAGE_MASK;
-	}
 
 	/* Don't reinitialize lists if they have been already init'ed */
 	if (!h->hugepage_freelists[0].next) {
@@ -611,29 +602,57 @@ static int __init hugetlb_init_hstate(st
 		h->hugetlb_next_nid = first_node(node_online_map);
 	}
 
-	while (h->parsed_hugepages < max_huge_pages[h - hstates]) {
+	while (h->parsed_hugepages < h->max_huge_pages) {
 		if (h->order > MAX_ORDER) {
 			if (!alloc_bm_huge_page(h))
 				break;
 		} else if (!alloc_fresh_huge_page(h))
 			break;
 		h->parsed_hugepages++;
-		pages++;
 	}
-	max_huge_pages[h - hstates] = h->parsed_hugepages;
+	h->max_huge_pages = h->parsed_hugepages;
+}
 
-	if (pages > 0)
-		printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
-			h->free_huge_pages,
-			memfmt(buf, huge_page_size(h)));
-	return 0;
+static void __init hugetlb_init_hstates(void)
+{
+	struct hstate *h;
+
+	for_each_hstate (h) {
+		/* oversize hugepages were init'ed in early boot */
+		if (h->order <= MAX_ORDER)
+			hugetlb_init_hstate(h);
+		max_huge_pages[h - hstates] = h->max_huge_pages;
+	}
+}
+
+static void __init report_hugepages(void)
+{
+	struct hstate *h;
+
+	for_each_hstate (h) {
+		char buf[32];
+		printk(KERN_INFO "HugeTLB registered size %s, pre-allocated %ld pages\n",
+				memfmt(buf, huge_page_size(h)),
+				h->free_huge_pages);
+	}
 }
 
 static int __init hugetlb_init(void)
 {
-	if (HPAGE_SHIFT == 0)
-		return 0;
-	return hugetlb_init_hstate(&global_hstate);
+	BUILD_BUG_ON(HPAGE_SHIFT == 0);
+
+	if (!size_to_hstate(HPAGE_SIZE)) {
+		huge_add_hstate(HUGETLB_PAGE_ORDER);
+		parsed_hstate->max_huge_pages = default_hstate_resv;
+	}
+
+	hugetlb_init_hstates();
+
+	gather_bootmem_prealloc();
+
+	report_hugepages();
+
+	return 0;
 }
 module_init(hugetlb_init);
 
@@ -641,9 +660,14 @@ module_init(hugetlb_init);
 void __init huge_add_hstate(unsigned order)
 {
 	struct hstate *h;
-	BUG_ON(size_to_hstate(PAGE_SIZE << order));
+
+	if (size_to_hstate(PAGE_SIZE << order)) {
+		printk("hugepagesz= specified twice, ignoring\n");
+		return;
+	}
+
 	BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
-	BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
+	BUG_ON(order < HPAGE_SHIFT - PAGE_SHIFT);
 	h = &hstates[max_hstate++];
 	h->order = order;
 	h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
@@ -653,17 +677,22 @@ void __init huge_add_hstate(unsigned ord
 
 static int __init hugetlb_setup(char *s)
 {
-	unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
+	unsigned long *mhp;
+
+	if (!max_hstate)
+		mhp = &default_hstate_resv;
+	else
+		mhp = &parsed_hstate->max_huge_pages;
+
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
+
 	/*
 	 * Global state is always initialized later in hugetlb_init.
 	 * But we need to allocate > MAX_ORDER hstates here early to still
 	 * use the bootmem allocator.
-	 * If you add additional hstates <= MAX_ORDER you'll need
-	 * to fix that.
 	 */
-	if (parsed_hstate != &global_hstate)
+	if (max_hstate > 0 && parsed_hstate->order > MAX_ORDER)
 		hugetlb_init_hstate(parsed_hstate);
 	return 1;
 }
@@ -719,8 +748,9 @@ set_max_huge_pages(struct hstate *h, uns
 	*err = 0;
 
 	if (h->order > MAX_ORDER) {
-		*err = -EINVAL;
-		return max_huge_pages[h - hstates];
+		if (count != h->max_huge_pages)
+			*err = -EINVAL;
+		return h->max_huge_pages;
 	}
 
 	/*
@@ -795,19 +825,24 @@ int hugetlb_sysctl_handler(struct ctl_ta
 {
 	int err = 0;
 	struct hstate *h;
-	int i;
+
 	err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
 	if (err)
 		return err;
-	i = 0;
-	for_each_hstate (h) {
-		max_huge_pages[i] = set_max_huge_pages(h, max_huge_pages[i],
-							&err);
-		if (err)
-			return err;
-		i++;
+
+	if (write) {
+		for_each_hstate (h) {
+			int tmp;
+
+			h->max_huge_pages = set_max_huge_pages(h,
+					max_huge_pages[h - hstates], &tmp);
+			max_huge_pages[h - hstates] = h->max_huge_pages;
+			if (tmp && !err)
+				err = tmp;
+		}
 	}
-	return 0;
+
+	return err;
 }
 
 int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -827,7 +827,7 @@ hugetlbfs_parse_options(char *options, s
 		struct hstate *h = pconfig->hstate;
 		if (setsize == SIZE_PERCENT) {
 			size <<= huge_page_shift(h);
-			size *= max_huge_pages[h - hstates];
+			size *= h->max_huge_pages;
 			do_div(size, 100);
 		}
 		pconfig->nr_blocks = (size >> huge_page_shift(h));
@@ -857,7 +857,7 @@ hugetlbfs_fill_super(struct super_block 
 	config.uid = current->fsuid;
 	config.gid = current->fsgid;
 	config.mode = 0755;
-	config.hstate = &global_hstate;
+	config.hstate = size_to_hstate(HPAGE_SIZE);
 	ret = hugetlbfs_parse_options(data, &config);
 	if (ret)
 		return ret;
Index: linux-2.6/include/linux/hugetlb.h
===================================================================
--- linux-2.6.orig/include/linux/hugetlb.h
+++ linux-2.6/include/linux/hugetlb.h
@@ -208,7 +208,10 @@ struct hstate {
 	int hugetlb_next_nid;
 	unsigned int order;
 	unsigned long mask;
-	unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
+	unsigned long max_huge_pages;
+	unsigned long nr_huge_pages;
+	unsigned long free_huge_pages;
+	unsigned long resv_huge_pages;
 	unsigned long surplus_huge_pages;
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
@@ -227,8 +230,6 @@ struct hstate *size_to_hstate(unsigned l
 
 extern struct hstate hstates[HUGE_MAX_HSTATE];
 
-#define global_hstate (hstates[0])
-
 static inline struct hstate *hstate_inode(struct inode *i)
 {
 	struct hugetlbfs_sb_info *hsb;
Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6/arch/x86/mm/hugetlbpage.c
@@ -259,6 +259,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long start_addr;
@@ -271,7 +272,7 @@ static unsigned long hugetlb_get_unmappe
 	}
 
 full_search:
-	addr = ALIGN(start_addr, HPAGE_SIZE);
+	addr = ALIGN(start_addr, huge_page_size(h));
 
 	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
@@ -293,7 +294,7 @@ full_search:
 		}
 		if (addr + mm->cached_hole_size < vma->vm_start)
 		        mm->cached_hole_size = vma->vm_start - addr;
-		addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+		addr = ALIGN(vma->vm_end, huge_page_size(h));
 	}
 }
 
@@ -301,6 +302,7 @@ static unsigned long hugetlb_get_unmappe
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev_vma;
 	unsigned long base = mm->mmap_base, addr = addr0;
@@ -321,7 +323,7 @@ try_again:
 		goto fail;
 
 	/* either no address requested or cant fit in requested address hole */
-	addr = (mm->free_area_cache - len) & HPAGE_MASK;
+	addr = (mm->free_area_cache - len) & huge_page_mask(h);
 	do {
 		/*
 		 * Lookup failure means no vma is above this address,
@@ -352,7 +354,7 @@ try_again:
 		        largest_hole = vma->vm_start - addr;
 
 		/* try just below the current vma->vm_start */
-		addr = (vma->vm_start - len) & HPAGE_MASK;
+		addr = (vma->vm_start - len) & huge_page_mask(h);
 	} while (len <= vma->vm_start);
 
 fail:
@@ -390,10 +392,11 @@ unsigned long
 hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+	struct hstate *h = hstate_file(file);
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
-	if (len & ~HPAGE_MASK)
+	if (len & ~huge_page_mask(h))
 		return -EINVAL;
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -405,7 +408,7 @@ hugetlb_get_unmapped_area(struct file *f
 	}
 
 	if (addr) {
-		addr = ALIGN(addr, HPAGE_SIZE);
+		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 10/17] mm: fix bootmem alignment
  2008-04-10 17:02 ` [patch 10/17] mm: fix bootmem alignment npiggin
@ 2008-04-10 17:33   ` Yinghai Lu
  2008-04-10 17:39     ` Nick Piggin
  2008-04-11 11:58     ` Nick Piggin
  0 siblings, 2 replies; 28+ messages in thread
From: Yinghai Lu @ 2008-04-10 17:33 UTC (permalink / raw)
  To: npiggin, Andrew Morton, Andi Kleen; +Cc: linux-kernel, linux-mm, pj, kniht

On Thu, Apr 10, 2008 at 10:02 AM,  <npiggin@suse.de> wrote:
> Without this fix bootmem can return unaligned addresses when the start of a
>  node is not aligned to the align value. Needed for reliably allocating
>  gigabyte pages.
>
>  I removed the offset variable because all tests should align themself correctly
>  now. Slight drawback might be that the bootmem allocator will spend
>  some more time skipping bits in the bitmap initially, but that shouldn't
>  be a big issue.
>


this patch from Andi was obsoleted by the one in -mm


The patch titled
    mm: offset align in alloc_bootmem
has been added to the -mm tree.  Its filename is
    mm-offset-align-in-alloc_bootmem.patch

------------------------------------------------------
Subject: mm: offset align in alloc_bootmem
From: Yinghai Lu <yhlu.kernel.send@gmail.com>

Need offset alignment when node_boot_start's alignment is less than align
required

Use local node_boot_start to match align.  so don't add extra opteration in
search loop.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 10/17] mm: fix bootmem alignment
  2008-04-10 17:33   ` Yinghai Lu
@ 2008-04-10 17:39     ` Nick Piggin
  2008-04-11 11:58     ` Nick Piggin
  1 sibling, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2008-04-10 17:39 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Andrew Morton, Andi Kleen, linux-kernel, linux-mm, pj, kniht

On Thu, Apr 10, 2008 at 10:33:50AM -0700, Yinghai Lu wrote:
> On Thu, Apr 10, 2008 at 10:02 AM,  <npiggin@suse.de> wrote:
> > Without this fix bootmem can return unaligned addresses when the start of a
> >  node is not aligned to the align value. Needed for reliably allocating
> >  gigabyte pages.
> >
> >  I removed the offset variable because all tests should align themself correctly
> >  now. Slight drawback might be that the bootmem allocator will spend
> >  some more time skipping bits in the bitmap initially, but that shouldn't
> >  be a big issue.
> >
> 
> 
> this patch from Andi was obsoleted by the one in -mm

Ah, great thanks for letting me know.

 
 
> The patch titled
>     mm: offset align in alloc_bootmem
> has been added to the -mm tree.  Its filename is
>     mm-offset-align-in-alloc_bootmem.patch
> 
> ------------------------------------------------------
> Subject: mm: offset align in alloc_bootmem
> From: Yinghai Lu <yhlu.kernel.send@gmail.com>
> 
> Need offset alignment when node_boot_start's alignment is less than align
> required
> 
> Use local node_boot_start to match align.  so don't add extra opteration in
> search loop.
> 
> Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
> Cc: Andi Kleen <ak@suse.de>
> Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Christoph Lameter <clameter@sgi.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
                   ` (16 preceding siblings ...)
  2008-04-10 17:02 ` [patch 17/17] hugetlb: misc fixes npiggin
@ 2008-04-10 23:59 ` Nish Aravamudan
  2008-04-11  8:28   ` Nick Piggin
  17 siblings, 1 reply; 28+ messages in thread
From: Nish Aravamudan @ 2008-04-10 23:59 UTC (permalink / raw)
  To: npiggin
  Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht, Adam Litke

Hi Nick,

On 4/10/08, npiggin@suse.de <npiggin@suse.de> wrote:
> Hi,
>
>  I'm taking care of Andi's hugetlb patchset now. I've taken a while to appear
>  to do anything with it because I have had other things to do and also needed
>  some time to get up to speed on it.
>
>  Anyway, from my reviewing of the patchset, I didn't find a great deal
>  wrong with it in the technical aspects. Taking hstate out of the hugetlbfs
>  inode and vma is really the main thing I did.

Have you tested with the libhugetlbfs test suite? We're gearing up for
libhugetlbfs 1.3, so most of the test are uptodate and expected to run
cleanly, even with giant hugetlb page support (Jon has been working
diligently to test with his 16G page support for power). I'm planning
on pushing the last bits out today for Adam to pick up before we start
stabilizing for 1.3, so I'm hoping if you grab tomorrow's development
snapshot from libhugetlbfs.ozlabs.org, things should run ok. Probably
only with just 1G hugepages, though, we haven't yet taught
libhugetlbfs about multiple hugepage size availability at run-time,
but that shouldn't be hard.

>  However on the less technical side, I think a few things could be improved,
>  eg. to do with the configuring and reporting, as well as the "administrative"
>  type of code. I tried to make improvements to things in the last patch of
>  the series. I will end up folding this properly into the rest of the patchset
>  where possible.

I've got a few ideas here. Are we sure that
/proc/sys/vm/nr_{,overcommit}_hugepages is the pool allocation
interface we want going forward? I'm fairly sure we don't. I think
we're best off moving to a sysfs-based allocator scheme, while keeping
/proc/sys/vm/nr_{,overcommit}_hugepages around for the default
hugepage size (which may be the only for many folks for now).

I'm thinking something like:

/sys/devices/system/[DIRNAME]/nr_hugepages ->
nr_hugepages_{default_hugepagesize}
/sys/devices/system/[DIRNAME]/nr_hugepages_default_hugepagesize
/sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize1
/sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize2
/sys/devices/system/[DIRNAME]/nr_overcommit_hugepages ->
nr_overcommit_hugepages_{default_hugepagesize}
/sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_default_hugepagesize
/sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize1
/sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize2

That is, nr_hugepages in the directory (should it be called vm?
memory? hugepages specifically? I'm looking for ideas!) will just be a
symlink to the underlying default hugepagesize allocator. The files
themselves would probably be named along the lines of:

nr_hugepages_2M
nr_hugepages_1G
nr_hugepages_64K

etc?

We'd want to have a similar layout on a per-node basis, I think (see
my patchsets to add a per-node interface).

>  The other thing I did was try to shuffle the patches around a bit. There
>  were one or two (pretty trivial) points where it wasn't bisectable, and also
>  merge a couple of patches.
>
>  I will try to get this patchset merged in -mm soon if feedback is positive.
>  I would also like to take patches for other architectures or any other
>  patches or suggestions for improvements.

There are definitely going to be conflicts between my per-node stack
and your set, but if you agree the interface should be cleaned up for
multiple hugepage size support, then I'd like to get my sysfs bits
into -mm and work on putting the global allocator into sysfs properly
for you to base off. I think there's enough room for discussion that
-mm may be a bit premature, but that's just my opinion.

Thanks for keeping the patchset uptodate, I hope to do a more careful
review next week of the individual patches.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 11/17] hugetlbfs: support larger than MAX_ORDER
  2008-04-10 17:02 ` [patch 11/17] hugetlbfs: support larger than MAX_ORDER npiggin
@ 2008-04-11  8:13   ` Andi Kleen
  2008-04-11  8:59     ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-04-11  8:13 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

>  	spin_lock(&hugetlb_lock);
> -	if (h->surplus_huge_pages_node[nid]) {
> +	if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {

As Andrew Hastings pointed out earlier this all needs to be h->order < MAX_ORDER
[got pretty much all the checks wrong off by one]. It won't affect anything
on x86-64 but might cause problems on archs which have exactly MAX_ORDER
sized huge pages.

>  		update_and_free_page(h, page);
>  		h->surplus_huge_pages--;
>  		h->surplus_huge_pages_node[nid]--;
> @@ -220,6 +221,9 @@ static struct page *alloc_fresh_huge_pag
>  {
>  	struct page *page;
>  
> +	if (h->order > MAX_ORDER)

>= etc.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-10 23:59 ` [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Nish Aravamudan
@ 2008-04-11  8:28   ` Nick Piggin
  2008-04-11 19:57     ` Nish Aravamudan
  0 siblings, 1 reply; 28+ messages in thread
From: Nick Piggin @ 2008-04-11  8:28 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht, Adam Litke

On Thu, Apr 10, 2008 at 04:59:15PM -0700, Nish Aravamudan wrote:
> Hi Nick,
> 
> On 4/10/08, npiggin@suse.de <npiggin@suse.de> wrote:
> > Hi,
> >
> >  I'm taking care of Andi's hugetlb patchset now. I've taken a while to appear
> >  to do anything with it because I have had other things to do and also needed
> >  some time to get up to speed on it.
> >
> >  Anyway, from my reviewing of the patchset, I didn't find a great deal
> >  wrong with it in the technical aspects. Taking hstate out of the hugetlbfs
> >  inode and vma is really the main thing I did.
> 
> Have you tested with the libhugetlbfs test suite? We're gearing up for
> libhugetlbfs 1.3, so most of the test are uptodate and expected to run
> cleanly, even with giant hugetlb page support (Jon has been working
> diligently to test with his 16G page support for power). I'm planning
> on pushing the last bits out today for Adam to pick up before we start
> stabilizing for 1.3, so I'm hoping if you grab tomorrow's development
> snapshot from libhugetlbfs.ozlabs.org, things should run ok. Probably
> only with just 1G hugepages, though, we haven't yet taught
> libhugetlbfs about multiple hugepage size availability at run-time,
> but that shouldn't be hard.

Yeah, it should be easy to disable the 2MB default and just make it
look exactly the same but with 1G pages.

Thanks a lot for your suggestion, I'll pull the snapshot over the 
weekend and try to make it pass on x86 and work with Jon to ensure it
is working with powerpc...

 
> >  However on the less technical side, I think a few things could be improved,
> >  eg. to do with the configuring and reporting, as well as the "administrative"
> >  type of code. I tried to make improvements to things in the last patch of
> >  the series. I will end up folding this properly into the rest of the patchset
> >  where possible.
> 
> I've got a few ideas here. Are we sure that
> /proc/sys/vm/nr_{,overcommit}_hugepages is the pool allocation
> interface we want going forward? I'm fairly sure we don't. I think
> we're best off moving to a sysfs-based allocator scheme, while keeping
> /proc/sys/vm/nr_{,overcommit}_hugepages around for the default
> hugepage size (which may be the only for many folks for now).
> 
> I'm thinking something like:
> 
> /sys/devices/system/[DIRNAME]/nr_hugepages ->
> nr_hugepages_{default_hugepagesize}
> /sys/devices/system/[DIRNAME]/nr_hugepages_default_hugepagesize
> /sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize1
> /sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize2
> /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages ->
> nr_overcommit_hugepages_{default_hugepagesize}
> /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_default_hugepagesize
> /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize1
> /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize2
> 
> That is, nr_hugepages in the directory (should it be called vm?
> memory? hugepages specifically? I'm looking for ideas!) will just be a
> symlink to the underlying default hugepagesize allocator. The files
> themselves would probably be named along the lines of:
> 
> nr_hugepages_2M
> nr_hugepages_1G
> nr_hugepages_64K
> 
> etc?

Yes I don't like the proc interface, nor the way it has been extended
(although that's not Andi's fault it is just a limitation of the old
API).

I think actually we should have individual directories for each hstate
size, and we can put all other stuff (reservations and per-node stuff
etc) under those directories. Leave the proc stuff just for the default
page size.

I think it should go in /sys/kernel/, because I think /sys/devices is
more of the hardware side of the system (so it makes sense for
reporting eg the actual supported TLB sizes, but for configuring your
page reserves, I think it makes more sense under /sys/kernel/). But
we'll ask the sysfs folk for guidance there.


> We'd want to have a similar layout on a per-node basis, I think (see
> my patchsets to add a per-node interface).
> 
> >  The other thing I did was try to shuffle the patches around a bit. There
> >  were one or two (pretty trivial) points where it wasn't bisectable, and also
> >  merge a couple of patches.
> >
> >  I will try to get this patchset merged in -mm soon if feedback is positive.
> >  I would also like to take patches for other architectures or any other
> >  patches or suggestions for improvements.
> 
> There are definitely going to be conflicts between my per-node stack
> and your set, but if you agree the interface should be cleaned up for
> multiple hugepage size support, then I'd like to get my sysfs bits
> into -mm and work on putting the global allocator into sysfs properly
> for you to base off. I think there's enough room for discussion that
> -mm may be a bit premature, but that's just my opinion.
> 
> Thanks for keeping the patchset uptodate, I hope to do a more careful
> review next week of the individual patches.

Sure, I haven't seen your work but it shouldn't be terribly hard to merge
either way. It should be easy if we work together ;)

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 11/17] hugetlbfs: support larger than MAX_ORDER
  2008-04-11  8:13   ` Andi Kleen
@ 2008-04-11  8:59     ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2008-04-11  8:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, kniht

On Fri, Apr 11, 2008 at 10:13:17AM +0200, Andi Kleen wrote:
> >  	spin_lock(&hugetlb_lock);
> > -	if (h->surplus_huge_pages_node[nid]) {
> > +	if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {
> 
> As Andrew Hastings pointed out earlier this all needs to be h->order < MAX_ORDER
> [got pretty much all the checks wrong off by one]. It won't affect anything
> on x86-64 but might cause problems on archs which have exactly MAX_ORDER
> sized huge pages.

Ah, hmm, I might have missed a couple of emails worth of feedback when
you last posted. Thanks for pointing this out, I'll read over them again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 10/17] mm: fix bootmem alignment
  2008-04-10 17:33   ` Yinghai Lu
  2008-04-10 17:39     ` Nick Piggin
@ 2008-04-11 11:58     ` Nick Piggin
  1 sibling, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2008-04-11 11:58 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Andrew Morton, Andi Kleen, linux-kernel, linux-mm, pj, kniht

On Thu, Apr 10, 2008 at 10:33:50AM -0700, Yinghai Lu wrote:
> On Thu, Apr 10, 2008 at 10:02 AM,  <npiggin@suse.de> wrote:
> > Without this fix bootmem can return unaligned addresses when the start of a
> >  node is not aligned to the align value. Needed for reliably allocating
> >  gigabyte pages.
> >
> >  I removed the offset variable because all tests should align themself correctly
> >  now. Slight drawback might be that the bootmem allocator will spend
> >  some more time skipping bits in the bitmap initially, but that shouldn't
> >  be a big issue.
> >
> 
> 
> this patch from Andi was obsoleted by the one in -mm
> 
> 
> The patch titled
>     mm: offset align in alloc_bootmem
> has been added to the -mm tree.  Its filename is
>     mm-offset-align-in-alloc_bootmem.patch
> 
> ------------------------------------------------------
> Subject: mm: offset align in alloc_bootmem
> From: Yinghai Lu <yhlu.kernel.send@gmail.com>
> 
> Need offset alignment when node_boot_start's alignment is less than align
> required
> 
> Use local node_boot_start to match align.  so don't add extra opteration in
> search loop.

Ah, with this patch I'm actually able to allocate 2 1GB pages (on my 4GB
box), so it must be doing something right ;) Will be helpful for my
testing, thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86
  2008-04-11  8:28   ` Nick Piggin
@ 2008-04-11 19:57     ` Nish Aravamudan
  0 siblings, 0 replies; 28+ messages in thread
From: Nish Aravamudan @ 2008-04-11 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, linux-kernel, linux-mm, pj, andi, kniht, Adam Litke, Greg KH

[Trimming Andi's SUSE address, as it gave me permanent failures on my
last message]

On 4/11/08, Nick Piggin <npiggin@suse.de> wrote:
> On Thu, Apr 10, 2008 at 04:59:15PM -0700, Nish Aravamudan wrote:
>  > Hi Nick,
>  >
>  > On 4/10/08, npiggin@suse.de <npiggin@suse.de> wrote:
>  > > Hi,
>  > >
>  > >  I'm taking care of Andi's hugetlb patchset now. I've taken a while to appear
>  > >  to do anything with it because I have had other things to do and also needed
>  > >  some time to get up to speed on it.
>  > >
>  > >  Anyway, from my reviewing of the patchset, I didn't find a great deal
>  > >  wrong with it in the technical aspects. Taking hstate out of the hugetlbfs
>  > >  inode and vma is really the main thing I did.
>  >
>  > Have you tested with the libhugetlbfs test suite? We're gearing up for
>  > libhugetlbfs 1.3, so most of the test are uptodate and expected to run
>  > cleanly, even with giant hugetlb page support (Jon has been working
>  > diligently to test with his 16G page support for power). I'm planning
>  > on pushing the last bits out today for Adam to pick up before we start
>  > stabilizing for 1.3, so I'm hoping if you grab tomorrow's development
>  > snapshot from libhugetlbfs.ozlabs.org, things should run ok. Probably
>  > only with just 1G hugepages, though, we haven't yet taught
>  > libhugetlbfs about multiple hugepage size availability at run-time,
>  > but that shouldn't be hard.
>
>
> Yeah, it should be easy to disable the 2MB default and just make it
>  look exactly the same but with 1G pages.

Exactly.

>  Thanks a lot for your suggestion, I'll pull the snapshot over the
>  weekend and try to make it pass on x86 and work with Jon to ensure it
>  is working with powerpc...

Just FYI, we tagged 1.3-pre1 today and it's out now:
http://libhugetlbfs.ozlabs.org/releases/libhugetlbfs-1.3-pre1.tar.gz.

The kernel tests should work fine on x86 as is, even with 1G pages. I
expect some of the linker script testcases to fail, though, as they
will require alignment changes, I think (Adam is actually reworking
the segment remapping code for libhugetlbfs 2.0, which will release
shortly after 1.3, under our current plans).

>  > >  However on the less technical side, I think a few things could be improved,
>  > >  eg. to do with the configuring and reporting, as well as the "administrative"
>  > >  type of code. I tried to make improvements to things in the last patch of
>  > >  the series. I will end up folding this properly into the rest of the patchset
>  > >  where possible.
>  >
>  > I've got a few ideas here. Are we sure that
>  > /proc/sys/vm/nr_{,overcommit}_hugepages is the pool allocation
>  > interface we want going forward? I'm fairly sure we don't. I think
>  > we're best off moving to a sysfs-based allocator scheme, while keeping
>  > /proc/sys/vm/nr_{,overcommit}_hugepages around for the default
>  > hugepage size (which may be the only for many folks for now).
>  >
>  > I'm thinking something like:
>  >
>  > /sys/devices/system/[DIRNAME]/nr_hugepages ->
>  > nr_hugepages_{default_hugepagesize}
>  > /sys/devices/system/[DIRNAME]/nr_hugepages_default_hugepagesize
>  > /sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize1
>  > /sys/devices/system/[DIRNAME]/nr_hugepages_other_hugepagesize2
>  > /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages ->
>  > nr_overcommit_hugepages_{default_hugepagesize}
>  > /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_default_hugepagesize
>  > /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize1
>  > /sys/devices/system/[DIRNAME]/nr_overcommit_hugepages_other_hugepagesize2
>  >
>  > That is, nr_hugepages in the directory (should it be called vm?
>  > memory? hugepages specifically? I'm looking for ideas!) will just be a
>  > symlink to the underlying default hugepagesize allocator. The files
>  > themselves would probably be named along the lines of:
>  >
>  > nr_hugepages_2M
>  > nr_hugepages_1G
>  > nr_hugepages_64K
>  >
>  > etc?
>
>
> Yes I don't like the proc interface, nor the way it has been extended
>  (although that's not Andi's fault it is just a limitation of the old
>  API).

Agreed, I wasn't trying to blame you or Andi for the choice. Just
suggesting we nip the extension in the bud :)

>  I think actually we should have individual directories for each hstate
>  size, and we can put all other stuff (reservations and per-node stuff
>  etc) under those directories. Leave the proc stuff just for the default
>  page size.
>
>  I think it should go in /sys/kernel/, because I think /sys/devices is
>  more of the hardware side of the system (so it makes sense for
>  reporting eg the actual supported TLB sizes, but for configuring your
>  page reserves, I think it makes more sense under /sys/kernel/). But
>  we'll ask the sysfs folk for guidance there.

That's a good point. I've added Greg explicitly to the Cc, to see if
he has any input. Greg, for something like an allocator interface for
hugepages, where would you expect to see that put in the sysfs
hierarchy? /sys/devices/system or /sys/kernel ?

The reason I was suggesting /sys/devices/system is that we already
have the NUMA topology laid out there (and is where I currently have
the per-node nr_hugepages). If we put per-node allocations in
/sys/kernel, we would have to duplicate some of that information (or
have really long filenames), and I'm not sure which is better.

Also, for reference, can we not use "reservations" for the pool
allocators? Reserved huge pages have a special meaning (are used to
satisfy MAP_SHARED mmap()s -- see
http://linux-mm.org/DynamicHugetlbPool). I'm not sure of a better
terminology, beyond perhaps "hugetlb pool interfaces" or something. I
know what you mean, but it got me confused for a second or two :)

>  > We'd want to have a similar layout on a per-node basis, I think (see
>  > my patchsets to add a per-node interface).
>  >
>  > >  The other thing I did was try to shuffle the patches around a bit. There
>  > >  were one or two (pretty trivial) points where it wasn't bisectable, and also
>  > >  merge a couple of patches.
>  > >
>  > >  I will try to get this patchset merged in -mm soon if feedback is positive.
>  > >  I would also like to take patches for other architectures or any other
>  > >  patches or suggestions for improvements.
>  >
>  > There are definitely going to be conflicts between my per-node stack
>  > and your set, but if you agree the interface should be cleaned up for
>  > multiple hugepage size support, then I'd like to get my sysfs bits
>  > into -mm and work on putting the global allocator into sysfs properly
>  > for you to base off. I think there's enough room for discussion that
>  > -mm may be a bit premature, but that's just my opinion.
>  >
>  > Thanks for keeping the patchset uptodate, I hope to do a more careful
>  > review next week of the individual patches.
>
>
> Sure, I haven't seen your work but it shouldn't be terribly hard to merge
>  either way. It should be easy if we work together ;)

I'll make sure to Cc you on the patches that will conflict. If we
decide that /sys/kernel is the right place for the per-node interface
to live, too, then I will need to respin them anyways.

As a side note, I don't think I saw any patches for Documentation in
the last posted set :) Could you update that, it might help with
understanding the changes a bit, although most are pretty
straightforward. It would also be great to update
http://linux-mm.org/PageTableStructure for the 1G case (and eventually
the power 16G case, Jon).

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 01/17] hugetlb: modular state
  2008-04-10 17:02 ` [patch 01/17] hugetlb: modular state npiggin
@ 2008-04-21 20:51   ` Jon Tollefson
  2008-04-22  6:45     ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Jon Tollefson @ 2008-04-21 20:51 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, andi, kniht

On Fri, 2008-04-11 at 03:02 +1000, npiggin@suse.de wrote:

<snip>

> Index: linux-2.6/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.orig/include/linux/hugetlb.h
> +++ linux-2.6/include/linux/hugetlb.h
> @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> 
>  /* arch callbacks */
> 
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz);

<snip>

The sz here needs to be a long to handle sizes such as 16G on powerpc.

There are other places in hugetlb.c where the size also needs to be a
long, but this one affects the arch code too since it is public.

Jon
Tollefson


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 01/17] hugetlb: modular state
  2008-04-21 20:51   ` Jon Tollefson
@ 2008-04-22  6:45     ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2008-04-22  6:45 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: akpm, Andi Kleen, linux-kernel, linux-mm, pj, andi

On Mon, Apr 21, 2008 at 03:51:24PM -0500, Jon Tollefson wrote:
> 
> On Fri, 2008-04-11 at 03:02 +1000, npiggin@suse.de wrote:
> 
> <snip>
> 
> > Index: linux-2.6/include/linux/hugetlb.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/hugetlb.h
> > +++ linux-2.6/include/linux/hugetlb.h
> > @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
> > 
> >  /* arch callbacks */
> > 
> > -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> > +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz);
> 
> <snip>
> 
> The sz here needs to be a long to handle sizes such as 16G on powerpc.
> 
> There are other places in hugetlb.c where the size also needs to be a
> long, but this one affects the arch code too since it is public.

Thanks, I've fixed that and found (hopefully) the rest of the ones
in the hugetlb.c code.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-04-22  6:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-10 17:02 [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 npiggin
2008-04-10 17:02 ` [patch 01/17] hugetlb: modular state npiggin
2008-04-21 20:51   ` Jon Tollefson
2008-04-22  6:45     ` Nick Piggin
2008-04-10 17:02 ` [patch 02/17] hugetlb: multiple hstates npiggin
2008-04-10 17:02 ` [patch 03/17] hugetlb: multi hstate proc files npiggin
2008-04-10 17:02 ` [patch 04/17] hugetlbfs: per mount hstates npiggin
2008-04-10 17:02 ` [patch 05/17] hugetlb: multi hstate sysctls npiggin
2008-04-10 17:02 ` [patch 06/17] hugetlb: abstract numa round robin selection npiggin
2008-04-10 17:02 ` [patch 07/17] mm: introduce non panic alloc_bootmem npiggin
2008-04-10 17:02 ` [patch 08/17] mm: export prep_compound_page to mm npiggin
2008-04-10 17:02 ` [patch 09/17] hugetlb: factor out huge_new_page npiggin
2008-04-10 17:02 ` [patch 10/17] mm: fix bootmem alignment npiggin
2008-04-10 17:33   ` Yinghai Lu
2008-04-10 17:39     ` Nick Piggin
2008-04-11 11:58     ` Nick Piggin
2008-04-10 17:02 ` [patch 11/17] hugetlbfs: support larger than MAX_ORDER npiggin
2008-04-11  8:13   ` Andi Kleen
2008-04-11  8:59     ` Nick Piggin
2008-04-10 17:02 ` [patch 12/17] hugetlb: support boot allocate different sizes npiggin
2008-04-10 17:02 ` [patch 13/17] hugetlb: printk cleanup npiggin
2008-04-10 17:02 ` [patch 14/17] hugetlb: introduce huge_pud npiggin
2008-04-10 17:02 ` [patch 15/17] x86: support GB hugepages on 64-bit npiggin
2008-04-10 17:02 ` [patch 16/17] x86: add hugepagesz option " npiggin
2008-04-10 17:02 ` [patch 17/17] hugetlb: misc fixes npiggin
2008-04-10 23:59 ` [patch 00/17] multi size, and giant hugetlb page support, 1GB hugetlb for x86 Nish Aravamudan
2008-04-11  8:28   ` Nick Piggin
2008-04-11 19:57     ` Nish Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox