linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing
@ 2007-07-13 15:16 Adam Litke
  2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke


In most real-world scenarios, configuring the size of the hugetlb pool
correctly is a difficult task.  If too few pages are allocated to the pool,
then some applications will not be able to use huge pages or, in some cases,
programs that overcommit huge pages could receive SIGBUS.  Isolating too much
memory in the hugetlb pool means it is not available for other uses, especially
those programs not yet using huge pages.

The obvious answer is to let the hugetlb pool grow and shrink in response to
the runtime demand for huge pages.  The work Mel Gorman has been doing to
establish a memory zone for movable memory allocations makes dynamically
resizing the hugetlb pool reliable.  This patch series is an RFC to show how we
might ease the burden of hugetlb pool configuration.  Comments?

How It Works
============

The goal is: upon depletion of the hugetlb pool, rather than reporting an error
immediately, first try and allocate the needed huge pages directly from the
buddy allocator.  We must be careful to avoid unbounded growth of the hugetlb
pool so we begin by accounting for huge pages as locked memory (since that is
what it actually is).  We will only allow a process to grow the hugetlb pool if
those allocations will not cause it to exceed its locked_vm ulimit.
Additionally, a sysctl parameter could be introduced that could govern if pool
resizing is permitted.

The real work begins when we decide there is a shortage of huge pages.  What
happens next depends on whether the pages are for a private or shared mapping.
Private mappings are straightforward.  At fault time, if alloc_huge_page()
fails, we allocate a page from buddy and increment the appropriate
surplus_huge_pages counter.  Because of strict reservation, shared mappings are
a bit more tricky since we must guarantee the pages at mmap time.  For this
case we determine the number of pages we are short and allocate them all at
once.  They are then all added to the pool but marked as reserved
(resv_huge_pages) and surplus (surplus_huge_pages).

We want the hugetlb pool to gravitate back to its original size, so
free_huge_page() must know how to free pages back to buddy when there are
surplus pages.  This is done by using per-node surplus_pages counters so thet
the number of pages doesn't become imbalanced across NUMA nodes.

Issues
======

In rare cases, I have seen the size of the hugetlb pool increase or decrease by
a few pages.  I am continuing to debug the issue, but it is a relatively minor
issue since it doesn't adversely affect the stability of the system.

Recently, a cpuset check was added to the shared memory reservation code to
roughly detect cases where there are not enough pages within a cpuset to
satisfy an allocation.  I am not quite sure how to integrate this logic into
the dynamic pool resizing patches but I am sure someone more familiar with
cpusets will have some good ideas.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant
  2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
@ 2007-07-13 15:16 ` Adam Litke
  2007-07-23 19:43   ` Christoph Lameter
  2007-07-13 15:16 ` [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm Adam Litke
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke

In many places throughout the kernel, the expression (HPAGE_SIZE/PAGE_SIZE) is
used to convert quantities in huge page units to a number of base pages.
Reduce redundancy and make the code more readable by introducing a constant
BASE_PAGES_PER_HPAGE whose name more clearly conveys the intended conversion.

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 arch/powerpc/mm/hugetlbpage.c |    2 +-
 arch/sparc64/mm/fault.c       |    2 +-
 include/linux/hugetlb.h       |    2 ++
 ipc/shm.c                     |    2 +-
 mm/hugetlb.c                  |   10 +++++-----
 mm/memory.c                   |    2 +-
 6 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 92a1b16..5e3414a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -387,7 +387,7 @@ static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
 	/* page is dirty */
 	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
 		if (trap == 0x400) {
-			for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++)
+			for (i = 0; i < BASE_PAGES_PER_HPAGE; i++)
 				__flush_dcache_icache(page_address(page+i));
 			set_bit(PG_arch_1, &page->flags);
 		} else {
diff --git a/arch/sparc64/mm/fault.c b/arch/sparc64/mm/fault.c
index b582024..4076003 100644
--- a/arch/sparc64/mm/fault.c
+++ b/arch/sparc64/mm/fault.c
@@ -434,7 +434,7 @@ good_area:
 
 	mm_rss = get_mm_rss(mm);
 #ifdef CONFIG_HUGETLB_PAGE
-	mm_rss -= (mm->context.huge_pte_count * (HPAGE_SIZE / PAGE_SIZE));
+	mm_rss -= (mm->context.huge_pte_count * BASE_PAGES_PER_HPAGE);
 #endif
 	if (unlikely(mm_rss >
 		     mm->context.tsb_block[MM_TSB_BASE].tsb_rss_limit))
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b4570b6..77021a3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -128,6 +128,8 @@ static inline unsigned long hugetlb_total_pages(void)
 
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+#define BASE_PAGES_PER_HPAGE (HPAGE_SIZE >> PAGE_SHIFT)
+
 #ifdef CONFIG_HUGETLBFS
 struct hugetlbfs_config {
 	uid_t   uid;
diff --git a/ipc/shm.c b/ipc/shm.c
index 4fefbad..fde409a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -559,7 +559,7 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			*rss += BASE_PAGES_PER_HPAGE * mapping->nrpages;
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb7180d..61a52b0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -37,7 +37,7 @@ static void clear_huge_page(struct page *page, unsigned long addr)
 	int i;
 
 	might_sleep();
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+	for (i = 0; i < BASE_PAGES_PER_HPAGE; i++) {
 		cond_resched();
 		clear_user_highpage(page + i, addr);
 	}
@@ -49,7 +49,7 @@ static void copy_huge_page(struct page *dst, struct page *src,
 	int i;
 
 	might_sleep();
-	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+	for (i = 0; i < BASE_PAGES_PER_HPAGE; i++) {
 		cond_resched();
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
@@ -191,7 +191,7 @@ static void update_and_free_page(struct page *page)
 	int i;
 	nr_huge_pages--;
 	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+	for (i = 0; i < BASE_PAGES_PER_HPAGE; i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
 				1 << PG_private | 1<< PG_writeback);
@@ -283,7 +283,7 @@ int hugetlb_report_node_meminfo(int nid, char *buf)
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
 unsigned long hugetlb_total_pages(void)
 {
-	return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+	return nr_huge_pages * BASE_PAGES_PER_HPAGE;
 }
 
 /*
@@ -642,7 +642,7 @@ same_page:
 		--remainder;
 		++i;
 		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+				pfn_offset < BASE_PAGES_PER_HPAGE) {
 			/*
 			 * We use pfn_offset to avoid touching the pageframes
 			 * of this compound page.
diff --git a/mm/memory.c b/mm/memory.c
index cb94488..bb8f7e8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -842,7 +842,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlbp,
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				unmap_hugepage_range(vma, start, end);
 				zap_work -= (end - start) /
-						(HPAGE_SIZE / PAGE_SIZE);
+						BASE_PAGES_PER_HPAGE;
 				start = end;
 			} else
 				start = unmap_page_range(*tlbp, vma,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm
  2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
  2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
@ 2007-07-13 15:16 ` Adam Litke
  2007-07-13 15:16 ` [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions Adam Litke
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke

Hugepages allocated for a process are pinned and may not be reclaimed. This
patch accounts for hugepages under locked_vm.

TODO:
	Explore replacing this patch with a hugetlb pool high watermark
instead.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 mm/hugetlb.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61a52b0..d1ca501 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -402,6 +402,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			continue;
 
 		page = pte_page(pte);
+		mm->locked_vm -= BASE_PAGES_PER_HPAGE;
 		if (pte_dirty(pte))
 			set_page_dirty(page);
 		list_add(&page->lru, &page_list);
@@ -535,6 +536,14 @@ retry:
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
 
+	/*
+ 	 * Account for huge pages as locked. Note that lock limits are not
+ 	 * enforced here because it is not expected that limits are enforced
+ 	 * at fault time. It also would not be right to enforce the limits
+ 	 * at mmap() time because the pages are not pinned at that point
+ 	 */
+	mm->locked_vm += BASE_PAGES_PER_HPAGE;
+
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions
  2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
  2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
  2007-07-13 15:16 ` [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm Adam Litke
@ 2007-07-13 15:16 ` Adam Litke
  2007-07-13 15:17 ` [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure Adam Litke
  2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
  4 siblings, 0 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 mm/hugetlb.c |   30 +++++++++++++++---------------
 1 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d1ca501..a754c20 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -88,6 +88,21 @@ static struct page *dequeue_huge_page(struct vm_area_struct *vma,
 	return page;
 }
 
+static void update_and_free_page(struct page *page)
+{
+	int i;
+	nr_huge_pages--;
+	nr_huge_pages_node[page_to_nid(page)]--;
+	for (i = 0; i < BASE_PAGES_PER_HPAGE; i++) {
+		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
+				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
+				1 << PG_private | 1<< PG_writeback);
+	}
+	page[1].lru.next = NULL;
+	set_page_refcounted(page);
+	__free_pages(page, HUGETLB_PAGE_ORDER);
+}
+
 static void free_huge_page(struct page *page)
 {
 	BUG_ON(page_count(page));
@@ -186,21 +201,6 @@ static unsigned int cpuset_mems_nr(unsigned int *array)
 }
 
 #ifdef CONFIG_SYSCTL
-static void update_and_free_page(struct page *page)
-{
-	int i;
-	nr_huge_pages--;
-	nr_huge_pages_node[page_to_nid(page)]--;
-	for (i = 0; i < BASE_PAGES_PER_HPAGE; i++) {
-		page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
-				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
-				1 << PG_private | 1<< PG_writeback);
-	}
-	page[1].lru.next = NULL;
-	set_page_refcounted(page);
-	__free_pages(page, HUGETLB_PAGE_ORDER);
-}
-
 #ifdef CONFIG_HIGHMEM
 static void try_to_free_low(unsigned long count)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure
  2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
                   ` (2 preceding siblings ...)
  2007-07-13 15:16 ` [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions Adam Litke
@ 2007-07-13 15:17 ` Adam Litke
  2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
  4 siblings, 0 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke

Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that
the hugetlb pool will be exhausted (or fully reserved) when a hugepage is
needed to satisfy a page fault.  Before killing the process in this situation,
try to allocate a hugepage directly from the buddy allocator.  Only do this if
the process would remain within its locked_vm memory limits.

Hugepages allocated directly from the buddy allocator (surplus pages)
should be freed back to the buddy allocator to prevent unbounded growth of
the hugetlb pool.  Introduce a per-node surplus pages counter which is then
used by free_huge_page to determine how the page should be freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 mm/hugetlb.c |   82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a754c20..f03db67 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -27,6 +27,7 @@ unsigned long max_huge_pages;
 static struct list_head hugepage_freelists[MAX_NUMNODES];
 static unsigned int nr_huge_pages_node[MAX_NUMNODES];
 static unsigned int free_huge_pages_node[MAX_NUMNODES];
+static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 /*
  * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
  */
@@ -105,16 +106,22 @@ static void update_and_free_page(struct page *page)
 
 static void free_huge_page(struct page *page)
 {
-	BUG_ON(page_count(page));
+	int nid = page_to_nid(page);
 
+	BUG_ON(page_count(page));
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
-	enqueue_huge_page(page);
+	if (surplus_huge_pages_node[nid]) {
+		update_and_free_page(page);
+		surplus_huge_pages_node[nid]--;
+	} else {
+		enqueue_huge_page(page);
+	}
 	spin_unlock(&hugetlb_lock);
 }
 
-static int alloc_fresh_huge_page(void)
+static struct page *__alloc_fresh_huge_page(void)
 {
 	static int nid = 0;
 	struct page *page;
@@ -129,16 +136,72 @@ static int alloc_fresh_huge_page(void)
 		nr_huge_pages++;
 		nr_huge_pages_node[page_to_nid(page)]++;
 		spin_unlock(&hugetlb_lock);
+	}
+	return page;
+}
+
+static int alloc_fresh_huge_page(void)
+{
+	struct page *page;
+
+	page = __alloc_fresh_huge_page();
+	if (page) {
 		put_page(page); /* free it into the hugepage allocator */
 		return 1;
 	}
 	return 0;
 }
 
+/*
+ * Returns 1 if a process remains within lock limits after locking
+ * hpage_delta huge pages. It is expected that mmap_sem is held
+ * when calling this function, otherwise the locked_vm counter may
+ * change unexpectedly
+ */
+static int within_locked_vm_limits(long hpage_delta)
+{
+	unsigned long locked_pages, locked_pages_limit;
+
+	/* Check locked page limits */
+	locked_pages = current->mm->locked_vm;
+	locked_pages += hpage_delta * BASE_PAGES_PER_HPAGE;
+	locked_pages_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	locked_pages_limit >>= PAGE_SHIFT;
+
+	/* Return 0 if we would exceed locked_vm limits */
+	if (locked_pages > locked_pages_limit)
+		return 0;
+
+	/* Nice, we're within limits */
+	return 1;
+}
+
+static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
+						unsigned long address)
+{
+	struct page *page = NULL;
+
+	/* Check we remain within limits if 1 huge page is allocated */
+	if (!within_locked_vm_limits(1))
+		return NULL;
+
+	page = __alloc_fresh_huge_page();
+	if (page) {
+		INIT_LIST_HEAD(&page->lru);
+
+		/* We now have a surplus huge page, keep track of it */
+		spin_lock(&hugetlb_lock);
+		surplus_huge_pages_node[page_to_nid(page)]++;
+		spin_unlock(&hugetlb_lock);
+	}
+
+	return page;
+}
+
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	spin_lock(&hugetlb_lock);
 	if (vma->vm_flags & VM_MAYSHARE)
@@ -158,7 +221,16 @@ fail:
 	if (vma->vm_flags & VM_MAYSHARE)
 		resv_huge_pages++;
 	spin_unlock(&hugetlb_lock);
-	return NULL;
+
+	/*
+	 * Private mappings do not use reserved huge pages so the allocation
+	 * may have failed due to an undersized hugetlb pool.  Try to grab a
+	 * surplus huge page from the buddy allocator.
+	 */
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		page = alloc_buddy_huge_page(vma, addr);
+
+	return page;
 }
 
 static int __init hugetlb_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
                   ` (3 preceding siblings ...)
  2007-07-13 15:17 ` [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure Adam Litke
@ 2007-07-13 15:17 ` Adam Litke
  2007-07-13 20:05   ` Paul Jackson
  4 siblings, 1 reply; 29+ messages in thread
From: Adam Litke @ 2007-07-13 15:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Andy Whitcroft, William Lee Irwin III,
	Christoph Lameter, Ken Chen, Adam Litke

Allow the hugetlb pool to grow dynamically for shared mappings as well.
Due to strict reservations, this is a bit more complex than the private
case.  We must grow the pool at mmap time so we can create a reservation.
The algorithm works as follows:

1) Determine and allocate the full hugetlb page shortage
2) If allocations fail, goto step 5
3) Take the hugetlb_lock and make sure we still have the right number.  If
   not, go back to step 1.
4) Add surplus pages to the hugetlb pool and mark them reserved
5) Free the rest of the surplus pages

Signed-off-by: Adam Litke <agl@us.ibm.com>
---

 mm/hugetlb.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f03db67..82cd935 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -198,6 +198,70 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
 	return page;
 }
 
+/*
+ * Increase the hugetlb pool such that it can accomodate a reservation
+ * of size 'delta'.
+ */
+static int gather_surplus_pages(int delta)
+{
+	struct list_head surplus_list;
+	struct page *page, *tmp;
+	int ret, i, needed, allocated;
+
+	/* Try and allocate all of the pages first */
+	needed = delta - free_huge_pages + resv_huge_pages;
+	allocated = 0;
+	INIT_LIST_HEAD(&surplus_list);
+
+	ret = -ENOMEM;
+retry:
+	spin_unlock(&hugetlb_lock);
+	for (i = 0; i < needed; i++) {
+		page = alloc_buddy_huge_page(NULL, 0);
+		if (!page) {
+			spin_lock(&hugetlb_lock);
+			needed = 0;
+			goto free;
+		}
+
+		list_add(&page->lru, &surplus_list);
+	}
+	allocated += needed;
+
+	/*
+	 * After retaking hugetlb_lock, we may find that some of the
+	 * free_huge_pages we were planning on using are no longer free.
+	 * In this case we need to allocate some additional pages.
+	 */
+	spin_lock(&hugetlb_lock);
+	needed = delta - free_huge_pages + resv_huge_pages - allocated;
+	if (needed > 0)
+		goto retry;
+
+	/*
+	 * Dispense the pages on the surplus list by adding them to the pool
+	 * or by freeing them back to the allocator.
+	 * We will have extra pages to free in one of two cases:
+	 * 1) We were not able to allocate enough pages to satisfy the entire
+	 *    reservation so we free all allocated pages.
+	 * 2) While we were allocating some surplus pages with the hugetlb_lock
+	 *    unlocked, some pool pages were freed.  Use those instead and
+	 *    free the surplus pages we allocated.
+	 */
+	needed += allocated;
+	ret = 0;
+free:
+	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
+		list_del(&page->lru);
+		if ((--needed) >= 0)
+			enqueue_huge_page(page);
+		else
+			update_and_free_page(page);
+	}
+
+	return ret;
+}
+
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr)
 {
@@ -893,13 +957,16 @@ static long region_truncate(struct list_head *head, long end)
 
 static int hugetlb_acct_memory(long delta)
 {
-	int ret = -ENOMEM;
+	int ret = 0;
 
 	spin_lock(&hugetlb_lock);
-	if ((delta + resv_huge_pages) <= free_huge_pages) {
+
+	if (((delta + resv_huge_pages) > free_huge_pages) &&
+			gather_surplus_pages(delta))
+		ret = -ENOMEM;
+	else
 		resv_huge_pages += delta;
-		ret = 0;
-	}
+
 	spin_unlock(&hugetlb_lock);
 	return ret;
 }
@@ -928,8 +995,15 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to)
 	 * a best attempt and hopefully to minimize the impact of changing
 	 * semantics that cpuset has.
 	 */
+	/*
+	 * I haven't figured out how to incorporate this cpuset bodge into
+	 * the dynamic hugetlb pool yet.  Hopefully someone more familiar with
+	 * cpusets can weigh in on their desired semantics.  Maybe we can just
+	 * drop this check?
+	 *
 	if (chg > cpuset_mems_nr(free_huge_pages_node))
 		return -ENOMEM;
+	 */
 
 	ret = hugetlb_acct_memory(chg);
 	if (ret < 0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
@ 2007-07-13 20:05   ` Paul Jackson
  2007-07-13 21:05     ` Adam Litke
  2007-07-13 21:09     ` Ken Chen
  0 siblings, 2 replies; 29+ messages in thread
From: Paul Jackson @ 2007-07-13 20:05 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, mel, apw, wli, clameter, kenchen

Adam wrote:
> +	/*
> +	 * I haven't figured out how to incorporate this cpuset bodge into
> +	 * the dynamic hugetlb pool yet.  Hopefully someone more familiar with
> +	 * cpusets can weigh in on their desired semantics.  Maybe we can just
> +	 * drop this check?
> +	 *
>  	if (chg > cpuset_mems_nr(free_huge_pages_node))
>  		return -ENOMEM;
> +	 */

I can't figure out the value of this check either -- Ken Chen added it, perhaps
he can comment.

But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
 1) The code in alloc_fresh_huge_page() seems to round robin over
    the entire system, spreading the hugetlb pages uniformly on all nodes.
    If one a task in one small cpuset starts aggressively allocating hugetlb
    pages, do you think this will work, Adam -- looks to me like we will end
    up calling alloc_fresh_huge_page() many times, most of which will fail to
    alloc_pages_node() anything because the 'static nid' clock hand will be
    pointing at a node outside of the current tasks cpuset (not in that tasks
    mems_allowed).  Inefficient, but I guess ok.
 2) I don't see what keeps us from picking hugetlb pages off -any- node in the
    system, perhaps way outside the current cpuset.  We shouldn't be looking for
    enough available (free_huge_pages - resv_huge_pages) pages in the whole
    system.  Rather we should be looking for and reserving enough such pages
    that are in the current tasks cpuset (set in its mems_allowed, to be precise)
    Folks aren't going to want their hugetlb pages coming from outside their
    tasks cpuset.
 3) If there is some code I missed (good chance) that enforces the rule that
    a task can only get a hugetlb page from a node in its cpuset, then this
    uniform global allocation of hugetlb pages, as noted in (1) above, can't
    be right.  Either it will force all nodes, including many nodes outside
    of the current tasks cpuset, to bulk up on free hugetlb pages, just to
    get enough of them on nodes allowed by the current tasks cpuset, or else
    it will fail to get enough on nodes local to the current tasks cpuset.
    I don't understand the logic well enough to know which, but either way
    sucks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 20:05   ` Paul Jackson
@ 2007-07-13 21:05     ` Adam Litke
  2007-07-13 21:24       ` Ken Chen
                         ` (3 more replies)
  2007-07-13 21:09     ` Ken Chen
  1 sibling, 4 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-13 21:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: linux-mm, mel, apw, wli, clameter, kenchen

On Fri, 2007-07-13 at 13:05 -0700, Paul Jackson wrote:
> Adam wrote:
> > +	/*
> > +	 * I haven't figured out how to incorporate this cpuset bodge into
> > +	 * the dynamic hugetlb pool yet.  Hopefully someone more familiar with
> > +	 * cpusets can weigh in on their desired semantics.  Maybe we can just
> > +	 * drop this check?
> > +	 *
> >  	if (chg > cpuset_mems_nr(free_huge_pages_node))
> >  		return -ENOMEM;
> > +	 */
> 
> I can't figure out the value of this check either -- Ken Chen added it, perhaps
> he can comment.

To be honest, I just don't think a global hugetlb pool and cpusets are
compatible, period.  I wonder if moving to the mempool interface and
having dynamic adjustable per-cpuset hugetlb mempools (ick) could make
things work saner.  It's on my list to see if mempools could be used to
replace the custom hugetlb pool code.  Otherwise, Mel's zone_movable
stuff could possibly remove the need for hugetlb pools as we know them.

> But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
>  1) The code in alloc_fresh_huge_page() seems to round robin over
>     the entire system, spreading the hugetlb pages uniformly on all nodes.
>     If one a task in one small cpuset starts aggressively allocating hugetlb
>     pages, do you think this will work, Adam -- looks to me like we will end
>     up calling alloc_fresh_huge_page() many times, most of which will fail to
>     alloc_pages_node() anything because the 'static nid' clock hand will be
>     pointing at a node outside of the current tasks cpuset (not in that tasks
>     mems_allowed).  Inefficient, but I guess ok.

Very good point.  I guess we call alloc_fresh_huge_page in two scenarios
now... 1) By echoing a number into /proc/sys/vm/nr_hugepages, and 2) by
trying to dynamically increase the pool size for a particular process.
Case 1 is not in the context of any process (per se) and so
node_online_map makes sense.  For case 2 we could teach the
__alloc_fresh_huge_page() to take a nodemask.  That could get nasty
though since we'd have to move away from a static variable to get proper
interleaving.

>  2) I don't see what keeps us from picking hugetlb pages off -any- node in the
>     system, perhaps way outside the current cpuset.  We shouldn't be looking for
>     enough available (free_huge_pages - resv_huge_pages) pages in the whole
>     system.  Rather we should be looking for and reserving enough such pages
>     that are in the current tasks cpuset (set in its mems_allowed, to be precise)
>     Folks aren't going to want their hugetlb pages coming from outside their
>     tasks cpuset.

Hmm, I see what you mean, but cpusets are already broken because we use
the global resv_huge_pages counter.  I realize that's what the
cpuset_mems_nr() thing was meant to address but it's not correct.

Perhaps if we make sure __alloc_fresh_huge_page() can be restricted to a
nodemask then we can avoid stealing pages from other cpusets.  But we'd
still be stuck with the existing problem for shared mappings: cpusets +
our strict_reservation algorithm cannot provide guarantees (like we can
without cpusets).

>  3) If there is some code I missed (good chance) that enforces the rule that
>     a task can only get a hugetlb page from a node in its cpuset, then this
>     uniform global allocation of hugetlb pages, as noted in (1) above, can't
>     be right.  Either it will force all nodes, including many nodes outside
>     of the current tasks cpuset, to bulk up on free hugetlb pages, just to
>     get enough of them on nodes allowed by the current tasks cpuset, or else
>     it will fail to get enough on nodes local to the current tasks cpuset.
>     I don't understand the logic well enough to know which, but either way
>     sucks.

I'll cook up a __alloc_fresh_huge_page(nodemask) patch and see if that
makes things better.  Thanks for your review and comments.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 20:05   ` Paul Jackson
  2007-07-13 21:05     ` Adam Litke
@ 2007-07-13 21:09     ` Ken Chen
  1 sibling, 0 replies; 29+ messages in thread
From: Ken Chen @ 2007-07-13 21:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Adam Litke, linux-mm, mel, apw, wli, clameter

On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
>  1) The code in alloc_fresh_huge_page() seems to round robin over
>     the entire system, spreading the hugetlb pages uniformly on all nodes.
>     If one a task in one small cpuset starts aggressively allocating hugetlb
>     pages, do you think this will work,

alloc_fresh_huge_page() is used to fill up the hugetlb page pool.  It
is called through sysctl path.  The path that dish out page out of the
pool and allocate to task is alloc_huge_page(), which should obey both
mempolicy and cpuset constrain.


>  2) I don't see what keeps us from picking hugetlb pages off -any- node in the
>     system, perhaps way outside the current cpuset.

I think it is checked in dequeue_huge_page():

                if (cpuset_zone_allowed_softwall(*z, GFP_HIGHUSER) &&
                    !list_empty(&hugepage_freelists[nid]))
                        break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:05     ` Adam Litke
@ 2007-07-13 21:24       ` Ken Chen
  2007-07-13 21:29       ` Christoph Lameter
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Ken Chen @ 2007-07-13 21:24 UTC (permalink / raw)
  To: Adam Litke; +Cc: Paul Jackson, linux-mm, mel, apw, wli, clameter

On 7/13/07, Adam Litke <agl@us.ibm.com> wrote:
> To be honest, I just don't think a global hugetlb pool and cpusets are
> compatible, period.

Agreed.  It's a mess.


> > But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
> >  1) The code in alloc_fresh_huge_page() seems to round robin over
> >     the entire system, spreading the hugetlb pages uniformly on all nodes.
> >     If one a task in one small cpuset starts aggressively allocating hugetlb
> >     pages, do you think this will work, Adam -- looks to me like we will end
> >     up calling alloc_fresh_huge_page() many times, most of which will fail to
> >     alloc_pages_node() anything because the 'static nid' clock hand will be
> >     pointing at a node outside of the current tasks cpuset (not in that tasks
> >     mems_allowed).  Inefficient, but I guess ok.
>
> Very good point.  I guess we call alloc_fresh_huge_page in two scenarios
> now... 1) By echoing a number into /proc/sys/vm/nr_hugepages, and 2) by
> trying to dynamically increase the pool size for a particular process.
> Case 1 is not in the context of any process (per se) and so
> node_online_map makes sense.  For case 2 we could teach the
> __alloc_fresh_huge_page() to take a nodemask.  That could get nasty
> though since we'd have to move away from a static variable to get proper
> interleaving.

alloc_fresh_huge_page
    alloc_pages_node
        get_page_from_freelist {
            ...
            if ((alloc_flags & ALLOC_CPUSET) &&
                        !cpuset_zone_allowed_softwall(zone, gfp_mask))
                                goto try_next_zone;
            ...

It looks to me that cpuset rule is buried deep down in the buddy
allocator.  So the cpuset mem_allowed rule is enforced in both pool
reservation time (in get_page_from_freelist) and hugetlb page fault
time in dequeue_huge_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:05     ` Adam Litke
  2007-07-13 21:24       ` Ken Chen
@ 2007-07-13 21:29       ` Christoph Lameter
  2007-07-13 21:38         ` Ken Chen
  2007-07-13 21:38       ` Paul Jackson
  2007-07-13 23:15       ` Nish Aravamudan
  3 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2007-07-13 21:29 UTC (permalink / raw)
  To: Adam Litke; +Cc: Paul Jackson, linux-mm, mel, apw, wli, kenchen

On Fri, 13 Jul 2007, Adam Litke wrote:

> To be honest, I just don't think a global hugetlb pool and cpusets are
> compatible, period.  I wonder if moving to the mempool interface and

Sorry no. We always had per node pools. There is no need to have per 
cpuset pools.

> Hmm, I see what you mean, but cpusets are already broken because we use
> the global resv_huge_pages counter.  I realize that's what the
> cpuset_mems_nr() thing was meant to address but it's not correct.

Well the global reserve counter causes a big reduction in performance 
since it requires the serialization of the hugetlb faults. Could we please 
get this straigthened out? This serialization somehow snuck in when I was 
not looking and it screws up multiple things.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:05     ` Adam Litke
  2007-07-13 21:24       ` Ken Chen
  2007-07-13 21:29       ` Christoph Lameter
@ 2007-07-13 21:38       ` Paul Jackson
  2007-07-17 23:42         ` Nish Aravamudan
  2007-07-13 23:15       ` Nish Aravamudan
  3 siblings, 1 reply; 29+ messages in thread
From: Paul Jackson @ 2007-07-13 21:38 UTC (permalink / raw)
  To: Adam Litke; +Cc: linux-mm, mel, apw, wli, clameter, kenchen

Adam wrote:
> To be honest, I just don't think a global hugetlb pool and cpusets are
> compatible, period.

It's not an easy fit, that's for sure ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:29       ` Christoph Lameter
@ 2007-07-13 21:38         ` Ken Chen
  2007-07-13 21:47           ` Christoph Lameter
  2007-07-13 22:21           ` Paul Jackson
  0 siblings, 2 replies; 29+ messages in thread
From: Ken Chen @ 2007-07-13 21:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Adam Litke, Paul Jackson, linux-mm, mel, apw, wli

On 7/13/07, Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 13 Jul 2007, Adam Litke wrote:
>
> > To be honest, I just don't think a global hugetlb pool and cpusets are
> > compatible, period.  I wonder if moving to the mempool interface and
>
> Sorry no. We always had per node pools. There is no need to have per
> cpuset pools.

Yeah, per node pool is fine.  But we need per cpuset reservation to
preserve current hugetlb semantics on shared mapping.


> > Hmm, I see what you mean, but cpusets are already broken because we use
> > the global resv_huge_pages counter.  I realize that's what the
> > cpuset_mems_nr() thing was meant to address but it's not correct.
>
> Well the global reserve counter causes a big reduction in performance
> since it requires the serialization of the hugetlb faults. Could we please
> get this straigthened out? This serialization somehow snuck in when I was
> not looking and it screws up multiple things.

Sadly, global serialization has some nice property.  It is now used in
three paths that I'm aware of:
(1) shared mapping reservation count
(2) linked list protection in unmap_hugepage_range
(3) shared page table on hugetlb mapping.

i suppose (2) and (3) can be moved into per-inode lock?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:38         ` Ken Chen
@ 2007-07-13 21:47           ` Christoph Lameter
  2007-07-13 22:21           ` Paul Jackson
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2007-07-13 21:47 UTC (permalink / raw)
  To: Ken Chen; +Cc: Adam Litke, Paul Jackson, linux-mm, mel, apw, wli

On Fri, 13 Jul 2007, Ken Chen wrote:

> > since it requires the serialization of the hugetlb faults. Could we please
> > get this straigthened out? This serialization somehow snuck in when I was
> > not looking and it screws up multiple things.
> 
> Sadly, global serialization has some nice property.  It is now used in
> three paths that I'm aware of:
> (1) shared mapping reservation count
> (2) linked list protection in unmap_hugepage_range
> (3) shared page table on hugetlb mapping.
> 
> i suppose (2) and (3) can be moved into per-inode lock?

Could we just leave the reservation system off and just enable it when 
something like DB2 runs that needs it?

We should be using standard locking conventions for regular 
pages as much as possible.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:38         ` Ken Chen
  2007-07-13 21:47           ` Christoph Lameter
@ 2007-07-13 22:21           ` Paul Jackson
  1 sibling, 0 replies; 29+ messages in thread
From: Paul Jackson @ 2007-07-13 22:21 UTC (permalink / raw)
  To: Ken Chen; +Cc: clameter, agl, linux-mm, mel, apw, wli

Ken wrote:
> But we need per cpuset reservation to
> preserve current hugetlb semantics on shared mapping.

Would it make sense to reserve on each node N/M hugetlb pages, where N
is how many hugetlb pages we needed in total for that jobs request,
and M is how many nodes are in the current cpuset:
	nodes_weight(task->mems_allowed)

The general case of nested cpusets is probably too weird to worry much
about, but if we have say three long running jobs in non-overlapping
cpusets, which have differing hugetlb needs (perhaps one of them use
no hugetlb pages, one uses a few, and one uses alot), then can we get
that working, so that each job has the number of hugetlb pages it needs,
spread reasonably uniformly across the nodes it is using.

This could even involve an explicit request, when the job started up,
from userland to the kernel, clearing out any existing hugetlb pages,
so that left over non-uniformities in the spread of hugetlb pages, or
excess allocation of them by prior jobs, don't intrude on the new job.

If we could get to the point where the start of a long running job, on
a set of nodes that it pretty much owned exclusively, was like the
system boot point has been until now, in that the job could wipe the
slate clean and setup some new set of hugetlb pages, in a whatever
balance (uniformly spread, or differing particular numbers on
particular nodes in that cpuset) the job required, assuming the job is
willing to be sufficiently well behaved in its requests, then that
would be good.

Then if ill behaved, convoluted or overlapping uses are tried, it's ok
if we kind of stumble along, not looking too pretty in what hugetlb
pages go where, just so long as we don't crash and don't oom fail when
there is mucho free and contiguous memory left.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:05     ` Adam Litke
                         ` (2 preceding siblings ...)
  2007-07-13 21:38       ` Paul Jackson
@ 2007-07-13 23:15       ` Nish Aravamudan
  3 siblings, 0 replies; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-13 23:15 UTC (permalink / raw)
  To: Adam Litke; +Cc: Paul Jackson, linux-mm, mel, apw, wli, clameter, kenchen

On 7/13/07, Adam Litke <agl@us.ibm.com> wrote:
> On Fri, 2007-07-13 at 13:05 -0700, Paul Jackson wrote:
> > Adam wrote:
> > > +   /*
> > > +    * I haven't figured out how to incorporate this cpuset bodge into
> > > +    * the dynamic hugetlb pool yet.  Hopefully someone more familiar with
> > > +    * cpusets can weigh in on their desired semantics.  Maybe we can just
> > > +    * drop this check?
> > > +    *
> > >     if (chg > cpuset_mems_nr(free_huge_pages_node))
> > >             return -ENOMEM;
> > > +    */
> >
> > I can't figure out the value of this check either -- Ken Chen added it, perhaps
> > he can comment.
>
> To be honest, I just don't think a global hugetlb pool and cpusets are
> compatible, period.  I wonder if moving to the mempool interface and
> having dynamic adjustable per-cpuset hugetlb mempools (ick) could make
> things work saner.  It's on my list to see if mempools could be used to
> replace the custom hugetlb pool code.  Otherwise, Mel's zone_movable
> stuff could possibly remove the need for hugetlb pools as we know them.
>
> > But the cpuset behaviour of this hugetlb stuff looks suspicious to me:
> >  1) The code in alloc_fresh_huge_page() seems to round robin over
> >     the entire system, spreading the hugetlb pages uniformly on all nodes.
> >     If one a task in one small cpuset starts aggressively allocating hugetlb
> >     pages, do you think this will work, Adam -- looks to me like we will end
> >     up calling alloc_fresh_huge_page() many times, most of which will fail to
> >     alloc_pages_node() anything because the 'static nid' clock hand will be
> >     pointing at a node outside of the current tasks cpuset (not in that tasks
> >     mems_allowed).  Inefficient, but I guess ok.
>
> Very good point.  I guess we call alloc_fresh_huge_page in two scenarios
> now... 1) By echoing a number into /proc/sys/vm/nr_hugepages, and 2) by
> trying to dynamically increase the pool size for a particular process.
> Case 1 is not in the context of any process (per se) and so
> node_online_map makes sense.  For case 2 we could teach the
> __alloc_fresh_huge_page() to take a nodemask.  That could get nasty
> though since we'd have to move away from a static variable to get proper
> interleaving.

<snip>

<snip>

> Perhaps if we make sure __alloc_fresh_huge_page() can be restricted to a
> nodemask then we can avoid stealing pages from other cpusets.  But we'd
> still be stuck with the existing problem for shared mappings: cpusets +
> our strict_reservation algorithm cannot provide guarantees (like we can
> without cpusets).

<snip>

> I'll cook up a __alloc_fresh_huge_page(nodemask) patch and see if that
> makes things better.  Thanks for your review and comments.

Already done, to some extent. Please see my set of three patches
(which I'll be posting again shortly), which stack on Christoph's
memoryless nodes patches. The first, which fixes hugepage interleaving
on memoryless node systems addes a mempolicy to
alloc_fresh_huge_page(). The second numafies most of the hugetlb.c API
to make things a little clearer. It might make sense to rebase some of
these patches of those changes? The third adds a per-node sysfs
interface for hugepage allocation. I think given those three, we might
be able to make cpusets and hugepages co-exist easier?

I'll post soon, just waiting for some test results to return.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-13 21:38       ` Paul Jackson
@ 2007-07-17 23:42         ` Nish Aravamudan
  2007-07-18 14:44           ` Lee Schermerhorn
  0 siblings, 1 reply; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-17 23:42 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Adam Litke, linux-mm, mel, apw, wli, clameter, kenchen

On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> Adam wrote:
> > To be honest, I just don't think a global hugetlb pool and cpusets are
> > compatible, period.
>
> It's not an easy fit, that's for sure ;).

In the context of my patches to make the hugetlb pool's interleave
work with memoryless nodes, I may have pseudo-solution for growing the
pool while respecting cpusets.

Essentially, given that GFP_THISNODE allocations stay on the node
requested (which is the case after Christoph's set of memoryless node
patches go in), we invoke:

  pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])

in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
we request hugepages in an interleaved fashion over all nodes with
memory.

Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
Or is it expected that the caller do the right thing with the policy
beforehand? If so, I think I could just make those two callers do

  pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))

?

Or am I way off here?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-17 23:42         ` Nish Aravamudan
@ 2007-07-18 14:44           ` Lee Schermerhorn
  2007-07-18 15:17             ` Nish Aravamudan
  0 siblings, 1 reply; 29+ messages in thread
From: Lee Schermerhorn @ 2007-07-18 14:44 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Paul Jackson, Adam Litke, linux-mm, mel, apw, wli, clameter,
	kenchen, Paul Mundt

On Tue, 2007-07-17 at 16:42 -0700, Nish Aravamudan wrote:
> On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> > Adam wrote:
> > > To be honest, I just don't think a global hugetlb pool and cpusets are
> > > compatible, period.
> >
> > It's not an easy fit, that's for sure ;).
> 
> In the context of my patches to make the hugetlb pool's interleave
> work with memoryless nodes, I may have pseudo-solution for growing the
> pool while respecting cpusets.
> 
> Essentially, given that GFP_THISNODE allocations stay on the node
> requested (which is the case after Christoph's set of memoryless node
> patches go in), we invoke:
> 
>   pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])
> 
> in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
> alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
> we request hugepages in an interleaved fashion over all nodes with
> memory.
> 
> Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
> Or is it expected that the caller do the right thing with the policy
> beforehand? If so, I think I could just make those two callers do
> 
>   pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))
> 
> ?
> 
> Or am I way off here?


Nish:

I have always considered the huge page pool, as populated by
alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
system global resource.  I think the system "does the right
thing"--well, almost--with Christoph's memoryless patches and your
hugetlb patches.  Certaintly, the huge pages allocated at boot time,
based on the command line parameter, are system-wide.  cpusets have not
been set up at that time.  

It requires privilege to write to the nr_hugepages sysctl, so allowing
it to spread pages across all available nodes [with memory], regardless
of cpusets, makes sense to me.  Altho' I don't expect many folks are
currently changing nr_hugepages from within a constrained cpuset, I
wouldn't want to see us change existing behavior, in this respect.  Your
per node attributes will provide the mechanism to allocate different
numbers of hugepages for, e.g., nodes in cpusets that have applications
that need them.

Re: the "well, almost":  nr_hugepages is still "broken" for me on some
of my platforms where the interleaved, dma-only pseudo-node contains
sufficient memory to satisfy a hugepage request.  I'll end up with a few
hugepages consuming most of the dma memory.  Consuming the dma isn't the
issue--there should be enough remaining for any dma needs.  I just want
more control over what gets placed on the interleaved pseudo-node by
default.  I think that Paul Mundt [added to cc list] has similar
concerns about default policies on the sh platforms.  I have some ideas,
but I'm waiting for the memoryless nodes and your patches to stabilize
in the mm tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-18 14:44           ` Lee Schermerhorn
@ 2007-07-18 15:17             ` Nish Aravamudan
  2007-07-18 16:02               ` Lee Schermerhorn
  0 siblings, 1 reply; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-18 15:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Jackson, Adam Litke, linux-mm, mel, apw, wli, clameter,
	kenchen, Paul Mundt

On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Tue, 2007-07-17 at 16:42 -0700, Nish Aravamudan wrote:
> > On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> > > Adam wrote:
> > > > To be honest, I just don't think a global hugetlb pool and cpusets are
> > > > compatible, period.
> > >
> > > It's not an easy fit, that's for sure ;).
> >
> > In the context of my patches to make the hugetlb pool's interleave
> > work with memoryless nodes, I may have pseudo-solution for growing the
> > pool while respecting cpusets.
> >
> > Essentially, given that GFP_THISNODE allocations stay on the node
> > requested (which is the case after Christoph's set of memoryless node
> > patches go in), we invoke:
> >
> >   pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])
> >
> > in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
> > alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
> > we request hugepages in an interleaved fashion over all nodes with
> > memory.
> >
> > Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
> > Or is it expected that the caller do the right thing with the policy
> > beforehand? If so, I think I could just make those two callers do
> >
> >   pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))
> >
> > ?
> >
> > Or am I way off here?
>
>
> Nish:
>
> I have always considered the huge page pool, as populated by
> alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> system global resource.  I think the system "does the right
> thing"--well, almost--with Christoph's memoryless patches and your
> hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> based on the command line parameter, are system-wide.  cpusets have not
> been set up at that time.

I fully agree that hugepages are a global resource.

> It requires privilege to write to the nr_hugepages sysctl, so allowing
> it to spread pages across all available nodes [with memory], regardless
> of cpusets, makes sense to me.  Altho' I don't expect many folks are
> currently changing nr_hugepages from within a constrained cpuset, I
> wouldn't want to see us change existing behavior, in this respect.  Your
> per node attributes will provide the mechanism to allocate different
> numbers of hugepages for, e.g., nodes in cpusets that have applications
> that need them.

The issue is that with Adam's patches, the hugepage pool will grow on
demand, presuming the process owner's mlock limit is sufficiently
high. If said process were running within a constrained cpuset, it
seems slightly out-of-whack to allow it grow the pool on other nodes
to satisfy the demand.

> Re: the "well, almost":  nr_hugepages is still "broken" for me on some
> of my platforms where the interleaved, dma-only pseudo-node contains
> sufficient memory to satisfy a hugepage request.  I'll end up with a few
> hugepages consuming most of the dma memory.  Consuming the dma isn't the
> issue--there should be enough remaining for any dma needs.  I just want
> more control over what gets placed on the interleaved pseudo-node by
> default.  I think that Paul Mundt [added to cc list] has similar
> concerns about default policies on the sh platforms.  I have some ideas,
> but I'm waiting for the memoryless nodes and your patches to stabilize
> in the mm tree.

And well, we're already 'broken' as far as I can tell with cpusets and
the hugepage pool. I'm just trying to decide if it's fixable as is, or
if we need extra cleverness. A simple hack would be to just modify the
interleave call with a callback that uses the appropriate mask if
CPUSETS is on or off (I don't want to always use cpuset_mems_allowed()
unconditionally, becuase it returns node_possible_map if !CPUSETS.

Thanks for the feedback. If folks are ok with the way things are, then
so be it. I was just hoping Paul might have some thoughts on how best
to avoid violating cpuset constraints with Adam's patches in the
context of my patches.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-18 15:17             ` Nish Aravamudan
@ 2007-07-18 16:02               ` Lee Schermerhorn
  2007-07-18 21:16                 ` Nish Aravamudan
  2007-07-19  1:52                 ` Paul Mundt
  0 siblings, 2 replies; 29+ messages in thread
From: Lee Schermerhorn @ 2007-07-18 16:02 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Paul Jackson, Adam Litke, linux-mm, mel, apw, wli, clameter,
	kenchen, Paul Mundt

On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Tue, 2007-07-17 at 16:42 -0700, Nish Aravamudan wrote:
> > > On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> > > > Adam wrote:
> > > > > To be honest, I just don't think a global hugetlb pool and cpusets are
> > > > > compatible, period.
> > > >
> > > > It's not an easy fit, that's for sure ;).
> > >
> > > In the context of my patches to make the hugetlb pool's interleave
> > > work with memoryless nodes, I may have pseudo-solution for growing the
> > > pool while respecting cpusets.
> > >
> > > Essentially, given that GFP_THISNODE allocations stay on the node
> > > requested (which is the case after Christoph's set of memoryless node
> > > patches go in), we invoke:
> > >
> > >   pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])
> > >
> > > in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
> > > alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
> > > we request hugepages in an interleaved fashion over all nodes with
> > > memory.
> > >
> > > Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
> > > Or is it expected that the caller do the right thing with the policy
> > > beforehand? If so, I think I could just make those two callers do
> > >
> > >   pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))
> > >
> > > ?
> > >
> > > Or am I way off here?
> >
> >
> > Nish:
> >
> > I have always considered the huge page pool, as populated by
> > alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> > system global resource.  I think the system "does the right
> > thing"--well, almost--with Christoph's memoryless patches and your
> > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> > based on the command line parameter, are system-wide.  cpusets have not
> > been set up at that time.
> 
> I fully agree that hugepages are a global resource.
> 
> > It requires privilege to write to the nr_hugepages sysctl, so allowing
> > it to spread pages across all available nodes [with memory], regardless
> > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> > currently changing nr_hugepages from within a constrained cpuset, I
> > wouldn't want to see us change existing behavior, in this respect.  Your
> > per node attributes will provide the mechanism to allocate different
> > numbers of hugepages for, e.g., nodes in cpusets that have applications
> > that need them.
> 
> The issue is that with Adam's patches, the hugepage pool will grow on
> demand, presuming the process owner's mlock limit is sufficiently
> high. If said process were running within a constrained cpuset, it
> seems slightly out-of-whack to allow it grow the pool on other nodes
> to satisfy the demand.

Ah, I see.  In that case, it might make sense to grow just for the
cpuset.  A couple of things come to mind tho':

1) we might want a per cpuset control to enable/disable hugetlb pool
growth on demand, or to limit the max size of the pool--especially if
the memories are not exclusively owned by the cpuset.  Otherwise,
non-privileged processes could grow the hugetlb pool in memories shared
with other cpusets [maybe the root cpuset?], thereby reducing the amount
of normal, managed pages available to the other cpusets.  Probably want
such a control in the absense of cpusets as well, if on-demand hugetlb
pool growth is implemented.  

2) per cpuset, on-demand hugetlb pool growth shouldn't affect the
behavior of the nr_hugepages sysctl--IMO, anyway.

3) managed "superpages" keeps sounding better and better ;-)

> 
> > Re: the "well, almost":  nr_hugepages is still "broken" for me on some
> > of my platforms where the interleaved, dma-only pseudo-node contains
> > sufficient memory to satisfy a hugepage request.  I'll end up with a few
> > hugepages consuming most of the dma memory.  Consuming the dma isn't the
> > issue--there should be enough remaining for any dma needs.  I just want
> > more control over what gets placed on the interleaved pseudo-node by
> > default.  I think that Paul Mundt [added to cc list] has similar
> > concerns about default policies on the sh platforms.  I have some ideas,
> > but I'm waiting for the memoryless nodes and your patches to stabilize
> > in the mm tree.
> 
> And well, we're already 'broken' as far as I can tell with cpusets and
> the hugepage pool. I'm just trying to decide if it's fixable as is, or
> if we need extra cleverness. A simple hack would be to just modify the
> interleave call with a callback that uses the appropriate mask if
> CPUSETS is on or off (I don't want to always use cpuset_mems_allowed()
> unconditionally, becuase it returns node_possible_map if !CPUSETS.

Maybe you want/need a cpuset_hugemems_allowed() that does "the right
thing" with and without cpusets?

> 
> Thanks for the feedback. If folks are ok with the way things are, then
> so be it. I was just hoping Paul might have some thoughts on how best
> to avoid violating cpuset constraints with Adam's patches in the
> context of my patches.

I'm not trying to discourage you, here.  I agree that cpusets, as useful
as I find them, do make things, uh, "interesting"--especially with
shared resources.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-18 16:02               ` Lee Schermerhorn
@ 2007-07-18 21:16                 ` Nish Aravamudan
  2007-07-18 21:40                   ` Lee Schermerhorn
  2007-07-19  1:52                 ` Paul Mundt
  1 sibling, 1 reply; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-18 21:16 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Jackson, Adam Litke, linux-mm, mel, apw, wli, clameter,
	kenchen, Paul Mundt

On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> > On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > On Tue, 2007-07-17 at 16:42 -0700, Nish Aravamudan wrote:
> > > > On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> > > > > Adam wrote:
> > > > > > To be honest, I just don't think a global hugetlb pool and cpusets are
> > > > > > compatible, period.
> > > > >
> > > > > It's not an easy fit, that's for sure ;).
> > > >
> > > > In the context of my patches to make the hugetlb pool's interleave
> > > > work with memoryless nodes, I may have pseudo-solution for growing the
> > > > pool while respecting cpusets.
> > > >
> > > > Essentially, given that GFP_THISNODE allocations stay on the node
> > > > requested (which is the case after Christoph's set of memoryless node
> > > > patches go in), we invoke:
> > > >
> > > >   pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])
> > > >
> > > > in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
> > > > alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
> > > > we request hugepages in an interleaved fashion over all nodes with
> > > > memory.
> > > >
> > > > Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
> > > > Or is it expected that the caller do the right thing with the policy
> > > > beforehand? If so, I think I could just make those two callers do
> > > >
> > > >   pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))
> > > >
> > > > ?
> > > >
> > > > Or am I way off here?
> > >
> > >
> > > Nish:
> > >
> > > I have always considered the huge page pool, as populated by
> > > alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> > > system global resource.  I think the system "does the right
> > > thing"--well, almost--with Christoph's memoryless patches and your
> > > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> > > based on the command line parameter, are system-wide.  cpusets have not
> > > been set up at that time.
> >
> > I fully agree that hugepages are a global resource.
> >
> > > It requires privilege to write to the nr_hugepages sysctl, so allowing
> > > it to spread pages across all available nodes [with memory], regardless
> > > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> > > currently changing nr_hugepages from within a constrained cpuset, I
> > > wouldn't want to see us change existing behavior, in this respect.  Your
> > > per node attributes will provide the mechanism to allocate different
> > > numbers of hugepages for, e.g., nodes in cpusets that have applications
> > > that need them.
> >
> > The issue is that with Adam's patches, the hugepage pool will grow on
> > demand, presuming the process owner's mlock limit is sufficiently
> > high. If said process were running within a constrained cpuset, it
> > seems slightly out-of-whack to allow it grow the pool on other nodes
> > to satisfy the demand.
>
> Ah, I see.  In that case, it might make sense to grow just for the
> cpuset.  A couple of things come to mind tho':
>
> 1) we might want a per cpuset control to enable/disable hugetlb pool
> growth on demand, or to limit the max size of the pool--especially if
> the memories are not exclusively owned by the cpuset.  Otherwise,
> non-privileged processes could grow the hugetlb pool in memories shared
> with other cpusets [maybe the root cpuset?], thereby reducing the amount
> of normal, managed pages available to the other cpusets.  Probably want
> such a control in the absense of cpusets as well, if on-demand hugetlb
> pool growth is implemented.

Well, the current restriction is on a per-process basis for locked
memory. But it might make sense to add a separate rlimit for hugepages
and then just allow cpusets to restrict that rlimit for processes
contained therein?

Similar would probably hold for the non-cpuset case?

But that seems like special casing for hugetlb pages where small pages
don't have the same restriction. If two cpusets share the same node,
can't one exhaust the node and thus starve the other cpuset? At that
point you need more than cpusets (arguably) and want resource
management at some level.

> 2) per cpuset, on-demand hugetlb pool growth shouldn't affect the
> behavior of the nr_hugepages sysctl--IMO, anyway.

Right, it doesn't as of right now. But we have an existing issue
(independent of hugetlb pool growth, just made more apparent that way)
for cpusets and run-time growth of the pool (which is more likely to
succeed (and perhaps happen) with Mel's patches, recently added to
Linus' tree). So I'm just trying to decide if it will be sufficient to
just obey the cpuset's allocation restrictions, if they have any.

> 3) managed "superpages" keeps sounding better and better ;-)

Preaching to the choir ... Still, have customers to support with the
current solution and want to do right by them, of course.

> > > Re: the "well, almost":  nr_hugepages is still "broken" for me on some
> > > of my platforms where the interleaved, dma-only pseudo-node contains
> > > sufficient memory to satisfy a hugepage request.  I'll end up with a few
> > > hugepages consuming most of the dma memory.  Consuming the dma isn't the
> > > issue--there should be enough remaining for any dma needs.  I just want
> > > more control over what gets placed on the interleaved pseudo-node by
> > > default.  I think that Paul Mundt [added to cc list] has similar
> > > concerns about default policies on the sh platforms.  I have some ideas,
> > > but I'm waiting for the memoryless nodes and your patches to stabilize
> > > in the mm tree.
> >
> > And well, we're already 'broken' as far as I can tell with cpusets and
> > the hugepage pool. I'm just trying to decide if it's fixable as is, or
> > if we need extra cleverness. A simple hack would be to just modify the
> > interleave call with a callback that uses the appropriate mask if
> > CPUSETS is on or off (I don't want to always use cpuset_mems_allowed()
> > unconditionally, becuase it returns node_possible_map if !CPUSETS.
>
> Maybe you want/need a cpuset_hugemems_allowed() that does "the right
> thing" with and without cpusets?

Yeah, that would be the callback I'd use (most likely). Would make
sense to just put it with the other cpuset code, though, good idea.

> > Thanks for the feedback. If folks are ok with the way things are, then
> > so be it. I was just hoping Paul might have some thoughts on how best
> > to avoid violating cpuset constraints with Adam's patches in the
> > context of my patches.
>
> I'm not trying to discourage you, here.  I agree that cpusets, as useful
> as I find them, do make things, uh, "interesting"--especially with
> shared resources.

Definitely. Just wanted to get some input. Will hopefully get around
to making my callback change suggested above to restrict pool growth
and will test on some NUMA boxen today or tomorrow.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-18 21:16                 ` Nish Aravamudan
@ 2007-07-18 21:40                   ` Lee Schermerhorn
  0 siblings, 0 replies; 29+ messages in thread
From: Lee Schermerhorn @ 2007-07-18 21:40 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Paul Jackson, Adam Litke, linux-mm, mel, apw, wli, clameter,
	kenchen, Paul Mundt

On Wed, 2007-07-18 at 14:16 -0700, Nish Aravamudan wrote:
> On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> > > On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > > On Tue, 2007-07-17 at 16:42 -0700, Nish Aravamudan wrote:
> > > > > On 7/13/07, Paul Jackson <pj@sgi.com> wrote:
> > > > > > Adam wrote:
> > > > > > > To be honest, I just don't think a global hugetlb pool and cpusets are
> > > > > > > compatible, period.
> > > > > >
> > > > > > It's not an easy fit, that's for sure ;).
> > > > >
> > > > > In the context of my patches to make the hugetlb pool's interleave
> > > > > work with memoryless nodes, I may have pseudo-solution for growing the
> > > > > pool while respecting cpusets.
> > > > >
> > > > > Essentially, given that GFP_THISNODE allocations stay on the node
> > > > > requested (which is the case after Christoph's set of memoryless node
> > > > > patches go in), we invoke:
> > > > >
> > > > >   pol = mpol_new(MPOL_INTERLEAVE, &node_states[N_MEMORY])
> > > > >
> > > > > in the two callers of alloc_fresh_huge_page(pol) in hugetlb.c.
> > > > > alloc_fresh_huge_page() in turn invokes interleave_nodes(pol) so that
> > > > > we request hugepages in an interleaved fashion over all nodes with
> > > > > memory.
> > > > >
> > > > > Now, what I'm wondering is why interleave_nodes() is not cpuset aware?
> > > > > Or is it expected that the caller do the right thing with the policy
> > > > > beforehand? If so, I think I could just make those two callers do
> > > > >
> > > > >   pol = mpol_new(MPOL_INTERLEAVE, cpuset_mems_allowed(current))
> > > > >
> > > > > ?
> > > > >
> > > > > Or am I way off here?
> > > >
> > > >
> > > > Nish:
> > > >
> > > > I have always considered the huge page pool, as populated by
> > > > alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> > > > system global resource.  I think the system "does the right
> > > > thing"--well, almost--with Christoph's memoryless patches and your
> > > > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> > > > based on the command line parameter, are system-wide.  cpusets have not
> > > > been set up at that time.
> > >
> > > I fully agree that hugepages are a global resource.
> > >
> > > > It requires privilege to write to the nr_hugepages sysctl, so allowing
> > > > it to spread pages across all available nodes [with memory], regardless
> > > > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> > > > currently changing nr_hugepages from within a constrained cpuset, I
> > > > wouldn't want to see us change existing behavior, in this respect.  Your
> > > > per node attributes will provide the mechanism to allocate different
> > > > numbers of hugepages for, e.g., nodes in cpusets that have applications
> > > > that need them.
> > >
> > > The issue is that with Adam's patches, the hugepage pool will grow on
> > > demand, presuming the process owner's mlock limit is sufficiently
> > > high. If said process were running within a constrained cpuset, it
> > > seems slightly out-of-whack to allow it grow the pool on other nodes
> > > to satisfy the demand.
> >
> > Ah, I see.  In that case, it might make sense to grow just for the
> > cpuset.  A couple of things come to mind tho':
> >
> > 1) we might want a per cpuset control to enable/disable hugetlb pool
> > growth on demand, or to limit the max size of the pool--especially if
> > the memories are not exclusively owned by the cpuset.  Otherwise,
> > non-privileged processes could grow the hugetlb pool in memories shared
> > with other cpusets [maybe the root cpuset?], thereby reducing the amount
> > of normal, managed pages available to the other cpusets.  Probably want
> > such a control in the absense of cpusets as well, if on-demand hugetlb
> > pool growth is implemented.
> 
> Well, the current restriction is on a per-process basis for locked
> memory. But it might make sense to add a separate rlimit for hugepages
> and then just allow cpusets to restrict that rlimit for processes
> contained therein?
> 
> Similar would probably hold for the non-cpuset case?
> 
> But that seems like special casing for hugetlb pages where small pages
> don't have the same restriction. If two cpusets share the same node,
> can't one exhaust the node and thus starve the other cpuset? At that
> point you need more than cpusets (arguably) and want resource
> management at some level.
> 

The difference I see is that "small pages" are "managed"--i.e., can be
reclaimed if not locked.  And you've already pointed out that we have a
resource limit on locking regular/small pages.  Huge pages are not
managed [unless Adam plans on tackling that as well!], so they are
effectively locked.  I guess that by a limiting the number of pages any
process could attach with another resource limit, we would limit the
growth of the huge page pool.  However, multiple processes in a cpuset
could attach different huge pages, thus growing the pool at the expense
of other cpusets.  No different from locked pages, huh?

Maybe just a system wide limit on the maximum size of the huge page
pool--i.e., on how large it can grow dynamically--is sufficient.

<snip remainder of discussion>

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-18 16:02               ` Lee Schermerhorn
  2007-07-18 21:16                 ` Nish Aravamudan
@ 2007-07-19  1:52                 ` Paul Mundt
  2007-07-20 20:35                   ` Nish Aravamudan
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Mundt @ 2007-07-19  1:52 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nish Aravamudan, Paul Jackson, Adam Litke, linux-mm, mel, apw,
	wli, clameter, kenchen

On Wed, Jul 18, 2007 at 12:02:03PM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> > On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > I have always considered the huge page pool, as populated by
> > > alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> > > system global resource.  I think the system "does the right
> > > thing"--well, almost--with Christoph's memoryless patches and your
> > > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> > > based on the command line parameter, are system-wide.  cpusets have not
> > > been set up at that time.
> > 
> > I fully agree that hugepages are a global resource.
> > 
> > > It requires privilege to write to the nr_hugepages sysctl, so allowing
> > > it to spread pages across all available nodes [with memory], regardless
> > > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> > > currently changing nr_hugepages from within a constrained cpuset, I
> > > wouldn't want to see us change existing behavior, in this respect.  Your
> > > per node attributes will provide the mechanism to allocate different
> > > numbers of hugepages for, e.g., nodes in cpusets that have applications
> > > that need them.
> > 
> > The issue is that with Adam's patches, the hugepage pool will grow on
> > demand, presuming the process owner's mlock limit is sufficiently
> > high. If said process were running within a constrained cpuset, it
> > seems slightly out-of-whack to allow it grow the pool on other nodes
> > to satisfy the demand.
> 
> Ah, I see.  In that case, it might make sense to grow just for the
> cpuset.  A couple of things come to mind tho':
> 
> 1) we might want a per cpuset control to enable/disable hugetlb pool
> growth on demand, or to limit the max size of the pool--especially if
> the memories are not exclusively owned by the cpuset.  Otherwise,
> non-privileged processes could grow the hugetlb pool in memories shared
> with other cpusets [maybe the root cpuset?], thereby reducing the amount
> of normal, managed pages available to the other cpusets.  Probably want
> such a control in the absense of cpusets as well, if on-demand hugetlb
> pool growth is implemented.  
> 
I don't see that the two are mutually exclusive. Hugetlb pools have to be
node-local anyways due to the varying distances, so perhaps the global
resource thing is the wrong way to approach it. There are already hooks
for spreading slab and page cache pages in cpusets, perhaps it makes
sense to add a hugepage spread variant to balance across the constrained
set?

nr_hugepages is likely something that should still be global, so the
sum of the hugepages in the per-node pools don't exceed this value.

It would be quite nice to have some way to have nodes opt-in to the sort
of behaviour they're willing to tolerate. Some nodes are never going to
tolerate spreading of any sort, hugepages, and so forth. Perhaps it makes
more sense to have some flags in the pgdat where we can more strongly
type the sort of behaviour the node is willing to put up with (or capable
of supporting), at least in this case the nodes that explicitly can't
cope are factored out before we even get to cpuset constraints (plus this
gives us a hook for setting up the interleave nodes in both the system
init and default policies). Thoughts?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-19  1:52                 ` Paul Mundt
@ 2007-07-20 20:35                   ` Nish Aravamudan
  2007-07-20 20:53                     ` Lee Schermerhorn
  2007-07-21 16:57                     ` Paul Mundt
  0 siblings, 2 replies; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-20 20:35 UTC (permalink / raw)
  To: Paul Mundt, Lee Schermerhorn, Nish Aravamudan, Paul Jackson,
	Adam Litke, linux-mm, mel, apw, wli, clameter, kenchen

On 7/18/07, Paul Mundt <lethal@linux-sh.org> wrote:
> On Wed, Jul 18, 2007 at 12:02:03PM -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> > > On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > > I have always considered the huge page pool, as populated by
> > > > alloc_fresh_huge_page() in response to changes in nr_hugepages, to be a
> > > > system global resource.  I think the system "does the right
> > > > thing"--well, almost--with Christoph's memoryless patches and your
> > > > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> > > > based on the command line parameter, are system-wide.  cpusets have not
> > > > been set up at that time.
> > >
> > > I fully agree that hugepages are a global resource.
> > >
> > > > It requires privilege to write to the nr_hugepages sysctl, so allowing
> > > > it to spread pages across all available nodes [with memory], regardless
> > > > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> > > > currently changing nr_hugepages from within a constrained cpuset, I
> > > > wouldn't want to see us change existing behavior, in this respect.  Your
> > > > per node attributes will provide the mechanism to allocate different
> > > > numbers of hugepages for, e.g., nodes in cpusets that have applications
> > > > that need them.
> > >
> > > The issue is that with Adam's patches, the hugepage pool will grow on
> > > demand, presuming the process owner's mlock limit is sufficiently
> > > high. If said process were running within a constrained cpuset, it
> > > seems slightly out-of-whack to allow it grow the pool on other nodes
> > > to satisfy the demand.
> >
> > Ah, I see.  In that case, it might make sense to grow just for the
> > cpuset.  A couple of things come to mind tho':
> >
> > 1) we might want a per cpuset control to enable/disable hugetlb pool
> > growth on demand, or to limit the max size of the pool--especially if
> > the memories are not exclusively owned by the cpuset.  Otherwise,
> > non-privileged processes could grow the hugetlb pool in memories shared
> > with other cpusets [maybe the root cpuset?], thereby reducing the amount
> > of normal, managed pages available to the other cpusets.  Probably want
> > such a control in the absense of cpusets as well, if on-demand hugetlb
> > pool growth is implemented.
> >
> I don't see that the two are mutually exclusive. Hugetlb pools have to be
> node-local anyways due to the varying distances, so perhaps the global
> resource thing is the wrong way to approach it. There are already hooks
> for spreading slab and page cache pages in cpusets, perhaps it makes
> sense to add a hugepage spread variant to balance across the constrained
> set?

I'm not sure I understand why you say "hugetlb pools"? There is no
plural in the kernel, there is only the global pool. Now, on NUMA
machines, yes, the pool is spread across nodes, but, well, that's just
because of where the memory is. We already spread out the allocation
of hugepages across all NUMA nodes (or will, once my patches go in).
And I think with my earlier suggestion (of just changing the
interleave mask used for those allocations to be cpuset-aware), that
we'd spread across the cpuset too, if there is one. Is that what you
mean by "spread variant"?

> nr_hugepages is likely something that should still be global, so the
> sum of the hugepages in the per-node pools don't exceed this value.

Yes, that is the case now. Well, mostly because they are always
incremented together or decremented together.

> It would be quite nice to have some way to have nodes opt-in to the sort
> of behaviour they're willing to tolerate. Some nodes are never going to
> tolerate spreading of any sort, hugepages, and so forth. Perhaps it makes
> more sense to have some flags in the pgdat where we can more strongly
> type the sort of behaviour the node is willing to put up with (or capable
> of supporting), at least in this case the nodes that explicitly can't
> cope are factored out before we even get to cpuset constraints (plus this
> gives us a hook for setting up the interleave nodes in both the system
> init and default policies). Thoughts?

I guess I don't understand which nodes you're talking about now? How
do you spread across any particular single node (how I read "Some
nodes are never going to tolerate spreading of any sort")? Or do you
mean that some cpusets aren't going to want to spread (interleave?).

Oh, are you trying to say that some nodes should be dropped from
interleave masks (explicitly excluded from all possible interleave
masks)? What kind of nodes would these be? We're doing something
similar to deal with memoryless nodes, perhaps it could be
generalized?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-20 20:35                   ` Nish Aravamudan
@ 2007-07-20 20:53                     ` Lee Schermerhorn
  2007-07-20 21:12                       ` Nish Aravamudan
  2007-07-21 16:57                     ` Paul Mundt
  1 sibling, 1 reply; 29+ messages in thread
From: Lee Schermerhorn @ 2007-07-20 20:53 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Paul Mundt, Paul Jackson, Adam Litke, linux-mm, mel, apw, wli,
	clameter, kenchen

On Fri, 2007-07-20 at 13:35 -0700, Nish Aravamudan wrote:
> On 7/18/07, Paul Mundt <lethal@linux-sh.org> wrote:
<snip>
> > It would be quite nice to have some way to have nodes opt-in to the sort
> > of behaviour they're willing to tolerate. Some nodes are never going to
> > tolerate spreading of any sort, hugepages, and so forth. Perhaps it makes
> > more sense to have some flags in the pgdat where we can more strongly
> > type the sort of behaviour the node is willing to put up with (or capable
> > of supporting), at least in this case the nodes that explicitly can't
> > cope are factored out before we even get to cpuset constraints (plus this
> > gives us a hook for setting up the interleave nodes in both the system
> > init and default policies). Thoughts?
> 
> I guess I don't understand which nodes you're talking about now? How
> do you spread across any particular single node (how I read "Some
> nodes are never going to tolerate spreading of any sort")? Or do you
> mean that some cpusets aren't going to want to spread (interleave?).
> 
> Oh, are you trying to say that some nodes should be dropped from
> interleave masks (explicitly excluded from all possible interleave
> masks)? What kind of nodes would these be? We're doing something
> similar to deal with memoryless nodes, perhaps it could be
> generalized?

If that's what Paul means [and I think it is, based on a converstation
at OLS], I have a similar requirement.  I'd like to be able to specify,
on the command line, at least [run time reconfig not a hard requirement]
nodes to be excluded from interleave masks, including the hugetlb
allocation mask [if this is different from the regular interleaving
nodemask].  

And, I agree, I think we can add another node_states[] entry or two to
hold these nodes.  I'll try to work up a patch next week if noone beats
me to it.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-20 20:53                     ` Lee Schermerhorn
@ 2007-07-20 21:12                       ` Nish Aravamudan
  0 siblings, 0 replies; 29+ messages in thread
From: Nish Aravamudan @ 2007-07-20 21:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Paul Mundt, Paul Jackson, Adam Litke, linux-mm, mel, apw, wli,
	clameter, kenchen

On 7/20/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Fri, 2007-07-20 at 13:35 -0700, Nish Aravamudan wrote:
> > On 7/18/07, Paul Mundt <lethal@linux-sh.org> wrote:
> <snip>
> > > It would be quite nice to have some way to have nodes opt-in to the sort
> > > of behaviour they're willing to tolerate. Some nodes are never going to
> > > tolerate spreading of any sort, hugepages, and so forth. Perhaps it makes
> > > more sense to have some flags in the pgdat where we can more strongly
> > > type the sort of behaviour the node is willing to put up with (or capable
> > > of supporting), at least in this case the nodes that explicitly can't
> > > cope are factored out before we even get to cpuset constraints (plus this
> > > gives us a hook for setting up the interleave nodes in both the system
> > > init and default policies). Thoughts?
> >
> > I guess I don't understand which nodes you're talking about now? How
> > do you spread across any particular single node (how I read "Some
> > nodes are never going to tolerate spreading of any sort")? Or do you
> > mean that some cpusets aren't going to want to spread (interleave?).
> >
> > Oh, are you trying to say that some nodes should be dropped from
> > interleave masks (explicitly excluded from all possible interleave
> > masks)? What kind of nodes would these be? We're doing something
> > similar to deal with memoryless nodes, perhaps it could be
> > generalized?
>
> If that's what Paul means [and I think it is, based on a converstation
> at OLS], I have a similar requirement.  I'd like to be able to specify,
> on the command line, at least [run time reconfig not a hard requirement]
> nodes to be excluded from interleave masks, including the hugetlb
> allocation mask [if this is different from the regular interleaving
> nodemask].

Right this would avoid using that DMA node for your systems.

> And, I agree, I think we can add another node_states[] entry or two to
> hold these nodes.  I'll try to work up a patch next week if noone beats
> me to it.

Sounds good. I think the commandline interface might be a bit hairy --
but I'll leave that to you :)

So then, I'd say, by default the interleave masks should be and'd with
this node_states(N_INTERLEAVE), where if not otherwise specified,
node_states(N_INTERLEAVE) == node_states[N_MEMORY]?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings
  2007-07-20 20:35                   ` Nish Aravamudan
  2007-07-20 20:53                     ` Lee Schermerhorn
@ 2007-07-21 16:57                     ` Paul Mundt
  1 sibling, 0 replies; 29+ messages in thread
From: Paul Mundt @ 2007-07-21 16:57 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Lee Schermerhorn, Paul Jackson, Adam Litke, linux-mm, mel, apw,
	wli, clameter, kenchen

On Fri, Jul 20, 2007 at 01:35:52PM -0700, Nish Aravamudan wrote:
> On 7/18/07, Paul Mundt <lethal@linux-sh.org> wrote:
> >On Wed, Jul 18, 2007 at 12:02:03PM -0400, Lee Schermerhorn wrote:
> >> On Wed, 2007-07-18 at 08:17 -0700, Nish Aravamudan wrote:
> >> > On 7/18/07, Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> >> > > I have always considered the huge page pool, as populated by
> >> > > alloc_fresh_huge_page() in response to changes in nr_hugepages, to 
> >be a
> >> > > system global resource.  I think the system "does the right
> >> > > thing"--well, almost--with Christoph's memoryless patches and your
> >> > > hugetlb patches.  Certaintly, the huge pages allocated at boot time,
> >> > > based on the command line parameter, are system-wide.  cpusets have 
> >not
> >> > > been set up at that time.
> >> >
> >> > I fully agree that hugepages are a global resource.
> >> >
> >> > > It requires privilege to write to the nr_hugepages sysctl, so 
> >allowing
> >> > > it to spread pages across all available nodes [with memory], 
> >regardless
> >> > > of cpusets, makes sense to me.  Altho' I don't expect many folks are
> >> > > currently changing nr_hugepages from within a constrained cpuset, I
> >> > > wouldn't want to see us change existing behavior, in this respect.  
> >Your
> >> > > per node attributes will provide the mechanism to allocate different
> >> > > numbers of hugepages for, e.g., nodes in cpusets that have 
> >applications
> >> > > that need them.
> >> >
> >> > The issue is that with Adam's patches, the hugepage pool will grow on
> >> > demand, presuming the process owner's mlock limit is sufficiently
> >> > high. If said process were running within a constrained cpuset, it
> >> > seems slightly out-of-whack to allow it grow the pool on other nodes
> >> > to satisfy the demand.
> >>
> >> Ah, I see.  In that case, it might make sense to grow just for the
> >> cpuset.  A couple of things come to mind tho':
> >>
> >> 1) we might want a per cpuset control to enable/disable hugetlb pool
> >> growth on demand, or to limit the max size of the pool--especially if
> >> the memories are not exclusively owned by the cpuset.  Otherwise,
> >> non-privileged processes could grow the hugetlb pool in memories shared
> >> with other cpusets [maybe the root cpuset?], thereby reducing the amount
> >> of normal, managed pages available to the other cpusets.  Probably want
> >> such a control in the absense of cpusets as well, if on-demand hugetlb
> >> pool growth is implemented.
> >>
> >I don't see that the two are mutually exclusive. Hugetlb pools have to be
> >node-local anyways due to the varying distances, so perhaps the global
> >resource thing is the wrong way to approach it. There are already hooks
> >for spreading slab and page cache pages in cpusets, perhaps it makes
> >sense to add a hugepage spread variant to balance across the constrained
> >set?
> 
> I'm not sure I understand why you say "hugetlb pools"? There is no
> plural in the kernel, there is only the global pool. Now, on NUMA
> machines, yes, the pool is spread across nodes, but, well, that's just
> because of where the memory is. We already spread out the allocation
> of hugepages across all NUMA nodes (or will, once my patches go in).
> And I think with my earlier suggestion (of just changing the
> interleave mask used for those allocations to be cpuset-aware), that
> we'd spread across the cpuset too, if there is one. Is that what you
> mean by "spread variant"?
> 
Yes, that's what I was referring to. The main thing is that there may
simply be nodes where we don't want to spread the huge pages (mostly due
to size constraints). For instance, nodes that don't make it in to
the interleave map are a reasonable candidate for also never spreading
hugepage pages to.

> >It would be quite nice to have some way to have nodes opt-in to the sort
> >of behaviour they're willing to tolerate. Some nodes are never going to
> >tolerate spreading of any sort, hugepages, and so forth. Perhaps it makes
> >more sense to have some flags in the pgdat where we can more strongly
> >type the sort of behaviour the node is willing to put up with (or capable
> >of supporting), at least in this case the nodes that explicitly can't
> >cope are factored out before we even get to cpuset constraints (plus this
> >gives us a hook for setting up the interleave nodes in both the system
> >init and default policies). Thoughts?
> 
> I guess I don't understand which nodes you're talking about now? How
> do you spread across any particular single node (how I read "Some
> nodes are never going to tolerate spreading of any sort")? Or do you
> mean that some cpusets aren't going to want to spread (interleave?).
> 
> Oh, are you trying to say that some nodes should be dropped from
> interleave masks (explicitly excluded from all possible interleave
> masks)? What kind of nodes would these be? We're doing something
> similar to deal with memoryless nodes, perhaps it could be
> generalized?
> 
Correct. You can see some the changes in mm/mempolicy,c:numa_policy_init() 
for keeping nodes out of the system init policy. While we want to be able
to let the kernel manage the node and let applications do node-local
allocation, this nodes will never want slab pages or anything like that
due to the size constraints.

Christoph had posted some earlier slub patches for excluding certain
nodes from slub entirely, this may also be something you want to pick up
and work on for memoryless nodes. I've been opting for SLOB + NUMA on my
platforms, but if something like this is tidied up generically then slub
is certainly something to support as an alternative.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant
  2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
@ 2007-07-23 19:43   ` Christoph Lameter
  2007-07-23 19:52     ` Adam Litke
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2007-07-23 19:43 UTC (permalink / raw)
  To: Adam Litke
  Cc: linux-mm, Mel Gorman, Andy Whitcroft, William Lee Irwin III, Ken Chen

On Fri, 13 Jul 2007 08:16:31 -0700
Adam Litke <agl@us.ibm.com> wrote:


> In many places throughout the kernel, the expression
> (HPAGE_SIZE/PAGE_SIZE) is used to convert quantities in huge page
> units to a number of base pages. Reduce redundancy and make the code
> more readable by introducing a constant BASE_PAGES_PER_HPAGE whose
> name more clearly conveys the intended conversion.

It may be better to put in a generic way of determining the pages of a
compound page.

Usually

1 << compound_order(page) will do the trick.

See also 

http://marc.info/?l=linux-kernel&m=118236495611300&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant
  2007-07-23 19:43   ` Christoph Lameter
@ 2007-07-23 19:52     ` Adam Litke
  0 siblings, 0 replies; 29+ messages in thread
From: Adam Litke @ 2007-07-23 19:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Mel Gorman, Andy Whitcroft, William Lee Irwin III, Ken Chen

On Mon, 2007-07-23 at 12:43 -0700, Christoph Lameter wrote:
> On Fri, 13 Jul 2007 08:16:31 -0700
> Adam Litke <agl@us.ibm.com> wrote:
> 
> 
> > In many places throughout the kernel, the expression
> > (HPAGE_SIZE/PAGE_SIZE) is used to convert quantities in huge page
> > units to a number of base pages. Reduce redundancy and make the code
> > more readable by introducing a constant BASE_PAGES_PER_HPAGE whose
> > name more clearly conveys the intended conversion.
> 
> It may be better to put in a generic way of determining the pages of a
> compound page.
> 
> Usually
> 
> 1 << compound_order(page) will do the trick.

Yes, that is much nicer, thanks!

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-07-23 19:52 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-13 15:16 [PATCH 0/5] [RFC] Dynamic hugetlb pool resizing Adam Litke
2007-07-13 15:16 ` [PATCH 1/5] [hugetlb] Introduce BASE_PAGES_PER_HPAGE constant Adam Litke
2007-07-23 19:43   ` Christoph Lameter
2007-07-23 19:52     ` Adam Litke
2007-07-13 15:16 ` [PATCH 2/5] [hugetlb] Account for hugepages as locked_vm Adam Litke
2007-07-13 15:16 ` [PATCH 3/5] [hugetlb] Move update_and_free_page so it can be used by alloc functions Adam Litke
2007-07-13 15:17 ` [PATCH 4/5] [hugetlb] Try to grow pool on alloc_huge_page failure Adam Litke
2007-07-13 15:17 ` [PATCH 5/5] [hugetlb] Try to grow pool for MAP_SHARED mappings Adam Litke
2007-07-13 20:05   ` Paul Jackson
2007-07-13 21:05     ` Adam Litke
2007-07-13 21:24       ` Ken Chen
2007-07-13 21:29       ` Christoph Lameter
2007-07-13 21:38         ` Ken Chen
2007-07-13 21:47           ` Christoph Lameter
2007-07-13 22:21           ` Paul Jackson
2007-07-13 21:38       ` Paul Jackson
2007-07-17 23:42         ` Nish Aravamudan
2007-07-18 14:44           ` Lee Schermerhorn
2007-07-18 15:17             ` Nish Aravamudan
2007-07-18 16:02               ` Lee Schermerhorn
2007-07-18 21:16                 ` Nish Aravamudan
2007-07-18 21:40                   ` Lee Schermerhorn
2007-07-19  1:52                 ` Paul Mundt
2007-07-20 20:35                   ` Nish Aravamudan
2007-07-20 20:53                     ` Lee Schermerhorn
2007-07-20 21:12                       ` Nish Aravamudan
2007-07-21 16:57                     ` Paul Mundt
2007-07-13 23:15       ` Nish Aravamudan
2007-07-13 21:09     ` Ken Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox