[RFC] Reserve huge pages for reliable MAP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
@ 2008-04-21 18:36 Mel Gorman
  2008-04-21 19:05 ` Adam Litke
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mel Gorman @ 2008-04-21 18:36 UTC (permalink / raw)
  To: wli, agl; +Cc: linux-mm, linux-kernel

MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
so that all future faults will be guaranteed to succeed. Applications are not
expected to use mlock() as this can result in poor NUMA placement.

MAP_PRIVATE mappings do not reserve pages. This can result in an application
being SIGKILLed later if a large page is not available at fault time. This
makes huge pages usage very ill-advised in some cases as the unexpected
application failure is intolerable. Forcing potential poor placement with
mlock() is not a great solution either.

This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
developer knows that future faults will also succeed. However, there is no
guarantee that children of the process will be able to write-fault the same
mapping. The assumption is being made that the majority of applications that
fork() either use MAP_SHARED as an IPC mechanism or are calling exec().

Opinions?

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 fs/hugetlbfs/inode.c    |    8 -
 include/linux/hugetlb.h |    3 
 mm/hugetlb.c            |  212 ++++++++++++++++++++++++++++++------------------
 3 files changed, 142 insertions(+), 81 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/fs/hugetlbfs/inode.c linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/fs/hugetlbfs/inode.c
--- linux-2.6.25-rc9-clean/fs/hugetlbfs/inode.c	2008-04-11 21:32:29.000000000 +0100
+++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/fs/hugetlbfs/inode.c	2008-04-21 17:05:25.000000000 +0100
@@ -103,9 +103,9 @@ static int hugetlbfs_file_mmap(struct fi
 	ret = -ENOMEM;
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	if (vma->vm_flags & VM_MAYSHARE &&
-	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
-				  len >> HPAGE_SHIFT))
+	if (hugetlb_reserve_pages(inode,
+				vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT),
+				len >> HPAGE_SHIFT, vma))
 		goto out;
 
 	ret = 0;
@@ -942,7 +942,7 @@ struct file *hugetlb_file_setup(const ch
 		goto out_dentry;
 
 	error = -ENOMEM;
-	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
+	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT, NULL))
 		goto out_inode;
 
 	d_instantiate(dentry, inode);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/include/linux/hugetlb.h linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/include/linux/hugetlb.h
--- linux-2.6.25-rc9-clean/include/linux/hugetlb.h	2008-04-11 21:32:29.000000000 +0100
+++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/include/linux/hugetlb.h	2008-04-17 16:45:32.000000000 +0100
@@ -29,7 +29,8 @@ int hugetlb_report_node_meminfo(int, cha
 unsigned long hugetlb_total_pages(void);
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access);
-int hugetlb_reserve_pages(struct inode *inode, long from, long to);
+int hugetlb_reserve_pages(struct inode *inode, long from, long to,
+						struct vm_area_struct *vma);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
 
 extern unsigned long max_huge_pages;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/mm/hugetlb.c linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/mm/hugetlb.c
--- linux-2.6.25-rc9-clean/mm/hugetlb.c	2008-04-11 21:32:29.000000000 +0100
+++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/mm/hugetlb.c	2008-04-21 18:29:19.000000000 +0100
@@ -40,6 +40,34 @@ static int hugetlb_next_nid;
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
+static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
+{
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return (unsigned long)vma->vm_private_data;
+	return 0;
+}
+
+static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
+						int delta)
+{
+	BUG_ON((unsigned long)vma->vm_private_data > 100);
+	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
+	if (!(vma->vm_flags & VM_MAYSHARE)) {
+		unsigned long reserve;
+		reserve = (unsigned long)vma->vm_private_data + delta;
+		vma->vm_private_data = (void *)reserve;
+	}
+}
+static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
+						unsigned long reserve)
+{
+	BUG_ON((unsigned long)vma->vm_private_data > 100);
+	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		vma->vm_private_data = (void *)reserve;
+}
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -99,6 +127,16 @@ static struct page *dequeue_huge_page_vm
 					htlb_alloc_mask, &mpol);
 	struct zone **z;
 
+	/*
+	 * A child process with MAP_PRIVATE mappings created by their parent
+	 * have no page reserves. This check ensures that reservations are
+	 * not "stolen". The child may still get SIGKILLed
+	 */
+	if (!(vma->vm_flags & VM_MAYSHARE) &&
+			!vma_resv_huge_pages(vma) &&
+			free_huge_pages - resv_huge_pages == 0)
+		return NULL;
+
 	for (z = zonelist->zones; *z; z++) {
 		nid = zone_to_nid(*z);
 		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
@@ -108,8 +146,23 @@ static struct page *dequeue_huge_page_vm
 			list_del(&page->lru);
 			free_huge_pages--;
 			free_huge_pages_node[nid]--;
-			if (vma && vma->vm_flags & VM_MAYSHARE)
+
+			/* Update reserves as applicable */
+			if (vma->vm_flags & VM_MAYSHARE) {
+				/* Shared mappings always have a reserve */
 				resv_huge_pages--;
+			} else {
+				/*
+				 * Only the process that called mmap() has
+				 * reserves for private mappings. Be careful
+				 * not to overflow counters from children
+				 * faulting
+				 */
+				if (vma_resv_huge_pages(vma)) {
+					resv_huge_pages--;
+					adjust_vma_resv_huge_pages(vma, -1);
+				}
+			}
 			break;
 		}
 	}
@@ -437,50 +490,23 @@ static void return_unused_surplus_pages(
 	}
 }
 
-
-static struct page *alloc_huge_page_shared(struct vm_area_struct *vma,
-						unsigned long addr)
+static struct page *alloc_huge_page(struct vm_area_struct *vma,
+				    unsigned long addr)
 {
 	struct page *page;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
+	/* Try dequeueing a page from the pool */
 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page_vma(vma, addr);
 	spin_unlock(&hugetlb_lock);
-	return page ? page : ERR_PTR(-VM_FAULT_OOM);
-}
-
-static struct page *alloc_huge_page_private(struct vm_area_struct *vma,
-						unsigned long addr)
-{
-	struct page *page = NULL;
-
-	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
-		return ERR_PTR(-VM_FAULT_SIGBUS);
 
-	spin_lock(&hugetlb_lock);
-	if (free_huge_pages > resv_huge_pages)
-		page = dequeue_huge_page_vma(vma, addr);
-	spin_unlock(&hugetlb_lock);
+	/* Attempt dynamic resizing */
 	if (!page) {
 		page = alloc_buddy_huge_page(vma, addr);
-		if (!page) {
-			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
-			return ERR_PTR(-VM_FAULT_OOM);
-		}
+		if (!page)
+			page = ERR_PTR(-VM_FAULT_OOM);
 	}
-	return page;
-}
-
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
-				    unsigned long addr)
-{
-	struct page *page;
-	struct address_space *mapping = vma->vm_file->f_mapping;
-
-	if (vma->vm_flags & VM_MAYSHARE)
-		page = alloc_huge_page_shared(vma, addr);
-	else
-		page = alloc_huge_page_private(vma, addr);
 
 	if (!IS_ERR(page)) {
 		set_page_refcounted(page);
@@ -705,8 +731,64 @@ static int hugetlb_vm_op_fault(struct vm
 	return 0;
 }
 
+static int hugetlb_acct_memory(long delta)
+{
+	int ret = -ENOMEM;
+
+	spin_lock(&hugetlb_lock);
+	/*
+	 * When cpuset is configured, it breaks the strict hugetlb page
+	 * reservation as the accounting is done on a global variable. Such
+	 * reservation is completely rubbish in the presence of cpuset because
+	 * the reservation is not checked against page availability for the
+	 * current cpuset. Application can still potentially OOM'ed by kernel
+	 * with lack of free htlb page in cpuset that the task is in.
+	 * Attempt to enforce strict accounting with cpuset is almost
+	 * impossible (or too ugly) because cpuset is too fluid that
+	 * task or memory node can be dynamically moved between cpusets.
+	 *
+	 * The change of semantics for shared hugetlb mapping with cpuset is
+	 * undesirable. However, in order to preserve some of the semantics,
+	 * we fall back to check against current free page availability as
+	 * a best attempt and hopefully to minimize the impact of changing
+	 * semantics that cpuset has.
+	 */
+	if (delta > 0) {
+		if (gather_surplus_pages(delta) < 0)
+			goto out;
+
+		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
+			return_unused_surplus_pages(delta);
+			goto out;
+		}
+	}
+
+	ret = 0;
+	if (delta < 0)
+		return_unused_surplus_pages((unsigned long) -delta);
+
+out:
+	spin_unlock(&hugetlb_lock);
+	return ret;
+}
+
+static void hugetlb_vm_open(struct vm_area_struct *vma)
+{
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		set_vma_resv_huge_pages(vma, 0);
+}
+
+static void hugetlb_vm_close(struct vm_area_struct *vma)
+{
+	unsigned long reserve = vma_resv_huge_pages(vma);
+	if (reserve)
+		hugetlb_acct_memory(-reserve);
+}
+
 struct vm_operations_struct hugetlb_vm_ops = {
 	.fault = hugetlb_vm_op_fault,
+	.close = hugetlb_vm_close,
+	.open = hugetlb_vm_open,
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
@@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
 	return chg;
 }
 
-static int hugetlb_acct_memory(long delta)
+int hugetlb_reserve_pages(struct inode *inode,
+					long from, long to,
+					struct vm_area_struct *vma)
 {
-	int ret = -ENOMEM;
+	long ret, chg;
 
-	spin_lock(&hugetlb_lock);
 	/*
-	 * When cpuset is configured, it breaks the strict hugetlb page
-	 * reservation as the accounting is done on a global variable. Such
-	 * reservation is completely rubbish in the presence of cpuset because
-	 * the reservation is not checked against page availability for the
-	 * current cpuset. Application can still potentially OOM'ed by kernel
-	 * with lack of free htlb page in cpuset that the task is in.
-	 * Attempt to enforce strict accounting with cpuset is almost
-	 * impossible (or too ugly) because cpuset is too fluid that
-	 * task or memory node can be dynamically moved between cpusets.
-	 *
-	 * The change of semantics for shared hugetlb mapping with cpuset is
-	 * undesirable. However, in order to preserve some of the semantics,
-	 * we fall back to check against current free page availability as
-	 * a best attempt and hopefully to minimize the impact of changing
-	 * semantics that cpuset has.
+	 * Shared mappings and read-only mappings should based their reservation
+	 * on the number of pages that are already allocated on behalf of the
+	 * file. Private mappings that are writable need to reserve the full
+	 * area. Note that a read-only private mapping that subsequently calls
+	 * mprotect() to make it read-write may not work reliably
 	 */
-	if (delta > 0) {
-		if (gather_surplus_pages(delta) < 0)
-			goto out;
-
-		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
-			return_unused_surplus_pages(delta);
-			goto out;
-		}
+	if (vma->vm_flags & VM_SHARED)
+		chg = region_chg(&inode->i_mapping->private_list, from, to);
+	else {
+		if (vma->vm_flags & VM_MAYWRITE)
+			chg = to - from;
+		else
+			chg = region_chg(&inode->i_mapping->private_list,
+								from, to);
+		set_vma_resv_huge_pages(vma, chg);
 	}
-
-	ret = 0;
-	if (delta < 0)
-		return_unused_surplus_pages((unsigned long) -delta);
-
-out:
-	spin_unlock(&hugetlb_lock);
-	return ret;
-}
-
-int hugetlb_reserve_pages(struct inode *inode, long from, long to)
-{
-	long ret, chg;
-
-	chg = region_chg(&inode->i_mapping->private_list, from, to);
+
 	if (chg < 0)
 		return chg;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-21 18:36 [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings Mel Gorman
@ 2008-04-21 19:05 ` Adam Litke
  2008-04-21 19:19   ` Mel Gorman
  2008-04-23 13:55 ` Andi Kleen
  2008-04-25 14:28 ` Andy Whitcroft
  2 siblings, 1 reply; 11+ messages in thread
From: Adam Litke @ 2008-04-21 19:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: wli, linux-mm, linux-kernel

On Mon, 2008-04-21 at 19:36 +0100, Mel Gorman wrote:
> MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> so that all future faults will be guaranteed to succeed. Applications are not
> expected to use mlock() as this can result in poor NUMA placement.
> 
> MAP_PRIVATE mappings do not reserve pages. This can result in an application
> being SIGKILLed later if a large page is not available at fault time. This
> makes huge pages usage very ill-advised in some cases as the unexpected
> application failure is intolerable. Forcing potential poor placement with
> mlock() is not a great solution either.
> 
> This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
> developer knows that future faults will also succeed. However, there is no
> guarantee that children of the process will be able to write-fault the same
> mapping. The assumption is being made that the majority of applications that
> fork() either use MAP_SHARED as an IPC mechanism or are calling exec().
> 
> Opinions?

Very sound idea in my opinion and definitely a step in the right
direction toward even more reliable huge pages.  With this patch (my
comments included), the only remaining cause of unexpected SIGKILLs
would be copy-on-write.

> @@ -40,6 +40,34 @@ static int hugetlb_next_nid;
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> +/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
> +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +{
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return (unsigned long)vma->vm_private_data;
> +	return 0;
> +}
> +
> +static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						int delta)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> +	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> +	if (!(vma->vm_flags & VM_MAYSHARE)) {
> +		unsigned long reserve;
> +		reserve = (unsigned long)vma->vm_private_data + delta;
> +		vma->vm_private_data = (void *)reserve;
> +	}
> +}
> +static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						unsigned long reserve)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);

I assume you're just playing it safe for this RFC, but surely a 100 page
max reservation is not sufficient (especially since we have a whole
unsigned long to work with).  Also, I am not sure a BUG_ON would be an
appropriate response to exceeding the maximum.

<snip>

> @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
>  	return chg;
>  }
> 
> -static int hugetlb_acct_memory(long delta)
> +int hugetlb_reserve_pages(struct inode *inode,
> +					long from, long to,
> +					struct vm_area_struct *vma)
>  {
> -	int ret = -ENOMEM;
> +	long ret, chg;
> 
> -	spin_lock(&hugetlb_lock);
>  	/*
> -	 * When cpuset is configured, it breaks the strict hugetlb page
> -	 * reservation as the accounting is done on a global variable. Such
> -	 * reservation is completely rubbish in the presence of cpuset because
> -	 * the reservation is not checked against page availability for the
> -	 * current cpuset. Application can still potentially OOM'ed by kernel
> -	 * with lack of free htlb page in cpuset that the task is in.
> -	 * Attempt to enforce strict accounting with cpuset is almost
> -	 * impossible (or too ugly) because cpuset is too fluid that
> -	 * task or memory node can be dynamically moved between cpusets.
> -	 *
> -	 * The change of semantics for shared hugetlb mapping with cpuset is
> -	 * undesirable. However, in order to preserve some of the semantics,
> -	 * we fall back to check against current free page availability as
> -	 * a best attempt and hopefully to minimize the impact of changing
> -	 * semantics that cpuset has.
> +	 * Shared mappings and read-only mappings should based their reservation
> +	 * on the number of pages that are already allocated on behalf of the
> +	 * file. Private mappings that are writable need to reserve the full
> +	 * area. Note that a read-only private mapping that subsequently calls
> +	 * mprotect() to make it read-write may not work reliably
>  	 */
> -	if (delta > 0) {
> -		if (gather_surplus_pages(delta) < 0)
> -			goto out;
> -
> -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> -			return_unused_surplus_pages(delta);
> -			goto out;
> -		}
> +	if (vma->vm_flags & VM_SHARED)
> +		chg = region_chg(&inode->i_mapping->private_list, from, to);
> +	else {
> +		if (vma->vm_flags & VM_MAYWRITE)
> +			chg = to - from;
> +		else
> +			chg = region_chg(&inode->i_mapping->private_list,
> +								from, to);
> +		set_vma_resv_huge_pages(vma, chg);

To promote reliability, might it be advisable to just reserve the pages
regardless of VM_MAYWRITE?  Otherwise we might want to consider
reserving the pages in hugetlb_change_protection().

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-21 19:05 ` Adam Litke
@ 2008-04-21 19:19   ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2008-04-21 19:19 UTC (permalink / raw)
  To: Adam Litke; +Cc: wli, linux-mm, linux-kernel

On (21/04/08 14:05), Adam Litke didst pronounce:
> On Mon, 2008-04-21 at 19:36 +0100, Mel Gorman wrote:
> > MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> > so that all future faults will be guaranteed to succeed. Applications are not
> > expected to use mlock() as this can result in poor NUMA placement.
> > 
> > MAP_PRIVATE mappings do not reserve pages. This can result in an application
> > being SIGKILLed later if a large page is not available at fault time. This
> > makes huge pages usage very ill-advised in some cases as the unexpected
> > application failure is intolerable. Forcing potential poor placement with
> > mlock() is not a great solution either.
> > 
> > This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> > to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
> > developer knows that future faults will also succeed. However, there is no
> > guarantee that children of the process will be able to write-fault the same
> > mapping. The assumption is being made that the majority of applications that
> > fork() either use MAP_SHARED as an IPC mechanism or are calling exec().
> > 
> > Opinions?
> 
> Very sound idea in my opinion and definitely a step in the right
> direction toward even more reliable huge pages.  With this patch (my
> comments included), the only remaining cause of unexpected SIGKILLs
> would be copy-on-write.
> 

Correct. I have a test case which trivially triggers that parent/child problem
but the parent is able to complete properly which was the important thing. If
the pool is empty and dynamic resizing is disabled, then the test case exits
at mmap() time which is what I figured was better behaviour.

> > @@ -40,6 +40,34 @@ static int hugetlb_next_nid;
> >   */
> >  static DEFINE_SPINLOCK(hugetlb_lock);
> > 
> > +/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
> > +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> > +{
> > +	if (!(vma->vm_flags & VM_MAYSHARE))
> > +		return (unsigned long)vma->vm_private_data;
> > +	return 0;
> > +}
> > +
> > +static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
> > +						int delta)
> > +{
> > +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> > +	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> > +	if (!(vma->vm_flags & VM_MAYSHARE)) {
> > +		unsigned long reserve;
> > +		reserve = (unsigned long)vma->vm_private_data + delta;
> > +		vma->vm_private_data = (void *)reserve;
> > +	}
> > +}
> > +static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> > +						unsigned long reserve)
> > +{
> > +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> 
> I assume you're just playing it safe for this RFC, but surely a 100 page
> max reservation is not sufficient (especially since we have a whole
> unsigned long to work with).  Also, I am not sure a BUG_ON would be an
> appropriate response to exceeding the maximum.
> 

That BUG_ON is a mistake. I was tracking down a reservation-related bug and
I introduced this temporarily and then forgot about it. Sorry about that.

> <snip>
> 
> > @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
> >  	return chg;
> >  }
> > 
> > -static int hugetlb_acct_memory(long delta)
> > +int hugetlb_reserve_pages(struct inode *inode,
> > +					long from, long to,
> > +					struct vm_area_struct *vma)
> >  {
> > -	int ret = -ENOMEM;
> > +	long ret, chg;
> > 
> > -	spin_lock(&hugetlb_lock);
> >  	/*
> > -	 * When cpuset is configured, it breaks the strict hugetlb page
> > -	 * reservation as the accounting is done on a global variable. Such
> > -	 * reservation is completely rubbish in the presence of cpuset because
> > -	 * the reservation is not checked against page availability for the
> > -	 * current cpuset. Application can still potentially OOM'ed by kernel
> > -	 * with lack of free htlb page in cpuset that the task is in.
> > -	 * Attempt to enforce strict accounting with cpuset is almost
> > -	 * impossible (or too ugly) because cpuset is too fluid that
> > -	 * task or memory node can be dynamically moved between cpusets.
> > -	 *
> > -	 * The change of semantics for shared hugetlb mapping with cpuset is
> > -	 * undesirable. However, in order to preserve some of the semantics,
> > -	 * we fall back to check against current free page availability as
> > -	 * a best attempt and hopefully to minimize the impact of changing
> > -	 * semantics that cpuset has.
> > +	 * Shared mappings and read-only mappings should based their reservation
> > +	 * on the number of pages that are already allocated on behalf of the
> > +	 * file. Private mappings that are writable need to reserve the full
> > +	 * area. Note that a read-only private mapping that subsequently calls
> > +	 * mprotect() to make it read-write may not work reliably
> >  	 */
> > -	if (delta > 0) {
> > -		if (gather_surplus_pages(delta) < 0)
> > -			goto out;
> > -
> > -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> > -			return_unused_surplus_pages(delta);
> > -			goto out;
> > -		}
> > +	if (vma->vm_flags & VM_SHARED)
> > +		chg = region_chg(&inode->i_mapping->private_list, from, to);
> > +	else {
> > +		if (vma->vm_flags & VM_MAYWRITE)
> > +			chg = to - from;
> > +		else
> > +			chg = region_chg(&inode->i_mapping->private_list,
> > +								from, to);
> > +		set_vma_resv_huge_pages(vma, chg);
> 
> To promote reliability, might it be advisable to just reserve the pages
> regardless of VM_MAYWRITE?  Otherwise we might want to consider
> reserving the pages in hugetlb_change_protection().
> 

It will mean excessive reservations for read-only mappings but I imagine
that is a fairly rare case anyway so I'll do that. The patch as-is is
vunerable to mprotect() making a read-only mapping read-write and the
application later dying. Reserving everything up-front might be
excessive but it's also a bit more robust.

Thanks Adam.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-21 18:36 [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings Mel Gorman
  2008-04-21 19:05 ` Adam Litke
@ 2008-04-23 13:55 ` Andi Kleen
  2008-04-23 15:14   ` Mel Gorman
  2008-05-04  5:29   ` dean gaudet
  2008-04-25 14:28 ` Andy Whitcroft
  2 siblings, 2 replies; 11+ messages in thread
From: Andi Kleen @ 2008-04-23 13:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: wli, agl, linux-mm, linux-kernel

Mel Gorman <mel@csn.ul.ie> writes:

> MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> so that all future faults will be guaranteed to succeed. Applications are not
> expected to use mlock() as this can result in poor NUMA placement.
>
> MAP_PRIVATE mappings do not reserve pages. This can result in an application
> being SIGKILLed later if a large page is not available at fault time. This
> makes huge pages usage very ill-advised in some cases as the unexpected
> application failure is intolerable. Forcing potential poor placement with
> mlock() is not a great solution either.
>
> This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> to what happens for MAP_SHARED mappings. 

This will break all applications that mmap more hugetlbpages than they
actually use. How do you know these don't exist?

> Opinions?

Seems like a risky interface change to me.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-23 13:55 ` Andi Kleen
@ 2008-04-23 15:14   ` Mel Gorman
  2008-04-23 15:43     ` Andi Kleen
  2008-05-04  5:29   ` dean gaudet
  1 sibling, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2008-04-23 15:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: wli, agl, linux-mm, linux-kernel

On (23/04/08 15:55), Andi Kleen didst pronounce:
> Mel Gorman <mel@csn.ul.ie> writes:
> 
> > MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> > so that all future faults will be guaranteed to succeed. Applications are not
> > expected to use mlock() as this can result in poor NUMA placement.
> >
> > MAP_PRIVATE mappings do not reserve pages. This can result in an application
> > being SIGKILLed later if a large page is not available at fault time. This
> > makes huge pages usage very ill-advised in some cases as the unexpected
> > application failure is intolerable. Forcing potential poor placement with
> > mlock() is not a great solution either.
> >
> > This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> > to what happens for MAP_SHARED mappings. 
> 
> This will break all applications that mmap more hugetlbpages than they
> actually use. How do you know these don't exist?
> 

If the large pages exist to satisfy the mapping, the application will not
even notice this change. They will only break if the are creating larger
mappings than large pages exist for (or can be allocated for in the event
they have enabled dynamic resizing with nr_overcommit_hugepages). If they
are doing that, they are running a big risk as they may get arbitrarily
killed later. Sometimes their app will run, other times it dies. If more
than one application is running on the system that is behaving like this,
they are really playing with fire.

With this change, a mmap() failure is a clear indication that the mapping
would have been unsafe to use and they should try mmap()ing with small pages
instead. 

> > Opinions?
> 
> Seems like a risky interface change to me.
> 

Using MAP_PRIVATE at all is a faily major risk as it is. I am failing to
see why risk of random SIGKILL is a desirable "feature".

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-23 15:14   ` Mel Gorman
@ 2008-04-23 15:43     ` Andi Kleen
  2008-04-23 16:01       ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2008-04-23 15:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andi Kleen, wli, agl, linux-mm, linux-kernel

> If the large pages exist to satisfy the mapping, the application will not
> even notice this change. They will only break if the are creating larger
> mappings than large pages exist for (or can be allocated for in the event
> they have enabled dynamic resizing with nr_overcommit_hugepages). If they
> are doing that, they are running a big risk as they may get arbitrarily
> killed later. 

The point is it is pretty common (especially when you have enough 
address space) just create a huge mapping and only use the begining.
This avoids costly resize operations later and is a quite useful
strategy on 64bit (but even on 32bit).  Now the upper size will
likely be incredibly huge (far beyond available physical memory), but it's 
obviously impossible really uses all of it.

It's also common in languages who don't support dynamic allocation well (like 
older fortran dialects). Given these won't use hugetlbfs directly either, 
but I couldn't rule out that someone wrote a special fortran run time library 
which transparently allocates large arrays from hugetlbfs. 

In fact i would be surprised if a number of such beasts don't exist -- it is 
really an obvious simple tuning option for old HPC fortran applications.

> Sometimes their app will run, other times it dies. If more
> than one application is running on the system that is behaving like this,
> they are really playing with fire.

With your change such an application will not run at all. Doesn't
seem like an improvement to me.

> With this change, a mmap() failure is a clear indication that the mapping
> would have been unsafe to use and they should try mmap()ing with small pages
> instead. 

I don't have a problem with having an optional strict overcommit checking 
mode (similar to what standard VM has), but it should be configurable
and off by default.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-23 15:43     ` Andi Kleen
@ 2008-04-23 16:01       ` Mel Gorman
  2008-04-24  8:47         ` Andy Whitcroft
  0 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2008-04-23 16:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: wli, agl, linux-mm, linux-kernel

On (23/04/08 17:43), Andi Kleen didst pronounce:
> > If the large pages exist to satisfy the mapping, the application will not
> > even notice this change. They will only break if the are creating larger
> > mappings than large pages exist for (or can be allocated for in the event
> > they have enabled dynamic resizing with nr_overcommit_hugepages). If they
> > are doing that, they are running a big risk as they may get arbitrarily
> > killed later. 
> 
> The point is it is pretty common (especially when you have enough 
> address space) just create a huge mapping and only use the begining.

Clearly these applications are assuming that overcommit is always allowed
because otherwise they would be failing. Also, the behaviour with reservations
is currently inconsistent as MAP_SHARED always tries to reserve the pages.
Maybe large shared mappings are rarer than large private mappings, but
this inconsistency is still shoddy.

> This avoids costly resize operations later and is a quite useful
> strategy on 64bit (but even on 32bit).  Now the upper size will
> likely be incredibly huge (far beyond available physical memory), but it's 
> obviously impossible really uses all of it.
> 
> It's also common in languages who don't support dynamic allocation well (like 
> older fortran dialects). Given these won't use hugetlbfs directly either, 
> but I couldn't rule out that someone wrote a special fortran run time library 
> which transparently allocates large arrays from hugetlbfs. 
> 
> In fact i would be surprised if a number of such beasts don't exist -- it is 
> really an obvious simple tuning option for old HPC fortran applications.
> 

I don't know if such a run-time library exists but libhugetlbfs is occasionally
used in situations like this to back the data segment backed by huge pages. It
first copies the data MAP_SHARED to fake the reservation before remapping
private and kinda hopes for the best.

> > Sometimes their app will run, other times it dies. If more
> > than one application is running on the system that is behaving like this,
> > they are really playing with fire.
> 
> With your change such an application will not run at all. Doesn't
> seem like an improvement to me.
> 

I disagree. mmap() failing is something an application should be able to
take care of and at least it possible as opposed to SIGKILL where it's
"game over". Even if they are using run-time helpers like libhugetlbfs,
it should be able to detect when a large-page-backed is flaky and use small
pages instead. As it is, it's very difficult to detect in advance if future
faults will succeed, particularly when more than one application is running.

If the application is unable to handle mmap() failing, then yes, it'll
exit. But at least it'll exit consistently and not get SIGKILLed
randomly which to me is a far worse situation.

> > With this change, a mmap() failure is a clear indication that the mapping
> > would have been unsafe to use and they should try mmap()ing with small pages
> > instead. 
> 
> I don't have a problem with having an optional strict overcommit checking 
> mode (similar to what standard VM has), but it should be configurable
> and off by default.
> 

There already is partial strict overcommit checking for MAP_SHARED. I
think it should be consistent and done for MAP_PRIVATE.

For disabling, I could look into adding MAP_NORESERVE support and
something like the current overcommit_memory tunable for hugetlbfs.
However with yet another proc tunable, I would push for enabling strict
checking by default for safe behaviour than disabled by default for
random SIGKILLs.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-23 16:01       ` Mel Gorman
@ 2008-04-24  8:47         ` Andy Whitcroft
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Whitcroft @ 2008-04-24  8:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andi Kleen, wli, agl, linux-mm, linux-kernel

On Wed, Apr 23, 2008 at 05:01:13PM +0100, Mel Gorman wrote:
> On (23/04/08 17:43), Andi Kleen didst pronounce:
> > > If the large pages exist to satisfy the mapping, the application will not
> > > even notice this change. They will only break if the are creating larger
> > > mappings than large pages exist for (or can be allocated for in the event
> > > they have enabled dynamic resizing with nr_overcommit_hugepages). If they
> > > are doing that, they are running a big risk as they may get arbitrarily
> > > killed later. 
> > 
> > The point is it is pretty common (especially when you have enough 
> > address space) just create a huge mapping and only use the begining.
> 
> Clearly these applications are assuming that overcommit is always allowed
> because otherwise they would be failing. Also, the behaviour with reservations
> is currently inconsistent as MAP_SHARED always tries to reserve the pages.
> Maybe large shared mappings are rarer than large private mappings, but
> this inconsistency is still shoddy.

As I remember history, in the beginning we had pre-fault semantics for all
mappings.  This meant that for the longest time we had the equivalent to
"reserve at mmap" semantics, indeed probabally had reserve at fork() too.
When dynamic faulting went in (to get locality of allocation to reference
for performance in numa systems) that changed, but mostly because it
was seen as hard to implement, or perhaps hard to decide what the sane
semantics were.

During the evolution since its been clear that the biggest problem is
unreliable semantics.  A semantic difference between private and shared
mappings is annoying enough, but the random SIGKILL at a random time in
the future has always been a festering thorn.  How many customers would
accept a support response of "yeah it'll do that on day 5 one time in
20 or so *beam*" not many I would bet.

In an ideal world we would have the same semantics for small and large
pages, but that would be to ignore the obvious difference in quantity and
availability of these mappings.  Of the strict and sloppy semantic options
we have had, the former are cirtainly easier to rely on if potentially
harder to work with.

> > This avoids costly resize operations later and is a quite useful
> > strategy on 64bit (but even on 32bit).  Now the upper size will
> > likely be incredibly huge (far beyond available physical memory), but it's 
> > obviously impossible really uses all of it.
> > 
> > It's also common in languages who don't support dynamic allocation well (like 
> > older fortran dialects). Given these won't use hugetlbfs directly either, 
> > but I couldn't rule out that someone wrote a special fortran run time library 
> > which transparently allocates large arrays from hugetlbfs. 
> > 
> > In fact i would be surprised if a number of such beasts don't exist -- it is 
> > really an obvious simple tuning option for old HPC fortran applications.
> > 
> 
> I don't know if such a run-time library exists but libhugetlbfs is occasionally
> used in situations like this to back the data segment backed by huge pages. It
> first copies the data MAP_SHARED to fake the reservation before remapping
> private and kinda hopes for the best.

Indeed we do see a number of such fortran apps.  The nice thing about them
is that they actually have only static allocations and so that in reality we
know how big they are up front and can use that information when sizing
the mappings to back them.  So those are actually the _easiest_ to handle.

> > > Sometimes their app will run, other times it dies. If more
> > > than one application is running on the system that is behaving like this,
> > > they are really playing with fire.
> > 
> > With your change such an application will not run at all. Doesn't
> > seem like an improvement to me.
> > 
> 
> I disagree. mmap() failing is something an application should be able to
> take care of and at least it possible as opposed to SIGKILL where it's
> "game over". Even if they are using run-time helpers like libhugetlbfs,
> it should be able to detect when a large-page-backed is flaky and use small
> pages instead. As it is, it's very difficult to detect in advance if future
> faults will succeed, particularly when more than one application is running.
> 
> If the application is unable to handle mmap() failing, then yes, it'll
> exit. But at least it'll exit consistently and not get SIGKILLed
> randomly which to me is a far worse situation.
>
> > > With this change, a mmap() failure is a clear indication that the mapping
> > > would have been unsafe to use and they should try mmap()ing with small pages
> > > instead. 
> > 
> > I don't have a problem with having an optional strict overcommit checking 
> > mode (similar to what standard VM has), but it should be configurable
> > and off by default.
> > 
> 
> There already is partial strict overcommit checking for MAP_SHARED. I
> think it should be consistent and done for MAP_PRIVATE.
> 
> For disabling, I could look into adding MAP_NORESERVE support and
> something like the current overcommit_memory tunable for hugetlbfs.
> However with yet another proc tunable, I would push for enabling strict
> checking by default for safe behaviour than disabled by default for
> random SIGKILLs.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-23 13:55 ` Andi Kleen
  2008-04-23 15:14   ` Mel Gorman
@ 2008-05-04  5:29   ` dean gaudet
  1 sibling, 0 replies; 11+ messages in thread
From: dean gaudet @ 2008-05-04  5:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mel Gorman, wli, agl, linux-mm, linux-kernel

On Wed, 23 Apr 2008, Andi Kleen wrote:

> Mel Gorman <mel@csn.ul.ie> writes:
> 
> > MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> > so that all future faults will be guaranteed to succeed. Applications are not
> > expected to use mlock() as this can result in poor NUMA placement.
> >
> > MAP_PRIVATE mappings do not reserve pages. This can result in an application
> > being SIGKILLed later if a large page is not available at fault time. This
> > makes huge pages usage very ill-advised in some cases as the unexpected
> > application failure is intolerable. Forcing potential poor placement with
> > mlock() is not a great solution either.
> >
> > This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> > to what happens for MAP_SHARED mappings. 
> 
> This will break all applications that mmap more hugetlbpages than they
> actually use. How do you know these don't exist?

such applications couldn't have existed before the change which added 
HugePages_Rsvd... which i admit was sometime between 2.6.11 and 2.6.18 but 
from my point of view the inability to actually allocate hugepages without 
trapping SIGSEGV/etc was a terrible bug introduced when HugePages_Rsvd was 
introduced.

-dean

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-21 18:36 [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings Mel Gorman
  2008-04-21 19:05 ` Adam Litke
  2008-04-23 13:55 ` Andi Kleen
@ 2008-04-25 14:28 ` Andy Whitcroft
  2008-04-25 15:19   ` Mel Gorman
  2 siblings, 1 reply; 11+ messages in thread
From: Andy Whitcroft @ 2008-04-25 14:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: wli, agl, linux-mm, linux-kernel

On Mon, Apr 21, 2008 at 07:36:22PM +0100, Mel Gorman wrote:
> MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> so that all future faults will be guaranteed to succeed. Applications are not
> expected to use mlock() as this can result in poor NUMA placement.
> 
> MAP_PRIVATE mappings do not reserve pages. This can result in an application
> being SIGKILLed later if a large page is not available at fault time. This
> makes huge pages usage very ill-advised in some cases as the unexpected
> application failure is intolerable. Forcing potential poor placement with
> mlock() is not a great solution either.
> 
> This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
> developer knows that future faults will also succeed. However, there is no
> guarantee that children of the process will be able to write-fault the same
> mapping. The assumption is being made that the majority of applications that
> fork() either use MAP_SHARED as an IPC mechanism or are calling exec().
> 
> Opinions?

[This is one of those patches which is best read applied, diff has not
been friendly to the reviewer.]

Overall I think we should be sanitising these semantics.  So I would
like to see this stack progressed.

> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> --- 
>  fs/hugetlbfs/inode.c    |    8 -
>  include/linux/hugetlb.h |    3 
>  mm/hugetlb.c            |  212 ++++++++++++++++++++++++++++++------------------
>  3 files changed, 142 insertions(+), 81 deletions(-)
> 
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/fs/hugetlbfs/inode.c linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/fs/hugetlbfs/inode.c
> --- linux-2.6.25-rc9-clean/fs/hugetlbfs/inode.c	2008-04-11 21:32:29.000000000 +0100
> +++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/fs/hugetlbfs/inode.c	2008-04-21 17:05:25.000000000 +0100
> @@ -103,9 +103,9 @@ static int hugetlbfs_file_mmap(struct fi
>  	ret = -ENOMEM;
>  	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>  
> -	if (vma->vm_flags & VM_MAYSHARE &&
> -	    hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
> -				  len >> HPAGE_SHIFT))
> +	if (hugetlb_reserve_pages(inode,
> +				vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT),
> +				len >> HPAGE_SHIFT, vma))
>  		goto out;
>  
>  	ret = 0;
> @@ -942,7 +942,7 @@ struct file *hugetlb_file_setup(const ch
>  		goto out_dentry;
>  
>  	error = -ENOMEM;
> -	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
> +	if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT, NULL))
>  		goto out_inode;
>  
>  	d_instantiate(dentry, inode);
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/include/linux/hugetlb.h linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/include/linux/hugetlb.h
> --- linux-2.6.25-rc9-clean/include/linux/hugetlb.h	2008-04-11 21:32:29.000000000 +0100
> +++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/include/linux/hugetlb.h	2008-04-17 16:45:32.000000000 +0100
> @@ -29,7 +29,8 @@ int hugetlb_report_node_meminfo(int, cha
>  unsigned long hugetlb_total_pages(void);
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, int write_access);
> -int hugetlb_reserve_pages(struct inode *inode, long from, long to);
> +int hugetlb_reserve_pages(struct inode *inode, long from, long to,
> +						struct vm_area_struct *vma);
>  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>  
>  extern unsigned long max_huge_pages;
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc9-clean/mm/hugetlb.c linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/mm/hugetlb.c
> --- linux-2.6.25-rc9-clean/mm/hugetlb.c	2008-04-11 21:32:29.000000000 +0100
> +++ linux-2.6.25-rc9-POC-MAP_PRIVATE-reserve/mm/hugetlb.c	2008-04-21 18:29:19.000000000 +0100
> @@ -40,6 +40,34 @@ static int hugetlb_next_nid;
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
>  
> +/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
> +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +{
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return (unsigned long)vma->vm_private_data;
> +	return 0;
> +}
> +
> +static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						int delta)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> +	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> +	if (!(vma->vm_flags & VM_MAYSHARE)) {
> +		unsigned long reserve;
> +		reserve = (unsigned long)vma->vm_private_data + delta;
> +		vma->vm_private_data = (void *)reserve;
> +	}
> +}
> +static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						unsigned long reserve)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> +	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		vma->vm_private_data = (void *)reserve;
> +}
> +
>  static void clear_huge_page(struct page *page, unsigned long addr)
>  {
>  	int i;
> @@ -99,6 +127,16 @@ static struct page *dequeue_huge_page_vm
>  					htlb_alloc_mask, &mpol);
>  	struct zone **z;
>  
> +	/*
> +	 * A child process with MAP_PRIVATE mappings created by their parent
> +	 * have no page reserves. This check ensures that reservations are
> +	 * not "stolen". The child may still get SIGKILLed
> +	 */
> +	if (!(vma->vm_flags & VM_MAYSHARE) &&
> +			!vma_resv_huge_pages(vma) &&
> +			free_huge_pages - resv_huge_pages == 0)
> +		return NULL;
> +
>  	for (z = zonelist->zones; *z; z++) {
>  		nid = zone_to_nid(*z);
>  		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
> @@ -108,8 +146,23 @@ static struct page *dequeue_huge_page_vm
>  			list_del(&page->lru);
>  			free_huge_pages--;
>  			free_huge_pages_node[nid]--;
> -			if (vma && vma->vm_flags & VM_MAYSHARE)
> +
> +			/* Update reserves as applicable */
> +			if (vma->vm_flags & VM_MAYSHARE) {
> +				/* Shared mappings always have a reserve */
>  				resv_huge_pages--;
> +			} else {
> +				/*
> +				 * Only the process that called mmap() has
> +				 * reserves for private mappings. Be careful
> +				 * not to overflow counters from children
> +				 * faulting
> +				 */
> +				if (vma_resv_huge_pages(vma)) {
> +					resv_huge_pages--;
> +					adjust_vma_resv_huge_pages(vma, -1);
> +				}
> +			}
>  			break;
>  		}
>  	}
> @@ -437,50 +490,23 @@ static void return_unused_surplus_pages(
>  	}
>  }
>  
> -
> -static struct page *alloc_huge_page_shared(struct vm_area_struct *vma,
> -						unsigned long addr)
> +static struct page *alloc_huge_page(struct vm_area_struct *vma,
> +				    unsigned long addr)
>  {
>  	struct page *page;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
>  
> +	/* Try dequeueing a page from the pool */
>  	spin_lock(&hugetlb_lock);
>  	page = dequeue_huge_page_vma(vma, addr);
>  	spin_unlock(&hugetlb_lock);
> -	return page ? page : ERR_PTR(-VM_FAULT_OOM);
> -}
> -
> -static struct page *alloc_huge_page_private(struct vm_area_struct *vma,
> -						unsigned long addr)
> -{
> -	struct page *page = NULL;
> -
> -	if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
> -		return ERR_PTR(-VM_FAULT_SIGBUS);
>  
> -	spin_lock(&hugetlb_lock);
> -	if (free_huge_pages > resv_huge_pages)
> -		page = dequeue_huge_page_vma(vma, addr);
> -	spin_unlock(&hugetlb_lock);
> +	/* Attempt dynamic resizing */
>  	if (!page) {
>  		page = alloc_buddy_huge_page(vma, addr);
> -		if (!page) {
> -			hugetlb_put_quota(vma->vm_file->f_mapping, 1);
> -			return ERR_PTR(-VM_FAULT_OOM);
> -		}
> +		if (!page)
> +			page = ERR_PTR(-VM_FAULT_OOM);
>  	}
> -	return page;
> -}
> -
> -static struct page *alloc_huge_page(struct vm_area_struct *vma,
> -				    unsigned long addr)
> -{
> -	struct page *page;
> -	struct address_space *mapping = vma->vm_file->f_mapping;
> -
> -	if (vma->vm_flags & VM_MAYSHARE)
> -		page = alloc_huge_page_shared(vma, addr);
> -	else
> -		page = alloc_huge_page_private(vma, addr);
>  
>  	if (!IS_ERR(page)) {
>  		set_page_refcounted(page);
> @@ -705,8 +731,64 @@ static int hugetlb_vm_op_fault(struct vm
>  	return 0;
>  }
>  
> +static int hugetlb_acct_memory(long delta)
> +{
> +	int ret = -ENOMEM;
> +
> +	spin_lock(&hugetlb_lock);
> +	/*
> +	 * When cpuset is configured, it breaks the strict hugetlb page
> +	 * reservation as the accounting is done on a global variable. Such
> +	 * reservation is completely rubbish in the presence of cpuset because
> +	 * the reservation is not checked against page availability for the
> +	 * current cpuset. Application can still potentially OOM'ed by kernel
> +	 * with lack of free htlb page in cpuset that the task is in.
> +	 * Attempt to enforce strict accounting with cpuset is almost
> +	 * impossible (or too ugly) because cpuset is too fluid that
> +	 * task or memory node can be dynamically moved between cpusets.
> +	 *
> +	 * The change of semantics for shared hugetlb mapping with cpuset is
> +	 * undesirable. However, in order to preserve some of the semantics,
> +	 * we fall back to check against current free page availability as
> +	 * a best attempt and hopefully to minimize the impact of changing
> +	 * semantics that cpuset has.
> +	 */
> +	if (delta > 0) {
> +		if (gather_surplus_pages(delta) < 0)
> +			goto out;
> +
> +		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> +			return_unused_surplus_pages(delta);
> +			goto out;
> +		}
> +	}
> +
> +	ret = 0;
> +	if (delta < 0)
> +		return_unused_surplus_pages((unsigned long) -delta);
> +
> +out:
> +	spin_unlock(&hugetlb_lock);
> +	return ret;
> +}
> +
> +static void hugetlb_vm_open(struct vm_area_struct *vma)
> +{
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		set_vma_resv_huge_pages(vma, 0);
> +}

Ok, you zap out the reservation when the VMA is opened.  How does that
tie in with the VMA modifications which occur when we mprotect a page in
the middle of a map?

>From my reading of vma_adjust and vma_split, I am not convinced you
would maintain the reservation correctly.  I suspect that the original
VMA will retain the whole reservation which it will then not be able to
use.  The new VMAs would not have any reservation and might then fail on
fault dispite the total reservation being sufficient.

> +
> +static void hugetlb_vm_close(struct vm_area_struct *vma)
> +{
> +	unsigned long reserve = vma_resv_huge_pages(vma);
> +	if (reserve)
> +		hugetlb_acct_memory(-reserve);
> +}
> +
>  struct vm_operations_struct hugetlb_vm_ops = {
>  	.fault = hugetlb_vm_op_fault,
> +	.close = hugetlb_vm_close,
> +	.open = hugetlb_vm_open,
>  };
>  
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
>  	return chg;
>  }
>  
> -static int hugetlb_acct_memory(long delta)
> +int hugetlb_reserve_pages(struct inode *inode,
> +					long from, long to,
> +					struct vm_area_struct *vma)
>  {
> -	int ret = -ENOMEM;
> +	long ret, chg;
>  
> -	spin_lock(&hugetlb_lock);
>  	/*
> -	 * When cpuset is configured, it breaks the strict hugetlb page
> -	 * reservation as the accounting is done on a global variable. Such
> -	 * reservation is completely rubbish in the presence of cpuset because
> -	 * the reservation is not checked against page availability for the
> -	 * current cpuset. Application can still potentially OOM'ed by kernel
> -	 * with lack of free htlb page in cpuset that the task is in.
> -	 * Attempt to enforce strict accounting with cpuset is almost
> -	 * impossible (or too ugly) because cpuset is too fluid that
> -	 * task or memory node can be dynamically moved between cpusets.
> -	 *
> -	 * The change of semantics for shared hugetlb mapping with cpuset is
> -	 * undesirable. However, in order to preserve some of the semantics,
> -	 * we fall back to check against current free page availability as
> -	 * a best attempt and hopefully to minimize the impact of changing
> -	 * semantics that cpuset has.
> +	 * Shared mappings and read-only mappings should based their reservation
> +	 * on the number of pages that are already allocated on behalf of the
> +	 * file. Private mappings that are writable need to reserve the full
> +	 * area. Note that a read-only private mapping that subsequently calls
> +	 * mprotect() to make it read-write may not work reliably
>  	 */
> -	if (delta > 0) {
> -		if (gather_surplus_pages(delta) < 0)
> -			goto out;
> -
> -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> -			return_unused_surplus_pages(delta);
> -			goto out;
> -		}
> +	if (vma->vm_flags & VM_SHARED)
> +		chg = region_chg(&inode->i_mapping->private_list, from, to);
> +	else {
> +		if (vma->vm_flags & VM_MAYWRITE)
> +			chg = to - from;
> +		else
> +			chg = region_chg(&inode->i_mapping->private_list,
> +								from, to);

In the read-only case you only create a reservation for the first mmap
of a particular offset in the file.  I do not think this will work as
intended.  If we consider a process which forks, and each process then
mmaps the same offset.  The first will get a reservation for its mmap,
the second will not.  This seems to violate the "mapper is guarenteed
to get sufficient pages" guarentee for the second mapper.  As the
pages are missing and read-only we know that we actually could share the
pages so in some sense this might make sense _if_ we could find and
share the pages at fault time.  Currently we do not have the information
required to find these pages so we would have to allocate pages for each
mmap.

As things stand I think that we should be using 'chg = to - from' for
all private mappings.  As each mapping is effectivly independant.

> +		set_vma_resv_huge_pages(vma, chg);

Whats not clear from the diff is that this change leaves us with two
cases where we apply region_chg() and one where we do not, but we then
always apply region_add().  Now when writing that region code I intended
the region_chg/region_add as prepare/commit pair with the former
performing any memory allocation we might require.  It is not safe to
call region_add without first calling region_chg.  Yes the names are not
helpful.  That region_add probabally should be:

        if (vma->vm_flags & VM_SHARED || !(vma->vm_flags & VM_MAYWRITE))
		region_add(&inode->i_mapping->private_list, from, to);


>  	}
> -
> -	ret = 0;
> -	if (delta < 0)
> -		return_unused_surplus_pages((unsigned long) -delta);
> -
> -out:
> -	spin_unlock(&hugetlb_lock);
> -	return ret;
> -}
> -
> -int hugetlb_reserve_pages(struct inode *inode, long from, long to)
> -{
> -	long ret, chg;
> -
> -	chg = region_chg(&inode->i_mapping->private_list, from, to);
> +
>  	if (chg < 0)
>  		return chg;

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings
  2008-04-25 14:28 ` Andy Whitcroft
@ 2008-04-25 15:19   ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2008-04-25 15:19 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: wli, agl, linux-mm, linux-kernel

On (25/04/08 15:28), Andy Whitcroft didst pronounce:
> > <SNIP>
> > 
> > Opinions?
> 
> [This is one of those patches which is best read applied, diff has not
> been friendly to the reviewer.]
> 

Yeah, I noticed that all right. Will see can it be fixed up in a future
revision.

> Overall I think we should be sanitising these semantics.  So I would
> like to see this stack progressed.
> 

Right.

> > <SNIP>
> > +
> > +static void hugetlb_vm_open(struct vm_area_struct *vma)
> > +{
> > +	if (!(vma->vm_flags & VM_MAYSHARE))
> > +		set_vma_resv_huge_pages(vma, 0);
> > +}
> 
> Ok, you zap out the reservation when the VMA is opened.  How does that
> tie in with the VMA modifications which occur when we mprotect a page in
> the middle of a map?
> 

Umm.... Badly.

> From my reading of vma_adjust and vma_split, I am not convinced you
> would maintain the reservation correctly. I suspect that the original
> VMA will retain the whole reservation which it will then not be able to
> use. 

You're right. The only raw of light here is that the reservation should
not leak. On exit, the VMA will be closed and the reserve given back on the
assumption it was pages reserved but not faulted. Not a great consolation,
this is still wrong.

> The new VMAs would not have any reservation and might then fail on
> fault dispite the total reservation being sufficient.
> 

Correct. It would get nailed by this check.

        /*
         * A child process with MAP_PRIVATE mappings created by their parent
         * have no page reserves. This check ensures that reservations are
         * not "stolen". The child may still get SIGKILLed
         */
        if (!(vma->vm_flags & VM_MAYSHARE) &&
                        !vma_resv_huge_pages(vma) &&
                        free_huge_pages - resv_huge_pages == 0)
                return NULL;

I will try altering dup_mmap() to reset the reserves during fork() instead
of vm_ops->open(). The vm_ops->close() should still be ok.

> > +
> > +static void hugetlb_vm_close(struct vm_area_struct *vma)
> > +{
> > +	unsigned long reserve = vma_resv_huge_pages(vma);
> > +	if (reserve)
> > +		hugetlb_acct_memory(-reserve);
> > +}
> > +
> >  struct vm_operations_struct hugetlb_vm_ops = {
> >  	.fault = hugetlb_vm_op_fault,
> > +	.close = hugetlb_vm_close,
> > +	.open = hugetlb_vm_open,
> >  };
> >  
> >  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> > @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
> >  	return chg;
> >  }
> >  
> > -static int hugetlb_acct_memory(long delta)
> > +int hugetlb_reserve_pages(struct inode *inode,
> > +					long from, long to,
> > +					struct vm_area_struct *vma)
> >  {
> > -	int ret = -ENOMEM;
> > +	long ret, chg;
> >  
> > -	spin_lock(&hugetlb_lock);
> >  	/*
> > -	 * When cpuset is configured, it breaks the strict hugetlb page
> > -	 * reservation as the accounting is done on a global variable. Such
> > -	 * reservation is completely rubbish in the presence of cpuset because
> > -	 * the reservation is not checked against page availability for the
> > -	 * current cpuset. Application can still potentially OOM'ed by kernel
> > -	 * with lack of free htlb page in cpuset that the task is in.
> > -	 * Attempt to enforce strict accounting with cpuset is almost
> > -	 * impossible (or too ugly) because cpuset is too fluid that
> > -	 * task or memory node can be dynamically moved between cpusets.
> > -	 *
> > -	 * The change of semantics for shared hugetlb mapping with cpuset is
> > -	 * undesirable. However, in order to preserve some of the semantics,
> > -	 * we fall back to check against current free page availability as
> > -	 * a best attempt and hopefully to minimize the impact of changing
> > -	 * semantics that cpuset has.
> > +	 * Shared mappings and read-only mappings should based their reservation
> > +	 * on the number of pages that are already allocated on behalf of the
> > +	 * file. Private mappings that are writable need to reserve the full
> > +	 * area. Note that a read-only private mapping that subsequently calls
> > +	 * mprotect() to make it read-write may not work reliably
> >  	 */
> > -	if (delta > 0) {
> > -		if (gather_surplus_pages(delta) < 0)
> > -			goto out;
> > -
> > -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> > -			return_unused_surplus_pages(delta);
> > -			goto out;
> > -		}
> > +	if (vma->vm_flags & VM_SHARED)
> > +		chg = region_chg(&inode->i_mapping->private_list, from, to);
> > +	else {
> > +		if (vma->vm_flags & VM_MAYWRITE)
> > +			chg = to - from;
> > +		else
> > +			chg = region_chg(&inode->i_mapping->private_list,
> > +								from, to);
> 
> In the read-only case you only create a reservation for the first mmap
> of a particular offset in the file.  I do not think this will work as
> intended. 

You're right.

> If we consider a process which forks, and each process then
> mmaps the same offset.  The first will get a reservation for its mmap,
> the second will not.  This seems to violate the "mapper is guarenteed
> to get sufficient pages" guarentee for the second mapper.  As the
> pages are missing and read-only we know that we actually could share the
> pages so in some sense this might make sense _if_ we could find and
> share the pages at fault time.  Currently we do not have the information
> required to find these pages so we would have to allocate pages for each
> mmap.
> 
> As things stand I think that we should be using 'chg = to - from' for
> all private mappings.  As each mapping is effectivly independant.
> 
> > +		set_vma_resv_huge_pages(vma, chg);
> 

Agreed. It's also a case that mprotect() to PROT_WRITE is not handled by this
patch at all. chg = to - from; appears to be the way to go for a multitute
of reasons.

> Whats not clear from the diff is that this change leaves us with two
> cases where we apply region_chg() and one where we do not, but we then
> always apply region_add().  Now when writing that region code I intended
> the region_chg/region_add as prepare/commit pair with the former
> performing any memory allocation we might require.  It is not safe to
> call region_add without first calling region_chg.  Yes the names are not
> helpful.  That region_add probabally should be:
> 
>         if (vma->vm_flags & VM_SHARED || !(vma->vm_flags & VM_MAYWRITE))
> 		region_add(&inode->i_mapping->private_list, from, to);
> 

Big oops on my part. You're right again.

> 
> >  	}
> > -
> > -	ret = 0;
> > -	if (delta < 0)
> > -		return_unused_surplus_pages((unsigned long) -delta);
> > -
> > -out:
> > -	spin_unlock(&hugetlb_lock);
> > -	return ret;
> > -}
> > -
> > -int hugetlb_reserve_pages(struct inode *inode, long from, long to)
> > -{
> > -	long ret, chg;
> > -
> > -	chg = region_chg(&inode->i_mapping->private_list, from, to);
> > +
> >  	if (chg < 0)
> >  		return chg;
> 

Thanks Andy.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-05-04  5:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-21 18:36 [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings Mel Gorman
2008-04-21 19:05 ` Adam Litke
2008-04-21 19:19   ` Mel Gorman
2008-04-23 13:55 ` Andi Kleen
2008-04-23 15:14   ` Mel Gorman
2008-04-23 15:43     ` Andi Kleen
2008-04-23 16:01       ` Mel Gorman
2008-04-24  8:47         ` Andy Whitcroft
2008-05-04  5:29   ` dean gaudet
2008-04-25 14:28 ` Andy Whitcroft
2008-04-25 15:19   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox