[PATCH 1/3] hugetlb: fix size=4G parsing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/3] hugetlb: fix size=4G parsing
@ 2006-10-25  2:31 Hugh Dickins
  2006-10-25  2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
  2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
  0 siblings, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  2:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ken Chen, Bill Irwin, Adam Litke, David Gibson, linux-mm

On 32-bit machines, mount -t hugetlbfs -o size=4G gave a 0GB filesystem,
size=5G gave a 1GB filesystem etc: there's no point in masking size with
HPAGE_MASK just before shifting its lower bits away, and since HPAGE_MASK
is a UL, that removed all the higher bits of the unsigned long long size.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
___

 fs/hugetlbfs/inode.c |    1 -
 1 file changed, 1 deletion(-)

--- 2.6.19-rc3/fs/hugetlbfs/inode.c	2006-10-24 04:34:28.000000000 +0100
+++ linux/fs/hugetlbfs/inode.c	2006-10-24 17:43:08.000000000 +0100
@@ -624,7 +624,6 @@ hugetlbfs_parse_options(char *options, s
 				do_div(size, 100);
 				rest++;
 			}
-			size &= HPAGE_MASK;
 			pconfig->nr_blocks = (size >> HPAGE_SHIFT);
 			value = rest;
 		} else if (!strcmp(opt,"nr_inodes")) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-25  2:31 [PATCH 1/3] hugetlb: fix size=4G parsing Hugh Dickins
@ 2006-10-25  2:35 ` Hugh Dickins
  2006-10-25  7:08   ` David Gibson
  2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  2:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ken Chen, Bill Irwin, Adam Litke, David Gibson, linux-mm

hugetlb_vmtruncate_list was misconverted to prio_tree: its prio_tree is
in units of PAGE_SIZE (PAGE_CACHE_SIZE) like any other, not HPAGE_SIZE
(whereas its radix_tree is kept in units of HPAGE_SIZE, otherwise slots
would be absurdly sparse).

At first I thought the error benign, just calling __unmap_hugepage_range
on more vmas than necessary; but on 32-bit machines, when the prio_tree
is searched correctly, it happens to ensure the v_offset calculation won't
overflow.  As it stood, when truncating at or beyond 4GB, it was liable
to discard pages COWed from lower offsets; or even to clear pmd entries
of preceding vmas, triggering exit_mmap's BUG_ON(nr_ptes).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
___

 fs/hugetlbfs/inode.c |   24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

--- 2.6.19-rc3/fs/hugetlbfs/inode.c	2006-10-24 04:34:28.000000000 +0100
+++ linux/fs/hugetlbfs/inode.c	2006-10-24 17:43:08.000000000 +0100
@@ -271,26 +271,24 @@ static void hugetlbfs_drop_inode(struct 
 		hugetlbfs_forget_inode(inode);
 }
 
-/*
- * h_pgoff is in HPAGE_SIZE units.
- * vma->vm_pgoff is in PAGE_SIZE units.
- */
 static inline void
-hugetlb_vmtruncate_list(struct prio_tree_root *root, unsigned long h_pgoff)
+hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 
-	vma_prio_tree_foreach(vma, &iter, root, h_pgoff, ULONG_MAX) {
-		unsigned long h_vm_pgoff;
+	vma_prio_tree_foreach(vma, &iter, root, pgoff, ULONG_MAX) {
 		unsigned long v_offset;
 
-		h_vm_pgoff = vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT);
-		v_offset = (h_pgoff - h_vm_pgoff) << HPAGE_SHIFT;
 		/*
-		 * Is this VMA fully outside the truncation point?
+		 * Can the expression below overflow on 32-bit arches?
+		 * No, because the prio_tree returns us only those vmas
+		 * which overlap the truncated area starting at pgoff,
+		 * and no vma on a 32-bit arch can span beyond the 4GB.
 		 */
-		if (h_vm_pgoff >= h_pgoff)
+		if (vma->vm_pgoff < pgoff)
+			v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+		else
 			v_offset = 0;
 
 		__unmap_hugepage_range(vma,
@@ -303,14 +301,14 @@ hugetlb_vmtruncate_list(struct prio_tree
  */
 static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 {
-	unsigned long pgoff;
+	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 
 	if (offset > inode->i_size)
 		return -EINVAL;
 
 	BUG_ON(offset & ~HPAGE_MASK);
-	pgoff = offset >> HPAGE_SHIFT;
+	pgoff = offset >> PAGE_SHIFT;
 
 	inode->i_size = offset;
 	spin_lock(&mapping->i_mmap_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  2:31 [PATCH 1/3] hugetlb: fix size=4G parsing Hugh Dickins
  2006-10-25  2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
@ 2006-10-25  2:38 ` Hugh Dickins
  2006-10-25  5:23   ` Mika Penttilä
  2006-10-25  6:26   ` David Gibson
  1 sibling, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  2:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ken Chen, Bill Irwin, Adam Litke, David Gibson, linux-mm

If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
my preliminary i_size check before attempting to allocate the page (though
this only fixes the most obvious case: more work will be needed here).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
___

This is not a complete solution (what if hugetlb_no_page is actually
racing with truncate_hugepages?), and there are several other accounting
anomalies in here (private versus shared pages, hugetlbfs quota handling);
but those all need more thought.  It'll probably make sense to use i_mutex
instead of hugetlb_instantiation_mutex, so locking out truncation and mmap.

 mm/hugetlb.c |    3 +++
 1 file changed, 3 insertions(+)

--- 2.6.19-rc3/mm/hugetlb.c	2006-10-24 04:34:37.000000000 +0100
+++ linux/mm/hugetlb.c	2006-10-24 16:23:17.000000000 +0100
@@ -478,6 +478,9 @@ int hugetlb_no_page(struct mm_struct *mm
 retry:
 	page = find_lock_page(mapping, idx);
 	if (!page) {
+		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+		if (idx >= size)
+			goto out;
 		if (hugetlb_get_quota(mapping))
 			goto out;
 		page = alloc_huge_page(vma, address);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
@ 2006-10-25  5:23   ` Mika Penttilä
  2006-10-25  5:52     ` David Gibson
  2006-10-25  6:26   ` David Gibson
  1 sibling, 1 reply; 25+ messages in thread
From: Mika Penttilä @ 2006-10-25  5:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, David Gibson, linux-mm

Hugh Dickins wrote:
> If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> my preliminary i_size check before attempting to allocate the page (though
> this only fixes the most obvious case: more work will be needed here).
>
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ___
>
> This is not a complete solution (what if hugetlb_no_page is actually
> racing with truncate_hugepages?), and there are several other accounting
> anomalies in here (private versus shared pages, hugetlbfs quota handling);
> but those all need more thought.  It'll probably make sense to use i_mutex
> instead of hugetlb_instantiation_mutex, so locking out truncation and mmap.
>
>  mm/hugetlb.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> --- 2.6.19-rc3/mm/hugetlb.c	2006-10-24 04:34:37.000000000 +0100
> +++ linux/mm/hugetlb.c	2006-10-24 16:23:17.000000000 +0100
> @@ -478,6 +478,9 @@ int hugetlb_no_page(struct mm_struct *mm
>  retry:
>  	page = find_lock_page(mapping, idx);
>  	if (!page) {
> +		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> +		if (idx >= size)
> +			goto out;
>  		if (hugetlb_get_quota(mapping))
>  			goto out;
>  		page = alloc_huge_page(vma, address);
>
> --
>   
Shouldn't it be something like following ?

size = (i_size_read(mapping->host) + HPAGE_SIZE - 1) >> HPAGE_SHIFT;



If so this was wrong in the original code also.

--Mika

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  5:23   ` Mika Penttilä
@ 2006-10-25  5:52     ` David Gibson
  2006-10-25  7:27       ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2006-10-25  5:52 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Hugh Dickins, Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 08:23:13AM +0300, Mika Penttila wrote:
> Hugh Dickins wrote:
> >If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> >area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> >my preliminary i_size check before attempting to allocate the page (though
> >this only fixes the most obvious case: more work will be needed here).
> >
> >Signed-off-by: Hugh Dickins <hugh@veritas.com>
> >___
> >
> >This is not a complete solution (what if hugetlb_no_page is actually
> >racing with truncate_hugepages?), and there are several other accounting
> >anomalies in here (private versus shared pages, hugetlbfs quota handling);
> >but those all need more thought.  It'll probably make sense to use i_mutex
> >instead of hugetlb_instantiation_mutex, so locking out truncation and 
> >mmap.
> >
> > mm/hugetlb.c |    3 +++
> > 1 file changed, 3 insertions(+)
> >
> >--- 2.6.19-rc3/mm/hugetlb.c	2006-10-24 04:34:37.000000000 +0100
> >+++ linux/mm/hugetlb.c	2006-10-24 16:23:17.000000000 +0100
> >@@ -478,6 +478,9 @@ int hugetlb_no_page(struct mm_struct *mm
> > retry:
> > 	page = find_lock_page(mapping, idx);
> > 	if (!page) {
> >+		size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> >+		if (idx >= size)
> >+			goto out;
> > 		if (hugetlb_get_quota(mapping))
> > 			goto out;
> > 		page = alloc_huge_page(vma, address);
> >
> Shouldn't it be something like following ?
> 
> size = (i_size_read(mapping->host) + HPAGE_SIZE - 1) >> HPAGE_SHIFT;
> 
> 
> 
> If so this was wrong in the original code also.

In theory, yes, but AFAIK there is no way to get an i_size on a
hugetlbfs file which is not a multiple of HPAGE_SIZE.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
  2006-10-25  5:23   ` Mika Penttilä
@ 2006-10-25  6:26   ` David Gibson
  2006-10-25  6:29     ` David Gibson
                       ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: David Gibson @ 2006-10-25  6:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> my preliminary i_size check before attempting to allocate the page (though
> this only fixes the most obvious case: more work will be needed here).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ___
> 
> This is not a complete solution (what if hugetlb_no_page is actually
> racing with truncate_hugepages?), and there are several other accounting
> anomalies in here (private versus shared pages, hugetlbfs quota handling);
> but those all need more thought.  It'll probably make sense to use i_mutex
> instead of hugetlb_instantiation_mutex, so locking out truncation
> and mmap.

Ah, yes.  I also encountered this one a few days ago - I found it in
the context of deserializing the hugepage fault path, which makes the
problem worse, and forgot to consider if there was also a problem in
the original case.

In fact, there's a second problem with the current location of the
i_size check.  As well as wrapping the reserved count, if there's a
fault on a truncated area and the hugepage pool is also empty, we can
get an OOM SIGKILL instead of the correct SIGBUS.

I don't things are quite as bad as you fear, though:  I believe the
page lock protects us against racing concurrent truncations (this is
one reason we have find_lock_page() here, rather than the
find_get_page() which appears in the analagous normal page path).

I suggest the slightly revised patch below, which doesn't duplicate
the i_size test, and cleans up the backout path (removing a
no-longer-useful goto label) in the process.

hugepage: Correct i_size test to fix some corner cases

If you truncated an mmap'ed hugetlbfs file, then faulted on the
truncated area, /proc/meminfo's HugePages_Rsvd wrapped hugely
"negative".  In addition, faulting on the truncated area when the
hugepage pool was also exhausted could result in a VM OOM SIGKILL
instead of the correct SIGBUS.

Correct these by moving the i_size check to before the allocation of a
new page.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

 mm/hugetlb.c |   36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2006-10-25 16:04:12.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2006-10-25 16:18:17.000000000 +1000
@@ -477,6 +477,22 @@ int hugetlb_no_page(struct mm_struct *mm
 	 */
 retry:
 	page = find_lock_page(mapping, idx);
+	/* In the case the page exists, we want to lock it before we
+	 * check against i_size to guard against racing truncations.
+	 * In the case it doesn't exist, we have to check against
+	 * i_size before attempting to allocate a page, or we could
+	 * get the wrong error if we're also out of hugepages in the
+	 * pool (OOM instead of SIGBUS).  So the i_size test has to go
+	 * in this slightly odd location. */
+	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	if (idx >= size) {
+		hugetlb_put_quota(mapping);
+		if (page) {
+			unlock_page(page);
+			put_page(page);
+		}
+		return VM_FAULT_SIGBUS;
+	}
 	if (!page) {
 		if (hugetlb_get_quota(mapping))
 			goto out;
@@ -504,13 +520,14 @@ retry:
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
-	if (idx >= size)
-		goto backout;
-
 	ret = VM_FAULT_MINOR;
-	if (!pte_none(*ptep))
-		goto backout;
+	if (!pte_none(*ptep)) {
+		spin_unlock(&mm->page_table_lock);
+		hugetlb_put_quota(mapping);
+		unlock_page(page);
+		put_page(page);
+		goto out;
+	}
 
 	add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
@@ -526,13 +543,6 @@ retry:
 	unlock_page(page);
 out:
 	return ret;
-
-backout:
-	spin_unlock(&mm->page_table_lock);
-	hugetlb_put_quota(mapping);
-	unlock_page(page);
-	put_page(page);
-	goto out;
 }
 
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  6:26   ` David Gibson
@ 2006-10-25  6:29     ` David Gibson
  2006-10-25  8:39     ` Hugh Dickins
  2006-10-25 21:31     ` Adam Litke
  2 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2006-10-25  6:29 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 04:26:10PM +1000, David Gibson wrote:
> On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> > If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> > area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> > my preliminary i_size check before attempting to allocate the page (though
> > this only fixes the most obvious case: more work will be needed here).
> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > ___
> > 
> > This is not a complete solution (what if hugetlb_no_page is actually
> > racing with truncate_hugepages?), and there are several other accounting
> > anomalies in here (private versus shared pages, hugetlbfs quota handling);
> > but those all need more thought.  It'll probably make sense to use i_mutex
> > instead of hugetlb_instantiation_mutex, so locking out truncation
> > and mmap.
> 
> Ah, yes.  I also encountered this one a few days ago - I found it in
> the context of deserializing the hugepage fault path, which makes the
> problem worse, and forgot to consider if there was also a problem in
> the original case.
> 
> In fact, there's a second problem with the current location of the
> i_size check.  As well as wrapping the reserved count, if there's a
> fault on a truncated area and the hugepage pool is also empty, we can
> get an OOM SIGKILL instead of the correct SIGBUS.
> 
> I don't things are quite as bad as you fear, though:  I believe the
> page lock protects us against racing concurrent truncations (this is
> one reason we have find_lock_page() here, rather than the
> find_get_page() which appears in the analagous normal page path).
> 
> I suggest the slightly revised patch below, which doesn't duplicate
> the i_size test, and cleans up the backout path (removing a
> no-longer-useful goto label) in the process.

Bother.  Forgot to add in the above, that I've also implemented a
couple of extra cases for the libhugetlbfs testsuite which will catch
this bug.  Adam, if you could merge the patch with these test cases
from:
	http://ozlabs.org/~dgibson/home/tmp/reserve-wraparound
to the libhugetlbfs tree, that would be great.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-25  2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
@ 2006-10-25  7:08   ` David Gibson
  2006-10-25  7:41     ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2006-10-25  7:08 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1316 bytes --]

On Wed, Oct 25, 2006 at 03:35:41AM +0100, Hugh Dickins wrote:
> hugetlb_vmtruncate_list was misconverted to prio_tree: its prio_tree is
> in units of PAGE_SIZE (PAGE_CACHE_SIZE) like any other, not HPAGE_SIZE
> (whereas its radix_tree is kept in units of HPAGE_SIZE, otherwise slots
> would be absurdly sparse).
> 
> At first I thought the error benign, just calling __unmap_hugepage_range
> on more vmas than necessary; but on 32-bit machines, when the prio_tree
> is searched correctly, it happens to ensure the v_offset calculation won't
> overflow.  As it stood, when truncating at or beyond 4GB, it was liable
> to discard pages COWed from lower offsets; or even to clear pmd entries
> of preceding vmas, triggering exit_mmap's BUG_ON(nr_ptes).

Hugh, I'd like to add a testcase to the libhugetlbfs testsuite which
will trigger this bug, but from the description above I'm not sure
exactly how to tickle it.  Can you give some more details of what
sequence of calls will cause the BUG_ON() to be called.

I've attached the skeleton test I have now, but I'm not sure if it's
even close to what's really required for this.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: truncate_above_4GB.c --]
[-- Type: text/x-csrc, Size: 2773 bytes --]

/*
 * libhugetlbfs - Easy use of Linux hugepages
 * Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation.
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public License
 * as published by the Free Software Foundation; either version 2.1 of
 * the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with this library; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
 */
#define _LARGEFILE64_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>
#include <sys/mman.h>

#include <hugetlbfs.h>

#include "hugetests.h"

/*
 * Test rationale:
 *
 * At one stage, a misconversion of hugetlb_vmtruncate_list to a
 * prio_tree meant that on 32-bit machines, truncates at or above 4GB
 * could truncate lower pages, resulting in BUG_ON()s.
 */
#define RANDOM_CONSTANT	0x1234ABCD
#define FOURGIG ((off64_t)0x100000000ULL)

static void sigbus_handler_fail(int signum, siginfo_t *si, void *uc)
{
	FAIL("Unexpected SIGBUS");
}

static void sigbus_handler_pass(int signum, siginfo_t *si, void *uc)
{
	PASS();
}

int main(int argc, char *argv[])
{
	int hpage_size;
	int fd;
	void *p, *q;
	volatile unsigned int *pi, *qi;
	int err;
	struct sigaction sa_fail = {
		.sa_sigaction = sigbus_handler_fail,
		.sa_flags = SA_SIGINFO,
	};
	struct sigaction sa_pass = {
		.sa_sigaction = sigbus_handler_pass,
		.sa_flags = SA_SIGINFO,
	};

	test_init(argc, argv);

	hpage_size = gethugepagesize();
	if (hpage_size < 0)
		CONFIG("No hugepage kernel support");

	fd = hugetlbfs_unlinked_fd();
	if (fd < 0)
		FAIL("hugetlbfs_unlinked_fd()");

	p = mmap64(NULL, hpage_size, PROT_READ|PROT_WRITE, MAP_PRIVATE,
		 fd, 0);
	if (p == MAP_FAILED)
		FAIL("mmap() offset 0");
	pi = p;
	/* Touch the low page */
	*pi = 0;

	q = mmap64(NULL, hpage_size, PROT_READ|PROT_WRITE, MAP_PRIVATE,
		 fd, FOURGIG);
	if (q == MAP_FAILED)
		FAIL("mmap() offset 4GB");
	qi = q;
	/* Touch the high page */
	*qi = 0;

	err = ftruncate64(fd, FOURGIG);
	if (err)
		FAIL("ftruncate(): %s", strerror(errno));

	err = sigaction(SIGBUS, &sa_fail, NULL);
	if (err)
		FAIL("sigaction() fail");

	*pi;

	err = sigaction(SIGBUS, &sa_pass, NULL);
	if (err)
		FAIL("sigaction() pass");

	*qi;

	/* Should have SIGBUSed above */
	FAIL("Didn't SIGBUS on truncated page.");
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  5:52     ` David Gibson
@ 2006-10-25  7:27       ` Hugh Dickins
  0 siblings, 0 replies; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  7:27 UTC (permalink / raw)
  To: David Gibson
  Cc: Mika Penttilä,
	Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 442 bytes --]

On Wed, 25 Oct 2006, David Gibson wrote:
> On Wed, Oct 25, 2006 at 08:23:13AM +0300, Mika Penttilä wrote:
> > Shouldn't it be something like following ?
> > 
> > size = (i_size_read(mapping->host) + HPAGE_SIZE - 1) >> HPAGE_SHIFT;
> > 
> > If so this was wrong in the original code also.
> 
> In theory, yes, but AFAIK there is no way to get an i_size on a
> hugetlbfs file which is not a multiple of HPAGE_SIZE.

Exactly.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-25  7:08   ` David Gibson
@ 2006-10-25  7:41     ` Hugh Dickins
  2006-10-25 23:49       ` Chen, Kenneth W
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  7:41 UTC (permalink / raw)
  To: David Gibson; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, 25 Oct 2006, David Gibson wrote:
> 
> Hugh, I'd like to add a testcase to the libhugetlbfs testsuite which
> will trigger this bug, but from the description above I'm not sure
> exactly how to tickle it.  Can you give some more details of what
> sequence of calls will cause the BUG_ON() to be called.
> 
> I've attached the skeleton test I have now, but I'm not sure if it's
> even close to what's really required for this.

I'll take a look, or reconstruct my own sequence, later on today and
send it just to you.  The BUG_ON was not at all what I was expecting,
and I spent quite a while working out how it came about (v_offset
wrapped, so vm_start + v_offset less than vm_start, so the huge unmap
applied to a non-huge vma before it).  Though I'm dubious whether it's
really worthwhile devising such a test now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  6:26   ` David Gibson
  2006-10-25  6:29     ` David Gibson
@ 2006-10-25  8:39     ` Hugh Dickins
  2006-10-25 10:09       ` David Gibson
  2006-10-25 21:31     ` Adam Litke
  2 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2006-10-25  8:39 UTC (permalink / raw)
  To: David Gibson; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, 25 Oct 2006, David Gibson wrote:
> On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> > If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> > area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> > my preliminary i_size check before attempting to allocate the page (though
> > this only fixes the most obvious case: more work will be needed here).
> > 
> > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > ___
> > 
> > This is not a complete solution (what if hugetlb_no_page is actually
> > racing with truncate_hugepages?), and there are several other accounting
> > anomalies in here (private versus shared pages, hugetlbfs quota handling);
> > but those all need more thought.  It'll probably make sense to use i_mutex
> > instead of hugetlb_instantiation_mutex, so locking out truncation
> > and mmap.
> 
> Ah, yes.  I also encountered this one a few days ago - I found it in
> the context of deserializing the hugepage fault path, which makes the
> problem worse, and forgot to consider if there was also a problem in
> the original case.
> 
> In fact, there's a second problem with the current location of the
> i_size check.  As well as wrapping the reserved count, if there's a
> fault on a truncated area and the hugepage pool is also empty, we can
> get an OOM SIGKILL instead of the correct SIGBUS.

That's exactly why I put in the preliminary i_size check originally,
which you guys then decided was unnecessary.  But it wasn't worth
arguing over which particular error manifested in that case.

> 
> I don't things are quite as bad as you fear, though:  I believe the
> page lock protects us against racing concurrent truncations (this is
> one reason we have find_lock_page() here, rather than the
> find_get_page() which appears in the analagous normal page path).

The page lock protects once you've got it.  But when getting a new
page, there's a sequence of operations before getting the page lock,
where a racing truncate may occur, and then we may add a page to cache
beyond the i_size.  And not remove it until the file is retruncated
or deleted or mount unmounted: not the worst of leaks, but not what
we want either.

> 
> I suggest the slightly revised patch below, which doesn't duplicate
> the i_size test, and cleans up the backout path (removing a
> no-longer-useful goto label) in the process.

I don't see the advantage of that version: I'd rather stick with my
clearly-not-worse two-liner for 2.6.19, and work on fixing it all up
properly later.

I was only trying to fix the prio_tree issue (which had led me to
mislead Ken on his prio_tree use - thankfully he ignored me and stuck
with what worked): but each time I tested something I found something
else wrong, right now must switch away, so posting the obvious bits.

To expand a little on the other problems here: the hugetlb_get_quota
and hugetlb_put_quota calls are suspect but hard to get right, and
the resv_huge_pages-- in alloc_huge_page a problem too I think.
Cached versus private pages seems confused (private pages should
not be counted out of the superblock quota).

It would probably straighten out if hugetlb_no_page just dealt with
with page cache pages (no VM_SHARED versus not path), and left the
private pages to hugetlb_cow; but that would be wasteful.

And almost(?) all the backtracking could be taken out if i_mutex
were held; hugetlbfs_file_mmap is already taking i_mutex within
mmap_sem (contrary to usual mm lock ordering, but probably okay
since hugetlbfs has no read/write, though lockdep may need teaching).
Though serializing these faults at all is regrettable.

Things may not be as bad as I fear, but I've certainly not yet
emerged into the light on it all.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  8:39     ` Hugh Dickins
@ 2006-10-25 10:09       ` David Gibson
  2006-10-26  3:59         ` Chen, Kenneth W
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2006-10-25 10:09 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Ken Chen, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 09:39:10AM +0100, Hugh Dickins wrote:
> On Wed, 25 Oct 2006, David Gibson wrote:
> > On Wed, Oct 25, 2006 at 03:38:24AM +0100, Hugh Dickins wrote:
> > > If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
> > > area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate
> > > my preliminary i_size check before attempting to allocate the page (though
> > > this only fixes the most obvious case: more work will be needed here).
> > > 
> > > Signed-off-by: Hugh Dickins <hugh@veritas.com>
> > > ___
> > > 
> > > This is not a complete solution (what if hugetlb_no_page is actually
> > > racing with truncate_hugepages?), and there are several other accounting
> > > anomalies in here (private versus shared pages, hugetlbfs quota handling);
> > > but those all need more thought.  It'll probably make sense to use i_mutex
> > > instead of hugetlb_instantiation_mutex, so locking out truncation
> > > and mmap.
> > 
> > Ah, yes.  I also encountered this one a few days ago - I found it in
> > the context of deserializing the hugepage fault path, which makes the
> > problem worse, and forgot to consider if there was also a problem in
> > the original case.
> > 
> > In fact, there's a second problem with the current location of the
> > i_size check.  As well as wrapping the reserved count, if there's a
> > fault on a truncated area and the hugepage pool is also empty, we can
> > get an OOM SIGKILL instead of the correct SIGBUS.
> 
> That's exactly why I put in the preliminary i_size check originally,
> which you guys then decided was unnecessary.

We did?  That was silly of us..

> But it wasn't worth
> arguing over which particular error manifested in that case.

> > I don't things are quite as bad as you fear, though:  I believe the
> > page lock protects us against racing concurrent truncations (this is
> > one reason we have find_lock_page() here, rather than the
> > find_get_page() which appears in the analagous normal page path).
> 
> The page lock protects once you've got it.  But when getting a new
> page, there's a sequence of operations before getting the page lock,
> where a racing truncate may occur, and then we may add a page to cache
> beyond the i_size.  And not remove it until the file is retruncated
> or deleted or mount unmounted: not the worst of leaks, but not what
> we want either.

Ah.  Yes.  Indeed.

Possibly we need to duplicate the truncate_count logic from the normal
page path.

> > I suggest the slightly revised patch below, which doesn't duplicate
> > the i_size test, and cleans up the backout path (removing a
> > no-longer-useful goto label) in the process.
> 
> I don't see the advantage of that version: I'd rather stick with my
> clearly-not-worse two-liner for 2.6.19, and work on fixing it all up
> properly later.

Yeah, I guess so.

> I was only trying to fix the prio_tree issue (which had led me to
> mislead Ken on his prio_tree use - thankfully he ignored me and stuck
> with what worked): but each time I tested something I found something
> else wrong, right now must switch away, so posting the obvious bits.
> 
> To expand a little on the other problems here: the hugetlb_get_quota
> and hugetlb_put_quota calls are suspect but hard to get right, and
> the resv_huge_pages-- in alloc_huge_page a problem too I think.
> Cached versus private pages seems confused (private pages should
> not be counted out of the superblock quota).
> 
> It would probably straighten out if hugetlb_no_page just dealt with
> with page cache pages (no VM_SHARED versus not path), and left the
> private pages to hugetlb_cow; but that would be wasteful.

In some cases exceedingly wasteful.  I don't think we can do this.

> And almost(?) all the backtracking could be taken out if i_mutex
> were held; hugetlbfs_file_mmap is already taking i_mutex within
> mmap_sem (contrary to usual mm lock ordering, but probably okay
> since hugetlbfs has no read/write, though lockdep may need teaching).
> Though serializing these faults at all is regrettable.

Um, yes.  Especially when I was in the middle of attempting to
de-serialize it.  Christoph Lameter has userspace stuff to do hugepage
initialization (clearing mostly), in parallal, which obviously won't
work with the serialization.  I have a tentative patch to address it,
which replaces the hugetlb_instantiation_mutex with a table of
mutexes, hashed on address_space and offset (or struct mm and address
for MAP_PRIVATE).  Originally I tried to simply remove the mutex, and
just retry faults when we got an OOM but a race was detected.  After
several variants each on 2 or 3 basic approaches, each of which turned
out to be less race-free than I originally thought, I gave up and went
with the hashed mutexes.  Either way though, there will still be
i_size issues to sort out.

I'm also not sure that i_mutex will work in any case.  I'm pretty sure
I looked for a suitable existing lock, before I created
hugetlb_instantiation_mutex.  I'm not at all sure, but I have a nasty
suspicion there's some not-immediately-obvious lock-ordering
constraint that means we can't take i_mutex across the fault.

> Things may not be as bad as I fear, but I've certainly not yet
> emerged into the light on it all.
> 
> Hugh
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25  6:26   ` David Gibson
  2006-10-25  6:29     ` David Gibson
  2006-10-25  8:39     ` Hugh Dickins
@ 2006-10-25 21:31     ` Adam Litke
  2 siblings, 0 replies; 25+ messages in thread
From: Adam Litke @ 2006-10-25 21:31 UTC (permalink / raw)
  To: David Gibson
  Cc: npiggin, Hugh Dickins, Andrew Morton, Ken Chen, Bill Irwin, linux-mm

On Wed, 2006-10-25 at 16:26 +1000, David Gibson wrote:
> Correct these by moving the i_size check to before the allocation of a
> new page.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

I have some early patches which do this very same thing albeit for a
different reason.  I was looking to create a proper ->nopage() operation
for hugetlbfs and quarantine fs things (inodes included) to
fs/hugetlbfs/inode.c.  The added advantage is hugetlb_[get/put]_quota()
calls could also be isolated to fs/hugetlbfs/inode.c.  Unfortunately
this was all held up by Nick Piggin's awesome ->fault() work.  Speaking
of which, could we just use VM_FAULT_RETRY semantics to handle the
i_size check that patch in the parent email removes?   [ Obviously not
for 2.6.19 :-) ]

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-25  7:41     ` Hugh Dickins
@ 2006-10-25 23:49       ` Chen, Kenneth W
  2006-10-26  3:47         ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Chen, Kenneth W @ 2006-10-25 23:49 UTC (permalink / raw)
  To: 'Hugh Dickins', David Gibson
  Cc: Andrew Morton, Bill Irwin, Adam Litke, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

Hugh Dickins wrote on Wednesday, October 25, 2006 12:41 AM
> On Wed, 25 Oct 2006, David Gibson wrote:
> > 
> > Hugh, I'd like to add a testcase to the libhugetlbfs testsuite which
> > will trigger this bug, but from the description above I'm not sure
> > exactly how to tickle it.  Can you give some more details of what
> > sequence of calls will cause the BUG_ON() to be called.
> > 
> > I've attached the skeleton test I have now, but I'm not sure if it's
> > even close to what's really required for this.
> 
> I'll take a look, or reconstruct my own sequence, later on today and
> send it just to you.  The BUG_ON was not at all what I was expecting,
> and I spent quite a while working out how it came about (v_offset
> wrapped, so vm_start + v_offset less than vm_start, so the huge unmap
> applied to a non-huge vma before it).  Though I'm dubious whether it's
> really worthwhile devising such a test now.

It's fairly easy to reproduce.  I got a test cases that easily trigger
kernel oops and even got a sequence to screw up hugepage_rsvd count.
All I have to do is to place vm_start high enough and combined with large
enough v_offset, the add "vma->vm_start + v_offset" will overflow. It
doesn't even need to be over 4GB.

Hugh, if you haven't got time to reconstruct the bug sequence, don't
bother. I'll give my test cases to David.

- Ken


[-- Attachment #2: case2.c --]
[-- Type: application/octet-stream, Size: 1296 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>

#define FILE_NAME "/mnt/htlb/junk"
#define HPAGE_SIZE (4*1024*1024)
#define LENGTH (16UL*1024*1024)
#define TRUNCATE (0x60000000)
#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0xa0000000UL)
#define FLAGS (MAP_PRIVATE)

int main(void)
{
        char *addr;
        int fd;
	char s;
	unsigned long i;

        fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
        if (fd < 0) {
                perror("Open failed");
                exit(1);
        }

        addr = mmap(0, LENGTH+TRUNCATE, PROTECTION, FLAGS, fd, 0);
        if (addr == MAP_FAILED) {
                perror("first mmap");
                unlink(FILE_NAME);
                exit(1);
        }
	munmap(addr, LENGTH+TRUNCATE);

        addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
        if (addr == MAP_FAILED) {
                perror("secound mmap");
                unlink(FILE_NAME);
                exit(1);
        }

	for (i = 0; i < LENGTH; i+= HPAGE_SIZE)
		addr[i] = 1;
	printf("addr = %lx\n", addr);

	ftruncate(fd, TRUNCATE);

	s = addr[0];

        close(fd);
        unlink(FILE_NAME);

        return s;
}

[-- Attachment #3: case1.c --]
[-- Type: application/octet-stream, Size: 1024 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>

#define FILE_NAME "/mnt/htlb/junk"
#define HPAGE_SIZE (4*1024*1024)
#define LENGTH (16UL*1024*1024)
#define TRUNCATE (8UL*1024*1024)
#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0xa0000000UL)
#define OFFSET (0x50000000L)
#define FLAGS (MAP_SHARED)

int main(void)
{
        char *addr;
        int fd;
	char s;
	unsigned long i;

        fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
        if (fd < 0) {
                perror("Open failed");
                exit(1);
        }

        addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, OFFSET);
        if (addr == MAP_FAILED) {
                perror("mmap");
                unlink(FILE_NAME);
                exit(1);
        }

	for (i = 0; i < LENGTH; i+= HPAGE_SIZE)
		addr[i] = 1;

	ftruncate(fd, TRUNCATE);
        close(fd);
        unlink(FILE_NAME);

        return s;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-25 23:49       ` Chen, Kenneth W
@ 2006-10-26  3:47         ` David Gibson
  2006-10-26  6:15           ` Chen, Kenneth W
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: David Gibson @ 2006-10-26  3:47 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Hugh Dickins', Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 04:49:29PM -0700, Chen, Kenneth W wrote:
> Hugh Dickins wrote on Wednesday, October 25, 2006 12:41 AM
> > On Wed, 25 Oct 2006, David Gibson wrote:
> > > 
> > > Hugh, I'd like to add a testcase to the libhugetlbfs testsuite which
> > > will trigger this bug, but from the description above I'm not sure
> > > exactly how to tickle it.  Can you give some more details of what
> > > sequence of calls will cause the BUG_ON() to be called.
> > > 
> > > I've attached the skeleton test I have now, but I'm not sure if it's
> > > even close to what's really required for this.
> > 
> > I'll take a look, or reconstruct my own sequence, later on today and
> > send it just to you.  The BUG_ON was not at all what I was expecting,
> > and I spent quite a while working out how it came about (v_offset
> > wrapped, so vm_start + v_offset less than vm_start, so the huge unmap
> > applied to a non-huge vma before it).  Though I'm dubious whether it's
> > really worthwhile devising such a test now.
> 
> It's fairly easy to reproduce.  I got a test cases that easily trigger
> kernel oops and even got a sequence to screw up hugepage_rsvd count.
> All I have to do is to place vm_start high enough and combined with large
> enough v_offset, the add "vma->vm_start + v_offset" will overflow. It
> doesn't even need to be over 4GB.
> 
> Hugh, if you haven't got time to reconstruct the bug sequence, don't
> bother. I'll give my test cases to David.

Actually, Hugh already sent me a testcase by direct mail.  But more
testcases are already good.

I don't really understand your case1.c, though.  It didn't cause any
oops on my laptop (i386) and I couldn't see what else was expected to
behave differently between the "pass" and "fail" cases.  It returns an
uninitialized variable as exit code, btw..

So, I've integrated both Hugh's testcase and case2.c into the
libhugetlbfs testsuite.  A patch to libhugetlbfs is below.  It would
be good if we could get Signed-off-by lines for it from Hugh and
Kenneth, to keep the lawyer types happy, then Adam should be able to
merge it into libhugetlbfs.  It applies on top of my earlier patch
adding testcases for the i_size related reserve count wraparound.

Btw, Adam, while checking this, I discovered there are some other
failures in the testsuite on i386.  We should fix those...

libhugetlbfs: Test cases for prio_tree brokenness

A misconversion of hugetlb_vmtruncate_list to a prio_tree has meant
that on 32-bit machines, certain combinations of mapping and
truncations can truncate incorrect pages, or overwrite pmds from other
VMAs, triggering BUG_ON()s or other wierdness.

Hugh Dickins has submitted a kernel patch to fix the bug.  This patch
adds a couple of different test cases to libhugetlbfs to exercise it.
The test case logic is from Hugh Dickins and Kenneth Chen, I've
adapated it slightly to fold into the testsuite.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Index: libhugetlbfs/tests/Makefile
===================================================================
--- libhugetlbfs.orig/tests/Makefile	2006-10-26 13:26:29.000000000 +1000
+++ libhugetlbfs/tests/Makefile	2006-10-26 13:26:31.000000000 +1000
@@ -4,7 +4,8 @@ LIB_TESTS = gethugepagesize test_root fi
 	readback truncate shared private empty_mounts meminfo_nohuge \
 	ptrace-write-hugepage icache-hygeine slbpacaflush \
 	chunk-overcommit mprotect alloc-instantiate-race mlock \
-	truncate_reserve_wraparound truncate_sigbus_versus_oom
+	truncate_reserve_wraparound truncate_sigbus_versus_oom \
+	map_high_truncate_2 truncate_above_4GB
 LIB_TESTS_64 = straddle_4GB huge_at_4GB_normal_below \
 	huge_below_4GB_normal_above
 NOLIB_TESTS = malloc malloc_manysmall dummy
Index: libhugetlbfs/tests/run_tests.sh
===================================================================
--- libhugetlbfs.orig/tests/run_tests.sh	2006-10-26 13:26:29.000000000 +1000
+++ libhugetlbfs/tests/run_tests.sh	2006-10-26 13:26:31.000000000 +1000
@@ -118,6 +118,8 @@ functional_tests () {
     run_test_bits 64 straddle_4GB
     run_test_bits 64 huge_at_4GB_normal_below
     run_test_bits 64 huge_below_4GB_normal_above
+    run_test map_high_truncate_2
+    run_test truncate_above_4GB
 
 # Tests requiring an active mount and hugepage COW
     run_test private
Index: libhugetlbfs/tests/truncate_above_4GB.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ libhugetlbfs/tests/truncate_above_4GB.c	2006-10-26 13:26:31.000000000 +1000
@@ -0,0 +1,136 @@
+/*
+ * libhugetlbfs - Easy use of Linux hugepages
+ * Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation.
+ * Copyright (C) 2006 Hugh Dickins <hugh@veritas.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+#define _LARGEFILE64_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/mman.h>
+
+#include <hugetlbfs.h>
+
+#include "hugetests.h"
+
+/*
+ * Test rationale:
+ *
+ * At one stage, a misconversion of hugetlb_vmtruncate_list to a
+ * prio_tree meant that on 32-bit machines, truncates at or above 4GB
+ * could truncate lower pages, resulting in BUG_ON()s.
+ */
+#define FOURGIG ((off64_t)0x100000000ULL)
+
+static void sigbus_handler_fail(int signum, siginfo_t *si, void *uc)
+{
+	FAIL("Unexpected SIGBUS");
+}
+
+static void sigbus_handler_pass(int signum, siginfo_t *si, void *uc)
+{
+	PASS();
+}
+
+int main(int argc, char *argv[])
+{
+	int page_size;
+	int hpage_size;
+	long long buggy_offset;
+	int fd;
+	void *p, *q;
+	volatile unsigned int *pi, *qi;
+	int err;
+	struct sigaction sa_fail = {
+		.sa_sigaction = sigbus_handler_fail,
+		.sa_flags = SA_SIGINFO,
+	};
+	struct sigaction sa_pass = {
+		.sa_sigaction = sigbus_handler_pass,
+		.sa_flags = SA_SIGINFO,
+	};
+
+	test_init(argc, argv);
+
+	page_size = getpagesize();
+	hpage_size = gethugepagesize();
+	if (hpage_size < 0)
+		CONFIG("No hugepage kernel support");
+
+	fd = hugetlbfs_unlinked_fd();
+	if (fd < 0)
+		FAIL("hugetlbfs_unlinked_fd()");
+
+	/* First get arena of three hpages size, at file offset 4GB */
+	q = mmap64(NULL, 3*hpage_size, PROT_READ|PROT_WRITE,
+		 MAP_PRIVATE, fd, FOURGIG);
+	if (q == MAP_FAILED)
+		FAIL("mmap() offset 4GB");
+	qi = q;
+	/* Touch the high page */
+	*qi = 0;
+
+	/* This part of the test makes the problem more obvious, but
+	 * is not essential.  It can't be done on powerpc, where
+	 * segment restrictions prohibit us from performing such a
+	 * mapping, so skip it there */
+#if !defined(__powerpc__) && !defined(__powerpc64__)
+	/* Replace middle hpage by tinypage mapping to trigger
+	 * nr_ptes BUG */
+	p = mmap64(q + hpage_size, hpage_size, PROT_READ|PROT_WRITE,
+		   MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
+	if (p != q + hpage_size)
+		FAIL("mmap() before low hpage");
+	pi = p;
+	/* Touch one page to allocate its page table */
+	*pi = 0;
+#endif
+
+	/* Replace top hpage by hpage mapping at confusing file offset */
+	buggy_offset = FOURGIG / (hpage_size / page_size);
+	p = mmap64(q + 2*hpage_size, hpage_size, PROT_READ|PROT_WRITE,
+		 MAP_FIXED|MAP_PRIVATE, fd, buggy_offset);
+	if (p != q + 2*hpage_size)
+		FAIL("mmap() buggy offset 0x%llx", buggy_offset);
+	pi = p;
+	/* Touch the low page with something non-zero */
+	*pi = 1;
+
+	err = ftruncate64(fd, FOURGIG);
+	if (err)
+		FAIL("ftruncate(): %s", strerror(errno));
+
+	err = sigaction(SIGBUS, &sa_fail, NULL);
+	if (err)
+		FAIL("sigaction() fail");
+
+	if (*pi != 1)
+		FAIL("Data 1 has changed to %u", *pi);
+
+	err = sigaction(SIGBUS, &sa_pass, NULL);
+	if (err)
+		FAIL("sigaction() pass");
+
+	*qi;
+
+	/* Should have SIGBUSed above */
+	FAIL("Didn't SIGBUS on truncated page.");
+}
Index: libhugetlbfs/hugeutils.c
===================================================================
--- libhugetlbfs.orig/hugeutils.c	2006-10-26 13:26:29.000000000 +1000
+++ libhugetlbfs/hugeutils.c	2006-10-26 13:26:31.000000000 +1000
@@ -235,7 +235,7 @@ int hugetlbfs_unlinked_fd(void)
 	strncat(name, "/libhugetlbfs.tmp.XXXXXX", sizeof(name)-1);
 	/* FIXME: deal with overflows */
 
-	fd = mkstemp(name);
+	fd = mkstemp64(name);
 
 	if (fd < 0) {
 		ERROR("mkstemp() failed: %s\n", strerror(errno));
Index: libhugetlbfs/tests/map_high_truncate_2.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ libhugetlbfs/tests/map_high_truncate_2.c	2006-10-26 13:32:21.000000000 +1000
@@ -0,0 +1,93 @@
+/*
+ * libhugetlbfs - Easy use of Linux hugepages
+ * Copyright (C) 2005-2006 David Gibson & Adam Litke, IBM Corporation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+#define _LARGEFILE64_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/mman.h>
+
+#include <hugetlbfs.h>
+
+#include "hugetests.h"
+
+/*
+ * Test rationale:
+ *
+ * At one stage, a misconversion of hugetlb_vmtruncate_list to a
+ * prio_tree meant that on 32-bit machines, certain combinations of
+ * mapping and truncations could truncate incorrect pages, or
+ * overwrite pmds from other VMAs, triggering BUG_ON()s or other
+ * wierdness.
+ *
+ * Test adapted to the libhugetlbfs framework from an example by
+ * Kenneth Chen <kenneth.w.chen@intel.com>
+ */
+#define MAP_LENGTH	(4 * hpage_size)
+#define TRUNCATE_POINT	0x60000000UL
+#define HIGH_ADDR	0xa0000000UL
+
+int main(int argc, char *argv[])
+{
+	int hpage_size;
+	int fd;
+	char *p, *q;
+	unsigned long i;
+	int err;
+
+	test_init(argc, argv);
+
+	hpage_size = gethugepagesize();
+	if (hpage_size < 0)
+		CONFIG("No hugepage kernel support");
+
+	fd = hugetlbfs_unlinked_fd();
+	if (fd < 0)
+		FAIL("hugetlbfs_unlinked_fd()");
+
+	/* First mapping */
+	p = mmap(0, MAP_LENGTH + TRUNCATE_POINT, PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE, fd, 0);
+	if (p == MAP_FAILED)
+		FAIL("mmap() 1");
+
+	munmap(p, 4*hpage_size + TRUNCATE_POINT);
+
+	q = mmap((void *)HIGH_ADDR, MAP_LENGTH, PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE, fd, 0);
+	if (q == MAP_FAILED)
+		FAIL("mmap() 2");
+
+	verbose_printf("High map at %p\n", q);
+
+	for (i = 0; i < MAP_LENGTH; i += hpage_size)
+		q[i] = 1;
+
+	err = ftruncate(fd, TRUNCATE_POINT);
+	if (err != 0)
+		FAIL("ftruncate()");
+
+	if (q[0] != 1)
+		FAIL("data mismatch");
+
+	PASS();
+}


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-25 10:09       ` David Gibson
@ 2006-10-26  3:59         ` Chen, Kenneth W
  2006-10-26  4:13           ` 'David Gibson'
  2006-10-26 19:08           ` Christoph Lameter
  0 siblings, 2 replies; 25+ messages in thread
From: Chen, Kenneth W @ 2006-10-26  3:59 UTC (permalink / raw)
  To: 'David Gibson', Hugh Dickins
  Cc: Andrew Morton, Bill Irwin, Adam Litke, linux-mm

David Gibson wrote on Wednesday, October 25, 2006 3:09 AM
> > And almost(?) all the backtracking could be taken out if i_mutex
> > were held; hugetlbfs_file_mmap is already taking i_mutex within
> > mmap_sem (contrary to usual mm lock ordering, but probably okay
> > since hugetlbfs has no read/write, though lockdep may need teaching).
> > Though serializing these faults at all is regrettable.
> 
> Um, yes.  Especially when I was in the middle of attempting to
> de-serialize it.  Christoph Lameter has userspace stuff to do hugepage
> initialization (clearing mostly), in parallal, which obviously won't
> work with the serialization.  I have a tentative patch to address it,

I used to argue dearly on how important it is to allow parallel hugetlb
faults for scalability, but somehow lost my ground in the midst of flurry
development.  Glad to see it is coming back.


> which replaces the hugetlb_instantiation_mutex with a table of
> mutexes, hashed on address_space and offset (or struct mm and address
> for MAP_PRIVATE).  Originally I tried to simply remove the mutex, and
> just retry faults when we got an OOM but a race was detected.  After
> several variants each on 2 or 3 basic approaches, each of which turned
> out to be less race-free than I originally thought, I gave up and went
> with the hashed mutexes.  Either way though, there will still be
> i_size issues to sort out.

We are trying to do too much in the fault path. One wild idea would be
to zero out page in the free_huge_page().  E.g. all "free" pages sitting
in the pool are ready to be handed out.  In the free_huge_page() path, we
have a lot more freedom and can scale better because there are no owner yet,
nor tricky race to worry about.  It will make the fault path faster.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-26  3:59         ` Chen, Kenneth W
@ 2006-10-26  4:13           ` 'David Gibson'
  2006-10-26 19:08           ` Christoph Lameter
  1 sibling, 0 replies; 25+ messages in thread
From: 'David Gibson' @ 2006-10-26  4:13 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Hugh Dickins, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Wed, Oct 25, 2006 at 08:59:18PM -0700, Chen, Kenneth W wrote:
> David Gibson wrote on Wednesday, October 25, 2006 3:09 AM
> > > And almost(?) all the backtracking could be taken out if i_mutex
> > > were held; hugetlbfs_file_mmap is already taking i_mutex within
> > > mmap_sem (contrary to usual mm lock ordering, but probably okay
> > > since hugetlbfs has no read/write, though lockdep may need teaching).
> > > Though serializing these faults at all is regrettable.
> > 
> > Um, yes.  Especially when I was in the middle of attempting to
> > de-serialize it.  Christoph Lameter has userspace stuff to do hugepage
> > initialization (clearing mostly), in parallal, which obviously won't
> > work with the serialization.  I have a tentative patch to address it,
> 
> I used to argue dearly on how important it is to allow parallel hugetlb
> faults for scalability, but somehow lost my ground in the midst of flurry
> development.  Glad to see it is coming back.

Ok.  Fwiw, my experimental patch for this, which I wrote in response
to Christoph Lameter's concerns is below.  It will need revision in
the face of whatever fix Hugh comes up with for the i_size problems.

> > which replaces the hugetlb_instantiation_mutex with a table of
> > mutexes, hashed on address_space and offset (or struct mm and address
> > for MAP_PRIVATE).  Originally I tried to simply remove the mutex, and
> > just retry faults when we got an OOM but a race was detected.  After
> > several variants each on 2 or 3 basic approaches, each of which turned
> > out to be less race-free than I originally thought, I gave up and went
> > with the hashed mutexes.  Either way though, there will still be
> > i_size issues to sort out.
> 
> We are trying to do too much in the fault path. One wild idea would be
> to zero out page in the free_huge_page().  E.g. all "free" pages sitting
> in the pool are ready to be handed out.  In the free_huge_page() path, we
> have a lot more freedom and can scale better because there are no owner yet,
> nor tricky race to worry about.  It will make the fault path faster.

Hrm.. that's an interesting idea.  One of the first things I attempted
with this deserialization thing was just to shrink the serialized
area, moving the hugepage clear out of the lock.  This turned out to
be impossible (or at least, much harder than I initially thought) due
to lock-ordering and various other constraints.  Doing the clear in
free_huge_age() addresses that quite neatly.

On the other hand, it won't help at all for COW faults - we still have
to copy a huge page.  And in addition, we will already have spent the
time to clear a page which we're just going to overwrite with a copy.

hugepage: Allow parallelization of the hugepage fault path

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2006-10-23 13:50:31.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2006-10-23 13:58:54.000000000 +1000
@@ -32,6 +32,13 @@ static unsigned int free_huge_pages_node
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -160,6 +167,13 @@ static int __init hugetlb_init(void)
 	}
 	max_huge_pages = free_huge_pages = nr_huge_pages = i;
 	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+
+	num_fault_mutexes = 2 * num_possible_cpus() - 1;
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -458,19 +472,14 @@ static int hugetlb_cow(struct mm_struct 
 }
 
 int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, int write_access)
+		    struct address_space *mapping, unsigned long idx,
+		    unsigned long address, pte_t *ptep, int write_access)
 {
 	int ret = VM_FAULT_SIGBUS;
-	unsigned long idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
-	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -535,28 +544,50 @@ backout:
 	goto out;
 }
 
+static int fault_mutex_hash(struct mm_struct *mm, struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	void *p = mapping;
+
+	if (! (vma->vm_flags & VM_SHARED)) {
+		p = mm;
+		pagenum = address << HPAGE_SIZE;
+	}
+
+	return ((unsigned long)p + pagenum) % num_fault_mutexes;
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access)
 {
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct address_space *mapping;
+	unsigned long idx;
+	int hash;
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = *ptep;
 	if (pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
+		ret = hugetlb_no_page(mm, vma, mapping, idx,
+				      address, ptep, write_access);
+		mutex_unlock(&htlb_fault_mutex_table[hash]);
 		return ret;
 	}
 
@@ -568,7 +599,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-26  3:47         ` David Gibson
@ 2006-10-26  6:15           ` Chen, Kenneth W
  2006-10-26  7:55           ` Hugh Dickins
  2006-10-26  8:13           ` Hugh Dickins
  2 siblings, 0 replies; 25+ messages in thread
From: Chen, Kenneth W @ 2006-10-26  6:15 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: 'Hugh Dickins', Andrew Morton, Bill Irwin, Adam Litke, linux-mm

David Gibson wrote on Wednesday, October 25, 2006 8:48 PM
> On Wed, Oct 25, 2006 at 04:49:29PM -0700, Chen, Kenneth W wrote:
> > It's fairly easy to reproduce.  I got a test cases that easily trigger
> > kernel oops and even got a sequence to screw up hugepage_rsvd count.
> > All I have to do is to place vm_start high enough and combined with large
> > enough v_offset, the add "vma->vm_start + v_offset" will overflow. It
> > doesn't even need to be over 4GB.
> > 
> > Hugh, if you haven't got time to reconstruct the bug sequence, don't
> > bother. I'll give my test cases to David.
> 
> I don't really understand your case1.c, though.  It didn't cause any
> oops on my laptop (i386) and I couldn't see what else was expected to
> behave differently between the "pass" and "fail" cases.  It returns an
> uninitialized variable as exit code, btw..

Yeah, case1 looked bogus by itself.  I must have some other combination
which I now unfortunately unable to recall.  This was the one that mess
up with hugepages_rsvd count. I won't pursue this any further given that
there are two other working test cases.


> So, I've integrated both Hugh's testcase and case2.c into the
> libhugetlbfs testsuite.  A patch to libhugetlbfs is below.  It would
> be good if we could get Signed-off-by lines for it from Hugh and
> Kenneth, to keep the lawyer types happy, then Adam should be able to
> merge it into libhugetlbfs.

map_high_truncate_2.c looks fine.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-26  3:47         ` David Gibson
  2006-10-26  6:15           ` Chen, Kenneth W
@ 2006-10-26  7:55           ` Hugh Dickins
  2006-10-26  8:13           ` Hugh Dickins
  2 siblings, 0 replies; 25+ messages in thread
From: Hugh Dickins @ 2006-10-26  7:55 UTC (permalink / raw)
  To: David Gibson
  Cc: Chen, Kenneth W, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Thu, 26 Oct 2006, David Gibson wrote:
> On Wed, Oct 25, 2006 at 04:49:29PM -0700, Chen, Kenneth W wrote:
> > 
> > Hugh, if you haven't got time to reconstruct the bug sequence, don't
> > bother. I'll give my test cases to David.

Thanks a lot, Ken.

> 
> Actually, Hugh already sent me a testcase by direct mail.  But more
> testcases are already good.
> 
> I don't really understand your case1.c, though.  It didn't cause any
> oops on my laptop (i386) and I couldn't see what else was expected to
> behave differently between the "pass" and "fail" cases.  It returns an
> uninitialized variable as exit code, btw..

I found it all very confusing, and I'm relieved to hear that Ken
did too!  Much easier to fix than to work out testcases and make
sense of the resulting misbehaviour.  I wasn't even aware that it
could happen without going beyond 4GB.

> 
> So, I've integrated both Hugh's testcase and case2.c into the
> libhugetlbfs testsuite.  A patch to libhugetlbfs is below.  It would
> be good if we could get Signed-off-by lines for it from Hugh and
> Kenneth, to keep the lawyer types happy, then Adam should be able to
> merge it into libhugetlbfs.  It applies on top of my earlier patch
> adding testcases for the i_size related reserve count wraparound.

Yes, you're welcome to my
Signed-off-by: Hugh Dickins <hugh@veritas.com>

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-26  3:47         ` David Gibson
  2006-10-26  6:15           ` Chen, Kenneth W
  2006-10-26  7:55           ` Hugh Dickins
@ 2006-10-26  8:13           ` Hugh Dickins
  2006-10-26 10:42             ` David Gibson
  2 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2006-10-26  8:13 UTC (permalink / raw)
  To: David Gibson
  Cc: Chen, Kenneth W, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Thu, 26 Oct 2006, David Gibson wrote:
> +
> +	/* This part of the test makes the problem more obvious, but
> +	 * is not essential.  It can't be done on powerpc, where
> +	 * segment restrictions prohibit us from performing such a
> +	 * mapping, so skip it there */
> +#if !defined(__powerpc__) && !defined(__powerpc64__)
> +	/* Replace middle hpage by tinypage mapping to trigger
> +	 * nr_ptes BUG */

I should add, I expect you'll need to extend that #if'ing to exclude
at least ia64 too, won't you?   No architecture that segregates its
hugepage virtual address space will manage the interposed tinypage.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] hugetlb: fix prio_tree unit
  2006-10-26  8:13           ` Hugh Dickins
@ 2006-10-26 10:42             ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2006-10-26 10:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chen, Kenneth W, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 09:13:28AM +0100, Hugh Dickins wrote:
> On Thu, 26 Oct 2006, David Gibson wrote:
> > +
> > +	/* This part of the test makes the problem more obvious, but
> > +	 * is not essential.  It can't be done on powerpc, where
> > +	 * segment restrictions prohibit us from performing such a
> > +	 * mapping, so skip it there */
> > +#if !defined(__powerpc__) && !defined(__powerpc64__)
> > +	/* Replace middle hpage by tinypage mapping to trigger
> > +	 * nr_ptes BUG */
> 
> I should add, I expect you'll need to extend that #if'ing to exclude
> at least ia64 too, won't you?   No architecture that segregates its
> hugepage virtual address space will manage the interposed tinypage.

Well, libhugetlbfs doesn't support ia64 at all, at present.  Mostly
just because none of us have convenient access to ia64 boxes to
develop or test on.

Porting will require a bunch of ugly switches to disable a third or
more of the functionality, though: the segment remapping stuff will
never work on ia64 in its present form either, again because of the
segregated address space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-26  3:59         ` Chen, Kenneth W
  2006-10-26  4:13           ` 'David Gibson'
@ 2006-10-26 19:08           ` Christoph Lameter
  2006-10-26 19:19             ` Chen, Kenneth W
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-10-26 19:08 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'David Gibson',
	Hugh Dickins, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Wed, 25 Oct 2006, Chen, Kenneth W wrote:

> I used to argue dearly on how important it is to allow parallel hugetlb
> faults for scalability, but somehow lost my ground in the midst of flurry
> development.  Glad to see it is coming back.

I wish someone would have cced me before allowing performance atrocities such as this.

Performance before the "fixes" (March, 2.6.16):

hsz=256M thr=100 pgs=10 min= 1370ms max=1746ms avg= 1554ms wall=1995ms cpu=155473ms
hsz=256M thr=100 pgs=10 min= 1936ms max=3706ms avg= 2610ms wall=4085ms cpu=261076ms
hsz=256M thr=100 pgs=10 min= 1375ms max=1988ms avg= 1600ms wall=2241ms cpu=160084ms

Performance now:

hsz=256M thr=10 pgs=3 min= 2965ms max=4091ms avg= 3471ms wall=4232ms cpu=34715ms
hsz=256M thr=100 pgs=3 min=16268ms max=43856ms avg=35927ms wall=44561ms cpu=3592702ms
hsz=256M thr=250 pgs=3 min=38348ms max=91242ms avg=74077ms wall=97071ms cpu=18519284ms

Note the performance now is only using 3 instead of 10 pages. Still factor 
10 down! Meaning we are now much worse than that.

With David's latest parallelization attempt:

hsz=256M thr=100 pgs=10 min= 1373ms max=9604ms avg= 6311ms wall=10787ms cpu=631164ms
hsz=256M thr=100 pgs=10 min= 1442ms max=9115ms avg= 6386ms wall=10078ms cpu=638645ms
hsz=256M thr=100 pgs=10 min= 1451ms max=10788ms avg= 7430ms wall=11357ms cpu=743070ms
hsz=256M thr=100 pgs=10 min= 1439ms max=11876ms avg= 8396ms wall=13091ms cpu=839642ms

Still down by a factor of 3 to 4.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-26 19:08           ` Christoph Lameter
@ 2006-10-26 19:19             ` Chen, Kenneth W
  2006-10-26 20:59               ` Christoph Lameter
  2006-10-26 22:19               ` 'David Gibson'
  0 siblings, 2 replies; 25+ messages in thread
From: Chen, Kenneth W @ 2006-10-26 19:19 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: 'David Gibson',
	Hugh Dickins, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

Christoph Lameter wrote on Thursday, October 26, 2006 12:09 PM
> On Wed, 25 Oct 2006, Chen, Kenneth W wrote:
> > I used to argue dearly on how important it is to allow parallel hugetlb
> > faults for scalability, but somehow lost my ground in the midst of flurry
> > development.  Glad to see it is coming back.
> 
> I wish someone would have cced me before allowing performance atrocities
> such as this.
> 
> Performance before the "fixes" (March, 2.6.16):
> 
> hsz=256M thr=100 pgs=10 min= 1370ms max=1746ms avg= 1554ms wall=1995ms cpu=155473ms
> hsz=256M thr=100 pgs=10 min= 1936ms max=3706ms avg= 2610ms wall=4085ms cpu=261076ms
> hsz=256M thr=100 pgs=10 min= 1375ms max=1988ms avg= 1600ms wall=2241ms cpu=160084ms
> 
> Performance now:
> 
> hsz=256M thr=10 pgs=3 min= 2965ms max=4091ms avg= 3471ms wall=4232ms cpu=34715ms
> hsz=256M thr=100 pgs=3 min=16268ms max=43856ms avg=35927ms wall=44561ms cpu=3592702ms
> hsz=256M thr=250 pgs=3 min=38348ms max=91242ms avg=74077ms wall=97071ms cpu=18519284ms
> 
> Note the performance now is only using 3 instead of 10 pages. Still factor 
> 10 down! Meaning we are now much worse than that.
> 
> With David's latest parallelization attempt:
> 
> hsz=256M thr=100 pgs=10 min= 1373ms max=9604ms avg= 6311ms wall=10787ms cpu=631164ms
> hsz=256M thr=100 pgs=10 min= 1442ms max=9115ms avg= 6386ms wall=10078ms cpu=638645ms
> hsz=256M thr=100 pgs=10 min= 1451ms max=10788ms avg= 7430ms wall=11357ms cpu=743070ms
> hsz=256M thr=100 pgs=10 min= 1439ms max=11876ms avg= 8396ms wall=13091ms cpu=839642ms
> 
> Still down by a factor of 3 to 4.


One performance fix I have in mind is to only use the mutex when system is down to 1
free hugetlb page. That is the real reason why mutex got introduced. I'm implementing
it right now and hope it will restore most if not all of the performance we lost.

Christoph, the shared page table for hugetlb also need your advice here in the path
of allocating page table page. It takes a per inode spin lock in order to find
shareable page table page.  Do you think it will cause problem?  I hope not.

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-26 19:19             ` Chen, Kenneth W
@ 2006-10-26 20:59               ` Christoph Lameter
  2006-10-26 22:19               ` 'David Gibson'
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-10-26 20:59 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'David Gibson',
	Hugh Dickins, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Thu, 26 Oct 2006, Chen, Kenneth W wrote:

> One performance fix I have in mind is to only use the mutex when system is down to 1
> free hugetlb page. That is the real reason why mutex got introduced. I'm implementing
> it right now and hope it will restore most if not all of the performance we lost.

Right. Its a heavily special case that brings down performance for 
everyone else. Maybe setup a boundary via /proc/sys/vm/nr_hugepages_xxx 
below which this is checked? 

> Christoph, the shared page table for hugetlb also need your advice here in the path
> of allocating page table page. It takes a per inode spin lock in order to find
> shareable page table page.  Do you think it will cause problem?  I hope not.

Well it depends on how huge pages are used. If you use one file per 
process then there is no issue.

Even if we have a common huge file spanning multiple nodes: If you just 
take the inode lock to find the page then I think its fine. We have the 
same issues with the page lock for regular pages. But please avoid locks 
while zeroing a huge page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd
  2006-10-26 19:19             ` Chen, Kenneth W
  2006-10-26 20:59               ` Christoph Lameter
@ 2006-10-26 22:19               ` 'David Gibson'
  1 sibling, 0 replies; 25+ messages in thread
From: 'David Gibson' @ 2006-10-26 22:19 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter',
	Hugh Dickins, Andrew Morton, Bill Irwin, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 12:19:53PM -0700, Chen, Kenneth W wrote:
> Christoph Lameter wrote on Thursday, October 26, 2006 12:09 PM
> > On Wed, 25 Oct 2006, Chen, Kenneth W wrote:
> > > I used to argue dearly on how important it is to allow parallel hugetlb
> > > faults for scalability, but somehow lost my ground in the midst of flurry
> > > development.  Glad to see it is coming back.
> > 
> > I wish someone would have cced me before allowing performance atrocities
> > such as this.
> > 
> > Performance before the "fixes" (March, 2.6.16):
> > 
> > hsz=256M thr=100 pgs=10 min= 1370ms max=1746ms avg= 1554ms wall=1995ms cpu=155473ms
> > hsz=256M thr=100 pgs=10 min= 1936ms max=3706ms avg= 2610ms wall=4085ms cpu=261076ms
> > hsz=256M thr=100 pgs=10 min= 1375ms max=1988ms avg= 1600ms wall=2241ms cpu=160084ms
> > 
> > Performance now:
> > 
> > hsz=256M thr=10 pgs=3 min= 2965ms max=4091ms avg= 3471ms wall=4232ms cpu=34715ms
> > hsz=256M thr=100 pgs=3 min=16268ms max=43856ms avg=35927ms wall=44561ms cpu=3592702ms
> > hsz=256M thr=250 pgs=3 min=38348ms max=91242ms avg=74077ms wall=97071ms cpu=18519284ms
> > 
> > Note the performance now is only using 3 instead of 10 pages. Still factor 
> > 10 down! Meaning we are now much worse than that.
> > 
> > With David's latest parallelization attempt:
> > 
> > hsz=256M thr=100 pgs=10 min= 1373ms max=9604ms avg= 6311ms wall=10787ms cpu=631164ms
> > hsz=256M thr=100 pgs=10 min= 1442ms max=9115ms avg= 6386ms wall=10078ms cpu=638645ms
> > hsz=256M thr=100 pgs=10 min= 1451ms max=10788ms avg= 7430ms wall=11357ms cpu=743070ms
> > hsz=256M thr=100 pgs=10 min= 1439ms max=11876ms avg= 8396ms wall=13091ms cpu=839642ms
> > 
> > Still down by a factor of 3 to 4.
> 
> 
> One performance fix I have in mind is to only use the mutex when
> system is down to 1 free hugetlb page. That is the real reason why
> mutex got introduced. I'm implementing it right now and hope it will
> restore most if not all of the performance we lost.

Not sufficient for a system with >2 CPUs.  And with preempt and
pathalogical conditions I'm not sure even use lock if # freepages < #
cpus is adequate.

> Christoph, the shared page table for hugetlb also need your advice
> here in the path of allocating page table page. It takes a per inode
> spin lock in order to find shareable page table page.  Do you think
> it will cause problem?  I hope not.
> 
> - Ken
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-10-26 22:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-25  2:31 [PATCH 1/3] hugetlb: fix size=4G parsing Hugh Dickins
2006-10-25  2:35 ` [PATCH 2/3] hugetlb: fix prio_tree unit Hugh Dickins
2006-10-25  7:08   ` David Gibson
2006-10-25  7:41     ` Hugh Dickins
2006-10-25 23:49       ` Chen, Kenneth W
2006-10-26  3:47         ` David Gibson
2006-10-26  6:15           ` Chen, Kenneth W
2006-10-26  7:55           ` Hugh Dickins
2006-10-26  8:13           ` Hugh Dickins
2006-10-26 10:42             ` David Gibson
2006-10-25  2:38 ` [PATCH 3/3] hugetlb: fix absurd HugePages_Rsvd Hugh Dickins
2006-10-25  5:23   ` Mika Penttilä
2006-10-25  5:52     ` David Gibson
2006-10-25  7:27       ` Hugh Dickins
2006-10-25  6:26   ` David Gibson
2006-10-25  6:29     ` David Gibson
2006-10-25  8:39     ` Hugh Dickins
2006-10-25 10:09       ` David Gibson
2006-10-26  3:59         ` Chen, Kenneth W
2006-10-26  4:13           ` 'David Gibson'
2006-10-26 19:08           ` Christoph Lameter
2006-10-26 19:19             ` Chen, Kenneth W
2006-10-26 20:59               ` Christoph Lameter
2006-10-26 22:19               ` 'David Gibson'
2006-10-25 21:31     ` Adam Litke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox