linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG] hugetlb: sleeping function called from invalid context
@ 2008-08-06 14:43 Gerald Schaefer
  2008-08-06 19:03 ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks Andy Whitcroft
  2008-08-07 20:28 ` Andy Whitcroft
  0 siblings, 2 replies; 10+ messages in thread
From: Gerald Schaefer @ 2008-08-06 14:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, schwidefsky, heiko.carstens, Mel Gorman

Hi,

running the libhugetlbfs test suite I met the following bug:

BUG: sleeping function called from invalid context at include/linux/pagemap.h:294
in_atomic():1, irqs_disabled():0
CPU: 0 Not tainted 2.6.27-rc1 #3
Process private (pid: 4531, task: 000000003f68e400, ksp: 000000002a7e3be8)
0700000033a00700 000000002a7e3bf0 0000000000000002 0000000000000000 
       000000002a7e3c90 000000002a7e3c08 000000002a7e3c08 0000000000016472 
       0000000000000000 000000002a7e3be8 0000000000000000 0000000000000000 
       000000002a7e3bf0 000000000000000c 000000002a7e3bf0 000000002a7e3c60 
       0000000000337798 0000000000016472 000000002a7e3bf0 000000002a7e3c40 
Call Trace:
([<00000000000163f4>] show_trace+0x130/0x140)
 [<00000000000164cc>] show_stack+0xc8/0xfc
 [<0000000000016c62>] dump_stack+0xb2/0xc0
 [<000000000003d64a>] __might_sleep+0x136/0x154
 [<000000000008badc>] find_lock_page+0x50/0xb8
 [<00000000000b9b08>] hugetlb_fault+0x4c4/0x684
 [<00000000000a3e3c>] handle_mm_fault+0x8ec/0xb54
 [<00000000003338aa>] do_protection_exception+0x32a/0x3b4
 [<00000000000256b2>] sysc_return+0x0/0x8
 [<0000000000400fba>] 0x400fba

While holding mm->page_table_lock, hugetlb_fault() calls hugetlbfs_pagecache_page(),
which calls find_lock_page(), which may sleep.

Thanks,
Gerald


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-06 14:43 [BUG] hugetlb: sleeping function called from invalid context Gerald Schaefer
@ 2008-08-06 19:03 ` Andy Whitcroft
  2008-08-07 11:35   ` Gerald Schaefer
  2008-08-07 20:28 ` Andy Whitcroft
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Whitcroft @ 2008-08-06 19:03 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: linux-kernel, linux-mm, schwidefsky, heiko.carstens, Mel Gorman,
	Andy Whitcroft

[Gerald, could you see if this works for you it seems to for us on
an x86 build.  If it does we can push it up to Andrew.]

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults.  A struct file_region is used to track
when reservations have been consumed and where.  These file_regions
are allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held.  This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases.  The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change.  The second phase actually commits
the change.  This patch makes use of this by checking the reservations
before the page_table_lock is taken; triggering any necessary allocations.
This may then be safely repeated within the locks without any allocations
being required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/hugetlb.c |   44 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28a2980..c4413ae 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1937,6 +1937,15 @@ retry:
 			lock_page(page);
 	}
 
+	/*
+	 * If we are going to COW a private mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that
+	 * any allocations necessary to record that reservation occur outside
+	 * the spinlock.
+	 */
+	if (write_access && !(vma->vm_flags & VM_SHARED))
+		vma_needs_reservation(h, vma, address);
+
 	spin_lock(&mm->page_table_lock);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
@@ -1973,6 +1982,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
@@ -1995,19 +2005,35 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ret = 0;
 
+	/*
+	 * If we are going to COW the mapping later, we examine the pending
+	 * reservations for this page now. This will ensure that any
+	 * allocations necessary to record that reservation occur outside the
+	 * spinlock. For private mappings, we also lookup the pagecache
+	 * page now as it is used to determine if a reservation has been
+	 * consumed.
+	 */
+	if (write_access && !pte_write(entry)) {
+		vma_needs_reservation(h, vma, address);
+
+		if (!(vma->vm_flags & VM_SHARED))
+			pagecache_page = hugetlbfs_pagecache_page(h,
+								vma, address);
+	}
+
 	spin_lock(&mm->page_table_lock);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
-		if (write_access && !pte_write(entry)) {
-			struct page *page;
-			page = hugetlbfs_pagecache_page(h, vma, address);
-			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
-			if (page) {
-				unlock_page(page);
-				put_page(page);
-			}
-		}
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry,
+							pagecache_page);
 	spin_unlock(&mm->page_table_lock);
+
+	if (pagecache_page) {
+		unlock_page(pagecache_page);
+		put_page(pagecache_page);
+	}
+
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
-- 
1.6.0.rc1.258.g80295

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-06 19:03 ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks Andy Whitcroft
@ 2008-08-07 11:35   ` Gerald Schaefer
  0 siblings, 0 replies; 10+ messages in thread
From: Gerald Schaefer @ 2008-08-07 11:35 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-kernel, linux-mm, schwidefsky, heiko.carstens, Mel Gorman

On Wed, 2008-08-06 at 20:03 +0100, Andy Whitcroft wrote:
> [Gerald, could you see if this works for you it seems to for us on
> an x86 build.  If it does we can push it up to Andrew.]

Yes, it works fine with your patch.

Thanks,
Gerald


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-06 14:43 [BUG] hugetlb: sleeping function called from invalid context Gerald Schaefer
  2008-08-06 19:03 ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks Andy Whitcroft
@ 2008-08-07 20:28 ` Andy Whitcroft
  2008-08-07 21:38   ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Andy Whitcroft @ 2008-08-07 20:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gerald Schaefer, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, Mel Gorman, Andy Whitcroft

[Andrew, this fixes a problem in the private reservations stack, shown up
by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
at allocation while holding locks.  This should be merged up to mainline
as a bug fix to those patches.]

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults.  A struct file_region is used to track
when reservations have been consumed and where.  These file_regions
are allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held.  This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases.  The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change.  The second phase actually commits
the change.  This patch makes use of this by checking the reservations
before the page_table_lock is taken; triggering any necessary allocations.
This may then be safely repeated within the locks without any allocations
being required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
---
 mm/hugetlb.c |   44 +++++++++++++++++++++++++++++++++++---------
 1 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28a2980..c4413ae 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1937,6 +1937,15 @@ retry:
 			lock_page(page);
 	}
 
+	/*
+	 * If we are going to COW a private mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that
+	 * any allocations necessary to record that reservation occur outside
+	 * the spinlock.
+	 */
+	if (write_access && !(vma->vm_flags & VM_SHARED))
+		vma_needs_reservation(h, vma, address);
+
 	spin_lock(&mm->page_table_lock);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
@@ -1973,6 +1982,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
@@ -1995,19 +2005,35 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ret = 0;
 
+	/*
+	 * If we are going to COW the mapping later, we examine the pending
+	 * reservations for this page now. This will ensure that any
+	 * allocations necessary to record that reservation occur outside the
+	 * spinlock. For private mappings, we also lookup the pagecache
+	 * page now as it is used to determine if a reservation has been
+	 * consumed.
+	 */
+	if (write_access && !pte_write(entry)) {
+		vma_needs_reservation(h, vma, address);
+
+		if (!(vma->vm_flags & VM_SHARED))
+			pagecache_page = hugetlbfs_pagecache_page(h,
+								vma, address);
+	}
+
 	spin_lock(&mm->page_table_lock);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
-		if (write_access && !pte_write(entry)) {
-			struct page *page;
-			page = hugetlbfs_pagecache_page(h, vma, address);
-			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
-			if (page) {
-				unlock_page(page);
-				put_page(page);
-			}
-		}
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry,
+							pagecache_page);
 	spin_unlock(&mm->page_table_lock);
+
+	if (pagecache_page) {
+		unlock_page(pagecache_page);
+		put_page(pagecache_page);
+	}
+
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
-- 
1.6.0.rc1.258.g80295

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-07 20:28 ` Andy Whitcroft
@ 2008-08-07 21:38   ` Andrew Morton
  2008-08-08  8:33     ` Mel Gorman
                       ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Andrew Morton @ 2008-08-07 21:38 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: gerald.schaefer, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, mel

On Thu,  7 Aug 2008 21:28:23 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> [Andrew, this fixes a problem in the private reservations stack, shown up
> by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
> at allocation while holding locks.  This should be merged up to mainline
> as a bug fix to those patches.]
> 
> In the normal case, hugetlbfs reserves hugepages at map time so that the
> pages exist for future faults.  A struct file_region is used to track
> when reservations have been consumed and where.  These file_regions
> are allocated as necessary with kmalloc() which can sleep with the
> mm->page_table_lock held.  This is wrong and triggers may-sleep warning
> when PREEMPT is enabled.
> 
> Updates to the underlying file_region are done in two phases.  The first
> phase prepares the region for the change, allocating any necessary memory,
> without actually making the change.  The second phase actually commits
> the change.  This patch makes use of this by checking the reservations
> before the page_table_lock is taken; triggering any necessary allocations.
> This may then be safely repeated within the locks without any allocations
> being required.
> 
> Credit to Mel Gorman for diagnosing this failure and initial versions of
> the patch.
> 

After applying the patch:

: int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
: 			unsigned long address, int write_access)
: {
: 	pte_t *ptep;
: 	pte_t entry;
: 	int ret;
: 	struct page *pagecache_page = NULL;
: 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
: 	struct hstate *h = hstate_vma(vma);
: 
: 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
: 	if (!ptep)
: 		return VM_FAULT_OOM;
: 
: 	/*
: 	 * Serialize hugepage allocation and instantiation, so that we don't
: 	 * get spurious allocation failures if two CPUs race to instantiate
: 	 * the same page in the page cache.
: 	 */
: 	mutex_lock(&hugetlb_instantiation_mutex);
: 	entry = huge_ptep_get(ptep);
: 	if (huge_pte_none(entry)) {
: 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
: 		mutex_unlock(&hugetlb_instantiation_mutex);
: 		return ret;
: 	}
: 
: 	ret = 0;
: 
: 	/*
: 	 * If we are going to COW the mapping later, we examine the pending
: 	 * reservations for this page now. This will ensure that any
: 	 * allocations necessary to record that reservation occur outside the
: 	 * spinlock. For private mappings, we also lookup the pagecache
: 	 * page now as it is used to determine if a reservation has been
: 	 * consumed.
: 	 */
: 	if (write_access && !pte_write(entry)) {
: 		vma_needs_reservation(h, vma, address);
: 
: 		if (!(vma->vm_flags & VM_SHARED))
: 			pagecache_page = hugetlbfs_pagecache_page(h,
: 								vma, address);
: 	}

There's a seeming race window here, where a new page can get
instantiated.  But down-read(mmap_sem) plus hugetlb_instantiation_mutex
prevents that, yes?


: 	spin_lock(&mm->page_table_lock);
: 	/* Check for a racing update before calling hugetlb_cow */
: 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
: 		if (write_access && !pte_write(entry))
: 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
: 							pagecache_page);
: 	spin_unlock(&mm->page_table_lock);
: 
: 	if (pagecache_page) {
: 		unlock_page(pagecache_page);
: 		put_page(pagecache_page);
: 	}
: 
: 	mutex_unlock(&hugetlb_instantiation_mutex);
: 
: 	return ret;
: }
: 
: 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-07 21:38   ` Andrew Morton
@ 2008-08-08  8:33     ` Mel Gorman
  2008-08-08 10:16     ` Andy Whitcroft
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2008-08-08  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, gerald.schaefer, linux-kernel, linux-mm,
	schwidefsky, heiko.carstens

On (07/08/08 14:38), Andrew Morton didst pronounce:
> On Thu,  7 Aug 2008 21:28:23 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > [Andrew, this fixes a problem in the private reservations stack, shown up
> > by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
> > at allocation while holding locks.  This should be merged up to mainline
> > as a bug fix to those patches.]
> > 
> > In the normal case, hugetlbfs reserves hugepages at map time so that the
> > pages exist for future faults.  A struct file_region is used to track
> > when reservations have been consumed and where.  These file_regions
> > are allocated as necessary with kmalloc() which can sleep with the
> > mm->page_table_lock held.  This is wrong and triggers may-sleep warning
> > when PREEMPT is enabled.
> > 
> > Updates to the underlying file_region are done in two phases.  The first
> > phase prepares the region for the change, allocating any necessary memory,
> > without actually making the change.  The second phase actually commits
> > the change.  This patch makes use of this by checking the reservations
> > before the page_table_lock is taken; triggering any necessary allocations.
> > This may then be safely repeated within the locks without any allocations
> > being required.
> > 
> > Credit to Mel Gorman for diagnosing this failure and initial versions of
> > the patch.
> > 
> 
> After applying the patch:
> 
> : int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> : 			unsigned long address, int write_access)
> : {
> : 	pte_t *ptep;
> : 	pte_t entry;
> : 	int ret;
> : 	struct page *pagecache_page = NULL;
> : 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> : 	struct hstate *h = hstate_vma(vma);
> : 
> : 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> : 	if (!ptep)
> : 		return VM_FAULT_OOM;
> : 
> : 	/*
> : 	 * Serialize hugepage allocation and instantiation, so that we don't
> : 	 * get spurious allocation failures if two CPUs race to instantiate
> : 	 * the same page in the page cache.
> : 	 */
> : 	mutex_lock(&hugetlb_instantiation_mutex);
> : 	entry = huge_ptep_get(ptep);
> : 	if (huge_pte_none(entry)) {
> : 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
> : 		mutex_unlock(&hugetlb_instantiation_mutex);
> : 		return ret;
> : 	}
> : 
> : 	ret = 0;
> : 
> : 	/*
> : 	 * If we are going to COW the mapping later, we examine the pending
> : 	 * reservations for this page now. This will ensure that any
> : 	 * allocations necessary to record that reservation occur outside the
> : 	 * spinlock. For private mappings, we also lookup the pagecache
> : 	 * page now as it is used to determine if a reservation has been
> : 	 * consumed.
> : 	 */
> : 	if (write_access && !pte_write(entry)) {
> : 		vma_needs_reservation(h, vma, address);
> : 
> : 		if (!(vma->vm_flags & VM_SHARED))
> : 			pagecache_page = hugetlbfs_pagecache_page(h,
> : 								vma, address);
> : 	}
> 
> There's a seeming race window here, where a new page can get
> instantiated.  But down-read(mmap_sem) plus hugetlb_instantiation_mutex
> prevents that, yes?
> 

Yes, but to double check

vma_needs_reservation() is called here and the region check needs to be
	protected. It requires that either down_write(mmap_sem) or
	hugetlb_instantiation_mutex + down_read(mmap_sem) is held but
	that is the case here

add_to_page_cache for hugetlbfs happens within hugetlb_no_page(). It
	only needs a reference to the page to prevent it doing away but also
	happens to be protected by the mutex and mmap_sem

For truncation, lock_page(page) prevents the page randomly disappearing
	until we finish with it. If the file is truncated before the
	fault, the caller gets a SIGBUS but the reservation counters
	don't get messed up

It's safe.

> 
> : 	spin_lock(&mm->page_table_lock);
> : 	/* Check for a racing update before calling hugetlb_cow */
> : 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
> : 		if (write_access && !pte_write(entry))
> : 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> : 							pagecache_page);
> : 	spin_unlock(&mm->page_table_lock);
> : 
> : 	if (pagecache_page) {
> : 		unlock_page(pagecache_page);
> : 		put_page(pagecache_page);
> : 	}
> : 
> : 	mutex_unlock(&hugetlb_instantiation_mutex);
> : 
> : 	return ret;
> : }
> : 
> : 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks
  2008-08-07 21:38   ` Andrew Morton
  2008-08-08  8:33     ` Mel Gorman
@ 2008-08-08 10:16     ` Andy Whitcroft
  2008-08-08 11:10     ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 Andy Whitcroft
  2008-08-11 17:58     ` Andy Whitcroft
  3 siblings, 0 replies; 10+ messages in thread
From: Andy Whitcroft @ 2008-08-08 10:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: gerald.schaefer, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, mel

On Thu, Aug 07, 2008 at 02:38:24PM -0700, Andrew Morton wrote:
> On Thu,  7 Aug 2008 21:28:23 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > [Andrew, this fixes a problem in the private reservations stack, shown up
> > by some testing done by Gerald on s390 with PREEMPT.  It fixes an attempt
> > at allocation while holding locks.  This should be merged up to mainline
> > as a bug fix to those patches.]
> > 
> > In the normal case, hugetlbfs reserves hugepages at map time so that the
> > pages exist for future faults.  A struct file_region is used to track
> > when reservations have been consumed and where.  These file_regions
> > are allocated as necessary with kmalloc() which can sleep with the
> > mm->page_table_lock held.  This is wrong and triggers may-sleep warning
> > when PREEMPT is enabled.
> > 
> > Updates to the underlying file_region are done in two phases.  The first
> > phase prepares the region for the change, allocating any necessary memory,
> > without actually making the change.  The second phase actually commits
> > the change.  This patch makes use of this by checking the reservations
> > before the page_table_lock is taken; triggering any necessary allocations.
> > This may then be safely repeated within the locks without any allocations
> > being required.
> > 
> > Credit to Mel Gorman for diagnosing this failure and initial versions of
> > the patch.
> > 
> 
> After applying the patch:
> 
> : int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> : 			unsigned long address, int write_access)
> : {
> : 	pte_t *ptep;
> : 	pte_t entry;
> : 	int ret;
> : 	struct page *pagecache_page = NULL;
> : 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> : 	struct hstate *h = hstate_vma(vma);
> : 
> : 	ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> : 	if (!ptep)
> : 		return VM_FAULT_OOM;
> : 
> : 	/*
> : 	 * Serialize hugepage allocation and instantiation, so that we don't
> : 	 * get spurious allocation failures if two CPUs race to instantiate
> : 	 * the same page in the page cache.
> : 	 */
> : 	mutex_lock(&hugetlb_instantiation_mutex);
> : 	entry = huge_ptep_get(ptep);
> : 	if (huge_pte_none(entry)) {
> : 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
> : 		mutex_unlock(&hugetlb_instantiation_mutex);
> : 		return ret;
> : 	}
> : 
> : 	ret = 0;
> : 
> : 	/*
> : 	 * If we are going to COW the mapping later, we examine the pending
> : 	 * reservations for this page now. This will ensure that any
> : 	 * allocations necessary to record that reservation occur outside the
> : 	 * spinlock. For private mappings, we also lookup the pagecache
> : 	 * page now as it is used to determine if a reservation has been
> : 	 * consumed.
> : 	 */
> : 	if (write_access && !pte_write(entry)) {
> : 		vma_needs_reservation(h, vma, address);
> : 
> : 		if (!(vma->vm_flags & VM_SHARED))
> : 			pagecache_page = hugetlbfs_pagecache_page(h,
> : 								vma, address);
> : 	}
> 
> There's a seeming race window here, where a new page can get
> instantiated.  But down-read(mmap_sem) plus hugetlb_instantiation_mutex
> prevents that, yes?

Although that is true, I would prefer to not think of the
instantiation_mutex as protection for this, its primary concern is
serialisation.  I believe that the combination of down_read(mmap_sem),
the page lock, and perversely the page_table_lock protect this.

At this point we know that the PTE was not pte_none else we would
have branched to no_page.  No mapping operations can be occuring as
we have down_read(mmap_sem).  Any truncates racing with us first clear
the PTEs and then the pagecache references.  Should we pick up a stale
pagecache reference, we will detect it when we recheck the PTE under
the page_table_lock; this will also detect any racing instantiations.

Obviously we have the instantiation_mutex, and the locking rules for
the regions need it.  But I believe we are safe against this race even
without the instantiation_mutex.

> : 	spin_lock(&mm->page_table_lock);
> : 	/* Check for a racing update before calling hugetlb_cow */
> : 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
> : 		if (write_access && !pte_write(entry))
> : 			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> : 							pagecache_page);
> : 	spin_unlock(&mm->page_table_lock);
> : 
> : 	if (pagecache_page) {
> : 		unlock_page(pagecache_page);
> : 		put_page(pagecache_page);
> : 	}
> : 
> : 	mutex_unlock(&hugetlb_instantiation_mutex);
> : 
> : 	return ret;
> : }
> : 
> : 

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2
  2008-08-07 21:38   ` Andrew Morton
  2008-08-08  8:33     ` Mel Gorman
  2008-08-08 10:16     ` Andy Whitcroft
@ 2008-08-08 11:10     ` Andy Whitcroft
  2008-08-08 12:57       ` Gerald Schaefer
  2008-08-11 17:58     ` Andy Whitcroft
  3 siblings, 1 reply; 10+ messages in thread
From: Andy Whitcroft @ 2008-08-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gerald Schaefer, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, Mel Gorman, Andy Whitcroft

[Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors.  Here is an updated patch to correct this.  This passes
testing here.  Gerald could you test thing one too.]

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults.  A struct file_region is used to track
when reservations have been consumed and where.  These file_regions
are allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held.  This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases.  The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change.  The second phase actually commits
the change.  This patch makes use of this by checking the reservations
before the page_table_lock is taken; triggering any necessary allocations.
This may then be safely repeated within the locks without any allocations
being required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/hugetlb.c |   55 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28a2980..393ea8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1937,6 +1937,18 @@ retry:
 			lock_page(page);
 	}
 
+	/*
+	 * If we are going to COW a private mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that
+	 * any allocations necessary to record that reservation occur outside
+	 * the spinlock.
+	 */
+	if (write_access && !(vma->vm_flags & VM_SHARED))
+		if (vma_needs_reservation(h, vma, address) < 0) {
+			ret = VM_FAULT_OOM;
+			goto backout_unlocked;
+		}
+
 	spin_lock(&mm->page_table_lock);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
@@ -1962,6 +1974,7 @@ out:
 
 backout:
 	spin_unlock(&mm->page_table_lock);
+backout_unlocked:
 	unlock_page(page);
 	put_page(page);
 	goto out;
@@ -1973,6 +1986,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
@@ -1989,25 +2003,44 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
-		return ret;
+		goto out_unlock;
 	}
 
 	ret = 0;
 
+	/*
+	 * If we are going to COW the mapping later, we examine the pending
+	 * reservations for this page now. This will ensure that any
+	 * allocations necessary to record that reservation occur outside the
+	 * spinlock. For private mappings, we also lookup the pagecache
+	 * page now as it is used to determine if a reservation has been
+	 * consumed.
+	 */
+	if (write_access && !pte_write(entry)) {
+		if (vma_needs_reservation(h, vma, address) < 0) {
+			ret = VM_FAULT_OOM;
+			goto out_unlock;
+		}
+
+		if (!(vma->vm_flags & VM_SHARED))
+			pagecache_page = hugetlbfs_pagecache_page(h,
+								vma, address);
+	}
+
 	spin_lock(&mm->page_table_lock);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
-		if (write_access && !pte_write(entry)) {
-			struct page *page;
-			page = hugetlbfs_pagecache_page(h, vma, address);
-			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
-			if (page) {
-				unlock_page(page);
-				put_page(page);
-			}
-		}
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry,
+							pagecache_page);
 	spin_unlock(&mm->page_table_lock);
+
+	if (pagecache_page) {
+		unlock_page(pagecache_page);
+		put_page(pagecache_page);
+	}
+
+out_unlock:
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
-- 
1.6.0.rc1.258.g80295

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2
  2008-08-08 11:10     ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 Andy Whitcroft
@ 2008-08-08 12:57       ` Gerald Schaefer
  0 siblings, 0 replies; 10+ messages in thread
From: Gerald Schaefer @ 2008-08-08 12:57 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, Mel Gorman

On Fri, 2008-08-08 at 12:10 +0100, Andy Whitcroft wrote:
> [Bah, while reviewing the locking based on your previous email I spotted
> that we need to check the return from the vma_needs_reservation call for
> allocation errors.  Here is an updated patch to correct this.  This passes
> testing here.  Gerald could you test thing one too.]

Ok, it works here too.

Thanks,
Gerald


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2
  2008-08-07 21:38   ` Andrew Morton
                       ` (2 preceding siblings ...)
  2008-08-08 11:10     ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 Andy Whitcroft
@ 2008-08-11 17:58     ` Andy Whitcroft
  3 siblings, 0 replies; 10+ messages in thread
From: Andy Whitcroft @ 2008-08-11 17:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gerald Schaefer, linux-kernel, linux-mm, schwidefsky,
	heiko.carstens, Mel Gorman, Andy Whitcroft

[Andrew this should replace the previous version which did not check
the returns from the region prepare for errors.  This has been tested by
us and Gerald and it looks good.

Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors.  Here is an updated patch to correct this.  This passes
testing here.]

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults.  A struct file_region is used to track
when reservations have been consumed and where.  These file_regions
are allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held.  This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases.  The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change.  The second phase actually commits
the change.  This patch makes use of this by checking the reservations
before the page_table_lock is taken; triggering any necessary allocations.
This may then be safely repeated within the locks without any allocations
being required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
---
 mm/hugetlb.c |   55 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28a2980..393ea8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1937,6 +1937,18 @@ retry:
 			lock_page(page);
 	}
 
+	/*
+	 * If we are going to COW a private mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that
+	 * any allocations necessary to record that reservation occur outside
+	 * the spinlock.
+	 */
+	if (write_access && !(vma->vm_flags & VM_SHARED))
+		if (vma_needs_reservation(h, vma, address) < 0) {
+			ret = VM_FAULT_OOM;
+			goto backout_unlocked;
+		}
+
 	spin_lock(&mm->page_table_lock);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
 	if (idx >= size)
@@ -1962,6 +1974,7 @@ out:
 
 backout:
 	spin_unlock(&mm->page_table_lock);
+backout_unlocked:
 	unlock_page(page);
 	put_page(page);
 	goto out;
@@ -1973,6 +1986,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
+	struct page *pagecache_page = NULL;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 	struct hstate *h = hstate_vma(vma);
 
@@ -1989,25 +2003,44 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	entry = huge_ptep_get(ptep);
 	if (huge_pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
-		return ret;
+		goto out_unlock;
 	}
 
 	ret = 0;
 
+	/*
+	 * If we are going to COW the mapping later, we examine the pending
+	 * reservations for this page now. This will ensure that any
+	 * allocations necessary to record that reservation occur outside the
+	 * spinlock. For private mappings, we also lookup the pagecache
+	 * page now as it is used to determine if a reservation has been
+	 * consumed.
+	 */
+	if (write_access && !pte_write(entry)) {
+		if (vma_needs_reservation(h, vma, address) < 0) {
+			ret = VM_FAULT_OOM;
+			goto out_unlock;
+		}
+
+		if (!(vma->vm_flags & VM_SHARED))
+			pagecache_page = hugetlbfs_pagecache_page(h,
+								vma, address);
+	}
+
 	spin_lock(&mm->page_table_lock);
 	/* Check for a racing update before calling hugetlb_cow */
 	if (likely(pte_same(entry, huge_ptep_get(ptep))))
-		if (write_access && !pte_write(entry)) {
-			struct page *page;
-			page = hugetlbfs_pagecache_page(h, vma, address);
-			ret = hugetlb_cow(mm, vma, address, ptep, entry, page);
-			if (page) {
-				unlock_page(page);
-				put_page(page);
-			}
-		}
+		if (write_access && !pte_write(entry))
+			ret = hugetlb_cow(mm, vma, address, ptep, entry,
+							pagecache_page);
 	spin_unlock(&mm->page_table_lock);
+
+	if (pagecache_page) {
+		unlock_page(pagecache_page);
+		put_page(pagecache_page);
+	}
+
+out_unlock:
 	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
-- 
1.6.0.rc1.258.g80295

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-08-11 17:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-06 14:43 [BUG] hugetlb: sleeping function called from invalid context Gerald Schaefer
2008-08-06 19:03 ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks Andy Whitcroft
2008-08-07 11:35   ` Gerald Schaefer
2008-08-07 20:28 ` Andy Whitcroft
2008-08-07 21:38   ` Andrew Morton
2008-08-08  8:33     ` Mel Gorman
2008-08-08 10:16     ` Andy Whitcroft
2008-08-08 11:10     ` [PATCH 1/1] allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 Andy Whitcroft
2008-08-08 12:57       ` Gerald Schaefer
2008-08-11 17:58     ` Andy Whitcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox