[RFC] reduce hugetlb_instantiation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] reduce hugetlb_instantiation_mutex usage
@ 2006-10-26 22:17 Chen, Kenneth W
  2006-10-26 22:44 ` Andrew Morton
  2006-10-26 23:47 ` 'David Gibson'
  0 siblings, 2 replies; 19+ messages in thread
From: Chen, Kenneth W @ 2006-10-26 22:17 UTC (permalink / raw)
  To: 'Christoph Lameter', 'David Gibson',
	Hugh Dickins, bill.irwin, Andrew Morton, Adam Litke
  Cc: linux-mm

First rev of patch to allow hugetlb page fault to scale.

hugetlb_instantiation_mutex was introduced to prevent spurious allocation
failure in a corner case: two threads race to instantiate same page with
only one free page left in the global pool.  However, this global
serialization hurts fault performance badly as noted by Christoph Lameter.
This patch attempt to cut back the use of mutex only when free page resource
is limited, thus allow fault to scale in most common cases.


Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


--- ./mm/hugetlb.c.orig	2006-10-26 10:26:43.000000000 -0700
+++ ./mm/hugetlb.c	2006-10-26 13:18:03.000000000 -0700
@@ -542,6 +542,8 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t entry;
 	int ret;
 	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	static atomic_t token = ATOMIC_INIT(0);
+	int use_mutex = 0;
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
@@ -552,12 +554,15 @@ int hugetlb_fault(struct mm_struct *mm, 
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	if (atomic_inc_return(&token) >= free_huge_pages) {
+		mutex_lock(&hugetlb_instantiation_mutex);
+		use_mutex = 1;
+	}
+
 	entry = *ptep;
 	if (pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
-		return ret;
+		goto out;
 	}
 
 	ret = VM_FAULT_MINOR;
@@ -568,7 +573,11 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
+
+out:
+	atomic_dec(&token);
+	if (use_mutex)
+		mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W
@ 2006-10-26 22:44 ` Andrew Morton
  2006-10-26 23:31   ` 'David Gibson'
  2006-10-26 23:47 ` 'David Gibson'
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2006-10-26 22:44 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter', 'David Gibson',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Thu, 26 Oct 2006 15:17:20 -0700
"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:

> First rev of patch to allow hugetlb page fault to scale.
> 
> hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> failure in a corner case: two threads race to instantiate same page with
> only one free page left in the global pool.  However, this global
> serialization hurts fault performance badly as noted by Christoph Lameter.
> This patch attempt to cut back the use of mutex only when free page resource
> is limited, thus allow fault to scale in most common cases.
>

ug.

How about we kill that instantiation_mutex thing altogether and fix the original bug
in a better fashion?  Like...

In hugetlb_no_page():

retry:
	page = find_lock_page(...)
	if (!page) {
		write_lock_irq(&mapping->tree_lock);
		if (radix_tree_lookup(...)) {
			write_unlock_irq(tree_lock);
			goto retry;
		}
		page = alloc_huge_page(...);
		if (!page)
			bail;
		radix_tree_insert(...);
		SetPageLocked(page);
		write_unlock_irq(tree_lock);
		clear_huge_page(...);
	}

	<stick it in page tables>

	unlock_page(page);


The key points:

- Use tree_lock to prevent the race

- allocate the hugepage inside tree_lock so we never get into this
  two-threads-tried-to-allocate-the-final-page problem.

- The hugepage is zeroed without locks held, under lock_page()

- lock_page() is used to make the other thread(s) sleep while the winner
  thread is zeroing out the page.


It means that rather a lot of add_to_page_cache() will need to be copied
into hugetlb_no_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-26 22:44 ` Andrew Morton
@ 2006-10-26 23:31   ` 'David Gibson'
  2006-10-27  0:04     ` Andrew Morton
  2006-10-27  1:47     ` 'David Gibson'
  0 siblings, 2 replies; 19+ messages in thread
From: 'David Gibson' @ 2006-10-26 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> On Thu, 26 Oct 2006 15:17:20 -0700
> "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> 
> > First rev of patch to allow hugetlb page fault to scale.
> > 
> > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > failure in a corner case: two threads race to instantiate same page with
> > only one free page left in the global pool.  However, this global
> > serialization hurts fault performance badly as noted by Christoph Lameter.
> > This patch attempt to cut back the use of mutex only when free page resource
> > is limited, thus allow fault to scale in most common cases.
> >
> 
> ug.
> 
> How about we kill that instantiation_mutex thing altogether and fix
> the original bug in a better fashion?  Like...
> 
> In hugetlb_no_page():
> 
> retry:
> 	page = find_lock_page(...)
> 	if (!page) {
> 		write_lock_irq(&mapping->tree_lock);
> 		if (radix_tree_lookup(...)) {
> 			write_unlock_irq(tree_lock);
> 			goto retry;
> 		}
> 		page = alloc_huge_page(...);
> 		if (!page)
> 			bail;
> 		radix_tree_insert(...);
> 		SetPageLocked(page);
> 		write_unlock_irq(tree_lock);
> 		clear_huge_page(...);
> 	}
> 
> 	<stick it in page tables>
> 
> 	unlock_page(page);
> 
> The key points:
> 
> - Use tree_lock to prevent the race
> 
> - allocate the hugepage inside tree_lock so we never get into this
>   two-threads-tried-to-allocate-the-final-page problem.
> 
> - The hugepage is zeroed without locks held, under lock_page()
> 
> - lock_page() is used to make the other thread(s) sleep while the winner
>   thread is zeroing out the page.
> 
> It means that rather a lot of add_to_page_cache() will need to be copied
> into hugetlb_no_page().

This handles the case of processes racing on a shared mapping, but not
the case of threads racing on a private mapping.  In the latter case
the race ends at the set_pte() rather than the add_to_page_cache()
(well, strictly with the whole page_table_lock atomic lump).  And we
can't move the clear after the set_pte() :(.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-26 23:31   ` 'David Gibson'
@ 2006-10-27  0:04     ` Andrew Morton
  2006-10-27  3:11       ` 'David Gibson'
  2006-10-27  1:47     ` 'David Gibson'
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2006-10-27  0:04 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Fri, 27 Oct 2006 09:31:37 +1000
"'David Gibson'" <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> > On Thu, 26 Oct 2006 15:17:20 -0700
> > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> > 
> > > First rev of patch to allow hugetlb page fault to scale.
> > > 
> > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > > failure in a corner case: two threads race to instantiate same page with
> > > only one free page left in the global pool.  However, this global
> > > serialization hurts fault performance badly as noted by Christoph Lameter.
> > > This patch attempt to cut back the use of mutex only when free page resource
> > > is limited, thus allow fault to scale in most common cases.
> > >
> > 
> > ug.
> > 
> > How about we kill that instantiation_mutex thing altogether and fix
> > the original bug in a better fashion?  Like...
> > 
> > In hugetlb_no_page():
> > 
> > retry:
> > 	page = find_lock_page(...)
> > 	if (!page) {
> > 		write_lock_irq(&mapping->tree_lock);
> > 		if (radix_tree_lookup(...)) {
> > 			write_unlock_irq(tree_lock);
> > 			goto retry;
> > 		}
> > 		page = alloc_huge_page(...);
> > 		if (!page)
> > 			bail;
> > 		radix_tree_insert(...);
> > 		SetPageLocked(page);
> > 		write_unlock_irq(tree_lock);
> > 		clear_huge_page(...);
> > 	}
> > 
> > 	<stick it in page tables>
> > 
> > 	unlock_page(page);
> > 
> > The key points:
> > 
> > - Use tree_lock to prevent the race
> > 
> > - allocate the hugepage inside tree_lock so we never get into this
> >   two-threads-tried-to-allocate-the-final-page problem.
> > 
> > - The hugepage is zeroed without locks held, under lock_page()
> > 
> > - lock_page() is used to make the other thread(s) sleep while the winner
> >   thread is zeroing out the page.
> > 
> > It means that rather a lot of add_to_page_cache() will need to be copied
> > into hugetlb_no_page().
> 
> This handles the case of processes racing on a shared mapping, but not
> the case of threads racing on a private mapping.  In the latter case
> the race ends at the set_pte() rather than the add_to_page_cache()
> (well, strictly with the whole page_table_lock atomic lump).  And we
> can't move the clear after the set_pte() :(.
> 

I expect we can do a similar thing, using page_table_lock to prevent the
race.

The key is to be able to make racing threads still block on the page lock. 
Perhaps install a temp pte which is !pte_present() and also !pte_none(). 
So the racing thread can use that pte to locate and wait upon the
presently-locked page while it is being COWed by another CPU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-27  0:04     ` Andrew Morton
@ 2006-10-27  3:11       ` 'David Gibson'
  2006-10-27  3:35         ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: 'David Gibson' @ 2006-10-27  3:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote:
> On Fri, 27 Oct 2006 09:31:37 +1000
> "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Oct 2006 15:17:20 -0700
> > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> > > 
> > > > First rev of patch to allow hugetlb page fault to scale.
> > > > 
> > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > > > failure in a corner case: two threads race to instantiate same page with
> > > > only one free page left in the global pool.  However, this global
> > > > serialization hurts fault performance badly as noted by Christoph Lameter.
> > > > This patch attempt to cut back the use of mutex only when free page resource
> > > > is limited, thus allow fault to scale in most common cases.
> > > >
> > > 
> > > ug.
> > > 
> > > How about we kill that instantiation_mutex thing altogether and fix
> > > the original bug in a better fashion?  Like...
> > > 
> > > In hugetlb_no_page():
> > > 
> > > retry:
> > > 	page = find_lock_page(...)
> > > 	if (!page) {
> > > 		write_lock_irq(&mapping->tree_lock);
> > > 		if (radix_tree_lookup(...)) {
> > > 			write_unlock_irq(tree_lock);
> > > 			goto retry;
> > > 		}
> > > 		page = alloc_huge_page(...);
> > > 		if (!page)
> > > 			bail;
> > > 		radix_tree_insert(...);
> > > 		SetPageLocked(page);
> > > 		write_unlock_irq(tree_lock);
> > > 		clear_huge_page(...);
> > > 	}
> > > 
> > > 	<stick it in page tables>
> > > 
> > > 	unlock_page(page);
> > > 
> > > The key points:
> > > 
> > > - Use tree_lock to prevent the race
> > > 
> > > - allocate the hugepage inside tree_lock so we never get into this
> > >   two-threads-tried-to-allocate-the-final-page problem.
> > > 
> > > - The hugepage is zeroed without locks held, under lock_page()
> > > 
> > > - lock_page() is used to make the other thread(s) sleep while the winner
> > >   thread is zeroing out the page.
> > > 
> > > It means that rather a lot of add_to_page_cache() will need to be copied
> > > into hugetlb_no_page().
> > 
> > This handles the case of processes racing on a shared mapping, but not
> > the case of threads racing on a private mapping.  In the latter case
> > the race ends at the set_pte() rather than the add_to_page_cache()
> > (well, strictly with the whole page_table_lock atomic lump).  And we
> > can't move the clear after the set_pte() :(.
> > 
> 
> I expect we can do a similar thing, using page_table_lock to prevent the
> race.
> 
> The key is to be able to make racing threads still block on the page lock. 
> Perhaps install a temp pte which is !pte_present() and also !pte_none(). 
> So the racing thread can use that pte to locate and wait upon the
> presently-locked page while it is being COWed by another CPU.

Um.. yes, that might work.  Though I'd need to think hard about a more
specific scheme.  I've been through a lot of approaches lately that
looked ok at first glance, but weren't :-/

And obviously we'd need to make sure such "tentative" PTEs are
constructible won't confuse other code on each relevant architecture.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-27  3:11       ` 'David Gibson'
@ 2006-10-27  3:35         ` Andrew Morton
  2006-10-27  4:06           ` 'David Gibson'
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2006-10-27  3:35 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Fri, 27 Oct 2006 13:11:56 +1000
"'David Gibson'" <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote:
> > On Fri, 27 Oct 2006 09:31:37 +1000
> > "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> > > > On Thu, 26 Oct 2006 15:17:20 -0700
> > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> > > > 
> > > > > First rev of patch to allow hugetlb page fault to scale.
> > > > > 
> > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > > > > failure in a corner case: two threads race to instantiate same page with
> > > > > only one free page left in the global pool.  However, this global
> > > > > serialization hurts fault performance badly as noted by Christoph Lameter.
> > > > > This patch attempt to cut back the use of mutex only when free page resource
> > > > > is limited, thus allow fault to scale in most common cases.
> > > > >
> > > > 
> > > > ug.
> > > > 
> > > > How about we kill that instantiation_mutex thing altogether and fix
> > > > the original bug in a better fashion?  Like...
> > > > 
> > > > In hugetlb_no_page():
> > > > 
> > > > retry:
> > > > 	page = find_lock_page(...)
> > > > 	if (!page) {
> > > > 		write_lock_irq(&mapping->tree_lock);
> > > > 		if (radix_tree_lookup(...)) {
> > > > 			write_unlock_irq(tree_lock);
> > > > 			goto retry;
> > > > 		}
> > > > 		page = alloc_huge_page(...);
> > > > 		if (!page)
> > > > 			bail;
> > > > 		radix_tree_insert(...);
> > > > 		SetPageLocked(page);
> > > > 		write_unlock_irq(tree_lock);
> > > > 		clear_huge_page(...);
> > > > 	}
> > > > 
> > > > 	<stick it in page tables>
> > > > 
> > > > 	unlock_page(page);
> > > > 
> > > > The key points:
> > > > 
> > > > - Use tree_lock to prevent the race
> > > > 
> > > > - allocate the hugepage inside tree_lock so we never get into this
> > > >   two-threads-tried-to-allocate-the-final-page problem.
> > > > 
> > > > - The hugepage is zeroed without locks held, under lock_page()
> > > > 
> > > > - lock_page() is used to make the other thread(s) sleep while the winner
> > > >   thread is zeroing out the page.
> > > > 
> > > > It means that rather a lot of add_to_page_cache() will need to be copied
> > > > into hugetlb_no_page().
> > > 
> > > This handles the case of processes racing on a shared mapping, but not
> > > the case of threads racing on a private mapping.  In the latter case
> > > the race ends at the set_pte() rather than the add_to_page_cache()
> > > (well, strictly with the whole page_table_lock atomic lump).  And we
> > > can't move the clear after the set_pte() :(.
> > > 
> > 
> > I expect we can do a similar thing, using page_table_lock to prevent the
> > race.
> > 
> > The key is to be able to make racing threads still block on the page lock. 
> > Perhaps install a temp pte which is !pte_present() and also !pte_none(). 
> > So the racing thread can use that pte to locate and wait upon the
> > presently-locked page while it is being COWed by another CPU.
> 
> Um.. yes, that might work.  Though I'd need to think hard about a more
> specific scheme.  I've been through a lot of approaches lately that
> looked ok at first glance, but weren't :-/
> 
> And obviously we'd need to make sure such "tentative" PTEs are
> constructible won't confuse other code on each relevant architecture.
> 

There's various cross-arch infrastructure for this which is used for
encoding swap offsets within pte's which could perhaps be ab^W^Wreused.

Alternatively, we could put the page into pagecache whether or not the
mapping is MAP_SHARED.  Then pull it out again prior to unlocking it if
it's MAP_PRIVATE.  So we're using pagecache just as a way for the
concurrent faulter to locate the page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-27  3:35         ` Andrew Morton
@ 2006-10-27  4:06           ` 'David Gibson'
  2006-10-31  2:54             ` Chen, Kenneth W
  0 siblings, 1 reply; 19+ messages in thread
From: 'David Gibson' @ 2006-10-27  4:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 08:35:22PM -0700, Andrew Morton wrote:
> On Fri, 27 Oct 2006 13:11:56 +1000
> "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote:
> > > On Fri, 27 Oct 2006 09:31:37 +1000
> > > "'David Gibson'" <david@gibson.dropbear.id.au> wrote:
> > > 
> > > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
> > > > > On Thu, 26 Oct 2006 15:17:20 -0700
> > > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
> > > > > 
> > > > > > First rev of patch to allow hugetlb page fault to scale.
> > > > > > 
> > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> > > > > > failure in a corner case: two threads race to instantiate same page with
> > > > > > only one free page left in the global pool.  However, this global
> > > > > > serialization hurts fault performance badly as noted by Christoph Lameter.
> > > > > > This patch attempt to cut back the use of mutex only when free page resource
> > > > > > is limited, thus allow fault to scale in most common cases.
> > > > > >
> > > > > 
> > > > > ug.
> > > > > 
> > > > > How about we kill that instantiation_mutex thing altogether and fix
> > > > > the original bug in a better fashion?  Like...
> > > > > 
> > > > > In hugetlb_no_page():
> > > > > 
> > > > > retry:
> > > > > 	page = find_lock_page(...)
> > > > > 	if (!page) {
> > > > > 		write_lock_irq(&mapping->tree_lock);
> > > > > 		if (radix_tree_lookup(...)) {
> > > > > 			write_unlock_irq(tree_lock);
> > > > > 			goto retry;
> > > > > 		}
> > > > > 		page = alloc_huge_page(...);
> > > > > 		if (!page)
> > > > > 			bail;
> > > > > 		radix_tree_insert(...);
> > > > > 		SetPageLocked(page);
> > > > > 		write_unlock_irq(tree_lock);
> > > > > 		clear_huge_page(...);
> > > > > 	}
> > > > > 
> > > > > 	<stick it in page tables>
> > > > > 
> > > > > 	unlock_page(page);
> > > > > 
> > > > > The key points:
> > > > > 
> > > > > - Use tree_lock to prevent the race
> > > > > 
> > > > > - allocate the hugepage inside tree_lock so we never get into this
> > > > >   two-threads-tried-to-allocate-the-final-page problem.
> > > > > 
> > > > > - The hugepage is zeroed without locks held, under lock_page()
> > > > > 
> > > > > - lock_page() is used to make the other thread(s) sleep while the winner
> > > > >   thread is zeroing out the page.
> > > > > 
> > > > > It means that rather a lot of add_to_page_cache() will need to be copied
> > > > > into hugetlb_no_page().
> > > > 
> > > > This handles the case of processes racing on a shared mapping, but not
> > > > the case of threads racing on a private mapping.  In the latter case
> > > > the race ends at the set_pte() rather than the add_to_page_cache()
> > > > (well, strictly with the whole page_table_lock atomic lump).  And we
> > > > can't move the clear after the set_pte() :(.
> > > > 
> > > 
> > > I expect we can do a similar thing, using page_table_lock to prevent the
> > > race.
> > > 
> > > The key is to be able to make racing threads still block on the page lock. 
> > > Perhaps install a temp pte which is !pte_present() and also !pte_none(). 
> > > So the racing thread can use that pte to locate and wait upon the
> > > presently-locked page while it is being COWed by another CPU.
> > 
> > Um.. yes, that might work.  Though I'd need to think hard about a more
> > specific scheme.  I've been through a lot of approaches lately that
> > looked ok at first glance, but weren't :-/
> > 
> > And obviously we'd need to make sure such "tentative" PTEs are
> > constructible won't confuse other code on each relevant architecture.
> 
> There's various cross-arch infrastructure for this which is used for
> encoding swap offsets within pte's which could perhaps be
> ab^W^Wreused.

Yes, but the encoding and assumptions about ptes aren't always exactly
the same for hugeptes as normal ptes.

> Alternatively, we could put the page into pagecache whether or not the
> mapping is MAP_SHARED.  Then pull it out again prior to unlocking it if
> it's MAP_PRIVATE.  So we're using pagecache just as a way for the
> concurrent faulter to locate the page.

Hrm.. interesting if we can make it work.  I'd be worried about cases
with concurrent PRIVATE and SHARED pages on the same file offset.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-27  4:06           ` 'David Gibson'
@ 2006-10-31  2:54             ` Chen, Kenneth W
  2006-10-31  3:17               ` 'David Gibson'
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Kenneth W @ 2006-10-31  2:54 UTC (permalink / raw)
  To: 'David Gibson', Andrew Morton
  Cc: 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

David Gibson wrote on Thursday, October 26, 2006 9:06 PM
> > Alternatively, we could put the page into pagecache whether or not the
> > mapping is MAP_SHARED.  Then pull it out again prior to unlocking it if
> > it's MAP_PRIVATE.  So we're using pagecache just as a way for the
> > concurrent faulter to locate the page.
> 
> Hrm.. interesting if we can make it work.  I'd be worried about cases
> with concurrent PRIVATE and SHARED pages on the same file offset.

I got side tracked on to the radix-tree stuff.  The comments in
hugetlb_no_page() make me wonder whether we have a race issue on
private mapping:

        /*
         * Use page lock to guard against racing truncation
         * before we get page_table_lock.
         */

Private mapping won't use radix tree during instantiation.  What protects
racy truncate against fault in that scenario?  Don't we have a bug here?

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31  2:54             ` Chen, Kenneth W
@ 2006-10-31  3:17               ` 'David Gibson'
  2006-10-31  5:15                 ` Chen, Kenneth W
  0 siblings, 1 reply; 19+ messages in thread
From: 'David Gibson' @ 2006-10-31  3:17 UTC (permalink / raw)
  To: Chen, Kenneth W, g
  Cc: Andrew Morton, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Mon, Oct 30, 2006 at 06:54:46PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, October 26, 2006 9:06 PM
> > > Alternatively, we could put the page into pagecache whether or not the
> > > mapping is MAP_SHARED.  Then pull it out again prior to unlocking it if
> > > it's MAP_PRIVATE.  So we're using pagecache just as a way for the
> > > concurrent faulter to locate the page.
> > 
> > Hrm.. interesting if we can make it work.  I'd be worried about cases
> > with concurrent PRIVATE and SHARED pages on the same file offset.
> 
> I got side tracked on to the radix-tree stuff.  The comments in
> hugetlb_no_page() make me wonder whether we have a race issue on
> private mapping:
> 
>         /*
>          * Use page lock to guard against racing truncation
>          * before we get page_table_lock.
>          */
> 
> Private mapping won't use radix tree during instantiation.  What protects
> racy truncate against fault in that scenario?  Don't we have a bug here?

Not at present, because the hugetlb_instantiation_mutex protects both
fault paths.  But with Andrew's patch as it stands, yes.  As I said in
a previous email.  The libhugetlbfs testsuite now has a testcase for
the MAP_PRIVATE as well as the MAP_SHARED version of the race.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31  3:17               ` 'David Gibson'
@ 2006-10-31  5:15                 ` Chen, Kenneth W
  2006-10-31 11:05                   ` 'David Gibson'
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Kenneth W @ 2006-10-31  5:15 UTC (permalink / raw)
  To: 'David Gibson', g
  Cc: Andrew Morton, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

David Gibson wrote on Monday, October 30, 2006 7:17 PM
> > I got side tracked on to the radix-tree stuff.  The comments in
> > hugetlb_no_page() make me wonder whether we have a race issue on
> > private mapping:
> > 
> >         /*
> >          * Use page lock to guard against racing truncation
> >          * before we get page_table_lock.
> >          */
> > 
> > Private mapping won't use radix tree during instantiation.  What protects
> > racy truncate against fault in that scenario?  Don't we have a bug here?
> 
> Not at present, because the hugetlb_instantiation_mutex protects both
> fault paths.  But with Andrew's patch as it stands, yes.  As I said in
> a previous email.  The libhugetlbfs testsuite now has a testcase for
> the MAP_PRIVATE as well as the MAP_SHARED version of the race.


That's not what I'm saying.  I should've said I'm off topic and not talking
about parallel fault for private mapping.

Instead, I'm asking how private mapping protect race between file truncation
and fault? For shared mapping, it is clear to me that we are using lock_page
to protect file truncate with fault.  But I don't see that protection with
private mapping in current upstream kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31  5:15                 ` Chen, Kenneth W
@ 2006-10-31 11:05                   ` 'David Gibson'
  2006-10-31 12:48                     ` Hugh Dickins
  0 siblings, 1 reply; 19+ messages in thread
From: 'David Gibson' @ 2006-10-31 11:05 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: g, Andrew Morton, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Monday, October 30, 2006 7:17 PM
> > > I got side tracked on to the radix-tree stuff.  The comments in
> > > hugetlb_no_page() make me wonder whether we have a race issue on
> > > private mapping:
> > > 
> > >         /*
> > >          * Use page lock to guard against racing truncation
> > >          * before we get page_table_lock.
> > >          */
> > > 
> > > Private mapping won't use radix tree during instantiation.  What protects
> > > racy truncate against fault in that scenario?  Don't we have a bug here?
> > 
> > Not at present, because the hugetlb_instantiation_mutex protects both
> > fault paths.  But with Andrew's patch as it stands, yes.  As I said in
> > a previous email.  The libhugetlbfs testsuite now has a testcase for
> > the MAP_PRIVATE as well as the MAP_SHARED version of the race.
> 
> 
> That's not what I'm saying.  I should've said I'm off topic and not talking
> about parallel fault for private mapping.
> 
> Instead, I'm asking how private mapping protect race between file truncation
> and fault? For shared mapping, it is clear to me that we are using lock_page
> to protect file truncate with fault.  But I don't see that protection with
> private mapping in current upstream kernel.

Oh, ok.  I can't see how it matters in the PRIVATE case, given that
truncate() won't, and shouldn't, truncate privately mapped pages.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31 11:05                   ` 'David Gibson'
@ 2006-10-31 12:48                     ` Hugh Dickins
  2006-11-01  6:18                       ` Nick Piggin
  2006-11-02  2:29                       ` 'David Gibson'
  0 siblings, 2 replies; 19+ messages in thread
From: Hugh Dickins @ 2006-10-31 12:48 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter',
	bill.irwin, Adam Litke, linux-mm

On Tue, 31 Oct 2006, 'David Gibson' wrote:
> On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote:
> > 
> > Instead, I'm asking how private mapping protect race between file truncation
> > and fault? For shared mapping, it is clear to me that we are using lock_page
> > to protect file truncate with fault.  But I don't see that protection with
> > private mapping in current upstream kernel.
> 
> Oh, ok.  I can't see how it matters in the PRIVATE case, given that
> truncate() won't, and shouldn't, truncate privately mapped pages.

Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge
from the standard for ordinary pages in this respect - could do, but that
would require thought of its own).  If you've been thinking otherwise,
that may explain why some of the accounting goes wrong.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31 12:48                     ` Hugh Dickins
@ 2006-11-01  6:18                       ` Nick Piggin
  2006-11-01 10:17                         ` Chen, Kenneth W
  2006-11-02  2:29                       ` 'David Gibson'
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-11-01  6:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: 'David Gibson',
	Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter',
	bill.irwin, Adam Litke, linux-mm

Hugh Dickins wrote:
> On Tue, 31 Oct 2006, 'David Gibson' wrote:
> 
>>On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote:
>>
>>>Instead, I'm asking how private mapping protect race between file truncation
>>>and fault? For shared mapping, it is clear to me that we are using lock_page
>>>to protect file truncate with fault.  But I don't see that protection with
>>>private mapping in current upstream kernel.
>>
>>Oh, ok.  I can't see how it matters in the PRIVATE case, given that
>>truncate() won't, and shouldn't, truncate privately mapped pages.
> 
> 
> Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge
> from the standard for ordinary pages in this respect - could do, but that
> would require thought of its own).  If you've been thinking otherwise,
> that may explain why some of the accounting goes wrong.

So what does the normal page fault path do? Just invalidates the private
page out of the page tables. A subsequent fault goes through the normal
shared page path, which detects the truncation as it would with any
shared fault. Right?

hugetlb seems to pretty well follow the same pattern as memory.c in this
regard. I don't see the race?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-11-01  6:18                       ` Nick Piggin
@ 2006-11-01 10:17                         ` Chen, Kenneth W
  2006-11-02  3:06                           ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Kenneth W @ 2006-11-01 10:17 UTC (permalink / raw)
  To: 'Nick Piggin', Hugh Dickins
  Cc: 'David Gibson',
	g, Andrew Morton, 'Christoph Lameter',
	bill.irwin, Adam Litke, linux-mm

Nick Piggin wrote on Tuesday, October 31, 2006 10:19 PM
> So what does the normal page fault path do? Just invalidates the private
> page out of the page tables. A subsequent fault goes through the normal
> shared page path, which detects the truncation as it would with any
> shared fault. Right?
> 
> hugetlb seems to pretty well follow the same pattern as memory.c in this
> regard. I don't see the race?

I was originally worried about a case that one thread fault on a private
mapping and get hold of a fresh page via alloc_huge_page(). While it executes
clear_huge_page(), 2nd thread come by did a ftruncate. After first thread
finish zeroing, I thought it will happily install a pte. But no, the inode
size check will prevent that from happening.

I was mislead by the comments in hugetlb_no_page() that page lock is used to
guard against racing truncation.  Now I'm drifting back into what "racing
truncation" the comment is referring to. What race does it trying to protect
with page lock?

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-11-01 10:17                         ` Chen, Kenneth W
@ 2006-11-02  3:06                           ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2006-11-02  3:06 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Hugh Dickins, 'David Gibson',
	g, Andrew Morton, 'Christoph Lameter',
	bill.irwin, Adam Litke, linux-mm

Chen, Kenneth W wrote:

>Nick Piggin wrote on Tuesday, October 31, 2006 10:19 PM
>
>>So what does the normal page fault path do? Just invalidates the private
>>page out of the page tables. A subsequent fault goes through the normal
>>shared page path, which detects the truncation as it would with any
>>shared fault. Right?
>>
>>hugetlb seems to pretty well follow the same pattern as memory.c in this
>>regard. I don't see the race?
>>
>
>I was originally worried about a case that one thread fault on a private
>mapping and get hold of a fresh page via alloc_huge_page(). While it executes
>clear_huge_page(), 2nd thread come by did a ftruncate. After first thread
>finish zeroing, I thought it will happily install a pte. But no, the inode
>size check will prevent that from happening.
>

Yes it should do.

>I was mislead by the comments in hugetlb_no_page() that page lock is used to
>guard against racing truncation.  Now I'm drifting back into what "racing
>truncation" the comment is referring to. What race does it trying to protect
>with page lock?
>

Probably attempting to fix a similar race as the do_no_page vs 
invalidate race
that I've been trying to fix for normal pages (and is currently handled for
truncate with truncate_count). However AFAIKS it is not page lock which 
protects
from truncate, but the page_table_lock (which the ptl scalability work might
have broken):

Here is the race for regular pagecache:

check i_size          
find_get_page         
                       i_size_write
                       unmap_mapping_range
set_pte               
                       truncate_inode_pages

So we have now mapped a truncated page. I fix this by using find_lock_page
in the filemap_nopage path, but I found that it isn't enough alone.
truncate_inode_pages must also check whether the page is mapped (while
holding the page lock) and be prepared to unmap (like
invalidate_inode_pages does).

Now, when looking at hugepages, it seems to be designed to give you the
unmap_mapping_range vs fault exclusion with ptl (note that it rechecks
i_size inside ptl). However now I think unmap_mapping_range runs without
ptl, you again have a race.

I'll look into it a bit more and see if I can fix it up in my patchset.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-31 12:48                     ` Hugh Dickins
  2006-11-01  6:18                       ` Nick Piggin
@ 2006-11-02  2:29                       ` 'David Gibson'
  1 sibling, 0 replies; 19+ messages in thread
From: 'David Gibson' @ 2006-11-02  2:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter',
	bill.irwin, Adam Litke, linux-mm

On Tue, Oct 31, 2006 at 12:48:13PM +0000, Hugh Dickins wrote:
> On Tue, 31 Oct 2006, 'David Gibson' wrote:
> > On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote:
> > > 
> > > Instead, I'm asking how private mapping protect race between file truncation
> > > and fault? For shared mapping, it is clear to me that we are using lock_page
> > > to protect file truncate with fault.  But I don't see that protection with
> > > private mapping in current upstream kernel.
> > 
> > Oh, ok.  I can't see how it matters in the PRIVATE case, given that
> > truncate() won't, and shouldn't, truncate privately mapped pages.
> 
> Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge
> from the standard for ordinary pages in this respect - could do, but that
> would require thought of its own).  

Oh, so it does.  That's not the semantic I would have guessed for
MAP_PRIVATE.

> If you've been thinking otherwise,
> that may explain why some of the accounting goes wrong.

Unlikely, given that there basically isn't any accounting for
MAP_PRIVATE pages.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-26 23:31   ` 'David Gibson'
  2006-10-27  0:04     ` Andrew Morton
@ 2006-10-27  1:47     ` 'David Gibson'
  2006-10-30 20:55       ` Adam Litke
  1 sibling, 1 reply; 19+ messages in thread
From: 'David Gibson' @ 2006-10-27  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Adam Litke, linux-mm

On Fri, Oct 27, 2006 at 09:31:37AM +1000, 'David Gibson' wrote:
> On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
[snip]
> > The key points:
> > 
> > - Use tree_lock to prevent the race
> > 
> > - allocate the hugepage inside tree_lock so we never get into this
> >   two-threads-tried-to-allocate-the-final-page problem.
> > 
> > - The hugepage is zeroed without locks held, under lock_page()
> > 
> > - lock_page() is used to make the other thread(s) sleep while the winner
> >   thread is zeroing out the page.
> > 
> > It means that rather a lot of add_to_page_cache() will need to be copied
> > into hugetlb_no_page().
> 
> This handles the case of processes racing on a shared mapping, but not
> the case of threads racing on a private mapping.  In the latter case
> the race ends at the set_pte() rather than the add_to_page_cache()
> (well, strictly with the whole page_table_lock atomic lump).  And we
> can't move the clear after the set_pte() :(.

At various times many people have proposed "solutions" which address
the SHARED case, or the PRIVATE case, but not both.  As Andrew points
out in a later mail his approach may be fixable for the PRIVATE case,
but nonetheless it's important to check both cases.

So, here's another patch for libhugetlbfs, extending its testcase for
this race to check the MAP_PRIVATE case as well as the MAP_SHARED
case.  Adam, please apply.  Everyone else, please test proposed
approaches against the testsuite.  It's not exhaustive, but it's easy
to run and a good start :).

libhugetlbfs: Testcase for MAP_PRIVATE OOM-liable race condition

The spurious OOM condition which can be caused by race conditions in
the hugetlb fault handler can be triggered with both SHARED mappings
(separate processes racing on the same address_space) and with PRIVATE
mappings (different threads racing on the same vma).

At present the alloc-instantiate-race testcase only tests the SHARED
mapping case.  Since at various times kernel fixes have been proposed
which address only one or the other of the cases, extend the testcase
to check the MAP_PRIVATE in addition to the MAP_SHARED case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Index: libhugetlbfs/tests/alloc-instantiate-race.c
===================================================================
--- libhugetlbfs.orig/tests/alloc-instantiate-race.c	2006-09-04 17:08:33.000000000 +1000
+++ libhugetlbfs/tests/alloc-instantiate-race.c	2006-10-27 11:31:54.000000000 +1000
@@ -42,13 +42,18 @@
 #include <sched.h>
 #include <signal.h>
 #include <sys/wait.h>
+#include <pthread.h>
+#include <linux/unistd.h>
 
 #include <hugetlbfs.h>
 
 #include "hugetests.h"
 
+_syscall0(pid_t, gettid);
+
 static int hpage_size;
 static pid_t child1, child2;
+static pthread_t thread1, thread2;
 
 void cleanup(void)
 {
@@ -58,9 +63,8 @@ void cleanup(void)
 		kill(child2, SIGKILL);
 }
 
-
-static void one_racer(void *p, int cpu,
-	       volatile int *mytrigger, volatile int *othertrigger)
+static int one_racer(void *p, int cpu,
+		     volatile int *mytrigger, volatile int *othertrigger)
 {
 	volatile int *pi = p;
 	cpu_set_t cpuset;
@@ -70,7 +74,7 @@ static void one_racer(void *p, int cpu,
 	CPU_ZERO(&cpuset);
 	CPU_SET(cpu, &cpuset);
 
-	err = sched_setaffinity(getpid(), CPU_SETSIZE/8, &cpuset);
+	err = sched_setaffinity(gettid(), CPU_SETSIZE/8, &cpuset);
 	if (err != 0)
 		CONFIG("sched_setaffinity(cpu%d): %s", cpu, strerror(errno));
 
@@ -83,16 +87,39 @@ static void one_racer(void *p, int cpu,
 	/* Instantiate! */
 	*pi = 1;
 
-	exit(0);
+	return 0;
+}
+
+static void proc_racer(void *p, int cpu,
+		       volatile int *mytrigger, volatile int *othertrigger)
+{
+	exit(one_racer(p, cpu, mytrigger, othertrigger));
 }
 
-static void run_race(void *syncarea)
+struct racer_info {
+	void *p; /* instantiation address */
+	int cpu;
+	int race_type;
+	volatile int *mytrigger;
+	volatile int *othertrigger;
+	int status;
+};
+
+static void *thread_racer(void *info)
+{
+	struct racer_info *ri = info;
+	int rc;
+
+	rc = one_racer(ri->p, ri->cpu, ri->mytrigger, ri->othertrigger);
+	return ri;
+}
+static void run_race(void *syncarea, int race_type)
 {
 	volatile int *trigger1, *trigger2;
 	int fd;
 	void *p;
 	int status1, status2;
-	pid_t ret;
+	int ret;
 
 	memset(syncarea, 0, sizeof(*trigger1) + sizeof(*trigger2));
 	trigger1 = syncarea;
@@ -104,47 +131,90 @@ static void run_race(void *syncarea)
 		FAIL("hugetlbfs_unlinked_fd()");
 
 	verbose_printf("Mapping final page.. ");
-	p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, race_type, fd, 0);
 	if (p == MAP_FAILED)
 		FAIL("mmap(): %s", strerror(errno));
 	verbose_printf("%p\n", p);
 
-	child1 = fork();
-	if (child1 < 0)
-		FAIL("fork(): %s", strerror(errno));
-	if (child1 == 0)
-		one_racer(p, 0, trigger1, trigger2);
-
-	child2 = fork();
-	if (child2 < 0)
-		FAIL("fork(): %s", strerror(errno));
-	if (child2 == 0)
-		one_racer(p, 1, trigger2, trigger1);
-
-	/* wait() calls */
-	ret = waitpid(child1, &status1, 0);
-	if (ret < 0)
-		FAIL("waitpid() child 1: %s", strerror(errno));
-	verbose_printf("Child 1 status: %x\n", status1);
-
-
-	ret = waitpid(child2, &status2, 0);
-	if (ret < 0)
-		FAIL("waitpid() child 2: %s", strerror(errno));
-	verbose_printf("Child 2 status: %x\n", status2);
-
-	if (WIFSIGNALED(status1))
-		FAIL("Child 1 killed by signal %s",
-		     strsignal(WTERMSIG(status1)));
-	if (WIFSIGNALED(status2))
+	if (race_type == MAP_SHARED) {
+		child1 = fork();
+		if (child1 < 0)
+			FAIL("fork(): %s", strerror(errno));
+		if (child1 == 0)
+			proc_racer(p, 0, trigger1, trigger2);
+
+		child2 = fork();
+		if (child2 < 0)
+			FAIL("fork(): %s", strerror(errno));
+		if (child2 == 0)
+			proc_racer(p, 1, trigger2, trigger1);
+
+		/* wait() calls */
+		ret = waitpid(child1, &status1, 0);
+		if (ret < 0)
+			FAIL("waitpid() child 1: %s", strerror(errno));
+		verbose_printf("Child 1 status: %x\n", status1);
+
+
+		ret = waitpid(child2, &status2, 0);
+		if (ret < 0)
+			FAIL("waitpid() child 2: %s", strerror(errno));
+		verbose_printf("Child 2 status: %x\n", status2);
+
+		if (WIFSIGNALED(status1))
+			FAIL("Child 1 killed by signal %s",
+			     strsignal(WTERMSIG(status1)));
+		if (WIFSIGNALED(status2))
 		FAIL("Child 2 killed by signal %s",
 		     strsignal(WTERMSIG(status2)));
 
-	if (WEXITSTATUS(status1) != 0)
-		FAIL("Child 1 terminated with code %d", WEXITSTATUS(status1));
+		status1 = WEXITSTATUS(status1);
+		status2 = WEXITSTATUS(status2);
+	} else {
+		struct racer_info ri1 = {
+			.p = p,
+			.cpu = 0,
+			.mytrigger = trigger1,
+			.othertrigger = trigger2,
+		};
+		struct racer_info ri2 = {
+			.p = p,
+			.cpu = 1,
+			.mytrigger = trigger2,
+			.othertrigger = trigger1,
+		};
+		void *tret1, *tret2;
+
+		ret = pthread_create(&thread1, NULL, thread_racer, &ri1);
+		if (ret != 0)
+			FAIL("pthread_create() 1: %s\n", strerror(errno));
+
+		ret = pthread_create(&thread2, NULL, thread_racer, &ri2);
+		if (ret != 0)
+			FAIL("pthread_create() 2: %s\n", strerror(errno));
+
+		ret = pthread_join(thread1, &tret1);
+		if (ret != 0)
+			FAIL("pthread_join() 1: %s\n", strerror(errno));
+		if (tret1 != &ri1)
+			FAIL("Thread 1 returned %p not %p, killed?\n",
+			     tret1, &ri1);
+		ret = pthread_join(thread2, &tret2);
+		if (ret != 0)
+			FAIL("pthread_join() 2: %s\n", strerror(errno));
+		if (tret2 != &ri2)
+			FAIL("Thread 2 returned %p not %p, killed?\n",
+			     tret2, &ri2);
+
+		status1 = ri1.status;
+		status2 = ri2.status;
+	}
 
-	if (WEXITSTATUS(status2) != 0)
-		FAIL("Child 2 terminated with code %d", WEXITSTATUS(status2));
+	if (status1 != 0)
+		FAIL("Racer 1 terminated with code %d", status1);
+
+	if (status2 != 0)
+		FAIL("Racer 2 terminated with code %d", status2);
 }
 
 int main(int argc, char *argv[])
@@ -153,14 +223,25 @@ int main(int argc, char *argv[])
 	int fd;
 	void *p, *q;
 	unsigned long i;
+	int race_type;
 
 	test_init(argc, argv);
 
-	if (argc != 2)
-		CONFIG("Usage: alloc-instantiate-race <# total available hugepages>");
+	if (argc != 3)
+		CONFIG("Usage: alloc-instantiate-race"
+		       "<# total available hugepages> <private|shard>");
 
 	totpages = atoi(argv[1]);
 
+	if (strcmp(argv[2], "shared") == 0) {
+		race_type = MAP_SHARED;
+	} else if (strcmp(argv[2], "private") == 0) {
+		race_type = MAP_PRIVATE;
+	} else {
+		CONFIG("Usage: alloc-instantiate-race"
+		       "<# total available hugepages> <private|shard>");
+	}
+
 	hpage_size = gethugepagesize();
 	if (hpage_size < 0)
 		CONFIG("No hugepage kernel support");
@@ -189,7 +270,7 @@ int main(int argc, char *argv[])
 		memset(p + (i * hpage_size), 0, sizeof(int));
 	verbose_printf("done\n");
 
-	run_race(q);
+	run_race(q, race_type);
 
 	PASS();
 }
Index: libhugetlbfs/tests/run_tests.sh
===================================================================
--- libhugetlbfs.orig/tests/run_tests.sh	2006-10-27 10:08:23.000000000 +1000
+++ libhugetlbfs/tests/run_tests.sh	2006-10-27 10:45:06.000000000 +1000
@@ -147,7 +147,8 @@ functional_tests () {
 # killall -HUP hugetlbd
 # to make the sharing daemon give up the files
     run_test chunk-overcommit `free_hpages`
-    run_test alloc-instantiate-race `free_hpages`
+    run_test alloc-instantiate-race `free_hpages` shared
+    run_test alloc-instantiate-race `free_hpages` private
     run_test truncate_reserve_wraparound
     run_test truncate_sigbus_versus_oom `free_hpages`
 }


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-27  1:47     ` 'David Gibson'
@ 2006-10-30 20:55       ` Adam Litke
  0 siblings, 0 replies; 19+ messages in thread
From: Adam Litke @ 2006-10-30 20:55 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: Andrew Morton, Chen, Kenneth W, 'Christoph Lameter',
	Hugh Dickins, bill.irwin, linux-mm

On Fri, 2006-10-27 at 11:47 +1000, 'David Gibson' wrote:
> libhugetlbfs: Testcase for MAP_PRIVATE OOM-liable race condition
> 
> The spurious OOM condition which can be caused by race conditions in
> the hugetlb fault handler can be triggered with both SHARED mappings
> (separate processes racing on the same address_space) and with PRIVATE
> mappings (different threads racing on the same vma).
> 
> At present the alloc-instantiate-race testcase only tests the SHARED
> mapping case.  Since at various times kernel fixes have been proposed
> which address only one or the other of the cases, extend the testcase
> to check the MAP_PRIVATE in addition to the MAP_SHARED case.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Adam Litke <agl@us.ibm.com>  
Applied.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] reduce hugetlb_instantiation_mutex usage
  2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W
  2006-10-26 22:44 ` Andrew Morton
@ 2006-10-26 23:47 ` 'David Gibson'
  1 sibling, 0 replies; 19+ messages in thread
From: 'David Gibson' @ 2006-10-26 23:47 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter',
	Hugh Dickins, bill.irwin, Andrew Morton, Adam Litke, linux-mm

On Thu, Oct 26, 2006 at 03:17:20PM -0700, Chen, Kenneth W wrote:
> First rev of patch to allow hugetlb page fault to scale.
> 
> hugetlb_instantiation_mutex was introduced to prevent spurious allocation
> failure in a corner case: two threads race to instantiate same page with
> only one free page left in the global pool.  However, this global
> serialization hurts fault performance badly as noted by Christoph Lameter.
> This patch attempt to cut back the use of mutex only when free page resource
> is limited, thus allow fault to scale in most common cases.

>From my experience of spending most of the last two weeks going "We
can just do <this>...hack, hack.., no, that has a race too" this is
much harder to get right than you'd think.

For example with your patch, suppose CPU0 and CPU1 are both attempting
to instantiate the same page in a shared mapping, CPU2 is attempting
to instantiate a page in an unrelated mapping.

CPU0		CPU1		CPU2		token	free_hpages
						0	2
atomic_inc					1	2
(use_mutex=0)
		atomic_inc			2	2
		(use_mutex=1)
				atomic_inc	3	2
				(use_mutex=1)
				mutex_lock
				<complete fault>
				mutex_unlock
				atomic_dec	2	1
		mutex_lock			2	1
alloc_huge_page					2	0
		alloc_huge_page
		-> OOM
add_to_page_cache

So we still have the spurious OOM.  There may be other race
scenarios, that's just the first I came up with.

Oh, also your patch accesses free_huge_pages bare, whereas its usually
protected by hugetlb_lock.  As a read-only access that's *probably*
ok, but any lock-free access of variables which are generally supposed
to be lock protected makes me nervious.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-11-02  3:06 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W
2006-10-26 22:44 ` Andrew Morton
2006-10-26 23:31   ` 'David Gibson'
2006-10-27  0:04     ` Andrew Morton
2006-10-27  3:11       ` 'David Gibson'
2006-10-27  3:35         ` Andrew Morton
2006-10-27  4:06           ` 'David Gibson'
2006-10-31  2:54             ` Chen, Kenneth W
2006-10-31  3:17               ` 'David Gibson'
2006-10-31  5:15                 ` Chen, Kenneth W
2006-10-31 11:05                   ` 'David Gibson'
2006-10-31 12:48                     ` Hugh Dickins
2006-11-01  6:18                       ` Nick Piggin
2006-11-01 10:17                         ` Chen, Kenneth W
2006-11-02  3:06                           ` Nick Piggin
2006-11-02  2:29                       ` 'David Gibson'
2006-10-27  1:47     ` 'David Gibson'
2006-10-30 20:55       ` Adam Litke
2006-10-26 23:47 ` 'David Gibson'

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox