* [RFC] reduce hugetlb_instantiation_mutex usage
@ 2006-10-26 22:17 Chen, Kenneth W
2006-10-26 22:44 ` Andrew Morton
2006-10-26 23:47 ` 'David Gibson'
0 siblings, 2 replies; 19+ messages in thread
From: Chen, Kenneth W @ 2006-10-26 22:17 UTC (permalink / raw)
To: 'Christoph Lameter', 'David Gibson',
Hugh Dickins, bill.irwin, Andrew Morton, Adam Litke
Cc: linux-mm
First rev of patch to allow hugetlb page fault to scale.
hugetlb_instantiation_mutex was introduced to prevent spurious allocation
failure in a corner case: two threads race to instantiate same page with
only one free page left in the global pool. However, this global
serialization hurts fault performance badly as noted by Christoph Lameter.
This patch attempt to cut back the use of mutex only when free page resource
is limited, thus allow fault to scale in most common cases.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
--- ./mm/hugetlb.c.orig 2006-10-26 10:26:43.000000000 -0700
+++ ./mm/hugetlb.c 2006-10-26 13:18:03.000000000 -0700
@@ -542,6 +542,8 @@ int hugetlb_fault(struct mm_struct *mm,
pte_t entry;
int ret;
static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+ static atomic_t token = ATOMIC_INIT(0);
+ int use_mutex = 0;
ptep = huge_pte_alloc(mm, address);
if (!ptep)
@@ -552,12 +554,15 @@ int hugetlb_fault(struct mm_struct *mm,
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- mutex_lock(&hugetlb_instantiation_mutex);
+ if (atomic_inc_return(&token) >= free_huge_pages) {
+ mutex_lock(&hugetlb_instantiation_mutex);
+ use_mutex = 1;
+ }
+
entry = *ptep;
if (pte_none(entry)) {
ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
- mutex_unlock(&hugetlb_instantiation_mutex);
- return ret;
+ goto out;
}
ret = VM_FAULT_MINOR;
@@ -568,7 +573,11 @@ int hugetlb_fault(struct mm_struct *mm,
if (write_access && !pte_write(entry))
ret = hugetlb_cow(mm, vma, address, ptep, entry);
spin_unlock(&mm->page_table_lock);
- mutex_unlock(&hugetlb_instantiation_mutex);
+
+out:
+ atomic_dec(&token);
+ if (use_mutex)
+ mutex_unlock(&hugetlb_instantiation_mutex);
return ret;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W @ 2006-10-26 22:44 ` Andrew Morton 2006-10-26 23:31 ` 'David Gibson' 2006-10-26 23:47 ` 'David Gibson' 1 sibling, 1 reply; 19+ messages in thread From: Andrew Morton @ 2006-10-26 22:44 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Christoph Lameter', 'David Gibson', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Thu, 26 Oct 2006 15:17:20 -0700 "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > First rev of patch to allow hugetlb page fault to scale. > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > failure in a corner case: two threads race to instantiate same page with > only one free page left in the global pool. However, this global > serialization hurts fault performance badly as noted by Christoph Lameter. > This patch attempt to cut back the use of mutex only when free page resource > is limited, thus allow fault to scale in most common cases. > ug. How about we kill that instantiation_mutex thing altogether and fix the original bug in a better fashion? Like... In hugetlb_no_page(): retry: page = find_lock_page(...) if (!page) { write_lock_irq(&mapping->tree_lock); if (radix_tree_lookup(...)) { write_unlock_irq(tree_lock); goto retry; } page = alloc_huge_page(...); if (!page) bail; radix_tree_insert(...); SetPageLocked(page); write_unlock_irq(tree_lock); clear_huge_page(...); } <stick it in page tables> unlock_page(page); The key points: - Use tree_lock to prevent the race - allocate the hugepage inside tree_lock so we never get into this two-threads-tried-to-allocate-the-final-page problem. - The hugepage is zeroed without locks held, under lock_page() - lock_page() is used to make the other thread(s) sleep while the winner thread is zeroing out the page. It means that rather a lot of add_to_page_cache() will need to be copied into hugetlb_no_page(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-26 22:44 ` Andrew Morton @ 2006-10-26 23:31 ` 'David Gibson' 2006-10-27 0:04 ` Andrew Morton 2006-10-27 1:47 ` 'David Gibson' 0 siblings, 2 replies; 19+ messages in thread From: 'David Gibson' @ 2006-10-26 23:31 UTC (permalink / raw) To: Andrew Morton Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: > On Thu, 26 Oct 2006 15:17:20 -0700 > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > > > First rev of patch to allow hugetlb page fault to scale. > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > > failure in a corner case: two threads race to instantiate same page with > > only one free page left in the global pool. However, this global > > serialization hurts fault performance badly as noted by Christoph Lameter. > > This patch attempt to cut back the use of mutex only when free page resource > > is limited, thus allow fault to scale in most common cases. > > > > ug. > > How about we kill that instantiation_mutex thing altogether and fix > the original bug in a better fashion? Like... > > In hugetlb_no_page(): > > retry: > page = find_lock_page(...) > if (!page) { > write_lock_irq(&mapping->tree_lock); > if (radix_tree_lookup(...)) { > write_unlock_irq(tree_lock); > goto retry; > } > page = alloc_huge_page(...); > if (!page) > bail; > radix_tree_insert(...); > SetPageLocked(page); > write_unlock_irq(tree_lock); > clear_huge_page(...); > } > > <stick it in page tables> > > unlock_page(page); > > The key points: > > - Use tree_lock to prevent the race > > - allocate the hugepage inside tree_lock so we never get into this > two-threads-tried-to-allocate-the-final-page problem. > > - The hugepage is zeroed without locks held, under lock_page() > > - lock_page() is used to make the other thread(s) sleep while the winner > thread is zeroing out the page. > > It means that rather a lot of add_to_page_cache() will need to be copied > into hugetlb_no_page(). This handles the case of processes racing on a shared mapping, but not the case of threads racing on a private mapping. In the latter case the race ends at the set_pte() rather than the add_to_page_cache() (well, strictly with the whole page_table_lock atomic lump). And we can't move the clear after the set_pte() :(. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-26 23:31 ` 'David Gibson' @ 2006-10-27 0:04 ` Andrew Morton 2006-10-27 3:11 ` 'David Gibson' 2006-10-27 1:47 ` 'David Gibson' 1 sibling, 1 reply; 19+ messages in thread From: Andrew Morton @ 2006-10-27 0:04 UTC (permalink / raw) To: 'David Gibson' Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Fri, 27 Oct 2006 09:31:37 +1000 "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: > > On Thu, 26 Oct 2006 15:17:20 -0700 > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > > > > > First rev of patch to allow hugetlb page fault to scale. > > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > > > failure in a corner case: two threads race to instantiate same page with > > > only one free page left in the global pool. However, this global > > > serialization hurts fault performance badly as noted by Christoph Lameter. > > > This patch attempt to cut back the use of mutex only when free page resource > > > is limited, thus allow fault to scale in most common cases. > > > > > > > ug. > > > > How about we kill that instantiation_mutex thing altogether and fix > > the original bug in a better fashion? Like... > > > > In hugetlb_no_page(): > > > > retry: > > page = find_lock_page(...) > > if (!page) { > > write_lock_irq(&mapping->tree_lock); > > if (radix_tree_lookup(...)) { > > write_unlock_irq(tree_lock); > > goto retry; > > } > > page = alloc_huge_page(...); > > if (!page) > > bail; > > radix_tree_insert(...); > > SetPageLocked(page); > > write_unlock_irq(tree_lock); > > clear_huge_page(...); > > } > > > > <stick it in page tables> > > > > unlock_page(page); > > > > The key points: > > > > - Use tree_lock to prevent the race > > > > - allocate the hugepage inside tree_lock so we never get into this > > two-threads-tried-to-allocate-the-final-page problem. > > > > - The hugepage is zeroed without locks held, under lock_page() > > > > - lock_page() is used to make the other thread(s) sleep while the winner > > thread is zeroing out the page. > > > > It means that rather a lot of add_to_page_cache() will need to be copied > > into hugetlb_no_page(). > > This handles the case of processes racing on a shared mapping, but not > the case of threads racing on a private mapping. In the latter case > the race ends at the set_pte() rather than the add_to_page_cache() > (well, strictly with the whole page_table_lock atomic lump). And we > can't move the clear after the set_pte() :(. > I expect we can do a similar thing, using page_table_lock to prevent the race. The key is to be able to make racing threads still block on the page lock. Perhaps install a temp pte which is !pte_present() and also !pte_none(). So the racing thread can use that pte to locate and wait upon the presently-locked page while it is being COWed by another CPU. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-27 0:04 ` Andrew Morton @ 2006-10-27 3:11 ` 'David Gibson' 2006-10-27 3:35 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: 'David Gibson' @ 2006-10-27 3:11 UTC (permalink / raw) To: Andrew Morton Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote: > On Fri, 27 Oct 2006 09:31:37 +1000 > "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: > > > On Thu, 26 Oct 2006 15:17:20 -0700 > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > > > > > > > First rev of patch to allow hugetlb page fault to scale. > > > > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > > > > failure in a corner case: two threads race to instantiate same page with > > > > only one free page left in the global pool. However, this global > > > > serialization hurts fault performance badly as noted by Christoph Lameter. > > > > This patch attempt to cut back the use of mutex only when free page resource > > > > is limited, thus allow fault to scale in most common cases. > > > > > > > > > > ug. > > > > > > How about we kill that instantiation_mutex thing altogether and fix > > > the original bug in a better fashion? Like... > > > > > > In hugetlb_no_page(): > > > > > > retry: > > > page = find_lock_page(...) > > > if (!page) { > > > write_lock_irq(&mapping->tree_lock); > > > if (radix_tree_lookup(...)) { > > > write_unlock_irq(tree_lock); > > > goto retry; > > > } > > > page = alloc_huge_page(...); > > > if (!page) > > > bail; > > > radix_tree_insert(...); > > > SetPageLocked(page); > > > write_unlock_irq(tree_lock); > > > clear_huge_page(...); > > > } > > > > > > <stick it in page tables> > > > > > > unlock_page(page); > > > > > > The key points: > > > > > > - Use tree_lock to prevent the race > > > > > > - allocate the hugepage inside tree_lock so we never get into this > > > two-threads-tried-to-allocate-the-final-page problem. > > > > > > - The hugepage is zeroed without locks held, under lock_page() > > > > > > - lock_page() is used to make the other thread(s) sleep while the winner > > > thread is zeroing out the page. > > > > > > It means that rather a lot of add_to_page_cache() will need to be copied > > > into hugetlb_no_page(). > > > > This handles the case of processes racing on a shared mapping, but not > > the case of threads racing on a private mapping. In the latter case > > the race ends at the set_pte() rather than the add_to_page_cache() > > (well, strictly with the whole page_table_lock atomic lump). And we > > can't move the clear after the set_pte() :(. > > > > I expect we can do a similar thing, using page_table_lock to prevent the > race. > > The key is to be able to make racing threads still block on the page lock. > Perhaps install a temp pte which is !pte_present() and also !pte_none(). > So the racing thread can use that pte to locate and wait upon the > presently-locked page while it is being COWed by another CPU. Um.. yes, that might work. Though I'd need to think hard about a more specific scheme. I've been through a lot of approaches lately that looked ok at first glance, but weren't :-/ And obviously we'd need to make sure such "tentative" PTEs are constructible won't confuse other code on each relevant architecture. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-27 3:11 ` 'David Gibson' @ 2006-10-27 3:35 ` Andrew Morton 2006-10-27 4:06 ` 'David Gibson' 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2006-10-27 3:35 UTC (permalink / raw) To: 'David Gibson' Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Fri, 27 Oct 2006 13:11:56 +1000 "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote: > > On Fri, 27 Oct 2006 09:31:37 +1000 > > "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > > > > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: > > > > On Thu, 26 Oct 2006 15:17:20 -0700 > > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > > > > > > > > > First rev of patch to allow hugetlb page fault to scale. > > > > > > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > > > > > failure in a corner case: two threads race to instantiate same page with > > > > > only one free page left in the global pool. However, this global > > > > > serialization hurts fault performance badly as noted by Christoph Lameter. > > > > > This patch attempt to cut back the use of mutex only when free page resource > > > > > is limited, thus allow fault to scale in most common cases. > > > > > > > > > > > > > ug. > > > > > > > > How about we kill that instantiation_mutex thing altogether and fix > > > > the original bug in a better fashion? Like... > > > > > > > > In hugetlb_no_page(): > > > > > > > > retry: > > > > page = find_lock_page(...) > > > > if (!page) { > > > > write_lock_irq(&mapping->tree_lock); > > > > if (radix_tree_lookup(...)) { > > > > write_unlock_irq(tree_lock); > > > > goto retry; > > > > } > > > > page = alloc_huge_page(...); > > > > if (!page) > > > > bail; > > > > radix_tree_insert(...); > > > > SetPageLocked(page); > > > > write_unlock_irq(tree_lock); > > > > clear_huge_page(...); > > > > } > > > > > > > > <stick it in page tables> > > > > > > > > unlock_page(page); > > > > > > > > The key points: > > > > > > > > - Use tree_lock to prevent the race > > > > > > > > - allocate the hugepage inside tree_lock so we never get into this > > > > two-threads-tried-to-allocate-the-final-page problem. > > > > > > > > - The hugepage is zeroed without locks held, under lock_page() > > > > > > > > - lock_page() is used to make the other thread(s) sleep while the winner > > > > thread is zeroing out the page. > > > > > > > > It means that rather a lot of add_to_page_cache() will need to be copied > > > > into hugetlb_no_page(). > > > > > > This handles the case of processes racing on a shared mapping, but not > > > the case of threads racing on a private mapping. In the latter case > > > the race ends at the set_pte() rather than the add_to_page_cache() > > > (well, strictly with the whole page_table_lock atomic lump). And we > > > can't move the clear after the set_pte() :(. > > > > > > > I expect we can do a similar thing, using page_table_lock to prevent the > > race. > > > > The key is to be able to make racing threads still block on the page lock. > > Perhaps install a temp pte which is !pte_present() and also !pte_none(). > > So the racing thread can use that pte to locate and wait upon the > > presently-locked page while it is being COWed by another CPU. > > Um.. yes, that might work. Though I'd need to think hard about a more > specific scheme. I've been through a lot of approaches lately that > looked ok at first glance, but weren't :-/ > > And obviously we'd need to make sure such "tentative" PTEs are > constructible won't confuse other code on each relevant architecture. > There's various cross-arch infrastructure for this which is used for encoding swap offsets within pte's which could perhaps be ab^W^Wreused. Alternatively, we could put the page into pagecache whether or not the mapping is MAP_SHARED. Then pull it out again prior to unlocking it if it's MAP_PRIVATE. So we're using pagecache just as a way for the concurrent faulter to locate the page. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-27 3:35 ` Andrew Morton @ 2006-10-27 4:06 ` 'David Gibson' 2006-10-31 2:54 ` Chen, Kenneth W 0 siblings, 1 reply; 19+ messages in thread From: 'David Gibson' @ 2006-10-27 4:06 UTC (permalink / raw) To: Andrew Morton Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Thu, Oct 26, 2006 at 08:35:22PM -0700, Andrew Morton wrote: > On Fri, 27 Oct 2006 13:11:56 +1000 > "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 26, 2006 at 05:04:15PM -0700, Andrew Morton wrote: > > > On Fri, 27 Oct 2006 09:31:37 +1000 > > > "'David Gibson'" <david@gibson.dropbear.id.au> wrote: > > > > > > > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: > > > > > On Thu, 26 Oct 2006 15:17:20 -0700 > > > > > "Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote: > > > > > > > > > > > First rev of patch to allow hugetlb page fault to scale. > > > > > > > > > > > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > > > > > > failure in a corner case: two threads race to instantiate same page with > > > > > > only one free page left in the global pool. However, this global > > > > > > serialization hurts fault performance badly as noted by Christoph Lameter. > > > > > > This patch attempt to cut back the use of mutex only when free page resource > > > > > > is limited, thus allow fault to scale in most common cases. > > > > > > > > > > > > > > > > ug. > > > > > > > > > > How about we kill that instantiation_mutex thing altogether and fix > > > > > the original bug in a better fashion? Like... > > > > > > > > > > In hugetlb_no_page(): > > > > > > > > > > retry: > > > > > page = find_lock_page(...) > > > > > if (!page) { > > > > > write_lock_irq(&mapping->tree_lock); > > > > > if (radix_tree_lookup(...)) { > > > > > write_unlock_irq(tree_lock); > > > > > goto retry; > > > > > } > > > > > page = alloc_huge_page(...); > > > > > if (!page) > > > > > bail; > > > > > radix_tree_insert(...); > > > > > SetPageLocked(page); > > > > > write_unlock_irq(tree_lock); > > > > > clear_huge_page(...); > > > > > } > > > > > > > > > > <stick it in page tables> > > > > > > > > > > unlock_page(page); > > > > > > > > > > The key points: > > > > > > > > > > - Use tree_lock to prevent the race > > > > > > > > > > - allocate the hugepage inside tree_lock so we never get into this > > > > > two-threads-tried-to-allocate-the-final-page problem. > > > > > > > > > > - The hugepage is zeroed without locks held, under lock_page() > > > > > > > > > > - lock_page() is used to make the other thread(s) sleep while the winner > > > > > thread is zeroing out the page. > > > > > > > > > > It means that rather a lot of add_to_page_cache() will need to be copied > > > > > into hugetlb_no_page(). > > > > > > > > This handles the case of processes racing on a shared mapping, but not > > > > the case of threads racing on a private mapping. In the latter case > > > > the race ends at the set_pte() rather than the add_to_page_cache() > > > > (well, strictly with the whole page_table_lock atomic lump). And we > > > > can't move the clear after the set_pte() :(. > > > > > > > > > > I expect we can do a similar thing, using page_table_lock to prevent the > > > race. > > > > > > The key is to be able to make racing threads still block on the page lock. > > > Perhaps install a temp pte which is !pte_present() and also !pte_none(). > > > So the racing thread can use that pte to locate and wait upon the > > > presently-locked page while it is being COWed by another CPU. > > > > Um.. yes, that might work. Though I'd need to think hard about a more > > specific scheme. I've been through a lot of approaches lately that > > looked ok at first glance, but weren't :-/ > > > > And obviously we'd need to make sure such "tentative" PTEs are > > constructible won't confuse other code on each relevant architecture. > > There's various cross-arch infrastructure for this which is used for > encoding swap offsets within pte's which could perhaps be > ab^W^Wreused. Yes, but the encoding and assumptions about ptes aren't always exactly the same for hugeptes as normal ptes. > Alternatively, we could put the page into pagecache whether or not the > mapping is MAP_SHARED. Then pull it out again prior to unlocking it if > it's MAP_PRIVATE. So we're using pagecache just as a way for the > concurrent faulter to locate the page. Hrm.. interesting if we can make it work. I'd be worried about cases with concurrent PRIVATE and SHARED pages on the same file offset. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-27 4:06 ` 'David Gibson' @ 2006-10-31 2:54 ` Chen, Kenneth W 2006-10-31 3:17 ` 'David Gibson' 0 siblings, 1 reply; 19+ messages in thread From: Chen, Kenneth W @ 2006-10-31 2:54 UTC (permalink / raw) To: 'David Gibson', Andrew Morton Cc: 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm David Gibson wrote on Thursday, October 26, 2006 9:06 PM > > Alternatively, we could put the page into pagecache whether or not the > > mapping is MAP_SHARED. Then pull it out again prior to unlocking it if > > it's MAP_PRIVATE. So we're using pagecache just as a way for the > > concurrent faulter to locate the page. > > Hrm.. interesting if we can make it work. I'd be worried about cases > with concurrent PRIVATE and SHARED pages on the same file offset. I got side tracked on to the radix-tree stuff. The comments in hugetlb_no_page() make me wonder whether we have a race issue on private mapping: /* * Use page lock to guard against racing truncation * before we get page_table_lock. */ Private mapping won't use radix tree during instantiation. What protects racy truncate against fault in that scenario? Don't we have a bug here? - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 2:54 ` Chen, Kenneth W @ 2006-10-31 3:17 ` 'David Gibson' 2006-10-31 5:15 ` Chen, Kenneth W 0 siblings, 1 reply; 19+ messages in thread From: 'David Gibson' @ 2006-10-31 3:17 UTC (permalink / raw) To: Chen, Kenneth W, g Cc: Andrew Morton, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Mon, Oct 30, 2006 at 06:54:46PM -0800, Chen, Kenneth W wrote: > David Gibson wrote on Thursday, October 26, 2006 9:06 PM > > > Alternatively, we could put the page into pagecache whether or not the > > > mapping is MAP_SHARED. Then pull it out again prior to unlocking it if > > > it's MAP_PRIVATE. So we're using pagecache just as a way for the > > > concurrent faulter to locate the page. > > > > Hrm.. interesting if we can make it work. I'd be worried about cases > > with concurrent PRIVATE and SHARED pages on the same file offset. > > I got side tracked on to the radix-tree stuff. The comments in > hugetlb_no_page() make me wonder whether we have a race issue on > private mapping: > > /* > * Use page lock to guard against racing truncation > * before we get page_table_lock. > */ > > Private mapping won't use radix tree during instantiation. What protects > racy truncate against fault in that scenario? Don't we have a bug here? Not at present, because the hugetlb_instantiation_mutex protects both fault paths. But with Andrew's patch as it stands, yes. As I said in a previous email. The libhugetlbfs testsuite now has a testcase for the MAP_PRIVATE as well as the MAP_SHARED version of the race. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 3:17 ` 'David Gibson' @ 2006-10-31 5:15 ` Chen, Kenneth W 2006-10-31 11:05 ` 'David Gibson' 0 siblings, 1 reply; 19+ messages in thread From: Chen, Kenneth W @ 2006-10-31 5:15 UTC (permalink / raw) To: 'David Gibson', g Cc: Andrew Morton, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm David Gibson wrote on Monday, October 30, 2006 7:17 PM > > I got side tracked on to the radix-tree stuff. The comments in > > hugetlb_no_page() make me wonder whether we have a race issue on > > private mapping: > > > > /* > > * Use page lock to guard against racing truncation > > * before we get page_table_lock. > > */ > > > > Private mapping won't use radix tree during instantiation. What protects > > racy truncate against fault in that scenario? Don't we have a bug here? > > Not at present, because the hugetlb_instantiation_mutex protects both > fault paths. But with Andrew's patch as it stands, yes. As I said in > a previous email. The libhugetlbfs testsuite now has a testcase for > the MAP_PRIVATE as well as the MAP_SHARED version of the race. That's not what I'm saying. I should've said I'm off topic and not talking about parallel fault for private mapping. Instead, I'm asking how private mapping protect race between file truncation and fault? For shared mapping, it is clear to me that we are using lock_page to protect file truncate with fault. But I don't see that protection with private mapping in current upstream kernel. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 5:15 ` Chen, Kenneth W @ 2006-10-31 11:05 ` 'David Gibson' 2006-10-31 12:48 ` Hugh Dickins 0 siblings, 1 reply; 19+ messages in thread From: 'David Gibson' @ 2006-10-31 11:05 UTC (permalink / raw) To: Chen, Kenneth W Cc: g, Andrew Morton, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote: > David Gibson wrote on Monday, October 30, 2006 7:17 PM > > > I got side tracked on to the radix-tree stuff. The comments in > > > hugetlb_no_page() make me wonder whether we have a race issue on > > > private mapping: > > > > > > /* > > > * Use page lock to guard against racing truncation > > > * before we get page_table_lock. > > > */ > > > > > > Private mapping won't use radix tree during instantiation. What protects > > > racy truncate against fault in that scenario? Don't we have a bug here? > > > > Not at present, because the hugetlb_instantiation_mutex protects both > > fault paths. But with Andrew's patch as it stands, yes. As I said in > > a previous email. The libhugetlbfs testsuite now has a testcase for > > the MAP_PRIVATE as well as the MAP_SHARED version of the race. > > > That's not what I'm saying. I should've said I'm off topic and not talking > about parallel fault for private mapping. > > Instead, I'm asking how private mapping protect race between file truncation > and fault? For shared mapping, it is clear to me that we are using lock_page > to protect file truncate with fault. But I don't see that protection with > private mapping in current upstream kernel. Oh, ok. I can't see how it matters in the PRIVATE case, given that truncate() won't, and shouldn't, truncate privately mapped pages. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 11:05 ` 'David Gibson' @ 2006-10-31 12:48 ` Hugh Dickins 2006-11-01 6:18 ` Nick Piggin 2006-11-02 2:29 ` 'David Gibson' 0 siblings, 2 replies; 19+ messages in thread From: Hugh Dickins @ 2006-10-31 12:48 UTC (permalink / raw) To: 'David Gibson' Cc: Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter', bill.irwin, Adam Litke, linux-mm On Tue, 31 Oct 2006, 'David Gibson' wrote: > On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote: > > > > Instead, I'm asking how private mapping protect race between file truncation > > and fault? For shared mapping, it is clear to me that we are using lock_page > > to protect file truncate with fault. But I don't see that protection with > > private mapping in current upstream kernel. > > Oh, ok. I can't see how it matters in the PRIVATE case, given that > truncate() won't, and shouldn't, truncate privately mapped pages. Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge from the standard for ordinary pages in this respect - could do, but that would require thought of its own). If you've been thinking otherwise, that may explain why some of the accounting goes wrong. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 12:48 ` Hugh Dickins @ 2006-11-01 6:18 ` Nick Piggin 2006-11-01 10:17 ` Chen, Kenneth W 2006-11-02 2:29 ` 'David Gibson' 1 sibling, 1 reply; 19+ messages in thread From: Nick Piggin @ 2006-11-01 6:18 UTC (permalink / raw) To: Hugh Dickins Cc: 'David Gibson', Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter', bill.irwin, Adam Litke, linux-mm Hugh Dickins wrote: > On Tue, 31 Oct 2006, 'David Gibson' wrote: > >>On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote: >> >>>Instead, I'm asking how private mapping protect race between file truncation >>>and fault? For shared mapping, it is clear to me that we are using lock_page >>>to protect file truncate with fault. But I don't see that protection with >>>private mapping in current upstream kernel. >> >>Oh, ok. I can't see how it matters in the PRIVATE case, given that >>truncate() won't, and shouldn't, truncate privately mapped pages. > > > Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge > from the standard for ordinary pages in this respect - could do, but that > would require thought of its own). If you've been thinking otherwise, > that may explain why some of the accounting goes wrong. So what does the normal page fault path do? Just invalidates the private page out of the page tables. A subsequent fault goes through the normal shared page path, which detects the truncation as it would with any shared fault. Right? hugetlb seems to pretty well follow the same pattern as memory.c in this regard. I don't see the race? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [RFC] reduce hugetlb_instantiation_mutex usage 2006-11-01 6:18 ` Nick Piggin @ 2006-11-01 10:17 ` Chen, Kenneth W 2006-11-02 3:06 ` Nick Piggin 0 siblings, 1 reply; 19+ messages in thread From: Chen, Kenneth W @ 2006-11-01 10:17 UTC (permalink / raw) To: 'Nick Piggin', Hugh Dickins Cc: 'David Gibson', g, Andrew Morton, 'Christoph Lameter', bill.irwin, Adam Litke, linux-mm Nick Piggin wrote on Tuesday, October 31, 2006 10:19 PM > So what does the normal page fault path do? Just invalidates the private > page out of the page tables. A subsequent fault goes through the normal > shared page path, which detects the truncation as it would with any > shared fault. Right? > > hugetlb seems to pretty well follow the same pattern as memory.c in this > regard. I don't see the race? I was originally worried about a case that one thread fault on a private mapping and get hold of a fresh page via alloc_huge_page(). While it executes clear_huge_page(), 2nd thread come by did a ftruncate. After first thread finish zeroing, I thought it will happily install a pte. But no, the inode size check will prevent that from happening. I was mislead by the comments in hugetlb_no_page() that page lock is used to guard against racing truncation. Now I'm drifting back into what "racing truncation" the comment is referring to. What race does it trying to protect with page lock? - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-11-01 10:17 ` Chen, Kenneth W @ 2006-11-02 3:06 ` Nick Piggin 0 siblings, 0 replies; 19+ messages in thread From: Nick Piggin @ 2006-11-02 3:06 UTC (permalink / raw) To: Chen, Kenneth W Cc: Hugh Dickins, 'David Gibson', g, Andrew Morton, 'Christoph Lameter', bill.irwin, Adam Litke, linux-mm Chen, Kenneth W wrote: >Nick Piggin wrote on Tuesday, October 31, 2006 10:19 PM > >>So what does the normal page fault path do? Just invalidates the private >>page out of the page tables. A subsequent fault goes through the normal >>shared page path, which detects the truncation as it would with any >>shared fault. Right? >> >>hugetlb seems to pretty well follow the same pattern as memory.c in this >>regard. I don't see the race? >> > >I was originally worried about a case that one thread fault on a private >mapping and get hold of a fresh page via alloc_huge_page(). While it executes >clear_huge_page(), 2nd thread come by did a ftruncate. After first thread >finish zeroing, I thought it will happily install a pte. But no, the inode >size check will prevent that from happening. > Yes it should do. >I was mislead by the comments in hugetlb_no_page() that page lock is used to >guard against racing truncation. Now I'm drifting back into what "racing >truncation" the comment is referring to. What race does it trying to protect >with page lock? > Probably attempting to fix a similar race as the do_no_page vs invalidate race that I've been trying to fix for normal pages (and is currently handled for truncate with truncate_count). However AFAIKS it is not page lock which protects from truncate, but the page_table_lock (which the ptl scalability work might have broken): Here is the race for regular pagecache: check i_size find_get_page i_size_write unmap_mapping_range set_pte truncate_inode_pages So we have now mapped a truncated page. I fix this by using find_lock_page in the filemap_nopage path, but I found that it isn't enough alone. truncate_inode_pages must also check whether the page is mapped (while holding the page lock) and be prepared to unmap (like invalidate_inode_pages does). Now, when looking at hugepages, it seems to be designed to give you the unmap_mapping_range vs fault exclusion with ptl (note that it rechecks i_size inside ptl). However now I think unmap_mapping_range runs without ptl, you again have a race. I'll look into it a bit more and see if I can fix it up in my patchset. -- Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-31 12:48 ` Hugh Dickins 2006-11-01 6:18 ` Nick Piggin @ 2006-11-02 2:29 ` 'David Gibson' 1 sibling, 0 replies; 19+ messages in thread From: 'David Gibson' @ 2006-11-02 2:29 UTC (permalink / raw) To: Hugh Dickins Cc: Chen, Kenneth W, g, Andrew Morton, 'Christoph Lameter', bill.irwin, Adam Litke, linux-mm On Tue, Oct 31, 2006 at 12:48:13PM +0000, Hugh Dickins wrote: > On Tue, 31 Oct 2006, 'David Gibson' wrote: > > On Mon, Oct 30, 2006 at 09:15:20PM -0800, Chen, Kenneth W wrote: > > > > > > Instead, I'm asking how private mapping protect race between file truncation > > > and fault? For shared mapping, it is clear to me that we are using lock_page > > > to protect file truncate with fault. But I don't see that protection with > > > private mapping in current upstream kernel. > > > > Oh, ok. I can't see how it matters in the PRIVATE case, given that > > truncate() won't, and shouldn't, truncate privately mapped pages. > > Bzzt, it does and should (unless we decide to make hugetlbfs pages diverge > from the standard for ordinary pages in this respect - could do, but that > would require thought of its own). Oh, so it does. That's not the semantic I would have guessed for MAP_PRIVATE. > If you've been thinking otherwise, > that may explain why some of the accounting goes wrong. Unlikely, given that there basically isn't any accounting for MAP_PRIVATE pages. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-26 23:31 ` 'David Gibson' 2006-10-27 0:04 ` Andrew Morton @ 2006-10-27 1:47 ` 'David Gibson' 2006-10-30 20:55 ` Adam Litke 1 sibling, 1 reply; 19+ messages in thread From: 'David Gibson' @ 2006-10-27 1:47 UTC (permalink / raw) To: Andrew Morton Cc: Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, Adam Litke, linux-mm On Fri, Oct 27, 2006 at 09:31:37AM +1000, 'David Gibson' wrote: > On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote: [snip] > > The key points: > > > > - Use tree_lock to prevent the race > > > > - allocate the hugepage inside tree_lock so we never get into this > > two-threads-tried-to-allocate-the-final-page problem. > > > > - The hugepage is zeroed without locks held, under lock_page() > > > > - lock_page() is used to make the other thread(s) sleep while the winner > > thread is zeroing out the page. > > > > It means that rather a lot of add_to_page_cache() will need to be copied > > into hugetlb_no_page(). > > This handles the case of processes racing on a shared mapping, but not > the case of threads racing on a private mapping. In the latter case > the race ends at the set_pte() rather than the add_to_page_cache() > (well, strictly with the whole page_table_lock atomic lump). And we > can't move the clear after the set_pte() :(. At various times many people have proposed "solutions" which address the SHARED case, or the PRIVATE case, but not both. As Andrew points out in a later mail his approach may be fixable for the PRIVATE case, but nonetheless it's important to check both cases. So, here's another patch for libhugetlbfs, extending its testcase for this race to check the MAP_PRIVATE case as well as the MAP_SHARED case. Adam, please apply. Everyone else, please test proposed approaches against the testsuite. It's not exhaustive, but it's easy to run and a good start :). libhugetlbfs: Testcase for MAP_PRIVATE OOM-liable race condition The spurious OOM condition which can be caused by race conditions in the hugetlb fault handler can be triggered with both SHARED mappings (separate processes racing on the same address_space) and with PRIVATE mappings (different threads racing on the same vma). At present the alloc-instantiate-race testcase only tests the SHARED mapping case. Since at various times kernel fixes have been proposed which address only one or the other of the cases, extend the testcase to check the MAP_PRIVATE in addition to the MAP_SHARED case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Index: libhugetlbfs/tests/alloc-instantiate-race.c =================================================================== --- libhugetlbfs.orig/tests/alloc-instantiate-race.c 2006-09-04 17:08:33.000000000 +1000 +++ libhugetlbfs/tests/alloc-instantiate-race.c 2006-10-27 11:31:54.000000000 +1000 @@ -42,13 +42,18 @@ #include <sched.h> #include <signal.h> #include <sys/wait.h> +#include <pthread.h> +#include <linux/unistd.h> #include <hugetlbfs.h> #include "hugetests.h" +_syscall0(pid_t, gettid); + static int hpage_size; static pid_t child1, child2; +static pthread_t thread1, thread2; void cleanup(void) { @@ -58,9 +63,8 @@ void cleanup(void) kill(child2, SIGKILL); } - -static void one_racer(void *p, int cpu, - volatile int *mytrigger, volatile int *othertrigger) +static int one_racer(void *p, int cpu, + volatile int *mytrigger, volatile int *othertrigger) { volatile int *pi = p; cpu_set_t cpuset; @@ -70,7 +74,7 @@ static void one_racer(void *p, int cpu, CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); - err = sched_setaffinity(getpid(), CPU_SETSIZE/8, &cpuset); + err = sched_setaffinity(gettid(), CPU_SETSIZE/8, &cpuset); if (err != 0) CONFIG("sched_setaffinity(cpu%d): %s", cpu, strerror(errno)); @@ -83,16 +87,39 @@ static void one_racer(void *p, int cpu, /* Instantiate! */ *pi = 1; - exit(0); + return 0; +} + +static void proc_racer(void *p, int cpu, + volatile int *mytrigger, volatile int *othertrigger) +{ + exit(one_racer(p, cpu, mytrigger, othertrigger)); } -static void run_race(void *syncarea) +struct racer_info { + void *p; /* instantiation address */ + int cpu; + int race_type; + volatile int *mytrigger; + volatile int *othertrigger; + int status; +}; + +static void *thread_racer(void *info) +{ + struct racer_info *ri = info; + int rc; + + rc = one_racer(ri->p, ri->cpu, ri->mytrigger, ri->othertrigger); + return ri; +} +static void run_race(void *syncarea, int race_type) { volatile int *trigger1, *trigger2; int fd; void *p; int status1, status2; - pid_t ret; + int ret; memset(syncarea, 0, sizeof(*trigger1) + sizeof(*trigger2)); trigger1 = syncarea; @@ -104,47 +131,90 @@ static void run_race(void *syncarea) FAIL("hugetlbfs_unlinked_fd()"); verbose_printf("Mapping final page.. "); - p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, race_type, fd, 0); if (p == MAP_FAILED) FAIL("mmap(): %s", strerror(errno)); verbose_printf("%p\n", p); - child1 = fork(); - if (child1 < 0) - FAIL("fork(): %s", strerror(errno)); - if (child1 == 0) - one_racer(p, 0, trigger1, trigger2); - - child2 = fork(); - if (child2 < 0) - FAIL("fork(): %s", strerror(errno)); - if (child2 == 0) - one_racer(p, 1, trigger2, trigger1); - - /* wait() calls */ - ret = waitpid(child1, &status1, 0); - if (ret < 0) - FAIL("waitpid() child 1: %s", strerror(errno)); - verbose_printf("Child 1 status: %x\n", status1); - - - ret = waitpid(child2, &status2, 0); - if (ret < 0) - FAIL("waitpid() child 2: %s", strerror(errno)); - verbose_printf("Child 2 status: %x\n", status2); - - if (WIFSIGNALED(status1)) - FAIL("Child 1 killed by signal %s", - strsignal(WTERMSIG(status1))); - if (WIFSIGNALED(status2)) + if (race_type == MAP_SHARED) { + child1 = fork(); + if (child1 < 0) + FAIL("fork(): %s", strerror(errno)); + if (child1 == 0) + proc_racer(p, 0, trigger1, trigger2); + + child2 = fork(); + if (child2 < 0) + FAIL("fork(): %s", strerror(errno)); + if (child2 == 0) + proc_racer(p, 1, trigger2, trigger1); + + /* wait() calls */ + ret = waitpid(child1, &status1, 0); + if (ret < 0) + FAIL("waitpid() child 1: %s", strerror(errno)); + verbose_printf("Child 1 status: %x\n", status1); + + + ret = waitpid(child2, &status2, 0); + if (ret < 0) + FAIL("waitpid() child 2: %s", strerror(errno)); + verbose_printf("Child 2 status: %x\n", status2); + + if (WIFSIGNALED(status1)) + FAIL("Child 1 killed by signal %s", + strsignal(WTERMSIG(status1))); + if (WIFSIGNALED(status2)) FAIL("Child 2 killed by signal %s", strsignal(WTERMSIG(status2))); - if (WEXITSTATUS(status1) != 0) - FAIL("Child 1 terminated with code %d", WEXITSTATUS(status1)); + status1 = WEXITSTATUS(status1); + status2 = WEXITSTATUS(status2); + } else { + struct racer_info ri1 = { + .p = p, + .cpu = 0, + .mytrigger = trigger1, + .othertrigger = trigger2, + }; + struct racer_info ri2 = { + .p = p, + .cpu = 1, + .mytrigger = trigger2, + .othertrigger = trigger1, + }; + void *tret1, *tret2; + + ret = pthread_create(&thread1, NULL, thread_racer, &ri1); + if (ret != 0) + FAIL("pthread_create() 1: %s\n", strerror(errno)); + + ret = pthread_create(&thread2, NULL, thread_racer, &ri2); + if (ret != 0) + FAIL("pthread_create() 2: %s\n", strerror(errno)); + + ret = pthread_join(thread1, &tret1); + if (ret != 0) + FAIL("pthread_join() 1: %s\n", strerror(errno)); + if (tret1 != &ri1) + FAIL("Thread 1 returned %p not %p, killed?\n", + tret1, &ri1); + ret = pthread_join(thread2, &tret2); + if (ret != 0) + FAIL("pthread_join() 2: %s\n", strerror(errno)); + if (tret2 != &ri2) + FAIL("Thread 2 returned %p not %p, killed?\n", + tret2, &ri2); + + status1 = ri1.status; + status2 = ri2.status; + } - if (WEXITSTATUS(status2) != 0) - FAIL("Child 2 terminated with code %d", WEXITSTATUS(status2)); + if (status1 != 0) + FAIL("Racer 1 terminated with code %d", status1); + + if (status2 != 0) + FAIL("Racer 2 terminated with code %d", status2); } int main(int argc, char *argv[]) @@ -153,14 +223,25 @@ int main(int argc, char *argv[]) int fd; void *p, *q; unsigned long i; + int race_type; test_init(argc, argv); - if (argc != 2) - CONFIG("Usage: alloc-instantiate-race <# total available hugepages>"); + if (argc != 3) + CONFIG("Usage: alloc-instantiate-race" + "<# total available hugepages> <private|shard>"); totpages = atoi(argv[1]); + if (strcmp(argv[2], "shared") == 0) { + race_type = MAP_SHARED; + } else if (strcmp(argv[2], "private") == 0) { + race_type = MAP_PRIVATE; + } else { + CONFIG("Usage: alloc-instantiate-race" + "<# total available hugepages> <private|shard>"); + } + hpage_size = gethugepagesize(); if (hpage_size < 0) CONFIG("No hugepage kernel support"); @@ -189,7 +270,7 @@ int main(int argc, char *argv[]) memset(p + (i * hpage_size), 0, sizeof(int)); verbose_printf("done\n"); - run_race(q); + run_race(q, race_type); PASS(); } Index: libhugetlbfs/tests/run_tests.sh =================================================================== --- libhugetlbfs.orig/tests/run_tests.sh 2006-10-27 10:08:23.000000000 +1000 +++ libhugetlbfs/tests/run_tests.sh 2006-10-27 10:45:06.000000000 +1000 @@ -147,7 +147,8 @@ functional_tests () { # killall -HUP hugetlbd # to make the sharing daemon give up the files run_test chunk-overcommit `free_hpages` - run_test alloc-instantiate-race `free_hpages` + run_test alloc-instantiate-race `free_hpages` shared + run_test alloc-instantiate-race `free_hpages` private run_test truncate_reserve_wraparound run_test truncate_sigbus_versus_oom `free_hpages` } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-27 1:47 ` 'David Gibson' @ 2006-10-30 20:55 ` Adam Litke 0 siblings, 0 replies; 19+ messages in thread From: Adam Litke @ 2006-10-30 20:55 UTC (permalink / raw) To: 'David Gibson' Cc: Andrew Morton, Chen, Kenneth W, 'Christoph Lameter', Hugh Dickins, bill.irwin, linux-mm On Fri, 2006-10-27 at 11:47 +1000, 'David Gibson' wrote: > libhugetlbfs: Testcase for MAP_PRIVATE OOM-liable race condition > > The spurious OOM condition which can be caused by race conditions in > the hugetlb fault handler can be triggered with both SHARED mappings > (separate processes racing on the same address_space) and with PRIVATE > mappings (different threads racing on the same vma). > > At present the alloc-instantiate-race testcase only tests the SHARED > mapping case. Since at various times kernel fixes have been proposed > which address only one or the other of the cases, extend the testcase > to check the MAP_PRIVATE in addition to the MAP_SHARED case. > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Adam Litke <agl@us.ibm.com> Applied. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] reduce hugetlb_instantiation_mutex usage 2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W 2006-10-26 22:44 ` Andrew Morton @ 2006-10-26 23:47 ` 'David Gibson' 1 sibling, 0 replies; 19+ messages in thread From: 'David Gibson' @ 2006-10-26 23:47 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Christoph Lameter', Hugh Dickins, bill.irwin, Andrew Morton, Adam Litke, linux-mm On Thu, Oct 26, 2006 at 03:17:20PM -0700, Chen, Kenneth W wrote: > First rev of patch to allow hugetlb page fault to scale. > > hugetlb_instantiation_mutex was introduced to prevent spurious allocation > failure in a corner case: two threads race to instantiate same page with > only one free page left in the global pool. However, this global > serialization hurts fault performance badly as noted by Christoph Lameter. > This patch attempt to cut back the use of mutex only when free page resource > is limited, thus allow fault to scale in most common cases. >From my experience of spending most of the last two weeks going "We can just do <this>...hack, hack.., no, that has a race too" this is much harder to get right than you'd think. For example with your patch, suppose CPU0 and CPU1 are both attempting to instantiate the same page in a shared mapping, CPU2 is attempting to instantiate a page in an unrelated mapping. CPU0 CPU1 CPU2 token free_hpages 0 2 atomic_inc 1 2 (use_mutex=0) atomic_inc 2 2 (use_mutex=1) atomic_inc 3 2 (use_mutex=1) mutex_lock <complete fault> mutex_unlock atomic_dec 2 1 mutex_lock 2 1 alloc_huge_page 2 0 alloc_huge_page -> OOM add_to_page_cache So we still have the spurious OOM. There may be other race scenarios, that's just the first I came up with. Oh, also your patch accesses free_huge_pages bare, whereas its usually protected by hugetlb_lock. As a read-only access that's *probably* ok, but any lock-free access of variables which are generally supposed to be lock protected makes me nervious. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2006-11-02 3:06 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-10-26 22:17 [RFC] reduce hugetlb_instantiation_mutex usage Chen, Kenneth W 2006-10-26 22:44 ` Andrew Morton 2006-10-26 23:31 ` 'David Gibson' 2006-10-27 0:04 ` Andrew Morton 2006-10-27 3:11 ` 'David Gibson' 2006-10-27 3:35 ` Andrew Morton 2006-10-27 4:06 ` 'David Gibson' 2006-10-31 2:54 ` Chen, Kenneth W 2006-10-31 3:17 ` 'David Gibson' 2006-10-31 5:15 ` Chen, Kenneth W 2006-10-31 11:05 ` 'David Gibson' 2006-10-31 12:48 ` Hugh Dickins 2006-11-01 6:18 ` Nick Piggin 2006-11-01 10:17 ` Chen, Kenneth W 2006-11-02 3:06 ` Nick Piggin 2006-11-02 2:29 ` 'David Gibson' 2006-10-27 1:47 ` 'David Gibson' 2006-10-30 20:55 ` Adam Litke 2006-10-26 23:47 ` 'David Gibson'
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox