From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 25BBC6B004D for ; Tue, 17 Nov 2009 11:44:16 -0500 (EST) Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235]) by e6.ny.us.ibm.com (8.14.3/8.13.1) with ESMTP id nAHGnktI004760 for ; Tue, 17 Nov 2009 11:49:46 -0500 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id nAHGi8ih116172 for ; Tue, 17 Nov 2009 11:44:08 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id nAHCbfjI027347 for ; Tue, 17 Nov 2009 07:37:41 -0500 Subject: Re: [PATCH] prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails. From: Adam Litke In-Reply-To: <20091117160922.GB29804@csn.ul.ie> References: <1257872456.3227.2.camel@dhcp-100-19-198.bos.redhat.com> <20091116121253.d86920a0.akpm@linux-foundation.org> <20091117160922.GB29804@csn.ul.ie> Content-Type: text/plain; charset="UTF-8" Date: Tue, 17 Nov 2009 10:43:48 -0600 Message-ID: <1258476228.2820.13.camel@aglitke> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Mel Gorman Cc: Andrew Morton , Larry Woodman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Lee Schermerhorn , Hugh Dickins , David Gibson List-ID: This looks good to me. The venerable hugetlb_mutex already prevents allocation/instantiation races in hugetlb_cow() so the extra spinlock dropping should have no adverse effects with respect to that. Acked-by: Adam Litke On Tue, 2009-11-17 at 16:09 +0000, Mel Gorman wrote: > On Mon, Nov 16, 2009 at 12:12:53PM -0800, Andrew Morton wrote: > > On Tue, 10 Nov 2009 12:00:56 -0500 > > Larry Woodman wrote: > > > > > > > > hugetlb_fault() takes the mm->page_table_lock spinlock then calls > > > hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to > > > an insufficient huge page pool it calls unmap_ref_private() with the > > > mm->page_table_lock held. unmap_ref_private() then calls > > > unmap_hugepage_range() which tries to acquire the mm->page_table_lock. > > > > > > > > > [] print_circular_bug_tail+0x80/0x9f > > > [] ? check_noncircular+0xb0/0xe8 > > > [] __lock_acquire+0x956/0xc0e > > > [] lock_acquire+0xee/0x12e > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] _spin_lock+0x40/0x89 > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] ? alloc_huge_page+0x218/0x318 > > > [] unmap_hugepage_range+0x3e/0x84 > > > [] hugetlb_cow+0x1e2/0x3f4 > > > [] ? hugetlb_fault+0x453/0x4f6 > > > [] hugetlb_fault+0x480/0x4f6 > > > [] follow_hugetlb_page+0x116/0x2d9 > > > [] ? _spin_unlock_irq+0x3a/0x5c > > > [] __get_user_pages+0x2a3/0x427 > > > [] get_user_pages+0x3e/0x54 > > > [] get_user_pages_fast+0x170/0x1b5 > > > [] dio_get_page+0x64/0x14a > > > [] __blockdev_direct_IO+0x4b7/0xb31 > > > [] blkdev_direct_IO+0x58/0x6e > > > [] ? blkdev_get_blocks+0x0/0xb8 > > > [] generic_file_aio_read+0xdd/0x528 > > > [] ? avc_has_perm+0x66/0x8c > > > [] do_sync_read+0xf5/0x146 > > > [] ? autoremove_wake_function+0x0/0x5a > > > [] ? security_file_permission+0x24/0x3a > > > [] vfs_read+0xb5/0x126 > > > [] ? fget_light+0x5e/0xf8 > > > [] sys_read+0x54/0x8c > > > [] system_call_fastpath+0x16/0x1b > > > > Confused. > > > > That code is very old, alloc_huge_page() failures are surely common and > > afaict the bug will lead to an instantly dead box. So why haven't > > people hit this bug before now? > > > > Failures like that can happen when the workload is calling fork() with > MAP_PRIVATE and the child is writing with a small hugepage pool. I reran the > tests used at the time the code was written and they don't trigger warnings or > problems in the normal case other than the note that the child got killed. It > required PROVE_LOCKING to be set which is not set on the default configs I > normally use when I don't suspect locking problems. > > I can confirm that with PROVE_LOCKING set that the warnings do trigger but > the machine doesn't lockup and I see in dmesg. Bit of a surprise really, > you'd think a double taking of a spinlock would result in damage. > > [ 111.031795] PID 2876 killed due to inadequate hugepage pool > > Applying the patch does fix that problem but there is a very similar > problem remaining in the same path. More on this later. > > > > This can be fixed by dropping the mm->page_table_lock around the call > > > to unmap_ref_private() if alloc_huge_page() fails, its dropped right below > > > in the normal path anyway: > > > > > > > Why is that safe? What is the page_table_lock protecting in here? > > > > The lock is there from 2005 and it was to protect against against concurrent > PTE updates. However, it is probably unnecessary protection as this whole > path should be protected by the hugetlb_instantiation_mutex. There was locking > that was left behind that may be unnecessary after that mutex was introducted > but is left in place for the day someone decides to tackle that mutex. > > I don't think this patch solves everything with the page_table_lock in > that path because it's also possible from the COW path to enter the buddy > allocator with the spinlock still held with setups like > > 1. create a mapping of 3 huge pages > 2. allow the system to dynamically allocate 6 huge pages > 3. fork() and fault in the child > with preempt and DEBUG_SPINLOCK_SLEEP set, a different > warning can still trigger > > Here is alternative patch below which extends Larry's patch to drop the > spinlock earlier and retake as required. Because it involves page table > updates, I've added Hugh to the cc because he knows all the rules and > gotchas backwards. > > ==== CUT HERE ==== > hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails. > > hugetlb_fault() takes the mm->page_table_lock spinlock then calls > hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an > insufficient huge page pool it calls unmap_ref_private() with the > mm->page_table_lock held. unmap_ref_private() then calls > unmap_hugepage_range() which tries to acquire the mm->page_table_lock. > > [] print_circular_bug_tail+0x80/0x9f > [] ? check_noncircular+0xb0/0xe8 > [] __lock_acquire+0x956/0xc0e > [] lock_acquire+0xee/0x12e > [] ? unmap_hugepage_range+0x3e/0x84 > [] ? unmap_hugepage_range+0x3e/0x84 > [] _spin_lock+0x40/0x89 > [] ? unmap_hugepage_range+0x3e/0x84 > [] ? alloc_huge_page+0x218/0x318 > [] unmap_hugepage_range+0x3e/0x84 > [] hugetlb_cow+0x1e2/0x3f4 > [] ? hugetlb_fault+0x453/0x4f6 > [] hugetlb_fault+0x480/0x4f6 > [] follow_hugetlb_page+0x116/0x2d9 > [] ? _spin_unlock_irq+0x3a/0x5c > [] __get_user_pages+0x2a3/0x427 > [] get_user_pages+0x3e/0x54 > [] get_user_pages_fast+0x170/0x1b5 > [] dio_get_page+0x64/0x14a > [] __blockdev_direct_IO+0x4b7/0xb31 > [] blkdev_direct_IO+0x58/0x6e > [] ? blkdev_get_blocks+0x0/0xb8 > [] generic_file_aio_read+0xdd/0x528 > [] ? avc_has_perm+0x66/0x8c > [] do_sync_read+0xf5/0x146 > [] ? autoremove_wake_function+0x0/0x5a > [] ? security_file_permission+0x24/0x3a > [] vfs_read+0xb5/0x126 > [] ? fget_light+0x5e/0xf8 > [] sys_read+0x54/0x8c > [] system_call_fastpath+0x16/0x1b > > This can be fixed by dropping the mm->page_table_lock around the call to > unmap_ref_private() if alloc_huge_page() fails, its dropped right below in > the normal path anyway. However, earlier in the that function, it's also > possible to call into the page allocator with the same spinlock held. > > What this patch does is drop the spinlock before the page allocator is > potentially entered. The check for page allocation failure can be made > without the page_table_lock as well as the copy of the huge page. Even if > the PTE changed while the spinlock was held, the consequence is that a huge > page is copied unnecessarily. This resolves both the double taking of the > lock and sleeping with the spinlock held. > > [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock] > Signed-off-by: Larry Woodman > Signed-off-by: Mel Gorman > Cc: Adam Litke > Cc: Andy Whitcroft > Cc: Lee Schermerhorn > Cc: Hugh Dickins > Cc: David Gibson > --- > mm/hugetlb.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 450493d..2ef66a2 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2293,6 +2293,9 @@ retry_avoidcopy: > outside_reserve = 1; > > page_cache_get(old_page); > + > + /* Drop page_table_lock as buddy allocator may be called */ > + spin_unlock(&mm->page_table_lock); > new_page = alloc_huge_page(vma, address, outside_reserve); > > if (IS_ERR(new_page)) { > @@ -2310,19 +2313,25 @@ retry_avoidcopy: > if (unmap_ref_private(mm, vma, old_page, address)) { > BUG_ON(page_count(old_page) != 1); > BUG_ON(huge_pte_none(pte)); > + spin_lock(&mm->page_table_lock); > goto retry_avoidcopy; > } > WARN_ON_ONCE(1); > } > > + /* Caller expects lock to be held */ > + spin_lock(&mm->page_table_lock); > return -PTR_ERR(new_page); > } > > - spin_unlock(&mm->page_table_lock); > copy_huge_page(new_page, old_page, address, vma); > __SetPageUptodate(new_page); > - spin_lock(&mm->page_table_lock); > > + /* > + * Retake the page_table_lock to check for racing updates > + * before the page tables are altered > + */ > + spin_lock(&mm->page_table_lock); > ptep = huge_pte_offset(mm, address & huge_page_mask(h)); > if (likely(pte_same(huge_ptep_get(ptep), pte))) { > /* Break COW */ -- Thanks, Adam -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org