From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07963C47258 for ; Wed, 31 Jan 2024 06:51:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B1596B0075; Wed, 31 Jan 2024 01:51:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 661196B0078; Wed, 31 Jan 2024 01:51:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 529656B007D; Wed, 31 Jan 2024 01:51:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3B80D6B0075 for ; Wed, 31 Jan 2024 01:51:14 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 02AFAC07B6 for ; Wed, 31 Jan 2024 06:51:13 +0000 (UTC) X-FDA: 81738684468.24.2466201 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf24.hostedemail.com (Postfix) with ESMTP id 8161318001B for ; Wed, 31 Jan 2024 06:51:10 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706683871; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eFFSOVcFGRIBiAkiEk1MjjptpWkRYxQUsA4+ayFOKCQ=; b=ye3WTFOxHAsqzPjSQ+A8pKS5waxJN/nB+7a62E9cavp9ajW1DdHyDhwkpyOo19CRbXgloa FaUi9sLH4Nr762LZI7KjeBKjcJ8yPyTOMMpyEACJmRRl49cehJuu0LpFfmT7GTpYCrJrnj Zxze8BQexL/394X5JgIIb1gInBI6SMQ= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706683871; a=rsa-sha256; cv=none; b=vQ2Jchi9z92h1oL+Uk6w9ZzS+pPQ7a2jYkPrHP+1nBxb0137uwqXYZce3asH67ZFeD0zjW QPNwcx96pb+ch2mYDemKofWVEmzWWggq6zMmaDj2jUok80tmn5C7YnTmEC06FwFRkK1+1S FB6876pzCJ5uGqPl1ch2AVV2KKfZidU= Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4TPsyN5nhhz1FJrS; Wed, 31 Jan 2024 14:46:36 +0800 (CST) Received: from canpemm500002.china.huawei.com (unknown [7.192.104.244]) by mail.maildlp.com (Postfix) with ESMTPS id AE32C140516; Wed, 31 Jan 2024 14:51:05 +0800 (CST) Received: from [10.173.135.154] (10.173.135.154) by canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Wed, 31 Jan 2024 14:51:05 +0800 Subject: Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE To: "Liam R. Howlett" , Muchun Song , Thorvald Natvig , Linux-MM References: <42788ABD-99AE-4AEF-B543-C0FABAFA0464@linux.dev> <4780b0e3-42e1-9099-d010-5a1793b6cbd3@huawei.com> <531195fb-b642-2bc1-3a07-4944ee5d8664@huawei.com> <20240129161735.6gmjsswx62o4pbja@revolver> <76f33f3b-f61f-efe7-f63f-1b2e0efaf71d@huawei.com> <20240130040814.hd3edkda5rbsxru7@revolver> From: Miaohe Lin Message-ID: Date: Wed, 31 Jan 2024 14:51:04 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20240130040814.hd3edkda5rbsxru7@revolver> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.173.135.154] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To canpemm500002.china.huawei.com (7.192.104.244) X-Rspam-User: X-Stat-Signature: wmn4ug5j4j5auj5jygw64fiersdht963 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 8161318001B X-HE-Tag: 1706683870-399038 X-HE-Meta: U2FsdGVkX191aPBgZSnB0OuuXHsMiO/GCp43xMMHh6fa12jhZsltl8scXGNsuNBvQhyt5dMFrv9HIdKkf/Zqnr+JgWVyi2k6uwxEq5URJ7j6LNGq+esvSB5Eo/1I8OGP9kc1RBrKYiCTS1TXCGrNfVFI0YT8QhqBwBl/OOROPJ8jJR95RpO/1mvPK2klK7efpjlDMa4Mwe7w/QFN8zAFE/2esR1qC7yR1M4tihajNo40vOJSFe5k0+AK2UtaRV87nVZyMOkX4YKWmh94vIRURTPIt3iNwkEOZFoJ6Y572exj84nr5MLc0MLFUe+GVpnsyfelwEqXBzqOFgcOcd2lwXyuRVh/gH8xV4VeNtHJ/emUGPRohCYAxUNVpudD/aDovoQtPkPFNBJr+akZfPd9V0459Wa9bWkbjHUHPRWazwAFo9JciJNXBDaGRytmkKkPr24pNxtbsWKX/z6XO2F9uGr9AqJC78dSbhZg0209Dk1YEI0sPHQ6FENLIpPxMfSkUnWD6T+KMl875coR15NIShuBXVOjxJB43TM1nfH2dvGUF5AYZO8cMzg5xgBHO/GMzWtmiUkGz+owP/WlN106FIZFrqbH0g6U0S3DCJp9cRGJhHIf+MDDpiP+EPA1JGViqhGVYEZNeZMEzH5C9jHWPAjY7BE9/k520LgXBkc7I9SNyRCXPZuTCyEd6T359K6qnO/9oILWqCaqDQnwxaqDBd0vVZfgV3h4ikg7DVu98M17kU3cwRZuZ2wzwafSQtFNgFFN0Dl8n+gGp71f1wqo3VBeD6QSHyX74dgfb02Tmq+4PrIGwj3ZSJhOXsuvJojV2I3K41Uk2CTeAn13Aa24rcKpGJXxL8doxFVLKJwRdFuEzolRiwmZgsZ3DnYkcb2InipEASlWxTG6R6OIkuFmY9l9EWQh6Do7steVqrW335yY51Xndhdp4MToaayhNZFEH4hRnHmoj87cgoAh2j8 1Mx0i3u0 jNYqDrnvJVtd1gtEEac3DDsNIjupd02fmXzbfgbvsVYuQkh7W8xRcp91ajPYRnv2TpLB3JNx4XvEzMOfV59IFPO2vQ/pwWkYDX0GMyp8MLR2a8LawM41JSxI5QKfY+iCrloovJXbi2/krGz4dYLYAi9N8PlCLPi8uMo/vVJzp/IHshA3V0Q1ec/RUVNGVewGz2lKj+XWSerL45OHNp8Miow/JjYgWjuEasMZyCnry1NG8ioefIODIGZe5ByTdUzcJ7+EeK6rtMFcey6nDYNdnsvFzseyWDB3sftIEnBL7skkU2dEn+Hgx4p41ew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/1/30 12:08, Liam R. Howlett wrote: > * Miaohe Lin [240129 21:14]: >> On 2024/1/30 0:17, Liam R. Howlett wrote: >>> * Miaohe Lin [240129 07:56]: >>>> On 2024/1/27 18:13, Miaohe Lin wrote: >>>>> On 2024/1/26 15:50, Muchun Song wrote: >>>>>> >>>>>> >>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig wrote: >>>>>>> >>>>>>> We've found what appears to be a lock issue that results in a blocked >>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an >>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. >>>>>>> >>>>>>> Based on some added pr_warn, we believe the following is happening: >>>>>>> When hugetlb_vmdelete_list is entered from the child process, >>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does >>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock >>>>>>> are true. >>>>>>> >>>>>>> While hugetlb_vmdelete_list is executing, the parent process does >>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a >>>>>>> lock for the same vma. >>>>>>> >>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of >>>>>>> the function, vma->vm_private_data is now populated, and hence >>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does >>>>>>> not hold. >>>>>> >>>>>> Thanks for your report. ->vm_private_data was introduced since the >>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed >>>>>> that at that time (actually, it is a little complex in pmd sharing >>>>>> case). I saw Miaohe had reviewed many of those. >>>>>> >>>>>> CC Miaohe, maybe he has some ideas on this. >>>>>> >>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff >>>>>> >>>>>> Thanks. >>>>>> >>>>>>> >>>>>>> dmesg: >>>>>>> WARNING: bad unlock balance detected! >>>>>>> 6.8.0-rc1+ #24 Not tainted >>>>>>> ------------------------------------- >>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at: >>>>>>> [] hugetlb_vma_unlock_write+0x48/0x60 >>>>>>> but there are no more locks to release! >>>>> >>>>> Thanks for your report. It seems there's a race: >>>>> >>>>> CPU 1 CPU 2 >>>>> fork hugetlbfs_fallocate >>>>> dup_mmap hugetlbfs_punch_hole >>>>> i_mmap_lock_write(mapping); >>>>> vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. >>>>> i_mmap_unlock_write(mapping); >>>>> hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); >>>>> hugetlb_vmdelete_list >>>>> vma_interval_tree_foreach >>>>> hugetlb_vma_trylock_write -- Vma_lock is cleared. >>>>> tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! >>>>> hugetlb_vma_unlock_write -- Vma_lock is assigned!!! >>>>> i_mmap_unlock_write(mapping); >>>>> >>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it. >>>>> But I'm not really sure. I will take a more closed look at next week. >>>> >>>> >>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized. >>>> But I'm not sure whether there're side effects with this patch. >>>> >>>> linux-UJMmTI:/home/linmiaohe/mm # git diff >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 47ff3b35352e..2ef2711452e0 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> } else if (anon_vma_fork(tmp, mpnt)) >>>> goto fail_nomem_anon_vma_fork; >>>> vm_flags_clear(tmp, VM_LOCKED_MASK); >>>> - file = tmp->vm_file; >>>> - if (file) { >>>> - struct address_space *mapping = file->f_mapping; >>>> - >>>> - get_file(file); >>>> - i_mmap_lock_write(mapping); >>>> - if (vma_is_shared_maywrite(tmp)) >>>> - mapping_allow_writable(mapping); >>>> - flush_dcache_mmap_lock(mapping); >>>> - /* insert tmp into the share list, just after mpnt */ >>>> - vma_interval_tree_insert_after(tmp, mpnt, >>>> - &mapping->i_mmap); >>>> - flush_dcache_mmap_unlock(mapping); >>>> - i_mmap_unlock_write(mapping); >>>> - } >>>> >>>> /* >>>> * Copy/update hugetlb private vma information. >>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> if (tmp->vm_ops && tmp->vm_ops->open) >>>> tmp->vm_ops->open(tmp); >>>> >>>> + file = tmp->vm_file; >>>> + if (file) { >>>> + struct address_space *mapping = file->f_mapping; >>>> + >>>> + get_file(file); >>>> + i_mmap_lock_write(mapping); >>>> + if (vma_is_shared_maywrite(tmp)) >>>> + mapping_allow_writable(mapping); >>>> + flush_dcache_mmap_lock(mapping); >>>> + /* insert tmp into the share list, just after mpnt. */ >>>> + vma_interval_tree_insert_after(tmp, mpnt, >>>> + &mapping->i_mmap); >>>> + flush_dcache_mmap_unlock(mapping); >>>> + i_mmap_unlock_write(mapping); >>>> + } >>>> + >>>> if (retval) { >>>> mpnt = vma_next(&vmi); >>>> goto loop_out; >>>> >>>> >>> >>> How is this possible? I thought, as specified in mm/rmap.c, that the >>> hugetlbfs path would be holding the mmap lock (which is also held in the >>> fork path)? >> >> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path >> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm. >> Or am I miss something? > > You are correct. It is also in mm/rmap.c: > * hugetlbfs PageHuge() take locks in this order: > * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) > * vma_lock (hugetlb specific lock for pmd_sharing) > * mapping->i_mmap_rwsem (also used for hugetlb pmd sharing) > * page->flags PG_locked (lock_page) > > Does it make sense for hugetlb_dup_vma_private() to assert > mapping->i_mmap_rwsem is locked? When is that necessary? I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of vma creating or destroy. Vma_lock is not supposed to be used at that time. > > I also think it might be safer to move the hugetlb_dup_vma_private() > call up instead of the insert into the interval tree down? > See the following comment from mmap.c: > > /* > * Put into interval tree now, so instantiated pages > * are visible to arm/parisc __flush_dcache_page > * throughout; but we cannot insert into address > * space until vma start or end is updated. > */ > > So there may be arch dependent reasons for this order. Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up. Or the race still exists: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole hugetlb_dup_vma_private -- Clear vma_lock. <-- it is moved up. i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is already cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or someone else can provide a better and safer solution. Thanks. > > Thanks, > Liam > > . >