Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jane Chu <jane.chu@oracle.com>
To: Miaohe Lin <linmiaohe@huawei.com>,
	"Liam R. Howlett" <Liam.Howlett@Oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	Thorvald Natvig <thorvald@google.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE
Date: Fri, 2 Feb 2024 13:02:00 -0800	[thread overview]
Message-ID: <70f13c9f-4364-4154-9b5c-69d6c5e9d65a@oracle.com> (raw)
In-Reply-To: <d4ee888e-9652-fed8-e72f-9bcfaac45c11@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 9095 bytes --]

On 1/30/2024 10:51 PM, Miaohe Lin wrote:

> On 2024/1/30 12:08, Liam R. Howlett wrote:
>> * Miaohe Lin<linmiaohe@huawei.com>  [240129 21:14]:
>>> On 2024/1/30 0:17, Liam R. Howlett wrote:
>>>> * Miaohe Lin<linmiaohe@huawei.com>  [240129 07:56]:
>>>>> On 2024/1/27 18:13, Miaohe Lin wrote:
>>>>>> On 2024/1/26 15:50, Muchun Song wrote:
>>>>>>>
>>>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig<thorvald@google.com>  wrote:
>>>>>>>>
>>>>>>>> We've found what appears to be a lock issue that results in a blocked
>>>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an
>>>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.
>>>>>>>>
>>>>>>>> Based on some added pr_warn, we believe the following is happening:
>>>>>>>> When hugetlb_vmdelete_list is entered from the child process,
>>>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
>>>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock
>>>>>>>> are true.
>>>>>>>>
>>>>>>>> While hugetlb_vmdelete_list is executing, the parent process does
>>>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
>>>>>>>> lock for the same vma.
>>>>>>>>
>>>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of
>>>>>>>> the function, vma->vm_private_data is now populated, and hence
>>>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
>>>>>>>> not hold.
>>>>>>> Thanks for your report. ->vm_private_data was introduced since the
>>>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed
>>>>>>> that at that time (actually, it is a little complex in pmd sharing
>>>>>>> case). I saw Miaohe had reviewed many of those.
>>>>>>>
>>>>>>> CC Miaohe, maybe he has some ideas on this.
>>>>>>>
>>>>>>> [1]https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@oracle.com/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>> dmesg:
>>>>>>>> WARNING: bad unlock balance detected!
>>>>>>>> 6.8.0-rc1+ #24 Not tainted
>>>>>>>> -------------------------------------
>>>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
>>>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
>>>>>>>> but there are no more locks to release!
>>>>>> Thanks for your report. It seems there's a race:
>>>>>>
>>>>>>   CPU 1											CPU 2
>>>>>>   fork											hugetlbfs_fallocate
>>>>>>    dup_mmap										 hugetlbfs_punch_hole
>>>>>>     i_mmap_lock_write(mapping);								
>>>>>>     vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>>>>>>     i_mmap_unlock_write(mapping);
>>>>>>     hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!			 i_mmap_lock_write(mapping);
>>>>>>     											 hugetlb_vmdelete_list
>>>>>> 											  vma_interval_tree_foreach
>>>>>> 											   hugetlb_vma_trylock_write -- Vma_lock is cleared.
>>>>>>     tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
>>>>>> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
>>>>>> 											 i_mmap_unlock_write(mapping);
>>>>>>
>>>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it.
>>>>>> But I'm not really sure. I will take a more closed look at next week.
>>>>>
>>>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized.
>>>>> But I'm not sure whether there're side effects with this patch.
>>>>>
>>>>> linux-UJMmTI:/home/linmiaohe/mm # git diff
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index 47ff3b35352e..2ef2711452e0 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>                  } else if (anon_vma_fork(tmp, mpnt))
>>>>>                          goto fail_nomem_anon_vma_fork;
>>>>>                  vm_flags_clear(tmp, VM_LOCKED_MASK);
>>>>> -               file = tmp->vm_file;
>>>>> -               if (file) {
>>>>> -                       struct address_space *mapping = file->f_mapping;
>>>>> -
>>>>> -                       get_file(file);
>>>>> -                       i_mmap_lock_write(mapping);
>>>>> -                       if (vma_is_shared_maywrite(tmp))
>>>>> -                               mapping_allow_writable(mapping);
>>>>> -                       flush_dcache_mmap_lock(mapping);
>>>>> -                       /* insert tmp into the share list, just after mpnt */
>>>>> -                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>> -                                       &mapping->i_mmap);
>>>>> -                       flush_dcache_mmap_unlock(mapping);
>>>>> -                       i_mmap_unlock_write(mapping);
>>>>> -               }
>>>>>
>>>>>                  /*
>>>>>                   * Copy/update hugetlb private vma information.
>>>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>>>>                  if (tmp->vm_ops && tmp->vm_ops->open)
>>>>>                          tmp->vm_ops->open(tmp);
>>>>>
>>>>> +               file = tmp->vm_file;
>>>>> +               if (file) {
>>>>> +                       struct address_space *mapping = file->f_mapping;
>>>>> +
>>>>> +                       get_file(file);
>>>>> +                       i_mmap_lock_write(mapping);
>>>>> +                       if (vma_is_shared_maywrite(tmp))
>>>>> +                               mapping_allow_writable(mapping);
>>>>> +                       flush_dcache_mmap_lock(mapping);
>>>>> +                       /* insert tmp into the share list, just after mpnt. */
>>>>> +                       vma_interval_tree_insert_after(tmp, mpnt,
>>>>> +                                       &mapping->i_mmap);
>>>>> +                       flush_dcache_mmap_unlock(mapping);
>>>>> +                       i_mmap_unlock_write(mapping);
>>>>> +               }
>>>>> +
>>>>>                  if (retval) {
>>>>>                          mpnt = vma_next(&vmi);
>>>>>                          goto loop_out;
>>>>>
>>>>>
>>>> How is this possible?  I thought, as specified in mm/rmap.c, that the
>>>> hugetlbfs path would be holding the mmap lock (which is also held in the
>>>> fork path)?
>>> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path
>>> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm.
>>> Or am I miss something?
>> You are correct.  It is also in mm/rmap.c:
>>   * hugetlbfs PageHuge() take locks in this order:
>>   *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
>>   *     vma_lock (hugetlb specific lock for pmd_sharing)
>>   *       mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
>>   *         page->flags PG_locked (lock_page)
>>
>> Does it make sense for hugetlb_dup_vma_private()  to assert
>> mapping->i_mmap_rwsem is locked?  When is that necessary?
> I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of
> vma creating or destroy. Vma_lock is not supposed to be used at that time.
>
>> I also think it might be safer to move the hugetlb_dup_vma_private()
>> call up instead of the insert into the interval tree down?
>> See the following comment from mmap.c:
>>
>>                          /*
>>                           * Put into interval tree now, so instantiated pages
>>                           * are visible to arm/parisc __flush_dcache_page
>>                           * throughout; but we cannot insert into address
>>                           * space until vma start or end is updated.
>>                           */
>>
>> So there may be arch dependent reasons for this order.
> Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up.
> Or the race still exists:
>
>   CPU 1											CPU 2
>   fork											hugetlbfs_fallocate
>    dup_mmap										 hugetlbfs_punch_hole
>     hugetlb_dup_vma_private -- Clear vma_lock.	<-- it is moved up.
>     i_mmap_lock_write(mapping);								
>     vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
>     i_mmap_unlock_write(mapping);
>     		 									 i_mmap_lock_write(mapping);
>     											 hugetlb_vmdelete_list
> 											  vma_interval_tree_foreach
> 											   hugetlb_vma_trylock_write -- Vma_lock is already cleared.
>     tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
> 											   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
> 											 i_mmap_unlock_write(mapping);
>
>
> My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or
> someone else can provide a better and safer solution.

But,  your patch has already moved the vma_interval_tree_insert_after() 
block after the

tmp->vm_ops->open(tmp) call, right?  Hence, there should be no more race with truncation?

thanks,
-jane

>
> Thanks.
>
>> Thanks,
>> Liam
>>
>> .
>>
>

[-- Attachment #2: Type: text/html, Size: 11188 bytes --]

next prev parent reply	other threads:[~2024-02-02 21:02 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-25 20:28 Thorvald Natvig
2024-01-26  7:50 ` Muchun Song
2024-01-27 10:13   ` Miaohe Lin
2024-01-29 12:56     ` Miaohe Lin
2024-01-29 16:17       ` Liam R. Howlett
2024-01-30  2:14         ` Miaohe Lin
2024-01-30  4:08           ` Liam R. Howlett
2024-01-31  6:51             ` Miaohe Lin
2024-02-02 21:02               ` Jane Chu [this message]
2024-02-04  1:54                 ` Miaohe Lin
2024-03-29 15:54                   ` Thorvald Natvig
2024-04-02 11:24                     ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=70f13c9f-4364-4154-9b5c-69d6c5e9d65a@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=Liam.Howlett@Oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=thorvald@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox