From: Miaohe Lin <linmiaohe@huawei.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
Muchun Song <songmuchun@bytedance.com>,
David Hildenbrand <david@redhat.com>,
Sven Schnelle <svens@linux.ibm.com>,
Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
Naoya Horiguchi <naoya.horiguchi@linux.dev>,
"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Prakash Sangappa <prakash.sangappa@oracle.com>,
James Houghton <jthoughton@google.com>,
Mina Almasry <almasrymina@google.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Ray Fucillo <Ray.Fucillo@intersystems.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization
Date: Sat, 8 Oct 2022 10:29:46 +0800 [thread overview]
Message-ID: <162f73d7-5bcb-e0f9-bdaf-7a5d4df10fba@huawei.com> (raw)
In-Reply-To: <YzeDLCFMN6XZNfoF@monkey>
On 2022/10/1 8:00, Mike Kravetz wrote:
> On 09/29/22 14:08, Miaohe Lin wrote:
>> On 2022/9/15 6:18, Mike Kravetz wrote:
>>> @@ -434,6 +434,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> struct folio *folio, pgoff_t index)
>>> {
>>> struct rb_root_cached *root = &mapping->i_mmap;
>>> + struct hugetlb_vma_lock *vma_lock;
>>> struct page *page = &folio->page;
>>> struct vm_area_struct *vma;
>>> unsigned long v_start;
>>> @@ -444,7 +445,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> end = (index + 1) * pages_per_huge_page(h);
>>>
>>> i_mmap_lock_write(mapping);
>>> -
>>> +retry:
>>> + vma_lock = NULL;
>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
>>> v_start = vma_offset_start(vma, start);
>>> v_end = vma_offset_end(vma, end);
>>> @@ -452,11 +454,63 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>> continue;
>>>
>>> + if (!hugetlb_vma_trylock_write(vma)) {
>>> + vma_lock = vma->vm_private_data;
>>> + /*
>>> + * If we can not get vma lock, we need to drop
>>> + * immap_sema and take locks in order. First,
>>> + * take a ref on the vma_lock structure so that
>>> + * we can be guaranteed it will not go away when
>>> + * dropping immap_sema.
>>> + */
>>> + kref_get(&vma_lock->refs);
>>> + break;
>>> + }
>>> +
>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>> NULL, ZAP_FLAG_DROP_MARKER);
>>> + hugetlb_vma_unlock_write(vma);
>>> }
>>>
>>> i_mmap_unlock_write(mapping);
>>> +
>>> + if (vma_lock) {
>>> + /*
>>> + * Wait on vma_lock. We know it is still valid as we have
>>> + * a reference. We must 'open code' vma locking as we do
>>> + * not know if vma_lock is still attached to vma.
>>> + */
>>> + down_write(&vma_lock->rw_sema);
>>> + i_mmap_lock_write(mapping);
>>> +
>>> + vma = vma_lock->vma;
>>> + if (!vma) {
>>
>> Thanks Mike. This method looks much simpler. But IIUC, this code can race with exit_mmap:
>>
>> CPU 1 CPU 2
>> hugetlb_unmap_file_folio exit_mmap
>> kref_get(&vma_lock->refs);
>> down_write(&vma_lock->rw_sema);
>> free_pgtables // i_mmap_lock_write is held inside it.
>> i_mmap_lock_write(mapping);
>> vma = vma_lock->vma;
>> remove_vma
>> hugetlb_vm_op_close
>> hugetlb_vma_lock_free
>> vma_lock->vma = NULL;
>> vm_area_free(vma);
>> vma is used-after-free??
>>
>> The root casue is free_pgtables is protected with i_mmap_lock_write while remove_vma is not.
>> Or am I miss something again? ;)
>
> Thank you Miaohe! Sorry for the delay in responding.
>
> Yes, I agree this is a possible race. My first thought is that we may be
> able to address this by simply taking the vma_lock when we clear the
> vma_lock->vma field. Something like this,
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4cb44a4629b8..bf0c220ebc32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6918,7 +6918,9 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> * certainly will no longer be attached to vma so clear
> * pointer.
> */
> + down_write(&vma_lock->rw_sema);
> vma_lock->vma = NULL;
> + up_write(&vma_lock->rw_sema);
> kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
> vma->vm_private_data = NULL;
> }
AFAICT, this should work. And we won't hold the vma_lock->rw_sema when delete vma, there
should not be a possible deadlock.
>
> I still need to do a bit more work to verify.
>
> Andrew, if you are concerned I do not think this is a show stopper. The
> race should be extremely rare, and a fix should be coming quickly.
Agree, this should be really rare.
>
> <snip>
>>> + mapping = vma->vm_file->f_mapping;
>>> + idx = vma_hugecache_offset(h, vma, haddr);
>>> hash = hugetlb_fault_mutex_hash(mapping, idx);
>>> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>>>
>>> + /*
>>> + * Acquire vma lock before calling huge_pte_alloc and hold
>>> + * until finished with ptep. This prevents huge_pmd_unshare from
>>> + * being called elsewhere and making the ptep no longer valid.
>>> + *
>>> + * ptep could have already be assigned via huge_pte_offset. That
>>> + * is OK, as huge_pte_alloc will return the same value unless
>>> + * something has changed.
>>> + */
>>> + hugetlb_vma_lock_read(vma);
>>
>> [1] says vma_lock for each vma mapping the file provides the same type of synchronization
>> around i_size as provided by the fault mutex. But what if vma->vm_private_data is NULL,
>> i.e. hugetlb_vma_lock_alloc fails to alloc vma_lock? There won't be such synchronization
>> in this case.
>>
>> [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
>>
>
> Right.
>
> Of course, this (checking i_size) only applies to shared/file mappings.
> The only time hugetlb_vma_lock_alloc should fail in such cases is when
> we can not allocate the small vma_lock structure. Since the vma_lock
> is primarily for huge pmd sharing synchronization, my thought was that
> allocation errors would just prevent sharing. But, as you point out
> it could also impact these checks.
>
> It would be easy to check for the lock allocation failure at mmap time
> and fail the mmap. It would be a little more tricky at fork time.
>
> This also is something that is highly unlikely to occur. And, if we
> can't allocate a vma_lock I suspect we will not be up and running long
> enough for this to be an issue. :) Let me think about the best way to handle.
Agree. This should be really rare too. Thanks for your work.
Thanks,
Miaohe Lin
>
>>
>> Other parts of the patch look good to me. Thanks for your work.
>>
>
> Thanks again,
>
next prev parent reply other threads:[~2022-10-08 2:29 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-14 22:18 [PATCH v2 0/9] hugetlb: Use new vma lock for huge " Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 2/9] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 3/9] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 4/9] hugetlb: create remove_inode_single_folio to remove single file folio Mike Kravetz
2022-09-24 12:36 ` Miaohe Lin
2022-09-14 22:18 ` [PATCH v2 5/9] hugetlb: rename vma_shareable() and refactor code Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 6/9] hugetlb: add vma based lock for pmd sharing Mike Kravetz
2022-09-15 20:25 ` Mike Kravetz
2022-09-24 13:11 ` Miaohe Lin
2022-09-14 22:18 ` [PATCH v2 7/9] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio Mike Kravetz
2022-09-14 22:18 ` [PATCH v2 8/9] hugetlb: use new vma_lock for pmd sharing synchronization Mike Kravetz
2022-09-29 6:08 ` Miaohe Lin
2022-10-01 0:00 ` Mike Kravetz
2022-10-08 2:29 ` Miaohe Lin [this message]
2022-09-14 22:18 ` [PATCH v2 9/9] hugetlb: clean up code checking for fault/truncation races Mike Kravetz
2022-09-19 23:32 ` Mike Kravetz
2022-09-29 6:25 ` Miaohe Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=162f73d7-5bcb-e0f9-bdaf-7a5d4df10fba@huawei.com \
--to=linmiaohe@huawei.com \
--cc=Ray.Fucillo@intersystems.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=almasrymina@google.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=axelrasmussen@google.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=jthoughton@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=naoya.horiguchi@linux.dev \
--cc=pasha.tatashin@soleen.com \
--cc=peterx@redhat.com \
--cc=prakash.sangappa@oracle.com \
--cc=songmuchun@bytedance.com \
--cc=svens@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox