From: Gavin Guo <gavinguo@igalia.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
muchun.song@linux.dev, akpm@linux-foundation.org,
mike.kravetz@oracle.com, kernel-dev@igalia.com,
stable@vger.kernel.org, Hugh Dickins <hughd@google.com>,
Florent Revest <revest@google.com>, Gavin Shan <gshan@redhat.com>
Subject: Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table
Date: Wed, 21 May 2025 19:12:21 +0800 [thread overview]
Message-ID: <3b167142-7c1a-4d6b-b41e-8c067ab737df@igalia.com> (raw)
In-Reply-To: <aCzdnAmuOylilU1p@localhost.localdomain>
On 5/21/25 03:53, Oscar Salvador wrote:
> On Tue, May 13, 2025 at 05:34:48PM +0800, Gavin Guo wrote:
>> The patch fixes a deadlock which can be triggered by an internal
>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log
>> [3] in this scenario:
>>
>> Process 1 Process 2
>> --- ---
>> hugetlb_fault
>> mutex_lock(B) // take B
>> filemap_lock_hugetlb_folio
>> filemap_lock_folio
>> __filemap_get_folio
>> folio_lock(A) // take A
>> hugetlb_wp
>> mutex_unlock(B) // release B
>> ... hugetlb_fault
>> ... mutex_lock(B) // take B
>> filemap_lock_hugetlb_folio
>> filemap_lock_folio
>> __filemap_get_folio
>> folio_lock(A) // blocked
>> unmap_ref_private
>> ...
>> mutex_lock(B) // retake and blocked
>>
> ...
>> Signed-off-by: Gavin Guo <gavinguo@igalia.com>
>
> I think this is more convoluted that it needs to be?
>
> hugetlb_wp() is called from hugetlb_no_page() and hugetlb_fault().
> hugetlb_no_page() locks and unlocks the lock itself, which leaves us
> with hugetlb_fault().
>
> hugetlb_fault() always passed the folio locked to hugetlb_wp(), and the
> latter only unlocks it when we have a cow from owner happening and we
> cannot satisfy the allocation.
> So, should not checking whether the folio is still locked after
> returning enough?
> What speaks against:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bd8971388236..23b57c5689a4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6228,6 +6228,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> u32 hash;
>
> folio_put(old_folio);
> + /*
> + * The pagecache_folio needs to be unlocked to avoid
> + * deadlock when the child unmaps the folio.
> + */
> + if (pagecache_folio)
> + folio_unlock(pagecache_folio);
> /*
> * Drop hugetlb_fault_mutex and vma_lock before
> * unmapping. unmapping needs to hold vma_lock
> @@ -6825,7 +6831,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> spin_unlock(vmf.ptl);
>
> if (pagecache_folio) {
> - folio_unlock(pagecache_folio);
> + /*
> + * hugetlb_wp() might have already unlocked pagecache_folio, so
> + * skip it if that is the case.
> + */
> + if (folio_test_locked(pagecache_folio))
> + folio_unlock(pagecache_folio);
> folio_put(pagecache_folio);
> }
> out_mutex:
Really appreciate your review. Good catch! This is elegant. The patch is
under testing and I'll send out the v2 patch soon.
>
>> ---
>> mm/hugetlb.c | 33 ++++++++++++++++++++++++++++-----
>> 1 file changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e3e6ac991b9c..ad54a74aa563 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>> * Keep the pte_same checks anyway to make transition from the mutex easier.
>> */
>> static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>> - struct vm_fault *vmf)
>> + struct vm_fault *vmf,
>> + bool *pagecache_folio_unlocked)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct mm_struct *mm = vma->vm_mm;
>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>> u32 hash;
>>
>> folio_put(old_folio);
>> + /*
>> + * The pagecache_folio needs to be unlocked to avoid
>> + * deadlock and we won't re-lock it in hugetlb_wp(). The
>> + * pagecache_folio could be truncated after being
>> + * unlocked. So its state should not be relied
>> + * subsequently.
>> + *
>> + * Setting *pagecache_folio_unlocked to true allows the
>> + * caller to handle any necessary logic related to the
>> + * folio's unlocked state.
>> + */
>> + if (pagecache_folio) {
>> + folio_unlock(pagecache_folio);
>> + if (pagecache_folio_unlocked)
>> + *pagecache_folio_unlocked = true;
>> + }
>> /*
>> * Drop hugetlb_fault_mutex and vma_lock before
>> * unmapping. unmapping needs to hold vma_lock
>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>> hugetlb_count_add(pages_per_huge_page(h), mm);
>> if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>> /* Optimization, do the COW without a second fault */
>> - ret = hugetlb_wp(folio, vmf);
>> + ret = hugetlb_wp(folio, vmf, NULL);
>> }
>>
>> spin_unlock(vmf->ptl);
>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> struct hstate *h = hstate_vma(vma);
>> struct address_space *mapping;
>> int need_wait_lock = 0;
>> + bool pagecache_folio_unlocked = false;
>> struct vm_fault vmf = {
>> .vma = vma,
>> .address = address & huge_page_mask(h),
>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>
>> if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>> if (!huge_pte_write(vmf.orig_pte)) {
>> - ret = hugetlb_wp(pagecache_folio, &vmf);
>> + ret = hugetlb_wp(pagecache_folio, &vmf,
>> + &pagecache_folio_unlocked);
>> goto out_put_page;
>> } else if (likely(flags & FAULT_FLAG_WRITE)) {
>> vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> out_ptl:
>> spin_unlock(vmf.ptl);
>>
>> - if (pagecache_folio) {
>> + /*
>> + * If the pagecache_folio is unlocked in hugetlb_wp(), we skip
>> + * folio_unlock() here.
>> + */
>> + if (pagecache_folio && !pagecache_folio_unlocked)
>> folio_unlock(pagecache_folio);
>> + if (pagecache_folio)
>> folio_put(pagecache_folio);
>> - }
>> out_mutex:
>> hugetlb_vma_unlock_read(vma);
>>
>>
>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3
>> --
>> 2.43.0
>>
>>
>
next prev parent reply other threads:[~2025-05-21 11:12 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-13 9:34 Gavin Guo
2025-05-14 0:56 ` Andrew Morton
2025-05-14 4:33 ` Byungchul Park
2025-05-14 6:47 ` Byungchul Park
2025-05-14 8:10 ` Gavin Guo
2025-05-15 2:22 ` Byungchul Park
2025-05-16 6:03 ` Byungchul Park
2025-05-16 7:32 ` Gavin Guo
2025-05-16 7:43 ` Byungchul Park
2025-05-20 19:53 ` Oscar Salvador
2025-05-21 11:12 ` Gavin Guo [this message]
2025-05-26 4:41 ` Gavin Shan
2025-05-27 9:59 ` Gavin Guo
2025-05-27 10:59 ` Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3b167142-7c1a-4d6b-b41e-8c067ab737df@igalia.com \
--to=gavinguo@igalia.com \
--cc=akpm@linux-foundation.org \
--cc=gshan@redhat.com \
--cc=hughd@google.com \
--cc=kernel-dev@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=revest@google.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox