From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94DBBC54ED1 for ; Tue, 27 May 2025 10:59:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 31FBD6B0088; Tue, 27 May 2025 06:59:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D02F6B008C; Tue, 27 May 2025 06:59:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E6076B0093; Tue, 27 May 2025 06:59:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id EE0476B0088 for ; Tue, 27 May 2025 06:59:39 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A1FFDBB321 for ; Tue, 27 May 2025 10:59:39 +0000 (UTC) X-FDA: 83488392078.17.E033B4B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf09.hostedemail.com (Postfix) with ESMTP id 2D425140006 for ; Tue, 27 May 2025 10:59:37 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BQfOMBc+; spf=pass (imf09.hostedemail.com: domain of gshan@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gshan@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748343577; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rrZMv/kL/46nzWq/pM0GU1nnL1s3PpogVVAx4861Hgg=; b=Rut9AHkosWeHTmaU8lJEL0P2WXz3lJ673s3Gz9MzLzxAI6EFncCMgvQjq3mUWYb9SERwvs ddQAatz8RgvsvWsDMsWuJIDI+kZUdKKQNeUGDMISzNI58GJcuTbjg+NUAZtlNeNPL9EhDR HoKposR+TWb1jZwXOaM+k3/lrNqGE6I= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BQfOMBc+; spf=pass (imf09.hostedemail.com: domain of gshan@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=gshan@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748343577; a=rsa-sha256; cv=none; b=5/pmUvfhraNXLPxmSt5uBo1sTV8pn/GL/IHqX5JV19W1wad+gFxYNRRyUpm/SzbFSICgL2 BrO1ZvJqRgEcrPDW4irizH+2/vjvNLy+qFh5t02CnhoFNUk+zzx5uxZOedsR8P9/gPlJmT zoOT/Gy16TG7j6JGi+w3db93rvP8D/0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748343576; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rrZMv/kL/46nzWq/pM0GU1nnL1s3PpogVVAx4861Hgg=; b=BQfOMBc+/qpRSx3ScI+Ts6CsLWqlpp/mD0K8uUKGYOeNDoX26O7y1boxCdjoRH28n0XlBH QuCf+SKkxpiwDEP6gtGt9d9DPplJWMAM9tDYAwQQdQOB8Cvd2uqmS30BZwM1xKGdO2f3I1 /8jvf4Vmi1mUpNj9a52ll/faNDYYAMI= Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-115-WyM4BK8jMWCriZagUEBOyg-1; Tue, 27 May 2025 06:59:35 -0400 X-MC-Unique: WyM4BK8jMWCriZagUEBOyg-1 X-Mimecast-MFC-AGG-ID: WyM4BK8jMWCriZagUEBOyg_1748343574 Received: by mail-pf1-f200.google.com with SMTP id d2e1a72fcca58-74089884644so2807594b3a.3 for ; Tue, 27 May 2025 03:59:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748343570; x=1748948370; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rrZMv/kL/46nzWq/pM0GU1nnL1s3PpogVVAx4861Hgg=; b=WNksFKGqwe8EU0wozLw6VRwrUx/Ite51DV91wwyngj7Ew8TydIKxms6kTV4lY2bDwy 4EN4lBCvo0xkdXDIXf4nzE4Zcl67Z/jgTNzIgUM4ButXv48gRA1As4qdYWcx4L4GD7Pn Oc6h55JeJL6BRaBqrKWT9UxpJQTDMB1GCyuyolZGVaYUW7Iu/cXgVh7mxi2lpk2S/GbE 4f5QWmKMruBQgDYqE17EgWpM3tKUfZmwetuEICTYnJ6zyNzxvHeuOKprqyveP4tRX8F8 i1rNi8QjMS3sLl3hrVJUP2DD266R2WTfiI2vplx7ccgcFmX3LckEGcuwobcwAyzN9DZq /ySA== X-Forwarded-Encrypted: i=1; AJvYcCVvCfar2WHQEFeN8YO3b5+TWbO3lrINkdkXn/4nZoGYToEDbNBR4wrulDSBVl3UDW3uPJS8xvyonQ==@kvack.org X-Gm-Message-State: AOJu0YyezAy4OuNeiB238Jc+vO2TUAy7/X8IoRgy/qhdD6+NWefBNh4e Dt/CaCzRDHAgEPj95Ug+lR61QHm1rHbX0V1jdhH3DjKbJXDHqXEbvJvLFqV2PtmbK8iR1mHsLQx y/sgmCLUsHMvW6eYrlJa5HrKjo5KkxULeXWjnhw3Lld/gtxR1UfNN X-Gm-Gg: ASbGncs/PPZRNp894TlAZ/Acs2DbTe9bZRfvS1+UZ8lKnYSjgtWSJ7PihbZOGA0wNLi 8pc+m3tpUnrum0FGT1nw79mptmmULB4W11n5SWlqg97EFnTMm6NGUHEDlqSgkTarbkeVa6KSqzK B4USzgsbx69CgLDN4om4I63mM/xq/VGT8b0S/rptEczPKTmxrkdsFENA324tAsb7Sk7Hveo0n13 qXjVRuQNDVV4kImRsV/fi596CwW5Rwya39X7W446YWvPcFseJSTaBBhWfoAXybf88WAjf0Q5Z9S pmSA2oCNm2Kl X-Received: by 2002:a05:6a00:1c9f:b0:746:2073:964e with SMTP id d2e1a72fcca58-746207396c2mr5936599b3a.13.1748343570362; Tue, 27 May 2025 03:59:30 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG8C77SrJb3DUyzl1CzTqmT1wTAJ0KChunGCPmuMiQnukr/32qyLJE47INIjuNnb6Wn8bzQ6g== X-Received: by 2002:a05:6a00:1c9f:b0:746:2073:964e with SMTP id d2e1a72fcca58-746207396c2mr5936564b3a.13.1748343569802; Tue, 27 May 2025 03:59:29 -0700 (PDT) Received: from [192.168.68.51] ([180.233.125.65]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-74627f622cesm1038324b3a.0.2025.05.27.03.59.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 May 2025 03:59:29 -0700 (PDT) Message-ID: <1ac43b6c-5674-4523-8200-10d598a743f6@redhat.com> Date: Tue, 27 May 2025 20:59:22 +1000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table To: Gavin Guo , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, muchun.song@linux.dev, osalvador@suse.de, akpm@linux-foundation.org, mike.kravetz@oracle.com, kernel-dev@igalia.com, stable@vger.kernel.org, Hugh Dickins , Florent Revest References: <20250513093448.592150-1-gavinguo@igalia.com> <5f5d8f85-c082-4cb9-93ef-d207309ba807@redhat.com> <1df840eb-6d04-4023-8c48-ffdda45931d0@igalia.com> From: Gavin Shan In-Reply-To: <1df840eb-6d04-4023-8c48-ffdda45931d0@igalia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: UpTROdVISjdP9-HR1SptUFVTE73IKNSWel4SYmR6GNM_1748343574 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: hp1afmo3jmbxruss19kf8n1ct7a6twau X-Rspamd-Queue-Id: 2D425140006 X-Rspamd-Server: rspam11 X-HE-Tag: 1748343577-228349 X-HE-Meta: U2FsdGVkX18KXU9erofSyFQgw8tPSreRciANlCG9t9bfO86hQrxxpQL17NjlOj1ksaqHoNuseANo/uV1R6f6id5f3FEzpDU1ZmN/M8Zs2WanncP7qUYDTdsG+6+gzowHAY6G+epMMJg/cB1GLoTEbZqlZR5jH32yIyQuTXGEO+k0BWkcckn542ZxXcToRq4IvHNuDPT9aZQx9SemgIUrbyZqfqQbxxprZUppzDbW7nIzA5BSpyZ0dGwbLpOqRam9S525hEYJHMYBcExC3fd/LwhCx/CL94nfnX2WtrWMRyb57pQ6PQ+04ZF4NwC+IxZnmbwXurw9ej036QGMv9/CCTID7zfYMBmZEIEbIpWvhKUIbg4/hVpYaucMVt4Eq/kuSRRZ/PfT/22eY0VDKxs8hl9SOWRgsatVqRIEhZtDP2/fI/xaZzwzdf2tNPWjA1nXZQFExguh+dulYKL80D3WqS9XIXcu+MJPBr9hUZCorGX2YL2htM9anCVtup5PwmqPurF4g+lkKv5Rf/R0iv4gpOrAIiaG2swEOFZsmZTTVGVdS1dlL94vXRjs03QW9GDAc9UaQfjXU9sISwk15Tcdz7FZSwOmpeAipeD8aNC47bIlqle3f68cXXvro07p8zyly0qFkQUrUHCz4lwF6DOUTxIX6uUTaRC1qskqrHaRmUi9VgytBO8xCZAA83hk83sIvQ+2vy9fvF0TX9ZqRGF/OhZq3jwUTQF51dL1XAG/yX7j/WwtExHdwqe/fm/WS/FMy64LejtEIf85KRCD7M3kRXS7+FlAytEbzrzQocX0455jKe8LJkg50sn9eT9vHzPHHauGthYEJB1RZxMNb0A682uamOIQHjHHCtacK3SatLdFhbqaX1rUaDTIIaxQ6ACe7u6jpqRbymTg6FueF8YzQhPd+uRi2IWRvnmPysmAHO4A3JRJys+65lDaZeX/g+gnMhdR50mwG+DqdP8jz34 8Ilb2lq6 52JL3kmuPkVGQx8zVJ67968HilRZWHP5UOnEHtxwl2wcto1hWKPjxZJMhf94XBLYPQi15SV+yiKJAslnsXO+Gl3YNT1ycbLth4/8BkwDm96NrKI1lXaXP3iXYqnBY95yZ9qTp7kZFaMC0CBzrbx8o6RDicMG8ePYrnZyWF7S+Ya4P42JOs7RUBl7kTLyviRghe5PZL0YRpADrutH7MwlzvQrVIki1uw+BqTXLJXdSVfz/0FArwsGUFDhfRWaJP1RVG0eN1q80JL1R6FTcz/tmYdSragginB5HDF4gA+ecPZK+FDj1+uscjbPZpnhF5bbg9e26mHAcw8QB/neemvr8UnbKHpHVCQmaSNm46AAuS8LWPLBW8ljnJRlR5wgjQHmEo5cesQBJyHvUsN/D5ET2uP5HGor8DgT350il4zV6sssRAZZbip6JTrrxUY3kC6cQ+7mgChGdvqYKBgIoJdl0NloHqKls4cgDQASrZmgs3CTP0n+3bYOFJkQUtfLriYCe5RppxW2Ym0AMxlCCheIrADhP6UJuYMf0LBua+yGVrKPs5ylprGgy5m5h12n1qFB6Yrimh3Dcu/uFz532HjRz1QAHq5HkuKAieJyum3zfU2BV8RhTB8XkoGaT1FeqfPqzYqAZrIzhzhVoWPdt6TFvOw1iPA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Gavin, On 5/27/25 7:59 PM, Gavin Guo wrote: > On 5/26/25 12:41, Gavin Shan wrote: >> On 5/13/25 7:34 PM, Gavin Guo wrote: >>> The patch fixes a deadlock which can be triggered by an internal >>> syzkaller [1] reproducer and captured by bpftrace script [2] and its log >>> [3] in this scenario: >>> >>> Process 1                              Process 2 >>> ---                       --- >>> hugetlb_fault >>>    mutex_lock(B) // take B >>>    filemap_lock_hugetlb_folio >>>      filemap_lock_folio >>>        __filemap_get_folio >>>          folio_lock(A) // take A >>>    hugetlb_wp >>>      mutex_unlock(B) // release B >>>      ...                                hugetlb_fault >>>      ...                                  mutex_lock(B) // take B >>>                                           filemap_lock_hugetlb_folio >>>                                             filemap_lock_folio >>>                                               __filemap_get_folio >>>                                                 folio_lock(A) // blocked >>>      unmap_ref_private >>>      ... >>>      mutex_lock(B) // retake and blocked >>> >>> This is a ABBA deadlock involving two locks: >>> - Lock A: pagecache_folio lock >>> - Lock B: hugetlb_fault_mutex_table lock >>> >>> The deadlock occurs between two processes as follows: >>> 1. The first process (let’s call it Process 1) is handling a >>> copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to >>> insufficient reserved hugetlb pages, Process 1, owner of the reserved >>> hugetlb page, attempts to unmap a hugepage owned by another process >>> (non-owner) to satisfy the reservation. Before unmapping, Process 1 >>> acquires lock B (hugetlb_fault_mutex_table lock) and then lock A >>> (pagecache_folio lock). To proceed with the unmap, it releases Lock B >>> but retains Lock A. After the unmap, Process 1 tries to reacquire Lock >>> B. However, at this point, Lock B has already been acquired by another >>> process. >>> >>> 2. The second process (Process 2) enters the hugetlb_fault handler >>> during the unmap operation. It successfully acquires Lock B >>> (hugetlb_fault_mutex_table lock) that was just released by Process 1, >>> but then attempts to acquire Lock A (pagecache_folio lock), which is >>> still held by Process 1. >>> >>> As a result, Process 1 (holding Lock A) is blocked waiting for Lock B >>> (held by Process 2), while Process 2 (holding Lock B) is blocked waiting >>> for Lock A (held by Process 1), constructing a ABBA deadlock scenario. >>> >>> The solution here is to unlock the pagecache_folio and provide the >>> pagecache_folio_unlocked variable to the caller to have the visibility >>> over the pagecache_folio status for subsequent handling. >>> >>> The error message: >>> INFO: task repro_20250402_:13229 blocked for more than 64 seconds. >>>        Not tainted 6.15.0-rc3+ #24 >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00004006 >>> Call Trace: >>>   >>>   __schedule+0x1755/0x4f50 >>>   schedule+0x158/0x330 >>>   schedule_preempt_disabled+0x15/0x30 >>>   __mutex_lock+0x75f/0xeb0 >>>   hugetlb_wp+0xf88/0x3440 >>>   hugetlb_fault+0x14c8/0x2c30 >>>   trace_clock_x86_tsc+0x20/0x20 >>>   do_user_addr_fault+0x61d/0x1490 >>>   exc_page_fault+0x64/0x100 >>>   asm_exc_page_fault+0x26/0x30 >>> RIP: 0010:__put_user_4+0xd/0x20 >>>   copy_process+0x1f4a/0x3d60 >>>   kernel_clone+0x210/0x8f0 >>>   __x64_sys_clone+0x18d/0x1f0 >>>   do_syscall_64+0x6a/0x120 >>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> RIP: 0033:0x41b26d >>>   >>> INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250. >>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006 >>> Call Trace: >>>   >>>   __schedule+0x1755/0x4f50 >>>   schedule+0x158/0x330 >>>   io_schedule+0x92/0x110 >>>   folio_wait_bit_common+0x69a/0xba0 >>>   __filemap_get_folio+0x154/0xb70 >>>   hugetlb_fault+0xa50/0x2c30 >>>   trace_clock_x86_tsc+0x20/0x20 >>>   do_user_addr_fault+0xace/0x1490 >>>   exc_page_fault+0x64/0x100 >>>   asm_exc_page_fault+0x26/0x30 >>> RIP: 0033:0x402619 >>>   >>> INFO: task repro_20250402_:13250 blocked for more than 65 seconds. >>>        Not tainted 6.15.0-rc3+ #24 >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513   task_flags:0x400040 flags:0x00000006 >>> Call Trace: >>>   >>>   __schedule+0x1755/0x4f50 >>>   schedule+0x158/0x330 >>>   io_schedule+0x92/0x110 >>>   folio_wait_bit_common+0x69a/0xba0 >>>   __filemap_get_folio+0x154/0xb70 >>>   hugetlb_fault+0xa50/0x2c30 >>>   trace_clock_x86_tsc+0x20/0x20 >>>   do_user_addr_fault+0xace/0x1490 >>>   exc_page_fault+0x64/0x100 >>>   asm_exc_page_fault+0x26/0x30 >>> RIP: 0033:0x402619 >>>   >>> >>> Showing all locks held in the system: >>> 1 lock held by khungtaskd/35: >>>   #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180 >>> 2 locks held by repro_20250402_/13229: >>>   #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300 >>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440 >>> 3 locks held by repro_20250402_/13250: >>>   #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490 >>>   #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30 >>>   #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30 >>> >>> Link: https://drive.google.com/file/d/1DVRnIW- vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1] >>> Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/ hugetlb_lock_debug.bt [2] >>> Link: https://drive.google.com/file/d/1bWq2-8o- BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3] >>> Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") >>> Cc: >>> Cc: Hugh Dickins >>> Cc: Florent Revest >>> Cc: Gavin Shan >>> Signed-off-by: Gavin Guo >>> --- >>>   mm/hugetlb.c | 33 ++++++++++++++++++++++++++++----- >>>   1 file changed, 28 insertions(+), 5 deletions(-) >>> >> >> I guess the change log can become concise after the kernel log is dropped. The summarized >> stack trace is sufficient to indicate how the dead locking scenario happens. Besides, >> it's no need to mention bpftrace and its output. So the changelog would be simplified >> to something like below. Please polish it a bit if you would to take it. The solution >> looks good except some nitpicks as below. >> >> --- >> >> There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on >> the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller >> [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), >> but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries >> to take the pagecache folio's lock. >> >> [1] https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/ view?usp=drive_link >> >> Process-1                                       Process-2 >> =========                                       ========= >> hugetlb_fault >>    mutex_lock                  (A1) >>    filemap_lock_hugetlb_folio  (B1) >>    hugetlb_wp >>      alloc_hugetlb_folio       #error >>        mutex_unlock            (A2) >>                                                 hugetlb_fault >> mutex_lock                  (A4) >> filemap_lock_hugetlb_folio  (B4) >>        unmap_ref_private >>        mutex_lock              (A3) >> >> Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's >> lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable >> is added to track if the pagecache folio's lock has been released by its child function >> hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes >> are applied to hugetlb_no_page(). >> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index e3e6ac991b9c..ad54a74aa563 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, >>>    * Keep the pte_same checks anyway to make transition from the mutex easier. >>>    */ >>>   static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, >>> -               struct vm_fault *vmf) >>> +               struct vm_fault *vmf, >>> +               bool *pagecache_folio_unlocked) >> >> Nitpick: the variable may be renamed to 'pagecache_folio_locked' if you're happy >> with. >> >>>   { >>>       struct vm_area_struct *vma = vmf->vma; >>>       struct mm_struct *mm = vma->vm_mm; >>> @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, >>>               u32 hash; >>>               folio_put(old_folio); >>> +            /* >>> +             * The pagecache_folio needs to be unlocked to avoid >>                                                 ^^^^^^^^ >> >>                                                 has to be (?) >> >>> +             * deadlock and we won't re-lock it in hugetlb_wp(). The >>> +             * pagecache_folio could be truncated after being >>> +             * unlocked. So its state should not be relied >>                                                                  ^^^^^^ >> reliable (?) >>> +             * subsequently. >>> +             * >>> +             * Setting *pagecache_folio_unlocked to true allows the >>> +             * caller to handle any necessary logic related to the >>> +             * folio's unlocked state. >>> +             */ >>> +            if (pagecache_folio) { >>> +                folio_unlock(pagecache_folio); >>> +                if (pagecache_folio_unlocked) >>> +                    *pagecache_folio_unlocked = true; >>> +            } >> >> The second section of the comments looks a bit redundant since the code changes >> are self-explaining enough :-) >> >>>               /* >>>                * Drop hugetlb_fault_mutex and vma_lock before >>>                * unmapping.  unmapping needs to hold vma_lock >>> @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, >>>       hugetlb_count_add(pages_per_huge_page(h), mm); >>>       if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { >>>           /* Optimization, do the COW without a second fault */ >>> -        ret = hugetlb_wp(folio, vmf); >>> +        ret = hugetlb_wp(folio, vmf, NULL); >> >> It's not certain if we have another deadlock between hugetlb_no_page() and hugetlb_wp(), >> similar to the existing one between hugetlb_fault() and hugetlb_wp(). So I think it's >> reasonable to pass '&pagecache_folio_locked' to hugetlb_wp() here and skip to unlock >> on pagecache_folio_locked == false in hugetlb_no_page(). It's not harmful at least. > > Thank you very much for taking the time to review my patch! I appreciate > your feedback. :) > > After carefully reviewing the hugetlb_no_page function, I've made an > observation regarding the pagecache_folio handling. Specifically, when > pagecache_folio is assigned to vmf->pte, the vmf->ptl lock is held. This > lock remains active when vmf->pte is later accessed in hugetlb_wp. The > ptl lock ensures that the pte value is the same as the pagecache_folio > assigned in hugetlb_no_page. As a result, the following comparison in > hugetlb_wp will always evaluate to false because old_folio and > pagecache_folio reference the same object: > >         if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && >                         old_folio != pagecache_folio) >                 cow_from_owner = true; > > Based on this analysis, passing pagecache_folio_locked in > hugetlb_no_page is unnecessary. Let me know if I missed anything. Other > comments look good to me. Thanks! > Thanks for diving into the code deeper. Agreed, we're safe to bypass this specific case. As you explained, @old_folio sorted out from PTE should be equal to @pagecache_folio, and the consistence is guranteed by PTL lock. So we don't have locking contentions between hugetlb_no_page() and hugetlb_wp(). >> >>>       } >>>       spin_unlock(vmf->ptl); >>> @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, >>>       struct hstate *h = hstate_vma(vma); >>>       struct address_space *mapping; >>>       int need_wait_lock = 0; >>> +    bool pagecache_folio_unlocked = false; >>>       struct vm_fault vmf = { >>>           .vma = vma, >>>           .address = address & huge_page_mask(h), >>> @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, >>>       if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { >>>           if (!huge_pte_write(vmf.orig_pte)) { >>> -            ret = hugetlb_wp(pagecache_folio, &vmf); >>> +            ret = hugetlb_wp(pagecache_folio, &vmf, >>> +                    &pagecache_folio_unlocked); >>>               goto out_put_page; >>>           } else if (likely(flags & FAULT_FLAG_WRITE)) { >>>               vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); >>> @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, >>>   out_ptl: >>>       spin_unlock(vmf.ptl); >>> -    if (pagecache_folio) { >>> +    /* >>> +     * If the pagecache_folio is unlocked in hugetlb_wp(), we skip >>> +     * folio_unlock() here. >>> +     */ >>> +    if (pagecache_folio && !pagecache_folio_unlocked) >>>           folio_unlock(pagecache_folio); >>> +    if (pagecache_folio) >>>           folio_put(pagecache_folio); >>> -    } >> >> The comments seem redundant since the code changes are self-explaining. >> Besides, no need to validate 'pagecache_folio' for twice. >> >>      if (pagecache_folio) { >>          if (pagecache_folio_locked) >>              folio_unlock(pagecache_folio); >> >>          folio_put(pagecache_folio); >>      } >> >>>   out_mutex: >>>       hugetlb_vma_unlock_read(vma); >>> >>> base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3 Thanks, Gavin