From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAFC3C54FB3 for ; Mon, 26 May 2025 04:41:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C68F76B007B; Mon, 26 May 2025 00:41:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C18D96B0082; Mon, 26 May 2025 00:41:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0A486B0083; Mon, 26 May 2025 00:41:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 88B226B007B for ; Mon, 26 May 2025 00:41:49 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E1CB280152 for ; Mon, 26 May 2025 04:41:48 +0000 (UTC) X-FDA: 83483811096.18.D46BA56 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 87CEA40009 for ; Mon, 26 May 2025 04:41:46 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Co6x8r8z; spf=pass (imf17.hostedemail.com: domain of gshan@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gshan@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748234506; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y6pa9Sy4TFu6ZV+BNqQ+Tm2rYejLcpHsKDhorOQOZPY=; b=R/lSIFJ+qEzXDjFvfLQH+lnmv7P4iYy7XegeqvDUOjhfB13AzLvFPO29Lsn6MizNBIUVpp T8H8PsyUekOx5aDHstisQtLKuXuoZBOD8LEAveuyFWGAjhWtaFOaIt4oz+JHOz84DAoRpY /ZLVsKB1tRANmhAfgSr/W68s/HWYjOU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Co6x8r8z; spf=pass (imf17.hostedemail.com: domain of gshan@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gshan@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748234506; a=rsa-sha256; cv=none; b=sX9/bImgj7BhRh9Bk4l3Vzyf3G036m8fUQEoASvqL/kICLSHFLAQf5y4RXKP3ecIpnEg/v 67qlfVA+RhU9lxhwRBOOUdrISOmGx97sahOqUzuvrNUXbj61xHy6Uo5zFwS2Xq7nqrdngm M+Dotz+k10Q+A+JyYiaUzm/Fq78FrRg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748234505; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=y6pa9Sy4TFu6ZV+BNqQ+Tm2rYejLcpHsKDhorOQOZPY=; b=Co6x8r8zcDhM9rt+DIwGzDnqHIT52zs6nopwDQHStNyIyHFz2UXybpHvSR9oUNXbZmRCLV BApGIBzly29USO+MMEw2RYo98QsHCTngLsln3CKM9BOFx0Fh8v/kfbxiD6sQC5ckXCMPBZ GGYdneUI7eWuloa0JG6lFMXy9I17CrA= Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-456-F5YOSgtWN6mJkQzOl3Z2Jg-1; Mon, 26 May 2025 00:41:44 -0400 X-MC-Unique: F5YOSgtWN6mJkQzOl3Z2Jg-1 X-Mimecast-MFC-AGG-ID: F5YOSgtWN6mJkQzOl3Z2Jg_1748234503 Received: by mail-pl1-f199.google.com with SMTP id d9443c01a7336-23411e9eb5dso13614755ad.2 for ; Sun, 25 May 2025 21:41:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748234498; x=1748839298; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=y6pa9Sy4TFu6ZV+BNqQ+Tm2rYejLcpHsKDhorOQOZPY=; b=e0z8mB/ERFvMjglf4TioMLntzgnDDtaaM1Ipf+AwshSVDNp66fYGS8sVuPFUS9cpsu IunjmlETxZNR0Tr3LC0TnXfb6xvlY9qL+26tZoBj++eW96hgcUzsrzkelBszp83+5oOA nGZcsm9fir8lWoNa4p54SWjAOWrGR+r23LLdWrIHuQBHJ3E2rDFxCE9lLbtYUuP2vedV x2h/DjGRedEpt6bZfgNgIf4MjNvJbbTPPGIj25QM+1JFQ6FUSiCvkHiwdF3aGmR8R3uX 4m2rEaDNZt7dCrBVbDG22L+NSxD2b9vN2V5TTpfNihFGvUeo9HibGS8iQlBZlOy/HEMD Bp0A== X-Forwarded-Encrypted: i=1; AJvYcCWJy7zud3lnJwarkb++1ZvUsOjMeqkMkwcmnLlSwa3WVue1bwzAVPvFHjGoXGxNOij7Op/e/DE9tg==@kvack.org X-Gm-Message-State: AOJu0Yw9LK1lzD+N8GSYnplwgpV6+JApvjSWfTGv2U9QXb8WdhlvXNbq AUtdsYDfu2HEIoQ+Zr68Qahn2VaMhXDRn+T9IpVTKh9k+g7AX0YC01LL+xK+pJdKpfhANnHeXFs o7zRGaBkShTHf/hvI+YcMLfJE/s5KOOWwNl4xTG4lTkUHQw5TqO/X X-Gm-Gg: ASbGncuj17YN2hm9QGN4WOXVw42UCO8sZpgaRiBerqORaqqsrJQkGz7Q1dcsPivkyrG 0aVfuukfFxqZsfOGmHVgh0XXJcBYZnGZhOcesnxBVTM7QsxxCet52rjI6rEbucdsf6iQJZlK5mr aSssH4U/bIGo7B+8xur6mL8rmt4XSHq92hfxT8gLsadr0hX2IQgK/xp8R4/CGYEx32TU+Jy3Uny CoTjy8W+4vp4VdHfmH4lHIMT0//2XIn1K3LLUWn6ZTvFb2vFSSz/U0OESGtsiWUAlG+faFayYlT PvGIwyf249xL X-Received: by 2002:a17:903:46ce:b0:22e:72fe:5f9c with SMTP id d9443c01a7336-23414fc85f8mr109785435ad.42.1748234498287; Sun, 25 May 2025 21:41:38 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEZhau8mYuM3muGotB9IVrJ2jRkpOlLJp1Uty1Nel7gBEru3MSknumERjnQHYIw50ZaY9K6dQ== X-Received: by 2002:a17:903:46ce:b0:22e:72fe:5f9c with SMTP id d9443c01a7336-23414fc85f8mr109785215ad.42.1748234497818; Sun, 25 May 2025 21:41:37 -0700 (PDT) Received: from [192.168.68.51] ([180.233.125.65]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-23426898297sm27846695ad.245.2025.05.25.21.41.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 25 May 2025 21:41:37 -0700 (PDT) Message-ID: <5f5d8f85-c082-4cb9-93ef-d207309ba807@redhat.com> Date: Mon, 26 May 2025 14:41:31 +1000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/hugetlb: fix a deadlock with pagecache_folio and hugetlb_fault_mutex_table To: Gavin Guo , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, muchun.song@linux.dev, osalvador@suse.de, akpm@linux-foundation.org, mike.kravetz@oracle.com, kernel-dev@igalia.com, stable@vger.kernel.org, Hugh Dickins , Florent Revest References: <20250513093448.592150-1-gavinguo@igalia.com> From: Gavin Shan In-Reply-To: <20250513093448.592150-1-gavinguo@igalia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Gx-zFfz4oC6kWsZdeCWj96tmzwgh9Q4OsrGK20yZSCQ_1748234503 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 87CEA40009 X-Stat-Signature: ogoo7i636ufjcxydedaoq89ak4jm85x4 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748234506-971743 X-HE-Meta: U2FsdGVkX1+7gc+EkN31Yq/rPtlxslK18vWfX9aYFWlyyFvSkSPmkNS7HyMmtg+qG02Np3a4DypmFLAD+Ye72se4rh9N4idzHqUv9RwrvWA1IMMI0VvnJ+nII9T951IcOF8BSx5D8rVXPBpzLjBAoZ6d2bPJpOIzh5aXP4UfL1Yla/G+QmObApBUHPukiIJ5KrNNBgALKaV7TWINwnB7rhfxBeSiQlQcO75Sl9xuHcq9hyf+pAfRQm2CNL1gQV+ANs3ZucgZmA4Ntijoy4GRqH80On7bLCYnI46zDC0bC+sStS3/wvJ1f1wcUqufIiqXzeAMyFx3XK0iN0rtEOWpZ/6zvYvY8qojKjg1wEwRjY+mzoAZ5htSw8aaOFvKwo0fqEnzXD/tednxL1Qw4JJt0buwcLYMOAvbZw7PEgDv5I1pZLOmYWUlNzPkqR2v7OROWW/Z8BeMBiw2AOdgmYv9OTD/NwH/elBLL8H4eaibK1KQVzC1Q75cf3rwszGLa4nCle4Gx2bRPcMWnl9TTp6m2HcoGkz/SVIJXgXdoO/cyNJbBjizGq67UmiSHTfgIeXkYg6Xpuk/S39WUxJbBZ3wz9u62jxPJ0DgSIbosNhF5uloFyd1IjS6Ff8vBBI7LH3xRk1jdX6zMt9N1OM0GKyPLY/rr68dkNrUYWJl5w72ffwjZ2IoFo1heiPnE9tNOYSnAj+MH1l5XEAboX4UCHbXPLSwoWuZnlXIg7mHWon09gJ362YnZA7RRYTlRtViTNJQGeqJvfL4Rxe7cGw72PBw9WZGnvGH7NrU0UU2bRtlt6L2ogviR6uuHYIVzFjNtf5OhBEodyH7R7K28k06MeMZoV31eSrOpF4eUo8rS6B4wy+K/RsHQO+iUMi8xg6Bs2z6uzEVSaHfbUdJpL3Fm0e01vwDFiiGo60gEnO7mFJcegmZPB1lCIsj4bgxskQSh4sHoUzGBm36JQMNOHwffMR OE6Vzx+B QUYvzqPd9bItP3jVO1MtqNX6faMnJOhyCOrbwIXxBtBr8nieEmRRbXZqW4rzJRa421Q414o/qtIRazxLf04ML7n1MlL3TtIJiCOLschd7oLnVGtqxe9fdT8US/4OfYE+cBrmpuvZoH/qg1gZJV9/aFBkkxF23y3ZFKKXQf9YC/nkyBTiRGMRzvAI/zt44c6yIajIQoighKKPrqMjQsbSGaElD/caUaLCubs2lpO2vqMjgLVAh0Q5XdrRY5a54YT92M0nLuxGWR3Kx/lEDMZxkzm25A1rKymWJ0uTLk/l2KyI6KapZ9LZIz0jCzxVGYeKWliUV162iRnZSX7J221rZDRsTju6SojJAck6fSxhwdxWg55c9bbeAezcL3GKlB57GmiiqhRCVMkvLk2BMLI6WMOdPrNvxeLkvgHA2OIkuo4kqYBcKMDjlMoeBzBmnxH1AL4jv1tpNcB+ZtFwYYySR/BnHSziObp0RVwYB31vblzKW/egb3LoxTKSYsDlgtN5OjAksibHiW6KJfvqA9O4Kowl6dvp7yt174hfcg5VQ//m3ZOB2TO+f09cLtvl5QdfdGp1RsOKbuquDoCsi3psdjenIvCI6zsxOAm+nHDXuFtAzluGMvO7ozk3auSzFJOOu4RCDg86uDWJqNMW9dTJ0ofN8DQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Gavin, On 5/13/25 7:34 PM, Gavin Guo wrote: > The patch fixes a deadlock which can be triggered by an internal > syzkaller [1] reproducer and captured by bpftrace script [2] and its log > [3] in this scenario: > > Process 1 Process 2 > --- --- > hugetlb_fault > mutex_lock(B) // take B > filemap_lock_hugetlb_folio > filemap_lock_folio > __filemap_get_folio > folio_lock(A) // take A > hugetlb_wp > mutex_unlock(B) // release B > ... hugetlb_fault > ... mutex_lock(B) // take B > filemap_lock_hugetlb_folio > filemap_lock_folio > __filemap_get_folio > folio_lock(A) // blocked > unmap_ref_private > ... > mutex_lock(B) // retake and blocked > > This is a ABBA deadlock involving two locks: > - Lock A: pagecache_folio lock > - Lock B: hugetlb_fault_mutex_table lock > > The deadlock occurs between two processes as follows: > 1. The first process (let’s call it Process 1) is handling a > copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to > insufficient reserved hugetlb pages, Process 1, owner of the reserved > hugetlb page, attempts to unmap a hugepage owned by another process > (non-owner) to satisfy the reservation. Before unmapping, Process 1 > acquires lock B (hugetlb_fault_mutex_table lock) and then lock A > (pagecache_folio lock). To proceed with the unmap, it releases Lock B > but retains Lock A. After the unmap, Process 1 tries to reacquire Lock > B. However, at this point, Lock B has already been acquired by another > process. > > 2. The second process (Process 2) enters the hugetlb_fault handler > during the unmap operation. It successfully acquires Lock B > (hugetlb_fault_mutex_table lock) that was just released by Process 1, > but then attempts to acquire Lock A (pagecache_folio lock), which is > still held by Process 1. > > As a result, Process 1 (holding Lock A) is blocked waiting for Lock B > (held by Process 2), while Process 2 (holding Lock B) is blocked waiting > for Lock A (held by Process 1), constructing a ABBA deadlock scenario. > > The solution here is to unlock the pagecache_folio and provide the > pagecache_folio_unlocked variable to the caller to have the visibility > over the pagecache_folio status for subsequent handling. > > The error message: > INFO: task repro_20250402_:13229 blocked for more than 64 seconds. > Not tainted 6.15.0-rc3+ #24 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00004006 > Call Trace: > > __schedule+0x1755/0x4f50 > schedule+0x158/0x330 > schedule_preempt_disabled+0x15/0x30 > __mutex_lock+0x75f/0xeb0 > hugetlb_wp+0xf88/0x3440 > hugetlb_fault+0x14c8/0x2c30 > trace_clock_x86_tsc+0x20/0x20 > do_user_addr_fault+0x61d/0x1490 > exc_page_fault+0x64/0x100 > asm_exc_page_fault+0x26/0x30 > RIP: 0010:__put_user_4+0xd/0x20 > copy_process+0x1f4a/0x3d60 > kernel_clone+0x210/0x8f0 > __x64_sys_clone+0x18d/0x1f0 > do_syscall_64+0x6a/0x120 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x41b26d > > INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250. > task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 > Call Trace: > > __schedule+0x1755/0x4f50 > schedule+0x158/0x330 > io_schedule+0x92/0x110 > folio_wait_bit_common+0x69a/0xba0 > __filemap_get_folio+0x154/0xb70 > hugetlb_fault+0xa50/0x2c30 > trace_clock_x86_tsc+0x20/0x20 > do_user_addr_fault+0xace/0x1490 > exc_page_fault+0x64/0x100 > asm_exc_page_fault+0x26/0x30 > RIP: 0033:0x402619 > > INFO: task repro_20250402_:13250 blocked for more than 65 seconds. > Not tainted 6.15.0-rc3+ #24 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006 > Call Trace: > > __schedule+0x1755/0x4f50 > schedule+0x158/0x330 > io_schedule+0x92/0x110 > folio_wait_bit_common+0x69a/0xba0 > __filemap_get_folio+0x154/0xb70 > hugetlb_fault+0xa50/0x2c30 > trace_clock_x86_tsc+0x20/0x20 > do_user_addr_fault+0xace/0x1490 > exc_page_fault+0x64/0x100 > asm_exc_page_fault+0x26/0x30 > RIP: 0033:0x402619 > > > Showing all locks held in the system: > 1 lock held by khungtaskd/35: > #0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180 > 2 locks held by repro_20250402_/13229: > #0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300 > #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440 > 3 locks held by repro_20250402_/13250: > #0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490 > #1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30 > #2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30 > > Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link [1] > Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debug.bt [2] > Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=sharing [3] > Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") > Cc: > Cc: Hugh Dickins > Cc: Florent Revest > Cc: Gavin Shan > Signed-off-by: Gavin Guo > --- > mm/hugetlb.c | 33 ++++++++++++++++++++++++++++----- > 1 file changed, 28 insertions(+), 5 deletions(-) > I guess the change log can become concise after the kernel log is dropped. The summarized stack trace is sufficient to indicate how the dead locking scenario happens. Besides, it's no need to mention bpftrace and its output. So the changelog would be simplified to something like below. Please polish it a bit if you would to take it. The solution looks good except some nitpicks as below. --- There is ABBA dead locking scenario happening between hugetlb_fault() and hugetlb_wp() on the pagecache folio's lock and hugetlb global mutex, which is reproducible with syzkaller [1]. As below stack traces reveal, process-1 tries to take the hugetlb global mutex (A3), but with the pagecache folio's lock hold. Process-2 took the hugetlb global mutex but tries to take the pagecache folio's lock. [1] https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=drive_link Process-1 Process-2 ========= ========= hugetlb_fault mutex_lock (A1) filemap_lock_hugetlb_folio (B1) hugetlb_wp alloc_hugetlb_folio #error mutex_unlock (A2) hugetlb_fault mutex_lock (A4) filemap_lock_hugetlb_folio (B4) unmap_ref_private mutex_lock (A3) Fix it by releasing the pagecache folio's lock at (A2) of process-1 so that pagecache folio's lock is available to process-2 at (B4), to avoid the deadlock. In process-1, a new variable is added to track if the pagecache folio's lock has been released by its child function hugetlb_wp() to avoid double releases on the lock in hugetlb_fault(). The similar changes are applied to hugetlb_no_page(). > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index e3e6ac991b9c..ad54a74aa563 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -6115,7 +6115,8 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, > * Keep the pte_same checks anyway to make transition from the mutex easier. > */ > static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, > - struct vm_fault *vmf) > + struct vm_fault *vmf, > + bool *pagecache_folio_unlocked) Nitpick: the variable may be renamed to 'pagecache_folio_locked' if you're happy with. > { > struct vm_area_struct *vma = vmf->vma; > struct mm_struct *mm = vma->vm_mm; > @@ -6212,6 +6213,22 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, > u32 hash; > > folio_put(old_folio); > + /* > + * The pagecache_folio needs to be unlocked to avoid ^^^^^^^^ has to be (?) > + * deadlock and we won't re-lock it in hugetlb_wp(). The > + * pagecache_folio could be truncated after being > + * unlocked. So its state should not be relied ^^^^^^ reliable (?) > + * subsequently. > + * > + * Setting *pagecache_folio_unlocked to true allows the > + * caller to handle any necessary logic related to the > + * folio's unlocked state. > + */ > + if (pagecache_folio) { > + folio_unlock(pagecache_folio); > + if (pagecache_folio_unlocked) > + *pagecache_folio_unlocked = true; > + } The second section of the comments looks a bit redundant since the code changes are self-explaining enough :-) > /* > * Drop hugetlb_fault_mutex and vma_lock before > * unmapping. unmapping needs to hold vma_lock > @@ -6566,7 +6583,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, > hugetlb_count_add(pages_per_huge_page(h), mm); > if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { > /* Optimization, do the COW without a second fault */ > - ret = hugetlb_wp(folio, vmf); > + ret = hugetlb_wp(folio, vmf, NULL); It's not certain if we have another deadlock between hugetlb_no_page() and hugetlb_wp(), similar to the existing one between hugetlb_fault() and hugetlb_wp(). So I think it's reasonable to pass '&pagecache_folio_locked' to hugetlb_wp() here and skip to unlock on pagecache_folio_locked == false in hugetlb_no_page(). It's not harmful at least. > } > > spin_unlock(vmf->ptl); > @@ -6638,6 +6655,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > struct hstate *h = hstate_vma(vma); > struct address_space *mapping; > int need_wait_lock = 0; > + bool pagecache_folio_unlocked = false; > struct vm_fault vmf = { > .vma = vma, > .address = address & huge_page_mask(h), > @@ -6792,7 +6810,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { > if (!huge_pte_write(vmf.orig_pte)) { > - ret = hugetlb_wp(pagecache_folio, &vmf); > + ret = hugetlb_wp(pagecache_folio, &vmf, > + &pagecache_folio_unlocked); > goto out_put_page; > } else if (likely(flags & FAULT_FLAG_WRITE)) { > vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); > @@ -6809,10 +6828,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > out_ptl: > spin_unlock(vmf.ptl); > > - if (pagecache_folio) { > + /* > + * If the pagecache_folio is unlocked in hugetlb_wp(), we skip > + * folio_unlock() here. > + */ > + if (pagecache_folio && !pagecache_folio_unlocked) > folio_unlock(pagecache_folio); > + if (pagecache_folio) > folio_put(pagecache_folio); > - } The comments seem redundant since the code changes are self-explaining. Besides, no need to validate 'pagecache_folio' for twice. if (pagecache_folio) { if (pagecache_folio_locked) folio_unlock(pagecache_folio); folio_put(pagecache_folio); } > out_mutex: > hugetlb_vma_unlock_read(vma); > > > base-commit: d76bb1ebb5587f66b0f8b8099bfbb44722bc08b3 Thanks, Gavin