From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A46EC5B552 for ; Wed, 4 Jun 2025 06:02:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A8E856B0594; Wed, 4 Jun 2025 02:02:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A66C66B0595; Wed, 4 Jun 2025 02:02:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 97D276B0596; Wed, 4 Jun 2025 02:02:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 799FE6B0594 for ; Wed, 4 Jun 2025 02:02:23 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 258711204E0 for ; Wed, 4 Jun 2025 06:02:23 +0000 (UTC) X-FDA: 83516673366.26.FD0AB38 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf20.hostedemail.com (Postfix) with ESMTP id 5B2B61C000A for ; Wed, 4 Jun 2025 06:02:20 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="jtvrKnm/"; spf=pass (imf20.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749016941; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dWMI+lg2sQ5ZZTLgrFyqGIyNwAQkPx/wpdd9hQyCRaM=; b=u5fvON8jL/+xrKaSb9v5xzQ8rLXgsRciWPAyp63FugJAG1HlOBswo+JWSAnkixOwfSvtWZ 1RTQcv+tpaJHWRv7Y/XeLC2jMjybiQEs7FaXcl/SuzgeX2cV4NwInFQL+dE67vbeI88ctS MyjBe0zapFkWBu8kd3nDvB1JYM9drOg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749016941; a=rsa-sha256; cv=none; b=wgOEb5V18cK502CK2g/Xja7XiMGRmkC/lZr3vNlXq3Oj0mbZsNR4Zf+KuxnDyRPucwlwua Vz7vvQoU3gcFsisHG+jbyltvAwNQqLn4IAlzQmgteDXbAWp7Ng4moojx9nUikq3aOCP1Ta x0/zZgwePruvgBFZ5fGqfU6UDSOipso= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="jtvrKnm/"; spf=pass (imf20.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-234b9dfb842so60662475ad.1 for ; Tue, 03 Jun 2025 23:02:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1749016939; x=1749621739; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=dWMI+lg2sQ5ZZTLgrFyqGIyNwAQkPx/wpdd9hQyCRaM=; b=jtvrKnm/3P5z5x66JNuP8Xkz0tie5INXgnIfLzbbiu+RNbKA5ld9E5X6HD+jbdTDpL Ob6HNAQ8SgD6yfTRbcGyAYRcNa+T+ycsOtbIxJ2n0Sn4IjjFN4FjIbEw2mxPEDfadE9q XJAwx+5aUwmisoL95DTuQETvgUgwE3Oe7MHz/krwXTTWXq34KR0dmNmSjgwKrkkQGiu0 NYQlq1X8hX1As1RH/RIygPjlvBAcsZvsYwDEnWf+1TGgB27eSfr6SRcUnijM0YW2b4DS YYmLmOGJP0XKsU55pbdRJM/Crj4IhO7ycbNJW0HYkqmq1PCWlI0ONHCzcOyGCSMmBFy9 Rtqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749016939; x=1749621739; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=dWMI+lg2sQ5ZZTLgrFyqGIyNwAQkPx/wpdd9hQyCRaM=; b=tvT95KZHaJpYDKnww2lX9YQYDcNZ8xzmMGkmRDVE2/TCQbn/hD9NO/LvJlMPDVhwwS qTrD3MmFL/VBIRfKH3keWot4EnqZQpYb7GjwvscxYGY+Q8XsPotbwdaMG4g7s0S9Ryh8 teFW8VEkrjVtaWriXJ3QO9b+8eknCpObNAqV8UMfTgMenqv3kF0oYpr4EsygHbsizupe aR/cSLELcmwDWwvwYhG0EMuoK3IoBp82Q4BKHoKf4KdLAyG1q/4X+sJUGy3K5UWfyZ9m 6NWcTRs/8QtVU0JmxxTk3gnHWjhA7Tq1PojfEeztCQZaokldX1SxmbeORZ9+hlTmntS7 g2xA== X-Forwarded-Encrypted: i=1; AJvYcCX3Uaw5HySJrtsXZWrC2K3qqwUHGuk9mjQf2+TW+zFYrV8a7yAswFaeQTKoS4Wxke68slVYKQND2w==@kvack.org X-Gm-Message-State: AOJu0YxuCOEy0IM3g/uXCBXdFRHgFq5rs65y3TTNTsdWNOMA57E9tOQX b9gW87ZG6Jk3ojrkDn9+NYQ9c/p0/lBWglG9JTURPjNmEllRtIIpOV8M7XMwphOAvbc= X-Gm-Gg: ASbGncvqVWYN/UrEf1rLJ0ci6hRqAdWHChzOM+yXMp7QVq2slju3FubMJU9Er+FM6VZ gEZ3Bp6xENuMrKSbAsjh0MW/hC5X0R3ALi3pUcrV2jpronmnVAmGxYRkdf70BircvTIW+yQtK6A o7Ci3bOAVLc/vXOglPSM/LZa0dU/PCQgn28GKACf9hFT3PulTekKGMsGDnLJioJlSsKiRtt16bz QJi1kCCrblbKmd2lVYT3lCG0K4aJMmpoe4cHj7acfzPiqdqEwPiDs40j7ovhRxIuhxtVLF+JqTp 22sMihdl80EOGi6GNkB5m4rKXrGM889w3ZU1AykGOSUaxxMugffSkv0tlj+TaRlNaSOoZlvDhEM = X-Google-Smtp-Source: AGHT+IGKgZq58uZhUXOwOyeVQ5leksbeKUkW3IJBfiuvTb4m+TsV0F6pnMsQZxZmiOqt/yU7cuW4oA== X-Received: by 2002:a17:902:ec92:b0:234:a734:4ab1 with SMTP id d9443c01a7336-235e1013722mr19350555ad.3.1749016938952; Tue, 03 Jun 2025 23:02:18 -0700 (PDT) Received: from [10.68.122.90] ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-23506d19bfesm96672215ad.253.2025.06.03.23.02.14 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 03 Jun 2025 23:02:18 -0700 (PDT) Message-ID: <3cb53060-9769-43f4-996d-355189df107d@bytedance.com> Date: Wed, 4 Jun 2025 14:02:12 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: Jann Horn , Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng References: <20250530104439.64841-1-21cnbao@gmail.com> <0fb74598-1fee-428e-987b-c52276bfb975@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 5B2B61C000A X-Stat-Signature: ds59grixpwrg59ma1urq9kpk5pk4e894 X-Rspam-User: X-HE-Tag: 1749016940-53231 X-HE-Meta: U2FsdGVkX1+j8P0o2NyGEa64w6AhacK5hIWcA7BTCYVUAuzp2RqDrK0MyWvcHlYhcGcTGxtyhbvIyO97KIePsT+QTJ/6K3cXqSfVlFpTAQxQTt2IvADoxkCXCP7bq5UCppCHBEZ+XAfxOBZnQhA28K0l3c7QbVKUqTU++VOGNzdg9TenMX4VPBQ5T7oDRM+biFCpM2wj4QKjGPDWwt4f/XUyiYYBZQM7DepBI7mnnjxxOUcvBoD8wrKxFVazgocYYJyi1eWh8JTQLAr53/PeLLdiOOUZOg9/f+vSjeEDMrfDORPrFDi4yrkyrUJjZRIvWCVKzwtJvVJ6pw1q0LVAKfCqWZJhcKDs9eUMB80AwyZle6Aw9Mhm7V4221xTmujYoVcOVFy12m5MzGmBSQwUzf1PCD4AUaG+am4cAJMuvF8rKHoQM2yYajO6G2djcB7KnLM2xN9LR7kxc9Sn5K1p/tNlYdovv+mZNs43l1sNxvFtAPMTtb2hKfd4y/Xng7JPOd3KrJ4iZzrnTgEfBSL5BUTWuJ4y9zpmxzWX2UkR4OSU3FyF1Gr2P8n2udZGxT4n3BbS+BcSCAvlRvgWzzNPMglkGyFJ1P5eslqp3MrqGStl/Q+z1fQurqccOrYl5jzzTRIVYWFOL8SmKPRwmjWMchZomDQz2BNkL/mbciOlFIRs2J8S8+CMUX+0v7dpwzAx0fcq/xNi/Aa29rIMNiS5a+apZxDZ2FCip60z6vQR2ZPHZfA4g6RikTcrb0TgQc4bvuBCVS8FRtTtGqeEO7Q1hMbd1W3svwy98AE6OcOrGwACM8Lp9KaZOnYp0jlRfoXuNg1Ds9XJQNShKTYu3cA3oCNpnfkYB4CTYUoimegdR2f2pypXxfZuNJjhI2cYyV2bDOZv4wcTfknUquxUXsQDArYUH1w1MWsEeu3Ewm2s/Mkl+Yz1oyJDOyftq27qRkv49kb6Y/nwNxPNcBAaZgY TYmAqdzl 4aOvM1A7BeuedSwMwLPa3BEJZLUUAMjFQ0yUuOsQDhKon9OKAuFGhjpn500cLQdVMojIDpVrya/It6NsmhT9peWgc1aZYzvQAZJeYU1TYTTQ7wD1UpliU2IuBUktwFeCkzC7XWG9ekIb4sXKg+uuBqScdrQ+yo2tVnLZ+KlMT0Z2gdMswpBoIHrmnl4CgeyOZj015RFsqzzZ7OUMHMC4UHmh1mXjR/oovNNXM+OVFK9mCs9faeQjt+RKxyhSqlsfIPkF7F9t7ZVMeQn7+6YUZzuSY64LG19XdDb6Ciyn6SeXFGbpL5KZkFhhaqNbEy8Equw1QQARvHTRGg+6nNjYfTH9e4v0P5yYpknk/9mTIPCEzzxnY0ssia1CxL528M+pDfv31z+dLlEuK7ZNsjmU0gXkubjHP2IU7OTo/uCtnXt5kzijkXocjcmbbFA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Lorenzo, On 6/3/25 5:54 PM, Lorenzo Stoakes wrote: > On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote: >> Hi Jann, >> >> On 5/30/25 10:06 PM, Jann Horn wrote: >>> On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote: >>>> Certain madvise operations, especially MADV_DONTNEED, occur far more >>>> frequently than other madvise options, particularly in native and Java >>>> heaps for dynamic memory management. >>>> >>>> Currently, the mmap_lock is always held during these operations, even when >>>> unnecessary. This causes lock contention and can lead to severe priority >>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon— >>>> hold the lock and block higher-priority threads. >>>> >>>> This patch enables the use of per-VMA locks when the advised range lies >>>> entirely within a single VMA, avoiding the need for full VMA traversal. In >>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. >>>> >>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android >>>> benefits from this per-VMA lock optimization. After extended runtime, >>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while >>>> only 1,231 fell back to mmap_lock. >>>> >>>> To simplify handling, the implementation falls back to the standard >>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of >>>> userfaultfd_remove(). >>> >>> One important quirk of this is that it can, from what I can see, cause >>> freeing of page tables (through pt_reclaim) without holding the mmap >>> lock at all: >>> >>> do_madvise [behavior=MADV_DONTNEED] >>> madvise_lock >>> lock_vma_under_rcu >>> madvise_do_behavior >>> madvise_single_locked_vma >>> madvise_vma_behavior >>> madvise_dontneed_free >>> madvise_dontneed_single_vma >>> zap_page_range_single_batched [.reclaim_pt = true] >>> unmap_single_vma >>> unmap_page_range >>> zap_p4d_range >>> zap_pud_range >>> zap_pmd_range >>> zap_pte_range >>> try_get_and_clear_pmd >>> free_pte >>> >>> This clashes with the assumption in walk_page_range_novma() that >>> holding the mmap lock in write mode is sufficient to prevent >>> concurrent page table freeing, so it can probably lead to page table >>> UAF through the ptdump interface (see ptdump_walk_pgd()). >> >> Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the >> following case: >> >> cpu 0 cpu 1 >> >> ptdump_walk_pgd >> --> walk_pte_range >> --> pte_offset_map (hold RCU read lock) >> zap_pte_range >> --> free_pte (via RCU) >> walk_pte_range_inner >> --> ptdump_pte_entry (the PTE page is not freed at this time) >> >> IIUC, there is no UAF issue here? >> >> If I missed anything please let me know. >> >> Thanks, >> Qi >> >> > > I forgot about that interesting placement of RCU lock acquisition :) I will > obviously let Jann come back to you on this, but I wonder if I need to > update the doc to reflect this actually. I saw that there is already a relevant description in process_addrs.rst: ``` So accessing PTE-level page tables requires at least holding an RCU read lock; but that only suffices for readers that can tolerate racing with concurrent page table updates such that an empty PTE is observed (in a page table that has actually already been detached and marked for RCU freeing) while another new page table has been installed in the same location and filled with entries. Writers normally need to take the PTE lock and revalidate that the PMD entry still refers to the same PTE-level page table. If the writer does not care whether it is the same PTE-level page table, it can take the PMD lock and revalidate that the contents of pmd entry still meet the requirements. In particular, this also happens in :c:func:`!retract_page_tables` when handling :c:macro:`!MADV_COLLAPSE`. ``` Thanks!