From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77193C5B549 for ; Thu, 5 Jun 2025 03:23:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 89CA46B05B2; Wed, 4 Jun 2025 23:23:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 826996B05B3; Wed, 4 Jun 2025 23:23:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 714626B05B4; Wed, 4 Jun 2025 23:23:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 47E546B05B2 for ; Wed, 4 Jun 2025 23:23:29 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 78F191613F3 for ; Thu, 5 Jun 2025 03:23:28 +0000 (UTC) X-FDA: 83519901696.17.BC816CD Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf30.hostedemail.com (Postfix) with ESMTP id CA9AA80004 for ; Thu, 5 Jun 2025 03:23:25 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ac0ysB95; spf=pass (imf30.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749093806; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KM3/Mfadr0ILRuZgl1am16zEWMa0SGHgSc/04/smS/E=; b=2ACZVDo48F1NTAV+QdLa6GsXqtkBVFXkztCfzTOXzpC7OlWG2LXlMw8a1qpl/UMsBTCOeK 9yJjANlqAp1LyoBG5vrFeuwsm/hJsfuGtts4qXZ2lZezaSKjD5+xjbLYkD+iBNFUpVeLVZ 1TN9/9e2yFCe+oQRtcHKN+jgBMEMWXs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ac0ysB95; spf=pass (imf30.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749093806; a=rsa-sha256; cv=none; b=GJa/W7loZthPpBjqHcW1KOP4GRJRXa4tuEGTnRk/7QWaGw6yYzyiGB6kcpEo7c9rhOMXbx nJElJ7cN2/IVVUSL5mPPgluC4M8/PK0A60OCORQNEDDjFRwminBpn9LFRQ1vgFGwjLXi5l 6xo69lTg6XY7a1pw+mvBucp9a3Jvvxw= Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-234c5b57557so4532165ad.3 for ; Wed, 04 Jun 2025 20:23:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1749093804; x=1749698604; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=KM3/Mfadr0ILRuZgl1am16zEWMa0SGHgSc/04/smS/E=; b=ac0ysB951wMmF0VPy332KYnWjWOGBvhLdrbCv0LD5XD/o0UEmal1NqZFEsKj42exy5 8Youj55ripZQzkQaYuv0i2/gj4/dburNDTttSY01iLO/yZDiBaroC/vCoEQCjrPhsEMa Til1wkMmC1q2W55/EVwaVaKZFVkurMvHVl/+7O7HVd0EuWIQ/MzDKpN0+k+JmFuNHqf2 Up0Hfiz7r5NV9S9CFlDDXjmqbwD/j7iF6BQP3apuk3DZqR508LL1Gvseb0jDWvRFELfV VyWEohhMW2/ezlJEC9B3/Rvsl3h9jM5tMAep6M2FrdgiUYx9atwIWASSr1hiIWAqWZJT ufdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749093804; x=1749698604; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=KM3/Mfadr0ILRuZgl1am16zEWMa0SGHgSc/04/smS/E=; b=L+HWJOo+aevvgUjKc9bUaaK8MqLReBnl5fOWrnki2TBfwcRAOtvTM8iIcH07Rzj66X frsXiGeIKtFUuB/exPb6+3Mgsbz36HAC+l1nZUxHo1lDVHjUsdEwV6tDAfr3b8NxB/C4 fFl2UTmEMSsRd9COKyp4zMMvF1+wH1ZJE76t5poVMvStN6FJ2uRf5F2NWZiNmhwYtrZm f7wWjCdI095kzv4V05Vd/ku6wJvrtE15yB4DLRmpVim5jDiZy3tmd5cv4YCw6CdxpZTw 1lHCNqesY37vEefOmijeHh/VtBiAwT36InAhcW09iupekyXFlgHtdccf6GOXZXH1Rh4Y Y6YA== X-Forwarded-Encrypted: i=1; AJvYcCUA05t0dbHjFaKsj5WUIdsvKFUyKsPLqomWaIvBBBmInEVP2+IujqCuata49N0RJfWJLfWJbl4k0g==@kvack.org X-Gm-Message-State: AOJu0YyQfbElgQCeZndBHitdONLw5sK/vvi9GudmrrzfbbJqfPnC1alc lPEd4tgS3scIMQ5bbd0RTA+EmARv9NHoSH5fHde2AzYcFwvUfhDF5+97HbiKx/avTRM= X-Gm-Gg: ASbGncunISvaRwKI1Ekyyi8aWnGgCKGGMHpLRdtVQhyhJ6azclH6TAXzQiOUJO6hN7A uDC6FssPBaqCjRfgDr9tT9ksOMnIONucsf1AIAeTpA+iztNNr34Qf1mY7Ajk41fbf9hQq8KfoXE w4Z92yPWm+FvvfS+ocCZezopSXdkr+nTg9UJSB0n2WGshQXxolSmOtnevAagFAgGLPFwiND69mP OJpV9ZQ7mRrCBDc6wn+eUaxO+DDqxzCiMHo55ENEYKILfE8kycs4U0SPa9k9U+xTMJhNRWrHJRo OsIrpLBEYbniVgopPdPKBg7/ueTm32ZYwrtwV2cmUKB3eSbptKKy5HWEd9rVRarUgR3c6SrSu5a ghWv3 X-Google-Smtp-Source: AGHT+IHPyZzSAZ7JcT87t5j2427jPh6+MmtdwNftpdDiPn5eutqHeiOyPcHZ7I8LHFMSzt/syqKWUg== X-Received: by 2002:a17:902:cf0d:b0:234:ef42:5d48 with SMTP id d9443c01a7336-235e1205974mr79340365ad.38.1749093804466; Wed, 04 Jun 2025 20:23:24 -0700 (PDT) Received: from [10.68.122.90] ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-23506cd8d6dsm110352515ad.119.2025.06.04.20.23.19 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 Jun 2025 20:23:23 -0700 (PDT) Message-ID: <7cb990bf-57d4-4fc9-b44c-f30175c0fb7a@bytedance.com> Date: Thu, 5 Jun 2025 11:23:18 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: Jann Horn , Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng References: <20250530104439.64841-1-21cnbao@gmail.com> <0fb74598-1fee-428e-987b-c52276bfb975@bytedance.com> <3cb53060-9769-43f4-996d-355189df107d@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: 9119mxqmgyzmmg3sb55oiu86oqr4wojk X-Rspamd-Queue-Id: CA9AA80004 X-Rspamd-Server: rspam11 X-HE-Tag: 1749093805-995301 X-HE-Meta: U2FsdGVkX1+ajk/hj7gKP4sYQUlJouG+yTf2NDjdlYAuOwepQT8ootNHyDe+PW9vKHGnXridmWx8jps4GpYLVVv6GiXd6aD2TuDyNZVragAe+VaM7cknYUcwzQEJ3BwN3O/po66eRp7QMEfMIrFm5rgdXDzja7YFiLJJewFTNKaAxO3kWID4/pK7ilu7yKUuhPCXav5RyJniIuMSIV+vYkc0n9YC9BMMZt0vzTmUEUQdKm90we6mTw6//CfeZuQ4d0NQWO1KmFTFds7TDlGfKWYUfU4HLp9jeQ92MewfNdT3fInBDPSTUCzAN4FRBdcYpKdGcCLULc6i3IbAFyG/I7nnOuoS1pIxeE0tT51HUxBr94pjT2OhlRTUg5w9hskIQtXuGP0Ru+DJsrZbZnm90V0c4srGuwnMdhuLoeLd6JmAuA+5vljPBDIX4CTNvuOWBTSsyZmZ6/Xyr6lr7WRQZ1/zbcABmwdD//bABMzOerFzWlHP5LUW3Tw0pVVVw8YrvXQpIq9ZblOhar8h39WlCb2D40NvwnnqTacYZH1flt0FHttpsCNqpwrbfWxuAvnRNDiBk40kOsG5KDV9DN4+fG/HWofsBbqdg8sZWGuD8wFGbuJig+1tkYMxCJ2wnx1p7L017hHOxb9mwX/HZPtKDERaC1X0KsBWsJf68+tEISLgBxeLzPHNHFr1yxAN4/rlf7S9qGGMpwlJMJktxW++3jQjBjCY81R3lQ4bHUA2xMBG02Z01c0MkhyYbFdAKz8jNKQ9v9QyKeHPDQYZjFaFawGkmd33af6aMhVtkkfoog0oXTSHSAp6VtFrWsiZH8X39Ut+hq7RfhhRF+NEGjbBCaPlTxMw0om1PntHS/JED4/tXed7rNfupzHN48PzfD1CjDRBHNC+WaEnPRxL9ea19afn45IK/ra+hlYqc3x5eiVw6FUOnxTWjN7z1pAVxe/y1SUf/UDIIMcw/3TlVfH AOPayLUq widDRctXELsZYzRBeGt0LRDwR64ssSdlFtcymfAq6GtpU1pflE+ySln/72NxcWr1sSUVDqclrCBRpdXVUHLlYuG7+q3kN8ivm7YVEi1nODasLdSnWQ0xMgXJq+CYGzzpJ0XFiy2g+KCO4OUZuM77QdeHAmD86e3Uat5RJJShLBX63IPPXnxNkgCGVHdbBxJ8n1DfrW+Z0w9snrIkxUhMV7U7DC+ZFYFl1mwHyTNkt2VWOZc8g6oJ4JpMj6gznzqUOgpDwouYdHesSBoztFzu/yFypbl7L7o/TZgCG/qbpHLLgUUN5X0Qefyfs6Guhi3BZAIY4xNglVpMtO2hODNI5X+irn5ZiE3TwPSqUAshQ6s93W4YNah3/Izt2CGmUWm2yzFJCz2QEOdmHcz03QhRlk0NGTqmNgl+AYw0w4ed7uJaNsFhjjxndnHC1GQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 6/5/25 1:50 AM, Lorenzo Stoakes wrote: > On Wed, Jun 04, 2025 at 02:02:12PM +0800, Qi Zheng wrote: >> Hi Lorenzo, >> >> On 6/3/25 5:54 PM, Lorenzo Stoakes wrote: >>> On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote: >>>> Hi Jann, >>>> >>>> On 5/30/25 10:06 PM, Jann Horn wrote: >>>>> On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote: >>>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more >>>>>> frequently than other madvise options, particularly in native and Java >>>>>> heaps for dynamic memory management. >>>>>> >>>>>> Currently, the mmap_lock is always held during these operations, even when >>>>>> unnecessary. This causes lock contention and can lead to severe priority >>>>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon— >>>>>> hold the lock and block higher-priority threads. >>>>>> >>>>>> This patch enables the use of per-VMA locks when the advised range lies >>>>>> entirely within a single VMA, avoiding the need for full VMA traversal. In >>>>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. >>>>>> >>>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android >>>>>> benefits from this per-VMA lock optimization. After extended runtime, >>>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while >>>>>> only 1,231 fell back to mmap_lock. >>>>>> >>>>>> To simplify handling, the implementation falls back to the standard >>>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of >>>>>> userfaultfd_remove(). >>>>> >>>>> One important quirk of this is that it can, from what I can see, cause >>>>> freeing of page tables (through pt_reclaim) without holding the mmap >>>>> lock at all: >>>>> >>>>> do_madvise [behavior=MADV_DONTNEED] >>>>> madvise_lock >>>>> lock_vma_under_rcu >>>>> madvise_do_behavior >>>>> madvise_single_locked_vma >>>>> madvise_vma_behavior >>>>> madvise_dontneed_free >>>>> madvise_dontneed_single_vma >>>>> zap_page_range_single_batched [.reclaim_pt = true] >>>>> unmap_single_vma >>>>> unmap_page_range >>>>> zap_p4d_range >>>>> zap_pud_range >>>>> zap_pmd_range >>>>> zap_pte_range >>>>> try_get_and_clear_pmd >>>>> free_pte >>>>> >>>>> This clashes with the assumption in walk_page_range_novma() that >>>>> holding the mmap lock in write mode is sufficient to prevent >>>>> concurrent page table freeing, so it can probably lead to page table >>>>> UAF through the ptdump interface (see ptdump_walk_pgd()). >>>> >>>> Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the >>>> following case: >>>> >>>> cpu 0 cpu 1 >>>> >>>> ptdump_walk_pgd >>>> --> walk_pte_range >>>> --> pte_offset_map (hold RCU read lock) >>>> zap_pte_range >>>> --> free_pte (via RCU) >>>> walk_pte_range_inner >>>> --> ptdump_pte_entry (the PTE page is not freed at this time) >>>> >>>> IIUC, there is no UAF issue here? >>>> >>>> If I missed anything please let me know. > > Seems to me that we don't need the VMA locks then unless I'm missing > something? :) Jann? > > Would this RCU-lock-acquired-by-pte_offset_map also save us from the > munmap() downgraded read lock scenario also? Or is the problem there > intermediate page table teardown I guess? > Right. Currently, page table pages other than PTE pages are not protected by RCU, so mmap write lock still needed in the munmap path to wait for all readers of the page table pages to exit the critical section. In other words, once we have achieved that all page table pages are protected by RCU, we can completely remove the page table pages from the protection of mmap locks. Here are some of my previous thoughts: ``` Another plan ============ Currently, page table modification are protected by page table locks (page_table_lock or split pmd/pte lock), but the life cycle of page table pages are protected by mmap_lock (and vma lock). For more details, please refer to the latest added Documentation/mm/process_addrs.rst file. Currently we try to free the PTE pages through RCU when CONFIG_PT_RECLAIM is turned on. In this case, we will no longer need to hold mmap_lock for the read/write op on the PTE pages. So maybe we can remove the page table from the protection of the mmap lock (which is too big), like this: 1. free all levels of page table pages by RCU, not just PTE pages, but also pmd, pud, etc. 2. similar to pte_offset_map/pte_unmap, add [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain rcu_read_lock/rcu_read_unlcok, and make them accept failure. In this way, we no longer need the mmap lock. For readers, such as page table wallers, we are already in the critical section of RCU. For writers, we only need to hold the page table lock. But there is a difficulty here, that is, the RCU critical section is not allowed to sleep, but it is possible to sleep in the callback function of .pmd_entry, such as mmu_notifier_invalidate_range_start(). Use SRCU instead? Or use RCU + refcount method? Not sure. But I think it's an interesting thing to try. ``` Thanks!