From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC5BBC5AD49 for ; Tue, 3 Jun 2025 07:24:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 618516B03C3; Tue, 3 Jun 2025 03:24:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A1B76B03C4; Tue, 3 Jun 2025 03:24:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46A0E6B03C5; Tue, 3 Jun 2025 03:24:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 290C06B03C3 for ; Tue, 3 Jun 2025 03:24:40 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 88C7B807A7 for ; Tue, 3 Jun 2025 07:24:39 +0000 (UTC) X-FDA: 83513251878.29.19DCC74 Received: from mail-pf1-f178.google.com (mail-pf1-f178.google.com [209.85.210.178]) by imf02.hostedemail.com (Postfix) with ESMTP id E0A5B80009 for ; Tue, 3 Jun 2025 07:24:36 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=UEQMD9X0; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748935477; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lSREP8nY0PkjSCLK7bLXqZ21Ej448bSXQyKUUXBbiI0=; b=zkBpG0u69gVMR0pMAvmLpdY+nEIkNmH5ZN6kiIpZaGDaATguFN8QG+FlhkpAnjkowOyne7 rC7qIwOnfIuPiM40SKf5D4lN8aTFSJ7hFJw9DoaiU72pNT8ypvos7CR+r+X0DsoWQYRV8b fwAorCT1hu2NNgRR9YYDpxofV1+KNF8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748935477; a=rsa-sha256; cv=none; b=UN3Xwy9LuEnGKwvE881Lpxey1t9N+jUO1jL26eBb9f42blp2DnxHscmdwmY2CBqb8YH0/q 5DXb3fWVLccj3XPWEGI8YrwiiCwPBdNo3Q+1auNUyIoTFDziAdQiXTdFAteG3Z7EUL/9Ua oTzmTol2ntWqmxKz5CQin45zIXjcnwE= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=UEQMD9X0; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.178 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com Received: by mail-pf1-f178.google.com with SMTP id d2e1a72fcca58-747fc77bb2aso455855b3a.3 for ; Tue, 03 Jun 2025 00:24:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748935475; x=1749540275; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=lSREP8nY0PkjSCLK7bLXqZ21Ej448bSXQyKUUXBbiI0=; b=UEQMD9X0o51TNh5+XYv9EwXVcsyTtmelMRD5RDrwXJHf6P11WFYIppe1Cs4vPSgykF K+xDCYSwT4+asqKZqQj68cZ4tzJmsjQWuM+FePwAVZce+9ccZsWnZYMVaEvBb9p2an1D +pSNaYPyed31ijLue34m9QcAbTftDmZ6QQhHB/ZIrWZm3bA6pSms8AYGEwboqrYH8svV 9NxkJXsUKlSl6TzCg1lg2CKYN7yJVelfXauqeA4lqsS+saMpN8sgrZvRdY3j+TN+r97o Xi/FoDKtYcEF+dlhshpS0GTbfDNSY9lBKoKVEjveGbvxA69fIj2PS6SaHp1EvIsMXc7V FLWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748935475; x=1749540275; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=lSREP8nY0PkjSCLK7bLXqZ21Ej448bSXQyKUUXBbiI0=; b=lx8ii11ciApuq4kIyG7QiwuxvYiZd5MTDfZlztWG7wNW0hSUrZHAxXI9vTSSnfAZjI Uez+MuPrJ+LkmYYCwcQ+nrqRo1p7sdjlx5VxnrTvYaFCcqk38ZOzcVHw9dbg4jaMQIKn ErRgqZHRjXr3KvlHrCC37CCTmhgE1bCgCXGI13E6Ub6H2smvfl4/nZHIf+imrpbfl2It XzniZ41oCj7PW8SpyEniJUvPgN62fo4e0xND79IQE1pqMLAmaLYSOIoGyBfNSt2/pLVf KPg93REDRcCFbEoJDvpTgdWnqsJO1vCZoWnWDX/7/5S/UEQnLr3NZ5NO0d+g2UZtjPo5 pcjA== X-Forwarded-Encrypted: i=1; AJvYcCUfUfCRtPhwHcwadagqp9QS1Ouq0bPTTdJskPKnCih30txAH+2eLIHzMCfbwyfJaStX6nstjzU9Qg==@kvack.org X-Gm-Message-State: AOJu0YxMojPbaN7I+e2GDPYqOdrEIjchXX1H/XScy/gz2IdcMStbV3QL 2ETog65vu/KiOWv51pOLqSZnkl74Pv0fkIxSyhRv895EsWRGweMNQmWE2taQug8O2Kg= X-Gm-Gg: ASbGncu5W7fXYxF36XWgpuGrzKU5wBvcnWia2HfIwl6T46+SLhfIlnOv0f+Kpmc7XCu 2jLTjEJch3i4S7+u9tBoC8nZP+d94Oclrz7eh8IDWiRmu98EFcjkkcrHV3u/7Yac8Nsbn+tK6DR LLfNlPrfH/HNduxZJ5gz0wMWZ+gVnsbWCHBMaqXFde2OnXrsZbiqXHf7FR5LaQot9SX6Mim+MjO yF/F9yhBRu9IiphnmxLlGZcJxonrnDee8/UdcCoN46Nc6jrMbqK7Y0czsMnaF0E3Z05rXrNzJGq yu9aglMcJHe+eobpRXxBRR1gALasdGTPvwB3BOPjglPBM+lNwe586G88mMvmAmI3maETjcWPW3O alQY0M9YJIiaFIMW3gbK7 X-Google-Smtp-Source: AGHT+IH2/zCSj+QE5Q8rOiIuxOj68PHcnQTYqGhZzWGh+2jOH1jtF3REXGR+Y3Jw498qJ7aqDfEfRQ== X-Received: by 2002:a05:6a00:178b:b0:742:a111:ee6f with SMTP id d2e1a72fcca58-747c1a8610fmr19584669b3a.10.1748935475607; Tue, 03 Jun 2025 00:24:35 -0700 (PDT) Received: from [10.68.122.90] ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-747afff725asm9062249b3a.172.2025.06.03.00.24.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 03 Jun 2025 00:24:35 -0700 (PDT) Message-ID: <0fb74598-1fee-428e-987b-c52276bfb975@bytedance.com> Date: Tue, 3 Jun 2025 15:24:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Jann Horn , Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng References: <20250530104439.64841-1-21cnbao@gmail.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E0A5B80009 X-Stat-Signature: oukfhzgpsc39fgzimp5hgogymo1yak3s X-Rspam-User: X-HE-Tag: 1748935476-936701 X-HE-Meta: U2FsdGVkX19n6HGeCsaqiNcusVFHeLpqGWm20iNMvzyWbRX/MMbhf1zZhimP4tmqUMh1P7GPs2AA6Lzf/Hqa5OgvHbDR7CxsgtbH/0Sdn+JXjXIYAYh1uWDGdWxuMIfHMp8wObAenVKfB3gofd/qs5eGrSB3Z2TA9PWc8Q1PFK5Nnr2o89TpLOXbWk+kugcEw/fKaqszoybcHLPlpTRXcz/VPGsrtuPCGbzIsF+Myi9+A9b8efrEQ4vL59FAvKG1ENoPMYJLf5R7K+19MjDtpQ07LXepF2CUunFwyrerTk/mPIEN/uCiEqmhsTe2FTXf2oj1w6ykZnsqkeRNCkBHroj3GZ71qyvkbpjtZudcWOlI6c0k7R16GzJ1/EN7BXqZaS6Er0Jof42bcnCocEp4CoF3tfswtfL/OeabFnAmagQvyks0CQ6QBq8996PoCgfzxtxDyQb1uwevDD22Z5Yvz97oiFu7zNxeqi3SrK0pp/+yY6qJNQ9TvXuFdvOHPzx6Y/3IsAzW7pS9ugmAgSRLrgj8Mdw3z8rqT2qEfd5kV6pikCaa+TMKu7ufKe/jlWfARZlhnCUDuSIhQWSpJVV7wzjy99Ig6JK5g3DvSSw2GcVm8OXV3s/epwL8S3EhlPYjw7D4aebDMOan2DkphChRugeoqgDv0mxTZAvTMRjKjvWsk2+ikJdAsqRJ9Xs09dsoE1nlGr0edMrIp35YsT4/cY7xpHS5GpkGjNU4910IIhoMYmTdBdjrKShPMTK9DcEyvRbhIxve4O5X1myZLb7YgrA/5coUACrqVhGHKOMJFJ+JKKpMkbndkwhRZHuGVaAv1rph1bZrKYEdIiVlTh3F63gAFMgP/SjLTIA7E3EspaJ2G0+vHO3tAzS+fCCrx68/lvkw8y2DXI7Edj9hvKuSzKpfcwjerYzAf2rqGOswYILIV2iy+dmIP3UI5ykKIV2wBkefDBX5pD8gr5kBbe6 VaLBOw9L CWoZ1Sc0YAFIJ7yh8A3IQasLAU5mDtIy24kqasS1QiUtbLupn27+/Jq3xdoYMUErWVs/8t9fuExgwLU+YxPzgHEENO5c+N1y8tH/mtUc6ESIhb+a3wvIimH3gUnL2R9L/7RhuF6XsjWJo53mfES94WXt5kXvz8n+6984+MPlXbaGyvqwMAWd8r4fYHOxM34IKVNEu/5kKrP+0rwM32dx5GZ/AhdGXUKNr5mlsb1zDDhDuWeFWE1kyYlj9pCQ6qis9xCYuphRcmDWlMZVRPKgqZo9Gyn4O+pSGugb2QYUW6Emoo+kbAB99gbKtaqJglWSP67AdrEtJjFMIEVDI6Gf+xmbVB2NqWRHp0EwjnEUb5WJU2FoCvoNwyt4wnOv/bXcqtjLju+23z5SJlNy9Mp7XfUcGZ5EoHhVIZfJb99WjsSvJWYGJrQaPZHTyhw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Jann, On 5/30/25 10:06 PM, Jann Horn wrote: > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote: >> Certain madvise operations, especially MADV_DONTNEED, occur far more >> frequently than other madvise options, particularly in native and Java >> heaps for dynamic memory management. >> >> Currently, the mmap_lock is always held during these operations, even when >> unnecessary. This causes lock contention and can lead to severe priority >> inversion, where low-priority threads—such as Android's HeapTaskDaemon— >> hold the lock and block higher-priority threads. >> >> This patch enables the use of per-VMA locks when the advised range lies >> entirely within a single VMA, avoiding the need for full VMA traversal. In >> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. >> >> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android >> benefits from this per-VMA lock optimization. After extended runtime, >> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while >> only 1,231 fell back to mmap_lock. >> >> To simplify handling, the implementation falls back to the standard >> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of >> userfaultfd_remove(). > > One important quirk of this is that it can, from what I can see, cause > freeing of page tables (through pt_reclaim) without holding the mmap > lock at all: > > do_madvise [behavior=MADV_DONTNEED] > madvise_lock > lock_vma_under_rcu > madvise_do_behavior > madvise_single_locked_vma > madvise_vma_behavior > madvise_dontneed_free > madvise_dontneed_single_vma > zap_page_range_single_batched [.reclaim_pt = true] > unmap_single_vma > unmap_page_range > zap_p4d_range > zap_pud_range > zap_pmd_range > zap_pte_range > try_get_and_clear_pmd > free_pte > > This clashes with the assumption in walk_page_range_novma() that > holding the mmap lock in write mode is sufficient to prevent > concurrent page table freeing, so it can probably lead to page table > UAF through the ptdump interface (see ptdump_walk_pgd()). Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the following case: cpu 0 cpu 1 ptdump_walk_pgd --> walk_pte_range --> pte_offset_map (hold RCU read lock) zap_pte_range --> free_pte (via RCU) walk_pte_range_inner --> ptdump_pte_entry (the PTE page is not freed at this time) IIUC, there is no UAF issue here? If I missed anything please let me know. Thanks, Qi