From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EEC7C5AD49 for ; Fri, 30 May 2025 22:00:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C22766B0165; Fri, 30 May 2025 18:00:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BD33A6B0166; Fri, 30 May 2025 18:00:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC2956B0169; Fri, 30 May 2025 18:00:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 893D76B0165 for ; Fri, 30 May 2025 18:00:38 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0E99514032D for ; Fri, 30 May 2025 22:00:38 +0000 (UTC) X-FDA: 83500944156.05.EE762ED Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52]) by imf17.hostedemail.com (Postfix) with ESMTP id 1E84F40010 for ; Fri, 30 May 2025 22:00:35 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="UDNz//Mu"; spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748642436; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d1qlvpXCyMWNd5z0USSTKKFj9XoQDqNjSIdKpaUp5v8=; b=nQaedpeyFridJb5c/wWieDZxfyWLu8AXme1i3zprj9fEHWyB3Rty6yHHPtx+ozS1FDkAtd hJymOq+UjMlSbSfh7KOFEsaTY/vJdvyzJRRbBZwPWVnjw9GlY8FOCEsGS0qEOHb+iIadfS 4C87BOd0ZNGDzagBm4HMG+4/5HHQ9Zk= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="UDNz//Mu"; spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748642436; a=rsa-sha256; cv=none; b=Q2iBmos+4kqzsuSUpOmhNuZty1sxiLzZ4FTO7fPnWa1+I9IM++HrkDfIVOOx6UV9EGhLzP Q8dCFbB9RQpZLx1MrMaIO6QRAlkmb7KSRmdPO2wxT0S/RGhiqaI4iOkKwf3zjQZqjL7Rl3 nOXGzrRdSyFgRp4M4gcgydrdC7MXW5Q= Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-87dfe906a87so761682241.0 for ; Fri, 30 May 2025 15:00:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748642435; x=1749247235; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d1qlvpXCyMWNd5z0USSTKKFj9XoQDqNjSIdKpaUp5v8=; b=UDNz//Muyh1t62wrY5AfMAF9mSVhnn8osg6h8aJ6JgaiMO22qgOGCkeboDlsC5Sf4p 5g8J1LsGJ3CTH2thmtUOx0eUbsYHQcGc3EAH5+IFvW0jrrxFCM1E6szl7viqRslQxRxJ ZYncoevER657SlqpkAD74z/ier78j3RcgJesGhsJ4Zm9HWCZA3Fc9GKXJoZ4GDnI+QMj 2iBgEtrHzaKliP7xlBRsSGD+FMZ4IUzxeP3H1L23ZGC2XhspCVJOV+VUF7nHreeWiBE7 rc5uoDURrHG1DPV6ts2CjgJd6Kut33czZaQURk1FbGBGvxUqCk7z/A7a7JioqUsY6zNg a4rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748642435; x=1749247235; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d1qlvpXCyMWNd5z0USSTKKFj9XoQDqNjSIdKpaUp5v8=; b=CkUWab395pbwwXc5dhQOGcbf598Wu4NAmAm3CWB2RmOu/JDjZF/ISe3laxSveROOI8 TC3L5O7ab4DDa/1Lel8yu68QP9I+2+OZNplNalS048SldRYh7hE+9EsO4LwSr7RXvfVu 8YfIgPpk5GAL+VJ31x/1jw3DnsFyCkmpi8Nxl+L1k2+h5W57y5xdmsHTF2ti9DGwIlH4 ZwYlwRsBPNhhfbDtm1XFcfr7fxl+QV1Ot2ZlDDQOdSx7+/8uqKJj46FJaPT8oxdGdfQZ I+H39iIIKoa1q1QU01K1S3spGdsmk5UtxB8Yp1eguVk62/r3/0Q5ZSsDsaWEP/YwoiO8 v4sA== X-Forwarded-Encrypted: i=1; AJvYcCWNfrk6T16OUt6cubpN0BsaPUI3AlpcWs9vwm8P6LwpmxcjRNXkxTNJrVV6oPL5v62pbxILyD3Ecw==@kvack.org X-Gm-Message-State: AOJu0YxW5iC6KX5EF+b86TnoHa2xiVQFQ+hynpBn3SRoAqTrr41XYfYO FpjN8tFHfIQFv5/s2BxC/FzHPH1G+JtAI4zlqVHjl7HLh3IMWjEAncIfWy+7bOvG/gZxqOyMfGL lTJMALmON4uttAFYWkwPlDnurRZ/Nxok= X-Gm-Gg: ASbGnctqQ5IW+3tGP9wFOpZryejC5eeba/Txpj8rSyrrqG1/En9eVWav9h8qmz8LJl6 a0yMxaFCdFuz9IsNEx6WezGQFopMtKTA5NmCQpUpZCG2ZMKYvt5Z0NkxGLSFDoUS/dGXqb1eGHm LFDwBthWwPvFVWLGV7Ot/ahqT6ZkEYQlyGog== X-Google-Smtp-Source: AGHT+IEO3cthOtX4sk2jPqf+U/41oen3Y8Stlld+6LRZVeNzIXO0o8kF7Dv7TUqhsbikTrXyaenKQFV6arABBWlcnN4= X-Received: by 2002:a05:6102:160a:b0:4e2:bbfe:1110 with SMTP id ada2fe7eead31-4e6ece53484mr3294369137.20.1748642434418; Fri, 30 May 2025 15:00:34 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 06:00:22 +0800 X-Gm-Features: AX0GCFsnORmywvqrfe36XaaqxmbBGqRmh9L8nf4o3WJCT3IBwKYmEkYw0w_tpF8 Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Jann Horn Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 96qizg6yoo6t4wo4rb46fs8667j9ddep X-Rspamd-Queue-Id: 1E84F40010 X-Rspamd-Server: rspam11 X-HE-Tag: 1748642435-181984 X-HE-Meta: U2FsdGVkX1+SQrIDCBXgojd5dPOM2p8GiZ8iIaIRlF+nEAgDeqC9jAYLkbZ0xadcFnZquNugcEarI4lcB1wFlzZz2iJLMm6bBOh7ldV4zQ3LoNzHjbIhKNBvj3OKawDO2+w3YvubosnziBcwi8/rP+iuILHQZSpIgoyvO9wQL2A1Om8jDalVQdoNWKX9Uzqx2t3ieY59vHx5SlVtJjnWbD551JYaPChCeYnTznbGqf76feVn2l1/mNDxmRirfKSQsTvlX+Mqpf6PR1VtzTiImuf9hUbeqitVq3imUOEN3OrYyC74SKE2wi8lJV/jWZp0zgv162vtsRpfeUHYKINudB5OGPhlOu8G4T4vMxsrLkja66sGz1/9imWWfFAWLkLm5Ijnk+TkmKI0act/NryX2+A0xyty8JgFEUYyReQ/n06j04+vPMgGoMuAlAggxUahdoXk5UR8ELDl7kuXZ1DF/N+eZtaAJ1CTzUxGHi1qfMCER6p0l/urONF9BlNCriTLv9Uk/6KOi/QpRE7Gxpo8Xpvb+lCiGN/KrskVScmI7wmT6kzjQZ9wrYRL7RwzJmsY/0gk/1pnKAQO+7HzcYHQpIuhnqURB4IOkGTF14h/zGP9nFbA74Yr8FUIaIhJKwYqPntiVuxT+ejpYYYF4B00iUhh2qT8zI804/reF7vHr1/EhyCxcmC2BH3qd2SkgELvR4PmlqgEvXJHj6QzZOTvm/X7kAJ/Orp6+PyiV9TaudTSwCGj6i+irGxEbNVLeONxpSHJbYDdAWj36+xwpj5t3R2EmU59GK5woM7bMhCybs35AdmXVimGb0hn+1Z3SgWx34kvvlaI3AOlCKeyWMNnZIhDN/Lfszbv2UkLjpeB/DrJSjCimKqGiTKOxfYIaLEMJow9A65ktgtKst1uw0jHB6B9pDHBqbWr3Fx/Gn73T9K8vJyf8Zl3dpGTY2i5y3FDaS3T/hhNQhIfuaPljbE RrL38oFG rFHkucCa/t43fDgRQMDCovJITRzTQR9DvMdM8sOWgmo3Cc1RNA+JjGEbKI2+y9iogexYoZo7jJ+Uu0qc+UV263lFjiIvqx4Yka9viek0I83Ko6iYaHcIlfJneVpFPlGpqWh49LCuoeNNNiRgDpkI4mtOGunELUJueyDEt8SjgCKgaannZh2Lfi6t8m/7urut9Q9zmQd/hsEpPG54eEA0/Uy7LScgfXiVSyHLVwV6uQYWyvuBaLj7SF5q0px5qqqxWuZOfz8W4F/2VcbUIVc3at+FYAIDgzLfPBwCcvl4pHFm8PD51YLCNAGReWCqjrRx6qbfM94On+44XAVKMSqE0quJKvBgn5v+mhXnGxEs+2sryKVxfboSa68XeLYTz3qZZoJIoPzW53Sqnp4lKKgJmHmA8acwfxplezHdLGUElm/ntW2+UqNeyz0NC3zitxwBOQRI6KUx5RwFaY9j3cp09FtSgHXw48yCg3N/uYUZNr0WPapc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 10:07=E2=80=AFPM Jann Horn wrote= : > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.com> w= rote: > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > frequently than other madvise options, particularly in native and Java > > heaps for dynamic memory management. > > > > Currently, the mmap_lock is always held during these operations, even w= hen > > unnecessary. This causes lock contention and can lead to severe priorit= y > > inversion, where low-priority threads=E2=80=94such as Android's HeapTas= kDaemon=E2=80=94 > > hold the lock and block higher-priority threads. > > > > This patch enables the use of per-VMA locks when the advised range lies > > entirely within a single VMA, avoiding the need for full VMA traversal.= In > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VM= As. > > > > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed by= Android > > benefits from this per-VMA lock optimization. After extended runtime, > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > only 1,231 fell back to mmap_lock. > > > > To simplify handling, the implementation falls back to the standard > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity= of > > userfaultfd_remove(). > > One important quirk of this is that it can, from what I can see, cause > freeing of page tables (through pt_reclaim) without holding the mmap > lock at all: > > do_madvise [behavior=3DMADV_DONTNEED] > madvise_lock > lock_vma_under_rcu > madvise_do_behavior > madvise_single_locked_vma > madvise_vma_behavior > madvise_dontneed_free > madvise_dontneed_single_vma > zap_page_range_single_batched [.reclaim_pt =3D true] > unmap_single_vma > unmap_page_range > zap_p4d_range > zap_pud_range > zap_pmd_range > zap_pte_range > try_get_and_clear_pmd > free_pte > > This clashes with the assumption in walk_page_range_novma() that > holding the mmap lock in write mode is sufficient to prevent > concurrent page table freeing, so it can probably lead to page table > UAF through the ptdump interface (see ptdump_walk_pgd()). > > I think before this patch can land, you'll have to introduce some new > helper like: > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > { > mmap_write_lock(mm); > for_each_vma(vmi, vma) > vma_start_write(vma); > } > > and use that in walk_page_range_novma() for user virtual address space > walks, and update the comment in there. > > > Cc: "Liam R. Howlett" > > Cc: Lorenzo Stoakes > > Cc: David Hildenbrand > > Cc: Vlastimil Babka > > Cc: Jann Horn > > Cc: Suren Baghdasaryan > > Cc: Lokesh Gidra > > Cc: Tangquan Zheng > > Signed-off-by: Barry Song > [...] > > +static void madvise_unlock(struct mm_struct *mm, > > + struct madvise_behavior *madv_behavior) > > +{ > > + if (madv_behavior->vma) > > + vma_end_read(madv_behavior->vma); > > Please set madv_behavior->vma to NULL here, so that if madvise_lock() > was called on madv_behavior again and decided to take the mmap lock > that time, the next madvise_unlock() wouldn't take the wrong branch > here. > > > + else > > + __madvise_unlock(mm, madv_behavior->behavior); > > +} > > + > > static bool madvise_batch_tlb_flush(int behavior) > > { > > switch (behavior) { > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct= *mm, > > unsigned long start, size_t len_in, > > struct madvise_behavior *madv_behavior) > > { > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > int behavior =3D madv_behavior->behavior; > > + > > struct blk_plug plug; > > unsigned long end; > > int error; > > > > if (is_memory_failure(behavior)) > > return madvise_inject_error(behavior, start, start + le= n_in); > > - start =3D untagged_addr_remote(mm, start); > > + start =3D untagged_addr(start); > > Why is this okay? I see that X86's untagged_addr_remote() asserts that > the mmap lock is held, which is no longer the case here with your > patch, but untagged_addr() seems wrong here, since we can be operating > on another process. I think especially on X86 with 5-level paging and > LAM, there can probably be cases where address bits are used for part > of the virtual address in one task while they need to be masked off in > another task? > > I wonder if you'll have to refactor X86 and Risc-V first to make this > work... ideally by making sure that their address tagging state > updates are atomic and untagged_area_remote() works locklessly. If possible, can we try to avoid this at least for this stage? We all agree that a per-VMA lock for DONTNEED is long overdue. The main goal of the patch is to drop the mmap_lock for high-frequency madvise operations like MADV_DONTNEED and potentially MADV_FREE. For these two cases, it's highly unlikely that one process would be managing the memory of another. In v2, we're modifying common code, which is why we ended up here. We could consider doing: if (current->mm =3D=3D mm) untagged_addr(start); else untagged_addr_remote(mm, start); As for remote madvise operations like MADV_COLD, until we resolve the issue with untagged_addr_remote=E2=80=94which still requires mmap_lock=E2= =80=94we can defer consideration of remote madvise cases. > > (Or you could try to use something like the > mmap_write_lock_with_all_vmas() I proposed above for synchronizing > against untagged_addr(), first write-lock the MM and then write-lock > all VMAs in it...) > > > end =3D start + PAGE_ALIGN(len_in); > > > > blk_start_plug(&plug); > > if (is_madvise_populate(behavior)) > > error =3D madvise_populate(mm, start, end, behavior); > > + else if (vma) > > + error =3D madvise_single_locked_vma(vma, start, end, > > + madv_behavior, madvise_vma_behavior); > > else > > error =3D madvise_walk_vmas(mm, start, end, madv_behavi= or, > > madvise_vma_behavior); > > @@ -1847,7 +1908,7 @@ static ssize_t vector_madvise(struct mm_struct *m= m, struct iov_iter *iter, > > > > total_len =3D iov_iter_count(iter); > > > > - ret =3D madvise_lock(mm, behavior); > > + ret =3D __madvise_lock(mm, behavior); > > if (ret) > > return ret; > > madvise_init_tlb(&madv_behavior, mm); > > (I think Lorenzo was saying that he would like madvise_lock() to also > be used in vector_madvise()? But I'll let him comment on that.) Thanks Barry