From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0AABC5AD49 for ; Fri, 30 May 2025 14:07:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C17E6B011C; Fri, 30 May 2025 10:07:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54BA06B011D; Fri, 30 May 2025 10:07:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3EBAE6B011F; Fri, 30 May 2025 10:07:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 195396B011C for ; Fri, 30 May 2025 10:07:10 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CAB6F1D2E7F for ; Fri, 30 May 2025 14:07:09 +0000 (UTC) X-FDA: 83499750978.01.CB7638F Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by imf21.hostedemail.com (Postfix) with ESMTP id E65211C0008 for ; Fri, 30 May 2025 14:07:07 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jYK950Gx; spf=pass (imf21.hostedemail.com: domain of jannh@google.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748614028; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3b+svQCcwpPh21cJ0DTZTXgTj5OzFaHmSN48KWEm3Y8=; b=EXf5RLwJuNbbh3oRheC6EcQyClAedeAIne7lf4azaCD5CkrB6iKdAHB4oVqyRYn+51fC9m 7BGact/AqafciGuch4ZjjuNhtuKppGMU0nSrIqcdYjIFpFcLyMF2xpbtdV33dUNpZWa5tl aDPxq6W0ZQjSnxdTvQ3GBBa7jGf0KT8= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jYK950Gx; spf=pass (imf21.hostedemail.com: domain of jannh@google.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748614028; a=rsa-sha256; cv=none; b=qG6OaDugDs1TfnwOuZghPYqD09okEw24dG9uv7a1VAYDqgw3Wolk6yw+WPfZZD8+X2i11V 1JRO1XfHDfS3dvqqjrmd7LOlt+Fds9UIpcSbMItWhEsVkBLVR5bhfWYT+MeJj9vwPvJaBO LZ20EiI/8hnANH2rg2B+f6sCHzDohUs= Received: by mail-ed1-f45.google.com with SMTP id 4fb4d7f45d1cf-5f438523d6fso9469a12.1 for ; Fri, 30 May 2025 07:07:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748614026; x=1749218826; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3b+svQCcwpPh21cJ0DTZTXgTj5OzFaHmSN48KWEm3Y8=; b=jYK950GxW3f4KT5v8OX+LB0tdIz64Icv6xOgygdyF5ndMT8eg+BcoB4upAVCWEJ2gB thMiIloLtmuokB23who7JTkgg/OLvbPDDDsNlDfMNjGwtPf4yCZWxYDiW6ShlLRtMB+1 KdecixivYkJy84Ozm+yh2KPxp7eBuywikcGoI9knNhS4N13VPoH3D6Niwtvoeti3G7IU 5i6qOJGWHNmJ26JLnGBft5V76+uqHBErAJPYVn3i+1bZ6abW61r3BtQsm+WbF/uRR4ZC IHu+4XZb5rPzel7QrQoyB9JxSHh3WVmdpuERhn097C0L/xYqg7DKjHIGH+OAMYSt6nUn o6lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748614026; x=1749218826; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3b+svQCcwpPh21cJ0DTZTXgTj5OzFaHmSN48KWEm3Y8=; b=q9lPzDmLU2AfcUjCKSGEKQTMUHvVqwudLsVJifCW9ZOJ7QQ3zBQ8boVeJOFJzw9ne5 zLlh82G2+MaTo2X/rvLQ9+gu8wmoICmfg1raMAHI+EPOEa1EwlFpPmIrj1PtbRUFnOKr 2N6VYQF1abi12yeGi13KGF+1WeRHkwEdM7T2UBviGVl7toQVSAaPpRqnBreyURr3AsRF 0/0uwK3rYyHXbvovv69JEY6tj4g9j0h0cpS7OxBttK3BbTlxy6vmwzodYNyOVWurkvRF 9fll4J2sN/zL7XhD1ZAm/3B91GlG4fD5CxqhcdCSfH6ED+3UPYuk3aErXRXiZ1doYv1b Fy2w== X-Forwarded-Encrypted: i=1; AJvYcCWcbOiFCj0K0lu0VWsO3E0Lb6RaqcpC0Fa5IUcMJmYixpgvKMTHFcMdzPhtsJGfxg3r+YFU3ujQOg==@kvack.org X-Gm-Message-State: AOJu0YxhnUhJXXCHaNxOqcbjSlUgyCcOlYw8up/zn8j6JENUCbkbwFCB ocNliC2Vq2i2aCba8POSGwjh2kxso8oyYJoUPg+iTyoVnyBoLQ3PHVWN+Y5limCGWRhNVSFUoRh k/Dvyv/RTYVMLneMIQ+VwPuLIAlp8bwH3LVeVe9fv X-Gm-Gg: ASbGnct5yQy0DJFEdCuo3MhuH/UK92MxY5piiOP9F4vdAezHMddPc8+y+/bwKiMREA9 oUqhkLGwhZTfAxVZwfIe4ARSUIBi8vEqNCem8Oyx3IaneRDe5PL9lDUHsZIeylgYc81gih25CNh wBHOkVJxTZjsJAsxFPIRzvJFwJMDsSf9tCqoYo71CIh2ChsHrQXRbnM14z4VfumDV4zITHgOo= X-Google-Smtp-Source: AGHT+IE4YAi11hEi/4MOex5qEodbZ0TPqICUbKDdoy137I4/cykm5Z8Vupf8nfwF8GzVY//VBVDuEEYMCVJS/dCxR7Y= X-Received: by 2002:aa7:d687:0:b0:5fc:e6a6:6a34 with SMTP id 4fb4d7f45d1cf-60571ab2164mr87893a12.6.1748614026146; Fri, 30 May 2025 07:07:06 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> In-Reply-To: <20250530104439.64841-1-21cnbao@gmail.com> From: Jann Horn Date: Fri, 30 May 2025 16:06:30 +0200 X-Gm-Features: AX0GCFvQX5C4zRFBSQKL2uP3abQCAg94yDHvhFqKytoCuqknnCNmM3qglgTUqDA Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E65211C0008 X-Stat-Signature: 8megowdpfqgwyoigs1eqb7x3u5jrufpy X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748614027-396118 X-HE-Meta: U2FsdGVkX18I9QRN9qBoLpcPTkzRALzkumsu63dYektFn3Fq6fUXYajxCIE5I0XuNp6hjetqtyqUMt/zkp77EsI5pRGjT7Cnb/L8yPwbAu1V0ggpaYqu9TpPuVMkFp7CwbqBe6n7QZEOmjMaHkwz1OIf/FFUgdvGnyEXGBmylE61Tberk9NOzeKljaytzsV7qmWRwTj2I1uomFM2yJtrb3eVj79yBL4WM1tiU7Fr89TIq1GGm+m8j/KKs2/FkH0771YdlEIP3r0/wO/GZDUbFiK8um4exxQPQ8uSokukPrNgWRDv7sAvgF5eIsWoX1A8jTK9Mx/Infq7HSFtwILkZRTLxza4IQjQycC2EQVWvjW5wIjhQrCV/T6bof4LB2KNnQRFFCuB9MQXT075lX6VOeeGiJkB93y/23gsY7vmEOrttmng3Xl7+k9tptsUu3frrnluNfPK1EsRe4BpqJ4+wLg1DhQ+jPxn/MoraWItusqdp5w9hjHS3BFYHYXP97m3qSn1I3AgdxnuDGhtGiSqfVFPORPNQUWpgRE9EtHGiaLGnBlOvq/IW2IIM92LV+gXRd/LtEs8QxHsnFdvNh5ZR8sZJf6Q0VjY/8KpHr4PNWp8SH25FQBnJ8gK7hsSFAGo0tD2tG6e8p9qWq5S8e16jQdi7CVb1dZdOfGv42o36VqRgX9OvlQoBznvVMEgxXYCjlMRsPgzKHBxhGFmBCbm+a5pBXY7SbXVswjp7rypY3U9G7F0geC6rJ7Z1fnQcWOzjkgA7wEpmTGiAfJHBEjKEL+tkgeiKabQZ4oov/SsrpGsLJoJcTe9CVaAIdSvs/BMdgP0Ea+hsmeQ3oqpGGz6AB3CIJiM8wHjC2bCOjqTKDklIRTbVkxIk5cmVppyunxNXBsDRls4X+sfZtPrwjbWztkG2k00t46MAZAVA15YrWjgdL6z6e7wSqOVfxHfQZa9Y4q1KTexRRI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > Certain madvise operations, especially MADV_DONTNEED, occur far more > frequently than other madvise options, particularly in native and Java > heaps for dynamic memory management. > > Currently, the mmap_lock is always held during these operations, even whe= n > unnecessary. This causes lock contention and can lead to severe priority > inversion, where low-priority threads=E2=80=94such as Android's HeapTaskD= aemon=E2=80=94 > hold the lock and block higher-priority threads. > > This patch enables the use of per-VMA locks when the advised range lies > entirely within a single VMA, avoiding the need for full VMA traversal. I= n > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs= . > > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed by A= ndroid > benefits from this per-VMA lock optimization. After extended runtime, > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > only 1,231 fell back to mmap_lock. > > To simplify handling, the implementation falls back to the standard > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity o= f > userfaultfd_remove(). One important quirk of this is that it can, from what I can see, cause freeing of page tables (through pt_reclaim) without holding the mmap lock at all: do_madvise [behavior=3DMADV_DONTNEED] madvise_lock lock_vma_under_rcu madvise_do_behavior madvise_single_locked_vma madvise_vma_behavior madvise_dontneed_free madvise_dontneed_single_vma zap_page_range_single_batched [.reclaim_pt =3D true] unmap_single_vma unmap_page_range zap_p4d_range zap_pud_range zap_pmd_range zap_pte_range try_get_and_clear_pmd free_pte This clashes with the assumption in walk_page_range_novma() that holding the mmap lock in write mode is sufficient to prevent concurrent page table freeing, so it can probably lead to page table UAF through the ptdump interface (see ptdump_walk_pgd()). I think before this patch can land, you'll have to introduce some new helper like: void mmap_write_lock_with_all_vmas(struct mm_struct *mm) { mmap_write_lock(mm); for_each_vma(vmi, vma) vma_start_write(vma); } and use that in walk_page_range_novma() for user virtual address space walks, and update the comment in there. > Cc: "Liam R. Howlett" > Cc: Lorenzo Stoakes > Cc: David Hildenbrand > Cc: Vlastimil Babka > Cc: Jann Horn > Cc: Suren Baghdasaryan > Cc: Lokesh Gidra > Cc: Tangquan Zheng > Signed-off-by: Barry Song [...] > +static void madvise_unlock(struct mm_struct *mm, > + struct madvise_behavior *madv_behavior) > +{ > + if (madv_behavior->vma) > + vma_end_read(madv_behavior->vma); Please set madv_behavior->vma to NULL here, so that if madvise_lock() was called on madv_behavior again and decided to take the mmap lock that time, the next madvise_unlock() wouldn't take the wrong branch here. > + else > + __madvise_unlock(mm, madv_behavior->behavior); > +} > + > static bool madvise_batch_tlb_flush(int behavior) > { > switch (behavior) { > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *= mm, > unsigned long start, size_t len_in, > struct madvise_behavior *madv_behavior) > { > + struct vm_area_struct *vma =3D madv_behavior->vma; > int behavior =3D madv_behavior->behavior; > + > struct blk_plug plug; > unsigned long end; > int error; > > if (is_memory_failure(behavior)) > return madvise_inject_error(behavior, start, start + len_= in); > - start =3D untagged_addr_remote(mm, start); > + start =3D untagged_addr(start); Why is this okay? I see that X86's untagged_addr_remote() asserts that the mmap lock is held, which is no longer the case here with your patch, but untagged_addr() seems wrong here, since we can be operating on another process. I think especially on X86 with 5-level paging and LAM, there can probably be cases where address bits are used for part of the virtual address in one task while they need to be masked off in another task? I wonder if you'll have to refactor X86 and Risc-V first to make this work... ideally by making sure that their address tagging state updates are atomic and untagged_area_remote() works locklessly. (Or you could try to use something like the mmap_write_lock_with_all_vmas() I proposed above for synchronizing against untagged_addr(), first write-lock the MM and then write-lock all VMAs in it...) > end =3D start + PAGE_ALIGN(len_in); > > blk_start_plug(&plug); > if (is_madvise_populate(behavior)) > error =3D madvise_populate(mm, start, end, behavior); > + else if (vma) > + error =3D madvise_single_locked_vma(vma, start, end, > + madv_behavior, madvise_vma_behavior); > else > error =3D madvise_walk_vmas(mm, start, end, madv_behavior= , > madvise_vma_behavior); > @@ -1847,7 +1908,7 @@ static ssize_t vector_madvise(struct mm_struct *mm,= struct iov_iter *iter, > > total_len =3D iov_iter_count(iter); > > - ret =3D madvise_lock(mm, behavior); > + ret =3D __madvise_lock(mm, behavior); > if (ret) > return ret; > madvise_init_tlb(&madv_behavior, mm); (I think Lorenzo was saying that he would like madvise_lock() to also be used in vector_madvise()? But I'll let him comment on that.)