From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08F90C5AD49 for ; Fri, 30 May 2025 20:18:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 988B06B01F9; Fri, 30 May 2025 16:18:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 960B66B01FB; Fri, 30 May 2025 16:18:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 875FC6B01FC; Fri, 30 May 2025 16:18:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 67C106B01F9 for ; Fri, 30 May 2025 16:18:07 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 11209C14A8 for ; Fri, 30 May 2025 20:18:07 +0000 (UTC) X-FDA: 83500685814.27.6DDC051 Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50]) by imf06.hostedemail.com (Postfix) with ESMTP id 2D1B918000B for ; Fri, 30 May 2025 20:18:05 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gBiHDzKR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748636285; a=rsa-sha256; cv=none; b=wuUi8sDvg4YSlSrMJDOlf2yidYcUnLN/Uuddhir2ThtLYKZpWMIx6bsGHD2KA7v9Ni8sya xTHL3xa++dAafSZlw4Q3u/uxNO2R4rfIIwaV1lf0mPeKX/pr6xxSEhksDqveXYNkkU34Nt b7vY+/yrV+kLKTF2QOmoex8+BHApODA= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gBiHDzKR; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748636285; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eTaO7tcQINjydL/flcv5n449NKXUcEH8jDdtCJkzu30=; b=r67ECWH/+nLGW7EJggz+/asse+H9/37vwJ5I0Rrf2//IAUsmb965Xb/mCGOn/Dw60/xZWT 7l3Al49T9LDBCjSw2evGlEsgQYLqIZtuNRhGbG4M4YXvGSVQ7R+Iic7L7Yf2UZNzJtixUL BzCzU1KoXAzQrq9NFdPX5j7seJoBsgg= Received: by mail-vs1-f50.google.com with SMTP id ada2fe7eead31-4e5adc977ecso719590137.3 for ; Fri, 30 May 2025 13:18:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748636284; x=1749241084; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=eTaO7tcQINjydL/flcv5n449NKXUcEH8jDdtCJkzu30=; b=gBiHDzKRJAS083d7vHbh1E+pJjxOgd0+8Lho/KoaEB+ltM1FwwN+1Zn4lf+adxfdzY dzjIsduzaBojXi0dhctlffY/BNgeQw6Y71imqJIlxH0q1SDykTlL7M9A5GrPgztnaWPb OrQiL2NQ24VC5BYhOx2foSgq3zpZfxQI104KKsCf3jYybaQxYLPrc6FrXbXAkv6XQE20 BMEith8HfnEMr+0DaxQC372YsYqMkrJKbKzlVinBCkoffxwPmlZkM61UkQ2wV/YYIW3b CZTYu7fI/QDgFgY/OV0coPhVMzNZpe08fMGWGReIpzCGQJPFilQkeioai/dlv8CddGzV Nhog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748636284; x=1749241084; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eTaO7tcQINjydL/flcv5n449NKXUcEH8jDdtCJkzu30=; b=q5ZKIJePm4ToEgUE6ieoSw4zBB7zU3Ce9dMN0wCH61V/22k4Jkplk1FO4/FumTtwdY lfxzFxcdgA9QqZJstYd6QBKJvAD870ddE2fkr+v0xTdVhYAYirDWQLgb6vgGpbbpzttx k9qOfTUxVysqR14CU+2c/zMvXw3UaL8diAQNKIoV++06d/EqG9IhR3GBEcFhBuIsCzkw UBM9R4T/lJrYFr75clkEvi6BQovldZLuYPsQMKu9UQos6wH4mqIXYjyNsStecUuJNSbN +dgq89VGihnXKtXJnUZkvT/jB8cc3PMpW+WtvmyBp+jOp+ufjm7QAOfj3GblBySR60bv FheA== X-Forwarded-Encrypted: i=1; AJvYcCVL/DppAwWuPh0EoItYpkpKpLnNf2KBaueAOnNERVr5gcWAVNEfxjxdMmi0nkDG+lGOpa7wjxeKxQ==@kvack.org X-Gm-Message-State: AOJu0YzVR4IPI/b1MtGMZurKHFSc21YbgFYkFNjN0p0VSrcOElcs2dOt vYEcrtayBkOLvs2DQ2UZ2g3DwpkISbTA890fZ+2M/fzZa/Ik6cTR/kCLFBeAD7Ep/0lDgTa1zcq WZgLcx/w2S7ZdrmPkGgLc5Mlhds0oiSI= X-Gm-Gg: ASbGnct9YtgPlLqo1uERENtKv6lMDc3WpVdAaW77mmQx/sBFDdxawIVE2DoSYZI7KL4 wOAF8WX7T28HbP1Sf+IdlqH2SsKqLY4huS3MApU3OFVXRhWrJXeHS9PRIDOQBm/MeShxA1BDNv6 RftnR30i/2gzF0SDaY8yLtxnhHmXB7488udw== X-Google-Smtp-Source: AGHT+IGY8UkJcHAqmx4t/Zyvbk+gdCYSN10rkBtDx675Y/kVvCVxginMv9GYH/E65XTm5++5btt0IA/Q7i0TUZrL668= X-Received: by 2002:a05:6102:8014:b0:4e6:edcf:3890 with SMTP id ada2fe7eead31-4e6edcf3991mr3435491137.10.1748636284081; Fri, 30 May 2025 13:18:04 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> In-Reply-To: <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 04:17:51 +0800 X-Gm-Features: AX0GCFvsXbvqCJgbPTIkCNdSbH4wtF7aHFgJbKcr7XA88rv9ZQuw2GtYSDutP6c Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Lorenzo Stoakes Cc: Jann Horn , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 2D1B918000B X-Stat-Signature: daq9cm51bas4izkxdwyb41w9sap4xmhi X-Rspam-User: X-HE-Tag: 1748636285-546164 X-HE-Meta: U2FsdGVkX1/oxCFJDrrYsGuWr8jupp34YMIgoxx1Zh3rxY3rQyqQW9zppHbi/TQ6HlgH1+K/2Gm9rXNhgqFGxttCUGFegD+gnJ7SVzJ7chQf5j2FeedlQI11NozoJ+nvKcmYQMU0ss9XPZbKHg3IR/LQudqILlR+UZ1Zf9BvjECqNVL0w77SqXVSKZ/7uudpNUjVb5VfrTxCUPyLTfSWiSkKyxByh42+o8x8xiYVXUbv28o46r79h461dU5vR+BY4RhYI6pkbmObAOgPD3q/kE8sp3QphksrAoqKkk6cPVTwjPk1wD4NRqu16hUsyOVv7c4p0r+Rg9sFtW0vO+5LEWJ66sVsn2DkGEEQdP5p/jqPQvXyM2jY8s8IOBzWaSSfss4xRV/ZiexfnIrBhESREHi7vF4UMf9zEOTGFEQE2oCVGlBH6mD2tBjBf2PLraAYjwjYAJ9MjDNUhTVcadap1Tnn6CWeWzsyehwMf7M+G3EvTvmYafSYcoJb8d6aOKzC/RQTPqjZ8dgElIFKNlUwB/UL9D4IfSmGsuLTKj/xx8mGjMpgO0Q2eQ3RK+t0Ju1m5PQ/8US5Ubn/rNJ1Y1w13A3U/TAfd/idOfckTt69c1V51JW5NSgypCGy/9PLD4mMWOgfzdaooNofTJjHV7aXV2cDr5n/avBJKrvkfw4cT05M8jias562+hV4zJ6RSHYldiEUT/J+YZadV9fRqQXx9rcQkro2vxuj21zKE0BECER/Q5CIBL2Ur9IsZs32sKCRpzcKv/zqGlEvFx32VDFSPHz4htozOrBcvQdYa/neYL/aSQwIf/GEA9nYjlAhAAQHUoBHMtehecN9FyLP7fAtp6SPtXPFol8a7hxlHBslHTnMLRVAaGKpmJWI5hMog4tgCVSOJxgMcKwUHSoUzd5rnJNkNf6W4UbFOrzTD2vwpcw3hCppVyIuUtB33YvWWdgEIwu+Fvz80V71oBxdxhd THbr2o+Z OnICRMWbV3O2CmAukcAiw/6UldjVTiILKfhYSbJknagfXbONkdXG81ckXt4Iy8OlTwSXIf/P7/Z506ixWQndrv2Yk7BAuhj4HSxP7IGBNQJSCwcg/yFRI8T1y/M6SfCzMR0VzTKLpds5CCiru6U8xaNlqrf4+nqYGlQQmNYR9B/gSoos+7OGJyKw7F895o+PKlslZyeGN+YfhOl8v5HLXnMOQEL+U6uH6rxG0fHlq/2m4g9sELyXqITSNKJof2Z6FeYzMH4TC7qefcfMJk4vTy7ZlTaQATp3XaIRQFsHOUiIYRkuhWUweWw3ZDxZIt5uN2qUy8H2YwsrFyWMMg7yP7HPqbLFiBU0h1PXzNyO2JMIbdu1U9u58aRaUmQSORNq3ROfTzHQa46EhwHFMEALjyq5KQd61ozFR8jg4m5P9Vdol29tU3/Hb/7RHngT3xvREop7asFRx4s3byCmDucZUB0+4ZEEs5RKcnTkeCKGktIyzCLIhE/mrV8Zj73XmWn4Azawp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 10:34=E2=80=AFPM Lorenzo Stoakes wrote: > > Barry - I was going to come back to this later, but Jann's sort of bumped > this in my inbox. > > This implementation isn't quite what I was after, would you give me a > little bit before a respin so I can have a think about this and make > sensible suggestions? Sure. > > Thanks! > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote: > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.com>= wrote: > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > > frequently than other madvise options, particularly in native and Jav= a > > > heaps for dynamic memory management. > > > > > > Currently, the mmap_lock is always held during these operations, even= when > > > unnecessary. This causes lock contention and can lead to severe prior= ity > > > inversion, where low-priority threads=E2=80=94such as Android's HeapT= askDaemon=E2=80=94 > > > hold the lock and block higher-priority threads. > > > > > > This patch enables the use of per-VMA locks when the advised range li= es > > > entirely within a single VMA, avoiding the need for full VMA traversa= l. In > > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple = VMAs. > > > > > > Tangquan=E2=80=99s testing shows that over 99.5% of memory reclaimed = by Android > > > benefits from this per-VMA lock optimization. After extended runtime, > > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, whil= e > > > only 1,231 fell back to mmap_lock. > > > > > > To simplify handling, the implementation falls back to the standard > > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexi= ty of > > > userfaultfd_remove(). > > > > One important quirk of this is that it can, from what I can see, cause > > freeing of page tables (through pt_reclaim) without holding the mmap > > lock at all: > > > > do_madvise [behavior=3DMADV_DONTNEED] > > madvise_lock > > lock_vma_under_rcu > > madvise_do_behavior > > madvise_single_locked_vma > > madvise_vma_behavior > > madvise_dontneed_free > > madvise_dontneed_single_vma > > zap_page_range_single_batched [.reclaim_pt =3D true] > > unmap_single_vma > > unmap_page_range > > zap_p4d_range > > zap_pud_range > > zap_pmd_range > > zap_pte_range > > try_get_and_clear_pmd > > free_pte > > > > This clashes with the assumption in walk_page_range_novma() that > > holding the mmap lock in write mode is sufficient to prevent > > concurrent page table freeing, so it can probably lead to page table > > UAF through the ptdump interface (see ptdump_walk_pgd()). > > Hmmmmmm is this because of the series that allows page table freeing on > zap... I think Zi's? > > We need to update the documentation on this then... which currently state= s > the VMA need only be stable. > > I guess this is still the case except for the novma walker you mention. > > Relatedly, It's worth looking at Dev's series which introduces a concerni= ng > new 'no lock at all' mode to the page table walker explicitly for novma. = I > cc'd you :) See [0]. > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a7915= 5@lucifer.local/ > > > > > I think before this patch can land, you'll have to introduce some new > > helper like: > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > > { > > mmap_write_lock(mm); > > for_each_vma(vmi, vma) > > vma_start_write(vma); > > } > > > > and use that in walk_page_range_novma() for user virtual address space > > walks, and update the comment in there. > > What dude? No, what? Marking literally all VMAs write locked? :/ > > I think this could have unexpected impact no? We're basically disabling V= MA > locking when we're in novma, that seems... really silly? > > > > > > > Cc: "Liam R. Howlett" > > > Cc: Lorenzo Stoakes > > > Cc: David Hildenbrand > > > Cc: Vlastimil Babka > > > Cc: Jann Horn > > > Cc: Suren Baghdasaryan > > > Cc: Lokesh Gidra > > > Cc: Tangquan Zheng > > > Signed-off-by: Barry Song > > [...] > > > +static void madvise_unlock(struct mm_struct *mm, > > > + struct madvise_behavior *madv_behavior) > > > +{ > > > + if (madv_behavior->vma) > > > + vma_end_read(madv_behavior->vma); > > > > Please set madv_behavior->vma to NULL here, so that if madvise_lock() > > was called on madv_behavior again and decided to take the mmap lock > > that time, the next madvise_unlock() wouldn't take the wrong branch > > here. i actually put some words for vector_madvise: " * ideally, for vector_madvise(), we are able to make the decision of lock types for each iteration; for this moment, we still use the global lock." I held on to that one because I'd rather get feedback before going too far - so vector_madvise() didn't be touched by having a __madvise_lock() and __madvise_lock(). For that case, we might need to take madvise_lock after releasing it. otherwise, this is not the case. BTW, I found vector_madvise doesn't check the ret value of madvise_lock(), it seems also a bug? static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, { /* Drop and reacquire lock to unwind race. */ madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); madvise_lock(mm, behavior); /* missing the ret chec= k */ madvise_init_tlb(&madv_behavior, mm) } > > Yeah I'm not a fan of having the vma referenced here this isn't quite wha= t > I meant. > > > > > > + else > > > + __madvise_unlock(mm, madv_behavior->behavior); > > > +} > > > + > > > static bool madvise_batch_tlb_flush(int behavior) > > > { > > > switch (behavior) { > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_stru= ct *mm, > > > unsigned long start, size_t len_in, > > > struct madvise_behavior *madv_behavior) > > > { > > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > > int behavior =3D madv_behavior->behavior; > > > + > > > struct blk_plug plug; > > > unsigned long end; > > > int error; > > > > > > if (is_memory_failure(behavior)) > > > return madvise_inject_error(behavior, start, start + = len_in); > > > - start =3D untagged_addr_remote(mm, start); > > > + start =3D untagged_addr(start); > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that > > the mmap lock is held, which is no longer the case here with your > > patch, but untagged_addr() seems wrong here, since we can be operating > > on another process. I think especially on X86 with 5-level paging and > > LAM, there can probably be cases where address bits are used for part > > of the virtual address in one task while they need to be masked off in > > another task? > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this > > work... ideally by making sure that their address tagging state > > updates are atomic and untagged_area_remote() works locklessly. > > Yeah I don't know why we're doing this at all? This seems new unless I > missed it? we might call madvise_do_behavior() within per-vma lock but untagged_addr_remote() always asserts a mmap_lock which will also be asserted by find_vma in madvise_walk_vmas(). so at least for architectures other than risc-v and x86, there is no difference. include/linux/uaccess.h #ifndef untagged_addr_remote #define untagged_addr_remote(mm, addr) ({ \ mmap_assert_locked(mm); \ untagged_addr(addr); \ }) #endif I didn't realize madv_dontneed could be done on a remote process, could it? > > > > > (Or you could try to use something like the > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing > > against untagged_addr(), first write-lock the MM and then write-lock > > all VMAs in it...) > > This would completely eliminate the point of this patch no? The whole poi= nt > is not taking these locks... And I'm very much not in favour of > write-locking literally every single VMA. under any circumstances. > > > > > > end =3D start + PAGE_ALIGN(len_in); > > > > > > blk_start_plug(&plug); > > > if (is_madvise_populate(behavior)) > > > error =3D madvise_populate(mm, start, end, behavior); > > > + else if (vma) > > > + error =3D madvise_single_locked_vma(vma, start, end, > > > + madv_behavior, madvise_vma_behavior); > > > else > > > error =3D madvise_walk_vmas(mm, start, end, madv_beha= vior, > > > madvise_vma_behavior); > > > @@ -1847,7 +1908,7 @@ static ssize_t vector_madvise(struct mm_struct = *mm, struct iov_iter *iter, > > > > > > total_len =3D iov_iter_count(iter); > > > > > > - ret =3D madvise_lock(mm, behavior); > > > + ret =3D __madvise_lock(mm, behavior); > > > if (ret) > > > return ret; > > > madvise_init_tlb(&madv_behavior, mm); > > > > (I think Lorenzo was saying that he would like madvise_lock() to also > > be used in vector_madvise()? But I'll let him comment on that.) > > Yeah I"m not a fan of this implementation it's not really what I had in > mind, as per top of mail. > > Will come back with suggestions later. > thanks! > Thanks! Barry