From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E02BC5AE59 for ; Tue, 3 Jun 2025 16:53:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BC5FD6B04C1; Tue, 3 Jun 2025 12:53:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B9DEC6B04C2; Tue, 3 Jun 2025 12:53:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ADA236B04C3; Tue, 3 Jun 2025 12:53:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 905306B04C1 for ; Tue, 3 Jun 2025 12:53:37 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 09CD5C106D for ; Tue, 3 Jun 2025 16:53:37 +0000 (UTC) X-FDA: 83514685674.28.F547148 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf16.hostedemail.com (Postfix) with ESMTP id 0B1FE180006 for ; Tue, 3 Jun 2025 16:53:34 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=KkbQwlrd; spf=pass (imf16.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748969615; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wQRQzkESXtxwAedI/ayC9kgQGUTy61o9qZHR4qLBeQw=; b=KBRUT/F6o0J5YtoGf7TreGmXXdCCc/aQE1ButCgx1oj17wvzYNqURFctza5qHODd5IPmLt gmkqHm/hsNmnOpW7BFiNN86oNnGTP0GXiUpbo0RsiUplZpt5qrlkN5SkDab2wHfZQofTnF sBswwh9t/Sqf9kl8IR06X7TygA6DgQw= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=KkbQwlrd; spf=pass (imf16.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748969615; a=rsa-sha256; cv=none; b=j/4TuNMt7g4Y6SJ13Hoc8LrCR7fwdQDqefbqCWChI2Ee6llkCLXgtPus22CyZuEiCK9rz/ 6TG+HABCIepVESwl4piSmM8YkqDgat8HRpf8Bpn3RGFWiJU2J6Kf//qOF8lNXL1NLZUaQ6 1p1JMOcj/3YxVlVmcVV+w/3TQpJwPsk= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-601a67c6e61so403a12.0 for ; Tue, 03 Jun 2025 09:53:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748969613; x=1749574413; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wQRQzkESXtxwAedI/ayC9kgQGUTy61o9qZHR4qLBeQw=; b=KkbQwlrdRN8Sg8ub6KnKaJdiHzqMKlT0e1wUjY1Na2EAuiDIx0UefDgdyqyFHHItHq rdAbUcX64sNRj4hI1xaEgqGwxcxLzke+sHcGCWh7yw2AuIep+Oc1LzTWGoXWyRW/CxIw 4NCaSrIq9Ksw6c+/cRxmnkeQxil5oCnfoGZSxdZx/pE+gZRNFNsJZ+0gLtwc3qgeto15 qfFEQJmjqq7wj16OiosFOnHTgHImNguJUGAvvDSL4B8vNcnaEWcL4UmphPBWGMlas8oC AXKKrlgjk51ypDHOQJLhuqxmPaq9wh5YUqkzDJK0WrkcHb0Bsq5zkseY1oAitXHJusfc JZhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748969613; x=1749574413; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wQRQzkESXtxwAedI/ayC9kgQGUTy61o9qZHR4qLBeQw=; b=RR4EIHE1rRgaH2eptiYZETZyL5iRM5uLmzB4+ghUZb7xESjRMfg941+ehB6/K8qsZB 9GoyU/Tmd7L+W+X0Hg24aJ5vFu3Jt5oOp7trMg78nQkkI3zlDEMgeN+794LKs7lRc6vM dqaRWxsEyHKCq0S637BTJ0rUo2wwvz+NmAtMrOijgwY3fD0YClopRR08I9bCUNBP7OWI TYdsqANRyuGEljPUYUlwDBxVdUdKrCRo0zSLnaXykCWRxD5d+KCtZ58qyprfKllJZS9p VDtLqP8ER+d4u+tUUlk7lyCxClurDNNndXtyF82TA7um3SrNduC7mtNN1hBkD+OKZh2C peNQ== X-Forwarded-Encrypted: i=1; AJvYcCW9/+fmK3i1pFca5v4ZppF8IpNTn8VWdF03cDjBtr/g4tvyGV3f3dHk7TRoDWfWA+kH43iXC3MSPQ==@kvack.org X-Gm-Message-State: AOJu0YwGtBsmDcPx/JbJiXlg770pkKtLvk7UO2t0RX4f4zZsUE1KT8T7 uPX+F8TkrtcvlUN5BOVvry6rtg0yO1OxjQTMaQMFwaa7qAEquGpyEXFbfegcnNMkdlQhHopNyYQ 3gDkZAjQIq8eJ4TwuUz03TU/DwiLmb0g0HY7nBHA8 X-Gm-Gg: ASbGncvgSkmF3qMkNK2FC5H6NV4lcq0qxF/5AydgYxIQG+/sVvTkKjBub+acuB6owxM bktUNaagesC0EoSJAcbfV2OF7Vtvi1Aqo4zGtVyTY3V0U4NY8xfMwaqsJviRnEIMD9tG9tdSaVo 90lba2RrEbuuVPh8sGffFxnO1Rn1uGNGkfZChFh5OKG/OhFcvH3Bj+HZxQrQ/aXYxLQNg= X-Google-Smtp-Source: AGHT+IEjyPVAixvQ48Mf5PcyJ0C13KrE2bBZOk0v6HCuqfJ6L3mrWzcICGIyt7ZXGdHIHOcTSK1RGAoiTIp6X9YbW4M= X-Received: by 2002:aa7:c313:0:b0:601:233a:4f4d with SMTP id 4fb4d7f45d1cf-606a90c1e5emr139670a12.2.1748969613185; Tue, 03 Jun 2025 09:53:33 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> In-Reply-To: From: Jann Horn Date: Tue, 3 Jun 2025 18:52:55 +0200 X-Gm-Features: AX0GCFt6JfcbuOOtCdeDDbdlzWDOtBwMsi7UT2u1gikHKgUMJVf-UXRivyKhNEM Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Barry Song <21cnbao@gmail.com> Cc: Lorenzo Stoakes , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0B1FE180006 X-Stat-Signature: rnkzidkyzy4csunez4r5sciiypr65fez X-Rspam-User: X-HE-Tag: 1748969614-959376 X-HE-Meta: U2FsdGVkX197E0bp7fKULROixIpWaVTwNPFMBxURRHbsg3Yu9eeuqiv6UyCqzcDJht2BSxD9cwg1KQvSicSDnioTEsmXEq4vcLk21nez+clwJHy+YJ7vSeYXQQo1UOZEXDlo1AOiVSnobEwZTieiR43JshbBNbsjHtZp935t53dZwsCOLLYkFgOe/2RUZ8ASZnvDxB5HmZblXYYDcSHKyKUI4vnu+JDhhCZcSqIhuQIPUAgM4upJtZPawHpJKTn2nNZHLwZlLoG0cAdnAq1pGYXaljkFOHU6E6/8g5j/NGhWlyVyJasyyCtZF3OvG8lt7oPgnOs3UfECSBkaBlW0gVm5pqBCjK4BAYG8EXMVcWT0cFyc38yjnoRR7np1vDYAPkq41rsNS/kQruYsw+HcXmbKTZbYONEJ8off1FUK8PVLg0L8epHPJ4xAVBsTERaQx06og6BgXaygqzccTC5r+KAHInRAW2eZByrN1KzuPXw2f4+1JQA97iUVbX31aqr7j94LR5O9Wui6mnTQjL78r5r6mzlf+5nGW6TBKmB4mi1BxSGCpMpe4w5rcwuSi8F3m2lDtRODbJroWk7J09YjTF5/Wk0awJTkWgYQQyN0P9L3woYgsu+5erGEzZ8I4IC+SrwpYSih8TkFn6X7O+OAboMTKcEV7w+CAgoWyA2JuW1e1syc//FdSjiQp+bETk1qNLqRdDsmU0FtLEQMgPKzigAxP7ZK2F3R/po8nDZKc8k5ewblwe6rdYFEEgJxUVn72wgGI5HTUdnQCaxQyVLhbHrPGJ1mUCZKlwh3lTCIoASB41Q0XZwsH3Gbg8iEU+swp1cPBcY+rXFAOgza3XHnQA3zO0NBdGGzKQt0UWjBmvr5wrmJ48tFXgunvLFaDb4P5De8mWLg/tdbeDLypMrNOyjnahOEVAJERK9gCMKoyMcqGYG7GUMlMuodP4+sY/hawKMW88/9Gb5uC7JIOEy 8dnmxvjP r3GIpRlq3+v5kJTbLOCAdEd8uZBF5W+n9zRF/VS+6n1aTJsMahJhhcYXsqny3FXUGyCblehqbU9TJiqyNsuH4omXPu+ltvZEqNzVk2is+pEJ8aiZ8mXU/lbsBrakoXRqCj7xM9Mk1B4cUd+cQau7SUmwmDTl/u8gWeA7DCzv5BlRbF9jrUgAOfJFFQDjH4v1kbxIVBkPIJ8/gpHAMIhw+zO9YJqgMYfyJvRJOI/3sLcHDfS6J8WlHOEuHOkTyhz9JPpYakKzGdOSlIncDTHzIeex3TF6R4KkFM8BZaH/hUi5vUUIQdwtNTd7+bDaESDhy376pt4Rg7S+uRpje5oZcErQzCb55Ebc5E984mjBg36GPNz7Y3biR1WC285g3m/xOXxc4bzUSYkH6vzQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 9:06=E2=80=AFAM Barry Song <21cnbao@gmail.com> wrote= : > On Sat, May 31, 2025 at 8:41=E2=80=AFAM Jann Horn wrot= e: > > On Fri, May 30, 2025 at 4:34=E2=80=AFPM Lorenzo Stoakes > > wrote: > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote: > > > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.= com> wrote: > > > > One important quirk of this is that it can, from what I can see, ca= use > > > > freeing of page tables (through pt_reclaim) without holding the mma= p > > > > lock at all: > > > > > > > > do_madvise [behavior=3DMADV_DONTNEED] > > > > madvise_lock > > > > lock_vma_under_rcu > > > > madvise_do_behavior > > > > madvise_single_locked_vma > > > > madvise_vma_behavior > > > > madvise_dontneed_free > > > > madvise_dontneed_single_vma > > > > zap_page_range_single_batched [.reclaim_pt =3D true] > > > > unmap_single_vma > > > > unmap_page_range > > > > zap_p4d_range > > > > zap_pud_range > > > > zap_pmd_range > > > > zap_pte_range > > > > try_get_and_clear_pmd > > > > free_pte > > > > > > > > This clashes with the assumption in walk_page_range_novma() that > > > > holding the mmap lock in write mode is sufficient to prevent > > > > concurrent page table freeing, so it can probably lead to page tabl= e > > > > UAF through the ptdump interface (see ptdump_walk_pgd()). > > > > > > Hmmmmmm is this because of the series that allows page table freeing = on > > > zap... I think Zi's? > > > > Yeah, that was Qi Zheng's > > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.17= 33305182.git.zhengqi.arch@bytedance.com/ > > . > > > > > We need to update the documentation on this then... which currently s= tates > > > the VMA need only be stable. > > > > > > I guess this is still the case except for the novma walker you mentio= n. > > > > > > Relatedly, It's worth looking at Dev's series which introduces a conc= erning > > > new 'no lock at all' mode to the page table walker explicitly for nov= ma. I > > > cc'd you :) See [0]. > > > > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a= 79155@lucifer.local/ > > > > Yeah, I saw that you CC'ed me; at a first glance that seems relatively > > innocuous to me as long as it's only done for kernel mappings where > > all the rules are different. > > > > > > > > > > > > > I think before this patch can land, you'll have to introduce some n= ew > > > > helper like: > > > > > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > > > > { > > > > mmap_write_lock(mm); > > > > for_each_vma(vmi, vma) > > > > vma_start_write(vma); > > > > } > > > > > > > > and use that in walk_page_range_novma() for user virtual address sp= ace > > > > walks, and update the comment in there. > > > > > > What dude? No, what? Marking literally all VMAs write locked? :/ > > > > > > I think this could have unexpected impact no? We're basically disabli= ng VMA > > > locking when we're in novma, that seems... really silly? > > > > I mean, walk_page_range_novma() being used on user virtual address > > space is pretty much a debug-only thing, I don't think it matters if > > it has to spend time poking flags in a few thousand VMAs. I guess the > > alternative would be to say "ptdump just doesn't show entries between > > VMAs, which shouldn't exist in the first place", and change ptdump to > > do a normal walk that skips over userspace areas not covered by a VMA. > > Maybe that's cleaner. > > > > But FWIW, we already do worse than what I proposed here when > > installing MMU notifiers, with mm_take_all_locks(). > > > > > > > + else > > > > > + __madvise_unlock(mm, madv_behavior->behavior); > > > > > +} > > > > > + > > > > > static bool madvise_batch_tlb_flush(int behavior) > > > > > { > > > > > switch (behavior) { > > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_= struct *mm, > > > > > unsigned long start, size_t len_in, > > > > > struct madvise_behavior *madv_behavior) > > > > > { > > > > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > > > > int behavior =3D madv_behavior->behavior; > > > > > + > > > > > struct blk_plug plug; > > > > > unsigned long end; > > > > > int error; > > > > > > > > > > if (is_memory_failure(behavior)) > > > > > return madvise_inject_error(behavior, start, star= t + len_in); > > > > > - start =3D untagged_addr_remote(mm, start); > > > > > + start =3D untagged_addr(start); > > > > > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts t= hat > > > > the mmap lock is held, which is no longer the case here with your > > > > patch, but untagged_addr() seems wrong here, since we can be operat= ing > > > > on another process. I think especially on X86 with 5-level paging a= nd > > > > LAM, there can probably be cases where address bits are used for pa= rt > > > > of the virtual address in one task while they need to be masked off= in > > > > another task? > > > > > > > > I wonder if you'll have to refactor X86 and Risc-V first to make th= is > > > > work... ideally by making sure that their address tagging state > > > > updates are atomic and untagged_area_remote() works locklessly. > > > > > > Yeah I don't know why we're doing this at all? This seems new unless = I > > > missed it? > > > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and > > reads data that is updated under the mmap lock, I think? So without > > this change you should get a lockdep splat on x86. > > > > > > (Or you could try to use something like the > > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing > > > > against untagged_addr(), first write-lock the MM and then write-loc= k > > > > all VMAs in it...) > > > > > > This would completely eliminate the point of this patch no? The whole= point > > > is not taking these locks... And I'm very much not in favour of > > > write-locking literally every single VMA. under any circumstances. > > > > I'm talking about doing this heavyweight locking in places like > > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand > > correctly, essentially reconfigure the size of the virtual address > > space of a running process from 56-bit to 47-bit at the hardware level > > and cause address bits that were previously part of the virtual > > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too, > > but then we'll have to keep in mind that two subsequent invocations of > > untagged_addr() can translate a userspace-specified virtual address > > into two different virtual addresses at the page table level. > > I=E2=80=99m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) c= an > reconfigure a running process from using 56-bit addresses to 47-bit. > I read the code and see the x86 kernel only supports LAM U57, and not > LAM U48 at all: > > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long n= r_bits) > { > ... > > if (!nr_bits || nr_bits > LAM_U57_BITS) { > mmap_write_unlock(mm); > return -EINVAL; > } > > mm_enable_lam(mm); > mmap_write_unlock(mm); > > return 0; > } Oh, you're right, currently only LAM U57 is supported by Linux, I was making bad assumptions. commit 2f8794bd087e, which introduced that code, also mentions that "For now only LAM_U57 is supported, with 6 tag bits". > I still don't fully understand why x86 differs from ARM64, > where the same bit mask is always applied unconditionally. > > On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis > using mmap or mprotect. However, the same bitmask operation is > always executed regardless of whether memory tags are present for a > given VMA. Hmm, true, that does look like a weird difference. > I mean, on arm64, if a process or a VMA doesn't have tag access > enabled, and we pass an address with high bits to madvise, > untagged_addr() will still strip the tag. But wouldn't that address > be invalid for a process or VMA that doesn't have TBI enabled? Yeah, I guess in that regard it might be fine to always strip the bits... Maybe the situation on arm64 is simpler, in that these bits are always either tag bits or unused on arm64, while my understanding is that on X86, the CPU supports changing the meaning of address bits between "part of virtual address" and "ignored tag bits". So I think stripping the bits might work fine for LAM U57, but not for LAM U48, and maybe the code is trying to be future-proof in case someone wants to add support for LAM U48? It is arguably also a bit more robust to reject garbage addresses instead of ignoring bits that would cause the CPU to treat the address as noncanonical, but I guess doing that just in MM code is not a big problem. (Unless someone relies on seccomp filters that block manipulation of specific virtual address ranges via munmap() and such, but I think almost nobody does that. I think I've only seen that once in some rarely-used component of the Tor project.) But I am out of my depth here and might be severely misunderstanding what's going on.