From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CCC1C5AE59 for ; Tue, 3 Jun 2025 07:06:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE9706B03BB; Tue, 3 Jun 2025 03:06:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B9A336B03BC; Tue, 3 Jun 2025 03:06:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A88A26B03BD; Tue, 3 Jun 2025 03:06:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 874C66B03BB for ; Tue, 3 Jun 2025 03:06:31 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DD8D3EC0C7 for ; Tue, 3 Jun 2025 07:06:30 +0000 (UTC) X-FDA: 83513206140.21.5EACFFB Received: from mail-ua1-f53.google.com (mail-ua1-f53.google.com [209.85.222.53]) by imf03.hostedemail.com (Postfix) with ESMTP id 05E2B20010 for ; Tue, 3 Jun 2025 07:06:28 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YwewZhnK; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.53 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748934389; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sObAWsrJ5ZBZRLi4kwOyELlX1Cpss2QS7Q/UEJ4ssaU=; b=d4NsLqIvsWp4JLEQqcoWdKDFWXla1yYsVLVJ+EuLp5O2+BB0SpeUC5Pw83k3W0vAEN0Ggj pIFC1xxwHjG8EQ3ZXaDFum5sADTC0wX+ND4NEPZPO+eoFS+NEtZ+1N/kOls4IEYmECO1wZ SjHPY2UxclvF72BgpMnbwMk8hzACcRk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748934389; a=rsa-sha256; cv=none; b=H2gJivdj6kpnawEB+dt8hFHf8iNHA04/RtNCCDcxH2jeq65JIosoHKCEcHmVJuejoIHDEQ ZnYkyYNfD568V2HASWTzlJMmOfcFYN4bQhuoc0k+Od5NAfsKS/JawbMZyUqhBmCoG7V57n L7gI6vVkQPttUljengnAnCcg2QQ/aqg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YwewZhnK; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.53 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f53.google.com with SMTP id a1e0cc1a2514c-87def342791so967945241.1 for ; Tue, 03 Jun 2025 00:06:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748934388; x=1749539188; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=sObAWsrJ5ZBZRLi4kwOyELlX1Cpss2QS7Q/UEJ4ssaU=; b=YwewZhnK3a1Pg69wm6EuDeG5Q2pyO9WDLRXPk9x8S1qtpWZB50s+DFfWp2ekYuK1ct tXOq2xuv9EmWZyJc62V4crrilvbs9PKN5J5wTpNC33qMJiPivV0fbXmqrkbbLvHXNAAk MzBt5/+8iFl74t29xbaykz+0LP0WsIInm7r/bxjsUbpD2LwJFXUtydvH32xZQoo8l2+t Gs0uYqELLlQHmceoppBL+g97IEbVtV2ywZPo/fb54r8n8Dk/vzq8F2KSrHuiB8fZkzki ezEM9NAUGy8rwjbHGXzr8yWV73UGLLdt86KCrqPUChOOYlnK4mJfG+x8B15HLM3VqpBL RA+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748934388; x=1749539188; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sObAWsrJ5ZBZRLi4kwOyELlX1Cpss2QS7Q/UEJ4ssaU=; b=fxLVV/q466nBGx3u7NlmnKFTQqD+QmY8ZCztnDbgifWOc3+nDkssqcsnwLKbYcf/Fj 3ZC3HwWyGjpIb5m2v87F5uvZWJoj2YIAorBx16d+SZ6+dRLPEQXPZsYPUUlFoRbN7wEt rdi0O175Kytvu9SulbRsQve/zvi1RPwa/p2sD7TqORqRsl4nmcYEvP+wXTapjG++qmAg vJ63Lx+9PMLOljYrWHC68GsC0DmvZ6Eei9XuMX71ZSwwGiqgDT4kQpsTJPnavN2mFCjj FZW24Yha4KTk7IV/aAj4b8IZQR9vmNI/S96E9HkTbbegBnURoJVt6uXQXpW1plweb8re rW9Q== X-Forwarded-Encrypted: i=1; AJvYcCWrJg+Wup5kswVnpDoS06d575dmRpuvi+WyN32thlb1BEeW7ik4i8rcZ4Zd4VoDKPKO41DIPZr7ug==@kvack.org X-Gm-Message-State: AOJu0YyoZJv2Z2F0Fo3uX/domftnQMZYiLPHyPFxZ5YkkAIGNIduP+To Rev8ApLYSKH8q3ZL+mheht/j+9P41f0hHQKHiLxMrYREYwFz2Hx1Nx04CWGAsgT8NWJFMm/w8fX smozfEZYrY+UKFkEbGJW7YFtgXnqBXYQ= X-Gm-Gg: ASbGncvYzWainHpbmQhmozFg+32Cl3/2Oesi9NaZ8ToK0xxMfsuNAbin3irKbgYEA5c NRsFsv0I3TspMnwwvdDJlrE6FCVAU8KJbUGzx+sGt0WN5Ni6QCw44Ml/Kp+783nKL6ZG6OdruXp RPqsX7Rn6uwCntWQKgzj/9KRXiE+MEyFQD4w== X-Google-Smtp-Source: AGHT+IFgxM277lZpa2B3Lrllf3ZISbTA7LdOXghtJE1KYNhyL0d8+gJhNSO6JMPfxbIdCb3IWgMbPe1K9GMriToyaQw= X-Received: by 2002:a05:6122:658b:b0:530:65f0:7fcf with SMTP id 71dfb90a1353d-53080f4443emr11625555e0c.1.1748934387774; Tue, 03 Jun 2025 00:06:27 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 3 Jun 2025 19:06:16 +1200 X-Gm-Features: AX0GCFvjSl1QW5m77TVMmrcJF3x3HUNvTB2vCn4KXpj-ivn5L-XiksBy_ZeMpNM Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Jann Horn Cc: Lorenzo Stoakes , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 05E2B20010 X-Stat-Signature: keszuqx7eyg7bgixduto58dydu7cme38 X-Rspam-User: X-HE-Tag: 1748934388-410533 X-HE-Meta: U2FsdGVkX19U1uIjiBF8lDY+IFA4atWldyXcEQeHRVVx1e8Wmt79lxwGUXwZlQrhNW/7qHML4ZxzkSjGJ9JphKGntE6NDzOCVWpBuNFCebTemcOmiKltFXPk+YsFh+dz3Nl5tlcOTRo9uDqm2lc2ZS3Jx9ApVDAvjKlG+68Iphzovyn4Xv103+8jAnx/hM2HL5Yvp7cyEu4GRORaE6TmHCcCfr7AufrdYTcSM1G+Wcjf4NyQ1/krfV6s7gxXRr5E2R1wibUpM+zv4iLJvcSDI4OsGQYWe7wqp08hMc1j01EuMpAc24eO6KXs/nYp5+lEZpJpiCk5Q4XnWdOuUKawsdCKkTzWtaHeoEVHWOE6UXFOJVWj5drIQ468i23UH61q30PGe87rA4PzPND4AKezB/iCLQiV4sluaR3BZ74vKaPiq0C5A/rEeEpOxOH6HbNShvD9rEmUjMto0awflKA7uXYgYP5HYbO4POf04j+CRhMLkmTkRg07GNkImpks/bCp7OFvl+uU/fnDfkw4aOOVrBOnB2G02Hb4mweaqYaK7fj6869oGXnKW8tcugcDnTRlhBmxQHrkLKJGQcSEtyfmgdkHMEEF825KtV7LK6SrJZ+oOVH5w8pOlb8vTg63w3uRZHF6oHYg2tZjTc5vo05Uz0/sskooV5oUqmZv4ONRSgo1aaOZsIVJIp6z+zv4G464rtPYc+O/VknOx8GceO5Pbi00ys1ViAC3wxD23n01TKJl4EvFGttcPQPxKEFXQKdWchqe0jRPkGRA5HTpsvLNlp70rr5EDVeyRNuDMKgy8gNeCZKshD0HMW4QDwyeKhLOUg7l4/r1+9nT89ZtaCSrAo/gh8X+QhH5tTp4qSn5p84Nw1OhIQLYAk/7Z/SeIiByjQbrE713KJopr3Bl5Ki4ZWlDvSSvnGE5Uk3oG8mG1V1BUxYrPi6W/Xr+NJ5Nm51SLgJTb8EqI+oJqH8+FBu MK0kqnKb dVecwdbwcC28PKvFW+d8B2UYpdPb139X5Av4lIFiqQ/N571ETtQxW2MZqD4M7iyvOTW4IOl7LG9PTQvTdJyLKntGvCy/Ucu1N9SY9tRFJCXW62zeTFQ+JdU123rcv/AyxFWqIsWZLMjw2lq+KUtSlX47PKd0FqEJy5inwgkFmCG8Av94mf2MtTIz/l43xvQtvzgLMgq6516IzHvIJtXA2RkIyUNJHY0VoEFkpB3Mjlz4EmccobISAD+hq9PaUEx8li6zu5CQ4TeWyzSWtz1l597QeOk5WCYJorNKIplmRiyBzBCkcNjL8m7mZTH/0QqS/zhVK6XYway817Bds1B/UaCNctf8aOXeZR190nQjb3MLyyc3tOmhFR9KveM3pD7kn2U2Zp/ME/OSUG/nhl3naprNqkXDCjcSb8gV6J75PDhtm0BUTO34KK4dPFcmnWv0bbvzzFpnsCWJ6TLQ3HzmAg7ZXoxNtrK66edQFVchjIdm879w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 8:41=E2=80=AFAM Jann Horn wrote: > > On Fri, May 30, 2025 at 4:34=E2=80=AFPM Lorenzo Stoakes > wrote: > > Barry - I was going to come back to this later, but Jann's sort of bump= ed > > this in my inbox. > > > > This implementation isn't quite what I was after, would you give me a > > little bit before a respin so I can have a think about this and make > > sensible suggestions? > > > > Thanks! > > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote: > > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmail.co= m> wrote: > > > One important quirk of this is that it can, from what I can see, caus= e > > > freeing of page tables (through pt_reclaim) without holding the mmap > > > lock at all: > > > > > > do_madvise [behavior=3DMADV_DONTNEED] > > > madvise_lock > > > lock_vma_under_rcu > > > madvise_do_behavior > > > madvise_single_locked_vma > > > madvise_vma_behavior > > > madvise_dontneed_free > > > madvise_dontneed_single_vma > > > zap_page_range_single_batched [.reclaim_pt =3D true] > > > unmap_single_vma > > > unmap_page_range > > > zap_p4d_range > > > zap_pud_range > > > zap_pmd_range > > > zap_pte_range > > > try_get_and_clear_pmd > > > free_pte > > > > > > This clashes with the assumption in walk_page_range_novma() that > > > holding the mmap lock in write mode is sufficient to prevent > > > concurrent page table freeing, so it can probably lead to page table > > > UAF through the ptdump interface (see ptdump_walk_pgd()). > > > > Hmmmmmm is this because of the series that allows page table freeing on > > zap... I think Zi's? > > Yeah, that was Qi Zheng's > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733= 305182.git.zhengqi.arch@bytedance.com/ > . > > > We need to update the documentation on this then... which currently sta= tes > > the VMA need only be stable. > > > > I guess this is still the case except for the novma walker you mention. > > > > Relatedly, It's worth looking at Dev's series which introduces a concer= ning > > new 'no lock at all' mode to the page table walker explicitly for novma= . I > > cc'd you :) See [0]. > > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79= 155@lucifer.local/ > > Yeah, I saw that you CC'ed me; at a first glance that seems relatively > innocuous to me as long as it's only done for kernel mappings where > all the rules are different. > > > > > > > > > I think before this patch can land, you'll have to introduce some new > > > helper like: > > > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > > > { > > > mmap_write_lock(mm); > > > for_each_vma(vmi, vma) > > > vma_start_write(vma); > > > } > > > > > > and use that in walk_page_range_novma() for user virtual address spac= e > > > walks, and update the comment in there. > > > > What dude? No, what? Marking literally all VMAs write locked? :/ > > > > I think this could have unexpected impact no? We're basically disabling= VMA > > locking when we're in novma, that seems... really silly? > > I mean, walk_page_range_novma() being used on user virtual address > space is pretty much a debug-only thing, I don't think it matters if > it has to spend time poking flags in a few thousand VMAs. I guess the > alternative would be to say "ptdump just doesn't show entries between > VMAs, which shouldn't exist in the first place", and change ptdump to > do a normal walk that skips over userspace areas not covered by a VMA. > Maybe that's cleaner. > > But FWIW, we already do worse than what I proposed here when > installing MMU notifiers, with mm_take_all_locks(). > > > > > + else > > > > + __madvise_unlock(mm, madv_behavior->behavior); > > > > +} > > > > + > > > > static bool madvise_batch_tlb_flush(int behavior) > > > > { > > > > switch (behavior) { > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_st= ruct *mm, > > > > unsigned long start, size_t len_in, > > > > struct madvise_behavior *madv_behavior) > > > > { > > > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > > > int behavior =3D madv_behavior->behavior; > > > > + > > > > struct blk_plug plug; > > > > unsigned long end; > > > > int error; > > > > > > > > if (is_memory_failure(behavior)) > > > > return madvise_inject_error(behavior, start, start = + len_in); > > > > - start =3D untagged_addr_remote(mm, start); > > > > + start =3D untagged_addr(start); > > > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts tha= t > > > the mmap lock is held, which is no longer the case here with your > > > patch, but untagged_addr() seems wrong here, since we can be operatin= g > > > on another process. I think especially on X86 with 5-level paging and > > > LAM, there can probably be cases where address bits are used for part > > > of the virtual address in one task while they need to be masked off i= n > > > another task? > > > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this > > > work... ideally by making sure that their address tagging state > > > updates are atomic and untagged_area_remote() works locklessly. > > > > Yeah I don't know why we're doing this at all? This seems new unless I > > missed it? > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and > reads data that is updated under the mmap lock, I think? So without > this change you should get a lockdep splat on x86. > > > > (Or you could try to use something like the > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing > > > against untagged_addr(), first write-lock the MM and then write-lock > > > all VMAs in it...) > > > > This would completely eliminate the point of this patch no? The whole p= oint > > is not taking these locks... And I'm very much not in favour of > > write-locking literally every single VMA. under any circumstances. > > I'm talking about doing this heavyweight locking in places like > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand > correctly, essentially reconfigure the size of the virtual address > space of a running process from 56-bit to 47-bit at the hardware level > and cause address bits that were previously part of the virtual > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too, > but then we'll have to keep in mind that two subsequent invocations of > untagged_addr() can translate a userspace-specified virtual address > into two different virtual addresses at the page table level. I=E2=80=99m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can reconfigure a running process from using 56-bit addresses to 47-bit. I read the code and see the x86 kernel only supports LAM U57, and not LAM U48 at all: static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_= bits) { ... if (!nr_bits || nr_bits > LAM_U57_BITS) { mmap_write_unlock(mm); return -EINVAL; } mm_enable_lam(mm); mmap_write_unlock(mm); return 0; } I still don't fully understand why x86 differs from ARM64, where the same bit mask is always applied unconditionally. On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis using mmap or mprotect. However, the same bitmask operation is always executed regardless of whether memory tags are present for a given VMA. I mean, on arm64, if a process or a VMA doesn't have tag access enabled, and we pass an address with high bits to madvise, untagged_addr() will still strip the tag. But wouldn't that address be invalid for a process or VMA that doesn't have TBI enabled? Thanks Barry