From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E070C5B543 for ; Thu, 5 Jun 2025 10:27:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DDD486B0577; Thu, 5 Jun 2025 06:27:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DB4926B0578; Thu, 5 Jun 2025 06:27:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CCB136B0579; Thu, 5 Jun 2025 06:27:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A9B8F6B0577 for ; Thu, 5 Jun 2025 06:27:41 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 52A04B94DE for ; Thu, 5 Jun 2025 10:27:41 +0000 (UTC) X-FDA: 83520970722.22.0CD3AEC Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41]) by imf18.hostedemail.com (Postfix) with ESMTP id 5D9401C0005 for ; Thu, 5 Jun 2025 10:27:39 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YDIfgEB3; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749119259; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OPuXwmLiDfs6Evoo10hPlCqsLPGGHuYNhb/1GnE1QMQ=; b=8iCCBctbKSCULUQwio+1CdUb3L0A1YbrfRBJhBks4pdjcH1gz3xp2n4s3gKeW5LotjHXkZ I3j4jAskEYTeMPa2XcgBzTNDMgcgT9YmaH9kMt47BqeRjpfR22qp8KHRtumF+Vm+Ljs2fv WBV7QhiCm2Mzm5g/wX6zX661BGWcox8= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YDIfgEB3; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749119259; a=rsa-sha256; cv=none; b=naQ7aCQxzpU5DuHKhg8BcCJwBlny5nTQYeDlRLGpSyJNy59WVRhCdjxamuNjRU4EaMR+R6 avjFVDR4BtGd6po4l09o41a/9HbP+g4vkGE2aTOfsaf1/3LfnXx3UrkLEytCd8Ab4a8mxa V7y40JppmSftrpGhhkixV4MW5Npv55M= Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-87eba2c5d78so12896241.0 for ; Thu, 05 Jun 2025 03:27:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749119258; x=1749724058; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OPuXwmLiDfs6Evoo10hPlCqsLPGGHuYNhb/1GnE1QMQ=; b=YDIfgEB3Ack/+n5Pgji0yzNpssprMQ2Xcph8vhHly2awHXTktrrZ50QVQey2YPDyQt QrdLiYgfxfn43N2UjsnV+SB6iLJEj0KxoB7YJfKM0uZaxDWwLmB11NQtNMKO4mQsVPaa w5aKqbchiCJUxwvcPmGO7+vn4/97JVZc69QHO+XHSix1kfwVbKnfbOCIWuWkUU38AQsF HLtx2lUlzYKHd9aH25wZglVLzpmJyKWI11wvg5KtiQJ6WCb5X5RPaR7n80JGbvvIs67/ xqShuHCK222nTWPd/BuTbGGIMK7v4FmRNVWCHUwHOQjttlZ7sg/OFmojF9pjzz0rc5HS YA0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749119258; x=1749724058; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OPuXwmLiDfs6Evoo10hPlCqsLPGGHuYNhb/1GnE1QMQ=; b=by9XFOvuiIj/0qYsDL6xi18RdHe8xfGI5b7YLEyNfxXMQKqO+qDGqfNOvPhnf474mc 7/eKpNVSUgnWmVgAqX1ze7FovzrnnqT2XBp5GuBsi/h8BGxVpa2NH62DrpHTVrqw/w7T HxEo4ndoZLkTROzhGXMR9PcUoDH3zvNmX8ijdSYvcO9MTXqkbvPUV5nQmUmVbrKV7HiD bJWYZJe288DJeSKB1fRzqiuXPq+XGXkdflEU2scv/oyBeYDAi/wknzVlp1hNWga0AXvB AQ4AUyzVnybXmSyhbAn2W40Va3La5xms0aao3vz3RE4fnyA71wOYKdGnDCWDupqs8rOk FVvA== X-Forwarded-Encrypted: i=1; AJvYcCUB+IxoAt0Z5WWvQs3mAl5UwLtFGI2RTmk8Y6MhN6NCfixhV6PBkF5KD8a/HStwHmWlSOtqFibj9Q==@kvack.org X-Gm-Message-State: AOJu0YwRzisVkqnpk8sHP8dGCb9E12uGefzvZmmwsblC+3KdSQF1PncI xhjZhbGJYPK9BASoLR2m/82pN80UuMbtMBWg+TAf8i+ZUNatGRZ6/lO7MizAfcHB5P0LUccUjiu OvZxAbHSkifHT2N6dcI+mTxdrfJQKIck= X-Gm-Gg: ASbGnctnZ4s8+wcNeE8AGfzLMxBa+8fFgjvzrnm95TFfJjahmy8sDxWewwaEhRhUYi8 Wnnea1cI7drpjY+fx+m6pP0Xnwo1g7tZn1V+J7V3CJm4c1DxbKPOApZ9ZdiZzDS5qH42eWkZB67 38GaF/IgnBDDsXHJ8M8r0XuoGrgiGTb+A5Gg== X-Google-Smtp-Source: AGHT+IHcA4sk69Rb5N1LVTzskp7VvgQB7U1dp01kQMhkFxP9GNZvlgulf6A8Ppsw92Taa3ZxT05PdBxqFiBX2k0csYk= X-Received: by 2002:a05:6102:1625:b0:4e5:a83a:3ceb with SMTP id ada2fe7eead31-4e746e68e2fmr4509098137.20.1749119258259; Thu, 05 Jun 2025 03:27:38 -0700 (PDT) MIME-Version: 1.0 References: <20250530104439.64841-1-21cnbao@gmail.com> <002aa917-d952-491d-800c-88a0476ac02f@lucifer.local> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 5 Jun 2025 22:27:27 +1200 X-Gm-Features: AX0GCFvJzPQUg6xIqO8HvbiHQT4ZjWwTAyyWi78nTT_QjfOS3oSv-chCNrNnQG0 Message-ID: Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED To: Jann Horn Cc: Lorenzo Stoakes , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , "Liam R. Howlett" , David Hildenbrand , Vlastimil Babka , Suren Baghdasaryan , Lokesh Gidra , Tangquan Zheng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5D9401C0005 X-Stat-Signature: 9e6bqq4thdogpk3fhgscmsi693iz3fmm X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1749119259-127983 X-HE-Meta: U2FsdGVkX1+RHRzKgXX1586+7EgAYsG6XwpvLPEqDSb1UFiIwlMp9vadZhBBhuMBcETZ3cH0WoWbIFlWQWSTPBFg8EG0WlW4bjfe9q12ajQwM9IVjG/p2KzJB8iDb7Bv2mrz9aUIHRQEfUki91dR1EUgLBfw2/HPJd+W5hGBViSY8VVn9QOrQA6ZPIv5Fb3VBymwJyjF03vd8rpVX8HWLHR8PcLWPW2fHzW08+Nsu/HKOsWn+3PsKnbZKWcbsrpVJjeQI2vS+rED7TSW1q6rxERM6BSN+fj5pepwaBLTNat0oEPPpZ1EfGPaUKYZfS9pCh9dbsf1jxUvezze+178HXJHiZQ/FVl3YVu13mO1xzGjSgx4xBrENqMg7EQR5zQEYoi7KArRBCUg4dSKfdEVzXQVB5qa8iXGzjj0sATq908AffqWdlv9VzNbkp/rU2rbHu5Wmz5S9GCM3rCRItMjzufeGEZzUINPwVKPmi7n2QpR7e+xNN7iobznY3W4rmjQyj/jI4S2NqGTpmQmMJSFWCSarnQvJlP2YIesiistK9CA9XP1gKTwWe/m9HO+a8DEmKg97wv3ZGttXZ6f9rLocmpmO3ekT/0BjWPKJmdumSvLCM5kJPedfYd7TrOBIV96j+oB5M03V2fGvc1pf0/kZ37GeFWOSUf5loXhKEmYnRdRNFlY5eznk0BWtJzkllqvUfuze26bmk1Cwmhifp8tq9uNlx2HkZBqMiS0fksFuLcGJYz4HpK0Xk6O4f9Gf4qz7VZks83pIqxYiYnmjF+io3h9OvUo4oDvZhyqMf+9lIe0IpuNSIBdkbL/45Hv5BzbrvlztCnUOCCrzacAcS/Zlox6h6IlGRfk4Yk8fQsTxnQmvzK4y5TOqnUY0KPdLpL3Ysje0XPJyOYESiWT0MdFkDyNm7hxLWlqlSTk2VEgvOYCDNSF6Z6hS7P2M8obxUGe+tbM9QkzsQxuvBgm0zE EVrduzgt IKTLvVKy0tcydFJYxjOBAe5/yag880EKuAVxpRNe8oafFtqspH6Z4SzFqsR4bYzVwogHq59a9ekc07TGe8wQxJuYC6sL10GXWSQ34eBPXnMyl1fEtVdkr/lERr+bhlbR+rKFBcCag9DbQPqQjjvIksQVaBRoHvci5zflLW+8SeAgMzFGWcAppKCP1XVn/D6aIDlOJFYSPbAuYhEEk1eiHOfyxno4fLtyswEmnOCB+pDVizRnaGl2Kwe4Cq2jZQk3AZq2rlQF2PsOsrM5csCS0zAO19ENXHMACadHjECC9PFB+PT7WhjL48Ziuz6IZsGJ/k1mPGHvE866tp23Gr0rw2Y7Lll0y7zryb71YPs87M1ltrZVIMZwbhMewjoKa8aZAWrv+wmtnkGV0mLqOIjKvO87b0JUlSgZ8IC1x37dd4OinvW+4UwqlI+8dMwAe75hBvWlXQpFJYSfQ2Kys0twspuDCzpBjmvJZtOvO8anFjQ9nJ19SGKoaZyyprkYuNxZvvdwqAtdQddzFLk0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 4, 2025 at 4:53=E2=80=AFAM Jann Horn wrote: > > On Tue, Jun 3, 2025 at 9:06=E2=80=AFAM Barry Song <21cnbao@gmail.com> wro= te: > > On Sat, May 31, 2025 at 8:41=E2=80=AFAM Jann Horn wr= ote: > > > On Fri, May 30, 2025 at 4:34=E2=80=AFPM Lorenzo Stoakes > > > wrote: > > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote: > > > > > On Fri, May 30, 2025 at 12:44=E2=80=AFPM Barry Song <21cnbao@gmai= l.com> wrote: > > > > > One important quirk of this is that it can, from what I can see, = cause > > > > > freeing of page tables (through pt_reclaim) without holding the m= map > > > > > lock at all: > > > > > > > > > > do_madvise [behavior=3DMADV_DONTNEED] > > > > > madvise_lock > > > > > lock_vma_under_rcu > > > > > madvise_do_behavior > > > > > madvise_single_locked_vma > > > > > madvise_vma_behavior > > > > > madvise_dontneed_free > > > > > madvise_dontneed_single_vma > > > > > zap_page_range_single_batched [.reclaim_pt =3D true] > > > > > unmap_single_vma > > > > > unmap_page_range > > > > > zap_p4d_range > > > > > zap_pud_range > > > > > zap_pmd_range > > > > > zap_pte_range > > > > > try_get_and_clear_pmd > > > > > free_pte > > > > > > > > > > This clashes with the assumption in walk_page_range_novma() that > > > > > holding the mmap lock in write mode is sufficient to prevent > > > > > concurrent page table freeing, so it can probably lead to page ta= ble > > > > > UAF through the ptdump interface (see ptdump_walk_pgd()). > > > > > > > > Hmmmmmm is this because of the series that allows page table freein= g on > > > > zap... I think Zi's? > > > > > > Yeah, that was Qi Zheng's > > > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.= 1733305182.git.zhengqi.arch@bytedance.com/ > > > . > > > > > > > We need to update the documentation on this then... which currently= states > > > > the VMA need only be stable. > > > > > > > > I guess this is still the case except for the novma walker you ment= ion. > > > > > > > > Relatedly, It's worth looking at Dev's series which introduces a co= ncerning > > > > new 'no lock at all' mode to the page table walker explicitly for n= ovma. I > > > > cc'd you :) See [0]. > > > > > > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a= 1a79155@lucifer.local/ > > > > > > Yeah, I saw that you CC'ed me; at a first glance that seems relativel= y > > > innocuous to me as long as it's only done for kernel mappings where > > > all the rules are different. > > > > > > > > > > > > > > > > > I think before this patch can land, you'll have to introduce some= new > > > > > helper like: > > > > > > > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm) > > > > > { > > > > > mmap_write_lock(mm); > > > > > for_each_vma(vmi, vma) > > > > > vma_start_write(vma); > > > > > } > > > > > > > > > > and use that in walk_page_range_novma() for user virtual address = space > > > > > walks, and update the comment in there. > > > > > > > > What dude? No, what? Marking literally all VMAs write locked? :/ > > > > > > > > I think this could have unexpected impact no? We're basically disab= ling VMA > > > > locking when we're in novma, that seems... really silly? > > > > > > I mean, walk_page_range_novma() being used on user virtual address > > > space is pretty much a debug-only thing, I don't think it matters if > > > it has to spend time poking flags in a few thousand VMAs. I guess the > > > alternative would be to say "ptdump just doesn't show entries between > > > VMAs, which shouldn't exist in the first place", and change ptdump to > > > do a normal walk that skips over userspace areas not covered by a VMA= . > > > Maybe that's cleaner. > > > > > > But FWIW, we already do worse than what I proposed here when > > > installing MMU notifiers, with mm_take_all_locks(). > > > > > > > > > + else > > > > > > + __madvise_unlock(mm, madv_behavior->behavior); > > > > > > +} > > > > > > + > > > > > > static bool madvise_batch_tlb_flush(int behavior) > > > > > > { > > > > > > switch (behavior) { > > > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct m= m_struct *mm, > > > > > > unsigned long start, size_t len_in, > > > > > > struct madvise_behavior *madv_behavior) > > > > > > { > > > > > > + struct vm_area_struct *vma =3D madv_behavior->vma; > > > > > > int behavior =3D madv_behavior->behavior; > > > > > > + > > > > > > struct blk_plug plug; > > > > > > unsigned long end; > > > > > > int error; > > > > > > > > > > > > if (is_memory_failure(behavior)) > > > > > > return madvise_inject_error(behavior, start, st= art + len_in); > > > > > > - start =3D untagged_addr_remote(mm, start); > > > > > > + start =3D untagged_addr(start); > > > > > > > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts= that > > > > > the mmap lock is held, which is no longer the case here with your > > > > > patch, but untagged_addr() seems wrong here, since we can be oper= ating > > > > > on another process. I think especially on X86 with 5-level paging= and > > > > > LAM, there can probably be cases where address bits are used for = part > > > > > of the virtual address in one task while they need to be masked o= ff in > > > > > another task? > > > > > > > > > > I wonder if you'll have to refactor X86 and Risc-V first to make = this > > > > > work... ideally by making sure that their address tagging state > > > > > updates are atomic and untagged_area_remote() works locklessly. > > > > > > > > Yeah I don't know why we're doing this at all? This seems new unles= s I > > > > missed it? > > > > > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 an= d > > > reads data that is updated under the mmap lock, I think? So without > > > this change you should get a lockdep splat on x86. > > > > > > > > (Or you could try to use something like the > > > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizin= g > > > > > against untagged_addr(), first write-lock the MM and then write-l= ock > > > > > all VMAs in it...) > > > > > > > > This would completely eliminate the point of this patch no? The who= le point > > > > is not taking these locks... And I'm very much not in favour of > > > > write-locking literally every single VMA. under any circumstances. > > > > > > I'm talking about doing this heavyweight locking in places like > > > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand > > > correctly, essentially reconfigure the size of the virtual address > > > space of a running process from 56-bit to 47-bit at the hardware leve= l > > > and cause address bits that were previously part of the virtual > > > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too, > > > but then we'll have to keep in mind that two subsequent invocations o= f > > > untagged_addr() can translate a userspace-specified virtual address > > > into two different virtual addresses at the page table level. > > > > I=E2=80=99m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...)= can > > reconfigure a running process from using 56-bit addresses to 47-bit. > > I read the code and see the x86 kernel only supports LAM U57, and not > > LAM U48 at all: > > > > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long= nr_bits) > > { > > ... > > > > if (!nr_bits || nr_bits > LAM_U57_BITS) { > > mmap_write_unlock(mm); > > return -EINVAL; > > } > > > > mm_enable_lam(mm); > > mmap_write_unlock(mm); > > > > return 0; > > } > > Oh, you're right, currently only LAM U57 is supported by Linux, I was > making bad assumptions. > > commit 2f8794bd087e, which introduced that code, also mentions that > "For now only LAM_U57 is supported, with 6 tag bits". > > > I still don't fully understand why x86 differs from ARM64, > > where the same bit mask is always applied unconditionally. > > > > On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis > > using mmap or mprotect. However, the same bitmask operation is > > always executed regardless of whether memory tags are present for a > > given VMA. > > Hmm, true, that does look like a weird difference. > > > I mean, on arm64, if a process or a VMA doesn't have tag access > > enabled, and we pass an address with high bits to madvise, > > untagged_addr() will still strip the tag. But wouldn't that address > > be invalid for a process or VMA that doesn't have TBI enabled? > > Yeah, I guess in that regard it might be fine to always strip the bits... > > Maybe the situation on arm64 is simpler, in that these bits are always > either tag bits or unused on arm64, while my understanding is that on > X86, the CPU supports changing the meaning of address bits between > "part of virtual address" and "ignored tag bits". So I think stripping > the bits might work fine for LAM U57, but not for LAM U48, and maybe > the code is trying to be future-proof in case someone wants to add > support for LAM U48? Perhaps supporting two different tag lengths is a bug rather than a feature= ? > > It is arguably also a bit more robust to reject garbage addresses > instead of ignoring bits that would cause the CPU to treat the address > as noncanonical, but I guess doing that just in MM code is not a big > problem. (Unless someone relies on seccomp filters that block > manipulation of specific virtual address ranges via munmap() and such, > but I think almost nobody does that. I think I've only seen that once > in some rarely-used component of the Tor project.) > Unless we perform a strict check in the kernel to ensure the allocated tag matches the logical tag, random high-bit values are still acceptable. For example, x86 only checks if the mask is -1UL or above 57 bits=E2=80=94 there=E2=80=99s no mechanism to exclude arbitrary high-bit values. Otherwise, there's no point in checking whether tagging is enabled for the = VMA or the mm. Thus, we end up applying a mask unconditionally anyway. > But I am out of my depth here and might be severely misunderstanding > what's going on. Same here :-) Especially since I have no idea what's going on with RISC-V with two different tagging lengths. Thanks Barry