From: Barry Song <21cnbao@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
David Hildenbrand <david@redhat.com>,
Vlastimil Babka <vbabka@suse.cz>,
Suren Baghdasaryan <surenb@google.com>,
Lokesh Gidra <lokeshgidra@google.com>,
Tangquan Zheng <zhengtangquan@oppo.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED
Date: Thu, 5 Jun 2025 22:27:27 +1200 [thread overview]
Message-ID: <CAGsJ_4wjSXDc_BsYQaELAxbpCUoCqbBJsabJkvJh3rZ+kJoULA@mail.gmail.com> (raw)
In-Reply-To: <CAG48ez0L_dRjArd_NMVSeQ3eJ5pGLJgNppsHxpBuBQbQHPU57w@mail.gmail.com>
On Wed, Jun 4, 2025 at 4:53 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jun 3, 2025 at 9:06 AM Barry Song <21cnbao@gmail.com> wrote:
> > On Sat, May 31, 2025 at 8:41 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > One important quirk of this is that it can, from what I can see, cause
> > > > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > > > lock at all:
> > > > >
> > > > > do_madvise [behavior=MADV_DONTNEED]
> > > > > madvise_lock
> > > > > lock_vma_under_rcu
> > > > > madvise_do_behavior
> > > > > madvise_single_locked_vma
> > > > > madvise_vma_behavior
> > > > > madvise_dontneed_free
> > > > > madvise_dontneed_single_vma
> > > > > zap_page_range_single_batched [.reclaim_pt = true]
> > > > > unmap_single_vma
> > > > > unmap_page_range
> > > > > zap_p4d_range
> > > > > zap_pud_range
> > > > > zap_pmd_range
> > > > > zap_pte_range
> > > > > try_get_and_clear_pmd
> > > > > free_pte
> > > > >
> > > > > This clashes with the assumption in walk_page_range_novma() that
> > > > > holding the mmap lock in write mode is sufficient to prevent
> > > > > concurrent page table freeing, so it can probably lead to page table
> > > > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> > > >
> > > > Hmmmmmm is this because of the series that allows page table freeing on
> > > > zap... I think Zi's?
> > >
> > > Yeah, that was Qi Zheng's
> > > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/
> > > .
> > >
> > > > We need to update the documentation on this then... which currently states
> > > > the VMA need only be stable.
> > > >
> > > > I guess this is still the case except for the novma walker you mention.
> > > >
> > > > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > > > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > > > cc'd you :) See [0].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
> > >
> > > Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> > > innocuous to me as long as it's only done for kernel mappings where
> > > all the rules are different.
> > >
> > > >
> > > > >
> > > > > I think before this patch can land, you'll have to introduce some new
> > > > > helper like:
> > > > >
> > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
> > > > > {
> > > > > mmap_write_lock(mm);
> > > > > for_each_vma(vmi, vma)
> > > > > vma_start_write(vma);
> > > > > }
> > > > >
> > > > > and use that in walk_page_range_novma() for user virtual address space
> > > > > walks, and update the comment in there.
> > > >
> > > > What dude? No, what? Marking literally all VMAs write locked? :/
> > > >
> > > > I think this could have unexpected impact no? We're basically disabling VMA
> > > > locking when we're in novma, that seems... really silly?
> > >
> > > I mean, walk_page_range_novma() being used on user virtual address
> > > space is pretty much a debug-only thing, I don't think it matters if
> > > it has to spend time poking flags in a few thousand VMAs. I guess the
> > > alternative would be to say "ptdump just doesn't show entries between
> > > VMAs, which shouldn't exist in the first place", and change ptdump to
> > > do a normal walk that skips over userspace areas not covered by a VMA.
> > > Maybe that's cleaner.
> > >
> > > But FWIW, we already do worse than what I proposed here when
> > > installing MMU notifiers, with mm_take_all_locks().
> > >
> > > > > > + else
> > > > > > + __madvise_unlock(mm, madv_behavior->behavior);
> > > > > > +}
> > > > > > +
> > > > > > static bool madvise_batch_tlb_flush(int behavior)
> > > > > > {
> > > > > > switch (behavior) {
> > > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > > > > unsigned long start, size_t len_in,
> > > > > > struct madvise_behavior *madv_behavior)
> > > > > > {
> > > > > > + struct vm_area_struct *vma = madv_behavior->vma;
> > > > > > int behavior = madv_behavior->behavior;
> > > > > > +
> > > > > > struct blk_plug plug;
> > > > > > unsigned long end;
> > > > > > int error;
> > > > > >
> > > > > > if (is_memory_failure(behavior))
> > > > > > return madvise_inject_error(behavior, start, start + len_in);
> > > > > > - start = untagged_addr_remote(mm, start);
> > > > > > + start = untagged_addr(start);
> > > > >
> > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > > > the mmap lock is held, which is no longer the case here with your
> > > > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > > > on another process. I think especially on X86 with 5-level paging and
> > > > > LAM, there can probably be cases where address bits are used for part
> > > > > of the virtual address in one task while they need to be masked off in
> > > > > another task?
> > > > >
> > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > > > work... ideally by making sure that their address tagging state
> > > > > updates are atomic and untagged_area_remote() works locklessly.
> > > >
> > > > Yeah I don't know why we're doing this at all? This seems new unless I
> > > > missed it?
> > >
> > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> > > reads data that is updated under the mmap lock, I think? So without
> > > this change you should get a lockdep splat on x86.
> > >
> > > > > (Or you could try to use something like the
> > > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > > > against untagged_addr(), first write-lock the MM and then write-lock
> > > > > all VMAs in it...)
> > > >
> > > > This would completely eliminate the point of this patch no? The whole point
> > > > is not taking these locks... And I'm very much not in favour of
> > > > write-locking literally every single VMA. under any circumstances.
> > >
> > > I'm talking about doing this heavyweight locking in places like
> > > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand
> > > correctly, essentially reconfigure the size of the virtual address
> > > space of a running process from 56-bit to 47-bit at the hardware level
> > > and cause address bits that were previously part of the virtual
> > > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> > > but then we'll have to keep in mind that two subsequent invocations of
> > > untagged_addr() can translate a userspace-specified virtual address
> > > into two different virtual addresses at the page table level.
> >
> > I’m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can
> > reconfigure a running process from using 56-bit addresses to 47-bit.
> > I read the code and see the x86 kernel only supports LAM U57, and not
> > LAM U48 at all:
> >
> > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> > {
> > ...
> >
> > if (!nr_bits || nr_bits > LAM_U57_BITS) {
> > mmap_write_unlock(mm);
> > return -EINVAL;
> > }
> >
> > mm_enable_lam(mm);
> > mmap_write_unlock(mm);
> >
> > return 0;
> > }
>
> Oh, you're right, currently only LAM U57 is supported by Linux, I was
> making bad assumptions.
>
> commit 2f8794bd087e, which introduced that code, also mentions that
> "For now only LAM_U57 is supported, with 6 tag bits".
>
> > I still don't fully understand why x86 differs from ARM64,
> > where the same bit mask is always applied unconditionally.
> >
> > On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis
> > using mmap or mprotect. However, the same bitmask operation is
> > always executed regardless of whether memory tags are present for a
> > given VMA.
>
> Hmm, true, that does look like a weird difference.
>
> > I mean, on arm64, if a process or a VMA doesn't have tag access
> > enabled, and we pass an address with high bits to madvise,
> > untagged_addr() will still strip the tag. But wouldn't that address
> > be invalid for a process or VMA that doesn't have TBI enabled?
>
> Yeah, I guess in that regard it might be fine to always strip the bits...
>
> Maybe the situation on arm64 is simpler, in that these bits are always
> either tag bits or unused on arm64, while my understanding is that on
> X86, the CPU supports changing the meaning of address bits between
> "part of virtual address" and "ignored tag bits". So I think stripping
> the bits might work fine for LAM U57, but not for LAM U48, and maybe
> the code is trying to be future-proof in case someone wants to add
> support for LAM U48?
Perhaps supporting two different tag lengths is a bug rather than a feature?
>
> It is arguably also a bit more robust to reject garbage addresses
> instead of ignoring bits that would cause the CPU to treat the address
> as noncanonical, but I guess doing that just in MM code is not a big
> problem. (Unless someone relies on seccomp filters that block
> manipulation of specific virtual address ranges via munmap() and such,
> but I think almost nobody does that. I think I've only seen that once
> in some rarely-used component of the Tor project.)
>
Unless we perform a strict check in the kernel to ensure the allocated tag
matches the logical tag, random high-bit values are still acceptable.
For example, x86 only checks if the mask is -1UL or above 57 bits—
there’s no mechanism to exclude arbitrary high-bit values.
Otherwise, there's no point in checking whether tagging is enabled for the VMA
or the mm. Thus, we end up applying a mask unconditionally anyway.
> But I am out of my depth here and might be severely misunderstanding
> what's going on.
Same here :-) Especially since I have no idea what's going on with RISC-V
with two different tagging lengths.
Thanks
Barry
next prev parent reply other threads:[~2025-06-05 10:27 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-30 10:44 Barry Song
2025-05-30 14:06 ` Jann Horn
2025-05-30 14:34 ` Lorenzo Stoakes
2025-05-30 20:17 ` Barry Song
2025-06-02 17:35 ` SeongJae Park
2025-06-02 17:53 ` SeongJae Park
2025-05-30 20:40 ` Jann Horn
2025-06-02 11:50 ` Lorenzo Stoakes
2025-06-03 1:06 ` Barry Song
2025-06-03 9:48 ` Lorenzo Stoakes
2025-06-03 7:06 ` Barry Song
2025-06-03 16:52 ` Jann Horn
2025-06-05 10:27 ` Barry Song [this message]
2025-05-30 22:00 ` Barry Song
2025-06-02 14:55 ` Jann Horn
2025-06-03 7:51 ` Barry Song
2025-06-03 7:24 ` Qi Zheng
2025-06-03 9:54 ` Lorenzo Stoakes
2025-06-04 6:02 ` Qi Zheng
2025-06-04 17:50 ` Lorenzo Stoakes
2025-06-05 3:23 ` Qi Zheng
2025-06-05 14:04 ` Lorenzo Stoakes
2025-06-06 3:55 ` Qi Zheng
2025-06-06 10:44 ` Lorenzo Stoakes
2025-06-09 6:40 ` Qi Zheng
2025-06-09 15:08 ` Lorenzo Stoakes
2025-06-10 7:20 ` David Hildenbrand
2025-06-06 11:07 ` Jann Horn
2025-06-03 18:43 ` Lorenzo Stoakes
2025-06-03 20:17 ` Suren Baghdasaryan
2025-06-04 5:22 ` Lorenzo Stoakes
2025-06-06 7:18 ` Barry Song
2025-06-06 10:16 ` Lorenzo Stoakes
2025-06-03 20:59 ` Pedro Falcato
2025-06-04 5:23 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4wjSXDc_BsYQaELAxbpCUoCqbBJsabJkvJh3rZ+kJoULA@mail.gmail.com \
--to=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=vbabka@suse.cz \
--cc=zhengtangquan@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox