Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	akpm@linux-foundation.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,  Barry Song <v-songbaohua@oppo.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 David Hildenbrand <david@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	 Suren Baghdasaryan <surenb@google.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	 Tangquan Zheng <zhengtangquan@oppo.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED
Date: Thu, 5 Jun 2025 22:27:27 +1200	[thread overview]
Message-ID: <CAGsJ_4wjSXDc_BsYQaELAxbpCUoCqbBJsabJkvJh3rZ+kJoULA@mail.gmail.com> (raw)
In-Reply-To: <CAG48ez0L_dRjArd_NMVSeQ3eJ5pGLJgNppsHxpBuBQbQHPU57w@mail.gmail.com>

On Wed, Jun 4, 2025 at 4:53 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jun 3, 2025 at 9:06 AM Barry Song <21cnbao@gmail.com> wrote:
> > On Sat, May 31, 2025 at 8:41 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, May 30, 2025 at 4:34 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Fri, May 30, 2025 at 04:06:30PM +0200, Jann Horn wrote:
> > > > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > One important quirk of this is that it can, from what I can see, cause
> > > > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > > > lock at all:
> > > > >
> > > > > do_madvise [behavior=MADV_DONTNEED]
> > > > >   madvise_lock
> > > > >     lock_vma_under_rcu
> > > > >   madvise_do_behavior
> > > > >     madvise_single_locked_vma
> > > > >       madvise_vma_behavior
> > > > >         madvise_dontneed_free
> > > > >           madvise_dontneed_single_vma
> > > > >             zap_page_range_single_batched [.reclaim_pt = true]
> > > > >               unmap_single_vma
> > > > >                 unmap_page_range
> > > > >                   zap_p4d_range
> > > > >                     zap_pud_range
> > > > >                       zap_pmd_range
> > > > >                         zap_pte_range
> > > > >                           try_get_and_clear_pmd
> > > > >                           free_pte
> > > > >
> > > > > This clashes with the assumption in walk_page_range_novma() that
> > > > > holding the mmap lock in write mode is sufficient to prevent
> > > > > concurrent page table freeing, so it can probably lead to page table
> > > > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> > > >
> > > > Hmmmmmm is this because of the series that allows page table freeing on
> > > > zap... I think Zi's?
> > >
> > > Yeah, that was Qi Zheng's
> > > https://lore.kernel.org/all/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com/
> > > .
> > >
> > > > We need to update the documentation on this then... which currently states
> > > > the VMA need only be stable.
> > > >
> > > > I guess this is still the case except for the novma walker you mention.
> > > >
> > > > Relatedly, It's worth looking at Dev's series which introduces a concerning
> > > > new 'no lock at all' mode to the page table walker explicitly for novma. I
> > > > cc'd you :) See [0].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local/
> > >
> > > Yeah, I saw that you CC'ed me; at a first glance that seems relatively
> > > innocuous to me as long as it's only done for kernel mappings where
> > > all the rules are different.
> > >
> > > >
> > > > >
> > > > > I think before this patch can land, you'll have to introduce some new
> > > > > helper like:
> > > > >
> > > > > void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
> > > > > {
> > > > >   mmap_write_lock(mm);
> > > > >   for_each_vma(vmi, vma)
> > > > >     vma_start_write(vma);
> > > > > }
> > > > >
> > > > > and use that in walk_page_range_novma() for user virtual address space
> > > > > walks, and update the comment in there.
> > > >
> > > > What dude? No, what? Marking literally all VMAs write locked? :/
> > > >
> > > > I think this could have unexpected impact no? We're basically disabling VMA
> > > > locking when we're in novma, that seems... really silly?
> > >
> > > I mean, walk_page_range_novma() being used on user virtual address
> > > space is pretty much a debug-only thing, I don't think it matters if
> > > it has to spend time poking flags in a few thousand VMAs. I guess the
> > > alternative would be to say "ptdump just doesn't show entries between
> > > VMAs, which shouldn't exist in the first place", and change ptdump to
> > > do a normal walk that skips over userspace areas not covered by a VMA.
> > > Maybe that's cleaner.
> > >
> > > But FWIW, we already do worse than what I proposed here when
> > > installing MMU notifiers, with mm_take_all_locks().
> > >
> > > > > > +       else
> > > > > > +               __madvise_unlock(mm, madv_behavior->behavior);
> > > > > > +}
> > > > > > +
> > > > > >  static bool madvise_batch_tlb_flush(int behavior)
> > > > > >  {
> > > > > >         switch (behavior) {
> > > > > > @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
> > > > > >                 unsigned long start, size_t len_in,
> > > > > >                 struct madvise_behavior *madv_behavior)
> > > > > >  {
> > > > > > +       struct vm_area_struct *vma = madv_behavior->vma;
> > > > > >         int behavior = madv_behavior->behavior;
> > > > > > +
> > > > > >         struct blk_plug plug;
> > > > > >         unsigned long end;
> > > > > >         int error;
> > > > > >
> > > > > >         if (is_memory_failure(behavior))
> > > > > >                 return madvise_inject_error(behavior, start, start + len_in);
> > > > > > -       start = untagged_addr_remote(mm, start);
> > > > > > +       start = untagged_addr(start);
> > > > >
> > > > > Why is this okay? I see that X86's untagged_addr_remote() asserts that
> > > > > the mmap lock is held, which is no longer the case here with your
> > > > > patch, but untagged_addr() seems wrong here, since we can be operating
> > > > > on another process. I think especially on X86 with 5-level paging and
> > > > > LAM, there can probably be cases where address bits are used for part
> > > > > of the virtual address in one task while they need to be masked off in
> > > > > another task?
> > > > >
> > > > > I wonder if you'll have to refactor X86 and Risc-V first to make this
> > > > > work... ideally by making sure that their address tagging state
> > > > > updates are atomic and untagged_area_remote() works locklessly.
> > > >
> > > > Yeah I don't know why we're doing this at all? This seems new unless I
> > > > missed it?
> > >
> > > Because untagged_addr_remote() has a mmap_assert_locked(mm) on x86 and
> > > reads data that is updated under the mmap lock, I think? So without
> > > this change you should get a lockdep splat on x86.
> > >
> > > > > (Or you could try to use something like the
> > > > > mmap_write_lock_with_all_vmas() I proposed above for synchronizing
> > > > > against untagged_addr(), first write-lock the MM and then write-lock
> > > > > all VMAs in it...)
> > > >
> > > > This would completely eliminate the point of this patch no? The whole point
> > > > is not taking these locks... And I'm very much not in favour of
> > > > write-locking literally every single VMA. under any circumstances.
> > >
> > > I'm talking about doing this heavyweight locking in places like
> > > arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) that can, if I understand
> > > correctly, essentially reconfigure the size of the virtual address
> > > space of a running process from 56-bit to 47-bit at the hardware level
> > > and cause address bits that were previously part of the virtual
> > > address to be ignored. READ_ONCE()/WRITE_ONCE() might do the job too,
> > > but then we'll have to keep in mind that two subsequent invocations of
> > > untagged_addr() can translate a userspace-specified virtual address
> > > into two different virtual addresses at the page table level.
> >
> > I’m confused about how arch_prctl(ARCH_ENABLE_TAGGED_ADDR, ...) can
> > reconfigure a running process from using 56-bit addresses to 47-bit.
> > I read the code and see the x86 kernel only supports LAM U57, and not
> > LAM U48 at all:
> >
> > static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> > {
> >         ...
> >
> >         if (!nr_bits || nr_bits > LAM_U57_BITS) {
> >                 mmap_write_unlock(mm);
> >                 return -EINVAL;
> >         }
> >
> >         mm_enable_lam(mm);
> >         mmap_write_unlock(mm);
> >
> >         return 0;
> > }
>
> Oh, you're right, currently only LAM U57 is supported by Linux, I was
> making bad assumptions.
>
> commit 2f8794bd087e, which introduced that code, also mentions that
> "For now only LAM_U57 is supported, with 6 tag bits".
>
> > I still don't fully understand why x86 differs from ARM64,
> > where the same bit mask is always applied unconditionally.
> >
> > On ARM64, we can even enable or disable PROT_MTE on a per-VMA basis
> > using mmap or mprotect. However, the same bitmask operation is
> > always executed regardless of whether memory tags are present for a
> > given VMA.
>
> Hmm, true, that does look like a weird difference.
>
> > I mean, on arm64, if a process or a VMA doesn't have tag access
> > enabled, and we pass an address with high bits to madvise,
> > untagged_addr() will still strip the tag. But wouldn't that address
> > be invalid for a process or VMA that doesn't have TBI enabled?
>
> Yeah, I guess in that regard it might be fine to always strip the bits...
>
> Maybe the situation on arm64 is simpler, in that these bits are always
> either tag bits or unused on arm64, while my understanding is that on
> X86, the CPU supports changing the meaning of address bits between
> "part of virtual address" and "ignored tag bits". So I think stripping
> the bits might work fine for LAM U57, but not for LAM U48, and maybe
> the code is trying to be future-proof in case someone wants to add
> support for LAM U48?

Perhaps supporting two different tag lengths is a bug rather than a feature?

>
> It is arguably also a bit more robust to reject garbage addresses
> instead of ignoring bits that would cause the CPU to treat the address
> as noncanonical, but I guess doing that just in MM code is not a big
> problem. (Unless someone relies on seccomp filters that block
> manipulation of specific virtual address ranges via munmap() and such,
> but I think almost nobody does that. I think I've only seen that once
> in some rarely-used component of the Tor project.)
>

Unless we perform a strict check in the kernel to ensure the allocated tag
matches the logical tag, random high-bit values are still acceptable.
For example, x86 only checks if the mask is -1UL or above 57 bits—
there’s no mechanism to exclude arbitrary high-bit values.

Otherwise, there's no point in checking whether tagging is enabled for the VMA
or the mm. Thus, we end up applying a mask unconditionally anyway.

> But I am out of my depth here and might be severely misunderstanding
> what's going on.

Same here :-) Especially since I have no idea what's going on with RISC-V
with two different tagging lengths.

Thanks
Barry

next prev parent reply	other threads:[~2025-06-05 10:27 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30 10:44 Barry Song
2025-05-30 14:06 ` Jann Horn
2025-05-30 14:34   ` Lorenzo Stoakes
2025-05-30 20:17     ` Barry Song
2025-06-02 17:35       ` SeongJae Park
2025-06-02 17:53         ` SeongJae Park
2025-05-30 20:40     ` Jann Horn
2025-06-02 11:50       ` Lorenzo Stoakes
2025-06-03  1:06         ` Barry Song
2025-06-03  9:48           ` Lorenzo Stoakes
2025-06-03  7:06       ` Barry Song
2025-06-03 16:52         ` Jann Horn
2025-06-05 10:27           ` Barry Song [this message]
2025-05-30 22:00   ` Barry Song
2025-06-02 14:55     ` Jann Horn
2025-06-03  7:51       ` Barry Song
2025-06-03  7:24   ` Qi Zheng
2025-06-03  9:54     ` Lorenzo Stoakes
2025-06-04  6:02       ` Qi Zheng
2025-06-04 17:50         ` Lorenzo Stoakes
2025-06-05  3:23           ` Qi Zheng
2025-06-05 14:04             ` Lorenzo Stoakes
2025-06-06  3:55               ` Qi Zheng
2025-06-06 10:44                 ` Lorenzo Stoakes
2025-06-09  6:40                   ` Qi Zheng
2025-06-09 15:08                     ` Lorenzo Stoakes
2025-06-10  7:20                     ` David Hildenbrand
2025-06-06 11:07           ` Jann Horn
2025-06-03 18:43 ` Lorenzo Stoakes
2025-06-03 20:17   ` Suren Baghdasaryan
2025-06-04  5:22     ` Lorenzo Stoakes
2025-06-06  7:18     ` Barry Song
2025-06-06 10:16       ` Lorenzo Stoakes
2025-06-03 20:59   ` Pedro Falcato
2025-06-04  5:23     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4wjSXDc_BsYQaELAxbpCUoCqbBJsabJkvJh3rZ+kJoULA@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=vbabka@suse.cz \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox