Re: [syzbot] [mm?] KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Jann Horn <jannh@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>,
	syzbot <syzbot+f5d897f5194d92aa1769@syzkaller.appspotmail.com>,
	Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	david@kernel.org, harry.yoo@oracle.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	riel@surriel.com, syzkaller-bugs@googlegroups.com,
	vbabka@suse.cz
Subject: Re: [syzbot] [mm?] KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare
Date: Wed, 14 Jan 2026 21:16:09 +0000	[thread overview]
Message-ID: <23daf873-1e80-4e19-87d4-b9d54ff081a9@lucifer.local> (raw)
In-Reply-To: <CAG48ez28kdkX1imCj3rDYNTi9PKQBx9Cqf+N2Rg92o9BEdo9QA@mail.gmail.com>

On Wed, Jan 14, 2026 at 07:23:37PM +0100, Jann Horn wrote:
> On Wed, Jan 14, 2026 at 7:02 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Wed, Jan 14, 2026 at 06:48:37PM +0100, Jann Horn wrote:
> > > On Wed, Jan 14, 2026 at 6:29 PM Jann Horn <jannh@google.com> wrote:
> > > > On Wed, Jan 14, 2026 at 6:06 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > > > On Wed, 14 Jan 2026 at 18:00, Jann Horn <jannh@google.com> wrote:
> > > > > > On Wed, Jan 14, 2026 at 5:43 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > > > > > On Wed, 14 Jan 2026 at 17:32, syzbot
> > > > > > > <syzbot+f5d897f5194d92aa1769@syzkaller.appspotmail.com> wrote:
> > > > > > > > ==================================================================
> > > > > > > > BUG: KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare
> > > > > > > >
> > > > > > > > write to 0xffff88811c751e80 of 8 bytes by task 13471 on cpu 1:
> > > > > > > >  __anon_vma_prepare+0x172/0x2f0 mm/rmap.c:212
> > > > > > > >  __vmf_anon_prepare+0x91/0x100 mm/memory.c:3673
> > > > > > > >  hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
> > > > > > > >  hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
> > > > > > > >  handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
> > > > > > [...]
> > > > > > > > read to 0xffff88811c751e80 of 8 bytes by task 13473 on cpu 0:
> > > > > > > >  __vmf_anon_prepare+0x26/0x100 mm/memory.c:3667
> > > > > > > >  hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
> > > > > > > >  hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
> > > > > > > >  handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
> > > > > > [...]
> > > > > > > >
> > > > > > > > value changed: 0x0000000000000000 -> 0xffff888104ecca28
> > > > > > > >
> > > > > > > > Reported by Kernel Concurrency Sanitizer on:
> > > > > > > > CPU: 0 UID: 0 PID: 13473 Comm: syz.2.3219 Tainted: G        W           syzkaller #0 PREEMPT(voluntary)
> > > > > > > > Tainted: [W]=WARN
> > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
> > > > > > > > ==================================================================
> > > > > > >
> > > > > > > Hi Harry,
> > > > > > >
> > > > > > > I see you've been debugging:
> > > > > > > KASAN: slab-use-after-free Read in folio_remove_rmap_ptes
> > > > > > > https://lore.kernel.org/all/694e3dc6.050a0220.35954c.0066.GAE@google.com/T/
> > > > > > >
> > > > > > > Can that bug be caused by this data race?
> > > > > > > Below is an explanation by Gemini LLM as to why this race is harmful.
> > > > > > > Obviously take it with a grain of salt, but with my limited mm
> > > > > > > knowledge it does not look immediately wrong (re rmap invariant).
> > > > > > >
> > > > > > > However, now digging into details I see that this Lorenzo's patch
> > > > > > > also marked as fixing "KASAN: slab-use-after-free Read in
> > > > > > > folio_remove_rmap_ptes":
> > > > > > >
> > > > > > > mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
> > > > > > > https://lore.kernel.org/all/b7930ad2b1503a657e29fe928eb33061d7eadf5b.1767638272.git.lorenzo.stoakes@oracle.com/T/
> > > > > > >
> > > > > > > So perhaps the race is still benign (or points to another issue?)
> > > > > > >
> > > > > > > Here is what LLM said about the race:
> > > > > > > -----
> > > > > > >
> > > > > > > The bug report is actionable and points to a harmful data race in the Linux
> > > > > > > kernel's memory management subsystem, specifically in the handling of
> > > > > > > anonymous `hugetlb` mappings.
> > > > > >
> > > > > > This data race is not specific to hugetlb at all, and it isn't caused
> > > > > > by any recent changes. It's a longstanding thing in core MM, but it's
> > > > > > pretty benign as far as I know.
> > > > > >
> > > > > > Fundamentally, the field vma->anon_vma can be read while only holding
> > > > > > the mmap lock in read mode; and it can concurrently be changed from
> > > > > > NULL to non-NULL.
> >
> > Well isn't that what the page_table_lock is for...?
>
> The page_table_lock prevents writer-writer data races, but not
> reader-writer data races. (It is only held by writers, not by
> readers.)

Right true, so what is important is does this matter :)

As far as faulting is concerned, we only care in so far as we'd become a
writer and then figure out we were raced anyway.

>
> > > > > >
> > > > > > One scenario to cause such a data race is to create a new anonymous
> > > > > > VMA, then trigger two concurrent page faults inside this VMA. Assume a
> > > > > > configuration with VMA locking disabled for simplicity, so that both
> > > > > > faults happen under the mmap lock in read mode. This will lead to two
> > > > > > concurrent calls to __vmf_anon_prepare()
> > > > > > (https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623),
> > > > > > both threads only holding the mmap_lock in read mode.
> > > > > > __vmf_anon_prepare() is essentially this (from
> > > > > > https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623,
> > > > > > with VMA locking code removed):
> > > > > >
> > > > > > vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
> > > > > > {
> > > > > >         struct vm_area_struct *vma = vmf->vma;
> > > > > >         vm_fault_t ret = 0;
> > > > > >
> > > > > >         if (likely(vma->anon_vma))
> > > > > >                 return 0;
> > > > > >         [...]
> > > > > >         if (__anon_vma_prepare(vma))
> > > > > >                 ret = VM_FAULT_OOM;
> > > > > >         [...]
> > > > > >         return ret;
> > > > > > }
> > > > > >
> > > > > > int __anon_vma_prepare(struct vm_area_struct *vma)
> > > > > > {
> > > > > >         struct mm_struct *mm = vma->vm_mm;
> > > > > >         struct anon_vma *anon_vma, *allocated;
> > > > > >         struct anon_vma_chain *avc;
> > > > > >
> > > > > >         [...]
> > > > > >
> > > > > >         [... allocate stuff ...]
> > > > > >
> > > > > >         anon_vma_lock_write(anon_vma);
> > > > > >         /* page_table_lock to protect against threads */
> > > > > >         spin_lock(&mm->page_table_lock);
> > > > > >         if (likely(!vma->anon_vma)) {
> > > > > >                 vma->anon_vma = anon_vma;
> > > > > >                 [...]
> > > > > >         }
> > > > > >         spin_unlock(&mm->page_table_lock);
> > > > > >         anon_vma_unlock_write(anon_vma);
> > > > > >
> > > > > >         [... cleanup ...]
> > > > > >
> > > > > >         return 0;
> > > > > >
> > > > > >         [... error handling ...]
> > > > > > }
> > > > > >
> > > > > > So if one thread reaches the "vma->anon_vma = anon_vma" assignment
> > > > > > while the other thread is running the "if (likely(vma->anon_vma))"
> > > > > > check, you get a (AFAIK benign) data race.
> > > > >
> > > > > Thanks for checking, Jann.
> > > > >
> > > > > To double check"
> > > > >
> > > > > "vma->anon_vma = anon_vma" is done w/o store-release, so the lockless
> > > > > readers can't read anon_vma contents, is it correct? So none of them
> > > > > really reading anon_vma, right?
> > > >
> > > > I think you are right that this should be using store-release;
> > > > searching around, I also mentioned this in
> > > > <https://lore.kernel.org/all/CAG48ez0qsAM-dkOUDetmNBSK4typ5t_FvMvtGiB7wQsP-G1jVg@mail.gmail.com/>:
> > > >
> > > > | > +Note that there are some exceptions to this - the `anon_vma`
> > > > field is permitted
> > > > | > +to be written to under mmap read lock and is instead serialised
> > > > by the `struct
> > > > | > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > > > |
> > > > | Hm, we really ought to add some smp_store_release() and READ_ONCE(),
> > > > | or something along those lines, around our ->anon_vma accesses...
> > > > | especially the "vma->anon_vma = anon_vma" assignment in
> > > > | __anon_vma_prepare() looks to me like, on architectures like arm64
> > > > | with write-write reordering, we could theoretically end up making a
> > > > | new anon_vma pointer visible to a concurrent page fault before the
> > > > | anon_vma has been initialized? Though I have no idea if that is
> > > > | practically possible, stuff would have to be reordered quite a bit for
> > > > | that to happen...

OK I'm confused how we can end up with an uninitialised anon_vma actually?

The write gets ordered before the initialisation, somehow?

	anon_vma = find_mergeable_anon_vma(vma);
	allocated = NULL;
	if (!anon_vma) {
		anon_vma = anon_vma_alloc();

WHICH IS (maybe inlined):
******************************
	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
	if (anon_vma) {
|-----------------------> ??
|		atomic_set(&anon_vma->refcount, 1);
|		anon_vma->num_children = 0;
|		anon_vma->num_active_vmas = 0;
|		anon_vma->parent = anon_vma;
|		/*
|		 * Initialise the anon_vma root to point to itself. If called
|		 * from fork, the root will be reset to the parents anon_vma.
|		 */
|		anon_vma->root = anon_vma;
|	}
|	return anon_vma;
|*****************************
|
|		anon_vma->num_children++; /* self-parent link for new root */
|		allocated = anon_vma;
|	}
|
|	anon_vma_lock_write(anon_vma);
|	/* page_table_lock to protect against threads */
|	spin_lock(&mm->page_table_lock);
|	if (likely(!vma->anon_vma)) {
|---------------vma->anon_vma = anon_vma;

Am I totally misunderstanding?

How likely is this?

Given the anon_vma_lock_write() and spin_lock() are we not avoiding this anyway?

> >
> > As far as the page fault is concerned it only really cares about whether it
> > exists or not, not whether it's initialised.
>
> Hmm, yeah, I'm not sure if anything in the page fault path actually
> directly accesses the anon_vma. The page fault path does eventually
> re-publish the anon_vma pointer with `WRITE_ONCE(folio->mapping,
> (struct address_space *) anon_vma)` in __folio_set_anon() though,
> which could then potentially allow a third thread to walk through
> folio->mapping and observe the uninitialized anon_vma...

But how would it be unintialised at that point?

See above.

>
> Looking at the situation on latest stable (v6.18.5), two racing faults
> on _adjacent_ anonymous VMAs could also end up with one thread writing
> ->anon_vma while the other thread executes reusable_anon_vma(),

	if (anon_vma_compatible(a, b)) {
		struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);

		if (anon_vma && list_is_singular(&old->anon_vma_chain))
			return anon_vma;
	}

Hmm... again I don't see how we're finding a mergeable anon_vma in the
adjacent VMA which is somehow uninitialised?

> loading the pointer to that anon_vma and accessing its
> ->anon_vma_chain.

The VMA's anon_vma_chain you mean? anon_vma doesn't have that field.

Is it again based on the assumption that on some architectures we might see
a write of an allocated-but-not-initialised anon_vma?

But I also don't see how this is harmful anyway as anything that touches
anon_vma state meaningfully has to take the rmap lock anyway.

I may be missing things here?

>
> > The operations that check/modify fields within the anon_vma are protected by the
> > anon rmap lock (my recent series takes advantage of this to avoid holding that
> > lock during AVC allocation for instance).
> >
> > This lock also protects the interval tree.

I have recently reworked a lot of anon_vma and am focusing on it as an area
to address so am interested in exploring this on that basis.

Thanks, Lorenzo

     prev parent reply	other threads:[~2026-01-14 21:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-14 16:32 syzbot
2026-01-14 16:42 ` Dmitry Vyukov
2026-01-14 16:59   ` Jann Horn
2026-01-14 17:05     ` Dmitry Vyukov
2026-01-14 17:29       ` Jann Horn
2026-01-14 17:48         ` Jann Horn
2026-01-14 18:02           ` Lorenzo Stoakes
2026-01-14 18:23             ` Jann Horn
2026-01-14 21:16               ` Lorenzo Stoakes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23daf873-1e80-4e19-87d4-b9d54ff081a9@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=dvyukov@google.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=syzbot+f5d897f5194d92aa1769@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox