From: Jann Horn <jannh@google.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Pedro Falcato <pfalcato@suse.de>,
syzbot <syzbot+20ed41006cf9d842c2b5@syzkaller.appspotmail.com>,
Liam.Howlett@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
syzkaller-bugs@googlegroups.com, vbabka@suse.cz
Subject: Re: [syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing_range
Date: Fri, 21 Mar 2025 00:04:42 +0100 [thread overview]
Message-ID: <CAG48ez0JUBPrb5Mh13Z__OeQ+w+uuhfEmj2FRReCpCT-2O-uyw@mail.gmail.com> (raw)
In-Reply-To: <332b3149-0e84-4bf8-9542-89d68b0a9680@lucifer.local>
On Thu, Mar 20, 2025 at 9:53 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote:
> > On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:
[...]
> > > Ahh, fun bug. This *seems* to be the bug:
> > >
> > > First, in vma_modify:
> > >
> > > merged = vma_merge_existing_range(vmg);
> > > if (merged)
> > > return merged;
> > > if (vmg_nomem(vmg))
> > > return ERR_PTR(-ENOMEM);
> > >
> > > then, all the way up to userfaultfd_release_all (the return value propagates
> > > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):
[...]
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 71ca012c616c..b2167b7dc27d 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > > merged = vma_merge_existing_range(vmg);
> > > if (merged)
> > > return merged;
> > > - if (vmg_nomem(vmg))
> > > + if (vmg_nomem(vmg)) {
> > > + /* If we can avoid failing the whole modification
> > > + * due to a merge OOM and validly keep going
> > > + * (we're modifying the whole VMA), return vma intact.
> > > + * It won't get merged, but such is life - we're avoiding
> > > + * OOM conditions in other parts of mm/ this way */
> > > + if (start <= vma->vm_start && end >= vma->vm_end)
> > > + return vma;
>
> I do not like this solution at all, sorry.
>
> I mean I get what you're doing and it's smart to try to find a means out of
> this in general :) but let me explain my reasoning:
>
> For one this is uffd's fault, and the fix clearly needs to be there.
I mean... this worked fine back in, for example, 5.4 -
userfaultfd_release() would loop over the VMAs, change some stuff in
some of them, and merge them where possible, and there was no way
anything could fail. It's the VMA subsystem that changed its API...
> But also, we _can't be sure_ vma is valid any more. The second it goes off
> to vma_merge_existing_range() it might be removed, which is why it's
> critical to only use 'merged'.
>
> Now you might be able to prove that _right now_ it'd be ok, if you do this
> check, because vma_complete() does the delete and only if either
> vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
> these happen _before_ vma_complete(), but that's an _implementation detail_
> and now we've made an assumption that this is the case here.
>
> An implicit effectively precondition on something unexpressed like that is
> asking for trouble, really don't want to go that way.
>
>
> > > return ERR_PTR(-ENOMEM);
> > > + }
>
> >
> > Along the lines of your idea, perhaps we could add a parameter "bool
> > never_fail" to vma_modify() that is passed through to
> > vma_merge_existing_range(), and guarantee that it never fails when
> > that parameter is set? Then we could also check that never_fail is
> > only used in cases where no split is necessary. That somewhat avoids
> > having this kind of check that only ever runs in error conditions...
>
> Yeah hmmm, again this is _not where the problem is_. And we're doing it for
> _one case only_, who _must_ succeed. Right?
It seems to me like it is... theoretically very reasonable of
userfaultfd to expect to have a "reliably change the flags of an
entire VMA" operation, and if the normal VMA code doesn't provide that
because of maple tree internals in the merging case, then it would be
reasonable for the VMA code to provide an alternative that does
provide this?
> Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
> bool, hey what are helper structs for right? :P)
>
> We coould add a special mode that says we __GFP_NOFAIL, we do that in
> vms_abort_munmap_vmas() and man that was under similar circumstances (hey
> remember that fun Liam? :)
>
> But at the same time, it feels icky, and we probably don't want to
> proliferate this pattern too much.
>
> So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
> release code could invoke.
>
> Perhaps we could genericise the vms_abort_munmap_vmas():
>
> mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
>
> And make that available or some form of it, to do the 'simplest' thing in
> this scenario.
The userfaultfd release code doesn't want an "unmap" operation. It
just wants to remove the __VM_UFFD_FLAGS flags and set the
vma->vm_userfaultfd_ctx pointer to NULL.
The VMA code then sometimes sees an opportunity to merge with adjacent
VMAs; and this merging is what's failing.
So if we're willing to tolerate having adjacent VMAs that are
mergeable but aren't merged after an allocation failure, then instead
of using __GFP_NOFAIL for the merge, we could also just ignore merge
failures, at least when some "never fail" flag is set?
[...]
> Another possible solution is to add a flag that _explicitly asserts and
> documents_ that you require that no VMA be removed before attempting to
> allocate.
>
> Or we could make that an _explicit_ assumption?
I don't think I understand this part. What VMA removal and allocation
are you talking about?
> And then the uffd code itself could cache vma and take Pedro's solution on
> that basis?
>
> void userfaultfd_release_all(struct mm_struct *mm,
> struct userfaultfd_ctx *ctx)
> {
> ...
>
> for_each_vma(vmi, vma) {
> struct vm_area_struct *old = vma;
>
> ...
>
> vma = userfaultfd_clear_vma(&vmi, prev, vma,
> vma->vm_start, vma->vm_end);
> if (IS_ERR(vma)) {
> BUG_ON(vma != -ENOMEM); /* Sorry Linus! */
>
> /*
> * OK we assert above that vma must remain intact if we fail to allocate,
> * We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
> */
> prev = old;
> }
>
> ...
> }
> }
>
> I mean it's going to be dirty whichever way round we do this.
>
> Thoughts guys?
I guess my main thought on this is: I would prefer it if we keep any
code that runs only in near-impossible cases as simple as possible,
because issues in those codepaths will take longer to find.
next prev parent reply other threads:[~2025-03-20 23:05 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-20 19:09 syzbot
2025-03-20 20:02 ` Pedro Falcato
2025-03-20 20:02 ` syzbot
2025-03-20 20:11 ` Jann Horn
2025-03-20 20:53 ` Lorenzo Stoakes
2025-03-20 23:04 ` Jann Horn [this message]
2025-03-21 9:10 ` Lorenzo Stoakes
2025-03-23 16:49 ` syzbot
2025-03-24 0:31 ` syzbot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAG48ez0JUBPrb5Mh13Z__OeQ+w+uuhfEmj2FRReCpCT-2O-uyw@mail.gmail.com \
--to=jannh@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=pfalcato@suse.de \
--cc=syzbot+20ed41006cf9d842c2b5@syzkaller.appspotmail.com \
--cc=syzkaller-bugs@googlegroups.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox