linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
To: Jann Horn <jannh@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: maple tree change made it possible for VMA iteration to see same VMA twice due to late vma_merge() failure
Date: Wed, 16 Aug 2023 12:17:58 -0400	[thread overview]
Message-ID: <20230816161758.avedpxvqpwngzmut@revolver> (raw)
In-Reply-To: <CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejtiHpgJAbdQ@mail.gmail.com>

* Jann Horn <jannh@google.com> [230815 15:37]:
> commit 18b098af2890 ("vma_merge: set vma iterator to correct
> position.") added a vma_prev(vmi) call to vma_merge() at a point where
> it's still possible to bail out. My understanding is that this moves
> the VMA iterator back by one VMA.
> 
> If you patch some extra logging into the kernel and inject a fake
> out-of-memory error at the vma_iter_prealloc() call in vma_split() (a
> real out-of-memory error there is very unlikely to happen in practice,
> I think - my understanding is that the kernel will basically kill
> every process on the system except for init before it starts failing
> GFP_KERNEL allocations that fit within a single slab, unless the
> allocation uses GFP_ACCOUNT or stuff like that, which the maple tree
> doesn't):
> 
> ```
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 7cecd49e078b..a7be4d6a5db6 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1454,9 +1454,16 @@ static int userfaultfd_register(struct
> userfaultfd_ctx *ctx,
>                 prev = vma;
> 
>         ret = 0;
> +       if (strcmp(current->comm, "FAILME") == 0)
> +               pr_warn("%s: begin vma iteration\n", __func__);
>         for_each_vma_range(vmi, vma, end) {
>                 cond_resched();
> 
> +               if (strcmp(current->comm, "FAILME") == 0) {
> +                       pr_warn("%s: prev=%px, vma=%px (%016lx-%016lx)\n",
> +                               __func__, prev, vma, vma->vm_start,
> vma->vm_end);
> +               }
> +
>                 BUG_ON(!vma_can_userfault(vma, vm_flags));
>                 BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
>                        vma->vm_userfaultfd_ctx.ctx != ctx);
> @@ -1481,6 +1488,8 @@ static int userfaultfd_register(struct
> userfaultfd_ctx *ctx,
>                                  vma_policy(vma),
>                                  ((struct vm_userfaultfd_ctx){ ctx }),
>                                  anon_vma_name(vma));
> +               if (strcmp(current->comm, "FAILME") == 0)
> +                       pr_warn("%s: vma_merge returned %px\n", __func__, prev);
>                 if (prev) {
>                         /* vma_merge() invalidated the mas */
>                         vma = prev;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3937479d0e07..fd435c40c43d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -990,7 +990,7 @@ struct vm_area_struct *vma_merge(struct
> vma_iterator *vmi, struct mm_struct *mm,
>         if (err)
>                 return NULL;
> 
> -       if (vma_iter_prealloc(vmi))
> +       if (strcmp(current->comm, "FAILME")==0 || vma_iter_prealloc(vmi))
>                 return NULL;
> 
>         init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
> ```
> 
> and then you run this reproducer:
> ```
> #define _GNU_SOURCE
> #include <err.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <sys/prctl.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <linux/userfaultfd.h>
> 
> #ifndef UFFD_USER_MODE_ONLY
> #define UFFD_USER_MODE_ONLY 1
> #endif
> 
> #define SYSCHK(x) ({          \
>   typeof(x) __res = (x);      \
>   if (__res == (typeof(x))-1) \
>     err(1, "SYSCHK(" #x ")"); \
>   __res;                      \
> })
> 
> int main(void) {
>   int uffd = SYSCHK(syscall(__NR_userfaultfd, UFFD_USER_MODE_ONLY));
> 
>   struct uffdio_api api = { .api = UFFD_API, .features = 0 };
>   SYSCHK(ioctl(uffd, UFFDIO_API, &api));
> 
>   /* create vma1 */
>   SYSCHK(mmap((void*)0x100000UL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
> 
>   /* set uffd on vma1 */
>   struct uffdio_register reg1 = {
>     .range = { .start = 0x100000, .len = 0x1000 },
>     .mode = UFFDIO_REGISTER_MODE_MISSING
>   };
>   SYSCHK(ioctl(uffd, UFFDIO_REGISTER, &reg1));
> 
>   /* create vma2 */
>   SYSCHK(mmap((void*)0x101000UL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
> 
>   /* tries to merge vma2 into vma1, with injected allocation failure
> causing merge failure */
>   SYSCHK(prctl(PR_SET_NAME, "FAILME"));
>   struct uffdio_register reg2 = {
>     .range = { .start = 0x101000, .len = 0x1000 },
>     .mode = UFFDIO_REGISTER_MODE_MISSING
>   };
>   SYSCHK(ioctl(uffd, UFFDIO_REGISTER, &reg2));
>   SYSCHK(prctl(PR_SET_NAME, "normal"));
> }
> ```
> 
> then you'll get this fun log output, showing that the same VMA
> (ffff88810c0b5e00) was visited by two iterations of the VMA iteration
> loop, and on the second iteration, prev==vma:
> 
> [  326.765586] userfaultfd_register: begin vma iteration
> [  326.766985] userfaultfd_register: prev=ffff88810c0b5ef0,
> vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> [  326.768786] userfaultfd_register: vma_merge returned 0000000000000000
> [  326.769898] userfaultfd_register: prev=ffff88810c0b5e00,
> vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> 
> I don't know if this can lead to anything bad but it seems pretty
> clearly unintended?

Yes, unintended.

So we are running out of memory, but since vma_merge() doesn't
differentiate between failure and 'nothing to merge', we end up in a
situation that we will revisit the same VMA.

I've been thinking about a way to work this into the interface and I
don't see a clean way because we (could) do different things before the
call depending on the situation.

I think we need to undo any vma iterator changes in the failure
scenarios if there is a chance of the iterator continuing to be used,
which is probably not limited to just this case.

I will audit these areas and CC you on the result.

Thanks,
Liam



  parent reply	other threads:[~2023-08-16 16:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 19:36 Jann Horn
2023-08-15 19:44 ` Jann Horn
2023-08-16 16:17 ` Liam R. Howlett [this message]
2023-08-16 17:12   ` Jann Horn
2023-08-16 19:18     ` Liam R. Howlett
2023-09-22 16:19       ` Liam R. Howlett
2023-09-22 17:52         ` Liam R. Howlett
2023-09-22 18:02           ` Jann Horn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230816161758.avedpxvqpwngzmut@revolver \
    --to=liam.howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox