Re: maple tree change made it possible for VMA iteration to see same VMA twice due to late vma_merge() failure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
To: Jann Horn <jannh@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: maple tree change made it possible for VMA iteration to see same VMA twice due to late vma_merge() failure
Date: Fri, 22 Sep 2023 12:19:19 -0400	[thread overview]
Message-ID: <20230922161919.6ct5c7tj35r4ex7m@revolver> (raw)
In-Reply-To: <20230816191851.wo2xhthmfq7uzoc3@revolver>

* Liam R. Howlett <Liam.Howlett@Oracle.com> [230816 15:18]:
> * Jann Horn <jannh@google.com> [230816 13:13]:
> > On Wed, Aug 16, 2023 at 6:18 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > * Jann Horn <jannh@google.com> [230815 15:37]:
> > > > commit 18b098af2890 ("vma_merge: set vma iterator to correct
> > > > position.") added a vma_prev(vmi) call to vma_merge() at a point where
> > > > it's still possible to bail out. My understanding is that this moves
> > > > the VMA iterator back by one VMA.
> > > >
> > > > If you patch some extra logging into the kernel and inject a fake
> > > > out-of-memory error at the vma_iter_prealloc() call in vma_split() (a
> > > > real out-of-memory error there is very unlikely to happen in practice,
> > > > I think - my understanding is that the kernel will basically kill
> > > > every process on the system except for init before it starts failing
> > > > GFP_KERNEL allocations that fit within a single slab, unless the
> > > > allocation uses GFP_ACCOUNT or stuff like that, which the maple tree
> > > > doesn't):
> > [...]
> > > > then you'll get this fun log output, showing that the same VMA
> > > > (ffff88810c0b5e00) was visited by two iterations of the VMA iteration
> > > > loop, and on the second iteration, prev==vma:
> > > >
> > > > [  326.765586] userfaultfd_register: begin vma iteration
> > > > [  326.766985] userfaultfd_register: prev=ffff88810c0b5ef0,
> > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > > > [  326.768786] userfaultfd_register: vma_merge returned 0000000000000000
> > > > [  326.769898] userfaultfd_register: prev=ffff88810c0b5e00,
> > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > > >
> > > > I don't know if this can lead to anything bad but it seems pretty
> > > > clearly unintended?
> > >
> > > Yes, unintended.
> > >
> > > So we are running out of memory, but since vma_merge() doesn't
> > > differentiate between failure and 'nothing to merge', we end up in a
> > > situation that we will revisit the same VMA.
> > >
> > > I've been thinking about a way to work this into the interface and I
> > > don't see a clean way because we (could) do different things before the
> > > call depending on the situation.
> > >
> > > I think we need to undo any vma iterator changes in the failure
> > > scenarios if there is a chance of the iterator continuing to be used,
> > > which is probably not limited to just this case.
> > 
> > I don't fully understand the maple tree interface - in the specific
> > case of vma_merge(), could you move the vma_prev() call down below the
> > point of no return, after vma_iter_prealloc()? Or does
> > vma_iter_prealloc() require that the iterator is already in the insert
> > position?
> 
> Yes, but maybe it shouldn't.  I detect a write going beyond the end of a
> node and take corrective action, but not to the front of a node.
> 
> If I change the internal code to figure out the preallocations without
> being pointed at the insert location, I still cannot take corrective
> action on failure since I don't know where I should have been within the
> tree structure, that is, I have lost the original range.
> 
> I'm still looking at this, but I'm wondering if I should change my
> interface for preallocations so I can handle this internally.  That
> would be a bigger change.
> 
> > 
> > > I will audit these areas and CC you on the result.

Looking at this, I think it's best to make a label and undo the
vma_prev() with a vma_next() - at least for now.

I'm also reading this for the error path on dup_anon_vma() failure, and
it appears to also have an issue which I'd like to point out here before
I send the fix for the first issue.

-----------
                vma_start_write(next);                                                                                  
                remove = next;                          /* case 1 */                                                    
                vma_end = next->vm_end;                                                                                 
                err = dup_anon_vma(prev, next);                                                                         
                if (curr) {                             /* case 6 */                                                    
                        vma_start_write(curr);                                                                          
                        remove = curr;                                                                                  
                        remove2 = next;                                                                                 
                        if (!next->anon_vma)                                                                            
                                err = dup_anon_vma(prev, curr);  
-----------

Since dup_anon_vma() can fail, I think here in case 6 we could overwrite
the failure.

That is, we will fail to clone the anon vma and mask the failure if we
are running case 6 with an anon in next.  Once the first dup_anon_vma()
returns error, the next call to clone curr vma may return 0 if there is
no anon vma (this, I think _must_ be the case). Then we are in a
situation where we will be removing next and expanding prev over curr
and next, but have not dup'ed the anon vma from next.


Thanks,
Liam

next prev parent reply	other threads:[~2023-09-22 16:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 19:36 Jann Horn
2023-08-15 19:44 ` Jann Horn
2023-08-16 16:17 ` Liam R. Howlett
2023-08-16 17:12   ` Jann Horn
2023-08-16 19:18     ` Liam R. Howlett
2023-09-22 16:19       ` Liam R. Howlett [this message]
2023-09-22 17:52         ` Liam R. Howlett
2023-09-22 18:02           ` Jann Horn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230922161919.6ct5c7tj35r4ex7m@revolver \
    --to=liam.howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox