linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
To: Jann Horn <jannh@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: maple tree change made it possible for VMA iteration to see same VMA twice due to late vma_merge() failure
Date: Fri, 22 Sep 2023 12:19:19 -0400	[thread overview]
Message-ID: <20230922161919.6ct5c7tj35r4ex7m@revolver> (raw)
In-Reply-To: <20230816191851.wo2xhthmfq7uzoc3@revolver>

* Liam R. Howlett <Liam.Howlett@Oracle.com> [230816 15:18]:
> * Jann Horn <jannh@google.com> [230816 13:13]:
> > On Wed, Aug 16, 2023 at 6:18 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > * Jann Horn <jannh@google.com> [230815 15:37]:
> > > > commit 18b098af2890 ("vma_merge: set vma iterator to correct
> > > > position.") added a vma_prev(vmi) call to vma_merge() at a point where
> > > > it's still possible to bail out. My understanding is that this moves
> > > > the VMA iterator back by one VMA.
> > > >
> > > > If you patch some extra logging into the kernel and inject a fake
> > > > out-of-memory error at the vma_iter_prealloc() call in vma_split() (a
> > > > real out-of-memory error there is very unlikely to happen in practice,
> > > > I think - my understanding is that the kernel will basically kill
> > > > every process on the system except for init before it starts failing
> > > > GFP_KERNEL allocations that fit within a single slab, unless the
> > > > allocation uses GFP_ACCOUNT or stuff like that, which the maple tree
> > > > doesn't):
> > [...]
> > > > then you'll get this fun log output, showing that the same VMA
> > > > (ffff88810c0b5e00) was visited by two iterations of the VMA iteration
> > > > loop, and on the second iteration, prev==vma:
> > > >
> > > > [  326.765586] userfaultfd_register: begin vma iteration
> > > > [  326.766985] userfaultfd_register: prev=ffff88810c0b5ef0,
> > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > > > [  326.768786] userfaultfd_register: vma_merge returned 0000000000000000
> > > > [  326.769898] userfaultfd_register: prev=ffff88810c0b5e00,
> > > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > > >
> > > > I don't know if this can lead to anything bad but it seems pretty
> > > > clearly unintended?
> > >
> > > Yes, unintended.
> > >
> > > So we are running out of memory, but since vma_merge() doesn't
> > > differentiate between failure and 'nothing to merge', we end up in a
> > > situation that we will revisit the same VMA.
> > >
> > > I've been thinking about a way to work this into the interface and I
> > > don't see a clean way because we (could) do different things before the
> > > call depending on the situation.
> > >
> > > I think we need to undo any vma iterator changes in the failure
> > > scenarios if there is a chance of the iterator continuing to be used,
> > > which is probably not limited to just this case.
> > 
> > I don't fully understand the maple tree interface - in the specific
> > case of vma_merge(), could you move the vma_prev() call down below the
> > point of no return, after vma_iter_prealloc()? Or does
> > vma_iter_prealloc() require that the iterator is already in the insert
> > position?
> 
> Yes, but maybe it shouldn't.  I detect a write going beyond the end of a
> node and take corrective action, but not to the front of a node.
> 
> If I change the internal code to figure out the preallocations without
> being pointed at the insert location, I still cannot take corrective
> action on failure since I don't know where I should have been within the
> tree structure, that is, I have lost the original range.
> 
> I'm still looking at this, but I'm wondering if I should change my
> interface for preallocations so I can handle this internally.  That
> would be a bigger change.
> 
> > 
> > > I will audit these areas and CC you on the result.

Looking at this, I think it's best to make a label and undo the
vma_prev() with a vma_next() - at least for now.

I'm also reading this for the error path on dup_anon_vma() failure, and
it appears to also have an issue which I'd like to point out here before
I send the fix for the first issue.

-----------
                vma_start_write(next);                                                                                  
                remove = next;                          /* case 1 */                                                    
                vma_end = next->vm_end;                                                                                 
                err = dup_anon_vma(prev, next);                                                                         
                if (curr) {                             /* case 6 */                                                    
                        vma_start_write(curr);                                                                          
                        remove = curr;                                                                                  
                        remove2 = next;                                                                                 
                        if (!next->anon_vma)                                                                            
                                err = dup_anon_vma(prev, curr);  
-----------

Since dup_anon_vma() can fail, I think here in case 6 we could overwrite
the failure.

That is, we will fail to clone the anon vma and mask the failure if we
are running case 6 with an anon in next.  Once the first dup_anon_vma()
returns error, the next call to clone curr vma may return 0 if there is
no anon vma (this, I think _must_ be the case). Then we are in a
situation where we will be removing next and expanding prev over curr
and next, but have not dup'ed the anon vma from next.


Thanks,
Liam



  reply	other threads:[~2023-09-22 16:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-15 19:36 Jann Horn
2023-08-15 19:44 ` Jann Horn
2023-08-16 16:17 ` Liam R. Howlett
2023-08-16 17:12   ` Jann Horn
2023-08-16 19:18     ` Liam R. Howlett
2023-09-22 16:19       ` Liam R. Howlett [this message]
2023-09-22 17:52         ` Liam R. Howlett
2023-09-22 18:02           ` Jann Horn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230922161919.6ct5c7tj35r4ex7m@revolver \
    --to=liam.howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox