Re: [RFC] Improving userfaultfd scalability for live migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Matlack <dmatlack@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
	James Houghton <jthoughton@google.com>,
	 Andrea Arcangeli <aarcange@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	Linux MM <linux-mm@kvack.org>, kvm <kvm@vger.kernel.org>,
	 chao.p.peng@linux.intel.com
Subject: Re: [RFC] Improving userfaultfd scalability for live migration
Date: Mon, 5 Dec 2022 09:31:48 -0800	[thread overview]
Message-ID: <CALzav=d=N7teRvjQZ1p0fs6i9hjmH7eVppJLMh_Go4TteQqqwg@mail.gmail.com> (raw)
In-Reply-To: <Y44NylxprhPn6AoN@x1n>

On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> > On Thu, Dec 01, 2022, James Houghton wrote:
> > > #1, however, is quite doable. The main codepath for post-copy, the
> > > path that is taken when a vCPU attempts to access unmapped memory, is
> > > (for x86, but similar for other architectures): handle_ept_violation
> > > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > > violation path" or "mem fault path." Other post-copy paths include at
> > > least: (i) KVM attempts to access guest memory via.
> > > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > > violation path (e.g., instruction emulation).
> > >
> > > We want the EPT violation path to be fast, as it is taken the vast
> > > majority of the time.
> >
> > ...
> >
> > > == Getting the faulting GPA to userspace ==
> > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > > and it provides the main functionality we need. We can extend it
> > > easily to support our use case here, and I think we have at least two
> > > options:
> > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > > -EFAULT).
> > > - We're already introducing a new CAP, so just tie the above behavior
> > > to whether or not one of the CAPs (below) is being used.
> >
> > We might even be able to get away with a third option: unconditionally return
> > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> > guest memory.
> >
> > > == Problems ==
> > > The major problem here is that this only solves the scalability
> > > problem for the KVM demand paging case. Other userfaultfd users, if
> > > they have scalability problems, will need to find another approach.
> >
> > It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> > via __get_user() when walking L1's EPT tables.

We could always modify FNAME(walk_addr_generic) to return out to user
space in the same way if that is indeed another bottleneck.

> >
> > Disclaimer: I know _very_ little about UFFD.
> >
> > Rather than add yet another flag to gup(), what about flag to say the task doesn't
> > want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> > SIGBUGS.

There are some copy_to/from_user() calls in KVM that cannot easily
exit out to KVM_RUN (for example, in the guts of the emulator IIRC).
But we could use your approach just to wrap the specific call sites
that can return from KVM_RUN.

> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 07c81ab3fd4d..7f66b56dd6e7 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >          * shmem_vm_ops->fault method is invoked even during
> >          * coredumping without mmap_lock and it ends up here.
> >          */
> > -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> > +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
> >                 goto out;
>
> I'll have a closer read on the nested part, but note that this path already
> has the mmap lock then it invalidates the goal if we want to avoid taking
> it from the first place, or maybe we don't care?
>
> If we want to avoid taking the mmap lock at all (hence the fast-gup
> approach), I'd also suggest we don't make it related to uffd at all but
> instead an interface to say "let's check whether the page tables are there
> (walk pgtable by fast-gup only), if not return to userspace".
>
> Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
> generic interface.  It's just that if the userspace knows what it's doing
> (postcopy-ing), it knows then the faults can potentially be resolved by
> userfaultfd at this stage.

Are there any cases where fast-gup can fail while uffd is enabled but
it's not due to uffd? e.g. if a page is swapped out? I don't know what
userspace would do in those situations to make forward progress.



>
> >
> >         /*
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ffb6eb55cd13..4c6c53ac6531 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
> >  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
> >  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
> >  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> > -#define PF__HOLE__00004000     0x00004000
> > +#define PF_NO_UFFD_WAIT                0x00004000
> >  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
> >  #define PF__HOLE__00010000     0x00010000
> >  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> >
>
> --
> Peter Xu
>

next prev parent reply	other threads:[~2022-12-05 17:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-01 19:37 James Houghton
2022-12-03  1:03 ` Sean Christopherson
2022-12-05 15:27   ` Peter Xu
2022-12-05 17:31     ` David Matlack [this message]
2022-12-05 18:03       ` David Matlack
2022-12-05 18:23         ` Sean Christopherson
2022-12-05 18:20       ` Sean Christopherson
2022-12-05 21:19         ` James Houghton
2022-12-06  1:06           ` Sean Christopherson
2022-12-06 17:35             ` James Houghton
2022-12-06 18:00               ` Sean Christopherson
2022-12-06 20:41                 ` James Houghton
2022-12-08  1:56                   ` David Matlack
2022-12-08 17:50                     ` James Houghton
2023-01-04  0:57                       ` Sean Christopherson
2023-01-04  1:05                         ` James Houghton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALzav=d=N7teRvjQZ1p0fs6i9hjmH7eVppJLMh_Go4TteQqqwg@mail.gmail.com' \
    --to=dmatlack@google.com \
    --cc=aarcange@redhat.com \
    --cc=axelrasmussen@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox