linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: James Houghton <jthoughton@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Linux MM <linux-mm@kvack.org>, kvm <kvm@vger.kernel.org>,
	chao.p.peng@linux.intel.com
Subject: Re: [RFC] Improving userfaultfd scalability for live migration
Date: Mon, 5 Dec 2022 10:27:06 -0500	[thread overview]
Message-ID: <Y44NylxprhPn6AoN@x1n> (raw)
In-Reply-To: <Y4qgampvx4lrHDXt@google.com>

On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote:
> On Thu, Dec 01, 2022, James Houghton wrote:
> > #1, however, is quite doable. The main codepath for post-copy, the
> > path that is taken when a vCPU attempts to access unmapped memory, is
> > (for x86, but similar for other architectures): handle_ept_violation
> > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT
> > violation path" or "mem fault path." Other post-copy paths include at
> > least: (i) KVM attempts to access guest memory via.
> > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and
> > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT
> > violation path (e.g., instruction emulation).
> > 
> > We want the EPT violation path to be fast, as it is taken the vast
> > majority of the time.
> 
> ...
> 
> > == Getting the faulting GPA to userspace ==
> > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged),
> > and it provides the main functionality we need. We can extend it
> > easily to support our use case here, and I think we have at least two
> > options:
> > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes
> > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would
> > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns
> > -EFAULT).
> > - We're already introducing a new CAP, so just tie the above behavior
> > to whether or not one of the CAPs (below) is being used.
> 
> We might even be able to get away with a third option: unconditionally return
> KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing
> guest memory.
> 
> > == Problems ==
> > The major problem here is that this only solves the scalability
> > problem for the KVM demand paging case. Other userfaultfd users, if
> > they have scalability problems, will need to find another approach.
> 
> It may not fully solve KVM's problem either.  E.g. if the VM is running nested
> VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic)
> via __get_user() when walking L1's EPT tables.
> 
> Disclaimer: I know _very_ little about UFFD.
> 
> Rather than add yet another flag to gup(), what about flag to say the task doesn't
> want to wait for UFFD faults?  If desired/necessary, KVM could even toggle the flag
> in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual
> SIGBUGS.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..7f66b56dd6e7 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          * shmem_vm_ops->fault method is invoked even during
>          * coredumping without mmap_lock and it ends up here.
>          */
> -       if (current->flags & (PF_EXITING|PF_DUMPCORE))
> +       if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT))
>                 goto out;

I'll have a closer read on the nested part, but note that this path already
has the mmap lock then it invalidates the goal if we want to avoid taking
it from the first place, or maybe we don't care?

If we want to avoid taking the mmap lock at all (hence the fast-gup
approach), I'd also suggest we don't make it related to uffd at all but
instead an interface to say "let's check whether the page tables are there
(walk pgtable by fast-gup only), if not return to userspace".

Because IIUC fast-gup has nothing to do with uffd, so it can also be a more
generic interface.  It's just that if the userspace knows what it's doing
(postcopy-ing), it knows then the faults can potentially be resolved by
userfaultfd at this stage.

>  
>         /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffb6eb55cd13..4c6c53ac6531 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC            0x00000800      /* Allocating memory */
>  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
> -#define PF__HOLE__00004000     0x00004000
> +#define PF_NO_UFFD_WAIT                0x00004000
>  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
>  #define PF__HOLE__00010000     0x00010000
>  #define PF_KSWAPD              0x00020000      /* I am kswapd */
> 

-- 
Peter Xu



  reply	other threads:[~2022-12-05 15:27 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-01 19:37 James Houghton
2022-12-03  1:03 ` Sean Christopherson
2022-12-05 15:27   ` Peter Xu [this message]
2022-12-05 17:31     ` David Matlack
2022-12-05 18:03       ` David Matlack
2022-12-05 18:23         ` Sean Christopherson
2022-12-05 18:20       ` Sean Christopherson
2022-12-05 21:19         ` James Houghton
2022-12-06  1:06           ` Sean Christopherson
2022-12-06 17:35             ` James Houghton
2022-12-06 18:00               ` Sean Christopherson
2022-12-06 20:41                 ` James Houghton
2022-12-08  1:56                   ` David Matlack
2022-12-08 17:50                     ` James Houghton
2023-01-04  0:57                       ` Sean Christopherson
2023-01-04  1:05                         ` James Houghton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y44NylxprhPn6AoN@x1n \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=axelrasmussen@google.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox