From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340FFC4321E for ; Mon, 5 Dec 2022 17:32:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 767AA8E0002; Mon, 5 Dec 2022 12:32:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7176A8E0001; Mon, 5 Dec 2022 12:32:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5DFE68E0002; Mon, 5 Dec 2022 12:32:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5115B8E0001 for ; Mon, 5 Dec 2022 12:32:16 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3697A40280 for ; Mon, 5 Dec 2022 17:32:16 +0000 (UTC) X-FDA: 80208946272.16.734E2AC Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf10.hostedemail.com (Postfix) with ESMTP id 942C7C000D for ; Mon, 5 Dec 2022 17:32:15 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=j7Gqdg3N; spf=pass (imf10.hostedemail.com: domain of dmatlack@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=dmatlack@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670261535; a=rsa-sha256; cv=none; b=Tj0yFA0/RnQBC0BDRSebA5ZSw6MISBIL7OmvMTHv+abSWUwuCheq8oWSM0F6UCZ1O4TpZz xrE1PIVyxv+T1Og1UJ28APRimNRj6RRCfHXq8+2s3zvZia/+Zk/L1YTPDS/pz2FNLXREP6 wYIlntWhSDV+d6OsDmyHt9Q+g9kG+Z4= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=j7Gqdg3N; spf=pass (imf10.hostedemail.com: domain of dmatlack@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=dmatlack@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670261535; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LItvTSNO/liErpomngMhd0/FAUJ8lKefBZplhX+MQ6I=; b=Gn6VNPRk+PyMnTZZv5MJZ/fRl7vA6VGxJVA5ZRhalO2//Uk9er1TIcKrZh3otSRTJ7qgXE NNbiQs6IO6s0PlO45+0IuoI7zrHQRCyz/XShye4lF47n9KR8CgFWjuwBTmjygitsfDrR+P AR+fXAk+yQS+VD9454ey91dpAbVbH14= Received: by mail-yb1-f173.google.com with SMTP id d128so15361724ybf.10 for ; Mon, 05 Dec 2022 09:32:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=LItvTSNO/liErpomngMhd0/FAUJ8lKefBZplhX+MQ6I=; b=j7Gqdg3N3+REJrQ9nV9uyQlZ/kWcGNV9ZXK3hms2c87c+h779yBNPYWbBKCj56C897 pV2WG0+7+tF6gjtIDi5hZ/zKRSUNc6x4LsVzxjmGF+BuW4JsvLNH74pec1CwYoTw4Pls k0TnkrKblPPQGI2B2xBpRnR1SP+1h/Jw1vh0o+kv0czRZfjSxBRIlMvhcp7LRZr4Adgf QFQOb4UjfSBxPdAQtN0lsOInGiFqmckh9v1rksrqHKjmpCPRcRwGQHXoKsPF/HLyycH2 st5tot8XuePmQmuYGiiT1WFN4fk8q3D8B1CZeXGIDV4LL2qHsX4i2KHBYb9dbwR9wwgr WErA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LItvTSNO/liErpomngMhd0/FAUJ8lKefBZplhX+MQ6I=; b=5PyN31DGgOXeLRsFB7ZMTMEdUpPJh1NtPtvfwgcxjlMFiD6C81FDaStxkbceebiDHc hPn1YLdqVx+TbqVUc2JVznd8Fx6eqnctnZhBpS8PT9dYyUF3TvwBLk0zGVFS/dRpKFlD L19Ud7X5sIHKS9chYvbx5enYHiF7ZJTB9mDQKtA75da7DeDynK8njzeptzFXRn8mRImB ZSY7C5vUQX1+XVdJH0Wu+na7fLUwja9MdxgFoRwUQAI5sqHRwfV+HD664h+rxdbsEZgi 2NjG0FaPo8Zqw+/eUhuQj9EWPNOKidKqkSkaL+AI3/3+4lff6xHDSJfjKNOjbQrBwWO/ hPLA== X-Gm-Message-State: ANoB5pky/zTN3YfIgEWHF+geuxxLRh5RY5XO7Nyf3dt2/YG+LY1pl14b nOb7kiKo0i1ghqjq9OAjsbQAq421ERO6gw4+pxrPhw== X-Google-Smtp-Source: AA0mqf7NfS2kiviqDrnp78Zj00MDps83Skg19wO0v4KhlJqyKbEaFBLoNKYA2/JeflBY7GMlygILLfEAU3dwpt2Dq20= X-Received: by 2002:a25:41d1:0:b0:6f0:8cc2:22ac with SMTP id o200-20020a2541d1000000b006f08cc222acmr52449178yba.303.1670261534607; Mon, 05 Dec 2022 09:32:14 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: David Matlack Date: Mon, 5 Dec 2022 09:31:48 -0800 Message-ID: Subject: Re: [RFC] Improving userfaultfd scalability for live migration To: Peter Xu Cc: Sean Christopherson , James Houghton , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.219.173:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_SEVEN(0.00)[9]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 942C7C000D X-Rspamd-Server: rspam01 X-Stat-Signature: cq4hbfiu1b8mhybd4q95iw371xtw4wck X-HE-Tag: 1670261535-188318 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > On Sat, Dec 03, 2022 at 01:03:38AM +0000, Sean Christopherson wrote: > > On Thu, Dec 01, 2022, James Houghton wrote: > > > #1, however, is quite doable. The main codepath for post-copy, the > > > path that is taken when a vCPU attempts to access unmapped memory, is > > > (for x86, but similar for other architectures): handle_ept_violation > > > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT > > > violation path" or "mem fault path." Other post-copy paths include at > > > least: (i) KVM attempts to access guest memory via. > > > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and > > > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT > > > violation path (e.g., instruction emulation). > > > > > > We want the EPT violation path to be fast, as it is taken the vast > > > majority of the time. > > > > ... > > > > > == Getting the faulting GPA to userspace == > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), > > > and it provides the main functionality we need. We can extend it > > > easily to support our use case here, and I think we have at least two > > > options: > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns > > > -EFAULT). > > > - We're already introducing a new CAP, so just tie the above behavior > > > to whether or not one of the CAPs (below) is being used. > > > > We might even be able to get away with a third option: unconditionally return > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing > > guest memory. > > > > > == Problems == > > > The major problem here is that this only solves the scalability > > > problem for the KVM demand paging case. Other userfaultfd users, if > > > they have scalability problems, will need to find another approach. > > > > It may not fully solve KVM's problem either. E.g. if the VM is running nested > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic) > > via __get_user() when walking L1's EPT tables. We could always modify FNAME(walk_addr_generic) to return out to user space in the same way if that is indeed another bottleneck. > > > > Disclaimer: I know _very_ little about UFFD. > > > > Rather than add yet another flag to gup(), what about flag to say the task doesn't > > want to wait for UFFD faults? If desired/necessary, KVM could even toggle the flag > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual > > SIGBUGS. There are some copy_to/from_user() calls in KVM that cannot easily exit out to KVM_RUN (for example, in the guts of the emulator IIRC). But we could use your approach just to wrap the specific call sites that can return from KVM_RUN. > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 07c81ab3fd4d..7f66b56dd6e7 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > * shmem_vm_ops->fault method is invoked even during > > * coredumping without mmap_lock and it ends up here. > > */ > > - if (current->flags & (PF_EXITING|PF_DUMPCORE)) > > + if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT)) > > goto out; > > I'll have a closer read on the nested part, but note that this path already > has the mmap lock then it invalidates the goal if we want to avoid taking > it from the first place, or maybe we don't care? > > If we want to avoid taking the mmap lock at all (hence the fast-gup > approach), I'd also suggest we don't make it related to uffd at all but > instead an interface to say "let's check whether the page tables are there > (walk pgtable by fast-gup only), if not return to userspace". > > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more > generic interface. It's just that if the userspace knows what it's doing > (postcopy-ing), it knows then the faults can potentially be resolved by > userfaultfd at this stage. Are there any cases where fast-gup can fail while uffd is enabled but it's not due to uffd? e.g. if a page is swapped out? I don't know what userspace would do in those situations to make forward progress. > > > > > /* > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index ffb6eb55cd13..4c6c53ac6531 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid; > > #define PF_MEMALLOC 0x00000800 /* Allocating memory */ > > #define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */ > > #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ > > -#define PF__HOLE__00004000 0x00004000 > > +#define PF_NO_UFFD_WAIT 0x00004000 > > #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ > > #define PF__HOLE__00010000 0x00010000 > > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > > > > -- > Peter Xu >