From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13790C4321E for ; Sat, 3 Dec 2022 01:03:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 772036B0071; Fri, 2 Dec 2022 20:03:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 721D76B0072; Fri, 2 Dec 2022 20:03:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C2666B0074; Fri, 2 Dec 2022 20:03:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4889A6B0071 for ; Fri, 2 Dec 2022 20:03:44 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1496F140AA2 for ; Sat, 3 Dec 2022 01:03:44 +0000 (UTC) X-FDA: 80199197568.29.7DF15F9 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf02.hostedemail.com (Postfix) with ESMTP id B698780002 for ; Sat, 3 Dec 2022 01:03:43 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DIOo6kDC; spf=pass (imf02.hostedemail.com: domain of seanjc@google.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670029423; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Q7F06+jiJvvR5kLNucI/jSZnJxR6nKoA0ANvxE9pg4M=; b=ce85j5IGDtRfIM31LACKv9nv5YXLQ44fb/DXpE470HV2nlebgmz2eUbRIKQT/ot6OGpTj5 Pe9RB0sW2DsDkelvodsF39Lnp2o/kk92nYjTuMhy3wQ/4e7ITGX7gLu0s9W8bHmh/C6TZ5 s7jvqFSd3D8BZ/l/TM3xtwQVGZjLl8w= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DIOo6kDC; spf=pass (imf02.hostedemail.com: domain of seanjc@google.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670029423; a=rsa-sha256; cv=none; b=27ekVlBnzTmlWdWU7rcK83RPlBkAU1/9gBGPoXElC8kWKhKmpiYn8B7BETgeQPPHPm4z4h FUBLW8PiW1kPcjA2GoRh70xbjp2O0m86ADQXP2g+BGWMZtexFl8O7QmydIQpqCoWwCh38/ TcWQiZTwWrrYEdrf4Mi9ywcubLzvbz4= Received: by mail-pj1-f41.google.com with SMTP id t11-20020a17090a024b00b0021932afece4so9809467pje.5 for ; Fri, 02 Dec 2022 17:03:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Q7F06+jiJvvR5kLNucI/jSZnJxR6nKoA0ANvxE9pg4M=; b=DIOo6kDCGjvlTfwtyULHyUJk9L+Z4A4YNcAQxf4FklyMRlpQn/1LK4q93Nx76CB9f4 rkLEYHqLxRTHNMCmsRm9VzFPJ+GqgiyhMuHqMrD225loO9/cKnjBrLq0L3oHqwNmRdE7 RhJn02zgMsQVHeTXjREi33/ZgupObgR3HFDecQule7KkNsgP6fmlQ85wQ+0dILbmWEbe C42RUfyRrLKUqRkJWPyZDlYPUBQxFAIScUqcFc+0kMnlTBzH6URCBYsXUiLguruwjTHn /W5RBsaOF9AeS/gYxGIOeTf1yPq+XCxGmh32aBNxHTtOhIWtfkXEX3B1EemJDso023m3 K4QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Q7F06+jiJvvR5kLNucI/jSZnJxR6nKoA0ANvxE9pg4M=; b=2sFQ+/TvMhYgL3qaN1Soi1qpVqiLpGRh0dpeRaDOTzpqeEk3JMyvKikGvIUEfTxwGw OO0dyNUur9Y/zdRM6Zh7rLQX5vWSnaz5Z3mllkC1P9Wx3TkcXOloA2b8f3NjPUc+gJzZ r8UucwYyVJbmVwxwRocqzSHp3f2HuiC1OShQzekJhlvGcfboMM9EsKJTEZTU4aPsFyYD I5fC2NG5urXylFxOdIlsVTpKB0+75kDODFiO2eI/d/GB9GfBS5F+dkpJLXwIPERuBJv1 9j2vm+vErXkRkS+VTMo2H0noEqYJn54OxuQH76TDXHYR+GSIaclloQ5AS8WPDVLtfO/T KIwA== X-Gm-Message-State: ANoB5pke2Ao4hApXskMUa8+AV6Fk70mwe0p8pNfPIeV6neInz8dSTqIo PmI4Ojie0GNdYIKklJbaCWwLWw== X-Google-Smtp-Source: AA0mqf4YyWU2oW8MStbh25D1tKWGrrbii+YzsH1DwP1rtwHYgviG5o1HZ1drOKdsi/mUsyxPBr/k4A== X-Received: by 2002:a17:902:d2c8:b0:189:3e8f:fa37 with SMTP id n8-20020a170902d2c800b001893e8ffa37mr50787524plc.76.1670029422437; Fri, 02 Dec 2022 17:03:42 -0800 (PST) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id nt15-20020a17090b248f00b00217090ece49sm5319758pjb.31.2022.12.02.17.03.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Dec 2022 17:03:42 -0800 (PST) Date: Sat, 3 Dec 2022 01:03:38 +0000 From: Sean Christopherson To: James Houghton Cc: Andrea Arcangeli , Peter Xu , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Subject: Re: [RFC] Improving userfaultfd scalability for live migration Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: qzd3zyx6q9t7xgioyo6tpdft8mcm3zqf X-Rspam-User: X-Spamd-Result: default: False [1.65 / 9.00]; SORBS_IRL_BL(3.00)[209.85.216.41:from]; BAYES_HAM(-1.45)[83.72%]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; BAD_REP_POLICIES(0.10)[]; TO_DN_SOME(0.00)[]; RCPT_COUNT_SEVEN(0.00)[8]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; DKIM_TRACE(0.00)[google.com:+]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; RCVD_COUNT_THREE(0.00)[3]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Queue-Id: B698780002 X-Rspamd-Server: rspam06 X-HE-Tag: 1670029423-115904 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 01, 2022, James Houghton wrote: > #1, however, is quite doable. The main codepath for post-copy, the > path that is taken when a vCPU attempts to access unmapped memory, is > (for x86, but similar for other architectures): handle_ept_violation > -> hva_to_pfn -> GUP -> handle_userfault. I'll call this the "EPT > violation path" or "mem fault path." Other post-copy paths include at > least: (i) KVM attempts to access guest memory via. > copy_{to,from}_user -> #pf -> handle_mm_fault -> handle_userfault, and > (ii) other callers of gfn_to_pfn* or hva_to_pfn* outside of the EPT > violation path (e.g., instruction emulation). > > We want the EPT violation path to be fast, as it is taken the vast > majority of the time. ... > == Getting the faulting GPA to userspace == > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), > and it provides the main functionality we need. We can extend it > easily to support our use case here, and I think we have at least two > options: > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns > -EFAULT). > - We're already introducing a new CAP, so just tie the above behavior > to whether or not one of the CAPs (below) is being used. We might even be able to get away with a third option: unconditionally return KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing guest memory. > == Problems == > The major problem here is that this only solves the scalability > problem for the KVM demand paging case. Other userfaultfd users, if > they have scalability problems, will need to find another approach. It may not fully solve KVM's problem either. E.g. if the VM is running nested VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic) via __get_user() when walking L1's EPT tables. Disclaimer: I know _very_ little about UFFD. Rather than add yet another flag to gup(), what about flag to say the task doesn't want to wait for UFFD faults? If desired/necessary, KVM could even toggle the flag in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual SIGBUGS. diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 07c81ab3fd4d..7f66b56dd6e7 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) * shmem_vm_ops->fault method is invoked even during * coredumping without mmap_lock and it ends up here. */ - if (current->flags & (PF_EXITING|PF_DUMPCORE)) + if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT)) goto out; /* diff --git a/include/linux/sched.h b/include/linux/sched.h index ffb6eb55cd13..4c6c53ac6531 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1729,7 +1729,7 @@ extern struct pid *cad_pid; #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */ #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ -#define PF__HOLE__00004000 0x00004000 +#define PF_NO_UFFD_WAIT 0x00004000 #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ #define PF__HOLE__00010000 0x00010000 #define PF_KSWAPD 0x00020000 /* I am kswapd */