From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3769DC4332F for ; Thu, 8 Dec 2022 17:50:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 628D98E0003; Thu, 8 Dec 2022 12:50:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D7978E0001; Thu, 8 Dec 2022 12:50:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A0258E0003; Thu, 8 Dec 2022 12:50:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3D4438E0001 for ; Thu, 8 Dec 2022 12:50:37 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 072B8160993 for ; Thu, 8 Dec 2022 17:50:37 +0000 (UTC) X-FDA: 80219878914.05.FF6E454 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) by imf25.hostedemail.com (Postfix) with ESMTP id 38240A000A for ; Thu, 8 Dec 2022 17:50:34 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DeT+cgAr; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670521835; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HcHEfoIgvQqzqdjIyO/o0yadEeMiUClkoqNRl4GZils=; b=RsXpgHAmYVDfzy+am8Bs2bCtTMCn6tETUpIlZdthHdMerdryIWNL3f0w7IgxMywJyUI1gB Fy9NuLuz0b2SR0Vmij7SDRVB5nVSP81ZnoHMLNO8m2ujECziKgQdmbcHTo1Ihu4x58UpLR kMNwDMZGkWiL6/MRI7YMo2gpfiirqyA= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DeT+cgAr; spf=pass (imf25.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670521835; a=rsa-sha256; cv=none; b=v7SZ+P7SD22b4+jpClcGz0xPdPb2jKGg6TwHoHFYwNpXWSiaXWuNRVx+NoeKdNHXYjrABV 7NEtEJUp7tdz2RHDo8eeBElzECGVPbKODAVRlvttxugjG10B4h/z0u606pHbIh1iV7vail FOc1sBvl0zQgaDk0561U7VBkh0rvAD0= Received: by mail-wm1-f43.google.com with SMTP id ja4-20020a05600c556400b003cf6e77f89cso4638392wmb.0 for ; Thu, 08 Dec 2022 09:50:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=HcHEfoIgvQqzqdjIyO/o0yadEeMiUClkoqNRl4GZils=; b=DeT+cgArGsVoEBJmKsU2RBwdHKFXTtpobl7x8BEhCc1r77ieUhoA9UsMIsyET9Dawf Ggac+66EKkHXBo48iGQKlNCE4+UXU0BTTRr3CmYrokZPwfUVaE39eMX0D23ACVlvsho3 qyadbda5NSqr7j0Z4UvhE5bV2J0BaSqzOrxS7pCJ4xFgdJdnMIBgOOmO8iGVc0PwEtC2 hDNdn0dQX1dYe9iEq+GcmuW6U3fTmgx4+7b89xu5h/CXyHVY+rkyZOXjkd2TBn2UCcKr UlSGlYgb2GofIjV92hTfttO7uDTWWLayr58XWmXF60zv8iK27L/MtHjcjOd6UOt13zTY B/fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=HcHEfoIgvQqzqdjIyO/o0yadEeMiUClkoqNRl4GZils=; b=BLhT88t5sbYDD19CKbr8OAxpkV9/bHT/i5ORHTC2TNzDIFe4i3OnwzGYi5wbj0b15c SAnSXgfppAFqJ7uGIUQcKQ5axJA6+dDeS3CFDQ8DhDrGCe6iOZ8+NpOuxayzGJ/fpMsy j0/bBQFGhE24+BMoNzuJxObEEz7zHd+l+G1etpiB48SrzSapJcTo25SmSpP2npQsjxo5 X+ZfWxKcQ11lrLgYeoa6M43KDX/9Ayuo0ACp+4U1AQnb8ncxd/+N/cLj3RqIisyC/lQS UtfY6sbWLOJZn6sLdDGzbd7O9Tzr756raeE+jyGNCpItV9b0d+3ZbuV8CmTek8hrRgMM y+jA== X-Gm-Message-State: ANoB5pmq3GlXfRb39Gge/wNi4WhHmXwwn0flgXhMRn5P6V6wX1OvWIpq xE318HWl9tP0yhkof38HTGecN5GO8BaU08s+gL2apw== X-Google-Smtp-Source: AA0mqf7g1hT+EVeUzX4ihjsGp/nRfj2r7RdPfa8flvKM7lDsUanEkZ+PvRwGFLAZnaT5o+Arex6LpkuNJxazEx70GCY= X-Received: by 2002:a05:600c:3847:b0:3d0:7d89:227e with SMTP id s7-20020a05600c384700b003d07d89227emr22185625wmr.166.1670521833742; Thu, 08 Dec 2022 09:50:33 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Thu, 8 Dec 2022 12:50:22 -0500 Message-ID: Subject: Re: [RFC] Improving userfaultfd scalability for live migration To: David Matlack Cc: Sean Christopherson , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com, Oliver Upton Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: amp5roaiqbj7wogqhu39qo5tu988ji5i X-Rspam-User: X-Rspamd-Queue-Id: 38240A000A X-Rspamd-Server: rspam06 X-HE-Tag: 1670521834-188283 X-HE-Meta: U2FsdGVkX1/U+jMOIq2BKN6QfLlx6Y8c3qEfqGYdpbOfJ3K3j+7rSDrxQ6PGWwOo+z3XSKzXasmDfpSLnT0YjFY18d67ZM16mN+OWisMiJ5qpG6jetkTGpfJ1cWPDZJglgM+I8tNMn1oiC2kqjX/Wpkytwtkb8MATR63Z2R/Km4Mlw4XV5/wsscoY6afvgiZ6BGxkNoCE3lkU5ua9x3qwz9IM+NQ2+8ITuewG4f60e8obPP+Tpl47EvlxclsLNAnjClJGsqh+bKpHK7Nr62KaDwh+gbWP1K1L5bZq/n1DTLKUrow64wQZ3FeXUItEy9jejpNQVOvmMu/t67ohxJIW+bfl7xCTLU6S5+xgLZLKx0c7EyU0tl1HLawr0caMz31QJwgT5yobZbvmCkmpdTC8T4pcEmmxYS1XsczEcZ8ZIGmhk0WGJ7ohAnTYRV8WM8dcPR/1Y92mUeoVG+lBMqGT5P7YerkuZLSCO70NrkUurl5jZEmbOpqQVXxe9/w62X/WKUA8OsViWGbeCUQYz64QtfdIUO0utH3iygkvM4JWCCYHZuluEfiLuE0lJ4W1Je/QPViGFYYQh5CenHdSG125f1Vt9QCfzMYePJLkkJydavmd2NYQekwmnbDZ7gSQINnNUDYcvm23hk6Qn1o+xPDyCIfqaJvPFVFXRH3apUPvDhWfk1dKlAq9hExcwSqGuXzgiKzG+vZr57OCCzLsVk5B3QC10/MHfZWjud9ivWGbzPHKQkbwBdeYiaY8E4sxZJhOqpn8lXBh6br94WIm4MKhqZZ/6PNIpggIhEcGQxbWM5InWCp0srEyiC1MLLN91nef4Kfp3HreN0faHOFaFKKcJOqobzmJPwHOkx1sXRpEKt9JRplD5Sjv6FGZJV/taICIafrh7G4IjmTD4NmtHby9CyNUMNBQGY27Zf5in8PbPq0wLhPk4eRvs2PO7Uj9jh8UFjcL965nqSjhKqdEf6 XSYOSGj7 pBnJ4s4eRgYHjoBU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000022, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 7, 2022 at 8:57 PM David Matlack wrote: > > On Tue, Dec 6, 2022 at 12:41 PM James Houghton wrote: > > On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson wrote: > > > Can you elaborate on what makes it better? Or maybe generate a list of pros and > > > cons? I can think of (dis)advantages for both approaches, but I haven't identified > > > anything that would be a blocking issue for either approach. Doesn't mean there > > > isn't one or more blocking issues, just that I haven't thought of any :-) > > > > Let's see.... so using no-slow-GUP over no UFFD waiting: > > - No need to take mmap_lock in mem fault path. > > - Change the relevant __gfn_to_pfn_memslot callers > > (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the > > new CAP is used. > > - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere > > in/near kvm_faultin_pfn/user_mem_abort). > > - Userspace has to indirectly figure out the state of the page tables > > to know what action to take (which introduces some weirdness, like if > > anyone MADV_DONTNEEDs some guest memory, we need to know). > > I'm no expert but I believe a guest access to MADV_DONTNEED'd GFN > would just cause a new page to be allocated by the kernel. So I think > userspace can still blindly do MADV_POPULATE_WRITE in this case. Were > there any other scenarios you had in mind? MADV_POPULATE_WRITE would drop into handle_userfault() if we're using uffd minor faults after we do MADV_DONTNEED. For uffd minor faults, if the PTE is none (i.e., completely blank, no swap information or anything), then we drop into handle_userfault(). I partially take back what I said. We have to be careful about someone messing with our page tables no matter which API we choose. Here is a better description of the weirdness that we have to put up with given each choice, with this assumption that, normally, we want to UFFDIO_CONTINUE a page exactly once: - For the no-slow-GUP choice, if someone MADV_DONTNEEDed memory and we didn't know about it, we would get stuck in MADV_POPULATE_WRITE. By using UFFD_FEATURE_THREAD_ID, we can tell if we got a userfault for a thread that is in the middle of a MADV_POPULATE_WRITE, and we can try to unblock the thread by doing an extra UFFDIO_CONTINUE. - For the PF_NO_UFFD_WAIT choice, if someone MADV_DONTNEEDed memory, we would just keep trying to start the vCPU without doing anything (we assume some other thread has UFFDIO_CONTINUEd for us). This is basically the same as if we were stuck in MADV_POPULATE_WRITE, and we can try to unblock the thread in a fashion similar to how we would in the other case. So really these approaches have similar requirements for what userspace needs to track. So I think I prefer the no-slow-GUP approach then. > > > - While userfaultfd is registered (so like during post-copy), any > > hva_to_pfn() calls that were resolvable with slow GUP before (without > > dropping into handle_userfault()) will now need to be resolved by > > userspace manually with a call to MADV_POPULATE_WRITE. This extra trip > > to userspace could slow things down. > > Is there any way to enable fast-gup to identify when a PTE is not > present due to userfaultfd specifically without taking the mmap_lock > (e.g. using an unused bit in the PTE)? Then we could avoid extra trips > to userspace for MADV_POPULATE_WRITE. To know if you would have dropped into handle_userfault(), you have to at least check the VMA flags, so at the moment, no. :(