From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67803C352A1 for ; Tue, 6 Dec 2022 17:35:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D8CD68E0003; Tue, 6 Dec 2022 12:35:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D3CC98E0001; Tue, 6 Dec 2022 12:35:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C048B8E0003; Tue, 6 Dec 2022 12:35:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B14FD8E0001 for ; Tue, 6 Dec 2022 12:35:31 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7EA5CC0D73 for ; Tue, 6 Dec 2022 17:35:31 +0000 (UTC) X-FDA: 80212583262.18.3603698 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) by imf26.hostedemail.com (Postfix) with ESMTP id DF482140007 for ; Tue, 6 Dec 2022 17:35:30 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=J8WEw78d; spf=pass (imf26.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670348131; a=rsa-sha256; cv=none; b=TC4AHKK3dLzkMciydjuPmBU37i4XCQnwBCSasLag1OEBOc0BXRbKVQgKvODU7Hr4c+Ta5K rvwMHHrk5MKdSVlUitwBGrm+bA8+PqMt0dPiZqc7iTmRdE1DTtFnQBuO2gxWPjIN6a3IQ0 noyUy6h0JR5KXSRVfMxGkNc2RuHSoQQ= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=J8WEw78d; spf=pass (imf26.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670348131; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aihnbml+wBjtL5Ixo7i9mw5ER6BmRFD5Gtob+jofanI=; b=Zrr+vivvBNhXwx7j9HLcNnYFyWttBs8j9QVGP9dWK4+p3CgNZXl/eKtcY4UfYz+pNz7Ucb I+fQY02V+zhFhXRvEUCRym44ePHXS0VrLp97GG+koyZH2vuDGpXIQjNj7v+0Hh+SVIKcZi vgg+r0/+Z9iJ35E1MDMSW7GecW8D+Rs= Received: by mail-wm1-f49.google.com with SMTP id ja4-20020a05600c556400b003cf6e77f89cso1446760wmb.0 for ; Tue, 06 Dec 2022 09:35:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=aihnbml+wBjtL5Ixo7i9mw5ER6BmRFD5Gtob+jofanI=; b=J8WEw78dM1Zbm2Z4Zu6YnDwl74tJgXptgLRpIaxt2U8i9I1/0IrlxqZYh8uZBvlzla zwrM8WhSmVMTXk/p91mI90qz6jNMa09lSPjfhsAP04pHwVUZupqTGOfOqrnsQj41QfRy /AbHi1ULbEukm97RK36pkc0K+oVDpB5jPlYt/DNebD6x8Z3EdAV+3uQZTPV1MDJGVl3N bh/et7CMtR1S/Pet5qWQ9smX6EbliDwMavf6MX4ET8x2S1aDAZKoxz8cUfBf12p+HJDv Xx23J4BZL+hNRQVYUqcJ8fh8LDiK4+pnjCx9uVCwyb8z8y7I1PGEjjccf8mHCizT/YsZ j0ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=aihnbml+wBjtL5Ixo7i9mw5ER6BmRFD5Gtob+jofanI=; b=ib1k/l4jAQcuWbV+FoWGOWwpkxg6HjPapXLeVUU03CZjU4u+fQ6rV4VR/HsiXsrDaB vw0mpi/7JFZtb1VDephg7ZdMlBOionnU9mtc449NVZ/F36kmKj8ROuAcs9lTcf4nOg7j nfaNUTIc24wnN+XXTF3XC5nF72zSeimwBGz0dI9pneX9dNgJqCvZyUvnluKdDtqmhDaU Ic2yWoYk7V/EmHt/bEtf12TvFpg+XYOxNbUeaqKKVnCQL6v6fguXqxAP91Hz1JEduLJp U8DDM8SW0xCiDaoM1fwkKuF2URSk8w9TDqTVvWHDdKrzpmhdLXGxxbEvLRqHLBKFCp/M URsg== X-Gm-Message-State: ANoB5pkUAVzmrUpyYW1avTFIy3Ynao7ZprHo7RQ38F517z1FUD2U73ss nrzun4EohGthi49IIkS4uAEmB3jDLqInYu9VtjKhvA== X-Google-Smtp-Source: AA0mqf45VAKkM9O7kumkuPoyhTEpETmrdYWcAwQ0d/3WUkWjNFZrm0Ea7qsP7Z8rxCuOA9Y06tr/MN5B0eaaCU1Z6do= X-Received: by 2002:a1c:c918:0:b0:3cf:f2aa:3dc2 with SMTP id f24-20020a1cc918000000b003cff2aa3dc2mr53706978wmb.175.1670348129572; Tue, 06 Dec 2022 09:35:29 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Tue, 6 Dec 2022 12:35:18 -0500 Message-ID: Subject: Re: [RFC] Improving userfaultfd scalability for live migration To: Sean Christopherson Cc: David Matlack , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Spamd-Result: default: False [3.10 / 9.00]; SORBS_IRL_BL(3.00)[209.85.128.49:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAYES_HAM(-0.00)[26.16%]; RCVD_COUNT_TWO(0.00)[2]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_SEVEN(0.00)[9]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: DF482140007 X-Rspamd-Server: rspam01 X-Stat-Signature: ria5bsfphpkzkh6ahbodzryqdnri7i6r X-HE-Tag: 1670348130-837640 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson wrote: > > On Mon, Dec 05, 2022, James Houghton wrote: > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson wrote: > > > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > > > > ... > > > > > I'll have a closer read on the nested part, but note that this path already > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > > it from the first place, or maybe we don't care? > > > > Not taking the mmap lock would be helpful, but we still have to take > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD > is involved. The "fault" would get propagated to userspace by KVM, userspace > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and > resumes the vCPU. If we haven't UFFDIO_CONTINUE'd some address range yet, MADV_POPULATE_WRITE for that range will drop into handle_userfault and go to sleep. Not good! So, going with the no-slow-GUP approach, resolving faults is done like this: - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. The PTEs will be none/blank right now. This is the common case. - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we would get EEXIST. (In this case, we probably have some type of swap entry in the page tables.) We have to change the page tables to make fast GUP succeed now *without* using UFFDIO_CONTINUE now. MADV_POPULATE_WRITE seems to be the right tool for the job. This case happens if the kernel has swapped the memory out, is migrating it, has poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to crash or inject a memory error. So with this approach, we never need to take the mmap_lock for reading in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE. Without removing the mmap_lock from *both*, we don't gain much. So if we disregard this tiny mmap_lock benefit, the other approach (the PF_NO_UFFD_WAIT approach) seems better. When KVM_RUN exits: - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. - If we have, then something bad has happened. Slow GUP already ran and failed, so we need to treat this in the same way we treat a MADV_POPULATE_WRITE failure above: userspace might just want to crash (or inject a memory error or something). - James