From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2B1CC352A1 for ; Tue, 6 Dec 2022 20:42:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 314AA8E0005; Tue, 6 Dec 2022 15:42:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2C53D8E0001; Tue, 6 Dec 2022 15:42:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 166078E0005; Tue, 6 Dec 2022 15:42:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 074E98E0001 for ; Tue, 6 Dec 2022 15:42:00 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id AC2931A0561 for ; Tue, 6 Dec 2022 20:41:59 +0000 (UTC) X-FDA: 80213053158.28.42A8D6D Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) by imf11.hostedemail.com (Postfix) with ESMTP id 229BD40010 for ; Tue, 6 Dec 2022 20:41:58 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=FAZU9OW0; spf=pass (imf11.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670359319; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=; b=ozYn2q3kgeCjjg+B6mqBUEZZQ9U/I52VvzAHLqEoSQU/iN+FXtSaDgsUlXAtxXbuhjtcX3 yOeDOT8QVoQVS+Mk2OhxkW9FMiEI5d8yt5CjwmIDZdEDpLEZhYxlVF2EDU2AXpkhhl/SLO BVUdikb/x9jon/9hUroZfynf7+55FHY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=FAZU9OW0; spf=pass (imf11.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670359319; a=rsa-sha256; cv=none; b=ARfxa7foCkekmoVtiMLUlJaR9wSxriSVlVnZQ3b9a2noHjdfrp0Z6v5HPsalNJ92iM3Ek4 rrhsyrmjEevG+XLtsCgvY6Sg7kasqLdKOwYc/jv3QzcWy9Mx8yYkXgnB8LwfJ8HhzuFYJ4 MQj69kiV+as1H7W4IOCmvpzRKJBDl9g= Received: by mail-wr1-f51.google.com with SMTP id h12so25179913wrv.10 for ; Tue, 06 Dec 2022 12:41:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=; b=FAZU9OW0Jzz/3hYPLpamvudKw8H90Y9qELvIq62fNEuM9jknqDmCTxapdXjOy9oMzh p369cqJihqRU18duaLuzVWYQLR5kkx1b+55tD33rl6911HSxwH0MSJmz8cMa+KYWcKIx hhZeMoHXx0Tp1ewkNp4Pfjtc3FSCu2nO4CEKLIrIPhTysp5uB70VSgPTLlB3FN7+T5M3 37YQkknyRjhILyEOZAetjeJnO6q8W0ugVaqoMXCcj0vmEFv2wn/sZDMz26LtQLd3Imx2 nON5vy7grbmIMDAwc2uG6Qa3eOTF8EA+RqBzi+Fut8pHC5gixDmtfilGy83IPjBL504B RFUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=; b=6usW+8S7nouxlPX2eKZ/VgdfZ4GbizGe2B48MkBnlmaqShhEGgT1+9RxiDPS7l1JmX wwGVqTTzHHDypM35oi9DutegkfRqTu94Vy0gDKkvR0y0X1QgFuwsuRPPaGbX5Cxxptkf o0x5DqffjZWu+1b5mLj2b72WTnrnyz9FXS3jyj+XeTi0f57UfPBkUXj8gSj6NBpswzmc tokm3/xoWL+VmKRAZvWsYBmUnt95cXLBhbv/egirTRIt6IFZ3gBLZlCWKpO99sclDTIZ SJfn8uhMft8oYfrMo1Tgv2isObogtBFIRGved0KPr4eDBqlm+tbwv0ItVYJNqY7UWAWk FxFw== X-Gm-Message-State: ANoB5pl2OyHnfCqfiB46GTN0rc/CLNefjsoa1jGwBU9okBvRbuucadhZ 8QRPG+UVVHeAQ0AWPZ8XkZ/WDM8XnuVi+A0XwIz8uQ== X-Google-Smtp-Source: AA0mqf7Kv9y6AlmtkEHHoJvMzOi800UE3Kv49+hGEvI1ZQE+TzTLRcqeOgAWRuNRo+j6EFYqzmb5Jba48P6/NrrWC5A= X-Received: by 2002:a5d:524f:0:b0:242:dee:716c with SMTP id k15-20020a5d524f000000b002420dee716cmr28660992wrc.664.1670359317722; Tue, 06 Dec 2022 12:41:57 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Tue, 6 Dec 2022 15:41:46 -0500 Message-ID: Subject: Re: [RFC] Improving userfaultfd scalability for live migration To: Sean Christopherson Cc: David Matlack , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com, Oliver Upton Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 229BD40010 X-Stat-Signature: knxo9ihhe7fnbeugrwmysida9jirdxei X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.221.51:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_SEVEN(0.00)[10]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspam-User: X-HE-Tag: 1670359318-167257 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson wrote: > > On Tue, Dec 06, 2022, James Houghton wrote: > > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson wrote: > > > > > > On Mon, Dec 05, 2022, James Houghton wrote: > > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson wrote: > > > > > > > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > > > > > > ... > > > > > > > I'll have a closer read on the nested part, but note that this path already > > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > > > > it from the first place, or maybe we don't care? > > > > > > > > Not taking the mmap lock would be helpful, but we still have to take > > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. > > > > > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD > > > is involved. The "fault" would get propagated to userspace by KVM, userspace > > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and > > > resumes the vCPU. > > > > If we haven't UFFDIO_CONTINUE'd some address range yet, > > MADV_POPULATE_WRITE for that range will drop into handle_userfault and > > go to sleep. Not good! > > Ah, right, userspace would still need to register UFFD for the region to handle > non-KVM (or incompatible KVM) accesses and could loop back on itself. > > > So, going with the no-slow-GUP approach, resolving faults is done like this: > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart > > KVM_RUN. The PTEs will be none/blank right now. This is the common > > case. > > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we > > would get EEXIST. (In this case, we probably have some type of swap > > entry in the page tables.) We have to change the page tables to make > > fast GUP succeed now *without* using UFFDIO_CONTINUE now. > > MADV_POPULATE_WRITE seems to be the right tool for the job. This case > > happens if the kernel has swapped the memory out, is migrating it, has > > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to > > crash or inject a memory error. > > > > So with this approach, we never need to take the mmap_lock for reading > > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE. > > Without removing the mmap_lock from *both*, we don't gain much. > > > > So if we disregard this tiny mmap_lock benefit, the other approach > > (the PF_NO_UFFD_WAIT approach) seems better. > > Can you elaborate on what makes it better? Or maybe generate a list of pros and > cons? I can think of (dis)advantages for both approaches, but I haven't identified > anything that would be a blocking issue for either approach. Doesn't mean there > isn't one or more blocking issues, just that I haven't thought of any :-) Let's see.... so using no-slow-GUP over no UFFD waiting: - No need to take mmap_lock in mem fault path. - Change the relevant __gfn_to_pfn_memslot callers (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the new CAP is used. - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere in/near kvm_faultin_pfn/user_mem_abort). - Userspace has to indirectly figure out the state of the page tables to know what action to take (which introduces some weirdness, like if anyone MADV_DONTNEEDs some guest memory, we need to know). - While userfaultfd is registered (so like during post-copy), any hva_to_pfn() calls that were resolvable with slow GUP before (without dropping into handle_userfault()) will now need to be resolved by userspace manually with a call to MADV_POPULATE_WRITE. This extra trip to userspace could slow things down. Both of these seem pretty simple to implement in the kernel; the most complicated part is just returning KVM_EXIT_MEMORY_FAULT in more places / for other architectures (I care about x86 and arm64). Right now both approaches seem fine to me. Not having to take the mmap_lock in the fault path, while being such a minor difference now, could be a huge benefit if we can later get around to making UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not requiring userspace to guess the state of the page tables seems helpful (less bug-prone, I guess). > > > When KVM_RUN exits: > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. > > - If we have, then something bad has happened. Slow GUP already ran > > and failed, so we need to treat this in the same way we treat a > > MADV_POPULATE_WRITE failure above: userspace might just want to crash > > (or inject a memory error or something). > > > > - James