From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7EE8C4708C for ; Mon, 5 Dec 2022 21:20:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 339348E0002; Mon, 5 Dec 2022 16:20:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E9A48E0001; Mon, 5 Dec 2022 16:20:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1B1698E0002; Mon, 5 Dec 2022 16:20:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0BAAC8E0001 for ; Mon, 5 Dec 2022 16:20:12 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CB95E4078A for ; Mon, 5 Dec 2022 21:20:11 +0000 (UTC) X-FDA: 80209520622.16.2A915B1 Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by imf29.hostedemail.com (Postfix) with ESMTP id 46FCB120019 for ; Mon, 5 Dec 2022 21:20:11 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dsRKl0ZR; spf=pass (imf29.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.50 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670275211; a=rsa-sha256; cv=none; b=JSh+vTiEjhzjdGsMlbhESPncLLrWKNyAOUirl4CHL0XWhpFeSSkVlRDrPEblgF0AIfXEAP 7//26fTb4wvw8XbQysE/dNJqjrsL2JCOVTXkIDB/Xu4/qFETkvVXjAEOtaCxgAGGRyH1wU zmg+6pOKYb0Q5pzN0G7zWhloO/DpTLM= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dsRKl0ZR; spf=pass (imf29.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.50 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670275211; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tvspzYuiNlqsolHQI9cb6UyP9lHHvgHoYr6XTMSAWqM=; b=n20U/royhx0ttYPv7nkOU7YMSWCA9VHV1yx47Dr7zhBB0gK/CtSkr0yKd0mri4LzyG1jQr W9d5XzgAxYMpZr0aYITnkfQYudjwdBm6sA+9lfmbL14c0nIqtN8XkPLKkmn4JvefwhZg1l JNhWQ/lasUVv/FJlAvVN83v9xLJ6Ksg= Received: by mail-wr1-f50.google.com with SMTP id m14so20765510wrh.7 for ; Mon, 05 Dec 2022 13:20:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=tvspzYuiNlqsolHQI9cb6UyP9lHHvgHoYr6XTMSAWqM=; b=dsRKl0ZRF1eSzedOz30CZ2dJwfwV7y4nwtkPD4smqjVfgP9d7dmpoOw6yIogiGj/nZ zU4YOI4+hZ+TE1Gds3vtpKbi8wJai7dTjae9wMfrwM0Qh+NCnHxbgZR4Wjwq+C+wrCxW xUJQCbItYCDNOAV3wRMWxcwcqz3/b1fuV9e1GtzUnooJ/KjQMEgfd/gpOoAMxOQKgcy+ zsBJfIzp4ArHQtzPpflCnoX6qgpuTkueVJS2TmxRACY/1Tx1SnuJjup/isAofnnK5gdr 13xZMNb/k3wNV0ojVPKXY7OBxHzx/bQCA7biHHZZSWmyU6CsalX1sBY49NGJ4YTKI8LV M6+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=tvspzYuiNlqsolHQI9cb6UyP9lHHvgHoYr6XTMSAWqM=; b=ednBehHP1g2jevsEYCHuvAWP53m5vdEH2RncWb1E1HnknAf4SmrgXeuKPpijqSFUSq 9/fxA/lEyroyaSKr7lTejIiREpUHJnhka3Q2Im5j5SGgGLJ3fubPVvHZ/V0XjB2UwQg3 Qgr9EFxl4/ov1LEx4YS/fUYzswStwDGBGTpeRCgihyZhNISJzabnEfZx7zZmlPatDuOW uRpmUEospTTg/0EN+S7E2gu+vJXL9jih3KTk01zSBWxgCiJGyl5u6pUoGLqrCLby709Y pJuEa5CzU1/D1nKLHHsUHQ5Y5F+iLhoJoF1Tzc0Tx6h3cqwq4quSwkM9wGlUMIA/BwU2 Z9QQ== X-Gm-Message-State: ANoB5plW5lhk9rp/G04fVpS6WDuY3RiihEA0wjZKAT6+Xm+JUIWFyy/M mW85nzkRMb3Us1Pt3Ovm+CTRFU7AH3oTeKKZBP1rRw== X-Google-Smtp-Source: AA0mqf6a0X/TGRnbbsp7Qd6SvXvGiMvX6c07IfefpbXvQ6O+QW2rALKmjJ0Ku5ZwK7wmbWmPn0UZRm58qA5zBSGOaH4= X-Received: by 2002:a05:6000:124d:b0:242:10a:6667 with SMTP id j13-20020a056000124d00b00242010a6667mr34250640wrx.39.1670275209777; Mon, 05 Dec 2022 13:20:09 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: James Houghton Date: Mon, 5 Dec 2022 16:19:56 -0500 Message-ID: Subject: Re: [RFC] Improving userfaultfd scalability for live migration To: Sean Christopherson Cc: David Matlack , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.221.50:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_SEVEN(0.00)[9]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 46FCB120019 X-Rspamd-Server: rspam01 X-Stat-Signature: b88z1s47aehdq9bcj994afoe4c53zoox X-HE-Tag: 1670275211-375584 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson wrote: > > On Mon, Dec 05, 2022, David Matlack wrote: > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > > > > == Getting the faulting GPA to userspace == > > > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), > > > > > and it provides the main functionality we need. We can extend it > > > > > easily to support our use case here, and I think we have at least two > > > > > options: > > > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes > > > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would > > > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns > > > > > -EFAULT). > > > > > - We're already introducing a new CAP, so just tie the above behavior > > > > > to whether or not one of the CAPs (below) is being used. > > > > > > > > We might even be able to get away with a third option: unconditionally return > > > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing > > > > guest memory. Wouldn't we need a new CAP for this? > > > > > > > > > == Problems == > > > > > The major problem here is that this only solves the scalability > > > > > problem for the KVM demand paging case. Other userfaultfd users, if > > > > > they have scalability problems, will need to find another approach. > > > > > > > > It may not fully solve KVM's problem either. E.g. if the VM is running nested > > > > VMs, many (most?) of the user faults could be triggered by FNAME(walk_addr_generic) > > > > via __get_user() when walking L1's EPT tables. > > > > We could always modify FNAME(walk_addr_generic) to return out to user > > space in the same way if that is indeed another bottleneck. > > Yes, but given that there's a decent chance that solving this problem will add > new ABI, I want to make sure that we are confident that we won't end up with gaps > in the ABI. I.e. I don't want to punt the nested case to the future. > > > > > Disclaimer: I know _very_ little about UFFD. > > > > > > > > Rather than add yet another flag to gup(), what about flag to say the task doesn't > > > > want to wait for UFFD faults? If desired/necessary, KVM could even toggle the flag > > > > in KVM_RUN so that faults that occur outside of KVM ultimately don't send an actual > > > > SIGBUGS. I really like this idea! Having KVM_RUN toggle it in handle_ept_violation/etc. seems like it would work the best. If we toggled it in userspace before KVM_RUN, we would still open ourselves up to KVM_RUN exiting without post-copy information (like, if GUP failed during instruction emulation), IIUC. > > > > There are some copy_to/from_user() calls in KVM that cannot easily > > exit out to KVM_RUN (for example, in the guts of the emulator IIRC). > > But we could use your approach just to wrap the specific call sites > > that can return from KVM_RUN. > > Yeah, it would definitely need to be opt-in. > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index 07c81ab3fd4d..7f66b56dd6e7 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > > > * shmem_vm_ops->fault method is invoked even during > > > > * coredumping without mmap_lock and it ends up here. > > > > */ > > > > - if (current->flags & (PF_EXITING|PF_DUMPCORE)) > > > > + if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT)) > > > > goto out; > > > > > > I'll have a closer read on the nested part, but note that this path already > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > it from the first place, or maybe we don't care? Not taking the mmap lock would be helpful, but we still have to take it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. The main goal is to avoid the locks in the userfaultfd wait_queues. If we could completely avoid taking the mmap lock for reading in the common post-copy case, we would avoid potential latency spikes if someone (e.g. khugepaged) came around and grabbed the mmap lock for writing. It seems pretty difficult to make UFFDIO_CONTINUE *not* take the mmap lock for reading, but I suppose it could be done with something like the per-VMA lock work [2]. If we could avoid taking the lock in UFFDIO_CONTINUE, then it seems plausible that we could avoid taking it in slow GUP too. So really whether or not we are taking the mmap lock (for reading) in the mem fault path isn't a huge deal by itself. > > > > > > If we want to avoid taking the mmap lock at all (hence the fast-gup > > > approach), I'd also suggest we don't make it related to uffd at all but > > > instead an interface to say "let's check whether the page tables are there > > > (walk pgtable by fast-gup only), if not return to userspace". > > Ooh, good point. If KVM provided a way for userspace to toggle a "fast-only" flag, > then hva_to_pfn() could bail if hva_to_pfn_fast() failed, and I think KVM could > just do pagefault_disable/enable() around compatible KVM uaccesses? > > > > Because IIUC fast-gup has nothing to do with uffd, so it can also be a more > > > generic interface. It's just that if the userspace knows what it's doing > > > (postcopy-ing), it knows then the faults can potentially be resolved by > > > userfaultfd at this stage. > > > > Are there any cases where fast-gup can fail while uffd is enabled but > > it's not due to uffd? e.g. if a page is swapped out? > > Undoubtedly. COW, NUMA balancing, KSM?, etc. Nit, I don't think "due to uffd" > is the right terminology, I think the right phrasing is something like "but can't > be resolved by userspace", or maybe "but weren't induced by userspace". UFFD > itself never causes faults. > > > I don't know what userspace would do in those situations to make forward progress. > > Access the page from userspace? E.g. a "LOCK AND -1" would resolve read and write > faults without modifying guest memory. > > That won't work for guests backed by "restricted mem", a.k.a. UPM guests, but > restricted mem really should be able to prevent those types of faults in the first > place. SEV guests are the one case I can think of where that approach won't work, > since writes will corrupt the guest. SEV guests can likely be special cased though. As I mentioned in the original email, I think MADV_POPULATE_WRITE would work here (Peter suggested this to me last week, thanks Peter!). It would basically call slow GUP for us. So instead of hva_to_pfn_fast (fails) -> hva_to_pfn_slow -> slow GUP, we do hva_to_pfn_fast (fails) -> exit to userspace -> MADV_POPULATE_WRITE (-> slow GUP) -> KVM_RUN -> hva_to_pfn_fast (succeeds). [2]: https://lwn.net/Articles/906852/ - James