From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E18FC47089 for ; Tue, 6 Dec 2022 01:06:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F39318E0002; Mon, 5 Dec 2022 20:06:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EE9B68E0001; Mon, 5 Dec 2022 20:06:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D953B8E0002; Mon, 5 Dec 2022 20:06:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C6F7E8E0001 for ; Mon, 5 Dec 2022 20:06:15 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9FEA140467 for ; Tue, 6 Dec 2022 01:06:15 +0000 (UTC) X-FDA: 80210090310.05.B221D19 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf12.hostedemail.com (Postfix) with ESMTP id 3E26040014 for ; Tue, 6 Dec 2022 01:06:15 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=ZjTFVPb7; spf=pass (imf12.hostedemail.com: domain of seanjc@google.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670288775; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sjRGBzzky7GEdbHMIaj50mNITkjhMUtLATUmInnNUXI=; b=E5x6sbrsIMwwCceDQ3phd9SezPElL9rEI0XfdbPFxE0/tbQpaF2sbkrSJ5GQwG21MrG4kc fubCONJAnzmxUMb/zFdtDYfO/hODSGEG0vO9NiB6c8Dk6i2ww5d0UfJHbW7KYxk8TBLjwq SehWgRIQTdJk9qTkmzvmw85NEos/40s= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=ZjTFVPb7; spf=pass (imf12.hostedemail.com: domain of seanjc@google.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670288775; a=rsa-sha256; cv=none; b=BoX27ga0PFSdZ5ZezcAUYYwilcqu51bxAX0U7owRLNW3bfAvCQPFHmTiGmfOXO8HCQI358 LiVRT4m0ZKBhIxMrbUeMNDlFa4pvyEJF8NqLrFeAL06ss14+Zr05CjmX0yKZOm52WB7WZN oehY3Ps3W8dOdQNWmkJlrmIuw+TRRHg= Received: by mail-pf1-f180.google.com with SMTP id k79so13101400pfd.7 for ; Mon, 05 Dec 2022 17:06:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=sjRGBzzky7GEdbHMIaj50mNITkjhMUtLATUmInnNUXI=; b=ZjTFVPb7C8TuejUqaxGN306ZgZFkObjgkZ2f3XHO14MbszUbIC221sxIgY4igZa8aC p/ZCzaYZm0psbuBUTTXyDXT+/0o4fthyEB+eQuW6FtPp36c0l9uECEFjETRnZAKQcaLT NoAFop2fOTbAb/ZggUIuqEpOim8ziWpPe65DeXJjyuVLkD8c71QATXqT0mliuCwNbsMO 9lWjndM14dWrGvF6Tga7a3blBCYQA6qOrLGX+mRXnztk8ZYs9/VBdSFoj+/hZiJBQFdS awy4zY2OVMl17LQwFPLbGLvFgnTB+E81ISfekjaWuqFzlOsYY87ZNReEZNx8EDfDLRaG PemQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=sjRGBzzky7GEdbHMIaj50mNITkjhMUtLATUmInnNUXI=; b=tYPavF5ayiJ3gEl+SdZiz86bu3ASAkxkn3KE/SnAvu/812MHRyP7ya7CqzSEP1so+9 SF/dzVU0H+q4clPi5uapY87mVJUvDCOA4SWxcQ02bXAChiMRNwoNwm/rfCqYDW4TuKKb Su0P0VPU1sJNbWnFJqNm8EOnl1wUpFCWYfWc65bDglwQKHMY/Ew7toIg32y7fXIwa6XV la+7J+Z0lfeyh0qGKbTVql0urqHos7ARmBU4+YHWpESHJVa49zYetuDMt/pGpcJ98pEy KBbM5wJZSjo/tm1hOL6VLkf9qqBkbaVytOnLEIAUsIEcADRQ/aXHQI0td6pWZCUVDG4Y +IeA== X-Gm-Message-State: ANoB5pn7al7VYfQ+0IEXvvqxZuYcK5RXpKmeiudYG5hP+j+/dhgLiur8 g++ICHjgxGq0P4jrFSvOERxUOA== X-Google-Smtp-Source: AA0mqf6kk3lrzdJqgI08MzPjoWZfxCYTogxk+SRbtpdnkGhILnApvwYkdESzZOjTKi1rpcGHPU/K5g== X-Received: by 2002:a62:687:0:b0:56e:924e:ee22 with SMTP id 129-20020a620687000000b0056e924eee22mr67478063pfg.34.1670288773908; Mon, 05 Dec 2022 17:06:13 -0800 (PST) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id 24-20020a630d58000000b0045751ef6423sm8708405pgn.87.2022.12.05.17.06.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Dec 2022 17:06:12 -0800 (PST) Date: Tue, 6 Dec 2022 01:06:09 +0000 From: Sean Christopherson To: James Houghton Cc: David Matlack , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Subject: Re: [RFC] Improving userfaultfd scalability for live migration Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 3E26040014 X-Stat-Signature: wo7dfau4c9q8jmzwr43n66di7nyzmzju X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.210.180:from]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; BAD_REP_POLICIES(0.10)[]; TO_DN_SOME(0.00)[]; RCPT_COUNT_SEVEN(0.00)[9]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; DKIM_TRACE(0.00)[google.com:+]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; RCVD_COUNT_THREE(0.00)[3]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-HE-Tag: 1670288775-977197 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Dec 05, 2022, James Houghton wrote: > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson wrote: > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > > > > > == Getting the faulting GPA to userspace == > > > > > > KVM_EXIT_MEMORY_FAULT was introduced recently [1] (not yet merged), > > > > > > and it provides the main functionality we need. We can extend it > > > > > > easily to support our use case here, and I think we have at least two > > > > > > options: > > > > > > - Introduce something like KVM_CAP_MEM_FAULT_REPORTING, which causes > > > > > > KVM_RUN to exit with exit reason KVM_EXIT_MEMORY_FAULT when it would > > > > > > otherwise just return -EFAULT (i.e., when kvm_handle_bad_page returns > > > > > > -EFAULT). > > > > > > - We're already introducing a new CAP, so just tie the above behavior > > > > > > to whether or not one of the CAPs (below) is being used. > > > > > > > > > > We might even be able to get away with a third option: unconditionally return > > > > > KVM_EXIT_MEMORY_FAULT instead of -EFAULT when the error occurs when accessing > > > > > guest memory. > > Wouldn't we need a new CAP for this? Maybe? I did say "might" :-) -EFAULT is sooo useless for userspace in these cases that there's a chance we can get away with an unconditional change. Probably not worth the risk of breaking userspace though as KVM will likely end up with a helper to fill in the exit info. > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > index 07c81ab3fd4d..7f66b56dd6e7 100644 > > > > > --- a/fs/userfaultfd.c > > > > > +++ b/fs/userfaultfd.c > > > > > @@ -394,7 +394,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > > > > * shmem_vm_ops->fault method is invoked even during > > > > > * coredumping without mmap_lock and it ends up here. > > > > > */ > > > > > - if (current->flags & (PF_EXITING|PF_DUMPCORE)) > > > > > + if (current->flags & (PF_EXITING|PF_DUMPCORE|PF_NO_UFFD_WAIT)) > > > > > goto out; > > > > > > > > I'll have a closer read on the nested part, but note that this path already > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > it from the first place, or maybe we don't care? > > Not taking the mmap lock would be helpful, but we still have to take > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. IIUC, Peter is suggesting that the kernel not even get to the point where UFFD is involved. The "fault" would get propagated to userspace by KVM, userspace fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and resumes the vCPU. > The main goal is to avoid the locks in the userfaultfd wait_queues. If > we could completely avoid taking the mmap lock for reading in the > common post-copy case, we would avoid potential latency spikes if > someone (e.g. khugepaged) came around and grabbed the mmap lock for > writing. > > It seems pretty difficult to make UFFDIO_CONTINUE *not* take the mmap > lock for reading, but I suppose it could be done with something like > the per-VMA lock work [2]. If we could avoid taking the lock in > UFFDIO_CONTINUE, then it seems plausible that we could avoid taking it > in slow GUP too. So really whether or not we are taking the mmap lock > (for reading) in the mem fault path isn't a huge deal by itself. ... > > > I don't know what userspace would do in those situations to make forward progress. > > > > Access the page from userspace? E.g. a "LOCK AND -1" would resolve read and write > > faults without modifying guest memory. > > > > That won't work for guests backed by "restricted mem", a.k.a. UPM guests, but > > restricted mem really should be able to prevent those types of faults in the first > > place. SEV guests are the one case I can think of where that approach won't work, > > since writes will corrupt the guest. SEV guests can likely be special cased though. > > As I mentioned in the original email, I think MADV_POPULATE_WRITE > would work here (Peter suggested this to me last week, thanks Peter!). > It would basically call slow GUP for us. So instead of hva_to_pfn_fast > (fails) -> hva_to_pfn_slow -> slow GUP, we do hva_to_pfn_fast (fails) > -> exit to userspace -> MADV_POPULATE_WRITE (-> slow GUP) -> KVM_RUN > -> hva_to_pfn_fast (succeeds).a Ah, nice. Missed that (obviously).