From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44B69C3A5A7 for ; Tue, 6 Dec 2022 18:01:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D523E8E0003; Tue, 6 Dec 2022 13:01:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D01948E0001; Tue, 6 Dec 2022 13:01:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC9BC8E0003; Tue, 6 Dec 2022 13:01:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AE4DA8E0001 for ; Tue, 6 Dec 2022 13:01:02 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6261AA0615 for ; Tue, 6 Dec 2022 18:01:02 +0000 (UTC) X-FDA: 80212647564.22.5A27F64 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf30.hostedemail.com (Postfix) with ESMTP id 6A7A180039 for ; Tue, 6 Dec 2022 18:00:58 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=p85BHNs9; spf=pass (imf30.hostedemail.com: domain of seanjc@google.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670349658; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IZ/eewSvSWgRvr6BfZF0Ink/Z3lXafgahPjrIsbVO+Q=; b=CZvcZkLBXAgYSSVJ70trB6ZjaROhnG2HY4WC3ff2NSCunwZLZqkS+KKfElquYv96q5bb1i gY64yhBDwZqlQlCOD5tnrcvGEJU6ihclKeGYKYB73XviqNqYRjIFi5XHMtXjHk+wVIf4wk b6jh4M5ke78bS7l4jkCVn3RT5CCZXdg= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=p85BHNs9; spf=pass (imf30.hostedemail.com: domain of seanjc@google.com designates 209.85.210.179 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670349658; a=rsa-sha256; cv=none; b=VsvNinklp7pBrnkc69ZbwYZ9yF8AoYLeNHlLvw7N8LokJdSixu0RiU67rhOMBnttveIZoD 88BuI8cbnZeLf/1/bhv0R/PhwBoPfP0OI0UHeten1MLQS9i5UE/AMDHz7EW9ueVuDG7bHj Lpuc3pNTynyCNiVsN6CxxNQsKuqBBhM= Received: by mail-pf1-f179.google.com with SMTP id 65so2578440pfx.9 for ; Tue, 06 Dec 2022 10:00:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=IZ/eewSvSWgRvr6BfZF0Ink/Z3lXafgahPjrIsbVO+Q=; b=p85BHNs950sX6+YdR/La3bRRURA5Iyad7gJsVNcTC8Sbk+t7TM0n4A5wqFUDrb0J6f /+likKUnoCGnHd2e+jdC1CwShBvoYZHib+RPCaAbfYqQFG52EACFXxvD9DYMXO+PEYl+ qwbqcjtKYsJxS8/I7wW7xnVPOt1zRzkc56jDTXIIAxNmUEyeAa+/1AppO5vfWNEUyFGw dJHifqN9g8OH8wzSVQ+WTMB6azYE08A6x/jubP+EiIxvUxNCNZRHXvhwhmPBSJGkocuZ n/fDuAJER1/AJr4Z/DDGWOvFemUvWzGKqhRzAHgv9QPr2SjdLwEW1ufcjaCqR61Cj2+h TrqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=IZ/eewSvSWgRvr6BfZF0Ink/Z3lXafgahPjrIsbVO+Q=; b=gfL14paCyz7W8LzgylIO9CPngIdFtt5RNQ6SRGTluDfyOPQOOTVSzwhg2PtJ/2V30G JS4lRajOxi2P2WOcRJY9LrRYVrI0PS4KuVU5CgQOgm/NwlBjKSweLEXrFWIeoQq1D9qK 3QrW+3ZKNNuJUO4DxX/FELsyK/d3gmN1KZSzkbg8s2cVXHIaAPFtZVHPff/GjZ+OnukT 4vagxbEHwJlLkIEWTPxzqf1MooYm2x0jr8TOgDjgcdXQwlLEY+thNRyvQFTmUFrqTFiH sBEcvcsyWP9H88XetUHK+1JYy/xBglhsokYEsUc3Y2TT56dbRFlnyJuopE8xZ7OwlPir ML4Q== X-Gm-Message-State: ANoB5pmXeYwPMvkEZiSUjl0Br7EGU4hHtNO9ccDnhh2xbDDQsSdfG5+8 Ce+OaK5Eruqnn7rKEds4uT0p+Q== X-Google-Smtp-Source: AA0mqf7DwpncQilHWw9mPUFGCznhUcmoxZzkWzvwjfoeUCID64NRYsXY8viyrYf11aEBw+RQWiYnsQ== X-Received: by 2002:a63:4043:0:b0:470:2ecd:333e with SMTP id n64-20020a634043000000b004702ecd333emr80514782pga.596.1670349657037; Tue, 06 Dec 2022 10:00:57 -0800 (PST) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id z18-20020a170903019200b00177f4ef7970sm13016894plg.11.2022.12.06.10.00.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Dec 2022 10:00:56 -0800 (PST) Date: Tue, 6 Dec 2022 18:00:53 +0000 From: Sean Christopherson To: James Houghton Cc: David Matlack , Peter Xu , Andrea Arcangeli , Paolo Bonzini , Axel Rasmussen , Linux MM , kvm , chao.p.peng@linux.intel.com Subject: Re: [RFC] Improving userfaultfd scalability for live migration Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6A7A180039 X-Stat-Signature: qfrtry18hp3jc1gxmkpajqq94sw6fpqm X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.210.179:from]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; BAD_REP_POLICIES(0.10)[]; TO_DN_SOME(0.00)[]; RCPT_COUNT_SEVEN(0.00)[9]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; DKIM_TRACE(0.00)[google.com:+]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; RCVD_COUNT_THREE(0.00)[3]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspam-User: X-HE-Tag: 1670349658-920400 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 06, 2022, James Houghton wrote: > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson wrote: > > > > On Mon, Dec 05, 2022, James Houghton wrote: > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson wrote: > > > > > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu wrote: > > > > > > ... > > > > > > I'll have a closer read on the nested part, but note that this path already > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > > > it from the first place, or maybe we don't care? > > > > > > Not taking the mmap lock would be helpful, but we still have to take > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. > > > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD > > is involved. The "fault" would get propagated to userspace by KVM, userspace > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and > > resumes the vCPU. > > If we haven't UFFDIO_CONTINUE'd some address range yet, > MADV_POPULATE_WRITE for that range will drop into handle_userfault and > go to sleep. Not good! Ah, right, userspace would still need to register UFFD for the region to handle non-KVM (or incompatible KVM) accesses and could loop back on itself. > So, going with the no-slow-GUP approach, resolving faults is done like this: > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart > KVM_RUN. The PTEs will be none/blank right now. This is the common > case. > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we > would get EEXIST. (In this case, we probably have some type of swap > entry in the page tables.) We have to change the page tables to make > fast GUP succeed now *without* using UFFDIO_CONTINUE now. > MADV_POPULATE_WRITE seems to be the right tool for the job. This case > happens if the kernel has swapped the memory out, is migrating it, has > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to > crash or inject a memory error. > > So with this approach, we never need to take the mmap_lock for reading > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE. > Without removing the mmap_lock from *both*, we don't gain much. > > So if we disregard this tiny mmap_lock benefit, the other approach > (the PF_NO_UFFD_WAIT approach) seems better. Can you elaborate on what makes it better? Or maybe generate a list of pros and cons? I can think of (dis)advantages for both approaches, but I haven't identified anything that would be a blocking issue for either approach. Doesn't mean there isn't one or more blocking issues, just that I haven't thought of any :-) > When KVM_RUN exits: > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. > - If we have, then something bad has happened. Slow GUP already ran > and failed, so we need to treat this in the same way we treat a > MADV_POPULATE_WRITE failure above: userspace might just want to crash > (or inject a memory error or something). > > - James