From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A2B1CC352A1
	for <linux-mm@archiver.kernel.org>; Tue,  6 Dec 2022 20:42:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 314AA8E0005; Tue,  6 Dec 2022 15:42:00 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2C53D8E0001; Tue,  6 Dec 2022 15:42:00 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 166078E0005; Tue,  6 Dec 2022 15:42:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 074E98E0001
	for <linux-mm@kvack.org>; Tue,  6 Dec 2022 15:42:00 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id AC2931A0561
	for <linux-mm@kvack.org>; Tue,  6 Dec 2022 20:41:59 +0000 (UTC)
X-FDA: 80213053158.28.42A8D6D
Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51])
	by imf11.hostedemail.com (Postfix) with ESMTP id 229BD40010
	for <linux-mm@kvack.org>; Tue,  6 Dec 2022 20:41:58 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=FAZU9OW0;
	spf=pass (imf11.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=jthoughton@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1670359319;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=;
	b=ozYn2q3kgeCjjg+B6mqBUEZZQ9U/I52VvzAHLqEoSQU/iN+FXtSaDgsUlXAtxXbuhjtcX3
	yOeDOT8QVoQVS+Mk2OhxkW9FMiEI5d8yt5CjwmIDZdEDpLEZhYxlVF2EDU2AXpkhhl/SLO
	BVUdikb/x9jon/9hUroZfynf7+55FHY=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=FAZU9OW0;
	spf=pass (imf11.hostedemail.com: domain of jthoughton@google.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=jthoughton@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670359319; a=rsa-sha256;
	cv=none;
	b=ARfxa7foCkekmoVtiMLUlJaR9wSxriSVlVnZQ3b9a2noHjdfrp0Z6v5HPsalNJ92iM3Ek4
	rrhsyrmjEevG+XLtsCgvY6Sg7kasqLdKOwYc/jv3QzcWy9Mx8yYkXgnB8LwfJ8HhzuFYJ4
	MQj69kiV+as1H7W4IOCmvpzRKJBDl9g=
Received: by mail-wr1-f51.google.com with SMTP id h12so25179913wrv.10
        for <linux-mm@kvack.org>; Tue, 06 Dec 2022 12:41:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=;
        b=FAZU9OW0Jzz/3hYPLpamvudKw8H90Y9qELvIq62fNEuM9jknqDmCTxapdXjOy9oMzh
         p369cqJihqRU18duaLuzVWYQLR5kkx1b+55tD33rl6911HSxwH0MSJmz8cMa+KYWcKIx
         hhZeMoHXx0Tp1ewkNp4Pfjtc3FSCu2nO4CEKLIrIPhTysp5uB70VSgPTLlB3FN7+T5M3
         37YQkknyRjhILyEOZAetjeJnO6q8W0ugVaqoMXCcj0vmEFv2wn/sZDMz26LtQLd3Imx2
         nON5vy7grbmIMDAwc2uG6Qa3eOTF8EA+RqBzi+Fut8pHC5gixDmtfilGy83IPjBL504B
         RFUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=jrY5fUHkqeAyOydO3erpRha3IR3Wyl+3Cda9Eh5slJY=;
        b=6usW+8S7nouxlPX2eKZ/VgdfZ4GbizGe2B48MkBnlmaqShhEGgT1+9RxiDPS7l1JmX
         wwGVqTTzHHDypM35oi9DutegkfRqTu94Vy0gDKkvR0y0X1QgFuwsuRPPaGbX5Cxxptkf
         o0x5DqffjZWu+1b5mLj2b72WTnrnyz9FXS3jyj+XeTi0f57UfPBkUXj8gSj6NBpswzmc
         tokm3/xoWL+VmKRAZvWsYBmUnt95cXLBhbv/egirTRIt6IFZ3gBLZlCWKpO99sclDTIZ
         SJfn8uhMft8oYfrMo1Tgv2isObogtBFIRGved0KPr4eDBqlm+tbwv0ItVYJNqY7UWAWk
         FxFw==
X-Gm-Message-State: ANoB5pl2OyHnfCqfiB46GTN0rc/CLNefjsoa1jGwBU9okBvRbuucadhZ
	8QRPG+UVVHeAQ0AWPZ8XkZ/WDM8XnuVi+A0XwIz8uQ==
X-Google-Smtp-Source: AA0mqf7Kv9y6AlmtkEHHoJvMzOi800UE3Kv49+hGEvI1ZQE+TzTLRcqeOgAWRuNRo+j6EFYqzmb5Jba48P6/NrrWC5A=
X-Received: by 2002:a5d:524f:0:b0:242:dee:716c with SMTP id
 k15-20020a5d524f000000b002420dee716cmr28660992wrc.664.1670359317722; Tue, 06
 Dec 2022 12:41:57 -0800 (PST)
MIME-Version: 1.0
References: <CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com>
 <Y4qgampvx4lrHDXt@google.com> <Y44NylxprhPn6AoN@x1n> <CALzav=d=N7teRvjQZ1p0fs6i9hjmH7eVppJLMh_Go4TteQqqwg@mail.gmail.com>
 <Y442dPwu2L6g8zAo@google.com> <CADrL8HV_8=ssHSumpQX5bVm2h2J01swdB=+at8=xLr+KtW79MQ@mail.gmail.com>
 <Y46VgQRU+do50iuv@google.com> <CADrL8HVM1poR5EYCsghhMMoN2U+FYT6yZr_5hZ8pLZTXpLnu8Q@mail.gmail.com>
 <Y4+DVdq1Pj3k4Nyz@google.com>
In-Reply-To: <Y4+DVdq1Pj3k4Nyz@google.com>
From: James Houghton <jthoughton@google.com>
Date: Tue, 6 Dec 2022 15:41:46 -0500
Message-ID: <CADrL8HVftX-B+oHLbjnJCret01yjUpOjQfmHdDa7mYkMenOa+A@mail.gmail.com>
Subject: Re: [RFC] Improving userfaultfd scalability for live migration
To: Sean Christopherson <seanjc@google.com>
Cc: David Matlack <dmatlack@google.com>, Peter Xu <peterx@redhat.com>, 
	Andrea Arcangeli <aarcange@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, 
	Axel Rasmussen <axelrasmussen@google.com>, Linux MM <linux-mm@kvack.org>, kvm <kvm@vger.kernel.org>, 
	chao.p.peng@linux.intel.com, Oliver Upton <oupton@google.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 229BD40010
X-Stat-Signature: knxo9ihhe7fnbeugrwmysida9jirdxei
X-Spamd-Result: default: False [-2.90 / 9.00];
	BAYES_HAM(-6.00)[100.00%];
	SORBS_IRL_BL(3.00)[209.85.221.51:from];
	BAD_REP_POLICIES(0.10)[];
	RCVD_NO_TLS_LAST(0.10)[];
	MIME_GOOD(-0.10)[text/plain];
	MIME_TRACE(0.00)[0:+];
	RCVD_COUNT_TWO(0.00)[2];
	FROM_EQ_ENVFROM(0.00)[];
	DMARC_POLICY_ALLOW(0.00)[google.com,reject];
	RCPT_COUNT_SEVEN(0.00)[10];
	DKIM_TRACE(0.00)[google.com:+];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org];
	R_DKIM_ALLOW(0.00)[google.com:s=20210112];
	ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1];
	FROM_HAS_DN(0.00)[];
	R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17];
	TO_DN_SOME(0.00)[];
	ARC_NA(0.00)[]
X-Rspam-User: 
X-HE-Tag: 1670359318-167257
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 06, 2022, James Houghton wrote:
> > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Dec 05, 2022, James Houghton wrote:
> > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Mon, Dec 05, 2022, David Matlack wrote:
> > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > ...
> > > > > > > I'll have a closer read on the nested part, but note that this path already
> > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > > > > it from the first place, or maybe we don't care?
> > > >
> > > > Not taking the mmap lock would be helpful, but we still have to take
> > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.
> > >
> > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
> > > is involved.  The "fault" would get propagated to userspace by KVM, userspace
> > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
> > > resumes the vCPU.
> >
> > If we haven't UFFDIO_CONTINUE'd some address range yet,
> > MADV_POPULATE_WRITE for that range will drop into handle_userfault and
> > go to sleep. Not good!
>
> Ah, right, userspace would still need to register UFFD for the region to handle
> non-KVM (or incompatible KVM) accesses and could loop back on itself.
>
> > So, going with the no-slow-GUP approach, resolving faults is done like this:
> > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart
> > KVM_RUN. The PTEs will be none/blank right now. This is the common
> > case.
> > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we
> > would get EEXIST. (In this case, we probably have some type of swap
> > entry in the page tables.) We have to change the page tables to make
> > fast GUP succeed now *without* using UFFDIO_CONTINUE now.
> > MADV_POPULATE_WRITE seems to be the right tool for the job. This case
> > happens if the kernel has swapped the memory out, is migrating it, has
> > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to
> > crash or inject a memory error.
> >
> > So with this approach, we never need to take the mmap_lock for reading
> > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE.
> > Without removing the mmap_lock from *both*, we don't gain much.
> >
> > So if we disregard this tiny mmap_lock benefit, the other approach
> > (the PF_NO_UFFD_WAIT approach) seems better.
>
> Can you elaborate on what makes it better?  Or maybe generate a list of pros and
> cons?  I can think of (dis)advantages for both approaches, but I haven't identified
> anything that would be a blocking issue for either approach.  Doesn't mean there
> isn't one or more blocking issues, just that I haven't thought of any :-)

Let's see.... so using no-slow-GUP over no UFFD waiting:
- No need to take mmap_lock in mem fault path.
- Change the relevant __gfn_to_pfn_memslot callers
(kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the
new CAP is used.
- No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere
in/near kvm_faultin_pfn/user_mem_abort).
- Userspace has to indirectly figure out the state of the page tables
to know what action to take (which introduces some weirdness, like if
anyone MADV_DONTNEEDs some guest memory, we need to know).
- While userfaultfd is registered (so like during post-copy), any
hva_to_pfn() calls that were resolvable with slow GUP before (without
dropping into handle_userfault()) will now need to be resolved by
userspace manually with a call to MADV_POPULATE_WRITE. This extra trip
to userspace could slow things down.

Both of these seem pretty simple to implement in the kernel; the most
complicated part is just returning KVM_EXIT_MEMORY_FAULT in more
places / for other architectures (I care about x86 and arm64).

Right now both approaches seem fine to me. Not having to take the
mmap_lock in the fault path, while being such a minor difference now,
could be a huge benefit if we can later get around to making
UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not
requiring userspace to guess the state of the page tables seems
helpful (less bug-prone, I guess).

>
> > When KVM_RUN exits:
> > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
> > - If we have, then something bad has happened. Slow GUP already ran
> > and failed, so we need to treat this in the same way we treat a
> > MADV_POPULATE_WRITE failure above: userspace might just want to crash
> > (or inject a memory error or something).
> >
> > - James