From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32FA9C5475B for ; Fri, 8 Mar 2024 23:22:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A3A648D0002; Fri, 8 Mar 2024 18:22:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9EB396B03FC; Fri, 8 Mar 2024 18:22:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B28A8D0002; Fri, 8 Mar 2024 18:22:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7839C6B03FB for ; Fri, 8 Mar 2024 18:22:54 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 3ED5CC0945 for ; Fri, 8 Mar 2024 23:22:54 +0000 (UTC) X-FDA: 81875449068.11.287296E Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf19.hostedemail.com (Postfix) with ESMTP id 93B481A000B for ; Fri, 8 Mar 2024 23:22:52 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=LU3HYoJd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3y53rZQYKCOoeQMZVOSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3y53rZQYKCOoeQMZVOSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709940172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7S3B1t8pdIqU5Wjnk50i6MWuw2joVUe0pWEUYZUDmO4=; b=8EGqYC1U+M5DyGbriwv/L+YAsV0sPzSCfUwnrkbzbvumBQuk3jvoSod5EILQbLuXvrh622 Vqd6Yv9Vs7fwcVJ7K8D+vAe4Hhad1kEAnR2fQ9oLBor1buYPc7zg/o46YKRJhiRVlR5FZV htPLa+KWL/J/hq9N6iSp9TFymCWBsp4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=LU3HYoJd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3y53rZQYKCOoeQMZVOSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3y53rZQYKCOoeQMZVOSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709940172; a=rsa-sha256; cv=none; b=uD2gpCxBiQSzDWvQTBH4hqMlqKZChnQQAqr90WrNCMTCfEjwQMhX75wZJlwfi6tSAvDCs1 RjrihezOfQs2mpr7dWRdBKBQTzDJPH+vuOwtp1GtOoBSQrjFeuIYouewcUlrmRydCeylMD GVsFlVc/sW8NRoITzVF+Vk7ZnLvxZFo= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6047fed0132so44082717b3.1 for ; Fri, 08 Mar 2024 15:22:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709940171; x=1710544971; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=7S3B1t8pdIqU5Wjnk50i6MWuw2joVUe0pWEUYZUDmO4=; b=LU3HYoJd7fM+DfE4SOxzZXsY8q45ZFnPfCt9xXMg1qwVQ6v2ApabcGhBYND35dN/st I0SCkcz9fLcp7sGysPl8p4Xsu1RY4P7YwmjLd5mHiJOE/GtY7YXa6glNJOBdhUjIULd3 wI99lmgNqj5lZr3Y3887cHOsAPk8JmJuCAt37Y9t75nEDW/wQoURSmWHlNHnbjMn5mHB DA/NjI9tLa8mYGqLJTJjRmp/hq2jKTzG9VfTtY1dVDGC6fMM/2DyCu5prwI25cHDfKUV AZ3ZlqQynBubaP2OApyHzJyo8I7zHp599ta4DGrhVYtE23gFSS89/k/wDQ6DCYeOKeBo KI2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709940171; x=1710544971; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=7S3B1t8pdIqU5Wjnk50i6MWuw2joVUe0pWEUYZUDmO4=; b=WIUZLvYvrzFAJ+Q9fjHwPnkAodnq9hZkG/7eXCwIIM9Af4TwuEakX7GfxbcSRBm5k8 56wjNDpOZnITTFdEeaNhG1+Iy0H7yHAOMd1S2BEJs1ouB3KUihKqQX+6gwaFqfX87dmZ +6wsH74Fa/bRFPSsazYqcW8a/otmOfavJitJ/TlNJkgzefMPQECMXaEZwKjMQnuTNPqH Qt3ZnYMoQqqR4uwBwyhhLdtWyI//E/qS3ZJeLvRmWu8EAf0MNPMrhRtPDLr0d/4et3Aj tT+RdBq+6MDGkCYLEBcorJo2O94BPh1LhzTr9oqUtiYpTtQrEaO/MuYjct3TiJgJPWJp z9KA== X-Forwarded-Encrypted: i=1; AJvYcCU1sHkTMEKQ5NY0k3krAZfd/PhRAA6+HjJVZGNYPnut35M6ANtp3yI0uO8eAlipFh0J75F4a94GqPBllTbgwj99dpM= X-Gm-Message-State: AOJu0YyzY9tEV3mArdOrgq3A4df/rWwFyklMIeMdT8yLM2VZk3IsiYla XRZJ5nOGA5PsyFaEMEdueDZAkVX3tbccJF66bkvbpLIo8JW6YGvNa41uz9Y3f/3LYxBfBkjPMBZ 2+Q== X-Google-Smtp-Source: AGHT+IHIif5Eml2ZF3fDerW6drNpqFldJ7Jz2SKMgJzKuurv4zQCJLfYI0vd1IGQU3mKE9h27C6nTWKouOk= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a0d:d7d3:0:b0:608:e67f:4387 with SMTP id z202-20020a0dd7d3000000b00608e67f4387mr134113ywd.7.1709940171728; Fri, 08 Mar 2024 15:22:51 -0800 (PST) Date: Fri, 8 Mar 2024 15:22:50 -0800 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: Unmapping KVM Guest Memory from Host Kernel From: Sean Christopherson To: James Gowans Cc: "akpm@linux-foundation.org" , Patrick Roy , "chao.p.peng@linux.intel.com" , Derek Manwaring , "rppt@kernel.org" , "pbonzini@redhat.com" , David Woodhouse , Nikita Kalyazin , "lstoakes@gmail.com" , "Liam.Howlett@oracle.com" , "linux-mm@kvack.org" , "qemu-devel@nongnu.org" , "kirill.shutemov@linux.intel.com" , "vbabka@suse.cz" , "mst@redhat.com" , "somlo@cmu.edu" , Alexander Graf , "kvm@vger.kernel.org" , "linux-coco@lists.linux.dev" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 93B481A000B X-Stat-Signature: jjouwfyndp8wtqwecm4pf3ajb89hzmf4 X-HE-Tag: 1709940172-276865 X-HE-Meta: U2FsdGVkX1/odOZgMBuly20PRJSEHAJjcArA1jAEZu4oy097SU9wW381i5BhKp5jsqTPw+25bQ06yIwa4WOg+6GS/X4kIMI7VLK7QScwrkDzzRDWoJDbI1GSBrTfTLy4WCc8jgoMYJyCYa1OK6Et2mZYZBYvCjWqnY60qQq88fXoXsGLXecc0P4vcdy6IJtTx7zBEBIg7XjL0uvWfqKHEnHV+lmQ3mKBjr6eJbkxhni4rtp7UtD2ZaRpfwX7Rlv5uRzlBBbjNFYx01CS2U6uPNCndY3I032jUkOjNi93PUEYoLUPzcDmM/4Yv5PLhXFNnBXpKTUcedm6+wb7xUxWaVqbCtFXAxqHaHQJly4hq+RdG7HzQZE8tDq3pkjpvHZ+xtBNPk1LCihK2LrH46J9Y5sWifw/YQKDiaa6SZF6uQ2JvfGZ9si/wxrIdSJcb0azg9MhpYTEzG41FcPcVzjzUDV6YsCoxDrRGWtL0g+3lhoxUNSwi23mJgsFMW9SDkpOP0xMT/4YrBKrPcv2Uh1j0L3Cnvr5yM1F37tbepRWw/e2i6c5yasnaWWMhS+uBPUZGID8LQhfi9oivwb+NW7/f4YvC+5fVuUyz7FlW1v5VgOtynTX1uTfPOeTfUrbEWW9Q0aCEy6/J8StZuX98fakBaPby1e3QXtP5OzNnlAQFiOmP1HMtTZNI7zrelqru8V0RmM/f1Hrqm74dEOqm6qtw2pVQux89vaM47dsTq1rQ+uGa6GKz90/LH8FYy9GmUc9D9EG7MODeyfWFqaBwSsmVSbsK6M1Is11KK/rRmJ/CBJ2do1Nsj/f2hnUKLPGwe8ZaFS2WUpwhlq4YzrseJPCSNehKbq9t4yYeku2d6a/CYl7vwdinUMbyfg5hKrHLc2Gz72bDPw5nv2cnoaWkacGMvZfjoYA93+EDpT6FYg/1dNSWNGkoRTblTNpxmEEtv5922Rqtym6j5F36pffaqG ExxL2+YD 6x7BjnKe1dPlwObhjJXU6XRjmPxh7eDqK+lsCsgjwafZw4pBkSyBCjn0yLMVOfotmp7BkEgudwixCJH7v+qspjzJnShcdTXz1W0gkz3JySxV7w40nwm6HXf4yqSScRuLwIbXPl0YbOmFHg1zFmO/qgHr4lOUigoi8hMFyX2yp4BOkeznHmxeIR44ZxDIFp2OatWFMhFg9y0WWT/1xzA+DdILJ7KSELDFh4JI/et4SzcvtYCzwNTSzvlTdoWodQdL0n+p1zYddz0mOiUiiXd9We9tCnI+cBDlIL5gur6m7x3EOn0+Cuzl8er7oD9WyYT6RZygfypAHTVgbQvi5jcqOE0X4ycgEYKxXeTkGh9iMoUCMaoUxQaL71ZrAcL7lbZtAi4Ma7smfJBTgZSI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 08, 2024, James Gowans wrote: > However, memfd_secret doesn=E2=80=99t work out the box for KVM guest memo= ry; the > main reason seems to be that the GUP path is intentionally disabled for > memfd_secret, so if we use a memfd_secret backed VMA for a memslot then > KVM is not able to fault the memory in. If it=E2=80=99s been pre-faulted = in by > userspace then it seems to work. Huh, that _shouldn't_ work. The folio_is_secretmem() in gup_pte_range() is supposed to prevent the "fast gup" path from getting secretmem pages. Is this on an upstream kernel? If so, and if you have bandwidth, can you f= igure out why that isn't working? At the very least, I suspect the memfd_secret maintainers would be very interested to know that it's possible to fast gup secretmem. > There are a few other issues around when KVM accesses the guest memory. > For example the KVM PV clock code goes directly to the PFN via the > pfncache, and that also breaks if the PFN is not in the direct map, so > we=E2=80=99d need to change that sort of thing, perhaps going via userspa= ce > addresses. >=20 > If we remove the memfd_secret check from the GUP path, and disable KVM=E2= =80=99s > pvclock from userspace via KVM_CPUID_FEATURES, we are able to boot a > simple Linux initrd using a Firecracker VMM modified to use > memfd_secret. >=20 > We are also aware of ongoing work on guest_memfd. The current > implementation unmaps guest memory from VMM address space, but leaves it > in the kernel=E2=80=99s direct map. We=E2=80=99re not looking at unmappin= g from VMM > userspace yet; we still need guest RAM there for PV drivers like virtio > to continue to work. So KVM=E2=80=99s gmem doesn=E2=80=99t seem like the = right solution? We (and by "we", I really mean the pKVM folks) are also working on allowing userspace to mmap() guest_memfd[*]. pKVM aside, the long term vision I hav= e for guest_memfd is to be able to use it for non-CoCo VMs, precisely for the sec= urity and robustness benefits it can bring. What I am hoping to do with guest_memfd is get userspace to only map memory= it needs, e.g. for emulated/synthetic devices, on-demand. I.e. to get to a st= ate where guest memory is mapped only when it needs to be. More below. > With this in mind, what=E2=80=99s the best way to solve getting guest RAM= out of > the direct map? Is memfd_secret integration with KVM the way to go, or > should we build a solution on top of guest_memfd, for example via some > flag that causes it to leave memory in the host userspace=E2=80=99s page = tables, > but removes it from the direct map?=20 100% enhance guest_memfd. If you're willing to wait long enough, pKVM migh= t even do all the work for you. :-) The killer feature of guest_memfd is that it allows the guest mappings to b= e a superset of the host userspace mappings. Most obviously, it allows mapping= memory into the guest without mapping first mapping the memory into the userspace = page tables. More subtly, it also makes it easier (in theory) to do things like= map the memory with 1GiB hugepages for the guest, but selectively map at 4KiB g= ranularity in the host. Or map memory as RWX in the guest, but RO in the host (I don'= t have a concrete use case for this, just pointing out it'll be trivial to do once guest_memfd supports mmap()). Every attempt to allow mapping VMA-based memory into a guest without it bei= ng accessible by host userspace emory failed; it's literally why we ended up implementing guest_memfd. We could teach KVM to do the same with memfd_sec= ret, but we'd just end up re-implementing guest_memfd. memfd_secret obviously gets you a PoC much faster, but in the long term I'm= quite sure you'll be fighting memfd_secret all the way. E.g. it's not dumpable, = it deliberately allocates at 4KiB granularity (though I suspect the bug you fo= und means that it can be inadvertantly mapped with 2MiB hugepages), it has no l= ine of sight to taking userspace out of the equation, etc. With guest_memfd on the other hand, everyone contributing to and maintainin= g it has goals that are *very* closely aligned with what you want to do. [*] https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com