From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FF1BC83030 for ; Tue, 8 Jul 2025 00:05:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 429E86B03D0; Mon, 7 Jul 2025 20:05:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3DA026B03D1; Mon, 7 Jul 2025 20:05:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C9486B03D2; Mon, 7 Jul 2025 20:05:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 16E486B03D0 for ; Mon, 7 Jul 2025 20:05:36 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B64251289DD for ; Tue, 8 Jul 2025 00:05:35 +0000 (UTC) X-FDA: 83639153430.01.D9495B2 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf29.hostedemail.com (Postfix) with ESMTP id EDF01120011 for ; Tue, 8 Jul 2025 00:05:33 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aB8T1UFn; spf=pass (imf29.hostedemail.com: domain of 3zGBsaAYKCIk5rn0wpt11tyr.p1zyv07A-zzx8npx.14t@flex--seanjc.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3zGBsaAYKCIk5rn0wpt11tyr.p1zyv07A-zzx8npx.14t@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=aB8T1UFn; spf=pass (imf29.hostedemail.com: domain of 3zGBsaAYKCIk5rn0wpt11tyr.p1zyv07A-zzx8npx.14t@flex--seanjc.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3zGBsaAYKCIk5rn0wpt11tyr.p1zyv07A-zzx8npx.14t@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751933134; a=rsa-sha256; cv=none; b=0M4OQRI1vHS6m8nI3cN26bnU6gh5yWA6Sspwhya3HJhkCec9Bkmsxkws2j5RJaVmrqRN2V XvSsX88xVPfoFM0oGXcmSc/J7sVjHXu3VwSR7li3lgNmjxfxWZfENsPRBnUU7ZMm4MyV13 h2btyVXRT5q/nhqy+IvjWEQcFx3RaTk= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751933134; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bUSsATYF1ni3LzC4mC5bWgDkacywyYgXO+8ZXgpBxbM=; b=2uoSIGUi4SC8g+qJlkQON+RsD8LrtRY0dSMQUMPK6d46/AqqLZUbL97rqtXKih3RlRsiYT ONaEgV5W12N1I29cuOWiBROtuKMtwRfUGJhQL6IxwkagQQ9wwqOCqDFq0FJlTLtsR2ci4w 2UW22boD+aL0ckJ/M+TuVVKkbklHSp0= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-31220ecc586so3354594a91.2 for ; Mon, 07 Jul 2025 17:05:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751933133; x=1752537933; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=bUSsATYF1ni3LzC4mC5bWgDkacywyYgXO+8ZXgpBxbM=; b=aB8T1UFnR0FAOatjOz+S76I7Wo06ugMRWGtGub9wWO7Y9Jqq2NPOiHVG5rBOeQBfAQ KFotG3AMrhR3wdQxeqCx7RA1zkkhjB5CbjN42f5HNi3t6e3ttPJWUh6tZMimM7JGmgKL HRlHSH56chnoBnP0cEadJV/OdO9h8G1pWPX+as7VBpA+0bNd6C9lUgua2G8o8ggNZYgw A8Etnut7V3t/9k4nMIn8sIDWqvOgeS9SolIdWmGVMPFP04cdYySCnt4CAVfk+7gUNjLd kawjiEP8EbdFtQZngRFPfqYZGEX9+KmNG/5ZoI4bo9rsh5WirxxCosNhuEV3HVnviBxQ auGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751933133; x=1752537933; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bUSsATYF1ni3LzC4mC5bWgDkacywyYgXO+8ZXgpBxbM=; b=OghqyLgKYWBKazlhZ5V8rc41M39y5ddmfaK5FX6ZbeO00DfHTuTDQLZ5OnxoTWY8q+ j5zonbpOi/dXowyvFMPLZWk8yB3bIuQYVAYhOdD4o1iRQlBS0DKBiY92jwyzSu1OBYip pBvfgWvwYZ+r1xA5qalUFeIek599d3IfF07vuIFH/fjQHQoTSd/MtW+8Aka+WKwfAFrz VBju0fyxl70M8sh/S/nhOQOk+Eb0DpN2/1+OsgJhBQJystb6fu8KXmp9YWdAKlPxGP/u fXDzBXU/Ncs8lZJFQBXhi6FtkgHwivhpkxgH+nXyOxV2cUwgw6Yu8o7sPmlcYbFfqOnO cTVw== X-Forwarded-Encrypted: i=1; AJvYcCWVRUNgo1zFtMyyjFietxkhWZqL+aJI4xrV4HjNRgaBX3/UO3KgXHwQS/v5KFIMh+yCgH2iTjBQhw==@kvack.org X-Gm-Message-State: AOJu0YzbEOYKXd6RheAC4kbBL7ZBvCnREer2fYvbk8Jz8d3bFXX48Q6W b64GxamnoOgbVWfd7jCHWkm/PM+LyinR0u0CYj1XmfnNDn0A0KolZ2dynWsrVXG4L8Pn+pCP0nB TiTzZfw== X-Google-Smtp-Source: AGHT+IF2BwRN0aroTvNd7kwxawQk7XZesc7Ff2rjPXh0vuuuTamAUm/XoRpwibDL1NZmjUEIy01ScuBK2pw= X-Received: from pjbqx14.prod.google.com ([2002:a17:90b:3e4e:b0:312:e914:4548]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1f87:b0:312:db8:dbd2 with SMTP id 98e67ed59e1d1-31c21dbd35cmr1083943a91.19.1751933132692; Mon, 07 Jul 2025 17:05:32 -0700 (PDT) Date: Mon, 7 Jul 2025 17:05:31 -0700 In-Reply-To: <0cdc7890-aade-4fa5-ad72-24cde6c7bce9@redhat.com> Mime-Version: 1.0 References: <434ab5a3-fedb-4c9e-8034-8f616b7e5e52@amd.com> <923b1c02-407a-4689-a047-dd94e885b103@redhat.com> <0cdc7890-aade-4fa5-ad72-24cde6c7bce9@redhat.com> Message-ID: Subject: Re: [PATCH v12 10/18] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory From: Sean Christopherson To: David Hildenbrand Cc: Ackerley Tng , Shivank Garg , Fuad Tabba , kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com, ira.weiny@intel.com Content-Type: text/plain; charset="us-ascii" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: EDF01120011 X-Stat-Signature: sh1749c4ck6ionu8uyimkkturs3eh6dd X-Rspam-User: X-HE-Tag: 1751933133-773018 X-HE-Meta: U2FsdGVkX191UxHk/1Souy8U7VYN8HIBZxaTfyO86+1NI/lYM8fqthufPskd5v/I7Q4t9tUryiXi5mxSrr7aggtucJadabJeTOec2nkptzTTIpvHtmmLNKhu2TWViiieKX7gWbIr82Du+sShjSkDMzbRG2iFE9H35+gpq3xAAhVpSenFn4VA4qPvcKJMYeQmub05Fo9nnF0UMaduIqeQTBf/4B0lUNf3PiF81zfFSobXnuy/3+XTEoGj6f21KqXfMNOlSi8MHw6jMT4mWZ4KtvDrxfEFspmIOTxh+zlQ4SWuWUq7jSYRX7kFxNadtVixYM4P00GxLVkBPb/86x0qtnTbsyvswBj0NLb6nuGj0NRIlCpMls1nmymSWPnQ77guPqabygCXLvPU+MRjhik3TKoXaOzYttG9VT2wpuUAfNUyL/ppEhY0VUmFCHG2NWJSteXku2hKF1NwZ7+tdX93ZHFfjzfcpBjeIbgmYxDRU0PtSgod+E92mYtDVNW/PicVQ+zm2CX61dAZFf36oZWA5JBF7ZH+6LFO7K+e3tiy4iPTafkilb/DIuVSolcXPzJzay3jWYTbVBlkD8AFsoIKHo8wXalXoRSFAW6ucdchxBfaqMUyHue67ZnmvgD4xIQPIT1tv04SNYkCHhvNkCgpYPbaGKttmE9fpkKGpCjWPkyAfmRauLFGy72CGMwrhIIsjfix8FtTXyTKCD7hGz4fIYc7IvIOvg9d0I2zEX8nre/NOO2sr/8bFPR9KIlXrU5p21qKJbCKhs3lADU4rzHFOXXC2rINhNeykDTZ08rzvMmpdoVAtIIE2Jaak+acSAXete41UcKN4vHo//1IIWj9yvIMAjTxQae7eQsnlmH6JVkZfG44y9ewf11TF7Po3UPlbBTgjRUlIk1KbA00Ylm3eEc0+fRSABf9shMD0xsoC0YXoREV3mjk7/qzLn3zqljNRPkbPpTLqM3fKz+3iaN bYnDBfvD kj+jOx5+4gUD2iOTa6E/J2PUjyCF1HFDVG+OdLtkOlOG2ZA45/30fzWcuyk/0reyEHAn6QxrcPh9NT3ubtgFG7kWnPyNELtMBBkQJH0lz8rC0tVIFIaRTi3YT/Rm6x01uMXrT97EZjvXS+fmFD6z29bamLEmbA39402bjAI0gqrDVroOhLCWjAIOJVlo655MshSetqVH6y038yGWB523HrwjTEflvQYTWQwYYV25ADUE96H/4EAxJFua0g8PCLzgoSBR85xY4JI2DA8At4pazqonXSlOoZbnu72NW31yyleQZP8skksE4T8vhwHkaYcoNwAzhnWZYlH28a2Xr9MeTDzKU8CheDW7+R/F47cxeSJ659ki7BdZOKoU2e2cqZ4Ou1E4jgxQpnAwjq4BqgSChitxRG9ROxjcF09PBVAtph9yqUyxd4ZoEA1PbAPRGfTLyfqH9k+DvKkywwa8uEegS9SZ/62xH2lGnvmA+jjUUEf65kD86QNlZZfNwcXfNjjU9zE6Tin6Geq5rR7lB7YTej5ueWeJAiNznR6K5TQjI/v+sFI4t8PxpY+Z+Ui6/RQJc6ksWiNgwiM3EsX3WKM0Qk6UqKYmYNfwShLLtv0BF21EW7nBIyORzpCYtfWRBt07P9Pm15sRS3Q/1QFaJbr3dbiAj1Zn+eQHa078/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 01, 2025, David Hildenbrand wrote: > > > > I support this approach. > > > > > > Agreed. Let's get this in with the changes requested by Sean applied. > > > > > > How to use GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with > > > legacy mem attributes (-> all memory in guest_memfd private) could be > > > added later on top, once really required. > > > > > > As discussed, CoCo VMs that want to support GUEST_MEMFD_FLAG_MMAP will > > > have to disable legacy mem attributes using a new capability in stage-2. > > > > > > > I rewatched the guest_memfd meeting on 2025-06-12. We do want to > > support the use case where userspace wants to have mmap (e.g. to set > > mempolicy) but does not want to allow faulting into the host. > > > > On 2025-06-12, the conclusion was that the problem will be solved once > > guest_memfd supports shareability, and that's because userspace can set > > shareability to GUEST, so the memory can't be faulted into the host. > > > > On 2025-06-26, Sean said we want to let userspace have an extra layer of > > protection so that memory cannot be faulted in to the host, ever. IOW, > > we want to let userspace say that even if there is a stray > > private-to-shared conversion, *don't* allow faulting memory into the > > host. Eh, my comments were more along the lines of "it would be nice if we could have such protections", not a "we must support this". And I suspect that making the behavior all-or-nothing for a given guest_memfd wouldn't be very useful, i.e. that userspace would probably want to be able to prevent accessing a specific chunk of the gmem instance. Actually, we can probably get that via mseal(), maybe even for free today? E.g. mmap() w/ PROT_NONE, mbind(), and then mseal(). So yeah, I think we do nothing for now. > > The difference is the "extra layer of protection", which should remain > > in effect even if there are (stray/unexpected) private-to-shared > > conversions to guest_memfd or to KVM. Here's a direct link to the point > > in the video where Sean brought this up [1]. I'm really hoping I didn't > > misinterpret this! > > > > Let me look ahead a little, since this involves use cases already > > brought up though I'm not sure how real they are. I just want to make > > sure that in a few patch series' time, we don't end up needing userspace > > to use a complex bunch of CAPs and FLAGs. > > > > In this series (mmap support, V12, patch 10/18) [2], to allow > > KVM_X86_DEFAULT_VMs to use guest_memfd, I added a `fault_from_gmem()` > > helper, which is defined as follows (before the renaming Sean requested): > > > > +static inline bool fault_from_gmem(struct kvm_page_fault *fault) > > +{ > > + return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot); > > +} > > > > The above is changeable, of course :). The intention is that if the > > fault is private, fault from guest_memfd. If GUEST_MEMFD_FLAG_MMAP is > > set (KVM_MEMSLOT_GMEM_ONLY will be set on the memslot), fault from > > guest_memfd. > > > > If we defer handling GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM > > with legacy mem attributes to the future, this helper will probably > > become > > > > -static inline bool fault_from_gmem(struct kvm_page_fault *fault) > > +static inline bool fault_from_gmem(struct kvm *kvm, struct kvm_page_fault *fault) > > +{ > > - return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot); > > + return fault->is_private || (kvm_gmem_memslot_supports_shared(fault->slot) && > > + !kvm_arch_disable_legacy_private_tracking(kvm)); > > +} > > > > And on memslot binding we check > > > > if kvm_arch_disable_legacy_private_tracking(kvm) I would invert the KVM-internal arch hook, and only have KVM x86's capability refer to the private memory attribute as legacy (because it simply doesn't exist for any thing else). > > and not GUEST_MEMFD_FLAG_MMAP > > return -EINVAL; > > > > 1. Is that what yall meant? I was thinking: if (kvm_arch_has_private_memory_attribute(kvm) == kvm_gmem_mmap(...)) return -EINVAL; I.e. in addition to requiring mmap() when KVM doesn't track private/sahred via memory attributes, also disallow mmap() when private/shared is tracked via memory attributes. > My understanding: > > CoCo VMs will initially (stage-1) only support !GUEST_MEMFD_FLAG_MMAP. > > With stage-2, CoCo VMs will support GUEST_MEMFD_FLAG_MMAP only with > kvm_arch_disable_legacy_private_tracking(). Yep, and everything except x86 will unconditionally return true for kvm_arch_disable_legacy_private_tracking() (or false if it's inverted as above). > Non-CoCo VMs will only support GUEST_MEMFD_FLAG_MMAP. (no concept of > private) > > > > > 2. Does this kind of not satisfy the "extra layer of protection" > > requirement (if it is a requirement)? It's not a requirement. > > A legacy CoCo VM using guest_memfd only for private memory (shared > > memory from say, shmem) and needing to set mempolicy would > > * Set GUEST_MEMFD_FLAG_MMAP I think we should keep it simple as above, and not support mmap() (and therefore mbind()) with legacy CoCo VMs. Given the double allocation flaws with the legacy approach, supporting mbind() seems like putting a bandaid on a doomed idea. > > * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false > > but still be able to send conversion ioctls directly to guest_memfd, > > and then be able to fault guest_memfd memory into the host. > > In that configuration, I would expect that all memory in guest_memfd is > private and remains private. > > guest_memfd without memory attributes cannot support in-place conversion. > > How to achieve that might be interesting: the capability will affect > guest_memfd behavior? > > > > > 3. Now for a use case I've heard of (feel free to tell me this will > > never be supported or "we'll deal with it if it comes"): On a > > non-CoCo VM, we want to use guest_memfd but not use mmap (and the > > initial VM image will be written using write() syscall or something > > else). > > > > * Set GUEST_MEMFD_FLAG_MMAP to false > > * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false > > (it's a non-CoCo VM, weird to do anything to do with private > > tracking) > > > > And now we're stuck because fault_from_gmem() will return false all > > the time and we can't use memory from guest_memfd. Nah, don't support this scenario. Or rather, use mseal() as above. If someone comes along with a concrete, strong use case for backing non-CoCo VMs and using mseal() to wall off guest memory doesn't suffice, then they can have the honor of justifying why KVM needs to take on more complexity. :-) > I think I discussed that with Sean: we would have GUEST_MEMFD_FLAG_WRITE > that will imply everything that GUEST_MEMFD_FLAG_MMAP would imply, except > the actual mmap() support. Ya, for the write() access or whatever. But there are bigger problems beyond populating the memory, e.g. a non-CoCo VM won't support private memory, so without many more changes to redirect KVM to gmem when faulting in guest memory, KVM won't be able to map any memory into the guest.