From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E251DC83038 for ; Tue, 1 Jul 2025 14:15:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8475B6B0096; Tue, 1 Jul 2025 10:15:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 81EC06B009A; Tue, 1 Jul 2025 10:15:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 734826B009E; Tue, 1 Jul 2025 10:15:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5D61A6B0096 for ; Tue, 1 Jul 2025 10:15:42 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id EF7B5B91E4 for ; Tue, 1 Jul 2025 14:15:41 +0000 (UTC) X-FDA: 83615894082.01.AE881B0 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) by imf29.hostedemail.com (Postfix) with ESMTP id 0ADB1120011 for ; Tue, 1 Jul 2025 14:15:39 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=glCPDnKs; spf=pass (imf29.hostedemail.com: domain of 3iu1jaAsKCD8bdlfsmfzuohhpphmf.dpnmjovy-nnlwbdl.psh@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3iu1jaAsKCD8bdlfsmfzuohhpphmf.dpnmjovy-nnlwbdl.psh@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751379340; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vN/E7b0UOrsRNuPjjIBnTzqCIHznXhChVkNuGJXxsvs=; b=wgA49Plw6oFhRuzRDScTh+2HEUKlcDUTSZ6inq2QqAhIwrYys9QIrKL0lUHQDKwlXEC4as Y0w1TcLomUCoRdFYsN4DeGlesw1lP3ldhIzDPSVCHXMUcGpPfv4wx9bnWJl8qs+NaO2cJu 4uqM9VOc+ety0NqwM+71WEaN6YIJ05Y= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=glCPDnKs; spf=pass (imf29.hostedemail.com: domain of 3iu1jaAsKCD8bdlfsmfzuohhpphmf.dpnmjovy-nnlwbdl.psh@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3iu1jaAsKCD8bdlfsmfzuohhpphmf.dpnmjovy-nnlwbdl.psh@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751379340; a=rsa-sha256; cv=none; b=YCHy2oblLLALywjw7WaL20kcTgfb7l6/Wke1ax4s0Alj42bcT+egFlM3GFvI7IbHSLOpyW M/h20ASD6jphBhWax1ZqgHmuZ9dNV6wQjE++7Lcki+LwwKvk1Py71feEZNxTZ8zSMc8kLS xpR1C8RTCCsZmh84NPviNdjzkFyVYU0= Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-749177ad09fso1542515b3a.2 for ; Tue, 01 Jul 2025 07:15:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751379339; x=1751984139; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=vN/E7b0UOrsRNuPjjIBnTzqCIHznXhChVkNuGJXxsvs=; b=glCPDnKs4J5EecwSdj6XUuBoAZPhZtSZU1ssWD3F6J3As8K3fZmsyAe2vmcSZOEu5X +wZRfVzdvyNBZGMPvLifCPFj7Ia5OLvdSdhIp9tojViES5UaZQ/rlYIc2eN2sBUJO0fL HicQ7Gl/B1o3DUK9MzgNPaN8jmPnxZpMazOu6JfUsFKRk7fyuEcwFx+ZH7CsIcFkWZO9 CnupAd/TTr7/VHHRh/KXTHQ2yfJoJ2Wn6Vo9MfEV3jbYCgZYaPPnFuaL2cqX+uozWhPp GOSaU3prVYCeEJxf/zyND4CFGwaa4Hkv97aTT9vnNUMhiyXvfvU89DbbVVhGPeehnw+N IsGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751379339; x=1751984139; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=vN/E7b0UOrsRNuPjjIBnTzqCIHznXhChVkNuGJXxsvs=; b=b78/CA9c5SWYRMd05+CNoWCBQ/O8GEGHD4xywzVoZH1ROapZ2EkaynadJdl1Gpci8t kp2nwDhCO9ExVeOdRETp3Nd0LY/fjy3c0R/TFo59E+CNnm5MXUedd9AG1tEORtiDZvVk hXBgeq+7714kJf4avEd8wpOdaI88PQcN+G1Q8gzIY689dnszQaiiS64OgDHjLDr5A1hw Q4o3vfHF1l4QtyYi41GIvC4pWX82ra1ACpCQkX1CR6HnrGdP2s4jAl7G8N8twNOJiR0V qbn9bXe7id5j3/qwGB3d2/UcqvFqnbyvd+7r1eLUpdOr5T7YCX5B6rEchjEYTatEgt23 NiDQ== X-Forwarded-Encrypted: i=1; AJvYcCVXdDhWuB6nW/SXitIU0bnlP4g1PoULBtamz839bbSM/B12KeiAi3OVkne0kfe4mezM+1yn2jQDBA==@kvack.org X-Gm-Message-State: AOJu0YzPMXsWOBpC/7tN8BIGHWfrLZG2rQc92zfmOP9KsGkHmMoSe1hN QMdnvAdw1U5n5rvprtGEoFxzrKduPW4rmCbWxXXdsBQG7dQyIDJLRCDpYOSJUXjw/pZm6tY5hwn 36dOdwl/2rnjX37GCrsnTtM/39w== X-Google-Smtp-Source: AGHT+IHMZR3wYB8ZZyCxXUTx1Jnjhbyg4vqWu+KzoOjmhZ9RCI6olxi5Hqf3ckApc0g71m/KsrOJSiWFthrldVVzeg== X-Received: from pfbhb2.prod.google.com ([2002:a05:6a00:8582:b0:748:f16c:14c5]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:c90:b0:742:4545:2d2b with SMTP id d2e1a72fcca58-74af6e39a7dmr24529506b3a.3.1751379338442; Tue, 01 Jul 2025 07:15:38 -0700 (PDT) Date: Tue, 01 Jul 2025 07:15:37 -0700 In-Reply-To: <923b1c02-407a-4689-a047-dd94e885b103@redhat.com> Mime-Version: 1.0 References: <20250611133330.1514028-1-tabba@google.com> <20250611133330.1514028-11-tabba@google.com> <434ab5a3-fedb-4c9e-8034-8f616b7e5e52@amd.com> <923b1c02-407a-4689-a047-dd94e885b103@redhat.com> Message-ID: Subject: Re: [PATCH v12 10/18] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory From: Ackerley Tng To: David Hildenbrand , Shivank Garg , Fuad Tabba Cc: Sean Christopherson , kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com, ira.weiny@intel.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 0ADB1120011 X-Stat-Signature: 7wu9e5xtb1owti6kcu151pxhwg6o61he X-HE-Tag: 1751379339-999540 X-HE-Meta: U2FsdGVkX18+KYRCDEuT5xeLNf16CAHcUtC+yha8rAcTExk+HmYb7zHn9bDklY9T29OgvXY6DD4XDFE/4ChyN1AueuqREHt/LiWluGzL3gmvRfpUZrvvESogKgrepZLh/Q7Au9zOHtvxHqKu4pdkpdgZ/rSx31TojzwjGRp+p0+tKRK/QDJkDPc8vG5qQ/EBYfkETLqNwQ7/HXNWglpx65uPn1LDV44kpFmOt9MX1t4nseF+b4DMeWTD2RBJZhiEK8iuivZobiA9Tje3nmQbpEdj6aw6qs7INikFkydzHTSyDOFyv8niDlEHqtDqpI9HVAoyG1+MTxnRs0yyPLJhKweK0Ph9rELsQr+FXzDN9RxT3DuKXaOvhultXwmLsh4hnupnefw/d+SoWkglvbciiz/uCXHpTRYlklL1P9JWdlRGiTlRRTrq+B6Hvjw4NNaBnL+1TivoMiVHVni4hgvKnfbyROwVoxf7VlEigMP/mvLiqX/mHocVK2zgT/ZBA8D0IOmvDROPzBhCyrfZQtc0t9MY5WcK80bltprX63isSiY9uM67QefQXchCR4+a8OUjqkHBkZnxAVXd3fYaV/m/QsHzXLxpa0GsyRc9GBLr8HjvQaLBhw4wKAPMDLCuYjakiNGwwqaPSi1spzKRh96pi8eAl1Xq5PlvCEfauO7HGCIwhZJLOXgvbzWVw3MoHCW/oG9UE+eUO3V2xg13oRS0mwdlxVzDpC+3Wfn2si1QDM81+6MTNnJNiy2fs7SbvyZCgo7pKFdm0g0QP0VzbQ8ME367I8vOEclYfd1ol8tL+VWO4L1CFypq9upaCGpYVoeqe+O+D6itTvmexY/XkGRHLQuV3qSTYiNXstLDS/OqS91iqlHEEsyAMaXpVR87d5c0NVeIDeiGaTJr9vv8CpxNRVmgK+MpqyhZZKSE6n5aaA/ShgH6bG98w9OjeKnb6+A+8YzZ4IKunjoTBTyk1Yi pYxqjgYP TJdEoeUWGYjTwIO0A4dQNbu6QNVf0erDgNYN4q2Hz7rsqwkRYDmJfhsqMfOjxKTeRGVAjmWHwf7dwxI7JmBlpAk6ixIIxTBcxyno1po/B88qVx025J7mj8uFgzpElOG2gUasxcn5/6W/Bakiv6SNtqubPuX4py4wB7MAsnicxu83S5OtOHm+gjDRygcRbbcFtAud6lXnSC+DyUt4JnUX4xk4nE0g2XG0TrFPQ0b8Z7YcQWbQIthBqx6DgtyVHC+4OaBwvFRyO8dXJFejfarDIGfaBG/Z+IlzcktoVXFNrQjcyFEe6ZiX+xNUp/2TOl3qpdu++zWeT7Jd61d5IOdi27z+u66UhMgleL3in6nUqEyMecscl3j4TseUq0CJtkrkuXeLNTYo+ZLwjZe3XVXwfJ5PwrtkBRYR6nvz2USzluKq9j/SUiF++9pQ/rYfSFWqNim5OWZxqn7XLz/HSgjI42ETYbSJZiMTlqqTHDFErG5Zr62NUY9we30FAJNHqCBf/LCAVZMTlDhK37DT8OarlPL0cfPvVoUFgsk3zSQQMVHMb3cggOZIJsuaJt5x2TUItPpcEx0eq42aCBTxOzs8KGMx3TMpQah3Me4jjinH7+3SaMIb/bdfvnXjZicduyo0QU2q+hbvtNQR4rBdyeoPEBinBQq7eRvUhWezbgbMl5+RMehjs86P9vFjFvQVgHlj4vZfHyAWUZnP1UAWAAgiRcc8n/AFkuO5aXXswJdhNJ8cTcbY3VhQuhEgfYM5G0ZWiifwxbVxxGbhV5ukaBtJL0gU1nQL3ur8LaFNgSv/Xfcgig14= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: David Hildenbrand writes: > On 30.06.25 21:26, Shivank Garg wrote: >> On 6/30/2025 8:38 PM, Fuad Tabba wrote: >>> Hi Ackerley, >>> >>> On Mon, 30 Jun 2025 at 15:44, Ackerley Tng wro= te: >>>> >>>> Fuad Tabba writes: >>>> >>>>> Hi Ackerley, >>>>> >>>>> On Fri, 27 Jun 2025 at 16:01, Ackerley Tng w= rote: >>>>>> >>>>>> Ackerley Tng writes: >>>>>> >>>>>>> [...] >>>>>> >>>>>>>>> +/* >>>>>>>>> + * Returns true if the given gfn's private/shared status (in the= CoCo sense) is >>>>>>>>> + * private. >>>>>>>>> + * >>>>>>>>> + * A return value of false indicates that the gfn is explicitly = or implicitly >>>>>>>>> + * shared (i.e., non-CoCo VMs). >>>>>>>>> + */ >>>>>>>>> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gf= n) >>>>>>>>> { >>>>>>>>> - return IS_ENABLED(CONFIG_KVM_GMEM) && >>>>>>>>> - kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRI= BUTE_PRIVATE; >>>>>>>>> + struct kvm_memory_slot *slot; >>>>>>>>> + >>>>>>>>> + if (!IS_ENABLED(CONFIG_KVM_GMEM)) >>>>>>>>> + return false; >>>>>>>>> + >>>>>>>>> + slot =3D gfn_to_memslot(kvm, gfn); >>>>>>>>> + if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shar= ed(slot)) { >>>>>>>>> + /* >>>>>>>>> + * Without in-place conversion support, if a guest_me= mfd memslot >>>>>>>>> + * supports shared memory, then all the slot's memory= is >>>>>>>>> + * considered not private, i.e., implicitly shared. >>>>>>>>> + */ >>>>>>>>> + return false; >>>>>>>> >>>>>>>> Why!?!? Just make sure KVM_MEMORY_ATTRIBUTE_PRIVATE is mutually e= xclusive with >>>>>>>> mappable guest_memfd. You need to do that no matter what. >>>>>>> >>>>>>> Thanks, I agree that setting KVM_MEMORY_ATTRIBUTE_PRIVATE should be >>>>>>> disallowed for gfn ranges whose slot is guest_memfd-only. Missed th= at >>>>>>> out. Where do people think we should check the mutual exclusivity? >>>>>>> >>>>>>> In kvm_supported_mem_attributes() I'm thiking that we should still = allow >>>>>>> the use of KVM_MEMORY_ATTRIBUTE_PRIVATE for other non-guest_memfd-o= nly >>>>>>> gfn ranges. Or do people think we should just disallow >>>>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE for the entire VM as long as one memsl= ot is >>>>>>> a guest_memfd-only memslot? >>>>>>> >>>>>>> If we check mutually exclusivity when handling >>>>>>> kvm_vm_set_memory_attributes(), as long as part of the range where >>>>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is requested to be set intersects a ra= nge >>>>>>> whose slot is guest_memfd-only, the ioctl will return EINVAL. >>>>>>> >>>>>> >>>>>> At yesterday's (2025-06-26) guest_memfd upstream call discussion, >>>>>> >>>>>> * Fuad brought up a possible use case where within the *same* VM, we >>>>>> want to allow both memslots that supports and does not support mm= ap in >>>>>> guest_memfd. >>>>>> * Shivank suggested a concrete use case for this: the user wants a >>>>>> guest_memfd memslot that supports mmap just so userspace addresse= s can >>>>>> be used as references for specifying memory policy. >>>>>> * Sean then added on that allowing both types of guest_memfd memslot= s >>>>>> (support and not supporting mmap) will allow the user to have a s= econd >>>>>> layer of protection and ensure that for some memslots, the user >>>>>> expects never to be able to mmap from the memslot. >>>>>> >>>>>> I agree it will be useful to allow both guest_memfd memslots that >>>>>> support and do not support mmap in a single VM. >>>>>> >>>>>> I think I found an issue with flags, which is that GUEST_MEMFD_FLAG_= MMAP >>>>>> should not imply that the guest_memfd will provide memory for all gu= est >>>>>> faults within the memslot's gfn range (KVM_MEMSLOT_GMEM_ONLY). >>>>>> >>>>>> For the use case Shivank raised, if the user wants a guest_memfd mem= slot >>>>>> that supports mmap just so userspace addresses can be used as refere= nces >>>>>> for specifying memory policy for legacy Coco VMs where shared memory >>>>>> should still come from other sources, GUEST_MEMFD_FLAG_MMAP will be = set, >>>>>> but KVM can't fault shared memory from guest_memfd. Hence, >>>>>> GUEST_MEMFD_FLAG_MMAP should not imply KVM_MEMSLOT_GMEM_ONLY. >>>>>> >>>>>> Thinking forward, if we want guest_memfd to provide (no-mmap) protec= tion >>>>>> even for non-CoCo VMs (such that perhaps initial VM image is populat= ed >>>>>> and then VM memory should never be mmap-ed at all), we will want >>>>>> guest_memfd to be the source of memory even if GUEST_MEMFD_FLAG_MMAP= is >>>>>> not set. >>>>>> >>>>>> I propose that we should have a single VM-level flag to solve this (= in >>>>>> line with Sean's guideline that we should just move towards what we = want >>>>>> and not support non-existent use cases): something like >>>>>> KVM_CAP_PREFER_GMEM. >>>>>> >>>>>> If KVM_CAP_PREFER_GMEM_MEMORY is set, >>>>>> >>>>>> * memory for any gfn range in a guest_memfd memslot will be requeste= d >>>>>> from guest_memfd >>>>>> * any privacy status queries will also be directed to guest_memfd >>>>>> * KVM_MEMORY_ATTRIBUTE_PRIVATE will not be a valid attribute >>>>>> >>>>>> KVM_CAP_PREFER_GMEM_MEMORY will be orthogonal with no validation on >>>>>> GUEST_MEMFD_FLAG_MMAP, which should just purely guard mmap support i= n >>>>>> guest_memfd. >>>>>> >>>>>> Here's a table that I set up [1]. I believe the proposed >>>>>> KVM_CAP_PREFER_GMEM_MEMORY (column 7) lines up with requirements >>>>>> (columns 1 to 4) correctly. >>>>>> >>>>>> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/= 3710/guest_memfd%20use%20cases%20vs%20guest_memfd%20flags%20and%20privacy%2= 0tracking.pdf >>>>> >>>>> I'm not sure this naming helps. What does "prefer" imply here? If the >>>>> caller from user space does not prefer, does it mean that they >>>>> mind/oppose? >>>>> >>>> >>>> Sorry, bad naming. >>>> >>>> I used "prefer" because some memslots may not have guest_memfd at >>>> all. To clarify, a "guest_memfd memslot" is a memslot that has some >>>> valid guest_memfd fd and offset. The memslot may also have a valid >>>> userspace_addr configured, either mmap-ed from the same guest_memfd fd >>>> or from some other backing memory (for legacy CoCo VMs), or NULL for >>>> userspace_addr. >>>> >>>> I meant to have the CAP enable KVM_MEMSLOT_GMEM_ONLY of this patch >>>> series for all memslots that have some valid guest_memfd fd and offset= , >>>> except if we have a VM-level CAP, KVM_MEMSLOT_GMEM_ONLY should be move= d >>>> to the VM level. >>> >>> Regardless of the name, I feel that this functionality at best does >>> not belong in this series, and potentially adds more confusion. >>> >>> Userspace should be specific about what it wants, and they know what >>> kind of memslots there are in the VM: userspace creates them. In that >>> case, userspace can either create a legacy memslot, no need for any of >>> the new flags, or it can create a guest_memfd memslot, and then use >>> any new flags to qualify that. Having a flag/capability that means >>> something for guest_memfd memslots, but effectively keeps the same >>> behavior for legacy ones seems to add more confusion. >>> >>>>> Regarding the use case Shivank mentioned, mmaping for policy, while >>>>> the use case is a valid one, the raison d'=C3=AAtre of mmap is to map= into >>>>> user space (i.e., fault it in). I would argue that if you opt into >>>>> mmap, you are doing it to be able to access it. >>>> >>>> The above is in conflict with what was discussed on 2025-06-26 IIUC. >>>> >>>> Shivank brought up the case of enabling mmap *only* to be able to set >>>> mempolicy using the VMAs, and Sean (IIUC) later agreed we should allow >>>> userspace to only enable mmap but still disable faults, so that usersp= ace >>>> is given additional protection, such that even if a (compromised) >>>> userspace does a private-to-shared conversion, userspace is still not >>>> allowed to fault in the page. >>> >>> I don't think there's a conflict :) What I think is this is outside >>> of the scope of this series for a few reasons: >>> >>> - This is prior to the mempolicy work (and is the base for it) >>> - If we need to, we can add a flag later to restrict mmap faulting >>> - Once we get in-place conversion, the mempolicy work could use the >>> ability to disallow mapping for private memory >>> >>> By actually implementing something now, we would be restricting the >>> mempolicy work, rather than helping it, since we would effectively be >>> deciding now how that work should proceed. By keeping this the way it >>> is now, the mempolicy work can explore various alternatives. >>> >>> I think we discussed this in the guest_memfd sync of 2025-06-12, and I >>> think this was roughly our conclusion. >>> >>>> Hence, if we want to support mmaping just for policy and continue to >>>> restrict faulting, then GUEST_MEMFD_FLAG_MMAP should not imply >>>> KVM_MEMSLOT_GMEM_ONLY. >>>> >>>>> To me, that seems like >>>>> something that merits its own flag, rather than mmap. Also, I recall >>>>> that we said that later on, with inplace conversion, that won't be >>>>> even necessary. >>>> >>>> On x86, as of now I believe we're going with an ioctl that does *not* >>>> check what the guest prefers and will go ahead to perform the >>>> private-to-shared conversion, which will go ahead to update >>>> shareability. >>> >>> Here I think you're making my case that we're dragging more complexity >>> from future work/series into this series, since now we're going into >>> the IOCTLs for the conversion series :) >>> >>>>> In other words, this would also be trying to solve a >>>>> problem that we haven't yet encountered and that we have a solution >>>>> for anyway. >>>>> >>>> >>>> So we don't have a solution for the use case where userspace wants to >>>> mmap but never fault for userspace's protection from stray >>>> private-to-shared conversions, unless we decouple GUEST_MEMFD_FLAG_MMA= P >>>> and KVM_MEMSLOT_GMEM_ONLY. >>>> >>>>> I think that, unless anyone disagrees, is to go ahead with the names >>>>> we discussed in the last meeting. They seem to be the ones that make >>>>> the most sense for the upcoming use cases. >>>>> >>>> >>>> We could also discuss if we really want to support the use case where >>>> userspace wants to mmap but never fault for userspace's protection fro= m >>>> stray private-to-shared conversions. >>> >>> I would really rather defer that work to when it's needed. It seems >>> that we should aim to land this series as soon as possible, since it's >>> the one blocking much of the future work. As far as I can tell, >>> nothing here precludes introducing the mechanism of supporting the >>> case where userspace wants to mmap but never fault, once it's needed. >>> This was I believe what we had agreed on in the sync on 2025-06-26. >>=20 >> I support this approach. > > Agreed. Let's get this in with the changes requested by Sean applied. > > How to use GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with=20 > legacy mem attributes (-> all memory in guest_memfd private) could be=20 > added later on top, once really required. > > As discussed, CoCo VMs that want to support GUEST_MEMFD_FLAG_MMAP will=20 > have to disable legacy mem attributes using a new capability in stage-2. > I rewatched the guest_memfd meeting on 2025-06-12. We do want to support the use case where userspace wants to have mmap (e.g. to set mempolicy) but does not want to allow faulting into the host. On 2025-06-12, the conclusion was that the problem will be solved once guest_memfd supports shareability, and that's because userspace can set shareability to GUEST, so the memory can't be faulted into the host. On 2025-06-26, Sean said we want to let userspace have an extra layer of protection so that memory cannot be faulted in to the host, ever. IOW, we want to let userspace say that even if there is a stray private-to-shared conversion, *don't* allow faulting memory into the host. The difference is the "extra layer of protection", which should remain in effect even if there are (stray/unexpected) private-to-shared conversions to guest_memfd or to KVM. Here's a direct link to the point in the video where Sean brought this up [1]. I'm really hoping I didn't misinterpret this! Let me look ahead a little, since this involves use cases already brought up though I'm not sure how real they are. I just want to make sure that in a few patch series' time, we don't end up needing userspace to use a complex bunch of CAPs and FLAGs. In this series (mmap support, V12, patch 10/18) [2], to allow KVM_X86_DEFAULT_VMs to use guest_memfd, I added a `fault_from_gmem()` helper, which is defined as follows (before the renaming Sean requested): +static inline bool fault_from_gmem(struct kvm_page_fault *fault) +{ + return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot)= ; +} The above is changeable, of course :). The intention is that if the fault is private, fault from guest_memfd. If GUEST_MEMFD_FLAG_MMAP is set (KVM_MEMSLOT_GMEM_ONLY will be set on the memslot), fault from guest_memfd. If we defer handling GUEST_MEMFD_FLAG_MMAP in combination with a CoCo VM with legacy mem attributes to the future, this helper will probably become -static inline bool fault_from_gmem(struct kvm_page_fault *fault) +static inline bool fault_from_gmem(struct kvm *kvm, struct kvm_page_fault = *fault) +{ - return fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot)= ; + return fault->is_private || (kvm_gmem_memslot_supports_shared(fault->slot= ) && + !kvm_arch_disable_legacy_private_tracking(kv= m)); +} And on memslot binding we check if kvm_arch_disable_legacy_private_tracking(kvm) and not GUEST_MEMFD_FLAG_M= MAP return -EINVAL; 1. Is that what yall meant? 2. Does this kind of not satisfy the "extra layer of protection" requirement (if it is a requirement)? A legacy CoCo VM using guest_memfd only for private memory (shared memory from say, shmem) and needing to set mempolicy would =20 * Set GUEST_MEMFD_FLAG_MMAP * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false =20 but still be able to send conversion ioctls directly to guest_memfd, and then be able to fault guest_memfd memory into the host. 3. Now for a use case I've heard of (feel free to tell me this will never be supported or "we'll deal with it if it comes"): On a non-CoCo VM, we want to use guest_memfd but not use mmap (and the initial VM image will be written using write() syscall or something else). * Set GUEST_MEMFD_FLAG_MMAP to false * Leave KVM_CAP_DISABLE_LEGACY_PRIVATE_TRACKING defaulted to false (it's a non-CoCo VM, weird to do anything to do with private tracking) And now we're stuck because fault_from_gmem() will return false all the time and we can't use memory from guest_memfd. [1] https://youtu.be/7b5hgKHoZoY?t=3D1162s [2] https://lore.kernel.org/all/20250611133330.1514028-11-tabba@google.com/ > --=20 > Cheers, > > David / dhildenb