From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71C64C83029 for ; Mon, 30 Jun 2025 15:08:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 017AA6B00BF; Mon, 30 Jun 2025 11:08:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F09806B00CC; Mon, 30 Jun 2025 11:08:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DD12E6B00E0; Mon, 30 Jun 2025 11:08:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C93156B00BF for ; Mon, 30 Jun 2025 11:08:55 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7ECF014050B for ; Mon, 30 Jun 2025 15:08:55 +0000 (UTC) X-FDA: 83612399430.11.2DA76AA Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) by imf04.hostedemail.com (Postfix) with ESMTP id 94A514001A for ; Mon, 30 Jun 2025 15:08:53 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=FbfAGtlc; spf=pass (imf04.hostedemail.com: domain of tabba@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751296133; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U9zc7dh7P1lgWJS7mxDw9ucf5qE4W+JY3tTCTmV57Sk=; b=S5wBFBqqqfmwj6FL7t9Enn/FZMqAeIE8uc6+fU3Ng/snBhzmGEZdobgOebDTDZnYQ2YJ7X qWka/SVEJe0f4b02UdlvfoGKfVFndSZdM8x3T2HToXtAt28OGEp+RsguucP0zZT6Lm/P2y 4kBW05ayxf7yAzuygbq7VlLsGfcPD5k= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=FbfAGtlc; spf=pass (imf04.hostedemail.com: domain of tabba@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751296133; a=rsa-sha256; cv=none; b=HaRFCBp/XX9aRs3AJkYT+s0E7x2/n+v5vhfVAATO2S0Sp1tvmPMM6nXw9qelipJ6tSkePd l8NVKpK2cUbazdhTB1T8XDjVxK66XudK0aJbv9LzRgpdd6L4KGjVIvI+dkU70So78vEXK6 GXATnAKsp10MLdsIlru7fBj1UdE0KeM= Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-4a7fc24ed5cso669911cf.1 for ; Mon, 30 Jun 2025 08:08:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751296133; x=1751900933; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=U9zc7dh7P1lgWJS7mxDw9ucf5qE4W+JY3tTCTmV57Sk=; b=FbfAGtlc2OfnvyQGyUmLT1GkTjELxA+nH4Ajo3dCaRwkk8BnjDbrPhSmw2cCmfxQyU /vOlk8WOWP4h0Kdoxy0QiryZstQYTvf1hzH7ptbK62sNaSjUygnNeFSE3AWOMRoa7yNz ymLckz9NBFcb85hprKbnxXz37Yf2i/eZ8mgjjr3S7LMHrk8IfKK7BqhSwcX6dwX9Rg9k 1ADO4QsivD6R1r6OSsqnfE2D7K+t0VRZ/Sn6Y7PMgCpYDC/+OpSUq1JK92U5I1ww0laj aI+67DMgVIF92m2q34juczzsUScIRuVip1lV3ASNsnCyrkK4p3CMNJpT5x6Q9zPERDjo ljNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751296133; x=1751900933; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=U9zc7dh7P1lgWJS7mxDw9ucf5qE4W+JY3tTCTmV57Sk=; b=E9VWUoGRjaQpU+Z18iWw0n3ZQw+RHu/nz7encHemLhXuU2QIdfgcnYfWsXC1rvHVd9 vAcZJlUZApVHxkXo1XSp3pKtriZyKDOBIOiUEyP8c1tJ1C0gZEUN1zmL9wJF6Ec2epYd e4I3/hnngEQNSRcz+SryKOII4TeTudUXBBG4ZdzEiUTSdlgTeNYawahXFmPI7pEWWmhE 3Zc0eyVMcAxRX5J2zLktAe+1AQLuEarY++macZugReoulvzAcNsWiokQNvNAgUvqFTwD trL08E0cckhu62zXYebAnBme1Kc5PUbWzkXVz6vIaSqV6nqLQS7hGPGiC4bxW8Sx/oM+ t21w== X-Forwarded-Encrypted: i=1; AJvYcCWRDEYS23Kwv/NKX4JrgR2/cIA6R0nOlf4Z6BpfZJ2/QTMiDoF9Q3YyMa3s573j62kY3i3J/7XLFA==@kvack.org X-Gm-Message-State: AOJu0YyrPQFcHdX/+A1ZIS2E2t43tMxE1brgIOmmGkDF9yUf/Ic3Cvhg BVnhb19SEYRsfjux87avyCEGjajYEs9gnSGuVB+mjwrjO+01tLZY+8P55ZBsFxvQt03QjoQJeeM Qd+r7FSmo6We33BFc71gEbkXwbv+ra6uzeUuNFgNg X-Gm-Gg: ASbGncu98jD8wUEoQZS1GFFQ64bmO9//kf9bByd+CQm2JfqyuA4ALahaWYxK4dkA/ws pZIgLEjF2wi8ElC0SW+dhofGq6IpntrguVjfXP9m54fmHajEacii50y9BmH0BbVxCyJx3L8qOjk d+eyep/tcEeIorO9RoTUx2AS+TDhxx0MVbfGmUgbtbS2i2+hpDztiumRtFTAaNI0Xx2fZE+nSJ X-Google-Smtp-Source: AGHT+IEAJHntXvAOV1chFvSWMzzoa2E2Wr2z/57HwB/g/e31w0jd5eukZ0N4vcyJ/fQ4WL3KEcCE5QT8BRvAm0FGvTw= X-Received: by 2002:a05:622a:45:b0:4a5:9b0f:a150 with SMTP id d75a77b69052e-4a807a7ea01mr8193561cf.16.1751296132075; Mon, 30 Jun 2025 08:08:52 -0700 (PDT) MIME-Version: 1.0 References: <20250611133330.1514028-1-tabba@google.com> <20250611133330.1514028-11-tabba@google.com> In-Reply-To: From: Fuad Tabba Date: Mon, 30 Jun 2025 16:08:15 +0100 X-Gm-Features: Ac12FXycjJtBWANvsS5V_wk3Bnhi9TM26-KzVDt2bOQ1K3HTN8DgGikkfAsHdWI Message-ID: Subject: Re: [PATCH v12 10/18] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory To: Ackerley Tng Cc: Sean Christopherson , kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com, ira.weiny@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: cnrmw76oy1ybneaabhoo9nyoqj5hos8o X-Rspamd-Queue-Id: 94A514001A X-Rspamd-Server: rspam11 X-Rspam-User: X-HE-Tag: 1751296133-707907 X-HE-Meta: U2FsdGVkX19jTZd+fqqqvibQvF0bJujCBZtkrbEkoijgtqzTYUOATBKxUeknFsutfFw9X82RhQesMfYo7XGKvKS8w6hg5VmsKT+WV8qT50XVuZEA5/YcfZ6ajUa4v/Q/g6LoAdciX9bqwoeyEKWh6cXK8FhiCsUz36oM6wwoSDQ1lXMmIdmIrkKwrlk2Xdx5eAeBImKScy+3QVtIXhoNF5taQvRPHbsvd6xGS1MlbBEx+IzRlOkjKbODRDJLwP0WJJpd1vkY2teymMc1b/fLc7YIk0jvOduWBqIlWvQOpSQCtVz6UnyalmiwiHjEVuXfxCWdkC9O6Zv6C46hfk3egV3zZ52epjVzCEDn3Sz/hVjyM4AOnuWAeRprBIodhEvFBUefK0kQ53aGGc+B2FOTfCaG0uto34zDHVuSMUdFwTM05Q31PNJUc2xT1lZo21Y2T5bk5BWCaLn158mRioGz2KFiu3BKTGKV+maaCA1DRkb55735qKrTsTf60G/qwKbJe31XzYKrJ+GmYrP3EnAkO0wXtPq8ANy8QTuG5AsA96vzMwXyjO6bDlIsHKBpo4LzPbhikVrVSV6MP3z05Rcu93yz99Ryl94FxMpu3zgRzUuEFAMzPLkTLG5OR7Xp1a2eUuoi0aMrMYZ8JPoX2EUbJEmHSjgAmwXARalYiNpun6i9VCrhx2GHZpOuSvyR/RfcNnrJfw/DE8ql9hqopg3v9/dBlJIIpGiHG7iCcREx/fEBAuRbCavXuSrB/2v+hGiHqDXkkWU9VuuKAsDCJoNkib2AWnjz2/04uebO2MYwVgMbkBrOmXkdWaee4auq54FaOWVV5k9Oc6e/mc8ePt92pBrkQXusMQA1atRL6FHeS4tStJ9cfil6CLcO0w3jcqYR6AhYfANMJgWhKMfl7vIVYtwdU5cPY7Y9XIUlwgZC8th7JGhesAYxsVjjI+ZuGYr3q4AtSDj9ib4e6rk8PhL icwRqLpg ILUiAn7REKL6TiFlRuE17WE2JDbIdFbBYMqlCBKjdViO4B/0dys6ha0TVceZlMMpDgW2jxMn9n8Tzccjpzkm8P1jqehNj5wPsGKKqmTiv1keBUl7zG9Ow1MVQB0Z7evOpC7KvyN38sU3DVseS3oEnLE6gJqnUUNVOrQNsQ5LizS6YwyB/FD8JvCjc6wZAsiuSz+DoK1xFadeM67Ykf7O+W6Ru9SfYg3x2F9uGRy2wn8qWg7xb8O+WVh5uvS7jfUordmg5ssUGNd4yxyaW0PM7OzwC6jOfq+YRWFzOEZk6eCqIkk9oC7GUyzO21o/WA2CDDwjI9Rwhk++3Hl/pd3sKMh+Sf5XSFUhESZc9+vlAihb9Hk0NGg/LlkwGguWvLvmYDi9C X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ackerley, On Mon, 30 Jun 2025 at 15:44, Ackerley Tng wrote: > > Fuad Tabba writes: > > > Hi Ackerley, > > > > On Fri, 27 Jun 2025 at 16:01, Ackerley Tng wro= te: > >> > >> Ackerley Tng writes: > >> > >> > [...] > >> > >> >>> +/* > >> >>> + * Returns true if the given gfn's private/shared status (in the = CoCo sense) is > >> >>> + * private. > >> >>> + * > >> >>> + * A return value of false indicates that the gfn is explicitly o= r implicitly > >> >>> + * shared (i.e., non-CoCo VMs). > >> >>> + */ > >> >>> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) > >> >>> { > >> >>> - return IS_ENABLED(CONFIG_KVM_GMEM) && > >> >>> - kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIB= UTE_PRIVATE; > >> >>> + struct kvm_memory_slot *slot; > >> >>> + > >> >>> + if (!IS_ENABLED(CONFIG_KVM_GMEM)) > >> >>> + return false; > >> >>> + > >> >>> + slot =3D gfn_to_memslot(kvm, gfn); > >> >>> + if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_share= d(slot)) { > >> >>> + /* > >> >>> + * Without in-place conversion support, if a guest_mem= fd memslot > >> >>> + * supports shared memory, then all the slot's memory = is > >> >>> + * considered not private, i.e., implicitly shared. > >> >>> + */ > >> >>> + return false; > >> >> > >> >> Why!?!? Just make sure KVM_MEMORY_ATTRIBUTE_PRIVATE is mutually ex= clusive with > >> >> mappable guest_memfd. You need to do that no matter what. > >> > > >> > Thanks, I agree that setting KVM_MEMORY_ATTRIBUTE_PRIVATE should be > >> > disallowed for gfn ranges whose slot is guest_memfd-only. Missed tha= t > >> > out. Where do people think we should check the mutual exclusivity? > >> > > >> > In kvm_supported_mem_attributes() I'm thiking that we should still a= llow > >> > the use of KVM_MEMORY_ATTRIBUTE_PRIVATE for other non-guest_memfd-on= ly > >> > gfn ranges. Or do people think we should just disallow > >> > KVM_MEMORY_ATTRIBUTE_PRIVATE for the entire VM as long as one memslo= t is > >> > a guest_memfd-only memslot? > >> > > >> > If we check mutually exclusivity when handling > >> > kvm_vm_set_memory_attributes(), as long as part of the range where > >> > KVM_MEMORY_ATTRIBUTE_PRIVATE is requested to be set intersects a ran= ge > >> > whose slot is guest_memfd-only, the ioctl will return EINVAL. > >> > > >> > >> At yesterday's (2025-06-26) guest_memfd upstream call discussion, > >> > >> * Fuad brought up a possible use case where within the *same* VM, we > >> want to allow both memslots that supports and does not support mmap = in > >> guest_memfd. > >> * Shivank suggested a concrete use case for this: the user wants a > >> guest_memfd memslot that supports mmap just so userspace addresses c= an > >> be used as references for specifying memory policy. > >> * Sean then added on that allowing both types of guest_memfd memslots > >> (support and not supporting mmap) will allow the user to have a seco= nd > >> layer of protection and ensure that for some memslots, the user > >> expects never to be able to mmap from the memslot. > >> > >> I agree it will be useful to allow both guest_memfd memslots that > >> support and do not support mmap in a single VM. > >> > >> I think I found an issue with flags, which is that GUEST_MEMFD_FLAG_MM= AP > >> should not imply that the guest_memfd will provide memory for all gues= t > >> faults within the memslot's gfn range (KVM_MEMSLOT_GMEM_ONLY). > >> > >> For the use case Shivank raised, if the user wants a guest_memfd memsl= ot > >> that supports mmap just so userspace addresses can be used as referenc= es > >> for specifying memory policy for legacy Coco VMs where shared memory > >> should still come from other sources, GUEST_MEMFD_FLAG_MMAP will be se= t, > >> but KVM can't fault shared memory from guest_memfd. Hence, > >> GUEST_MEMFD_FLAG_MMAP should not imply KVM_MEMSLOT_GMEM_ONLY. > >> > >> Thinking forward, if we want guest_memfd to provide (no-mmap) protecti= on > >> even for non-CoCo VMs (such that perhaps initial VM image is populated > >> and then VM memory should never be mmap-ed at all), we will want > >> guest_memfd to be the source of memory even if GUEST_MEMFD_FLAG_MMAP i= s > >> not set. > >> > >> I propose that we should have a single VM-level flag to solve this (in > >> line with Sean's guideline that we should just move towards what we wa= nt > >> and not support non-existent use cases): something like > >> KVM_CAP_PREFER_GMEM. > >> > >> If KVM_CAP_PREFER_GMEM_MEMORY is set, > >> > >> * memory for any gfn range in a guest_memfd memslot will be requested > >> from guest_memfd > >> * any privacy status queries will also be directed to guest_memfd > >> * KVM_MEMORY_ATTRIBUTE_PRIVATE will not be a valid attribute > >> > >> KVM_CAP_PREFER_GMEM_MEMORY will be orthogonal with no validation on > >> GUEST_MEMFD_FLAG_MMAP, which should just purely guard mmap support in > >> guest_memfd. > >> > >> Here's a table that I set up [1]. I believe the proposed > >> KVM_CAP_PREFER_GMEM_MEMORY (column 7) lines up with requirements > >> (columns 1 to 4) correctly. > >> > >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/37= 10/guest_memfd%20use%20cases%20vs%20guest_memfd%20flags%20and%20privacy%20t= racking.pdf > > > > I'm not sure this naming helps. What does "prefer" imply here? If the > > caller from user space does not prefer, does it mean that they > > mind/oppose? > > > > Sorry, bad naming. > > I used "prefer" because some memslots may not have guest_memfd at > all. To clarify, a "guest_memfd memslot" is a memslot that has some > valid guest_memfd fd and offset. The memslot may also have a valid > userspace_addr configured, either mmap-ed from the same guest_memfd fd > or from some other backing memory (for legacy CoCo VMs), or NULL for > userspace_addr. > > I meant to have the CAP enable KVM_MEMSLOT_GMEM_ONLY of this patch > series for all memslots that have some valid guest_memfd fd and offset, > except if we have a VM-level CAP, KVM_MEMSLOT_GMEM_ONLY should be moved > to the VM level. Regardless of the name, I feel that this functionality at best does not belong in this series, and potentially adds more confusion. Userspace should be specific about what it wants, and they know what kind of memslots there are in the VM: userspace creates them. In that case, userspace can either create a legacy memslot, no need for any of the new flags, or it can create a guest_memfd memslot, and then use any new flags to qualify that. Having a flag/capability that means something for guest_memfd memslots, but effectively keeps the same behavior for legacy ones seems to add more confusion. > > Regarding the use case Shivank mentioned, mmaping for policy, while > > the use case is a valid one, the raison d'=C3=AAtre of mmap is to map i= nto > > user space (i.e., fault it in). I would argue that if you opt into > > mmap, you are doing it to be able to access it. > > The above is in conflict with what was discussed on 2025-06-26 IIUC. > > Shivank brought up the case of enabling mmap *only* to be able to set > mempolicy using the VMAs, and Sean (IIUC) later agreed we should allow > userspace to only enable mmap but still disable faults, so that userspace > is given additional protection, such that even if a (compromised) > userspace does a private-to-shared conversion, userspace is still not > allowed to fault in the page. I don't think there's a conflict :) What I think is this is outside of the scope of this series for a few reasons: - This is prior to the mempolicy work (and is the base for it) - If we need to, we can add a flag later to restrict mmap faulting - Once we get in-place conversion, the mempolicy work could use the ability to disallow mapping for private memory By actually implementing something now, we would be restricting the mempolicy work, rather than helping it, since we would effectively be deciding now how that work should proceed. By keeping this the way it is now, the mempolicy work can explore various alternatives. I think we discussed this in the guest_memfd sync of 2025-06-12, and I think this was roughly our conclusion. > Hence, if we want to support mmaping just for policy and continue to > restrict faulting, then GUEST_MEMFD_FLAG_MMAP should not imply > KVM_MEMSLOT_GMEM_ONLY. > > > To me, that seems like > > something that merits its own flag, rather than mmap. Also, I recall > > that we said that later on, with inplace conversion, that won't be > > even necessary. > > On x86, as of now I believe we're going with an ioctl that does *not* > check what the guest prefers and will go ahead to perform the > private-to-shared conversion, which will go ahead to update > shareability. Here I think you're making my case that we're dragging more complexity from future work/series into this series, since now we're going into the IOCTLs for the conversion series :) > > In other words, this would also be trying to solve a > > problem that we haven't yet encountered and that we have a solution > > for anyway. > > > > So we don't have a solution for the use case where userspace wants to > mmap but never fault for userspace's protection from stray > private-to-shared conversions, unless we decouple GUEST_MEMFD_FLAG_MMAP > and KVM_MEMSLOT_GMEM_ONLY. > > > I think that, unless anyone disagrees, is to go ahead with the names > > we discussed in the last meeting. They seem to be the ones that make > > the most sense for the upcoming use cases. > > > > We could also discuss if we really want to support the use case where > userspace wants to mmap but never fault for userspace's protection from > stray private-to-shared conversions. I would really rather defer that work to when it's needed. It seems that we should aim to land this series as soon as possible, since it's the one blocking much of the future work. As far as I can tell, nothing here precludes introducing the mechanism of supporting the case where userspace wants to mmap but never fault, once it's needed. This was I believe what we had agreed on in the sync on 2025-06-26. Cheers, /fuad