From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B9C9E92738 for ; Thu, 5 Oct 2023 14:43:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 009AC6B02AC; Thu, 5 Oct 2023 10:43:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EFB8E6B02AD; Thu, 5 Oct 2023 10:43:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9B936B02AE; Thu, 5 Oct 2023 10:43:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C70176B02AC for ; Thu, 5 Oct 2023 10:43:21 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7D79B1CA11B for ; Thu, 5 Oct 2023 14:43:21 +0000 (UTC) X-FDA: 81311675802.26.4CBB8BE Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com [209.85.219.177]) by imf02.hostedemail.com (Postfix) with ESMTP id CDAC68000A for ; Thu, 5 Oct 2023 14:43:19 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ozXMQgvz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of tabba@google.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696516999; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=EE8iPYu9LpRlrA3P5svRFO6CzOzjvPUAMSGvj485Oxk=; b=V8jKs8PlbSi+3B2Al0OPDkfyZfRhNsThasmkJ7clViVndfkfMoFoBHtoE2dloBudCsS67I SITMKmsEoIj/25HC1TkRORSdJrM3J4cI5zMVCKqSKpiqnUueU1yA7BcQDYO1OABOotqSAO CmmmvQ+XklVT9d27NCNssNASL7oSfhM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ozXMQgvz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of tabba@google.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696516999; a=rsa-sha256; cv=none; b=0eIxEiXDZF+XKwu1P69odJol4ulzDUV3kyK9d/Vfky5d7mX4HANJX0JscwLII+U9a5IaYO VbRzy6AZKnNYehu7+l9zIDqbQIGaVxw+fM1YxbSIITBD2aREjtiyiCItTxlIxTdS8Ln5A0 QozjsGkVbhPdwwvkdfsH+ik9hEeM/qg= Received: by mail-yb1-f177.google.com with SMTP id 3f1490d57ef6-d8673a90f56so1147867276.0 for ; Thu, 05 Oct 2023 07:43:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696516999; x=1697121799; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=EE8iPYu9LpRlrA3P5svRFO6CzOzjvPUAMSGvj485Oxk=; b=ozXMQgvzHNmnltuBJ1+vc2tiD/TJGB1KapxdoG9SnbhBY/w+ErznS4S8g9PaubMsHI kzBrXwjH122DVI2p8suG30/tssNQdnxRU2nYm0kut8+iBti4uUPbMRrYpqCEDWlORtq2 NYp8P8ZVhduQjwLjPaJoERf9kqUFTOAzvKS4WxhHOXSVl+FS0cq+6ypKom5ncSFD87Ip 4aN4fEHwPeU2Io+bxm0cIZ0K1wsWzdfeKMt9itCpw+p07Hd3YZgRNUXwpDp7MhHLOhnT CBV4qCBqAx3jCO9ip1Xk437ZZPa/1wYXW13G4gC34VQ7RtqW6/00FM+loDayrT7GtMji D+GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696516999; x=1697121799; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=EE8iPYu9LpRlrA3P5svRFO6CzOzjvPUAMSGvj485Oxk=; b=q+ZNGQaRMOhj2GSogjo2d/pAXKuYkuJu8fwI3cALVoutEbw7Eo2DO0kZ0e0Xodmi7r NXSoYyeD6ioE1eL12f3G3im7ScFPLorn4/n0vaE/gelyxJn22rT4imEsm2r/1SadwUmd rNCOSwOzVHzwruEr9komXPiHgI+w0aX66ZnWlcuRgJJ59Jn9pu3UKozuXtgjnvt4Vt20 3FiBjLKIH7SMGoF2Tgn7rQu6Sc5HHByzol6SCbOIt6pVgmr13Hz6hUfaFDSMt46mMBow +6X6oD5/4gpdOldrFmmv77gaTf2IPr7xkQpZfL8avuLDwMS1l/KjLJMvg3b+ug4u6lR4 MWXw== X-Gm-Message-State: AOJu0YyyXOuzGA4SxJfjhGeLRZXyjoz7Y+snjSkzPwh0ESYbo3FJf83y KA75vjpvSMjPAZqeeYMQDKXespHtJvgv04DkIrOKKg== X-Google-Smtp-Source: AGHT+IGOqTvtxfqWgNnMMFc4737tOPznDx2Ec+u1Qc6vg6Ytf9IAjo7xOI8HjQPL083wFsBnlWD0qahO/3mYB56VT9A= X-Received: by 2002:a25:23ce:0:b0:d91:5a1b:eea with SMTP id j197-20020a2523ce000000b00d915a1b0eeamr4876224ybj.50.1696516998560; Thu, 05 Oct 2023 07:43:18 -0700 (PDT) MIME-Version: 1.0 From: Fuad Tabba Date: Thu, 5 Oct 2023 15:42:42 +0100 Message-ID: Subject: Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , KVM , "moderated list:ARM64 PORT (AARCH64 ARCHITECTURE)" , KVMARM , LinuxMIPS , linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, open list , Chao Peng , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , Xu Yilun , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: CDAC68000A X-Stat-Signature: fwskfzzfhnm7sizsmgfkuzipyypuncd5 X-HE-Tag: 1696516999-407789 X-HE-Meta: U2FsdGVkX19Mhui8oZOd0sMocTppX4s+c8m5Voy+XjHjkkeIobmLRheqvLQB5ee7qg1du5cRbxUSTLSS9SBT0PNiASplU7hx0LJtsp+tBc9i+gLST6lCyzrvx5AmJRMDsHZxOJpgnKna8rinUAsC8gZotris4c3/JTIegc6DWjBNVkUt5lfEYR3RLiwH95f73Pb5bw41y2vvSqBpjb1s0VyJq4I2e1EBY0sZV7uuysxCEo7z4PSnmaBbGOfz8nOmdw1zrml2yCi/t6qYPlYLQcwytYDNBLHOEdVNZWvuj+ursBAcQImGisVq0kzsH7WQ37zOm18+8LFo3B6epxow62PabU+9mb6lO+cOfDGqKEC++Xrfi88iOV7uEXs4Q8CDhL07fsE/dbRpBFnSeRh1tWQA9gGkqclaZ+IdsBklGZ+Z1stRRS5CD3W7Zzxh4D4oJerbgtDM1VfpXTkEU3fIjcwOlvfKFCSyKLmMLgSGGo2V1Jll9AWw1iGoTr8N5v1qNTckvIuyTTu8+7Z00X7dCk8CZhMDJiNEPdyq1RBTjrPSuPK5drZgG0imaNokNb+O84rs7rAlAqnlDQUwHMUPsfOyovhGfo40HcNpgnretzTcosJq8tc6Mxqr1R8bUpRfXeU/nZLycEkQlGvRdxGpxAM4FtAL8kAogG9T0EUwjSF80XimIjl6uu9pcDqM3uL+8TwK0TvU/o5nt25TVHSiHhlKC3umSy5hxmffaB85GEVkIQK0GuS51lLi4ZmBCyPEcezkj9sD9cUIcmDMfwsKjN1UR90f+3o4LQTRD9TxCkuNxqxwCuQAg+r4fOvNDN5KRBdCHe0kUOXDh4ov/lyn2VH06K1vWyIKsnEjNhXE/fRpZwfqzetUTvfdQESytoBeuBZjCEtID3fbBFxzPXBJlE+/6LBQhnSW+DwbO2z+v5GuV81x3EyxKLeI9S4mW+OnseO9XXP2uiiiN23jCMX Kkbh8SsA 7kncpmKAipSgspclhDj2yaIlK9y3Zv43KoDFNRSHLc9OaxOtupzHZLnKTdvNHgtyvOlyyfN2EF1kSTArVSrWF1yylU9UFMPntRhpybsmTN7r/cakzH7YPqbmWHkaXBq1Ly5a/fnR0X2/PHMlcr7meoNS6s2B29v3DYhjOWO9wgQHbRPdvPkxCQ3XPyqgENBynleMRhAZjF6ihkaqitbmKnPe8EALZJh9fDfUrASlKgz8esa6dNFYXTO6eB2OTuczCrUJ+qiUN8CfN9Sa+3PVkGN0Vop5vDxJfycYjihVdCDzI7Kk9KsCKMv7uzNXZUQGKOm6oPpESeJ5I1UvMYtOWUPzWvaBdEPKucCZp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Sean, On Tue, Oct 3, 2023 at 9:51=E2=80=AFPM Sean Christopherson wrote: > > On Tue, Oct 03, 2023, Fuad Tabba wrote: > > On Tue, Oct 3, 2023 at 4:59=E2=80=AFPM Sean Christopherson wrote: > > > On Tue, Oct 03, 2023, Fuad Tabba wrote: > > > > > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) > > > > > + > > > > > > > > In pKVM, we don't want to allow setting (or clearing) of PRIVATE/SH= ARED > > > > attributes from userspace. > > > > > > Why not? The whole thing falls apart if userspace doesn't *know* the= state of a > > > page, and the only way for userspace to know the state of a page at a= given moment > > > in time is if userspace controls the attributes. E.g. even if KVM we= re to provide > > > a way for userspace to query attributes, the attributes exposed to us= rspace would > > > become stale the instant KVM drops slots_lock (or whatever lock prote= cts the attributes) > > > since userspace couldn't prevent future changes. > > > > I think I might not quite understand the purpose of the > > KVM_SET_MEMORY_ATTRIBUTES ABI. In pKVM, all of a protected guest's memo= ry is > > private by default, until the guest shares it with the host (via a > > hypercall), or another guest (future work). When the guest shares it, > > userspace is notified via KVM_EXIT_HYPERCALL. In many use cases, usersp= ace > > doesn't need to keep track directly of all of this, but can reactively = un/map > > the memory being un/shared. > > Yes, and then userspace needs to tell KVM, via KVM_SET_MEMORY_ATTRIBUTES,= that > userspace has agreed to change the state of the page. Userspace may not = need/want > to explicitly track the state of pages, but userspace still needs to tell= KVM what > userspace wants. > > KVM is primarily an accelerator, e.g. KVM's role is to make things go fas= t (relative > to doing things in userspace) and provide access to resources/instruction= s that > require elevated privileges. As a general rule, we try to avoid defining= the vCPU > model, security policies, etc. in KVM, because hardcoding policy into KVM= (and the > kernel as a whole) eventually limits the utility of KVM. > > As it pertains to PRIVATE vs. SHARED, KVM's role is to define and enforce= the basic > rules, but KVM shouldn't do things like define when it is (il)legal to co= nvert > memory to/from SHARED, what pages can be converted, what happens if the g= uest and > userspace disagree, etc. Thanks for clarifying that. My initial understanding of the purpose of KVM_SET_MEMORY_ATTRIBUTES wasn't clear. Now I see how having the userspace view in KVM would avoid the need to hardcore many policies, and I can see how this could come in handy in the future when we start going into multi-sharing, for example. > > > Why does pKVM need to prevent userspace from stating *its* view of at= tributes? > > > > > > If the goal is to reduce memory overhead, that can be solved by using= an internal, > > > non-ABI attributes flag to track pKVM's view of SHARED vs. PRIVATE. = If the guest > > > attempts to access memory where pKVM and userspace don't agree on the= state, > > > generate an exit to userspace. Or kill the guest. Or do something e= lse entirely. > > > > For the pKVM hypervisor the guest's view of the attributes doesn't > > matter. The hypervisor at the end of the day is the ultimate arbiter > > for what is shared and with how. For pKVM (at least in my port of > > guestmem), we use the memory attributes from guestmem essentially to > > control which memory can be mapped by the host. > > The guest's view absolutely matters. The guest's view may not be express= ed at > access time, e.g. as you note below, pKVM and other software-protected VM= s don't > have a dedicated shared vs. private bit like TDX and SNP. But the view i= s still > there, e.g. in the pKVM model, the guest expresses its desire for shared = vs. > private via hypercall, and IIRC, the guest's view is tracked by the hyper= visor > in the stage-2 PTEs. pKVM itself may track the guest's view on things, b= ut the > view is still the guest's. This was poorly worded on my part. You're right that in practice the pKVM hypervisor is the one tracking the guest's view, based on the hypercalls from the guest. > E.g. if the guest thinks a page is private, but in reality KVM and host u= serspace > have it as shared, then the guest may unintentionally leak data to the un= trusted > world. > > IIUC, you have implemented guest_memfd support in pKVM by changing the at= tributes > when the guest makes the hypercall. This can work, but only so long as t= he guest > and userspace are well-behaved, and it will likely paint pKVM into a corn= er in > the long run. > > E.g. if the guest makes a hypercall to convert memory to PRIVATE, but the= re is > no memslot or the memslot doesn't support private memory, then unless the= re is > policy baked into KVM, or an ABI for the guest<=3D>host hypercall interfa= ce that > allows unwinding the program counter, you're stuck. Returning an error f= or the > hypercall straight from KVM is undesirable as that would put policy into = KVM that > doesn't need to be there, e.g. that would prevent userspace from manipula= ting > memslots in response to (un)share requests from the guest. It's a simila= r story > if KVM marks the page as PRIVATE, as that would prevent userspace from re= turning > an error for the hypercall, i.e. would prevent usersepace from denying th= e request > to convert to PRIVATE. Ack. > > One difference between pKVM and TDX (as I understand it), is that TDX > > uses the msb of the guest's IPA to indicate whether memory is shared > > or private, and that can generate a mismatch on guest memory access > > between what it thinks the state is, and what the sharing state in > > reality is. pKVM doesn't have that. Memory is private by default, and > > can be shared in-place, both in the guest's IPA space as well as the > > underlying physical page. > > TDX's shared bit and SNP's encryption bit are just a means of hardware en= forcement. > pKVM does have a hardware bit because hardware doesn't provide any enforc= ement. > But as above, pKVM does have an equivalent *somewhere*. > > > > > The other thing, which we need for pKVM anyway, is to make > > > > kvm_vm_set_mem_attributes() global, so that it can be called from o= utside of > > > > kvm_main.c (already have a local patch for this that declares it in > > > > kvm_host.h), > > > > > > That's no problem, but I am definitely opposed to KVM modifying attri= butes that > > > are owned by userspace. > > > > > > > and not gate this function by KVM_GENERIC_MEMORY_ATTRIBUTES. > > > > > > As above, I am opposed to pKVM having a completely different ABI for = managing > > > PRIVATE vs. SHARED. I have no objection to pKVM using unclaimed flag= s in the > > > attributes to store extra metadata, but if KVM_SET_MEMORY_ATTRIBUTES = doesn't work > > > for pKVM, then we've failed miserably and should revist the uAPI. > > > > Like I said, pKVM doesn't need a userspace ABI for managing PRIVATE/SHA= RED, > > just a way of tracking in the host kernel of what is shared (as opposed= to > > the hypervisor, which already has the knowledge). The solution could si= mply > > be that pKVM does not enable KVM_GENERIC_MEMORY_ATTRIBUTES, has its own > > tracking of the status of the guest pages, and only selects KVM_PRIVATE= _MEM. > > At the risk of overstepping my bounds, I think that effectively giving th= e guest > full control over what is shared vs. private is a mistake. It more or le= ss locks > pKVM into a single model, and even within that model, dealing with errors= and/or > misbehaving guests becomes unnecessarily problematic. > > Using KVM_SET_MEMORY_ATTRIBUTES may not provide value *today*, e.g. the u= serspace > side of pKVM could simply "reflect" all conversion hypercalls, and termin= ate the > VM on errors. But the cost is very minimal, e.g. a single extra ioctl() = per > converion, and the upside is that pKVM won't be stuck if a use case comes= along > that wants to go beyond "all conversion requests either immediately succe= ed or > terminate the guest". Now that I understand the purpose of KVM_SET_MEMORY_ATTRIBUTES, I agree. However, pKVM needs to track at the host kernel (i.e., EL1) whether guest memory is shared or private. One approach would be to add another flag to the attributes that tracks the host kernel view. The way KVM_SET_MEMORY_ATTRIBUTES is implemented now, userspace can zero it, so in that case, that operation would need to be masked to avoid that. Another approach would be to have a pKVM-specific xarray (or similar) to do the tracking, but since there is a structure that's already doing something similar (i.e.,the attributes array), it seems like it would be unnecessary overhead. Do you have any ideas or preferences? Cheers, /fuad