From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07A79E75459 for ; Tue, 3 Oct 2023 18:33:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 635288D0082; Tue, 3 Oct 2023 14:33:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E50D8D0003; Tue, 3 Oct 2023 14:33:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 485DE8D0082; Tue, 3 Oct 2023 14:33:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 38CF48D0003 for ; Tue, 3 Oct 2023 14:33:50 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id F2A088041E for ; Tue, 3 Oct 2023 18:33:49 +0000 (UTC) X-FDA: 81304998978.05.4E50785 Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf10.hostedemail.com (Postfix) with ESMTP id 465EAC001A for ; Tue, 3 Oct 2023 18:33:48 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dsfznbBc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of tabba@google.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696358028; a=rsa-sha256; cv=none; b=oBFDp+/fQM4hz1ETwB2eHa9uuTS7L+90dWwakjsJHEkWWmYbyUm9VK5YNNkbRuWumVP4G4 NEnGOx/0J/NvvD4/ZVllGG0DfQrrblmYS/CYi0TYW0t+5yBFbfU2rgInoqAj6xpoisgkUo ul7TR1HJK0Btv5rIklDfw3vyIPAmnvs= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=dsfznbBc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of tabba@google.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696358028; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R9RZg3twPdN7SJarZ/3IsJk/93o/jgnrY3ylxDmEVWY=; b=RNustMMh6CFvNTJUhYUrWgdLyfHt6mMvCYu7f1rx7kPIEQto8GjJ5Sa7d1186IIGB6MGZP MyXj6VnZ0LGgIytXnXNm+6OVdsc4sIW6goD/qJgmTrXBWq/q/Klrsr7i4cKJtJ0HB9Wfqf ev6KBMn/s1yKm7dTZKPrq2Njoajiu54= Received: by mail-qv1-f47.google.com with SMTP id 6a1803df08f44-65b0e623189so7123986d6.1 for ; Tue, 03 Oct 2023 11:33:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696358027; x=1696962827; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=R9RZg3twPdN7SJarZ/3IsJk/93o/jgnrY3ylxDmEVWY=; b=dsfznbBc102Al+0qDlDHKRbezxSplIQEUS/XQrs5/vuTj5EBP+/prxugXNvlZ1Iytw cCBUASVtC8a91CYUSdkC6DHa1JOUs2yRJw5tM+zSWqmTYujBN46h0o8m3inmpaFE6n7/ xlpFafuGPim6pxGuVSMMrnShbUH5Zg+HUMxCI1AbyfDgyJzaR/TTqDzi8POjx4AkwNin 5UjwE7RZoxVCMH8EwhyLPh8JA7EYZkGEn2MFA5V6lJpWpmN0Xj+BRf7nVf3ZnlCU/Wy5 EDoiMQoUnml3LKVbqQ60JUBuPEE1WD/xHcUAuR3Y2h8lJL/9JFRCyeXCDuiKfvHnHj/+ 9V6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696358027; x=1696962827; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=R9RZg3twPdN7SJarZ/3IsJk/93o/jgnrY3ylxDmEVWY=; b=lLFhpL9Kl4niv8FiakcBOEUi40bwbkdxlmtEh/1WZ+9P0PcGtWJcGz8ZnSti1Tchfx fAoqogjLX+wRYa9JoO5HaaTEWo44cXRpmZR3xscud2Ds3jvvr5Pd8KJptVEqlQXzVXHI gkferOyHKaX2e6U2QsxBH3CSStIqGr1FPuVqov3eCphAX1iiVLuw2qsu89G7pgRqHyRQ iQUg6bJt++lAia9ZxeNyVhv9hmNRRmVS875jdYYfe1kic9Q9ACbzgOOcFuvLYRteWbXK Hct9VNFqisUBGaaRtXi2T+CRJCxZ2wiKFfJfczK7Iih76lLevZP9ak+WLYf5xC7K3HRS obfg== X-Gm-Message-State: AOJu0YyKt81JXQXUCG5cPCAPiYGNUkBwW6Eybiubsy4UxsvZ36CrRHny woCaarPKgAlY0GlGoM5MscJyqa4gl4ruuT6rjiWb6g== X-Google-Smtp-Source: AGHT+IFD9MiWpIrPl+X1KG+mF8SaIxZsGnEuN1WGAxrVYz5JoBkTLtzwKfCvY0TUsCLlDtWXvgqIVLrhKrhnAnrWPa0= X-Received: by 2002:a0c:c409:0:b0:64f:3699:90cd with SMTP id r9-20020a0cc409000000b0064f369990cdmr170157qvi.42.1696358027177; Tue, 03 Oct 2023 11:33:47 -0700 (PDT) MIME-Version: 1.0 References: <20230914015531.1419405-1-seanjc@google.com> <20230914015531.1419405-12-seanjc@google.com> In-Reply-To: From: Fuad Tabba Date: Tue, 3 Oct 2023 19:33:09 +0100 Message-ID: Subject: Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , Xu Yilun , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 465EAC001A X-Stat-Signature: acb69c6ks9tjrcs4ycintmxkwcqkc1pm X-HE-Tag: 1696358028-836553 X-HE-Meta: U2FsdGVkX19zo6edP4vbfw102owmSX1AjtPdDlOsIj2gg+s5jsB196eGeQq15Gvw2QLoXBjzcgcuYIEAhHpchcd2eFnp2aaGRQ/Cdh0BxhP3HlFxyVjVMYrbeLdMIDupkR3Kh4k3xejxojnxBGeojUx537Y0klZXnOsL73OKyWc54BALh5wm6hKuvXsqWBDYpJ2s45c1ZI+lMukG3dYh/n7gYNt8S+qtnRKGnkGTeZjZVK9q1+LkKDSQRY/h9RXq5VG/ILDo4m5NwOnjatNPdGNFsqFSN/fImZuOUQE9N3GNeTkegMIepXifxMnPvs3kkmpRSVKfJSASGioQzradR+QcJpRFpl1mTAAt0bIFv2qpVozTL8CqcWmRMaCt/l9e3IQouOyVZ81dXusReLT98hiiBf5gGqCkxn8+D558dTXdlIhHxXZOvqDfgHD4T+pj+B1XpTXGB/zBBidHnCJ/HwiW6c56+EPM5xpLg3BH+8ua77ZVC3Q4j2EnaEE23X/bsBdKMaNvhRreGFNPKeV3wPimfWwyupaa7b54BweMZLX/vnUaQnlUoG+eg65afraz38nB2BNkYZrBh5Ujmw64BNmA1+OYJUdod62mIJJ3BKw/SN3MLDy6L6Yp9Mv4VOlzmFyq0mTk4R1WVTAY4+4zJShYJJkTCqIkXsjmUIHNhMf2+EWTVSzOMEVYa33GhpXKXHfnbuIdBCt3+M4jmCKII27eM70k8zGGuXuiB08kOPE2SnDJ3cC1GZhE7cuYwPrr48Ja/cBh5HpgzAZx937gTef3tM6M8IPPrxHhDjdOk2OnRiPFTeb6CM92f3R0S0BYrCI6B/Hz4jI3Xu7raiPdmGu6OyUV0+XwfeTEtdFc8DfaqxXOZbGRDAnqNvwTaRPqg58ia8UXMeDdxiqbhb68KhDS6PuOD4vb2QvteC6eQ6c/wmhH0o63vD+ILvp2cU4ac2wgwIe616ysaE12pIr YHLiS8IR GZIY2sJiJh9r/dI2NZk89i74A3zpLT5SUBqZiyv+yn+Jiv5mRL05QxsjIA1oz4OddwZ7SH+FFTslqdz+N8igLBzkhQELjhV66nxZXiirpG2YPqxwDdubmjmQKGc1Q5BkzpJ8Qy19jrEFHt91gQ24zbWdSPN1JpoOooFCN1ZXJiJGrzeQkLVWqrnQLrsMtn3KJnFBio/DfVYmKGo/kqu7qpvWtR7rUOdpfmBFGQA5Vj1ZNbGdh2YZPK/aQrtXP/8nau1qOmlnRg3+Z3M1isawB3v1hKM7zXqEv0iUMsj1QqzdxOq3hxQpvxydaOR4PO5JkIi6o X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Sean, On Tue, Oct 3, 2023 at 4:59=E2=80=AFPM Sean Christopherson wrote: > > On Tue, Oct 03, 2023, Fuad Tabba wrote: > > Hi, > > > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > > > index d2d913acf0df..f8642ff2eb9d 100644 > > > --- a/include/uapi/linux/kvm.h > > > +++ b/include/uapi/linux/kvm.h > > > @@ -1227,6 +1227,7 @@ struct kvm_ppc_resize_hpt { > > > #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 > > > #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 > > > #define KVM_CAP_USER_MEMORY2 230 > > > +#define KVM_CAP_MEMORY_ATTRIBUTES 231 > > > > > > #ifdef KVM_CAP_IRQ_ROUTING > > > > > > @@ -2293,4 +2294,17 @@ struct kvm_s390_zpci_op { > > > /* flags for kvm_s390_zpci_op->u.reg_aen.flags */ > > > #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0) > > > > > > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */ > > > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES _IOR(KVMIO, 0xd2, __= u64) > > > +#define KVM_SET_MEMORY_ATTRIBUTES _IOW(KVMIO, 0xd3, st= ruct kvm_memory_attributes) > > > + > > > +struct kvm_memory_attributes { > > > + __u64 address; > > > + __u64 size; > > > + __u64 attributes; > > > + __u64 flags; > > > +}; > > > + > > > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) > > > + > > > > In pKVM, we don't want to allow setting (or clearing) of PRIVATE/SHARED > > attributes from userspace. > > Why not? The whole thing falls apart if userspace doesn't *know* the sta= te of a > page, and the only way for userspace to know the state of a page at a giv= en moment > in time is if userspace controls the attributes. E.g. even if KVM were t= o provide > a way for userspace to query attributes, the attributes exposed to usrspa= ce would > become stale the instant KVM drops slots_lock (or whatever lock protects = the attributes) > since userspace couldn't prevent future changes. I think I might not quite understand the purpose of the KVM_SET_MEMORY_ATTRIBUTES ABI. In pKVM, all of a protected guest's memory is private by default, until the guest shares it with the host (via a hypercall), or another guest (future work). When the guest shares it, userspace is notified via KVM_EXIT_HYPERCALL. In many use cases, userspace doesn't need to keep track directly of all of this, but can reactively un/map the memory being un/shared. > Why does pKVM need to prevent userspace from stating *its* view of attrib= utes? > > If the goal is to reduce memory overhead, that can be solved by using an = internal, > non-ABI attributes flag to track pKVM's view of SHARED vs. PRIVATE. If t= he guest > attempts to access memory where pKVM and userspace don't agree on the sta= te, > generate an exit to userspace. Or kill the guest. Or do something else = entirely. For the pKVM hypervisor the guest's view of the attributes doesn't matter. The hypervisor at the end of the day is the ultimate arbiter for what is shared and with how. For pKVM (at least in my port of guestmem), we use the memory attributes from guestmem essentially to control which memory can be mapped by the host. One difference between pKVM and TDX (as I understand it), is that TDX uses the msb of the guest's IPA to indicate whether memory is shared or private, and that can generate a mismatch on guest memory access between what it thinks the state is, and what the sharing state in reality is. pKVM doesn't have that. Memory is private by default, and can be shared in-place, both in the guest's IPA space as well as the underlying physical page. > > However, we'd like to use the attributes xarray to track the sharing st= ate of > > guest pages at the host kernel. > > > > Moreover, we'd rather the default guest page state be PRIVATE, and > > only specify which pages are shared. All pKVM guest pages start off as > > private, and the majority will remain so. > > I would rather optimize kvm_vm_set_mem_attributes() to generate range-bas= ed > xarray entries, at which point it shouldn't matter all that much whether = PRIVATE > or SHARED is the default "empty" state. We opted not to do that for the = initial > merge purely to keep the code as simple as possible (which is obviously s= till not > exactly simple). > > With range-based xarray entries, the cost of tagging huge chunks of memor= y as > PRIVATE should be a non-issue. And if that's not enough for whatever rea= son, I > would rather define the polarity of PRIVATE on a per-VM basis, but only f= or internal > storage. Sounds good. > > I'm not sure if this is the best way to do this: One idea would be to m= ove > > the definition of KVM_MEMORY_ATTRIBUTE_PRIVATE to > > arch/*/include/asm/kvm_host.h, which is where kvm_arch_supported_attrib= utes() > > lives as well. This would allow different architectures to specify thei= r own > > attributes (i.e., instead we'd have a KVM_MEMORY_ATTRIBUTE_SHARED for p= KVM). > > This wouldn't help in terms of preventing userspace from clearing attri= butes > > (i.e., setting a 0 attribute) though. > > > > The other thing, which we need for pKVM anyway, is to make > > kvm_vm_set_mem_attributes() global, so that it can be called from outsi= de of > > kvm_main.c (already have a local patch for this that declares it in > > kvm_host.h), > > That's no problem, but I am definitely opposed to KVM modifying attribute= s that > are owned by userspace. > > > and not gate this function by KVM_GENERIC_MEMORY_ATTRIBUTES. > > As above, I am opposed to pKVM having a completely different ABI for mana= ging > PRIVATE vs. SHARED. I have no objection to pKVM using unclaimed flags in= the > attributes to store extra metadata, but if KVM_SET_MEMORY_ATTRIBUTES does= n't work > for pKVM, then we've failed miserably and should revist the uAPI. Like I said, pKVM doesn't need a userspace ABI for managing PRIVATE/SHARED, just a way of tracking in the host kernel of what is shared (as opposed to the hypervisor, which already has the knowledge). The solution could simply be that pKVM does not enable KVM_GENERIC_MEMORY_ATTRIBUTES, has its own tracking of the status of the guest pages, and only selects KVM_PRIVATE_MEM. Thanks! /fuad