From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48B14C4332F for ; Thu, 2 Nov 2023 13:53:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA18480022; Thu, 2 Nov 2023 09:53:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2A0D8D000F; Thu, 2 Nov 2023 09:53:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7CC180022; Thu, 2 Nov 2023 09:53:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 927D58D000F for ; Thu, 2 Nov 2023 09:53:03 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6D3931401E6 for ; Thu, 2 Nov 2023 13:53:03 +0000 (UTC) X-FDA: 81413155446.03.CA45487 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf19.hostedemail.com (Postfix) with ESMTP id 9E4181A0009 for ; Thu, 2 Nov 2023 13:53:01 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Z8sfQOjJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of tabba@google.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698933181; a=rsa-sha256; cv=none; b=s7WM/3sj5nLqQ7payozcMx7CJM4phJA5G0XXlLn0d8wY5hrLxBTyNi05puJ4IjQyTnodtC ptk6gBl5ge4F+nmvYogDL+55PMIfGTDMuZl1CD6Y5NKwMC7/RL/bVdeop6WkKDaJhkq8K7 m8Tgi8bc3o1ibatfHoujpnWB+tWjZak= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Z8sfQOjJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of tabba@google.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698933181; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8MiH+OD80Lz6N2WkoQWVzzSH5wdFrVwABmyWNvMjndg=; b=2ZHayDXdLCRrg/zpRWs/8XMWgA2WRFitVYWkf35W6RtaZ3EBzRdmpviuH+Na0S+6MmqC/7 8sm3c+c1vuu0SbuJSdlPQg2QINEMwvZpC+a0zMwuUAi+fPfWDtRgInuy0QwHgZ5Lm2A3A0 iNcU0AwpKXWxb9pmFAFsbFv2s/lhYIM= Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-66d0169cf43so5726276d6.3 for ; Thu, 02 Nov 2023 06:53:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698933181; x=1699537981; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8MiH+OD80Lz6N2WkoQWVzzSH5wdFrVwABmyWNvMjndg=; b=Z8sfQOjJF9B+sp6O6JD1sSs/gi6rs1lGY3NbCs3G6pbIc9kZVUV473PuvhQtHnJil0 scOxpcclJ0N3DubWzoL094CDrx2cjrjb+cGy24xL41givXKuFgnWZtOgmJWqYfe5VNPl 51/k0Yn8lMfbOqz3mOCP/bYJL2x2qfTxvxo4Lkwz5Y6ZxT1UqSjKtQBM5HGpgLtMxAqw Eq4xH5efTyNV0szP9G7eqdoZLpkhwcuG0A4FfyJ9TTguxAbhVw6hi0U4ZzR6MwDQZI0m /zrwqjrBeW67Sc3aJ6Qxyydg9loOZs8dKxtDK9xsZw/Bb/WY8V6MAVPoCj+e87CbUnKf UwNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698933181; x=1699537981; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8MiH+OD80Lz6N2WkoQWVzzSH5wdFrVwABmyWNvMjndg=; b=AMtP4caI5VOPnVsBl8Q8DWIte6sPcjdSjwPSYBgdl145GQ6zvRQ326CbWhqlh0DZWw 83Dj4d7igZkSFArfo9tLUghf5uA1LRZeANXk/ugweMNePCP6bY2cAm70FMFwfz47CGo9 fo4Gzka/R4obMAmeVgnATvq5xGzSztrYYrOab/8Gu6EZePkcDpxVT5/b2ulPZIwfJFib C8mMp+qJwgwreTN4CuWnv50w+ytDfJckjx7MhQZ9+c2H4DGYguG3q487Ac7m7B7/hApc dwrHqQQeYliEpba0Yzzt/KCgWBYiMHbsCxvhwi2ZpQGF/vu2aRLajrDowm50UJgpiRPS szNQ== X-Gm-Message-State: AOJu0YwylKyqa1ejDZq091yfrvJ6T7weiPOs9oMF3fNfP0ll46alHRm0 r7OpydcjZpE93+SvRIXk93/qN+3wilv8jbLHn5tFJA== X-Google-Smtp-Source: AGHT+IHABEQi7q2gF4GYbu+6KWyVOhp4LTc7vcuWRR/ds6omy7P2Nr93O+4ItCwG4g8ZUZhgbGUZVOPe8u3aj0mVGQs= X-Received: by 2002:ad4:5761:0:b0:672:4e8c:9aa5 with SMTP id r1-20020ad45761000000b006724e8c9aa5mr14682447qvx.47.1698933180583; Thu, 02 Nov 2023 06:53:00 -0700 (PDT) MIME-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> In-Reply-To: From: Fuad Tabba Date: Thu, 2 Nov 2023 13:52:23 +0000 Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 9E4181A0009 X-Stat-Signature: 8upzjqaqg7thdwnmqnxsjp1j76tkdn8q X-HE-Tag: 1698933181-547691 X-HE-Meta: U2FsdGVkX1+7tX58x4+2JTy8a09yPC8IGpokVSzlELkcT9BR0rSC+aU6mQXnramj/Z30srLx4hP/r60goTvPh91s37Gi5Z8AjICRPgJKR4t6EdQqxCUEeMenEuWMUpdnZ1Np/HGH5ku7wS7EfauCj9zn7ettwq1uFyFhrLF8zgdHjUwKPC1DWnq290wslYLn7/wuHJ3c/YCdjwC8LBK3F3SXOz++fundjSDziYZuPgNg4ZO6tfoxqv6sIcNdCgZYc83Yy6ERg6TnpFD4gv+emuwkt5MKYps6YUY/UuYYPYjgbfC1CIh/X1xRqZvjpkz1rPqj1GVO+KmVRQCwkTHo2UW2jwetc5Tt0M3GDAflvHbIPebiL1P72OoxZfdk1/Pn62zDb1lhRB1ZoT238lyTMpYqvXo28UoU5dJogK0phv3lqRFL1/SxI2yydvc4QegFzXYhrMZaab/3NmsFd+zufbD/FnzTu6qxdqVQ7fHDZhUm8T2i24oh1uNjCEzAfJdzqwd8Fd30quAxs9xOH6E/KYxxy2YTf5D/tQawIpxmkkgXhTbraBd2Q8VLcfCUz1Q0wc+0u0hbqR8dujkdA5eOY6d2u4mh2pzGN2Rt2kuCGOg2riBffJUzBqX05UMW/du/p2UNdma2g6zGHH2AF2QakM/bW4urQwgh6gfIJwjnX9otn1geBa6KbW/+oTJzQ6s+XkW71OAZV+fMU6HAAjdcwmlwFZiZblC9Bfak3ateyif9RMqEcbkDfeKbH1XVnycrczNprg0Sj72SM33grv4NhQLdh1f8/DzsuEQrSNSQxF/73oXF4JGVVVnS6IN0Z4cA5U6syUon60cMN2t095MDgWvXksmpjWTJ4VHDQUDV7uduYkQ+f5QSD4BqzeF4wGAfJVd2sV9+Vl5ThyPmiPOZHHC7dC7GsLmxrOTaZFrdhtxndf1VQssz/CHlHlCX+0IsyYabrmkqFwpmo530M4W xURo/IGl hFKXZSaWN9A54i/0VfOzUfILQ5jLDE9okkFbiwe8RDwibWMAC4WmX50SNMGK9EOewL2Ldoh6Q5pdmaEjwtjL2QvIvwdJOLQ9+0x2EeIEWlJFN9scA93BpeuZItj6zZwoLdCb6/7Leu2rNjlwj5ooGtIguQs+WNlgim7/t26VXHXlcRzp/DBE8pScwUxRacZpLrRwPGIXvP/4Uv4P4owdC/7l5BuZfOWRRH2e9e9MWWK+sk2kjENeoXTlrxTraqFny4OXZjTgoKY75nQzrA3jXMCw4EAQlXUa/pjuDTXaM+bLCLqlEa0DEOhkbcVCHKgZ9dMg2vuRIbhqQJr+jpP0NJ52JC1edPFVWyZofpsKJj3jfZ5TOouhyVrSUmB/bs79gUdEE5yvO1nwrhwtzuoQnJ3jwWy0oAjO/Te4w7lkHZ0RLv/h4Faje0qLKw84dth3TIRt8MjYwwep4QPk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 1, 2023 at 9:55=E2=80=AFPM Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Fuad Tabba wrote: > > > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct= kvm_memory_slot *memslot) > > > > > /* This does not remove the slot from struct kvm_memslots data s= tructures */ > > > > > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_= slot *slot) > > > > > { > > > > > + if (slot->flags & KVM_MEM_PRIVATE) > > > > > + kvm_gmem_unbind(slot); > > > > > + > > > > > > > > Should this be called after kvm_arch_free_memslot()? Arch-specific = ode > > > > might need some of the data before the unbinding, something I thoug= ht > > > > might be necessary at one point for the pKVM port when deleting a > > > > memslot, but realized later that kvm_invalidate_memslot() -> > > > > kvm_arch_guest_memory_reclaimed() was the more logical place for it= . > > > > Also, since that seems to be the pattern for arch-specific handlers= in > > > > KVM. > > > > > > Maybe? But only if we can about symmetry between the allocation and = free paths > > > I really don't think kvm_arch_free_memslot() should be doing anything= beyond a > > > "pure" free. E.g. kvm_arch_free_memslot() is also called after movin= g a memslot, > > > which hopefully we never actually have to allow for guest_memfd, but = any code in > > > kvm_arch_free_memslot() would bring about "what if" questions regardi= ng memslot > > > movement. I.e. the API is intended to be a "free arch metadata assoc= iated with > > > the memslot". > > > > > > Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_= reclaimed()? > > > > It's about the host reclaiming ownership of guest memory when tearing > > down a protected guest. In pKVM, we currently teardown the guest and > > reclaim its memory when kvm_arch_destroy_vm() is called. The problem > > with guestmem is that kvm_gmem_unbind() could get called before that > > happens, after which the host might try to access the unbound guest > > memory. Since the host hasn't reclaimed ownership of the guest memory > > from hyp, hilarity ensues (it crashes). > > > > Initially, I hooked reclaim guest memory to kvm_free_memslot(), but > > then I needed to move the unbind later in the function. I realized > > later that kvm_arch_guest_memory_reclaimed() gets called earlier (at > > the right time), and is more aptly named. > > Aha! I suspected that might be the case. > > TDX and SNP also need to solve the same problem of "reclaiming" memory be= fore it > can be safely accessed by the host. The plan is to add an arch hook (or = two?) > into guest_memfd that is invoked when memory is freed from guest_memfd. > > Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as del= eting a > memslot doesn't *guarantee* that guest memory is actually reclaimed (whic= h reminds > me, we need to figure out a better name for that thing before introducing > kvm_arch_gmem_invalidate()). I see. I'd assumed that that was what you're using. I agree that it's not completely correct, so for the moment, I assume that if that happens we have a misbehaving host, teardown the guest and reclaim its memory. > The effective false positives aren't fatal for the current usage because = the hook > is used only for x86 SEV guests to flush caches. An unnecessary flush ca= n cause > performance issues, but it doesn't affect correctness. For TDX and SNP, a= nd IIUC > pKVM, false positives are fatal because KVM could assign memory back to t= he host > that is still owned by guest_memfd. Yup. > E.g. a misbehaving userspace could prematurely delete a memslot. And the= more > fun example is intrahost migration, where the plan is to allow pointing m= ultiple > guest_memfd files at a single guest_memfd inode: > https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com > > There was a lot of discussion for this, but it's scattered all over the p= lace. > The TL;DR is is that the inode will represent physical memory, and a file= will > represent a given "struct kvm" instance's view of that memory. And so th= e memory > isn't reclaimed until the inode is truncated/punched. > > I _think_ this reflects the most recent plan from the guest_memfd side: > https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689= 893403.git.isaku.yamahata@intel.com Thanks for pointing that out. I think this might be the way to go. I'll have a closer look at this and see how to get it to work with pKVM. Cheers, /fuad