From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BC22D3C539 for ; Thu, 17 Oct 2024 23:16:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FA6B6B0082; Thu, 17 Oct 2024 19:16:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7831B6B0083; Thu, 17 Oct 2024 19:16:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5FC986B0085; Thu, 17 Oct 2024 19:16:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3EF8F6B0082 for ; Thu, 17 Oct 2024 19:16:54 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 16B3C80133 for ; Thu, 17 Oct 2024 23:16:44 +0000 (UTC) X-FDA: 82684656054.05.6713665 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf25.hostedemail.com (Postfix) with ESMTP id 04FBAA0006 for ; Thu, 17 Oct 2024 23:16:44 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Sx8A8VcC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 345oRZwsKCJ48AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=345oRZwsKCJ48AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729206978; a=rsa-sha256; cv=none; b=1WU+cnmVUSvbwGa9ur5hSlSYKPbnERQc/WlocUSs7yUvpZfnErwT4wFv/ctRagOrzRzLyu VkcUl/gzQZA1IyGRrK3QUiPbBwj7F2MubGumf9aOncXvHYYBOrvzij7K7ZRt7OBFsrsx3Z Jp3fZQAqQGhRON/BuDmkmw9DDDZSess= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Sx8A8VcC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 345oRZwsKCJ48AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=345oRZwsKCJ48AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729206978; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=+zYjnaVq7JQQfvY4k20WgcRhuiQAGDSrTu1qY7s0RHI=; b=m27t5BIhFETkxgialX/8iOrRo4fV+y35vOmErDh7SWEjNkvqMXkqkeT15cglY5o+xuGzDI 0p1fRR6cV6DN2djmtlwQFMIzOYO5nLgUQPr3HQQ0pkt29v73mczZ0apD1vaykCw+5rNfnz ZVW4U7lZUjYrjnHu1o9gZFrF7k9DtII= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6e35199eb2bso27093157b3.3 for ; Thu, 17 Oct 2024 16:16:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729207011; x=1729811811; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=+zYjnaVq7JQQfvY4k20WgcRhuiQAGDSrTu1qY7s0RHI=; b=Sx8A8VcCWBrQXmhsPtkMjqp2V9Bthf/7i+p/zcEMqnoTQkY33bTJ8bCFZgfGVtsqZb aEdRN2QwObJCQQFx0JOIbSwTSqONEeAxZVa5/PfIhGTVn5rcNeGqCi8rv4uuvv5viRWu mpQRZe9zgJLGnxI1NMiiIx5Ri6+/NWu8ZbZs+uvbbfSQgxk1Yx6vGqvRKg6D/NdvrUmN Wk4/jg9YTRrLloo56EtWW/xTaUeleg5C2j4EAqzsAkVbaT5ENpU0r9beI2eCnFDZVfHq 8jsAeweB7pYTnAet2Es+7+LS3bjLsYouYx/i9XzU2SgRAm5Vh761rQxrspMuS+qAubje nVDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729207011; x=1729811811; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+zYjnaVq7JQQfvY4k20WgcRhuiQAGDSrTu1qY7s0RHI=; b=KBCbvDBur8MCGNriiV5cLy6w7XwLoJs2brCAsZhLx4TRgrUPeLuRgtwfHhZkxOlxGP NZBIgW1MKg0f+6+7+wkRScMlQB0vARYnz2/DMuabT7LOhXqP2T7gK9UhFHSLCJM8cDs6 OZf9qdEdsdhO6bbjs7/+dFtdgTviUoNpoQfQZYKZYrht1Q1k3oXNSoerzvKyb6qw+i6w CF89lftljdr6PfVahUkvNL4a6q/zHX0gtqbXzf/umS7Ha3v6UYD5S73+ZxFx9cbnVUMy PBBw98avQfRJWQE+6kJIdifc+bDgoX5PlFBmvPDMvYb/yLkPiHCmtMViVc1FwrRVGVsk YZ0g== X-Forwarded-Encrypted: i=1; AJvYcCVXZBPCDAXRZGCNkbFN9tOcKyO72S3DxIGXoePmDMyx42yagzSjD5KM/LHLEPAtiFHXUGwt5HmJLw==@kvack.org X-Gm-Message-State: AOJu0Yx3qGppmy/129zwxLzDqdWg1SWEsF/HlhqqM2EhK0fAIxEQbnYf fnMxxQRpq263DLBDdUSd0bFvdGzietzXItyDpRVA1g6/ZqRk3DXRRZ9UWyAxQtc1dvJbRjJKdhH fUZWE7uZhiG/m4B5tlUbdAA== X-Google-Smtp-Source: AGHT+IFfV6eQbYhNulYjT+WCqGWPQRoZrQ0tbhfX4Hz/hMhhogXLArIjar7wjJ6r+VUqZczFoagDpAxBGwKDnqHV4g== X-Received: from ackerleytng-ctop-specialist.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1612]) (user=ackerleytng job=sendgmr) by 2002:a05:690c:700d:b0:6db:c34f:9e4f with SMTP id 00721157ae682-6e5bfc716e3mr161057b3.8.1729207011089; Thu, 17 Oct 2024 16:16:51 -0700 (PDT) Date: Thu, 17 Oct 2024 23:16:49 +0000 In-Reply-To: <6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk> (message from Patrick Roy on Thu, 10 Oct 2024 17:21:11 +0100) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap From: Ackerley Tng To: Patrick Roy Cc: seanjc@google.com, quic_eberman@quicinc.com, tabba@google.com, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com, erdemaktas@google.com, vannapurve@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org, jgowans@amazon.com, kalyazin@amazon.co.uk, derekmn@amazon.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Stat-Signature: jxwfw6kjrrar5xiuakcifmsroa5ji3ku X-Rspamd-Queue-Id: 04FBAA0006 X-Rspamd-Server: rspam02 X-HE-Tag: 1729207004-758656 X-HE-Meta: U2FsdGVkX18fvAkaMMiy/v3A3GovQ5GLmJM3FWlD16W/zdQM821LETk9rpPOzj90DZgzdMORjG+CepaBrtRvY1PrMKRJvun2ORk6iAO7BbIuFRA9mbfteMZQYjppf32IfRhuFiQqlRKdOH5DYeOUp8b0Xj5UQ1qKzQmr18pJ2AZpD8ZdOFw6jrVu1ClMzrr7Hqjzq8OMyTaZ2jmNS6QqWkdVG77ZvZYILYeEV9S2pJFw6gx8mUdigLOVTQOrwRJZfCfwDvnLV9G5GOg6PQKF8cIJR0YUUo1g/hePkNNmv1NOq0QbTnDx3IkXh//N37uCIdP8uEQFyNdY9vNPKhKgNwPPYRikK2XXLqo+ZF6FHvy9+TT9w/ALYoShalenzsX4zJNXB9gXBF2VLvSIEaeaniVxXImr6PSj6OFjkhx3SluO15Ffrk/TZBluUcAF27gM0mboQrtYfNV7Yiy30el6c1d9/zYBNXcdEuqpzrDNLQnqTYeRYgc6Rq5rvFZ3riT3Erp3WduI+mR1FPilgReN46u2M5/A4CUCTjUTtjygCaFerHVKTU103vduogERSeA3C1Nby4LyVx0x93kEDVIiHV1BwUfgHA0buwPCkVlM6ePMtzRPLXuZ/vART1isInWiOuMslg22dm+yfvzV4I0R1C0hxgHbaD8sojt6giU1eTh9962kZ7DUxRKF6yhaeyt8DLAr8j5wOXc9KUOPetC9NjwueYOxvynd7xpC4poqNUcThZniF+w6PVGXrNunwxn9QXKxPNCEQPj+U/bH7AiGyNLqNxTS/Ae2mQwJW7F5Ir/avMc8vkCFiPQXv4cP6j+DbhzTSxUjjBWIPbaCtCTcjWj5183vSe9+wJDMkkNxU0X2gO6dLzWlAtVfp6GnJANavtwOgscSZSYNpdWeibl+FT8Um1z6Q0Czl5mfSf5+Mlq3P6DxqCY5Zo+g2bU0UVZZADULRrzvbrIKEq2mLTk nF8xxHgU 4XhhJp8C7PnbrqYVwYFOL12IZqt1pQASvHHRoa481m0OJWrW8y4iGRkNpogQ0NUq9CElQiXpBqornuh94TUqe5QU5LC1gZtKNN5r9Rw+jpXpkFd1USUD/CCsP/v7vITfJMbwhH6uLmkBsptuNsXWJVkyUWLYku57/nKAKxMaMmxnoiUPhn2z/4jFe6CyXJSLMnKChYaZUR4PGGEA/QsajySGKNxYJ32CGY3fLQ3vTPhpnqaw6cOCoQLzDWQx+HdQgjdcVjJBEtG6/T+SDICso+DuWxAgRVnyVxskFZMx/uZKgmTe6VhmtPtwxgXpj74p3BJRZMWg9cJ4RHUu0+gS6cVn/HvY9Jt6XNA14o691A/pR4+UPuhzR0hch3kXPyI67plAzamWPfAK+2yxcVd1BWkjYLI1rxPwDxVvnyK1Hc2btfmIDkZTQRbtUJF8hjCtG59PyvO309HRFSD7YbHNxmNGLfGuokvdfkbUaaInmL7r6KplFGS5Sq389sL20W70+Jwa9Ke51TZgfsgPs7NYI0UXinb8npYmlShpCeO86QJiqqGs3xdvVnfeS3uSmJv3+6WFtGIdkLlTARmsbqQAChTUlmaBKrA3Em0A6irElTvGk7yUXS+bfpyCQfdmleF2i4dLez1sbq6fmVfgkiMtf5nqR5EJOn17epWb/qsoHkrdpyCvEvMLm1pssf5rQHQGrkY3vNNqujfaIReaqbOnIEryu7ZVgvi+EmzvmZSBbCUke+GcU69nwLJJfCyrsVnQ8YhJON41aQzqxL9T5QiboUtT5uvgip8G9e3xdCcPXB+ITOMIzoM+H7EdY6LFnbYO2wp4NmuUNhrTsl68= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Patrick Roy writes: > On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote: >> On Tue, Oct 08, 2024, Ackerley Tng wrote: >>> Patrick Roy writes: >>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS >>>> are going for, we'd like a VM type with host-controlled in-place >>>> conversions which doesn't zero on transitions, >> >> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for >> KVM to care if userspace or the guest _wants_ a page to be shared vs. private. >> Userspace is fully trusted to manage things; KVM simply reacts to the current >> state of things. >> >> And more importantly, whether or not the direct map is zapped needs to be a >> property of the guest_memfd inode, i.e. can't be associated with a struct kvm. >> I forget who got volunteered to do the work, > > I think me? At least we talked about it briefly > >> but we're going to need similar >> functionality for tracking the state of individual pages in a huge folio, as >> folio_mark_uptodate() is too coarse-grained. I.e. at some point, I expect that >> guest_memfd will make it easy-ish to determine whether or not the direct map has >> been obliterated. >> >> The shared vs. private attributes tracking in KVM is still needed (I think), as >> it communicates what userspace _wants_, whereas he guest_memfd machinery will >> track what the state _is_. > > If I'm understanding this patch series correctly, the approach taken > here is to force the KVM memory attributes and the internal guest_memfd > state to be in-sync, because the VMA from mmap()ing guest_memfd is > reflected back into the userspace_addr of the memslot. In this patch series, we're also using guest_memfd state (faultability xarray) to prevent any future faults before checking that there are no mappings. Further explanation at [1]. Reason (a) at [1] is what Sean describes above to be what userspace _wants_ vs what the state _is_. > So, to me, in > this world, "direct map zapped iff > kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory > attribute changes forcing the corresponding gmem state change. That's > why I was talking about conversions above. I think if we do continue to have state in guest_memfd, then direct map removal should be based on guest_memfd's state, rather than KVM_MEMORY_ATTRIBUTE_PRIVATE in mem_attr_array. > I've played around with this locally, and since KVM seems to generally > use copy_from_user and friends to access the userspace_addr VMA, (aka > private mem that's reflected back into memslots here), with this things > like MMIO emulation can be oblivious to gmem's existence, since > copy_from_user and co don't require GUP or presence of direct map > entries (well, "oblivious" in the sense that things like kvm_read_guest > currently ignore memory attributes and unconditionally access > userspace_addr, which I suppose is not really wanted for VMs where > userspace_addr and guest_memfd aren't short-circuited like this). The > exception is kvm_clock, where the pv_time page would need to be > explicitly converted to shared to restore the direct map entry, although > I think we could just let userspace deal with making sure this page is > shared (and then, if gmem supports GUP on shared memory, even the > gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd > still need a tiny hack in the uhva->pfn translation somewhere to handle > gmem vmas, but iirc you did mention that having kvm-clock be special > might be fine). > > I guess it does come down to what you note below, answering the question > of "how does KVM internally access guest_memfd for non-CoCo VMs". Is > there any way we can make uaccesses like above work? I've finally gotten > around to re-running some performance benchmarks of my on-demand > reinsertion patches with all the needed TLB flushes added, and my fio > benchmark on a virtio-blk device suffers a ~50% throughput regression, > which does not necessarily spark joy. And I think James H. mentioned at > LPC that making the userfault stuff work with my patches would be quite > hard. All this in addition to you also not necessarily sounding too keen > on it either :D > >>>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new >>>> VM type for that. >> >> Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename? >> The original thought behind "software protected VM" was to do a slow build of >> something akin to pKVM, but realistically I don't think that idea is going anywhere. > > Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a > playground where various configurations other VM types enforce can be > mixed and matched (e.g. zero on conversions yes/no, direct map removal > yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my > understanding :) > Given the different axes of possible configurations for guest_memfd (zero on conversion, direct map removal), I think it's better to let userspace choose, than to enumerate the combinations in VM types. Independently of whether to use a flag or VM type to configure guest_memfd, the "zeroed" state has to be stored somewhere. For folios to at least be zeroed once, presence in the filemap could indicated "zeroed". Presence in the filemap may be awkward to use as an indication of "zeroed" for the conversion case. What else can we use to store "zeroed"? Suggestions: 1. Since "prepared" already took the dirty bit on the folio, "zeroed" can use the checked bit on the folio. [2] indicates that it is for filesystems, which sounds like guest_memfd :) 2. folio->private (which we may already need to use) 3. Another xarray >> [1] https://lore.kernel.org/all/diqz1q0qtqnd.fsf@ackerleytng-ctop.c.googlers.com/T/#ma6f828d7a50c4de8a2f829a16c9bb458b53d8f3f [2] https://elixir.bootlin.com/linux/v6.11.4/source/include/linux/page-flags.h#L147