From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2A57C38150 for ; Wed, 10 Jul 2024 07:34:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61E786B0089; Wed, 10 Jul 2024 03:34:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CF6B6B008C; Wed, 10 Jul 2024 03:34:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 495D16B0092; Wed, 10 Jul 2024 03:34:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2A9176B0089 for ; Wed, 10 Jul 2024 03:34:29 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BE8A2A548B for ; Wed, 10 Jul 2024 07:34:28 +0000 (UTC) X-FDA: 82323030216.11.0645A37 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf13.hostedemail.com (Postfix) with ESMTP id 563712001D for ; Wed, 10 Jul 2024 07:34:25 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LRVvovir; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf13.hostedemail.com: domain of rppt@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=rppt@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720596841; a=rsa-sha256; cv=none; b=QV0ClvrfWZS6GvF+TcHOURmojiyQm6/1hyxjFMZJ8zw+NYHXZfj7qApbzhkFxPmkvlNlID EhqMwS/AMRcdR1WX+MHDIsrEBq2ifbSuRDxuvumzw0kK4c/RJ/r++q0ZK3+syHmqJSwlzc 8dAmB+fQqxFqe3q/8DxmreXC96BXZVc= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LRVvovir; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf13.hostedemail.com: domain of rppt@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=rppt@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720596841; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u1Cf+zHDYX6GjeFCjV0fgEdCn2xtjt4ver6c9quA7Vg=; b=8rCojwYwwXsVsuYlQEDpFte9L76G3kTnpyL0GCSoLM5SyDQzpOxGmH7FOl1LVsJSe9+Ww4 ZtWaCPCr5gUTJWZSuxOet9evD6xEknn+/f0rvitEUjz80mQViy8wJ2WYr1PynYTx38P9YO MjX2hOKn8k3+TkV+jHiemqQbNkkBS3c= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id B476ACE13D0; Wed, 10 Jul 2024 07:34:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 75F9CC32782; Wed, 10 Jul 2024 07:34:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720596861; bh=dvoKNXwrD28MkkI8y466iNL0VlvuCYSk/fPFFvZKz3w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=LRVvovireNfQcRKQxb/uK2FuFMyJjw3TkEkJ22O/546QnRRV2zF7JWMfct3nzE+pd iu4WoPH27IVKO9aIdF5W6mx5AJgNXnQlImXYZ2/2vRBRfs4R71WfQU2nzevFuhPqnF TKKqwdfacz/bphBPOxDi0xD/k65vyQB0f+56uxW7A98Sq9jbnHoo6YuLaNO73gcuSH cbsB4K96xVc+RBmY0Qjj4u//FkFW1Z6G9r0XLNEMXUUOAvfE13MLwkgQW5HMcJq7S7 3jJDBzQoeHsLcy2+VZ/lfvQA6qTD4yc6Cx9Fi3rjJMFD58cOk7fuwpyVec1qQT8rQp 3/UutjGYb5zSg== Date: Wed, 10 Jul 2024 10:31:35 +0300 From: Mike Rapoport To: Patrick Roy Cc: seanjc@google.com, pbonzini@redhat.com, akpm@linux-foundation.org, dwmw@amazon.co.uk, david@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, willy@infradead.org, graf@amazon.com, derekmn@amazon.com, kalyazin@amazon.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, dmatlack@google.com, tabba@google.com, chao.p.peng@linux.intel.com, xmarcalx@amazon.co.uk Subject: Re: [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Message-ID: References: <20240709132041.3625501-1-roypat@amazon.co.uk> <20240709132041.3625501-6-roypat@amazon.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240709132041.3625501-6-roypat@amazon.co.uk> X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 563712001D X-Stat-Signature: itunb1t151c3ohq5ggi91x5qf56ouss1 X-Rspam-User: X-HE-Tag: 1720596865-831076 X-HE-Meta: U2FsdGVkX1/j1xgxOVRuJHRXB0lXtKOW3yjyLt1cdZB2QW0J1S/ZVdLatvBnfevoJCxBZ59dE0FkpdP1LfbdEkuglKE2aqBqfroj5nxOHQIkwLHdsFiU3mk+xY3vNY+vyGJLA7CIlRgwvF03pUuogXgvncWtXG8ocmJpFyQqREqEKUC1y3rxCOGEAKl3KbR6yGXniEJA50sZ37/p8y98hQgUPEl4i8c8NXBnmPELGpeBfzvYlN3PS+zT1ZE5/6W5PAUBcOzdVtHi5Mx10ySNNjJ/ZS/Oz3HZD3tcyrYez7y3a1wNcoKjHQqsVPx09HpYxXf80d7R+b+jR5cNs3PMhsEHPyTHZob3VPgEtuxCw7zFjRzgjQD81e7m2Q0edbM1dUxvIuQBVWxptPLNfYp4btQJY/y71pp6H9ETd1TQcTvh4PgStHM0yjvvWgKgtLqJEBRo5QGOjafoLjibpp7PPquiLMJkyTWKo3zvZB5Y+JY17QHwgPdvUm3jxYyA7fyKtjfXN4rzcSR0xc5vU8pQveQVnd/yxKad9zLHTDvBa0NnFKBSro2+SpwcrHbRudgGaAQe5wRRiBilpeGjhkfr4hidKjXpZmj6hC+Inp9OnhPWSE/5fxVx18dcXg0goIRf4/IsBsVJTR8+UeAz2P08UrnA0/y0QU+QD9Lx/i+2tbsx6htKdai010j0Lh8lIflmRpMdhqIJ5OxSVelGxOYvSAJv3Mu39QFH2sKKkbPAEM5O0vD5cwAiTjwvDOfhaL1Blmd8srlz0rsni1RGG8yvK0ZS7T8ojYrFsAUtsI0imeqSvJZyg7KhRlOJMDkXgXQBE1kyNpXizQSlJXHG2nkrewxFiLLBsho6MWjJZsX0OoH7WF25cKv3D1TLZbL9f49SyfIZgOMpPOwTpG7TxchWBdmBVDYEIP4E8Q55zXAj/5dCCJjGUwsXWieMvxxn2eLG4+dmlVudc9kX5JE964v wZnyt6ve J5K5J5IzMKbCDyguyLoXO2NS+3j7m4vOKUJnYYlwTHuK1X3UV7Hly789B6xTPap9s2marZ+fQtJBzO+jLW+HIzC0xDyOXVAxShHwQ+dbIHAXGqwTDkx8ii21cdpEYTMSfL1/UYUtvgY62k5VfqAsb01VIX4jplf8hfVmDXO6178OArdzussnUqjec+kCA/Qa38QHxak0T0B6MRVjIwz56WVnUQG2GCPwjM0jRJX3x75Rl1X/PAI0Rln0IlKUvMTTGvKKpToTakSF7luo0ZXtcX1CUomLi6gwE73LY1CAoFiy/yDY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 09, 2024 at 02:20:33PM +0100, Patrick Roy wrote: > While guest_memfd is not available to be mapped by userspace, it is > still accessible through the kernel's direct map. This means that in > scenarios where guest-private memory is not hardware protected, it can > be speculatively read and its contents potentially leaked through > hardware side-channels. Removing guest-private memory from the direct > map, thus mitigates a large class of speculative execution issues > [1, Table 1]. > > This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the > struct pages backing guest-private memory from the direct map. Should > `CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed > after preparation and before invalidation, so that the > prepare/invalidate routines do not have to worry about potentially > absent direct map entries. > > Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be > called multiple time, and it is the responsibility of the preparation > routine to not "prepare" the same folio twice [2]. Thus, instead > explicitly check if `filemap_grab_folio` allocated a new folio, and > remove the returned folio from the direct map only if this was the case. > > The patch uses release_folio instead of free_folio to reinsert pages > back into the direct map as by the time free_folio is called, > folio->mapping can already be NULL. This means that a call to > folio_inode inside free_folio might deference a NULL pointer, leaving no > way to access the inode which stores the flags that allow determining > whether the page was removed from the direct map in the first place. > > Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead > of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache > hierarchy. This is especially important once KVM restores direct map > entries on-demand in later patches, where simple FIO benchmarks of a > virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R) > Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput > compared to a non-flushing solution. > > Not flushing the TLB means that until TLB entries for temporarily > restored direct map entries get naturally evicted, they can be used > during speculative execution, and effectively "unhide" the memory for > longer than intended. We consider this acceptable, as the only pages > that are temporarily reinserted into the direct map like this will > either hold PV data structures (kvm-clock, asyncpf, etc), or pages > containing privileged instructions inside the guest kernel image (in the > MMIO emulation case). > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > Signed-off-by: Patrick Roy > --- > include/uapi/linux/kvm.h | 2 ++ > virt/kvm/guest_memfd.c | 52 ++++++++++++++++++++++++++++++++++------ > 2 files changed, 47 insertions(+), 7 deletions(-) > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index e065d9fe7ab2..409116aa23c9 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd { > __u64 reserved[6]; > }; > > +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) > + > #endif /* __LINUX_KVM_H */ > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 9148b9679bb1..dc9b0c2d0b0e 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -4,6 +4,7 @@ > #include > #include > #include > +#include > > #include "kvm_mm.h" > > @@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol > return 0; > } > > +static bool kvm_gmem_not_present(struct inode *inode) > +{ > + return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0; > +} > + > static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare) > { > struct folio *folio; > + bool zap_direct_map = false; > + int r; > > /* TODO: Support huge pages. */ > folio = filemap_grab_folio(inode->i_mapping, index); > @@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool > for (i = 0; i < nr_pages; i++) > clear_highpage(folio_page(folio, i)); > > + // We need to clear the folio before calling kvm_gmem_prepare_folio, > + // but can only remove it from the direct map _after_ preparation is done. No C++ comments please > + zap_direct_map = kvm_gmem_not_present(inode); > + > folio_mark_uptodate(folio); > } > > if (prepare) { > - int r = kvm_gmem_prepare_folio(inode, index, folio); > - if (r < 0) { > - folio_unlock(folio); > - folio_put(folio); > - return ERR_PTR(r); > - } > + r = kvm_gmem_prepare_folio(inode, index, folio); > + if (r < 0) > + goto out_err; > + } > + > + if (zap_direct_map) { > + r = set_direct_map_invalid_noflush(&folio->page); It's not future proof to presume that folio is a single page here. You should loop over folio pages and add a TLB flush after the loop. > + if (r < 0) > + goto out_err; > + > + // We use the private flag to track whether the folio has been removed > + // from the direct map. This is because inside of ->free_folio, > + // we do not have access to the address_space anymore, meaning we > + // cannot check folio_inode(folio)->i_private to determine whether > + // KVM_GMEM_NO_DIRECT_MAP was set. > + folio_set_private(folio); > } > > /* > @@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool > * unevictable and there is no storage to write back to. > */ > return folio; > +out_err: > + folio_unlock(folio); > + folio_put(folio); > + return ERR_PTR(r); > } > > static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, > @@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio) > } > #endif > > +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end) > +{ > + if (start == 0 && end == PAGE_SIZE) { > + // We only get here if PG_private is set, which only happens if kvm_gmem_not_present > + // returned true in kvm_gmem_get_folio. Thus no need to do that check again. > + BUG_ON(set_direct_map_default_noflush(&folio->page)); Ditto. > + > + folio_clear_private(folio); > + } > +} > + > static const struct address_space_operations kvm_gmem_aops = { > .dirty_folio = noop_dirty_folio, > .migrate_folio = kvm_gmem_migrate_folio, > .error_remove_folio = kvm_gmem_error_folio, > + .invalidate_folio = kvm_gmem_invalidate_folio, > #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE > .free_folio = kvm_gmem_free_folio, > #endif > @@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > { > loff_t size = args->size; > u64 flags = args->flags; > - u64 valid_flags = 0; > + u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP; > > if (flags & ~valid_flags) > return -EINVAL; > -- > 2.45.2 > -- Sincerely yours, Mike.