From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7F99CCA0EED for ; Thu, 28 Aug 2025 09:39:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B69FC6B0025; Thu, 28 Aug 2025 05:39:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B41C56B0026; Thu, 28 Aug 2025 05:39:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A302B6B0028; Thu, 28 Aug 2025 05:39:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 92B966B0025 for ; Thu, 28 Aug 2025 05:39:28 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3F4491DEDE7 for ; Thu, 28 Aug 2025 09:39:28 +0000 (UTC) X-FDA: 83825668416.05.57769F5 Received: from fra-out-002.esa.eu-central-1.outbound.mail-perimeter.amazon.com (fra-out-002.esa.eu-central-1.outbound.mail-perimeter.amazon.com [3.65.3.180]) by imf27.hostedemail.com (Postfix) with ESMTP id D986140004 for ; Thu, 28 Aug 2025 09:39:25 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazoncorp2 header.b=BvlSmmtP; spf=pass (imf27.hostedemail.com: domain of "prvs=3288f7157=roypat@amazon.co.uk" designates 3.65.3.180 as permitted sender) smtp.mailfrom="prvs=3288f7157=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1756373966; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=wBKMgXKBvz9IiVmaTGjG/lSyZH5eWM6lHVDXOR7WFUE=; b=oHUuQ5uo8a5huU52s//OP6yJOZIrH8iszaj5O5L9+38+wJBxnW6rwRtpi0uOhbNDExDzaw 2vojg1jIP2nYJsOhOVjro/pj+vusObcR7Jls2hrw+B6da1aG1Q+hzJDa4/N3X2U0qMPpgz 1HDM5WbO9vuSTq5jKgkXYAnUUdaETXA= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazoncorp2 header.b=BvlSmmtP; spf=pass (imf27.hostedemail.com: domain of "prvs=3288f7157=roypat@amazon.co.uk" designates 3.65.3.180 as permitted sender) smtp.mailfrom="prvs=3288f7157=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1756373966; a=rsa-sha256; cv=none; b=uk9ujZDn3eudVPXT0jqa6eUsQb6u39ptfhMrqOKcwrffCIb5v+L2ZDIQuIiyzFvWQCvju7 iiWnvQCBK6s5gsu3TXrtltFOxFX63QUmsbtxQc/FBrDuByV4YlgHKgBqi+8L6+x5kABf+e 15m+t7O4OJOSYkXqG781sx2Cy78Bglo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazoncorp2; t=1756373966; x=1787909966; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=wBKMgXKBvz9IiVmaTGjG/lSyZH5eWM6lHVDXOR7WFUE=; b=BvlSmmtP3HrhKp/9vMwOl5irhJuKMG1v44dblOEzzr+LmXPg9Lzg9sMT Kx4H/WDnshhnJO+L6Zi8JtD0TiWydKl96Fh17/xkfaAPN0T8d+Vj4Zmzi utLA3XWQmwNoRdAZroZOII3iZKgPVPWfNxPgNZUo5yyMVeL6Ylsoi7+OE eBYGBZolQZbJYu2SnQvK7h4XZunVdc7qbIC45eOhqNc0fXzzjzCXvsxJn g2jC13ykZMGwE7MBwAD4jnPd6rGp/ZcBKFRoyNn16qaCHkNfKx75p1GnL n/qmrPv2/LYpmHvgPiKTJyHG/TBPN4v2hLkLHkPPkfs85o8iZqbAJjXZz w==; X-CSE-ConnectionGUID: SIlh/HeRQO2Ix/T5Q4dxPg== X-CSE-MsgGUID: GQgFOM3UQo+ArwAbAQdjTg== X-IronPort-AV: E=Sophos;i="6.17,290,1747699200"; d="scan'208";a="1303383" Received: from ip-10-6-11-83.eu-central-1.compute.internal (HELO smtpout.naws.eu-central-1.prod.farcaster.email.amazon.dev) ([10.6.11.83]) by internal-fra-out-002.esa.eu-central-1.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Aug 2025 09:39:16 +0000 Received: from EX19MTAEUA001.ant.amazon.com [54.240.197.233:11608] by smtpin.naws.eu-central-1.prod.farcaster.email.amazon.dev [10.0.3.140:2525] with esmtp (Farcaster) id b400d3fe-1f6b-4e43-a1c7-ed7157d75ef6; Thu, 28 Aug 2025 09:39:15 +0000 (UTC) X-Farcaster-Flow-ID: b400d3fe-1f6b-4e43-a1c7-ed7157d75ef6 Received: from EX19D015EUB003.ant.amazon.com (10.252.51.113) by EX19MTAEUA001.ant.amazon.com (10.252.50.192) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.17; Thu, 28 Aug 2025 09:39:15 +0000 Received: from EX19D015EUB004.ant.amazon.com (10.252.51.13) by EX19D015EUB003.ant.amazon.com (10.252.51.113) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.17; Thu, 28 Aug 2025 09:39:14 +0000 Received: from EX19D015EUB004.ant.amazon.com ([fe80::2dc9:7aa9:9cd3:fc8a]) by EX19D015EUB004.ant.amazon.com ([fe80::2dc9:7aa9:9cd3:fc8a%3]) with mapi id 15.02.2562.017; Thu, 28 Aug 2025 09:39:14 +0000 From: "Roy, Patrick" To: "david@redhat.com" , "seanjc@google.com" CC: "Roy, Patrick" , "tabba@google.com" , "ackerleytng@google.com" , "pbonzini@redhat.com" , "kvm@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , "kvmarm@lists.linux.dev" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "rppt@kernel.org" , "will@kernel.org" , "vbabka@suse.cz" , "Cali, Marco" , "Kalyazin, Nikita" , "Thomson, Jack" , "Manwaring, Derek" Subject: [PATCH v5 00/12] Direct Map Removal Support for guest_memfd Thread-Topic: [PATCH v5 00/12] Direct Map Removal Support for guest_memfd Thread-Index: AQHcF/+dcK3rIkIurEWjp/zWzGg9HQ== Date: Thu, 28 Aug 2025 09:39:14 +0000 Message-ID: <20250828093902.2719-1-roypat@amazon.co.uk> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.19.88.180] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Rspamd-Queue-Id: D986140004 X-Rspamd-Server: rspam04 X-Rspam-User: X-Stat-Signature: dhjw8tbmwwmpm7doeqje6bjtdiydzfah X-HE-Tag: 1756373965-27793 X-HE-Meta: U2FsdGVkX193b7y4skCuIq91AqLAjGO8jsN7SJNJjqzrGgXq+jYVVJMZYcU4eBdIhQTw9hoX6WjS68Q8qA2XrWogxp9FmiWXHGHuNfISbXK9qhQI3/S5Zegv9jkjQIoEkGsjxTd2ud9XhkaOovSoGJN5ygd6GMQbMKQWOIh3j7LUss8no/s/4ZfKSGjAk5rheAjDtsnJNd4drjoMuDJhikWxAiXglsBP30g+I0BwKB6jss/PPs8cdvhaEdHu852bFMz+6E54pvIxpov104EarbGEyJlNPJjUy0s3eat1FUSpGkxwhiNSrDGO2T6VpHxAyJl+YjNkmdYafc4NCnifs/ndzuxI5KkqhzCMkrRrrUAth6vh8hfbkorPX9PRgbERr8Q/B8jzxXUlNucJyxCmUNP5fOqlDs3+nbxJWfY49WkxnYdUPGLTR0PgwTAgyUHR3SCCxvdfZUMxjgnDGumgYunQpTT9lqDH1MR/4n1Qn94fxdHtjGXNVSigmb8TlFCZjDAtB//ZRQ4DZ3NFvJBtH6hxGiLoKbycVJ0MQFr8Kqc1x4/Uzwvl6qbDuHny+3vfUsIrrZ2670nZni8sKNB9OCj9+bXV60gNqvURY490iIqEBjUvqDL+3sUQecbkSXRo9DxlcH7ncfH/UKutnZbgQVRf/9pYzP33vjcYlKhEzXH+at+RDUk+wY4q1PgiEnTBp09P7x7CxkIVQFd8j4esjANeaIwqetM+QkqbIVH6Tvz56/FmT/X6E/utlEdEJ3yo/hEQNjde8s8BpYu/h1OEBiYifVlXwaeswsAilRLnxLluSbXc0WKUF+nfi/kH8XdEOORF+Q47kl/NF04KrPYhBs4cM12oXAeSSWPnfMLDu9LxU33xbK3FoIKC/wg/MY7GZqyl1aMeiQlkvb66mD7Qg6eUMHzoYWHitwFZ6HRMOOmU6lux0ZkeYWuJUg3459d0N33hiMfj6HnLB8UR423 iQhAHT6m hULCDlZo8UXkwKVNCUskBeH9uoyB4YAv/Z27PHgcU9i0rXjIfTRvGJmvlc32O0FDY7VasW12Et2gNmXCRCKfd9fJGUz/jBUmwgChC4cgpVy0U/V9k0hFdZ6yu3Qk+dQ8atl2O405Sh1ivp6qZQPVXLB93pRdfsb7w1bhGABku+t2G3t8SHxWdL/jlyomVc/jGjXQ1sKq7MZf0UCWSNMb4yTmuIriYkuwdBpRa7iPVN9v5oGnUHS33ItwwGHUVlnJJ4t8+5zz6OL+NT1OzeUlR1hunxCg9MdEdTBh7+kHDcNZwHgGG19vgzqlp8H8AfhTN4AvMYRWszNC7mPWaHj7RmdPHCxT2Gv1EwSWu0mV8c/BM3fZHhuNC4tSZkk5so/rPbBSOBKHViEONnsdk3q/KhzMp0kGCgSjfKbOw65f2lVDkhV2Fg5a3TR5wxGiA/trFBxIvxnCUpcWraqfirYb4kv9GiWYSY0c1U07Pv0JXkNi2pIO6b1Z66iHLf5nyQklCld5rdmwyV6n6TmnyhZ3atf3YpLyHheYfo4U8k79rFEudUJsp8ouY7PzduNTxvgggCjMf84YYxmgK4lfmMODN8m0a6ZDAI0C1lVHRuQEKG/CPA5aPGnSjk1wfsJ+gACuVI9FFQ+RDDzqAsMdE5x7vrbuKh4tuQIJZax2EXRftbk38vnPxrbWigFN5fiK7OaICbLcfaOzC3NBp63fbXAS62m5RawEGI0vmaPw0vj+ejuvTxx/yleOh35bBYvIk+ZbHJ8DJm4WKWphAEsK6nfA//ECXRyHEh1PXvMDZNwyNu4okyfodny13hEp39D6MAe1NwIBa7G0v8Njx+ieJ408UW7ytz0Sm/r0wsXu7hVuwXMygKsg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [ based on kvm/next ]=0A= =0A= Unmapping virtual machine guest memory from the host kernel's direct map is= a=0A= successful mitigation against Spectre-style transient execution issues: If = the=0A= kernel page tables do not contain entries pointing to guest memory, then an= y=0A= attempted speculative read through the direct map will necessarily be block= ed=0A= by the MMU before any observable microarchitectural side-effects happen. Th= is=0A= means that Spectre-gadgets and similar cannot be used to target virtual mac= hine=0A= memory. Roughly 60% of speculative execution issues fall into this category= [1,=0A= Table 1].=0A= =0A= This patch series extends guest_memfd with the ability to remove its memory= =0A= from the host kernel's direct map, to be able to attain the above protectio= n=0A= for KVM guests running inside guest_memfd.=0A= =0A= =3D=3D=3D Design =3D=3D=3D=0A= =0A= We build on top of guest_memfd's recent support for "non-confidential VMs",= in=0A= which all of guest_memfd is mappable to userspace (e.g. considered "shared"= ).=0A= For such VMs, all guest page faults are routed through guest_memfd's specia= l=0A= page fault handler, which due to consuming fd+offset directly, can map dire= ct=0A= map removed memory into the guest. KVM's internal accesses to guest memory = are=0A= handled by providing each memslot with a userspace mapping of that memslots= =0A= guest_memfd via userspace_addr. Since KVM's internal accesses are almost=0A= exclusively handled via copy_from_user() and friends, this allows KVM to ac= cess=0A= direct map removed guest memory for features such as MMIO instruction emula= tion=0A= on x86 or pvtime support on ARM64.=0A= =0A= =3D=3D=3D Implementation =3D=3D=3D=0A= =0A= The KVM_CREATE_GUEST_MEMFD ioctl gains a new flag=0A= GUEST_MEMFD_FLAG_NO_DIRECT_MAP. If this flag is passed, then guest_memfd= =0A= removes direct map entries for its folios are preparation. Upon free-ing of= the=0A= memory, direct map entries are restored prior to gmem's arch specific=0A= invalidation callback.=0A= =0A= Support for the flag can be discovered via the KVM_CAP_GMEM_NO_DIRECT_MAP= =0A= capability, which is only available if direct map modifications at 4k=0A= granularity is architecturally possible / when KVM can successfully map dir= ect=0A= map removed memory into the guest.=0A= =0A= =3D=3D=3D Testing =3D=3D=3D=0A= =0A= KVM selftests are extended to cover the above-described non-CoCo workflows,= =0A= where guest_memfd with direct map entries removed is used to back all of gu= est=0A= memory, and exercising some simple MMIO paths.=0A= =0A= Additionally, a Firecracker branch with support for these VMs can be found = on=0A= GitHub [2].=0A= =0A= =3D=3D=3D Changes since v4 =3D=3D=3D=0A= =0A= - Rebase on top of kvm/next=0A= - Stop using PG_private to track direct map removal state=0A= - fix build or KVM-as-a-module by using new EXPORT_SYMBOL_FOR_MODULES=0A= =0A= =3D=3D=3D FAQ =3D=3D=3D=0A= =0A= --- why not reuse memfd_secret() / a bespoke guest memory solution? ---=0A= =0A= having guest memory be direct map removed means guest page faults cannot be= =0A= resolved by GUP-ing userspace mappings of guest memory, as GUP is disabled = for=0A= direct map removed memory (as currently GUP has no way to understand that a= =0A= specific GUP request will not subsequently dereference page_address()).=0A= guest_memfd already has a special path inside KVM that instead consumed=0A= fd+offset, so it makes sense to reuse this. Additionally, it means that=0A= direct-map-removed VMs can benefit from active development on guest_memfd, = such=0A= as huge pages support.=0A= =0A= --- why do KVM internal accesses through userspace page tables? ---=0A= =0A= For traditional VMs, all KVM internal accesses are done through the=0A= userspace_addr stored in a memslot, meaning no changes to most KVM code are= =0A= needed just to allow access to guest_memfd backed / direct map removed gues= t=0A= memory of non-confidential VMs. Previous iterations of this series tried to= =0A= avoid userspace mappings, instead attempting to dynamically restore direct = map=0A= entries for internal accesses [RFCv2], but this turned out to have a=0A= significant performance impact, as well as additional complexity due to nee= ding=0A= to refcount direct map reinsertion operations and making them play nicely w= ith=0A= gmem truncations.=0A= =0A= --- what doesn't work with direct map removed VMs? ---=0A= =0A= The only thing I'm aware of is kvm-clock, since it tries to GUP guest memor= y=0A= via gfn_to_pfn_cache. Realistically, this is only a problem on AMD, as on I= ntel=0A= guests can use TSC as a clocksource (Intel allows discovery of TSC frequenc= y=0A= via CPUID, while AMD doesn't). AMD guests fall back onto some calibration= =0A= routine, which fails most of the time though.=0A= =0A= [1]: https://download.vusec.net/papers/quarantine_raid23.pdf=0A= [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret= -hiding=0A= [RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon= .co.uk/=0A= [RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon= .co.uk/=0A= [RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.= co.uk/=0A= [v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co= .uk/=0A= =0A= =0A= Elliot Berman (1):=0A= filemap: Pass address_space mapping to ->free_folio()=0A= =0A= Patrick Roy (11):=0A= arch: export set_direct_map_valid_noflush to KVM module=0A= mm: introduce AS_NO_DIRECT_MAP=0A= KVM: guest_memfd: Add flag to remove from direct map=0A= KVM: Documentation: describe GUEST_MEMFD_FLAG_NO_DIRECT_MAP=0A= KVM: selftests: load elf via bounce buffer=0A= KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd=0A= !=3D -1=0A= KVM: selftests: Add guest_memfd based vm_mem_backing_src_types=0A= KVM: selftests: stuff vm_mem_backing_src_type into vm_shape=0A= KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in mem conversion=0A= tests=0A= KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in=0A= guest_memfd_test.c=0A= KVM: selftests: Test guest execution from direct map removed gmem=0A= =0A= Documentation/filesystems/locking.rst | 2 +-=0A= Documentation/virt/kvm/api.rst | 5 ++=0A= arch/arm64/include/asm/kvm_host.h | 12 ++++=0A= arch/arm64/mm/pageattr.c | 1 +=0A= arch/loongarch/mm/pageattr.c | 1 +=0A= arch/riscv/mm/pageattr.c | 1 +=0A= arch/s390/mm/pageattr.c | 1 +=0A= arch/x86/mm/pat/set_memory.c | 1 +=0A= fs/nfs/dir.c | 11 ++--=0A= fs/orangefs/inode.c | 3 +-=0A= include/linux/fs.h | 2 +-=0A= include/linux/kvm_host.h | 7 +++=0A= include/linux/pagemap.h | 16 +++++=0A= include/linux/secretmem.h | 18 ------=0A= include/uapi/linux/kvm.h | 2 +=0A= lib/buildid.c | 4 +-=0A= mm/filemap.c | 9 +--=0A= mm/gup.c | 14 +----=0A= mm/mlock.c | 2 +-=0A= mm/secretmem.c | 9 +--=0A= mm/vmscan.c | 4 +-=0A= .../testing/selftests/kvm/guest_memfd_test.c | 2 +=0A= .../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++---=0A= .../testing/selftests/kvm/include/test_util.h | 8 +++=0A= tools/testing/selftests/kvm/lib/elf.c | 8 +--=0A= tools/testing/selftests/kvm/lib/io.c | 23 +++++++=0A= tools/testing/selftests/kvm/lib/kvm_util.c | 61 +++++++++++--------=0A= tools/testing/selftests/kvm/lib/test_util.c | 8 +++=0A= tools/testing/selftests/kvm/lib/x86/sev.c | 1 +=0A= .../selftests/kvm/pre_fault_memory_test.c | 1 +=0A= .../selftests/kvm/set_memory_region_test.c | 50 +++++++++++++--=0A= .../kvm/x86/private_mem_conversions_test.c | 7 ++-=0A= virt/kvm/guest_memfd.c | 32 ++++++++--=0A= virt/kvm/kvm_main.c | 5 ++=0A= 34 files changed, 264 insertions(+), 104 deletions(-)=0A= =0A= =0A= base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383=0A= -- =0A= 2.50.1=0A= =0A=