From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8067C3ABC9 for ; Fri, 9 May 2025 20:15:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 76B066B00F0; Fri, 9 May 2025 16:15:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7437A6B00F1; Fri, 9 May 2025 16:15:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 608036B00F2; Fri, 9 May 2025 16:15:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3CE3C6B00F0 for ; Fri, 9 May 2025 16:15:34 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 943EF805AB for ; Fri, 9 May 2025 20:15:34 +0000 (UTC) X-FDA: 83424474588.13.50D8C45 Received: from mail-vk1-f202.google.com (mail-vk1-f202.google.com [209.85.221.202]) by imf29.hostedemail.com (Postfix) with ESMTP id BE2B9120002 for ; Fri, 9 May 2025 20:15:32 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=t7gMBfNo; spf=pass (imf29.hostedemail.com: domain of 3Y2IeaAoKCOgTdRYeQRdYXQYYQVO.MYWVSXeh-WWUfKMU.YbQ@flex--jthoughton.bounces.google.com designates 209.85.221.202 as permitted sender) smtp.mailfrom=3Y2IeaAoKCOgTdRYeQRdYXQYYQVO.MYWVSXeh-WWUfKMU.YbQ@flex--jthoughton.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746821732; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pHvdZShxIAqV01CMuKaaMDNli6sM0ediw7/flXiQIUk=; b=zEGw3z/uvXA5cSBtTA8kIoNWHvETVwrKdjQoYcm6ecbDlx711YvtXvZs16nt4xhM0AuCIP lxXxgDSettDeVtcVnyDXFufVySPqmx4qM32O6sg2h0YIlAQzuYas+vZIoXoPISVz2DMoLw Ea+QF0E8g0uKMUolIHbQ/RWB1J7Qp40= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=t7gMBfNo; spf=pass (imf29.hostedemail.com: domain of 3Y2IeaAoKCOgTdRYeQRdYXQYYQVO.MYWVSXeh-WWUfKMU.YbQ@flex--jthoughton.bounces.google.com designates 209.85.221.202 as permitted sender) smtp.mailfrom=3Y2IeaAoKCOgTdRYeQRdYXQYYQVO.MYWVSXeh-WWUfKMU.YbQ@flex--jthoughton.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746821732; a=rsa-sha256; cv=none; b=xy7XKJXMN92FkwXejNi99KiL5sEkFAkBNQXHfrSpOliMU9daWxFK8Hyl+IA23f6zKGcpYQ c0XtsdLdc7AjA3LhYGZTCEZxBPDsa8o7207Gi1EezcSI5Pu1rk8CI4PYrp2UjrWwLGK8eP bjB3DR976EHMokR4POqr6L4R1b34jZI= Received: by mail-vk1-f202.google.com with SMTP id 71dfb90a1353d-5240516c312so570437e0c.3 for ; Fri, 09 May 2025 13:15:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746821732; x=1747426532; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=pHvdZShxIAqV01CMuKaaMDNli6sM0ediw7/flXiQIUk=; b=t7gMBfNoh1uqZ/Vf8HH28uM6NahpTqAfaeOECUY46YeNaf2+D3RjxPwWWrQbcSlzpK DVR0mzeiu5fNM8iw2MMIAh6+P71KnzxkcUfpTuaN/fVDy25T9YoLahpHv68fcaIqnKBX Ixn5BpvwxHAgXA0LOmt7rih6qrgrnxwmrz0g6VwkHGgOdHRRqEXQeM3Iubsd60oXo3SK CLVjJIoQT2zp7U2mXxQ6sAbO2BTFjcR9MXDFSC4yoRyNiU7lqgHAQHPEGhH3MfyutfjF owkenym3Q3Dkq29TT/vl6Q37wBgjs95tXat9V51d0ab8irUOt9R2dT3B8r+3ogf96Kl2 7HiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746821732; x=1747426532; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=pHvdZShxIAqV01CMuKaaMDNli6sM0ediw7/flXiQIUk=; b=EwYZiWxZEWxpV5m0N4silnbGyfNzr0LFaw4QKMYrZQ1grtBtwifFNYWo5Z3HoSeilF DDOO3cuQCOnoYWqwLKi6SR9nydVMpyTlef/vwAlywt7Yr4vffjKydvytTV9JG0LKBZ7r x9uItyCZfZl4r2NYES+aQYmpUK+CEwtUcBpNUZAlUklL8REBOqy5kkCFiwtbixnRGm3G vMMiuL6pkyK1iWoWdiZuN7EQlSp90X6XMq/vSWyOggxmBd7RTqH9+gCYl6cqDr0kFgZM oMipYcuT3B4UWf5dMmiMIYkSg4M4KPIT5WiHJvl0jsfkJCIFt5hSBh2UFkiWKjrvlYF8 3dBw== X-Forwarded-Encrypted: i=1; AJvYcCUbNJhpizwDuh/tFIypomkqi1si2/7fZ1CBVy8+iSE1fKbBCZiwfykkPTDfRkIkGH6J4YdmAdbYvQ==@kvack.org X-Gm-Message-State: AOJu0Yw4Kmu8FhylNWA3ODUqx/kLgXgGV3D1g9Lm0I49ktN2cRMQPBiS WrCU4+/DW/jJlZMnltyKhrKHSUSvd44L+IMrYDSl8Etrrbu9AL9VIC2aPYAf6d1COrI84znLCpm bDYGnBdO+VG8w8+IuvA== X-Google-Smtp-Source: AGHT+IHmo4x8mnDm3HFwy0upkj1U9SngW/N6WYByA7P77kZB7LNS2lsdnOmKyvsZVF0zdVeuUjwRpStzTwy5SgVY X-Received: from vkz1.prod.google.com ([2002:a05:6122:5301:b0:529:28c4:eb83]) (user=jthoughton job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6122:892:b0:51f:fc9d:875d with SMTP id 71dfb90a1353d-52c53d1dd24mr4307344e0c.8.1746821731709; Fri, 09 May 2025 13:15:31 -0700 (PDT) Date: Fri, 9 May 2025 20:15:28 +0000 In-Reply-To: <20250430165655.605595-11-tabba@google.com> Mime-Version: 1.0 References: <20250430165655.605595-11-tabba@google.com> X-Mailer: git-send-email 2.49.0.1015.ga840276032-goog Message-ID: <20250509201529.3160064-1-jthoughton@google.com> Subject: Re: [PATCH v8 10/13] KVM: arm64: Handle guest_memfd()-backed guest page faults From: James Houghton To: tabba@google.com Cc: ackerleytng@google.com, akpm@linux-foundation.org, amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@linux.intel.com, chenhuacai@kernel.org, david@redhat.com, dmatlack@google.com, fvdl@google.com, hch@infradead.org, hughd@google.com, isaku.yamahata@gmail.com, isaku.yamahata@intel.com, james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com, jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com, kirill.shutemov@linux.intel.com, kvm@vger.kernel.org, liam.merwick@oracle.com, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, peterx@redhat.com, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, rientjes@google.com, roypat@amazon.co.uk, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: BE2B9120002 X-Rspamd-Server: rspam09 X-Stat-Signature: 9kwupkc6jgczuwrzrk5wnqqxob69815g X-HE-Tag: 1746821732-20279 X-HE-Meta: U2FsdGVkX1/w2j39mYulKrw8d2ZVD5fyWOjRveyl7rWtLAVpxqMarIIYjVWYELyUQT2Wpntn8LRRxvArh8sQlzWvd9A47+S9P9OCw7dQy6BbaUAqXWkKU5UQUqkmCgDSHrO/yj+QQrUjw7fzc1UAxs2qqPkCgTeX8SmjFhRJy3mxdczPT38ghmUXpz26DSohplTiQmOhtEIbcwBL61KNGbIpJGF7oq9JUDAu45qkmIbNAYuoxiZBVx6M+cMt07hLcIxRFxyHF/DSGCfWpWwRWoMIO8Ea5t6v9xXSRaJxaWJmv3RM0rZ3+XIUVY5QOmoD4fcrWIvdkHmM+ryVhPpbhw86X5U6zo3zJn7S4jU+Oa43U7b6sTgm4bCs3aPejTKmf8nYeX5qt++YixWVrJE+33s9qp9t/nw0/DxswaEn6FbuKmHd4DET4MJZhJfwx6Nwm06nyjkcZSLM97rYV9YL34cXZ/eDsRQRXVYo0LtczDOqKGcldn4i/u/CtKeNNZ7lGPJaFWLJ/ZaEjKoO/lbEfOYtH/SNXm4dzTWgtgkAZmCKRZ887bxsSuYVWQbA5+q3t64sG0Q0vvPmZV9+Cbff5esTfgMecSYoZF44ErMw6f7DxCQdD1c8KNAJT6UIdURGFeHvyl5vFxL14IY9uaBTsbn/BSoqpJU9nqWyOoO4fjvxBJrhtMk/gMzv3bwqcIoSm/fSlEWbpaSaa3Y43za3XN7oD7Fy3wfM7pPMSg1sZabhV9eP5raDoluIEhEsuwnk3HQJUfi33TaCbnLc0ke9j5O4soeZ0jow59FQWg6AvgPwMHLizz6Ts0hWH/lySc+JH7jau5stCT6OEoLv0VvpsZJoXT+x1rykBKETRxJdfRe97oNWDR/+6FNDEI17dWDoMRdVO6tDepdACOBRNaJJznydAqDwzDARQb0ifle0S4y7BIRGOddx+VisgsnsuhD9FCuSCVxH2Yld1orzT+g fphv8FCX JWZveE98w2PwbPksGvfitMvtPI2UNk3y8H0G+FdeEMUOkDa7DBtvgepZG3Eqd3wfdYNUD6zhvQ+qtNsMOVuROpVZiZSbfNM0h8FqoO0PrMOV3ugqYQzsCIoYu/T895IgIb8kIwEFNRLywmayjDXNMrhterirMRpT646I/mC8jSadyLWxxAk4vUeDf7OYFUsshIzs8bBAd1NsglCgmsfIxv7EhusTkbfMNGe5cQU47+fqFR9fK9c3X5DmQ1eucTF8EzYuApD3Z24O8HyEKyW0u6f6xz2H1qJeRfcC+TTgxkRV1yBNMXUOE1+GE1W9++sJr1Dy07gAsUhkhtc9J8dSwURVaph7SeHKh2UTzbpuL9MOG6N+pntMxix0jq7kvRRJ8DYtz8Q4xLFGpp7IRK82o3Hyu2Xq2sDwnZ8KEFnrik6uS1TNapD42YWHY0SUC6pNWTaELDJ631EYCugX39waNrQDWNtpIwiB2bZHVZv5iHMGWhXTavwMiGFNt6V2+wQg6Y6Xw3Punlp193LEkfiuXH19NXCUrDfaHDpW7EROlPcWk/whgH8Zc2cB6reWI2suYOL4IWf8lLKRVzvgP0exZFYd+dHOZ/Dc9XBkH7Y/FgBydUaS+HQKKB24Pn0m9RPOSQHE5wpP+fssi07tbo72QBx7+FH/JhZRJmJetMD9YdpKR3kVZc9PfL1Z0CxMgnxT2Dqya X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 30, 2025 at 9:57=E2=80=AFAM Fuad Tabba wrote= : > > Add arm64 support for handling guest page faults on guest_memfd > backed memslots. > > For now, the fault granule is restricted to PAGE_SIZE. > > Signed-off-by: Fuad Tabba > --- > =C2=A0arch/arm64/kvm/mmu.c =C2=A0 =C2=A0 | 65 +++++++++++++++++++++++++++= ------------- > =C2=A0include/linux/kvm_host.h | =C2=A05 ++++ > =C2=A0virt/kvm/kvm_main.c =C2=A0 =C2=A0 =C2=A0| =C2=A05 ---- > =C2=A03 files changed, 50 insertions(+), 25 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 148a97c129de..d1044c7f78bb 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1466,6 +1466,30 @@ static bool kvm_vma_mte_allowed(struct vm_area_str= uct *vma) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return vma->vm_flags & VM_MTE_ALLOWED; > =C2=A0} > > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *sl= ot, > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0gfn_t gfn, bool write_fault, bool *writable, > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0struct page **page, bool is_gmem) > +{ > + =C2=A0 =C2=A0 =C2=A0 kvm_pfn_t pfn; > + =C2=A0 =C2=A0 =C2=A0 int ret; > + > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem) > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return __kvm_faultin_p= fn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page); > + > + =C2=A0 =C2=A0 =C2=A0 *writable =3D false; > + > + =C2=A0 =C2=A0 =C2=A0 ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, pag= e, NULL); > + =C2=A0 =C2=A0 =C2=A0 if (!ret) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *writable =3D !memslot= _is_readonly(slot); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return pfn; > + =C2=A0 =C2=A0 =C2=A0 } > + > + =C2=A0 =C2=A0 =C2=A0 if (ret =3D=3D -EHWPOISON) > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return KVM_PFN_ERR_HWP= OISON; > + > + =C2=A0 =C2=A0 =C2=A0 return KVM_PFN_ERR_NOSLOT_MASK; > +} > + > =C2=A0static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_= ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 struct kvm_s2_trans *nested, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 struct kvm_memory_slot *memslot, unsigned long hva, > @@ -1473,19 +1497,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0{ > =C2=A0 =C2=A0 =C2=A0 =C2=A0 int ret =3D 0; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool write_fault, writable; > - =C2=A0 =C2=A0 =C2=A0 bool exec_fault, mte_allowed; > + =C2=A0 =C2=A0 =C2=A0 bool exec_fault, mte_allowed =3D false; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool device =3D false, vfio_allow_any_uc =3D = false; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned long mmu_seq; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 phys_addr_t ipa =3D fault_ipa; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct kvm *kvm =3D vcpu->kvm; > - =C2=A0 =C2=A0 =C2=A0 struct vm_area_struct *vma; > + =C2=A0 =C2=A0 =C2=A0 struct vm_area_struct *vma =3D NULL; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 short vma_shift; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 void *memcache; > - =C2=A0 =C2=A0 =C2=A0 gfn_t gfn; > + =C2=A0 =C2=A0 =C2=A0 gfn_t gfn =3D ipa >> PAGE_SHIFT; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_pfn_t pfn; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool logging_active =3D memslot_is_logging(me= mslot); > - =C2=A0 =C2=A0 =C2=A0 bool force_pte =3D logging_active || is_protected_= kvm_enabled(); > - =C2=A0 =C2=A0 =C2=A0 long vma_pagesize, fault_granule; > + =C2=A0 =C2=A0 =C2=A0 bool is_gmem =3D kvm_slot_has_gmem(memslot) && kvm= _mem_from_gmem(kvm, gfn); > + =C2=A0 =C2=A0 =C2=A0 bool force_pte =3D logging_active || is_gmem || is= _protected_kvm_enabled(); > + =C2=A0 =C2=A0 =C2=A0 long vma_pagesize, fault_granule =3D PAGE_SIZE; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 enum kvm_pgtable_prot prot =3D KVM_PGTABLE_PR= OT_R; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct kvm_pgtable *pgt; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct page *page; > @@ -1522,16 +1547,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 return ret; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > > + =C2=A0 =C2=A0 =C2=A0 mmap_read_lock(current->mm); We don't have to take the mmap_lock for gmem faults, right? I think we should reorganize user_mem_abort() a bit (and I think vma_pagesi= ze and maybe vma_shift should be renamed) given the changes we're making here. Below is a diff that I think might be a little cleaner. Let me know what yo= u think. > + > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Let's check if we will get back a hug= e page backed by hugetlbfs, or > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* get block mapping for device MMIO reg= ion. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ > - =C2=A0 =C2=A0 =C2=A0 mmap_read_lock(current->mm); > - =C2=A0 =C2=A0 =C2=A0 vma =3D vma_lookup(current->mm, hva); > - =C2=A0 =C2=A0 =C2=A0 if (unlikely(!vma)) { > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_err("Failed to fin= d VMA for hva 0x%lx\n", hva); > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmap_read_unlock(curre= nt->mm); > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return -EFAULT; > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 vma =3D vma_lookup(cur= rent->mm, hva); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (unlikely(!vma)) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 kvm_err("Failed to find VMA for hva 0x%lx\n", hva); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 mmap_read_unlock(current->mm); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 return -EFAULT; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > + > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 vfio_allow_any_uc =3D = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mte_allowed =3D kvm_vm= a_mte_allowed(vma); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (force_pte) > @@ -1602,18 +1633,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ipa &=3D ~(vma_pa= gesize - 1); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > > - =C2=A0 =C2=A0 =C2=A0 gfn =3D ipa >> PAGE_SHIFT; > - =C2=A0 =C2=A0 =C2=A0 mte_allowed =3D kvm_vma_mte_allowed(vma); > - > - =C2=A0 =C2=A0 =C2=A0 vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY= _UNCACHED; > - > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Don't use the VMA after the unlock -- it m= ay have vanished */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0 vma =3D NULL; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Read mmu_invalidate_seq so that KVM c= an detect if the results of > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* vma_lookup() or __kvm_faultin_pfn() become= stale prior to > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* acquiring kvm->mmu_lock. > + =C2=A0 =C2=A0 =C2=A0 =C2=A0* vma_lookup() or faultin_pfn() become stale= prior to acquiring > + =C2=A0 =C2=A0 =C2=A0 =C2=A0* kvm->mmu_lock. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Rely on mmap_read_unlock() for an imp= licit smp_rmb(), which pairs > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* with the smp_wmb() in kvm_mmu_invalid= ate_end(). > @@ -1621,8 +1647,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmap_read_unlock(current->mm); > > - =C2=A0 =C2=A0 =C2=A0 pfn =3D __kvm_faultin_pfn(memslot, gfn, write_faul= t ? FOLL_WRITE : 0, > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &writable, &page); > + =C2=A0 =C2=A0 =C2=A0 pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault= , &writable, &page, is_gmem); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) { I think we need to take care to handle HWPOISON properly. I know that it is (or will most likely be) the case that GUP(hva) --> pfn, but with gmem, it *might* not be the case. So the following line isn't right. I think we need to handle HWPOISON for gmem using memory fault exits instea= d of sending a SIGBUS to userspace. This would be consistent with how KVM/x86 today handles getting a HWPOISON page back from kvm_gmem_get_pfn(). I'm not entirely sure how KVM/x86 is meant to handle HWPOISON on shared gmem pages = yet; I need to keep reading your series. The reorganization diff below leaves this unfixed. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_send_hwpoison= _signal(hva, vma_shift); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f3af6bff3232..1b2e4e9a7802 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_= t gfn) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return gfn_to_memslot(kvm, gfn)->id; > =C2=A0} > > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slo= t) > +{ > + =C2=A0 =C2=A0 =C2=A0 return slot->flags & KVM_MEM_READONLY; > +} > + > =C2=A0static inline gfn_t > =C2=A0hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot) > =C2=A0{ > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index c75d8e188eb7..d9bca5ba19dc 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2640,11 +2640,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *= vcpu, gfn_t gfn) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return size; > =C2=A0} > > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot) > -{ > - =C2=A0 =C2=A0 =C2=A0 return slot->flags & KVM_MEM_READONLY; > -} > - > =C2=A0static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot= *slot, gfn_t gfn, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0gfn_t *nr= _pages, bool write) > =C2=A0{ > -- > 2.49.0.901.g37484f566f-goog Thanks, Fuad! Here's the reorganization/rename diff: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index d1044c7f78bba..c9eb72fe9013b 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1502,7 +1502,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, unsigned long mmu_seq; phys_addr_t ipa =3D fault_ipa; struct kvm *kvm =3D vcpu->kvm; - struct vm_area_struct *vma =3D NULL; short vma_shift; void *memcache; gfn_t gfn =3D ipa >> PAGE_SHIFT; @@ -1510,7 +1509,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, bool logging_active =3D memslot_is_logging(memslot); bool is_gmem =3D kvm_slot_has_gmem(memslot) && kvm_mem_from_gmem(kvm, gfn= ); bool force_pte =3D logging_active || is_gmem || is_protected_kvm_enabled(= ); - long vma_pagesize, fault_granule =3D PAGE_SIZE; + long target_size =3D PAGE_SIZE, fault_granule =3D PAGE_SIZE; enum kvm_pgtable_prot prot =3D KVM_PGTABLE_PROT_R; struct kvm_pgtable *pgt; struct page *page; @@ -1547,13 +1546,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, return ret; } =20 - mmap_read_lock(current->mm); - /* * Let's check if we will get back a huge page backed by hugetlbfs, or * get block mapping for device MMIO region. */ if (!is_gmem) { + struct vm_area_struct *vma =3D NULL; + + mmap_read_lock(current->mm); + vma =3D vma_lookup(current->mm, hva); if (unlikely(!vma)) { kvm_err("Failed to find VMA for hva 0x%lx\n", hva); @@ -1563,38 +1564,45 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, =20 vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY_UNCACHED; mte_allowed =3D kvm_vma_mte_allowed(vma); - } - - if (force_pte) - vma_shift =3D PAGE_SHIFT; - else - vma_shift =3D get_vma_page_shift(vma, hva); + vma_shift =3D force_pte ? get_vma_page_shift(vma, hva) : PAGE_SHIFT; =20 - switch (vma_shift) { + switch (vma_shift) { #ifndef __PAGETABLE_PMD_FOLDED - case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) - break; - fallthrough; + case PUD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) + break; + fallthrough; #endif - case CONT_PMD_SHIFT: - vma_shift =3D PMD_SHIFT; - fallthrough; - case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + case CONT_PMD_SHIFT: + vma_shift =3D PMD_SHIFT; + fallthrough; + case PMD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + break; + fallthrough; + case CONT_PTE_SHIFT: + vma_shift =3D PAGE_SHIFT; + force_pte =3D true; + fallthrough; + case PAGE_SHIFT: break; - fallthrough; - case CONT_PTE_SHIFT: - vma_shift =3D PAGE_SHIFT; - force_pte =3D true; - fallthrough; - case PAGE_SHIFT: - break; - default: - WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); - } + default: + WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); + } =20 - vma_pagesize =3D 1UL << vma_shift; + /* + * Read mmu_invalidate_seq so that KVM can detect if the results of + * vma_lookup() or faultin_pfn() become stale prior to acquiring + * kvm->mmu_lock. + * + * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs + * with the smp_wmb() in kvm_mmu_invalidate_end(). + */ + mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; + mmap_read_unlock(current->mm); + + target_size =3D 1UL << vma_shift; + } =20 if (nested) { unsigned long max_map_size; @@ -1620,7 +1628,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, max_map_size =3D PAGE_SIZE; =20 force_pte =3D (max_map_size =3D=3D PAGE_SIZE); - vma_pagesize =3D min(vma_pagesize, (long)max_map_size); + target_size =3D min(target_size, (long)max_map_size); } =20 /* @@ -1628,27 +1636,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, * ensure we find the right PFN and lay down the mapping in the right * place. */ - if (vma_pagesize =3D=3D PMD_SIZE || vma_pagesize =3D=3D PUD_SIZE) { - fault_ipa &=3D ~(vma_pagesize - 1); - ipa &=3D ~(vma_pagesize - 1); + if (target_size =3D=3D PMD_SIZE || target_size =3D=3D PUD_SIZE) { + fault_ipa &=3D ~(target_size - 1); + ipa &=3D ~(target_size - 1); } =20 - /* Don't use the VMA after the unlock -- it may have vanished */ - vma =3D NULL; - - /* - * Read mmu_invalidate_seq so that KVM can detect if the results of - * vma_lookup() or faultin_pfn() become stale prior to acquiring - * kvm->mmu_lock. - * - * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs - * with the smp_wmb() in kvm_mmu_invalidate_end(). - */ - mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; - mmap_read_unlock(current->mm); - pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_= gmem); if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) { + // TODO: Handle gmem properly. vma_shift + // intentionally left uninitialized. kvm_send_hwpoison_signal(hva, vma_shift); return 0; } @@ -1658,9 +1654,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, if (kvm_is_device_pfn(pfn)) { /* * If the page was identified as device early by looking at - * the VMA flags, vma_pagesize is already representing the + * the VMA flags, target_size is already representing the * largest quantity we can map. If instead it was mapped - * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE + * via __kvm_faultin_pfn(), target_size is set to PAGE_SIZE * and must not be upgraded. * * In both cases, we don't let transparent_hugepage_adjust() @@ -1699,7 +1695,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, =20 kvm_fault_lock(kvm); pgt =3D vcpu->arch.hw_mmu->pgt; - if (mmu_invalidate_retry(kvm, mmu_seq)) { + if (!is_gmem && mmu_invalidate_retry(kvm, mmu_seq)) { ret =3D -EAGAIN; goto out_unlock; } @@ -1708,16 +1704,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize =3D=3D PAGE_SIZE && !(force_pte || device)) { + if (target_size =3D=3D PAGE_SIZE && !(force_pte || device)) { if (fault_is_perm && fault_granule > PAGE_SIZE) - vma_pagesize =3D fault_granule; - else - vma_pagesize =3D transparent_hugepage_adjust(kvm, memslot, + target_size =3D fault_granule; + else if (!is_gmem) + target_size =3D transparent_hugepage_adjust(kvm, memslot, hva, &pfn, &fault_ipa); =20 - if (vma_pagesize < 0) { - ret =3D vma_pagesize; + if (target_size < 0) { + ret =3D target_size; goto out_unlock; } } @@ -1725,7 +1721,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, if (!fault_is_perm && !device && kvm_has_mte(kvm)) { /* Check the VMM hasn't introduced a new disallowed VMA */ if (mte_allowed) { - sanitise_mte_tags(kvm, pfn, vma_pagesize); + sanitise_mte_tags(kvm, pfn, target_size); } else { ret =3D -EFAULT; goto out_unlock; @@ -1750,10 +1746,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, =20 /* * Under the premise of getting a FSC_PERM fault, we just need to relax - * permissions only if vma_pagesize equals fault_granule. Otherwise, + * permissions only if target_size equals fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault_is_perm && vma_pagesize =3D=3D fault_granule) { + if (fault_is_perm && target_size =3D=3D fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. @@ -1761,7 +1757,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, prot &=3D ~KVM_NV_GUEST_MAP_SZ; ret =3D KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot,= flags); } else { - ret =3D KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize, + ret =3D KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, target_size, __pfn_to_phys(pfn), prot, memcache, flags); }