From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8CA5C3ABDA for ; Wed, 14 May 2025 21:26:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7409E6B00BC; Wed, 14 May 2025 17:26:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 677E86B00BE; Wed, 14 May 2025 17:26:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CB2D6B00C0; Wed, 14 May 2025 17:26:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 226656B00BC for ; Wed, 14 May 2025 17:26:56 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0005681212 for ; Wed, 14 May 2025 21:26:56 +0000 (UTC) X-FDA: 83442798432.30.32E8A68 Received: from mail-ua1-f73.google.com (mail-ua1-f73.google.com [209.85.222.73]) by imf10.hostedemail.com (Postfix) with ESMTP id 1A242C0011 for ; Wed, 14 May 2025 21:26:54 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yGXXf37x; spf=pass (imf10.hostedemail.com: domain of 3ngolaAoKCI82C07Dz0C76z77z4x.v75416DG-553Etv3.7Az@flex--jthoughton.bounces.google.com designates 209.85.222.73 as permitted sender) smtp.mailfrom=3ngolaAoKCI82C07Dz0C76z77z4x.v75416DG-553Etv3.7Az@flex--jthoughton.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747258015; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tQhHQJb7v4lI1G9NW2xdZGalzNE/IBQFZAhE95tIzZg=; b=OpVtebDYzBGC0HrhS2+1nB7P3UKZoHwzhPkFjrXPzNlO2u5z+sSgGoTkXnJiFIYNo53KU0 SHdO3AeNpjYqnQQ9uFwa/oQnvk4xhltIo1KwKL5aS/F0PWicgbafiBTWb/yNlbtg6o1EGo Poo6yvPomFrr/US1eFPQarozcIw7Zww= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yGXXf37x; spf=pass (imf10.hostedemail.com: domain of 3ngolaAoKCI82C07Dz0C76z77z4x.v75416DG-553Etv3.7Az@flex--jthoughton.bounces.google.com designates 209.85.222.73 as permitted sender) smtp.mailfrom=3ngolaAoKCI82C07Dz0C76z77z4x.v75416DG-553Etv3.7Az@flex--jthoughton.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747258015; a=rsa-sha256; cv=none; b=fvekwV+N3ERWzxFKnjQPDbFSrXeF/XzA2YkmvHcAHAG3K0je2rrOtd0i0dnIIFVP+HLOUl dfYNP8D+sUZbLkvMWorKTCaDXWHLvCqMqbn3IyXNl6Jh4NheV/s5D4xucyz8AGoWzzLOOE vfvFSsz70o8z1rO8Ke2iBINiMd0BIhY= Received: by mail-ua1-f73.google.com with SMTP id a1e0cc1a2514c-87003ab0511so50516241.2 for ; Wed, 14 May 2025 14:26:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747258014; x=1747862814; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=tQhHQJb7v4lI1G9NW2xdZGalzNE/IBQFZAhE95tIzZg=; b=yGXXf37xmprIlwIF0xYbjTdpOnt3/Pd0YijmPUKHMSXnX/58dgOP41z7UtheJdNA+j kl9YdQKczZLtYgIIekOCITLBToebBU+ybGNKPhb3iNJVQ9beHP4Na6yQ5lWTxCnjV0Z3 VZMsvAWUClNWAdp6yAUeCA/RSHmXv81xeWdpwrKOYbRuM/b2s0QRVznW1OBt0mUhbb2/ vIEE7Wok9SmdE8DSLF5O8LQJXqL6GqB2GEZFQ27QZD8ehSTBfMAbP288hdulb51BnVcs Tg0njwZn7/CKOhTfELUNm4ufcL4YHoNv/7VZb/shpS+WXRGBtbXEKJxIn5g0Vn5OoQ8Q SZjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747258014; x=1747862814; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=tQhHQJb7v4lI1G9NW2xdZGalzNE/IBQFZAhE95tIzZg=; b=GHpiZeVHR0HoF+O1InvFkyOw1Q1650aU8O/rnclTgLmXKnc+eAzlOjcypYSAaZTiD6 1mG3y3sibkE2pbFzvUfH60fi1Gi0uDFUeNAx00EHb4vCRiTUrUNK8k3EdX8yCfSFOlov jjGO4yHTivhMa+OnMCxvhrijA6Y4XeKDYtRwkIL7M9KZi3I5H6UudckQeW8pggWTfsAs 9oAYhc4PvVZiRg5MAL1WIhbeGArb7z5o/KV21Q/3Hgs591Yc1zGCvtxmwxRUF5/odsNl Wh7uyGsGHlxlUD6wPEy4d7KnOAId6SarMbMmYzQYRBMILnmVMm4o9Bjh68F8k6Sa/9RV 7wkQ== X-Forwarded-Encrypted: i=1; AJvYcCURnRDqEbC3Mk8r/7D2LO764J+AJnvD9C8x/9rNYxtPwDMtIsAnmlrCHZ5Nm/ea6Bo7DkNClTzoVg==@kvack.org X-Gm-Message-State: AOJu0YwyX1wlB135TAKOElUN5zK3UmgZ3et//oM+TV8VqoN9ZxgABmFe sPQQVV/kGfaSskWphawx0MswYHp9bUnzlrFPJQrkR7QLXFbCTNnNBdhlGA4EmDLOlLDxluiF1aT 9QRcVUPjhpOrCGhvCPw== X-Google-Smtp-Source: AGHT+IFB9tyyu+c7Dt10GfaIsPKvoJ58O1BKGcP/ygQyI/GAX+2pXeX8t9b/mlwY7WN69tKSZcRwYmB5Av8WCxe9 X-Received: from uabje31.prod.google.com ([2002:a05:6130:681f:b0:877:a5c8:f3c8]) (user=jthoughton job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6102:3e0c:b0:4df:84d5:543e with SMTP id ada2fe7eead31-4df9541ae27mr65557137.7.1747258014077; Wed, 14 May 2025 14:26:54 -0700 (PDT) Date: Wed, 14 May 2025 21:26:52 +0000 In-Reply-To: <20250513163438.3942405-14-tabba@google.com> Mime-Version: 1.0 References: <20250513163438.3942405-14-tabba@google.com> X-Mailer: git-send-email 2.49.0.1045.g170613ef41-goog Message-ID: <20250514212653.1011484-1-jthoughton@google.com> Subject: Re: [PATCH v9 13/17] KVM: arm64: Handle guest_memfd()-backed guest page faults From: James Houghton To: tabba@google.com Cc: ackerleytng@google.com, akpm@linux-foundation.org, amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@linux.intel.com, chenhuacai@kernel.org, david@redhat.com, dmatlack@google.com, fvdl@google.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@gmail.com, isaku.yamahata@intel.com, james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com, jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com, kirill.shutemov@linux.intel.com, kvm@vger.kernel.org, liam.merwick@oracle.com, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, peterx@redhat.com, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, rientjes@google.com, roypat@amazon.co.uk, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 8aigez535f5gqaa6948x3hu67w4kk96b X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1A242C0011 X-HE-Tag: 1747258014-227968 X-HE-Meta: U2FsdGVkX1+FLHt9QXpi5CZoJj6OhyoB3liKTWFGBpCU5UWdOPrdByBlB+aG12ZXBbAym/vEH4cEieh6cceptCZlKsGMHRe8+bA7lx1PLEbfmp97e3aMNiG+R7lcJ3TsqBTGIbV/JP7Yxh44wpBa9MUEzBSZRkjkZ+B81Rs5WHLE6CdxES9M3XiyUEBDUDavDIhRQuSwqiNJhlQwDXYgBJaMl/B+OAQDnYrSkBkQD5GhSOBM9ItmFHxrM4wbLMQIyM1F7nicrSnKiAZh3o1Eib4bkTVobhPBhFFfJzDvHINmajdxphJn5tNkby/cWRCkQcm5nEneq9Vb/wnEq40WuY9piZPn1OPMbC4uChsTMslCUBKTT5ASvs6S4/3Gs+9e/Gb5GDCxm2YIdrZFHHIkifALpEkw6FvRgDuZ3fZRLPbzxIGnk5peL9iF22GpYJZyLholy1MOxzjp+pZDYSXp0Be6RnrYuXMu67CsNsQ3T7oB75CNwXSNOeUJxCOz7AaaVZaF8jsa0S2DrqCtmtVtIF5CXIaomFa2kHhZPK/5AfOEqi1NPzgXjw6XYAlVkbnW57pnDEohm3589x/lu9ogAybGSdLMZtuDWkKxYlXP9pvinvrYfW44dxLm+ccK3O+jXj1BGsFPkWrKxpPxsjK6u3OUittwY/Ol0nEp53SWXdS4p1ka+DTvAW66a/J1pW21T4ofZioC1dZnvXZ0VILwutFFHUzsMt4fkD8SUeYW6P4cRyF8JBWwU8wqoxOUW0yBuJaJTMOkdkk8f9+LG1ZjGF6HqXl3H/bAQAQvwAPR8PHA33LcX6KTuXDwFW070ehm3/GGRNTDO9WuB0oqbz6RmlOkx9vALN3ux7d8abi7+pPtMaVcDaWN9POGJi/FFz2URmkwh4Z2mjbK8fR5qZNxXFeqczUtYbk0iL8GVPkdhHYpocK59hHv2vJHvw2PWTUslWbG77nDFcBQ25EyZJ6 s8acEnOJ 2nagLtWChByiwKOA5ZHbsd5yK6k8Y9EE2u5M7zdDcyahEfmmzd2nDSvGq6KDqEKffOv9+LVLNDyvS9e1VbyoabvRWHmHn8z9Ric8yPTNTQJc63eKL2xXFYXjCdSNIAdPH7mAfqM5cChyDawI4+mk/GHOwlhCVD4HDqiXfKU7BnMjD9fXRUZM2jUIdrQtOvaELTaY6lCW0iXGh+vv8vzRVklpJUnY7a5E7QdI04Lhr15Lh70VtFSF6T0eRBmQiP0FtyNCCXanylmtp5p3JyOMtvtqUouhQsW1H2eEDpQJJnDL/ug3xdlRpTcc/0Ukvyh8VCOiSkVK/T7NPjm4HULYjTmLycEE5sSLJeLtHDiZk/GJT67tOheA2LsEvx4Gm6fpmGKef/Uu77mtWfAwMtx1bvaozC2BhE5AMmRzXz7qPL9M17cYBs7GqXploiJVPAQfICfPdf7f2gvklc0ZKEh6FGsHsZEyWgzeo/kXFtSwBHy5mFQH1eJNzpcCb2RpCzoRXFW+s9KQ8j/Jraddk21CeDUxXWjRv7kl1HWxwmq+osm6439JNOk2Ad1VmmnKKk3x+CJRMT6hEpTOwPpao04TCZ0wLxjmPCh6imrRedcTQZqcXEnNgEtNh/bUQUR4CVnQVVMRT2pxwFJXEKQA9mR2hbEjMsXjviyUJI8jsITobMOBnqgBMINRD/z86RAhonIwqv9Qv X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 13, 2025 at 9:35=E2=80=AFAM Fuad Tabba wrote= : > > Add arm64 support for handling guest page faults on guest_memfd > backed memslots. > > For now, the fault granule is restricted to PAGE_SIZE. > > Signed-off-by: Fuad Tabba > --- > =C2=A0arch/arm64/kvm/mmu.c =C2=A0 =C2=A0 | 94 +++++++++++++++++++++++++--= ------------- > =C2=A0include/linux/kvm_host.h | =C2=A05 +++ > =C2=A0virt/kvm/kvm_main.c =C2=A0 =C2=A0 =C2=A0| =C2=A05 --- > =C2=A03 files changed, 64 insertions(+), 40 deletions(-) > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index d756c2b5913f..9a48ef08491d 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1466,6 +1466,30 @@ static bool kvm_vma_mte_allowed(struct vm_area_str= uct *vma) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return vma->vm_flags & VM_MTE_ALLOWED; > =C2=A0} > > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *sl= ot, > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0gfn_t gfn, bool write_fault, bool *writable, > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0struct page **page, bool is_gmem) > +{ > + =C2=A0 =C2=A0 =C2=A0 kvm_pfn_t pfn; > + =C2=A0 =C2=A0 =C2=A0 int ret; > + > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem) > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return __kvm_faultin_p= fn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page); > + > + =C2=A0 =C2=A0 =C2=A0 *writable =3D false; > + > + =C2=A0 =C2=A0 =C2=A0 ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, pag= e, NULL); > + =C2=A0 =C2=A0 =C2=A0 if (!ret) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *writable =3D !memslot= _is_readonly(slot); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return pfn; > + =C2=A0 =C2=A0 =C2=A0 } > + > + =C2=A0 =C2=A0 =C2=A0 if (ret =3D=3D -EHWPOISON) > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return KVM_PFN_ERR_HWP= OISON; > + > + =C2=A0 =C2=A0 =C2=A0 return KVM_PFN_ERR_NOSLOT_MASK; I don't think the above handling for the `ret !=3D 0` case is correct. I th= ink we should just be returning `ret` out to userspace. The diff I have below is closer to what I think we must do. > +} > + > =C2=A0static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_= ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 struct kvm_s2_trans *nested, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 struct kvm_memory_slot *memslot, unsigned long hva, > @@ -1473,19 +1497,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0{ > =C2=A0 =C2=A0 =C2=A0 =C2=A0 int ret =3D 0; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool write_fault, writable; > - =C2=A0 =C2=A0 =C2=A0 bool exec_fault, mte_allowed; > + =C2=A0 =C2=A0 =C2=A0 bool exec_fault, mte_allowed =3D false; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool device =3D false, vfio_allow_any_uc =3D = false; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned long mmu_seq; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 phys_addr_t ipa =3D fault_ipa; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct kvm *kvm =3D vcpu->kvm; > - =C2=A0 =C2=A0 =C2=A0 struct vm_area_struct *vma; > - =C2=A0 =C2=A0 =C2=A0 short page_shift; > + =C2=A0 =C2=A0 =C2=A0 struct vm_area_struct *vma =3D NULL; > + =C2=A0 =C2=A0 =C2=A0 short page_shift =3D PAGE_SHIFT; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 void *memcache; > - =C2=A0 =C2=A0 =C2=A0 gfn_t gfn; > + =C2=A0 =C2=A0 =C2=A0 gfn_t gfn =3D ipa >> PAGE_SHIFT; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_pfn_t pfn; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool logging_active =3D memslot_is_logging(me= mslot); > - =C2=A0 =C2=A0 =C2=A0 bool force_pte =3D logging_active || is_protected_= kvm_enabled(); > - =C2=A0 =C2=A0 =C2=A0 long page_size, fault_granule; > + =C2=A0 =C2=A0 =C2=A0 bool is_gmem =3D kvm_slot_has_gmem(memslot); > + =C2=A0 =C2=A0 =C2=A0 bool force_pte =3D logging_active || is_gmem || is= _protected_kvm_enabled(); > + =C2=A0 =C2=A0 =C2=A0 long page_size, fault_granule =3D PAGE_SIZE; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 enum kvm_pgtable_prot prot =3D KVM_PGTABLE_PR= OT_R; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct kvm_pgtable *pgt; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct page *page; > @@ -1529,17 +1554,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Let's check if we will get back a hug= e page backed by hugetlbfs, or > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* get block mapping for device MMIO reg= ion. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ > - =C2=A0 =C2=A0 =C2=A0 mmap_read_lock(current->mm); > - =C2=A0 =C2=A0 =C2=A0 vma =3D vma_lookup(current->mm, hva); > - =C2=A0 =C2=A0 =C2=A0 if (unlikely(!vma)) { > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_err("Failed to fin= d VMA for hva 0x%lx\n", hva); > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmap_read_unlock(curre= nt->mm); > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return -EFAULT; > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmap_read_lock(current= ->mm); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 vma =3D vma_lookup(cur= rent->mm, hva); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (unlikely(!vma)) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 kvm_err("Failed to find VMA for hva 0x%lx\n", hva); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 mmap_read_unlock(current->mm); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 return -EFAULT; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > + > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 vfio_allow_any_uc =3D = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mte_allowed =3D kvm_vm= a_mte_allowed(vma); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > > - =C2=A0 =C2=A0 =C2=A0 if (force_pte) > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 page_shift =3D PAGE_SH= IFT; > - =C2=A0 =C2=A0 =C2=A0 else > + =C2=A0 =C2=A0 =C2=A0 if (!force_pte) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 page_shift =3D ge= t_vma_page_shift(vma, hva); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 switch (page_shift) { > @@ -1605,27 +1633,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, = phys_addr_t fault_ipa, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ipa &=3D ~(page_s= ize - 1); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > > - =C2=A0 =C2=A0 =C2=A0 gfn =3D ipa >> PAGE_SHIFT; > - =C2=A0 =C2=A0 =C2=A0 mte_allowed =3D kvm_vma_mte_allowed(vma); > - > - =C2=A0 =C2=A0 =C2=A0 vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY= _UNCACHED; > - > - =C2=A0 =C2=A0 =C2=A0 /* Don't use the VMA after the unlock -- it may ha= ve vanished */ > - =C2=A0 =C2=A0 =C2=A0 vma =3D NULL; > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem) { > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Don't use the VMA a= fter the unlock -- it may have vanished */ > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 vma =3D NULL; I think we can just move the vma declaration inside the earlier `if (is_gme= m)` bit above. It should be really hard to accidentally attempt to use `vma` or `hva` in the is_gmem case. `vma` we can easily make it impossible; `hva` is harder. See below for what I think this should look like. > > - =C2=A0 =C2=A0 =C2=A0 /* > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* Read mmu_invalidate_seq so that KVM can de= tect if the results of > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* vma_lookup() or __kvm_faultin_pfn() become= stale prior to > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* acquiring kvm->mmu_lock. > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* Rely on mmap_read_unlock() for an implicit= smp_rmb(), which pairs > - =C2=A0 =C2=A0 =C2=A0 =C2=A0* with the smp_wmb() in kvm_mmu_invalidate_e= nd(). > - =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ > - =C2=A0 =C2=A0 =C2=A0 mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; > - =C2=A0 =C2=A0 =C2=A0 mmap_read_unlock(current->mm); > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Read mmu_inval= idate_seq so that KVM can detect if the results > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* of vma_lookup(= ) or faultin_pfn() become stale prior to > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* acquiring kvm-= >mmu_lock. > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Rely on mmap_r= ead_unlock() for an implicit smp_rmb(), which > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* pairs with the= smp_wmb() in kvm_mmu_invalidate_end(). > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/ > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmu_seq =3D vcpu->kvm-= >mmu_invalidate_seq; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mmap_read_unlock(curre= nt->mm); > + =C2=A0 =C2=A0 =C2=A0 } > > - =C2=A0 =C2=A0 =C2=A0 pfn =3D __kvm_faultin_pfn(memslot, gfn, write_faul= t ? FOLL_WRITE : 0, > - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &writable, &page); > + =C2=A0 =C2=A0 =C2=A0 pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault= , &writable, &page, is_gmem); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_send_hwpoison= _signal(hva, page_shift); `hva` is used here even for the is_gmem case, and that should be slightly concerning. And indeed it is, this is not the appropriate way to handle hwpoison for gmem (and it is different than the behavior you have for x86).= x86 handles this by returning a KVM_MEMORY_FAULT_EXIT to userspace; we should d= o the same. I've put what I think is more appropriate in the diff below. And just to be clear, IMO, we *cannot* do what you have written now, especi= ally given that we are getting rid of the userspace_addr sanity check (but that check was best-effort anyway). > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return 0; > @@ -1677,7 +1701,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 kvm_fault_lock(kvm); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 pgt =3D vcpu->arch.hw_mmu->pgt; > - =C2=A0 =C2=A0 =C2=A0 if (mmu_invalidate_retry(kvm, mmu_seq)) { > + =C2=A0 =C2=A0 =C2=A0 if (!is_gmem && mmu_invalidate_retry(kvm, mmu_seq)= ) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D -EAGAIN; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 goto out_unlock; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 } > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f9bb025327c3..b317392453a5 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1884,6 +1884,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_= t gfn) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return gfn_to_memslot(kvm, gfn)->id; > =C2=A0} > > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slo= t) > +{ > + =C2=A0 =C2=A0 =C2=A0 return slot->flags & KVM_MEM_READONLY; > +} I think if you're going to move this helper to include/linux/kvm_host.h, yo= u might want to do so in its own patch and change all of the existing places where we check KVM_MEM_READONLY directly. *shrug* > + > =C2=A0static inline gfn_t > =C2=A0hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot) > =C2=A0{ > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 6289ea1685dd..6261d8638cd2 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2640,11 +2640,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *= vcpu, gfn_t gfn) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 return size; > =C2=A0} > > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot) > -{ > - =C2=A0 =C2=A0 =C2=A0 return slot->flags & KVM_MEM_READONLY; > -} > - > =C2=A0static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot= *slot, gfn_t gfn, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0gfn_t *nr= _pages, bool write) > =C2=A0{ > -- > 2.49.0.1045.g170613ef41-goog > Alright, here's the diff I have in mind: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 9a48ef08491db..74eae19792373 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1466,28 +1466,30 @@ static bool kvm_vma_mte_allowed(struct vm_area_stru= ct *vma) return vma->vm_flags & VM_MTE_ALLOWED; } =20 -static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot= , - gfn_t gfn, bool write_fault, bool *writable, - struct page **page, bool is_gmem) +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_vcpu *vcpu, + struct kvm_memory_slot *slot, gfn_t gfn, + bool exec_fault, bool write_fault, bool *writable, + struct page **page, bool is_gmem, kvm_pfn_t *pfn) { - kvm_pfn_t pfn; int ret; =20 - if (!is_gmem) - return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writab= le, page); + if (!is_gmem) { + *pfn =3D __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writ= able, page); + return 0; + } =20 *writable =3D false; =20 - ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL); - if (!ret) { - *writable =3D !memslot_is_readonly(slot); - return pfn; + ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, pfn, page, NULL); + if (ret) { + kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT, + PAGE_SIZE, write_fault, + exec_fault, false); + return ret; } =20 - if (ret =3D=3D -EHWPOISON) - return KVM_PFN_ERR_HWPOISON; - - return KVM_PFN_ERR_NOSLOT_MASK; + *writable =3D !memslot_is_readonly(slot); + return 0; } =20 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, @@ -1502,7 +1504,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, unsigned long mmu_seq; phys_addr_t ipa =3D fault_ipa; struct kvm *kvm =3D vcpu->kvm; - struct vm_area_struct *vma =3D NULL; short page_shift =3D PAGE_SHIFT; void *memcache; gfn_t gfn =3D ipa >> PAGE_SHIFT; @@ -1555,6 +1556,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys= _addr_t fault_ipa, * get block mapping for device MMIO region. */ if (!is_gmem) { + struct vm_area_struct *vma =3D NULL; + mmap_read_lock(current->mm); vma =3D vma_lookup(current->mm, hva); if (unlikely(!vma)) { @@ -1565,33 +1568,44 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, =20 vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY_UNCACHED; mte_allowed =3D kvm_vma_mte_allowed(vma); - } =20 - if (!force_pte) - page_shift =3D get_vma_page_shift(vma, hva); + if (!force_pte) + page_shift =3D get_vma_page_shift(vma, hva); + + /* + * Read mmu_invalidate_seq so that KVM can detect if the results + * of vma_lookup() or faultin_pfn() become stale prior to + * acquiring kvm->mmu_lock. + * + * Rely on mmap_read_unlock() for an implicit smp_rmb(), which + * pairs with the smp_wmb() in kvm_mmu_invalidate_end(). + */ + mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; + mmap_read_unlock(current->mm); =20 - switch (page_shift) { + switch (page_shift) { #ifndef __PAGETABLE_PMD_FOLDED - case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) - break; - fallthrough; + case PUD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) + break; + fallthrough; #endif - case CONT_PMD_SHIFT: - page_shift =3D PMD_SHIFT; - fallthrough; - case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + case CONT_PMD_SHIFT: + page_shift =3D PMD_SHIFT; + fallthrough; + case PMD_SHIFT: + if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + break; + fallthrough; + case CONT_PTE_SHIFT: + page_shift =3D PAGE_SHIFT; + force_pte =3D true; + fallthrough; + case PAGE_SHIFT: break; - fallthrough; - case CONT_PTE_SHIFT: - page_shift =3D PAGE_SHIFT; - force_pte =3D true; - fallthrough; - case PAGE_SHIFT: - break; - default: - WARN_ONCE(1, "Unknown page_shift %d", page_shift); + default: + WARN_ONCE(1, "Unknown page_shift %d", page_shift); + } } =20 page_size =3D 1UL << page_shift; @@ -1633,24 +1647,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph= ys_addr_t fault_ipa, ipa &=3D ~(page_size - 1); } =20 - if (!is_gmem) { - /* Don't use the VMA after the unlock -- it may have vanished */ - vma =3D NULL; - + ret =3D faultin_pfn(kvm, vcpu, memslot, gfn, exec_fault, write_fault, + &writable, &page, is_gmem, &pfn); + if (ret) + return ret; + if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) { /* - * Read mmu_invalidate_seq so that KVM can detect if the results - * of vma_lookup() or faultin_pfn() become stale prior to - * acquiring kvm->mmu_lock. - * - * Rely on mmap_read_unlock() for an implicit smp_rmb(), which - * pairs with the smp_wmb() in kvm_mmu_invalidate_end(). + * For gmem, hwpoison should be communicated via a memory fault + * exit, not via a SIGBUS. */ - mmu_seq =3D vcpu->kvm->mmu_invalidate_seq; - mmap_read_unlock(current->mm); - } - - pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_= gmem); - if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) { + WARN_ON_ONCE(is_gmem); kvm_send_hwpoison_signal(hva, page_shift); return 0; }