From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ADC42C3ABC9
	for <linux-mm@archiver.kernel.org>; Thu, 15 May 2025 09:27:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5A0DE6B00BB; Thu, 15 May 2025 05:27:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 52A876B00C9; Thu, 15 May 2025 05:27:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 37F1B6B00CA; Thu, 15 May 2025 05:27:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 0F5DB6B00BB
	for <linux-mm@kvack.org>; Thu, 15 May 2025 05:27:41 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 63765141673
	for <linux-mm@kvack.org>; Thu, 15 May 2025 09:27:42 +0000 (UTC)
X-FDA: 83444614764.06.9E4C384
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175])
	by imf24.hostedemail.com (Postfix) with ESMTP id 5F08C180016
	for <linux-mm@kvack.org>; Thu, 15 May 2025 09:27:40 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=i8KxCpTz;
	spf=pass (imf24.hostedemail.com: domain of tabba@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=tabba@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1747301260;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vLVWrpXHwyZFM5AS7GCsykQ+QWDj2ywEXacPF0HLVGE=;
	b=21XOYXPqF9kvrD4kv6AtbLTCPmyIHNnRnQaCeeMgGgKpx99Uj401qLUiMEmHJfgWUzkq6z
	MmPw7p88rvlUL2jaHASrxYTcVOX2bWoVBaM2GHePS1hAjkXeUIIzrkJhNM9pdniIINoEWT
	y6rvdNGz/fBE9IUaFZDzhNvDgdZHlQs=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=i8KxCpTz;
	spf=pass (imf24.hostedemail.com: domain of tabba@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=tabba@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747301260; a=rsa-sha256;
	cv=none;
	b=3WKtmXlo8CdiVlgUCjfY02GWS1ORsD9CQ6NqMnI5uhq1OK8FtHzRJLAwClLKeTsjpiLP/+
	/D1/WskLWdHdk9QhB83tVZF5aYilPyh8aoiCJ8tAqD2+o5a1BLa+eYToka+I53e8fe0I7P
	jyxq0trCrsyUQiQGaIlo4oqOzwxIcME=
Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-4774611d40bso232961cf.0
        for <linux-mm@kvack.org>; Thu, 15 May 2025 02:27:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1747301259; x=1747906059; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=vLVWrpXHwyZFM5AS7GCsykQ+QWDj2ywEXacPF0HLVGE=;
        b=i8KxCpTzLA9/+fpjFGdu1hCfNQhPU4J0Oxcr9B2mNbq2XMiyGYi5LMvryGFEfAaBrO
         /CbAQoR6wMJjG7G8OYFGDRGn3+R5spt0FylsOQquQERTyxi3R8BIt+V0zXkkH/F9Ch/X
         WZcpiE+qs8ZgJATHOgAgPDzWrd0O0tpViKiOOQUZyEVaSpmbgW0FwNSumYG01GOx3ax7
         eYMgoEyUE9A2E4gOIbCCzR3OzBFN/rGbtvdVg/r9P1ATxTAWgLgCzl3qdRULXJaw3rfY
         d0mFbPNdo2aNM6X81gf5bIzePYb8KDOQNJkm1+hqESnQkodRLtamlqK4QT23Rw0C34m/
         wU1g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1747301259; x=1747906059;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=vLVWrpXHwyZFM5AS7GCsykQ+QWDj2ywEXacPF0HLVGE=;
        b=peVxDc6E27T4rAje250yXsfRTn3zqUQ97bGX+1TgqB6ghRHgvgjjSWVQrUtT6iJXZe
         xEa8bSmK/WG9H0koS/EshUB01p6MHIDtQ6E+DwGLvA4ekFMU0OF7vgJbMNPDM5p/+4rf
         SnCpF8MlhbqPt5NlGa/w+0PHvHyRbBjJ4FX+Socgu+SNMTARd50IrigvngmwtL3BrdGb
         h0sC32x+Z/vAOuu8uIQwFUH/yuUmGScQ3aHdn3+aeB/pQJGPNS2hBLzXARRiRLWbiwY1
         Z+lNRW/Wz7tXfDJnKX5akvaaq9vRp4hlGziVbD85tZE51/m6aoV89e18ccc8gQFsVVwU
         kIZg==
X-Forwarded-Encrypted: i=1; AJvYcCUsDx03k5wQT+FsQxozam4AJ4lJ1JYIle4DE1WDJ20jTbmU6v7mJboGbjIRJtcPCKquoob2dlqK1Q==@kvack.org
X-Gm-Message-State: AOJu0YwO3exwUrKfv+sGy2I3AyLk+f2OWRmiOLVMmasj6UpAlwHrg0Z6
	l/JYioNpe3MAIWa398FuAqeY7UvQgnq1KzN/1wEHYLMBovbRTjekJ7gI1ob/QNS4XUtaVw26BMe
	Evp3vSO1xN1d689BpMPc5G1T3c6YKtJRnMEqK5UVg
X-Gm-Gg: ASbGnctUOl8KM1OFS6/4IvEqv1VLAkrY7khk4xH5tvte95rL95SuUTXnHlUCd1BDii3
	B2bx7IdFsdcNermhzg79Zr56TGqmtzukcAAVZdMfJsZKIlZ5PgKqo6kHpD/MR/gjzw5IzJjya7/
	rtwrZbAsMxPE+mVzYSq2neoGkGwlzVNyoEEXw20ug0U1EDJyWWssNaZ0bXe1WP
X-Google-Smtp-Source: AGHT+IEY3Xzc7fr4+j+8c7Xmzn1ASJ0g0PGM20FPE+6j6dnC/T4ZD4yfgoiNZfuCxktxFtJP7fxNgol2XsmKPY/V9Ak=
X-Received: by 2002:ac8:5a0f:0:b0:494:763e:d971 with SMTP id
 d75a77b69052e-494a339efa1mr2170171cf.23.1747301259016; Thu, 15 May 2025
 02:27:39 -0700 (PDT)
MIME-Version: 1.0
References: <20250513163438.3942405-14-tabba@google.com> <20250514212653.1011484-1-jthoughton@google.com>
In-Reply-To: <20250514212653.1011484-1-jthoughton@google.com>
From: Fuad Tabba <tabba@google.com>
Date: Thu, 15 May 2025 11:27:02 +0200
X-Gm-Features: AX0GCFtxdi9w1pFV0otXB7lxTAtPlqLdBKg0KJODZ4v7UKNCvRCSP9F597nrSVg
Message-ID: <CA+EHjTy1UoOXDKbH-9DgE_ULGBp9OtWb5R8aK1DWvgu8ECrMsA@mail.gmail.com>
Subject: Re: [PATCH v9 13/17] KVM: arm64: Handle guest_memfd()-backed guest
 page faults
To: James Houghton <jthoughton@google.com>
Cc: ackerleytng@google.com, akpm@linux-foundation.org, amoorthy@google.com, 
	anup@brainfault.org, aou@eecs.berkeley.edu, brauner@kernel.org, 
	catalin.marinas@arm.com, chao.p.peng@linux.intel.com, chenhuacai@kernel.org, 
	david@redhat.com, dmatlack@google.com, fvdl@google.com, hch@infradead.org, 
	hughd@google.com, ira.weiny@intel.com, isaku.yamahata@gmail.com, 
	isaku.yamahata@intel.com, james.morse@arm.com, jarkko@kernel.org, 
	jgg@nvidia.com, jhubbard@nvidia.com, keirf@google.com, 
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org, liam.merwick@oracle.com, 
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, mail@maciej.szmigiero.name, 
	maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, 
	oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, 
	paul.walmsley@sifive.com, pbonzini@redhat.com, peterx@redhat.com, 
	qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, 
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, 
	quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, rientjes@google.com, 
	roypat@amazon.co.uk, seanjc@google.com, shuah@kernel.org, 
	steven.price@arm.com, suzuki.poulose@arm.com, vannapurve@google.com, 
	vbabka@suse.cz, viro@zeniv.linux.org.uk, wei.w.wang@intel.com, 
	will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, 
	yilun.xu@intel.com, yuzenghui@huawei.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 5F08C180016
X-Rspamd-Server: rspam09
X-Stat-Signature: ud67mpqc4ors3nzsc9rqo71eumyef13h
X-HE-Tag: 1747301260-73634
X-HE-Meta: U2FsdGVkX1+kfNbGyi+aqrZOkR75dsAs3uZYYhz/NlMRyoc04JBd9qdZgKwMKAaPGRW6ALQS9rE3uDevjHjR0T1SkyvI7FRzePvWhAVjKssIW7gcPPBNhi8fOMPudTdNNSYBWbr5B1JF8fWUwXEhapS9L4FPUJxZ61dCdJG9wYzgPW8jB0q7xy/8Ed+GIhxb73rObGarlExkhNFXsq7RPY5Cxr6g1m82ppzZsREJphBH4wIM54n3tyqaXMLyx/oGY93s7bn0YZeMJth5kojx+UixIsfwE+rDp6P9O069vItCQ4Sh4rIWyQ4iobpBPrfYaTcsO4XOESmtBY6ExAaCi0xAIg7b0vwDdKXnsBdlumLRLoZkao17poQCidCub/sENW9luxebYcyT2ovvOsMxKCPrzuKW2uc0NG9sOAKmTt0arnD05/pTii6u4Zb4PmSxHSD1vL4NU17/Q/6ToRHUqo1nnmr2vLbzT8raQBbviRN8EXhkHnOa7ZmDNwwpRwyhJ4sc9UzIYp5rpDMx2r6Ff6bnetdwJssQZSLMRKz7gWO7eMe9gHjS0vOO8zl8X45qZS9mNX/+fOXhjsEGuiu/+z3OTJIRruHnC0ZrBAsZNkgEPW93T0M5VDG+xqdQ7eDn+7b8XxIInU0eX8pYKeuKDIsOxX26B5exy8DhK4FkxRjbiQUbihkF8KMYvB8fsLTYpzBNu1GhCMye3J2yxEjn6CMf1POaCC4nh0DSuDj6BRRENXu4c+U30wgcdtV7+AawEIepOaNtXs0nonrwhv/7xRdwHsUJp9M8ZHqBxBQZ2JVmadBHthx9DkQIPYpiLwQGqHR/EBaEMNgVEdnGmnQfwphQBWqlL+luFiJ/mDK0nhrVZpOUx4z9ixTfaOcQgbIQMldnOIn1c2sRZz195Hevt6QvzTSAitf2+Z1dcCvIDu0sDX0Rgde05CefiKB/MoeXSO1u3Oe9h1JfUDsTISF
 DP6Ig5BV
 Y4bxxonfw9Y8+zWZsDxQVv/Du/nqe3OwyBT6MKzCYZKb0Pl2JJC22691Du18mIGOxNc56RkWj92VCCk/p+9QgH3jPDPoaupT/5c+rVoH3nNHTsEkRjWwwt/Hy1u9kF5y0l48um5u2Iqygd+MxXx5Bflhxb9PNyr7oXYXjy/7Wteo8lF7JbzQYcIdYkzPRkmJavxEnMs2+pW5JooZqV2A+RCnHeHzHlcb2UNBazJPssq1bYfppHaGeiurPMyOKCEaBECr/ua6ob+cIgbNeRX9T1sEy0RssPjrAhnkeVgdILbphKDWetrP+Rrxu5ALYPOxZk1wFMjkZiki8+2YPO9HDzvPePg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi James,

On Wed, 14 May 2025 at 23:26, James Houghton <jthoughton@google.com> wrote:
>
> On Tue, May 13, 2025 at 9:35=E2=80=AFAM Fuad Tabba <tabba@google.com> wro=
te:
> >
> > Add arm64 support for handling guest page faults on guest_memfd
> > backed memslots.
> >
> > For now, the fault granule is restricted to PAGE_SIZE.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c     | 94 +++++++++++++++++++++++++---------------
> >  include/linux/kvm_host.h |  5 +++
> >  virt/kvm/kvm_main.c      |  5 ---
> >  3 files changed, 64 insertions(+), 40 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index d756c2b5913f..9a48ef08491d 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1466,6 +1466,30 @@ static bool kvm_vma_mte_allowed(struct vm_area_s=
truct *vma)
> >         return vma->vm_flags & VM_MTE_ALLOWED;
> >  }
> >
> > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *=
slot,
> > +                            gfn_t gfn, bool write_fault, bool *writabl=
e,
> > +                            struct page **page, bool is_gmem)
> > +{
> > +       kvm_pfn_t pfn;
> > +       int ret;
> > +
> > +       if (!is_gmem)
> > +               return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_=
WRITE : 0, writable, page);
> > +
> > +       *writable =3D false;
> > +
> > +       ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> > +       if (!ret) {
> > +               *writable =3D !memslot_is_readonly(slot);
> > +               return pfn;
> > +       }
> > +
> > +       if (ret =3D=3D -EHWPOISON)
> > +               return KVM_PFN_ERR_HWPOISON;
> > +
> > +       return KVM_PFN_ERR_NOSLOT_MASK;
>
> I don't think the above handling for the `ret !=3D 0` case is correct. I =
think
> we should just be returning `ret` out to userspace.

Ack.

>
> The diff I have below is closer to what I think we must do.
>
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa=
,
> >                           struct kvm_s2_trans *nested,
> >                           struct kvm_memory_slot *memslot, unsigned lon=
g hva,
> > @@ -1473,19 +1497,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu=
, phys_addr_t fault_ipa,
> >  {
> >         int ret =3D 0;
> >         bool write_fault, writable;
> > -       bool exec_fault, mte_allowed;
> > +       bool exec_fault, mte_allowed =3D false;
> >         bool device =3D false, vfio_allow_any_uc =3D false;
> >         unsigned long mmu_seq;
> >         phys_addr_t ipa =3D fault_ipa;
> >         struct kvm *kvm =3D vcpu->kvm;
> > -       struct vm_area_struct *vma;
> > -       short page_shift;
> > +       struct vm_area_struct *vma =3D NULL;
> > +       short page_shift =3D PAGE_SHIFT;
> >         void *memcache;
> > -       gfn_t gfn;
> > +       gfn_t gfn =3D ipa >> PAGE_SHIFT;
> >         kvm_pfn_t pfn;
> >         bool logging_active =3D memslot_is_logging(memslot);
> > -       bool force_pte =3D logging_active || is_protected_kvm_enabled()=
;
> > -       long page_size, fault_granule;
> > +       bool is_gmem =3D kvm_slot_has_gmem(memslot);
> > +       bool force_pte =3D logging_active || is_gmem || is_protected_kv=
m_enabled();
> > +       long page_size, fault_granule =3D PAGE_SIZE;
> >         enum kvm_pgtable_prot prot =3D KVM_PGTABLE_PROT_R;
> >         struct kvm_pgtable *pgt;
> >         struct page *page;
> > @@ -1529,17 +1554,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu=
, phys_addr_t fault_ipa,
> >          * Let's check if we will get back a huge page backed by hugetl=
bfs, or
> >          * get block mapping for device MMIO region.
> >          */
> > -       mmap_read_lock(current->mm);
> > -       vma =3D vma_lookup(current->mm, hva);
> > -       if (unlikely(!vma)) {
> > -               kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > -               mmap_read_unlock(current->mm);
> > -               return -EFAULT;
> > +       if (!is_gmem) {
> > +               mmap_read_lock(current->mm);
> > +               vma =3D vma_lookup(current->mm, hva);
> > +               if (unlikely(!vma)) {
> > +                       kvm_err("Failed to find VMA for hva 0x%lx\n", h=
va);
> > +                       mmap_read_unlock(current->mm);
> > +                       return -EFAULT;
> > +               }
> > +
> > +               vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY_UNCA=
CHED;
> > +               mte_allowed =3D kvm_vma_mte_allowed(vma);
> >         }
> >
> > -       if (force_pte)
> > -               page_shift =3D PAGE_SHIFT;
> > -       else
> > +       if (!force_pte)
> >                 page_shift =3D get_vma_page_shift(vma, hva);
> >
> >         switch (page_shift) {
> > @@ -1605,27 +1633,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu=
, phys_addr_t fault_ipa,
> >                 ipa &=3D ~(page_size - 1);
> >         }
> >
> > -       gfn =3D ipa >> PAGE_SHIFT;
> > -       mte_allowed =3D kvm_vma_mte_allowed(vma);
> > -
> > -       vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > -
> > -       /* Don't use the VMA after the unlock -- it may have vanished *=
/
> > -       vma =3D NULL;
> > +       if (!is_gmem) {
> > +               /* Don't use the VMA after the unlock -- it may have va=
nished */
> > +               vma =3D NULL;
>
> I think we can just move the vma declaration inside the earlier `if (is_g=
mem)`
> bit above. It should be really hard to accidentally attempt to use `vma` =
or
> `hva` in the is_gmem case. `vma` we can easily make it impossible; `hva` =
is
> harder.

To be honest, I think we need to refactor user_mem_abort(). It's
already a bit messy, and with the guest_memfd code, and in the
(hopefully) soon, pkvm code, it's going to get messier. Some of the
things things to keep in mind are, like you suggest, ensuring that vma
and hva aren't in scope where they're not needed.

>
> See below for what I think this should look like.
>
> >
> > -       /*
> > -        * Read mmu_invalidate_seq so that KVM can detect if the result=
s of
> > -        * vma_lookup() or __kvm_faultin_pfn() become stale prior to
> > -        * acquiring kvm->mmu_lock.
> > -        *
> > -        * Rely on mmap_read_unlock() for an implicit smp_rmb(), which =
pairs
> > -        * with the smp_wmb() in kvm_mmu_invalidate_end().
> > -        */
> > -       mmu_seq =3D vcpu->kvm->mmu_invalidate_seq;
> > -       mmap_read_unlock(current->mm);
> > +               /*
> > +                * Read mmu_invalidate_seq so that KVM can detect if th=
e results
> > +                * of vma_lookup() or faultin_pfn() become stale prior =
to
> > +                * acquiring kvm->mmu_lock.
> > +                *
> > +                * Rely on mmap_read_unlock() for an implicit smp_rmb()=
, which
> > +                * pairs with the smp_wmb() in kvm_mmu_invalidate_end()=
.
> > +                */
> > +               mmu_seq =3D vcpu->kvm->mmu_invalidate_seq;
> > +               mmap_read_unlock(current->mm);
> > +       }
> >
> > -       pfn =3D __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRIT=
E : 0,
> > -                               &writable, &page);
> > +       pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault, &writable, =
&page, is_gmem);
> >         if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) {
> >                 kvm_send_hwpoison_signal(hva, page_shift);
>
> `hva` is used here even for the is_gmem case, and that should be slightly
> concerning. And indeed it is, this is not the appropriate way to handle
> hwpoison for gmem (and it is different than the behavior you have for x86=
). x86
> handles this by returning a KVM_MEMORY_FAULT_EXIT to userspace; we should=
 do
> the same.

You're right. My initial thought was that by having a best-effort
check that that would be enough, and not change the arm64 behavior all
that much. Exiting to userspace is cleaner.

> I've put what I think is more appropriate in the diff below.
>
> And just to be clear, IMO, we *cannot* do what you have written now, espe=
cially
> given that we are getting rid of the userspace_addr sanity check (but tha=
t
> check was best-effort anyway).
>
> >                 return 0;
> > @@ -1677,7 +1701,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, =
phys_addr_t fault_ipa,
> >
> >         kvm_fault_lock(kvm);
> >         pgt =3D vcpu->arch.hw_mmu->pgt;
> > -       if (mmu_invalidate_retry(kvm, mmu_seq)) {
> > +       if (!is_gmem && mmu_invalidate_retry(kvm, mmu_seq)) {
> >                 ret =3D -EAGAIN;
> >                 goto out_unlock;
> >         }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f9bb025327c3..b317392453a5 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1884,6 +1884,11 @@ static inline int memslot_id(struct kvm *kvm, gf=
n_t gfn)
> >         return gfn_to_memslot(kvm, gfn)->id;
> >  }
> >
> > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *s=
lot)
> > +{
> > +       return slot->flags & KVM_MEM_READONLY;
> > +}
>
> I think if you're going to move this helper to include/linux/kvm_host.h, =
you
> might want to do so in its own patch and change all of the existing place=
s
> where we check KVM_MEM_READONLY directly. *shrug*

It's a tough job, but someone's gotta do it :)

>
> > +
> >  static inline gfn_t
> >  hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
> >  {
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6289ea1685dd..6261d8638cd2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2640,11 +2640,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu=
 *vcpu, gfn_t gfn)
> >         return size;
> >  }
> >
> > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > -{
> > -       return slot->flags & KVM_MEM_READONLY;
> > -}
> > -
> >  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *s=
lot, gfn_t gfn,
> >                                        gfn_t *nr_pages, bool write)
> >  {
> > --
> > 2.49.0.1045.g170613ef41-goog
> >
>
> Alright, here's the diff I have in mind:

Thank you James.

Cheers,
/fuad


>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 9a48ef08491db..74eae19792373 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1466,28 +1466,30 @@ static bool kvm_vma_mte_allowed(struct vm_area_st=
ruct *vma)
>         return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>
> -static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *sl=
ot,
> -                            gfn_t gfn, bool write_fault, bool *writable,
> -                            struct page **page, bool is_gmem)
> +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_vcpu *vcpu,
> +                            struct kvm_memory_slot *slot, gfn_t gfn,
> +                            bool exec_fault, bool write_fault, bool *wri=
table,
> +                            struct page **page, bool is_gmem, kvm_pfn_t =
*pfn)
>  {
> -       kvm_pfn_t pfn;
>         int ret;
>
> -       if (!is_gmem)
> -               return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WR=
ITE : 0, writable, page);
> +       if (!is_gmem) {
> +               *pfn =3D __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_=
WRITE : 0, writable, page);
> +               return 0;
> +       }
>
>         *writable =3D false;
>
> -       ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> -       if (!ret) {
> -               *writable =3D !memslot_is_readonly(slot);
> -               return pfn;
> +       ret =3D kvm_gmem_get_pfn(kvm, slot, gfn, pfn, page, NULL);
> +       if (ret) {
> +               kvm_prepare_memory_fault_exit(vcpu, gfn << PAGE_SHIFT,
> +                                             PAGE_SIZE, write_fault,
> +                                             exec_fault, false);
> +               return ret;
>         }
>
> -       if (ret =3D=3D -EHWPOISON)
> -               return KVM_PFN_ERR_HWPOISON;
> -
> -       return KVM_PFN_ERR_NOSLOT_MASK;
> +       *writable =3D !memslot_is_readonly(slot);
> +       return 0;
>  }
>
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> @@ -1502,7 +1504,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph=
ys_addr_t fault_ipa,
>         unsigned long mmu_seq;
>         phys_addr_t ipa =3D fault_ipa;
>         struct kvm *kvm =3D vcpu->kvm;
> -       struct vm_area_struct *vma =3D NULL;
>         short page_shift =3D PAGE_SHIFT;
>         void *memcache;
>         gfn_t gfn =3D ipa >> PAGE_SHIFT;
> @@ -1555,6 +1556,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, ph=
ys_addr_t fault_ipa,
>          * get block mapping for device MMIO region.
>          */
>         if (!is_gmem) {
> +               struct vm_area_struct *vma =3D NULL;
> +
>                 mmap_read_lock(current->mm);
>                 vma =3D vma_lookup(current->mm, hva);
>                 if (unlikely(!vma)) {
> @@ -1565,33 +1568,44 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, =
phys_addr_t fault_ipa,
>
>                 vfio_allow_any_uc =3D vma->vm_flags & VM_ALLOW_ANY_UNCACH=
ED;
>                 mte_allowed =3D kvm_vma_mte_allowed(vma);
> -       }
>
> -       if (!force_pte)
> -               page_shift =3D get_vma_page_shift(vma, hva);
> +               if (!force_pte)
> +                       page_shift =3D get_vma_page_shift(vma, hva);
> +
> +               /*
> +                * Read mmu_invalidate_seq so that KVM can detect if the =
results
> +                * of vma_lookup() or faultin_pfn() become stale prior to
> +                * acquiring kvm->mmu_lock.
> +                *
> +                * Rely on mmap_read_unlock() for an implicit smp_rmb(), =
which
> +                * pairs with the smp_wmb() in kvm_mmu_invalidate_end().
> +                */
> +               mmu_seq =3D vcpu->kvm->mmu_invalidate_seq;
> +               mmap_read_unlock(current->mm);
>
> -       switch (page_shift) {
> +               switch (page_shift) {
>  #ifndef __PAGETABLE_PMD_FOLDED
> -       case PUD_SHIFT:
> -               if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_=
SIZE))
> -                       break;
> -               fallthrough;
> +               case PUD_SHIFT:
> +                       if (fault_supports_stage2_huge_mapping(memslot, h=
va, PUD_SIZE))
> +                               break;
> +                       fallthrough;
>  #endif
> -       case CONT_PMD_SHIFT:
> -               page_shift =3D PMD_SHIFT;
> -               fallthrough;
> -       case PMD_SHIFT:
> -               if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_=
SIZE))
> +               case CONT_PMD_SHIFT:
> +                       page_shift =3D PMD_SHIFT;
> +                       fallthrough;
> +               case PMD_SHIFT:
> +                       if (fault_supports_stage2_huge_mapping(memslot, h=
va, PMD_SIZE))
> +                               break;
> +                       fallthrough;
> +               case CONT_PTE_SHIFT:
> +                       page_shift =3D PAGE_SHIFT;
> +                       force_pte =3D true;
> +                       fallthrough;
> +               case PAGE_SHIFT:
>                         break;
> -               fallthrough;
> -       case CONT_PTE_SHIFT:
> -               page_shift =3D PAGE_SHIFT;
> -               force_pte =3D true;
> -               fallthrough;
> -       case PAGE_SHIFT:
> -               break;
> -       default:
> -               WARN_ONCE(1, "Unknown page_shift %d", page_shift);
> +               default:
> +                       WARN_ONCE(1, "Unknown page_shift %d", page_shift)=
;
> +               }
>         }
>
>         page_size =3D 1UL << page_shift;
> @@ -1633,24 +1647,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, =
phys_addr_t fault_ipa,
>                 ipa &=3D ~(page_size - 1);
>         }
>
> -       if (!is_gmem) {
> -               /* Don't use the VMA after the unlock -- it may have vani=
shed */
> -               vma =3D NULL;
> -
> +       ret =3D faultin_pfn(kvm, vcpu, memslot, gfn, exec_fault, write_fa=
ult,
> +                         &writable, &page, is_gmem, &pfn);
> +       if (ret)
> +               return ret;
> +       if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) {
>                 /*
> -                * Read mmu_invalidate_seq so that KVM can detect if the =
results
> -                * of vma_lookup() or faultin_pfn() become stale prior to
> -                * acquiring kvm->mmu_lock.
> -                *
> -                * Rely on mmap_read_unlock() for an implicit smp_rmb(), =
which
> -                * pairs with the smp_wmb() in kvm_mmu_invalidate_end().
> +                * For gmem, hwpoison should be communicated via a memory=
 fault
> +                * exit, not via a SIGBUS.
>                  */
> -               mmu_seq =3D vcpu->kvm->mmu_invalidate_seq;
> -               mmap_read_unlock(current->mm);
> -       }
> -
> -       pfn =3D faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &p=
age, is_gmem);
> -       if (pfn =3D=3D KVM_PFN_ERR_HWPOISON) {
> +               WARN_ON_ONCE(is_gmem);
>                 kvm_send_hwpoison_signal(hva, page_shift);
>                 return 0;
>         }