From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1874BC02182 for ; Thu, 23 Jan 2025 11:01:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7DE7C280002; Thu, 23 Jan 2025 06:01:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 76698280001; Thu, 23 Jan 2025 06:01:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E0A4280002; Thu, 23 Jan 2025 06:01:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3DC5B280001 for ; Thu, 23 Jan 2025 06:01:08 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E17D34844A for ; Thu, 23 Jan 2025 11:01:07 +0000 (UTC) X-FDA: 83038424574.27.FB9F494 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf12.hostedemail.com (Postfix) with ESMTP id EC97C4001A for ; Thu, 23 Jan 2025 11:01:05 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="OsyILKq/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of tabba@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737630066; a=rsa-sha256; cv=none; b=3pw8qxF5K2xlAKSSSEDYcrfs8aVtWMHLNSNtgfkJYsaDUpExn75zmV0P7oB4jeoG/Shrse z0Z4Gyyt0NyhZhb5nZV79u5GmT2zFWRXJ9v7BjU0+izQcZ/gFTUfbLMiYqh6qx4jRTgouu 7OG7Jc/X+9jsaqnWSX2Pi6K7lPe3uog= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="OsyILKq/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of tabba@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737630066; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qKRF6KNk8zRTiJg3LmDGov/DwkKeuqzVdlm7H5Xf/Bg=; b=WlTrP96iHQ9/LQX00JKDChXbiUISovepzORInZNTxgR4adnblhDBESnjoG5L4TfDmHMknH 50tarThVHLkYbq2KFrtCk6gpoSHa4/a2K1xw7+q0cU4hYcudUsUZMn68ri5amutPWYt9/U mBByYJL0QLuaS0Hv9Tgvh/Xk4cfec6k= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4679b5c66d0so164361cf.1 for ; Thu, 23 Jan 2025 03:01:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737630065; x=1738234865; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=qKRF6KNk8zRTiJg3LmDGov/DwkKeuqzVdlm7H5Xf/Bg=; b=OsyILKq/F7ZhJpdkHZOJ1d+JYs4w/sy0WU6OjmJ1D0dOCpyjtTHxVv+c2ebfHWWG7x Bbg5DOCyq49KzajbemOkkicYoh6tmDTzoT/iztHXCs/z+gjpszlIktqDI3bJw4sDMgiR BHm4Z2+US3Cfr3xUQ4Yfz1HLwTB7w/4X0GhIy7DoiqvvvFakDy5jXAOgsxRh49v1aIWD Yg2ZwYgYMeeLSo7f7dlQnTg39iVQoVi2KSqLpmRC27OafVBY/JRxYWMtxhi5BkF6F5dt ICod+e0mYP5cDGduiVHLdI3zop57VML01n5ZodZSD+IoUM8C2Az+OZCmkb5+Sq5Qhs0k ZesA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737630065; x=1738234865; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=qKRF6KNk8zRTiJg3LmDGov/DwkKeuqzVdlm7H5Xf/Bg=; b=PIym/ast9LXo8QnGm5H2zwcMJFDXEgX3DSDI2OWciBJqjcbpwDyDx1i9VUq2UdPfCc +vswWwxmzQg4/KOVTYrIF8Svspi+QBAJZaAqKGGezzhJwsWavYxVsdW6eQTv3JWCwuus wWt5Koj/HZQ/G532L3attuo92xFLbfCaEeKI8ZolYySTgpgRAozteIzH+p+EmNEVQw2L DzzMFwBGxbRs50gZ69+hoWVlJF+K16oVeGm2lzz7pU8SsmqusGOZZ4aFWVq5ocL5NRKh dIzWTFpWhc6es7MpT5D20l2KDCKdE9y4x5YgxcMgleTxzwy+zggO2QdkUtEsia5P4ULR Im3A== X-Forwarded-Encrypted: i=1; AJvYcCXFzlhZuvcG4RYrLztn7JWG0q5nDlET/lcdLhuVU1Nvcxxe32C+1iBwyArUe7j5vD1mUF9DwerJzw==@kvack.org X-Gm-Message-State: AOJu0YwtaItLzzjzTsZpmBSRIyodDljF5RiYhZviguBfa/D3fgA8j7o5 nTou4OCcmAaDd75uFkbpheqm2bfQO5dkhkeTLVTXMFqMujQEMlxIpbV5A8NZ9s/tau2O3KefbjE GnbQ/8WIRvrss1sPY6pWVs6uNuZWtOWDhSrAV X-Gm-Gg: ASbGncsNrkl4iISxBjK+AasdwZTkyncXQIkotfMoA3CtLbr28ZQmMKDsJQam5ZYqMO4 dFT29V6Ca0luXKTL0AWCayOEzjdtRg1txXMlAdSL5Yfm38yvsUAakKXm2rkPWSQfW+RGCDKUyZp OG1cHgBeXRyIMzLA== X-Google-Smtp-Source: AGHT+IEhDcVu8ACTJ7IG91GJ66VbO065GG5aGxDdFcDLyw0CtQceKN6CYY8visH22Gzo45SduUQpbbJ88bKa06sEE8E= X-Received: by 2002:ac8:5d54:0:b0:46c:9f17:12f6 with SMTP id d75a77b69052e-46e5dad893cmr2548431cf.27.1737630064616; Thu, 23 Jan 2025 03:01:04 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Fuad Tabba Date: Thu, 23 Jan 2025 11:00:28 +0000 X-Gm-Features: AWEUYZmBN7k_cteavGpHUkL6JKKUlnvikxSs7bQNdXkQX1RxN1tVPQnaNeluCco Message-ID: Subject: Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages To: Ackerley Tng Cc: vbabka@suse.cz, kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: EC97C4001A X-Stat-Signature: 1uq1qyoafnfc4gjsrc7etuck785sk4ak X-HE-Tag: 1737630065-430025 X-HE-Meta: U2FsdGVkX194jXL+HjUENcVpG79lVXLj4wgFu3l3uqGntnBfpr9pcaLuJeyc3siU9jT7JzFWc/n7ZmRpMryea+KbOdv55lHGUcTUKkim12SNgSdmiK6YKFn7ujpbqKcjW+Aa3BknomRqZ85EOs/zSrCziaNn54QzluVzm/EHhjsGFlRsPCVRXX2GW2azKdRPw2ANhMEAoPwd4NV/cYujn+aRdiQS4EvsaDRtRtNOS4LpXmoD8bTc0tv0DYXy2MwPwtcEk9NOTyz8vlJBOktn9BWi2WNOw/NmSEU4XN4FOa/UdySEEuT8GVA7xFLYJ8BR3jDCCCq+CrS66nbqWdMH2CldbD+3dMRUOWtJ2QJD/gUoiOA24O5UzN2d6BP3ab1ctw1x5qNTYiKD63uZCq3LTZcQJIpwD/fHRqk2fZRvMvTQKj1Em8rY5nDo6X9WGfv3OeoJDcqIqDHzXSknH87Bxg1kWlaIoNakOD8kAGb/L0WD4SgKIc6I11rxIiPG75t839F9265ZvMl9tQxRx48X/hX8+vteoLoQfYVc3oElEcxXwT12bpq0M48+kFqtPyU5I7lCvhi939PUvpgeYHkCaZK4D77Gp9ouguP0emZ+HsvTmQ5S7ogEtfDAp+v+U710ZUzFXZ4tw0k5EbZEbGWPA3Ql8jJ2TWJXzOcf2Um2zDIfZGuqEx5kF2EPBuv2A2pfMf8hPQkx1FXuc0rNGpMErwYUSjy/Ok2RH9laUYnSXo2ZMlatGut5p4U2WcYzLpluWj4CVQzA1+6xbv4KqM9vff+s6j3YzWaSGRdftFlDRY56OKjd17e5Q5LtFfeI1uuhB0XCtNAPzx4rs3tx/vpXm3oVM/d3OvGqYDgv5xE+BL7LJxYynu/S2y1jbNbiAx4kTGSIUK5cqR75eTwf1IXlaXizYY2re5Yq1CvwiRB1C/67QDJPtbrQL2wKHWKvpXlWuC7tSESUm2XxqvjFbWG X6RR+drg fnilzx6hm+kWhZ+PNoicSlt6rE5QRU8bmqzLU4X6QMNz0PGA70zW2/Xtp8wHthE5yf2L3CbJ5TWlU/F15s90Yne+PRwiNo83gQB+7mvjsCiPNoVEXS2ekQYisGiprqRjCN3Ir8k09R2+zKztFEv01Ev5alD4xcm3Yquj0RM9URSd/7kZklvnLZGk4LxlVsXb2hq12lKIhga6MnnJu42vOfFp5dxGc5WtKX6Nn0ZPBaR6VNPDq79ntoCYX0MemCe8RdyZ2Art7ncfMLsMA+AHdKu1vb08LQFGvQvKd60/0em0u9jNWFy7qfc5d6cVkVFzlQyhIcTaxC11EMbvNFse5MeZ7QQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 22 Jan 2025 at 22:24, Ackerley Tng wrote: > > Fuad Tabba writes: > > >> > > >> > > >> > +/* > >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does > >> > + * not have any references to the folio. It does that by setting the folio type > >> > + * to guestmem. > >> > + * > >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host > >> > + * has references, and the callback has been registered. > >> > >> Note this comment. > >> > >> > + * > >> > + * Must be called with the following locks held: > >> > + * - filemap (inode->i_mapping) invalidate_lock > >> > + * - folio lock > >> > + */ > >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx) > >> > +{ > >> > + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; > >> > + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); > >> > + int refcount; > >> > + > >> > + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock); > >> > + WARN_ON_ONCE(!folio_test_locked(folio)); > >> > + > >> > + if (folio_mapped(folio) || folio_test_guestmem(folio)) > >> > + return -EAGAIN; > >> > >> But here we return -EAGAIN and no callback was registered? > > > > This is intentional. If the folio is still mapped (i.e., its mapcount > > is elevated), then we cannot register the callback yet, so the > > host/vmm needs to unmap first, then try again. That said, I see the > > problem with the comment above, and I will clarify this. > > > >> > + > >> > + /* Register a callback first. */ > >> > + __folio_set_guestmem(folio); > >> > + > >> > + /* > >> > + * Check for references after setting the type to guestmem, to guard > >> > + * against potential races with the refcount being decremented later. > >> > + * > >> > + * At least one reference is expected because the folio is locked. > >> > + */ > >> > + > >> > + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio)); > >> > + if (refcount == 1) { > >> > + int r; > >> > + > >> > + /* refcount isn't elevated, it's now faultable by the guest. */ > >> > >> Again this seems racy, somebody could have just speculatively increased it. > >> Maybe we need to freeze here as well? > > > > A speculative increase here is ok I think (famous last words). The > > callback was registered before the check, therefore, such an increase > > would trigger the callback. > > > > Thanks, > > /fuad > > > > > > I checked the callback (kvm_gmem_handle_folio_put()) and agree with you > that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled > correctly (since kvm_gmem_handle_folio_put() doesn't assume anything > about the mappability state at callback-time). > > However, what if the new speculative reference writes to the page and > guest goes on to fault/use the page? I don't think that's a problem. At this point the page is in a transient state, but still shared from the guest's point of view. Moreover, no one can fault-in the page at the host at this point (we check in kvm_gmem_fault()). Let's have a look at the code: +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx) +{ + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); + int refcount; At this point the guest still perceives the page as shared, the state of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means that kvm_gmem_fault() doesn't fault-in the page at the host anymore. + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock); + WARN_ON_ONCE(!folio_test_locked(folio)); + + if (folio_mapped(folio) || folio_test_guestmem(folio)) + return -EAGAIN; + + /* Register a callback first. */ + __folio_set_guestmem(folio); This (in addition to the state of the NONE_MAPPABLE), also ensures that kvm_gmem_fault() doesn't fault-in the page at the host anymore. + /* + * Check for references after setting the type to guestmem, to guard + * against potential races with the refcount being decremented later. + * + * At least one reference is expected because the folio is locked. + */ + + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio)); + if (refcount == 1) { + int r; At this point we know that guest_memfd has the only real reference. Speculative references AFAIK do not access the page itself. + + /* refcount isn't elevated, it's now faultable by the guest. */ + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL))); Now it's safe so let the guest know that it can map the page. + if (!r) + __kvm_gmem_restore_pending_folio(folio); + + return r; + } + + return -EAGAIN; +} Does this make sense, or did I miss something? Thanks! /fuad > >> > + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL))); > >> > + if (!r) > >> > + __kvm_gmem_restore_pending_folio(folio); > >> > + > >> > + return r; > >> > + } > >> > + > >> > + return -EAGAIN; > >> > +} > >> > + > >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn) > >> > +{ > >> > + unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn; > >> > + struct inode *inode = file_inode(slot->gmem.file); > >> > + struct folio *folio; > >> > + int r; > >> > + > >> > + filemap_invalidate_lock(inode->i_mapping); > >> > + > >> > + folio = filemap_lock_folio(inode->i_mapping, pgoff); > >> > + if (WARN_ON_ONCE(IS_ERR(folio))) { > >> > + r = PTR_ERR(folio); > >> > + goto out; > >> > + } > >> > + > >> > + r = __gmem_register_callback(folio, inode, pgoff); > >> > + > >> > + folio_unlock(folio); > >> > + folio_put(folio); > >> > +out: > >> > + filemap_invalidate_unlock(inode->i_mapping); > >> > + > >> > + return r; > >> > +} > >> > + > >> > +/* > >> > + * Callback function for __folio_put(), i.e., called when all references by the > >> > + * host to the folio have been dropped. This allows gmem to transition the state > >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue > >> > + * transitioning its state to private, since the host cannot attempt to access > >> > + * it anymore. > >> > + */ > >> > +void kvm_gmem_handle_folio_put(struct folio *folio) > >> > +{ > >> > + struct xarray *mappable_offsets; > >> > + struct inode *inode; > >> > + pgoff_t index; > >> > + void *xval; > >> > + > >> > + inode = folio->mapping->host; > >> > + index = folio->index; > >> > + mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; > >> > + xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); > >> > + > >> > + filemap_invalidate_lock(inode->i_mapping); > >> > + __kvm_gmem_restore_pending_folio(folio); > >> > + WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL))); > >> > + filemap_invalidate_unlock(inode->i_mapping); > >> > +} > >> > + > >> > static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff) > >> > { > >> > struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; > >>