From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77CA0C02193 for ; Thu, 6 Feb 2025 03:28:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E312A6B0082; Wed, 5 Feb 2025 22:28:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DE0936B0083; Wed, 5 Feb 2025 22:28:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA8406B0085; Wed, 5 Feb 2025 22:28:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id ADA4E6B0082 for ; Wed, 5 Feb 2025 22:28:04 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5F0BF1C841E for ; Thu, 6 Feb 2025 03:28:04 +0000 (UTC) X-FDA: 83088086088.03.4C6D00E Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf07.hostedemail.com (Postfix) with ESMTP id 91D2E40007 for ; Thu, 6 Feb 2025 03:28:02 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gO1DavsQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of 3QSykZwsKCG4MOWQdXQkfZSSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3QSykZwsKCG4MOWQdXQkfZSSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738812482; a=rsa-sha256; cv=none; b=pLraKZ3u4fApWra2D3kJk/ffti67HNGXBQI1alEtihW77Wj8vYrZ0ZmWtBxO79k9+pDFg9 XPAh9wlBGBlslOb+yQMX4GbbRGZTckOQX0cGPb/ey7B+chNGsNk4Q5qe+THJUMakzGrvmQ kwC9WbHWsu80rUM18YIYLJ4QPah/X5Q= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=gO1DavsQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of 3QSykZwsKCG4MOWQdXQkfZSSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3QSykZwsKCG4MOWQdXQkfZSSaaSXQ.OaYXUZgj-YYWhMOW.adS@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738812482; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=f86QKI54VzReRntSvKH8Sw9ugqYuvTu09svLm+2ooYY=; b=Uc36nLYR5vTXUY56adY4UJaA14Nz/LtNnXPi/I6YrwMYMgCELIZxZCo1/4vc3ORJF0EM0w CbGgRxRhtFrMuOE+wimQRrZgzTItXHoqHWl6zzYL3R7y9mZWvjGj5eOWNLK6Hx9ESCfbwV eY+1WGjEZYhDfG46bq76pgVGgNJDXDA= Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2f9c774c1c1so1351418a91.0 for ; Wed, 05 Feb 2025 19:28:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738812481; x=1739417281; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=f86QKI54VzReRntSvKH8Sw9ugqYuvTu09svLm+2ooYY=; b=gO1DavsQTmWhHESpzluYPE8AzE/EOQHRiOVxDNRpp2eTIJ2oGCR+x0NX/ZB4x5V+MK P6WwPOwoHHTe/7meDXgIMBSYkwuqRzOPjcW83n57OivTTuMxwk/koTgnOFWrD/w/0Mzf CEorMQNuBbuvwN5GUZ361/x3qgLXcvUad7KK1gLFO1ouXRZuBLZmD6RpgpTV6ZwZYWK6 kg0yqPRY1miiaUeTZpDQBtPuM7iKPIP4I8ynD/hJCyy0l/th7JtKVj75zlN76wgmxZcK UjCjdFe/QsZts3nZFk1YbFHsIy99WrbTY8LWsgRS0lsBy8BDPJ0d6eE2kPXyluvHWqA6 Y8Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738812481; x=1739417281; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=f86QKI54VzReRntSvKH8Sw9ugqYuvTu09svLm+2ooYY=; b=qgWXER260qqu2LZQ6Fvbssbu9C5LukmPFt9mP3EH0oTv+AgVrAyyk2+paGB7IdNUm1 wE5xYyQCNGxL84cf9F4UPQSM64AINm9NLlbtDOto3oot+NJ/ufRONC6B0pO9bqyGNKLk h0ZVWEQpLOJuj0Bn7lANe07U7LgIo5Wrof0P07yDAz4DmnyCiC5PqGyXW1Iod2rZX7mo on1/CAu+AECIwXZeHJ3aEWMK4OO5eEQoXlztIa4hh+maPUyET+7qZvqPwXhnMLi2jvzU 00NxHS5rugQbNwUqndVEBeqLV1ZeIFGV+YRhWKWPy+uSeFIdh5yFoNmNTRzeebrGO2Tz 9yMA== X-Forwarded-Encrypted: i=1; AJvYcCUyu1i+yTDE745gWeBSSBc+4iqcTfwJnrptVQ+TaU3cuQQxQvVzBTryavAChHZUt4FoeguxiO5wyg==@kvack.org X-Gm-Message-State: AOJu0YwAJNAffduxa92IEmBXptMo6NKTYFgRD+tH1vc6w112nVSi2LRr FzZf/Yoig8CF8Q/1/7LWiRx5j3vI61GHdzw1n0orSHv+D2ncEyUoO9JKomNhL3KcjG4ev6lwikW cTEkzCQDViIRtACOOCGyDHg== X-Google-Smtp-Source: AGHT+IH0DhkGpMKRUFBfTEJ4n7ZfzHdfgibyH/6sTUAKCmbFrQGBXGYU070tEmkqKMXdvIZELc0Lez0AxeRJeaX27g== X-Received: from pjbst14.prod.google.com ([2002:a17:90b:1fce:b0:2f5:63a:4513]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:4b4a:b0:2ea:7cd5:4ad6 with SMTP id 98e67ed59e1d1-2f9e0851384mr6953905a91.32.1738812481234; Wed, 05 Feb 2025 19:28:01 -0800 (PST) Date: Thu, 06 Feb 2025 03:28:00 +0000 In-Reply-To: (message from Fuad Tabba on Thu, 23 Jan 2025 11:00:28 +0000) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages From: Ackerley Tng To: Fuad Tabba Cc: vbabka@suse.cz, kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: 91D2E40007 X-Rspamd-Server: rspam10 X-Stat-Signature: t9jif1p81nymc79pjgehjyo1k8jugcwc X-HE-Tag: 1738812482-352439 X-HE-Meta: U2FsdGVkX1/BsQjXTM4eLP7JCQXS29qAh11/LmtZmkeo1dFkR9NvBJkz5xQ0idA0pYOvznBguDlimH/o+RQGZy40ccOkjrwMJ1wUVGeETg+m736+ZR5Mw1HydYdW6twQEyHIrgJURlY0VGcgM/y76wdynWzycD+g59v1Ou9+ureVelt+UcevNYd+ojUeU3/g0+RNEIfuI1nXP2wd7Q1evo8Aui2TE1R5SEeFYjdYEPRyOIqg+O88HLWVNEd3YmS1q6lV0VdfbxXWENl81qzwnRGs7xCcKO65fhMzZiwL6QRBndGcVdbEvQUhNN4nGOqYpCeZ4O5zZb4qF5aI2PYF4i5vjlcA2COYWA5eGKEtCTWyEQI4Yi0aLVN4ezzuDAP1QOxRT6P+7g2K0bMPxp4MkDoT1igUrwkFPUNzH5oQjuzlNRnD95ZZ1U56iT8bDKl8SfXYd4lyabDbsLWQXUN4k0B8RFhbsIVzVH3HqafBoVzXoFUrtfa7zS/YGpf7fw8sv64QMmw5oSAzDO9cnU3YeHRr+a9TshVkO3p7GajUU3DbiySuTi6XJ07zfWgSUSUR1uO01FsQTRLa31w5eCDbGXJGw5qczRQrc5wIfB48pdYO3224+HL0Rz7qIE+d4IVBh7vK0UGU1++M7PoJzgIwRTdHh1pcTkn/sOzOOsDsjduT/3itvBKKqPJc149O0FC2kZWYdaiD42UZjaTctc/RZTN7NPvjQnTVd5565iYyaaQ0Zypz+ivjcp8xzmNnafue/b6o0n+Uc4wORWt6knMOoQZ/6f9dGJvkypmHC04mv6j9Mx4zxKCtRDUFHe04N0ICwdlS8XCq93nxRp1QPiv675sARjQNLgw2ZY2da5RTmJN9/hXRKa5+AbXUxHuhEWGhpFqvaAQseFSGYdrkkqwjO3xjXJ7v3NfXSzKsFpF7oEfKxNXdmZ++k6743fhxE8Kne2s97VmSQ//368RLV3G RcYmKlkN qILFAUaEf4lF1FWGt6ALOvo2RQrPqJ97dm2vtUC21a7f+//5IMyXUOSXsrJPsxcZ5cOmW0VjZlGmM1oUdo5utgKrSbreEmgGWauxXmCoSDF2Hwnnr6twoe1c3dHNuvmJFLW/iTP3hfph1Ys9RE+FkgcpZwLk0umgAAMNMN9b8DI1tmJgyxM3pXhz1fWX6vgm8p5BW2wVAXJUP/J6Bnz162cSgA0rdPZOfX/8jGn/gAtT06ZF+4O6E+buX+oSZR4HuSW8Mf/7aUL/FDNfhPjAwrBTSVHpXwasjDM/kg4Enh0tRUhnuujlli/5oeKeTB4korpbzLn5olba0vt8MwHuhcxt/AmIcPrlKq9/GanLGAfLdOE+tMa2TTYQvOYpoOOM0rfmAnz6SzhPDA8wqIQ4L3CarD3a1Dq3X5C5FFgpOwjhgniDj4VRvQfRbSGwXl2EE7iClkDV/Z1N4KqDSRtmZWF3yoNK8FoQ3MsHW269vaKB9d++7dIkEvEV8SBvG31WAVFX+YaFlfyVdTxj8nPlJ2WycOIhwT3HaQoobsxWod0QNg4J+O79HjXdMoeOgSOPUhQZa1CMZx8k+wmT6kcyo2YMglw0YVil9Cc2maj7VtjJSGmgk3g1tS+1e5ScJsg2CgDR9Qqb+rgZdwesGfv8JWpx42V5OrRt9BufOdkO3xcRx13H/Qd1bPd/VFA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Fuad Tabba writes: > On Wed, 22 Jan 2025 at 22:24, Ackerley Tng wrote: >> >> Fuad Tabba writes: >> >> >> > >> >> > >> >> > +/* >> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does >> >> > + * not have any references to the folio. It does that by setting the folio type >> >> > + * to guestmem. >> >> > + * >> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host >> >> > + * has references, and the callback has been registered. >> >> >> >> Note this comment. >> >> >> >> > + * >> >> > + * Must be called with the following locks held: >> >> > + * - filemap (inode->i_mapping) invalidate_lock >> >> > + * - folio lock >> >> > + */ >> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx) >> >> > +{ >> >> > + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; >> >> > + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); >> >> > + int refcount; >> >> > + >> >> > + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock); >> >> > + WARN_ON_ONCE(!folio_test_locked(folio)); >> >> > + >> >> > + if (folio_mapped(folio) || folio_test_guestmem(folio)) >> >> > + return -EAGAIN; >> >> >> >> But here we return -EAGAIN and no callback was registered? >> > >> > This is intentional. If the folio is still mapped (i.e., its mapcount >> > is elevated), then we cannot register the callback yet, so the >> > host/vmm needs to unmap first, then try again. That said, I see the >> > problem with the comment above, and I will clarify this. >> > >> >> > + >> >> > + /* Register a callback first. */ >> >> > + __folio_set_guestmem(folio); >> >> > + >> >> > + /* >> >> > + * Check for references after setting the type to guestmem, to guard >> >> > + * against potential races with the refcount being decremented later. >> >> > + * >> >> > + * At least one reference is expected because the folio is locked. >> >> > + */ >> >> > + >> >> > + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio)); >> >> > + if (refcount == 1) { >> >> > + int r; >> >> > + >> >> > + /* refcount isn't elevated, it's now faultable by the guest. */ >> >> >> >> Again this seems racy, somebody could have just speculatively increased it. >> >> Maybe we need to freeze here as well? >> > >> > A speculative increase here is ok I think (famous last words). The >> > callback was registered before the check, therefore, such an increase >> > would trigger the callback. >> > >> > Thanks, >> > /fuad >> > >> > >> >> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you >> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled >> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything >> about the mappability state at callback-time). >> >> However, what if the new speculative reference writes to the page and >> guest goes on to fault/use the page? > > I don't think that's a problem. At this point the page is in a > transient state, but still shared from the guest's point of view. > Moreover, no one can fault-in the page at the host at this point (we > check in kvm_gmem_fault()). > > Let's have a look at the code: > > +static int __gmem_register_callback(struct folio *folio, struct inode > *inode, pgoff_t idx) > +{ > + struct xarray *mappable_offsets = > &kvm_gmem_private(inode)->mappable_offsets; > + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); > + int refcount; > > At this point the guest still perceives the page as shared, the state > of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means > that kvm_gmem_fault() doesn't fault-in the page at the host anymore. > > + rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock); > + WARN_ON_ONCE(!folio_test_locked(folio)); > + > + if (folio_mapped(folio) || folio_test_guestmem(folio)) > + return -EAGAIN; > + > + /* Register a callback first. */ > + __folio_set_guestmem(folio); > > This (in addition to the state of the NONE_MAPPABLE), also ensures > that kvm_gmem_fault() doesn't fault-in the page at the host anymore. > > + /* > + * Check for references after setting the type to guestmem, to guard > + * against potential races with the refcount being decremented later. > + * > + * At least one reference is expected because the folio is locked. > + */ > + > + refcount = folio_ref_sub_return(folio, folio_nr_pages(folio)); > + if (refcount == 1) { > + int r; > > At this point we know that guest_memfd has the only real reference. > Speculative references AFAIK do not access the page itself. > + > + /* refcount isn't elevated, it's now faultable by the guest. */ > + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, > idx, xval_guest, GFP_KERNEL))); > > Now it's safe so let the guest know that it can map the page. > > + if (!r) > + __kvm_gmem_restore_pending_folio(folio); > + > + return r; > + } > + > + return -EAGAIN; > +} > > Does this make sense, or did I miss something? Thanks for explaining! I don't know enough to confirm/deny this but I agree that if speculative references don't access the page itself, this works. What if over here, we just drop the refcount, and let setting mappability to GUEST happen in the folio_put() callback? > > Thanks! > /fuad > >> >> > + r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL))); >> >> > + if (!r) >> >> > + __kvm_gmem_restore_pending_folio(folio); >> >> > + >> >> > + return r; >> >> > + } >> >> > + >> >> > + return -EAGAIN; >> >> > +} >> >> > + >> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn) >> >> > +{ >> >> > + unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn; >> >> > + struct inode *inode = file_inode(slot->gmem.file); >> >> > + struct folio *folio; >> >> > + int r; >> >> > + >> >> > + filemap_invalidate_lock(inode->i_mapping); >> >> > + >> >> > + folio = filemap_lock_folio(inode->i_mapping, pgoff); >> >> > + if (WARN_ON_ONCE(IS_ERR(folio))) { >> >> > + r = PTR_ERR(folio); >> >> > + goto out; >> >> > + } >> >> > + >> >> > + r = __gmem_register_callback(folio, inode, pgoff); >> >> > + >> >> > + folio_unlock(folio); >> >> > + folio_put(folio); >> >> > +out: >> >> > + filemap_invalidate_unlock(inode->i_mapping); >> >> > + >> >> > + return r; >> >> > +} >> >> > + >> >> > +/* >> >> > + * Callback function for __folio_put(), i.e., called when all references by the >> >> > + * host to the folio have been dropped. This allows gmem to transition the state >> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue >> >> > + * transitioning its state to private, since the host cannot attempt to access >> >> > + * it anymore. >> >> > + */ >> >> > +void kvm_gmem_handle_folio_put(struct folio *folio) >> >> > +{ >> >> > + struct xarray *mappable_offsets; >> >> > + struct inode *inode; >> >> > + pgoff_t index; >> >> > + void *xval; >> >> > + >> >> > + inode = folio->mapping->host; >> >> > + index = folio->index; >> >> > + mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; >> >> > + xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); >> >> > + >> >> > + filemap_invalidate_lock(inode->i_mapping); >> >> > + __kvm_gmem_restore_pending_folio(folio); >> >> > + WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL))); >> >> > + filemap_invalidate_unlock(inode->i_mapping); >> >> > +} >> >> > + >> >> > static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff) >> >> > { >> >> > struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; >> >>