From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E141C71136 for ; Wed, 11 Jun 2025 22:10:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C1DE36B007B; Wed, 11 Jun 2025 18:10:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA7A96B0088; Wed, 11 Jun 2025 18:10:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6F196B0089; Wed, 11 Jun 2025 18:10:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 84C176B007B for ; Wed, 11 Jun 2025 18:10:48 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id ED39D8179F for ; Wed, 11 Jun 2025 22:10:47 +0000 (UTC) X-FDA: 83544515334.14.40F818F Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf05.hostedemail.com (Postfix) with ESMTP id 39BA210000B for ; Wed, 11 Jun 2025 22:10:46 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=RiRaFjfO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 35P5JaAsKCFMvx5zC6zJE8119916z.x97638FI-775Gvx5.9C1@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=35P5JaAsKCFMvx5zC6zJE8119916z.x97638FI-775Gvx5.9C1@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749679846; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=H6zqQTmUzvZAJemTZAf+mRTjr1EfUfxJu4CNbewIY38=; b=47AsMxxUJXHfNq37cJ43UqlE7MgTef05o6R8EQTzbXRa8CX4PHTO1DyRumRbk6J1quYD8e yEOSaUFykFivdBMhVRPhtMmszJgfSJgWPBge7gPT45t3Yi2M/GR60yY7BSZBPptR4pP8ri Y5fvUrfXXVaMnDbe884jQFh7ZV60yG0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749679846; a=rsa-sha256; cv=none; b=pWxmkC2tezBcOvx3qmBTSwIm5ID9qKYKr1gOTyXNUAMph+ocDmJodlzeFm3OeSkAnZUSTE ZiCSNdEvb3F/st/9TbOt2Xd3xC0jet+rvO1of/rldt5C0tNPqp3dhKY6x6BoZS5BlPpn1X DhBN03GheUTkz2XUPgg2Qn2LCILrswI= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=RiRaFjfO; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 35P5JaAsKCFMvx5zC6zJE8119916z.x97638FI-775Gvx5.9C1@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=35P5JaAsKCFMvx5zC6zJE8119916z.x97638FI-775Gvx5.9C1@flex--ackerleytng.bounces.google.com Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-740774348f6so239184b3a.1 for ; Wed, 11 Jun 2025 15:10:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749679845; x=1750284645; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=H6zqQTmUzvZAJemTZAf+mRTjr1EfUfxJu4CNbewIY38=; b=RiRaFjfOguiokJ1nVQBX/gCWekosCr+a8ZLVyrl+DviVXYbPS1kVojA3f8lNnLgeUm Gww+FLF3B3i8E2uAkeHYVrfMjw2b4rWFTRnecwQe/jsA8BNmzFTH9/+AWn/MRS4yLAN2 1ZTz67OU1MYO3L5ZPctaPbhUdvCYcE8wgcDmp+WP7RJOhLVbdSiM6pv/uBZRLCCMlY1L PkvqoTkVkKvi5mUar8jCxehb5VD3YMtSWBLGrktZB+6tcEW1jUBtS4fvyzNRq4uTcWb+ Gf8n9ouPPT8nHcVeWQOL+/8eFqEobvaRiYdz3s4TKdAqJ/OCs8ght1NHQAJZlDrbQ2ob NOdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749679845; x=1750284645; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H6zqQTmUzvZAJemTZAf+mRTjr1EfUfxJu4CNbewIY38=; b=rr5sm3bmpqOU6rbkqK/MCs4IKkgm0kEnpbdR6U+G65hyzetGsLo2fsciZfhFkQJllT VYFapLYCqspX+ricbZXPEAiovB0lRlbBYdLGH9lWblMw3dhGI9ZBoSUkowUezcj1DBvw ndQq2F9pI8/dXTza4LuH+MNqauzsHbnWw3miepWBX/peSt5eTFvZ5o0ll1qAMa/vcB+e WEVHmLgdEmL5EPr8u/5DdrqwJk2V1X6LyP0LMEwBS2Ee5B5Lb6jkIkAfPEtfl7eSANpI 0g3DiyvOXYvAC+mkCf6ap80vASEHmZsxvdftiyNXSADXZoig9QDvnUHWJpA0M1nM4Jeh 6wMw== X-Forwarded-Encrypted: i=1; AJvYcCXeK6TTyXEFSvPJei14FtlBTaKf7M775Sgls44MyOKrFFy/MC8lOQZ9tTTE7IKszkV6ONhVC8CUjA==@kvack.org X-Gm-Message-State: AOJu0Yyi2LuxRRJ/AwMYRyX87OFCQ5WFnNDhjhs67v2zn/Ah0LzJh9ZY 9bBT3Tu/ahbcs/ASuquivn/ZQHlHISG6bXUNSS67Yon8Gtf6OTiNlYQXeVqLfKB7OT3Npl102XA iVh1wT6sq8ssgaZSwm5JNyWierg== X-Google-Smtp-Source: AGHT+IG8KruwiUMvb+ATJBOZLNay31OV+ptC3k/zywhlmKWLbMkv7bf57V8cnC+4Oz/vdx8qXs/OaaWPkktFHE2Tng== X-Received: from pfbna4.prod.google.com ([2002:a05:6a00:3e04:b0:746:2ae9:24a]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:483:b0:21a:df04:3285 with SMTP id adf61e73a8af0-21f97870b51mr1918983637.22.1749679844831; Wed, 11 Jun 2025 15:10:44 -0700 (PDT) Date: Wed, 11 Jun 2025 15:10:43 -0700 In-Reply-To: <20250529054227.hh2f4jmyqf6igd3i@amd.com> (message from Michael Roth on Thu, 29 May 2025 00:42:27 -0500) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting From: Ackerley Tng To: Michael Roth Cc: kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 39BA210000B X-Stat-Signature: fndn93q8oo6cmexzgmgfz5q7dfmwec3j X-Rspam-User: X-HE-Tag: 1749679846-328917 X-HE-Meta: U2FsdGVkX18v14KZJyBOq9bAVLK0BKHMEfB75Aq9er7SaErVexHDqGj0j3gOADP0vwohIZTHER6Hy0fuDNsnrPIz/1YgnA6Z71AF6GM2nBmnpIRyTpJeIgCU8BMASpv2TvoI5NDp+MzNadai6V7FEYUBQtF+vhFb1qEx5lQJ+ElKcd2/hCvOc6qcuCZgsTfyYWoRjoIOqQdFEpoC3CzzWCkZ8fSXJDrVKH0pXewxDkAwIRbWT2cmXWFMl9bMd6tQ1PS8QK/9iiyN3dckfy+i/iITGdiuoxZBuy9iX5IMFmqcexM3KZnU7l179oaWYymziZ/E6o/MQAt2/v59DRLelAHpmYG3YhOxwPuH08VJ+wZudWisjfeje1ExhU9jwCE+v1uinpAFQC+/VhY2egvjwo4Pz4caiOiCHa7jr7HboUV1ndY8okFvFs/TmqTFuR+8jlOnSysdnkMFyBIQWaTQ2EgeKZ2Q/jmbYrOJDNbM/Za2lkTiybc6K80hUMFO116tadBk5aLjmmtiXaz3XQXhBiag+xFjQXhiN7GchFOPN1lhMMdPsiaHbzY9GsxuWgzs03afX1Qk+3rLBMfN8Zrs9U9v6OYSatQpXWtqBGPgQsH+1ZlCp6FOI5Ay1ShM6+zHNyl/3qM50xSLpFk2+85tBUNyEUwkhwMp2yEfCqhGgbNVw3KR3rwDzDenzB87msgtmW0bU5MFxpuxG26y91gk+3KrLEzFv8rPZDOnaGb4FqY5dkSTLMB/4yaooYEsRynIhXoun+Rnzb9fvTLOlREYh4vu7+weSaidqHi7wmeHYXFVGwLWjh0byanKruUXu/i8VGacONYAO3MZujrQsVmqJe8qIT3n4cDdOmrFns0ZlwgExhnFJGIZ6OlGUYkDPQx1dESZokFIEjUyZaHyC/UY4I/NbnaGxQhsIou0whz7pj5dX3ppBRcIJogG6m0wFp4ZVQtCWPhP8ggTKOKCB41 HpZUP3Y/ qMrYw2ZUNWW5GLUK6AYkQ4ffihroMeE3+TzVzqxNPLaueFdWW/zX9Zd8zQlO0Pp7AEOh/a06IMbtUDFfD2U/b7w9IlHuFp47TH1QF5n+M06NIlyqGd6fdOqXpNrbjDk9ZKn9IKfc7nqfSe0zuRPtQ1kTlyxDY+DRsNz5kzdSsmnk4RzekwEGBCBpkyD8hhxpmtjyGGHiRD/JsE+Izdn05BCKk4BZTkRiSW7VcZZNb7ipT6hPzv4Oda4sNw4hGqq7nvMOvO/qM9WXwyxyULzMf84uq/AZYV2jQP5ARkdODZON6ZsXeV4LiYDh5DIQDZyIRhqlLu1Vt1aExXUfRxl4CDdgKKtjLTLJnR2EbSxVfDv+S5vOsZ5k0Av6aQqsOZcpDyf+J130XKiNNJZMRCTRO3Nuw+iqV+uZuNtgpyYAY+tNRVDONl7GC/Ulms+p5ekd5d7FzeDYx05qNdGvI6O8FgFc/ka4csK7uES0yDUir5R+3YDL2Y956ff/EF0gKoNMxzrxyQVhBUe2KuFpqqJoL3JzNG1K+JRE+PseDpQn3Q/9hXcfjE/lF1S6NlLyK83DMGKhGmN4z+101srTTEMiYph2NNehIPVVygDSBeWooRdt3tBZHQjFx1yhUkfedMzJ+v5/k4IROMESfTcOgaV7y66IfziqV64svldJH6k5VC/1W2/zT5Zdlns6RHe5aSfjIQPmyb3qxnk6P/ZpBe8vCdVVPM256nLeWJXIu+LpvXS+q5jKDjsaLQy7hZA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Michael Roth writes: > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote: Missed out responses on the second two comments! [...] >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM >> + >> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private, >> + loff_t size, u64 flags) >> +{ >> + enum shareability m; >> + pgoff_t last; >> + >> + last = (size >> PAGE_SHIFT) - 1; >> + m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST : >> + SHAREABILITY_ALL; >> + return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m), >> + GFP_KERNEL); > > One really nice thing about using a maple tree is that it should get rid > of a fairly significant startup delay for SNP/TDX when the entire xarray gets > initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES > (which is the current QEMU default behavior). > > I'd originally advocated for sticking with the xarray implementation Fuad was > using until we'd determined we really need it for HugeTLB support, but I'm > sort of thinking it's already justified just based on the above. > We discussed this at the guest_memfd upstream call, and I believe the current position is to go with maple_trees. Thanks for bringing this up! > Maybe it would make sense for KVM memory attributes too? > I think so, but I haven't had the chance to work on that. >> +} >> + [...] >> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, >> if (!file) >> return -EFAULT; >> >> + filemap_invalidate_lock_shared(file_inode(file)->i_mapping); >> + In this RFC, the filemap_invalidate_lock() was basically used to serialize everything that could modify shareability. > > I like the idea of using a write-lock/read-lock to protect write/read access > to shareability state (though maybe not necessarily re-using filemap's > invalidate lock), it's simple and still allows concurrent faulting in of gmem > pages. One issue on the SNP side (which also came up in one of the gmem calls) > is if we introduce support for tracking preparedness as discussed (e.g. via a > new SHAREABILITY_GUEST_PREPARED state) the > SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at > fault-time, and so would need to take the write-lock and no longer allow for > concurrent fault-handling. > > I was originally planning on introducing a new rw_semaphore with similar > semantics to the rw_lock that Fuad previously had in his restricted mmap > series[1] (and simiar semantics to filemap invalidate lock here). The main > difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths, > was that in the case of a folio being present for an index, the folio lock would > also need to be held in order to update the shareability state. Because > of that, fault paths (which will always either have or allocate folio > basically) can rely on the folio lock to guard shareability state in a more > granular way and so can avoid a global write lock. > > They would still need to hold the read lock to access the tree however. > Or more specifically, any paths that could allocate a folio need to take > a read lock so there isn't a TOCTOU situation where shareability is > being updated for an index for which a folio hasn't been allocated, but > then just afterward the folio gets faulted in/allocated while the > shareability state is already being updated which the understand that > there was no folio around that needed locking. > > I had a branch with in-place conversion support for SNP[2] that added this > lock reworking on top of Fuad's series along with preparation tracking, > but I'm now planning to rebase that on top of the patches from this > series that Sean mentioned[3] earlier: > > KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION > KVM: Query guest_memfd for private/shared status > KVM: guest_memfd: Skip LRU for guest_memfd folios > KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls > KVM: guest_memfd: Introduce and use shareability to guard faulting > KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes > > but figured I'd mention it here in case there are other things to consider on > the locking front. We discussed this a little at the last guest_memfd call: I'll summarize the question I raised during the call here in text. :) Today in guest_memfd the "prepared" and "zeroed" concepts are tracked with the folio's uptodate flag. Preparation is only used by SNP today and TDX does the somewhat equivalent "preparation" at time of mapping into the guest page table. Can we do SNP's preparation at some other point in time and not let the "prepared" state be handled by guest_memfd at all? This might simplify locking too, so preparedness would be locked whenever SNP needs to, independently of shareability tracking. Also, this might simplify the routines that use kvm_gmem_populate(), perhaps remove the need for kvm_gmem_populate()? The current callers are basically using kvm_gmem_populate() to allocate pages, why not call kvm_gmem_get_folio() to do the allocation? Another tangential point: it's hard to use the uptodate flag for tracking preparedness, since when there are huge pages, the uptodate flag can only indicate if the entire folio is prepared, but a user of the memory might only have part of the folio prepared. > > Definitely agree with Sean though that it would be nice to start identifying a > common base of patches for the in-place conversion enablement for SNP, TDX, and > pKVM so the APIs/interfaces for hugepages can be handled separately. > > -Mike > > [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/ > [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/ > [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/ > >> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); >> if (IS_ERR(folio)) { >> r = PTR_ERR(folio); >> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, >> *page = folio_file_page(folio, index); >> else >> folio_put(folio); >> - >> out: >> + filemap_invalidate_unlock_shared(file_inode(file)->i_mapping); >> fput(file); >> return r; >> } >> -- >> 2.49.0.1045.g170613ef41-goog >>