From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4EA1C021B1 for ; Thu, 20 Feb 2025 09:27:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 750F02802B1; Thu, 20 Feb 2025 04:27:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D8C36B0092; Thu, 20 Feb 2025 04:27:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 554182802B1; Thu, 20 Feb 2025 04:27:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 366456B0088 for ; Thu, 20 Feb 2025 04:27:00 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E526BC0BA7 for ; Thu, 20 Feb 2025 09:26:59 +0000 (UTC) X-FDA: 83139793758.16.B0C4DFC Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf25.hostedemail.com (Postfix) with ESMTP id 1CA22A000A for ; Thu, 20 Feb 2025 09:26:57 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ASRogxif; spf=pass (imf25.hostedemail.com: domain of tabba@google.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740043618; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DnhcIOH335Rs5OhBGYZtd9oahr5TV1ZZB9arTSRJ0/c=; b=Jfg3MZpeoeD5e1OdtoiQzml0LbWBYGzdK9ydvkJy39OmHk7V5LYegayWAtEGnDbmnkn6uO rXAyXItnHU/0Y5aAYSC/qrZb7mv2Q4QbiVTWE3/Y5enCfrp1DjaU1Fr5HnoRZIZBIBqYMh YR5vjTm3HVz1NM/zY1VT6B23GYcDKcs= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ASRogxif; spf=pass (imf25.hostedemail.com: domain of tabba@google.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740043618; a=rsa-sha256; cv=none; b=isBU7sVRkiDD5UmFdg55nqudM1t0lMp6k2yjPpxEk1tyXHwQvUskE6z8EJqq3r9OLTzwZC X+iSKKLn7q+XuwpWCilDGXV5Dn7CNH3ICELNMK+BRa8QIjZX1JjbSdVyGQwePGxaTzro7b 7oLLkYmB0NK5eve1xUVlEn9mq4lrUKI= Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-471fbfe8b89so289731cf.0 for ; Thu, 20 Feb 2025 01:26:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740043617; x=1740648417; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DnhcIOH335Rs5OhBGYZtd9oahr5TV1ZZB9arTSRJ0/c=; b=ASRogxifBFai7b58NyuUdEPdWMAwdhyv2Zki5BQIRezi6/UCxEhsv4Ah/2f36P2lki 6iRZMvBM6RmbzCfbIcM5wDtoIp150WDHUdbBVbZnc7JTv+eaK+eyhchVJMhpdsqrT6Io LID7DSdDL4R++QF31q/A+StbQVlX0eYAqzLacWE69KEoAtTef/z2sGBrm3LfsQoytSUt r52ESgQjnD4U8MP3iZ3x9WUF9L41hmDBXeXPIs9APoc7cZHBrGdqh9JHjfg6NtZjyqMa hz3sqiXZeyQtfZ3X9GJtzr5rZu6cL1ZWsn0e+4y4zoSqt0jwoTKWO8JOhrqhT5kllGRi J5mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740043617; x=1740648417; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DnhcIOH335Rs5OhBGYZtd9oahr5TV1ZZB9arTSRJ0/c=; b=hPPIhWvCkj6GAdub36ldIhlbRFC3W/nNxaFsqnmf+uFJVingvAf3t2SBixLkTAjdPe auAXlxRRZYzDlmzq1CQaYgmCikUBOwnkS4ugY5aNUOWKwoj/xDpWD0SBg7S2/xPhjLGY tVkUUpZbOoUljSwUEZ/XjeiwsVyMs4gBPAoCgkPThudYIHeRvCKqh5GXDOidYLGkV3kL Rs9IRIAzT5bfva3+P+9pXriVDWf2njfn0jklGO9PIOGUQ2r5pU0d4FNif+kJu1B+a0xr MT0Ry9z4BfoTKNt9G+jCm6zITV0A0MfRJRQZWqLQbeAVgrFLSQOpbi5RkUDbwhLVTP2/ aM/A== X-Forwarded-Encrypted: i=1; AJvYcCW2JEA++5YA+zJXJpV3GsUIZ5QTKT3pUEMVFuncL7RB2kLDDH7Nw2Qrrqt4QB69ensT312zY6qgfA==@kvack.org X-Gm-Message-State: AOJu0YxlKwQLJ2oKPfsiVPL4PkKYPXQBlQ42qWb1tsS4H0422/qWmIJX U+qF/A+vvQYf2ZGlImB5qiewsB1MPqGSoynVs+uDWvg634ykaMuEzkyvDbgdxqgFf1IySbjRAt7 pANuFUCEIFWM+B+eGZhVEni1pOZJunGe447mJ X-Gm-Gg: ASbGncsiuhvysgbsGCs2mWokBYsM9PUpIsqvOz0gdmAM3JEsvO2l5MYNAyE2kEScGs1 4gELXLjvq9VpbW5g6caRPgsv9LneXWkozSQsIgxQQm9A2lfxNK0w4jCv7c9uNhWGSeuly/f0= X-Google-Smtp-Source: AGHT+IHyDFIVnCUigZ5bTkPpCdEKaSVO4OwYmEPkXXA0PRmebGTdWF4mHE1UGnZ53ZRMMA9fc4Pg5u7vW6FF1gU07aI= X-Received: by 2002:ac8:5e06:0:b0:46c:78e4:a9cc with SMTP id d75a77b69052e-472171339e4mr2235621cf.25.1740043616801; Thu, 20 Feb 2025 01:26:56 -0800 (PST) MIME-Version: 1.0 References: <20250117163001.2326672-6-tabba@google.com> In-Reply-To: From: Fuad Tabba Date: Thu, 20 Feb 2025 09:26:20 +0000 X-Gm-Features: AWEUYZkXWRg--vw_yP0RUaHbdTHUxGGOdXbhDT85cTXg1nmxQFtS7oDEzDq2-QI Message-ID: Subject: Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition To: Ackerley Tng Cc: kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 1CA22A000A X-Stat-Signature: punq3h1i1ei5bcc91y6c68qw9uxypx5r X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740043617-78204 X-HE-Meta: U2FsdGVkX1/KOduKoLzd5wuaSrsjMeW+X3QGWpe/YKL0CARpRKbRkjPxat8Wd+xe1H/pfWMd343BJLYNtYy3lMNWRkgAitjCQCXdjNal2jiQVfz+VX22VXT6LcU7bsEjxfZhIwOmDyy1pJeyHunncymZGn1cL0SkgnOZ9t2d4/eYyEHwdfOBNXdYSYQMXCkfZz4z19JSDLnkLI+Z7G0FoU3x55q1ggUJ98h3b4vE4LWcoCukNhYAFCdFVof2ty/Co/kYMAFmywv89rlMUpKV2W0+Lxvw8t6KwrV5p+yTf7p6T9CpYvUpDxa7JJXXZVrI1zAN4K4fgHQ9pwBC/hqztoY6hrzpsg1OKXuh+QFZXYEBVCzY5aKBkAw92HAtC5VRKHs+CNFj/OGDoUl99DZLgqZDM+XVKOenYY0l2SUHKoKWlTPocwyr7UoFhr7ppxQBHgOwi9ch4ItXz1czjELcc/GJupyu4f5ROWOW7MIj6qPgkq/xa8PPBrRejxO938O0T3ZdoHtiluUldJMZTLE5RLiygetBDSVF+IdcdEudaFfOkniyZWJXv+8ZxmfrlHbhFQrRqPCSjf9DmJXkvGZPgJvpWC+PDuVrcj3SHd87eC3jA0F5GaZMlaH0ihmSbc15uAUfBf8DppsjnmZPhWAaOISe1zr8ZB5SYc5ZPKanrnLdL6eCrhMznkdGMrZ85D3FePmELjvBDxdaKzENk6nsSyMQqGRXxBP0t0BR0kg+SUfT4jbYPhKLskSak5LskMAKqvSRYsN2unjDR/4jmr+tQPPzoLMaew20m7tYFmcJv802Bs04pf6zARvc5AlILA+GwBzYddFY+58/rG9btvSb1scTN5oZCWqhrNxmCXLdQve/C2xbFp2UTWF1B5ruk6yk5t63KnCEPqwhuwoJFLiLWYjMSLp9RLpIs3wd6dOffQnf10uaIvJANvW+JO0+VUbgtoOqNSLVnghbcAIcwnI TXd1v4cW AaBllh3x8GaAUzxGwVMAngSWysDuLtBHOE8ypOvAvXraet0gOuIMZtSGiL92bjRzvA3WkjGvDjyQTbb+roMq2FY4t/ZE6lzo5vTlqi6RZL1XPeBvtt9wC+nSwuzGv80DO/85Few4RtIkHHsAXSRzOkclGRS4Fduj3lF6HsZ2i2OkNZeT3dx2WtlFqRk2nh/EC7E9hqcVcFEKYSIGor95WmvBt6ikX5dMAe+oyK3KtGIoEFQ/3zO4qaSiMgDux533ugcny3Av856DDL7NEdp0HJz07U6/ODFaQDtvPBFSTYXDZ9u+XWCY5PvRlpM7oXpWk63Wre8i2y5eBPAEcnCL3ZxV8mw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.001265, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ackerley, On Wed, 19 Feb 2025 at 23:33, Ackerley Tng wrote: > > Fuad Tabba writes: > > This question should not block merging of this series since performance > can be improved in a separate series: > > > > > > > + > > +/* > > + * Marks the range [start, end) as mappable by both the host and the guest. > > + * Usually called when guest shares memory with the host. > > + */ > > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end) > > +{ > > + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; > > + void *xval = xa_mk_value(KVM_GMEM_ALL_MAPPABLE); > > + pgoff_t i; > > + int r = 0; > > + > > + filemap_invalidate_lock(inode->i_mapping); > > + for (i = start; i < end; i++) { > > Were any alternative data structures considered, or does anyone have > suggestions for alternatives? Doing xa_store() in a loop here will take > a long time for large ranges. > > I looked into the following: > > Option 1: (preferred) Maple trees > > Maple tree has a nice API, though it would be better if it can combine > ranges that have the same value. > > I will have to dig into performance, but I'm assuming that even large > ranges are stored in a few nodes so this would be faster than iterating > over indices in an xarray. > > void explore_maple_tree(void) > { > DEFINE_MTREE(mt); > > mt_init_flags(&mt, MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU); > > mtree_store_range(&mt, 0, 16, xa_mk_value(0x20), GFP_KERNEL); > mtree_store_range(&mt, 8, 24, xa_mk_value(0x32), GFP_KERNEL); > mtree_store_range(&mt, 5, 10, xa_mk_value(0x32), GFP_KERNEL); > > { > void *entry; > MA_STATE(mas, &mt, 0, 0); > > mas_for_each(&mas, entry, ULONG_MAX) { > pr_err("[%ld, %ld]: 0x%lx\n", mas.index, mas.last, xa_to_value(entry)); > } > } > > mtree_destroy(&mt); > } > > stdout: > > [0, 4]: 0x20 > [5, 10]: 0x32 > [11, 24]: 0x32 > > Option 2: Multi-index xarray > > The API is more complex than maple tree's, and IIUC multi-index xarrays > are not generalizable to any range, so the range can't be 8 1G pages + 1 > 4K page for example. The size of the range has to be a power of 2 that > is greater than 4K. > > Using multi-index xarrays would mean computing order to store > multi-index entries. This can be computed from the size of the range to > be added, but is an additional source of errors. > > Option 3: Interval tree, which is built on top of red-black trees > > The API is set up at a lower level. A macro is used to define interval > trees, the user has to deal with nodes in the tree directly and > separately define functions to override sub-ranges in larger ranges. I didn't consider any other data structures, mainly out of laziness :) What I mean by that is, xarrays is what is already used in guest_memfd for tracking other gfn related items, even though many have talked about in the future replacing it with something else. I agree with you that it's not the ideal data structure, but also, like you said, this isn't part of the interface, and it would be easy to replace in the future. As you mention, one of the challenges is figuring out the performance impact in practice, and once things have settled down and the interface is more or less settled, some benchmarking would be useful to guide us here. Thanks! /fuad > > + r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL)); > > + if (r) > > + break; > > + } > > + filemap_invalidate_unlock(inode->i_mapping); > > + > > + return r; > > +} > > + > > +/* > > + * Marks the range [start, end) as not mappable by the host. If the host doesn't > > + * have any references to a particular folio, then that folio is marked as > > + * mappable by the guest. > > + * > > + * However, if the host still has references to the folio, then the folio is > > + * marked and not mappable by anyone. Marking it is not mappable allows it to > > + * drain all references from the host, and to ensure that the hypervisor does > > + * not transition the folio to private, since the host still might access it. > > + * > > + * Usually called when guest unshares memory with the host. > > + */ > > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end) > > +{ > > + struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets; > > + void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE); > > + void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE); > > + pgoff_t i; > > + int r = 0; > > + > > + filemap_invalidate_lock(inode->i_mapping); > > + for (i = start; i < end; i++) { > > + struct folio *folio; > > + int refcount = 0; > > + > > + folio = filemap_lock_folio(inode->i_mapping, i); > > + if (!IS_ERR(folio)) { > > + refcount = folio_ref_count(folio); > > + } else { > > + r = PTR_ERR(folio); > > + if (WARN_ON_ONCE(r != -ENOENT)) > > + break; > > + > > + folio = NULL; > > + } > > + > > + /* +1 references are expected because of filemap_lock_folio(). */ > > + if (folio && refcount > folio_nr_pages(folio) + 1) { > > + /* > > + * Outstanding references, the folio cannot be faulted > > + * in by anyone until they're dropped. > > + */ > > + r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL)); > > + } else { > > + /* > > + * No outstanding references. Transition the folio to > > + * guest mappable immediately. > > + */ > > + r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL)); > > + } > > + > > + if (folio) { > > + folio_unlock(folio); > > + folio_put(folio); > > + } > > + > > + if (WARN_ON_ONCE(r)) > > + break; > > + } > > + filemap_invalidate_unlock(inode->i_mapping); > > + > > + return r; > > +} > > + > > > >