From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DDB5C54E5D for ; Tue, 19 Mar 2024 00:10:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 083456B0082; Mon, 18 Mar 2024 20:10:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0592A6B0083; Mon, 18 Mar 2024 20:10:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E63126B0087; Mon, 18 Mar 2024 20:10:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D1ED26B0082 for ; Mon, 18 Mar 2024 20:10:51 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 97026160FBA for ; Tue, 19 Mar 2024 00:10:51 +0000 (UTC) X-FDA: 81911857902.15.BFA3B84 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf11.hostedemail.com (Postfix) with ESMTP id C2CBD4001B for ; Tue, 19 Mar 2024 00:10:49 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0k7cGTt8; spf=pass (imf11.hostedemail.com: domain of 3CNj4ZQYKCNEF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3CNj4ZQYKCNEF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710807049; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dVOcK8wb3sXGFyuNatUw0hEIoH+aEIoBNqkWidxVNDc=; b=e9nnjmGNk7/qq7zsL4P/bn4etyW5pB0wwbM6+cbhKkqwVO8hkWYwqIAJ9q316128xYRfxC M3iTjRu2XHbcnkd0PnYofHDvtaSjkb+oCiGcowTVOnhrbBOuTP/yFTB2cDOfIVVZFaZKGL f1rkef0Z6JGpPyXdVYaLPvvRGAtUMSI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710807049; a=rsa-sha256; cv=none; b=uAShmXyfOk0WU+pwz++qsoDrVi18M4bNxJ8Jl5dRZIu2t5zw1tD742STCb4QtqebfDJ1Z4 HZAm/PM+FhJlODTaDopR8uJI8K/EUiiFQN3rZf4vEc3FNWl3SRoXzy0k40QwWYqsM255Jb pZGe9+p6+Xzj/BszAY+vrqcBLYPvs+k= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0k7cGTt8; spf=pass (imf11.hostedemail.com: domain of 3CNj4ZQYKCNEF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3CNj4ZQYKCNEF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dc6ceade361so8107749276.0 for ; Mon, 18 Mar 2024 17:10:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710807049; x=1711411849; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=dVOcK8wb3sXGFyuNatUw0hEIoH+aEIoBNqkWidxVNDc=; b=0k7cGTt8NItgdyL6Y4+n543q9+bmCGXW++faGdJ1Msz2/jepuDw440gnLJJ03pylXs XM1iuqCntD1ODB9gl0rb7MIFC2OOZOGWMHjm5nL9gXdx4jJq0+fzawSnagI3mJczlJi4 /dRKS1ZIY2swEL/P1AhBRWCQ/7M+VHGpMGBZ7KY7BNh3sEYT/4Gjmk+m5Kfv3hGdoUdd qyejQy0xkZ9iCqQflmElFsmwYHfl6hIPiGG2nB/iX2wpiUREBM/aXL9OeDKNcDldqMOm anjoG/4vCjxsAFKEQ96fP4/CLqinB30NeQGhkKS7oHa31TOtdO2rsLFR0LFuMi6rozMk Ts8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710807049; x=1711411849; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=dVOcK8wb3sXGFyuNatUw0hEIoH+aEIoBNqkWidxVNDc=; b=eFA3Eft9PZynRtciZCClIp8//ie180cgd7OSodsyDM5LDcg/oiThQ5/6IGZSTLIXPS td8wbcvECG24tyK/8TRHHiYTlDiIhl7J1lCCME2WWLCHiQ++4nMmpVBMyBLkd779QiYk +GeSlx/uZo3T19QYbPJQaAitATYnbfErvQjrCkuwgMdVs1MoyRIcHFliqUfMdfKenpwZ GLNqpCIcJ73OHMn9tL4EtNOvlgVHLCiwuLGdUZOuQVDW+KkGbq3tw3pXMesY01veb984 239RlOmbmk0+OeRwR+ODtJAuqY7HO5nZBatC/bcRIO2p4obpf5R1aYyjICexObq5Zaef shxg== X-Forwarded-Encrypted: i=1; AJvYcCUxclf3Lym9KpQ9WyeIid83XpeVkpflys0sEcBGN8mq/Pd4DitQGZikGAS8PC9giCupyiNkfX5eNtH9NMnZQrzGTQg= X-Gm-Message-State: AOJu0YwOVc5y6AZn4jpsveIeTXSmPgL+xk+aAOUC1C1OPbwRLmuro6Hh T2iGmpXdDP6/EaIObIxPsQiVqop70GSNaH+821tDw+7ghvqpZRm9Ou4Ar6rT+uPBcy4Jrx7I8ZE xuw== X-Google-Smtp-Source: AGHT+IG8o0R1VQ3cJvZ6rxzM7gbl6TSbT5OoWu6Hsm5q28r6QLtwJIp7eVHhB3U032wtW5rGR7bg0RFXNqw= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:1144:b0:dcb:abcc:62be with SMTP id p4-20020a056902114400b00dcbabcc62bemr3472520ybu.6.1710807048787; Mon, 18 Mar 2024 17:10:48 -0700 (PDT) Date: Mon, 18 Mar 2024 17:10:47 -0700 In-Reply-To: Mime-Version: 1.0 References: <99a94a42-2781-4d48-8b8c-004e95db6bb5@redhat.com> <7470390a-5a97-475d-aaad-0f6dfb3d26ea@redhat.com> Message-ID: Subject: Re: folio_mmapped From: Sean Christopherson To: Vishal Annapurve Cc: David Hildenbrand , Quentin Perret , Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, keirf@google.com, linux-mm@kvack.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C2CBD4001B X-Rspam-User: X-Stat-Signature: 39ycch63updz5cdqd8ipjge6hcygh5mp X-Rspamd-Server: rspam03 X-HE-Tag: 1710807049-664338 X-HE-Meta: U2FsdGVkX18yWP+GAxMa4gytt76ZZPZnW0khQbV+ut9sdBGOnxRICUV2BGKyNshvXDzr74sG9AKfGwNqPuQDx09JhHNVjHrHWWMtIIdygSoq9q0SVm0RpOjMiVa1ld3+BhYFoRN4nZQpduZIOkFjtllehjuQHl5V3syZQqd/657MxBH2+4CUj8bk40nmBS+/kka0ojlT/1+AO6QrPhZ7mcynGK7gi7Bbr3oWg6S1ZD0+Jn56T50BfUn4oJQEQJ7r/WH6OaVj5M/2kImDmp2/ZKJoYxIFbSICwzcvrjIH71gUv22a/nOCUR6alNzLKaCVNGvnyKz/L2BNxZwz/N702ew8WE+0j98lUCaY8bsBZ0AjDZJYgYdAt/IHm3BnqONRZ6IGoLtIERMtEFngFHkKq0C/NVsame+TvA7lRyuRReuG5EuovJgBhj1/KMB8rFPqSDSzvU44tS9MXgkNc8t7JEugOMoH61CDPErp3xCbHRYHF5SeEiPJDR4gN48GGdEZbKKnvxtAXfn6Cnvlz8rzcxTHZz42KvN6pYX/ZVAFrmvPrun0JZCsRQ9RwmID6U3EtzJpOheRg8vxnNDxqiu8ZP2gCVyMorjap8nw+Nfv4dZCz6Q0Kvee0fnRQ9LeqXIkEApwJRamYpClwahQcTkanW3W5+M8zalM/4JGHJUjv8q9etsv4M9Gg4PJZhdV0CdOzeFlqrDg1s6hTvLraJwzL8MTWPcN0NePiAwpaAYH3o895EbdK7pFtbSbmvXo0UMZpsfzH8qeXkR38tZasSIxEStdLNSnBHJwg3aLqVe2eMlDbgW16WTE9MEufjJ68nkqss4LXg/LZB3fTyPfh3VAmVw9yu9UsaKvdgFHuhtFhIgRRXRqWCw2IyFDQh1AODIPQPklWxIq0y/ISX6M1RntW97L2S/hUNF2AIs7yJlXcyWmrz1VkYIneOviceyP5lT3dhf4jSU1zs8ejAjWrZ4 R6g1bNIW pIYYa1B9ZK8wJkMC/yOvBQdSb0txM9bOIqWxzs6C6eUNXULiXi5qQmiSI83OPkjj2We/LwjjLYjjqYtKGevR/rFBK1FW3/GO1hHC7hSt5KEqJS8g4PPtqnk9lNOXOUHViHIKIL4/pQz4njtsFUBpeEJTbCan6xQjU4tcXowahVsdx4g0rx9mAZc2wD+ZheJ/50fw1k3azrXXXpRzTCZ/adsMXDGqDieoQEC7U2qGVHA6EmT9oUWqAIOBQ7ifz9MTdNe9CENEJW1hhdwDnLg/ffQZtMt5aBukPx0uvxHDp+as4DYZdtR18ZuIVYpyE4sBR2e3dCC3GuaTLGOc1DOPa1FZDfr3VrNR19O1TV0G8+PMgbnT09pHHEqW5gDBXMVJKhPqmSS3HDGhpgfPMm+udI3ITlOeCpqykSQfrZELXNAG8J5qwgpMBTIN7ZK/1urrCVYw5PKx9bA3ew3YzHeNcNzt75bREUz7uXrVMtqcK76/Ok0EOziflqClIKh8dqTUa6jXzot30tP+9d90EQ+ZY4wWyG3Z4qiHlenQmz7ZYZZpxx7Mb7OB2dQi9ZjsmQf8HQLYYYmZgtRjoq7mztPIWiVKPzcTf2XrJj2KALnGq/eWANHApT0n7I5ZVs10Rc5m6d/E58qO/xvL3Wd9lZYPUgeztmlfazfHdbKSSaqceB0jNoZw3S46ompqSWPyokxzao29SDedS8FELzdztne5S0cpz09nJyFBqwlZ2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000003, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 18, 2024, Vishal Annapurve wrote: > On Mon, Mar 18, 2024 at 3:02=E2=80=AFPM David Hildenbrand wrote: > > Second, we should find better ways to let an IOMMU map these pages, > > *not* using GUP. There were already discussions on providing a similar > > fd+offset-style interface instead. GUP really sounds like the wrong > > approach here. Maybe we should look into passing not only guest_memfd, > > but also "ordinary" memfds. +1. I am not completely opposed to letting SNP and TDX effectively convert pages between private and shared, but I also completely agree that letting anything gup() guest_memfd memory is likely to end in tears. > I need to dig into past discussions around this, but agree that > passing guest memfds to VFIO drivers in addition to HVAs seems worth > exploring. This may be required anyways for devices supporting TDX > connect [1]. >=20 > If we are talking about the same file catering to both private and > shared memory, there has to be some way to keep track of references on > the shared memory from both host userspace and IOMMU. >=20 > > > > Third, I don't think we should be using huge pages where huge pages > > don't make any sense. Using a 1 GiB page so the VM will convert some > > pieces to map it using PTEs will destroy the whole purpose of using 1 > > GiB pages. It doesn't make any sense. I don't disagree, but the fundamental problem is that we have no guarantees= as to what that guest will or will not do. We can certainly make very educated g= uesses, and probably be right 99.99% of the time, but being wrong 0.01% of the time probably means a lot of broken VMs, and a lot of unhappy customers. > I had started a discussion for this [2] using an RFC series.=20 David is talking about the host side of things, AFAICT you're talking about= the guest side... > challenge here remain: > 1) Unifying all the conversions under one layer > 2) Ensuring shared memory allocations are huge page aligned at boot > time and runtime. >=20 > Using any kind of unified shared memory allocator (today this part is > played by SWIOTLB) will need to support huge page aligned dynamic > increments, which can be only guaranteed by carving out enough memory > at boot time for CMA area and using CMA area for allocation at > runtime. > - Since it's hard to come up with a maximum amount of shared memory > needed by VM, especially with GPUs/TPUs around, it's difficult to come > up with CMA area size at boot time. ...which is very relevant as carving out memory in the guest is nigh imposs= ible, but carving out memory in the host for systems whose sole purpose is to run= VMs is very doable. > I think it's arguable that even if a VM converts 10 % of its memory to > shared using 4k granularity, we still have fewer page table walks on > the rest of the memory when using 1G/2M pages, which is a significant > portion. Performance is a secondary concern. If this were _just_ about guest perfor= mance, I would unequivocally side with David: the guest gets to keep the pieces if= it fragments a 1GiB page. The main problem we're trying to solve is that we want to provision a host = such that the host can serve 1GiB pages for non-CoCo VMs, and can also simultane= ously run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo V= Ms, 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo = VMs would also get the benefits of 1GiB mappings, that's not the driving motivi= ation for this discussion. As HugeTLB exists today, supporting that use case isn't really feasible bec= ause there's no sane way to convert/free just a sliver of a 1GiB page (and recon= stitute the 1GiB when the sliver is converted/freed back). Peeking ahead at my next comment, I don't think that solving this in the gu= est is a realistic option, i.e. IMO, we need to figure out a way to handle this= in the host, without relying on the guest to cooperate. Luckily, we haven't a= dded hugepage support of any kind to guest_memfd, i.e. we have a fairly blank sl= ate to work with. The other big advantage that we should lean into is that we can make assump= tions about guest_memfd usage that would never fly for a general purpose backing = stores, e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not desirable, for (almost?) all of the CoCo use cases. I don't have any concrete ideas at this time, but my gut feeling is that th= is won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ bei= ng general purposes, and if we we account for conversion scenarios when design= ing hugepage support for guest_memfd. > > For example, one could create a GPA layout where some regions are backe= d > > by gigantic pages that cannot be converted/can only be converted as a > > whole, and some are backed by 4k pages that can be converted back and > > forth. We'd use multiple guest_memfds for that. I recall that physicall= y > > restricting such conversions/locations (e.g., for bounce buffers) in > > Linux was already discussed somewhere, but I don't recall the details. > > > > It's all not trivial and not easy to get "clean". >=20 > Yeah, agree with this point, it's difficult to get a clean solution > here, but the host side solution might be easier to deploy (not > necessarily easier to implement) and possibly cleaner than attempts to > regulate the guest side. I think we missed the opportunity to regulate the guest side by several yea= rs. To be able to rely on such a scheme, e.g. to deploy at scale and not DoS cu= stomer VMs, KVM would need to be able to _enforce_ the scheme. And while I am mor= e than willing to put my foot down on things where the guest is being blatantly ri= diculous, wanting to convert an arbitrary 4KiB chunk of memory between private and sh= ared isn't ridiculous (likely inefficient, but not ridiculous). I.e. I'm not wi= lling to have KVM refuse conversions that are legal according to the SNP and TDX = specs (and presumably the CCA spec, too). That's why I think we're years too late; this sort of restriction needs to = go in the "hardware" spec, and that ship has sailed.