From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3EC4C83F1D for ; Sat, 12 Jul 2025 17:53:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7EC426B00B8; Sat, 12 Jul 2025 13:53:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C3B66B00BA; Sat, 12 Jul 2025 13:53:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DA136B00BC; Sat, 12 Jul 2025 13:53:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5BF236B00B8 for ; Sat, 12 Jul 2025 13:53:35 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0C46F1401F0 for ; Sat, 12 Jul 2025 17:53:35 +0000 (UTC) X-FDA: 83656359990.12.00A51E4 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf06.hostedemail.com (Postfix) with ESMTP id 1C494180003 for ; Sat, 12 Jul 2025 17:53:32 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=v4Mr0nzX; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752342813; a=rsa-sha256; cv=none; b=xpjQeykB+7DrnM8c4vd2RzJn+zTzKED/BueMh27qil/REBC7c/w7wjg88FzHQh/+5x40OP eA2bjvf06fvV6qtFKP/j8aaEGooQnLABrANmIL1YzkA/lXiu5cLwo3UA16i5LGTv9t8VGv /qDYMyyBgJr28ZjxzU5p74OH0C9sBC4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=v4Mr0nzX; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752342813; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=; b=MT+CUem6Fiko7ndSDo1Pyldbqb0uB89NJBUP4l4ABBXuGC7J1kPlfgMCNUPB0bfAXaT0tQ OC0sAxwDuZK4NDEoIifKJsMXIRBTM8RM9EmtdXzQH8fetl7DUnXirU6FphiJ0nWtWu8wsN PzUkA3zR/MjsOJcabgkZAMSxIoWAeoE= Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2357c61cda7so121265ad.1 for ; Sat, 12 Jul 2025 10:53:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752342812; x=1752947612; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=; b=v4Mr0nzXUpL7svZssIuEn9wdYOTfIj0xwozTagDCAjaUtcP3hDSQerqNZ5ZF37/fzE IlBzg2Fqbkpk+tMsnhMzMEt2GmIBj8thMF8osR+HZIFWDs6VPI+dOGVx3mdr42c11awW BFm5bcbn/BuVMXYd68pYUZ1HQPYMFdCg6bePW/z5c0UBL0qJP9YAobxL0imL4+gYZV0F yqKAqH7Mvxo8hzbATNXL4XysPeCQCre2T+flVNNgB/ZZJTKTsDqRJj2X/uTtQp6Ov5Os sW1xZVSbDympMjKJXUaZ3z2dCdC0L9iJzPW+Dqj+7fW6Z+GYVqZsgXiy7cnZ19uzCpsO k+Sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752342812; x=1752947612; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=; b=myr1D/+B1vxy4cg5XJfDZWt7f309xZP8/CyhP7Eh68ZL30Y0+pY9Zz1W5k15fhXYMW KOCgiUZj3BPVawOzQQ8r9/co0NqoCTzsQWM4Avy1gv1XE/TPNXhf0jUU6LiPZc9jXLjO i9WVX1fyebl2dWKe3FKYam2j1/QuTTBUiuj7ihpJ4pniwLYzETansDpjjFF9CpVp1yDh 1qIdLSa3lugHM/J6LZ8AHmPLTi14szPRQzy1TmIInHxgbCP2wfVIC808t6zzem/pjRgE T5KXZcrLv/AoqW29peSwdEdWLDnUS837yFn9e/zLnpC8xKmCcc/HcHyLyqnvwzPVzP6w XXgA== X-Forwarded-Encrypted: i=1; AJvYcCXYtVTYiFHA6uWLK3T6JSV3ExsV8IixV4AX44gITZAoBg+QLdDxREnxLdUiMY3C+xnedFHA/2Pe/Q==@kvack.org X-Gm-Message-State: AOJu0Yz27AVbhoxx101kwhxziyeQohjPH3dMfbTPKKU1sg2u7L07GTUV hHAkmwmWpDgez03psbEtH7Eo0U5zsn/WxdaCA7py57/6R382WNwpl3UxI9cW7TqMnVjhortUA7+ lQXd0WdB+bmfmwPJDmSBlhyc/oftAOLH7Q5IB0uv/ X-Gm-Gg: ASbGncuL5iPMY9iIAhzblB4P+z/+kW7JFGYhzhCLQiZfDDEfCyPU1L2RndtiStglIhQ oLvdOIU3GOVQtMkw581GAVXVztgu/fK5H6By6LBBe4Q5E4GaX6KU6EX6/WJTmGZ7TP3T94Q7VGo BUouSx9eI9oz2f0FU4d8dnfNuWaXJykwyMrKOTAJmdGV1npCWvBqTAvdLR1HDk1Yj4uJRAwHOXN fpbKdxuQzWcuS21ULbSr4peVTG7Tf0CM3WPYN0Z X-Google-Smtp-Source: AGHT+IGXZjTxB3NipN6zL4wxoHLscGMTCymyrebDMnTbZldyzqeNacRyryboTJLi0nmJu4tdqjI5M8LT+EiWQEHNhyg= X-Received: by 2002:a17:903:8cb:b0:216:48d4:b3a8 with SMTP id d9443c01a7336-23df6985bd5mr1838155ad.16.1752342811290; Sat, 12 Jul 2025 10:53:31 -0700 (PDT) MIME-Version: 1.0 References: <20250529054227.hh2f4jmyqf6igd3i@amd.com> <20250702232517.k2nqwggxfpfp3yym@amd.com> <20250703041210.uc4ygp4clqw2h6yd@amd.com> <20250703203944.lhpyzu7elgqmplkl@amd.com> <20250712001055.3in2lnjz6zljydq2@amd.com> In-Reply-To: <20250712001055.3in2lnjz6zljydq2@amd.com> From: Vishal Annapurve Date: Sat, 12 Jul 2025 10:53:17 -0700 X-Gm-Features: Ac12FXzBIe6hc0vRvgKfa9hTZJEZVmuxsY2_E0PJwlI__Cgm8PredUOA4GmiwtU Message-ID: Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting To: Michael Roth Cc: Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 1C494180003 X-Stat-Signature: iesonjaaraouty3mm31fiakds9bc8qu4 X-Rspam-User: X-HE-Tag: 1752342812-728565 X-HE-Meta: U2FsdGVkX18QQFa1AvtiOG4cCW3Gb1gm5FeAC/yuURBRTquwd+iNvrSfp2t0N2aLGwWgfY4vd2oAnSepBaLt1FCIeCCrHyBR5rvv6FjWnzNi/4RRaMAqilCWUmaOdtTa/ELhv3gSwdISiTs8ofPCm8D2chOk4qnguDqb+i1LwDRdV/zwW3EgK2VwhUu2xINcFzy68QfZese66pwww2ivQPXcde0tdiO9SeDtWC1DRM2+bE+eXZ2CbkpoOwgt3bHr5bt7phgl54Hyu1wv877zkufFu5v4qddYXGKPWNCEJYGP5he8gBkHAGxfBjZCuvYg7gyfHeTUvAYBxtGseq0f/WwH3lK7+XtwYQs1EsQdCpw+qti05eAbNho4PeLe/n14Sl+eMux7h3zbsDMYOcQkUxEknTcFURy3Ofkse3ipNHa3d2DAAmJQdYyLTxa7gOSD3lcHpzg+M5mYM2KSV8G+0lGdMnhE5AGxCrwFW2eSJY8hPk9MSGTrpRITdoguLx1ryTXisvtgCW/2hgO1E4Uo5Z1bcwe+D+Tl2ul6S+cHQM/HG3vhwtYhYgjhe/d0p5LECV0FqK2phCHgblY76u++uazJWJKwfHHrtH33nkwUhvE1q8DkDh2A5bT0KkJAvm16xfzzvcEv5uvxJU9v3V2AMPw1cA03mQ/3ge398hk4b5wUvYeZ8gkR9FIGJb9QWgDMWrcISi/Hxmq8cUnCOk9trwO7dul6OsNC88ZmV7o1EJAZqKCgCgPNfVn+0slCMQGxPGBEkWohTN8+of1JoVdRRnVfhh3W0aNbcxauUpmSCTmbw0scW//QU3oyRlS4+EAPYAfaimUmd0f2s1sK2dgyf6ao0Wvu42oHuJk2FzHFnDar24gm2JELAbjgTkRdElSBIAn+APYQfpeP16L5/tmZ9vZrysA6krz0gz3DArYjjZ78UzjwuMw9BRHuOWqFGAaBYUGrElSVbcha/5rPnAB JydXfyR8 irSv6kmQrboGIp2qlr7LI8B/IVtyy/zy8vLr2ghpPWLTxHSP6wOMAVB+pdSnQkiPi+XwOdkqbAkRQeAi6QK/655Q4+LTrEDEsdzjUotO1CSF4dNl93JhDDZ9ordeU5vJw/DRgruPcGSAUEBldvdZ3F0LDFcei19Wiexqr8/WdCr1d9lWN3CstMQIOQvIxSvFh57HA8PN0ncp3TCw4E4wv5YfU1z/Q3jcHQmO8ABPr0ie7U7YtAzkhMQ54hzeRmIP8ojxrnCjZjuFFGC0lqfUyQdf4SXbGwZLz2MUfA5RB1cuzWTf7+QlBn19OR3HaAFqWItmyV+32Pt9kFbkVCFxg8PMT6w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 11, 2025 at 5:11=E2=80=AFPM Michael Roth = wrote: > > > > Wishful thinking on my part: It would be great to figure out a way to > > promote these pagetable entries without relying on the guest, if > > possible with ABI updates, as I think the host should have some > > control over EPT/NPT granularities even for Confidential VMs. Along > > I'm not sure how much it would buy us. For example, for a 2MB hugetlb > SNP guest boot with 16GB of memory I see 622 2MB hugepages getting > split, but only about 30 or so of those get merged back to 2MB folios > during guest run-time. These are presumably the set of 2MB regions we > could promote back up, but it's not much given that we wouldn't expect > that value to grow proportionally for larger guests: it's really > separate things like the number of vCPUs (for shared GHCB pages), number > of virtio buffers, etc. that end up determining the upper bound on how > many pages might get split due to 4K private->shared conversion, and > these would vary all that much from get to get outside maybe vCPU > count. > > For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged > during run-time and would be candidates for promotion. > Thanks for the great analysis here. I think we will need to repeat such analysis for other scenarios such as usage with accelerators. > This could be greatly improved from the guest side by using > higher-order allocations to create pools of shared memory that could > then be used to reduce the number of splits caused by doing > private->shared conversions on random ranges of malloc'd memory, > and this could be done even without special promotion support on the > host for pretty much the entirety of guest memory. The idea there would > be to just making optimized guests avoid the splits completely, rather > than relying on the limited subset that hardware can optimize without > guest cooperation. Yes, it would be great to improve the situation from the guest side, e.g. I tried with a rough draft [1], the conclusion there was that we need to set aside "enough" guest memory as CMA to cause all the DMA go through 2M aligned buffers. It's hard to figure out how much is "enough", but we could start somewhere. That being said, the host still has to manage memory this way by splitting/merging at runtime because I don't think it's possible to enforce all conversions to happen at 2M (or any at 1G) granularity. So it's also very likely that even if guests do significant chunk of conversions at hugepage granularity, host still needs to split pages all the way to 4K for all shared regions unless we can bake another restriction in the conversion ABI that guests can only convert the same ranges to private as were converted before to shared. [1] https://lore.kernel.org/lkml/20240112055251.36101-1-vannapurve@google.c= om/ > > > the similar lines, it would be great to have "page struct"-less memory > > working for Confidential VMs, which should greatly reduce the toil > > with merge/split operations and will render the conversions mostly to > > be pagetable manipulations. > > FWIW, I did some profiling of split/merge vs. overall conversion time > (by that I mean all cycles spent within kvm_gmem_convert_execute_work()), > and while split/merge does take quite a few more cycles than your > average conversion operation (~100x more), the total cycles spent > splitting/merging ended up being about 7% of the total cycles spent > handling conversions (1043938460 cycles in this case). > > For 1GB, a split/merge take >1000x more than a normal conversion > operation (46475980 cycles vs 320 in this sample), but it's probably > still not too bad vs the overall conversion path, and as mentioned above > it only happens about 6x for 16GB SNP guest so I don't think split/merge > overhead is a huge deal for current guests, especially if we work toward > optimizing guest-side usage of shared memory in the future. (There is > potential for this to crater performance for a very poorly-optimized > guest however but I think the guest should bear some burden for that > sort of thing: e.g. flipping the same page back-and-forth between > shared/private vs. caching it for continued usage as shared page in the > guest driver path isn't something we should put too much effort into > optimizing.) > As per discussions in the past, guest_memfd private pages are simply only managed by guest_memfd. We don't need and effectively don't want the kernel to manage guest private memory. So effectively we can get rid of page structs in theory just for private pages as well and allocate page structs only for shared memory on conversion and deallocate on conversion back to private. And when we have base core-mm allocators that hand out raw pfns to start with, we don't even need shared memory ranges to be backed by page structs. Few hurdles we need to cross: 1) Invent a new filemap equivalent that maps guest_memfd offsets to pfns 2) Modify TDX EPT management to work with pfns and not page structs 3) Modify generic KVM NPT/EPT management logic to work with pfns and not rely on page structs 4) Modify memory error/hwpoison handling to route all memory errors on such pfns to guest_memfd. I believe there are obvious benefits (reduced complexity, reduced memory footprint etc) if we go this route and we are very likely to go this route for future usecases even if we decide to live with conversion costs today.