From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F2B0C3ABD8 for ; Fri, 16 May 2025 13:12:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEF986B016B; Fri, 16 May 2025 09:12:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A9D876B016C; Fri, 16 May 2025 09:12:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 915ED6B016D; Fri, 16 May 2025 09:12:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 71E756B016B for ; Fri, 16 May 2025 09:12:11 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E76EDBEE8F for ; Fri, 16 May 2025 13:12:12 +0000 (UTC) X-FDA: 83448809304.05.99D5DF9 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf20.hostedemail.com (Postfix) with ESMTP id C44691C0012 for ; Fri, 16 May 2025 13:12:10 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4FuhxyXU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747401130; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vRZigz65utFUqAMmPQfonu5ixQiHldB3hpjMC8Trs54=; b=vnEhXIYK99u35TVOcn+FysSZUYAdtMDaRNjkuoKgFoeKpJglVepyiJo7yHVz/yBQ2M7uUD OJG6iBMfN2AH+yXd4PKtlrEJ2AUHmIRsG9GpnZbYc2h84FKm2Az94ZVBbO74OxLDsD7kRq AX4bvYWS9hpXk3EorFErIUg5+fF7eSg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747401130; a=rsa-sha256; cv=none; b=NWuMz5hPvTX7bpYa1OLEgaziNH1CllI7LurDvz6joaLWbeymNjpxrZawn1DUdtzckD5inD Vtk8sFUMaoKeiy3aHPvMnKkH/NJjVH0YRwfyN4xoZYqSoN1dtttKLhEM+exO+AQbCb9aJ4 QiRgHydNVkQE8dEl6pNnNkDedJC2IWU= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=4FuhxyXU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=vannapurve@google.com Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-231f37e114eso28825ad.1 for ; Fri, 16 May 2025 06:12:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747401129; x=1748005929; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vRZigz65utFUqAMmPQfonu5ixQiHldB3hpjMC8Trs54=; b=4FuhxyXUQe6yTG3bczYL+hwSu0IhoHXU9o6c2UUWDFGGbvWSGfosFULeGqoIgqGPz9 6CHB3LuZte7FhWVxruob6mbWDDV5Ievkn6GSqSPWLrNHGrB8nyXNqMF4bfpH2cvcfuCl faC0ORWztCDZhVX52vAzFLF7sd7ohRcfvLIzNt5Xhz0jWTDlPygSdvoj7XurQsQZ3fXi anZidyMYu1RuZp8pnRaU0n2rPly8AhzRFBjfAbYBSbkpmwtvgpsVtKi/7k4jw7FqGiPd VfdqgwQZF3PzqmP26DuwUM2U8Kx99rPI9AoHRBSbVzwaf5OmA1TSkfQ5XE0wTWksImoZ CiOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747401129; x=1748005929; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vRZigz65utFUqAMmPQfonu5ixQiHldB3hpjMC8Trs54=; b=wI9WibD3NQd80+p/tPBD8hoXlDMf+fyyKxkz+xHO6hXKLlVmanggaoamtS7w34grKJ 1rgMOe65XIvLciM2xEL3zH2Mim+ITnebzZc2u8SxV/sXLhfO3hECe5CtRrlTTZtLa2i/ YELZLUIQ4Dvmxm4FZwZNyta9UJvbJbLD1c7GXTzmr9GjWLgwz1AS/8QN5EVqte4g2vVQ r6AbHKKVg3Xvqvs+219yNOzMF4Wn2tvjJL/PQYQY+L5eSILVjKZlTts9k0rqJvJMAKza 2EWwJZsq5MAGsrZ/Pa34sUd/YATHNGHCasq2XcosDErgL7gL4SnomFhNKNugiR3QX+YC CU1w== X-Forwarded-Encrypted: i=1; AJvYcCWrXJABFKh/AIqwJ9rzsiQSe76Jlkw2HEl1s8tPYYhOkyHa2JXmI0/f5l/uCzdbe3PN1yM6NTDcwQ==@kvack.org X-Gm-Message-State: AOJu0YxLkDD/f3gWGvmJiBSljtfoNDOVUjrO3sa1F5c89XeUhvC2XDf3 LBO5DB9P3fZRMYKEv5QzVB4r3W+MQFkOr/YxgPPTgOutVMtdw6jq9X43EEFZ9/srdpN72//BQIU aCh72c4rPApcoeeh37F/Tkuo4WIloULjMlZnCnvpN X-Gm-Gg: ASbGncs+sbjs5MLL8HvCIZxXvz1mNomTurohxdrwcMXspqrm+Nj67cSJHm1pcPAjEqi /4eNubCLZc1HGHQf5VhvCBjs2/XsbAzrHKfyr6FmwEUKTojoxBJ90+SJGWyFiZBQ352SkhZEaSI C8r5STLqQPsSSPZeW4dn72Ut971AIzvJ+sX/HiGpA3keVi34gzAiyIPCcuj6M5xbzJ X-Google-Smtp-Source: AGHT+IHWzQxgLF5GinQQAT7heQR9B/bnByY6NVpp33fJcRw7lCpA5vIJ/aZyc6R8k4XwPwfMJ4BqgdEyoxgbu+RiAgI= X-Received: by 2002:a17:902:d2c6:b0:223:37ec:63be with SMTP id d9443c01a7336-231b497f774mr6867365ad.4.1747401128820; Fri, 16 May 2025 06:12:08 -0700 (PDT) MIME-Version: 1.0 References: <24e8ae7483d0fada8d5042f9cd5598573ca8f1c5.camel@intel.com> <7d3b391f3a31396bd9abe641259392fd94b5e72f.camel@intel.com> In-Reply-To: <7d3b391f3a31396bd9abe641259392fd94b5e72f.camel@intel.com> From: Vishal Annapurve Date: Fri, 16 May 2025 06:11:56 -0700 X-Gm-Features: AX0GCFs6I9TXzPWCzzMaQ_r47X6z5Mtwr_vfS-VvWPoNm6giHe9GgjB_h50xjcU Message-ID: Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd To: "Edgecombe, Rick P" Cc: "seanjc@google.com" , "pvorel@suse.cz" , "kvm@vger.kernel.org" , "catalin.marinas@arm.com" , "Miao, Jun" , "Shutemov, Kirill" , "pdurrant@amazon.co.uk" , "steven.price@arm.com" , "peterx@redhat.com" , "x86@kernel.org" , "amoorthy@google.com" , "tabba@google.com" , "quic_svaddagi@quicinc.com" , "maz@kernel.org" , "vkuznets@redhat.com" , "quic_eberman@quicinc.com" , "keirf@google.com" , "hughd@google.com" , "mail@maciej.szmigiero.name" , "palmer@dabbelt.com" , "Wieczor-Retman, Maciej" , "Zhao, Yan Y" , "ajones@ventanamicro.com" , "willy@infradead.org" , "jack@suse.cz" , "paul.walmsley@sifive.com" , "aik@amd.com" , "usama.arif@bytedance.com" , "quic_mnalajal@quicinc.com" , "fvdl@google.com" , "rppt@kernel.org" , "quic_cvanscha@quicinc.com" , "nsaenz@amazon.es" , "vbabka@suse.cz" , "Du, Fan" , "anthony.yznaga@oracle.com" , "linux-kernel@vger.kernel.org" , "thomas.lendacky@amd.com" , "mic@digikod.net" , "oliver.upton@linux.dev" , "akpm@linux-foundation.org" , "bfoster@redhat.com" , "binbin.wu@linux.intel.com" , "muchun.song@linux.dev" , "Li, Zhiquan1" , "rientjes@google.com" , "mpe@ellerman.id.au" , "Aktas, Erdem" , "david@redhat.com" , "jgg@ziepe.ca" , "jhubbard@nvidia.com" , "Xu, Haibo1" , "anup@brainfault.org" , "Hansen, Dave" , "Yamahata, Isaku" , "jthoughton@google.com" , "Wang, Wei W" , "steven.sistare@oracle.com" , "jarkko@kernel.org" , "quic_pheragu@quicinc.com" , "chenhuacai@kernel.org" , "Huang, Kai" , "shuah@kernel.org" , "dwmw@amazon.co.uk" , "pankaj.gupta@amd.com" , "Peng, Chao P" , "nikunj@amd.com" , "Graf, Alexander" , "viro@zeniv.linux.org.uk" , "pbonzini@redhat.com" , "yuzenghui@huawei.com" , "jroedel@suse.de" , "suzuki.poulose@arm.com" , "jgowans@amazon.com" , "Xu, Yilun" , "liam.merwick@oracle.com" , "michael.roth@amd.com" , "quic_tsoni@quicinc.com" , "richard.weiyang@gmail.com" , "Weiny, Ira" , "aou@eecs.berkeley.edu" , "Li, Xiaoyao" , "qperret@google.com" , "kent.overstreet@linux.dev" , "dmatlack@google.com" , "james.morse@arm.com" , "brauner@kernel.org" , "ackerleytng@google.com" , "linux-fsdevel@vger.kernel.org" , "pgonda@google.com" , "quic_pderrin@quicinc.com" , "roypat@amazon.co.uk" , "linux-mm@kvack.org" , "will@kernel.org" , "hch@infradead.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: wg167mos3qxzgsmjqron3rmgf3cyngpa X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: C44691C0012 X-HE-Tag: 1747401130-38600 X-HE-Meta: U2FsdGVkX1835oiEemnjM/Q/3GpLrx9xFcsCDLqPe/5FwKDuGyEsSSZoI/ozVvcZZF/bSYYK/tqMsb1mNQaTM2yZSFC3UguY69fp9MnjjvgNhGKNsXkWKAKrqSyS2X0DHyMO3MhAcieiooykldc77Qnv9E/DQM2F2HTrrXEspWUqiL9VL2UR82IwiyiNI+mEZbyl11VGOJpytg1Goyjo/wOi0D0kY24vR1r6rG2UGx9DqvHutZW+08FShVetJa3j/nyNolO926teckdJrjnEq+mC15CjPQ4hacaBh2c/m6qY0EH5fJfFMD+qSgHIz7oXPfZvK5rGAYY8vzwKyYkIe3al3CaV1RLqdcHScE/DPykcKxmJ5zd8YmRIKlES/V47z/S0QmZV3dOwY5Eq1raGHZZFbK7LIo31uf7RotB+ssqRHqb9JO5CFq4HcRhwU3hj1Jq3FO/CXyFhw66ALZB+9onqSLfsLFJSoN54pUyg44rIRXG/q0guf2mEObFRMC4MpocgjPmWri06r9UQtcygpC82hnjVBzvEiBeFj/96r5wfdz6EtHFgTriR5flFJmuN7N1cb7s725w4JeAovdC4LyFT8/cqet7PLUI3DyT7IBXSUQ4suOUtidn83xhQbP6vJvXULtekfSCjOIeXATFMOv/GIVQZ+sc+jpT5cOQugbwoKX/fFqDWgVq1qx+SWRYIzDUsAfQb3pnpx41Fla8gQyLaF56TX82zJ0iDzVeVUMmH9VWi/dHFTdZPLTskGxNIqvYZjXuX/gFc3NfTpH03TETxT5iayjYK8D4DnTJmvidHCLnaeATpKToJU/983iwWBMduBHD1GcjNnp4ewqZ8MwWsH4GrUOsJ576cVAAuKfUaC/nUvQMp71bK42WpkhxWcUv47Sb6cGa6VpI3CN6akvrvRk+F69ahamuCNlrnjjCWHOsJW0ZQ+wL5gLzIeWJRaDA9Ns2F9qOZ/3sr6b0 t5bnyoBL 6rRbbqybpS1ua7vV62HzCaW4/nPe2p/s369M7ij4uaHcAv4ky9keniOYJhFBz/tkEN0zPPjBYZ8VvsIDxd5W4vYTQQIUe6s0yOsVtzYqYbhHRyfb5pPetQd01O3czdX9W6+afN41OgqPdzkaST0R+SKV68yNDjfV9GmvVf4iPOKyjcG9nPm9jfvnxTKgtSJbg3MUYrSxMtY/EfhceVRDqRKHLwX2A5+6Pdp9pc4HF4RVXdr+PBJqCQj0Sx2ce0u/OnrAXGsjzFOXehnvgv0XQih8KLUhXjpl5NUYg0ThUFj3bU4RDmaRPdfpQFnPhJcqWmhXjCsoVF14w7Vc5Kq77ISPq+sV0IwG9C0jOzccbJA8GpoP4+nDDWr4ueF68347cYSoC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 15, 2025 at 7:12=E2=80=AFPM Edgecombe, Rick P wrote: > > On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote: > > > > > Thinking from the TDX perspective, we might have bigger fish to f= ry than > > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of t= he > > > > > benefits don't have numbers. How much are we getting for all the > > > > > complexity, over say buddy allocated 2MB pages? > > > > TDX may have bigger fish to fry, but some of us have bigger fish to fry= than > > TDX :-) > > Fair enough. But TDX is on the "roadmap". So it helps to say what the tar= get of > this series is. > > > > > > > This series should work for any page sizes backed by hugetlb memory= . > > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are > > > > essential for certain workloads and will emerge as guest_memfd user= s. > > > > Features like KHO/memory persistence in addition also depend on > > > > hugepage support in guest_memfd. > > > > > > > > This series takes strides towards making guest_memfd compatible wit= h > > > > usecases where 1G pages are essential and non-confidential VMs are > > > > already exercising them. > > > > > > > > I think the main complexity here lies in supporting in-place > > > > conversion which applies to any huge page size even for buddy > > > > allocated 2MB pages or THP. > > > > > > > > This complexity arises because page structs work at a fixed > > > > granularity, future roadmap towards not having page structs for gue= st > > > > memory (at least private memory to begin with) should help towards > > > > greatly reducing this complexity. > > > > > > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remai= n > > > > essential and complement this series well for better memory footpri= nt > > > > and overall performance of TDX VMs. > > > > > > Hmm, this didn't really answer my questions about the concrete benefi= ts. > > > > > > I think it would help to include this kind of justification for the 1= GB > > > guestmemfd pages. "essential for certain workloads and will emerge" i= s a bit > > > hard to review against... > > > > > > I think one of the challenges with coco is that it's almost like a sp= rint to > > > reimplement virtualization. But enough things are changing at once th= at not > > > all of the normal assumptions hold, so it can't copy all the same sol= utions. > > > The recent example was that for TDX huge pages we found that normal > > > promotion paths weren't actually yielding any benefit for surprising = TDX > > > specific reasons. > > > > > > On the TDX side we are also, at least currently, unmapping private pa= ges > > > while they are mapped shared, so any 1GB pages would get split to 2MB= if > > > there are any shared pages in them. I wonder how many 1GB pages there= would > > > be after all the shared pages are converted. At smaller TD sizes, it = could > > > be not much. > > > > You're conflating two different things. guest_memfd allocating and man= aging > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB > > granularity. Allocating memory in 1GiB chunks is useful even if KVM ca= n only > > map memory into the guest using 4KiB pages. > > I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? = The > list quoted there was more about guest performance. Or maybe the clever p= age > table walkers that find contiguous small mappings could benefit guest > performance too? It's the kind of thing I'd like to see at least broadly = called > out. The crux of this series really is hugetlb backing support for guest_memfd and handling CoCo VMs irrespective of the page size as I suggested earlier, so 2M page sizes will need to handle similar complexity of in-place conversion. Google internally uses 1G hugetlb pages to achieve high bandwidth IO, lower memory footprint using HVO and lower MMU/IOMMU page table memory footprint among other improvements. These percentages carry a substantial impact when working at the scale of large fleets of hosts each carrying significant memory capacity. guest_memfd hugepage support + hugepage EPT mapping support for TDX VMs significantly help: 1) ~70% decrease in TDX VM boot up time 2) ~65% decrease in TDX VM shutdown time 3) ~90% decrease in TDX VM PAMT memory overhead 4) Improvement in TDX SEPT memory overhead And we believe this combination should also help achieve better performance with TDX connect in future. Hugetlb huge pages are preferred as they are statically carved out at boot and so provide much better guarantees of availability. Once the pages are carved out, any VMs scheduled on such a host will need to work with the same hugetlb memory sizes. This series attempts to use hugetlb pages with in-place conversion, avoiding the double allocation problem that otherwise results in significant memory overheads for CoCo VMs. > > I'm thinking that Google must have a ridiculous amount of learnings about= VM > memory management. And this is probably designed around those learnings. = But > reviewers can't really evaluate it if they don't know the reasons and tra= deoffs > taken. If it's going upstream, I think it should have at least the high l= evel > reasoning explained. > > I don't mean to harp on the point so hard, but I didn't expect it to be > controversial either. > > > > > > So for TDX in isolation, it seems like jumping out too far ahead to > > > effectively consider the value. But presumably you guys are testing t= his on > > > SEV or something? Have you measured any performance improvement? For = what > > > kind of applications? Or is the idea to basically to make guestmemfd = work > > > like however Google does guest memory? > > > > The longer term goal of guest_memfd is to make it suitable for backing = all > > VMs, hence Vishal's "Non-CoCo VMs" comment. > > Oh, I actually wasn't aware of this. Or maybe I remember now. I thought h= e was > talking about pKVM. > > > Yes, some of this is useful for TDX, but we (and others) want to use > > guest_memfd for far more than just CoCo VMs. > > > > And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloa= ds. > I've heard this a lot. It must be true, but I've never seen the actual nu= mbers. > For a long time people believed 1GB huge pages on the direct map were cri= tical, > but then benchmarking on a contemporary CPU couldn't find much difference > between 2MB and 1GB. I'd expect TDP huge pages to be different than that = because > the combined walks are huge, iTLB, etc, but I'd love to see a real number= .