From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F3EC4C83F1D
	for <linux-mm@archiver.kernel.org>; Sat, 12 Jul 2025 17:53:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7EC426B00B8; Sat, 12 Jul 2025 13:53:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7C3B66B00BA; Sat, 12 Jul 2025 13:53:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6DA136B00BC; Sat, 12 Jul 2025 13:53:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 5BF236B00B8
	for <linux-mm@kvack.org>; Sat, 12 Jul 2025 13:53:35 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 0C46F1401F0
	for <linux-mm@kvack.org>; Sat, 12 Jul 2025 17:53:35 +0000 (UTC)
X-FDA: 83656359990.12.00A51E4
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177])
	by imf06.hostedemail.com (Postfix) with ESMTP id 1C494180003
	for <linux-mm@kvack.org>; Sat, 12 Jul 2025 17:53:32 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=v4Mr0nzX;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752342813; a=rsa-sha256;
	cv=none;
	b=xpjQeykB+7DrnM8c4vd2RzJn+zTzKED/BueMh27qil/REBC7c/w7wjg88FzHQh/+5x40OP
	eA2bjvf06fvV6qtFKP/j8aaEGooQnLABrANmIL1YzkA/lXiu5cLwo3UA16i5LGTv9t8VGv
	/qDYMyyBgJr28ZjxzU5p74OH0C9sBC4=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=v4Mr0nzX;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1752342813;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=;
	b=MT+CUem6Fiko7ndSDo1Pyldbqb0uB89NJBUP4l4ABBXuGC7J1kPlfgMCNUPB0bfAXaT0tQ
	OC0sAxwDuZK4NDEoIifKJsMXIRBTM8RM9EmtdXzQH8fetl7DUnXirU6FphiJ0nWtWu8wsN
	PzUkA3zR/MjsOJcabgkZAMSxIoWAeoE=
Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2357c61cda7so121265ad.1
        for <linux-mm@kvack.org>; Sat, 12 Jul 2025 10:53:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1752342812; x=1752947612; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=;
        b=v4Mr0nzXUpL7svZssIuEn9wdYOTfIj0xwozTagDCAjaUtcP3hDSQerqNZ5ZF37/fzE
         IlBzg2Fqbkpk+tMsnhMzMEt2GmIBj8thMF8osR+HZIFWDs6VPI+dOGVx3mdr42c11awW
         BFm5bcbn/BuVMXYd68pYUZ1HQPYMFdCg6bePW/z5c0UBL0qJP9YAobxL0imL4+gYZV0F
         yqKAqH7Mvxo8hzbATNXL4XysPeCQCre2T+flVNNgB/ZZJTKTsDqRJj2X/uTtQp6Ov5Os
         sW1xZVSbDympMjKJXUaZ3z2dCdC0L9iJzPW+Dqj+7fW6Z+GYVqZsgXiy7cnZ19uzCpsO
         k+Sg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1752342812; x=1752947612;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+1/Vlrp3LRvVrxvdxBs0B2p6QwaoTvL2sUXR6QBqiJY=;
        b=myr1D/+B1vxy4cg5XJfDZWt7f309xZP8/CyhP7Eh68ZL30Y0+pY9Zz1W5k15fhXYMW
         KOCgiUZj3BPVawOzQQ8r9/co0NqoCTzsQWM4Avy1gv1XE/TPNXhf0jUU6LiPZc9jXLjO
         i9WVX1fyebl2dWKe3FKYam2j1/QuTTBUiuj7ihpJ4pniwLYzETansDpjjFF9CpVp1yDh
         1qIdLSa3lugHM/J6LZ8AHmPLTi14szPRQzy1TmIInHxgbCP2wfVIC808t6zzem/pjRgE
         T5KXZcrLv/AoqW29peSwdEdWLDnUS837yFn9e/zLnpC8xKmCcc/HcHyLyqnvwzPVzP6w
         XXgA==
X-Forwarded-Encrypted: i=1; AJvYcCXYtVTYiFHA6uWLK3T6JSV3ExsV8IixV4AX44gITZAoBg+QLdDxREnxLdUiMY3C+xnedFHA/2Pe/Q==@kvack.org
X-Gm-Message-State: AOJu0Yz27AVbhoxx101kwhxziyeQohjPH3dMfbTPKKU1sg2u7L07GTUV
	hHAkmwmWpDgez03psbEtH7Eo0U5zsn/WxdaCA7py57/6R382WNwpl3UxI9cW7TqMnVjhortUA7+
	lQXd0WdB+bmfmwPJDmSBlhyc/oftAOLH7Q5IB0uv/
X-Gm-Gg: ASbGncuL5iPMY9iIAhzblB4P+z/+kW7JFGYhzhCLQiZfDDEfCyPU1L2RndtiStglIhQ
	oLvdOIU3GOVQtMkw581GAVXVztgu/fK5H6By6LBBe4Q5E4GaX6KU6EX6/WJTmGZ7TP3T94Q7VGo
	BUouSx9eI9oz2f0FU4d8dnfNuWaXJykwyMrKOTAJmdGV1npCWvBqTAvdLR1HDk1Yj4uJRAwHOXN
	fpbKdxuQzWcuS21ULbSr4peVTG7Tf0CM3WPYN0Z
X-Google-Smtp-Source: AGHT+IGXZjTxB3NipN6zL4wxoHLscGMTCymyrebDMnTbZldyzqeNacRyryboTJLi0nmJu4tdqjI5M8LT+EiWQEHNhyg=
X-Received: by 2002:a17:903:8cb:b0:216:48d4:b3a8 with SMTP id
 d9443c01a7336-23df6985bd5mr1838155ad.16.1752342811290; Sat, 12 Jul 2025
 10:53:31 -0700 (PDT)
MIME-Version: 1.0
References: <20250529054227.hh2f4jmyqf6igd3i@amd.com> <diqz1prqvted.fsf@ackerleytng-ctop.c.googlers.com>
 <20250702232517.k2nqwggxfpfp3yym@amd.com> <CAGtprH-=f1FBOS=xWciBU6KQJ9LJQ5uZoms83aSRBDsC3=tpZA@mail.gmail.com>
 <20250703041210.uc4ygp4clqw2h6yd@amd.com> <CAGtprH9sckYupyU12+nK-ySJjkTgddHmBzrq_4P1Gemck5TGOQ@mail.gmail.com>
 <20250703203944.lhpyzu7elgqmplkl@amd.com> <CAGtprH9_zS=QMW9y8krZ5Hq5jTL3Y9v0iVxxUY2+vSe9Mz83Tw@mail.gmail.com>
 <20250712001055.3in2lnjz6zljydq2@amd.com>
In-Reply-To: <20250712001055.3in2lnjz6zljydq2@amd.com>
From: Vishal Annapurve <vannapurve@google.com>
Date: Sat, 12 Jul 2025 10:53:17 -0700
X-Gm-Features: Ac12FXzBIe6hc0vRvgKfa9hTZJEZVmuxsY2_E0PJwlI__Cgm8PredUOA4GmiwtU
Message-ID: <CAGtprH-fSW219J3gxD3UFLKhSvBj-kqUDezRXPFqTjj90po_xQ@mail.gmail.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting
To: Michael Roth <michael.roth@amd.com>
Cc: Ackerley Tng <ackerleytng@google.com>, kvm@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, 
	aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, 
	amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, 
	aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, 
	brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, 
	chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, 
	dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, 
	fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, 
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com, 
	isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, 
	jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, 
	jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, 
	kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, 
	kirill.shutemov@intel.com, liam.merwick@oracle.com, 
	maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, 
	mic@digikod.net, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, 
	nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, 
	pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, 
	pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, 
	qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, 
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, 
	quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, 
	rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, 
	rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, 
	steven.sistare@oracle.com, suzuki.poulose@arm.com, tabba@google.com, 
	thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, 
	viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, 
	will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, 
	yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, 
	zhiquan1.li@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 1C494180003
X-Stat-Signature: iesonjaaraouty3mm31fiakds9bc8qu4
X-Rspam-User: 
X-HE-Tag: 1752342812-728565
X-HE-Meta: U2FsdGVkX18QQFa1AvtiOG4cCW3Gb1gm5FeAC/yuURBRTquwd+iNvrSfp2t0N2aLGwWgfY4vd2oAnSepBaLt1FCIeCCrHyBR5rvv6FjWnzNi/4RRaMAqilCWUmaOdtTa/ELhv3gSwdISiTs8ofPCm8D2chOk4qnguDqb+i1LwDRdV/zwW3EgK2VwhUu2xINcFzy68QfZese66pwww2ivQPXcde0tdiO9SeDtWC1DRM2+bE+eXZ2CbkpoOwgt3bHr5bt7phgl54Hyu1wv877zkufFu5v4qddYXGKPWNCEJYGP5he8gBkHAGxfBjZCuvYg7gyfHeTUvAYBxtGseq0f/WwH3lK7+XtwYQs1EsQdCpw+qti05eAbNho4PeLe/n14Sl+eMux7h3zbsDMYOcQkUxEknTcFURy3Ofkse3ipNHa3d2DAAmJQdYyLTxa7gOSD3lcHpzg+M5mYM2KSV8G+0lGdMnhE5AGxCrwFW2eSJY8hPk9MSGTrpRITdoguLx1ryTXisvtgCW/2hgO1E4Uo5Z1bcwe+D+Tl2ul6S+cHQM/HG3vhwtYhYgjhe/d0p5LECV0FqK2phCHgblY76u++uazJWJKwfHHrtH33nkwUhvE1q8DkDh2A5bT0KkJAvm16xfzzvcEv5uvxJU9v3V2AMPw1cA03mQ/3ge398hk4b5wUvYeZ8gkR9FIGJb9QWgDMWrcISi/Hxmq8cUnCOk9trwO7dul6OsNC88ZmV7o1EJAZqKCgCgPNfVn+0slCMQGxPGBEkWohTN8+of1JoVdRRnVfhh3W0aNbcxauUpmSCTmbw0scW//QU3oyRlS4+EAPYAfaimUmd0f2s1sK2dgyf6ao0Wvu42oHuJk2FzHFnDar24gm2JELAbjgTkRdElSBIAn+APYQfpeP16L5/tmZ9vZrysA6krz0gz3DArYjjZ78UzjwuMw9BRHuOWqFGAaBYUGrElSVbcha/5rPnAB
 JydXfyR8
 irSv6kmQrboGIp2qlr7LI8B/IVtyy/zy8vLr2ghpPWLTxHSP6wOMAVB+pdSnQkiPi+XwOdkqbAkRQeAi6QK/655Q4+LTrEDEsdzjUotO1CSF4dNl93JhDDZ9ordeU5vJw/DRgruPcGSAUEBldvdZ3F0LDFcei19Wiexqr8/WdCr1d9lWN3CstMQIOQvIxSvFh57HA8PN0ncp3TCw4E4wv5YfU1z/Q3jcHQmO8ABPr0ie7U7YtAzkhMQ54hzeRmIP8ojxrnCjZjuFFGC0lqfUyQdf4SXbGwZLz2MUfA5RB1cuzWTf7+QlBn19OR3HaAFqWItmyV+32Pt9kFbkVCFxg8PMT6w==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 11, 2025 at 5:11=E2=80=AFPM Michael Roth <michael.roth@amd.com>=
 wrote:
> >
> > Wishful thinking on my part: It would be great to figure out a way to
> > promote these pagetable entries without relying on the guest, if
> > possible with ABI updates, as I think the host should have some
> > control over EPT/NPT granularities even for Confidential VMs. Along
>
> I'm not sure how much it would buy us. For example, for a 2MB hugetlb
> SNP guest boot with 16GB of memory I see 622 2MB hugepages getting
> split, but only about 30 or so of those get merged back to 2MB folios
> during guest run-time. These are presumably the set of 2MB regions we
> could promote back up, but it's not much given that we wouldn't expect
> that value to grow proportionally for larger guests: it's really
> separate things like the number of vCPUs (for shared GHCB pages), number
> of virtio buffers, etc. that end up determining the upper bound on how
> many pages might get split due to 4K private->shared conversion, and
> these would vary all that much from get to get outside maybe vCPU
> count.
>
> For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged
> during run-time and would be candidates for promotion.
>

Thanks for the great analysis here. I think we will need to repeat
such analysis for other scenarios such as usage with accelerators.

> This could be greatly improved from the guest side by using
> higher-order allocations to create pools of shared memory that could
> then be used to reduce the number of splits caused by doing
> private->shared conversions on random ranges of malloc'd memory,
> and this could be done even without special promotion support on the
> host for pretty much the entirety of guest memory. The idea there would
> be to just making optimized guests avoid the splits completely, rather
> than relying on the limited subset that hardware can optimize without
> guest cooperation.

Yes, it would be great to improve the situation from the guest side,
e.g. I tried with a rough draft [1], the conclusion there was that we
need to set aside "enough" guest memory as CMA to cause all the DMA go
through 2M aligned buffers. It's hard to figure out how much is
"enough", but we could start somewhere. That being said, the host
still has to manage memory this way by splitting/merging at runtime
because I don't think it's possible to enforce all conversions to
happen at 2M (or any at 1G) granularity. So it's also very likely that
even if guests do significant chunk of conversions at hugepage
granularity, host still needs to split pages all the way to 4K for all
shared regions unless we can bake another restriction in the
conversion ABI that guests can only convert the same ranges to private
as were converted before to shared.

[1] https://lore.kernel.org/lkml/20240112055251.36101-1-vannapurve@google.c=
om/

>
> > the similar lines, it would be great to have "page struct"-less memory
> > working for Confidential VMs, which should greatly reduce the toil
> > with merge/split operations and will render the conversions mostly to
> > be pagetable manipulations.
>
> FWIW, I did some profiling of split/merge vs. overall conversion time
> (by that I mean all cycles spent within kvm_gmem_convert_execute_work()),
> and while split/merge does take quite a few more cycles than your
> average conversion operation (~100x more), the total cycles spent
> splitting/merging ended up being about 7% of the total cycles spent
> handling conversions (1043938460 cycles in this case).
>
> For 1GB, a split/merge take >1000x more than a normal conversion
> operation (46475980 cycles vs 320 in this sample), but it's probably
> still not too bad vs the overall conversion path, and as mentioned above
> it only happens about 6x for 16GB SNP guest so I don't think split/merge
> overhead is a huge deal for current guests, especially if we work toward
> optimizing guest-side usage of shared memory in the future. (There is
> potential for this to crater performance for a very poorly-optimized
> guest however but I think the guest should bear some burden for that
> sort of thing: e.g. flipping the same page back-and-forth between
> shared/private vs. caching it for continued usage as shared page in the
> guest driver path isn't something we should put too much effort into
> optimizing.)
>

As per discussions in the past, guest_memfd private pages are simply
only managed by guest_memfd. We don't need and effectively don't want
the kernel to manage guest private memory. So effectively we can get
rid of page structs in theory just for private pages as well and
allocate page structs only for shared memory on conversion and
deallocate on conversion back to private.

And when we have base core-mm allocators that hand out raw pfns to
start with, we don't even need shared memory ranges to be backed by
page structs.

Few hurdles we need to cross:
1) Invent a new filemap equivalent that maps guest_memfd offsets to pfns
2) Modify TDX EPT management to work with pfns and not page structs
3) Modify generic KVM NPT/EPT management logic to work with pfns and
not rely on page structs
4) Modify memory error/hwpoison handling to route all memory errors on
such pfns to guest_memfd.

I believe there are obvious benefits (reduced complexity, reduced
memory footprint etc) if we go this route and we are very likely to go
this route for future usecases even if we decide to live with
conversion costs today.