From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5ABDBC369C2 for ; Fri, 25 Apr 2025 22:45:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E60C6B0008; Fri, 25 Apr 2025 18:45:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 198046B000A; Fri, 25 Apr 2025 18:45:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 05E346B000C; Fri, 25 Apr 2025 18:45:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id DB2716B0008 for ; Fri, 25 Apr 2025 18:45:24 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0CC291CD3AA for ; Fri, 25 Apr 2025 22:45:25 +0000 (UTC) X-FDA: 83374049010.02.73C6420 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) by imf10.hostedemail.com (Postfix) with ESMTP id 5E284C0017 for ; Fri, 25 Apr 2025 22:45:23 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1QF+TTe3; spf=pass (imf10.hostedemail.com: domain of 3ghAMaAsKCBs13B5IC5PKE77FF7C5.3FDC9ELO-DDBM13B.FI7@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3ghAMaAsKCBs13B5IC5PKE77FF7C5.3FDC9ELO-DDBM13B.FI7@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745621123; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PkZO4mi4lxG/wlqj2CFwXmhdIEjDtuyovfVkgLxS/v0=; b=kWkIinfNtKwADx4LdEC4wjH+PQM44rqC17ckNH7jk8QBpVNGs2UYR5obAGN92E2q5l+mCl NqFeM+ZWxge8SDpHAOQ7z4YPzLqK51Qnhuqf4CG1u/A8JsP+N7yg0YiCP7u/CzKXolWvTe Qm7T2Z9N5QrpFIWfZxzzKBXlYUIdCtw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1QF+TTe3; spf=pass (imf10.hostedemail.com: domain of 3ghAMaAsKCBs13B5IC5PKE77FF7C5.3FDC9ELO-DDBM13B.FI7@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3ghAMaAsKCBs13B5IC5PKE77FF7C5.3FDC9ELO-DDBM13B.FI7@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745621123; a=rsa-sha256; cv=none; b=UDTK3xlp79qde/2SVVWrRma+jt5r6z0T1Csk7EPB5VP9hyugJhgQ8yu6Za2HbKk6G4wYT9 E3QSpgEo8nhnmBrE6rYm+EdBfdeX3WLFgCnUhDlHop2nLramIrWWsateQld0r1agEERWsq 4RPPhEdfDtaRH987fhLurpUwFWUBLyQ= Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-736c0306242so3511058b3a.1 for ; Fri, 25 Apr 2025 15:45:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745621122; x=1746225922; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=PkZO4mi4lxG/wlqj2CFwXmhdIEjDtuyovfVkgLxS/v0=; b=1QF+TTe36BTEjs5I5p/K/EUklQkIj7dZPbWRiLXm+m84mDfbt2XNnBvGlpBVF+WaLl boTUq944IYx/TbOsl164b/8C5e63fNjTfLIbFS26A4raPyKDzZ0/2/tRKa51dmsmhB/4 dcUmxc2muVTcQOFlxbkOIrRmGGRcu0V7GSoMo/lEMuHEM8BP/m/XKmQoJU1HZ52AF/CY qexD6YL3/VAkkyK+zSH4gh3UNYXcL+sR1Akk0iFWOARt5GqH9/upLhgCCFL3mjd27Rg9 fUgs8wqBlfg2NbsOXkAZzkCCz9eBm2O697Te6cXtqIHZm+e+YyVv8mh64HS+SwczbqnV N3FQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745621122; x=1746225922; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=PkZO4mi4lxG/wlqj2CFwXmhdIEjDtuyovfVkgLxS/v0=; b=sv+gLT5g712op5Lp0Xrz+iBmUnrkZP7bBXATdAbI/lVTs57zsGbfPsHDgFQdqRtKcV GrgscVJA6qVjFTPH4A23xfvRZ1MlmWcj4+B545bofzNBMRWHOF+Ptfj52RanL71+clVB QCPb8D1WIFy3VdJd5of6JJuhJvAQAbAECeSQY/xPGlOW94J7CLDDkjUCusxZXlBP8vG1 tF/DU6DGzj8xkS4ly+q4iEIw8FsI6kAefFdNxwmftO4edvel3bI/+ZCmtzYDv35/H+Kg 0bMWYjTeHdlN/HdojVJJGMqhA4xZxKV6yyQjkVSs91o3f1BLvLzLiC6nRem/4QiMcKxC fXGQ== X-Forwarded-Encrypted: i=1; AJvYcCUB2IWUFrg96sB/DMdW/4vkMArN764AVgp2NYVckGcY/fZY+AY63Fbph5lH1KfhmitizvHqlqVHMw==@kvack.org X-Gm-Message-State: AOJu0Yw6JZglY717B60SqJbq+sqjEWJYWm4lnyo5kNHV+3MSwxcOz9uZ 5VdDoKwE7oylHHY3j/LTvZNjkTWFD1M8kCD9TjpkxF7+VsT7kvQUhhCSIR3UjlMvLXMI+qy8Cga LNNz8w3SgNYAQTAPOuk2HJw== X-Google-Smtp-Source: AGHT+IGmnJ65gmKyA3VF08u0JIu1lEdKstgsWN4dMP/m3HjVmCFFs4RaXPIOere/UVmxFSjxhjf6/nl08qkJgAlx8w== X-Received: from pfbky4.prod.google.com ([2002:a05:6a00:6f44:b0:73e:1a21:4bb9]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:228a:b0:736:3d7c:236c with SMTP id d2e1a72fcca58-73ff72e4055mr1450296b3a.14.1745621122037; Fri, 25 Apr 2025 15:45:22 -0700 (PDT) Date: Fri, 25 Apr 2025 15:45:20 -0700 In-Reply-To: Mime-Version: 1.0 References: <38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com> Message-ID: Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page From: Ackerley Tng To: Yan Zhao Cc: Vishal Annapurve , Chenyi Qiang , tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 5E284C0017 X-Stat-Signature: mhf759531q5a8pcrbqoz37f9shro4y54 X-HE-Tag: 1745621123-521203 X-HE-Meta: U2FsdGVkX1/p0XrXuKo2wL/86trMJ5rYKTECYrV5c1bFoYVz+x+X7W2zaTHDSGtdhYsyFhUC39rbxZQIeZfc21tAo7OATnp9Otkj4IcxIctFqlEnNPuz3C5O3Yt4HjqV2A5pjVHb8siQ7zqcHzE+9Ax33O4r/6sgT5YKQeZ/GsQu9lx1wjw1tyisM92okNqFDXSFW4NRioxpV6sWY3zNElDoaeof/FniM6VKMyYc7/02zBcK7t+HrxupNk8TZBNYSAYDOtOo2xC528ENxPYrPumirbWA+wjgyrnIwPuhAObvMqZG80DzHV2vaH0o+6/lW/vnYzyYSvtj7t7bq9wU/6DJVUyOkxPTQCbRCOi2Dmsr9eGaUhu3aUiyslIuIAGfLpjBHSfD4fyJLKh/rDEuNTJ4lkMpHNPRKdtVH3VYoGTmLWTUZS1VjdxLCE7uxGIZ3JSTJuPjIuc7WDB4d8bAFsUcrhlq1HHJhks3j3i42MU+cvgWLt15EcD08SXIFAl3KanPhckyxFGQxfs+T82oNkRzvvY89j7uds3hUwTrUjNh7FP3TdfogrSBQjoC19OQUo36m00v+2ubm9CtaSHRgGwINvHdHGITJ9WJIxxSB4UXv6z73Zd/3mnCokZZzcDPVHbsKMSlj7bFsPcYCFqWQwilT2yDTO8Xp9qEh4zRLhLQnQV7KhxtmNjSV3lTkigoPviC3/WCakiUQYupfUiPyF1G4dRZ2hZMumyZetHtlDbIm16/VQvzx9L3qi5dYO5PRCLKH0T7EAJDOf8NgibrqkYZxNavWsulFRNLco6uAFfCxdrg5lVKE34tTR2rkTOWZxUByjFEKTqmW2WpmKxmwquCa5JAW1XnA3CBNvLS+SWggjO7m5oXtQM6i27ZxSaeedPNM3lfFsMZOMlzZqOLXtAkOtHEfJu6bwaSIS7MqC+MuZsaOqOeSBQjsih9wDXMR/ZDh6TjyFdO+mqUdJe JOJTEaHz i5bMxmVrOP+UXdS3DGtccKMXHTEsr9ZHyX++bAg+ODE4LPZ3JgAnC+poWamnPgHMisl1jMYcuhIognE+z2IC6i2Zw8ae8IUCuzuY2WOxfepU0v8LvZ9Ehi+Vw9GI46FA0eWgxZMHDAToqJix3+qhHKX7finI0BPW/0J117hlSCGjyoglkM/PScJQ5xz7+KFJeSn1h+sj3lm1BhXBQT8mYMwY4I54g9JjVnR5OtycluTYsmo8EVtC5vyxCz9WEeDRNQ+nyq4ki31nzOqZ3ockl4iWWIT3I/AWfdO8UnCnFEoa7H8z3tLTW6Za8b9eQDJshOQ49EWia96VAEU8O8kLRHpDoPL60fdL152pU9PQ3XbEYAKs2oVAyIDqeJY07ztwO28OKYiu5jYQdCJX8H7kABqpPxk86yRallpwz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yan Zhao writes: > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote: >> Vishal Annapurve writes: >>=20 >> > On Thu, Apr 24, 2025 at 1:15=E2=80=AFAM Yan Zhao wrote: >> >> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote: >> >> > >> >> > >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote: >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote: >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote: >> >> > >>> Yan Zhao writes: >> >> > >>> >> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote: >> >> > >>>>> +/* >> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns= a folio with >> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the fil= emap. >> >> > >>>>> + */ >> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(= struct inode *inode, >> >> > >>>>> + p= goff_t index) >> >> > >>>>> +{ >> >> > >>>>> + struct kvm_gmem_hugetlb *hgmem; >> >> > >>>>> + pgoff_t aligned_index; >> >> > >>>>> + struct folio *folio; >> >> > >>>>> + int nr_pages; >> >> > >>>>> + int ret; >> >> > >>>>> + >> >> > >>>>> + hgmem =3D kvm_gmem_hgmem(inode); >> >> > >>>>> + folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgme= m->spool); >> >> > >>>>> + if (IS_ERR(folio)) >> >> > >>>>> + return folio; >> >> > >>>>> + >> >> > >>>>> + nr_pages =3D 1UL << huge_page_order(hgmem->h); >> >> > >>>>> + aligned_index =3D round_down(index, nr_pages); >> >> > >>>> Maybe a gap here. >> >> > >>>> >> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is = not aligned to >> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G ali= gned, the >> >> > >>>> corresponding GFN is not 2M/1G aligned. >> >> > >>> >> >> > >>> Thanks for looking into this. >> >> > >>> >> >> > >>> In 1G page support for guest_memfd, the offset and size are alw= ays >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd = creation >> >> > >>> time, and it is true that when binding to a memslot, slot->base= _gfn and >> >> > >>> slot->npages may not be hugepage aligned. >> >> > >>> >> >> > >>>> >> >> > >>>> However, TDX requires that private huge pages be 2M aligned in= GFN. >> >> > >>>> >> >> > >>> >> >> > >>> IIUC other factors also contribute to determining the mapping l= evel in >> >> > >>> the guest page tables, like lpage_info and .private_max_mapping= _level() >> >> > >>> in kvm_x86_ops. >> >> > >>> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lp= age_info >> >> > >>> will track that and not allow faulting into guest page tables a= t higher >> >> > >>> granularity. >> >> > >> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and >> >> > >> slot->base_gfn + npages. e.g., >> >> > >> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot, >> >> > >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [4K, 2M= +8K); >> >> > >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M+8K,= 4M+8K); >> >> > >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M+8K,= 6M+8K); >> >> > >> lpage_info[2M][3].disallow_lpage =3D 1, which is for GFN [6M+8K,= 8M+8K); >> >> > >> >> > Should it be? >> >> > lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [8K, 2M); >> >> > lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M, 4M); >> >> > lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M, 6M); >> >> > lpage_info[2M][3].disallow_lpage =3D 0, which is for GFN [6M, 8M); >> >> > lpage_info[2M][4].disallow_lpage =3D 1, which is for GFN [8M, 8M+8K= ); >> >> Right. Good catch. Thanks! >> >> >> >> Let me update the example as below: >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range) >> >> >> >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GPA [8KB, 2MB); >> >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GPA [2MB, 4MB); >> >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GPA [4MB, 6MB); >> >> lpage_info[2M][3].disallow_lpage =3D 0, which is for GPA [6MB, 8MB); >> >> lpage_info[2M][4].disallow_lpage =3D 1, which is for GPA [8MB, 8MB+8K= B); >> >> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB = and GPA >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to all= ocate two >> >> 2MB folios, whose physical addresses may not be contiguous. >> >> >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and G= PA 4MB, >> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4M= B, 6MB). >> >> However, guest_memfd just allocates the same 2MB folio for both fault= s. >> >> >> >> >> >> > >> >> > >> >> >> > >> --------------------------------------------------------- >> >> > >> | | | | | | | | | >> >> > >> 8K 2M 2M+8K 4M 4M+8K 6M 6M+8K 8M 8M+8K >> >> > >> >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][= 2], huge >> >> > >> page is allowed. Also, they have the same aligned_index 2 in gue= st_memfd. >> >> > >> So, guest_memfd allocates the same huge folio of 2M order for th= em. >> >> > > Sorry, sent too fast this morning. The example is not right. The = correct >> >> > > one is: >> >> > > >> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowe= d. So, >> >> > > KVM will create a 2M mapping for them. >> >> > > >> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond = to the >> >> > > same 2M folio and physical addresses may not be contiguous. >> > >> > Then during binding, guest memfd offset misalignment with hugepage >> > should be same as gfn misalignment. i.e. >> > >> > (offset & ~huge_page_mask(h)) =3D=3D ((slot->base_gfn << PAGE_SHIFT) & >> > ~huge_page_mask(h)); >> > >> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that >> > are not hugepage aligned, so guest_memfd should also be able to >> > support non-hugepage aligned memslots. >> > >>=20 >> I drew up a picture [1] which hopefully clarifies this. >>=20 >> Thanks for pointing this out, I understand better now and we will add an >> extra constraint during memslot binding of guest_memfd to check that gfn >> offsets within a hugepage must be guest_memfd offsets. > I'm a bit confused. > > As "index =3D gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you a= re going > to force "slot->base_gfn =3D=3D slot->gmem.pgoff" ? > > For some memory region, e.g., "pc.ram", it's divided into 2 parts: > - one with offset 0, size 0x80000000(2G), > positioned at GPA 0, which is below GPA 4G; > - one with offset 0x80000000(2G), size 0x80000000(2G), > positioned at GPA 0x100000000(4G), which is above GPA 4G. > > For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.= pgoff > is 0x80000000. > Nope I don't mean to enforce that they are equal, we just need the offsets within the page to be equal. I edited Vishal's code snippet, perhaps it would help explain better: page_size is the size of the hugepage, so in our example, page_size =3D SZ_2M; page_mask =3D ~(page_size - 1); offset_within_page =3D slot->gmem.pgoff & page_mask; gfn_within_page =3D (slot->base_gfn << PAGE_SHIFT) & page_mask; We will enforce that offset_within_page =3D=3D gfn_within_page; >> Adding checks at binding time will allow hugepage-unaligned offsets (to >> be at parity with non-guest_memfd backing memory) but still fix this >> issue. >>=20 >> lpage_info will make sure that ranges near the bounds will be >> fragmented, but the hugepages in the middle will still be mappable as >> hugepages. >>=20 >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706= /binding-must-have-same-alignment.svg