From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F397C369AB for ; Thu, 24 Apr 2025 18:15:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E0D86B0011; Thu, 24 Apr 2025 14:15:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 068736B00D4; Thu, 24 Apr 2025 14:15:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E01A36B00D5; Thu, 24 Apr 2025 14:15:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BC6676B0011 for ; Thu, 24 Apr 2025 14:15:14 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 671B51417A9 for ; Thu, 24 Apr 2025 18:15:16 +0000 (UTC) X-FDA: 83369739432.30.6B263D4 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf13.hostedemail.com (Postfix) with ESMTP id B58E820002 for ; Thu, 24 Apr 2025 18:15:14 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J6LNbq+2; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of 3sX8KaAsKCCI8AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3sX8KaAsKCCI8AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745518514; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D5ca3BAG75clyx23dBcL2JDA6BjjGOPOYCSfGb5BdZc=; b=PJRCTPvKbUfk8NOIPlDcjpuGDqobJ7n5B+aieiV2UQ2/AfrryRfaokSKL3/aPkNLEnbcla NjvNILKx4oT+0og8ZzjTRPzZlmqTneFexLHm996PvmygDQTzoW8mWv7Ie37gdU9ltK1Mwk lV8yRxJxMFW9FW6OdLR6pM80M1hu80g= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J6LNbq+2; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of 3sX8KaAsKCCI8AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3sX8KaAsKCCI8AICPJCWRLEEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745518514; a=rsa-sha256; cv=none; b=eqOWwVV1u24NNiQsVUJ9eO6mJ+93HbYVT9fUax2Jg+y5nv+x9tVstgwFaKP39iLfEYeV42 tOVQQPc6p1MhA8pA9JaFYbRDfDQA2ihiwp33pliepfoT6ByEd5yteo1Qf/FIjPEHrJEOrH 9J5E5hb7xB8swhLwQFdJ29TXIgXLoHQ= Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-73bfc657aefso868501b3a.1 for ; Thu, 24 Apr 2025 11:15:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745518513; x=1746123313; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=D5ca3BAG75clyx23dBcL2JDA6BjjGOPOYCSfGb5BdZc=; b=J6LNbq+2wYtuzUXOhTBkgfgC6lIfKL1hH2ehCqJIERHcVP+6KshcjN3coFjuQoc6I7 J26yUGleKBIJ8pjPB4Fd/ocT4Oxenqwt/M65bmtophAI83B9bcCcBQ4b6bvyYJb1XPeh MaTwBuLColuHyEQ2AoV8KkLAVTdiv+P35HGzNTEBB1QSMe+04s8sLNwUV6onx9ORj/vf 6Xn8KsxmjwMAdC5FR7xfWNU8ivH6gKZIdtNH0onN9S3aZ2BBbm1rbXaRTHEleY6NU8Fk zzd09cvY4/K0aH8ktlWzld1HHvdPdPIUNobObO5YMmttxsaHEyQj5QDTnzchj5pNhY+u 0HgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745518513; x=1746123313; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=D5ca3BAG75clyx23dBcL2JDA6BjjGOPOYCSfGb5BdZc=; b=wNJSbvzzYm/xlJnfJIO6O03tapbfeLy19/kbbjvzbGtbltNyiNnX3HVZ68Xot3W6cH 8o0E75H0UxRYbwzHjzlEyGCz/5z7cnIUDjTD76A8it1ViFiiHSEZcBz/MNw9qTEfTef8 mnEwuc/GHdcwlhphPa9jXCq1O4OsaiVqXuQMEIhvt6WCCmGCis6WLouQw+iQtQFs/9+u gIBwefmjPRqEhdXxBeR3NYJ+0qKQUuKGM1CN/K71gEX7X5axc10B5aqaGZgWADAIthgQ uxTQKu2UpSyo9Y6uguSbL5LWtI1ShtywvsFd+07eh7rmmVKN9HmLNfVsWKpJ3KCRfzvx vaLw== X-Forwarded-Encrypted: i=1; AJvYcCXwlFcaEcWIlgREwKxnHU7rUSluW0wCHwf2YXHog1F/3x68n0GiLxHeKbm+PD8FxAEu7xDiRlx0sw==@kvack.org X-Gm-Message-State: AOJu0YyDPtN02+TLYi1jsz95UM/FGPp7beB7jg450hXsAI0VXPASF+ij Au9epOQi/DnQwnrkNThYw65h4uKtW+duz7vlqizulrBto4BomHjLuNig8YSgwSQcm8ogtBXR9il wdvwFMAMRODdUsue2bA5SRg== X-Google-Smtp-Source: AGHT+IEjHUImTvNW706XVptp79KSjxcaqI5ueQhopNSPDN6+zyHSYy4f+9kb3ZIp9FE3XNV/3Ks8SXHBhHpHnX8+oA== X-Received: from pfbfe15.prod.google.com ([2002:a05:6a00:2f0f:b0:736:b063:5038]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:4acb:b0:732:2923:b70f with SMTP id d2e1a72fcca58-73e32fda318mr943178b3a.11.1745518513546; Thu, 24 Apr 2025 11:15:13 -0700 (PDT) Date: Thu, 24 Apr 2025 11:15:11 -0700 In-Reply-To: Mime-Version: 1.0 References: <38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com> Message-ID: Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page From: Ackerley Tng To: Vishal Annapurve , Yan Zhao Cc: Chenyi Qiang , tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: aeki8zu7kupynzafn9f3aaaibfimz39k X-Rspamd-Queue-Id: B58E820002 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1745518514-240421 X-HE-Meta: U2FsdGVkX1/vUYZg05PzedMYL/k5iKiba8MSL03JswxIWpcUQ7qboibs0krpwnbwwKJAj4SOQEbdt/6h54s38H08VJ6ywRcct9D5Flj6E0SH1abRfkoDA6gVgMBmCGD6ZUkVdZDGH/QDB/bOx+jtACyf5oN3kLrAr4pTdTlS0aPgLUrMMN5SrxNDALZ0rW2Q12+FE2DIpVAfSV6vYrICtHdjtozIFG+UNIx5591Ac7zzrQFzlf8dhcB0RcZAg7fWAoRpPAONaHjmz7gIBivTv1dPmDjJOzUELQ7qABp+cZZ0qR0bxF04/CbxPYS76DLPx1jlt1Q57Xpe8gzvKLOr3/RvDlhlxIKh2/EPlIGE00wDgJj+S5LDdOxXn189ry9n+L5BKEnHgixWK4rXWFh8tQuEvnwyTTMcECMaXv01kmXLLM2fYUhs63dZPKhYqi6QKqIZpsZyoQpToCOZtXCsk1MnG7Wph3K3dWuOchKc+9ZaunsW1hG0/5S69/mhBdzqcr0ORqriZf614XCNX6unqe7nLWyYOe4JOrvhX8lzt1muaF1MYBgrnLhkxn8b8m6X5MhKx6Dr5IbzT0Rw0CesrUfwqcv/tgiJRYk9n9A+xTjknJbEnpYX0jYv3OySp1Mj5LfBGxk4Ds5BEvXbX9D4sQfK1d4/zByQCS8yuBMYGgk600BecxORZ0FoMQMg5s84yrdDkNafeHtqx9OfXiEVp3Iya2WerrX7yOQVA/xoqUIp6J8kpERER0mRDoHjqlBmh+2aOmYeQ9s2FVGtch8ftXi+6Z0QmuCTEmOfrzKDv7+WRGa6rD62r6wWCs6EZD8t8nVyp/vg6e+hTv/IjmAHWYYkVGvfaREPPUMPfWpsB07TjrfwPGSa+9v5zFa9eitKnXebYc/Cy9Qod+aDhiPRJM8VjJSA8wqTYMh0JzUxxpXYl25tojfZVawhdLC/lRaRzKAON1z4xMK/kMXsI71 rUoZYAi3 GS83vUZIwoqEM5xu1w5e3HMAO9WZx6Nt4pqXtPeFWemJzs0ARR/sJpMVrQGe+7uSXUKxz911I0NPHmspI5EjTLYvuJ7zMGfO66WvpRANG4RvyGQWz3NiabsynicXFhxjbkPNLnVq4f/Q3A9Jr8yoNe4tAEzblcrGLzAtRDc6aim3hvYz9Oxot4kkt2/aSl3BEZktD7l1FADNDvJTbsug/Fp3sRvFWPhfEGuQC1Nowu9qXGwNsmTzZsBOl/UpbYjrZUo4xd4fbFnxeZk+tU4nF4PyG5L/7ll4qPZxPaQXrASFwIvCxry4i1drOI2S6sHYCeDSVfARWrzBpd7BbLjEFWyYSMeUr2np8kRgFPyW0TGZ3+kecLbdKc76dLSzFvFOQxX3eDDonh6h+ZIwwbnk66oIUX2YNxtCEMQfP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Vishal Annapurve writes: > On Thu, Apr 24, 2025 at 1:15=E2=80=AFAM Yan Zhao w= rote: >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote: >> > >> > >> > On 4/24/2025 12:25 PM, Yan Zhao wrote: >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote: >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote: >> > >>> Yan Zhao writes: >> > >>> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote: >> > >>>>> +/* >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a = folio with >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filema= p. >> > >>>>> + */ >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(str= uct inode *inode, >> > >>>>> + pgof= f_t index) >> > >>>>> +{ >> > >>>>> + struct kvm_gmem_hugetlb *hgmem; >> > >>>>> + pgoff_t aligned_index; >> > >>>>> + struct folio *folio; >> > >>>>> + int nr_pages; >> > >>>>> + int ret; >> > >>>>> + >> > >>>>> + hgmem =3D kvm_gmem_hgmem(inode); >> > >>>>> + folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->= spool); >> > >>>>> + if (IS_ERR(folio)) >> > >>>>> + return folio; >> > >>>>> + >> > >>>>> + nr_pages =3D 1UL << huge_page_order(hgmem->h); >> > >>>>> + aligned_index =3D round_down(index, nr_pages); >> > >>>> Maybe a gap here. >> > >>>> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not= aligned to >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligne= d, the >> > >>>> corresponding GFN is not 2M/1G aligned. >> > >>> >> > >>> Thanks for looking into this. >> > >>> >> > >>> In 1G page support for guest_memfd, the offset and size are always >> > >>> hugepage aligned to the hugepage size requested at guest_memfd cre= ation >> > >>> time, and it is true that when binding to a memslot, slot->base_gf= n and >> > >>> slot->npages may not be hugepage aligned. >> > >>> >> > >>>> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GF= N. >> > >>>> >> > >>> >> > >>> IIUC other factors also contribute to determining the mapping leve= l in >> > >>> the guest page tables, like lpage_info and .private_max_mapping_le= vel() >> > >>> in kvm_x86_ops. >> > >>> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage= _info >> > >>> will track that and not allow faulting into guest page tables at h= igher >> > >>> granularity. >> > >> >> > >> lpage_info only checks the alignments of slot->base_gfn and >> > >> slot->base_gfn + npages. e.g., >> > >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot, >> > >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [4K, 2M+8K= ); >> > >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M+8K, 4M= +8K); >> > >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M+8K, 6M= +8K); >> > >> lpage_info[2M][3].disallow_lpage =3D 1, which is for GFN [6M+8K, 8M= +8K); >> > >> > Should it be? >> > lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [8K, 2M); >> > lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M, 4M); >> > lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M, 6M); >> > lpage_info[2M][3].disallow_lpage =3D 0, which is for GFN [6M, 8M); >> > lpage_info[2M][4].disallow_lpage =3D 1, which is for GFN [8M, 8M+8K); >> Right. Good catch. Thanks! >> >> Let me update the example as below: >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range) >> >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GPA [8KB, 2MB); >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GPA [2MB, 4MB); >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GPA [4MB, 6MB); >> lpage_info[2M][3].disallow_lpage =3D 0, which is for GPA [6MB, 8MB); >> lpage_info[2M][4].disallow_lpage =3D 1, which is for GPA [8MB, 8MB+8KB); >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and= GPA >> 4MB+16KB. However, their aligned_index values lead guest_memfd to alloca= te two >> 2MB folios, whose physical addresses may not be contiguous. >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA = 4MB, >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, = 6MB). >> However, guest_memfd just allocates the same 2MB folio for both faults. >> >> >> > >> > >> >> > >> --------------------------------------------------------- >> > >> | | | | | | | | | >> > >> 8K 2M 2M+8K 4M 4M+8K 6M 6M+8K 8M 8M+8K >> > >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2],= huge >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_= memfd. >> > >> So, guest_memfd allocates the same huge folio of 2M order for them. >> > > Sorry, sent too fast this morning. The example is not right. The cor= rect >> > > one is: >> > > >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. = So, >> > > KVM will create a 2M mapping for them. >> > > >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to = the >> > > same 2M folio and physical addresses may not be contiguous. > > Then during binding, guest memfd offset misalignment with hugepage > should be same as gfn misalignment. i.e. > > (offset & ~huge_page_mask(h)) =3D=3D ((slot->base_gfn << PAGE_SHIFT) & > ~huge_page_mask(h)); > > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that > are not hugepage aligned, so guest_memfd should also be able to > support non-hugepage aligned memslots. > I drew up a picture [1] which hopefully clarifies this. Thanks for pointing this out, I understand better now and we will add an extra constraint during memslot binding of guest_memfd to check that gfn offsets within a hugepage must be guest_memfd offsets. Adding checks at binding time will allow hugepage-unaligned offsets (to be at parity with non-guest_memfd backing memory) but still fix this issue. lpage_info will make sure that ranges near the bounds will be fragmented, but the hugepages in the middle will still be mappable as hugepages. [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/bi= nding-must-have-same-alignment.svg >> > > >> > > >> > >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the sam= e folio. >> > >> It's also weird for a 2M mapping in KVM to stride across 2 huge fol= ios. >> > >> >> > >>> Hence I think it is okay to leave it to KVM to fault pages into th= e >> > >>> guest correctly. For guest_memfd will just maintain the invariant = that >> > >>> offset and size are hugepage aligned, but not require that >> > >>> slot->base_gfn and slot->npages are hugepage aligned. This behavio= r will >> > >>> be consistent with other backing memory for guests like regular sh= mem or >> > >>> HugeTLB. >> > >>> >> > >>>>> + ret =3D kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapp= ing, folio, >> > >>>>> + aligned_index, >> > >>>>> + htlb_alloc_mask= (hgmem->h)); >> > >>>>> + WARN_ON(ret); >> > >>>>> + >> > >>>>> spin_lock(&inode->i_lock); >> > >>>>> inode->i_blocks +=3D blocks_per_huge_page(hgmem->h); >> > >>>>> spin_unlock(&inode->i_lock); >> > >>>>> >> > >>>>> - return page_folio(requested_page); >> > >>>>> + return folio; >> > >>>>> +} >> > > >> >