From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBFBFC369AB for ; Thu, 24 Apr 2025 14:10:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B724A6B000C; Thu, 24 Apr 2025 10:10:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ACFB16B0082; Thu, 24 Apr 2025 10:10:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 949B26B0093; Thu, 24 Apr 2025 10:10:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 73C486B000C for ; Thu, 24 Apr 2025 10:10:43 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 88CAAB6FE7 for ; Thu, 24 Apr 2025 14:10:44 +0000 (UTC) X-FDA: 83369123208.22.6A43B7C Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf01.hostedemail.com (Postfix) with ESMTP id AC65D40003 for ; Thu, 24 Apr 2025 14:10:42 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uhwfNucq; spf=pass (imf01.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745503842; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KMrvBQHRfYfkGh/Um2a6kp7PNJoeWsTC0p7ApqTxhJ4=; b=XW6lGV1NQJ32gwuARluFd2fbPHDaMbOAMCAdl8DlDe8RSobk/S2nF5bYX+xlewoocxCmOg zJnMVIO73cSJl0Uy0M2Xoys1kxKSHVQ3soH1RY+kOkITd5nP3/XU7HNNGr7QPAAWur/0IY Ooc0gtfGpGOW9Hyn9I9cYgik6MoEiIc= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uhwfNucq; spf=pass (imf01.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745503842; a=rsa-sha256; cv=none; b=uMZkPPiCv1QdvDTf1WyRDotAKzxkOeyqzyAhMCtZz2U9X7mrksCZIw1oeOFZ+A5FM11mPp 3FwtoQZs+AJiWIM5sWqqT80Lqw0We6buo3ZWphGFTWanxhEATN8rtJTkSOZrjMMqkjOVf4 tJAQpSBMTzBuznp3LYCg100dX+0KmfY= Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2263428c8baso161755ad.1 for ; Thu, 24 Apr 2025 07:10:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745503841; x=1746108641; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KMrvBQHRfYfkGh/Um2a6kp7PNJoeWsTC0p7ApqTxhJ4=; b=uhwfNucqBn+P0pQ4921dXvRuGO+JDXEB4uNoKqdx6wZJCwju15qbr2MlK4aF4ByRwb Ukh+Kc1lZ3sVvV2FJMU0aqLv/prV7bwICgl8ozClj2d8qk8dlU+BeJ4JL/SqlN+R8Ow5 WjxqM19hTNQNxmh9PaS0TJx1WL2KH1zts8V84Y6lzLNvHSPEtQeaCBA7gojoFLKf2bom RZVIz//Xlb3OayLs3f0cptkyH+10px0xpjqeMotMqzq3ITKYLV3cqHRVKV+JrFgbm9AC AiqzingGZcJsaq+XtWcCLpogrI67FqmNrhVUYcU1/b/xA/gXX521+dpKtHIf9cNODUky QwBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745503841; x=1746108641; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KMrvBQHRfYfkGh/Um2a6kp7PNJoeWsTC0p7ApqTxhJ4=; b=dbmUt55FUCkWyA12iuodl61K7GEXpugsFgoQgmYICe6BSUkqt8n0Fo8OYCJ4BHsE3U WF6lAqioSgBSIMFIPGG15RLBncSR/mkVTq2SsB1yTDEBNSjDcOMprxnY8+IgVJFhvTHc bdFtL+xlQtdbGJNMqHF6+j3XIjmGZ5jPjKH1dWppgSQ1GqGYRdZfptcIB2kvWys4bKlE YUiRZ19nL/2nDVgoQEicABauGDh97P0wZQcXIMrFp27FVWPJs3pAk74PCIdgLaifkPGh olk7Fe+rZ1g7yLwylOR8fpVSgyouFYv7d4ZJhx2FAjVVCQE4l+lsKDG1dvjRm+bM8F9B NmaQ== X-Forwarded-Encrypted: i=1; AJvYcCVGODRVrbYcy8DZ7AXshpbfMCE/kPB5xVuDEWyqsPXJWAAl5RDOjyLbehJ+P32WlexmoptZRCyCeA==@kvack.org X-Gm-Message-State: AOJu0Yz201DqopNraz/n4L7XsZjGSp8fj4Lor99xZ85SjaXeueFjQMip xes/KVf5hKUidfHFyPuk+LpOJtGiiEPld5Fu9Bhd4pDKv9SV7dqeHH3lpbqoc6DMQwJ4wcDgPHp zD390o2KPA7fVobj1EAi1Pb9rujxR1gTr5nEb X-Gm-Gg: ASbGnctHH+7t8b4YyLSeSsFjBGbIZzKES9/7iP/xNGuZeETBpQde/nGco6szmW0/VCn 8varZlx0njuxvRab6wP7teXsNiZdxAvFxmneCyO2cPLcD7s34qPh0pnF7HrBRy2wqXGcM0lWB/e kqvQkpkKTMmGRShsWHRgtWEKKU9AoklXr1zMUv5+jALx3pwg+br3kAfw== X-Google-Smtp-Source: AGHT+IFD9PMIDjYVZIu4XeDCEmXm5JJ9kqk0ET3qsJ1IGuscVnIeFyKraOFTxsj8pEhDs0FmSrqFKcRZTv0WGcRk6BM= X-Received: by 2002:a17:903:190d:b0:223:7f8f:439b with SMTP id d9443c01a7336-22db331d4fcmr2342005ad.29.1745503840777; Thu, 24 Apr 2025 07:10:40 -0700 (PDT) MIME-Version: 1.0 References: <38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com> In-Reply-To: From: Vishal Annapurve Date: Thu, 24 Apr 2025 07:10:28 -0700 X-Gm-Features: ATxdqUEQE_t-BEFR1W0udJMvfpqwrx52O_-L1cdq80rEE8HhadLoddGPwZuyCFg Message-ID: Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page To: Yan Zhao Cc: Chenyi Qiang , Ackerley Tng , tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: k78azakqkjsbjpdxohe7631rx1m3wteo X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: AC65D40003 X-Rspam-User: X-HE-Tag: 1745503842-742077 X-HE-Meta: U2FsdGVkX18fwmxpqqfQdlzDDaaXhB33368SHFFrvBqFTdQT8JXTkBxRKjdc8VEFBWFqjUxPUa0pVw9ys2g9iqi6asBB798tENeqJQlE4GTWrNJSR5KjcB2RaTDMg1Y6pZSdG34uBlioWC5YtEcCsMCDTan7rKHMpAS9GPMfJGAFHQwLB60AiXNev9lujArzXvlMkb72nhSYKysz1dFW6o+Bd3TntGEInxW/pYLkbGtvXNDgp3/9f1BjsR2vzwsQo30BA77d2MLfqTk1HpY5bjw4foONlxkk9+PZ8xCsYReD9gn7x3KhISKru65Q/IqCJPNXJVNKeJlLRYQXX+HHV6CJzlu/I9KF8CuXuCHHSycurLj79EkgdCPJnhzvbb1lIa0Q0aX8qOry5ftwMbm/IfTwn1mpEhyHg/ToY1wATESDmH56LyGazJI8XJO4z9/6Qc8ZuRSZsbe7ZIKEx8xXxKgpcCIx2YX1UXRLMzDHMFYJ5LSdcTwjBVt6lx7zVeHpe6IGDKlIoB/iko938yLnycdjsjEY6BG0XPcewi5n0/t02ytlCgfxObeH2zL9Ni0HKsojX7RQRI703iuZsI9XdBU9eRVf253RrdGj74IALiCo+knD1d+3tK/ldpm5ri8KsDMk+l7yd+yUYVjZFVTNgF0t6MqiNUYmv8PTzJx0xwBcR94VnQIGBDX0fpXIkw6PTiuR8PPpOiENbw7jRzbQOtv7F+unUbSXeBHK31IY2lMceW5Kq/L494BdZgqbbziwGPBE9Hzwp548gBs16sHzWK626qGwbAgHBht8+Nn1DlDfQIwIQL6yJ8cKQdM7yY8UNVWKHUiFbAub1nkHaOr8NX9yzuF3tKNccMQkNFIE/OHyl2Qqvi+eTK6bArVO1ag0iaZtV53mdIMPQiYH5I9GGONAraIZ79MbzZP+/UalbzUWPdHqDJhMYJ/gHh1kKjpWL02ZHtZt5VuOQiaysNN Mqm3AOlf rYZcIqdwcM46tdslC3oJg+cE1j466Tfd4Fhhq1XtYSZ3GEilGqFuBgsELZooI8mo03j/LGuBD6YdTYBBLlE1aePHVhGQc1ybtuOJdTDEr/HjdZ1q4bLVUcVfYKcmcGsRLvaMqyI7O6UkjDiR5ncFfn2tdmEKEG0Z1J8tWAOrWpiwuSS/HF29hYv2eSWTc46sf1RHuTgoJtJYIFMNIRWyyvb2qz0aM0W3yFwbVAZ6EJDj90otYEqk8keCWQ41WvNiRT+r8ZAfeZPDABqS2NAggwRtL9Kk5HdZfxQ+RNtddbXb3K+8U+AXU1IVEkoevXbAhJNrsJpyZCevzdTH6rs/JsVNeew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 1:15=E2=80=AFAM Yan Zhao wro= te: > > On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote: > > > > > > On 4/24/2025 12:25 PM, Yan Zhao wrote: > > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote: > > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote: > > >>> Yan Zhao writes: > > >>> > > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote: > > >>>>> +/* > > >>>>> + * Allocates and then caches a folio in the filemap. Returns a f= olio with > > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap= . > > >>>>> + */ > > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(stru= ct inode *inode, > > >>>>> + pgoff= _t index) > > >>>>> +{ > > >>>>> + struct kvm_gmem_hugetlb *hgmem; > > >>>>> + pgoff_t aligned_index; > > >>>>> + struct folio *folio; > > >>>>> + int nr_pages; > > >>>>> + int ret; > > >>>>> + > > >>>>> + hgmem =3D kvm_gmem_hgmem(inode); > > >>>>> + folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->s= pool); > > >>>>> + if (IS_ERR(folio)) > > >>>>> + return folio; > > >>>>> + > > >>>>> + nr_pages =3D 1UL << huge_page_order(hgmem->h); > > >>>>> + aligned_index =3D round_down(index, nr_pages); > > >>>> Maybe a gap here. > > >>>> > > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not = aligned to > > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned= , the > > >>>> corresponding GFN is not 2M/1G aligned. > > >>> > > >>> Thanks for looking into this. > > >>> > > >>> In 1G page support for guest_memfd, the offset and size are always > > >>> hugepage aligned to the hugepage size requested at guest_memfd crea= tion > > >>> time, and it is true that when binding to a memslot, slot->base_gfn= and > > >>> slot->npages may not be hugepage aligned. > > >>> > > >>>> > > >>>> However, TDX requires that private huge pages be 2M aligned in GFN= . > > >>>> > > >>> > > >>> IIUC other factors also contribute to determining the mapping level= in > > >>> the guest page tables, like lpage_info and .private_max_mapping_lev= el() > > >>> in kvm_x86_ops. > > >>> > > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_= info > > >>> will track that and not allow faulting into guest page tables at hi= gher > > >>> granularity. > > >> > > >> lpage_info only checks the alignments of slot->base_gfn and > > >> slot->base_gfn + npages. e.g., > > >> > > >> if slot->base_gfn is 8K, npages is 8M, then for this slot, > > >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [4K, 2M+8K)= ; > > >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M+8K, 4M+= 8K); > > >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M+8K, 6M+= 8K); > > >> lpage_info[2M][3].disallow_lpage =3D 1, which is for GFN [6M+8K, 8M+= 8K); > > > > Should it be? > > lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [8K, 2M); > > lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M, 4M); > > lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M, 6M); > > lpage_info[2M][3].disallow_lpage =3D 0, which is for GFN [6M, 8M); > > lpage_info[2M][4].disallow_lpage =3D 1, which is for GFN [8M, 8M+8K); > Right. Good catch. Thanks! > > Let me update the example as below: > slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range) > > lpage_info[2M][0].disallow_lpage =3D 1, which is for GPA [8KB, 2MB); > lpage_info[2M][1].disallow_lpage =3D 0, which is for GPA [2MB, 4MB); > lpage_info[2M][2].disallow_lpage =3D 0, which is for GPA [4MB, 6MB); > lpage_info[2M][3].disallow_lpage =3D 0, which is for GPA [6MB, 8MB); > lpage_info[2M][4].disallow_lpage =3D 1, which is for GPA [8MB, 8MB+8KB); > > lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and = GPA > 4MB+16KB. However, their aligned_index values lead guest_memfd to allocat= e two > 2MB folios, whose physical addresses may not be contiguous. > > Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4= MB, > KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6= MB). > However, guest_memfd just allocates the same 2MB folio for both faults. > > > > > > >> > > >> --------------------------------------------------------- > > >> | | | | | | | | | > > >> 8K 2M 2M+8K 4M 4M+8K 6M 6M+8K 8M 8M+8K > > >> > > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], = huge > > >> page is allowed. Also, they have the same aligned_index 2 in guest_m= emfd. > > >> So, guest_memfd allocates the same huge folio of 2M order for them. > > > Sorry, sent too fast this morning. The example is not right. The corr= ect > > > one is: > > > > > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. S= o, > > > KVM will create a 2M mapping for them. > > > > > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to t= he > > > same 2M folio and physical addresses may not be contiguous. Then during binding, guest memfd offset misalignment with hugepage should be same as gfn misalignment. i.e. (offset & ~huge_page_mask(h)) =3D=3D ((slot->base_gfn << PAGE_SHIFT) & ~huge_page_mask(h)); For non guest_memfd backed scenarios, KVM allows slot gfn ranges that are not hugepage aligned, so guest_memfd should also be able to support non-hugepage aligned memslots. > > > > > > > > >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same= folio. > > >> It's also weird for a 2M mapping in KVM to stride across 2 huge foli= os. > > >> > > >>> Hence I think it is okay to leave it to KVM to fault pages into the > > >>> guest correctly. For guest_memfd will just maintain the invariant t= hat > > >>> offset and size are hugepage aligned, but not require that > > >>> slot->base_gfn and slot->npages are hugepage aligned. This behavior= will > > >>> be consistent with other backing memory for guests like regular shm= em or > > >>> HugeTLB. > > >>> > > >>>>> + ret =3D kvm_gmem_hugetlb_filemap_add_folio(inode->i_mappi= ng, folio, > > >>>>> + aligned_index, > > >>>>> + htlb_alloc_mask(= hgmem->h)); > > >>>>> + WARN_ON(ret); > > >>>>> + > > >>>>> spin_lock(&inode->i_lock); > > >>>>> inode->i_blocks +=3D blocks_per_huge_page(hgmem->h); > > >>>>> spin_unlock(&inode->i_lock); > > >>>>> > > >>>>> - return page_folio(requested_page); > > >>>>> + return folio; > > >>>>> +} > > > > >