From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0569C369D9 for ; Wed, 30 Apr 2025 20:09:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEB726B00AA; Wed, 30 Apr 2025 16:09:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C9B296B00AB; Wed, 30 Apr 2025 16:09:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B61096B00AD; Wed, 30 Apr 2025 16:09:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9718A6B00AA for ; Wed, 30 Apr 2025 16:09:37 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DB322160D6C for ; Wed, 30 Apr 2025 20:09:37 +0000 (UTC) X-FDA: 83391800394.07.8E80CB4 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf06.hostedemail.com (Postfix) with ESMTP id 1FBA1180010 for ; Wed, 30 Apr 2025 20:09:35 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=n1zWFUAe; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of 3foMSaAsKCBcxz71E81LGA33BB381.zB985AHK-997Ixz7.BE3@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3foMSaAsKCBcxz71E81LGA33BB381.zB985AHK-997Ixz7.BE3@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746043776; a=rsa-sha256; cv=none; b=PEdEkG8+49mlqfXJaSEkUZq2HguV+Ko0uYnmmXT3/9K7zw2WTEPn93tJj5VcQ8HSBiy6qV mQydIm+wqwkjzMm1kPHQb/8W6YKcFaaPrLJxBJLSmKPyit8ApkSu3Zh5NYU2/g5UE8d8sW Vp9BhAXlYbnn84jnhRI1/bxC5P//bK4= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=n1zWFUAe; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of 3foMSaAsKCBcxz71E81LGA33BB381.zB985AHK-997Ixz7.BE3@flex--ackerleytng.bounces.google.com designates 209.85.210.201 as permitted sender) smtp.mailfrom=3foMSaAsKCBcxz71E81LGA33BB381.zB985AHK-997Ixz7.BE3@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746043776; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=roo9VbEUeuTY1fPaSMGBkZlZZtsYBcMbMprjUdRSI8w=; b=Q5s5Sv2fu/bBxWhhCo9bQGkbiS4oqOFtp1zDwGOTaNfHNNBM2CDMXy5HLIB75m26fXihL/ k1b0o1PJlt/9DoAiZfjpNxqQ0xljzzyOSudxGDdhg2zwhkL6sDmYcnaS6IYLlpNZPnIAoW RqgYcYykE+xHOz2jijFIp0TKiWCRWuc= Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-736c0306242so373990b3a.1 for ; Wed, 30 Apr 2025 13:09:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746043775; x=1746648575; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=roo9VbEUeuTY1fPaSMGBkZlZZtsYBcMbMprjUdRSI8w=; b=n1zWFUAeFersE0xCVtnwdH5rCtt2ETxehlmMEZDDKDHx9ZZUVaTjC8X0br4jBgL9f7 OCHa4RB22HmaLgVtmZxKnjfZlNYLh1jbBpdoYYDgc5bDmFnQ0PavLXdviqh3oPybORzX LR31bX2o0yd339Q6LTN3MGm1KjmcPGd2835Rfg8I3evOuMvdC2bY9Q7/OhH5FxJnMs3v taqqvRZdRKfREWdW9I8q1L5yzd5sQUTc2o1/zLB/DWe0MsHur3uBZ3pqyRZgq/HQXDWQ 0P+QfwoiF4jqMIpu8oMH+qaTS4vABbNjPbsKFIbQkGtsuwccUAmatkW3O1Vbl0WW+MfB 60Rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746043775; x=1746648575; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=roo9VbEUeuTY1fPaSMGBkZlZZtsYBcMbMprjUdRSI8w=; b=cb/BplA06mSfyRFnMfRsLPgJpUCURjQW/6GLEa/YZi7jVObPBimb72cV6++t06JZih K+RwG5OaEJC9Mvu5tFkhEw8KNGEtrUGoLJKwdh4cTOR12NH/CgVOsYzqzRKIjiUI92pv +FTW0qQ3Yhx0C7mILOthkwkpzfu/jQm8LsHvjv47sEaPIRVEE5f6iKTMoJg8zxo095SU ohMFFKP4ot8wBhIOA0KBnDovhr38UreYMQ5z0h9G+bDzbsbirgn1+HAjxHvv8ztaGCZH igpE0evyC4/6VOr0gnZPPUdHbsvFNHAyGlX189O4VMrRCrLsh2GfR9C9E4z3hCA2KW1a QyUg== X-Forwarded-Encrypted: i=1; AJvYcCXoqh6+dCWbcjQoDAeQO5b4fP5RlAEuFMXE7OBm8a3SbjOo1E3doWsb2j/PKYyumgc96ppdKB0/zQ==@kvack.org X-Gm-Message-State: AOJu0YxKXQ2iCEc1HTb4FZSGbGRWw5vwTfgWl2hGeBeCU/FigGT2F+6z sGPywk+VxBhQN9HNepC4j++XPnEadO+pB+3jtHhBU70yRsaZC8cOU2ldpjyHYwvuhCz6pv33foT I0OrDRJZsAm07q7JbjCGzig== X-Google-Smtp-Source: AGHT+IFBFEey+jz79g0x/N2Y9iwsRGWFSszUVDcZKaPJ2VRwyywiCnN6mK3lLG1CvDJudH65ioUy3VFbaqS30NNXvw== X-Received: from pfjq6.prod.google.com ([2002:a05:6a00:886:b0:736:aaee:120e]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3a44:b0:1fe:90c5:7cee with SMTP id adf61e73a8af0-20aa438094emr6214425637.28.1746043774762; Wed, 30 Apr 2025 13:09:34 -0700 (PDT) Date: Wed, 30 Apr 2025 13:09:33 -0700 In-Reply-To: (message from Yan Zhao on Mon, 28 Apr 2025 09:05:32 +0800) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page From: Ackerley Tng To: Yan Zhao Cc: vannapurve@google.com, chenyi.qiang@intel.com, tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1FBA1180010 X-Stat-Signature: m9yohrbsbfdbaueubsypwmypsypreszm X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746043775-131402 X-HE-Meta: U2FsdGVkX18NDX3ArJcVz2sC8mFI9jroeqlBpUX2fpcsapxknwfRrm4JJ/bqqBTfXXUHqa5UOf5eHhyyD/SEGb0OoTCIYFf1jxdFabeyY4bv6Wqqy5u4G8JBfxQpFepKKmRyd88v5K/paS4fLNCQCjp3ZZop8jAeW9b0+hPuwQSgT635YhQMnuFBWjCZa8jWSuC1V4K2ZZNgdNHBalxcsWVQKN34fspcO1jm2Z7+mhVY1Riw4Sno5y+1MeBj73ot/1Uus4RftJN69qyPMHjNHRWDgyPTbSZkxe4SOQa0WEUbm4zwd1DhwARGE7rmpD94nRQRvHgCCGGlVDz997R4INNSDFArsXlKLyCtIi34jLzCaRPLFtJFInFd8rgRVosTE2Ex5VncMHFlvurWWHWgXC23c5r++MoDD6X1fmha9mUhdTO8GK9evlR2+bjsZWpcGIXS0VGAbOGmn3h7dVpnNxTr4dbrjnb1BECiecWFVYsnG227XicmkJMEU9Gxxqxh+GjOSxSvedMUx+GkHFk1x4LMfGL+EvfvCaagm9FoYsGMCyPXC3poshbcz35yvAaVUgOIhdK4LyB56T5T3n5W8wrZFRUPcuklrbjL6MnIysiSN1kabraUnp6igKg9JTAw+CIDxSY5DVS7ED1ayK3Flx+Q4UXuC0Z8dsJepJqRSrQLT8GXzN+psJZrLP+CvNRPewXUrlFisxUTHlLAbkAIbVpHSVQWwscVkJc8nn6OIa2ZBfMLQA1kZIOD0siVfKF4NjpAYv/Ym5OmWXHzi2KiYp0TiRkQqwSS/eeFUP4TguMP+Fcgs+GP9IxCOu7p+70fMDX6fSMjJw9CM4bLtqteYWgfOq5xuPqfZBclRMUZMFSqw1pra3aiWbXqpQYq/lUr0zPxiSIC83bf/B4WWUKZmbo/V3ZF9sYI5mDEKqHkd1KWb0KCTTQCzqU/rpUgfB7kHJUbn8x3HF7ICXhqZr3 8JT8ek3s sod3uLTBGQJLaQQ0NxwrzWKSsMT4e2TE9axVp4QTXT7dhtgsZVXkqm6YctdPyuHLNgq0OHO//ho+FK4OenGbK4/ecYcx02Wq+/AWxth/yP8pxqScwVIm/PtDpUCzsOEJQY5zrE2XhbnVyfjIkuFP0VOaDklsHEYd0XQ9zVfAdL0+E0DPbyODbYC4HGe7ia2Twa1g/eLsfPCmflIdBpxiAuKflNkyrZoT6hXKjC8qBt+KUxRyU3rNoaug6Jyp1Gd/S2uW5PH//flakAtV54nDX6Uxh6vvmU5pXvln6T+3nykZhJD6OoMT/nIsaIz/6zbhE52WojgJtKc97CjBYtcfCxKvxmRrK+YrjNE8wrf48kH5+Z84rBBfq++DR+Bi3abEH6pS7HjT4SHiYphjsZAT88Xnc8zkkOt5mA0ljfkqQLmsky0yKNSasyYHRjimXEmbXUoL29KcdY7Zs5hK02sPAQoRjX/xaIgLLmsFO3UYgwzeutRZLVyfU8z34N2DeEaPi9jKJrw0HDru/pyAj/wBV58JZOOl212Fx2Cp6R+PTtp7AhVpLZ7UgCadVjc18ywxI3+A83uCjCru0UekYo13vSL7FwuYHGAcjl8tStF/ZDKx8o+SP/QMiQ8ihrvVNz9l36hJzEwS2idemLvU+NkuBXpdZeXYWJaApdb4bsXbiM6Zs/YQTbAOjIhWcMp5RtmR/DMCPgweLYFFr6fL+4uUG5NzWVaEB+uPF6Jqr21OiAGl7FPoJJM+gP8ZvqrErQ1ubNNZwNjmUZskWpEbSgnAp7O6J3msaBwxH3wEve0V0pPT3liRu02oLKW7XWw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yan Zhao writes: > On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote: >> Yan Zhao writes: >>=20 >> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote: >> >> Vishal Annapurve writes: >> >>=20 >> >> > On Thu, Apr 24, 2025 at 1:15=E2=80=AFAM Yan Zhao wrote: >> >> >> >> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote: >> >> >> > >> >> >> > >> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote: >> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote: >> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote: >> >> >> > >>> Yan Zhao writes: >> >> >> > >>> >> >> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrot= e: >> >> >> > >>>>> +/* >> >> >> > >>>>> + * Allocates and then caches a folio in the filemap. Retu= rns a folio with >> >> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the = filemap. >> >> >> > >>>>> + */ >> >> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_fol= io(struct inode *inode, >> >> >> > >>>>> + = pgoff_t index) >> >> >> > >>>>> +{ >> >> >> > >>>>> + struct kvm_gmem_hugetlb *hgmem; >> >> >> > >>>>> + pgoff_t aligned_index; >> >> >> > >>>>> + struct folio *folio; >> >> >> > >>>>> + int nr_pages; >> >> >> > >>>>> + int ret; >> >> >> > >>>>> + >> >> >> > >>>>> + hgmem =3D kvm_gmem_hgmem(inode); >> >> >> > >>>>> + folio =3D kvm_gmem_hugetlb_alloc_folio(hgmem->h, h= gmem->spool); >> >> >> > >>>>> + if (IS_ERR(folio)) >> >> >> > >>>>> + return folio; >> >> >> > >>>>> + >> >> >> > >>>>> + nr_pages =3D 1UL << huge_page_order(hgmem->h); >> >> >> > >>>>> + aligned_index =3D round_down(index, nr_pages); >> >> >> > >>>> Maybe a gap here. >> >> >> > >>>> >> >> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn = is not aligned to >> >> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G = aligned, the >> >> >> > >>>> corresponding GFN is not 2M/1G aligned. >> >> >> > >>> >> >> >> > >>> Thanks for looking into this. >> >> >> > >>> >> >> >> > >>> In 1G page support for guest_memfd, the offset and size are = always >> >> >> > >>> hugepage aligned to the hugepage size requested at guest_mem= fd creation >> >> >> > >>> time, and it is true that when binding to a memslot, slot->b= ase_gfn and >> >> >> > >>> slot->npages may not be hugepage aligned. >> >> >> > >>> >> >> >> > >>>> >> >> >> > >>>> However, TDX requires that private huge pages be 2M aligned= in GFN. >> >> >> > >>>> >> >> >> > >>> >> >> >> > >>> IIUC other factors also contribute to determining the mappin= g level in >> >> >> > >>> the guest page tables, like lpage_info and .private_max_mapp= ing_level() >> >> >> > >>> in kvm_x86_ops. >> >> >> > >>> >> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned,= lpage_info >> >> >> > >>> will track that and not allow faulting into guest page table= s at higher >> >> >> > >>> granularity. >> >> >> > >> >> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and >> >> >> > >> slot->base_gfn + npages. e.g., >> >> >> > >> >> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot, >> >> >> > >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [4K,= 2M+8K); >> >> >> > >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M+= 8K, 4M+8K); >> >> >> > >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M+= 8K, 6M+8K); >> >> >> > >> lpage_info[2M][3].disallow_lpage =3D 1, which is for GFN [6M+= 8K, 8M+8K); >> >> >> > >> >> >> > Should it be? >> >> >> > lpage_info[2M][0].disallow_lpage =3D 1, which is for GFN [8K, 2M= ); >> >> >> > lpage_info[2M][1].disallow_lpage =3D 0, which is for GFN [2M, 4M= ); >> >> >> > lpage_info[2M][2].disallow_lpage =3D 0, which is for GFN [4M, 6M= ); >> >> >> > lpage_info[2M][3].disallow_lpage =3D 0, which is for GFN [6M, 8M= ); >> >> >> > lpage_info[2M][4].disallow_lpage =3D 1, which is for GFN [8M, 8M= +8K); >> >> >> Right. Good catch. Thanks! >> >> >> >> >> >> Let me update the example as below: >> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range) >> >> >> >> >> >> lpage_info[2M][0].disallow_lpage =3D 1, which is for GPA [8KB, 2MB= ); >> >> >> lpage_info[2M][1].disallow_lpage =3D 0, which is for GPA [2MB, 4MB= ); >> >> >> lpage_info[2M][2].disallow_lpage =3D 0, which is for GPA [4MB, 6MB= ); >> >> >> lpage_info[2M][3].disallow_lpage =3D 0, which is for GPA [6MB, 8MB= ); >> >> >> lpage_info[2M][4].disallow_lpage =3D 1, which is for GPA [8MB, 8MB= +8KB); >> >> >> >> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4= MB and GPA >> >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to = allocate two >> >> >> 2MB folios, whose physical addresses may not be contiguous. >> >> >> >> >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB an= d GPA 4MB, >> >> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), = [4MB, 6MB). >> >> >> However, guest_memfd just allocates the same 2MB folio for both fa= ults. >> >> >> >> >> >> >> >> >> > >> >> >> > >> >> >> >> > >> --------------------------------------------------------- >> >> >> > >> | | | | | | | | | >> >> >> > >> 8K 2M 2M+8K 4M 4M+8K 6M 6M+8K 8M 8M+= 8K >> >> >> > >> >> >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2= M][2], huge >> >> >> > >> page is allowed. Also, they have the same aligned_index 2 in = guest_memfd. >> >> >> > >> So, guest_memfd allocates the same huge folio of 2M order for= them. >> >> >> > > Sorry, sent too fast this morning. The example is not right. T= he correct >> >> >> > > one is: >> >> >> > > >> >> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is all= owed. So, >> >> >> > > KVM will create a 2M mapping for them. >> >> >> > > >> >> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspo= nd to the >> >> >> > > same 2M folio and physical addresses may not be contiguous. >> >> > >> >> > Then during binding, guest memfd offset misalignment with hugepage >> >> > should be same as gfn misalignment. i.e. >> >> > >> >> > (offset & ~huge_page_mask(h)) =3D=3D ((slot->base_gfn << PAGE_SHIFT= ) & >> >> > ~huge_page_mask(h)); >> >> > >> >> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges th= at >> >> > are not hugepage aligned, so guest_memfd should also be able to >> >> > support non-hugepage aligned memslots. >> >> > >> >>=20 >> >> I drew up a picture [1] which hopefully clarifies this. >> >>=20 >> >> Thanks for pointing this out, I understand better now and we will add= an >> >> extra constraint during memslot binding of guest_memfd to check that = gfn >> >> offsets within a hugepage must be guest_memfd offsets. >> > I'm a bit confused. >> > >> > As "index =3D gfn - slot->base_gfn + slot->gmem.pgoff", do you mean yo= u are going >> > to force "slot->base_gfn =3D=3D slot->gmem.pgoff" ? >> > >> > For some memory region, e.g., "pc.ram", it's divided into 2 parts: >> > - one with offset 0, size 0x80000000(2G), >> > positioned at GPA 0, which is below GPA 4G; >> > - one with offset 0x80000000(2G), size 0x80000000(2G), >> > positioned at GPA 0x100000000(4G), which is above GPA 4G. >> > >> > For the second part, its slot->base_gfn is 0x100000000, while slot->gm= em.pgoff >> > is 0x80000000. >> > >>=20 >> Nope I don't mean to enforce that they are equal, we just need the >> offsets within the page to be equal. >>=20 >> I edited Vishal's code snippet, perhaps it would help explain better: >>=20 >> page_size is the size of the hugepage, so in our example, >>=20 >> page_size =3D SZ_2M; >> page_mask =3D ~(page_size - 1); > page_mask =3D page_size - 1 ? > Yes, thank you! >> offset_within_page =3D slot->gmem.pgoff & page_mask; >> gfn_within_page =3D (slot->base_gfn << PAGE_SHIFT) & page_mask; >>=20 >> We will enforce that >>=20 >> offset_within_page =3D=3D gfn_within_page; > For "pc.ram", if it has 2.5G below 4G, it would be configured as follows > - slot 1: slot->gmem.pgoff=3D0, base GPA 0, size=3D2.5G > - slot 2: slot->gmem.pgoff=3D2.5G, base GPA 4G, size=3D1.5G > > When binding these two slots to the same guest_memfd created with flag > KVM_GUEST_MEMFD_HUGE_1GB:=20 > - binding the 1st slot will succeed; > - binding the 2nd slot will fail. > > What options does userspace have in this scenario? > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.= pgoff > isn't ideal either. > > What about something similar as below? > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index d2feacd14786..87c33704a748 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_m= emory_slot *slot, > } > > *pfn =3D folio_file_pfn(folio, index); > - if (max_order) > - *max_order =3D folio_order(folio); > + if (max_order) { > + int order; > + > + order =3D folio_order(folio); > + > + while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) = & ((1 << order) - 1))) > + order--; > + > + *max_order =3D order; > + } > > *is_prepared =3D folio_test_uptodate(folio); > return folio; > Vishal was wondering how this is working before guest_memfd was introduced, for other backing memory like HugeTLB. I then poked around and found this [1]. I will be adding a similar check for any slot where kvm_slot_can_be_private(slot). Yan, that should work, right? [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565= f4616186/arch/x86/kvm/x86.c#L12996 >> >> Adding checks at binding time will allow hugepage-unaligned offsets (= to >> >> be at parity with non-guest_memfd backing memory) but still fix this >> >> issue. >> >>=20 >> >> lpage_info will make sure that ranges near the bounds will be >> >> fragmented, but the hugepages in the middle will still be mappable as >> >> hugepages. >> >>=20 >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3= 706/binding-must-have-same-alignment.svg