From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D622C3ABBC for ; Tue, 6 May 2025 19:22:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 853D76B0089; Tue, 6 May 2025 15:22:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7DB236B008A; Tue, 6 May 2025 15:22:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 655A56B008C; Tue, 6 May 2025 15:22:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 41DBF6B0089 for ; Tue, 6 May 2025 15:22:50 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1AA4580142 for ; Tue, 6 May 2025 19:22:52 +0000 (UTC) X-FDA: 83413455384.07.1253BD3 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) by imf27.hostedemail.com (Postfix) with ESMTP id 7C00640014 for ; Tue, 6 May 2025 19:22:50 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=iHaAeFlY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of 3iWEaaAsKCPwegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3iWEaaAsKCPwegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746559370; a=rsa-sha256; cv=none; b=wVNjquZk+/44egltxtI4wgy++b+Kr5jpfg+OKjgXHq6vrmgRfbvNJxsxGXTiKkPSALLy7n e/+pKA/7za6O7XmbM5c8SHK6dvNJEnXUSxKskXJvrKE+HUDsHClINTnDR6UtHEfOuejmnu TYVGj0x/jI+R5DMNe2dxPglZal9RmCI= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=iHaAeFlY; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of 3iWEaaAsKCPwegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3iWEaaAsKCPwegoivpi2xrkksskpi.gsqpmry1-qqozego.svk@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746559370; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YdVfAUAmDEBQ8JFX7fJFifwcWORGf91hOm42i7pODPI=; b=quu8WujR82lUD58+jzQt8YxF+VTeNC0U6LirVP7s2V26tvbktDFOaS0Ma10j1YR1DmVq2Z KLxUcSevG65ovm0TkXuCXRwWKLNQ8VIXPeHZpdoc0eDAtuUfhfz55nfKGDjjo9UNozyjly zmqzag4ORnEhBU7FCqKpJiBCq7+cB7U= Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-7395095a505so4155384b3a.1 for ; Tue, 06 May 2025 12:22:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746559369; x=1747164169; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YdVfAUAmDEBQ8JFX7fJFifwcWORGf91hOm42i7pODPI=; b=iHaAeFlY/RFRy4gO2Q9Pbek2T4yox0I85V32Kt3C0+iVmGf6wfhcvGZm9e5NAj4zvG jRaV+u7TZ6mXkJCO6/WD3T8hY5XnOEJdIP1JCWfO8ZF5Xr9CWhMSjvYCIRZCMYBYeZoe mFYaE8naNu7doRit4fKYitl0ih1AB3v9rxmMTOtMyVefRmFv+fivRBcpbqREBvngKv6N uQglYBZQUo6u0549c1gqAoVEB6GnIqpVg3zgAr8p9laK/yhs80YY6E/32X+ZUkqfyZX5 Nf2z8lUbkAQOtRFMzjCdM1wAXcADyXTqI4d83MjkuSgvw6rm1d87BJvHsGgBAl2qB6lh Cx9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746559369; x=1747164169; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YdVfAUAmDEBQ8JFX7fJFifwcWORGf91hOm42i7pODPI=; b=H/a5JwQFS+T085Rutw5QLaeO+yQzgRhZJGNvKyMHHdHYdS4FY3gvDkdqQURtGgu5o4 Gb5uUbUXv8aij4DNVCMGKyYoT03Hva+Y+bcWDYbarfJg3V1P1KPuTlvnDvMFAk4SuGiC m+NYvz7kzak/xdQhLKlocQRCFy8tegmeb5VO3oCTRRMiPk7b8tO1SW8lo+Zebtvahlw9 4iSC4T+OnKbQiuonX2XpUjjwtUMpgTC0RpD2Bsup6klCatafjXHVaF5JKdIUXpCIFetc 1sybwU+boTh+66KOS3jzdklheTHAcLhBpgHDWFgI37ASuINvfU0+DRQ4+n4jUvcFpz++ fv5A== X-Forwarded-Encrypted: i=1; AJvYcCUlbPHcNZ02nVFvX4UXcV+BmHNJROtwZ1sYiV/692xu+UzolD92KxN8CbbFBPNCQD0r8rb70vjmvQ==@kvack.org X-Gm-Message-State: AOJu0YwEEvt9w0+YH2SvkFtzCOgRwePkJRzgR9pTy8+mJmYcuV8arrmu FXIHDw5VrP4OWlet6WbHq9vJDdDZva3NszswPydx46u1G80pSwX/MXv9qyYbwgfL/TzFx/n0ow8 uYNNkPOecAezDFTMMHYn29g== X-Google-Smtp-Source: AGHT+IF6xVKxxuyEiNYi+0vPcETYSLwUkqiD4tqqqh48d1N2zs7IMgRp17ja4z+L+OzOmL3DebRn7uSN+uZhGnUzFw== X-Received: from pfcg7.prod.google.com ([2002:a05:6a00:23c7:b0:730:8e17:ed13]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:ab05:b0:730:9946:5973 with SMTP id d2e1a72fcca58-7409cf1e873mr452083b3a.5.1746559369363; Tue, 06 May 2025 12:22:49 -0700 (PDT) Date: Tue, 06 May 2025 12:22:47 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page From: Ackerley Tng To: Yan Zhao Cc: vannapurve@google.com, chenyi.qiang@intel.com, tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 7C00640014 X-Stat-Signature: dgbmuc6nawzheicshu5agqai9ys3hc57 X-Rspam-User: X-HE-Tag: 1746559370-63171 X-HE-Meta: U2FsdGVkX19sqAAk+p56Wy3tcHJ/G9muZweBrCQe5E9KoEEaY23BOiMP/QFLIByfguecR+hNWBcQiQ6Y3w3HOQmPX9nhQJVsep6XwvdARMhN7vVI/s/J4+hl2DDjL+y4xUMZVS2PDHJDypssW6VWveLYXMGumERBPqRlfRt8phuhs+iarYCcNCA2dkyKt/WGc2/69BQVr7xybrwDtX2Vij1+jazPLCAc2l7isTceQVs8an8cOb6lJsKTKy2GlLL3ZS4YcCG3iQvMtJfIkQbQI+yam23A4xpuSKqitZu0LnzD+YrVPmHveo1hZep0mPR/aipszyPf0I3uQw0h5YlFwkZj6FZkjW6Bs2Y9ZMqt9+bmX0ZhvJuDOSgusRXC8Wm/BpRILOhyJgu5x87pgS9C+aPZCN4EHQ/jZZyTbGxImPSnC7iLTWk8Bi8hSQwQrmZl5ayBAv73qGtYtohmO8EcyEWplDcQV7DMgfAOWWkhIEfu/jtN4rInQXRmlKHWRXaT7QAdcDBv2WB3E1DzsAQOwDdjxJs4CEh7SD12OA7vxHkHZOFg1vZlDYhRaKeYa63Smp9sBP7FGZji0QB6+CcpoYyyPQt1I0FoiwvLmd2v8dU+qDPsBo0C/zqD5Ek/WKZ4lCPV3XGwf6kxzER59SzsK5Vdt71AqkejDmBCKRm/193sSEyzEsQDUHPrV5NwREGNz1jTT6yfu774kE3tMrFS1JMn7wPZkgMTV6He1jTTMOl0uhxq+YSXN/wDydYwK3yWGznUTWrT6MVDKXP53fEArm+D/sjFK7meJAWJXoKTBn3pNnxGZ3yNWFKRJcBr8l3Hiyxh7MRiRZ2vYg0VjHoQcOzQKNfnHJzpsE3e2br/1YweE5aYnkZh2WXZ27fu4yk+ZuojugFbW6UDpuBNoKbeIaUUd9rOYwFKn6a/Yh35bJoO/B9wS4+lTFjjHuIC/SEDtN8Qc4G9DjUFtaFRKb+ mcVg/we4 Oi9r4ssGVWbYtIUdN75kRc6gYOiEbeCj93KsPrQXhuutq5zpNlzRmNNid83C9w4WQQoNAFnE91kuXc3jGpFKxjruILD87LNOOBA4TltBN7UuBGXmhSLvQXiD0Vcefsjb4NEMBbDp+XIqSMoGRv6VzXKCzQXJiFb6epWuXZgO6LFYWvMlKrM3RPtvkmrvd2fI9vRv1Ha+tYk6uoARHG7C2HYsEuVYhsE7XhFXH7CoaIMGO07I4KPdV3XJdAlqRFMdXCz7/xkeRkn1w3KQ3kiupHwQeLkX7bH+scKiM+roqlkJUIgurzgbG+1A9ChQs5AqVvZQ5NZakHXCKgZjRvjpX+Bgkahby69T+MqbKE+hryyIaxaA6gLvogKWQOad94cE/rnCVAhThCE2mKZkAcmwyUu029ef1K7bPzURB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yan Zhao writes: >> > >> > >> > What options does userspace have in this scenario? >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff >> > isn't ideal either. >> > >> > What about something similar as below? >> > >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c >> > index d2feacd14786..87c33704a748 100644 >> > --- a/virt/kvm/guest_memfd.c >> > +++ b/virt/kvm/guest_memfd.c >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, >> > } >> > >> > *pfn = folio_file_pfn(folio, index); >> > - if (max_order) >> > - *max_order = folio_order(folio); >> > + if (max_order) { >> > + int order; >> > + >> > + order = folio_order(folio); >> > + >> > + while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1))) >> > + order--; >> > + >> > + *max_order = order; >> > + } >> > >> > *is_prepared = folio_test_uptodate(folio); >> > return folio; >> > >> >> Vishal was wondering how this is working before guest_memfd was >> introduced, for other backing memory like HugeTLB. >> >> I then poked around and found this [1]. I will be adding a similar check >> for any slot where kvm_slot_can_be_private(slot). >> >> Yan, that should work, right? > No, I don't think the checking of ugfn [1] should work. > > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory > are allocated from guest_memfd), the slot->userspace_addr does not necessarily > have the same offset as slot->gmem.pgoff. Even if we audit the offset in > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing > slot->userspace_addr to point to a different offset. > > 2. for slots bound to guest_memfd that do not support in-place-conversion, > shared memory is allocated from a different backend. Therefore, checking > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is > currently absent because guest_memfd supports 4K only. > > Let me clarify, I meant these changes: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4b64ab3..d0dccf1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages) return 0; } +static inline bool kvm_is_level_aligned(u64 value, int level) +{ + return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level)); +} + static int kvm_alloc_memslot_metadata(struct kvm *kvm, struct kvm_memory_slot *slot) { @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm, slot->arch.lpage_info[i - 1] = linfo; - if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1)) + if (!kvm_is_level_aligned(slot->base_gfn, level)) linfo[0].disallow_lpage = 1; - if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1)) + if (!kvm_is_level_aligned(slot->base_gfn + npages, level)) linfo[lpages - 1].disallow_lpage = 1; ugfn = slot->userspace_addr >> PAGE_SHIFT; /* - * If the gfn and userspace address are not aligned wrt each - * other, disable large page support for this slot. + * If the gfn and userspace address are not aligned or if gfn + * and guest_memfd offset are not aligned wrt each other, + * disable large page support for this slot. */ - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) { + if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) || + (kvm_slot_can_be_private(slot) && + !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff, + level))) { unsigned long j; for (j = 0; j < lpages; ++j) This does not rely on the ugfn check, but adds a similar check for gmem.pgoff. I think this should take care of case (1.), for guest_memfds going to be used for both shared and private memory. Userspace can't update slot->userspace_addr, since guest_memfd memslots cannot be updated and can only be deleted. If userspace re-uses slot->userspace_addr for some other memory address without deleting and re-adding a memslot, + KVM's access to memory should still be fine, since after the recent discussion at guest_memfd upstream call, KVM's guest faults will always go via fd+offset and KVM's access won't be disrupted there. Whatever checking done at memslot binding time will still be valid. + Host's access and other accesses (e.g. instruction emulation, which uses slot->userspace_addr) to guest memory will be broken, but I think there's nothing protecting against that. The same breakage would happen for non-guest_memfd memslot. p.s. I will be adding the validation as you suggested [1], though that shouldn't make a difference here, since the above check directly validates against gmem.pgoff. Regarding 2., checking this checks against gmem.pgoff and should handle that as well. [1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/ I prefer checking at binding time because it aligns with the ugfn check that is already there, and avoids having to check at every fault. >> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996 >> >> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to >> >> >> be at parity with non-guest_memfd backing memory) but still fix this >> >> >> issue. >> >> >> >> >> >> lpage_info will make sure that ranges near the bounds will be >> >> >> fragmented, but the hugepages in the middle will still be mappable as >> >> >> hugepages. >> >> >> >> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg