From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F245C3ABBC for ; Tue, 6 May 2025 20:47:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 23E816B0082; Tue, 6 May 2025 16:47:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C6546B0083; Tue, 6 May 2025 16:47:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 040D26B0085; Tue, 6 May 2025 16:47:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D56526B0082 for ; Tue, 6 May 2025 16:47:01 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 7894C1A02EC for ; Tue, 6 May 2025 20:47:03 +0000 (UTC) X-FDA: 83413667526.04.35BB4F8 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) by imf15.hostedemail.com (Postfix) with ESMTP id A9C70A0009 for ; Tue, 6 May 2025 20:47:01 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DCLkk0a1; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of 3RHUaaAsKCN8BDLFSMFZUOHHPPHMF.DPNMJOVY-NNLWBDL.PSH@flex--ackerleytng.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3RHUaaAsKCN8BDLFSMFZUOHHPPHMF.DPNMJOVY-NNLWBDL.PSH@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746564421; a=rsa-sha256; cv=none; b=IJ7PcTyN3+A5ZalPn8wU4ByYJbKfVjc2R9AHKY3h2/m4H0cK2D1h3Aeyq7JFt1j/8QMX4m Q9nwBpnmeE+tcGUiDf/MHFetSFXFh7zfflhBN/+3VhKPn/cycsvrC3/zMppx2hCISV25QU kPZXnuYmJYra9GQjiYcl7f/exhzP/XQ= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DCLkk0a1; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of 3RHUaaAsKCN8BDLFSMFZUOHHPPHMF.DPNMJOVY-NNLWBDL.PSH@flex--ackerleytng.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3RHUaaAsKCN8BDLFSMFZUOHHPPHMF.DPNMJOVY-NNLWBDL.PSH@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746564421; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DCzobWbLqzYgZ0RtPdZiSF2En8bMULz9xtc7ptlaLyg=; b=k+wfcg7EYa6qtJEe4UkxgDqFoPbDpsPH3+VbZfPpKyyaOzKyPSy4TZivzJFbk+EJBjMUeC BWuyWZZ6jIRZLlcBb4egDhvTWP7bih2Rrc/ggiGA//eIqIN5li/BNXMOARQh/sLEC0nwYa CsslbzeLUejdJfkXiWa2WX8uCRe3+GY= Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-b00e4358a34so3378545a12.0 for ; Tue, 06 May 2025 13:47:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1746564420; x=1747169220; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=DCzobWbLqzYgZ0RtPdZiSF2En8bMULz9xtc7ptlaLyg=; b=DCLkk0a1Qb88AY6jva3SDnob4Nvoc2ws5m165aOEhZbLur5cVRQor6X4hQHtoNqXje piTW/s/8UynXKJNgFjZYBEQDqvXgASpPPKb04Sp7Z3cCeBT9VUxbhFKq8+FBRpeaOQ2F 0bI83k108NhNtL3/92uyTdE7Z/FTuZX55LLnPtb+bkO9IRZLIvhAJozGan/0OcZwZ7ZQ WAGiOiDGlZ4749B7YalnA91jKPeEfBHB5O8spXlJhNz+sOLWBka7YQMoLJwX+rXnDzA1 /fNA2dgQ9wkaIFMmfPIUIyZFWVvtSfbmFZzuX2EwiApBLJr2SA5sFguSRaXv/tTeEM9g HbhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746564420; x=1747169220; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=DCzobWbLqzYgZ0RtPdZiSF2En8bMULz9xtc7ptlaLyg=; b=jin0xj6o0ah8m4cVBwmPeP1qhIgBsrVYdSDvF0SEd+c24F9sLJsTrEoACExPXmfSaR hGQQN6SdTyoOgBNwGg2WGkI3HTzBeK1FSgCAvPLkuf3YqAgK4Nc4CwWK928W8DZEQEL2 u/RlgdnVnPU0gufnFE+WwQc1i/5JXvA87TPPSo7wBxpZWeRxsEcLjhyaCZ424wy2kS/W sDiHw0oDD5X2lKXhvO9RQZjlt4KwyZA/8rJwNV/fygQrOqAEpXJyDraZQuSAC1p7EJI1 xJK3uJUeLUJEf9EIKP83+AUuSd4XMGuKesab1s7ZXdBEs2H3ktx6zlBCY+t33B7yq2kx Swfw== X-Forwarded-Encrypted: i=1; AJvYcCWF50dyI7xc8z2ZGsSAFiw9txuIC+/MZZnJxYJRC4qezHvMqdBsY+oOVeQbw2uLNC827nXUflUPQA==@kvack.org X-Gm-Message-State: AOJu0YzAlkT2eDWyo0hYlqchqOoRwAmdR27OqLwsH22cLH6w+Nb+1CS5 QqVxGBGNj1YOhU3gantMhaTtYg1J09W/0Mi1MVuEYbLKyZ293qV+AZWeg4vOTksfKIYqeROk9Wh 0yaKFJtok+mw+abF0swu20A== X-Google-Smtp-Source: AGHT+IHuf4HiiM5l7QrOtvUrpzBKa8NR8v2a6GaWbIwD5Bf0Sh8JWbcl8IXLPMkv3T1UFmq2oEYM6jTxGiIuLrPMRg== X-Received: from pfjg15.prod.google.com ([2002:a05:6a00:b8f:b0:740:6f6:7346]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:3d8f:b0:1f5:902e:1e97 with SMTP id adf61e73a8af0-2148d53bad9mr825091637.41.1746564420272; Tue, 06 May 2025 13:47:00 -0700 (PDT) Date: Tue, 06 May 2025 13:46:58 -0700 In-Reply-To: <39ea3946-6683-462e-af5d-fe7d28ab7d00@redhat.com> Mime-Version: 1.0 References: <386c1169-8292-43d1-846b-c50cbdc1bc65@redhat.com> <7e32aabe-c170-4cfc-99aa-f257d2a69364@redhat.com> <39ea3946-6683-462e-af5d-fe7d28ab7d00@redhat.com> Message-ID: Subject: Re: [PATCH v8 06/13] KVM: x86: Generalize private fault lookups to guest_memfd fault lookups From: Ackerley Tng To: David Hildenbrand , Sean Christopherson , Vishal Annapurve Cc: Fuad Tabba , kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A9C70A0009 X-Stat-Signature: johqx6y93ffkemjrmbrtzqtwfcpaugnx X-HE-Tag: 1746564421-191608 X-HE-Meta: U2FsdGVkX1/xjYGkgCkPBFLTCPK69kl9h93y0OcUsQ35GqGXGmtQM7FaMZhpabeXZ1LGNtvynVNDMzrIHzqh0IHjK8g0m4KIIhB1Vr01nEPPjZX19SAfBbZI6VBXGjMhDZr2Gnv7/ijfAomS07Cymf3y2Qj4f3dBpdLV53QV64lKozuhPmEPt2sFVhRj3+/CvvTdO2GPF7wfWgwdScm4DrONIvIVAe4h4hL9+R4xFwa5IJLIFbvlrYnyx6bx1PFNFSpR8LE6CmRSrS1kb98EVAVeACjLk5TFDRYsX8mno2QRscuuJ4tQdWZvPCGnh6lGvYbi8mXel35bFSo8rDLZZnmXSvCKXmRvovM1JfojcLlAvZjc3t8BF4v1R1ZW4UALkO+QIFnJEdfdTLVQSVODNZEH7VOV9gQYUQ9OqKTcnj+2InijlNcL5ZSMjS0UdGpco1suU8wukroa7hS9oI0NLCUt6eUWlR7ilk3GCF1goMdb/sxX+TEiCVn4Unexa5QO4kblCn0bNvhCt8CY8U7H2nUA2XLCbeU213OlY4FQcWp/6T0JflH08HBuzr956egP8Fu0RjdY6o4XEkrs8HYdc6yOIykmi2hPC9MfzzccUWD14gQrAQzkt4A8WdiO4HA+uxmKRT4ipnmFADHo2imQ+ZB4u5FIMoTZNWuFN9pE5GuXob8E34h5w3/o3bXY86T1F5dgWheMfUbhAIjkQ0LX+FuXYPzwDMKCpBUlU1ZYloj388BaAVLgrFyYaEHboCbc1KYqOqFzzNXzEeNPixtzyLwd8HAWTuOKR63YgJvzVdfnwf/ELGbsJhp/SUQkZFOTiPJ0LXbBfoBrYECrIUiR06eXWJOIxZr+tY8oyfowViQBlnrLwtJ0zS7ro9v7TClV4XgZ9YIztpeqbBj0CANM/X+J0Y/rveOMbg9tQnfqBHsZNT+eKHY005AzX7AZWia4rEDW9e6XO6R5z5mVVZD rIcvEmCT lF413GMIl2pZiAL+rao9M2JKdcdk0b+DXRpPhsJjwpS4gtxkGpASM0ivuNeNg4xsidGBfif1Dc69D4g5JQLQrE18z+/kCN6Fexz2N2LlLQRHivZ9YH+Slu5agcotyCnugL3toVohBWK59cfccB6virAzy80AwzpF4KDuDsI14CfJT6qt9obYwOw+T8qdQiOHwbTIPYgoXjcgSdmIyvmUlzm8iELyyziLunE3mwg0V8jNqhDw23n/YM+7E4M6Um80DQCsglZDXXlmyoctlM+f+Oi/LJMjBAEfnWvLTAbAr2CdGV64XIRx1eLOeJL6Cy2drLKGb1xCE4InuqtRK+7W5TQc+/COj6PAKZqNH+qX8xnAJdPtDY9WsPMQfNQMcnskKGyPYm/kX3yMYqmU7md+xe/Plog/QzphTcIVpOPwVC+9bEWvhaAklcHyGmgir0DUKHglxhjAUs2jkTZxCY6D+JY/Z2vCb2qmQi0nSne8auBuxV/2pif5ucPbVhkHtahGCBZYrvJj4tJwD8bgnfoO59hb2wsEpcKR9Pm6hjCogdBCvus3f0WA1hNy+OPez62NBfdcfrpk0ZDdQKbrB/DQ95gBIQOyQBEu/aiHRuWnGlQgPRiB3FyuV5K+qbM1X2JMh0/FOcMZqndbsaKrT4pYiYQh60Lb2S7ukPbZ28entCWCPXe23bU0+AIrZJJOakdQ+1fBnAlHYDIeTBxsyPKn34zHKwy2ZafyuDfk0r0I8LbWgFXG/EUlthPjwSc0quAALRQcHwy2MzbchAGKRcUQDYXeWjYum14ki9xmq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: David Hildenbrand writes: > On 06.05.25 15:58, Sean Christopherson wrote: >> On Mon, May 05, 2025, Vishal Annapurve wrote: >>> On Mon, May 5, 2025 at 10:17=E2=80=AFPM Vishal Annapurve wrote: >>>> >>>> On Mon, May 5, 2025 at 3:57=E2=80=AFPM Sean Christopherson wrote: >>>>>> ... >>>>>> And not worry about lpage_infor for the time being, until we actuall= y do >>>>>> support larger pages. >>>>> >>>>> I don't want to completely punt on this, because if it gets messy, th= en I want >>>>> to know now and have a solution in hand, not find out N months from n= ow. >>>>> >>>>> That said, I don't expect it to be difficult. What we could punt on = is >>>>> performance of the lookups, which is the real reason KVM maintains th= e rather >>>>> expensive disallow_lpage array. >>>>> >>>>> And that said, memslots can only bind to one guest_memfd instance, so= I don't >>>>> immediately see any reason why the guest_memfd ioctl() couldn't proce= ss the >>>>> slots that are bound to it. I.e. why not update KVM_LPAGE_MIXED_FLAG= from the >>>>> guest_memfd ioctl() instead of from KVM_SET_MEMORY_ATTRIBUTES? >>>> >>>> I am missing the point here to update KVM_LPAGE_MIXED_FLAG for the >>>> scenarios where in-place memory conversion will be supported with >>>> guest_memfd. As guest_memfd support for hugepages comes with the >>>> design that hugepages can't have mixed attributes. i.e. max_order >>>> returned by get_pfn will always have the same attributes for the folio >>>> range. >>=20 >> Oh, if this will naturally be handled by guest_memfd, then do that. I w= as purely >> reacting to David's suggestion to "not worry about lpage_infor for the t= ime being, >> until we actually do support larger pages". >>=20 >>>> Is your suggestion around using guest_memfd ioctl() to also toggle >>>> memory attributes for the scenarios where guest_memfd instance doesn't >>>> have in-place memory conversion feature enabled? >>> >>> Reading more into your response, I guess your suggestion is about >>> covering different usecases present today and new usecases which may >>> land in future, that rely on kvm_lpage_info for faster lookup. If so, >>> then it should be easy to modify guest_memfd ioctl to update >>> kvm_lpage_info as you suggested. >>=20 >> Nah, I just missed/forgot that using a single guest_memfd for private an= d shared >> would naturally need to split the folio and thus this would Just Work. Sean, David, I'm circling back to make sure I'm following the discussion correctly before Fuad sends out the next revision of this series. > > Yeah, I ignored that fact as well. So essentially, this patch should be= =20 > mostly good for now. > >From here [1], these changes will make it to v9 + kvm_max_private_mapping_level renaming to kvm_max_gmem_mapping_level + kvm_mmu_faultin_pfn_private renaming to kvm_mmu_faultin_pfn_gmem > Only kvm_mmu_hugepage_adjust() must be taught to not rely on=20 > fault->is_private. > I think fault->is_private should contribute to determining the max mapping level. By the time kvm_mmu_hugepage_adjust() is called, * For Coco VMs using guest_memfd only for private memory, * fault->is_private would have been checked to align with kvm->mem_attr_array, so=20 * For Coco VMs using guest_memfd for both private/shared memory, * fault->is_private would have been checked to align with guest_memfd's shareability * For non-Coco VMs using guest_memfd * fault->is_private would be false Hence fault->is_private can be relied on when calling kvm_mmu_hugepage_adjust(). If fault->is_private, there will be no host userspace mapping to check, hence in __kvm_mmu_max_mapping_level(), we should skip querying host page tables. If !fault->is_private, for shared memory ranges, if the VM uses guest_memfd only for shared memory, we should query host page tables. If !fault->is_private, for shared memory ranges, if the VM uses guest_memfd for both shared/private memory, we should not query host page tables. If !fault->is_private, for non-Coco VMs, we should not query host page tables. I propose to rename the parameter is_private to skip_host_page_tables, so - if (is_private) + if (skip_host_page_tables) return max_level; and pass skip_host_page_tables =3D fault->is_private || kvm_gmem_memslot_supports_shared(fault->slot); where kvm_gmem_memslot_supports_shared() checks the inode in the memslot for GUEST_MEMFD_FLAG_SUPPORT_SHARED. For recover_huge_pages_range(), the other user of __kvm_mmu_max_mapping_level(), currently there's no prior call to kvm_gmem_get_pfn() to get max_order or max_level, so I propose to call __kvm_mmu_max_mapping_level() with if (kvm_gmem_memslot_supports_shared(slot)) { max_level =3D kvm_gmem_max_mapping_level(slot, gfn); skip_host_page_tables =3D true; } else { max_level =3D PG_LEVEL_NUM; skip_host_page_tables =3D kvm_slot_has_gmem(slot) && kvm_mem_is_private(kvm, gfn); } Without 1G support, kvm_gmem_max_mapping_level(slot, gfn) would always return 4K. With 1G support, kvm_gmem_max_mapping_level(slot, gfn) would return the level for the page's order, at the offset corresponding to the gfn. > Once we support large folios in guest_memfd, only the "alignment"=20 > consideration might have to be taken into account. > I'll be handling this alignment as part of the 1G page support series (won't be part of Fuad's first stage series) [2] > Anything else? > > --=20 > Cheers, > > David / dhildenb [1] https://lore.kernel.org/all/20250430165655.605595-7-tabba@google.com/ [2] https://lore.kernel.org/all/diqz1pt1sfw8.fsf@ackerleytng-ctop.c.googler= s.com/