From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96E5AC4345F for ; Thu, 18 Apr 2024 19:53:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0BA706B0082; Thu, 18 Apr 2024 15:53:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F36E96B009F; Thu, 18 Apr 2024 15:53:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DD78D6B00CA; Thu, 18 Apr 2024 15:53:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B50DD6B0082 for ; Thu, 18 Apr 2024 15:53:31 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 55E0C8136C for ; Thu, 18 Apr 2024 19:53:31 +0000 (UTC) X-FDA: 82023702222.22.27DF51B Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf15.hostedemail.com (Postfix) with ESMTP id 841F2A001D for ; Thu, 18 Apr 2024 19:53:29 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IKb1Dp2Q; spf=pass (imf15.hostedemail.com: domain of 3OHohZgYKCOkdPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3OHohZgYKCOkdPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713470009; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9Fwrx9M7PjHp4hPBDgNAvgC8cnKrAzhv9n1yaPxLSA8=; b=Y4ppmLYNAcDUK1toVoFll5haHDG0As699JTFxaBdeHMDjcgc5ZNIy7zJxMJ5UWDyXuIIr0 H//eJW8YKbC898aTq18hsw759VttgTMNP6z/+B4xNmGNoUqxq49BCKmzPryS7vg6dICm1T EVrW8DdFQYdsnXGb1Q7ho9Vyip95Ddo= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IKb1Dp2Q; spf=pass (imf15.hostedemail.com: domain of 3OHohZgYKCOkdPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3OHohZgYKCOkdPLYUNRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713470009; a=rsa-sha256; cv=none; b=2ParT+HKecfhhjUhB7OAgA27fQPQuk18+0nVHzNAADdK6SWY6/jF5mEJC/2KM1Nb9P4wJQ ClUbkUeOX/ltbeQ6m3WJQy5+/1sC+uMiaNGF/o4PBnbbYTS3iJWdF/UrkegTDdxLXX0tkS VoCKklmv0JZagc4kx5EJUS8oVpgzU1o= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6156b93c768so23022757b3.1 for ; Thu, 18 Apr 2024 12:53:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1713470008; x=1714074808; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9Fwrx9M7PjHp4hPBDgNAvgC8cnKrAzhv9n1yaPxLSA8=; b=IKb1Dp2QpOX07RljYnI8ClXztlSWpQ/dsA6jjJaQf7hs8nwk/Tw6vasmwf9dzx9U7j uUxggY7kbmbM4MgXhRFkSsgD5eyJmoYksm9ICYFykyfVUYD7M6GeG//fomieUae/niOI 6+vhM+JXWkyWqXl8mCuyz3I25PAfHCIPNMfh7WVjNvlIE5ZzBqitZk4Bh6kKZ1Z15LHe /bd9IsBnu0a3atJtgWxD0FxDUt7TGBT5sHEG79bZ5dSMpUBUm4neTo4d7Tk2RXvLqYaW ynp1bHvNVONNhur05rD2Uq5ZapKDdbDdIJYKS+o0cWjWP/CHtWG+a1WhEwm8M7X/qv7Q 2GtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713470008; x=1714074808; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9Fwrx9M7PjHp4hPBDgNAvgC8cnKrAzhv9n1yaPxLSA8=; b=EOXloSPoWfKkGLPUw23J4V1GeFelVKRe+H5pFFMHiX4a6I7M3XgAoHsdwYr22jHCcc YjKDgdT/OZQh/q98Vy1s/Hl1KApxlcvkw8EIb+emJiV28lR/trJy70ARFP505rjoO0O+ OLrWZhmieu06nNCPhxQEiWQsEBln38L4w/Piwuyd9Qz4JjDbEynTwwN59p79j7u+zx1k vC08UDAoJfvhT4HM7c/GZ37ONe3esc4rE9jDtl1avri4MujAUHov0CAw5MfEZVqof+JH iC2WFg5AiUfic/DB/ptmJflMR/lNFLAnZIXYdrYvoI4lijlO6abBT8ZHAhicHmqPBRbf J7Ig== X-Forwarded-Encrypted: i=1; AJvYcCVekyGRjWg7yhuvinu/8/V07CXQMpluBdzcM/PxhXlUkhc48veY1Cy4ugs/hxy2xJcE+uQDCXHkCim+M5J/p71WW9g= X-Gm-Message-State: AOJu0YxT9qaiKKB2W79obqKxCLWl1O5wwE/gaOd7mCgoa8vZXjyDY+wN 9Fb3s3/l78BPwur1p/vbeHhY6EkS4PlbU1MoMN6NnxzzmPQd37Oymi+u61QvAsreN//1a0H40L+ mzQ== X-Google-Smtp-Source: AGHT+IGGyn2/Eis9yr5uvFbHSsNr+5jpucEC9aioMrlcMqw+x+tJjikMNMKUFalz9faFMmsjcJGmE2MmVtI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a0d:d9d3:0:b0:618:9044:4f8b with SMTP id b202-20020a0dd9d3000000b0061890444f8bmr785538ywe.7.1713470008510; Thu, 18 Apr 2024 12:53:28 -0700 (PDT) Date: Thu, 18 Apr 2024 12:53:26 -0700 In-Reply-To: <20240418141932.GA1855@willie-the-truck> Mime-Version: 1.0 References: <20240405115815.3226315-1-pbonzini@redhat.com> <20240405115815.3226315-2-pbonzini@redhat.com> <20240412104408.GA27645@willie-the-truck> <86jzl2sovz.wl-maz@kernel.org> <86h6g5si0m.wl-maz@kernel.org> <20240418141932.GA1855@willie-the-truck> Message-ID: Subject: Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback From: Sean Christopherson To: Will Deacon Cc: Marc Zyngier , Paolo Bonzini , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Oliver Upton , Tianrui Zhao , Bibo Mao , Thomas Bogendoerfer , Nicholas Piggin , Anup Patel , Atish Patra , Andrew Morton , David Hildenbrand , linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org Content-Type: text/plain; charset="us-ascii" X-Stat-Signature: 7quisndojjzskr3f65nqj3fnhy5dtbc1 X-Rspamd-Queue-Id: 841F2A001D X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1713470009-401104 X-HE-Meta: U2FsdGVkX18SwifzAzrJnvnmt5t2ZD4Et4YMDty+vXN/700yNRiI/uIFf4AxBjHenJO5aoWq9nFFRdO9LJse+lLrPis2LvBOxDQ3/QOlNygk7YnCvHea4qchNpzH1sRoVaMsiNNvhVqjmkv05mtUySA6Nk32apkGAGsXvtBwZ/WRc8f8gnstDBDMEIwYB5SMK59b4QvER0nHqSHWQluaC91DnlmqNiGA39uvy+sLpkGqHcGNLPFi6w/ELPtyqNHRGQpsAEsMcOW0g2vQ1lR+LlBqE6tWGVetLxEr1yW01Q6w+9ZlKOAo6zb/Q262jfVqumN5HzyAVJ+L8KD3pPjPSLfAQ+C+hxloDiHOQFqpDhNXbu0PMxQrZ6DO6LnI2s4ISUnyH9eIRTgaWEzuXGR36tRIHSZszhnID5C+FVcFSweMiRbrfdEqWRQWUJFcws5jO0SVgLR5u/dBt9gxVc11dV1+7hS0txirqQYO+l5RG5l8zVRnNJ/QLq14ji7/VvFsNWTm0izqRPnvS8aDdtVyJ+KvD7KoO/bmNJ0Hgro3QOyCdyBlZLBqSsLaXs1C9rkeGBeGUKWmiO+H6kvnE3EN9nAFlBZdulWZrBvjZ7uGa4eD6IlgTSj7b6ZxsQYC9AL0eIxqDz0x7W83KnClKELztDXemowL4tU7HsiPlBVIJRsK3X/G/kyq72UZQZm+qx37YhLWzC/ReWvIax4XDagN8iTwWds2WuHx+5GZUilhKVK4CLYVHfFFtd86E/y9bf7r3qYyBqfLkTCSdTwX5nT/OD6HaDeW1gDBRcuGsrCYotdAuLGEQrRVhazmesw4T3Wj54Zk99cVkJ7sBr8sCSKplsRp2TXxF7lJT3rXIFOVkXFb07YaPuuHf6mkxEiPZKuwn2PmWnnfH7WPdgs8YbWJ/b54MOyVTnFNdvlUM5j02Z+suz3o6FVxHvqCdl/TDZoC+giNVGmZmYULE2egBII 2eeSfkNN lX2VLWvX+6ENkOKSFA3XNoKz5AJtsDqThNKbWzqum0wbQdEpL9bvAtV563tk6tyqzMp2UKzKkmriNnRSgeTy4MtpsqTbl5CgsfD9MyN52orSgQefsAH5Z/ALzimKPQ3NcgO/XKd2vVY3Q4koJr3naIDp2y9RcqbkdNCkcM09rG/XUDB11q0imB0AR4gXlfRpn0iRP+gRg2SiB/hiv6P/pmgpsfJ7XV/g2ZvvkKfwm7HXD34twjspcrbAhDw08U/t6sWBG61ISWXbrN20v9mr903axITCmyPwBxPM67bJtC+JUQdu6TzQ6UAozGexADcsjdya8acIwbo9/1F70pzMxP2SDmFDMY0iOH6tgIony9eS0x6fTs0ZOXDZfgKNw1NR3nr+oAjcvpbNBG/7qDXY7WZBwrQhhi6fxK7wc/Xdi+Na82z/BeKsbtmwGrYVEtyNgHLPK5m+8H/0xlufn9GeSgBIUCwzLUb55eXwwehVBfzbwCwuSXxE7EZS1U+wQMdKiV95fOGlmLnM/gV0JzNvrTY51LwfYKapw2DG3drjfQwirYgdgqmkSDFAFeySwCvdOovSBqUh+jpTvv+sc0AKAMR5T6oxGmd/56ZdiDpqJvgPcyE+sIe391cYJ56Rqwgdm8YSoKxvoOaVW9bOnUHovFpulV6e6eGwCNQ2a0115O15MvQfFk2NKLZZkgLxmhiPxtvmoMxHFEgUnguw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 18, 2024, Will Deacon wrote: > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson wrote: > > > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon wrote: > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > > would be really great to change the .clear_flush_young() callback so > > > > > > that the architecture could handle the TLB invalidation. At the moment, > > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret' > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > > > when we actually update the ptes for small ranges. > > > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > > costly consequences). > > > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > > code that deals with the page tables, as it has a pretty good idea of > > > > > what needs to be invalidated and how -- specially on architectures > > > > > that have a HW-broadcast facility like arm64. > > > > > > > > Would this be roughly on par with an in-line flush on arm64? The simpler, more > > > > straightforward solution would be to let architectures override flush_on_ret, > > > > but I would prefer something like the below as x86 can also utilize a range-based > > > > flush when running as a nested hypervisor. > > > > ... > > > > > I think this works for us on HW that has range invalidation, which > > > would already be a positive move. > > > > > > For the lesser HW that isn't range capable, it also gives the > > > opportunity to perform the iteration ourselves or go for the nuclear > > > option if the range is larger than some arbitrary constant (though > > > this is additional work). > > > > > > But this still considers the whole range as being affected by > > > range->handler(). It'd be interesting to try and see whether more > > > precise tracking is (or isn't) generally beneficial. > > > > I assume the idea would be to let arch code do single-page invalidations of > > stage-2 entries for each gfn? > > Right, as it's the only code which knows which ptes actually ended up > being aged. > > > Unless I'm having a brain fart, x86 can't make use of that functionality. Intel > > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > > provides an instruction to do broadcast invalidations, but it takes a virtual > > address, i.e. a stage-1 address. I can't tell if it's a host virtual address or > > a guest virtual address, but it's a moot point because KVM doen't have the guest > > virtual address, and if it's a host virtual address, there would need to be valid > > mappings in the host page tables for it to work, which KVM can't guarantee. > > Ah, so it sounds like it would need to be an arch opt-in then. Even if x86 (or some other arch code) could use the precise tracking, I think it would make sense to have the behavior be arch specific. Adding infrastructure to get information from arch code, only to turn around and give it back to arch code would be odd. Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, the best/easiest solution would be to let arm64 opt out of the common TLB flush when a SPTE is made young. With the range-based flushing bundled in, this? --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++--------------- 2 files changed, 27 insertions(+), 15 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index afbc99264ffa..8fe5f5e16919 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER +int kvm_arch_flush_tlb_if_young(void); + static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_invalidate_in_progress)) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 38b498669ef9..5ebef8ef239c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -595,6 +595,11 @@ static void kvm_null_fn(void) } #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) +int __weak kvm_arch_flush_tlb_if_young(void) +{ + return true; +} + /* Iterate over each memslot intersecting [start, last] (inclusive) range */ #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \ for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \ @@ -611,6 +616,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, struct kvm_gfn_range gfn_range; struct kvm_memory_slot *slot; struct kvm_memslots *slots; + bool need_flush = false; int i, idx; if (WARN_ON_ONCE(range->end <= range->start)) @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, break; } r.ret |= range->handler(kvm, &gfn_range); + + /* + * Use a precise gfn-based TLB flush when possible, as + * most mmu_notifier events affect a small-ish range. + * Fall back to a full TLB flush if the gfn-based flush + * fails, and don't bother trying the gfn-based flush + * if a full flush is already pending. + */ + if (range->flush_on_ret && !need_flush && r.ret && + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, + gfn_range.end - gfn_range.start + 1)) + need_flush = true; } } - if (range->flush_on_ret && r.ret) + if (need_flush) kvm_flush_remote_tlbs(kvm); if (r.found_memslot) @@ -680,7 +698,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, unsigned long start, unsigned long end, - gfn_handler_t handler) + gfn_handler_t handler, + bool flush_on_ret) { struct kvm *kvm = mmu_notifier_to_kvm(mn); const struct kvm_mmu_notifier_range range = { @@ -688,7 +707,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, .end = end, .handler = handler, .on_lock = (void *)kvm_null_fn, - .flush_on_ret = true, + .flush_on_ret = flush_on_ret, .may_block = false, }; @@ -700,17 +719,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn unsigned long end, gfn_handler_t handler) { - struct kvm *kvm = mmu_notifier_to_kvm(mn); - const struct kvm_mmu_notifier_range range = { - .start = start, - .end = end, - .handler = handler, - .on_lock = (void *)kvm_null_fn, - .flush_on_ret = false, - .may_block = false, - }; - - return __kvm_handle_hva_range(kvm, &range).ret; + return kvm_handle_hva_range(mn, start, end, handler, false); } void kvm_mmu_invalidate_begin(struct kvm *kvm) @@ -876,7 +885,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, { trace_kvm_age_hva(start, end); - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); + return kvm_handle_hva_range(mn, start, end, kvm_age_gfn, + kvm_arch_flush_tlb_if_young()); } static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, base-commit: eae53272c8ad4e7ed2bbb11bd0456eb5b0484f0c --