From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0AF6BC636C9 for ; Wed, 21 Jul 2021 16:36:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7F9466121E for ; Wed, 21 Jul 2021 16:36:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7F9466121E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D4BBF6B0081; Wed, 21 Jul 2021 12:36:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CFC276B008A; Wed, 21 Jul 2021 12:36:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BEC076B008C; Wed, 21 Jul 2021 12:36:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id A69ED6B0081 for ; Wed, 21 Jul 2021 12:36:16 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 549981AA91 for ; Wed, 21 Jul 2021 16:36:16 +0000 (UTC) X-FDA: 78387147552.26.FBB7FC1 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf25.hostedemail.com (Postfix) with ESMTP id 8FF78B00DE71 for ; Wed, 21 Jul 2021 16:36:15 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 561AB1FB; Wed, 21 Jul 2021 09:36:14 -0700 (PDT) Received: from [192.168.0.110] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 46B1A3F73D; Wed, 21 Jul 2021 09:36:12 -0700 (PDT) Subject: Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size To: Sean Christopherson Cc: Marc Zyngier , linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, linux-mm@kvack.org, Matthew Wilcox , Paolo Bonzini , Will Deacon , Quentin Perret , James Morse , Suzuki K Poulose , kernel-team@android.com References: <20210717095541.1486210-1-maz@kernel.org> <20210717095541.1486210-2-maz@kernel.org> From: Alexandru Elisei Message-ID: <568c571a-17f5-24a5-4aec-8b508f21eddd@arm.com> Date: Wed, 21 Jul 2021 17:37:16 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US Authentication-Results: imf25.hostedemail.com; dkim=none; spf=pass (imf25.hostedemail.com: domain of alexandru.elisei@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=alexandru.elisei@arm.com; dmarc=pass (policy=none) header.from=arm.com X-Stat-Signature: uiq7gzqsh9zt1jiq7ydxcu46dmrrp44f X-Rspamd-Queue-Id: 8FF78B00DE71 X-Rspamd-Server: rspam01 X-HE-Tag: 1626885375-13069 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Sean, Thank you for writing this, it explains exactly what I wanted to know. On 7/20/21 9:33 PM, Sean Christopherson wrote: > On Tue, Jul 20, 2021, Alexandru Elisei wrote: >> Hi Marc, >> >> I just can't figure out why having the mmap lock is not needed to walk the >> userspace page tables. Any hints? Or am I not seeing where it's taken? > Disclaimer: I'm not super familiar with arm64's page tables, but the relevant KVM > functionality is common across x86 and arm64. > > KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct > associated with the VM, and disallows calling ioctls from a different process, > i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for which > KVM registered the mmu_notifier. As part of registration, the mmu_notifier > does mmgrab() and doesn't do mmdrop() until it's unregistered. That ensures the > mm_struct itself is live. > > For the page tables liveliness, KVM implements mmu_notifier_ops.release, which is > invoked at the beginning of exit_mmap(), before the page tables are freed. In > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, a.k.a. > the stage2 tables in KVM arm64. The flow in question, get_user_mapping_size(), > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is > guaranteed to run with live userspace tables. > > Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}. KVM's > invalidate_range implementations also take mmu_lock, and also update a sequence > counter and a flag stating that there's an invalidation in progress. When > installing a stage2 entry, KVM snapshots the sequence counter before taking > mmu_lock, and then checks it again after acquiring mmu_lock. If the counter > mismatches, or an invalidation is in-progress, then KVM bails and resumes the > guest without fixing the fault. > > E.g. if the host zaps userspace page tables and KVM "wins" the race, the subsequent > kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2 > entries. And if the host zap "wins" the race, KVM will resume the guest, which > in normal operation will hit the exception again and go back through the entire > process of installing stage2 entries. > > Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly > handles the case where exit_mmap() wins the race. The invalidate_range hooks will > still be called, so userspace page tables aren't a problem, but > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without > any additional notifications that I see. x86 deals with this by ensuring its > top-level TDP entry (stage2 equivalent) is valid while the page fault handler is > running. > > void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu) > { > struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); > struct kvm_pgtable *pgt = NULL; > > spin_lock(&kvm->mmu_lock); > pgt = mmu->pgt; > if (pgt) { > mmu->pgd_phys = 0; > mmu->pgt = NULL; > free_percpu(mmu->last_vcpu_ran); > } > spin_unlock(&kvm->mmu_lock); > > ... > } > > AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt > if exit_mmap() collidied with user_mem_abort(). > > static int user_mem_abort(...) > { > > ... > > spin_lock(&kvm->mmu_lock); > pgt = vcpu->arch.hw_mmu->pgt; <-- hw_mmu->pgt may be NULL (hw_mmu points at vcpu->kvm->arch.mmu) > if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to change > goto out_unlock; > > ... > > if (fault_status == FSC_PERM && vma_pagesize == fault_granule) { > ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot); > } else { > ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize, > __pfn_to_phys(pfn), prot, > memcache); > } > }