From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBFF5ECAAD8 for ; Fri, 23 Sep 2022 10:16:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF85B80008; Fri, 23 Sep 2022 06:16:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA86580007; Fri, 23 Sep 2022 06:16:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6EC480008; Fri, 23 Sep 2022 06:16:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C329280007 for ; Fri, 23 Sep 2022 06:16:12 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1B819401BC for ; Fri, 23 Sep 2022 10:16:12 +0000 (UTC) X-FDA: 79942944984.02.A12A562 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 50CCA140029 for ; Fri, 23 Sep 2022 10:16:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1663928170; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=u3PYvENyt8HxfyNNUDBeATDqucO/QCptsiVKIzXO4TY=; b=EBqNSgDc4j8xNkMlVM/LD9vfNL4gDGVrcYNUWBgEngrxTdsnZTB3OwBleSfip12YLKhoUx 3vNMFM50R0cpruZdM8vLC/h65aA+s7SIrJBOL9kt9G/b4+3yJ6ny8WIZz87GXUzI0v101b tOR2GZi5Oi6xs83cS66KxWWNO/RAeWk= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-569-ZUvhy9NCN2OSO6uHDsQXFg-1; Fri, 23 Sep 2022 06:16:07 -0400 X-MC-Unique: ZUvhy9NCN2OSO6uHDsQXFg-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AADED29324AD; Fri, 23 Sep 2022 10:16:06 +0000 (UTC) Received: from starship (unknown [10.40.193.233]) by smtp.corp.redhat.com (Postfix) with ESMTP id 31ACF492CA2; Fri, 23 Sep 2022 10:16:05 +0000 (UTC) Message-ID: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> Subject: The root cause of failure of access_tracking_perf_test in a nested guest From: Maxim Levitsky To: "kvm@vger.kernel.org" Cc: Paolo Bonzini , Vladimir Davydov , linux-mm@kvack.org, Sean Christopherson , Emanuele Giuseppe Esposito Date: Fri, 23 Sep 2022 13:16:04 +0300 Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EBqNSgDc; spf=pass (imf26.hostedemail.com: domain of mlevitsk@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mlevitsk@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663928171; a=rsa-sha256; cv=none; b=TNGCQP4gqv3BSG3bCAiXWiAGk2bhw3Ax5MLmnwL0WLg48P/n/LZHiQCp/5bvnqoM5RF495 +1+C90Sw1P7T0TjxsWcPcVDRUUWPW8jzeAjPlmwhmnjZzdziKaClxD7kZgnRLR6sVCVPpO bwmdSTmYucVNO+DIxSu96WU0jP/tQzU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663928171; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=u3PYvENyt8HxfyNNUDBeATDqucO/QCptsiVKIzXO4TY=; b=siuqGrLxZ/8i3P1CZFtw/Q07OVaslWDslggrMZi3083WRyPuttXApUeByC/cApOR3O/TH8 NG9ClhPAIueyk5wmJWhoRRI3bU4rpcLFNYjkqTTXjIS5brKZckWYRdoIKgUcDDu+OlJmWe MxL3AodVQDamKzJDuDYWaxiFfpqfLWI= Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=EBqNSgDc; spf=pass (imf26.hostedemail.com: domain of mlevitsk@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=mlevitsk@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-Stat-Signature: afnf6ojterhm6pbwu3d44miktfbryh7c X-Rspamd-Queue-Id: 50CCA140029 X-Rspamd-Server: rspam09 X-HE-Tag: 1663928171-668071 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi! Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test fails when run in a nested guest on Intel, and I finally was able to find the root casue. So the access_tracking_perf_test tests the following: - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable file which allows a process to set/clear the accessed bit in its page tables. the interface of this file is inverted, it is a bitmap of 'idle' bits Idle bit set === dirty bit is clear. - It then runs a KVM guest, and checks that when the guest accesses its memory (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap. In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap, and then runs a guest which reads/writes all its memory, and then it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap. Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular - kvm_mmu_notifier_clear_flush_young - kvm_mmu_notifier_clear_young - kvm_mmu_notifier_test_young First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value. The difference between the first two notifiers is that the first one flushes EPT/NPT, and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one. This means that on the bare metal, the tlb might still have the accessed bit set, and thus it might not set it again in the PTE when a memory access is done through it. There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be done on purpose. I would like to hear your opinion on why it was done this way, and if the original reasons for not doing the tlb flush are still valid. Now why the access_tracking_perf_test fails in a nested guest? It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which is not bounded by size, because it is stored in the unsync sptes in memory. Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to the memory is not intercepted and because of this doesn't turn back the accessed bit in the guest EPT tables. (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't keep sptes for gptes with no accessed bit) Any comments are welcome! If you think that the lack of the EPT flush is still the right thing to do, I vote again to have at least some form of a blacklist of selftests which are expected to fail, when run under KVM (fix_hypercall_test is the other test I already know that fails in a KVM guest, also without a practical way to fix it). Best regards, Maxim Levitsky PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which means that L0 syncs all the page tables. Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT. Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still fails once in a while, likely because of timing and/or different implementation.