From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD205C6FA82 for ; Fri, 23 Sep 2022 11:57:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 421D18000A; Fri, 23 Sep 2022 07:57:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AAC580007; Fri, 23 Sep 2022 07:57:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2238D8000A; Fri, 23 Sep 2022 07:57:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0C80380007 for ; Fri, 23 Sep 2022 07:57:07 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D126F81675 for ; Fri, 23 Sep 2022 11:57:06 +0000 (UTC) X-FDA: 79943199252.19.D8B9A1A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id EC4871800AA for ; Fri, 23 Sep 2022 11:57:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1663934225; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4miBri3bJhR5OaFoUxgYoX/ss5biL8QmMasfztYCuX4=; b=bWCG5IBbamND2+1UlP1Qe7gI5d3LEyazkG5BN0dSIFaAe9nLVcFjFn+zjHYJ/gZfoUT/aE WEgQ+qTxfOktUyuLJBYlBdGbvfFRm68WtOT3NldGXQ4LKKEqTYaDH5GDbHav6sAaLTFdy7 c/WO7z4p0MltZLhs1M0DO+gO6dciYo4= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-609-ICIQtgnUOAGZo1QmA_-OhA-1; Fri, 23 Sep 2022 07:57:04 -0400 X-MC-Unique: ICIQtgnUOAGZo1QmA_-OhA-1 Received: by mail-wm1-f72.google.com with SMTP id y20-20020a05600c365400b003b4d4ae666fso1720344wmq.4 for ; Fri, 23 Sep 2022 04:57:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date; bh=4miBri3bJhR5OaFoUxgYoX/ss5biL8QmMasfztYCuX4=; b=ngatNZ9UABAdUmv0embt67rpvUGQvZ52YM1rjDCqtfNuQAz6vC0NMIiqmlotE1/aai xLlP5kda843qwd5x09RWgiJg3tyhR7o/Erc/VGIdH5nlE/cfagY0wvw3tghJZ85sPZmC Pz2SIOSBqO/3cbejrUlriBkjtf8GHQUFHCu4wBm5ZXVEoDZBhIxDg2gv2CS1jRriorpn 9VPxOKlTTMsmTE+firwbFmfCtlBn+usw0Aj2A2/33Hle7AKNO7Q38dSbVPY/zKte/yOx 2VIgFTDNyenD1A2U+cvVNVxR9R+q1QUCQK2t7fERNqHQqQn0/pWW60bmpNY6xfCyQL0U kloA== X-Gm-Message-State: ACrzQf2HTrpWVMxdfuQo1BNLLdbhQIVtxOIzpa7F4az+3Xi6JIOQOEhy PzUjkS9ff/I2wngHBqOcJYT65NqWTMji+U3zEKvmdoqP4DSv4zcfT+DcQ3aE0nI5Nstg7/C9vPS TteWBsB685Lo= X-Received: by 2002:a5d:5887:0:b0:22b:107e:7e39 with SMTP id n7-20020a5d5887000000b0022b107e7e39mr5155819wrf.694.1663934223093; Fri, 23 Sep 2022 04:57:03 -0700 (PDT) X-Google-Smtp-Source: AMsMyM47KtCY7OjlZi09YuqBLhLlMCLGbOE3dNReWbGbXGp3ENh5S0EFb/Vag6XNY3kRQj/zHsGi+A== X-Received: by 2002:a5d:5887:0:b0:22b:107e:7e39 with SMTP id n7-20020a5d5887000000b0022b107e7e39mr5155795wrf.694.1663934222736; Fri, 23 Sep 2022 04:57:02 -0700 (PDT) Received: from [192.168.149.123] (58.254.164.109.static.wline.lns.sme.cust.swisscom.ch. [109.164.254.58]) by smtp.gmail.com with ESMTPSA id q16-20020a1cf310000000b003a5fa79007fsm2345707wmq.7.2022.09.23.04.57.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 23 Sep 2022 04:57:02 -0700 (PDT) Message-ID: Date: Fri, 23 Sep 2022 13:57:01 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: The root cause of failure of access_tracking_perf_test in a nested guest To: Maxim Levitsky , "kvm@vger.kernel.org" Cc: Paolo Bonzini , Vladimir Davydov , linux-mm@kvack.org, Sean Christopherson References: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> From: Emanuele Giuseppe Esposito In-Reply-To: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663934226; a=rsa-sha256; cv=none; b=qh18oooNQzXCn9OPA2VQhh0xOY/AD2zaXDb9vDr/1C+ZoAv2CPl+1xJPtL3u/vg1uSbBIt psW8kP/Jlf/zhjz8kjDpwU4LVqSrQYwa5G6+ys8FP5zMiUL5ZZrBofUmd2h1QR6ta027so foXyMIa+C66mv1AQtb6vjxkixpUKBJo= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bWCG5IBb; spf=pass (imf16.hostedemail.com: domain of eesposit@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=eesposit@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663934226; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4miBri3bJhR5OaFoUxgYoX/ss5biL8QmMasfztYCuX4=; b=Cw91AlpWNqp+nuAqE9D6aBtfVnEZh++Gx+BBo1ShR+NmXLbHYyJei82RxIh2z8vwF4G3J9 f7k8QR/X0mg3RAGc8hVqdMK30DslovcXo04NgyNS3DE47KSUdSgO5rCxXrvh4ETvGxy06e RKUn4rdhrHtYOBXZLBGnoEOl44jGlxw= Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bWCG5IBb; spf=pass (imf16.hostedemail.com: domain of eesposit@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=eesposit@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-Stat-Signature: e9juy4a4hiutyguej9kcu5bp6rztf6fj X-Rspamd-Queue-Id: EC4871800AA X-Rspamd-Server: rspam02 X-HE-Tag: 1663934225-780612 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Am 23/09/2022 um 12:16 schrieb Maxim Levitsky: > Hi! > > Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test > fails when run in a nested guest on Intel, and I finally was able to find the root casue. > > So the access_tracking_perf_test tests the following: > > - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable > file which allows a process to set/clear the accessed bit in its page tables. > the interface of this file is inverted, it is a bitmap of 'idle' bits > Idle bit set === dirty bit is clear. > > - It then runs a KVM guest, and checks that when the guest accesses its memory > (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap. > > In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap, > and then runs a guest which reads/writes all its memory, and then > it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap. > > > > Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular > - kvm_mmu_notifier_clear_flush_young > - kvm_mmu_notifier_clear_young > - kvm_mmu_notifier_test_young > > First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value. > > The difference between the first two notifiers is that the first one flushes EPT/NPT, > and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one. > > This means that on the bare metal, the tlb might still have the accessed bit set, and thus > it might not set it again in the PTE when a memory access is done through it. > > There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be > done on purpose. > > I would like to hear your opinion on why it was done this way, and if the original reasons for > not doing the tlb flush are still valid. > > Now why the access_tracking_perf_test fails in a nested guest? > It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which > is not bounded by size, because it is stored in the unsync sptes in memory. > > Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't > notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to > the memory is not intercepted and because of this doesn't turn back > the accessed bit in the guest EPT tables. > > (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't > keep sptes for gptes with no accessed bit) As suggested by Paolo, I also tried changing page_idle.c implementation so that it would call kvm_mmu_notifier_clear_flush_young instead of its non-flush counterpart: diff --git a/mm/page_idle.c b/mm/page_idle.c index edead6a8a5f9..ffc1b0182534 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -62,10 +62,10 @@ static bool page_idle_clear_pte_refs_one(struct page *page, * For PTE-mapped THP, one sub page is referenced, * the whole THP is referenced. */ - if (ptep_clear_young_notify(vma, addr, pvmw.pte)) + if (ptep_clear_flush_young_notify(vma, addr, pvmw.pte)) referenced = true; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { - if (pmdp_clear_young_notify(vma, addr, pvmw.pmd)) + if (pmdp_clear_flush_young_notify(vma, addr, pvmw.pmd)) referenced = true; } else { /* unexpected pmd-mapped page? */ As expected, with the above patch the test does not fail anymore, proving Maxim's point. As I understand an alternative was to get rid of the test? Or at least move it outside from kvm? Thank you, Emanuele > > > Any comments are welcome! > > If you think that the lack of the EPT flush is still the right thing to do, > I vote again to have at least some form of a blacklist of selftests which > are expected to fail, when run under KVM (fix_hypercall_test is the other test > I already know that fails in a KVM guest, also without a practical way to fix it). > > > Best regards, > Maxim Levitsky > > > PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which > means that L0 syncs all the page tables. > > Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT. > > Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still > fails once in a while, likely because of timing and/or different implementation. > > >