From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05C47C04A95 for ; Fri, 23 Sep 2022 19:25:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7D6CD8000B; Fri, 23 Sep 2022 15:25:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7864280007; Fri, 23 Sep 2022 15:25:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 626CC8000B; Fri, 23 Sep 2022 15:25:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5148C80007 for ; Fri, 23 Sep 2022 15:25:13 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 28BB4A01A5 for ; Fri, 23 Sep 2022 19:25:13 +0000 (UTC) X-FDA: 79944328506.19.819448A Received: from mail-oi1-f179.google.com (mail-oi1-f179.google.com [209.85.167.179]) by imf01.hostedemail.com (Postfix) with ESMTP id C671340012 for ; Fri, 23 Sep 2022 19:25:12 +0000 (UTC) Received: by mail-oi1-f179.google.com with SMTP id j188so1096389oih.0 for ; Fri, 23 Sep 2022 12:25:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=Fmt9MN5NfVGezrHnd/u5tnrRyL7gHvPLtsOh+992pbU=; b=WEl6sZEBc/er/wrMM35U7lFP3w+PTJQFGCbt1lZDTIetLI4Rm63YyLyS/WM+R6Ofyf Mstp0lkuGo9MJ6Euwn5DZjScS7cfV4aXTa24myZmIw1qdlt/1TLQDW9dzYEYgX0oiU77 URtbyL2i6E2NkMPqPYNXQ3xIfeYgcxpMBbG4D3HPoCWB6ZBFDkwXFqs4CQrqDApyzAvF tOeg++TqPTTqnZaerFv95l7KaK4ix+rDlKNrJQkgvPqPwhfYGP9djaj96kC/1JC85VEn ZH8r+XDoeQZwmw/MWp3HKLjJ+/mZ4TV7pIDuLeV75bRaXXjjNuP/N27spJWuM9KJ04HV Ehdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=Fmt9MN5NfVGezrHnd/u5tnrRyL7gHvPLtsOh+992pbU=; b=MAIXaV35u7AFIUrabGs8mxu7WuF+0Or3S0nxxharW1trVVlkS+XYRlHY6GXUdsmXqy pHySG1SWVV1mDV2vo3JEQuFTniJQVECsQRIHMAOm/G5aYxp1+oHibamjrjIpp/XaWrOu OEGi1Nq+N5gN4j+zOHcgIlKQVb9XQ8jYbvfMqawUH4Lp3Rk/ebCTilucahvcfAiDmnOR XJpbtiSYOGJTMkow2Wtei5RJQjgw3nB219CZFfGmd20AV3rxOyJ0yMdLtvMpbIxajDqn 323ER3XXacOczZ3a4N8hagE5cnehTgcm7F64U+YSRsrOl3ncQITNgH2aG3gwd+FD01sP ynRg== X-Gm-Message-State: ACrzQf1uLwvyS+OVYfFyc9c9tM+myCPtzue3BVmaJzwtbsjiQqr1WIBe mvOyDN749qjNQDSHy+KxsYr6DXlHOL4lABsF1B8GUg== X-Google-Smtp-Source: AMsMyM6opd6tAaE8wRmRyYEbDHJ63Wicor8tHC/tbYo2If3Cb3Kt7nmAFS/JTZ22c/dYX0G0TUIRiAiMBO0/qrhGl8A= X-Received: by 2002:a05:6808:f8e:b0:351:a39:e7ca with SMTP id o14-20020a0568080f8e00b003510a39e7camr3317295oiw.269.1663961111771; Fri, 23 Sep 2022 12:25:11 -0700 (PDT) MIME-Version: 1.0 References: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> In-Reply-To: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> From: Jim Mattson Date: Fri, 23 Sep 2022 12:25:00 -0700 Message-ID: Subject: Re: The root cause of failure of access_tracking_perf_test in a nested guest To: Maxim Levitsky Cc: "kvm@vger.kernel.org" , Paolo Bonzini , Vladimir Davydov , linux-mm@kvack.org, Sean Christopherson , Emanuele Giuseppe Esposito Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WEl6sZEB; spf=pass (imf01.hostedemail.com: domain of jmattson@google.com designates 209.85.167.179 as permitted sender) smtp.mailfrom=jmattson@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663961112; a=rsa-sha256; cv=none; b=pf9ew2Xsi97Gsr8t9hu8CWyf92DQWlQiJUF/WDZjDBEg3x2vRoYogMjenyO7/friFD+D1t 7HVPspOGXBn2Yv+4Gsm6ViX0X5qoLjnsg/HXzbaJ3oknAEuTALCYiCXr8nt09EjBoRnfKt DbhG7wv5PHeBpE3ArPPaiXPf2+VNefM= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663961112; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Fmt9MN5NfVGezrHnd/u5tnrRyL7gHvPLtsOh+992pbU=; b=4Cdl4kxeoouo7gO3/Dg5CgxKMZH9tjYUsXUgnXa1NEMhhTUmi48Krl5OVLMA3cKx4lvjGN khG2wsjelylDCbUyjIcAMBQYTrcD6z2gl1znZAV81G8RTPtjoTEzhEqANhsdiCu+8q4pC4 G7npfQxXI6bDnJO/mocH5HSCfsqJqSI= Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WEl6sZEB; spf=pass (imf01.hostedemail.com: domain of jmattson@google.com designates 209.85.167.179 as permitted sender) smtp.mailfrom=jmattson@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Stat-Signature: sppft5ofjq4r5r95hcfjetr4646xtrfy X-Rspamd-Queue-Id: C671340012 X-Rspamd-Server: rspam09 X-HE-Tag: 1663961112-229663 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Sep 23, 2022 at 3:16 AM Maxim Levitsky wrote: > > Hi! > > Me and Emanuele Giuseppe Esposito were working on trying to understand wh= y the access_tracking_perf_test > fails when run in a nested guest on Intel, and I finally was able to find= the root casue. > > So the access_tracking_perf_test tests the following: > > - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/w= ritiable > file which allows a process to set/clear the accessed bit in its page tab= les. > the interface of this file is inverted, it is a bitmap of 'idle' bits > Idle bit set =3D=3D=3D dirty bit is clear. > > - It then runs a KVM guest, and checks that when the guest accesses its m= emory > (through EPT/NPT), the accessed bits are still updated normally as seen f= rom /sys/kernel/mm/page_idle/bitmap. > > In particular it first clears the accesssed bit using /sys/kernel/mm/page= _idle/bitmap, > and then runs a guest which reads/writes all its memory, and then > it checks that the accessed bit is set again by reading the /sys/kernel/m= m/page_idle/bitmap. > > > > Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are = used, and in particular > - kvm_mmu_notifier_clear_flush_young > - kvm_mmu_notifier_clear_young > - kvm_mmu_notifier_test_young > > First two clear the accessed bit from NPT/EPT, and the 3rd only checks it= s value. > > The difference between the first two notifiers is that the first one flus= hes EPT/NPT, > and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/b= itmap uses the second one. > > This means that on the bare metal, the tlb might still have the accessed = bit set, and thus > it might not set it again in the PTE when a memory access is done through= it. > > There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy,= so this seems to be > done on purpose. > > I would like to hear your opinion on why it was done this way, and if the= original reasons for > not doing the tlb flush are still valid. > > Now why the access_tracking_perf_test fails in a nested guest? > It is because kvm shadow paging which is used to shadow the nested EPT, a= nd it has a "TLB" which > is not bounded by size, because it is stored in the unsync sptes in memor= y. > > Because of this, when the guest clears the accessed bit in its nested EPT= entries, KVM doesn't > notice/intercept it and corresponding EPT sptes remain the same, thus lat= er the guest access to > the memory is not intercepted and because of this doesn't turn back > the accessed bit in the guest EPT tables. Does the guest execute an INVEPT after clearing the accessed bit? >From volume 3 of the SDM, section 28.3.5 Accessed and Dirty Flags for EPT: > A processor may cache information from the EPT paging-structure entries i= n TLBs and paging-structure caches (see Section 28.4). This fact implies th= at, if software changes an accessed flag or a dirty flag from 1 to 0, the p= rocessor might not set the corresponding bit in memory on a subsequent acce= ss using an affected guest-physical address. > (If TLB flush were to happen, we would 'sync' the unsync sptes, by zappin= g them because we don't > keep sptes for gptes with no accessed bit) > > > Any comments are welcome! > > If you think that the lack of the EPT flush is still the right thing to d= o, > I vote again to have at least some form of a blacklist of selftests which > are expected to fail, when run under KVM (fix_hypercall_test is the other= test > I already know that fails in a KVM guest, also without a practical way to= fix it). > > > Best regards, > Maxim Levitsky > > > PS: the test doesn't fail on AMD because we sync the nested NPT on each n= ested VM entry, which > means that L0 syncs all the page tables. > > Also the test sometimes passes on Intel when an unrelated TLB flush syncs= the nested EPT. > > Not using the new tdp_mmu also 'helps' by letting the test pass much more= often but it still > fails once in a while, likely because of timing and/or different implemen= tation. > > >