From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
To: Maxim Levitsky <mlevitsk@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
linux-mm@kvack.org, Sean Christopherson <seanjc@google.com>
Subject: Re: The root cause of failure of access_tracking_perf_test in a nested guest
Date: Fri, 23 Sep 2022 13:57:01 +0200 [thread overview]
Message-ID: <da71648e-b501-dba2-a7f8-93d16ce6d833@redhat.com> (raw)
In-Reply-To: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com>
Am 23/09/2022 um 12:16 schrieb Maxim Levitsky:
> Hi!
>
> Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
> fails when run in a nested guest on Intel, and I finally was able to find the root casue.
>
> So the access_tracking_perf_test tests the following:
>
> - It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
> file which allows a process to set/clear the accessed bit in its page tables.
> the interface of this file is inverted, it is a bitmap of 'idle' bits
> Idle bit set === dirty bit is clear.
>
> - It then runs a KVM guest, and checks that when the guest accesses its memory
> (through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.
>
> In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
> and then runs a guest which reads/writes all its memory, and then
> it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.
>
>
>
> Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
> - kvm_mmu_notifier_clear_flush_young
> - kvm_mmu_notifier_clear_young
> - kvm_mmu_notifier_test_young
>
> First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.
>
> The difference between the first two notifiers is that the first one flushes EPT/NPT,
> and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.
>
> This means that on the bare metal, the tlb might still have the accessed bit set, and thus
> it might not set it again in the PTE when a memory access is done through it.
>
> There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
> done on purpose.
>
> I would like to hear your opinion on why it was done this way, and if the original reasons for
> not doing the tlb flush are still valid.
>
> Now why the access_tracking_perf_test fails in a nested guest?
> It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
> is not bounded by size, because it is stored in the unsync sptes in memory.
>
> Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
> notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
> the memory is not intercepted and because of this doesn't turn back
> the accessed bit in the guest EPT tables.
>
> (If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
> keep sptes for gptes with no accessed bit)
As suggested by Paolo, I also tried changing page_idle.c implementation so that it would call kvm_mmu_notifier_clear_flush_young instead of its non-flush counterpart:
diff --git a/mm/page_idle.c b/mm/page_idle.c
index edead6a8a5f9..ffc1b0182534 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -62,10 +62,10 @@ static bool page_idle_clear_pte_refs_one(struct page *page,
* For PTE-mapped THP, one sub page is referenced,
* the whole THP is referenced.
*/
- if (ptep_clear_young_notify(vma, addr, pvmw.pte))
+ if (ptep_clear_flush_young_notify(vma, addr, pvmw.pte))
referenced = true;
} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
- if (pmdp_clear_young_notify(vma, addr, pvmw.pmd))
+ if (pmdp_clear_flush_young_notify(vma, addr, pvmw.pmd))
referenced = true;
} else {
/* unexpected pmd-mapped page? */
As expected, with the above patch the test does not fail anymore, proving Maxim's point.
As I understand an alternative was to get rid of the test? Or at least move it outside from kvm?
Thank you,
Emanuele
>
>
> Any comments are welcome!
>
> If you think that the lack of the EPT flush is still the right thing to do,
> I vote again to have at least some form of a blacklist of selftests which
> are expected to fail, when run under KVM (fix_hypercall_test is the other test
> I already know that fails in a KVM guest, also without a practical way to fix it).
>
>
> Best regards,
> Maxim Levitsky
>
>
> PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
> means that L0 syncs all the page tables.
>
> Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.
>
> Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
> fails once in a while, likely because of timing and/or different implementation.
>
>
>
next prev parent reply other threads:[~2022-09-23 11:57 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-23 10:16 Maxim Levitsky
2022-09-23 11:57 ` Emanuele Giuseppe Esposito [this message]
2022-09-23 17:30 ` David Matlack
2022-09-23 19:25 ` Jim Mattson
2022-09-23 20:28 ` David Matlack
2022-09-26 8:50 ` Emanuele Giuseppe Esposito
2022-10-04 18:52 ` Mingwei Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=da71648e-b501-dba2-a7f8-93d16ce6d833@redhat.com \
--to=eesposit@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mlevitsk@redhat.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox