The root cause of failure of access_tracking_perf_test in a nested guest

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Maxim Levitsky <mlevitsk@redhat.com>
To: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	linux-mm@kvack.org, Sean Christopherson <seanjc@google.com>,
	Emanuele Giuseppe Esposito <eesposit@redhat.com>
Subject: The root cause of failure of access_tracking_perf_test in a nested guest
Date: Fri, 23 Sep 2022 13:16:04 +0300	[thread overview]
Message-ID: <50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com> (raw)

Hi!

Me and Emanuele Giuseppe Esposito were working on trying to understand why the access_tracking_perf_test
fails when run in a nested guest on Intel, and I finally was able to find the root casue.

So the access_tracking_perf_test tests the following:

- It opens /sys/kernel/mm/page_idle/bitmap which is a special root read/writiable
file which allows a process to set/clear the accessed bit in its page tables.
the interface of this file is inverted, it is a bitmap of 'idle' bits
Idle bit set === dirty bit is clear.

- It then runs a KVM guest, and checks that when the guest accesses its memory
(through EPT/NPT), the accessed bits are still updated normally as seen from /sys/kernel/mm/page_idle/bitmap.

In particular it first clears the accesssed bit using /sys/kernel/mm/page_idle/bitmap,
and then runs a guest which reads/writes all its memory, and then
it checks that the accessed bit is set again by reading the /sys/kernel/mm/page_idle/bitmap.

Now since KVM uses its own paging (aka secondary MMU), mmu notifiers are used, and in particular
- kvm_mmu_notifier_clear_flush_young
- kvm_mmu_notifier_clear_young
- kvm_mmu_notifier_test_young

First two clear the accessed bit from NPT/EPT, and the 3rd only checks its value.

The difference between the first two notifiers is that the first one flushes EPT/NPT,
and the second one doesn't, and apparently the /sys/kernel/mm/page_idle/bitmap uses the second one.

This means that on the bare metal, the tlb might still have the accessed bit set, and thus
it might not set it again in the PTE when a memory access is done through it.

There is a comment in kvm_mmu_notifier_clear_young about this inaccuracy, so this seems to be
done on purpose.

I would like to hear your opinion on why it was done this way, and if the original reasons for
not doing the tlb flush are still valid.

Now why the access_tracking_perf_test fails in a nested guest?
It is because kvm shadow paging which is used to shadow the nested EPT, and it has a "TLB" which
is not bounded by size, because it is stored in the unsync sptes in memory.

Because of this, when the guest clears the accessed bit in its nested EPT entries, KVM doesn't
notice/intercept it and corresponding EPT sptes remain the same, thus later the guest access to
the memory is not intercepted and because of this doesn't turn back
the accessed bit in the guest EPT tables.

(If TLB flush were to happen, we would 'sync' the unsync sptes, by zapping them because we don't
keep sptes for gptes with no accessed bit)

Any comments are welcome!

If you think that the lack of the EPT flush is still the right thing to do,
I vote again to have at least some form of a blacklist of selftests which
are expected to fail, when run under KVM (fix_hypercall_test is the other test
I already know that fails in a KVM guest, also without a practical way to fix it).

Best regards,
	Maxim Levitsky

PS: the test doesn't fail on AMD because we sync the nested NPT on each nested VM entry, which
means that L0 syncs all the page tables.

Also the test sometimes passes on Intel when an unrelated TLB flush syncs the nested EPT.

Not using the new tdp_mmu also 'helps' by letting the test pass much more often but it still
fails once in a while, likely because of timing and/or different implementation.

next             reply	other threads:[~2022-09-23 10:16 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-23 10:16 Maxim Levitsky [this message]
2022-09-23 11:57 ` Emanuele Giuseppe Esposito
2022-09-23 17:30 ` David Matlack
2022-09-23 19:25 ` Jim Mattson
2022-09-23 20:28   ` David Matlack
2022-09-26  8:50     ` Emanuele Giuseppe Esposito
2022-10-04 18:52       ` Mingwei Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50dfe81bf95db91e6148b421740490c35c33233e.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=eesposit@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox