* Re: access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use
[not found] <7a46456d6750ea682ba321ad09541fa81677b81a.camel@redhat.com>
@ 2024-05-21 23:29 ` Sean Christopherson
0 siblings, 0 replies; only message in thread
From: Sean Christopherson @ 2024-05-21 23:29 UTC (permalink / raw)
To: Maxim Levitsky; +Cc: kvm, Paolo Bonzini, Henry Huang, linux-mm
On Wed, May 15, 2024, Maxim Levitsky wrote:
> Small note on why we started seeing this failure on RHEL 9 and only on some machines:
>
> - RHEL9 has MGLRU enabled, RHEL8 doesn't.
For a stopgap in KVM selftests, or possibly even a long term solution in case the
decision is that page_idle will simply have different behavior for MGLRU, couldn't
we tweak the test to not assert if MGRLU is enabled?
E.g. refactor get_module_param_integer() and/or get_module_param() to add
get_sysfs_value_integer() or so, and then do this?
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..1e759df36098 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -123,6 +123,11 @@ static void mark_page_idle(int page_idle_fd, uint64_t pfn)
"Set page_idle bits for PFN 0x%" PRIx64, pfn);
}
+static bool is_lru_gen_enabled(void)
+{
+ return !!get_sysfs_value_integer("/sys/kernel/mm/lru_gen/enabled");
+}
+
static void mark_vcpu_memory_idle(struct kvm_vm *vm,
struct memstress_vcpu_args *vcpu_args)
{
@@ -185,7 +190,8 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
*/
if (still_idle >= pages / 10) {
#ifdef __x86_64__
- TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+ TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR) ||
+ is_lru_gen_enabled(),
"vCPU%d: Too many pages still idle (%lu out of %lu)",
vcpu_idx, still_idle, pages);
#endif
> - machine needs to have more than one NUMA node because NUMA balancing
> (enabled by default) tries apparently to write protect the primary PTEs
> of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs:
> (at least with new tdp mmu)
>
> access_tracking-3448 [091] ....1.. 1380.244666: handle_changed_spte <-tdp_mmu_set_spte
> access_tracking-3448 [091] ....1.. 1380.244667: <stack trace>
> => cdc_driver_init
> => handle_changed_spte
> => tdp_mmu_set_spte
> => tdp_mmu_zap_leafs
> => kvm_tdp_mmu_unmap_gfn_range
> => kvm_unmap_gfn_range
> => kvm_mmu_notifier_invalidate_range_start
> => __mmu_notifier_invalidate_range_start
> => change_p4d_range
> => change_protection
> => change_prot_numa
> => task_numa_work
> => task_work_run
> => exit_to_user_mode_prepare
> => syscall_exit_to_user_mode
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It's a separate question, if the NUMA balancing should do this, or if NUMA
> balancing should be enabled by default,
FWIW, IMO, enabling NUMA balancing on a system whose primary purpose is to run VMs
is bad idea. NUMA balancing operates under the assumption that a !PRESENT #PF is
relatively cheap. When secondary MMUs are involved, that is simply not the case,
e.g. to honor the mmu_notifer event, KVM zaps _and_ does a remote TLB flush. Even
if we reworked KVM and/or the mmu_notifiers so that KVM didn't need to do such a
heavy operation, the cost of page fault VM-Exit is significantly higher than the
cost of a host #PF.
> because there are other reasons that can force KVM to invalidate the
> secondary mappings and trigger this issue.
Ya.
^ permalink raw reply [flat|nested] only message in thread