From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8E22C47077 for ; Thu, 11 Jan 2024 12:43:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 52AAF6B00AA; Thu, 11 Jan 2024 07:43:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D9846B00AB; Thu, 11 Jan 2024 07:43:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C7EA6B00AC; Thu, 11 Jan 2024 07:43:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 295866B00AA for ; Thu, 11 Jan 2024 07:43:21 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E60C680C48 for ; Thu, 11 Jan 2024 12:43:20 +0000 (UTC) X-FDA: 81666995760.06.44D060A Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) by imf21.hostedemail.com (Postfix) with ESMTP id E526C1C0013 for ; Thu, 11 Jan 2024 12:43:18 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of f.weber@proxmox.com designates 94.136.29.106 as permitted sender) smtp.mailfrom=f.weber@proxmox.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704976999; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KrvYUbE+moRmxlkCx6BUxxaQLtS0S8b36GHn9Uy8h10=; b=GDnZpGIcn+U+yBf6TC6ZuNA8G1Wad/1TpXH2ii7fqO3zvJ1Wfsb0rzIsRv/I0TT/+q5mt0 KZ8NVNGgRi5ZGGxuTqdd3ldD3QtoRX/3DY+yqASTlYKPAw9K0U42yTQqtdJr1IctPiYEDK 9Pnvyz8eP1jNzAnl+gzkDGY9Eit5u80= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704976999; a=rsa-sha256; cv=none; b=abaBxQs3A+SvHI/QJmWQfPWij1WhP4zP1d4lLICefl4rctIN4JEQyYQGis2y/1f4OvZ6BD a4B+4ezyojWhU/hRe4sjTfDRs1o9vOmjmTRZcCxrrG8u6eoYf+qcXJpLz4vSBNLXxUG7UW va2pXq7CdGdMYkehjk4XUvU6o2uH24A= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of f.weber@proxmox.com designates 94.136.29.106 as permitted sender) smtp.mailfrom=f.weber@proxmox.com; dmarc=none Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 4A9F544E2E; Thu, 11 Jan 2024 13:43:16 +0100 (CET) Message-ID: <2c73db3f-2465-49e1-9d57-ef8d978849b6@proxmox.com> Date: Thu, 11 Jan 2024 13:43:15 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Temporary KVM guest hangs connected to KSM and NUMA balancer Content-Language: en-US From: Friedrich Weber To: kvm@vger.kernel.org Cc: Sean Christopherson , Paolo Bonzini , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <832697b9-3652-422d-a019-8c0574a188ac@proxmox.com> In-Reply-To: <832697b9-3652-422d-a019-8c0574a188ac@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: wbbxp4zxuz34e6wb7qroatx37h6smun5 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E526C1C0013 X-Rspam-User: X-HE-Tag: 1704976998-909407 X-HE-Meta: U2FsdGVkX1+miAp5DxqqZPhn1oKd8OTem3SumOhZX4+nxT+FPxaHHr4Gmy3JUg2pHE98mkEfV/fAU7sOReHzW4OQ7tQXcvzlBC4qXpk/WWqDdG6za7ncQfsWPMOsI458So2t40hVF/RzfSzluDui3Uo53HvYexUJqQn8WBqNfKEWyTxNYR6W//cBIP5Ho+YKV0qsW8Ds12o0QXFzwg4V7NMpu4q/QBn0621RcnTZ0BV4meq9GfVAZ3+GhkoiWG5nEKUxPe5sl3dIC4po2lFX76WEfzrgp17Ybw5qDMz6e1jTR1WGz1HG/2i8fyT+6YXFKDjD34ZhcsF6EVZH9WJR5fOLePAjQcHpMCa/aKZ9bHYVOGDcB9cbZwReUuEim0wWzGfCnyqnLsmSxEDShgHq8swp9MK5bRJxYOkxgFc94+CGqgZ/Y6YnnxuTR57ceEi27uqP1rVoROXw3p7ee7LKzhjTPPWmAyqlu2COjkhlGaunMI+2oszmLZuNmtxFmkevoatrce4AwhPCmERGl4RqtbdJGlmV11yAGJcrSpwxlisASm2pp30pCsFCXssSiABLKFdKF12JoOS2xX6f4S9inrpZq40nt1R3Qp17XMdR+olmQFZEKCoy3R5ubrj6khoacgqPCMt4k0Iy8H6lqQPHrWNbcdPOJcqZcPXDBxu808xfFNygYYvrBeyu70L95WvKaRqeoOvkmHcyvuJ6Q+0/zkEK9Xvea3s3pBrOYlaxn+0dKzC/5yGdLK2bg7EtdaAdB41qtrMvSZWDIU3Z0/9XcP4QygczpvKex52mByiSqvNeIqAdu9z15MphRIe5vQwPu3iS6vBaNxHXd6i1SxMNR3EocC8mla2c2eTL/MJYkAe2ceAGAoLtBQCmd9p4vQXxCUc+1VnNt99hi7380r+a6kIUUQhOLPycxjFB1BWZc3FzjSNJ5QZJ59SBcXncZS8vJec2Hh/0zXuOCwc1daF Vq/o/4iP G6iVLvgwaEiXr45rpjNJfr6uXf7Hpihaa11AD+ECAIjPLe5gnxm92H9vX2CO6Wtz6E6WyKcF3pwohPHTfrt7q0LIzYAho9F+tzjtQBg7FiKx5clNPdYwZ8vtT2w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000142, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On 04/01/2024 14:42, Friedrich Weber wrote: > f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and > mmu_notifier unmap") As this commit mentions tdp_mmu_zap_leafs, I ran the reproducer again on 610a9b8f (6.7-rc8) while tracing >500ms calls to that function (printing a stacktrace) and task_numa_work via a bpftrace script [1]. Again, there are several invocations of task_numa_work that take >500ms (the VM appears to be frozen during that time). For instance, one invocation that takes ~20 seconds: [1704971602] task_numa_work (tid=52035) took 19995 ms For this particular thread and in the 20 seconds before that, there were 8 invocations of tdp_mmu_zap_leafs with >500ms: [1704971584] tdp_mmu_zap_leafs (tid=52035) took 2291 ms [1704971586] tdp_mmu_zap_leafs (tid=52035) took 2343 ms [1704971589] tdp_mmu_zap_leafs (tid=52035) took 2316 ms [1704971590] tdp_mmu_zap_leafs (tid=52035) took 1663 ms [1704971591] tdp_mmu_zap_leafs (tid=52035) took 682 ms [1704971594] tdp_mmu_zap_leafs (tid=52035) took 2706 ms [1704971597] tdp_mmu_zap_leafs (tid=52035) took 3132 ms [1704971602] tdp_mmu_zap_leafs (tid=52035) took 4846 ms They roughly sum up to 20s. The stacktrace is the same for all: bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345 bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345 bpf_trampoline_380104735946+208 tdp_mmu_zap_leafs+5 kvm_unmap_gfn_range+347 kvm_mmu_notifier_invalidate_range_start+394 __mmu_notifier_invalidate_range_start+156 change_protection+3908 change_prot_numa+105 task_numa_work+1029 bpf_trampoline_6442457341+117 task_numa_work+9 xfer_to_guest_mode_handle_work+261 kvm_arch_vcpu_ioctl_run+1553 kvm_vcpu_ioctl+667 __x64_sys_ioctl+164 do_syscall_64+96 entry_SYSCALL_64_after_hwframe+110 AFAICT this pattern repeats several times. I uploaded the last 150kb of the bpftrace output to [2]. I can provide the full output if needed. To me, not knowing much about the KVM/KSM/NUMA balancer interplay, this looks like task_numa_work triggers several invocations of tdp_mmu_zap_leafs, each of which takes an unusually long time. If anyone has a hunch why this might happen, or an idea where to look next, it would be much appreciated. Best, Friedrich [1] kfunc:tdp_mmu_zap_leafs { @start_zap[tid] = nsecs; } kretfunc:tdp_mmu_zap_leafs /@start_zap[tid]/ { $diff = nsecs - @start_zap[tid]; if ($diff > 500000000) { // 500ms time("[%s] "); printf("tdp_mmu_zap_leafs (tid=%d) took %d ms\n", tid, $diff / 1000000); print(kstack()); } delete(@start_zap[tid]); } kfunc:task_numa_work { @start_numa[tid] = nsecs; } kretfunc:task_numa_work /@start_numa[tid]/ { $diff = nsecs - @start_numa[tid]; if ($diff > 500000000) { // 500ms time("[%s] "); printf("task_numa_work (tid=%d) took %d ms\n", tid, $diff / 1000000); } delete(@start_numa[tid]); } [2] https://paste.debian.net/1303767/