From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BC71FEA7948 for ; Wed, 4 Feb 2026 22:24:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C06BC6B008A; Wed, 4 Feb 2026 17:24:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BAABB6B0092; Wed, 4 Feb 2026 17:24:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A75846B0093; Wed, 4 Feb 2026 17:24:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 94F1A6B008A for ; Wed, 4 Feb 2026 17:24:20 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 41AA5C158A for ; Wed, 4 Feb 2026 22:24:20 +0000 (UTC) X-FDA: 84408203880.27.B6FA6A4 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by imf20.hostedemail.com (Postfix) with ESMTP id 407371C0006 for ; Wed, 4 Feb 2026 22:24:18 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eWQXuME8; spf=pass (imf20.hostedemail.com: domain of tytanick@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=tytanick@gmail.com; arc=pass ("google.com:s=arc-20240605:i=1"); dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770243858; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=x5UJBYDY1coRDDbsPYBH/VdKciJPykBb+VXYigiAII0=; b=JLaxS3BSt/QEdRDslcJuyRyfjFgL9DQbcTxasIR34NFGG6Sn/RlJkRwbstjyY0xKjFVB3m EGkdMISc7rxmL7TCDt3f3WvJZSAuDVkJI+7bcx8eSZmb6vojD3k7i75qo0RSo82Z6wg+Fc hoOlzyVksKZYdX0CZL/jJW4D9gNDndc= ARC-Authentication-Results: i=2; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eWQXuME8; spf=pass (imf20.hostedemail.com: domain of tytanick@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=tytanick@gmail.com; arc=pass ("google.com:s=arc-20240605:i=1"); dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1770243858; a=rsa-sha256; cv=pass; b=djuOBulCMsWhAsPzNISCv6wVMoXfS98TfEupq3Jc+U89rT3c1Ze76rdyusp8dUgig18tRr Lxua+DL/BkaWA9HCen+RwVBPJUUumbNYqvzFwhWcFMsdYDm8z4AMDlvypSk5NpxZXgpllD KOdj93PDWr0eQZpHZr3FYTS2zBgjhes= Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-386914b8e81so11394291fa.0 for ; Wed, 04 Feb 2026 14:24:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770243856; cv=none; d=google.com; s=arc-20240605; b=Dh5rptUAPvvle/MZm6WWR+d2HX6dEyJ/IHRO8LKu/zk3wIf3SPcyFfHYnvvAismzwU I8fHoMKqeNoADG46hSkgo6mOwPdxkVYQQHNLtE3sOtlQqdffNfRH4LXbbVAKPi7fAhQR C4CPWus8t6KTwtzKJlQ2dsS1Ab9deOSs9LXTiiyAGNu/BH9Qvub9ZXaaF3LrSFuzKXTs sJlWuzDqIv2ciASHJU0cU4aH34csz0UI3Cuq9ZrqgqlDD2UoO1ybr9uDRFLp9Djssvl7 fTDtWXxuto7s971yK3snDqQtGOcZ+BiqpxqZiuDAQQ8qEeKlgWX8iN8a35mgYZWUZh0Z TpOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=x5UJBYDY1coRDDbsPYBH/VdKciJPykBb+VXYigiAII0=; fh=Tj8Yv+68Tat6f33mkdvmlxqXOlmx6a5imWfRenmQQUo=; b=TwOx8OsAxAtmcgreULdvJMR38uhmBnyQPN5P7qLAPCp7qhyuacsA7Cx+ZRHcGVN4Td TbrnPegyaMNHtvbEk31XnflJW2yTv0KXeNv0I64BUNv1WIhC3JlaZq7ThmA//WxaJVXS RjCLkaMq6WYOEmhLd3Cdqbn/6FNMtHp47Dieuzn166XbWKI/8HDikYMCEJ6JeoSqo4G8 hbj4ej5qvmQdnY6GBHhcf2j4a6q0NjZYHecLuIoLYvR9aSpiDuofbiiclLvKppVFYWIO 9rBXHcN562DabrGTD53PmxTl06BP4UksxnfLWIOBfGZi0rQTPpic8pav2R5cNOE3zG14 v5xg==; darn=kvack.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770243856; x=1770848656; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=x5UJBYDY1coRDDbsPYBH/VdKciJPykBb+VXYigiAII0=; b=eWQXuME8TWGOF/VC+Qh6mut5mlVcB6rRrqbcmqL5gmUzG5Ts7XCsgefZ97L1b/ZbnM yrhsFnuTMQ9qKd0fPhSvr1M7wpx7nk6qdUYq9HqVBORj/8v5P9TnSH88ZUmN0PsoUUma AIsR+7isv0Uas8wo7HT1l2BlOPdQq1fpSlHp/u/YNio2m2juFiT6VSlrbi+J4ALqwhB3 5i1hv7Jy0/RU7GcrWkZpmdTOHKQ6aeI4bETJ8Qcc9fEt1YISCd+pe42DVfqUC7SvbaLv 8mlH0E/QQMFGtvYsN1hYG/XdYIV7pJsHsN+CHJpjAQCBT1qdbQN288msix3REzbYTOG6 C5yA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770243856; x=1770848656; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=x5UJBYDY1coRDDbsPYBH/VdKciJPykBb+VXYigiAII0=; b=fSL+NwHrGmfH3zAtQnWPjrwaf+GzLPRcjfEPw5UjghhgnW/yaOm7/m2eC75+8fInvc BjVo7iCHGGtomDieSp3fZwUOAh7sTdHrLXolLbQmRVN3vmZ4GVyudN5JJK9etlbFbesK A4OTEQol0P8DvIY8iW3umHFdZq6KmdMcO4LjCQDzTdE3MZ7DfgqdHMjPhs5bR/lKRL52 lBknM6W+RJNve3kX4G24+cPEcVBI4EMCs0Vsc3k3WVWYMDB1hksZbtJ5wUUuS3DW/5+x Ql+L7fcTbPfAFWZxtcBiS0hZPfPYZgNJA6r6w3bUDj3QGwMDaOVWd+BBF1O+X27obxt3 93iA== X-Gm-Message-State: AOJu0YxyXa0uYYaBd6ctbNaatbAdJ78GEmBMEeMuPPrEreMLXEMOKl/c nG2KHhcAjSOyqa7aelsAPqgaGIJeM+zz2wINbGWKbOG7qW/8j+LDvFPhkvnDEfWeh5gF7o2CDg6 xz2xkDuVzPrCTIy6g4i/0GVTz2YbQ+s0= X-Gm-Gg: AZuq6aL9IeIIguVeNklwZSuugxiZOZeCWBfKkq4+4oo1HHUUjq7kRwA7fSgCuqzU67S 2FxRiC7Q1u3h3XYdrGSyubB5THoEOamxjcZI4rxps8xntkFaE7/0pMAMjGRLqqK2Y8AEibZMM4l dNsEx33VHAGYVfs0prkfv1+N05051jZ1sqhOBXtLSn2xyWLYgB6YYVIjv258cVRecN66dNXDwla vXe4kC5eEt1hlO2JAoyKSVHvo5wB1kD1pz2LwBMil7cNSKdTjEVpHzXwIpjfkBq55lR7Crd84s8 ue9d+EEOi2ccWSi4KLH9fh4J9SpFuOe4Pm6WdFEidYmEbYu5Y27MvKliZO4bqV3YQ6Pqqb7KvrG w9U98pn+AKzX5kw== X-Received: by 2002:a05:6512:32ce:b0:59e:3288:6b06 with SMTP id 2adb3069b0e04-59e3c7c6b26mr237735e87.14.1770243855854; Wed, 04 Feb 2026 14:24:15 -0800 (PST) MIME-Version: 1.0 References: <5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org> In-Reply-To: <5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org> From: Tytus Rogalewski Date: Wed, 4 Feb 2026 23:24:05 +0100 X-Gm-Features: AZwV_QgflirvpnubaVXy-WLN8zFmP2DMhzRT07X_himWCmKdOUcUmb_qh2JIPyk Message-ID: Subject: Re: walk_pgd_range BUG: unable to handle page fault To: "David Hildenbrand (arm)" Cc: linux-mm@kvack.org, muchun.song@linux.dev, osalvador@suse.de Content-Type: multipart/alternative; boundary="00000000000031aeed064a07045d" X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 407371C0006 X-Stat-Signature: 964idf4koax5ykoy7kw15cyrpszpffcg X-Rspam-User: X-HE-Tag: 1770243858-110960 X-HE-Meta: U2FsdGVkX1854D6ktxKa87YIwNj893QVSCoz/j3biYilfpwU9li22K891Ek0KGtcIAVaWQTM872WRRlafXPI0YTre1GpyyYnEE1Ya9D1B+6SK47TjY8R8O4L2AmwbdZ4PQsEPuQWeZMUXwJruWnJxVEZ/Cs5ezIkhGkA7hOdWaNNCeCWes+Jg8umbu0no8dhoE2pbIjlizB7D3zPg1ETkVc5ZnT9yxRS4GOFETGG10yghdDMCKeOdBh9ZvRi0BpHSiAhmVH/lkV7a3ihA1x+a1Vvq3rB+uTBmCoV3rAwh5xwwQiLwk4d1y25p1YHi/ESuXqjsZs8hHdd5iPDHV5JoQWrubp0ouP3pZx2jgy68fpYitxPmlJkK3BkKjbewPOdkbsS6DiBMxtT2c8HDCJiqcxapQTBT46QW7gw52LVUh9Cg1iHcK9JjaSw0lzRefLi09pbZ6K8mcD1QgQcnkGkdB8xl2yG+xhELefyPCF2K9k42Vm5paSEavCw/Pj+VR7vhXU404O9ZmYwnKtm/cMiwwC65Ai9VTQXnVJHEpUQwIoS1WGH2tgaiqexGd8ttEHrnyqyq/1sI65zoCzdkFXU/FeGAwOtGxLI2Woy/oMcX28P1F+mVsDvF2YflXp3L1XT9X+UW2tt1xO6B89P+jTA4CE+SIVojOtjPeGI7w8Ag4I9MamO57FKYLtKeeosFhV0vcuzUXQHz6quM/7+zdBPT4zDAksQ3stcPsUwgpkLZnUxIbVc1U1gQbsRZEfy5JI+4Vpu1tg63OmwsifF61kiLl0/S65ywWzch86GFvkyl+6Zo1QvN2Twt1dONWsUbr10uydkpvEKJw4sSt0NLu+HKcYoVkzRYuBZeJN0QVQPxwrnaAyOCJHkH53ng6qzSkuDNC10/0R90KMp4nCFwjApIhBxa5xyUuuzflf6WyhTflkiiA+xb+AEh0zcE29aYMJ3ue8nqcx9AQ9FNyL0lgR FbC2LWWH qteamcQJ2myhBLuvBnBPZPB46tnRwAlFNuq0qfk6pSYQj6VI045F0pYUnVmrWdPCnsHCc8woNkYFW9/b5y2EMFyCXqcb/8kt33l4XdCyasxX8v8WmAv7v+b9PEJZAFxiJ9QO5qR4ZZbESR1No/+0m/fL3wHZn19SvEvvtFU9TjwjuWOmyVf1ZZa6+t6EY//jHHC2AYiUKLsM9oBYoS+2PSogJhWz74jHszP1I9tXaEQhOIoDvnpB4IBrseLbjvJFkh0Kym1kOX5+2ZIcOgrxiQlAmYwwaMKYN6ngF/mlkVy2XQ+dqvoHOKM20yxUC7GAodoATowOIv2RDen5FQPVuq/KkwbqXGz+TC393et5FbWd/39ve+GMGtpRGoeXt18StRLc9labaHvkOb/eD0Gjts6cvUIG3RBNGo17/RwCv8v7fyiDPh3l8CVuuUQk8r5OfHVFkcCY25sxDewJOn7PCiISVQpzEeLOuDH0rWbzNNGJAUoJyQKh+RrX72U38xITTW56DGLjMi07U3CLiTFqK+R1YF0nVeCOkJ4QFd3OZ35XzBtcqf4dGxxpYAM1YJbBui+ns X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --00000000000031aeed064a07045d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, hugepages is qemu term probably. Yeah 4k is default and booting is hard with that much memory aspecially if you boot , stop and few times. But this issue might be strictly related to vfio passthrough mix. I did not tested 2mb pages actually because why to use it if i have 1GB ? Do you think it could be more stable than 1GB or should it be the same logic as 2MB ? Well. i started to use 1GB ones recently as i had to get through all this iommu cpu labirynth with binding proper gpu to proper memory and proper cpu affinity in kvm. And proxmox ve does not have such logic. If you tell me what to collect, i can collect it. I have other symptom actually. Hmm maybe its related or maybe not. Still i had this second symptom from the beginning and i did nit had such crashes on 4k. I am using distributed network storage moosefs and mounting it via fuse. Then using qcow2 vm images. I am having freezes sometimes in VMs but that might be related to that fuse as i mount one fuse share and starting even 8 vms from that one mount. And from time to time some vms stop responding or freeze. I will soon rewrite it to use NBD istead and that should be fixed if that was caused by fuse. Still i am not sure actually if thise are separate issues or related and which triggers which. If there is blocked fuse process by vm A is it possible that vm B might throw this walk page bug or it should not be related even if disk slows down ? -- tel. 790 202 300 *Tytus Rogalewski* Dolina Krzemowa 6A 83-010 Jagatowo NIP: 9570976234 W dniu =C5=9Br., 4 lut 2026 o 22:52 David Hildenbrand (arm) napisa=C5=82(a): > On 1/28/26 15:14, Tytus Rogalewski wrote: > > Hello guys, > > > > Hi! > > > Recently i have reported slab memory leak and it was fixed. > > > > I am having yet another issue and wondering where to write with it. > > Would you be able to tell me if this is the right place or should i sen= d > > it to someone else ? > > The issue seems also like memory leak. > > > > It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+). > > All servers are doing KVM with vfio GPU PCIE passthrough and it happens > > when i am using HUGEPAGE 1GB + qemu > > Okay, so we'll longterm-pin all guest memory into the iommu. > > > Basically i am allocating 970GB into hugepages, leaving 37GB to kvm. > > In normal operation i have about 20GB free space but when this issue > > occurs, all RAM is taken and even when i have added 100GB swap, it was > > also consumed. > > When you say hugepage you mean 1 GiB hugetlb, correct? > > > It can work for days or week without issue and > > > > I did not seen that issue when i had hugepages disabled (on normal 2KB > > pages allocation in kvm). > > I assume you meant 4k pages. What about 2 MiB hugetlb? > > > And i am using hugepages as it is impossible to boot VM with >200GB ram= . > > Oh, really? That's odd. > > > When that issue happens, process ps hangs and only top shows > > something but machine needs to be rebooted due to many zombiee processe= s. > > > > *Hardware: * > > Motherboard: ASRockRack GENOA2D24G-2L > > CPU: 2x AMD EPYC 9654 96-Core Processor > > System ram: 1024 GB > > GPUs: 8x RTX5090 vfio passthrough > > > > root@pve14:~# uname -a > > *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UTC > > 2026 x86_64 GNU/Linux > > > > [171053.341288] *BUG: unable to handle page fault for address*: > > ff469ae640000000 > > [171053.341310] #PF: supervisor read access in kernel mode > > [171053.341319] #PF: error_code(0x0000) - not-present page > > [171053.341328] PGD 4602067 P4D 0 > > [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI > > [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted 6.18.6= - > > pbk #1 PREEMPT(voluntary) > > [171053.341362] Hardware name: TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W= , > > BIOS 10.20 05/05/2025 > > [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0 > > [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e 0= 0 > > 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43 > > dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20 > > [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287 > > [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX: > > 0000000000000000 > > [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI: > > 800008dfc00002b7 > > [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09: > > 0000000000000000 > > [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12: > > ff469ae640000000 > > [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15: > > ff59d95d70e6b8a8 > > [171053.341464] FS: 00007d4e8ec94b80(0000) GS:ff4692876ae7e000(0000) > > knlGS:0000000000000000 > > [171053.341476] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4: > > 0000000000f71ef0 > > [171053.341495] PKRU: 55555554 > > [171053.341501] Call Trace: > > [171053.341508] > > [171053.341518] __walk_page_range+0x8e/0x220 > > [171053.341529] ? sysvec_apic_timer_interrupt+0x57/0xc0 > > [171053.341541] walk_page_vma+0x92/0xe0 > > [171053.341551] smap_gather_stats.part.0+0x8c/0xd0 > > [171053.341563] show_smaps_rollup+0x258/0x420 > > Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble > somewhere into something unexpected while doing a page table walk. > > [171053.341288] BUG: unable to handle page fault for address: > ff469ae640000000 > [171053.341310] #PF: supervisor read access in kernel mode > [171053.341319] #PF: error_code(0x0000) - not-present page > [171053.341328] PGD 4602067 P4D 0 > > There is not a lot of information there :( > > Did you have other splats/symptoms or was it always that? > > -- > Cheers, > > David > --00000000000031aeed064a07045d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

hu= gepages is qemu term probably.

Yeah 4k is default and booting is hard with that much memory aspecia= lly if you boot , stop and few times.
But this issue= might be strictly related to vfio passthrough mix.
= I did not tested 2mb pages actually because why to use it if i have 1GB ?
Do you think it could be more stable than 1GB or shou= ld it be the same logic as 2MB ?

Well. i started to use 1GB ones recently as i had to get through = all this iommu cpu labirynth with binding proper gpu to proper memory and p= roper cpu affinity in kvm. And proxmox ve does not have such logic.

If you tell me what to collect,= i can collect it.

I hav= e other symptom actually. Hmm maybe its related or maybe not.
Still i had this second symptom from the beginning and i did nit = had such crashes on 4k.
I am using distributed netwo= rk storage moosefs and mounting it via fuse. Then using qcow2 vm images.
I am having freezes sometimes in VMs but that might be= related to that fuse as i mount one fuse share and starting even 8 vms fro= m that one mount.
And from time to time some vms sto= p responding or freeze.
I will soon rewrite it to us= e NBD istead and that should be fixed if that was caused by fuse.
Still i am not sure actually if thise are separate issues or = related and which triggers which.
If there is blocke= d fuse process by vm A is it possible that vm B might throw this walk page = bug or it should not be related even if disk slows down ?

--

te= l. 790 202 300

Tytus Rogalew= ski

Dolina Krzemowa 6A

83-010 Jagatowo

NIP: 9570976234



W dniu =C5=9Br., 4= lut 2026 o 22:52 David Hildenbrand (arm) <david@kernel.org> napisa=C5=82(a):
= On 1/28/26 15:14, Tytus Rogalewski wrote:
> Hello guys,
>

Hi!

> Recently i have reported slab memory leak and it was fixed.
>
> I am having yet another issue and wondering where to write with it. > Would you be able to tell me if this is the right place or should i se= nd
> it to someone else ?
> The issue seems also like memory leak.
>
> It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+). > All servers are doing KVM with vfio GPU PCIE passthrough and it happen= s
> when i am using HUGEPAGE 1GB=C2=A0+ qemu

Okay, so we'll longterm-pin all guest memory into the iommu.

> Basically i am=C2=A0allocating 970GB into hugepages, leaving 37GB to k= vm.
> In normal operation i have about 20GB free space but when this issue <= br> > occurs, all RAM is taken and even when i have added 100GB swap, it was=
> also consumed.

When you say hugepage you mean 1 GiB hugetlb, correct?

> It can work for days or week without issue and
>
> I did not seen that issue when i had hugepages disabled (on normal 2KB=
> pages allocation in kvm).

I assume you meant 4k pages. What about 2 MiB hugetlb?

> And i am using hugepages as it is impossible to boot VM with >200GB= ram.

Oh, really? That's odd.

> When that issue happens, process ps hangs and only top shows
> something=C2=A0but machine needs to be rebooted due to many zombiee pr= ocesses.
>
> *Hardware: *
> Motherboard: ASRockRack GENOA2D24G-2L
> CPU: 2x AMD EPYC 9654 96-Core Processor
> System ram: 1024 GB
> GPUs: 8x RTX5090 vfio passthrough
>
> root@pve14:~# uname -a
> *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UT= C
> 2026 x86_64 GNU/Linux
>
> [171053.341288] *BUG: unable to handle page fault for address*:
> ff469ae640000000
> [171053.341310] #PF: supervisor read access in kernel mode
> [171053.341319] #PF: error_code(0x0000) - not-present page
> [171053.341328] PGD 4602067 P4D 0
> [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI
> [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted 6.18.= 6-
> pbk #1 PREEMPT(voluntary)
> [171053.341362] Hardware name: =C2=A0TURIN2D24G-2L+/500W/TURIN2D24G-2L= +/500W,
> BIOS 10.20 05/05/2025
> [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0
> [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e = 00
> 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 4= 3
> dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7= 47 20
> [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287
> [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX:
> 0000000000000000
> [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI:
> 800008dfc00002b7
> [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09:
> 0000000000000000
> [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12:
> ff469ae640000000
> [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15:
> ff59d95d70e6b8a8
> [171053.341464] FS: =C2=A000007d4e8ec94b80(0000) GS:ff4692876ae7e000(0= 000)
> knlGS:0000000000000000
> [171053.341476] CS: =C2=A00010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4:
> 0000000000f71ef0
> [171053.341495] PKRU: 55555554
> [171053.341501] Call Trace:
> [171053.341508] =C2=A0<TASK>
> [171053.341518] =C2=A0__walk_page_range+0x8e/0x220
> [171053.341529] =C2=A0? sysvec_apic_timer_interrupt+0x57/0xc0
> [171053.341541] =C2=A0walk_page_vma+0x92/0xe0
> [171053.341551] =C2=A0smap_gather_stats.part.0+0x8c/0xd0
> [171053.341563] =C2=A0show_smaps_rollup+0x258/0x420

Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble
somewhere into something unexpected while doing a page table walk.

[171053.341288] BUG: unable to handle page fault for address:
ff469ae640000000
[171053.341310] #PF: supervisor read access in kernel mode
[171053.341319] #PF: error_code(0x0000) - not-present page
[171053.341328] PGD 4602067 P4D 0

There is not a lot of information there :(

Did you have other splats/symptoms or was it always that?

--
Cheers,

David
--00000000000031aeed064a07045d--