Just had another crash and it is still happening in 6.18.8 . Here is claude output this time as i am lazy :) Let me know what steps should i do next. Claude: Summary (English) Crash Type: Kernel Oops - NULL pointer dereference / page fault Timeline: February 4, 2026 at 21:42:16 Environment: - Host: pve12 (Proxmox VE) - Hardware: AMD EPYC Turin (TURIN2D24G-2L+/500W) - Kernel: 6.18.8-pbk (custom build) - CPU affected: Core 125 (of 128+ cores) Root Cause: The kernel encountered a page fault while the qm process (Proxmox QEMU VM manager, PID 783442) was reading /proc//smaps_rollup. The fault occurred in the walk_pgd_range() function at address ff164aee00000000, which is an invalid/not-present page. Call Stack: walk_pgd_range() → __walk_page_range() → walk_page_vma() → smap_gather_stats() → show_smaps_rollup() → seq_read() → vfs_read() Analysis: This is a race condition or memory corruption bug in the kernel's page table walking code. The qm tool was attempting to gather memory statistics from a VM process when the kernel dereferenced a stale or corrupted page table entry (PTE). The address ff164aee00000000 in R12 suggests a corrupted pointer that should have pointed to a valid page table structure. Potential Causes: 1. Race condition between VM memory operations and /proc/smaps_rollup reads 2. Bug in kernel 6.18.8 page table handling with KVM/VFIO passthrough 3. Custom kernel patches (-pbk) introducing instability (this is clean ubuntu config) 4. Possible memory corruption from VFIO PCI passthrough (mlx5, vfio-pci modules loaded) Recommendations: 1. Check upstream kernel 6.18.x changelogs for related fixes 2. Test with vanilla kernel (without -pbk patches) (it happens on other kernels too) 3. Monitor for recurrence and check if specific VMs trigger the issue (no, multiple ones) 4. Consider enabling kernel debugging (CONFIG_DEBUG_VM) to catch similar issues earlier Bash(ssh root@10.10.42.12 "journalctl -b -1 -k --no-pager | grep -A 100 'BUG: unable to handle page fault'") ⎿ Feb 04 21:42:16 pve12 kernel: BUG: unable to handle page fault for address: ff164aee00000000 Feb 04 21:42:16 pve12 kernel: #PF: supervisor read access in kernel mode Feb 04 21:42:16 pve12 kernel: #PF: error_code(0x0000) - not-present page … +106 lines (ctrl+o to expand) ⏺ Crash Logs Feb 04 21:42:16 pve12 kernel: BUG: unable to handle page fault for address: ff164aee00000000 Feb 04 21:42:16 pve12 kernel: #PF: supervisor read access in kernel mode Feb 04 21:42:16 pve12 kernel: #PF: error_code(0x0000) - not-present page Feb 04 21:42:16 pve12 kernel: PGD 4602067 P4D 0 Feb 04 21:42:16 pve12 kernel: Oops: Oops: 0000 [#1] SMP NOPTI Feb 04 21:42:16 pve12 kernel: CPU: 125 UID: 0 PID: 783442 Comm: qm Not tainted 6.18.8-pbk #1 PREEMPT(voluntary) Feb 04 21:42:16 pve12 kernel: Hardware name: TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W, BIOS 10.12 11/14/2024 Feb 04 21:42:16 pve12 kernel: RIP: 0010:walk_pgd_range+0x6ff/0xbb0 Feb 04 21:42:16 pve12 kernel: Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e 00 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43 dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20 Feb 04 21:42:16 pve12 kernel: RSP: 0018:ff643a4c403439a8 EFLAGS: 00010287 Feb 04 21:42:16 pve12 kernel: RAX: 000071afc01fffff RBX: 000071afc0200000 RCX: 0000000000000000 Feb 04 21:42:16 pve12 kernel: RDX: 0000000000000000 RSI: 000071afffffffff RDI: 80002fd0000002b7 Feb 04 21:42:16 pve12 kernel: RBP: ff643a4c40343a88 R08: 0000000000000080 R09: 0000000000000000 Feb 04 21:42:16 pve12 kernel: R10: ffffffffb84588c0 R11: 0000000000000000 R12: ff164aee00000000 Feb 04 21:42:16 pve12 kernel: R13: 000071b000000000 R14: 000071afc0000000 R15: ff643a4c40343b08 Feb 04 21:42:16 pve12 kernel: FS: 0000767454f93b80(0000) GS:ff161c21bdefe000(0000) knlGS:0000000000000000 Feb 04 21:42:16 pve12 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 04 21:42:16 pve12 kernel: CR2: ff164aee00000000 CR3: 00000080dff18005 CR4: 0000000000f71ef0 Feb 04 21:42:16 pve12 kernel: PKRU: 55555554 Feb 04 21:42:16 pve12 kernel: Call Trace: Feb 04 21:42:16 pve12 kernel: Feb 04 21:42:16 pve12 kernel: __walk_page_range+0x8e/0x220 Feb 04 21:42:16 pve12 kernel: walk_page_vma+0x92/0xe0 Feb 04 21:42:16 pve12 kernel: smap_gather_stats.part.0+0x8c/0xd0 Feb 04 21:42:16 pve12 kernel: show_smaps_rollup+0x258/0x420 Feb 04 21:42:16 pve12 kernel: seq_read_iter+0x137/0x4c0 Feb 04 21:42:16 pve12 kernel: seq_read+0xf5/0x140 Feb 04 21:42:16 pve12 kernel: vfs_read+0xbb/0x350 Feb 04 21:42:16 pve12 kernel: ksys_read+0x69/0xf0 Feb 04 21:42:16 pve12 kernel: __x64_sys_read+0x19/0x30 Feb 04 21:42:16 pve12 kernel: x64_sys_call+0x2180/0x25a0 Feb 04 21:42:16 pve12 kernel: do_syscall_64+0x80/0xce0 Feb 04 21:42:16 pve12 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Feb 04 21:42:16 pve12 kernel: Feb 04 21:42:16 pve12 kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel nf_tables bridge stp llc sunrpc softdog bonding binfmt_misc nfnetlink_log nls_iso8859_1 intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd snd_pcm snd_timer kvm_amd snd ipmi_ssif soundcore kvm polyval_clmulni dax_hmem ghash_clmulni_intel cxl_acpi aesni_intel cxl_port rapl cxl_core pcspkr einj ast acpi_ipmi spd5118 ipmi_si k10temp ipmi_devintf ccp ipmi_msghandler input_leds joydev mac_hid sch_fq_codel msr vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore nfnetlink dmi_sysfs autofs4 btrfs blake2b_generic xor raid6_pq mlx5_ib ib_uverbs macsec ib_core cdc_ether mlx5_core usbnet usbkbd mii mlxfw nvme nvme_core psample igb nvme_keyring i2c_algo_bit nvme_auth dca tls hkdf i2c_piix4 ahci libahci i2c_smbus hid_generic usbmouse usbhid hid Feb 04 21:42:16 pve12 kernel: CR2: ff164aee00000000 Feb 04 21:42:16 pve12 kernel: ---[ end trace 0000000000000000 ]--- Feb 04 21:42:16 pve12 kernel: RIP: 0010:walk_pgd_range+0x6ff/0xbb0 Feb 04 21:42:16 pve12 kernel: note: qm[783442] exited with irqs disabled -- tel. 790 202 300 *Tytus Rogalewski* Dolina Krzemowa 6A 83-010 Jagatowo NIP: 9570976234 śr., 4 lut 2026 o 23:24 Tytus Rogalewski napisał(a): > Hi, > > hugepages is qemu term probably. > > Yeah 4k is default and booting is hard with that much memory aspecially if > you boot , stop and few times. > But this issue might be strictly related to vfio passthrough mix. > I did not tested 2mb pages actually because why to use it if i have 1GB ? > Do you think it could be more stable than 1GB or should it be the same > logic as 2MB ? > > Well. i started to use 1GB ones recently as i had to get through all this > iommu cpu labirynth with binding proper gpu to proper memory and proper cpu > affinity in kvm. And proxmox ve does not have such logic. > > If you tell me what to collect, i can collect it. > > I have other symptom actually. Hmm maybe its related or maybe not. > Still i had this second symptom from the beginning and i did nit had such > crashes on 4k. > I am using distributed network storage moosefs and mounting it via fuse. > Then using qcow2 vm images. > I am having freezes sometimes in VMs but that might be related to that > fuse as i mount one fuse share and starting even 8 vms from that one mount. > And from time to time some vms stop responding or freeze. > I will soon rewrite it to use NBD istead and that should be fixed if that > was caused by fuse. > Still i am not sure actually if thise are separate issues or related and > which triggers which. > If there is blocked fuse process by vm A is it possible that vm B might > throw this walk page bug or it should not be related even if disk slows > down ? > > -- > > tel. 790 202 300 > > *Tytus Rogalewski* > > Dolina Krzemowa 6A > > 83-010 Jagatowo > > NIP: 9570976234 > > > W dniu śr., 4 lut 2026 o 22:52 David Hildenbrand (arm) > napisał(a): > >> On 1/28/26 15:14, Tytus Rogalewski wrote: >> > Hello guys, >> > >> >> Hi! >> >> > Recently i have reported slab memory leak and it was fixed. >> > >> > I am having yet another issue and wondering where to write with it. >> > Would you be able to tell me if this is the right place or should i >> send >> > it to someone else ? >> > The issue seems also like memory leak. >> > >> > It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+). >> > All servers are doing KVM with vfio GPU PCIE passthrough and it happens >> > when i am using HUGEPAGE 1GB + qemu >> >> Okay, so we'll longterm-pin all guest memory into the iommu. >> >> > Basically i am allocating 970GB into hugepages, leaving 37GB to kvm. >> > In normal operation i have about 20GB free space but when this issue >> > occurs, all RAM is taken and even when i have added 100GB swap, it was >> > also consumed. >> >> When you say hugepage you mean 1 GiB hugetlb, correct? >> >> > It can work for days or week without issue and >> > >> > I did not seen that issue when i had hugepages disabled (on normal 2KB >> > pages allocation in kvm). >> >> I assume you meant 4k pages. What about 2 MiB hugetlb? >> >> > And i am using hugepages as it is impossible to boot VM with >200GB ram. >> >> Oh, really? That's odd. >> >> > When that issue happens, process ps hangs and only top shows >> > something but machine needs to be rebooted due to many zombiee >> processes. >> > >> > *Hardware: * >> > Motherboard: ASRockRack GENOA2D24G-2L >> > CPU: 2x AMD EPYC 9654 96-Core Processor >> > System ram: 1024 GB >> > GPUs: 8x RTX5090 vfio passthrough >> > >> > root@pve14:~# uname -a >> > *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UTC >> > 2026 x86_64 GNU/Linux >> > >> > [171053.341288] *BUG: unable to handle page fault for address*: >> > ff469ae640000000 >> > [171053.341310] #PF: supervisor read access in kernel mode >> > [171053.341319] #PF: error_code(0x0000) - not-present page >> > [171053.341328] PGD 4602067 P4D 0 >> > [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI >> > [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted >> 6.18.6- >> > pbk #1 PREEMPT(voluntary) >> > [171053.341362] Hardware name: >> TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W, >> > BIOS 10.20 05/05/2025 >> > [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0 >> > [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e >> 00 >> > 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43 >> > dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20 >> > [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287 >> > [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX: >> > 0000000000000000 >> > [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI: >> > 800008dfc00002b7 >> > [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09: >> > 0000000000000000 >> > [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12: >> > ff469ae640000000 >> > [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15: >> > ff59d95d70e6b8a8 >> > [171053.341464] FS: 00007d4e8ec94b80(0000) GS:ff4692876ae7e000(0000) >> > knlGS:0000000000000000 >> > [171053.341476] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4: >> > 0000000000f71ef0 >> > [171053.341495] PKRU: 55555554 >> > [171053.341501] Call Trace: >> > [171053.341508] >> > [171053.341518] __walk_page_range+0x8e/0x220 >> > [171053.341529] ? sysvec_apic_timer_interrupt+0x57/0xc0 >> > [171053.341541] walk_page_vma+0x92/0xe0 >> > [171053.341551] smap_gather_stats.part.0+0x8c/0xd0 >> > [171053.341563] show_smaps_rollup+0x258/0x420 >> >> Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble >> somewhere into something unexpected while doing a page table walk. >> >> [171053.341288] BUG: unable to handle page fault for address: >> ff469ae640000000 >> [171053.341310] #PF: supervisor read access in kernel mode >> [171053.341319] #PF: error_code(0x0000) - not-present page >> [171053.341328] PGD 4602067 P4D 0 >> >> There is not a lot of information there :( >> >> Did you have other splats/symptoms or was it always that? >> >> -- >> Cheers, >> >> David >> >