W dniu śr., 4 lut 2026 o 22:52 David Hildenbrand (arm) <david@kernel.org> napisał(a):

On 1/28/26 15:14, Tytus Rogalewski wrote:
> Hello guys,
>

Hi!

> Recently i have reported slab memory leak and it was fixed.
>
> I am having yet another issue and wondering where to write with it.
> Would you be able to tell me if this is the right place or should i send
> it to someone else ?
> The issue seems also like memory leak.
>
> It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+).
> All servers are doing KVM with vfio GPU PCIE passthrough and it happens
> when i am using HUGEPAGE 1GB + qemu

Okay, so we'll longterm-pin all guest memory into the iommu.

> Basically i am allocating 970GB into hugepages, leaving 37GB to kvm.
> In normal operation i have about 20GB free space but when this issue
> occurs, all RAM is taken and even when i have added 100GB swap, it was
> also consumed.

When you say hugepage you mean 1 GiB hugetlb, correct?

> It can work for days or week without issue and
>
> I did not seen that issue when i had hugepages disabled (on normal 2KB
> pages allocation in kvm).

I assume you meant 4k pages. What about 2 MiB hugetlb?

> And i am using hugepages as it is impossible to boot VM with >200GB ram.

Oh, really? That's odd.

> When that issue happens, process ps hangs and only top shows
> something but machine needs to be rebooted due to many zombiee processes.
>
> *Hardware: *
> Motherboard: ASRockRack GENOA2D24G-2L
> CPU: 2x AMD EPYC 9654 96-Core Processor
> System ram: 1024 GB
> GPUs: 8x RTX5090 vfio passthrough
>
> root@pve14:~# uname -a
> *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UTC
> 2026 x86_64 GNU/Linux
>
> [171053.341288] *BUG: unable to handle page fault for address*:
> ff469ae640000000
> [171053.341310] #PF: supervisor read access in kernel mode
> [171053.341319] #PF: error_code(0x0000) - not-present page
> [171053.341328] PGD 4602067 P4D 0
> [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI
> [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted 6.18.6-
> pbk #1 PREEMPT(voluntary)
> [171053.341362] Hardware name: TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W,
> BIOS 10.20 05/05/2025
> [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0
> [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e 00
> 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43
> dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20
> [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287
> [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX:
> 0000000000000000
> [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI:
> 800008dfc00002b7
> [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09:
> 0000000000000000
> [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12:
> ff469ae640000000
> [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15:
> ff59d95d70e6b8a8
> [171053.341464] FS: 00007d4e8ec94b80(0000) GS:ff4692876ae7e000(0000)
> knlGS:0000000000000000
> [171053.341476] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4:
> 0000000000f71ef0
> [171053.341495] PKRU: 55555554
> [171053.341501] Call Trace:
> [171053.341508] <TASK>
> [171053.341518] __walk_page_range+0x8e/0x220
> [171053.341529] ? sysvec_apic_timer_interrupt+0x57/0xc0
> [171053.341541] walk_page_vma+0x92/0xe0
> [171053.341551] smap_gather_stats.part.0+0x8c/0xd0
> [171053.341563] show_smaps_rollup+0x258/0x420

Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble
somewhere into something unexpected while doing a page table walk.

[171053.341288] BUG: unable to handle page fault for address:
ff469ae640000000
[171053.341310] #PF: supervisor read access in kernel mode
[171053.341319] #PF: error_code(0x0000) - not-present page
[171053.341328] PGD 4602067 P4D 0

There is not a lot of information there :(

Did you have other splats/symptoms or was it always that?

--
Cheers,

David