linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "David Hildenbrand (arm)" <david@kernel.org>
To: Tytus Rogalewski <tytanick@gmail.com>,
	linux-mm@kvack.org, muchun.song@linux.dev, osalvador@suse.de
Subject: Re: walk_pgd_range BUG: unable to handle page fault
Date: Wed, 4 Feb 2026 22:52:27 +0100	[thread overview]
Message-ID: <5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org> (raw)
In-Reply-To: <CANfXJzt4P+FCkdL_=FfmG80_bY8FkzSocJSPeksSQ_vXObRNOQ@mail.gmail.com>

On 1/28/26 15:14, Tytus Rogalewski wrote:
> Hello guys,
> 

Hi!

> Recently i have reported slab memory leak and it was fixed.
> 
> I am having yet another issue and wondering where to write with it.
> Would you be able to tell me if this is the right place or should i send 
> it to someone else ?
> The issue seems also like memory leak.
> 
> It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+).
> All servers are doing KVM with vfio GPU PCIE passthrough and it happens 
> when i am using HUGEPAGE 1GB + qemu

Okay, so we'll longterm-pin all guest memory into the iommu.

> Basically i am allocating 970GB into hugepages, leaving 37GB to kvm.
> In normal operation i have about 20GB free space but when this issue 
> occurs, all RAM is taken and even when i have added 100GB swap, it was 
> also consumed.

When you say hugepage you mean 1 GiB hugetlb, correct?

> It can work for days or week without issue and
> 
> I did not seen that issue when i had hugepages disabled (on normal 2KB 
> pages allocation in kvm).

I assume you meant 4k pages. What about 2 MiB hugetlb?

> And i am using hugepages as it is impossible to boot VM with >200GB ram.

Oh, really? That's odd.

> When that issue happens, process ps hangs and only top shows 
> something but machine needs to be rebooted due to many zombiee processes.
> 
> *Hardware: *
> Motherboard: ASRockRack GENOA2D24G-2L
> CPU: 2x AMD EPYC 9654 96-Core Processor
> System ram: 1024 GB
> GPUs: 8x RTX5090 vfio passthrough
> 
> root@pve14:~# uname -a
> *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UTC 
> 2026 x86_64 GNU/Linux
> 
> [171053.341288] *BUG: unable to handle page fault for address*: 
> ff469ae640000000
> [171053.341310] #PF: supervisor read access in kernel mode
> [171053.341319] #PF: error_code(0x0000) - not-present page
> [171053.341328] PGD 4602067 P4D 0
> [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI
> [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted 6.18.6- 
> pbk #1 PREEMPT(voluntary)
> [171053.341362] Hardware name:  TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W, 
> BIOS 10.20 05/05/2025
> [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0
> [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e 00 
> 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43 
> dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20
> [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287
> [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX: 
> 0000000000000000
> [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI: 
> 800008dfc00002b7
> [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09: 
> 0000000000000000
> [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12: 
> ff469ae640000000
> [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15: 
> ff59d95d70e6b8a8
> [171053.341464] FS:  00007d4e8ec94b80(0000) GS:ff4692876ae7e000(0000) 
> knlGS:0000000000000000
> [171053.341476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4: 
> 0000000000f71ef0
> [171053.341495] PKRU: 55555554
> [171053.341501] Call Trace:
> [171053.341508]  <TASK>
> [171053.341518]  __walk_page_range+0x8e/0x220
> [171053.341529]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> [171053.341541]  walk_page_vma+0x92/0xe0
> [171053.341551]  smap_gather_stats.part.0+0x8c/0xd0
> [171053.341563]  show_smaps_rollup+0x258/0x420

Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble 
somewhere into something unexpected while doing a page table walk.

[171053.341288] BUG: unable to handle page fault for address: 
ff469ae640000000
[171053.341310] #PF: supervisor read access in kernel mode
[171053.341319] #PF: error_code(0x0000) - not-present page
[171053.341328] PGD 4602067 P4D 0

There is not a lot of information there :(

Did you have other splats/symptoms or was it always that?

-- 
Cheers,

David


       reply	other threads:[~2026-02-04 21:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CANfXJzt4P+FCkdL_=FfmG80_bY8FkzSocJSPeksSQ_vXObRNOQ@mail.gmail.com>
2026-02-04 21:52 ` David Hildenbrand (arm) [this message]
2026-02-04 22:24   ` Tytus Rogalewski
2026-02-04 22:50     ` Tytus Rogalewski
2026-02-05 12:44       ` David Hildenbrand (Arm)
2026-02-05 12:46         ` Tytus Rogalewski
2026-02-05 12:57           ` David Hildenbrand (Arm)
2026-02-05 13:20             ` Tytus Rogalewski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org \
    --to=david@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=tytanick@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox