Re: walk_pgd_range BUG: unable to handle page fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Tytus Rogalewski <tytanick@gmail.com>
To: "David Hildenbrand (arm)" <david@kernel.org>
Cc: linux-mm@kvack.org, muchun.song@linux.dev, osalvador@suse.de
Subject: Re: walk_pgd_range BUG: unable to handle page fault
Date: Wed, 4 Feb 2026 23:24:05 +0100	[thread overview]
Message-ID: <CANfXJzsWFyKXJKsESM+7JXoGkDSeQt+Qaimy3FV1-neyXiHZBg@mail.gmail.com> (raw)
In-Reply-To: <5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5929 bytes --]

Hi,

hugepages is qemu term probably.

Yeah 4k is default and booting is hard with that much memory aspecially if
you boot , stop and few times.
But this issue might be strictly related to vfio passthrough mix.
I did not tested 2mb pages actually because why to use it if i have 1GB ?
Do you think it could be more stable than 1GB or should it be the same
logic as 2MB ?

Well. i started to use 1GB ones recently as i had to get through all this
iommu cpu labirynth with binding proper gpu to proper memory and proper cpu
affinity in kvm. And proxmox ve does not have such logic.

If you tell me what to collect, i can collect it.

I have other symptom actually. Hmm maybe its related or maybe not.
Still i had this second symptom from the beginning and i did nit had such
crashes on 4k.
I am using distributed network storage moosefs and mounting it via fuse.
Then using qcow2 vm images.
I am having freezes sometimes in VMs but that might be related to that fuse
as i mount one fuse share and starting even 8 vms from that one mount.
And from time to time some vms stop responding or freeze.
I will soon rewrite it to use NBD istead and that should be fixed if that
was caused by fuse.
Still i am not sure actually if thise are separate issues or related and
which triggers which.
If there is blocked fuse process by vm A is it possible that vm B might
throw this walk page bug or it should not be related even if disk slows
down ?

--

tel. 790 202 300

*Tytus Rogalewski*

Dolina Krzemowa 6A

83-010 Jagatowo

NIP: 9570976234


W dniu śr., 4 lut 2026 o 22:52 David Hildenbrand (arm) <david@kernel.org>
napisał(a):

> On 1/28/26 15:14, Tytus Rogalewski wrote:
> > Hello guys,
> >
>
> Hi!
>
> > Recently i have reported slab memory leak and it was fixed.
> >
> > I am having yet another issue and wondering where to write with it.
> > Would you be able to tell me if this is the right place or should i send
> > it to someone else ?
> > The issue seems also like memory leak.
> >
> > It happens on multiple servers (less on 6.18.6, more on 6.19-rc4+).
> > All servers are doing KVM with vfio GPU PCIE passthrough and it happens
> > when i am using HUGEPAGE 1GB + qemu
>
> Okay, so we'll longterm-pin all guest memory into the iommu.
>
> > Basically i am allocating 970GB into hugepages, leaving 37GB to kvm.
> > In normal operation i have about 20GB free space but when this issue
> > occurs, all RAM is taken and even when i have added 100GB swap, it was
> > also consumed.
>
> When you say hugepage you mean 1 GiB hugetlb, correct?
>
> > It can work for days or week without issue and
> >
> > I did not seen that issue when i had hugepages disabled (on normal 2KB
> > pages allocation in kvm).
>
> I assume you meant 4k pages. What about 2 MiB hugetlb?
>
> > And i am using hugepages as it is impossible to boot VM with >200GB ram.
>
> Oh, really? That's odd.
>
> > When that issue happens, process ps hangs and only top shows
> > something but machine needs to be rebooted due to many zombiee processes.
> >
> > *Hardware: *
> > Motherboard: ASRockRack GENOA2D24G-2L
> > CPU: 2x AMD EPYC 9654 96-Core Processor
> > System ram: 1024 GB
> > GPUs: 8x RTX5090 vfio passthrough
> >
> > root@pve14:~# uname -a
> > *Linux pve14 6.18.6-pbk* #1 SMP PREEMPT_DYNAMIC Mon Jan 19 20:59:46 UTC
> > 2026 x86_64 GNU/Linux
> >
> > [171053.341288] *BUG: unable to handle page fault for address*:
> > ff469ae640000000
> > [171053.341310] #PF: supervisor read access in kernel mode
> > [171053.341319] #PF: error_code(0x0000) - not-present page
> > [171053.341328] PGD 4602067 P4D 0
> > [171053.341337] *Oops*: Oops: 0000 [#1] SMP NOPTI
> > [171053.341348] CPU: 16 UID: 0 PID: 3250869 Comm: qm Not tainted 6.18.6-
> > pbk #1 PREEMPT(voluntary)
> > [171053.341362] Hardware name:  TURIN2D24G-2L+/500W/TURIN2D24G-2L+/500W,
> > BIOS 10.20 05/05/2025
> > [171053.341373] RIP: 0010:*walk_pgd_range*+0x6ff/0xbb0
> > [171053.341386] Code: 08 49 39 dd 0f 84 8c 01 00 00 49 89 de 49 8d 9e 00
> > 00 20 00 48 8b 75 b8 48 81 e3 00 00 e0 ff 48 8d 43 ff 48 39 f0 49 0f 43
> > dd <49> f7 04 24 9f ff ff ff 0f 84 e2 fd ff ff 48 8b 45 c0 41 c7 47 20
> > [171053.341406] RSP: 0018:ff59d95d70e6b748 EFLAGS: 00010287
> > [171053.341416] RAX: 00007a22401fffff RBX: 00007a2240200000 RCX:
> > 0000000000000000
> > [171053.341425] RDX: 0000000000000000 RSI: 00007a227fffffff RDI:
> > 800008dfc00002b7
> > [171053.341435] RBP: ff59d95d70e6b828 R08: 0000000000000080 R09:
> > 0000000000000000
> > [171053.341444] R10: ffffffff8de588c0 R11: 0000000000000000 R12:
> > ff469ae640000000
> > [171053.341454] R13: 00007a2280000000 R14: 00007a2240000000 R15:
> > ff59d95d70e6b8a8
> > [171053.341464] FS:  00007d4e8ec94b80(0000) GS:ff4692876ae7e000(0000)
> > knlGS:0000000000000000
> > [171053.341476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [171053.341485] CR2: ff469ae640000000 CR3: 0000008241eed006 CR4:
> > 0000000000f71ef0
> > [171053.341495] PKRU: 55555554
> > [171053.341501] Call Trace:
> > [171053.341508]  <TASK>
> > [171053.341518]  __walk_page_range+0x8e/0x220
> > [171053.341529]  ? sysvec_apic_timer_interrupt+0x57/0xc0
> > [171053.341541]  walk_page_vma+0x92/0xe0
> > [171053.341551]  smap_gather_stats.part.0+0x8c/0xd0
> > [171053.341563]  show_smaps_rollup+0x258/0x420
>
> Hm, so someone is reading /proc/$PID/smaps_rollup and we stumble
> somewhere into something unexpected while doing a page table walk.
>
> [171053.341288] BUG: unable to handle page fault for address:
> ff469ae640000000
> [171053.341310] #PF: supervisor read access in kernel mode
> [171053.341319] #PF: error_code(0x0000) - not-present page
> [171053.341328] PGD 4602067 P4D 0
>
> There is not a lot of information there :(
>
> Did you have other splats/symptoms or was it always that?
>
> --
> Cheers,
>
> David
>

[-- Attachment #2: Type: text/html, Size: 8553 bytes --]

next prev parent reply	other threads:[~2026-02-04 22:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CANfXJzt4P+FCkdL_=FfmG80_bY8FkzSocJSPeksSQ_vXObRNOQ@mail.gmail.com>
2026-02-04 21:52 ` David Hildenbrand (arm)
2026-02-04 22:24   ` Tytus Rogalewski [this message]
2026-02-04 22:50     ` Tytus Rogalewski
2026-02-05 12:44       ` David Hildenbrand (Arm)
2026-02-05 12:46         ` Tytus Rogalewski
2026-02-05 12:57           ` David Hildenbrand (Arm)
2026-02-05 13:20             ` Tytus Rogalewski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANfXJzsWFyKXJKsESM+7JXoGkDSeQt+Qaimy3FV1-neyXiHZBg@mail.gmail.com \
    --to=tytanick@gmail.com \
    --cc=david@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox