[syzbot] [mm?] kernel BUG in __khugepaged

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [mm?] kernel BUG in __khugepaged_enter
@ 2026-02-14 16:40 syzbot
  2026-02-16 14:40 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 5+ messages in thread
From: syzbot @ 2026-02-14 16:40 UTC (permalink / raw)
  To: Liam.Howlett, akpm, baohua, baolin.wang, david, dev.jain,
	lance.yang, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	ryan.roberts, syzkaller-bugs, ziy

Hello,

syzbot found the following issue on:

HEAD commit:    1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1169dae6580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=54ae71b284dd0e13
dashboard link: https://syzkaller.appspot.com/bug?extid=6b554d491efbe066b701
compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/ed43f42e3ea1/disk-1e83ccd5.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/d8af54a32588/vmlinux-1e83ccd5.xz
kernel image: https://storage.googleapis.com/syzbot-assets/34e6a8cc1037/bzImage-1e83ccd5.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6b554d491efbe066b701@syzkaller.appspotmail.com

------------[ cut here ]------------
kernel BUG at mm/khugepaged.c:438!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 16472 Comm: syz.3.2372 Tainted: G     U  W    L XTNJ syzkaller #0 PREEMPT(full) 
Tainted: [U]=USER, [W]=WARN, [L]=SOFTLOCKUP, [X]=AUX, [T]=RANDSTRUCT, [N]=TEST, [J]=FWCTL
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff341d3f52 CR3: 00000000319b0000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 khugepaged_enter_vma mm/khugepaged.c:467 [inline]
 khugepaged_enter_vma+0x137/0x2c0 mm/khugepaged.c:461
 do_huge_pmd_anonymous_page+0x1c8/0x1c00 mm/huge_memory.c:1469
 create_huge_pmd mm/memory.c:6102 [inline]
 __handle_mm_fault+0x1e96/0x2b50 mm/memory.c:6376
 handle_mm_fault+0x36d/0xa20 mm/memory.c:6583
 do_user_addr_fault+0x5a3/0x12f0 arch/x86/mm/fault.c:1334
 handle_page_fault arch/x86/mm/fault.c:1474 [inline]
 exc_page_fault+0x6f/0xd0 arch/x86/mm/fault.c:1527
 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618
RIP: 0033:0x87560
Code: Unable to access opcode bytes at 0x87536.
RSP: 002b:000000000000000e EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00007f0608615fa0 RCX: 00007f060839bf79
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0002000020003b4a
RBP: 00007f06084327e0 R08: 0000000000000103 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f0608616038 R14: 00007f0608615fa0 R15: 00007ffee482e7a8
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055906703f168 CR3: 00000000319b0000 CR4: 00000000003526f0


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [syzbot] [mm?] kernel BUG in __khugepaged_enter
  2026-02-14 16:40 [syzbot] [mm?] kernel BUG in __khugepaged_enter syzbot
@ 2026-02-16 14:40 ` David Hildenbrand (Arm)
  2026-02-16 14:43   ` Lorenzo Stoakes
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-16 14:40 UTC (permalink / raw)
  To: syzbot, Liam.Howlett, akpm, baohua, baolin.wang, dev.jain,
	lance.yang, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	ryan.roberts, syzkaller-bugs, ziy

On 2/14/26 17:40, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1169dae6580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=54ae71b284dd0e13
> dashboard link: https://syzkaller.appspot.com/bug?extid=6b554d491efbe066b701
> compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/ed43f42e3ea1/disk-1e83ccd5.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/d8af54a32588/vmlinux-1e83ccd5.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/34e6a8cc1037/bzImage-1e83ccd5.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6b554d491efbe066b701@syzkaller.appspotmail.com
> 
> ------------[ cut here ]------------
> kernel BUG at mm/khugepaged.c:438!
> Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 UID: 0 PID: 16472 Comm: syz.3.2372 Tainted: G     U  W    L XTNJ syzkaller #0 PREEMPT(full)
> Tainted: [U]=USER, [W]=WARN, [L]=SOFTLOCKUP, [X]=AUX, [T]=RANDSTRUCT, [N]=TEST, [J]=FWCTL
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
> RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
> Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
> RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
> RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
> RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
> RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
> R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fff341d3f52 CR3: 00000000319b0000 CR4: 00000000003526f0
> Call Trace:
>   <TASK>
>   khugepaged_enter_vma mm/khugepaged.c:467 [inline]
>   khugepaged_enter_vma+0x137/0x2c0 mm/khugepaged.c:461
>   do_huge_pmd_anonymous_page+0x1c8/0x1c00 mm/huge_memory.c:1469
>   create_huge_pmd mm/memory.c:6102 [inline]
>   __handle_mm_fault+0x1e96/0x2b50 mm/memory.c:6376
>   handle_mm_fault+0x36d/0xa20 mm/memory.c:6583
>   do_user_addr_fault+0x5a3/0x12f0 arch/x86/mm/fault.c:1334
>   handle_page_fault arch/x86/mm/fault.c:1474 [inline]
>   exc_page_fault+0x6f/0xd0 arch/x86/mm/fault.c:1527
>   asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618

This is the VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm), which checks

	atomic_read(&mm->mm_users) == 0;

So we have mm->mm_users == 0 while processing a page fault. Weird.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [syzbot] [mm?] kernel BUG in __khugepaged_enter
  2026-02-16 14:40 ` David Hildenbrand (Arm)
@ 2026-02-16 14:43   ` Lorenzo Stoakes
  2026-02-16 16:08     ` Lorenzo Stoakes
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Stoakes @ 2026-02-16 14:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: syzbot, Liam.Howlett, akpm, baohua, baolin.wang, dev.jain,
	lance.yang, linux-kernel, linux-mm, npache, ryan.roberts,
	syzkaller-bugs, ziy

On Mon, Feb 16, 2026 at 03:40:21PM +0100, David Hildenbrand (Arm) wrote:
> On 2/14/26 17:40, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1169dae6580000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=54ae71b284dd0e13
> > dashboard link: https://syzkaller.appspot.com/bug?extid=6b554d491efbe066b701
> > compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/ed43f42e3ea1/disk-1e83ccd5.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/d8af54a32588/vmlinux-1e83ccd5.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/34e6a8cc1037/bzImage-1e83ccd5.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+6b554d491efbe066b701@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > kernel BUG at mm/khugepaged.c:438!
> > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> > CPU: 0 UID: 0 PID: 16472 Comm: syz.3.2372 Tainted: G     U  W    L XTNJ syzkaller #0 PREEMPT(full)
> > Tainted: [U]=USER, [W]=WARN, [L]=SOFTLOCKUP, [X]=AUX, [T]=RANDSTRUCT, [N]=TEST, [J]=FWCTL
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
> > RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
> > Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
> > RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
> > RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
> > RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
> > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
> > R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
> > FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007fff341d3f52 CR3: 00000000319b0000 CR4: 00000000003526f0
> > Call Trace:
> >   <TASK>
> >   khugepaged_enter_vma mm/khugepaged.c:467 [inline]
> >   khugepaged_enter_vma+0x137/0x2c0 mm/khugepaged.c:461
> >   do_huge_pmd_anonymous_page+0x1c8/0x1c00 mm/huge_memory.c:1469
> >   create_huge_pmd mm/memory.c:6102 [inline]
> >   __handle_mm_fault+0x1e96/0x2b50 mm/memory.c:6376
> >   handle_mm_fault+0x36d/0xa20 mm/memory.c:6583
> >   do_user_addr_fault+0x5a3/0x12f0 arch/x86/mm/fault.c:1334
> >   handle_page_fault arch/x86/mm/fault.c:1474 [inline]
> >   exc_page_fault+0x6f/0xd0 arch/x86/mm/fault.c:1527
> >   asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618
>
> This is the VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm), which checks
>
> 	atomic_read(&mm->mm_users) == 0;
>
> So we have mm->mm_users == 0 while processing a page fault. Weird.

mm lifecycle... pinging Liam :)

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [syzbot] [mm?] kernel BUG in __khugepaged_enter
  2026-02-16 14:43   ` Lorenzo Stoakes
@ 2026-02-16 16:08     ` Lorenzo Stoakes
  2026-02-17 17:30       ` Liam R. Howlett
  0 siblings, 1 reply; 5+ messages in thread
From: Lorenzo Stoakes @ 2026-02-16 16:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: syzbot, Liam.Howlett, akpm, baohua, baolin.wang, dev.jain,
	lance.yang, linux-kernel, linux-mm, npache, ryan.roberts,
	syzkaller-bugs, ziy, Thomas Gleixner

+cc Thomas in case the commit it's sat at is indicative, there does seem to be
some weirdness with MMF_MULTIPROCESS processes (i.e. CLONE_VM but !CLONE_THREAD)
resulting in possible memory corruption?

We kinda need a repro to be sure though I think...

On Mon, Feb 16, 2026 at 02:43:17PM +0000, Lorenzo Stoakes wrote:
> On Mon, Feb 16, 2026 at 03:40:21PM +0100, David Hildenbrand (Arm) wrote:
> > On 2/14/26 17:40, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1169dae6580000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=54ae71b284dd0e13
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=6b554d491efbe066b701
> > > compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.

We're going to need one I fear :)

> > >
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/ed43f42e3ea1/disk-1e83ccd5.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/d8af54a32588/vmlinux-1e83ccd5.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/34e6a8cc1037/bzImage-1e83ccd5.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+6b554d491efbe066b701@syzkaller.appspotmail.com
> > >
> > > ------------[ cut here ]------------
> > > kernel BUG at mm/khugepaged.c:438!
> > > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> > > CPU: 0 UID: 0 PID: 16472 Comm: syz.3.2372 Tainted: G     U  W    L XTNJ syzkaller #0 PREEMPT(full)
> > > Tainted: [U]=USER, [W]=WARN, [L]=SOFTLOCKUP, [X]=AUX, [T]=RANDSTRUCT, [N]=TEST, [J]=FWCTL
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
> > > RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
> > > Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
> > > RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
> > > RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
> > > RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
> > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
> > > R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
> > > FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007fff341d3f52 CR3: 00000000319b0000 CR4: 00000000003526f0
> > > Call Trace:
> > >   <TASK>
> > >   khugepaged_enter_vma mm/khugepaged.c:467 [inline]
> > >   khugepaged_enter_vma+0x137/0x2c0 mm/khugepaged.c:461
> > >   do_huge_pmd_anonymous_page+0x1c8/0x1c00 mm/huge_memory.c:1469
> > >   create_huge_pmd mm/memory.c:6102 [inline]
> > >   __handle_mm_fault+0x1e96/0x2b50 mm/memory.c:6376
> > >   handle_mm_fault+0x36d/0xa20 mm/memory.c:6583
> > >   do_user_addr_fault+0x5a3/0x12f0 arch/x86/mm/fault.c:1334

	vma = lock_vma_under_rcu(mm, address);
	if (!vma)
		goto lock_mmap; <--- didn't jump there, so is a VMA lock.

	if (unlikely(access_error(error_code, vma))) {
		bad_area_access_error(regs, error_code, address, NULL, vma);
		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
		return;
	}
	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); <-- here

> > >   handle_page_fault arch/x86/mm/fault.c:1474 [inline]
> > >   exc_page_fault+0x6f/0xd0 arch/x86/mm/fault.c:1527
> > >   asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618
> >
> > This is the VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm), which checks
> >
> > 	atomic_read(&mm->mm_users) == 0;

Yeah, and that just shouldn't be possible, so maybe memory corruption?

The crash log indicates the system is tainted by softlock
https://syzkaller.appspot.com/text?tag=CrashLog&x=1169dae6580000 so something's
gone horribly wrong there...

(from crash log)
[  696.104336][T16472] pgd ffff8880319b0000 mm_users 0 mm_count 2 pgtables_bytes 155648 map_count 32

VMA's still there so exit_mmap() hasn't run yet...

But hmm we injected a fault :)

[  696.293779][T16475] FAULT_INJECTION: forcing a failure.
[  696.293779][T16475] name fail_page_alloc, interval 1, probability 0, space 0, times 0

[  696.332139][T16475]  dump_stack_lvl+0x100/0x190
[  696.332164][T16475]  should_fail_ex.cold+0x5/0xa
[  696.332178][T16475]  ? prepare_alloc_pages+0x16d/0x5f0
[  696.332200][T16475]  should_fail_alloc_page+0xeb/0x140
[  696.332219][T16475]  prepare_alloc_pages+0x1f0/0x5f0
[  696.332241][T16475]  __alloc_frozen_pages_noprof+0x193/0x2410
[  696.332258][T16475]  ? stack_trace_save+0x8e/0xc0
[  696.332277][T16475]  ? __pfx_stack_trace_save+0x10/0x10
[  696.332297][T16475]  ? stack_depot_save_flags+0x27/0x9d0
[  696.332315][T16475]  ? __lock_acquire+0x4a5/0x2630
[  696.332331][T16475]  ? kasan_save_stack+0x3f/0x50
[  696.332346][T16475]  ? __pfx___alloc_frozen_pages_noprof+0x10/0x10
[  696.332360][T16475]  ? copy_time_ns+0xf6/0x800
[  696.332379][T16475]  ? unshare_nsproxy_namespaces+0xc3/0x1f0
[  696.332408][T16475]  ? __x64_sys_unshare+0x31/0x40
[  696.332423][T16475]  ? do_syscall_64+0x106/0xf80
[  696.332437][T16475]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  696.332460][T16475]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  696.332480][T16475]  ? policy_nodemask+0xed/0x4f0
[  696.332500][T16475]  alloc_pages_mpol+0x1fb/0x550
[  696.332519][T16475]  ? __pfx_alloc_pages_mpol+0x10/0x10
[  696.332542][T16475]  alloc_pages_noprof+0x131/0x390
[  696.332560][T16475]  copy_time_ns+0x11a/0x800

So:

static struct nsproxy *create_new_namespaces(u64 flags,
	struct task_struct *tsk, struct user_namespace *user_ns,
	struct fs_struct *new_fs)
{
	...
	new_nsp->time_ns_for_children = copy_time_ns(flags, user_ns,
				tsk->nsproxy->time_ns_for_children); <- -ENOMEM
	if (IS_ERR(new_nsp->time_ns_for_children)) {
		err = PTR_ERR(new_nsp->time_ns_for_children);
		goto out_time;
	}

	...

out_time:
	put_net(new_nsp->net_ns);
out_net:
	put_cgroup_ns(new_nsp->cgroup_ns);
out_cgroup:
	put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
	put_ipc_ns(new_nsp->ipc_ns);
out_ipc:
	put_uts_ns(new_nsp->uts_ns);
out_uts:
	put_mnt_ns(new_nsp->mnt_ns);
out_ns:
	kmem_cache_free(nsproxy_cachep, new_nsp);
	return ERR_PTR(err);
}

So we're putting the world... maybe some of this is buggy?

[  696.332578][T16475]  ? copy_cgroup_ns+0x71/0x970
[  696.332601][T16475]  create_new_namespaces+0x48a/0xac0

So:

int unshare_nsproxy_namespaces(unsigned long unshare_flags,
	struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs)
{
	...
	*new_nsp = create_new_namespaces(unshare_flags, current, user_
					 new_fs ? new_fs : current->fs);
	if (IS_ERR(*new_nsp)) {
		err = PTR_ERR(*new_nsp);
		goto out;
	}
	...
out:
	return err;
}


[  696.332626][T16475]  unshare_nsproxy_namespaces+0xc3/0x1f0
[  696.332648][T16475]  ksys_unshare+0x455/0xab0

So:

int ksys_unshare(unsigned long unshare_flags)
{
	...
	err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy,
					 new_cred, new_fs);
	if (err)
		goto bad_unshare_cleanup_cred;
	...
bad_unshare_cleanup_cred:
	if (new_cred)
		put_cred(new_cred);
bad_unshare_cleanup_fd:
	if (new_fd)
		put_files_struct(new_fd);

bad_unshare_cleanup_fs:
	if (new_fs)
		free_fs_struct(new_fs);

bad_unshare_out:
	return err;
}

And again we're putting all the things... maybe something buggy here?

Perhaps this unshare is racing with something else?

OTOH, we _already_ had mm_users = 0 at this point (as per mm dump) so. Probably
something before got us into this state?

[  696.332664][T16475]  ? __pfx_ksys_unshare+0x10/0x10
[  696.332679][T16475]  ? xfd_validate_state+0x129/0x190
[  696.332702][T16475]  __x64_sys_unshare+0x31/0x40
[  696.332717][T16475]  do_syscall_64+0x106/0xf80
[  696.332730][T16475]  ? clear_bhb_loop+0x40/0x90
[  696.332747][T16475]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Also from mm dump:

flags: 00000000,840007fd

MMF_TOPDOWN | MMF_MULTIPROCESS | (core dump flags)

No MMF_VM_HUGEPAGE...

MMF_MULTIPROCESS marks this as shared between processes, as set in
copy_process() -> copy_oom_score_adj() which has a guard:

	/* Skip if spawning a thread or using vfork */
	if ((clone_flags & (CLONE_VM | CLONE_THREAD | CLONE_VFORK)) != CLONE_VM)
		return;

Which grabs the mm in __set_oom_adj() which as per commit 44a70adec910 ("mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj")
suggests processes were cloned with CLONE_VM but not CLONE_SIGHAND (which
presumably implies !CLONE_THREAD).

Anyway it's hard to know with a repro.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [syzbot] [mm?] kernel BUG in __khugepaged_enter
  2026-02-16 16:08     ` Lorenzo Stoakes
@ 2026-02-17 17:30       ` Liam R. Howlett
  0 siblings, 0 replies; 5+ messages in thread
From: Liam R. Howlett @ 2026-02-17 17:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand (Arm),
	syzbot, akpm, baohua, baolin.wang, dev.jain, lance.yang,
	linux-kernel, linux-mm, npache, ryan.roberts, syzkaller-bugs,
	ziy, Thomas Gleixner

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [260216 11:08]:
> +cc Thomas in case the commit it's sat at is indicative, there does seem to be
> some weirdness with MMF_MULTIPROCESS processes (i.e. CLONE_VM but !CLONE_THREAD)
> resulting in possible memory corruption?
> 
> We kinda need a repro to be sure though I think...
> 
> On Mon, Feb 16, 2026 at 02:43:17PM +0000, Lorenzo Stoakes wrote:
> > On Mon, Feb 16, 2026 at 03:40:21PM +0100, David Hildenbrand (Arm) wrote:
> > > On 2/14/26 17:40, syzbot wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:    1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on..
> > > > git tree:       upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1169dae6580000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=54ae71b284dd0e13
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=6b554d491efbe066b701
> > > > compiler:       gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> 
> We're going to need one I fear :)

Indeed.  No clear smoking gun.

> 
> > > >
> > > > Downloadable assets:
> > > > disk image: https://storage.googleapis.com/syzbot-assets/ed43f42e3ea1/disk-1e83ccd5.raw.xz
> > > > vmlinux: https://storage.googleapis.com/syzbot-assets/d8af54a32588/vmlinux-1e83ccd5.xz
> > > > kernel image: https://storage.googleapis.com/syzbot-assets/34e6a8cc1037/bzImage-1e83ccd5.xz
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+6b554d491efbe066b701@syzkaller.appspotmail.com
> > > >
> > > > ------------[ cut here ]------------
> > > > kernel BUG at mm/khugepaged.c:438!
> > > > Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
> > > > CPU: 0 UID: 0 PID: 16472 Comm: syz.3.2372 Tainted: G     U  W    L XTNJ syzkaller #0 PREEMPT(full)
> > > > Tainted: [U]=USER, [W]=WARN, [L]=SOFTLOCKUP, [X]=AUX, [T]=RANDSTRUCT, [N]=TEST, [J]=FWCTL
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026
> > > > RIP: 0010:__khugepaged_enter+0x30a/0x380 mm/khugepaged.c:438
> > > > Code: 64 7e 8e e8 a8 dc 66 ff e8 93 e6 8d ff 5b 5d 41 5c 41 5d 41 5e 41 5f e9 04 6c 04 09 e8 7f e6 8d ff 48 89 df e8 17 33 d9 ff 90 <0f> 0b 48 89 ef e8 dc 51 f8 ff e9 3b fd ff ff e8 f2 52 f8 ff e9 e1
> > > > RSP: 0018:ffffc9000e98fba8 EFLAGS: 00010292
> > > > RAX: 000000000000031f RBX: ffff888079b24980 RCX: 0000000000000000
> > > > RDX: 000000000000031f RSI: ffffffff81e5b2c9 RDI: fffff52001d31f1c
> > > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > > R10: 0000000080000000 R11: 0000000000000001 R12: 0000000008100177
> > > > R13: ffff88804adf9510 R14: 0000000000000000 R15: 0000000000000000
> > > > FS:  00007f06093436c0(0000) GS:ffff8881245b1000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007fff341d3f52 CR3: 00000000319b0000 CR4: 00000000003526f0
> > > > Call Trace:
> > > >   <TASK>
> > > >   khugepaged_enter_vma mm/khugepaged.c:467 [inline]
> > > >   khugepaged_enter_vma+0x137/0x2c0 mm/khugepaged.c:461
> > > >   do_huge_pmd_anonymous_page+0x1c8/0x1c00 mm/huge_memory.c:1469
> > > >   create_huge_pmd mm/memory.c:6102 [inline]
> > > >   __handle_mm_fault+0x1e96/0x2b50 mm/memory.c:6376
> > > >   handle_mm_fault+0x36d/0xa20 mm/memory.c:6583
> > > >   do_user_addr_fault+0x5a3/0x12f0 arch/x86/mm/fault.c:1334
> 
> 	vma = lock_vma_under_rcu(mm, address);
> 	if (!vma)
> 		goto lock_mmap; <--- didn't jump there, so is a VMA lock.
> 
> 	if (unlikely(access_error(error_code, vma))) {
> 		bad_area_access_error(regs, error_code, address, NULL, vma);
> 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> 		return;
> 	}
> 	fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); <-- here
> 
> > > >   handle_page_fault arch/x86/mm/fault.c:1474 [inline]
> > > >   exc_page_fault+0x6f/0xd0 arch/x86/mm/fault.c:1527
> > > >   asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618
> > >
> > > This is the VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm), which checks
> > >
> > > 	atomic_read(&mm->mm_users) == 0;
> 
> Yeah, and that just shouldn't be possible, so maybe memory corruption?

tglx had some changes to the mm_users account for cid stuff [1].

There is also a potential double mmput() fix [2].


> 
> The crash log indicates the system is tainted by softlock
> https://syzkaller.appspot.com/text?tag=CrashLog&x=1169dae6580000 so something's
> gone horribly wrong there...
> 
> (from crash log)
> [  696.104336][T16472] pgd ffff8880319b0000 mm_users 0 mm_count 2 pgtables_bytes 155648 map_count 32
> 
> VMA's still there so exit_mmap() hasn't run yet...
> 
> But hmm we injected a fault :)
> 
> [  696.293779][T16475] FAULT_INJECTION: forcing a failure.
> [  696.293779][T16475] name fail_page_alloc, interval 1, probability 0, space 0, times 0
> 
> [  696.332139][T16475]  dump_stack_lvl+0x100/0x190
> [  696.332164][T16475]  should_fail_ex.cold+0x5/0xa
> [  696.332178][T16475]  ? prepare_alloc_pages+0x16d/0x5f0
> [  696.332200][T16475]  should_fail_alloc_page+0xeb/0x140
> [  696.332219][T16475]  prepare_alloc_pages+0x1f0/0x5f0
> [  696.332241][T16475]  __alloc_frozen_pages_noprof+0x193/0x2410
> [  696.332258][T16475]  ? stack_trace_save+0x8e/0xc0
> [  696.332277][T16475]  ? __pfx_stack_trace_save+0x10/0x10
> [  696.332297][T16475]  ? stack_depot_save_flags+0x27/0x9d0
> [  696.332315][T16475]  ? __lock_acquire+0x4a5/0x2630
> [  696.332331][T16475]  ? kasan_save_stack+0x3f/0x50
> [  696.332346][T16475]  ? __pfx___alloc_frozen_pages_noprof+0x10/0x10
> [  696.332360][T16475]  ? copy_time_ns+0xf6/0x800
> [  696.332379][T16475]  ? unshare_nsproxy_namespaces+0xc3/0x1f0
> [  696.332408][T16475]  ? __x64_sys_unshare+0x31/0x40
> [  696.332423][T16475]  ? do_syscall_64+0x106/0xf80
> [  696.332437][T16475]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [  696.332460][T16475]  ? __sanitizer_cov_trace_switch+0x54/0x90
> [  696.332480][T16475]  ? policy_nodemask+0xed/0x4f0
> [  696.332500][T16475]  alloc_pages_mpol+0x1fb/0x550
> [  696.332519][T16475]  ? __pfx_alloc_pages_mpol+0x10/0x10
> [  696.332542][T16475]  alloc_pages_noprof+0x131/0x390
> [  696.332560][T16475]  copy_time_ns+0x11a/0x800


There are a number of failures reported, all seem to be the same
injections.

...

> Perhaps this unshare is racing with something else?

Yeah, I cannot find a particular obvious bug either.

> 
> OTOH, we _already_ had mm_users = 0 at this point (as per mm dump) so. Probably
> something before got us into this state?

This may be a previous failure that was missed or expected?  Is the
syzbot a clean run or has a full terminal output?  The outputs seem to
indicate it's the 'last test run' but not the entire log..?

Could we have already failed via the previous fix [2]?  It's not in the
tree where this bug is being reported.

> 
> [  696.332664][T16475]  ? __pfx_ksys_unshare+0x10/0x10
> [  696.332679][T16475]  ? xfd_validate_state+0x129/0x190
> [  696.332702][T16475]  __x64_sys_unshare+0x31/0x40
> [  696.332717][T16475]  do_syscall_64+0x106/0xf80
> [  696.332730][T16475]  ? clear_bhb_loop+0x40/0x90
> [  696.332747][T16475]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Also from mm dump:
> 
> flags: 00000000,840007fd
> 
> MMF_TOPDOWN | MMF_MULTIPROCESS | (core dump flags)
> 
> No MMF_VM_HUGEPAGE...
> 
> MMF_MULTIPROCESS marks this as shared between processes, as set in
> copy_process() -> copy_oom_score_adj() which has a guard:
> 
> 	/* Skip if spawning a thread or using vfork */
> 	if ((clone_flags & (CLONE_VM | CLONE_THREAD | CLONE_VFORK)) != CLONE_VM)
> 		return;
> 
> Which grabs the mm in __set_oom_adj() which as per commit 44a70adec910 ("mm,
> oom_adj: make sure processes sharing mm have same view of oom_score_adj")
> suggests processes were cloned with CLONE_VM but not CLONE_SIGHAND (which
> presumably implies !CLONE_THREAD).
> 
> Anyway it's hard to know with a repro.
> 

Thanks,
Liam

[1] https://lore.kernel.org/all/20251119172549.832764634@linutronix.de/
[2] https://lore.kernel.org/all/20260210192738.3041609-1-andrii@kernel.org/


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-17 17:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-14 16:40 [syzbot] [mm?] kernel BUG in __khugepaged_enter syzbot
2026-02-16 14:40 ` David Hildenbrand (Arm)
2026-02-16 14:43   ` Lorenzo Stoakes
2026-02-16 16:08     ` Lorenzo Stoakes
2026-02-17 17:30       ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox