+Vlastimil Babka, Hyeonggon Yoo, Feng Tang and Fengwei Yin Hi, Mike Kravetz, we reported "[linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h" [1] Vlastimil, Hyeonggon, Feng and Fengwei gave us a lot of great guidances based on it, and, perticularly, after enabling below config per Vlastimil's suggestion CONFIG_DEBUG_PAGEALLOC CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT CONFIG_SLUB_DEBUG CONFIG_SLUB_DEBUG_ON by more tests, we realized the "0af8489b02" is not the real culprit. the new bisection was triggered and finally it pointed to this "7118fc2906". though reporting for different issues ("kernel_BUG_at_include/linux/mm.h" for 0af8489b02 vs. "kernel_BUG_at_lib/list_debug.c" for this commit), Feng and Fengwei helped further to confirm they are similar. They will supply more technical wise analysis later. please be noted the issues are not always happening (~10% on this commit or 0af8489b02) ========================================================================================= compiler/kconfig/rootfs/sleep/tbox_group/testcase: gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/1/vm-snb/boot 48b8d744ea841b8a 7118fc2906e2925d7edb5ed9c8a 0af8489b0216fa1dd83e264bef8 ---------------- --------------------------- --------------------------- fail:runs %reproduction fail:runs %reproduction fail:runs | | | | | :999 10% 97:999 9% 94:999 dmesg.invalid_opcode:#[##] :999 0% :999 0% 4:999 dmesg.kernel_BUG_at_include/linux/mm.h :999 0% 2:999 1% 5:999 dmesg.kernel_BUG_at_include/linux/page-flags.h :999 9% 90:999 9% 85:999 dmesg.kernel_BUG_at_lib/list_debug.c :999 0% 4:999 0% :999 dmesg.kernel_BUG_at_mm/page_alloc.c :999 0% 1:999 0% :999 dmesg.kernel_BUG_at_mm/slub.c [1] https://lore.kernel.org/all/202212312021.bc1efe86-oliver.sang@intel.com/ below is the detail report. Greeting, FYI, we noticed kernel_BUG_at_lib/list_debug.c due to commit (built with gcc-11): commit: 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849 ("hugetlb: address ref count racing in prep_compound_gigantic_page") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master [test failed on linux-next/master c12e2e5b76b2e739ccdf196bee960412b45d5f85] in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag | Reported-by: kernel test robot | Link: https://lore.kernel.org/oe-lkp/202301170941.49728982-oliver.sang@intel.com [ 31.031172][ T210] ------------[ cut here ]------------ [ 31.032147][ T210] kernel BUG at lib/list_debug.c:54! [ 31.033124][ T210] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [ 31.034237][ T210] CPU: 1 PID: 210 Comm: systemd-udevd Tainted: G S 5.13.0-00219-g7118fc2906e2 #1 [ 31.036108][ T210] EIP: __list_del_entry_valid.cold (lib/list_debug.c:54 (discriminator 3)) [ 31.037237][ T210] Code: 01 89 54 24 08 c7 04 24 08 c7 04 24 83 15 ec 4b 83 15 ec 4b 61 80 f9 ff 61 80 f9 ff 9a c5 01 83 9a c5 01 83 c5 00 0f 0b c5 00 <0f> 0b 9a c5 01 83 9a c5 01 83 c5 00 83 05 c5 00 83 05 01 b8 44 f4 All code ======== 0: 01 89 54 24 08 c7 add %ecx,-0x38f7dbac(%rcx) 6: 04 24 add $0x24,%al 8: 08 c7 or %al,%bh a: 04 24 add $0x24,%al c: 83 15 ec 4b 83 15 ec adcl $0xffffffec,0x15834bec(%rip) # 0x15834bff 13: 4b 61 rex.WXB (bad) 15: 80 f9 ff cmp $0xff,%cl 18: 61 (bad) 19: 80 f9 ff cmp $0xff,%cl 1c: 9a (bad) 1d: c5 01 83 (bad) 20: 9a (bad) 21: c5 01 83 (bad) 24: c5 00 0f (bad) 27: 0b c5 or %ebp,%eax 29:* 00 0f add %cl,(%rdi) <-- trapping instruction 2b: 0b 9a c5 01 83 9a or -0x657cfe3b(%rdx),%ebx 31: c5 01 83 (bad) 34: c5 00 83 (bad) 37: 05 c5 00 83 05 add $0x58300c5,%eax 3c: 01 .byte 0x1 3d: b8 .byte 0xb8 3e: 44 f4 rex.R hlt Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 9a (bad) 3: c5 01 83 (bad) 6: 9a (bad) 7: c5 01 83 (bad) a: c5 00 83 (bad) d: 05 c5 00 83 05 add $0x58300c5,%eax 12: 01 .byte 0x1 13: b8 .byte 0xb8 14: 44 f4 rex.R hlt [ 31.044796][ T210] EAX: 00000044 EBX: e6d0e564 ECX: 00000000 EDX: 00000001 [ 31.046040][ T210] ESI: ee7f4360 EDI: ee7f4328 EBP: ca1b795c ESP: ca1b7950 [ 31.047314][ T210] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046 [ 31.048702][ T210] CR0: 80050033 CR2: 005d5dbc CR3: 0a287000 CR4: 000406d0 [ 31.049987][ T210] Call Trace: [ 31.050583][ T210] __rmqueue_pcplist (include/linux/list.h:132 include/linux/list.h:146 mm/page_alloc.c:3644) [ 31.051469][ T210] rmqueue_pcplist+0x13c/0x3a0 [ 31.052536][ T210] ? rmqueue_pcplist+0x4b/0x3a0 [ 31.053479][ T210] rmqueue+0x323/0xd20 [ 31.054230][ T210] get_page_from_freelist (mm/page_alloc.c:4162) [ 31.055219][ T210] __alloc_pages (mm/page_alloc.c:5374) [ 31.055998][ T210] allocate_slab (include/linux/gfp.h:558 include/linux/gfp.h:572 include/linux/gfp.h:585 mm/slub.c:1702 mm/slub.c:1842) [ 31.056834][ T210] new_slab (mm/slub.c:1907) [ 31.057538][ T210] new_slab_objects (mm/slub.c:2652) [ 31.058396][ T210] ___slab_alloc+0xf8/0x520 [ 31.059340][ T210] ? lock_release (kernel/locking/lockdep.c:5534) [ 31.060073][ T210] ? __d_alloc (fs/dcache.c:1745) [ 31.060817][ T210] ? rcu_read_unlock (include/linux/rcupdate.h:272 (discriminator 7) include/linux/rcupdate.h:711 (discriminator 7)) [ 31.061621][ T210] ? get_obj_cgroup_from_current (mm/memcontrol.c:2931) [ 31.062657][ T210] __slab_alloc+0x9b/0x100 [ 31.063622][ T210] ? __d_alloc (fs/dcache.c:1745) [ 31.064402][ T210] kmem_cache_alloc (mm/slub.c:2936 mm/slub.c:2978 mm/slub.c:2983) [ 31.065230][ T210] ? __d_alloc (fs/dcache.c:1745) [ 31.065990][ T210] __d_alloc (fs/dcache.c:1745) [ 31.066733][ T210] d_alloc (fs/dcache.c:1824) [ 31.067415][ T210] d_alloc_parallel (fs/dcache.c:2575) [ 31.068288][ T210] ? __init_waitqueue_head (kernel/sched/wait.c:13) [ 31.069218][ T210] __lookup_slow (fs/namei.c:1615) [ 31.070026][ T210] lookup_slow (fs/namei.c:1646) [ 31.070786][ T210] walk_component (fs/namei.c:1942) [ 31.071614][ T210] ? inode_permission (fs/namei.c:522) [ 31.072484][ T210] link_path_walk (fs/namei.c:2269) [ 31.073324][ T210] path_openat (fs/namei.c:3490 (discriminator 2)) [ 31.074100][ T210] ? __lock_acquired (kernel/locking/lockdep.c:5723) [ 31.074949][ T210] do_filp_open (fs/namei.c:3521) [ 31.075754][ T210] do_sys_openat2 (fs/open.c:1188) [ 31.076615][ T210] do_sys_open (fs/open.c:1203) [ 31.077356][ T210] __ia32_sys_openat (fs/open.c:1214) [ 31.078206][ T210] __do_fast_syscall_32 (arch/x86/entry/common.c:78 arch/x86/entry/common.c:143) [ 31.079126][ T210] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4109 kernel/locking/lockdep.c:4169) [ 31.080178][ T210] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 19)) [ 31.081085][ T210] ? __call_rcu (kernel/rcu/tree.c:3072 (discriminator 1)) [ 31.081899][ T210] ? __fput (fs/file_table.c:58 fs/file_table.c:298) [ 31.082630][ T210] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4109 kernel/locking/lockdep.c:4169) [ 31.083685][ T210] ? syscall_exit_to_user_mode (kernel/entry/common.c:132 kernel/entry/common.c:304) [ 31.084726][ T210] ? __do_fast_syscall_32 (arch/x86/entry/common.c:147) [ 31.085681][ T210] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4109 kernel/locking/lockdep.c:4169) [ 31.086738][ T210] ? irqentry_exit_to_user_mode (kernel/entry/common.c:132 kernel/entry/common.c:317) [ 31.087741][ T210] do_fast_syscall_32 (arch/x86/entry/common.c:168) [ 31.088627][ T210] do_SYSENTER_32 (arch/x86/entry/common.c:211) [ 31.089427][ T210] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:872) [ 31.090288][ T210] EIP: 0xb7f04549 [ 31.090926][ T210] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 All code ======== 0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi 4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d a: 10 06 adc %al,(%rsi) c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi 10: 10 07 adc %al,(%rdi) 12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi 16: 10 08 adc %cl,(%rax) 18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi 1c: 00 00 add %al,(%rax) 1e: 00 00 add %al,(%rax) 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24: 89 e5 mov %esp,%ebp 26: 0f 34 sysenter 28: cd 80 int $0x80 2a:* 5d pop %rbp <-- trapping instruction 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 retq 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d 76 00 lea 0x0(%rsi),%esi 35: 58 pop %rax 36: b8 77 00 00 00 mov $0x77,%eax 3b: cd 80 int $0x80 3d: 90 nop 3e: 8d .byte 0x8d 3f: 76 .byte 0x76 Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 retq 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d 76 00 lea 0x0(%rsi),%esi b: 58 pop %rax c: b8 77 00 00 00 mov $0x77,%eax 11: cd 80 int $0x80 13: 90 nop 14: 8d .byte 0x8d 15: 76 .byte 0x76 To reproduce: # build kernel cd linux cp config-5.13.0-00219-g7118fc2906e2 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH= modules_install cd find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state. -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests