Hi Gleb, On Mon, Jun 04, 2012 at 02:56:50PM +0300, Gleb Natapov wrote: > On Mon, Jun 04, 2012 at 07:46:03PM +0800, Fengguang Wu wrote: > > Hi, > > > > I'm running lots of kvm instances for doing kernel boot tests. > > Unfortunately the test system itself is not stable enough, I got scary > > errors in both kvm and the host kernel. Like this. > > > What do you mean by "in both kvm and the host kernel". Do you have I mean the host side's kvm user space process and kernel seem to have problems. > similar Oopses inside your guests? If yes can you post one? There are all kinds of problems in the guest kernel, too. Probably I built in too many modules (take a debian config and s/=m/=y/) and enabled too many debug options. Many of the bugs I ran into have already been reported by others in LKML. Here are more weird things. Kernel freeze in the rcu-torture test: (x86_64-nfsroot config) [ 4.584367] Initializing RT-Tester: OK [ 4.598820] audit: initializing netlink socket (disabled) [ 4.616364] type=2000 audit(1338804588.615:1): initialized [ 4.695469] Kprobe smoke test started [ 4.726854] Kprobe smoke test passed successfully [ 4.745166] rcu-torture:--- Start of test: nreaders=2 nfakewriters=4 stat_interval=0 verbose=0 test_no_idle_hz=0 shuffle_interval=3 stutter=5 +irqreader=1 fqs_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0 test_boost_interval=7 test_boost_duration=4 shutdown_secs=0 onoff_interval=0 +onoff_holdoff=0 iput_final() on mtdchar_close(), somehow after unclean umount: (x86_64-allyesdebian config) [ 277.167130] VFS: Busy inodes after unmount of mtd_inodefs. Self-destruct in 5 seconds. Have a nice day... [ 277.167202] VFS: Busy inodes after unmount of mtd_inodefs. Self-destruct in 5 seconds. Have a nice day... [ 277.175814] BUG: unable to handle kernel paging request at ffff88000ad0c820 [ 277.175818] IP: [] iput_final+0x1d/0x17c [ 277.175825] PGD 4015063 PUD 4019063 PMD 17f91067 PTE ad0c160 [ 277.175830] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 277.175834] CPU 0 [ 277.175835] Modules linked in: [ 277.175837] [ 277.175839] Pid: 3071, comm: mtd_probe Not tainted 3.4.0-rc1+ #1 Bochs Bochs [ 277.175843] RIP: 0010:[] [] iput_final+0x1d/0x17c [ 277.175847] RSP: 0018:ffff880005ee3d38 EFLAGS: 00010246 [ 277.175849] RAX: 0000000000000001 RBX: ffff8800087b7be8 RCX: 0000000000000001 [ 277.175864] RDX: 8c6318c6318c6301 RSI: ffff8800087b7c80 RDI: ffff8800087b7be8 [ 277.175866] RBP: ffff880005ee3d58 R08: 0000000000000000 R09: 0000000000004c86 [ 277.175868] R10: ffffffff816ce939 R11: ffff8800087b7c80 R12: ffff88000ad0c7f0 [ 277.175870] R13: ffff88000dd347f0 R14: ffff88000ae56be8 R15: ffff880009b95ee8 [ 277.175873] FS: 00007f7b52261700(0000) GS:ffff880017400000(0000) knlGS:0000000000000000 [ 277.175876] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 277.175878] CR2: ffff88000ad0c820 CR3: 0000000007c1a000 CR4: 00000000000006f0 [ 277.175884] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 277.175889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 277.175892] Process mtd_probe (pid: 3071, threadinfo ffff880005ee2000, task ffff8800004d8740) [ 277.175894] Stack: [ 277.175895] 0000000000000001 ffff8800087b7be8 ffff880005e4ae90 ffff88000dd347f0 [ 277.175900] ffff880005ee3d78 ffffffff811a717e 0000000000000000 ffff880007cac730 [ 277.175904] ffff880005ee3da8 ffffffff81ccf9ea ffff88000ae56be8 ffff880005e4ae90 [ 277.175908] Call Trace: [ 277.175912] [] iput+0x3d/0x41 [ 277.175918] [] mtdchar_close+0x3f/0x78 [ 277.175922] [] __fput+0x117/0x1b8 [ 277.175925] [] fput+0x1a/0x1c [ 277.175929] [] filp_close+0x70/0x7b [ 277.175933] [] close_files+0xc6/0xff [ 277.175936] [] put_files_struct.part.8+0x14/0xa9 [ 277.175940] [] put_files_struct+0x1f/0x23 [ 277.175943] [] exit_files+0x48/0x50 [ 277.175946] [] do_exit+0x2c2/0x463 [ 277.175951] [] ? trace_hardirqs_on_caller+0x12d/0x164 [ 277.175955] [] do_group_exit+0x88/0xb6 [ 277.175958] [] sys_exit_group+0x17/0x17 [ 277.175963] [] system_call_fastpath+0x16/0x1b [ 277.175965] Code: df e8 98 fd ff ff 58 5b 41 5c 41 5d 5d c3 55 48 89 e5 41 55 41 54 53 50 66 66 66 66 90 f6 87 e0 00 00 00 08 4c 8b 67 28 48 89 fb <4d> 8b 6c 24 30 74 11 be 61 05 00 00 48 c7 c7 9e f6 a6 83 e8 1b [ 277.176016] RIP [] iput_final+0x1d/0x17c [ 277.176020] RSP [ 277.176020] CR2: ffff88000ad0c820 [ 277.176020] ---[ end trace 7383ee643f887e33 ]--- [ 277.176020] Fixing recursive fault but reboot is needed! [ 277.176020] BUG: scheduling while atomic: mtd_probe/3071/0x00000002 [ 277.176020] INFO: lockdep is turned off. [ 277.176020] Modules linked in: [ 277.176020] irq event stamp: 3696 [ 277.176020] hardirqs last enabled at (3695): [] __call_rcu+0x2fe/0x316 [ 277.176020] hardirqs last disabled at (3696): [] error_sti+0x5/0x6 [ 277.176020] softirqs last enabled at (2044): [] __do_softirq+0x239/0x24f [ 277.176020] softirqs last disabled at (1965): [] call_softirq+0x1c/0x30 [ 277.176020] Pid: 3071, comm: mtd_probe Tainted: G D 3.4.0-rc1+ #1 [ 277.176020] Call Trace: [ 277.176020] [] ? print_irqtrace_events+0x9d/0xa1 [ 277.176020] [] __schedule_bug+0x66/0x6a [ 277.176020] [] __schedule+0x8c/0x59a [ 277.176020] [] schedule+0x64/0x66 [ 277.176020] [] do_exit+0x111/0x463 [ 277.176020] [] oops_end+0xbe/0xc6 [ 277.176020] [] no_context+0x1ae/0x1bd [ 277.176020] [] __bad_area_nosemaphore+0x1de/0x1ff [ 277.176020] [] ? pte_offset_kernel+0x14/0x3a [ 277.176020] [] bad_area_nosemaphore+0x13/0x15 [ 277.176020] [] do_page_fault+0x1bc/0x390 [ 277.176020] [] ? local_clock+0x36/0x4d [ 277.176020] [] ? error_sti+0x5/0x6 [ 277.176020] [] ? time_hardirqs_off+0x26/0x29 [ 277.176020] [] ? error_sti+0x5/0x6 [ 277.176020] [] ? trace_hardirqs_off_caller+0x3f/0x9c [ 277.176020] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 277.176020] [] do_async_page_fault+0x31/0x4f [ 277.176020] [] async_page_fault+0x25/0x30 [ 277.176020] [] ? _atomic_dec_and_lock+0x2d/0x4c [ 277.176020] [] ? iput_final+0x1d/0x17c [ 277.176020] [] iput+0x3d/0x41 [ 277.176020] [] mtdchar_close+0x3f/0x78 [ 277.176020] [] __fput+0x117/0x1b8 [ 277.176020] [] fput+0x1a/0x1c [ 277.176020] [] filp_close+0x70/0x7b [ 277.176020] [] close_files+0xc6/0xff [ 277.176020] [] put_files_struct.part.8+0x14/0xa9 [ 277.176020] [] put_files_struct+0x1f/0x23 [ 277.176020] [] exit_files+0x48/0x50 [ 277.176020] [] do_exit+0x2c2/0x463 [ 277.176020] [] ? trace_hardirqs_on_caller+0x12d/0x164 [ 277.176020] [] do_group_exit+0x88/0xb6 [ 277.176020] [] sys_exit_group+0x17/0x17 [ 277.176020] [] system_call_fastpath+0x16/0x1b [ 277.176265] VFS: Busy inodes after unmount of mtd_inodefs. Self-destruct in 5 seconds. Have a nice day... [ 277.184584] BUG: unable to handle kernel paging request at ffff88000ac7d820 [ 277.184588] IP: [] iput_final+0x1d/0x17c [ 277.184594] PGD 4015063 PUD 4019063 PMD 17f91067 PTE ac7d160 This looks like a real bug in some device driver: (x86_64-allyesdebian config) [ 503.061353] BUG: unable to handle kernel NULL pointer dereference at 0000000000000015 [ 503.065177] IP: [] usb_hcd_flush_endpoint+0x2f/0xec [ 503.065177] PGD 0 [ 503.065177] Oops: 0000 [#1] SMP [ 503.065177] CPU 0 [ 503.065177] Modules linked in: [ 503.065177] [ 503.065177] Pid: 20, comm: kworker/0:1 Not tainted 3.4.0-rc7+ #320 Bochs Bochs [ 503.065177] RIP: 0010:[] [] usb_hcd_flush_endpoint+0x2f/0xec [ 503.065177] RSP: 0018:ffff88000e013d20 EFLAGS: 00010046 [ 503.065177] RAX: 0000000000380038 RBX: ffff88000ace0000 RCX: 0000000000000000 [ 503.065177] RDX: 0000000000000038 RSI: ffff88000ad001c0 RDI: ffffffff83e1d86c [ 503.065177] RBP: 0000000000000005 R08: ffff88000e012000 R09: 0000000000000007 [ 503.065177] R10: 000000000000086e R11: 0000000000000000 R12: 0000000000000015 [ 503.065177] R13: 0000000600000005 R14: 0000000000000001 R15: 0000000000000400 [ 503.065177] FS: 0000000000000000(0000) GS:ffff88000f400000(0000) knlGS:0000000000000000 [ 503.065177] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 503.065177] CR2: 0000000000000015 CR3: 0000000003813000 CR4: 00000000000006f0 [ 503.065177] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 503.065177] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 503.065177] Process kworker/0:1 (pid: 20, threadinfo ffff88000e012000, task ffff88000f3ea140) [ 503.065177] Stack: [ 503.065177] ffff88000ace0000 ffff88000ace0000 ffff88000ace0008 ffff88000ace0080 [ 503.065177] 0000000000000400 ffffffff81fb969d 000000000000001d ffff88000ace8400 [ 503.065177] 0000000000000720 ffff88000acea888 ffff88000acea800 ffff88000e013df8 [ 503.065177] Call Trace: [ 503.065177] [] ? usb_suspend_both+0x155/0x184 [ 503.065177] [] ? usb_runtime_suspend+0x27/0x54 [ 503.065177] [] ? usb_probe_device+0x33/0x33 [ 503.065177] [] ? __rpm_callback+0x2b/0x4f [ 503.065177] [] ? rpm_suspend+0x2f1/0x47c [ 503.065177] [] ? mod_timer+0x7b/0x89 [ 503.065177] [] ? pm_schedule_suspend+0xab/0xab [ 503.065177] [] ? pm_runtime_work+0x71/0x8c [ 503.065177] [] ? process_one_work+0x125/0x296 [ 503.065177] [] ? worker_thread+0xc4/0x164 [ 503.065177] [] ? rescuer_thread+0x16e/0x16e [ 503.065177] [] ? kthread+0x7d/0x85 [ 503.065177] [] ? kernel_thread_helper+0x4/0x10 [ 503.065177] [] ? flush_kthread_worker+0x92/0x92 [ 503.065177] [] ? gs_change+0x13/0x13 [ 503.065177] Code: f6 41 54 55 48 89 f5 53 53 0f 84 d0 00 00 00 4c 8d 65 10 48 89 fb e8 b2 d9 b3 00 48 c7 c7 6c d8 e1 83 4c 8b 6b 40 e8 28 e4 b3 00 <48> 8b 5d 10 eb 0a 83 7b 18 00 74 0a 48 8b 5b 20 48 83 eb 20 eb [ 503.065177] RIP [] usb_hcd_flush_endpoint+0x2f/0xec [ 503.065177] RSP [ 503.065177] CR2: 0000000000000015 [ 503.065177] ---[ end trace e6f1441a2b05b7c3 ]--- Another trace: [ 109.687421] intel_oaktrail: Platform not recognized (You could try the module's force-parameter) [ 109.687421] ieee802154hardmac ieee802154hardmac: Added ieee802154 HardMAC hardware [ 109.974115] vhci_hcd: changed 0 [ 110.046770] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 110.050713] CPU 0 [ 110.050713] Modules linked in: [ 110.050713] [ 110.050713] Pid: 20, comm: kworker/0:1 Not tainted 3.5.0-rc1+ #13 Bochs Bochs [ 110.050713] RIP: 0010:[] [] usb_hcd_flush_endpoint+0x4c/0x10b [ 110.050713] RSP: 0018:ffff880015811c40 EFLAGS: 00010046 [ 110.050713] RAX: ffffffff822b1d35 RBX: ffff88000bde0000 RCX: 0000000000000001 [ 110.050713] RDX: ffff88000bde0000 RSI: ffffffff841d76e8 RDI: 0000000000000046 [ 110.050713] RBP: ffff880015811c60 R08: ffffffff84a434b0 R09: ffffffff84a434b0 [ 110.050713] R10: ffffffff822b1d35 R11: ffffffff841d76e8 R12: 5a5a0030305f7065 [ 110.050713] R13: 5a5a0030305f7075 R14: fffffffdfffffffd R15: 0000000000000001 [ 110.050713] FS: 0000000000000000(0000) GS:ffff880017400000(0000) knlGS:0000000000000000 [ 110.050713] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 110.050713] CR2: 0000000000000000 CR3: 0000000004014000 CR4: 00000000000006f0 [ 110.050713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 110.050713] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 110.050713] Process kworker/0:1 (pid: 20, threadinfo ffff880015810000, task ffff88001580c500) [ 110.050713] Stack: [ 110.050713] ffff88000bde0000 ffff88000bde0000 ffff88000bde0080 0000000000000400 [ 110.050713] ffff880015811cb0 ffffffff822b5b46 ffff88000be2d7f0 0000040015811d40 [ 110.050713] ffff880015811ca0 ffff88000bde2878 ffff88000bde27f0 ffff880015811d28 [ 110.050713] Call Trace: [ 110.050713] Call Trace: [ 110.050713] [] usb_suspend_both+0x16f/0x19d [ 110.050713] [] usb_runtime_suspend+0x36/0x5d [ 110.050713] [] ? usb_probe_device+0x37/0x37 [ 110.050713] [] __rpm_callback+0x34/0x5b [ 110.050713] [] rpm_suspend+0x2ff/0x49d [ 110.050713] [] ? lock_acquired+0x5f/0x68 [ 110.050713] [] ? pm_runtime_work+0x1e/0xa1 [ 110.050713] [] pm_runtime_work+0x88/0xa1 [ 110.050713] [] process_one_work+0x240/0x401 [ 110.050713] [] ? process_one_work+0x165/0x401 [ 110.050713] [] ? worker_thread+0x38/0x15f [ 110.050713] [] ? pm_schedule_suspend+0xb1/0xb1 [ 110.050713] [] worker_thread+0xdb/0x15f [ 110.050713] [] ? manage_workers.isra.24+0xaa/0xaa [ 110.050713] [] kthread+0xaf/0xb7 [ 110.050713] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 110.050713] [] kernel_thread_helper+0x4/0x10 [ 110.050713] [] ? retint_restore_args+0x13/0x13 [ 110.050713] [] ? __init_kthread_worker+0x5a/0x5a [ 110.050713] [] ? gs_change+0x13/0x13 [ 110.050713] Code: be 5b 06 00 00 48 c7 c7 81 40 cd 83 4d 8d 6c 24 10 e8 1c e1 e0 fe e8 0f f0 cc 00 48 c7 c7 d0 76 1d 84 4c 8b 73 40 e8 a6 fe cc 00 <49> 8b 5c 24 10 eb 0a 83 7b 18 00 74 13 48 8b 5b 20 48 83 eb 20 [ 110.050713] RIP [] usb_hcd_flush_endpoint+0x4c/0x10b [ 110.050713] RSP [ 110.050713] ---[ end trace 3ffa21620df1b13a ]--- [ 110.050713] note: kworker/0:1[20] exited with preempt_count 1 [ 114.071820] BUG: scheduling while atomic: kworker/0:1/20/0x10000002 [ 114.132169] INFO: lockdep is turned off. [ 114.192540] Modules linked in: [ 114.252134] Pid: 20, comm: kworker/0:1 Tainted: G D 3.5.0-rc1+ #13 [ 114.315119] Call Trace: [ 114.376805] [] __schedule_bug+0x66/0x75 [ 114.440581] [] __schedule+0x8c/0x599 [ 114.504267] [] ? exit_fs+0x3a/0x75 [ 114.568691] [] __cond_resched+0x27/0x32 [ 114.633880] [] _cond_resched+0x19/0x22 [ 114.698817] [] mutex_lock_nested+0x2a/0x45 [ 114.764387] [] perf_event_exit_task+0x2a/0x9d [ 114.830649] [] do_exit+0x360/0x477 [ 114.896676] [] oops_end+0xbe/0xc6 [ 114.962852] [] die+0x5a/0x64 [ 115.028695] [] do_general_protection+0x131/0x13a [ 115.096531] [] general_protection+0x25/0x30 [ 115.164060] [] ? usb_hcd_flush_endpoint+0x4c/0x10b [ 115.232190] [] ? usb_hcd_flush_endpoint+0x4c/0x10b [ 115.298747] [] ? usb_hcd_flush_endpoint+0x4c/0x10b [ 115.364073] [] usb_suspend_both+0x16f/0x19d [ 115.428984] [] usb_runtime_suspend+0x36/0x5d [ 115.494194] [] ? usb_probe_device+0x37/0x37 [ 115.558063] [] __rpm_callback+0x34/0x5b [ 115.620646] [] rpm_suspend+0x2ff/0x49d [ 115.679620] [] ? lock_acquired+0x5f/0x68 [ 115.737314] [] ? pm_runtime_work+0x1e/0xa1 [ 115.794345] [] pm_runtime_work+0x88/0xa1 [ 115.851304] [] process_one_work+0x240/0x401 [ 115.908428] [] ? process_one_work+0x165/0x401 [ 115.964883] [] ? worker_thread+0x38/0x15f [ 116.020499] [] ? pm_schedule_suspend+0xb1/0xb1 [ 116.077215] [] worker_thread+0xdb/0x15f [ 116.131873] [] ? manage_workers.isra.24+0xaa/0xaa [ 116.185794] [] kthread+0xaf/0xb7 [ 116.238823] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 116.293488] [] kernel_thread_helper+0x4/0x10 [ 116.348421] [] ? retint_restore_args+0x13/0x13 [ 116.403763] [] ? __init_kthread_worker+0x5a/0x5a [ 116.459768] [] ? gs_change+0x13/0x13 A funny bug triggered at kernel panic time: (x86_64-allyesdebian config) [ 1526.520230] =============================== [ 1526.520230] [ INFO: suspicious RCU usage. ] [ 1526.520230] 3.5.0-rc1+ #12 Not tainted [ 1526.520230] ------------------------------- [ 1526.520230] /c/kernel-tests/mm/include/linux/rcupdate.h:436 Illegal context switch in RCU read-side critical section! [ 1526.520230] [ 1526.520230] other info that might help us debug this: [ 1526.520230] [ 1526.520230] [ 1526.520230] rcu_scheduler_active = 1, debug_locks = 0 [ 1526.520230] 3 locks held by net.agent/3279: [ 1526.520230] #0: (&mm->mmap_sem){++++++}, at: [] do_page_fault+0x193/0x390 [ 1526.520230] #1: (panic_lock){+.+...}, at: [] panic+0x37/0x1d3 [ 1526.520230] #2: (rcu_read_lock){.+.+..}, at: [] rcu_lock_acquire+0x0/0x29 [ 1526.520230] [ 1526.520230] stack backtrace: [ 1526.520230] Pid: 3279, comm: net.agent Not tainted 3.5.0-rc1+ #12 [ 1526.520230] Call Trace: [ 1526.520230] [] lockdep_rcu_suspicious+0x109/0x112 [ 1526.520230] [] rcu_preempt_sleep_check+0x45/0x47 [ 1526.520230] [] __might_sleep+0x1e/0x19a [ 1526.520230] [] down_write+0x26/0x81 [ 1526.520230] [] led_trigger_unregister+0x1f/0x9c [ 1526.520230] [] heartbeat_reboot_notifier+0x15/0x19 [ 1526.520230] [] notifier_call_chain+0x96/0xcd [ 1526.520230] [] __atomic_notifier_call_chain+0x8e/0xff [ 1526.520230] [] ? kmsg_dump+0x37/0x1eb [ 1526.520230] [] atomic_notifier_call_chain+0x14/0x16 [ 1526.520230] [] panic+0xe8/0x1d3 [ 1526.520230] [] out_of_memory+0x15d/0x1d3 [ 1526.520230] [] __alloc_pages_nodemask+0x5c2/0x762 [ 1526.520230] [] alloc_pages_current+0xd2/0xf3 [ 1526.520230] [] __page_cache_alloc+0xa1/0xad [ 1526.520230] [] filemap_fault+0x1f9/0x2f6 [ 1526.520230] [] ? time_hardirqs_off+0x26/0x29 [ 1526.520230] [] ? trace_hardirqs_off_caller+0x1f/0x9c [ 1526.520230] [] __do_fault+0xc8/0x37d [ 1526.520230] [] ? lock_release_holdtime.part.23+0x4e/0x55 [ 1526.520230] [] handle_pte_fault+0xf1/0x196 [ 1526.520230] [] ? pmd_offset+0x14/0x3a [ 1526.520230] [] handle_mm_fault+0x1b1/0x1cb [ 1526.520230] [] do_page_fault+0x36b/0x390 [ 1526.520230] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1526.520230] [] ? handle_pte_fault+0x124/0x196 [ 1526.520230] [] ? handle_pte_fault+0x124/0x196 [ 1526.520230] [] ? error_sti+0x5/0x6 [ 1526.520230] [] ? time_hardirqs_off+0x26/0x29 [ 1526.520230] [] ? error_sti+0x5/0x6 [ 1526.520230] [] ? trace_hardirqs_off_caller+0x3f/0x9c [ 1526.520230] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 1526.520230] [] do_async_page_fault+0x31/0x5e [ 1526.520230] [] async_page_fault+0x25/0x30 Thanks, Fengguang