* Re: kernel BUG at mm/vmscan.c:1114 [not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com> @ 2011-08-02 7:22 ` Andrew Morton 2011-08-03 6:44 ` Xiaotian Feng 2011-08-02 14:24 ` Mel Gorman 1 sibling, 1 reply; 14+ messages in thread From: Andrew Morton @ 2011-08-02 7:22 UTC (permalink / raw) To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, mgorman On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: > __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > was trying to build my kernel. The photo of crash screen and my config > is attached. hm, now why has that started happening? Perhaps you could apply this debug patch, see if we can narrow it down? --- a/mm/vmscan.c~a +++ a/mm/vmscan.c @@ -54,6 +54,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/vmscan.h> +#define D() do { printk("%s:%d\n", __FILE__, __LINE__); } while (0) + /* * reclaim_mode determines how the inactive list is shrunk * RECLAIM_MODE_SINGLE: Reclaim only order-0 pages @@ -1018,27 +1020,37 @@ int __isolate_lru_page(struct page *page int ret = -EINVAL; /* Only take pages on the LRU. */ - if (!PageLRU(page)) + if (!PageLRU(page)) { + D(); return ret; + } /* * When checking the active state, we need to be sure we are * dealing with comparible boolean values. Take the logical not * of each. */ - if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) + if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode)) { + printk("mode:%d\n", mode); + D(); return ret; + } - if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file) + if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file) { + printk("mode: %d, pifc: %d, file: %d\n", mode, + page_is_file_cache(page), file); + D(); return ret; - + } /* * When this function is being called for lumpy reclaim, we * initially look into all LRU pages, active, inactive and * unevictable; only give shrink_page_list evictable pages. */ - if (PageUnevictable(page)) + if (PageUnevictable(page)) { + D(); return ret; + } ret = -EBUSY; _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-02 7:22 ` kernel BUG at mm/vmscan.c:1114 Andrew Morton @ 2011-08-03 6:44 ` Xiaotian Feng 2011-08-03 8:54 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Xiaotian Feng @ 2011-08-03 6:44 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, linux-kernel, mgorman On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: > >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >> was trying to build my kernel. The photo of crash screen and my config >> is attached. > > hm, now why has that started happening? > > Perhaps you could apply this debug patch, see if we can narrow it down? > I will try it then, but it isn't very reproducible :( But my system hung after some list corruption warnings... I hit the corruption 4 times... So, Dozens of corruption warnings followed after this one: [ 3641.495875] ------------[ cut here ]------------ [ 3641.495885] WARNING: at lib/list_debug.c:53 __list_del_entry+0xa1/0xd0() [ 3641.495888] Hardware name: 42424XC [ 3641.495891] list_del corruption. prev->next should be ffffea00000a6c20, but was ffff880033edde70 [ 3641.495893] Modules linked in: ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp binfmt_misc parport_pc ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi arc4 snd_hwdep snd_pcm snd_seq_midi snd_rawmidi cryptd aes_x86_64 iwlagn snd_seq_midi_event aes_generic snd_seq snd_timer snd_seq_device mac80211 btusb bluetooth snd cfg80211 snd_page_alloc i915 uvcvideo videodev drm_kms_helper psmouse v4l2_compat_ioctl32 drm tpm_tis tpm lp soundcore tpm_bios nvram i2c_algo_bit serio_raw joydev parport video usbhid hid ahci libahci firewire_ohci firewire_core crc_itu_t sdhci_pci sdhci e1000e [ 3641.495987] Pid: 22709, comm: skype Tainted: G W 3.0.0+ #23 [ 3641.495989] Call Trace: [ 3641.495996] [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 [ 3641.496001] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 [ 3641.496006] [<ffffffff81332a71>] __list_del_entry+0xa1/0xd0 [ 3641.496010] [<ffffffff81332ab1>] list_del+0x11/0x40 [ 3641.496015] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 [ 3641.496020] [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0 [ 3641.496025] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 [ 3641.496028] [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0 [ 3641.496032] [<ffffffff81519ac9>] ? __sk_free+0xf9/0x1d0 [ 3641.496036] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 [ 3641.496040] [<ffffffff81519ac9>] __sk_free+0xf9/0x1d0 [ 3641.496044] [<ffffffff81519c35>] sk_free+0x25/0x30 [ 3641.496049] [<ffffffff81576ac9>] tcp_close+0x239/0x440 [ 3641.496054] [<ffffffff815a10ef>] inet_release+0xcf/0x150 [ 3641.496058] [<ffffffff815a1042>] ? inet_release+0x22/0x150 [ 3641.496063] [<ffffffff81513f19>] sock_release+0x29/0x90 [ 3641.496067] [<ffffffff81513f97>] sock_close+0x17/0x30 [ 3641.496072] [<ffffffff8119151d>] fput+0xfd/0x240 [ 3641.496077] [<ffffffff8118c9b6>] filp_close+0x66/0x90 [ 3641.496081] [<ffffffff8118d412>] sys_close+0xc2/0x1a0 [ 3641.496087] [<ffffffff81652b60>] sysenter_dispatch+0x7/0x33 [ 3641.496093] [<ffffffff8132c8ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f And after I reboot my system, trying to recover building my kernel, the system hung again, and I got following warnings: [ 1220.468089] ------------[ cut here ]------------ [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() [ 1220.468102] Hardware name: 42424XC [ 1220.468104] list_del corruption. next->prev should be ffffea0000e069a0, but was ffff880100216c78 [ 1220.468106] Modules linked in: ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e sdhci_pci sdhci crc_itu_t [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 [ 1220.468188] Call Trace: [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de So is it possible that my previous BUG is triggered by slab list corruption? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-03 6:44 ` Xiaotian Feng @ 2011-08-03 8:54 ` Mel Gorman 2011-08-03 9:02 ` Li Zefan 2011-08-04 3:54 ` Xiaotian Feng 0 siblings, 2 replies; 14+ messages in thread From: Mel Gorman @ 2011-08-03 8:54 UTC (permalink / raw) To: Xiaotian Feng; +Cc: Andrew Morton, linux-mm, linux-kernel On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: > On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: > > > >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > >> was trying to build my kernel. The photo of crash screen and my config > >> is attached. > > > > hm, now why has that started happening? > > > > Perhaps you could apply this debug patch, see if we can narrow it down? > > > > I will try it then, but it isn't very reproducible :( > But my system hung after some list corruption warnings... I hit the > corruption 4 times... > That is very unexpected but if lists are being corrupted, it could explain the previously reported bug as that bug looked like an active page on an inactive list. What was the last working kernel? Can you bisect? > [ 1220.468089] ------------[ cut here ]------------ > [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() > [ 1220.468102] Hardware name: 42424XC > [ 1220.468104] list_del corruption. next->prev should be > ffffea0000e069a0, but was ffff880100216c78 > [ 1220.468106] Modules linked in: ip6table_filter ip6_tables > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 > xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp > iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc > ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi > snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd > snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer > aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd > v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit > btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev > parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e > sdhci_pci sdhci crc_itu_t > [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 > [ 1220.468188] Call Trace: > [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 > [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 > [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 > [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 > [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 > [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 > [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 > [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 > [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 > [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de > This warning and the page reclaim warning are on paths that are commonly used and I would expect to see multiple reports. I wonder what is happening on your machine that is so unusual. Have you run memtest on this machine for a few hours and badblocks on the disk to ensure this is not hardware trouble? > So is it possible that my previous BUG is triggered by slab list corruption? Not directly, but clearly there is something very wrong. If slub corruption reports are very common and kernel 3.0 is fine, my strongest candidate for the corruption would be the SLUB lockless patches. Try git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R They should revert cleanly with offsets. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-03 8:54 ` Mel Gorman @ 2011-08-03 9:02 ` Li Zefan 2011-08-04 3:54 ` Xiaotian Feng 1 sibling, 0 replies; 14+ messages in thread From: Li Zefan @ 2011-08-03 9:02 UTC (permalink / raw) To: Mel Gorman; +Cc: Xiaotian Feng, Andrew Morton, linux-mm, linux-kernel 16:54, Mel Gorman wrote: > On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: >> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >>> On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: >>> >>>> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >>>> was trying to build my kernel. The photo of crash screen and my config >>>> is attached. >>> >>> hm, now why has that started happening? >>> >>> Perhaps you could apply this debug patch, see if we can narrow it down? >>> >> >> I will try it then, but it isn't very reproducible :( >> But my system hung after some list corruption warnings... I hit the >> corruption 4 times... >> > > That is very unexpected but if lists are being corrupted, it could > explain the previously reported bug as that bug looked like an active > page on an inactive list. > > What was the last working kernel? Can you bisect? > I just triggered the same BUG_ON() while running xfstests to test btrfs, but I forgot to remember which test case was running when it happaned, case 134 or around. -- Li Zefan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-03 8:54 ` Mel Gorman 2011-08-03 9:02 ` Li Zefan @ 2011-08-04 3:54 ` Xiaotian Feng 2011-08-05 8:42 ` Xiaotian Feng 1 sibling, 1 reply; 14+ messages in thread From: Xiaotian Feng @ 2011-08-04 3:54 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: >> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: >> > >> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >> >> was trying to build my kernel. The photo of crash screen and my config >> >> is attached. >> > >> > hm, now why has that started happening? >> > >> > Perhaps you could apply this debug patch, see if we can narrow it down? >> > >> >> I will try it then, but it isn't very reproducible :( >> But my system hung after some list corruption warnings... I hit the >> corruption 4 times... >> > > That is very unexpected but if lists are being corrupted, it could > explain the previously reported bug as that bug looked like an active > page on an inactive list. > > What was the last working kernel? Can you bisect? > >> [ 1220.468089] ------------[ cut here ]------------ >> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() >> [ 1220.468102] Hardware name: 42424XC >> [ 1220.468104] list_del corruption. next->prev should be >> ffffea0000e069a0, but was ffff880100216c78 >> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables >> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp >> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc >> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi >> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd >> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer >> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd >> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit >> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev >> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e >> sdhci_pci sdhci crc_itu_t >> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 >> [ 1220.468188] Call Trace: >> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 >> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 >> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 >> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 >> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 >> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 >> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 >> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 >> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 >> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de >> > I'm hitting this again today, when I'm trying to rebuild my kernel.... Looking it a bit list_del corruption. next->prev should be ffffea0000e069a0, but was ffff880100216c78 I find something interesting from my syslog: PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 > This warning and the page reclaim warning are on paths that are > commonly used and I would expect to see multiple reports. I wonder > what is happening on your machine that is so unusual. > > Have you run memtest on this machine for a few hours and badblocks > on the disk to ensure this is not hardware trouble? > >> So is it possible that my previous BUG is triggered by slab list corruption? > > Not directly, but clearly there is something very wrong. > > If slub corruption reports are very common and kernel 3.0 is fine, my > strongest candidate for the corruption would be the SLUB lockless > patches. Try > > git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R > I will try it now, thanks. > They should revert cleanly with offsets. > > -- > Mel Gorman > SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-04 3:54 ` Xiaotian Feng @ 2011-08-05 8:42 ` Xiaotian Feng 2011-08-05 9:19 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Xiaotian Feng @ 2011-08-05 8:42 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote: > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: >>> > >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >>> >> was trying to build my kernel. The photo of crash screen and my config >>> >> is attached. >>> > >>> > hm, now why has that started happening? >>> > >>> > Perhaps you could apply this debug patch, see if we can narrow it down? >>> > >>> >>> I will try it then, but it isn't very reproducible :( >>> But my system hung after some list corruption warnings... I hit the >>> corruption 4 times... >>> >> >> That is very unexpected but if lists are being corrupted, it could >> explain the previously reported bug as that bug looked like an active >> page on an inactive list. >> >> What was the last working kernel? Can you bisect? >> >>> [ 1220.468089] ------------[ cut here ]------------ >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() >>> [ 1220.468102] Hardware name: 42424XC >>> [ 1220.468104] list_del corruption. next->prev should be >>> ffffea0000e069a0, but was ffff880100216c78 >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e >>> sdhci_pci sdhci crc_itu_t >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 >>> [ 1220.468188] Call Trace: >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de >>> >> > > I'm hitting this again today, when I'm trying to rebuild my kernel.... > Looking it a bit > > list_del corruption. next->prev should be ffffea0000e069a0, but was > ffff880100216c78 > > I find something interesting from my syslog: > > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 > >> This warning and the page reclaim warning are on paths that are >> commonly used and I would expect to see multiple reports. I wonder >> what is happening on your machine that is so unusual. >> >> Have you run memtest on this machine for a few hours and badblocks >> on the disk to ensure this is not hardware trouble? >> >>> So is it possible that my previous BUG is triggered by slab list corruption? >> >> Not directly, but clearly there is something very wrong. >> >> If slub corruption reports are very common and kernel 3.0 is fine, my >> strongest candidate for the corruption would be the SLUB lockless >> patches. Try >> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R >> > Here's a update for the results: 3.0.0-rc7: running for hours without a crash upstream kernel: list corruption happened while building kernel within 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) upstream kernel with above revert: running for hours without a crash Trying to bisect but rebuild is slow .... > I will try it now, thanks. > >> They should revert cleanly with offsets. >> >> -- >> Mel Gorman >> SUSE Labs >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-05 8:42 ` Xiaotian Feng @ 2011-08-05 9:19 ` Mel Gorman 2011-08-05 12:09 ` Xiaotian Feng 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-08-05 9:19 UTC (permalink / raw) To: Xiaotian Feng Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter (Adding patch author to cc) On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote: > On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote: > > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: > >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: > >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: > >>> > > >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > >>> >> was trying to build my kernel. The photo of crash screen and my config > >>> >> is attached. > >>> > > >>> > hm, now why has that started happening? > >>> > > >>> > Perhaps you could apply this debug patch, see if we can narrow it down? > >>> > > >>> > >>> I will try it then, but it isn't very reproducible :( > >>> But my system hung after some list corruption warnings... I hit the > >>> corruption 4 times... > >>> > >> > >> That is very unexpected but if lists are being corrupted, it could > >> explain the previously reported bug as that bug looked like an active > >> page on an inactive list. > >> > >> What was the last working kernel? Can you bisect? > >> > >>> [ 1220.468089] ------------[ cut here ]------------ > >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() > >>> [ 1220.468102] Hardware name: 42424XC > >>> [ 1220.468104] list_del corruption. next->prev should be > >>> ffffea0000e069a0, but was ffff880100216c78 > >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables > >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 > >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp > >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc > >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi > >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd > >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer > >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd > >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit > >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev > >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e > >>> sdhci_pci sdhci crc_itu_t > >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 > >>> [ 1220.468188] Call Trace: > >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 > >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 > >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 > >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 > >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 > >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 > >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 > >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 > >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 > >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de > >>> > >> > > > > I'm hitting this again today, when I'm trying to rebuild my kernel.... > > Looking it a bit > > > > list_del corruption. next->prev should be ffffea0000e069a0, but was > > ffff880100216c78 > > > > I find something interesting from my syslog: > > > > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 > > > >> This warning and the page reclaim warning are on paths that are > >> commonly used and I would expect to see multiple reports. I wonder > >> what is happening on your machine that is so unusual. > >> > >> Have you run memtest on this machine for a few hours and badblocks > >> on the disk to ensure this is not hardware trouble? > >> > >>> So is it possible that my previous BUG is triggered by slab list corruption? > >> > >> Not directly, but clearly there is something very wrong. > >> > >> If slub corruption reports are very common and kernel 3.0 is fine, my > >> strongest candidate for the corruption would be the SLUB lockless > >> patches. Try > >> > >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R > >> > > > > Here's a update for the results: > > 3.0.0-rc7: running for hours without a crash > upstream kernel: list corruption happened while building kernel within > 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) > upstream kernel with above revert: running for hours without a crash > > Trying to bisect but rebuild is slow .... > If you have not done so already, I strongly suggest your bisection starts within that range of patches to isolate which one is at fault. It'll cut down on the number of builds you need to do. Thanks for testing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-05 9:19 ` Mel Gorman @ 2011-08-05 12:09 ` Xiaotian Feng 2011-08-05 12:30 ` Xiaotian Feng 0 siblings, 1 reply; 14+ messages in thread From: Xiaotian Feng @ 2011-08-05 12:09 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote: > (Adding patch author to cc) > > On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote: >> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote: >> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: >> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: >> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: >> >>> > >> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >> >>> >> was trying to build my kernel. The photo of crash screen and my config >> >>> >> is attached. >> >>> > >> >>> > hm, now why has that started happening? >> >>> > >> >>> > Perhaps you could apply this debug patch, see if we can narrow it down? >> >>> > >> >>> >> >>> I will try it then, but it isn't very reproducible :( >> >>> But my system hung after some list corruption warnings... I hit the >> >>> corruption 4 times... >> >>> >> >> >> >> That is very unexpected but if lists are being corrupted, it could >> >> explain the previously reported bug as that bug looked like an active >> >> page on an inactive list. >> >> >> >> What was the last working kernel? Can you bisect? >> >> >> >>> [ 1220.468089] ------------[ cut here ]------------ >> >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() >> >>> [ 1220.468102] Hardware name: 42424XC >> >>> [ 1220.468104] list_del corruption. next->prev should be >> >>> ffffea0000e069a0, but was ffff880100216c78 >> >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables >> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp >> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc >> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi >> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd >> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer >> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd >> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit >> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev >> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e >> >>> sdhci_pci sdhci crc_itu_t >> >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 >> >>> [ 1220.468188] Call Trace: >> >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 >> >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 >> >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 >> >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 >> >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 >> >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 >> >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >> >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 >> >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 >> >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 >> >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de >> >>> >> >> >> > >> > I'm hitting this again today, when I'm trying to rebuild my kernel.... >> > Looking it a bit >> > >> > list_del corruption. next->prev should be ffffea0000e069a0, but was >> > ffff880100216c78 >> > >> > I find something interesting from my syslog: >> > >> > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 >> > >> >> This warning and the page reclaim warning are on paths that are >> >> commonly used and I would expect to see multiple reports. I wonder >> >> what is happening on your machine that is so unusual. >> >> >> >> Have you run memtest on this machine for a few hours and badblocks >> >> on the disk to ensure this is not hardware trouble? >> >> >> >>> So is it possible that my previous BUG is triggered by slab list corruption? >> >> >> >> Not directly, but clearly there is something very wrong. >> >> >> >> If slub corruption reports are very common and kernel 3.0 is fine, my >> >> strongest candidate for the corruption would be the SLUB lockless >> >> patches. Try >> >> >> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R >> >> >> > >> >> Here's a update for the results: >> >> 3.0.0-rc7: running for hours without a crash >> upstream kernel: list corruption happened while building kernel within >> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) >> upstream kernel with above revert: running for hours without a crash >> >> Trying to bisect but rebuild is slow .... >> > > If you have not done so already, I strongly suggest your bisection > starts within that range of patches to isolate which one is at fault. > It'll cut down on the number of builds you need to do. Thanks for > testing. > This is interesting, I just change as following: diff --git a/mm/slub.c b/mm/slub.c index eb5a8f9..616b78e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, "__slab_alloc")); if (unlikely(!object)) { - c->page = NULL; + //c->page = NULL; stat(s, DEACTIVATE_BYPASS); + deactivate_slab(s, c); goto new_slab; } Then my system doesn't print any list corruption warnings and my build success then. So this means revert of 03e404af2 could cure this. I'll do more test next week to see if the list corruption still exist, thanks. > -- > Mel Gorman > SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-05 12:09 ` Xiaotian Feng @ 2011-08-05 12:30 ` Xiaotian Feng 2011-08-05 12:55 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Xiaotian Feng @ 2011-08-05 12:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote: > On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote: >> (Adding patch author to cc) >> >> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote: >>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote: >>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: >>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: >>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: >>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: >>> >>> > >>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >>> >>> >> was trying to build my kernel. The photo of crash screen and my config >>> >>> >> is attached. >>> >>> > >>> >>> > hm, now why has that started happening? >>> >>> > >>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down? >>> >>> > >>> >>> >>> >>> I will try it then, but it isn't very reproducible :( >>> >>> But my system hung after some list corruption warnings... I hit the >>> >>> corruption 4 times... >>> >>> >>> >> >>> >> That is very unexpected but if lists are being corrupted, it could >>> >> explain the previously reported bug as that bug looked like an active >>> >> page on an inactive list. >>> >> >>> >> What was the last working kernel? Can you bisect? >>> >> >>> >>> [ 1220.468089] ------------[ cut here ]------------ >>> >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() >>> >>> [ 1220.468102] Hardware name: 42424XC >>> >>> [ 1220.468104] list_del corruption. next->prev should be >>> >>> ffffea0000e069a0, but was ffff880100216c78 >>> >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables >>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp >>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc >>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi >>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd >>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer >>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd >>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit >>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev >>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e >>> >>> sdhci_pci sdhci crc_itu_t >>> >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 >>> >>> [ 1220.468188] Call Trace: >>> >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 >>> >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 >>> >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 >>> >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 >>> >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 >>> >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 >>> >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 >>> >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 >>> >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 >>> >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 >>> >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de >>> >>> >>> >> >>> > >>> > I'm hitting this again today, when I'm trying to rebuild my kernel.... >>> > Looking it a bit >>> > >>> > list_del corruption. next->prev should be ffffea0000e069a0, but was >>> > ffff880100216c78 >>> > >>> > I find something interesting from my syslog: >>> > >>> > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 >>> > >>> >> This warning and the page reclaim warning are on paths that are >>> >> commonly used and I would expect to see multiple reports. I wonder >>> >> what is happening on your machine that is so unusual. >>> >> >>> >> Have you run memtest on this machine for a few hours and badblocks >>> >> on the disk to ensure this is not hardware trouble? >>> >> >>> >>> So is it possible that my previous BUG is triggered by slab list corruption? >>> >> >>> >> Not directly, but clearly there is something very wrong. >>> >> >>> >> If slub corruption reports are very common and kernel 3.0 is fine, my >>> >> strongest candidate for the corruption would be the SLUB lockless >>> >> patches. Try >>> >> >>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R >>> >> >>> > >>> >>> Here's a update for the results: >>> >>> 3.0.0-rc7: running for hours without a crash >>> upstream kernel: list corruption happened while building kernel within >>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) >>> upstream kernel with above revert: running for hours without a crash >>> >>> Trying to bisect but rebuild is slow .... >>> >> >> If you have not done so already, I strongly suggest your bisection >> starts within that range of patches to isolate which one is at fault. >> It'll cut down on the number of builds you need to do. Thanks for >> testing. >> > > This is interesting, I just change as following: > > diff --git a/mm/slub.c b/mm/slub.c > index eb5a8f9..616b78e 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s, > gfp_t gfpflags, int node, > "__slab_alloc")); > > if (unlikely(!object)) { > - c->page = NULL; > + //c->page = NULL; > stat(s, DEACTIVATE_BYPASS); > + deactivate_slab(s, c); > goto new_slab; > } > > Then my system doesn't print any list corruption warnings and my build > success then. So this means revert of 03e404af2 could cure this. > I'll do more test next week to see if the list corruption still exist, thanks. > Sorry, please ignore it... My system corrupted before I went to leave .... > > >> -- >> Mel Gorman >> SUSE Labs >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-05 12:30 ` Xiaotian Feng @ 2011-08-05 12:55 ` Mel Gorman 2011-08-05 15:51 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-08-05 12:55 UTC (permalink / raw) To: Xiaotian Feng Cc: Andrew Morton, linux-mm, linux-kernel, Pekka Enberg, Christoph Lameter On Fri, Aug 05, 2011 at 08:30:44PM +0800, Xiaotian Feng wrote: > On Fri, Aug 5, 2011 at 8:09 PM, Xiaotian Feng <xtfeng@gmail.com> wrote: > > On Fri, Aug 5, 2011 at 5:19 PM, Mel Gorman <mgorman@suse.de> wrote: > >> (Adding patch author to cc) > >> > >> On Fri, Aug 05, 2011 at 04:42:43PM +0800, Xiaotian Feng wrote: > >>> On Thu, Aug 4, 2011 at 11:54 AM, Xiaotian Feng <xtfeng@gmail.com> wrote: > >>> > On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@suse.de> wrote: > >>> >> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote: > >>> >>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > >>> >>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@gmail.com> wrote: > >>> >>> > > >>> >>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > >>> >>> >> was trying to build my kernel. The photo of crash screen and my config > >>> >>> >> is attached. > >>> >>> > > >>> >>> > hm, now why has that started happening? > >>> >>> > > >>> >>> > Perhaps you could apply this debug patch, see if we can narrow it down? > >>> >>> > > >>> >>> > >>> >>> I will try it then, but it isn't very reproducible :( > >>> >>> But my system hung after some list corruption warnings... I hit the > >>> >>> corruption 4 times... > >>> >>> > >>> >> > >>> >> That is very unexpected but if lists are being corrupted, it could > >>> >> explain the previously reported bug as that bug looked like an active > >>> >> page on an inactive list. > >>> >> > >>> >> What was the last working kernel? Can you bisect? > >>> >> > >>> >>> [ 1220.468089] ------------[ cut here ]------------ > >>> >>> [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0() > >>> >>> [ 1220.468102] Hardware name: 42424XC > >>> >>> [ 1220.468104] list_del corruption. next->prev should be > >>> >>> ffffea0000e069a0, but was ffff880100216c78 > >>> >>> [ 1220.468106] Modules linked in: ip6table_filter ip6_tables > >>> >>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 > >>> >>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp > >>> >>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc > >>> >>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi > >>> >>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd > >>> >>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer > >>> >>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd > >>> >>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit > >>> >>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev > >>> >>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e > >>> >>> sdhci_pci sdhci crc_itu_t > >>> >>> [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G W 3.0.0+ #23 > >>> >>> [ 1220.468188] Call Trace: > >>> >>> [ 1220.468190] <IRQ> [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0 > >>> >>> [ 1220.468201] [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50 > >>> >>> [ 1220.468206] [<ffffffff81332a52>] __list_del_entry+0x82/0xd0 > >>> >>> [ 1220.468210] [<ffffffff81332ab1>] list_del+0x11/0x40 > >>> >>> [ 1220.468216] [<ffffffff8117a212>] __slab_free+0x362/0x3d0 > >>> >>> [ 1220.468222] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468226] [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220 > >>> >>> [ 1220.468230] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468234] [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468239] [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220 > >>> >>> [ 1220.468243] [<ffffffff811c6606>] bvec_free_bs+0x26/0x40 > >>> >>> [ 1220.468247] [<ffffffff811c6654>] bio_free+0x34/0x70 > >>> >>> [ 1220.468250] [<ffffffff811c66a5>] bio_fs_de > >>> >>> > >>> >> > >>> > > >>> > I'm hitting this again today, when I'm trying to rebuild my kernel.... > >>> > Looking it a bit > >>> > > >>> > list_del corruption. next->prev should be ffffea0000e069a0, but was > >>> > ffff880100216c78 > >>> > > >>> > I find something interesting from my syslog: > >>> > > >>> > PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144 > >>> > > >>> >> This warning and the page reclaim warning are on paths that are > >>> >> commonly used and I would expect to see multiple reports. I wonder > >>> >> what is happening on your machine that is so unusual. > >>> >> > >>> >> Have you run memtest on this machine for a few hours and badblocks > >>> >> on the disk to ensure this is not hardware trouble? > >>> >> > >>> >>> So is it possible that my previous BUG is triggered by slab list corruption? > >>> >> > >>> >> Not directly, but clearly there is something very wrong. > >>> >> > >>> >> If slub corruption reports are very common and kernel 3.0 is fine, my > >>> >> strongest candidate for the corruption would be the SLUB lockless > >>> >> patches. Try > >>> >> > >>> >> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R > >>> >> > >>> > > >>> > >>> Here's a update for the results: > >>> > >>> 3.0.0-rc7: running for hours without a crash > >>> upstream kernel: list corruption happened while building kernel within > >>> 10 mins (I'm running some app chrome/firefox/thunderbird/... as well) > >>> upstream kernel with above revert: running for hours without a crash > >>> > >>> Trying to bisect but rebuild is slow .... > >>> > >> > >> If you have not done so already, I strongly suggest your bisection > >> starts within that range of patches to isolate which one is at fault. > >> It'll cut down on the number of builds you need to do. Thanks for > >> testing. > >> > > > > This is interesting, I just change as following: > > > > diff --git a/mm/slub.c b/mm/slub.c > > index eb5a8f9..616b78e 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s, > > gfp_t gfpflags, int node, > > "__slab_alloc")); > > > > if (unlikely(!object)) { > > - c->page = NULL; > > + //c->page = NULL; > > stat(s, DEACTIVATE_BYPASS); > > + deactivate_slab(s, c); > > goto new_slab; > > } > > > > Then my system doesn't print any list corruption warnings and my build > > success then. So this means revert of 03e404af2 could cure this. > > I'll do more test next week to see if the list corruption still exist, thanks. > > > > Sorry, please ignore it... My system corrupted before I went to leave .... > Please continue the bisection in that case and establish for sure if the problem is in that series or not. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-05 12:55 ` Mel Gorman @ 2011-08-05 15:51 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2011-08-05 15:51 UTC (permalink / raw) To: Mel Gorman Cc: Xiaotian Feng, Andrew Morton, linux-mm, linux-kernel, Pekka Enberg [-- Attachment #1: Type: TEXT/PLAIN, Size: 2035 bytes --] On Fri, 5 Aug 2011, Mel Gorman wrote: > > > This is interesting, I just change as following: > > > > > > diff --git a/mm/slub.c b/mm/slub.c > > > index eb5a8f9..616b78e 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -2104,8 +2104,9 @@ static void *__slab_alloc(struct kmem_cache *s, > > > gfp_t gfpflags, int node, > > > "__slab_alloc")); > > > > > > if (unlikely(!object)) { > > > - c->page = NULL; > > > + //c->page = NULL; > > > stat(s, DEACTIVATE_BYPASS); > > > + deactivate_slab(s, c); > > > goto new_slab; > > > } > > > > > > Then my system doesn't print any list corruption warnings and my build > > > success then. So this means revert of 03e404af2 could cure this. > > > I'll do more test next week to see if the list corruption still exist, thanks. > > > > > > > Sorry, please ignore it... My system corrupted before I went to leave .... > > > > Please continue the bisection in that case and establish for sure if the > problem is in that series or not. Thanks. The above fix should not affect anything since a per cpu slab is not on any partial lists. And since there are no objects remaining in the slab there is then also no point of putting it back. It wont be on any lists before and after the action so no list processing is needed. Hmmm.... There maybe a race with slab_free from a remote processor. I dont see any problem here since we convert the page from frozen to nonfrozen in __slab_alloc and __slab_free will ignore the partial list management if it sees it to be frozen. Maybe we need some memory barriers here. Right now we are relying on the cmpxchg_double for sync of the state in the page struct but we also need the c->page variable to be consistent with that state. But we disable interrupts in __slab_alloc so there are no races possible with slab_free only with remote __slab_free invocations which will not touch c->page. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 [not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com> 2011-08-02 7:22 ` kernel BUG at mm/vmscan.c:1114 Andrew Morton @ 2011-08-02 14:24 ` Mel Gorman 2011-08-02 17:15 ` Andrew Morton 2011-08-03 6:45 ` Xiaotian Feng 1 sibling, 2 replies; 14+ messages in thread From: Mel Gorman @ 2011-08-02 14:24 UTC (permalink / raw) To: Xiaotian Feng; +Cc: linux-mm, linux-kernel, Andrew Morton On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote: > Hi, > I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > was trying to build my kernel. The photo of crash screen and my config > is attached. Thanks. > Regards > Xiaotian I am obviously blind because in 3.0, I cannot see what BUG is at mm/vmscan.c:1114 :(. I see 1109: /* 1110: * If we don't have enough swap space, reclaiming of 1111: * anon page which don't already have a swap slot is 1112: * pointless. 1113: */ 1114: if (nr_swap_pages <= 0 && PageAnon(cursor_page) && 1115: !PageSwapCache(cursor_page)) 1116: break; 1117: 1118: if (__isolate_lru_page(cursor_page, mode, file) == 0) { 1119: list_move(&cursor_page->lru, dst); 1120: mem_cgroup_del_lru(cursor_page); Is this 3.0 vanilla or are there some other patches applied? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-02 14:24 ` Mel Gorman @ 2011-08-02 17:15 ` Andrew Morton 2011-08-03 6:45 ` Xiaotian Feng 1 sibling, 0 replies; 14+ messages in thread From: Andrew Morton @ 2011-08-02 17:15 UTC (permalink / raw) To: Mel Gorman; +Cc: Xiaotian Feng, linux-mm, linux-kernel On Tue, 2 Aug 2011 15:24:59 +0100 Mel Gorman <mgorman@suse.de> wrote: > On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote: > > Hi, > > __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I > > was trying to build my kernel. The photo of crash screen and my config > > is attached. Thanks. > > Regards > > Xiaotian > > I am obviously blind because in 3.0, I cannot see what BUG is at > mm/vmscan.c:1114 :(. I see > > 1109: /* > 1110: * If we don't have enough swap space, reclaiming of > 1111: * anon page which don't already have a swap slot is > 1112: * pointless. > 1113: */ > 1114: if (nr_swap_pages <= 0 && PageAnon(cursor_page) && > 1115: !PageSwapCache(cursor_page)) > 1116: break; > 1117: > 1118: if (__isolate_lru_page(cursor_page, mode, file) == 0) { > 1119: list_move(&cursor_page->lru, dst); > 1120: mem_cgroup_del_lru(cursor_page); > > Is this 3.0 vanilla or are there some other patches applied? > "3.0.0+": Current mainline. static unsigned long isolate_lru_pages(unsigned long nr_to_scan, struct list_head *src, struct list_head *dst, unsigned long *scanned, int order, int mode, int file) { unsigned long nr_taken = 0; unsigned long nr_lumpy_taken = 0; unsigned long nr_lumpy_dirty = 0; unsigned long nr_lumpy_failed = 0; unsigned long scan; for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { struct page *page; unsigned long pfn; unsigned long end_pfn; unsigned long page_pfn; int zone_id; page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); VM_BUG_ON(!PageLRU(page)); switch (__isolate_lru_page(page, mode, file)) { case 0: list_move(&page->lru, dst); mem_cgroup_del_lru(page); nr_taken += hpage_nr_pages(page); break; case -EBUSY: /* else it is being freed elsewhere */ list_move(&page->lru, src); mem_cgroup_rotate_lru_list(page, page_lru(page)); continue; default: -->> BUG(); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel BUG at mm/vmscan.c:1114 2011-08-02 14:24 ` Mel Gorman 2011-08-02 17:15 ` Andrew Morton @ 2011-08-03 6:45 ` Xiaotian Feng 1 sibling, 0 replies; 14+ messages in thread From: Xiaotian Feng @ 2011-08-03 6:45 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, linux-kernel, Andrew Morton On Tue, Aug 2, 2011 at 10:24 PM, Mel Gorman <mgorman@suse.de> wrote: > On Tue, Aug 02, 2011 at 03:09:57PM +0800, Xiaotian Feng wrote: >> Hi, >> I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I >> was trying to build my kernel. The photo of crash screen and my config >> is attached. Thanks. >> Regards >> Xiaotian > > I am obviously blind because in 3.0, I cannot see what BUG is at > mm/vmscan.c:1114 :(. I see > > 1109: /* > 1110: * If we don't have enough swap space, reclaiming of > 1111: * anon page which don't already have a swap slot is > 1112: * pointless. > 1113: */ > 1114: if (nr_swap_pages <= 0 && PageAnon(cursor_page) && > 1115: !PageSwapCache(cursor_page)) > 1116: break; > 1117: > 1118: if (__isolate_lru_page(cursor_page, mode, file) == 0) { > 1119: list_move(&cursor_page->lru, dst); > 1120: mem_cgroup_del_lru(cursor_page); > > Is this 3.0 vanilla or are there some other patches applied? No, I'm using fresh cloned upstream kernel, without any changes. Thanks. > > -- > Mel Gorman > SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-08-05 15:51 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CAJn8CcE20-co4xNOD8c+0jMeABrc1mjmGzju3xT34QwHHHFsUA@mail.gmail.com>
[not found] ` <CAJn8CcG-pNbg88+HLB=tRr26_R+A0RxZEWsJQg4iGe4eY2noXA@mail.gmail.com>
2011-08-02 7:22 ` kernel BUG at mm/vmscan.c:1114 Andrew Morton
2011-08-03 6:44 ` Xiaotian Feng
2011-08-03 8:54 ` Mel Gorman
2011-08-03 9:02 ` Li Zefan
2011-08-04 3:54 ` Xiaotian Feng
2011-08-05 8:42 ` Xiaotian Feng
2011-08-05 9:19 ` Mel Gorman
2011-08-05 12:09 ` Xiaotian Feng
2011-08-05 12:30 ` Xiaotian Feng
2011-08-05 12:55 ` Mel Gorman
2011-08-05 15:51 ` Christoph Lameter
2011-08-02 14:24 ` Mel Gorman
2011-08-02 17:15 ` Andrew Morton
2011-08-03 6:45 ` Xiaotian Feng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox