From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 73B7D6B003D for ; Tue, 10 Mar 2009 18:05:27 -0400 (EDT) Date: Tue, 10 Mar 2009 22:05:06 +0000 (GMT) From: Hugh Dickins Subject: Re: PROBLEM: kernel BUG at mm/slab.c:3002! In-Reply-To: <49B6B72B.7070408@hp.com> Message-ID: References: <49B68450.9000505@hp.com> <1236705532.3205.14.camel@calx> <49B6A374.6040805@hp.com> <1236707030.3205.21.camel@calx> <49B6B72B.7070408@hp.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: "Alan D. Brunelle" Cc: Matt Mackall , "linux-kernel@vger.kernel.org" , cl@linux-foundation.org, penberg@cs.helsinki.fi, linux-mm@kvack.org List-ID: On Tue, 10 Mar 2009, Alan D. Brunelle wrote: > Matt Mackall wrote: > > On Tue, 2009-03-10 at 13:29 -0400, Alan D. Brunelle wrote: > >> Matt Mackall wrote: > >>> On Tue, 2009-03-10 at 11:16 -0400, Alan D. Brunelle wrote: > >>>> Running blktrace & I/O loads cause a kernel BUG at mm/slab.c:3002!. > >>> Pid: 11346, comm: blktrace Tainted: G B 2.6.29-rc7 #3 ProLiant > >>> DL585 G5 > >>> > >>> That 'B' there indicates you've hit 'bad page' before this. That bug > >>> seems to be strongly correlated with some form of hardware trouble. > >>> Unfortunately, that makes everything after that point a little suspect. > >> > >> /If/ it were a hardware issue, that might explain the subsequent issue > >> when I switched to SLUB instead... > > > > Well it was almost certainly not a bug in SLAB itself (and your SLUB > > test is obviously quite conclusive there). We'd have lots of reports. > > It's probably too early to conclude it's hardware though. > > > >> How does one look for "bad page reports"? > > > > It'll look something like this (pasted from Google): > > > >>> kernel: Bad page state at free_hot_cold_page (in process 'beam', > >>> page c1a95320) > >>> kernel: flags:0x40020118 mapping:f401adc0 mapped:0 count:0 > >>> private:0x00000000 > > > > Interestingly enough, I'm not seeing the kernel detect such things - but > in going into the hardware server logs, a co-worker found "unrecoverable > system errors" being detected at about the same times we're seeing the > panics. In 2.6.29-rc, the "B" taint should be associated with mm/page_alloc.c's bad_page() KERN_ALERT "BUG: Bad page state in process %s pfn:%05lx\n", but it could also now come from mm/memory.c's print_bad_pte() KERN_ALERT "BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n", which replaces the old mm/rmap.c Eeeks, and some other cases too. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org