From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 5DBEE6B005A for ; Tue, 25 Aug 2009 15:32:50 -0400 (EDT) Received: from coyote.coyote.den ([141.153.112.139]) by vms173005.mailsrvcs.net (Sun Java(tm) System Messaging Server 6.3-7.04 (built Sep 26 2008; 32bit)) with ESMTPA id <0KOU00HAUZY90J30@vms173005.mailsrvcs.net> for linux-mm@kvack.org; Sun, 23 Aug 2009 21:22:57 -0500 (CDT) From: Gene Heskett Subject: Re: Bad page state (was Re: Linux 2.6.31-rc7) Date: Sun, 23 Aug 2009 22:22:40 -0400 References: <200908230420.46228.gene.heskett@verizon.net> In-reply-to: MIME-version: 1.0 Content-type: Text/Plain; charset=iso-8859-1 Content-transfer-encoding: 7bit Content-disposition: inline Message-id: <200908232222.40377.gene.heskett@verizon.net> Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Wu Fengguang , Andrew Morton , Hugh Dickins , Mel Gorman , "linux-mm@kvack.org" List-ID: On Sunday 23 August 2009, Linus Torvalds wrote: >Gene - good news and bad news. > >The good news is that this is almost certainly not a kernel bug. > >The bad news is that your machine is almost certainly buggy and you'll >need to replace your RAM (although it's possible that just removing it >and re-seating it could fix things). See for details below. > Spot on! Which, when I had a chance today after memtest ran and found a stuck bit 00004000 at 2 separate addresses fairly early in the testing, I let it run for another 6 hours while I painted some shutters, and it ran without incrementing the counts of those 2 addresses. So after dinner I stripped out the cards as all slots are full and laid the mobo out the right side of the beast, then pulled all 4 1GB modules, putting the front one in the back slot, sorta rotating the tires. Then I let memtest run for about 45 minutes with no further errors. That faint knocking sound? Me, rapping on my head, cuz I should have grokked that. I'll catchup on my email & let memtest run for a few hours again tomorrow. More... >On Sun, 23 Aug 2009, Gene Heskett wrote: >> I changed the vmlinuz compression to gzip and rebooted to it last night, >> and got this shortly after the bootup to -rc7 with the kernal cli >> argument that makes sensors work on an asus board again: >> >> Aug 22 22:29:07 coyote kernel: [ 2449.053652] BUG: Bad page state in >> process python pfn:a0e93 Aug 22 22:29:07 coyote kernel: [ 2449.053658] >> page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0 >> Aug 22 22:29:07 coyote kernel: [ 2449.053662] Pid: 4818, comm: python Not >> tainted 2.6.31-rc7 #3 Aug 22 22:29:07 coyote kernel: [ 2449.053664] Call >> Trace: >> Aug 22 22:29:07 coyote kernel: [ 2449.053672] [] ? >> printk+0x23/0x40 Aug 22 22:29:07 coyote kernel: [ 2449.053678] >> [] bad_page+0xcf/0x150 Aug 22 22:29:07 coyote kernel: [ >> 2449.053682] [] get_page_from_freelist+0x37d/0x480 Aug 22 >> 22:29:07 coyote kernel: [ 2449.053686] [] >> __alloc_pages_nodemask+0xdf/0x520 Aug 22 22:29:07 coyote kernel: [ >> 2449.053691] [] handle_mm_fault+0x4a9/0x9f0 Aug 22 22:29:07 >> coyote kernel: [ 2449.053695] [] ? >> tick_dev_program_event+0x43/0xf0 Aug 22 22:29:07 coyote kernel: [ >> 2449.053699] [] ? tick_program_event+0x36/0x60 Aug 22 22:29:07 >> coyote kernel: [ 2449.053703] [] do_page_fault+0x141/0x290 Aug >> 22 22:29:07 coyote kernel: [ 2449.053707] [] ? >> do_page_fault+0x0/0x290 Aug 22 22:29:07 coyote kernel: [ 2449.053710] >> [] error_code+0x73/0x78 Aug 22 22:29:07 coyote kernel: [ >> 2449.053712] Disabling lock debugging due to kernel taint >> >> This doesn't look exactly like the previous one but the result is >> similar. > >Actually, it looks _too_ much like the previous one in one very specific >regard: that 'page' pointer is identical. Anf that is where the 'flags' >came from. > >Look here: >> Aug 21 22:37:47 coyote kernel: [ 1030.152737] BUG: Bad page state in >> process lzma pfn:a1093 Aug 21 22:37:47 coyote kernel: [ 1030.152743] >> page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0 >> >> Aug 22 22:29:07 coyote kernel: [ 2449.053652] BUG: Bad page state in >> process python pfn:a0e93 Aug 22 22:29:07 coyote kernel: [ 2449.053658] >> page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0 > >and notice how "page:c28fc260" is the same, even though 'pfn' is not. Yes, and that is I believe, the same address that memtest triggered on for the 1st and 3rd errors, there was another, at about 1/2 meg higher address in between. I shot a pix of the memtest screen just for the records. Its pny memory, and possibly still in warranty, I find out tomorrow for sure. >Gene - I can almost guarantee that you have bad memory. Why? > > - 'pfn' is the Linux kernel "page index" - so when the two 'pfn' numbers > are different, that means that we're talking about different > physical pages, and indexes into the 'struct page[]' array. > > - but because the page array was allocated at different addresses > (probably because of slightly different configurations and timings > during boot), the actual physical memory location that describes those > different pages happens to be the same. > > - and I can almost guarantee that you have a bit that is stuck to 1 in > that RAM location. The 'flags' field is the first one in 'struct page', > and so it's the memory location at kernel virtual address c28fc260 that > is corrupt - and the way the kernel mappings work on x86, that's > physical address 28fc260 (at around the 40MB mark). 40.7, and 41.3 according to memtest. :) >There is almost certainly no way that this is a kernel bug - that memory >location is smack dab in the middle of that 'struct page[]' array, and >there is absolutely no reason why two different kernels with clearly >different allocations would set the same incorrect bug. I mean - it >_could_ happen, and maybe there's some really subtle idiotic thing going >on, but it's really unlikely. > >The address is just so random, and so non-special - and yet it's exactly >the same physical address in both cases, even though it actually describes >different things as far as the kernel is concerned. That's an almost 100% >sure sign of a hard-error in your memory. > >And depending on kernel config options, that bad RAM location will be used >for different things. In your two cases, it's been used for the 'struct >page[]' array both times, but in other cases it could have been used for >something else - and maybe resulted in random crashes or other odd things, >rather than happen to get noticed by a debug test. > >The good news about hard memory errors is that if you boot into a memory >tester like memtest86, it's going to find it. So we're not going to have >to guess about whether I'm right or not - I would suggest you go download >memtest86+ from www.memtest.org and run it. I'd just get the bootable ISO >image of memtest86+ v2.11 and burn it to a CD, and boot it, but there are >other ways to run that thing. > >It's even possible that depending on which distro you have, you may >already have a "memtest" entry in your LILO or grub setup. I think SuSE >installs memtest as one of the bootable options, for example. I had an entry for it from a kubuntu install that has since committed suicide. I'm about up to my armpits in fedora, so I have a mandriva 2009.1 install on another drive that may well become the main os. Not near as much trouble with codecs with an offshore verse of this. All I need to do is get all my scripts and a 7GB email corpus moved. Unforch, by the time I get that done the next mandriva will be out. So many things to do, and relatively little time to do them. Too many other hobbies and honeydo's. :) Thanks for not spanking me on this one Linus, I blew it, badly. Linus -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The NRA is offering FREE Associate memberships to anyone who wants them. I marvel at the strength of human weakness. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org