From: Linus Torvalds <torvalds@linux-foundation.org>
To: Gene Heskett <gene.heskett@verizon.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Hugh Dickins <hugh.dickins@tiscali.co.uk>,
Mel Gorman <mel@csn.ul.ie>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: Bad page state (was Re: Linux 2.6.31-rc7)
Date: Sun, 23 Aug 2009 09:44:14 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.2.01.0908230943490.3158@localhost.localdomain> (raw)
In-Reply-To: <200908230420.46228.gene.heskett@verizon.net>
Gene - good news and bad news.
The good news is that this is almost certainly not a kernel bug.
The bad news is that your machine is almost certainly buggy and you'll
need to replace your RAM (although it's possible that just removing it
and re-seating it could fix things). See for details below.
On Sun, 23 Aug 2009, Gene Heskett wrote:
>
> I changed the vmlinuz compression to gzip and rebooted to it last night, and
> got this shortly after the bootup to -rc7 with the kernal cli argument that
> makes sensors work on an asus board again:
>
> Aug 22 22:29:07 coyote kernel: [ 2449.053652] BUG: Bad page state in process python pfn:a0e93
> Aug 22 22:29:07 coyote kernel: [ 2449.053658] page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0
> Aug 22 22:29:07 coyote kernel: [ 2449.053662] Pid: 4818, comm: python Not tainted 2.6.31-rc7 #3
> Aug 22 22:29:07 coyote kernel: [ 2449.053664] Call Trace:
> Aug 22 22:29:07 coyote kernel: [ 2449.053672] [<c130fb33>] ? printk+0x23/0x40
> Aug 22 22:29:07 coyote kernel: [ 2449.053678] [<c108352f>] bad_page+0xcf/0x150
> Aug 22 22:29:07 coyote kernel: [ 2449.053682] [<c10845cd>] get_page_from_freelist+0x37d/0x480
> Aug 22 22:29:07 coyote kernel: [ 2449.053686] [<c10848af>] __alloc_pages_nodemask+0xdf/0x520
> Aug 22 22:29:07 coyote kernel: [ 2449.053691] [<c1095ff9>] handle_mm_fault+0x4a9/0x9f0
> Aug 22 22:29:07 coyote kernel: [ 2449.053695] [<c105ca83>] ? tick_dev_program_event+0x43/0xf0
> Aug 22 22:29:07 coyote kernel: [ 2449.053699] [<c105cbd6>] ? tick_program_event+0x36/0x60
> Aug 22 22:29:07 coyote kernel: [ 2449.053703] [<c1020d61>] do_page_fault+0x141/0x290
> Aug 22 22:29:07 coyote kernel: [ 2449.053707] [<c1020c20>] ? do_page_fault+0x0/0x290
> Aug 22 22:29:07 coyote kernel: [ 2449.053710] [<c131339b>] error_code+0x73/0x78
> Aug 22 22:29:07 coyote kernel: [ 2449.053712] Disabling lock debugging due to kernel taint
>
> This doesn't look exactly like the previous one but the result is similar.
Actually, it looks _too_ much like the previous one in one very specific
regard: that 'page' pointer is identical. Anf that is where the 'flags'
came from.
Look here:
> Aug 21 22:37:47 coyote kernel: [ 1030.152737] BUG: Bad page state in process lzma pfn:a1093
> Aug 21 22:37:47 coyote kernel: [ 1030.152743] page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0
> Aug 22 22:29:07 coyote kernel: [ 2449.053652] BUG: Bad page state in process python pfn:a0e93
> Aug 22 22:29:07 coyote kernel: [ 2449.053658] page:c28fc260 flags:80004000 count:0 mapcount:0 mapping:(null) index:0
and notice how "page:c28fc260" is the same, even though 'pfn' is not.
Gene - I can almost guarantee that you have bad memory. Why?
- 'pfn' is the Linux kernel "page index" - so when the two 'pfn' numbers
are different, that means that we're talking about different
physical pages, and indexes into the 'struct page[]' array.
- but because the page array was allocated at different addresses
(probably because of slightly different configurations and timings
during boot), the actual physical memory location that describes those
different pages happens to be the same.
- and I can almost guarantee that you have a bit that is stuck to 1 in
that RAM location. The 'flags' field is the first one in 'struct page',
and so it's the memory location at kernel virtual address c28fc260 that
is corrupt - and the way the kernel mappings work on x86, that's
physical address 28fc260 (at around the 40MB mark).
There is almost certainly no way that this is a kernel bug - that memory
location is smack dab in the middle of that 'struct page[]' array, and
there is absolutely no reason why two different kernels with clearly
different allocations would set the same incorrect bug. I mean - it
_could_ happen, and maybe there's some really subtle idiotic thing going
on, but it's really unlikely.
The address is just so random, and so non-special - and yet it's exactly
the same physical address in both cases, even though it actually describes
different things as far as the kernel is concerned. That's an almost 100%
sure sign of a hard-error in your memory.
And depending on kernel config options, that bad RAM location will be used
for different things. In your two cases, it's been used for the 'struct
page[]' array both times, but in other cases it could have been used for
something else - and maybe resulted in random crashes or other odd things,
rather than happen to get noticed by a debug test.
The good news about hard memory errors is that if you boot into a memory
tester like memtest86, it's going to find it. So we're not going to have
to guess about whether I'm right or not - I would suggest you go download
memtest86+ from www.memtest.org and run it. I'd just get the bootable ISO
image of memtest86+ v2.11 and burn it to a CD, and boot it, but there are
other ways to run that thing.
It's even possible that depending on which distro you have, you may
already have a "memtest" entry in your LILO or grub setup. I think SuSE
installs memtest as one of the bootable options, for example.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-08-25 19:37 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <alpine.LFD.2.01.0908211810390.3158@localhost.localdomain>
[not found] ` <200908212248.40987.gene.heskett@verizon.net>
2009-08-22 4:17 ` Linus Torvalds
2009-08-23 7:22 ` Wu Fengguang
2009-08-23 8:20 ` Gene Heskett
2009-08-23 16:44 ` Linus Torvalds [this message]
2009-08-23 17:04 ` Gene Heskett
2009-08-24 2:22 ` Gene Heskett
2009-08-24 13:55 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.01.0908230943490.3158@localhost.localdomain \
--to=torvalds@linux-foundation.org \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=gene.heskett@verizon.net \
--cc=hugh.dickins@tiscali.co.uk \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox