From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 7B5E06B0012 for ; Thu, 16 Jun 2011 02:20:47 -0400 (EDT) Subject: Re: Oops in VMA code Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Alexander Graf In-Reply-To: Date: Thu, 16 Jun 2011 08:20:43 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <96D27CEC-8492-49F2-913F-F587DEC5E95E@suse.de> References: <47FAB15C-B113-40FD-9CE0-49566AACC0DF@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Benjamin Herrenschmidt , linux-mm@kvack.org, "linux-kernel@vger.kernel.org List" On 16.06.2011, at 07:59, Linus Torvalds wrote: > On Wed, Jun 15, 2011 at 10:32 PM, Alexander Graf = wrote: >>=20 >> 0xc000000000190580 : ld r9,16(r9) >> 0xc000000000190584 : mr r26,r11 >> 0xc000000000190588 : cmpdi cr7,r9,0 >> 0xc00000000019058c : mr r11,r26 >> 0xc000000000190590 : beq = cr7,0xc0000000001905c4 >> 0xc000000000190594 : addi r26,r9,-56 >> 0xc000000000190598 : ld r0,16(r26) >> 0xc00000000019059c : cmpld cr7,r31,r0 >> 0xc0000000001905a0 : blt = cr7,0xc000000000190580 >=20 > That's the inner loop in find_vma_prev(), and yes, it was inlined into > do_munmap. >=20 > And the fault happens in that "ld r0,16(r26)", and it looks like you > have memory corruption. >=20 > r26 has the value 0xc00090026236bbb0, and that "90" byte in the middle > there looks bogus. It's not a valid pointer any more, but if that "9" > had been a zero, it would have been. Please see my reply to Ben here. > So it looks like the rbtree has become corrupt, and it _looks_ like > it's just a couple of bits that are set in what otherwise looks like a > reasonable pointer. It *could* be a two-bit error that wasn't > corrected (I assume you have ECC or parity on your RAM or caches), so > it's theoretically possible that it's hardware, but generally memory > corruption is due to software bugs, so that's a pretty far-fetched > thing. I'm not running on ECC memory IIRC, but this really doesn't look like a = memory bit flip. Maybe somewhere else which resulted in that code to = overwrite memory here, but I tend to not want to blame hardware for = failures. Usually these bugs are software made :) > At a guess, there's not a lot more to be had from the oops. The > corruption probably came from some totally unrelated code. Without > more of a pattern, it's pretty much impossible to even guess. >=20 > It may be that somebody can see something I'm missing, but unless you > can find an ECC error report in your logs and say "oh, that's it", I > suspect that you're better off ignoring it, and hoping that it will > happen again (and again) so that we'd get enough of a pattern to start > making any educated guesses about what's going on. >=20 > That's why I often google oops reports - one report may not give much > of a pattern, but if google finds lots of them that all look roughly > similar, you end up possibly seeing what the common issue is. Yup, so let's keep this documented for now. Actually, the more I think = about it the more it looks like simple random memory corruption by = someone else in the kernel - and that's basically impossible to track = and will give completely different bugs next time around :(. Either way, thanks a lot for looking at it! Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org