From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 83B456B0012 for ; Thu, 16 Jun 2011 02:55:09 -0400 (EDT) Received: from mail-wy0-f169.google.com (mail-wy0-f169.google.com [74.125.82.169]) (authenticated bits=0) by smtp1.linux-foundation.org (8.14.2/8.13.5/Debian-3ubuntu1.1) with ESMTP id p5G6sXo1004486 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for ; Wed, 15 Jun 2011 23:54:35 -0700 Received: by wyf19 with SMTP id 19so1053415wyf.14 for ; Wed, 15 Jun 2011 23:54:33 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <96D27CEC-8492-49F2-913F-F587DEC5E95E@suse.de> References: <47FAB15C-B113-40FD-9CE0-49566AACC0DF@suse.de> <96D27CEC-8492-49F2-913F-F587DEC5E95E@suse.de> From: Linus Torvalds Date: Wed, 15 Jun 2011 23:54:13 -0700 Message-ID: Subject: Re: Oops in VMA code Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Alexander Graf Cc: Benjamin Herrenschmidt , linux-mm@kvack.org, "linux-kernel@vger.kernel.org List" On Wed, Jun 15, 2011 at 11:20 PM, Alexander Graf wrote: > > On 16.06.2011, at 07:59, Linus Torvalds wrote: >> >> r26 has the value 0xc00090026236bbb0, and that "90" byte in the middle >> there looks bogus. It's not a valid pointer any more, but if that "9" >> had been a zero, it would have been. > > Please see my reply to Ben here. Your reply to Ben seems to say that 0xc00000026236bbb0 wouldn't have been a valid address, because you don't have that much memory. But that's clearly not true. All the other registers have valid pointers in them, and the stack pointer (r1) is c000000262987cd0, for example. And that stack is clearly valid - if the kernel stack pointer was corrupted, you'd never have gotten as far as reporting the oops. So you may have only 8GB of RAM in that machine, but if so, there's some empty unmapped physical space. Because clearly your RAM is _not_ limited to being mapped to below 0xc000000200000000. To recap: I'm pretty sure the memory corruption is just the "90" byte. The rest of the pointer looks too much like a pointer to be otherwise. Whether that's due to a two-bit error (unlikely) or a wild byte write (or 16-bit write with zeroes) is hard to say. USUALLY when we have wild pointer errors, the corruption is more than just a few bits, but it could have been something that sets a few bits in software, and just sets them using a stale pointer. > Yup, so let's keep this documented for now. Actually, the more I think ab= out it the more it looks like simple random memory corruption by someone el= se in the kernel - and that's basically impossible to track and will give c= ompletely different bugs next time around :(. We've had several bugs found by the pattern of the corruption, so I wouldn't say "impossible to track". Even if the next time ends up being a completely different oops (because the corruption happened in a totally different kind of data structure), it might be possible that there's that same "90" byte pattern, for example. But it needs more than one bug report to see what the pattern is. Usually it takes a _lot_ more.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org