From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 0662D6B005C for ; Mon, 15 Jun 2009 11:26:46 -0400 (EDT) Date: Mon, 15 Jun 2009 16:28:04 +0100 From: Alan Cox Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5) Message-ID: <20090615162804.4cb75b30@lxorguk.ukuu.org.uk> In-Reply-To: <20090615152427.GF31969@one.firstfloor.org> References: <20090615024520.786814520@intel.com> <4A35BD7A.9070208@linux.vnet.ibm.com> <20090615042753.GA20788@localhost> <20090615140019.4e405d37@lxorguk.ukuu.org.uk> <20090615132934.GE31969@one.firstfloor.org> <20090615154832.73c89733@lxorguk.ukuu.org.uk> <20090615152427.GF31969@one.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: Hugh Dickins , Wu Fengguang , Balbir Singh , Andrew Morton , LKML , Ingo Molnar , Mel Gorman , Thomas Gleixner , "H. Peter Anvin" , Peter Zijlstra , Nick Piggin , "riel@redhat.com" , "chris.mason@oracle.com" , "linux-mm@kvack.org" List-ID: > oops=panic already implies panic on all machine check exceptions, so they will > be fine then (assuming this is the best strategy for availability > for them, which I personally find quite doubtful, but we can discuss this some > other time) You can have the argument with all the people who deploy large systems. Providing their boxes can be persuaded to panic they don't care about the masses. > That's because unpoisioning is quite hard -- you need some kind > of synchronization point for all the error handling and that's > the poisoned page and if it unposions itself then you need > some very heavy weight synchronization to avoid handling errors > multiple time. I looked at it, but it's quite messy. > > Also it's of somewhat dubious value. On a system running under a hypervisor or with hot swappable memory its of rather higher value. In the hypervisor case the guest system can acquire a new virtual page to replace the faulty one. In fact the hypervisor case is even more complex as the guest may get migrated at which point knowledge of "poisoned" memory is not at all connected to information on hardware failings. > > > > (You can unfail pages on x86 as well it appears by scrubbing them via DMA > > - yes ?) > > Not architectually. Also the other problem is not just unpoisoning them, > but finding out if the page is permenantly bad or just temporarily. Small detail you are overlooking: Hot swap mirrorable memory. Second detail you are overlooking curse a lot suspend to disk remove dirt from fans, clean/replace RAM resume from disk The very act of making the ECC error not take out the box creates the environment whereby the underlying hardware error (if there was one) can be cured. In all these cases the fact you've got to shoot stuff because a page has been lost becomes totally disconnected from the idea that the page is somehow not recoverable and "contaminated" forever. Which to me says your model is wrong. Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org