From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: ECC error correction - page isolation Date: Fri, 2 Jun 2006 05:10:49 +0200 References: <069061BE1B26524C85EC01E0F5CC3CC30163E1F1@rigel.headquarters.spacedev.com> <200606020146.33703.ak@suse.de> <447F94B3.7030807@yahoo.com.au> In-Reply-To: <447F94B3.7030807@yahoo.com.au> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200606020510.49877.ak@suse.de> Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: Brian Lindahl , linux-mm@kvack.org List-ID: > Good summary. I'll just add a couple of things: in recent kernels > we have a page migration facility which should be able to take care > of moving process and pagecache pages for you, without walking rmap > or killing the process (assuming you're talking about correctable > ECC errors). I think he means uncorrected errors. Correctable errors can be fixed up by a scrubber without anything else noticing. Ok if your system doesn't support getting rid of them without an atomic operation you might need to "stop the world" on MP, but that's relatively easy using stop_machine(). > This may not quite have the right in-kernel API for you use yet, but > it shouldn't be difficult to add. > > > > > If it's kernel space there are several cases: > > - Free page (count == 0). Easy: ignore it. > > Also, if you want to isolate the free page, you can allocate it, > and tuck it away in a list somewhere (or just forget about it > completely). Normally it's rare that a bit breaks completely. Usually they just toggle for some reason and are ok again if you rewrite them (how to do the rewrite without triggering an MCE can be tricky BTW). Or the glitch wasn't in the RAM transistors itself, but on some bus, then it might also be ok again on retry. What more often happens is that a DIMM (or rather a chip on a DIMM) breaks completely. In this case you need to remove the whole chip. This can be often done in hardware using "chipkill" (which is kind a special case of hardware RAM RAID). Anyways you usually need to remove a large memory area, much bigger than a page, in this case and it's more like memory hot unplug (which we don't quite support yet, but it's being worked on ...) Of course that's all for normal systems. If you're in a space craft (as I gather from the original poster's domain name) crossing the Van Allen belts or doing a solar storm it might be very different. But even then I would expect bits to more often just switch than break completely. Maybe for a Jupiter probe it's different and chips might really spoil. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org