linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* ECC error correction - page isolation
@ 2006-06-01 18:06 Brian Lindahl
  2006-06-01 23:46 ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Brian Lindahl @ 2006-06-01 18:06 UTC (permalink / raw)
  To: linux-mm

We have a board that gives us access to ECC error counts and ECC error status (4 bits, each corresponding to a different error). A background process performs a scrub (read, rewrite) on individual raw memory pages to activate the ECC. When the error count changes (an error is detected), I'd like to be able to isolate the page, if unused. The pages are scrubbed as raw physical addresses (page numbers) via a ioctl command on /dev/mem. Is there a facility that will allow me to map this physical address range to a page entity in the kernel so that I can isolate it and mark it as unusable, or reboot if it's active? Is there a better way to do this (i.e. avoiding the mapping phase and interact directly with physical page entities in the kernel)? Where should I begin my journey into mm in the kernel? What structures, functions and globals should I be looking at?

Going this deep in the kernel is pretty foreign to me, so any help would be appreciated. Thanks in advance!

Brian Lindahl 
Embedded Software Engineer 
858-375-2077 
brian.lindahl@spacedev.com 
SpaceDev, Inc. 
"We Make Space Happen"
 
 
This email message and any information or files contained within or attached to this message may be privileged, confidential, proprietary and protected from disclosure and is intended only for the person or entity to which it is addressed.  This email is considered a business record and is therefore property of the SpaceDev, Inc.  Any direct or indirect review, re-transmission, dissemination, forwarding, printing, use, disclosure, or copying of this message or any part thereof or other use of or any file attached to this message, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited.  If you received this message in error, please immediately inform the sender by reply e-mail and delete the message and any attachments and all copies of it from your system and destroy any hard copies of it.  No confidentiality or privilege is waived or lost by any mis-transmission.  SpaceDev, Inc. is neither liable for proper, complete transmission or the information contained in this communication, nor any delay in its receipt or any virus contained therein.  No representation, warranty or undertaking (express or implied) is given and no responsibility or liability is accepted by SpaceDev, Inc., as to the accuracy or the information contained herein or for any loss or damage (be it direct, indirect, special or other consequential) arising from reliance on it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread
* RE: ECC error correction - page isolation
@ 2006-06-05 23:36 Brian Lindahl
  0 siblings, 0 replies; 6+ messages in thread
From: Brian Lindahl @ 2006-06-05 23:36 UTC (permalink / raw)
  To: linux-mm

> If it's kernel space there are several cases:
> - Free page (count == 0). Easy: ignore it.
> - Reserved - e.g. page itself or kernel code - panic
> - Slab (slab bit set) - panic
> - Page table (cannot be detected right now, but you could
> change your architecture to set special bits) - handle like
> process error
> - buffer cache: toss or IO/error if it was dirty
> - Probably more cases
> Most can be figured out by looking at the various bits in struct page

Right, this sort of activity will be the main guts of error recovery.
Nothing too fancy I'm guessing, just requires a bit of digging. If we
can do something moderately intelligent (toss it), do that, otherwise
panic.

> I think he means uncorrected errors. Correctable errors can be fixed 
> up by a scrubber without anything else noticing.

This is correct in our environment.

> Ok if your system doesn't support getting rid of them without 
> an atomic operation you might need to "stop the world" on MP, 
> but that's relatively easy using stop_machine().

It's a UP, but I have no qualms about extending it to MP as we go. I
assume "start_machine()" brings us back up again?

> Interesting background, Brian might find it useful. He did say 
> he wanted to isolate the pages if they're unused, so perhaps
non-transient
> errors can be detected. Or the system just wants to be overly
paranoid?

It's more of a "nice to have" feature in case our customers are overly
paranoid :) The main idea here, is to retest pages that have been
isolated when memory gets tight (if it ever does). After several retests
with no errors, we'll be releasing the pages back to the kernel. This is
mostly to avoid tossing the same page(s) over and over in case they're
susceptible, for some reason.

For a sanity check, so far, I have something like this:

u32 pfn; /* = some page number */
struct * page = pfn_to_page(pfn);

To get an address for the read/rewrite cycle:

atomic_long_t * p = (atomic_long_t *) page_address(page);

To do the read/rewrite cycle, for each atomic_long_t, p, in the page:

atomic_long_add(0, p);

That should trigger the ECC without muddling with the data in a MP-safe
fashion (this should be a fun test, we get to make some RAM physically
fail). So check the ECC error count, and if it changed, do something
smart with 'page'.

One thing I'm having trouble with is finding out what page number to
start with and end with to make the scrubbing simple for the user (the
ioctl returns two u32s). Is there a better way to do this (i.e. existing
globals)?

pfn_beg = pfn_end = 0;
for_each_pgdat(pgdat)
{
  pfn_beg = min(pfn_beg, pgdat->node_start_pfn);
  pfn_end = max(pfn_end, pgdat->node_start_pfn +
pgdat->node_spanned_pages);
}

I also validate the page number using 'pfn_valid(pfn)' before retrieving
the struct page from the page number (fails silently to act like
contiguous memory to the user).

Does this hit every physical page? Or am I missing pages that may have
been allocated by the bootmem allocator?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-06-05 23:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-01 18:06 ECC error correction - page isolation Brian Lindahl
2006-06-01 23:46 ` Andi Kleen
2006-06-02  1:30   ` Nick Piggin
2006-06-02  3:10     ` Andi Kleen
2006-06-02  3:15       ` Nick Piggin
2006-06-05 23:36 Brian Lindahl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox