linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* ECC error correction - page isolation
@ 2006-06-01 18:06 Brian Lindahl
  2006-06-01 23:46 ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Brian Lindahl @ 2006-06-01 18:06 UTC (permalink / raw)
  To: linux-mm

We have a board that gives us access to ECC error counts and ECC error status (4 bits, each corresponding to a different error). A background process performs a scrub (read, rewrite) on individual raw memory pages to activate the ECC. When the error count changes (an error is detected), I'd like to be able to isolate the page, if unused. The pages are scrubbed as raw physical addresses (page numbers) via a ioctl command on /dev/mem. Is there a facility that will allow me to map this physical address range to a page entity in the kernel so that I can isolate it and mark it as unusable, or reboot if it's active? Is there a better way to do this (i.e. avoiding the mapping phase and interact directly with physical page entities in the kernel)? Where should I begin my journey into mm in the kernel? What structures, functions and globals should I be looking at?

Going this deep in the kernel is pretty foreign to me, so any help would be appreciated. Thanks in advance!

Brian Lindahl 
Embedded Software Engineer 
858-375-2077 
brian.lindahl@spacedev.com 
SpaceDev, Inc. 
"We Make Space Happen"
 
 
This email message and any information or files contained within or attached to this message may be privileged, confidential, proprietary and protected from disclosure and is intended only for the person or entity to which it is addressed.  This email is considered a business record and is therefore property of the SpaceDev, Inc.  Any direct or indirect review, re-transmission, dissemination, forwarding, printing, use, disclosure, or copying of this message or any part thereof or other use of or any file attached to this message, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited.  If you received this message in error, please immediately inform the sender by reply e-mail and delete the message and any attachments and all copies of it from your system and destroy any hard copies of it.  No confidentiality or privilege is waived or lost by any mis-transmission.  SpaceDev, Inc. is neither liable for proper, complete transmission or the information contained in this communication, nor any delay in its receipt or any virus contained therein.  No representation, warranty or undertaking (express or implied) is given and no responsibility or liability is accepted by SpaceDev, Inc., as to the accuracy or the information contained herein or for any loss or damage (be it direct, indirect, special or other consequential) arising from reliance on it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECC error correction - page isolation
  2006-06-01 18:06 ECC error correction - page isolation Brian Lindahl
@ 2006-06-01 23:46 ` Andi Kleen
  2006-06-02  1:30   ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2006-06-01 23:46 UTC (permalink / raw)
  To: Brian Lindahl; +Cc: linux-mm

On Thursday 01 June 2006 20:06, Brian Lindahl wrote:
> We have a board that gives us access to ECC error counts and ECC error
> status (4 bits, each corresponding to a different error). A background
> process performs a scrub (read, rewrite) on individual raw memory pages to
> activate the ECC. When the error count changes (an error is detected), I'd
> like to be able to isolate the page, if unused. The pages are scrubbed as
> raw physical addresses (page numbers) via a ioctl command on /dev/mem. Is
> there a facility that will allow me to map this physical address range to a
> page entity in the kernel so that I can isolate it and mark it as unusable,
> or reboot if it's active? Is there a better way to do this (i.e. avoiding
> the mapping phase and interact directly with physical page entities in the
> kernel)? Where should I begin my journey into mm in the kernel? What
> structures, functions and globals should I be looking at?
>
> Going this deep in the kernel is pretty foreign to me, so any help would be
> appreciated. Thanks in advance!

I did a prototype for something like this years ago. It is relatively 
complicated. 

If you get machine checks in normal accesses you have to bootstrap
yourself. This means it has to be handed off to a thread to be able
to take locks safely. For a scrubber that can be ignored. Doing 
it from arbitary context requires some tricks.

Then you have to take a look at the struct page associated with
the address. If it's a rmap page (you'll need a 2.6 kernel) you
can walk the rmap chains to find the processes that have 
the page mapped. You can look at the PTEs and 
the page bits to see if it's dirty or not. For clean pages
the page can be just dropped. Otherwise you have
to kill the process (or send them a signal they could handle) 

There is no generic function to do the rmap walk right now, but it's not too 
hard. 

If it's kernel space there are several cases:
- Free page (count == 0). Easy: ignore it.
- Reserved - e.g. page itself or kernel code - panic
- Slab (slab bit set) - panic
- Page table (cannot be detected right now, but you could
change your architecture to set special bits) - handle like 
process error
- buffer cache: toss or IO/error if it was dirty 
- Probably more cases

Most can be figured out by looking at the various bits in struct page

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECC error correction - page isolation
  2006-06-01 23:46 ` Andi Kleen
@ 2006-06-02  1:30   ` Nick Piggin
  2006-06-02  3:10     ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2006-06-02  1:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brian Lindahl, linux-mm

Andi Kleen wrote:

> If you get machine checks in normal accesses you have to bootstrap
> yourself. This means it has to be handed off to a thread to be able
> to take locks safely. For a scrubber that can be ignored. Doing 
> it from arbitary context requires some tricks.
> 
> Then you have to take a look at the struct page associated with
> the address. If it's a rmap page (you'll need a 2.6 kernel) you
> can walk the rmap chains to find the processes that have 
> the page mapped. You can look at the PTEs and 
> the page bits to see if it's dirty or not. For clean pages
> the page can be just dropped. Otherwise you have
> to kill the process (or send them a signal they could handle) 
> 
> There is no generic function to do the rmap walk right now, but it's not too 
> hard. 

Good summary. I'll just add a couple of things: in recent kernels
we have a page migration facility which should be able to take care
of moving process and pagecache pages for you, without walking rmap
or killing the process (assuming you're talking about correctable
ECC errors).

This may not quite have the right in-kernel API for you use yet, but
it shouldn't be difficult to add.

> 
> If it's kernel space there are several cases:
> - Free page (count == 0). Easy: ignore it.

Also, if you want to isolate the free page, you can allocate it,
and tuck it away in a list somewhere (or just forget about it
completely).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECC error correction - page isolation
  2006-06-02  1:30   ` Nick Piggin
@ 2006-06-02  3:10     ` Andi Kleen
  2006-06-02  3:15       ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2006-06-02  3:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Brian Lindahl, linux-mm

> Good summary. I'll just add a couple of things: in recent kernels
> we have a page migration facility which should be able to take care
> of moving process and pagecache pages for you, without walking rmap
> or killing the process (assuming you're talking about correctable
> ECC errors).

I think he means uncorrected errors. Correctable errors can be fixed up
by a scrubber without anything else noticing.

Ok if your system doesn't support getting rid of them without an atomic
operation you might need to "stop the world" on MP, but that's relatively
easy using stop_machine().

> This may not quite have the right in-kernel API for you use yet, but
> it shouldn't be difficult to add.
> 
> > 
> > If it's kernel space there are several cases:
> > - Free page (count == 0). Easy: ignore it.
> 
> Also, if you want to isolate the free page, you can allocate it,
> and tuck it away in a list somewhere (or just forget about it
> completely).

Normally it's rare that a bit breaks completely. Usually they just toggle
for some reason and are ok again if you rewrite them (how to do the rewrite without
triggering an MCE can be tricky BTW). Or the glitch wasn't in the RAM transistors
itself, but on some bus, then it might also be ok again on retry. 

What more often happens is that a DIMM (or rather a chip on a DIMM) breaks 
completely. In this case you need to remove the whole chip. This
can be often done in hardware using "chipkill" (which is kind a special
case of hardware RAM RAID).

Anyways you usually need to remove a large memory area, much bigger than a page, 
in this case  and it's more like memory hot unplug (which we don't quite 
support yet, but it's being worked on ...) 

Of course that's all for normal systems. If you're in a space craft (as I 
gather from the original poster's domain name) 
crossing the Van Allen belts or doing a solar storm it might be very different. 
But even then I would expect bits to more often just switch than break completely. 
Maybe for a Jupiter probe it's different and chips might really spoil.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECC error correction - page isolation
  2006-06-02  3:10     ` Andi Kleen
@ 2006-06-02  3:15       ` Nick Piggin
  0 siblings, 0 replies; 6+ messages in thread
From: Nick Piggin @ 2006-06-02  3:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brian Lindahl, linux-mm

Andi Kleen wrote:
>>Good summary. I'll just add a couple of things: in recent kernels
>>we have a page migration facility which should be able to take care
>>of moving process and pagecache pages for you, without walking rmap
>>or killing the process (assuming you're talking about correctable
>>ECC errors).
> 
> 
> I think he means uncorrected errors. Correctable errors can be fixed up
> by a scrubber without anything else noticing.

Oh you're probably right.

> 
> Ok if your system doesn't support getting rid of them without an atomic
> operation you might need to "stop the world" on MP, but that's relatively
> easy using stop_machine().
> 
> 
>>This may not quite have the right in-kernel API for you use yet, but
>>it shouldn't be difficult to add.
>>
>>
>>>If it's kernel space there are several cases:
>>>- Free page (count == 0). Easy: ignore it.
>>
>>Also, if you want to isolate the free page, you can allocate it,
>>and tuck it away in a list somewhere (or just forget about it
>>completely).
> 
> 
> Normally it's rare that a bit breaks completely. Usually they just toggle
> for some reason and are ok again if you rewrite them (how to do the rewrite without
> triggering an MCE can be tricky BTW). Or the glitch wasn't in the RAM transistors
> itself, but on some bus, then it might also be ok again on retry. 
> 
> What more often happens is that a DIMM (or rather a chip on a DIMM) breaks 
> completely. In this case you need to remove the whole chip. This
> can be often done in hardware using "chipkill" (which is kind a special
> case of hardware RAM RAID).
> 
> Anyways you usually need to remove a large memory area, much bigger than a page, 
> in this case  and it's more like memory hot unplug (which we don't quite 
> support yet, but it's being worked on ...) 
> 
> Of course that's all for normal systems. If you're in a space craft (as I 
> gather from the original poster's domain name) 
> crossing the Van Allen belts or doing a solar storm it might be very different. 
> But even then I would expect bits to more often just switch than break completely. 
> Maybe for a Jupiter probe it's different and chips might really spoil.

Interesting background, Brian might find it useful. He did say he wanted
to isolate the pages if they're unused, so perhaps non-transient errors
can be detected. Or the system just wants to be overly paranoid?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: ECC error correction - page isolation
@ 2006-06-05 23:36 Brian Lindahl
  0 siblings, 0 replies; 6+ messages in thread
From: Brian Lindahl @ 2006-06-05 23:36 UTC (permalink / raw)
  To: linux-mm

> If it's kernel space there are several cases:
> - Free page (count == 0). Easy: ignore it.
> - Reserved - e.g. page itself or kernel code - panic
> - Slab (slab bit set) - panic
> - Page table (cannot be detected right now, but you could
> change your architecture to set special bits) - handle like
> process error
> - buffer cache: toss or IO/error if it was dirty
> - Probably more cases
> Most can be figured out by looking at the various bits in struct page

Right, this sort of activity will be the main guts of error recovery.
Nothing too fancy I'm guessing, just requires a bit of digging. If we
can do something moderately intelligent (toss it), do that, otherwise
panic.

> I think he means uncorrected errors. Correctable errors can be fixed 
> up by a scrubber without anything else noticing.

This is correct in our environment.

> Ok if your system doesn't support getting rid of them without 
> an atomic operation you might need to "stop the world" on MP, 
> but that's relatively easy using stop_machine().

It's a UP, but I have no qualms about extending it to MP as we go. I
assume "start_machine()" brings us back up again?

> Interesting background, Brian might find it useful. He did say 
> he wanted to isolate the pages if they're unused, so perhaps
non-transient
> errors can be detected. Or the system just wants to be overly
paranoid?

It's more of a "nice to have" feature in case our customers are overly
paranoid :) The main idea here, is to retest pages that have been
isolated when memory gets tight (if it ever does). After several retests
with no errors, we'll be releasing the pages back to the kernel. This is
mostly to avoid tossing the same page(s) over and over in case they're
susceptible, for some reason.

For a sanity check, so far, I have something like this:

u32 pfn; /* = some page number */
struct * page = pfn_to_page(pfn);

To get an address for the read/rewrite cycle:

atomic_long_t * p = (atomic_long_t *) page_address(page);

To do the read/rewrite cycle, for each atomic_long_t, p, in the page:

atomic_long_add(0, p);

That should trigger the ECC without muddling with the data in a MP-safe
fashion (this should be a fun test, we get to make some RAM physically
fail). So check the ECC error count, and if it changed, do something
smart with 'page'.

One thing I'm having trouble with is finding out what page number to
start with and end with to make the scrubbing simple for the user (the
ioctl returns two u32s). Is there a better way to do this (i.e. existing
globals)?

pfn_beg = pfn_end = 0;
for_each_pgdat(pgdat)
{
  pfn_beg = min(pfn_beg, pgdat->node_start_pfn);
  pfn_end = max(pfn_end, pgdat->node_start_pfn +
pgdat->node_spanned_pages);
}

I also validate the page number using 'pfn_valid(pfn)' before retrieving
the struct page from the page number (fails silently to act like
contiguous memory to the user).

Does this hit every physical page? Or am I missing pages that may have
been allocated by the bootmem allocator?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-06-05 23:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-01 18:06 ECC error correction - page isolation Brian Lindahl
2006-06-01 23:46 ` Andi Kleen
2006-06-02  1:30   ` Nick Piggin
2006-06-02  3:10     ` Andi Kleen
2006-06-02  3:15       ` Nick Piggin
2006-06-05 23:36 Brian Lindahl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox