RE: ECC error correction - page isolation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Brian Lindahl" <Brian.Lindahl@SpaceDev.com>
To: linux-mm@kvack.org
Subject: RE: ECC error correction - page isolation
Date: Mon, 5 Jun 2006 16:36:50 -0700	[thread overview]
Message-ID: <069061BE1B26524C85EC01E0F5CC3CC30163E1F3@rigel.headquarters.spacedev.com> (raw)

> If it's kernel space there are several cases:
> - Free page (count == 0). Easy: ignore it.
> - Reserved - e.g. page itself or kernel code - panic
> - Slab (slab bit set) - panic
> - Page table (cannot be detected right now, but you could
> change your architecture to set special bits) - handle like
> process error
> - buffer cache: toss or IO/error if it was dirty
> - Probably more cases
> Most can be figured out by looking at the various bits in struct page

Right, this sort of activity will be the main guts of error recovery.
Nothing too fancy I'm guessing, just requires a bit of digging. If we
can do something moderately intelligent (toss it), do that, otherwise
panic.

> I think he means uncorrected errors. Correctable errors can be fixed 
> up by a scrubber without anything else noticing.

This is correct in our environment.

> Ok if your system doesn't support getting rid of them without 
> an atomic operation you might need to "stop the world" on MP, 
> but that's relatively easy using stop_machine().

It's a UP, but I have no qualms about extending it to MP as we go. I
assume "start_machine()" brings us back up again?

> Interesting background, Brian might find it useful. He did say 
> he wanted to isolate the pages if they're unused, so perhaps
non-transient
> errors can be detected. Or the system just wants to be overly
paranoid?

It's more of a "nice to have" feature in case our customers are overly
paranoid :) The main idea here, is to retest pages that have been
isolated when memory gets tight (if it ever does). After several retests
with no errors, we'll be releasing the pages back to the kernel. This is
mostly to avoid tossing the same page(s) over and over in case they're
susceptible, for some reason.

For a sanity check, so far, I have something like this:

u32 pfn; /* = some page number */
struct * page = pfn_to_page(pfn);

To get an address for the read/rewrite cycle:

atomic_long_t * p = (atomic_long_t *) page_address(page);

To do the read/rewrite cycle, for each atomic_long_t, p, in the page:

atomic_long_add(0, p);

That should trigger the ECC without muddling with the data in a MP-safe
fashion (this should be a fun test, we get to make some RAM physically
fail). So check the ECC error count, and if it changed, do something
smart with 'page'.

One thing I'm having trouble with is finding out what page number to
start with and end with to make the scrubbing simple for the user (the
ioctl returns two u32s). Is there a better way to do this (i.e. existing
globals)?

pfn_beg = pfn_end = 0;
for_each_pgdat(pgdat)
{
  pfn_beg = min(pfn_beg, pgdat->node_start_pfn);
  pfn_end = max(pfn_end, pgdat->node_start_pfn +
pgdat->node_spanned_pages);
}

I also validate the page number using 'pfn_valid(pfn)' before retrieving
the struct page from the page number (fails silently to act like
contiguous memory to the user).

Does this hit every physical page? Or am I missing pages that may have
been allocated by the bootmem allocator?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next             reply	other threads:[~2006-06-05 23:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-05 23:36 Brian Lindahl [this message]
  -- strict thread matches above, loose matches on Subject: below --
2006-06-01 18:06 Brian Lindahl
2006-06-01 23:46 ` Andi Kleen
2006-06-02  1:30   ` Nick Piggin
2006-06-02  3:10     ` Andi Kleen
2006-06-02  3:15       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=069061BE1B26524C85EC01E0F5CC3CC30163E1F3@rigel.headquarters.spacedev.com \
    --to=brian.lindahl@spacedev.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox