linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Failing memory auto-hotremove support?
@ 2008-07-03 12:25 Tim Small
  2008-07-03 17:42 ` Doug Thompson
  0 siblings, 1 reply; 4+ messages in thread
From: Tim Small @ 2008-07-03 12:25 UTC (permalink / raw)
  To: bluesmoke-devel, linux-mm

Hello,

I just noticed that there is memory hotplug / hotremove support in the 
kernel.org kernel now.

I was thinking that it may be desirable (e.g. on large NUMA systems) to 
automatically trigger the removal of memory modules (or just take a 
section of the memory module out of use, if applicable), if a memory 
module exceeded a pre-set correctable error rate (or RIGHT-NOW, if an 
uncorrectable memory error was detected).

Tim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Failing memory auto-hotremove support?
  2008-07-03 12:25 Failing memory auto-hotremove support? Tim Small
@ 2008-07-03 17:42 ` Doug Thompson
  2008-07-04  5:24   ` Yasunori Goto
  0 siblings, 1 reply; 4+ messages in thread
From: Doug Thompson @ 2008-07-03 17:42 UTC (permalink / raw)
  To: Tim Small, bluesmoke-devel, linux-mm

--- Tim Small <tim@buttersideup.com> wrote:

> Hello,
> 
> I just noticed that there is memory hotplug / hotremove support in the 
> kernel.org kernel now.

cool, good to hear. Now I (or others) need some cycles to review it and mod EDAC to utilize it if
possible and/or provide feedback to the memory guys

> 
> I was thinking that it may be desirable (e.g. on large NUMA systems) to 
> automatically trigger the removal of memory modules (or just take a 
> section of the memory module out of use, if applicable), if a memory 
> module exceeded a pre-set correctable error rate (or RIGHT-NOW, if an 
> uncorrectable memory error was detected).

THAT is exactly what one of the goals of EDAC (then bluesmoke) had in mind years ago, but there
was no easy mechanism, within the kernel, to perform those types of controls (take a section of
memory out of commision).

When you have a NUMA node with 64 or 128 gigbabytes of memory and have 5,000 such nodes, rebooting
in not a very good thing to do. 

BUT being able to detect a bad DIMM (or a pair) via EDAC and then notify the memory subsystem to
de-activate that DIMM (pair) from active use is GREAT feature to have. The node graciously handles
the downed memory and stays UP running that big cluster task, all the while notifying the admin
that a DIMM needs replacement at the next maintaince cycle.

doug t

> 
> Tim.
> 


W1DUG

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Failing memory auto-hotremove support?
  2008-07-03 17:42 ` Doug Thompson
@ 2008-07-04  5:24   ` Yasunori Goto
  2008-07-09 23:34     ` Badari Pulavarty
  0 siblings, 1 reply; 4+ messages in thread
From: Yasunori Goto @ 2008-07-04  5:24 UTC (permalink / raw)
  To: Doug Thompson; +Cc: Tim Small, bluesmoke-devel, linux-mm

Hi.

> > I just noticed that there is memory hotplug / hotremove support in the 
> > kernel.org kernel now.
> 
> cool, good to hear. Now I (or others) need some cycles to review it and mod EDAC to utilize it if
> possible and/or provide feedback to the memory guys

There is a documentation about memory hotplug. I hope it will help you.
(Documentation/memory-hotplug.txt)


> > I was thinking that it may be desirable (e.g. on large NUMA systems) to 
> > automatically trigger the removal of memory modules (or just take a 
> > section of the memory module out of use, if applicable), if a memory 
> > module exceeded a pre-set correctable error rate (or RIGHT-NOW, if an 
> > uncorrectable memory error was detected).
> 
> THAT is exactly what one of the goals of EDAC (then bluesmoke) had in mind years ago, but there
> was no easy mechanism, within the kernel, to perform those types of controls (take a section of
> memory out of commision).

At least, each memory section can be offlined "logically".
So, if there is a (correctable) error in a section, the section will be not used
after the section's offline. 
There is no code for automatic offline yet. But I think it is not difficult.


Physical (in other words, electrical) removing needs more works (except powerpc box.)
In x86-64/ia64, the memory device (or container device)of ACPI must be support
_EJD method, and physical removing code must be called. But I think its code is
not completed yet.


> When you have a NUMA node with 64 or 128 gigbabytes of memory and have 5,000 such nodes, rebooting
> in not a very good thing to do. 
> BUT being able to detect a bad DIMM (or a pair) via EDAC and then notify the memory subsystem to
> de-activate that DIMM (pair) from active use is GREAT feature to have. The node graciously handles
> the downed memory and stays UP running that big cluster task, all the while notifying the admin
> that a DIMM needs replacement at the next maintaince cycle.


Unfortunately, NUMA nodes can't be removed yet.
The pgdat and other structures for each nodes can't be removed yet.

I'm planning how to remove them now, and will make it possible step by step.
Please wait.

I'm encouraged at there are new people who expect memory hotplug. :-)

Thanks.

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Failing memory auto-hotremove support?
  2008-07-04  5:24   ` Yasunori Goto
@ 2008-07-09 23:34     ` Badari Pulavarty
  0 siblings, 0 replies; 4+ messages in thread
From: Badari Pulavarty @ 2008-07-09 23:34 UTC (permalink / raw)
  To: Yasunori Goto; +Cc: Doug Thompson, Tim Small, bluesmoke-devel, linux-mm

On Fri, 2008-07-04 at 14:24 +0900, Yasunori Goto wrote:
> Hi.
> 
> > > I just noticed that there is memory hotplug / hotremove support in the 
> > > kernel.org kernel now.
> > 
> > cool, good to hear. Now I (or others) need some cycles to review it and mod EDAC to utilize it if
> > possible and/or provide feedback to the memory guys
> 
> There is a documentation about memory hotplug. I hope it will help you.
> (Documentation/memory-hotplug.txt)
> 
> 
> > > I was thinking that it may be desirable (e.g. on large NUMA systems) to 
> > > automatically trigger the removal of memory modules (or just take a 
> > > section of the memory module out of use, if applicable), if a memory 
> > > module exceeded a pre-set correctable error rate (or RIGHT-NOW, if an 
> > > uncorrectable memory error was detected).
> > 
> > THAT is exactly what one of the goals of EDAC (then bluesmoke) had in mind years ago, but there
> > was no easy mechanism, within the kernel, to perform those types of controls (take a section of
> > memory out of commision).
> 
> At least, each memory section can be offlined "logically".
> So, if there is a (correctable) error in a section, the section will be not used
> after the section's offline. 
> There is no code for automatic offline yet. But I think it is not difficult.
> 
> 
> Physical (in other words, electrical) removing needs more works (except powerpc box.)
> In x86-64/ia64, the memory device (or container device)of ACPI must be support
> _EJD method, and physical removing code must be called. But I think its code is
> not completed yet.
> 
> 
> > When you have a NUMA node with 64 or 128 gigbabytes of memory and have 5,000 such nodes, rebooting
> > in not a very good thing to do. 
> > BUT being able to detect a bad DIMM (or a pair) via EDAC and then notify the memory subsystem to
> > de-activate that DIMM (pair) from active use is GREAT feature to have. The node graciously handles
> > the downed memory and stays UP running that big cluster task, all the while notifying the admin
> > that a DIMM needs replacement at the next maintaince cycle.
> 
> 
> Unfortunately, NUMA nodes can't be removed yet.
> The pgdat and other structures for each nodes can't be removed yet.
> 
> I'm planning how to remove them now, and will make it possible step by step.
> Please wait.

While we are trying to test hot remove nodes, we ran into this. There
are allocations on the first memblock on each node, preventing it from
removing the node.

If you have ideas/code to move these allocations out of the way - I 
will be more than happy to test/verify/help :)

Thanks,
Badari

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-07-09 23:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-03 12:25 Failing memory auto-hotremove support? Tim Small
2008-07-03 17:42 ` Doug Thompson
2008-07-04  5:24   ` Yasunori Goto
2008-07-09 23:34     ` Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox