From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id l6N6dZJf031582 for (envelope-from y-goto@jp.fujitsu.com); Mon, 23 Jul 2007 15:39:36 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id C5D881B801E for ; Mon, 23 Jul 2007 15:39:35 +0900 (JST) Received: from s7.gw.fujitsu.co.jp (s7.gw.fujitsu.co.jp [10.0.50.97]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 9EDCA2DC07E for ; Mon, 23 Jul 2007 15:39:35 +0900 (JST) Received: from s7.gw.fujitsu.co.jp (s7 [127.0.0.1]) by s7.gw.fujitsu.co.jp (Postfix) with ESMTP id 870FE161C009 for ; Mon, 23 Jul 2007 15:39:35 +0900 (JST) Received: from fjm501.ms.jp.fujitsu.com (fjm501.ms.jp.fujitsu.com [10.56.99.71]) by s7.gw.fujitsu.co.jp (Postfix) with ESMTP id 1D0E2161C007 for ; Mon, 23 Jul 2007 15:39:35 +0900 (JST) Received: from fjmscan502.ms.jp.fujitsu.com (fjmscan502.ms.jp.fujitsu.com [10.56.99.142])by fjm501.ms.jp.fujitsu.com with ESMTP id l6N6ciog016057 for ; Mon, 23 Jul 2007 15:38:45 +0900 Date: Mon, 23 Jul 2007 15:38:48 +0900 From: Yasunori Goto Subject: Re: [RFC][Doc] memory hotplug documentaion In-Reply-To: <20070720155346.33ca523b.kamezawa.hiroyu@jp.fujitsu.com> References: <20070720155346.33ca523b.kamezawa.hiroyu@jp.fujitsu.com> Message-Id: <20070723152107.35EA.Y-GOTO@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" List-ID: Hmm. It should be mentioned that there are 2 phases of memory hotplug. (Physical hot-add/remove and online/offline) If it is written in early section clearly, I think it will be helpful for readers. Bye. > Hi, > > I'm considering to add text file for memory hotplug to -mm kernel to which memory > unplug base patches are merged now. like Documentation/vm/memory_hotplug > This is not patch style yet. > > I wrote this. But I know I'm not a good writer (even in Japanese...) and > I have no skilled reviewer. > > This is RFC for memory hotplug documentation. This documentation describes > how-to-use and current development status. (Of course, I'll update this when > I post patches.) If development status is unnecessary, I'll remove them. > > Any comments and questions are helpful. > > Thanks, > -Kame > == > Memory Hotplug > -------------- > > Last Updated: Jul 20 2007 > > This document is about memory hotplug including how-to-use and current status. > Because Memory Hotplug is still under development, contents of this text will > be changed often. > > > > 1. Introduction > 2. SPARSEMEM and Section > 3. Hardware(Firmware) Support. > 4. Notify memory hotplug event by hand > 5. State of memory > 6. How to online memory > 7. Memory offline and ZONE_MOVABLE > 8. How to offline memory > 9. Future Work List > > Note(1): x86_64's special memory hotplug is not described. > Note(2): This text assumes that sysfs is mounted at /sys. > > > 1. Introduction > ------------ > Memory Hotplug allows users to increase/decrease the amount of memory. > Generally, there are two purposes. > > (A) For changing the amount of memory > (B) For installing/removing DIMM or for helping hardware support of memory > power consumption reduction or DIMM exchanges, and dynamic hardware > reconfiguration like NUMA-node-hotadd. > > (A) is required by highly virtualized environment and (B) is required by > hardwares which support memory power management. > > Linux's memory hotplug divides the memory into a logical group of "section". > Memory Hotplug allows onlining/offlining of sections. > > When a user onlines a secion, the whole memory in it are installed into the > system. When a user offlines a section, the whole memory in it is removed from > the system. > > > 2. SPARSEMEM and Section > ------------ > Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory > into chunks of the same size. The chunk is called as "section". The size of > section is architecture dependent. For example, power uses 16MiB, ia64 uses > 1GiB. > > Memory hotplug onlines/offlines this "section". > > To know the size of section, please read this file > /sys/devices/system/memory/block_size_bytes > > This file shows the size of section in byte. > > All section has its device information under /sys/devices/system/memory as > > /sys/devices/system/memory/memoryXXX/ > (XXX is section id.) > > Now, XXX is defined as "start_address_of_section/secion_size". > > Under each section, you can see 3 files. > > /sys/devices/system/memory/memoryXXX/phys_index > /sys/devices/system/memory/memoryXXX/phys_device > /sys/devices/system/memory/memoryXXX/state > > 'phys_index' : read-only and contains section id, same as XXX. > 'state' : read-write > at read: contains online/offline state of memory. > at write: user can specify "online", "offline" command > 'phys_device': read-only: designed to show the name of physical memory device. > This is not well implemented now. > > 3. Hardware(Firmware) Support > ------------ > On x86_64/ia64 platform, memory hotplug by ACPI is supported. > > In general, the firmware (ACPI), which supports memory hotplug, defines > memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, > Linux's acpi handler does hotadd memory to the system and call hotplug udev > script.This sequence will be done in automatically. > > But scripts for memory hotplug is not contained in generic udev package(now). > You may have to write it by yourself or online/offline memory by hand. > Please see "How to online memory", "How to offline memory" in this text. > > > 4. Notify memory hotplug event by hand > ------------ > In some environment, especially virtualized environment, firmware will not > notify memory hotplug event to the kernel. For such environment, "probe" > interface is supported. This interface depends on CONFIG_ARCH_MEMORY_PROBE. > > Now, CONFIG_ARCH_MEMORY_PROBE is supported only by powerpc but it does not > includes highly architecture codes. Please add config if you need "probe" > interface. > > Probe interface is located at > /sys/devices/system/memory/probe > > You can tell the physical address of new memory to the kernel by > > %echo start_address_of_new_memory > /sys/devices/system/memory/probe > > Then, [start_address_of_new_memory, start_address_of_new_memory + section_size) > memory range is hot-added. In this case, hotplug script is not called (in > current implementation.). You'll have to online memory by yourself. > Please see "How to online memory" in this text. > > 5. State of memory > ------------ > To see (online/offline) state of memory section, read 'state' file. > > %cat /sys/device/system/memory/memoryXXX/state > > > If the memory section is online, you'll read "online". > If the memory section is offline, you'll read "offline". > > > 6. How to online memory > ------------ > Even if the memory is hot-added, it is not at ready-to-use state. > For using newly added memory, you have to "online" memory section. > > For onlining, you have to write "online" to section's state file as: > > %echo online > /sys/devices/system/memory/memoryXXX/state > > After this, section memoryXXX's state will be 'online'. And the amount of > available memory will be increased. > > Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA). > This may be changed in future. > > > 7. Memory offline and ZONE_MOVABLE > ------------ > Memory offlining is complicated than memory online. Because memory offline > has to make the whole memory section to be unused, memory offline can be > failed if the section includes memory which is never freed. > > In general, memory offline can use 2 techniques. > > (1) reclaim and free all memory in the section. > (2) migrate all pages in the section. > > In current implementation, Linux's memory offline uses method (2), freeing > the whole pages in the section by page migration. But not all pages are > migratable. Under current Linux, migratable pages are anonymous pages and > page caches. For offlining a section by migration, the kernel has to guarantee > that the section contains just only migratable pages. > > Now, a boot option for making a section which consists of migratable pages is > supported. By specifying "kernelcore=" or "movablecore=" boot option, you can > create ZONE_MOVABLE...a zone which is just used for movable pages. > (See also Documentation/kernel-parameters.txt) > > Assume the system has "TOTAL" amount of memory at boot time, this boot option > creates ZONE_MOVABLE as following. > > 1) When kernelcore=YYYY boot option is used, > Size of memory not for movable pages (not for offline) is YYYY. > Size of memory for movable pages (for offline) is TOTAL-YYYY. > > 2) When movablecore=ZZZZ boot option is used, > Size of memory not for movable pages (not for offline) is TOTAL - YYYY. > Size of memory for movable pages (for offline) is YYYY. > > > Note) Unfortunately, there is no information to show which section is belongs > to ZONE_MOVABLE. This is TBD. > > > 8. How to offline memory > ------------ > You can offline section by sysfs interface as memory onlining. > > %echo offline > /sys/devices/system/memory/memoryXXX/state > > If offline succeed, state of memory section is changed to be "offline". > If fail, some error core (like -EBUSY) will be returned be the kernel. > Even if a section is not belongs to ZONE_MOVABLE, you can try to offline it. > If it doesn't contain 'unmovable' memory, you'll get success. > > A section under ZONE_MOVABLE is considered to be able to be offlined easily. > But under some buzy state, it may return -EBUSY. Even if a memory section > cannot be offlined with -EBUSY, you can retry offline and will be able to > offline (soon?). (For example, a page is referred by some kernel internal call.) > > Consideration: > Memory hotplug's design direction is to make possibility of memory offlining > bigger and to guarantee unplugging memory under any situation. But it needs > more work. Returning -EBUSY under some situation may be good because the user > can decide to retry more or not by himself. Currently, memory offlining code > does some amount of retry with 120 secs timeout. > > 9. Future Work > ------------ > - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like > sysctl or new control file. > - showing memory section and physical device relation ship. > - showing memory section and node relation ship (maybe good for NUMA) > - showing memory section is under ZONE_MOVABLE or not > - test and make it better memory offlining. > - support HugeTLB page migration and offlining. > - memmap removing at memory offline. > > -- Yasunori Goto -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org