Hi Tang, On 03/16/2013 06:35 PM, Tang Chen wrote: > Hi Yinghai, all, > > As Yinghai have implemented parsing numa info early more considerately, > I think we can introduce the movablemem_map boot option again. > > This patch-set is based on Linux 3.9 rc-2, but need to apply Yinghai's > "x86, ACPI, numa: Parse numa info early" patch-set first. > Please refer to: > v1: https://lkml.org/lkml/2013/3/7/642 > v2: https://lkml.org/lkml/2013/3/10/47 > > > In this part1 patch-set, we reimplemented movablemem_map boot option > based on Yinghai's SRAT work. The path is like this: > 1) parse SRAT, fill only existing memory into numa_meminfo, like: > numa_cleanup_meminfo() { > 251 const u64 low = 0; > 252 const u64 high = PFN_PHYS(max_pfn); > ...... > 255 /* first, trim all entries */ > 256 for (i = 0; i < mi->nr_blks; i++) { > 257 struct numa_memblk *bi = &mi->blk[i]; > 258 > 259 /* make sure all blocks are inside the limits */ > 260 bi->start = max(bi->start, low); > 261 bi->end = min(bi->end, high); > 262 > 263 /* and there's no empty block */ > 264 if (bi->start >= bi->end) > 265 numa_remove_memblk_from(i--, mi); > 266 } > ...... > } > > Those non-existing memory, such as memory not added yet, won't be > stored in numa_meminfo. > > 2) initialize memory mapping for the existing memory, putting pagetables > and vmemmap on local node. > > Since not all memory info is kept, we have to sanitize movablemem_map.map[] > when we parse SRAT, so we may prevent allocating pagetables or vmemmap on > local node if user specified the whole node as movable. > > To avoid this problem, here is my idea: > 1) Store not only existing memory ranges in numa_mem_info, but all the > memory info from SRAT; > 2) Map only existing memory as before; > 3) Do memblock limitation after memory mapping initialization using > numa_meminfo, so that movablemem_map will be able to exclude pagetables > and vmemmap ranges on local node. > > This will be done in part2 soon. > > How do you think? > > Part2 of this patch-set is under development. > > ======================================================================== > [What we are doing] > This patchset introduces a boot option for user to specify ZONE_MOVABLE > memory map for each node in the system. Users can use it in two ways: > > 1. movablecore_map=nn[KMG]@ss[KMG] > In this way, the kernel will make sure memory range from ss to ss+nn is > on ZONE_MOVABLE. The hotplug info provided by SRAT will be ignored. > > 2. movablecore_map=acpi > In this way, the kernel will use memory hotplug info in SRAT to determine > ZONE_MOVABLE for each node. All the ranges user has specified will be > ignored. > > > [Why we do this] > If we hot remove a memroy device, it cannot have kernel memory, > because Linux cannot migrate kernel memory currently. Therefore, > we have to guarantee that the hot removed memory has only movable > memoroy. > (Here is an exception: When we implement the node hotplug functionality, > for those kernel memory whose life cycle is the same as the node, such as > pagetables, vmemmap and so on, although the kernel cannot migrate them, > we can still put them on local node because we can free them before we > hot-remove the node. This is not implemented yet.) > > Linux has two boot options, kernelcore= and movablecore=, for > creating movable memory. These boot options can specify the amount > of memory use as kernel or movable memory. Using them, we can > create ZONE_MOVABLE which has only movable memory. > (NOTE: doing this will cause NUMA performance because the kernel won't > be able to distribute kernel memory evenly to each node.) > > But it does not fulfill a requirement of memory hot remove, because > even if we specify the boot options, movable memory is distributed > in each node evenly. So when we want to hot remove memory which > memory range is 0x80000000-0c0000000, we have no way to specify > the memory as movable memory. > > Furthermore, even if we can use SRAT, users still need an interface > to enable/disable this functionality if they don't want to lose their > NUMA performance. So I think, a user interface is always needed. > > So we proposed this new feature which specifies memory range to use as > movable memory. http://marc.info/?l=linux-mm&m=136014458829566&w=2 It seems that Mel don't like this idea. > > [Ways to do this] > There may be 2 ways to specify movable memory. > 1. use firmware information > 2. use boot option > > 1. use firmware information > According to ACPI spec 5.0, SRAT table has memory affinity structure > and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory > Affinity Structure". If we use the information, we might be able to > specify movable memory by firmware. For example, if Hot Pluggable > Filed is enabled, Linux sets the memory as movable memory. > > 2. use boot option > This is our proposal. New boot option can specify memory range to use > as movable memory. > > > [How we do this] > We chose second way, because if we use first way, users cannot change > memory range to use as movable memory easily. We think if we create > movable memory, performance regression may occur by NUMA. In this case, > user can turn off the feature easily if we prepare the boot option. > And if we prepare the boot optino, the user can select which memory > to use as movable memory easily. > > > [How to use] > 1. For movablecore_map=nn[KMG]@ss[KMG]: > * > * SRAT: |_____| |_____| |_________| |_________| ...... > * node id: 0 1 1 2 > * user specified: |__| |___| > * ZONE_MOVABLE: |___| |_________| |______| ...... > * > NOTE: 1) User can specify this option more than once, but at most MAX_NUMNODES > times. The extra options will be ignored. > 2) In this case, SRAT info will be ingored. > > 2. For movablemem_map=acpi: > * > * SRAT: |_____| |_____| |_________| |_________| ...... > * node id: 0 1 1 2 > * hotpluggable: n y y n > * ZONE_MOVABLE: |_____| |_________| > * > NOTE: 1) Before parsing SRAT, memblock has already reserve some memory ranges > for other purposes, such as for kernel image. We cannot prevent > kernel from using these memory, so we need to exclude these memory > even if it is hotpluggable. > Furthermore, to ensure the kernel has enough memory to boot, we make > all the memory on the node which the kernel resides in should be > un-hotpluggable. > 2) In this case, all the user specified memory ranges will be ingored. > > We also need to consider the following points: > 1) Using this boot option could cause NUMA performance down because the kernel > memory will not be distributed on each node evenly. So for users who don't > want to lose their NUMA performance, just don't use it. > 2) If kernelcore or movablecore is also specified, movablecore_map will have > higher priority to be satisfied. > 3) This option has no conflict with memmap option. > > > Tang Chen (8): > acpi: Print hotplug info in SRAT. > x86, mm, numa, acpi: Add movable_memmap boot option. > x86, mm, numa, acpi: Introduce zone_movable_limit[] to store start > pfn of ZONE_MOVABLE. > x86, mm, numa, acpi: Extend movablemem_map to the end of each node. > x86, mm, numa, acpi: Support getting hotplug info from SRAT. > x86, mm, numa, acpi: Sanitize zone_movable_limit[]. > x86, mm, numa, acpi: make movablemem_map have higher priority > x86, mm, numa, acpi: Memblock limit with movablemem_map > > Yasuaki Ishimatsu (1): > x86: get pg_data_t's memory from other node > > Documentation/kernel-parameters.txt | 36 +++++ > arch/x86/mm/numa.c | 5 +- > arch/x86/mm/srat.c | 130 +++++++++++++++++- > include/linux/memblock.h | 2 + > include/linux/mm.h | 22 +++ > mm/memblock.c | 50 +++++++ > mm/page_alloc.c | 265 ++++++++++++++++++++++++++++++++++- > 7 files changed, 500 insertions(+), 10 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org