From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id 480786B0034 for ; Thu, 19 Sep 2013 12:57:25 -0400 (EDT) Received: by mail-pd0-f172.google.com with SMTP id z10so8667424pdj.17 for ; Thu, 19 Sep 2013 09:57:24 -0700 (PDT) Received: by mail-pa0-f45.google.com with SMTP id bg4so9860392pad.32 for ; Thu, 19 Sep 2013 09:57:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5236E4D9.6010502@cn.fujitsu.com> References: <1379064655-20874-1-git-send-email-tangchen@cn.fujitsu.com> <5236E4D9.6010502@cn.fujitsu.com> Date: Fri, 20 Sep 2013 00:57:22 +0800 Message-ID: Subject: Re: [PATCH v3 0/5] x86, memblock: Allocate memory near kernel image before SRAT parsed. From: Yanfei Zhang Content-Type: multipart/alternative; boundary=001a1134a31894db9204e6bf721b Sender: owner-linux-mm@kvack.org List-ID: To: Zhang Yanfei , tj@kernel.org, "H. Peter Anvin" , yinghai@kernel.org Cc: Tang Chen , rjw@sisk.pl, lenb@kernel.org, tglx@linutronix.de, mingo@elte.hu, akpm@linux-foundation.org, toshi.kani@hp.com, liwanp@linux.vnet.ibm.com, trenn@suse.de, jiang.liu@huawei.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, isimatu.yasuaki@jp.fujitsu.com, izumi.taku@jp.fujitsu.com, mgorman@suse.de, mina86@mina86.com, gong.chen@linux.intel.com, vasilis.liaskovitis@profitbricks.com, lwoodman@redhat.com, riel@redhat.com, jweiner@redhat.com, prarit@redhat.com, x86@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-acpi@vger.kernel.org --001a1134a31894db9204e6bf721b Content-Type: text/plain; charset=UTF-8 ping...... 2013/9/16 Zhang Yanfei > Hello tejun, > > Could you please help reviewing the patchset? As you suggested, > we've make the patchset much simpler and cleaner. > > Thanks in advance! > > On 09/13/2013 05:30 PM, Tang Chen wrote: > > This patch-set is based on tj's suggestion, and not fully tested. > > Just for review and discussion. > > > > This patch-set is based on the latest kernel (3.11) > > HEAD is: > > commit d5d04bb48f0eb89c14e76779bb46212494de0bec > > Author: Linus Torvalds > > Date: Wed Sep 11 19:55:12 2013 -0700 > > > > > > [Problem] > > > > The current Linux cannot migrate pages used by the kerenl because > > of the kernel direct mapping. In Linux kernel space, va = pa + > PAGE_OFFSET. > > When the pa is changed, we cannot simply update the pagetable and > > keep the va unmodified. So the kernel pages are not migratable. > > > > There are also some other issues will cause the kernel pages not > migratable. > > For example, the physical address may be cached somewhere and will be > used. > > It is not to update all the caches. > > > > When doing memory hotplug in Linux, we first migrate all the pages in one > > memory device somewhere else, and then remove the device. But if pages > are > > used by the kernel, they are not migratable. As a result, memory used by > > the kernel cannot be hot-removed. > > > > Modifying the kernel direct mapping mechanism is too difficult to do. And > > it may cause the kernel performance down and unstable. So we use the > following > > way to do memory hotplug. > > > > > > [What we are doing] > > > > In Linux, memory in one numa node is divided into several zones. One of > the > > zones is ZONE_MOVABLE, which the kernel won't use. > > > > In order to implement memory hotplug in Linux, we are going to arrange > all > > hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these > memory. > > To do this, we need ACPI's help. > > > > In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The > memory > > affinities in SRAT record every memory range in the system, and also, > flags > > specifying if the memory range is hotpluggable. > > (Please refer to ACPI spec 5.0 5.2.16) > > > > With the help of SRAT, we have to do the following two things to achieve > our > > goal: > > > > 1. When doing memory hot-add, allow the users arranging hotpluggable as > > ZONE_MOVABLE. > > (This has been done by the MOVABLE_NODE functionality in Linux.) > > > > 2. when the system is booting, prevent bootmem allocator from allocating > > hotpluggable memory for the kernel before the memory initialization > > finishes. > > > > The problem 2 is the key problem we are going to solve. But before > solving it, > > we need some preparation. Please see below. > > > > > > [Preparation] > > > > Bootloader has to load the kernel image into memory. And this memory > must be > > unhotpluggable. We cannot prevent this anyway. So in a memory hotplug > system, > > we can assume any node the kernel resides in is not hotpluggable. > > > > Before SRAT is parsed, we don't know which memory ranges are > hotpluggable. But > > memblock has already started to work. In the current kernel, memblock > allocates > > the following memory before SRAT is parsed: > > > > setup_arch() > > |->memblock_x86_fill() /* memblock is ready */ > > |...... > > |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */ > > |->reserve_real_mode() /* allocate memory under 1MB */ > > |->init_mem_mapping() /* allocate page tables, about 2MB to > map 1GB memory */ > > |->dma_contiguous_reserve() /* specified by user, should be low */ > > |->setup_log_buf() /* specified by user, several mega > bytes */ > > |->relocate_initrd() /* could be large, but will be freed > after boot, should reorder */ > > |->acpi_initrd_override() /* several mega bytes */ > > |->reserve_crashkernel() /* could be large, should reorder */ > > |...... > > |->initmem_init() /* Parse SRAT */ > > > > According to Tejun's advice, before SRAT is parsed, we should try our > best to > > allocate memory near the kernel image. Since the whole node the kernel > resides > > in won't be hotpluggable, and for a modern server, a node may have at > least 16GB > > memory, allocating several mega bytes memory around the kernel image > won't cross > > to hotpluggable memory. > > > > > > [About this patch-set] > > > > So this patch-set does the following: > > > > 1. Make memblock be able to allocate memory from low address to high > address. > > 1) Keep all the memblock APIs' prototype unmodified. > > 2) When the direction is bottom up, keep the start address greater > than the > > end of kernel image. > > > > 2. Improve init_mem_mapping() to support allocate page tables in bottom > up direction. > > > > 3. Introduce "movablenode" boot option to enable and disable this > functionality. > > > > PS: Reordering of relocate_initrd() has not been done yet. > acpi_initrd_override() > > needs to access initrd with virtual address. So relocate_initrd() > must be done > > before acpi_initrd_override(). > > > > > > Change log v2 -> v3: > > 1. According to Toshi's suggestion, move the direction checking logic > into memblock. > > And simply the code more. > > > > Change log v1 -> v2: > > 1. According to tj's suggestion, implemented a new function > memblock_alloc_bottom_up() > > to allocate memory from bottom upwards, whihc can simplify the code. > > > > > > Tang Chen (5): > > memblock: Introduce allocation direction to memblock. > > memblock: Improve memblock to support allocation from lower address. > > x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is > > parsed. > > x86, mem-hotplug: Support initialize page tables from low to high. > > mem-hotplug: Introduce movablenode boot option to control memblock > > allocation direction. > > > > Documentation/kernel-parameters.txt | 15 ++++ > > arch/x86/kernel/setup.c | 44 ++++++++++++- > > arch/x86/mm/init.c | 121 > ++++++++++++++++++++++++++-------- > > include/linux/memblock.h | 22 ++++++ > > include/linux/memory_hotplug.h | 5 ++ > > mm/memblock.c | 120 > +++++++++++++++++++++++++++++++---- > > mm/memory_hotplug.c | 9 +++ > > 7 files changed, 293 insertions(+), 43 deletions(-) > > > > > > > -- > Thanks. > Zhang Yanfei > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > --001a1134a31894db9204e6bf721b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable ping......

2013/9/16 Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Hello tejun,

Could you please help reviewing the patchset? As you suggested,
we've make the patchset much simpler and cleaner.

Thanks in advance!

On 09/13/2013 05:30 PM, Tang Chen wrote:
> This patch-set is based on tj's suggestion, and not fully tested.<= br> > Just for review and discussion.
>
> This patch-set is based on the latest kernel (3.11)
> HEAD is:
> commit d5d04bb48f0eb89c14e76779bb46212494de0bec
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: =C2=A0 Wed Sep 11 19:55:12 2013 -0700
>
>
> [Problem]
>
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va =3D pa + PAGE_= OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
>
> There are also some other issues will cause the kernel pages not migra= table.
> For example, the physical address may be cached somewhere and will be = used.
> It is not to update all the caches.
>
> When doing memory hotplug in Linux, we first migrate all the pages in = one
> memory device somewhere else, and then remove the device. But if pages= are
> used by the kernel, they are not migratable. As a result, memory used = by
> the kernel cannot be hot-removed.
>
> Modifying the kernel direct mapping mechanism is too difficult to do. = And
> it may cause the kernel performance down and unstable. So we use the f= ollowing
> way to do memory hotplug.
>
>
> [What we are doing]
>
> In Linux, memory in one numa node is divided into several zones. One o= f the
> zones is ZONE_MOVABLE, which the kernel won't use.
>
> In order to implement memory hotplug in Linux, we are going to arrange= all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use t= hese memory.
> To do this, we need ACPI's help.
>
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The = memory
> affinities in SRAT record every memory range in the system, and also, = flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
>
> With the help of SRAT, we have to do the following two things to achie= ve our
> goal:
>
> 1. When doing memory hot-add, allow the users arranging hotpluggable a= s
> =C2=A0 =C2=A0ZONE_MOVABLE.
> =C2=A0 =C2=A0(This has been done by the MOVABLE_NODE functionality in = Linux.)
>
> 2. when the system is booting, prevent bootmem allocator from allocati= ng
> =C2=A0 =C2=A0hotpluggable memory for the kernel before the memory init= ialization
> =C2=A0 =C2=A0finishes.
>
> The problem 2 is the key problem we are going to solve. But before sol= ving it,
> we need some preparation. Please see below.
>
>
> [Preparation]
>
> Bootloader has to load the kernel image into memory. And this memory m= ust be
> unhotpluggable. We cannot prevent this anyway. So in a memory hotplug = system,
> we can assume any node the kernel resides in is not hotpluggable.
>
> Before SRAT is parsed, we don't know which memory ranges are hotpl= uggable. But
> memblock has already started to work. In the current kernel, memblock = allocates
> the following memory before SRAT is parsed:
>
> setup_arch()
> =C2=A0|->memblock_x86_fill() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0/* memblock is ready */
> =C2=A0|......
> =C2=A0|->early_reserve_e820_mpc_new() =C2=A0 /* allocate memory und= er 1MB */
> =C2=A0|->reserve_real_mode() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0/* allocate memory under 1MB */
> =C2=A0|->init_mem_mapping() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 /* allocate page tables, about 2MB to map 1GB memory */
> =C2=A0|->dma_contiguous_reserve() =C2=A0 =C2=A0 =C2=A0 /* specified= by user, should be low */
> =C2=A0|->setup_log_buf() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0/* specified by user, several mega bytes */
> =C2=A0|->relocate_initrd() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* could be large, but will be freed after boot, should reorder *= /
> =C2=A0|->acpi_initrd_override() =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* seve= ral mega bytes */
> =C2=A0|->reserve_crashkernel() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/*= could be large, should reorder */
> =C2=A0|......
> =C2=A0|->initmem_init() =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 /* Parse SRAT */
>
> According to Tejun's advice, before SRAT is parsed, we should try = our best to
> allocate memory near the kernel image. Since the whole node the kernel= resides
> in won't be hotpluggable, and for a modern server, a node may have= at least 16GB
> memory, allocating several mega bytes memory around the kernel image w= on't cross
> to hotpluggable memory.
>
>
> [About this patch-set]
>
> So this patch-set does the following:
>
> 1. Make memblock be able to allocate memory from low address to high a= ddress.
> =C2=A0 =C2=A01) Keep all the memblock APIs' prototype unmodified.<= br> > =C2=A0 =C2=A02) When the direction is bottom up, keep the start addres= s greater than the
> =C2=A0 =C2=A0 =C2=A0 end of kernel image.
>
> 2. Improve init_mem_mapping() to support allocate page tables in botto= m up direction.
>
> 3. Introduce "movablenode" boot option to enable and disable= this functionality.
>
> PS: Reordering of relocate_initrd() has not been done yet. acpi_initrd= _override()
> =C2=A0 =C2=A0 needs to access initrd with virtual address. So relocate= _initrd() must be done
> =C2=A0 =C2=A0 before acpi_initrd_override().
>
>
> Change log v2 -> v3:
> 1. According to Toshi's suggestion, move the direction checking lo= gic into memblock.
> =C2=A0 =C2=A0And simply the code more.
>
> Change log v1 -> v2:
> 1. According to tj's suggestion, implemented a new function memblo= ck_alloc_bottom_up()
> =C2=A0 =C2=A0to allocate memory from bottom upwards, whihc can simplif= y the code.
>
>
> Tang Chen (5):
> =C2=A0 memblock: Introduce allocation direction to memblock.
> =C2=A0 memblock: Improve memblock to support allocation from lower add= ress.
> =C2=A0 x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is=
> =C2=A0 =C2=A0 parsed.
> =C2=A0 x86, mem-hotplug: Support initialize page tables from low to hi= gh.
> =C2=A0 mem-hotplug: Introduce movablenode boot option to control membl= ock
> =C2=A0 =C2=A0 allocation direction.
>
> =C2=A0Documentation/kernel-parameters.txt | =C2=A0 15 ++++
> =C2=A0arch/x86/kernel/setup.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 | =C2=A0 44 ++++++++++++-
> =C2=A0arch/x86/mm/init.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0| =C2=A0121 ++++++++++++++++++++++++++--------
> =C2=A0include/linux/memblock.h =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0| =C2=A0 22 ++++++
> =C2=A0include/linux/memory_hotplug.h =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2= =A05 ++
> =C2=A0mm/memblock.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0120 +++++++++++++++++++++++++++++++----=
> =C2=A0mm/memory_hotplug.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 | =C2=A0 =C2=A09 +++
> =C2=A07 files changed, 293 insertions(+), 43 deletions(-)
>
>


--
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel= " in
the body of a message to major= domo@vger.kernel.org
More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.html Please read the FAQ at =C2=A0http://www.tux.org/lkml/

--001a1134a31894db9204e6bf721b-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org