From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B83BC282CE for ; Tue, 12 Feb 2019 13:21:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 165F1214DA for ; Tue, 12 Feb 2019 13:21:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 165F1214DA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7D2B98E0013; Tue, 12 Feb 2019 08:21:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7831B8E0011; Tue, 12 Feb 2019 08:21:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 671788E0013; Tue, 12 Feb 2019 08:21:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id 0A0348E0011 for ; Tue, 12 Feb 2019 08:21:52 -0500 (EST) Received: by mail-ed1-f70.google.com with SMTP id u7so2349199edj.10 for ; Tue, 12 Feb 2019 05:21:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:thread-topic:thread-index:date:message-id:references :in-reply-to:accept-language:content-language :content-transfer-encoding:mime-version; bh=t3RsL3SMelMTTdSZMj4vqBTqYzktafmOevpe3iETmBE=; b=gjocjmUziieDwpK+BGaE3h01Ue7WSuaiAXbBAYqXax1iHBpF1ZApNlIKNUXB4hbkft oRn2Jv9qB1B5alBdWXXNKteWqX7hQgdTojDZMWY/EqJFBmlTucLa1sTlWD8D5o/39d+Y dAkawYRtdRUUp5l1LKouvrjUKTy5q1OU8PkI1MDVlLe+f7PX9gtGnoKw0nNIAgseWLfF 4og5xS0sDuOtUBzTV1umN8jEIbvxPMZKi4LJ/+Nld/9SkzYv32pcetN7VZPDh/q4SdUo YuU05nV9ZoTzIkGdpthsuCDcg4bg5lbLZ+OCNQrM53/t/yXUVzxQFvz1TU2u0P5wHKzL oaVg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of shameerali.kolothum.thodi@huawei.com designates 185.176.76.210 as permitted sender) smtp.mailfrom=shameerali.kolothum.thodi@huawei.com X-Gm-Message-State: AHQUAuYv26c/296Wwg/ROjx1sbtOUkFw4idSy85UhcpNRNUzF2AYT05t dprgVh7fPVEtiL8O9M0RIoKaggEZGU028Gz72/0bZUN5TIR/OmqB5YpeqztLGNBjMMLuCPCgD+l VsnOdp2dzj054JOQVBNcwT/3axegUqMzvdQIcwHZri5pd17GkCiRoWPn/R9GWPW/IWw== X-Received: by 2002:a17:906:6053:: with SMTP id p19mr2668962ejj.227.1549977711506; Tue, 12 Feb 2019 05:21:51 -0800 (PST) X-Google-Smtp-Source: AHgI3IZQuGNeYE2oDxguV0DZrqZnyR7UD/uwHyrUruHartquSU8esx75kdxCqcwmL0PwanXU/5TG X-Received: by 2002:a17:906:6053:: with SMTP id p19mr2668851ejj.227.1549977709312; Tue, 12 Feb 2019 05:21:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549977709; cv=none; d=google.com; s=arc-20160816; b=aAKVg02DVbR4KSzUbgraKNKD3giUSyV7BfPjXlmxWK+/EGR9FPzSf7pu+fmP701KTk kbMlDn7ZPelj6gXsWjX8Vb/mFe6NuEwDOoexeY2hxHCL95/Hv2tfjUnOqQuRAHZAHXmP 9zymeQqWWa5AIRpwrCw8Yxo6Q2fRfQVyM/x/W/6Ne1T2iCYpr48heGvLSGZZPTyHC//e sgZTgwsw+y3AmJmYLXBAg1EQW5H86z0IPBHaT0t7Oj6W2m/glE57CMCpwmuaKaSQ8OdG VchOdjnlB6Yfis/2SPcdtdweJqJydAo+ShHQiTzbgdhkby41AziT7qk9L+spJ2c7T8xQ JF3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:content-transfer-encoding:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:subject:cc:to:from; bh=t3RsL3SMelMTTdSZMj4vqBTqYzktafmOevpe3iETmBE=; b=cKQ9USaMOYoyyrC7sCoGwamSJI8SlYF8UWXqiE/I5ePWTZ1ZKWRbgF7Fz1G9/IS/Y7 IfgzJEJj5iKifYGnEJ5owR+AjXkegJiz+/ND2lTfb9EMcawhkcCw0yGsOIe34KO1skQA zRk12V0mvRSb2f4W7nfJnvzBVevMFgCqNJTIAhrVGzGp0nmBOkLNdTm+PUu5/3yq8sb2 Gj+PmvGitanH/csdRYknt+sV8R/+CArvI+QNrzJnwIAm61zcQSKkw1vUAzYqp4p/mdf4 fe8DjoZ5SiKHFBGFpskbp984c+f1wNtMa8cOw7dbTtOGYYJVVrXo8IXCU1zBJHS/jrBm w9Xg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of shameerali.kolothum.thodi@huawei.com designates 185.176.76.210 as permitted sender) smtp.mailfrom=shameerali.kolothum.thodi@huawei.com Received: from huawei.com (lhrrgout.huawei.com. [185.176.76.210]) by mx.google.com with ESMTPS id b28si1600262edd.78.2019.02.12.05.21.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 12 Feb 2019 05:21:49 -0800 (PST) Received-SPF: pass (google.com: domain of shameerali.kolothum.thodi@huawei.com designates 185.176.76.210 as permitted sender) client-ip=185.176.76.210; Authentication-Results: mx.google.com; spf=pass (google.com: domain of shameerali.kolothum.thodi@huawei.com designates 185.176.76.210 as permitted sender) smtp.mailfrom=shameerali.kolothum.thodi@huawei.com Received: from lhreml701-cah.china.huawei.com (unknown [172.18.7.106]) by Forcepoint Email with ESMTP id B2E65315BB3DCEAA294A; Tue, 12 Feb 2019 13:21:48 +0000 (GMT) Received: from LHREML524-MBS.china.huawei.com ([169.254.2.78]) by lhreml701-cah.china.huawei.com ([10.201.108.42]) with mapi id 14.03.0415.000; Tue, 12 Feb 2019 13:21:39 +0000 From: Shameerali Kolothum Thodi To: Jonathan Cameron , Oscar Salvador CC: "linux-mm@kvack.org" , "mhocko@suse.com" , "dan.j.williams@intel.com" , "Pavel.Tatashin@microsoft.com" , "david@redhat.com" , "linux-kernel@vger.kernel.org" , "dave.hansen@intel.com" , Linuxarm , Robin Murphy Subject: RE: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Thread-Topic: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Thread-Index: AQHUwtEdigDNP0y6HEKCgupIm+NFFKXcIEiQ Date: Tue, 12 Feb 2019 13:21:38 +0000 Message-ID: <5FC3163CFD30C246ABAA99954A238FA8392B5DB6@lhreml524-mbs.china.huawei.com> References: <20190122103708.11043-1-osalvador@suse.de> <20190212124707.000028ea@huawei.com> In-Reply-To: <20190212124707.000028ea@huawei.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.202.227.237] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > -----Original Message----- > From: Jonathan Cameron > Sent: 12 February 2019 12:47 > To: Oscar Salvador > Cc: linux-mm@kvack.org; mhocko@suse.com; dan.j.williams@intel.com; > Pavel.Tatashin@microsoft.com; david@redhat.com; > linux-kernel@vger.kernel.org; dave.hansen@intel.com; Shameerali Kolothum > Thodi ; Linuxarm > ; Robin Murphy > Subject: Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from > hotadded memory >=20 > On Tue, 22 Jan 2019 11:37:04 +0100 > Oscar Salvador wrote: >=20 > > Hi, > > > > this is the v2 of the first RFC I sent back then in October [1]. > > In this new version I tried to reduce the complexity as much as possibl= e, > > plus some clean ups. > > > > [Testing] > > > > I have tested it on "x86_64" (small/big memblocks) and on "powerpc". > > On both architectures hot-add/hot-remove online/offline operations > > worked as expected using vmemmap pages, I have not seen any issues so f= ar. > > I wanted to try it out on Hyper-V/Xen, but I did not manage to. > > I plan to do so along this week (if time allows). > > I would also like to test it on arm64, but I am not sure I can grab > > an arm64 box anytime soon. >=20 > Hi Oscar, >=20 > I ran tests on one of our arm64 machines. Particular machine doesn't actu= ally > have > the mechanics for hotplug, so was all 'faked', but software wise it's all= the > same. >=20 > Upshot, seems to work as expected on arm64 as well. > Tested-by: Jonathan Cameron >=20 > Remove currently relies on some out of tree patches (and dirty hacks) due > to the usual issue with how arm64 does pfn_valid. It's not even vaguely > ready for upstream. I'll aim to post an informational set for anyone else > testing in this area (it's more or less just a rebase of the patches from > a few years ago). >=20 > +CC Shameer who has been testing the virtualization side for more details= on > that,=20 Right, I have sent out a RFC series[1] to enable mem hotplug for Qemu ARM v= irt platform. Using this Qemu, I ran few tests with your patches on a HiSilicon= ARM64 platform. Looks like it is doing the job. root@ubuntu:~# uname -a Linux ubuntu 5.0.0-rc1-mm1-00173-g22b0744 #5 SMP PREEMPT Tue Feb 5 10:32:26= GMT 2019 aarch64 aarch64 aarch64 GNU/Linux root@ubuntu:~# numactl -H available: 2 nodes (0-1) node 0 cpus: 0 node 0 size: 981 MB node 0 free: 854 MB node 1 cpus: node 1 size: 0 MB node 1 free: 0 MB node distances: node 0 1=20 0: 10 20=20 1: 20 10=20 root@ubuntu:~# (qemu)=20 (qemu) object_add memory-backend-ram,id=3Dmem1,size=3D1G (qemu) device_add pc-dimm,id=3Ddimm1,memdev=3Dmem1,node=3D1 root@ubuntu:~#=20 root@ubuntu:~# numactl -H available: 2 nodes (0-1) node 0 cpus: 0 node 0 size: 981 MB node 0 free: 853 MB node 1 cpus: node 1 size: 1008 MB node 1 free: 1008 MB node distances: node 0 1=20 0: 10 20=20 1: 20 10=20 root@ubuntu:~# =20 FWIW, Tested-by: Shameer Kolothum Thanks, Shameer [1] https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06966.html and Robin who is driving forward memory hotplug in general on the arm64 > side. >=20 > Thanks, >=20 > Jonathan >=20 > > > > [Coverletter]: > > > > This is another step to make the memory hotplug more usable. The primar= y > > goal of this patchset is to reduce memory overhead of the hot added > > memory (at least for SPARSE_VMEMMAP memory model). The current way > we use > > to populate memmap (struct page array) has two main drawbacks: > > > > a) it consumes an additional memory until the hotadded memory itself is > > onlined and > > b) memmap might end up on a different numa node which is especially tru= e > > for movable_node configuration. > > > > a) is problem especially for memory hotplug based memory "ballooning" > > solutions when the delay between physical memory hotplug and the > > onlining can lead to OOM and that led to introduction of hacks like = auto > > onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining > > policy for the newly added memory")). > > > > b) can have performance drawbacks. > > > > I have also seen hot-add operations failing on powerpc due to the fact > > that we try to use order-8 pages when populating the memmap array. > > Given 64KB base pagesize, that is 16MB. > > If we run out of those, we just fail the operation and we cannot add > > more memory. > > We could fallback to base pages as x86_64 does, but we can do better. > > > > One way to mitigate all these issues is to simply allocate memmap array > > (which is the largest memory footprint of the physical memory hotplug) > > from the hotadded memory itself. VMEMMAP memory model allows us to > map > > any pfn range so the memory doesn't need to be online to be usable > > for the array. See patch 3 for more details. In short I am reusing an > > existing vmem_altmap which wants to achieve the same thing for nvdim > > device memory. > > > > There is also one potential drawback, though. If somebody uses memory > > hotplug for 1G (gigantic) hugetlb pages then this scheme will not work > > for them obviously because each memory block will contain reserved > > area. Large x86 machines will use 2G memblocks so at least one 1G page > > will be available but this is still not 2G... > > > > I am not really sure somebody does that and how reliable that can work > > actually. Nevertheless, I _believe_ that onlining more memory into > > virtual machines is much more common usecase. Anyway if there ever is a > > strong demand for such a usecase we have basically 3 options a) enlarge > > memory blocks even more b) enhance altmap allocation strategy and reuse > > low memory sections to host memmaps of other sections on the same NUMA > > node c) have the memmap allocation strategy configurable to fallback to > > the current allocation. > > > > [Overall design]: > > > > Let us say we hot-add 2GB of memory on a x86_64 (memblock size =3D 128M= ). > > That is: > > > > - 16 sections > > - 524288 pages > > - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each > section) > > > > The range of pages is: 0xffffea0004000000 - 0xffffea0006000000 > > The vmemmap range is: 0xffffea0004000000 - 0xffffea0004080000 > > > > 0xffffea0004000000 is the head vmemmap page (first page), while all th= e > others > > are "tails". > > > > We keep the following information in it: > > > > - Head page: > > - head->_refcount: number of sections > > - head->private : number of vmemmap pages > > - Tail page: > > - tail->freelist : pointer to the head > > > > This is done because it eases the work in cases where we have to comput= e > the > > number of vmemmap pages to know how much do we have to skip etc, and to > keep > > the right accounting to present_pages. > > > > When we want to hot-remove the range, we need to be careful because the > first > > pages of that range, are used for the memmap maping, so if we remove > those > > first, we would blow up while accessing the others later on. > > For that reason we keep the number of sections in head->_refcount, to k= now > how > > much do we have to defer the free up. > > > > Since in a hot-remove operation, sections are being removed sequentiall= y, the > > approach taken here is that every time we hit free_section_memmap(), we > decrease > > the refcount of the head. > > When it reaches 0, we know that we hit the last section, so we call > > vmemmap_free() for the whole memory-range in backwards, so we make > sure that > > the pages used for the mapping will be latest to be freed up. > > > > The accounting is as follows: > > > > Vmemmap pages are charged to spanned/present_paged, but not to > manages_pages. > > > > I yet have to check a couple of things like creating an accounting item > > like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the > memory that > > went in there, testing Hyper-V/Xen to see how they react to the fact th= at > > we are using the beginning of the memory-range for our own purposes, an= d > to > > check the thing about gigantic pages + hotplug. > > I also have to check that there is no compilation/runtime errors when > > CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP. > > But before that, I would like to get people's feedback about the overal= l > > design, and ideas/suggestions. > > > > > > [1] https://patchwork.kernel.org/cover/10685835/ > > > > Michal Hocko (3): > > mm, memory_hotplug: cleanup memory offline path > > mm, memory_hotplug: provide a more generic restrictions for memory > > hotplug > > mm, sparse: rename kmalloc_section_memmap, > __kfree_section_memmap > > > > Oscar Salvador (1): > > mm, memory_hotplug: allocate memmap from the added memory range > for > > sparse-vmemmap > > > > arch/arm64/mm/mmu.c | 10 ++- > > arch/ia64/mm/init.c | 5 +- > > arch/powerpc/mm/init_64.c | 7 ++ > > arch/powerpc/mm/mem.c | 6 +- > > arch/s390/mm/init.c | 12 ++- > > arch/sh/mm/init.c | 6 +- > > arch/x86/mm/init_32.c | 6 +- > > arch/x86/mm/init_64.c | 20 +++-- > > drivers/hv/hv_balloon.c | 1 + > > drivers/xen/balloon.c | 1 + > > include/linux/memory_hotplug.h | 42 ++++++++-- > > include/linux/memremap.h | 2 +- > > include/linux/page-flags.h | 23 +++++ > > kernel/memremap.c | 9 +- > > mm/compaction.c | 8 ++ > > mm/memory_hotplug.c | 186 > +++++++++++++++++++++++++++++------------ > > mm/page_alloc.c | 47 ++++++++++- > > mm/page_isolation.c | 13 +++ > > mm/sparse.c | 124 > +++++++++++++++++++++++++-- > > mm/util.c | 2 + > > 20 files changed, 431 insertions(+), 99 deletions(-) > > >=20