From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 196D3C433B4 for ; Thu, 15 Apr 2021 11:20:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9DA5B61158 for ; Thu, 15 Apr 2021 11:20:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9DA5B61158 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 35E636B0036; Thu, 15 Apr 2021 07:20:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3342F6B0070; Thu, 15 Apr 2021 07:20:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1B9F46B0036; Thu, 15 Apr 2021 07:20:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142]) by kanga.kvack.org (Postfix) with ESMTP id F37816B0036 for ; Thu, 15 Apr 2021 07:20:04 -0400 (EDT) Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B0520181F0FE9 for ; Thu, 15 Apr 2021 11:20:04 +0000 (UTC) X-FDA: 78034357128.34.C5F24AA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 488EFC0007D4 for ; Thu, 15 Apr 2021 11:20:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1618485603; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9sheADQUeBo5yuHEij7C4j0pwAlHIfFWeMORBbPtinU=; b=UhlN1pZ/Z/ErspiJ2JRtSmNlSDfyzOGeDwUYWREoBSQpjD5N9shwDN/E46+S540rzJR3hd brgerNYh2jGZCOWnipNlgpGri/9ZavJKRHjpqDNAb+/8ZhLYOyQcL6bGUHv3NpuzNARzwR liXW/M74VA6UCO0YRbHGZjcdKJmwBXk= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-262-aP7m1u-VM_KwDia611-GQQ-1; Thu, 15 Apr 2021 07:20:02 -0400 X-MC-Unique: aP7m1u-VM_KwDia611-GQQ-1 Received: by mail-wr1-f70.google.com with SMTP id d15-20020a5d538f0000b02901027c18c581so2581281wrv.3 for ; Thu, 15 Apr 2021 04:20:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=9sheADQUeBo5yuHEij7C4j0pwAlHIfFWeMORBbPtinU=; b=HmOllAmIVpdKIVGIxqY2o7aB7GjKZPGLI/7oVh1rTnBi3wbg1GOjFnps22GLRBhFui s64m9X78z9t7hs1ZEta6odIlkzVvGa2rjEVeWbfgVPfYmfbLeCuznjeN+Cdh5y6n7jm6 4bYjfx1/hcf1RGC52y8d4DPAtJIVQ6dmN3zwhnXslCuPu1tfNCrJJ6744Ah1jb8WtuMN OFJ5ZeGVmTtb6KSAVPsLxCe8n89CRe7s7253VYcp/neT4c+YMycP4VGZl/onem206o6h yXHX+PkV8tSkLOECIr5Fo4KtCE0otY1PhYObrj/3EQm8r414+VuaSeOEXOcMEfxQyIVY nNeg== X-Gm-Message-State: AOAM533MKqJNv4s+jcAHqfCHgjjCj9Xckfa1iMi7ZGXl81CcfZG5rMKs C0qYY5VwYPQEMWuIfJIjsG2e3eckWz6VyjqGFgiT579Vyb774QnSD8yjyEg8XOUQY+Bg1DijwQp sseZScUlBUKc= X-Received: by 2002:a5d:4b81:: with SMTP id b1mr2917105wrt.243.1618485600923; Thu, 15 Apr 2021 04:20:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz+rhU5G5K6ftzgu4Po2sB5eFU+NHlz9MrWl+BeAIyTrHSLkurmRgZ6RkIUdttREpDYvCXD0w== X-Received: by 2002:a5d:4b81:: with SMTP id b1mr2917072wrt.243.1618485600621; Thu, 15 Apr 2021 04:20:00 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6392.dip0.t-ipconnect.de. [91.12.99.146]) by smtp.gmail.com with ESMTPSA id p18sm2407995wrs.68.2021.04.15.04.20.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Apr 2021 04:20:00 -0700 (PDT) To: Oscar Salvador , Andrew Morton Cc: Michal Hocko , Anshuman Khandual , Pavel Tatashin , Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210408121804.10440-1-osalvador@suse.de> <20210408121804.10440-5-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v7 4/8] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: <54bed4d3-631f-7d30-aa2c-f8dd2f2c6804@redhat.com> Date: Thu, 15 Apr 2021 13:19:59 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210408121804.10440-5-osalvador@suse.de> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 488EFC0007D4 X-Stat-Signature: i4rh3febgybkjbutezpmqyd7sb89ac6h Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf22; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618485600-910292 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.04.21 14:18, Oscar Salvador wrote: > Physical memory hotadd has to allocate a memmap (struct page array) for > the newly added memory section. Currently, alloc_pages_node() is used > for those allocations. >=20 > This has some disadvantages: > a) an existing memory is consumed for that purpose > (eg: ~2MB per 128MB memory section on x86_64) > b) if the whole node is movable then we have off-node struct pages > which has performance drawbacks. > c) It might be there are no PMD_ALIGNED chunks so memmap array gets > populated with base pages. >=20 > This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. >=20 > Vmemap page tables can map arbitrary memory. > That means that we can simply use the beginning of each memory section = and > map struct pages there. > struct pages which back the allocated space then just need to be treate= d > carefully. >=20 > Implementation wise we will reuse vmem_altmap infrastructure to overrid= e > the default allocator used by __populate_section_memmap. > Part of the implementation also relies on memory_block structure gainin= g > a new field which specifies the number of vmemmap_pages at the beginnin= g. > This patch also introduces the following functions: >=20 > - vmemmap_init_space: Initializes vmemmap pages by calling move_pfn_r= ange_to_zone(), > calls kasan_add_zero_shadow() or the vmemmap range and marks > online as many sections as vmemmap pages fully span. > - vmemmap_adjust_pages: Accounts/substract vmemmap_pages to node and = zone > present_pages > - vmemmap_deinit_space: Undoes what vmemmap_init_space does. >=20 This is a bit asynchronous; and the function names are not really express= ing what is being done :) I'll try to come up with better names below. It is worth mentioning that the real "mess" is that we want offline_pages= () to properly handle zone->present_pages going to 0. Therefore, we want = to manually mess with the present page count. > Signed-off-by: Oscar Salvador > --- > drivers/base/memory.c | 64 ++++++++++++++-- > include/linux/memory.h | 8 +- > include/linux/memory_hotplug.h | 13 ++++ > include/linux/memremap.h | 2 +- > include/linux/mmzone.h | 7 +- > mm/Kconfig | 5 ++ > mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++= ++++++++- > mm/sparse.c | 2 - > 8 files changed, 247 insertions(+), 16 deletions(-) >=20 > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index f209925a5d4e..a5e536a3e9a4 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -173,16 +173,65 @@ static int memory_block_online(struct memory_bloc= k *mem) > { > unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr)= ; > unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; > + unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; > + int ret; > + > + /* > + * Although vmemmap pages have a different lifecycle than the pages > + * they describe (they remain until the memory is unplugged), doing > + * its initialization and accounting at hot-{online,offline} stage > + * simplifies things a lot > + */ I suggest detecting the zone in here and just passing it down to online_p= ages(). > + if (nr_vmemmap_pages) { > + ret =3D vmemmap_init_space(start_pfn, nr_vmemmap_pages, mem->nid, > + mem->online_type); > + if (ret) > + return ret; > + } > =20 > - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); > + ret =3D online_pages(start_pfn + nr_vmemmap_pages, > + nr_pages - nr_vmemmap_pages, mem->online_type, > + mem->nid); > + > + /* > + * Undo the work if online_pages() fails. > + */ > + if (ret && nr_vmemmap_pages) { > + vmemmap_adjust_pages(start_pfn, -nr_vmemmap_pages); > + vmemmap_deinit_space(start_pfn, nr_vmemmap_pages); > + } > + > + return ret; > } My take would be doing the present page adjustment after onlining succeed= ed: static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; struct zone *zone; int ret; zone =3D mhp_get_target_zone(mem->nid, mem->online_type); if (nr_vmemmap_pages) { ret =3D mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); if (ret) return ret; } ret =3D online_pages(start_pfn + nr_vmemmap_pages, nr_pages - nr_vmemmap= _pages, zone); if (ret) { if (nr_vmemmap_pages) mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); return ret; } /* * Account once onlining succeeded. If the page was unpopulated, * it is now already properly populated. */ if (nr_vmemmap_pages) adjust_present_page_count(zone, nr_vmemmap_pages); return 0; =09 } And the opposite: static int memory_block_offline(struct memory_block *mem) { unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; struct zone *zone; int ret; zone =3D page_zone(pfn_to_page(start_pfn)); /* * Unaccount before offlining, such that unpopulated zones can * properly be torn down in offline_pages(). */ if (nr_vmemmap_pages) adjust_present_page_count(zone, -nr_vmemmap_pages); ret =3D offline_pages(start_pfn + nr_vmemmap_pages, nr_pages - nr_vmemma= p_pages); if (ret) { if (nr_vmemmap_pages) adjust_present_page_count(zone, +nr_vmemmap_pages); return ret; } if (nr_vmemmap_pages) mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); return 0; =09 } Having to do the present page adjustment manually is not completely nice, but it's easier than manually having to mess with zones becomming populat= ed/unpopulated outside of online_pages()/offline_pages(). --=20 Thanks, David / dhildenb