From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CFA7C433ED for ; Fri, 16 Apr 2021 10:33:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E5806610CD for ; Fri, 16 Apr 2021 10:33:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E5806610CD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 774D66B0036; Fri, 16 Apr 2021 06:33:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7244B6B006C; Fri, 16 Apr 2021 06:33:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C5FB6B0070; Fri, 16 Apr 2021 06:33:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0022.hostedemail.com [216.40.44.22]) by kanga.kvack.org (Postfix) with ESMTP id 4268C6B0036 for ; Fri, 16 Apr 2021 06:33:40 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id F0ACE2480 for ; Fri, 16 Apr 2021 10:33:39 +0000 (UTC) X-FDA: 78037868958.03.9476B9D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 542DAA00039C for ; Fri, 16 Apr 2021 10:33:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1618569218; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HPxWXQjYmYAvrhIYXMRfmvVPZD5pNgtjy/TXa+spbAM=; b=e+4UhL+28QC7JkAr4z8NAIYgPTQfDB3o7FqariEVbGIewjjL4qUFyexYtEEWe0MXjD7aFW VP+eIMZVTkx8wTomVX7I2PuMrlCnofngg0nKy4eDeAJTWbTK+Bez7AIzqJyfE0KLDdq++G MWl5CDqKWIc9YuJJtYnlThHXugNmK04= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-275-5AQIRuWHOGSU6dFvoXwc2w-1; Fri, 16 Apr 2021 06:33:37 -0400 X-MC-Unique: 5AQIRuWHOGSU6dFvoXwc2w-1 Received: by mail-wm1-f69.google.com with SMTP id g144-20020a1c20960000b029012983de0c8fso2875967wmg.7 for ; Fri, 16 Apr 2021 03:33:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=HPxWXQjYmYAvrhIYXMRfmvVPZD5pNgtjy/TXa+spbAM=; b=IbcU5hvQxdWVRElQwv+DEv8DyK2l1FXdCUnD2+6cmBX26Hy5SrP2IRUMvJQHvR980Q dZPmMiZNkNUX9KuAE084X3g0mnmMV4kFwD0NjBhxW8BsUSnpPhNml60Y6Yp/HMY1HTeP BT6Zejmbo3Iit+PtzsPLSLZxDYTBvMO2b4ZFzrc7P0Ug4dg7NaY5xeKhM5W0WNUPh/AU RQyFrfLO35lHRPWFhxePCPWvdjskQR+8m/eR7Srs+aZcGdrAM5KgprvKrwCEZBvLAtRN QIifrWalWGso7Cl9etiJFJPL2Fv5HrZfB3Y0AoOC7JWPiNpsbJ8iSVs9RaKLqaYpPHPC vpPQ== X-Gm-Message-State: AOAM5314XoLemop7Ak6qBODWkcwdWLtIcrKO1URr+R1+PEYM2qNT48/h zTAeq0fw9m3olGzKjExz+lOkdZhzFuPsiPJ531gIFefZg6pQsOevs8JTtdVeaCdsGT6i1fdxLUq Tp3XVynvt0G8= X-Received: by 2002:adf:b1d3:: with SMTP id r19mr8568406wra.97.1618569215832; Fri, 16 Apr 2021 03:33:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxIRplpL1O+iW76z0hJNj6szizw5N585pJVP/CAeqkoDjjPOi2Inf4or+qnplyqURwWofflmw== X-Received: by 2002:adf:b1d3:: with SMTP id r19mr8568371wra.97.1618569215546; Fri, 16 Apr 2021 03:33:35 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64fb.dip0.t-ipconnect.de. [91.12.100.251]) by smtp.gmail.com with ESMTPSA id 2sm6558635wmi.19.2021.04.16.03.33.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 16 Apr 2021 03:33:35 -0700 (PDT) To: Oscar Salvador , Andrew Morton Cc: Michal Hocko , Anshuman Khandual , Pavel Tatashin , Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210416102153.8794-1-osalvador@suse.de> <20210416102153.8794-5-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: Date: Fri, 16 Apr 2021 12:33:34 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210416102153.8794-5-osalvador@suse.de> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 542DAA00039C X-Stat-Signature: wsqi84j7ah5b35t9a8as1bqiy8uh8s7j Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf15; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618569217-689260 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.04.21 12:21, Oscar Salvador wrote: > Physical memory hotadd has to allocate a memmap (struct page array) for > the newly added memory section. Currently, alloc_pages_node() is used > for those allocations. >=20 > This has some disadvantages: > a) an existing memory is consumed for that purpose > (eg: ~2MB per 128MB memory section on x86_64) > b) if the whole node is movable then we have off-node struct pages > which has performance drawbacks. > c) It might be there are no PMD_ALIGNED chunks so memmap array gets > populated with base pages. >=20 > This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. >=20 > Vmemap page tables can map arbitrary memory. > That means that we can simply use the beginning of each memory section = and > map struct pages there. > struct pages which back the allocated space then just need to be treate= d > carefully. >=20 > Implementation wise we will reuse vmem_altmap infrastructure to overrid= e > the default allocator used by __populate_section_memmap. > Part of the implementation also relies on memory_block structure gainin= g > a new field which specifies the number of vmemmap_pages at the beginnin= g. > This patch also introduces the following functions: >=20 > - mhp_init_memmap_on_memory: > Initializes vmemmap pages by calling move_pfn_range_to_zone(), > calls kasan_add_zero_shadow(), and onlines as many sections > as vmemmap pages fully span. > - mhp_deinit_memmap_on_memory: > Undoes what mhp_init_memmap_on_memory. >=20 > The new function memory_block_online() calls mhp_init_memmap_on_memory(= ) before > doing the actual online_pages(). Should online_pages() fail, we clean u= p > by calling mhp_deinit_memmap_on_memory(). > Adjusting of present_pages is done at the end once we know that online_= pages() > succedeed. >=20 > On offline, memory_block_offline() needs to unaccount vmemmap pages fro= m > present_pages() before calling offline_pages(). > This is necessary because offline_pages() tears down some structures ba= sed > on the fact whether the node or the zone become empty. > If offline_pages() fails, we account back vmemmap pages. > If it succeeds, we call mhp_deinit_memmap_on_memory(). >=20 > Hot-remove: >=20 > We need to be careful when removing memory, as adding and > removing memory needs to be done with the same granularity. > To check that this assumption is not violated, we check the > memory range we want to remove and if a) any memory block has > vmemmap pages and b) the range spans more than a single memory > block, we scream out loud and refuse to proceed. >=20 > If all is good and the range was using memmap on memory (aka vmemmap = pages), > we construct an altmap structure so free_hugepage_table does the righ= t > thing and calls vmem_altmap_free instead of free_pagetable. >=20 > Signed-off-by: Oscar Salvador > --- > drivers/base/memory.c | 75 ++++++++++++++++-- > include/linux/memory.h | 8 +- > include/linux/memory_hotplug.h | 17 +++- > include/linux/memremap.h | 2 +- > include/linux/mmzone.h | 7 +- > mm/Kconfig | 5 ++ > mm/memory_hotplug.c | 171 ++++++++++++++++++++++++++++++++= ++++++--- > mm/sparse.c | 2 - > 8 files changed, 265 insertions(+), 22 deletions(-) >=20 > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index f209925a5d4e..179857d53982 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -173,16 +173,76 @@ static int memory_block_online(struct memory_bloc= k *mem) > { > unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr)= ; > unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; > + unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; > + struct zone *zone; > + int ret; > + > + zone =3D mhp_get_target_zone(start_pfn, nr_pages, mem->nid, > + mem->online_type); > + > + /* > + * Although vmemmap pages have a different lifecycle than the pages > + * they describe (they remain until the memory is unplugged), doing > + * its initialization and accounting at hot-{online,offline} stage s/its/their/ s/hot-{online,offline}/memory onlining/offlining stage/ > + * simplifies things a lot > + */ > + if (nr_vmemmap_pages) { > + ret =3D mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone)= ; > + if (ret) > + return ret; > + } > + > + ret =3D online_pages(start_pfn + nr_vmemmap_pages, > + nr_pages - nr_vmemmap_pages, zone); > + if (ret) { > + if (nr_vmemmap_pages) > + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); > + return ret; > + } > + > + /* > + * Account once onlining succeeded. If the page was unpopulated, it i= s s/page/zone/ > + * now already properly populated. > + */ > + if (nr_vmemmap_pages) > + adjust_present_page_count(zone, nr_vmemmap_pages); > =20 > - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); > + return ret; > } > =20 > static int memory_block_offline(struct memory_block *mem) > { > unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr)= ; > unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; > + unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; > + struct zone *zone; > + int ret; > + > + zone =3D page_zone(pfn_to_page(start_pfn)); > =20 > - return offline_pages(start_pfn, nr_pages); > + /* > + * Unaccount before offlining, such that unpopulated zone and kthread= s > + * can properly be torn down in offline_pages(). > + */ > + if (nr_vmemmap_pages) > + adjust_present_page_count(zone, -nr_vmemmap_pages); > + > + ret =3D offline_pages(start_pfn + nr_vmemmap_pages, > + nr_pages - nr_vmemmap_pages); > + if (ret) { > + /* offline_pages() failed. Account back. */ > + if (nr_vmemmap_pages) > + adjust_present_page_count(zone, nr_vmemmap_pages); > + return ret; > + } > + > + /* > + * Re-adjust present pages if offline_pages() fails. > + */ That comment is stale. I'd just drop it. > + if (nr_vmemmap_pages) > + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); > + > + return ret; > } [...] > -static void adjust_present_page_count(struct zone *zone, long nr_pages= ) > +/* > + * This function should only be called by memory_block_{online,offline= }, > + * and {online,offline}_pages. > + */ > +void adjust_present_page_count(struct zone *zone, long nr_pages) > { > unsigned long flags; > =20 > @@ -839,12 +850,64 @@ static void adjust_present_page_count(struct zone= *zone, long nr_pages) > pgdat_resize_unlock(zone->zone_pgdat, &flags); > } > =20 > -int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > - int online_type, int nid) > +struct zone *mhp_get_target_zone(unsigned long pfn, unsigned long nr_p= ages, > + int nid, int online_type) > +{ > + return zone_for_pfn_range(online_type, nid, pfn, nr_pages); > +} > + Oh, you can just use zone_for_pfn_range() directly for now. No need for=20 mhp_get_target_zone(). Sorry for not realizing this. > +int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_page= s, > + struct zone *zone) > +{ > + unsigned long end_pfn =3D pfn + nr_pages; > + int ret; > + > + /* > + * Initialize vmemmap pages with the corresponding node, zone links s= et. > + */ > + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); > + > + ret =3D kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)= ); > + if (ret) { > + remove_pfn_range_from_zone(zone, pfn, nr_pages); > + return ret; > + } IIRC, we have to add the zero shadow first, before touching the memory.=20 This is also what mm/memremap.c does. In mhp_deinit_memmap_on_memory(), you already remove in the proper=20 (reversed) order :) > + > +int __ref online_pages(unsigned long pfn, unsigned long nr_pages, stru= ct zone *zone) > { > unsigned long flags; > - struct zone *zone; > int need_zonelists_rebuild =3D 0; > + int nid; > int ret; > struct memory_notify arg; > =20 > @@ -860,8 +923,9 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, > =20 > mem_hotplug_begin(); > =20 > + nid =3D zone_to_nid(zone); I'd do that right above const int nid =3D zone_to_nid(zone); [...] --=20 Thanks, David / dhildenb