From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56865C433C1 for ; Wed, 24 Mar 2021 14:52:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A7B1D61A10 for ; Wed, 24 Mar 2021 14:52:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A7B1D61A10 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 33F0A6B02D4; Wed, 24 Mar 2021 10:52:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 315A86B02D5; Wed, 24 Mar 2021 10:52:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 142CC6B02D6; Wed, 24 Mar 2021 10:52:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0061.hostedemail.com [216.40.44.61]) by kanga.kvack.org (Postfix) with ESMTP id E801B6B02D4 for ; Wed, 24 Mar 2021 10:52:48 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id AF5978249980 for ; Wed, 24 Mar 2021 14:52:48 +0000 (UTC) X-FDA: 77955059616.29.E980B26 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 78C475001526 for ; Wed, 24 Mar 2021 14:52:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1616597567; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Bpu2A/pW1a1W5AJSucmodOCKzDSDFqBH0HZmo6Fhjlg=; b=TzK3iNIRqwKYOK/TKCugTe0PcyE4BkUCivlRfKzPgUc2958OwcfP8O0j3YJxfa7MXi5Cyv +Jc+HRy4vgIxY6+AebTNvz3WgxM4NLqcmaDsEi4daLuhgKpDLBBHhOqTJSosewwgkCkdFr rOSTHbSt8zLObstnsND9qW3UVNXmKzQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-394-tWQtTvN7Obm7ICpKD_svwg-1; Wed, 24 Mar 2021 10:52:43 -0400 X-MC-Unique: tWQtTvN7Obm7ICpKD_svwg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 222C31005D58; Wed, 24 Mar 2021 14:52:42 +0000 (UTC) Received: from [10.36.115.66] (ovpn-115-66.ams2.redhat.com [10.36.115.66]) by smtp.corp.redhat.com (Postfix) with ESMTP id 94AFF5D9CA; Wed, 24 Mar 2021 14:52:39 +0000 (UTC) To: Michal Hocko , Oscar Salvador Cc: Andrew Morton , Anshuman Khandual , Vlastimil Babka , Pavel Tatashin , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20210319092635.6214-1-osalvador@suse.de> <20210319092635.6214-2-osalvador@suse.de> <20210324101259.GB16560@linux> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: <3bc4168c-fd31-0c9a-44ac-88e25d524eef@redhat.com> Date: Wed, 24 Mar 2021 15:52:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 78C475001526 X-Stat-Signature: gy3stegh3gatfrnsu8ennzffrggoj3tm Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf01; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=63.128.21.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1616597567-956177 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.03.21 15:42, Michal Hocko wrote: > On Wed 24-03-21 13:03:29, Michal Hocko wrote: >> On Wed 24-03-21 11:12:59, Oscar Salvador wrote: > [...] >>> I kind of understand to be reluctant to use vmemmap_pages terminology= here, but >>> unfortunately we need to know about it. >>> We could rename nr_vmemmap_pages to offset_buddy_pages or something l= ike that. >> >> I am not convinced. It seems you are justr trying to graft the new >> functionality in. But I still believe that {on,off}lining shouldn't ca= re >> about where their vmemmaps come from at all. It should be a >> responsibility of the code which reserves that space to compansate for >> accounting. Otherwise we will end up with a hard to maintain code >> because expectations would be spread at way too many places. Not to >> mention different pfns that the code should care about. >=20 > The below is a quick hack on top of this patch to illustrate my > thinking. I have dug out all the vmemmap pieces out of the > {on,off}lining and hooked all the accounting when the space is reserved= . > This just compiles without any deeper look so there are likely some > minor problems but I haven't really encountered any major problems or > hacks to introduce into the code. The separation seems to be possible. > The diffstat also looks promising. Am I missing something fundamental i= n > this? >=20 From a quick glimpse, this touches on two things discussed in the past: 1. If the underlying memory block is offline, all sections are offline.=20 Zone shrinking code will happily skip over the vmemmap pages and you can=20 end up with out-of-zone pages assigned to the zone. Can happen in corner=20 cases. There is no way to know that the memmap of these pages was=20 initialized and is of value. 2. You heavily fragment zone layout although you might end up with=20 consecutive zones (e.g., online all hotplugged memory movable) > --- > drivers/base/memory.c | 8 +-- > include/linux/memory_hotplug.h | 6 +- > mm/memory_hotplug.c | 151 ++++++++++++++++++++------------= --------- > 3 files changed, 80 insertions(+), 85 deletions(-) >=20 > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 5ea2b3fbce02..9697acfe96eb 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -181,15 +181,15 @@ memory_block_action(unsigned long start_section_n= r, unsigned long action, > unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block; > int ret; > =20 > - start_pfn =3D section_nr_to_pfn(start_section_nr); > + start_pfn =3D section_nr_to_pfn(start_section_nr) + nr_vmemmap_pages; > + nr_pages -=3D nr_vmemmap_pages; > =20 > switch (action) { > case MEM_ONLINE: > - ret =3D online_pages(start_pfn, nr_pages, nr_vmemmap_pages, > - online_type, nid); > + ret =3D online_pages(start_pfn, nr_pages, online_type, nid); > break; > case MEM_OFFLINE: > - ret =3D offline_pages(start_pfn, nr_pages, nr_vmemmap_pages); > + ret =3D offline_pages(start_pfn, nr_pages); > break; > default: > WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotp= lug.h > index a85d4b7d15c2..673d2d4a8443 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -109,8 +109,7 @@ extern int zone_grow_waitqueues(struct zone *zone, = unsigned long nr_pages); > extern int add_one_highpage(struct page *page, int pfn, int bad_ppro)= ; > /* VM interface that may be used by firmware interface */ > extern int online_pages(unsigned long pfn, unsigned long nr_pages, > - unsigned long nr_vmemmap_pages, int online_type, > - int nid); > + int online_type, int nid); > extern struct zone *test_pages_in_a_zone(unsigned long start_pfn, > unsigned long end_pfn); > extern void __offline_isolated_pages(unsigned long start_pfn, > @@ -317,8 +316,7 @@ static inline void pgdat_resize_init(struct pglist_= data *pgdat) {} > #ifdef CONFIG_MEMORY_HOTREMOVE > =20 > extern void try_offline_node(int nid); > -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pag= es, > - unsigned long nr_vmemmap_pages); > +extern int offline_pages(unsigned long start_pfn, unsigned long nr_pag= es); > extern int remove_memory(int nid, u64 start, u64 size); > extern void __remove_memory(int nid, u64 start, u64 size); > extern int offline_and_remove_memory(int nid, u64 start, u64 size); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 0c3a98cb8cde..754026a9164d 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -844,30 +844,19 @@ struct zone * zone_for_pfn_range(int online_type,= int nid, unsigned start_pfn, > } > =20 > int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > - unsigned long nr_vmemmap_pages, int online_type, int nid) > + int online_type, int nid) > { > - unsigned long flags, buddy_start_pfn, buddy_nr_pages; > + unsigned long flags; > struct zone *zone; > int need_zonelists_rebuild =3D 0; > int ret; > struct memory_notify arg; > =20 > - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */ > - if (WARN_ON_ONCE(!nr_pages || > - !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))) > - return -EINVAL; > - > - buddy_start_pfn =3D pfn + nr_vmemmap_pages; > - buddy_nr_pages =3D nr_pages - nr_vmemmap_pages; > - > mem_hotplug_begin(); > =20 > /* associate pfn range with the zone */ > zone =3D zone_for_pfn_range(online_type, nid, pfn, nr_pages); > - if (nr_vmemmap_pages) > - move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL, > - MIGRATE_UNMOVABLE); > - move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL, > + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, > MIGRATE_ISOLATE); > =20 > arg.start_pfn =3D pfn; > @@ -884,7 +873,7 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, > * onlining, such that undo_isolate_page_range() works correctly. > */ > spin_lock_irqsave(&zone->lock, flags); > - zone->nr_isolate_pageblock +=3D buddy_nr_pages / pageblock_nr_pages; > + zone->nr_isolate_pageblock +=3D nr_pages / pageblock_nr_pages; > spin_unlock_irqrestore(&zone->lock, flags); > =20 > /* > @@ -897,7 +886,7 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, > setup_zone_pageset(zone); > } > =20 > - online_pages_range(pfn, nr_pages, buddy_start_pfn); > + online_pages_range(pfn, nr_pages, pfn); > zone->present_pages +=3D nr_pages; > =20 > pgdat_resize_lock(zone->zone_pgdat, &flags); > @@ -910,9 +899,7 @@ int __ref online_pages(unsigned long pfn, unsigned = long nr_pages, > zone_pcp_update(zone); > =20 > /* Basic onlining is complete, allow allocation of onlined pages. */ > - undo_isolate_page_range(buddy_start_pfn, > - buddy_start_pfn + buddy_nr_pages, > - MIGRATE_MOVABLE); > + undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); > =20 > /* > * Freshly onlined pages aren't shuffled (e.g., all pages are placed= to > @@ -1126,6 +1113,59 @@ bool mhp_supports_memmap_on_memory(unsigned long= size) > IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)= ); > } > =20 > +static void reserve_vmmemmap_space(int nid, unsigned long pfn, unsigne= d long nr_pages, struct vmem_altmap *altmap) > +{ > + struct zone *zone =3D &NODE_DATA(nid)->node_zones[ZONE_NORMAL]; > + > + altmap->free =3D nr_pages; > + altmap->base_pfn =3D pfn; > + > + /* initialize struct pages and account for this space */ > + move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); > +} > + > +static void unaccount_vmemmap_space(int nid, unsigned long start_pfn, = unsigned long nr_pages) > +{ > + struct zone *zone =3D &NODE_DATA(nid)->node_zones[ZONE_NORMAL]; > + unsigned long flags; > + > + adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); > + zone->present_pages -=3D nr_pages; > + > + pgdat_resize_lock(zone->zone_pgdat, &flags); > + zone->zone_pgdat->node_present_pages -=3D nr_pages; > + pgdat_resize_unlock(zone->zone_pgdat, &flags); > + > + remove_pfn_range_from_zone(zone, start_pfn, nr_pages); > +} > + > +static int remove_memory_block_cb(struct memory_block *mem, void *arg) > +{ > + unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages; > + struct vmem_altmap mhp_altmap =3D {}; > + struct vmem_altmap *altmap =3D NULL; > + u64 start =3D PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)); > + u64 size =3D memory_block_size_bytes(); > + > + if (!mem->nr_vmemmap_pages) { > + arch_remove_memory(mem->nid, start, size, NULL); > + return 0; > + } > + > + /* > + * Let remove_pmd_table->free_hugepage_table > + * do the right thing if we used vmem_altmap > + * when hot-adding the range. > + */ > + mhp_altmap.alloc =3D nr_vmemmap_pages; > + altmap =3D &mhp_altmap; > + > + unaccount_vmemmap_space(mem->nid, PHYS_PFN(start), nr_vmemmap_pages); > + arch_remove_memory(mem->nid, start, size, altmap); > + > + return 0; > +} > + > /* > * NOTE: The caller must call lock_device_hotplug() to serialize hotp= lug > * and online/offline operations (triggered e.g. by sysfs). > @@ -1170,8 +1210,7 @@ int __ref add_memory_resource(int nid, struct res= ource *res, mhp_t mhp_flags) > ret =3D -EINVAL; > goto error; > } > - mhp_altmap.free =3D PHYS_PFN(size); > - mhp_altmap.base_pfn =3D PHYS_PFN(start); > + reserve_vmmemmap_space(nid, PHYS_PFN(start), PHYS_PFN(size), &mhp_al= tmap); > params.altmap =3D &mhp_altmap; > } > =20 > @@ -1639,25 +1678,16 @@ static int count_system_ram_pages_cb(unsigned l= ong start_pfn, > return 0; > } > =20 > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_page= s, > - unsigned long nr_vmemmap_pages) > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_page= s) > { > const unsigned long end_pfn =3D start_pfn + nr_pages; > - unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = =3D 0; > + unsigned long pfn, system_ram_pages =3D 0; > unsigned long flags; > struct zone *zone; > struct memory_notify arg; > int ret, node; > char *reason; > =20 > - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */ > - if (WARN_ON_ONCE(!nr_pages || > - !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) > - return -EINVAL; > - > - buddy_start_pfn =3D start_pfn + nr_vmemmap_pages; > - buddy_nr_pages =3D nr_pages - nr_vmemmap_pages; > - > mem_hotplug_begin(); > =20 > /* > @@ -1693,7 +1723,7 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages, > zone_pcp_disable(zone); > =20 > /* set above range as isolated */ > - ret =3D start_isolate_page_range(buddy_start_pfn, end_pfn, > + ret =3D start_isolate_page_range(start_pfn, end_pfn, > MIGRATE_MOVABLE, > MEMORY_OFFLINE | REPORT_FAILURE); > if (ret) { > @@ -1713,7 +1743,7 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages, > } > =20 > do { > - pfn =3D buddy_start_pfn; > + pfn =3D start_pfn; > do { > if (signal_pending(current)) { > ret =3D -EINTR; > @@ -1744,18 +1774,18 @@ int __ref offline_pages(unsigned long start_pfn= , unsigned long nr_pages, > * offlining actually in order to make hugetlbfs's object > * counting consistent. > */ > - ret =3D dissolve_free_huge_pages(buddy_start_pfn, end_pfn); > + ret =3D dissolve_free_huge_pages(start_pfn, end_pfn); > if (ret) { > reason =3D "failure to dissolve huge pages"; > goto failed_removal_isolated; > } > =20 > - ret =3D test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE= ); > + ret =3D test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); > =20 > } while (ret); > =20 > /* Mark all sections offline and remove free pages from the buddy. *= / > - __offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn); > + __offline_isolated_pages(start_pfn, end_pfn, start_pfn); > pr_debug("Offlined Pages %ld\n", nr_pages); > =20 > /* > @@ -1764,13 +1794,13 @@ int __ref offline_pages(unsigned long start_pfn= , unsigned long nr_pages, > * of isolated pageblocks, memory onlining will properly revert this= . > */ > spin_lock_irqsave(&zone->lock, flags); > - zone->nr_isolate_pageblock -=3D buddy_nr_pages / pageblock_nr_pages; > + zone->nr_isolate_pageblock -=3D nr_pages / pageblock_nr_pages; > spin_unlock_irqrestore(&zone->lock, flags); > =20 > zone_pcp_enable(zone); > =20 > /* removal success */ > - adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages); > + adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); > zone->present_pages -=3D nr_pages; > =20 > pgdat_resize_lock(zone->zone_pgdat, &flags); > @@ -1799,7 +1829,7 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages, > return 0; > =20 > failed_removal_isolated: > - undo_isolate_page_range(buddy_start_pfn, end_pfn, MIGRATE_MOVABLE); > + undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); > memory_notify(MEM_CANCEL_OFFLINE, &arg); > failed_removal_pcplists_disabled: > zone_pcp_enable(zone); > @@ -1830,14 +1860,6 @@ static int check_memblock_offlined_cb(struct mem= ory_block *mem, void *arg) > return 0; > } > =20 > -static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg= ) > -{ > - /* > - * If not set, continue with the next block. > - */ > - return mem->nr_vmemmap_pages; > -} > - > static int check_cpu_on_node(pg_data_t *pgdat) > { > int cpu; > @@ -1912,9 +1934,6 @@ EXPORT_SYMBOL(try_offline_node); > static int __ref try_remove_memory(int nid, u64 start, u64 size) > { > int rc =3D 0; > - struct vmem_altmap mhp_altmap =3D {}; > - struct vmem_altmap *altmap =3D NULL; > - unsigned long nr_vmemmap_pages =3D 0; > =20 > BUG_ON(check_hotplug_memory_range(start, size)); > =20 > @@ -1927,31 +1946,6 @@ static int __ref try_remove_memory(int nid, u64 = start, u64 size) > if (rc) > return rc; > =20 > - /* > - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in > - * the same granularity it was added - a single memory block. > - */ > - if (memmap_on_memory) { > - nr_vmemmap_pages =3D walk_memory_blocks(start, size, NULL, > - get_nr_vmemmap_pages_cb); > - if (nr_vmemmap_pages) { > - if (size !=3D memory_block_size_bytes()) { > - pr_warn("Refuse to remove %#llx - %#llx," > - "wrong granularity\n", > - start, start + size); > - return -EINVAL; > - } > - > - /* > - * Let remove_pmd_table->free_hugepage_table > - * do the right thing if we used vmem_altmap > - * when hot-adding the range. > - */ > - mhp_altmap.alloc =3D nr_vmemmap_pages; > - altmap =3D &mhp_altmap; > - } > - } > - > /* remove memmap entry */ > firmware_map_remove(start, start + size, "System RAM"); > =20 > @@ -1963,7 +1957,10 @@ static int __ref try_remove_memory(int nid, u64 = start, u64 size) > =20 > mem_hotplug_begin(); > =20 > - arch_remove_memory(nid, start, size, altmap); > + if (!memmap_on_memory) > + arch_remove_memory(nid, start, size, NULL); > + else > + walk_memory_blocks(start, size, NULL, remove_memory_block_cb); > =20 > if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { > memblock_free(start, size); >=20 --=20 Thanks, David / dhildenb