From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DWy+=JN=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6CFA7C433ED
	for <linux-mm@archiver.kernel.org>; Fri, 16 Apr 2021 10:33:41 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id E5806610CD
	for <linux-mm@archiver.kernel.org>; Fri, 16 Apr 2021 10:33:40 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E5806610CD
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 774D66B0036; Fri, 16 Apr 2021 06:33:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7244B6B006C; Fri, 16 Apr 2021 06:33:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5C5FB6B0070; Fri, 16 Apr 2021 06:33:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0022.hostedemail.com [216.40.44.22])
	by kanga.kvack.org (Postfix) with ESMTP id 4268C6B0036
	for <linux-mm@kvack.org>; Fri, 16 Apr 2021 06:33:40 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id F0ACE2480
	for <linux-mm@kvack.org>; Fri, 16 Apr 2021 10:33:39 +0000 (UTC)
X-FDA: 78037868958.03.9476B9D
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf15.hostedemail.com (Postfix) with ESMTP id 542DAA00039C
	for <linux-mm@kvack.org>; Fri, 16 Apr 2021 10:33:37 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1618569218;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HPxWXQjYmYAvrhIYXMRfmvVPZD5pNgtjy/TXa+spbAM=;
	b=e+4UhL+28QC7JkAr4z8NAIYgPTQfDB3o7FqariEVbGIewjjL4qUFyexYtEEWe0MXjD7aFW
	VP+eIMZVTkx8wTomVX7I2PuMrlCnofngg0nKy4eDeAJTWbTK+Bez7AIzqJyfE0KLDdq++G
	MWl5CDqKWIc9YuJJtYnlThHXugNmK04=
Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com
 [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-275-5AQIRuWHOGSU6dFvoXwc2w-1; Fri, 16 Apr 2021 06:33:37 -0400
X-MC-Unique: 5AQIRuWHOGSU6dFvoXwc2w-1
Received: by mail-wm1-f69.google.com with SMTP id g144-20020a1c20960000b029012983de0c8fso2875967wmg.7
        for <linux-mm@kvack.org>; Fri, 16 Apr 2021 03:33:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=HPxWXQjYmYAvrhIYXMRfmvVPZD5pNgtjy/TXa+spbAM=;
        b=IbcU5hvQxdWVRElQwv+DEv8DyK2l1FXdCUnD2+6cmBX26Hy5SrP2IRUMvJQHvR980Q
         dZPmMiZNkNUX9KuAE084X3g0mnmMV4kFwD0NjBhxW8BsUSnpPhNml60Y6Yp/HMY1HTeP
         BT6Zejmbo3Iit+PtzsPLSLZxDYTBvMO2b4ZFzrc7P0Ug4dg7NaY5xeKhM5W0WNUPh/AU
         RQyFrfLO35lHRPWFhxePCPWvdjskQR+8m/eR7Srs+aZcGdrAM5KgprvKrwCEZBvLAtRN
         QIifrWalWGso7Cl9etiJFJPL2Fv5HrZfB3Y0AoOC7JWPiNpsbJ8iSVs9RaKLqaYpPHPC
         vpPQ==
X-Gm-Message-State: AOAM5314XoLemop7Ak6qBODWkcwdWLtIcrKO1URr+R1+PEYM2qNT48/h
	zTAeq0fw9m3olGzKjExz+lOkdZhzFuPsiPJ531gIFefZg6pQsOevs8JTtdVeaCdsGT6i1fdxLUq
	Tp3XVynvt0G8=
X-Received: by 2002:adf:b1d3:: with SMTP id r19mr8568406wra.97.1618569215832;
        Fri, 16 Apr 2021 03:33:35 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxIRplpL1O+iW76z0hJNj6szizw5N585pJVP/CAeqkoDjjPOi2Inf4or+qnplyqURwWofflmw==
X-Received: by 2002:adf:b1d3:: with SMTP id r19mr8568371wra.97.1618569215546;
        Fri, 16 Apr 2021 03:33:35 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c64fb.dip0.t-ipconnect.de. [91.12.100.251])
        by smtp.gmail.com with ESMTPSA id 2sm6558635wmi.19.2021.04.16.03.33.35
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 16 Apr 2021 03:33:35 -0700 (PDT)
To: Oscar Salvador <osalvador@suse.de>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
 Pavel Tatashin <pasha.tatashin@soleen.com>, Vlastimil Babka
 <vbabka@suse.cz>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
References: <20210416102153.8794-1-osalvador@suse.de>
 <20210416102153.8794-5-osalvador@suse.de>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added
 memory range
Message-ID: <df8220ac-4214-5ff6-0048-35553fea8c8c@redhat.com>
Date: Fri, 16 Apr 2021 12:33:34 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <20210416102153.8794-5-osalvador@suse.de>
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 542DAA00039C
X-Stat-Signature: wsqi84j7ah5b35t9a8as1bqiy8uh8s7j
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf15; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1618569217-689260
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 16.04.21 12:21, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>=20
> This has some disadvantages:
>   a) an existing memory is consumed for that purpose
>      (eg: ~2MB per 128MB memory section on x86_64)
>   b) if the whole node is movable then we have off-node struct pages
>      which has performance drawbacks.
>   c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>      populated with base pages.
>=20
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>=20
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section =
and
> map struct pages there.
> struct pages which back the allocated space then just need to be treate=
d
> carefully.
>=20
> Implementation wise we will reuse vmem_altmap infrastructure to overrid=
e
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gainin=
g
> a new field which specifies the number of vmemmap_pages at the beginnin=
g.
> This patch also introduces the following functions:
>=20
>   - mhp_init_memmap_on_memory:
> 		       Initializes vmemmap pages by calling move_pfn_range_to_zone(),
> 		       calls kasan_add_zero_shadow(), and onlines as many sections
> 		       as vmemmap pages fully span.
>   - mhp_deinit_memmap_on_memory:
> 		       Undoes what mhp_init_memmap_on_memory.
>=20
> The new function memory_block_online() calls mhp_init_memmap_on_memory(=
) before
> doing the actual online_pages(). Should online_pages() fail, we clean u=
p
> by calling mhp_deinit_memmap_on_memory().
> Adjusting of present_pages is done at the end once we know that online_=
pages()
> succedeed.
>=20
> On offline, memory_block_offline() needs to unaccount vmemmap pages fro=
m
> present_pages() before calling offline_pages().
> This is necessary because offline_pages() tears down some structures ba=
sed
> on the fact whether the node or the zone become empty.
> If offline_pages() fails, we account back vmemmap pages.
> If it succeeds, we call mhp_deinit_memmap_on_memory().
>=20
> Hot-remove:
>=20
>   We need to be careful when removing memory, as adding and
>   removing memory needs to be done with the same granularity.
>   To check that this assumption is not violated, we check the
>   memory range we want to remove and if a) any memory block has
>   vmemmap pages and b) the range spans more than a single memory
>   block, we scream out loud and refuse to proceed.
>=20
>   If all is good and the range was using memmap on memory (aka vmemmap =
pages),
>   we construct an altmap structure so free_hugepage_table does the righ=
t
>   thing and calls vmem_altmap_free instead of free_pagetable.
>=20
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   drivers/base/memory.c          |  75 ++++++++++++++++--
>   include/linux/memory.h         |   8 +-
>   include/linux/memory_hotplug.h |  17 +++-
>   include/linux/memremap.h       |   2 +-
>   include/linux/mmzone.h         |   7 +-
>   mm/Kconfig                     |   5 ++
>   mm/memory_hotplug.c            | 171 ++++++++++++++++++++++++++++++++=
++++++---
>   mm/sparse.c                    |   2 -
>   8 files changed, 265 insertions(+), 22 deletions(-)
>=20
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f209925a5d4e..179857d53982 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -173,16 +173,76 @@ static int memory_block_online(struct memory_bloc=
k *mem)
>   {
>   	unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr)=
;
>   	unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block;
> +	unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone =3D mhp_get_target_zone(start_pfn, nr_pages, mem->nid,
> +				   mem->online_type);
> +
> +	/*
> +	 * Although vmemmap pages have a different lifecycle than the pages
> +	 * they describe (they remain until the memory is unplugged), doing
> +	 * its initialization and accounting at hot-{online,offline} stage

s/its/their/

s/hot-{online,offline}/memory onlining/offlining stage/

> +	 * simplifies things a lot
> +	 */
> +	if (nr_vmemmap_pages) {
> +		ret =3D mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone)=
;
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret =3D online_pages(start_pfn + nr_vmemmap_pages,
> +			   nr_pages - nr_vmemmap_pages, zone);
> +	if (ret) {
> +		if (nr_vmemmap_pages)
> +			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Account once onlining succeeded. If the page was unpopulated, it i=
s

s/page/zone/

> +	 * now already properly populated.
> +	 */
> +	if (nr_vmemmap_pages)
> +		adjust_present_page_count(zone, nr_vmemmap_pages);
>  =20
> -	return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> +	return ret;
>   }
>  =20
>   static int memory_block_offline(struct memory_block *mem)
>   {
>   	unsigned long start_pfn =3D section_nr_to_pfn(mem->start_section_nr)=
;
>   	unsigned long nr_pages =3D PAGES_PER_SECTION * sections_per_block;
> +	unsigned long nr_vmemmap_pages =3D mem->nr_vmemmap_pages;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone =3D page_zone(pfn_to_page(start_pfn));
>  =20
> -	return offline_pages(start_pfn, nr_pages);
> +	/*
> +	 * Unaccount before offlining, such that unpopulated zone and kthread=
s
> +	 * can properly be torn down in offline_pages().
> +	 */
> +	if (nr_vmemmap_pages)
> +		adjust_present_page_count(zone, -nr_vmemmap_pages);
> +
> +	ret =3D offline_pages(start_pfn + nr_vmemmap_pages,
> +			    nr_pages - nr_vmemmap_pages);
> +	if (ret) {
> +		/* offline_pages() failed. Account back. */
> +		if (nr_vmemmap_pages)
> +			adjust_present_page_count(zone, nr_vmemmap_pages);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Re-adjust present pages if offline_pages() fails.
> +	 */

That comment is stale. I'd just drop it.

> +	if (nr_vmemmap_pages)
> +		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
> +
> +	return ret;
>   }

[...]

> -static void adjust_present_page_count(struct zone *zone, long nr_pages=
)
> +/*
> + * This function should only be called by memory_block_{online,offline=
},
> + * and {online,offline}_pages.
> + */
> +void adjust_present_page_count(struct zone *zone, long nr_pages)
>   {
>   	unsigned long flags;
>  =20
> @@ -839,12 +850,64 @@ static void adjust_present_page_count(struct zone=
 *zone, long nr_pages)
>   	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>   }
>  =20
> -int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       int online_type, int nid)
> +struct zone *mhp_get_target_zone(unsigned long pfn, unsigned long nr_p=
ages,
> +				 int nid, int online_type)
> +{
> +	return zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> +}
> +

Oh, you can just use zone_for_pfn_range() directly for now. No need for=20
mhp_get_target_zone(). Sorry for not realizing this.

> +int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_page=
s,
> +			      struct zone *zone)
> +{
> +	unsigned long end_pfn =3D pfn + nr_pages;
> +	int ret;
> +
> +	/*
> +	 * Initialize vmemmap pages with the corresponding node, zone links s=
et.
> +	 */
> +	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> +
> +	ret =3D kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)=
);
> +	if (ret) {
> +		remove_pfn_range_from_zone(zone, pfn, nr_pages);
> +		return ret;
> +	}

IIRC, we have to add the zero shadow first, before touching the memory.=20
This is also what mm/memremap.c does.

In mhp_deinit_memmap_on_memory(), you already remove in the proper=20
(reversed) order :)

> +
> +int __ref online_pages(unsigned long pfn, unsigned long nr_pages, stru=
ct zone *zone)
>   {
>   	unsigned long flags;
> -	struct zone *zone;
>   	int need_zonelists_rebuild =3D 0;
> +	int nid;
>   	int ret;
>   	struct memory_notify arg;
>  =20
> @@ -860,8 +923,9 @@ int __ref online_pages(unsigned long pfn, unsigned =
long nr_pages,
>  =20
>   	mem_hotplug_begin();
>  =20
> +	nid =3D zone_to_nid(zone);

I'd do that right above

const int nid =3D zone_to_nid(zone);

[...]

--=20
Thanks,

David / dhildenb