From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BF4CC433DB for ; Thu, 25 Feb 2021 18:58:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8AAC264F1A for ; Thu, 25 Feb 2021 18:58:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8AAC264F1A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 12A3A6B0005; Thu, 25 Feb 2021 13:58:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B4126B0006; Thu, 25 Feb 2021 13:58:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EBDF96B006C; Thu, 25 Feb 2021 13:58:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id D51E26B0005 for ; Thu, 25 Feb 2021 13:58:09 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id A1B6A1801D436 for ; Thu, 25 Feb 2021 18:58:09 +0000 (UTC) X-FDA: 77857700298.29.FDBF7A6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 4161B407F8F7 for ; Thu, 25 Feb 2021 18:58:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1614279488; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Lw++gPrd0qkbs0UPYsh3cqur+6MHk1b/4zyGUrpbpA0=; b=SJ4Ed+AM4cUQmiCrjkaAuhA34FCiAJ6k6dJL11fCOrmed7HTN/zriV6biGSPOWswvw3qlW TIuHeI0bTkp+QLCfdL+na4J83I/+4mSRbCGY/qoB4vjqdA4Gz2qLFHbvg3XRkHvWCVBbqm d+x6ZQZUQBI6VbH1cGIJIj1zyS5bntY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-46-d1IZYcFnNCe2TMVfs3IFVQ-1; Thu, 25 Feb 2021 13:58:06 -0500 X-MC-Unique: d1IZYcFnNCe2TMVfs3IFVQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6E8C21009615; Thu, 25 Feb 2021 18:58:04 +0000 (UTC) Received: from [10.36.114.58] (ovpn-114-58.ams2.redhat.com [10.36.114.58]) by smtp.corp.redhat.com (Postfix) with ESMTP id AD3F56EF40; Thu, 25 Feb 2021 18:58:02 +0000 (UTC) To: Oscar Salvador , Andrew Morton Cc: Michal Hocko , VlastimilBabkavbabka@suse.cz, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Anshuman Khandual References: <20210209133854.17399-1-osalvador@suse.de> <20210209133854.17399-2-osalvador@suse.de> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range Message-ID: <60afb5ca-230e-265f-9579-dac66a152c33@redhat.com> Date: Thu, 25 Feb 2021 19:58:01 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <20210209133854.17399-2-osalvador@suse.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 4161B407F8F7 X-Stat-Signature: 7i4xobuompdsaic8c117cag3ahormam9 Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf26; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614279484-880317 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 09.02.21 14:38, Oscar Salvador wrote: > Physical memory hotadd has to allocate a memmap (struct page array) for > the newly added memory section. Currently, alloc_pages_node() is used > for those allocations. >=20 > This has some disadvantages: > a) an existing memory is consumed for that purpose > (eg: ~2MB per 128MB memory section on x86_64) > b) if the whole node is movable then we have off-node struct pages > which has performance drawbacks. > c) It might be there are no PMD_ALIGNED chunks so memmap array gets > populated with base pages. >=20 > This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. >=20 > Vmemap page tables can map arbitrary memory. > That means that we can simply use the beginning of each memory section = and > map struct pages there. > struct pages which back the allocated space then just need to be treate= d > carefully. >=20 > Implementation wise we will reuse vmem_altmap infrastructure to overrid= e > the default allocator used by __populate_section_memmap. > Part of the implementation also relies on memory_block structure gainin= g > a new field which specifies the number of vmemmap_pages at the beginnin= g. > This comes in handy as in {online,offline}_pages, all the isolation and > migration is being done on (buddy_start_pfn, end_pfn] range, > being buddy_start_pfn =3D start_pfn + nr_vmemmap_pages. >=20 > In this way, we have: >=20 > (start_pfn, buddy_start_pfn - 1] =3D Initialized and PageReserved > (buddy_start_pfn, end_pfn] =3D Initialized and sent to buddy nit: shouldn't it be [start_pfn, buddy_start_pfn - 1] [buddy_start_pfn, end_pfn - 1] or [start_pfn, buddy_start_pfn) [buddy_start_pfn, end_pfn) (I remember that "[" means inclusive and "(" means exclusive, I might be = wrong :) ) I actually prefer the first variant. >=20 > Hot-remove: >=20 > We need to be careful when removing memory, as adding and > removing memory needs to be done with the same granularity. > To check that this assumption is not violated, we check the > memory range we want to remove and if a) any memory block has > vmemmap pages and b) the range spans more than a single memory > block, we scream out loud and refuse to proceed. >=20 > If all is good and the range was using memmap on memory (aka vmemmap = pages), > we construct an altmap structure so free_hugepage_table does the righ= t > thing and calls vmem_altmap_free instead of free_pagetable. >=20 > Signed-off-by: Oscar Salvador [...] > /* > * online_page_callback contains pointer to current page onlining fun= ction. > * Initially it is generic_online_page(). If it is required it could = be > @@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsig= ned int order) > } > EXPORT_SYMBOL_GPL(generic_online_page); > =20 > -static void online_pages_range(unsigned long start_pfn, unsigned long = nr_pages) > +static void online_pages_range(unsigned long start_pfn, unsigned long = nr_pages, > + unsigned long buddy_start_pfn) > { > const unsigned long end_pfn =3D start_pfn + nr_pages; > - unsigned long pfn; > + unsigned long pfn =3D buddy_start_pfn; > + > + /* > + * When using memmap_on_memory, the range might be unaligned as the > + * first pfns are used for vmemmap pages. Align it in case we need to= . > + */ > + if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) { if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) > + (*online_page_callback)(pfn_to_page(pfn), pageblock_order); > + pfn +=3D 1 << pageblock_order; pfn +=3D pageblock_nr_pages; Can you add a comment why we can be sure that we are off by a single pag= eblock? What about s390x where a MAX_ORDER_NR_PAGES =3D=3D 4 * pageblock_= nr_pages? Would it make thing simpler to just do a while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) { (*online_page_callback)(pfn_to_page(pfn), 0); pfn++; } [...] > =20 > /* > * Freshly onlined pages aren't shuffled (e.g., all pages are placed= to > @@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_blo= ck *mem, void *arg) > return device_online(&mem->dev); > } > =20 > +bool mhp_supports_memmap_on_memory(unsigned long size) > +{ > + return memmap_on_memory_enabled && > + size =3D=3D memory_block_size_bytes(); Regarding my other comments as reply to the other patches, I'd move all m= agic you have when trying to enable right here. > =20 > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_page= s) > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_page= s, > + unsigned long nr_vmemmap_pages) > { > const unsigned long end_pfn =3D start_pfn + nr_pages; > - unsigned long pfn, system_ram_pages =3D 0; > + unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = =3D 0; > unsigned long flags; > struct zone *zone; > struct memory_notify arg; > @@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages) > !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) > return -EINVAL; > =20 > + buddy_start_pfn =3D start_pfn + nr_vmemmap_pages; > + buddy_nr_pages =3D nr_pages - nr_vmemmap_pages; > + > mem_hotplug_begin(); > =20 > /* > @@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages) > zone_pcp_disable(zone); > =20 > /* set above range as isolated */ > - ret =3D start_isolate_page_range(start_pfn, end_pfn, > + ret =3D start_isolate_page_range(buddy_start_pfn, end_pfn, > MIGRATE_MOVABLE, > MEMORY_OFFLINE | REPORT_FAILURE); Did you take care to properly adjust undo_isolate_page_range() as well? I= can't spot it. > if (ret) { > @@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages) > } > =20 > do { > - pfn =3D start_pfn; > + pfn =3D buddy_start_pfn; > do { > if (signal_pending(current)) { > ret =3D -EINTR; > @@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn= , unsigned long nr_pages) > * offlining actually in order to make hugetlbfs's object > * counting consistent. > */ > - ret =3D dissolve_free_huge_pages(start_pfn, end_pfn); > + ret =3D dissolve_free_huge_pages(buddy_start_pfn, end_pfn); > if (ret) { > reason =3D "failure to dissolve huge pages"; > goto failed_removal_isolated; > } > =20 > - ret =3D test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); > + ret =3D test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE= ); > =20 > } while (ret); > =20 > /* Mark all sections offline and remove free pages from the buddy. *= / > - __offline_isolated_pages(start_pfn, end_pfn); > + __offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);> pr= _debug("Offlined Pages %ld\n", nr_pages); > =20 > /* > @@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn= , unsigned long nr_pages) > * of isolated pageblocks, memory onlining will properly revert this= . > */ > spin_lock_irqsave(&zone->lock, flags); > - zone->nr_isolate_pageblock -=3D nr_pages / pageblock_nr_pages; > + zone->nr_isolate_pageblock -=3D buddy_nr_pages / pageblock_nr_pages; > spin_unlock_irqrestore(&zone->lock, flags); > =20 > zone_pcp_enable(zone); > =20 > /* removal success */ > - adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); > + adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages); > zone->present_pages -=3D nr_pages; > =20 > pgdat_resize_lock(zone->zone_pgdat, &flags); > @@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct mem= ory_block *mem, void *arg) > return 0; > } > =20 [...] > +static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg= ) > +{ > + unsigned long *nr_vmemmap_pages =3D (unsigned long *)arg; > + > + *nr_vmemmap_pages +=3D mem->nr_vmemmap_pages; > + return mem->nr_vmemmap_pages; > +} > + I think you can do this easier, all you want to know is if there is any block that has nr_vmemmap_pages set - and return the value. static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *= arg) { /* If not set, continue with the next block. */ return mem->nr_vmemmap_pages; } ... > + walk_memory_blocks(start, size, &nr_vmemmap_pages, > + get_nr_vmemmap_pages_cb); ... mem->nr_vmemmap_pages =3D walk_memory_blocks(start ...) Looks quite promising, only a couple of things to fine-tune :) Thanks! --=20 Thanks, David / dhildenb