From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07005E95A67 for ; Sun, 8 Oct 2023 08:57:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2702F6B017B; Sun, 8 Oct 2023 04:57:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 21FF26B017C; Sun, 8 Oct 2023 04:57:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E8406B017D; Sun, 8 Oct 2023 04:57:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EFF576B017B for ; Sun, 8 Oct 2023 04:57:38 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BEFA6B4351 for ; Sun, 8 Oct 2023 08:57:38 +0000 (UTC) X-FDA: 81321690996.26.5A90CC3 Received: from out-193.mta1.migadu.com (out-193.mta1.migadu.com [95.215.58.193]) by imf07.hostedemail.com (Postfix) with ESMTP id 44B9340013 for ; Sun, 8 Oct 2023 08:57:34 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=qBF8GbfO; spf=pass (imf07.hostedemail.com: domain of yajun.deng@linux.dev designates 95.215.58.193 as permitted sender) smtp.mailfrom=yajun.deng@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696755457; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y10H/HMLxaCFV3IBxifP5y2+mCacV1qzKaQVQZSnA3A=; b=yGVYMelzedjUi8xSXg0i8TJe4gsL2rV3UTn1JsrTuvF7UP/C4FERAIYuHs5HNd7ux+pkNs Id7qOZKcvl7tzRU9rgUiKVwLvmBeZERkwanMPHkHbUgDfZXhXG1LaAMkrRNHUn9jitoxVR TrKZm2A0MjoJptuTBnUOjtkGjrfa2cA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696755457; a=rsa-sha256; cv=none; b=usSMNLLrUNjpvyGlM09ckAiAj59gHk7bfRNvOY0vcUe4rG1Ui6D1G7uyk9TUC6KhEE//sh VpIiDuGCZEksQxYEy9chm6qttuJMab0x3dzn+DjJbh+xSsPGkdJgcSV9zC2/+jjDWFAd+Q I35//rvUbpCnoRs+5TM89pj4MQrOPs4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=qBF8GbfO; spf=pass (imf07.hostedemail.com: domain of yajun.deng@linux.dev designates 95.215.58.193 as permitted sender) smtp.mailfrom=yajun.deng@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Message-ID: <2f8c4741-5c7f-272d-9cef-9fda9fbc7ca6@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1696755451; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=y10H/HMLxaCFV3IBxifP5y2+mCacV1qzKaQVQZSnA3A=; b=qBF8GbfO3jMQAprRf1WlTsXvZENuk08a5/0qubUt0gjK9EQXofsxIpeunoRepIKcKaLRxF kf2tQPnW2RmOR2TGsOPIh6pMX8wh3RfMzEZbDZyG1yCyYl77jhMLdf8PKrWR/Y00SA8dGi lXJe76YWUX1YidkNuBieP2X92dJ7Z/I= Date: Sun, 8 Oct 2023 16:57:22 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY Content-Language: en-US To: David Hildenbrand , Mike Rapoport Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230928083302.386202-1-yajun.deng@linux.dev> <20230928083302.386202-3-yajun.deng@linux.dev> <20230929083018.GU3303@kernel.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yajun Deng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Stat-Signature: j7o3bfzfpg5kz5cdpjufutz389fp5o3y X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 44B9340013 X-Rspam-User: X-HE-Tag: 1696755454-750894 X-HE-Meta: U2FsdGVkX1/HVSyhRoJaAz2PIijY/D1K/arqDI0QaKurdjXl1LR3aclzv54+eh7JFkVIs9OdoKYU2LULZv3ed6gl0McMUtRQT9vU4ptG+r5FtIqpsxjzj6jpOlSubT4IO9OmC127zvqz28l5kmVVfllMRCV+zCFbFoHz+02r9f0d7KCg1jItbKndwhlULtDBUjK3U9OoGUyjuqFBm9Ulia83CNeiWgBnbVhtF9Aq7nmbV+ejFDYy/xrvUdiVI3fioVmBc55DX4VXMMR1Jeg35oOOS2772v6nHawyBIN9b42cSpNqvDUFxQGxIvOgUmjcQq7hcwYgHU7BTNBhczx/QpPjmAz1jGWOHanM3owKyOfkHXA4nH9+pO/T71FVhCKmQlnhG5A3mghpSyIwcrFgwgMau73ZYRgqv7xSmd0Dyhq9fUGKiKhTsw2V/KZvMhzen20NYshpqnnUet/YzAkc2iOmg6iO4SjcSjhGTJUW5ytBCOCcMJYYUTNKpq7EHsn5GwbICECPs/QhGoOFZUwG6+xNCcsAovgmshV79DofdxO1MpYKz47bXjVwc/K4zM0JltFYZdtCfltv782A1IjglHXlhCfN5VzNET/XJ5nO4BMb4Bk2VlHSbueEz61/LO1PXceTsKcBJIpfyLXd7RYYLMzlbuWXTPnuA+MPBre/cx5WLZKFcsi/9Y7dNLapn2UBTY+dgNaPDfLCblAeDbqYOejZkgyqyGrPknSmH2w0jNI46mHUEurhGrXQ4MYrXG/IjF+ABjCXzilfDV3aMJPe+oUVuzXbCcaIV+Th8bZT2SnoT3rHR1m2do8yGh9qbMA0gbaJuNNg1TB9RhA/yJBYBetZ7VuLM3fYaSzVG798bPc69dxYGLzFib64K0KpdwoK6uzqLVcDzvBfGDfV2wnUrQ1jKeDCNRnr64/ae98ndbihMLrx3tWPRe25OVw14qBXJVRGM8jXhiUUAfFarfW aRn/b4Pk bQduNIfJhz82//6qsxQS+RJZl4nksIyzfVcNEj/UTSRHxw1HeDrdUDyTq5g1FIDHrp3KJqiUaXarypRRQNLRVDS9gi8FB1E4MwigLmqP0uPOiikntUZWOCCC//zrh9o55YDFEGdrU6ZMdV/StqGuogByYSUNBfhCAGiCdW6T0q4GGnzDyNMpL6RvSkikpdptNWAs+doqY5zyOJsFbz/+7YhGubg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/10/2 16:30, David Hildenbrand wrote: > On 29.09.23 10:30, Mike Rapoport wrote: >> On Thu, Sep 28, 2023 at 04:33:02PM +0800, Yajun Deng wrote: >>> memmap_init_range() would init page count of all pages, but the free >>> pages count would be reset in __free_pages_core(). There are opposite >>> operations. It's unnecessary and time-consuming when it's MEMINIT_EARLY >>> context. >>> >>> Init page count in reserve_bootmem_region when in MEMINIT_EARLY >>> context, >>> and check the page count before reset it. >>> >>> At the same time, the INIT_LIST_HEAD in reserve_bootmem_region isn't >>> need, as it already done in __init_single_page. >>> >>> The following data was tested on an x86 machine with 190GB of RAM. >>> >>> before: >>> free_low_memory_core_early()    341ms >>> >>> after: >>> free_low_memory_core_early()    285ms >>> >>> Signed-off-by: Yajun Deng >>> --- >>> v4: same with v2. >>> v3: same with v2. >>> v2: check page count instead of check context before reset it. >>> v1: >>> https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/ >>> --- >>>   mm/mm_init.c    | 18 +++++++++++++----- >>>   mm/page_alloc.c | 20 ++++++++++++-------- >>>   2 files changed, 25 insertions(+), 13 deletions(-) >>> >>> diff --git a/mm/mm_init.c b/mm/mm_init.c >>> index 9716c8a7ade9..3ab8861e1ef3 100644 >>> --- a/mm/mm_init.c >>> +++ b/mm/mm_init.c >>> @@ -718,7 +718,7 @@ static void __meminit >>> init_reserved_page(unsigned long pfn, int nid) >>>           if (zone_spans_pfn(zone, pfn)) >>>               break; >>>       } >>> -    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, >>> INIT_PAGE_COUNT); >>> +    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, 0); >>>   } >>>   #else >>>   static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {} >>> @@ -756,8 +756,8 @@ void __meminit >>> reserve_bootmem_region(phys_addr_t start, >>>                 init_reserved_page(start_pfn, nid); >>>   -            /* Avoid false-positive PageTail() */ >>> -            INIT_LIST_HEAD(&page->lru); >>> +            /* Init page count for reserved region */ >> >> Please add a comment that describes _why_ we initialize the page >> count here. >> >>> +            init_page_count(page); >>>                 /* >>>                * no need for atomic set_bit because the struct >>> @@ -888,9 +888,17 @@ void __meminit memmap_init_range(unsigned long >>> size, int nid, unsigned long zone >>>           } >>>             page = pfn_to_page(pfn); >>> -        __init_single_page(page, pfn, zone, nid, INIT_PAGE_COUNT); >>> -        if (context == MEMINIT_HOTPLUG) >>> + >>> +        /* If the context is MEMINIT_EARLY, we will init page count >>> and >>> +         * mark page reserved in reserve_bootmem_region, the free >>> region >>> +         * wouldn't have page count and we will check the pages count >>> +         * in __free_pages_core. >>> +         */ >>> +        __init_single_page(page, pfn, zone, nid, 0); >>> +        if (context == MEMINIT_HOTPLUG) { >>> +            init_page_count(page); >>>               __SetPageReserved(page); >> >> Rather than calling init_page_count() and __SetPageReserved() for >> MEMINIT_HOTPLUG you can set flags to INIT_PAGE_COUNT | >> INIT_PAGE_RESERVED >> an call __init_single_page() after the check for MEMINIT_HOTPLUG. >> >> But more generally, I wonder if we have to differentiate HOTPLUG here >> at all. >> @David, can you comment please? > > There are a lot of details to that, and I'll share some I can briefly > think of. > > 1) __SetPageReserved() > > I tried removing that a while ago, but there was a blocker (IIRC > something about > ZONE_DEVICE). I still have the patches at [1] and I could probably > take a look > if that blocker still exists (I recall that something changed at some > point, but > I never had the time to follow up). > > But once we stop setting the pages reserved, we might run into issues > with ... > > > 2) init_page_count() > > virtio-mem, XEN balloon and HV-balloon add memory blocks that can > contain holes. > set_online_page_callback() is used to intercept memory onlining and to > expose > only the pages that are not holes to the buddy: calling > generic_online_page() on !hole. > > Holes are PageReserved but with an initialized page count. Memory > offlining will fail on > PageReserved pages -- has_unmovable_pages(). > > > At least virtio-mem clears the PageReserved flag of holes when > onlining memory, > and currently relies in the page count to be reasonable (so memory > offlining can work). > > static void virtio_mem_set_fake_offline(unsigned long pfn, >                     unsigned long nr_pages, bool onlined) > { >     page_offline_begin(); >     for (; nr_pages--; pfn++) { >         struct page *page = pfn_to_page(pfn); > >         __SetPageOffline(page); >         if (!onlined) { >             SetPageDirty(page); >             /* FIXME: remove after cleanups */ >             ClearPageReserved(page); >         } >     } >     page_offline_end(); > } > > > For virtio-mem, we could initialize the page count there instead. The > other PV drivers > might require a bit more thought. > > > [1] > https://github.com/davidhildenbrand/linux/tree/online_reserved_cleanup > >> >>> +        } >>>             /* >>>            * Usually, we want to mark the pageblock MIGRATE_MOVABLE, >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 06be8821d833..b868caabe8dc 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1285,18 +1285,22 @@ void __free_pages_core(struct page *page, >>> unsigned int order) >>>       unsigned int loop; >>>         /* >>> -     * When initializing the memmap, __init_single_page() sets the >>> refcount >>> -     * of all pages to 1 ("allocated"/"not free"). We have to set the >>> -     * refcount of all involved pages to 0. >>> +     * When initializing the memmap, memmap_init_range sets the >>> refcount >>> +     * of all pages to 1 ("reserved" and "free") in hotplug >>> context. We >>> +     * have to set the refcount of all involved pages to 0. Otherwise, >>> +     * we don't do it, as reserve_bootmem_region only set the >>> refcount on >>> +     * reserve region ("reserved") in early context. >>>        */ >> >> Again, why hotplug and early init should be different? >> >>> -    prefetchw(p); >>> -    for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>> -        prefetchw(p + 1); >>> +    if (page_count(page)) { >>> +        prefetchw(p); >>> +        for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>> +            prefetchw(p + 1); >>> +            __ClearPageReserved(p); >>> +            set_page_count(p, 0); >>> +        } >>>           __ClearPageReserved(p); >>>           set_page_count(p, 0); > > That looks wrong. if the page count would by pure luck be 0 already > for hotplugged memory, > you wouldn't clear the reserved flag. > > These changes make me a bit nervous. Is 'if (page_count(page) || PageReserved(page))' be safer? Or do I need to do something else?