From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98417CD68E3 for ; Tue, 10 Oct 2023 02:32:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32FC88003C; Mon, 9 Oct 2023 22:32:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E05C80027; Mon, 9 Oct 2023 22:32:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 163E48003C; Mon, 9 Oct 2023 22:32:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 0362580027 for ; Mon, 9 Oct 2023 22:32:09 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BB02BA04AC for ; Tue, 10 Oct 2023 02:32:08 +0000 (UTC) X-FDA: 81327977136.25.A445130 Received: from out-190.mta0.migadu.com (out-190.mta0.migadu.com [91.218.175.190]) by imf16.hostedemail.com (Postfix) with ESMTP id C5F29180009 for ; Tue, 10 Oct 2023 02:32:05 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=wenNksZz; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf16.hostedemail.com: domain of yajun.deng@linux.dev designates 91.218.175.190 as permitted sender) smtp.mailfrom=yajun.deng@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696905126; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=S1Ce34sHbx25arW1V0gGJiDhHu5zWENIwtHy+sRGyWc=; b=uy4nlnYUe5TVWYR3pt1Gjrhq5KLueu8p299fM4FiNaAg46/dYHemQoLJuZfsUkHLJoLi0T hznBIb2WgYUz1K0+qvdREJeCFIb7Ho6Mc2/t/bPL/EiVMt/NmeUo0QEM1pzQVthiT84RAS E8EQf7pPBEg7EAsklUGxed3dztp1Sck= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=wenNksZz; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf16.hostedemail.com: domain of yajun.deng@linux.dev designates 91.218.175.190 as permitted sender) smtp.mailfrom=yajun.deng@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696905126; a=rsa-sha256; cv=none; b=t/tO68BrrA8u5kKuqaGfZ8OBmbOGiK2MF6FmjE5mzf8oWC9yCYf6c6S8sV1AlEQtWTbIf5 DtFF8r5esnT1UiQd6vn+GUSwJDEgdVrVoust+ug380uDZnI6Tk2WoGCG2wcVUEO4KX6D50 6TRoXqolXne3AG6nkjn6eKaLfJktorI= Message-ID: <5382bf2d-5aa0-1498-8169-3248be4b5af3@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1696905120; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=S1Ce34sHbx25arW1V0gGJiDhHu5zWENIwtHy+sRGyWc=; b=wenNksZzeirtnFp1tgWbklzaGplbIUcUeEA9pNwNn+JiWJ/LQT7tgNypudji5N7PdnJ9ek pI65+HwtlACzr87fOCRafkoRpd9DpktG/0cDs/QAdNT4iGHgFDFw5WgtQmzkRZTXxvqfvr 8cgunGWwdbCj0Mwocb6tODY4Vl4WyE4= Date: Tue, 10 Oct 2023 10:31:51 +0800 MIME-Version: 1.0 Subject: Re: [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yajun Deng To: David Hildenbrand , Mike Rapoport Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230928083302.386202-1-yajun.deng@linux.dev> <20230928083302.386202-3-yajun.deng@linux.dev> <20230929083018.GU3303@kernel.org> <2f8c4741-5c7f-272d-9cef-9fda9fbc7ca6@linux.dev> In-Reply-To: <2f8c4741-5c7f-272d-9cef-9fda9fbc7ca6@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C5F29180009 X-Stat-Signature: ywfopazy7ysgj1trpamffupkno8izgnd X-Rspam-User: X-HE-Tag: 1696905125-632767 X-HE-Meta: U2FsdGVkX19oIwv4O+TetPauExF7AVsqZ2LvU+o9JE3AgOqZHokEviv1pxC0rWayU49nekPDoDjbKzUmywSsoMh0QNJ8xYJA+kQaEnMmfjwjtFZPD48aBdItoHmLcpmVWXgUPZA9/utH6v8pu8uc+hnF8mnbC2wp+BFKhM5PJm6/LgxsbwwFmvvwqPIm9JTa77stBo7QsAYQcFTKH1pP+mAbLBj9A0cKcTBBHMYwZw6AXy8c1ea4gZWeYZ5lK1uEIl5DMlrlWNfPO9YS35EtB31emrcft1ZL3TiI69JdwjLC+5VttiexGLeaE8QHzC4qU9+iRRsZ6DsA9Dd2NGPyLWIIRzYdcKuGIRIEB/J4RorPaVqL2uXPZHOK/fdA1LSwbns26nMjtODCDjfInUNXPqihfKwGzLsHZc7OjuWw5f78l+2VK6bhK41LHNC63t6kVbnWhgYKLAxEC6aPZUTVXCHS5dbPB1zEpqV5JTwGGvktzVQS7rPfBSw+qAeEuDyz++jfYSWzIvoqc4/1f7iScPg637awftof8PxJtn1aWVEjEAKEzqr4TVzf8LO0GzMkeOPvoRICVZrO2l6NlqQR/B/QIe8xW7bWOXaCzqkrkGvGgrtB9AIt1GyT+LbhAeJ/ZROYMqwHzRi8OoKmdrQ0OQJPOosIUrgcQgV+LSLg1j0eFYjFwes2LZhg5Q1kb0mPDM8gC0SYy6dQLPtboLErc7kDaeQ8zqRJYnIkmkDOeJ5MWKPEi3o2sEgi3RYTJAjGe5KcyZtDxtKc0uT8pDB8VObPpOcJGG7gfNYBqtGJLWdO7NRajpJsKT3MBDlvx7DPiQ8DOLfOb8Gr6U0dtDCmTwbEBjBHpPM2K6CFJKvrxJrcM4NTZg4d3/cSaHBR7d0XKK4imATIXOZyncgXKHmLhzjewJhk5YMrnOsKMMr44rbcG2slKWZ5czgCtqzq7LuIzYaxVS8xFSX9LNdjSvc kzlSk3qk FWl2e+kHV8uqKIbPeheZ+gFLL2j7oNofoskVg86YDvRQM+stK2+wT0bOXisZEpqTC5caZap4y+CzQAICNn+dYHya1CP5r70x8eBvc9hhWjazOjZytC3n/Td9zdSWOBqNzX6rYjp08iyeO7H1SOoQg4yZ2uwnzm1JRpv9STtfUXL9eYUOXcA8OLd+cv+IIkp5XzC3Yw5TlDmVuyO6zc0V3wUo9tFE30OoTjLPE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/10/8 16:57, Yajun Deng wrote: > > On 2023/10/2 16:30, David Hildenbrand wrote: >> On 29.09.23 10:30, Mike Rapoport wrote: >>> On Thu, Sep 28, 2023 at 04:33:02PM +0800, Yajun Deng wrote: >>>> memmap_init_range() would init page count of all pages, but the free >>>> pages count would be reset in __free_pages_core(). There are opposite >>>> operations. It's unnecessary and time-consuming when it's >>>> MEMINIT_EARLY >>>> context. >>>> >>>> Init page count in reserve_bootmem_region when in MEMINIT_EARLY >>>> context, >>>> and check the page count before reset it. >>>> >>>> At the same time, the INIT_LIST_HEAD in reserve_bootmem_region isn't >>>> need, as it already done in __init_single_page. >>>> >>>> The following data was tested on an x86 machine with 190GB of RAM. >>>> >>>> before: >>>> free_low_memory_core_early()    341ms >>>> >>>> after: >>>> free_low_memory_core_early()    285ms >>>> >>>> Signed-off-by: Yajun Deng >>>> --- >>>> v4: same with v2. >>>> v3: same with v2. >>>> v2: check page count instead of check context before reset it. >>>> v1: >>>> https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/ >>>> --- >>>>   mm/mm_init.c    | 18 +++++++++++++----- >>>>   mm/page_alloc.c | 20 ++++++++++++-------- >>>>   2 files changed, 25 insertions(+), 13 deletions(-) >>>> >>>> diff --git a/mm/mm_init.c b/mm/mm_init.c >>>> index 9716c8a7ade9..3ab8861e1ef3 100644 >>>> --- a/mm/mm_init.c >>>> +++ b/mm/mm_init.c >>>> @@ -718,7 +718,7 @@ static void __meminit >>>> init_reserved_page(unsigned long pfn, int nid) >>>>           if (zone_spans_pfn(zone, pfn)) >>>>               break; >>>>       } >>>> -    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, >>>> INIT_PAGE_COUNT); >>>> +    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, 0); >>>>   } >>>>   #else >>>>   static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {} >>>> @@ -756,8 +756,8 @@ void __meminit >>>> reserve_bootmem_region(phys_addr_t start, >>>>                 init_reserved_page(start_pfn, nid); >>>>   -            /* Avoid false-positive PageTail() */ >>>> -            INIT_LIST_HEAD(&page->lru); >>>> +            /* Init page count for reserved region */ >>> >>> Please add a comment that describes _why_ we initialize the page >>> count here. >>> >>>> +            init_page_count(page); >>>>                 /* >>>>                * no need for atomic set_bit because the struct >>>> @@ -888,9 +888,17 @@ void __meminit memmap_init_range(unsigned long >>>> size, int nid, unsigned long zone >>>>           } >>>>             page = pfn_to_page(pfn); >>>> -        __init_single_page(page, pfn, zone, nid, INIT_PAGE_COUNT); >>>> -        if (context == MEMINIT_HOTPLUG) >>>> + >>>> +        /* If the context is MEMINIT_EARLY, we will init page >>>> count and >>>> +         * mark page reserved in reserve_bootmem_region, the free >>>> region >>>> +         * wouldn't have page count and we will check the pages count >>>> +         * in __free_pages_core. >>>> +         */ >>>> +        __init_single_page(page, pfn, zone, nid, 0); >>>> +        if (context == MEMINIT_HOTPLUG) { >>>> +            init_page_count(page); >>>>               __SetPageReserved(page); >>> >>> Rather than calling init_page_count() and __SetPageReserved() for >>> MEMINIT_HOTPLUG you can set flags to INIT_PAGE_COUNT | >>> INIT_PAGE_RESERVED >>> an call __init_single_page() after the check for MEMINIT_HOTPLUG. >>> >>> But more generally, I wonder if we have to differentiate HOTPLUG >>> here at all. >>> @David, can you comment please? >> >> There are a lot of details to that, and I'll share some I can briefly >> think of. >> >> 1) __SetPageReserved() >> >> I tried removing that a while ago, but there was a blocker (IIRC >> something about >> ZONE_DEVICE). I still have the patches at [1] and I could probably >> take a look >> if that blocker still exists (I recall that something changed at some >> point, but >> I never had the time to follow up). >> >> But once we stop setting the pages reserved, we might run into issues >> with ... >> >> >> 2) init_page_count() >> >> virtio-mem, XEN balloon and HV-balloon add memory blocks that can >> contain holes. >> set_online_page_callback() is used to intercept memory onlining and >> to expose >> only the pages that are not holes to the buddy: calling >> generic_online_page() on !hole. >> >> Holes are PageReserved but with an initialized page count. Memory >> offlining will fail on >> PageReserved pages -- has_unmovable_pages(). >> >> >> At least virtio-mem clears the PageReserved flag of holes when >> onlining memory, >> and currently relies in the page count to be reasonable (so memory >> offlining can work). >> >> static void virtio_mem_set_fake_offline(unsigned long pfn, >>                     unsigned long nr_pages, bool onlined) >> { >>     page_offline_begin(); >>     for (; nr_pages--; pfn++) { >>         struct page *page = pfn_to_page(pfn); >> >>         __SetPageOffline(page); >>         if (!onlined) { >>             SetPageDirty(page); >>             /* FIXME: remove after cleanups */ >>             ClearPageReserved(page); >>         } >>     } >>     page_offline_end(); >> } >> >> >> For virtio-mem, we could initialize the page count there instead. The >> other PV drivers >> might require a bit more thought. >> >> >> [1] >> https://github.com/davidhildenbrand/linux/tree/online_reserved_cleanup >> >>> >>>> +        } >>>>             /* >>>>            * Usually, we want to mark the pageblock MIGRATE_MOVABLE, >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 06be8821d833..b868caabe8dc 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -1285,18 +1285,22 @@ void __free_pages_core(struct page *page, >>>> unsigned int order) >>>>       unsigned int loop; >>>>         /* >>>> -     * When initializing the memmap, __init_single_page() sets the >>>> refcount >>>> -     * of all pages to 1 ("allocated"/"not free"). We have to set the >>>> -     * refcount of all involved pages to 0. >>>> +     * When initializing the memmap, memmap_init_range sets the >>>> refcount >>>> +     * of all pages to 1 ("reserved" and "free") in hotplug >>>> context. We >>>> +     * have to set the refcount of all involved pages to 0. >>>> Otherwise, >>>> +     * we don't do it, as reserve_bootmem_region only set the >>>> refcount on >>>> +     * reserve region ("reserved") in early context. >>>>        */ >>> >>> Again, why hotplug and early init should be different? >>> >>>> -    prefetchw(p); >>>> -    for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>>> -        prefetchw(p + 1); >>>> +    if (page_count(page)) { >>>> +        prefetchw(p); >>>> +        for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>>> +            prefetchw(p + 1); >>>> +            __ClearPageReserved(p); >>>> +            set_page_count(p, 0); >>>> +        } >>>>           __ClearPageReserved(p); >>>>           set_page_count(p, 0); >> >> That looks wrong. if the page count would by pure luck be 0 already >> for hotplugged memory, >> you wouldn't clear the reserved flag. >> >> These changes make me a bit nervous. > > > Is 'if (page_count(page) || PageReserved(page))' be safer? Or do I > need to do something else? > How about the following if statement? But it needs to add more patch like v1 ([PATCH 2/4] mm: Introduce MEMINIT_LATE context). It'll be safer, but more complex. Please comment... if (context != MEMINIT_EARLY || (page_count(page) || PageReserved(page)) {