From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C645CDB482 for ; Thu, 12 Oct 2023 09:23:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 223918D011A; Thu, 12 Oct 2023 05:23:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1D3038D0002; Thu, 12 Oct 2023 05:23:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 074888D011A; Thu, 12 Oct 2023 05:23:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E83248D0002 for ; Thu, 12 Oct 2023 05:23:46 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BA7C1120583 for ; Thu, 12 Oct 2023 09:23:46 +0000 (UTC) X-FDA: 81336272052.13.CD4A39C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf09.hostedemail.com (Postfix) with ESMTP id 9EF78140017 for ; Thu, 12 Oct 2023 09:23:43 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GTC5smOW; spf=pass (imf09.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697102623; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9nohLbai83u7nTTQEUd6yx1+aaDWZOL8xWZjMw5OKmI=; b=mH5gsATwGmXJfXGNJzzJwCcYvtq6lSpFFjV04gngaxtfgBB6/dWTN7cBC4LuLW4JwYt13w +m5L1qcrNhHgi2qCK3i0oNPd1tUdOh9f/nBFic+BmHNWOjsOke5CdqYRKz3GYverhgDVhw Bsu9PSXeLOkDeyf8r98WNeUThl+AYv8= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GTC5smOW; spf=pass (imf09.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697102623; a=rsa-sha256; cv=none; b=O1ZpiiYrl54oicI11odGV1Z1f9x9w2FHQRGH9Entdq1quP2HL3P3B64WDGKxR3EWZHGipU eQTK+wm6YX4TgrR2/lbNuAEB2QlTQN0RdhqFUeMXViZr7OGOcegleoaWOH7Kbq+paBl3OR Q0IDBDu5D4paz+ZamkJmdo/yU/vOpYE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1697102622; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9nohLbai83u7nTTQEUd6yx1+aaDWZOL8xWZjMw5OKmI=; b=GTC5smOWHcUC2hTAEPQDhC5lSNLf0txs2WpNSiqw8abPaWW4Uypzd/neRuiPCEg2vLrqp0 cjEgRZI0LFmhe4CFeQYOgQvZEeE740kddLQw4lP/teXIujFjDptg7k3Dj8jgEluYPQXtqM Aubq2uERHldWgdeWR5QsoPMP8j4GvRg= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-526-aKBPT95oNky3NR36RBbdDg-1; Thu, 12 Oct 2023 05:23:36 -0400 X-MC-Unique: aKBPT95oNky3NR36RBbdDg-1 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-32d83fd3765so499223f8f.3 for ; Thu, 12 Oct 2023 02:23:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697102615; x=1697707415; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=9nohLbai83u7nTTQEUd6yx1+aaDWZOL8xWZjMw5OKmI=; b=ii7g1eLL0sPEYdzghj1APFoyBMmbS6TG2LTm8tSoMo5sxCoDkMJCu/0pOFbBqmhgOD XbvyKDCRixMaJCLz5JaLz/WgGBavpmD1VviDdXLcEC0DaRKh52zC61FxrfkbDCKAb8Bx Ao2RYmLgSiejSTiRBoZO3tnBCDVvn8mc6qU2RmsW/8bOv3UvTfCAid8bVUa19liJaL9v 0wWXQNXiYbf5NslR5U16IeZCSgiIQpTfF/LqnR1qCp1ZA1iFERvZ/KV/XrF7I47JJaPY qF7ygqcfJcC1XHhYuyaYSpfYrApGpxXEvGNAzAF9W7A3DZ0yLCFHT0y+bFmIP4uMLgOn vVXA== X-Gm-Message-State: AOJu0YwOyVeNEL6l7HpjkI7hnrtYjwnistMwN1HoI2shVS+Yzo2jKFCy Fo0REMAlbQYtUqEj6/aPYA02L9fZsgLO+5i0bcDRUfWVL/uvOnsJTFthePJUTV0DyEUd//c60Pt Ewzlnq+BXE64= X-Received: by 2002:a5d:4c50:0:b0:31c:8880:5d0f with SMTP id n16-20020a5d4c50000000b0031c88805d0fmr19914738wrt.11.1697102615198; Thu, 12 Oct 2023 02:23:35 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHKLj54dTANcZ5gFp0tmdX3e/m8l30NgjAW96R9qaVuTshzhzf/aX9VFAsEyB0ZJe/BWSb31g== X-Received: by 2002:a5d:4c50:0:b0:31c:8880:5d0f with SMTP id n16-20020a5d4c50000000b0031c88805d0fmr19914719wrt.11.1697102614742; Thu, 12 Oct 2023 02:23:34 -0700 (PDT) Received: from ?IPV6:2003:cb:c70d:ee00:b271:fb6c:a931:4769? (p200300cbc70dee00b271fb6ca9314769.dip0.t-ipconnect.de. [2003:cb:c70d:ee00:b271:fb6c:a931:4769]) by smtp.gmail.com with ESMTPSA id l21-20020adfa395000000b0032d520df3absm6600017wrb.5.2023.10.12.02.23.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 12 Oct 2023 02:23:34 -0700 (PDT) Message-ID: Date: Thu, 12 Oct 2023 11:23:33 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY To: Yajun Deng , Mike Rapoport Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230928083302.386202-1-yajun.deng@linux.dev> <20230928083302.386202-3-yajun.deng@linux.dev> <20230929083018.GU3303@kernel.org> <2f8c4741-5c7f-272d-9cef-9fda9fbc7ca6@linux.dev> <5382bf2d-5aa0-1498-8169-3248be4b5af3@linux.dev> From: David Hildenbrand Organization: Red Hat In-Reply-To: <5382bf2d-5aa0-1498-8169-3248be4b5af3@linux.dev> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 9EF78140017 X-Rspam-User: X-Stat-Signature: hrf6rugjh9a66ar5dmymn1dnrf64sa9e X-Rspamd-Server: rspam01 X-HE-Tag: 1697102623-42604 X-HE-Meta: U2FsdGVkX18ixxTYDrGey1K9PZbtNu57e+vB4+lG5KK9xP9UzEqICJ8YwP3vHKDbT22Y240lWnfmK/MKHlLlBgL34kLK3vYABgKOs79YjX3Kty68BqXYox1PFbDOQMS3dTXx/2gJ4UQyAc6nZeDHV6d0UpMLR9tsCGGk7YAMZ6SujdNkhNFXTmxL+Rr8nAGzktUwXX/3cbkz7HiC1JswgWGigsd1qiYatjTgquUTSUNz2b2SJYn/KoBWX0Pi5xi4sN2QwViEgM1Z3y/xwcklTM4p/TTVaoLbV94p6qGLw2E9tLuv9Omk1MRilK2ouDnpJg5qUFSgZ1M9EeiQQsB9XHQWWGObOj+zkYyYoBvVWC5OBY+Va62bIjo+17iVG6q7lzivMggpIkTLMcmACCMPGR9lIJocr9BWBT2QvkX7eDEZtztgJxVEkux9+W0tnJ8JrFQxQ8N+u7ivWEX9s6Lfgvd05WAxkTXmWVd/UxNWUMM6J24HFS/9+aktrVq70l9cYvHmOadYO/hyUX4Ml+Mj5k2WKUJUJlmCJ37q9tkjLkf2lTPbP9BveCUU4meM+xVnG52z0nLEV3S53GYOqFE5h86nDCH93UAr8Dj+YkRpehwYURXhIFQ34LojZPZqjPedfmfnE17VoxeII2llwXKI7mkGILjsJwlxu5CxRvMTyw2BErvF6BCpBJYHO5aQX0bifCqJuIyUXA2b3tztEJOGa6ds9Wt7vnJklGAMS4jZ/uWjzmUdw6t9YaGOIjihIYLwRNciAQmDe5OeLlEfmGsER3+0YZINNRvbNz9WhA36DwTj6mJjCOzt2aUdeLoy31IjT97L8oQrZk4bMl3AVeQg2tRHZ1C77VKGshmA/1K3DGkB55AHncP+Wmy23rXyIpc+kvLQX4Zn6SgKDRjq4fcaSZqTr/mn4NYGQt80X+PReug/i7h/g12y8YZKaz3jCt8IwNR3y412zakXgl/pFdV 2OIEQMO5 GKLFUCAZGM4R0cDBcIJykjXNIFXM1jwINv+rXcZdCnkJMEp4wZtoILlMPWCsZeX/aNaWnP30nfO/6LwgQGuxwj4JGCzuD3owah03WdO3+hoI3Pr8sXsMukr3SraJQjlJEB7nsj3bBwKilO/HPgu21/Y//1dTQBOz86kTFAZv0Mtz+PGz1+e7WCzwkl7U7WVStnd6W6dTvb23jNDZOcKeJY34O9qxasvjXsEA7EhgwZ7/Vk1MiNbTIql9UWrcuizSKwKScaN22VdWAU+218PH19EjNxLAQxY6sFMZPuIYeRYq2vMgzfjZLIFm+iTclYi9OVU1F55HKWMeV4ukF5IwpdczwPkG/eav0kAmgnyg0orPAu2n0lF18otoBFXJHKUUh4zXSkrYgI7ds+ARc94CsNttd3gVSU6aWaexK7spTzklRk7DmevNIRRIE+2NcPeyg4+5iSxr94/Sa97nTOH9eHOHXNzzBPlcTwJ5sZFFVpvbXclsvFoiB45RqWjapZL5nophP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10.10.23 04:31, Yajun Deng wrote: > > On 2023/10/8 16:57, Yajun Deng wrote: >> >> On 2023/10/2 16:30, David Hildenbrand wrote: >>> On 29.09.23 10:30, Mike Rapoport wrote: >>>> On Thu, Sep 28, 2023 at 04:33:02PM +0800, Yajun Deng wrote: >>>>> memmap_init_range() would init page count of all pages, but the free >>>>> pages count would be reset in __free_pages_core(). There are opposite >>>>> operations. It's unnecessary and time-consuming when it's >>>>> MEMINIT_EARLY >>>>> context. >>>>> >>>>> Init page count in reserve_bootmem_region when in MEMINIT_EARLY >>>>> context, >>>>> and check the page count before reset it. >>>>> >>>>> At the same time, the INIT_LIST_HEAD in reserve_bootmem_region isn't >>>>> need, as it already done in __init_single_page. >>>>> >>>>> The following data was tested on an x86 machine with 190GB of RAM. >>>>> >>>>> before: >>>>> free_low_memory_core_early()    341ms >>>>> >>>>> after: >>>>> free_low_memory_core_early()    285ms >>>>> >>>>> Signed-off-by: Yajun Deng >>>>> --- >>>>> v4: same with v2. >>>>> v3: same with v2. >>>>> v2: check page count instead of check context before reset it. >>>>> v1: >>>>> https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/ >>>>> --- >>>>>   mm/mm_init.c    | 18 +++++++++++++----- >>>>>   mm/page_alloc.c | 20 ++++++++++++-------- >>>>>   2 files changed, 25 insertions(+), 13 deletions(-) >>>>> >>>>> diff --git a/mm/mm_init.c b/mm/mm_init.c >>>>> index 9716c8a7ade9..3ab8861e1ef3 100644 >>>>> --- a/mm/mm_init.c >>>>> +++ b/mm/mm_init.c >>>>> @@ -718,7 +718,7 @@ static void __meminit >>>>> init_reserved_page(unsigned long pfn, int nid) >>>>>           if (zone_spans_pfn(zone, pfn)) >>>>>               break; >>>>>       } >>>>> -    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, >>>>> INIT_PAGE_COUNT); >>>>> +    __init_single_page(pfn_to_page(pfn), pfn, zid, nid, 0); >>>>>   } >>>>>   #else >>>>>   static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {} >>>>> @@ -756,8 +756,8 @@ void __meminit >>>>> reserve_bootmem_region(phys_addr_t start, >>>>>                 init_reserved_page(start_pfn, nid); >>>>>   -            /* Avoid false-positive PageTail() */ >>>>> -            INIT_LIST_HEAD(&page->lru); >>>>> +            /* Init page count for reserved region */ >>>> >>>> Please add a comment that describes _why_ we initialize the page >>>> count here. >>>> >>>>> +            init_page_count(page); >>>>>                 /* >>>>>                * no need for atomic set_bit because the struct >>>>> @@ -888,9 +888,17 @@ void __meminit memmap_init_range(unsigned long >>>>> size, int nid, unsigned long zone >>>>>           } >>>>>             page = pfn_to_page(pfn); >>>>> -        __init_single_page(page, pfn, zone, nid, INIT_PAGE_COUNT); >>>>> -        if (context == MEMINIT_HOTPLUG) >>>>> + >>>>> +        /* If the context is MEMINIT_EARLY, we will init page >>>>> count and >>>>> +         * mark page reserved in reserve_bootmem_region, the free >>>>> region >>>>> +         * wouldn't have page count and we will check the pages count >>>>> +         * in __free_pages_core. >>>>> +         */ >>>>> +        __init_single_page(page, pfn, zone, nid, 0); >>>>> +        if (context == MEMINIT_HOTPLUG) { >>>>> +            init_page_count(page); >>>>>               __SetPageReserved(page); >>>> >>>> Rather than calling init_page_count() and __SetPageReserved() for >>>> MEMINIT_HOTPLUG you can set flags to INIT_PAGE_COUNT | >>>> INIT_PAGE_RESERVED >>>> an call __init_single_page() after the check for MEMINIT_HOTPLUG. >>>> >>>> But more generally, I wonder if we have to differentiate HOTPLUG >>>> here at all. >>>> @David, can you comment please? >>> >>> There are a lot of details to that, and I'll share some I can briefly >>> think of. >>> >>> 1) __SetPageReserved() >>> >>> I tried removing that a while ago, but there was a blocker (IIRC >>> something about >>> ZONE_DEVICE). I still have the patches at [1] and I could probably >>> take a look >>> if that blocker still exists (I recall that something changed at some >>> point, but >>> I never had the time to follow up). >>> >>> But once we stop setting the pages reserved, we might run into issues >>> with ... >>> >>> >>> 2) init_page_count() >>> >>> virtio-mem, XEN balloon and HV-balloon add memory blocks that can >>> contain holes. >>> set_online_page_callback() is used to intercept memory onlining and >>> to expose >>> only the pages that are not holes to the buddy: calling >>> generic_online_page() on !hole. >>> >>> Holes are PageReserved but with an initialized page count. Memory >>> offlining will fail on >>> PageReserved pages -- has_unmovable_pages(). >>> >>> >>> At least virtio-mem clears the PageReserved flag of holes when >>> onlining memory, >>> and currently relies in the page count to be reasonable (so memory >>> offlining can work). >>> >>> static void virtio_mem_set_fake_offline(unsigned long pfn, >>>                     unsigned long nr_pages, bool onlined) >>> { >>>     page_offline_begin(); >>>     for (; nr_pages--; pfn++) { >>>         struct page *page = pfn_to_page(pfn); >>> >>>         __SetPageOffline(page); >>>         if (!onlined) { >>>             SetPageDirty(page); >>>             /* FIXME: remove after cleanups */ >>>             ClearPageReserved(page); >>>         } >>>     } >>>     page_offline_end(); >>> } >>> >>> >>> For virtio-mem, we could initialize the page count there instead. The >>> other PV drivers >>> might require a bit more thought. >>> >>> >>> [1] >>> https://github.com/davidhildenbrand/linux/tree/online_reserved_cleanup >>> >>>> >>>>> +        } >>>>>             /* >>>>>            * Usually, we want to mark the pageblock MIGRATE_MOVABLE, >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 06be8821d833..b868caabe8dc 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -1285,18 +1285,22 @@ void __free_pages_core(struct page *page, >>>>> unsigned int order) >>>>>       unsigned int loop; >>>>>         /* >>>>> -     * When initializing the memmap, __init_single_page() sets the >>>>> refcount >>>>> -     * of all pages to 1 ("allocated"/"not free"). We have to set the >>>>> -     * refcount of all involved pages to 0. >>>>> +     * When initializing the memmap, memmap_init_range sets the >>>>> refcount >>>>> +     * of all pages to 1 ("reserved" and "free") in hotplug >>>>> context. We >>>>> +     * have to set the refcount of all involved pages to 0. >>>>> Otherwise, >>>>> +     * we don't do it, as reserve_bootmem_region only set the >>>>> refcount on >>>>> +     * reserve region ("reserved") in early context. >>>>>        */ >>>> >>>> Again, why hotplug and early init should be different? >>>> >>>>> -    prefetchw(p); >>>>> -    for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>>>> -        prefetchw(p + 1); >>>>> +    if (page_count(page)) { >>>>> +        prefetchw(p); >>>>> +        for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >>>>> +            prefetchw(p + 1); >>>>> +            __ClearPageReserved(p); >>>>> +            set_page_count(p, 0); >>>>> +        } >>>>>           __ClearPageReserved(p); >>>>>           set_page_count(p, 0); >>> >>> That looks wrong. if the page count would by pure luck be 0 already >>> for hotplugged memory, >>> you wouldn't clear the reserved flag. >>> >>> These changes make me a bit nervous. >> >> >> Is 'if (page_count(page) || PageReserved(page))' be safer? Or do I >> need to do something else? >> > > How about the following if statement? But it needs to add more patch > like v1 ([PATCH 2/4] mm: Introduce MEMINIT_LATE context). > > It'll be safer, but more complex. Please comment... > > if (context != MEMINIT_EARLY || (page_count(page) || PageReserved(page)) { > Ideally we could make initialization only depend on the context, and not check for count or the reserved flag. -- Cheers, David / dhildenb