From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CFC7E7849A for ; Mon, 2 Oct 2023 08:30:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9174D6B0163; Mon, 2 Oct 2023 04:30:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C7176B0164; Mon, 2 Oct 2023 04:30:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78FAD6B0165; Mon, 2 Oct 2023 04:30:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6B1476B0163 for ; Mon, 2 Oct 2023 04:30:32 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3723A120172 for ; Mon, 2 Oct 2023 08:30:32 +0000 (UTC) X-FDA: 81299849904.09.072EF75 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 192DFA0008 for ; Mon, 2 Oct 2023 08:30:28 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Gnc2P4kb; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696235429; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BlqC/PzhLroyvC3Upif5X/Cft8kmbeniyvSBVl7WoRA=; b=mHjawbBQGims9FRHPDlFXWmXnTvdEWNpVlfyFrHlqPlHCw3Cg0F0+GuLk+gduKDAWj1oYw J9KCPwYySFQeW3cgvYGhXL/YT5QtriHaOypFeity98tLdSUvKhKldOXQozZz0WnyT2ucPO LFHZQ7MshfzCh+juTYuwi5FFcUhrWMI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696235429; a=rsa-sha256; cv=none; b=bVOGnMNMP7PjCWYaF0oetbSomIK8PCwmOazNfRkf4TtbAe7qAra2WFb4kAuJ0rhYAlsxtS YNESwvsKfWx0vrABqj4J0JvGQLCP0oOQ6eNLs4YeetBx3uOzoPIecbHQSy0L9Ri7DeYRkV t8FfDQpTNtotxhoRfmtXJdrDUqqUkY0= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Gnc2P4kb; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1696235428; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BlqC/PzhLroyvC3Upif5X/Cft8kmbeniyvSBVl7WoRA=; b=Gnc2P4kbM+CJvjaOiHCed/GmD5GkQJ0UiD1VwI45ML2ujtNB4JRGQ+Kt+78bsTEguS6RRr qB52ZHt0VVfHiE+3oZ63VGclCtuDwBnFQHSGY5DNYsoF9Qawx8PfzvJ6R12vypIv1epMS2 hCttxIvBsiw29KKG/zavN1cz/Oy8GnM= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-241-BjkcfWSaPZ-71ELaLZlMtg-1; Mon, 02 Oct 2023 04:30:25 -0400 X-MC-Unique: BjkcfWSaPZ-71ELaLZlMtg-1 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-4059475c174so106410525e9.0 for ; Mon, 02 Oct 2023 01:30:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696235424; x=1696840224; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BlqC/PzhLroyvC3Upif5X/Cft8kmbeniyvSBVl7WoRA=; b=e02wSasRWBZKcF5uurNv/aYS7XW6xnCOVEGb901Zd44wIwZfWUde+tuM626iy8mO/4 DXaHWI/Y/ZK40pq0Qly8Ht6u0phNavDJi949hQOdj+XkzVeOuKOjm+dz4iMJq0Mfjequ ScwZdIKQz/gybspoPUr7rImWOWsBaTBo93GRa4Jbxwk1xAb9++gGhjIVcKPqgdmc9eFs u33wF2j+vlIUMVKEOALIdvMyZyBF8he3PAVBbFANs2n+cv0yw14zLC2RUdpuGthoRmxJ IvQVOrsCqlV6jmg+afNWvdyECiRKi7IIbXmQartWeT/724fYrmcO/eXz5quuQhURvyj9 rKyw== X-Gm-Message-State: AOJu0YyJdc2Ro7sPir/u97hwXyKwPEVQwp1cdAOF41XjvYibWhcqjzFk UmH5CmeBSXYj+e/v47v9HIB+PXkK1lYD6JbektEf8S7TAuyWwo8uTPrvbc22FZsGvhpTrxFtOmg r5w+yRllTK+k= X-Received: by 2002:a1c:7211:0:b0:405:367d:4656 with SMTP id n17-20020a1c7211000000b00405367d4656mr10028285wmc.29.1696235423760; Mon, 02 Oct 2023 01:30:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHG1k3lnvnbpsN4w38eQDYhKPBV+ep78G6Aoul68G3nuonACby2RGLIUkvu563Aim9tyDuDCw== X-Received: by 2002:a1c:7211:0:b0:405:367d:4656 with SMTP id n17-20020a1c7211000000b00405367d4656mr10028263wmc.29.1696235423208; Mon, 02 Oct 2023 01:30:23 -0700 (PDT) Received: from ?IPV6:2003:cb:c735:f200:cb49:cb8f:88fc:9446? (p200300cbc735f200cb49cb8f88fc9446.dip0.t-ipconnect.de. [2003:cb:c735:f200:cb49:cb8f:88fc:9446]) by smtp.gmail.com with ESMTPSA id s28-20020adfa29c000000b003232f167df5sm17124815wra.108.2023.10.02.01.30.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 02 Oct 2023 01:30:22 -0700 (PDT) Message-ID: Date: Mon, 2 Oct 2023 10:30:21 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 To: Mike Rapoport , Yajun Deng Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230928083302.386202-1-yajun.deng@linux.dev> <20230928083302.386202-3-yajun.deng@linux.dev> <20230929083018.GU3303@kernel.org> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v4 2/2] mm: Init page count in reserve_bootmem_region when MEMINIT_EARLY In-Reply-To: <20230929083018.GU3303@kernel.org> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: 5fwccxns988dk4eqywpx6knx9w9cwhuw X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 192DFA0008 X-Rspam-User: X-HE-Tag: 1696235428-969119 X-HE-Meta: U2FsdGVkX1+dOp95vVCsg7pso2J4wPzmlMDD45xuiEExV5h3cMt2oVXq+9dqdMFd+5FiXKuWqLqpuO3S53hbuMKWjwDk67coAK5wHREunXDj5ci+WVMpU6nOZtl+3Mwis12xU6aJL9TTzlI4sGDF5Quy+COw08mxLj5Fs9Y9t0UqwIphjYNEISs8H23axguf56z7ickPiIYtmznDqvnudcB6LFfemTAzXmHyQ+cgDrEWSq4So/VpRnPA00/QSYw0wYs+mp0BYCWsTyvLZk4NbbFGrQp72+ZSDf/Vn6vjsaD5oZUkd1z3isp1P/H82wDtcdfIIFlIBUnL0JGjXbw6HKFL/9cc/+Hy2akT0W70aI1MMJezER/WpwgtvXEYcpDfHCHfIimMYsOASlZRIwS1+ANmjNe9AvsVnB2wWPbUuMNTbKNkdbT3xs8V4rri/2N/N3q8kQxwUPaQTu5+w1adyTK9jNaz5nPe8IXT8UoZrV0eLnwpRPK7F1QtvofhzA3Orx+8OHm9RvEcOi5a5aXx68WaqvO8VdhlwVShR75M8FQzJAT4Q5RshHfx82tzInTKk3Ni/6ON1wg9JGaXurktSeGwRHfWczgNOrN1rvPdUd3u5Gz9ortk/3TmPBbprFtW3u+n0jEl+8XyMwFcAVnMBHwcx9qfzUDdNyf6iqquF5dwMZHpBLRBNQfvl1fe/PHcrMnqQ3FjOP7OuxgYrOOu7KgprIjhxB/R9kU+oIH02dq+ZkZCLuFlo4HAigIoLtDpVrt25qxR22gL4e+0Me63pYKyKn5UsiA3WDAGXkw++K3C9n72AjYCd3aBlXswSwRAijl4KB6kwv6noZDO6oTcMxkrZPjL+jvj9MFOdVldBN0QmsrRbKeA4IzoexTQ0YrN6+gRZADNA5i9eD+N1foIyUJYM8rXF0vyo5d3v6NelUcKRoYSO5o8J63s7ScwnIvf9jXOAFH/iExsnArShV3 NRy2Yvih wM9ZaNAkvv3SAYP2ZrC67/QzBQc+W/l/fgP5Rd5k3V0WNj2h7Dv+qKtQmyILIE/bhIRZSr9cStQgRckfZ4P7GVvQEaWicWRm3XRKQAl6EJX6ikpUPs5knR4Ig2I80UYuUv3XjSaPY6k6Ij4OaseAdMMieBfPud1FCE5L+uH5dd0EJfBRcAnVhUPnQHWzwDU7P3J50hwnDOnDSjH6T23H58tjz04Zd1wcLszvA9eNVH8DrzFf5naxoy7qfx7iqWxNB9nza9iXdc/ToO11ZcEjMSd+DKxPtHD4BM7wgCJKRF1S+yNc2Cb3ZnPrBneXofYmlzliGUCWZ6ddoeErQy9Pzgu1LM+23p6eOL1Rcy2RW1dE46IiTn6SHmypqlsKf4ZSXoNdHfLZfUI8YKwTBXRqM1cbVvMUZn+uMk7307rTrFXmGJAeg8EIA8Zn/bPk0bWxK1pHjcQEtOMD90wA4Oxa5UR1WFWjuXfN6qR/pxXW39t1WpwAN6IbD+JqLE1hsPu7TSESh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 29.09.23 10:30, Mike Rapoport wrote: > On Thu, Sep 28, 2023 at 04:33:02PM +0800, Yajun Deng wrote: >> memmap_init_range() would init page count of all pages, but the free >> pages count would be reset in __free_pages_core(). There are opposite >> operations. It's unnecessary and time-consuming when it's MEMINIT_EARLY >> context. >> >> Init page count in reserve_bootmem_region when in MEMINIT_EARLY context, >> and check the page count before reset it. >> >> At the same time, the INIT_LIST_HEAD in reserve_bootmem_region isn't >> need, as it already done in __init_single_page. >> >> The following data was tested on an x86 machine with 190GB of RAM. >> >> before: >> free_low_memory_core_early() 341ms >> >> after: >> free_low_memory_core_early() 285ms >> >> Signed-off-by: Yajun Deng >> --- >> v4: same with v2. >> v3: same with v2. >> v2: check page count instead of check context before reset it. >> v1: https://lore.kernel.org/all/20230922070923.355656-1-yajun.deng@linux.dev/ >> --- >> mm/mm_init.c | 18 +++++++++++++----- >> mm/page_alloc.c | 20 ++++++++++++-------- >> 2 files changed, 25 insertions(+), 13 deletions(-) >> >> diff --git a/mm/mm_init.c b/mm/mm_init.c >> index 9716c8a7ade9..3ab8861e1ef3 100644 >> --- a/mm/mm_init.c >> +++ b/mm/mm_init.c >> @@ -718,7 +718,7 @@ static void __meminit init_reserved_page(unsigned long pfn, int nid) >> if (zone_spans_pfn(zone, pfn)) >> break; >> } >> - __init_single_page(pfn_to_page(pfn), pfn, zid, nid, INIT_PAGE_COUNT); >> + __init_single_page(pfn_to_page(pfn), pfn, zid, nid, 0); >> } >> #else >> static inline void pgdat_set_deferred_range(pg_data_t *pgdat) {} >> @@ -756,8 +756,8 @@ void __meminit reserve_bootmem_region(phys_addr_t start, >> >> init_reserved_page(start_pfn, nid); >> >> - /* Avoid false-positive PageTail() */ >> - INIT_LIST_HEAD(&page->lru); >> + /* Init page count for reserved region */ > > Please add a comment that describes _why_ we initialize the page count here. > >> + init_page_count(page); >> >> /* >> * no need for atomic set_bit because the struct >> @@ -888,9 +888,17 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone >> } >> >> page = pfn_to_page(pfn); >> - __init_single_page(page, pfn, zone, nid, INIT_PAGE_COUNT); >> - if (context == MEMINIT_HOTPLUG) >> + >> + /* If the context is MEMINIT_EARLY, we will init page count and >> + * mark page reserved in reserve_bootmem_region, the free region >> + * wouldn't have page count and we will check the pages count >> + * in __free_pages_core. >> + */ >> + __init_single_page(page, pfn, zone, nid, 0); >> + if (context == MEMINIT_HOTPLUG) { >> + init_page_count(page); >> __SetPageReserved(page); > > Rather than calling init_page_count() and __SetPageReserved() for > MEMINIT_HOTPLUG you can set flags to INIT_PAGE_COUNT | INIT_PAGE_RESERVED > an call __init_single_page() after the check for MEMINIT_HOTPLUG. > > But more generally, I wonder if we have to differentiate HOTPLUG here at all. > @David, can you comment please? There are a lot of details to that, and I'll share some I can briefly think of. 1) __SetPageReserved() I tried removing that a while ago, but there was a blocker (IIRC something about ZONE_DEVICE). I still have the patches at [1] and I could probably take a look if that blocker still exists (I recall that something changed at some point, but I never had the time to follow up). But once we stop setting the pages reserved, we might run into issues with ... 2) init_page_count() virtio-mem, XEN balloon and HV-balloon add memory blocks that can contain holes. set_online_page_callback() is used to intercept memory onlining and to expose only the pages that are not holes to the buddy: calling generic_online_page() on !hole. Holes are PageReserved but with an initialized page count. Memory offlining will fail on PageReserved pages -- has_unmovable_pages(). At least virtio-mem clears the PageReserved flag of holes when onlining memory, and currently relies in the page count to be reasonable (so memory offlining can work). static void virtio_mem_set_fake_offline(unsigned long pfn, unsigned long nr_pages, bool onlined) { page_offline_begin(); for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); __SetPageOffline(page); if (!onlined) { SetPageDirty(page); /* FIXME: remove after cleanups */ ClearPageReserved(page); } } page_offline_end(); } For virtio-mem, we could initialize the page count there instead. The other PV drivers might require a bit more thought. [1] https://github.com/davidhildenbrand/linux/tree/online_reserved_cleanup > >> + } >> >> /* >> * Usually, we want to mark the pageblock MIGRATE_MOVABLE, >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 06be8821d833..b868caabe8dc 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1285,18 +1285,22 @@ void __free_pages_core(struct page *page, unsigned int order) >> unsigned int loop; >> >> /* >> - * When initializing the memmap, __init_single_page() sets the refcount >> - * of all pages to 1 ("allocated"/"not free"). We have to set the >> - * refcount of all involved pages to 0. >> + * When initializing the memmap, memmap_init_range sets the refcount >> + * of all pages to 1 ("reserved" and "free") in hotplug context. We >> + * have to set the refcount of all involved pages to 0. Otherwise, >> + * we don't do it, as reserve_bootmem_region only set the refcount on >> + * reserve region ("reserved") in early context. >> */ > > Again, why hotplug and early init should be different? > >> - prefetchw(p); >> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >> - prefetchw(p + 1); >> + if (page_count(page)) { >> + prefetchw(p); >> + for (loop = 0; loop < (nr_pages - 1); loop++, p++) { >> + prefetchw(p + 1); >> + __ClearPageReserved(p); >> + set_page_count(p, 0); >> + } >> __ClearPageReserved(p); >> set_page_count(p, 0); That looks wrong. if the page count would by pure luck be 0 already for hotplugged memory, you wouldn't clear the reserved flag. These changes make me a bit nervous. -- Cheers, David / dhildenb