From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 398ACC79F9E for ; Mon, 5 Jan 2026 16:17:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 917FD6B0184; Mon, 5 Jan 2026 11:17:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 886856B0187; Mon, 5 Jan 2026 11:17:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7353B6B0188; Mon, 5 Jan 2026 11:17:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 61C986B0184 for ; Mon, 5 Jan 2026 11:17:58 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 0BD951601F4 for ; Mon, 5 Jan 2026 16:17:58 +0000 (UTC) X-FDA: 84298416636.20.3001F25 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 4B73B18000F for ; Mon, 5 Jan 2026 16:17:56 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1767629876; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VTqmywUFm3OWBmZcYGPOE+NugJGr79ts8PRWDS7cN5Y=; b=QOKY/od8tEciEfLSRKI1qaPsVoAG5THpgtzrzg6OGWIpbtStd1GzgaF+ajClLyztj2ptaD u+M8QKfRjF3V9vTQoCUVaXJhx4tokj4ry0v2i7GicJ8LlUnbF6cFjNzoNlOvPMAi4zCiMp vf60F/tYKs9iWsMnetz2jTNMtvEzQZU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1767629876; a=rsa-sha256; cv=none; b=IMSuk8fkY2bMx7JvVOnPSpe4M3QMukN3I7Vo2dQslNDAN1uUQ/z8uI4yjesV4IXxfhKWQs 5LiM3XxCIKd8H8bW7TLxUYXfck5mRoqqlYteDmqM0GPCVBZPWM6xlT5F/F9r8a+aQ6kzU2 /KHyd4NT2cfGETVcsbCeSYiJyzdIv0g= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A850E497; Mon, 5 Jan 2026 08:17:48 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A34023F6A8; Mon, 5 Jan 2026 08:17:53 -0800 (PST) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , Uladzislau Rezki , "Vishal Moola (Oracle)" Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Date: Mon, 5 Jan 2026 16:17:37 +0000 Message-ID: <20260105161741.3952456-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260105161741.3952456-1-ryan.roberts@arm.com> References: <20260105161741.3952456-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: e7g7r69o661bn5a9xgbsqsgs1d9dkbot X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 4B73B18000F X-Rspam-User: X-HE-Tag: 1767629876-119813 X-HE-Meta: U2FsdGVkX18Ho3HEs4I7HR1XNOhuIBoDgcEpY1HvzFCPoPQKBOXoJOSm9qgJP5Iqfl983JKtfVAdaWJVlHCYuMSCf0cFULvac/bTmtAffQ8YPZ4Z+4F8X62VtsviaLOZLW4222M7hGETp4xg6YSG9Jnl/vTDY2gOwNns2+oQJRC1dMXCV0HJJJm20JTCUQ4BkF8nFv7R+sc6zLQfLUTyv3n8uKOKXiVsv/BMcoHjI5fzXfyR9TCg2wueM7irZJA/BX/AiCULKql+lJxsqcxaLYbYLkFTAqJO/4KEpoiqG5YLtovaRqqqPSBSC9APpMUZqKiWLU+j0ermyy34BUqKtLe6BueA8hBAgOZb9HJetaOLFY8X4FfpBT+KqFhbkrzKQg4dQwe4Gx0dtD+oo8Q1GKDrlAjDLkpJAHeONAWQXK4WHqIZmvdCXPbAdCbjDnznipchCV95NhhBCYJgIht/l0FWlnwjTJeo8XDa+TnM9oWHY/3IXNK5dPDR6AUjju6MZtt1lHESHglqwOMyxXY+x6itTbpU83QS+DKZTsTbbgfh0qd06M1XZUvWZnkr104aKhSGO4K55EBEE6EsAyMOAfx1phH19CcFX0/MId7EcSY8buzCrBtibPd8aTV0vhtRQMIZX/ONPMamiL3ZUNPTZCYf5m0nCUZ6jXXcer/tVLIShv0jZ+JhzKMaCUbryl3Y3K9KOH41l36f0MFw1yDZHZXfp5velbttsQzkHK1kN6fOBqWRm/fYchM+Bd71StOlexBYX2qxzsNs0XVCvqUdj64OfJDKEDi/8kfFgEvfk85KVKZh5BgnqsF4Wy0rAaGe7BEFs76YXl/zgQia9a6rQTkQuZ7rBNK/taPl0TP5zCtnRtAwCd5q1ROIQRJHClaAaaCcZQ3T3wcVlUMGlePe4yitFfDQxYNHQOk+mBb0z4ZChC88eoEsJLgHDCbIX/Dm8tDw52iC4iq1FuGD9Bt yV8JhBa7 SBY9a69hQoTtuxa/4IkqzGz1u5KRNaMHIVuJOm9xRWXLSU+i+zfZCWFYUzhxheGFlNlLX8g+swiv5MpRxjxc4K6e8PulJbQUFhtSor6JLg+7P84oqvGepkYZDSBMWOjoXZd77aD3dYL9x/2SvV6WEtBS8d084VqzU1eiVAbCl8bx1u8cnfxpMkWCRjVJEtdMTH26346e5WGzLFvr9hIn6WsX1Yju+DtvVfN9AhqyeeGbKNDgS9M2tPvPcX0Je5NDDloHtlghKSBK8xHfPq/vdNkj1yA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Decompose the range of order-0 pages to be freed into the set of largest possible power-of-2 size and aligned chunks and free them to the pcp or buddy. This improves on the previous approach which freed each order-0 page individually in a loop. Testing shows performance to be improved by more than 10x in some cases. Since each page is order-0, we must decrement each page's reference count individually and only consider the page for freeing as part of a high order chunk if the reference count goes to zero. Additionally free_pages_prepare() must be called for each individual order-0 page too, so that the struct page state and global accounting state can be appropriately managed. But once this is done, the resulting high order chunks can be freed as a unit to the pcp or buddy. This significiantly speeds up the free operation but also has the side benefit that high order blocks are added to the pcp instead of each page ending up on the pcp order-0 list; memory remains more readily available in high orders. vmalloc will shortly become a user of this new optimized free_contig_range() since it agressively allocates high order non-compound pages, but then calls split_page() to end up with contiguous order-0 pages. These can now be freed much more efficiently. The execution time of the following function was measured in a VM on an Apple M2 system: static int page_alloc_high_ordr_test(void) { unsigned int order = HPAGE_PMD_ORDER; struct page *page; int i; for (i = 0; i < 100000; i++) { page = alloc_pages(GFP_KERNEL, order); if (!page) return -1; split_page(page, order); free_contig_range(page_to_pfn(page), 1UL << order); } return 0; } Execution time before: 1684366 usec Execution time after: 136216 usec Perf trace before: 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork | ---ret_from_fork kthread 0xffffbba283e63980 | |--60.01%--0xffffbba283e636dc | | | |--58.57%--free_contig_range | | | | | |--57.19%--___free_pages | | | | | | | |--46.65%--__free_frozen_pages | | | | | | | | | |--28.08%--free_pcppages_bulk | | | | | | | | | --12.05%--free_frozen_page_commit.constprop.0 | | | | | | | |--5.10%--__get_pfnblock_flags_mask.isra.0 | | | | | | | |--1.13%--_raw_spin_unlock | | | | | | | |--0.78%--free_frozen_page_commit.constprop.0 | | | | | | | --0.75%--_raw_spin_trylock | | | | | --0.95%--__free_frozen_pages | | | --1.44%--___free_pages | --0.78%--0xffffbba283e636c0 split_page Perf trace after: 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork | ---ret_from_fork kthread 0xffffbbd55ef74980 | |--8.74%--0xffffbbd55ef746dc | free_contig_range | | | --8.72%--__free_contig_range | --1.56%--0xffffbbd55ef746c0 | --1.54%--split_page Signed-off-by: Ryan Roberts --- include/linux/gfp.h | 1 + mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++----- 2 files changed, 106 insertions(+), 11 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b155929af5b1..3ed0bef34d0c 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_ #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__)) #endif +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages); void free_contig_range(unsigned long pfn, unsigned long nr_pages); #ifdef CONFIG_CONTIG_ALLOC diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a045d728ae0f..1015c8edf8a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t; /* Free the page without taking locks. Rely on trylock only. */ #define FPI_TRYLOCK ((__force fpi_t)BIT(2)) +/* free_pages_prepare() has already been called for page(s) being freed. */ +#define FPI_PREPARED ((__force fpi_t)BIT(3)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order, unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page); - if (free_pages_prepare(page, order)) - free_one_page(zone, page, pfn, order, fpi_flags); + if (!(fpi_flags & FPI_PREPARED)) { + if (!free_pages_prepare(page, order)) + return; + } + + free_one_page(zone, page, pfn, order, fpi_flags); } void __meminit __free_pages_core(struct page *page, unsigned int order, @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order, return; } - if (!free_pages_prepare(page, order)) - return; + if (!(fpi_flags & FPI_PREPARED)) { + if (!free_pages_prepare(page, order)) + return; + } /* * We only track unmovable, reclaimable and movable on pcp lists. @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, } #endif /* CONFIG_CONTIG_ALLOC */ +static void free_prepared_contig_range(struct page *page, + unsigned long nr_pages) +{ + while (nr_pages) { + unsigned int fit_order, align_order, order; + unsigned long pfn; + + /* + * Find the largest aligned power-of-2 number of pages that + * starts at the current page, does not exceed nr_pages and is + * less than or equal to pageblock_order. + */ + pfn = page_to_pfn(page); + fit_order = ilog2(nr_pages); + align_order = pfn ? __ffs(pfn) : fit_order; + order = min3(fit_order, align_order, pageblock_order); + + /* + * Free the chunk as a single block. Our caller has already + * called free_pages_prepare() for each order-0 page. + */ + __free_frozen_pages(page, order, FPI_PREPARED); + + page += 1UL << order; + nr_pages -= 1UL << order; + } +} + +/** + * __free_contig_range - Free contiguous range of order-0 pages. + * @pfn: Page frame number of the first page in the range. + * @nr_pages: Number of pages to free. + * + * For each order-0 struct page in the physically contiguous range, put a + * reference. Free any page who's reference count falls to zero. The + * implementation is functionally equivalent to, but significantly faster than + * calling __free_page() for each struct page in a loop. + * + * Memory allocated with alloc_pages(order>=1) then subsequently split to + * order-0 with split_page() is an example of appropriate contiguous pages that + * can be freed with this API. + * + * Returns the number of pages which were not freed, because their reference + * count did not fall to zero. + * + * Context: May be called in interrupt context or while holding a normal + * spinlock, but not in NMI context or while holding a raw spinlock. + */ +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages) +{ + struct page *page = pfn_to_page(pfn); + unsigned long not_freed = 0; + struct page *start = NULL; + unsigned long i; + bool can_free; + + /* + * Chunk the range into contiguous runs of pages for which the refcount + * went to zero and for which free_pages_prepare() succeeded. If + * free_pages_prepare() fails we consider the page to have been freed + * deliberately leak it. + * + * Code assumes contiguous PFNs have contiguous struct pages, but not + * vice versa. + */ + for (i = 0; i < nr_pages; i++, page++) { + VM_BUG_ON_PAGE(PageHead(page), page); + VM_BUG_ON_PAGE(PageTail(page), page); + + can_free = put_page_testzero(page); + if (!can_free) + not_freed++; + else if (!free_pages_prepare(page, 0)) + can_free = false; + + if (!can_free && start) { + free_prepared_contig_range(start, page - start); + start = NULL; + } else if (can_free && !start) { + start = page; + } + } + + if (start) + free_prepared_contig_range(start, page - start); + + return not_freed; +} +EXPORT_SYMBOL(__free_contig_range); + void free_contig_range(unsigned long pfn, unsigned long nr_pages) { - unsigned long count = 0; + unsigned long count; struct folio *folio = pfn_folio(pfn); if (folio_test_large(folio)) { @@ -7266,12 +7365,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages) return; } - for (; nr_pages--; pfn++) { - struct page *page = pfn_to_page(pfn); - - count += page_count(page) != 1; - __free_page(page); - } + count = __free_contig_range(pfn, nr_pages); WARN(count != 0, "%lu pages are still in use!\n", count); } EXPORT_SYMBOL(free_contig_range); -- 2.43.0