* [PATCH v1 0/2] Free contiguous order-0 pages efficiently
@ 2026-01-05 16:17 Ryan Roberts
2026-01-05 16:17 ` [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Ryan Roberts
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Ryan Roberts @ 2026-01-05 16:17 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle)
Cc: Ryan Roberts, linux-mm, linux-kernel
Hi All,
A recent change to vmalloc caused some performance benchmark regressions (see
[1]). I'm attempting to fix that (and at the same time signficantly improve
beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
At the same time I observed that free_contig_range() was essentially doing the
same thing as vfree() so I've fixed it there too.
I think I've convinced myself that free_pages_prepare() per order-0 page
followed by a single free_frozen_page_commit() or free_one_page() for the high
order block is safe/correct, but would be good if a page_alloc expert can
confirm!
Applies against today's mm-unstable (344d3580dacd). All mm selftests run and
pass.
Thanks,
Ryan
Ryan Roberts (2):
mm/page_alloc: Optimize free_contig_range()
vmalloc: Optimize vfree
include/linux/gfp.h | 1 +
mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
mm/vmalloc.c | 29 +++++++----
3 files changed, 125 insertions(+), 21 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
@ 2026-01-05 16:17 ` Ryan Roberts
2026-01-05 17:15 ` Zi Yan
2026-01-05 16:17 ` [PATCH v1 2/2] vmalloc: Optimize vfree Ryan Roberts
` (3 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2026-01-05 16:17 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle)
Cc: Ryan Roberts, linux-mm, linux-kernel
Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.
Since each page is order-0, we must decrement each page's reference
count individually and only consider the page for freeing as part of a
high order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page
too, so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.
This significiantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.
vmalloc will shortly become a user of this new optimized
free_contig_range() since it agressively allocates high order
non-compound pages, but then calls split_page() to end up with
contiguous order-0 pages. These can now be freed much more efficiently.
The execution time of the following function was measured in a VM on an
Apple M2 system:
static int page_alloc_high_ordr_test(void)
{
unsigned int order = HPAGE_PMD_ORDER;
struct page *page;
int i;
for (i = 0; i < 100000; i++) {
page = alloc_pages(GFP_KERNEL, order);
if (!page)
return -1;
split_page(page, order);
free_contig_range(page_to_pfn(page), 1UL << order);
}
return 0;
}
Execution time before: 1684366 usec
Execution time after: 136216 usec
Perf trace before:
60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
|
---ret_from_fork
kthread
0xffffbba283e63980
|
|--60.01%--0xffffbba283e636dc
| |
| |--58.57%--free_contig_range
| | |
| | |--57.19%--___free_pages
| | | |
| | | |--46.65%--__free_frozen_pages
| | | | |
| | | | |--28.08%--free_pcppages_bulk
| | | | |
| | | | --12.05%--free_frozen_page_commit.constprop.0
| | | |
| | | |--5.10%--__get_pfnblock_flags_mask.isra.0
| | | |
| | | |--1.13%--_raw_spin_unlock
| | | |
| | | |--0.78%--free_frozen_page_commit.constprop.0
| | | |
| | | --0.75%--_raw_spin_trylock
| | |
| | --0.95%--__free_frozen_pages
| |
| --1.44%--___free_pages
|
--0.78%--0xffffbba283e636c0
split_page
Perf trace after:
10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
|
---ret_from_fork
kthread
0xffffbbd55ef74980
|
|--8.74%--0xffffbbd55ef746dc
| free_contig_range
| |
| --8.72%--__free_contig_range
|
--1.56%--0xffffbbd55ef746c0
|
--1.54%--split_page
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/gfp.h | 1 +
mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
2 files changed, 106 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b155929af5b1..3ed0bef34d0c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
#define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
#endif
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
#ifdef CONFIG_CONTIG_ALLOC
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a045d728ae0f..1015c8edf8a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
/* Free the page without taking locks. Rely on trylock only. */
#define FPI_TRYLOCK ((__force fpi_t)BIT(2))
+/* free_pages_prepare() has already been called for page(s) being freed. */
+#define FPI_PREPARED ((__force fpi_t)BIT(3))
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
unsigned long pfn = page_to_pfn(page);
struct zone *zone = page_zone(page);
- if (free_pages_prepare(page, order))
- free_one_page(zone, page, pfn, order, fpi_flags);
+ if (!(fpi_flags & FPI_PREPARED)) {
+ if (!free_pages_prepare(page, order))
+ return;
+ }
+
+ free_one_page(zone, page, pfn, order, fpi_flags);
}
void __meminit __free_pages_core(struct page *page, unsigned int order,
@@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
return;
}
- if (!free_pages_prepare(page, order))
- return;
+ if (!(fpi_flags & FPI_PREPARED)) {
+ if (!free_pages_prepare(page, order))
+ return;
+ }
/*
* We only track unmovable, reclaimable and movable on pcp lists.
@@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
}
#endif /* CONFIG_CONTIG_ALLOC */
+static void free_prepared_contig_range(struct page *page,
+ unsigned long nr_pages)
+{
+ while (nr_pages) {
+ unsigned int fit_order, align_order, order;
+ unsigned long pfn;
+
+ /*
+ * Find the largest aligned power-of-2 number of pages that
+ * starts at the current page, does not exceed nr_pages and is
+ * less than or equal to pageblock_order.
+ */
+ pfn = page_to_pfn(page);
+ fit_order = ilog2(nr_pages);
+ align_order = pfn ? __ffs(pfn) : fit_order;
+ order = min3(fit_order, align_order, pageblock_order);
+
+ /*
+ * Free the chunk as a single block. Our caller has already
+ * called free_pages_prepare() for each order-0 page.
+ */
+ __free_frozen_pages(page, order, FPI_PREPARED);
+
+ page += 1UL << order;
+ nr_pages -= 1UL << order;
+ }
+}
+
+/**
+ * __free_contig_range - Free contiguous range of order-0 pages.
+ * @pfn: Page frame number of the first page in the range.
+ * @nr_pages: Number of pages to free.
+ *
+ * For each order-0 struct page in the physically contiguous range, put a
+ * reference. Free any page who's reference count falls to zero. The
+ * implementation is functionally equivalent to, but significantly faster than
+ * calling __free_page() for each struct page in a loop.
+ *
+ * Memory allocated with alloc_pages(order>=1) then subsequently split to
+ * order-0 with split_page() is an example of appropriate contiguous pages that
+ * can be freed with this API.
+ *
+ * Returns the number of pages which were not freed, because their reference
+ * count did not fall to zero.
+ *
+ * Context: May be called in interrupt context or while holding a normal
+ * spinlock, but not in NMI context or while holding a raw spinlock.
+ */
+unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
+{
+ struct page *page = pfn_to_page(pfn);
+ unsigned long not_freed = 0;
+ struct page *start = NULL;
+ unsigned long i;
+ bool can_free;
+
+ /*
+ * Chunk the range into contiguous runs of pages for which the refcount
+ * went to zero and for which free_pages_prepare() succeeded. If
+ * free_pages_prepare() fails we consider the page to have been freed
+ * deliberately leak it.
+ *
+ * Code assumes contiguous PFNs have contiguous struct pages, but not
+ * vice versa.
+ */
+ for (i = 0; i < nr_pages; i++, page++) {
+ VM_BUG_ON_PAGE(PageHead(page), page);
+ VM_BUG_ON_PAGE(PageTail(page), page);
+
+ can_free = put_page_testzero(page);
+ if (!can_free)
+ not_freed++;
+ else if (!free_pages_prepare(page, 0))
+ can_free = false;
+
+ if (!can_free && start) {
+ free_prepared_contig_range(start, page - start);
+ start = NULL;
+ } else if (can_free && !start) {
+ start = page;
+ }
+ }
+
+ if (start)
+ free_prepared_contig_range(start, page - start);
+
+ return not_freed;
+}
+EXPORT_SYMBOL(__free_contig_range);
+
void free_contig_range(unsigned long pfn, unsigned long nr_pages)
{
- unsigned long count = 0;
+ unsigned long count;
struct folio *folio = pfn_folio(pfn);
if (folio_test_large(folio)) {
@@ -7266,12 +7365,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
return;
}
- for (; nr_pages--; pfn++) {
- struct page *page = pfn_to_page(pfn);
-
- count += page_count(page) != 1;
- __free_page(page);
- }
+ count = __free_contig_range(pfn, nr_pages);
WARN(count != 0, "%lu pages are still in use!\n", count);
}
EXPORT_SYMBOL(free_contig_range);
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v1 2/2] vmalloc: Optimize vfree
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
2026-01-05 16:17 ` [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Ryan Roberts
@ 2026-01-05 16:17 ` Ryan Roberts
2026-01-06 4:36 ` Matthew Wilcox
2026-01-05 16:26 ` [PATCH v1 0/2] Free contiguous order-0 pages efficiently David Hildenbrand (Red Hat)
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2026-01-05 16:17 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle)
Cc: Ryan Roberts, linux-mm, linux-kernel
Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page.
Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate
high order pages which are subsequently split to order-0.
Unfortunately this had the side effect of causing performance
regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko
benchmarks). See Closes: tag. This happens because the high order pages
must be gotten from the buddy but then because they are split to
order-0, when they are freed they are freed to the order-0 pcp.
Previously allocation was for order-0 pages so they were recycled from
the pcp.
It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.
So let's do exactly that; use the new __free_contig_range() API to
batch-free contiguous ranges of pfns. This not only removes the
regression, but significantly improves performance of vfree beyond the
baseline.
A selection of test_vmalloc benchmarks running on AWS m7g.metal (arm64)
system. v6.18 is the baseline. Commit a06157804399 ("mm/vmalloc: request
large order pages from buddy allocator") was added in v6.19-rc1 where we
see regressions. Then with this change performance is much better. (>0
is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):
+----------------------------------------------------------+-------------+-------------+
| test_vmalloc benchmark | v6.19-rc1 | v6.19-rc1 |
| | | + change |
+==========================================================+=============+=============+
| fix_align_alloc_test: p:1, h:0, l:500000 (usec) | (R) -40.69% | (I) 4.85% |
| fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 0.10% | -1.04% |
| fix_size_alloc_test: p:4, h:0, l:500000 (usec) | (R) -22.74% | (I) 14.12% |
| fix_size_alloc_test: p:16, h:0, l:500000 (usec) | (R) -23.63% | (I) 43.81% |
| fix_size_alloc_test: p:16, h:1, l:500000 (usec) | -1.58% | (I) 102.28% |
| fix_size_alloc_test: p:64, h:0, l:100000 (usec) | (R) -24.39% | (I) 89.64% |
| fix_size_alloc_test: p:64, h:1, l:100000 (usec) | (I) 2.34% | (I) 181.42% |
| fix_size_alloc_test: p:256, h:0, l:100000 (usec) | (R) -23.29% | (I) 111.05% |
| fix_size_alloc_test: p:256, h:1, l:100000 (usec) | (I) 3.74% | (I) 213.52% |
| fix_size_alloc_test: p:512, h:0, l:100000 (usec) | (R) -23.80% | (I) 118.28% |
| fix_size_alloc_test: p:512, h:1, l:100000 (usec) | (R) -2.84% | (I) 427.65% |
| full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 2.74% | -1.12% |
| kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 0.58% | -0.79% |
| kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | -0.66% | -0.91% |
| long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | (R) -25.24% | (I) 70.62% |
| pcpu_alloc_test: p:1, h:0, l:500000 (usec) | -0.58% | -1.27% |
| random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | (R) -45.75% | (I) 11.11% |
| random_size_alloc_test: p:1, h:0, l:500000 (usec) | (R) -28.16% | (I) 59.47% |
| vm_map_ram_test: p:1, h:0, l:500000 (usec) | -0.54% | -0.85% |
+----------------------------------------------------------+-------------+-------------+
Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/vmalloc.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 32d6ee92d4ff..86407178b6d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3434,7 +3434,8 @@ void vfree_atomic(const void *addr)
void vfree(const void *addr)
{
struct vm_struct *vm;
- int i;
+ unsigned long start_pfn;
+ int i, nr;
if (unlikely(in_interrupt())) {
vfree_atomic(addr);
@@ -3460,17 +3461,25 @@ void vfree(const void *addr)
/* All pages of vm should be charged to same memcg, so use first one. */
if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
- for (i = 0; i < vm->nr_pages; i++) {
- struct page *page = vm->pages[i];
- BUG_ON(!page);
- /*
- * High-order allocs for huge vmallocs are split, so
- * can be freed as an array of order-0 allocations
- */
- __free_page(page);
- cond_resched();
+ if (vm->nr_pages) {
+ start_pfn = page_to_pfn(vm->pages[0]);
+ nr = 1;
+ for (i = 1; i < vm->nr_pages; i++) {
+ unsigned long pfn = page_to_pfn(vm->pages[i]);
+
+ if (start_pfn + nr != pfn) {
+ __free_contig_range(start_pfn, nr);
+ start_pfn = pfn;
+ nr = 1;
+ cond_resched();
+ } else {
+ nr++;
+ }
+ }
+ __free_contig_range(start_pfn, nr);
}
+
if (!(vm->flags & VM_MAP_PUT_PAGES))
atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
kvfree(vm->pages);
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
2026-01-05 16:17 ` [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Ryan Roberts
2026-01-05 16:17 ` [PATCH v1 2/2] vmalloc: Optimize vfree Ryan Roberts
@ 2026-01-05 16:26 ` David Hildenbrand (Red Hat)
2026-01-05 16:36 ` Zi Yan
2026-01-06 4:38 ` Matthew Wilcox
4 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-05 16:26 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Uladzislau Rezki,
Vishal Moola (Oracle),
Ackerley Tng, Michael Roth
Cc: linux-mm, linux-kernel
On 1/5/26 17:17, Ryan Roberts wrote:
> Hi All,
Hi,
>
> A recent change to vmalloc caused some performance benchmark regressions (see
> [1]). I'm attempting to fix that (and at the same time signficantly improve
> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
I recently raised the utility of something like that in case we had to
split a large folio in the page cache for guestmemfd, and then want to
punch-hole the whole (original) thing while making sure that the whole
thing ends up back in the buddy.
Freeing individual chunks (e.g., order-0 pages) has the problem that
some pages might get reallocated before merged, consequently fragmenting
the bigger chunk.
--
Cheers
David
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
` (2 preceding siblings ...)
2026-01-05 16:26 ` [PATCH v1 0/2] Free contiguous order-0 pages efficiently David Hildenbrand (Red Hat)
@ 2026-01-05 16:36 ` Zi Yan
2026-01-05 16:41 ` Ryan Roberts
2026-01-06 4:38 ` Matthew Wilcox
4 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2026-01-05 16:36 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel, Kefeng Wang
On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
> Hi All,
>
> A recent change to vmalloc caused some performance benchmark regressions (see
> [1]). I'm attempting to fix that (and at the same time signficantly improve
> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
>
> At the same time I observed that free_contig_range() was essentially doing the
> same thing as vfree() so I've fixed it there too.
>
> I think I've convinced myself that free_pages_prepare() per order-0 page
> followed by a single free_frozen_page_commit() or free_one_page() for the high
> order block is safe/correct, but would be good if a page_alloc expert can
> confirm!
>
> Applies against today's mm-unstable (344d3580dacd). All mm selftests run and
> pass.
Kefeng has a series on using frozen pages for alloc_contig*() in mm-new
and touches free_contig_range() as well. You might want to rebase on top
of that.
I like your approach of freeing multiple order-0 pages as a batch, since
they are essentially a non-compound high order page. I also pointed out
a similar optimization when reviewing Kefeng’s patchset[1] (see my comment
on __free_contig_frozen_range()).
In terms of rebase, there should be minor for free_contig_range(). In addition,
maybe your free_prepared_contig_range() can replace __free_contig_frozen_range()
in Kefeng’s version to improve performance for both code paths.
I will take a look at the patches. Thanks.
[1] https://lore.kernel.org/linux-mm/D90F7769-F3A8-4234-A9CE-F97BC48CCACE@nvidia.com/
>
> Thanks,
> Ryan
>
> Ryan Roberts (2):
> mm/page_alloc: Optimize free_contig_range()
> vmalloc: Optimize vfree
>
> include/linux/gfp.h | 1 +
> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
> mm/vmalloc.c | 29 +++++++----
> 3 files changed, 125 insertions(+), 21 deletions(-)
>
> --
> 2.43.0
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-05 16:36 ` Zi Yan
@ 2026-01-05 16:41 ` Ryan Roberts
0 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2026-01-05 16:41 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel, Kefeng Wang
On 05/01/2026 16:36, Zi Yan wrote:
> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>
>> Hi All,
>>
>> A recent change to vmalloc caused some performance benchmark regressions (see
>> [1]). I'm attempting to fix that (and at the same time signficantly improve
>> beyond the baseline) by freeing a contiguous set of order-0 pages as a batch.
>>
>> At the same time I observed that free_contig_range() was essentially doing the
>> same thing as vfree() so I've fixed it there too.
>>
>> I think I've convinced myself that free_pages_prepare() per order-0 page
>> followed by a single free_frozen_page_commit() or free_one_page() for the high
>> order block is safe/correct, but would be good if a page_alloc expert can
>> confirm!
>>
>> Applies against today's mm-unstable (344d3580dacd). All mm selftests run and
>> pass.
>
> Kefeng has a series on using frozen pages for alloc_contig*() in mm-new
> and touches free_contig_range() as well. You might want to rebase on top
> of that.
>
> I like your approach of freeing multiple order-0 pages as a batch, since
> they are essentially a non-compound high order page. I also pointed out
> a similar optimization when reviewing Kefeng’s patchset[1] (see my comment
> on __free_contig_frozen_range()).
>
> In terms of rebase, there should be minor for free_contig_range(). In addition,
> maybe your free_prepared_contig_range() can replace __free_contig_frozen_range()
> in Kefeng’s version to improve performance for both code paths.
OK, great! I'll hold off on the rebase until I get some code review feedback on
this version (I'd like to hear someone agree that what I'm doing is actually
sound!). Assuming feedback is positive, I'll rebase v2 onto mm-new and look at
the extra optimization opportunites as you suggest.
Thanks,
Ryan
>
> I will take a look at the patches. Thanks.
>
> [1] https://lore.kernel.org/linux-mm/D90F7769-F3A8-4234-A9CE-F97BC48CCACE@nvidia.com/
>
>>
>> Thanks,
>> Ryan
>>
>> Ryan Roberts (2):
>> mm/page_alloc: Optimize free_contig_range()
>> vmalloc: Optimize vfree
>>
>> include/linux/gfp.h | 1 +
>> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
>> mm/vmalloc.c | 29 +++++++----
>> 3 files changed, 125 insertions(+), 21 deletions(-)
>>
>> --
>> 2.43.0
>
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()
2026-01-05 16:17 ` [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Ryan Roberts
@ 2026-01-05 17:15 ` Zi Yan
2026-01-05 17:31 ` Ryan Roberts
0 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2026-01-05 17:15 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel, Jiaqi Yan
On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
> Decompose the range of order-0 pages to be freed into the set of largest
> possible power-of-2 size and aligned chunks and free them to the pcp or
> buddy. This improves on the previous approach which freed each order-0
> page individually in a loop. Testing shows performance to be improved by
> more than 10x in some cases.
>
> Since each page is order-0, we must decrement each page's reference
> count individually and only consider the page for freeing as part of a
> high order chunk if the reference count goes to zero. Additionally
> free_pages_prepare() must be called for each individual order-0 page
> too, so that the struct page state and global accounting state can be
> appropriately managed. But once this is done, the resulting high order
> chunks can be freed as a unit to the pcp or buddy.
>
> This significiantly speeds up the free operation but also has the side
> benefit that high order blocks are added to the pcp instead of each page
> ending up on the pcp order-0 list; memory remains more readily available
> in high orders.
>
> vmalloc will shortly become a user of this new optimized
> free_contig_range() since it agressively allocates high order
> non-compound pages, but then calls split_page() to end up with
> contiguous order-0 pages. These can now be freed much more efficiently.
>
> The execution time of the following function was measured in a VM on an
> Apple M2 system:
>
> static int page_alloc_high_ordr_test(void)
> {
> unsigned int order = HPAGE_PMD_ORDER;
> struct page *page;
> int i;
>
> for (i = 0; i < 100000; i++) {
> page = alloc_pages(GFP_KERNEL, order);
> if (!page)
> return -1;
> split_page(page, order);
> free_contig_range(page_to_pfn(page), 1UL << order);
> }
>
> return 0;
> }
>
> Execution time before: 1684366 usec
> Execution time after: 136216 usec
>
> Perf trace before:
>
> 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
> |
> ---ret_from_fork
> kthread
> 0xffffbba283e63980
> |
> |--60.01%--0xffffbba283e636dc
> | |
> | |--58.57%--free_contig_range
> | | |
> | | |--57.19%--___free_pages
> | | | |
> | | | |--46.65%--__free_frozen_pages
> | | | | |
> | | | | |--28.08%--free_pcppages_bulk
> | | | | |
> | | | | --12.05%--free_frozen_page_commit.constprop.0
> | | | |
> | | | |--5.10%--__get_pfnblock_flags_mask.isra.0
> | | | |
> | | | |--1.13%--_raw_spin_unlock
> | | | |
> | | | |--0.78%--free_frozen_page_commit.constprop.0
> | | | |
> | | | --0.75%--_raw_spin_trylock
> | | |
> | | --0.95%--__free_frozen_pages
> | |
> | --1.44%--___free_pages
> |
> --0.78%--0xffffbba283e636c0
> split_page
>
> Perf trace after:
>
> 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
> |
> ---ret_from_fork
> kthread
> 0xffffbbd55ef74980
> |
> |--8.74%--0xffffbbd55ef746dc
> | free_contig_range
> | |
> | --8.72%--__free_contig_range
> |
> --1.56%--0xffffbbd55ef746c0
> |
> --1.54%--split_page
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> include/linux/gfp.h | 1 +
> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 106 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b155929af5b1..3ed0bef34d0c 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
> #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>
> #endif
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>
> #ifdef CONFIG_CONTIG_ALLOC
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a045d728ae0f..1015c8edf8a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
> /* Free the page without taking locks. Rely on trylock only. */
> #define FPI_TRYLOCK ((__force fpi_t)BIT(2))
>
> +/* free_pages_prepare() has already been called for page(s) being freed. */
> +#define FPI_PREPARED ((__force fpi_t)BIT(3))
> +
> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> static DEFINE_MUTEX(pcp_batch_high_lock);
> #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> unsigned long pfn = page_to_pfn(page);
> struct zone *zone = page_zone(page);
>
> - if (free_pages_prepare(page, order))
> - free_one_page(zone, page, pfn, order, fpi_flags);
> + if (!(fpi_flags & FPI_PREPARED)) {
> + if (!free_pages_prepare(page, order))
> + return;
> + }
> +
> + free_one_page(zone, page, pfn, order, fpi_flags);
> }
>
> void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> return;
> }
>
> - if (!free_pages_prepare(page, order))
> - return;
> + if (!(fpi_flags & FPI_PREPARED)) {
> + if (!free_pages_prepare(page, order))
> + return;
> + }
>
> /*
> * We only track unmovable, reclaimable and movable on pcp lists.
> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> }
> #endif /* CONFIG_CONTIG_ALLOC */
>
> +static void free_prepared_contig_range(struct page *page,
> + unsigned long nr_pages)
> +{
> + while (nr_pages) {
> + unsigned int fit_order, align_order, order;
> + unsigned long pfn;
> +
> + /*
> + * Find the largest aligned power-of-2 number of pages that
> + * starts at the current page, does not exceed nr_pages and is
> + * less than or equal to pageblock_order.
> + */
> + pfn = page_to_pfn(page);
> + fit_order = ilog2(nr_pages);
> + align_order = pfn ? __ffs(pfn) : fit_order;
> + order = min3(fit_order, align_order, pageblock_order);
> +
> + /*
> + * Free the chunk as a single block. Our caller has already
> + * called free_pages_prepare() for each order-0 page.
> + */
> + __free_frozen_pages(page, order, FPI_PREPARED);
> +
> + page += 1UL << order;
> + nr_pages -= 1UL << order;
> + }
> +}
> +
> +/**
> + * __free_contig_range - Free contiguous range of order-0 pages.
> + * @pfn: Page frame number of the first page in the range.
> + * @nr_pages: Number of pages to free.
> + *
> + * For each order-0 struct page in the physically contiguous range, put a
> + * reference. Free any page who's reference count falls to zero. The
> + * implementation is functionally equivalent to, but significantly faster than
> + * calling __free_page() for each struct page in a loop.
> + *
> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
> + * order-0 with split_page() is an example of appropriate contiguous pages that
> + * can be freed with this API.
> + *
> + * Returns the number of pages which were not freed, because their reference
> + * count did not fall to zero.
> + *
> + * Context: May be called in interrupt context or while holding a normal
> + * spinlock, but not in NMI context or while holding a raw spinlock.
> + */
> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long not_freed = 0;
> + struct page *start = NULL;
> + unsigned long i;
> + bool can_free;
> +
> + /*
> + * Chunk the range into contiguous runs of pages for which the refcount
> + * went to zero and for which free_pages_prepare() succeeded. If
> + * free_pages_prepare() fails we consider the page to have been freed
> + * deliberately leak it.
> + *
> + * Code assumes contiguous PFNs have contiguous struct pages, but not
> + * vice versa.
> + */
> + for (i = 0; i < nr_pages; i++, page++) {
> + VM_BUG_ON_PAGE(PageHead(page), page);
> + VM_BUG_ON_PAGE(PageTail(page), page);
> +
> + can_free = put_page_testzero(page);
> + if (!can_free)
> + not_freed++;
> + else if (!free_pages_prepare(page, 0))
> + can_free = false;
I understand you use free_pages_prepare() here to catch early failures.
I wonder if we could let __free_frozen_pages() handle the failure of
non-compound >0 order pages instead of a new FPI flag.
Looking at free_pages_prepare(), three cases would cause failures:
1. PageHWPoison(page): the code excludes >0 order pages, so it needs
to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
2. uncleared PageNetpp(page): probably need to check every individual
page of this >0 order page and call bad_page() for any violator.
3. bad free page: probably need to do it for individual page as well.
I think it might be too much effort for you to get the above done.
Can you leave a TODO at FPI_PREPARED? I might try to do it
if Jiaqi’s series can be merged?
Otherwise, the rest of the patch looks good to me.
Thanks.
[1] https://lore.kernel.org/linux-mm/20251219183346.3627510-1-jiaqiyan@google.com/
> +
> + if (!can_free && start) {
> + free_prepared_contig_range(start, page - start);
> + start = NULL;
> + } else if (can_free && !start) {
> + start = page;
> + }
> + }
> +
> + if (start)
> + free_prepared_contig_range(start, page - start);
> +
> + return not_freed;
> +}
> +EXPORT_SYMBOL(__free_contig_range);
> +
> void free_contig_range(unsigned long pfn, unsigned long nr_pages)
> {
> - unsigned long count = 0;
> + unsigned long count;
> struct folio *folio = pfn_folio(pfn);
>
> if (folio_test_large(folio)) {
> @@ -7266,12 +7365,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
> return;
> }
>
> - for (; nr_pages--; pfn++) {
> - struct page *page = pfn_to_page(pfn);
> -
> - count += page_count(page) != 1;
> - __free_page(page);
> - }
> + count = __free_contig_range(pfn, nr_pages);
> WARN(count != 0, "%lu pages are still in use!\n", count);
> }
> EXPORT_SYMBOL(free_contig_range);
> --
> 2.43.0
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()
2026-01-05 17:15 ` Zi Yan
@ 2026-01-05 17:31 ` Ryan Roberts
2026-01-07 3:32 ` Zi Yan
0 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2026-01-05 17:31 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel, Jiaqi Yan
On 05/01/2026 17:15, Zi Yan wrote:
> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>
>> Decompose the range of order-0 pages to be freed into the set of largest
>> possible power-of-2 size and aligned chunks and free them to the pcp or
>> buddy. This improves on the previous approach which freed each order-0
>> page individually in a loop. Testing shows performance to be improved by
>> more than 10x in some cases.
>>
>> Since each page is order-0, we must decrement each page's reference
>> count individually and only consider the page for freeing as part of a
>> high order chunk if the reference count goes to zero. Additionally
>> free_pages_prepare() must be called for each individual order-0 page
>> too, so that the struct page state and global accounting state can be
>> appropriately managed. But once this is done, the resulting high order
>> chunks can be freed as a unit to the pcp or buddy.
>>
>> This significiantly speeds up the free operation but also has the side
>> benefit that high order blocks are added to the pcp instead of each page
>> ending up on the pcp order-0 list; memory remains more readily available
>> in high orders.
>>
>> vmalloc will shortly become a user of this new optimized
>> free_contig_range() since it agressively allocates high order
>> non-compound pages, but then calls split_page() to end up with
>> contiguous order-0 pages. These can now be freed much more efficiently.
>>
>> The execution time of the following function was measured in a VM on an
>> Apple M2 system:
>>
>> static int page_alloc_high_ordr_test(void)
>> {
>> unsigned int order = HPAGE_PMD_ORDER;
>> struct page *page;
>> int i;
>>
>> for (i = 0; i < 100000; i++) {
>> page = alloc_pages(GFP_KERNEL, order);
>> if (!page)
>> return -1;
>> split_page(page, order);
>> free_contig_range(page_to_pfn(page), 1UL << order);
>> }
>>
>> return 0;
>> }
>>
>> Execution time before: 1684366 usec
>> Execution time after: 136216 usec
>>
>> Perf trace before:
>>
>> 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>> |
>> ---ret_from_fork
>> kthread
>> 0xffffbba283e63980
>> |
>> |--60.01%--0xffffbba283e636dc
>> | |
>> | |--58.57%--free_contig_range
>> | | |
>> | | |--57.19%--___free_pages
>> | | | |
>> | | | |--46.65%--__free_frozen_pages
>> | | | | |
>> | | | | |--28.08%--free_pcppages_bulk
>> | | | | |
>> | | | | --12.05%--free_frozen_page_commit.constprop.0
>> | | | |
>> | | | |--5.10%--__get_pfnblock_flags_mask.isra.0
>> | | | |
>> | | | |--1.13%--_raw_spin_unlock
>> | | | |
>> | | | |--0.78%--free_frozen_page_commit.constprop.0
>> | | | |
>> | | | --0.75%--_raw_spin_trylock
>> | | |
>> | | --0.95%--__free_frozen_pages
>> | |
>> | --1.44%--___free_pages
>> |
>> --0.78%--0xffffbba283e636c0
>> split_page
>>
>> Perf trace after:
>>
>> 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>> |
>> ---ret_from_fork
>> kthread
>> 0xffffbbd55ef74980
>> |
>> |--8.74%--0xffffbbd55ef746dc
>> | free_contig_range
>> | |
>> | --8.72%--__free_contig_range
>> |
>> --1.56%--0xffffbbd55ef746c0
>> |
>> --1.54%--split_page
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>> include/linux/gfp.h | 1 +
>> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
>> 2 files changed, 106 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index b155929af5b1..3ed0bef34d0c 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>> #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>
>> #endif
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>
>> #ifdef CONFIG_CONTIG_ALLOC
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a045d728ae0f..1015c8edf8a4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>> /* Free the page without taking locks. Rely on trylock only. */
>> #define FPI_TRYLOCK ((__force fpi_t)BIT(2))
>>
>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>> +#define FPI_PREPARED ((__force fpi_t)BIT(3))
>> +
>> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>> static DEFINE_MUTEX(pcp_batch_high_lock);
>> #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>> unsigned long pfn = page_to_pfn(page);
>> struct zone *zone = page_zone(page);
>>
>> - if (free_pages_prepare(page, order))
>> - free_one_page(zone, page, pfn, order, fpi_flags);
>> + if (!(fpi_flags & FPI_PREPARED)) {
>> + if (!free_pages_prepare(page, order))
>> + return;
>> + }
>> +
>> + free_one_page(zone, page, pfn, order, fpi_flags);
>> }
>>
>> void __meminit __free_pages_core(struct page *page, unsigned int order,
>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>> return;
>> }
>>
>> - if (!free_pages_prepare(page, order))
>> - return;
>> + if (!(fpi_flags & FPI_PREPARED)) {
>> + if (!free_pages_prepare(page, order))
>> + return;
>> + }
>>
>> /*
>> * We only track unmovable, reclaimable and movable on pcp lists.
>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>> }
>> #endif /* CONFIG_CONTIG_ALLOC */
>>
>> +static void free_prepared_contig_range(struct page *page,
>> + unsigned long nr_pages)
>> +{
>> + while (nr_pages) {
>> + unsigned int fit_order, align_order, order;
>> + unsigned long pfn;
>> +
>> + /*
>> + * Find the largest aligned power-of-2 number of pages that
>> + * starts at the current page, does not exceed nr_pages and is
>> + * less than or equal to pageblock_order.
>> + */
>> + pfn = page_to_pfn(page);
>> + fit_order = ilog2(nr_pages);
>> + align_order = pfn ? __ffs(pfn) : fit_order;
>> + order = min3(fit_order, align_order, pageblock_order);
>> +
>> + /*
>> + * Free the chunk as a single block. Our caller has already
>> + * called free_pages_prepare() for each order-0 page.
>> + */
>> + __free_frozen_pages(page, order, FPI_PREPARED);
>> +
>> + page += 1UL << order;
>> + nr_pages -= 1UL << order;
>> + }
>> +}
>> +
>> +/**
>> + * __free_contig_range - Free contiguous range of order-0 pages.
>> + * @pfn: Page frame number of the first page in the range.
>> + * @nr_pages: Number of pages to free.
>> + *
>> + * For each order-0 struct page in the physically contiguous range, put a
>> + * reference. Free any page who's reference count falls to zero. The
>> + * implementation is functionally equivalent to, but significantly faster than
>> + * calling __free_page() for each struct page in a loop.
>> + *
>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>> + * can be freed with this API.
>> + *
>> + * Returns the number of pages which were not freed, because their reference
>> + * count did not fall to zero.
>> + *
>> + * Context: May be called in interrupt context or while holding a normal
>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>> + */
>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> +{
>> + struct page *page = pfn_to_page(pfn);
>> + unsigned long not_freed = 0;
>> + struct page *start = NULL;
>> + unsigned long i;
>> + bool can_free;
>> +
>> + /*
>> + * Chunk the range into contiguous runs of pages for which the refcount
>> + * went to zero and for which free_pages_prepare() succeeded. If
>> + * free_pages_prepare() fails we consider the page to have been freed
>> + * deliberately leak it.
>> + *
>> + * Code assumes contiguous PFNs have contiguous struct pages, but not
>> + * vice versa.
>> + */
>> + for (i = 0; i < nr_pages; i++, page++) {
>> + VM_BUG_ON_PAGE(PageHead(page), page);
>> + VM_BUG_ON_PAGE(PageTail(page), page);
>> +
>> + can_free = put_page_testzero(page);
>> + if (!can_free)
>> + not_freed++;
>> + else if (!free_pages_prepare(page, 0))
>> + can_free = false;
>
> I understand you use free_pages_prepare() here to catch early failures.
> I wonder if we could let __free_frozen_pages() handle the failure of
> non-compound >0 order pages instead of a new FPI flag.
I'm not sure I follow. You would still need to provide a flag to
__free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
will treat it as a non-compound high order page, which would be wrong;
free_pages_prepare() would only be called for the head page (with the order
passed in) and that won't do the right thing.
I guess you could pass the flag all the way to free_pages_prepare() then it
could be modified to do the right thing for contiguous order-0 pages; that would
probably ultimately be more efficient then calling free_pages_prepare() for
every order-0 page. Is that what you are suggesting?
>
> Looking at free_pages_prepare(), three cases would cause failures:
> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
> to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>
> 2. uncleared PageNetpp(page): probably need to check every individual
> page of this >0 order page and call bad_page() for any violator.
>
> 3. bad free page: probably need to do it for individual page as well.
It's not just handling the failures, it's accounting; e.g.
__memcg_kmem_uncharge_page().
>
> I think it might be too much effort for you to get the above done.
Indeed, I'd prefer to consider that an additional improvement opportunity :)
> Can you leave a TODO at FPI_PREPARED? I might try to do it
> if Jiaqi’s series can be merged?
Yes no problem!
>
> Otherwise, the rest of the patch looks good to me.
Thanks for the quick review. I'll wait to see if others chime in then rebase
onto mm-new at the ~end of the week.
Thanks,
Ryan
>
> Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/20251219183346.3627510-1-jiaqiyan@google.com/
>
>> +
>> + if (!can_free && start) {
>> + free_prepared_contig_range(start, page - start);
>> + start = NULL;
>> + } else if (can_free && !start) {
>> + start = page;
>> + }
>> + }
>> +
>> + if (start)
>> + free_prepared_contig_range(start, page - start);
>> +
>> + return not_freed;
>> +}
>> +EXPORT_SYMBOL(__free_contig_range);
>> +
>> void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> {
>> - unsigned long count = 0;
>> + unsigned long count;
>> struct folio *folio = pfn_folio(pfn);
>>
>> if (folio_test_large(folio)) {
>> @@ -7266,12 +7365,7 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>> return;
>> }
>>
>> - for (; nr_pages--; pfn++) {
>> - struct page *page = pfn_to_page(pfn);
>> -
>> - count += page_count(page) != 1;
>> - __free_page(page);
>> - }
>> + count = __free_contig_range(pfn, nr_pages);
>> WARN(count != 0, "%lu pages are still in use!\n", count);
>> }
>> EXPORT_SYMBOL(free_contig_range);
>> --
>> 2.43.0
>
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 2/2] vmalloc: Optimize vfree
2026-01-05 16:17 ` [PATCH v1 2/2] vmalloc: Optimize vfree Ryan Roberts
@ 2026-01-06 4:36 ` Matthew Wilcox
2026-01-06 9:47 ` David Laight
2026-01-06 11:04 ` Ryan Roberts
0 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2026-01-06 4:36 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On Mon, Jan 05, 2026 at 04:17:38PM +0000, Ryan Roberts wrote:
> + if (vm->nr_pages) {
> + start_pfn = page_to_pfn(vm->pages[0]);
> + nr = 1;
> + for (i = 1; i < vm->nr_pages; i++) {
> + unsigned long pfn = page_to_pfn(vm->pages[i]);
> +
> + if (start_pfn + nr != pfn) {
> + __free_contig_range(start_pfn, nr);
> + start_pfn = pfn;
> + nr = 1;
> + cond_resched();
> + } else {
> + nr++;
> + }
It kind of feels like __free_contig_range() and this routine do the same
thing -- iterate over each page and make sure that it's compatible with
being freed. What if we did ...
+ for (i = 0; i < vm->nr_pages; i++) {
+ struct page *page = vm->pages[i];
+
+ if (!put_page_testzero(page)) {
+ __free_frozen_contig_pages(start_page, nr);
+ nr = 0;
+ continue;
+ }
+
+ if (!nr) {
+ start_page = page;
+ nr = 1;
+ continue;
+ }
+
+ if (start_page + nr != page) {
+ __free_frozen_contig_pages(start_page, nr);
+ start_page = page;
+ nr = 1;
+ cond_resched();
+ } else {
+ nr++;
+ }
+ }
+
+ __free_frozen_contig_pages(start_page, nr);
That way we don't need to mess around with returning the number of pages
not freed.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
` (3 preceding siblings ...)
2026-01-05 16:36 ` Zi Yan
@ 2026-01-06 4:38 ` Matthew Wilcox
2026-01-06 11:10 ` Ryan Roberts
2026-01-06 11:34 ` Uladzislau Rezki
4 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2026-01-06 4:38 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On Mon, Jan 05, 2026 at 04:17:36PM +0000, Ryan Roberts wrote:
> Hi All,
>
> A recent change to vmalloc caused some performance benchmark regressions (see
> [1]). I'm attempting to fix that (and at the same time signficantly improve
Unfortunately, there was no [1] ... I'm not sure this benchmark is
really doing anything representative. But the performance improvement
is certainly welcome; we'd deferred work on that for later.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 2/2] vmalloc: Optimize vfree
2026-01-06 4:36 ` Matthew Wilcox
@ 2026-01-06 9:47 ` David Laight
2026-01-06 11:04 ` Ryan Roberts
1 sibling, 0 replies; 15+ messages in thread
From: David Laight @ 2026-01-06 9:47 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ryan Roberts, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On Tue, 6 Jan 2026 04:36:23 +0000
Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Jan 05, 2026 at 04:17:38PM +0000, Ryan Roberts wrote:
> > + if (vm->nr_pages) {
> > + start_pfn = page_to_pfn(vm->pages[0]);
> > + nr = 1;
> > + for (i = 1; i < vm->nr_pages; i++) {
> > + unsigned long pfn = page_to_pfn(vm->pages[i]);
> > +
> > + if (start_pfn + nr != pfn) {
> > + __free_contig_range(start_pfn, nr);
> > + start_pfn = pfn;
> > + nr = 1;
> > + cond_resched();
> > + } else {
> > + nr++;
> > + }
>
> It kind of feels like __free_contig_range() and this routine do the same
> thing -- iterate over each page and make sure that it's compatible with
> being freed. What if we did ...
>
nr = 0;
> + for (i = 0; i < vm->nr_pages; i++) {
> + struct page *page = vm->pages[i];
> +
> + if (!put_page_testzero(page)) {
Do you need an if (nr) here?
If nothing else something might complain about start_page being unset.
Is this a common/expected path?
If not I think you can just 'continue' and it all still look ok.
There is also a cond_reshed() below, if a common path should
there be one here?
> + __free_frozen_contig_pages(start_page, nr);
> + nr = 0;
> + continue;
> + }
> +
> + if (!nr) {
> + start_page = page;
> + nr = 1;
> + continue;
> + }
> +
> + if (start_page + nr != page) {
> + __free_frozen_contig_pages(start_page, nr);
> + start_page = page;
> + nr = 1;
> + cond_resched();
> + } else {
> + nr++;
> + }
> + }
> +
> + __free_frozen_contig_pages(start_page, nr);
>
> That way we don't need to mess around with returning the number of pages
> not freed.
>
I think this shorter form is equivalent.
nr = 0;
start_page = NULL;
for (i = 0; i < vm->nr_pages; i++) {
struct page *page = vm->pages[i];
if (!put_page_testzero(page))
continue;
if (start_page + nr != page) {
if (nr) {
__free_frozen_contig_pages(start_page, nr);
cond_resched();
}
start_page = page;
nr = 1;
} else {
nr++;
}
}
if (nr)
__free_frozen_contig_pages(start_page, nr);
David
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 2/2] vmalloc: Optimize vfree
2026-01-06 4:36 ` Matthew Wilcox
2026-01-06 9:47 ` David Laight
@ 2026-01-06 11:04 ` Ryan Roberts
1 sibling, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2026-01-06 11:04 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On 06/01/2026 04:36, Matthew Wilcox wrote:
> On Mon, Jan 05, 2026 at 04:17:38PM +0000, Ryan Roberts wrote:
>> + if (vm->nr_pages) {
>> + start_pfn = page_to_pfn(vm->pages[0]);
>> + nr = 1;
>> + for (i = 1; i < vm->nr_pages; i++) {
>> + unsigned long pfn = page_to_pfn(vm->pages[i]);
>> +
>> + if (start_pfn + nr != pfn) {
>> + __free_contig_range(start_pfn, nr);
>> + start_pfn = pfn;
>> + nr = 1;
>> + cond_resched();
>> + } else {
>> + nr++;
>> + }
>
> It kind of feels like __free_contig_range() and this routine do the same
> thing -- iterate over each page and make sure that it's compatible with
> being freed. What if we did ...
__free_contig_range() as I implemented it is common to vfree() and
free_contig_range() so more users benefit from the optimization. If we move
put_page_testzero() into vfree() we would also need a loop in
free_contig_range() to do the same thing.
Additionally where do you propose to put free_pages_prepare()? That's currently
handled by the loop in __free_contig_range() for my implementation. I don't
think we want to export that outside of page_alloc.c really. Zi was suggesting
the long term solution might be to make free_pages_prepare() "contiguous range
of order-0 pages" aware, but that's a future improvement I wasn't planning to do
here, so currently it needs to be called for each order-0 page.
>
> + for (i = 0; i < vm->nr_pages; i++) {
> + struct page *page = vm->pages[i];
> +
> + if (!put_page_testzero(page)) {
> + __free_frozen_contig_pages(start_page, nr);
> + nr = 0;
> + continue;
> + }
> +
> + if (!nr) {
> + start_page = page;
> + nr = 1;
> + continue;
> + }
> +
> + if (start_page + nr != page) {
It was my understanding that a contiguous run of PFNs guarrantees a
corresponding contiguous run of struct pages, but not vice versa; I thought
there was a memory model where holes in PFNs were closed in the vmemmap meaning
that just because 2 struct pages are virtually contiguous that doesn't mean the
PFNs are physically contiguous? That's why I was using PFN here.
Perhaps I'm wrong?
Thanks,
Ryan
> + __free_frozen_contig_pages(start_page, nr);
> + start_page = page;
> + nr = 1;
> + cond_resched();
> + } else {
> + nr++;
> + }
> + }
> +
> + __free_frozen_contig_pages(start_page, nr);
>
> That way we don't need to mess around with returning the number of pages
> not freed.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-06 4:38 ` Matthew Wilcox
@ 2026-01-06 11:10 ` Ryan Roberts
2026-01-06 11:34 ` Uladzislau Rezki
1 sibling, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2026-01-06 11:10 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On 06/01/2026 04:38, Matthew Wilcox wrote:
> On Mon, Jan 05, 2026 at 04:17:36PM +0000, Ryan Roberts wrote:
>> Hi All,
>>
>> A recent change to vmalloc caused some performance benchmark regressions (see
>> [1]). I'm attempting to fix that (and at the same time signficantly improve
>
> Unfortunately, there was no [1] ...
Oops:
[1] https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
(it's the same link as Closes: tag in patch 2).
> I'm not sure this benchmark is
> really doing anything representative.
Yes that's probably fair, but my argument is that we should either care about
the numbers or delete the tests. It seems we don't want to delete the tests.
> But the performance improvement
> is certainly welcome; we'd deferred work on that for later.
OK, let's focus on the "performance improvement" motivation instead of the
"regression fixing" part :)
Thanks,
Ryan
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 0/2] Free contiguous order-0 pages efficiently
2026-01-06 4:38 ` Matthew Wilcox
2026-01-06 11:10 ` Ryan Roberts
@ 2026-01-06 11:34 ` Uladzislau Rezki
1 sibling, 0 replies; 15+ messages in thread
From: Uladzislau Rezki @ 2026-01-06 11:34 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Ryan Roberts, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel
On Tue, Jan 06, 2026 at 04:38:39AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 05, 2026 at 04:17:36PM +0000, Ryan Roberts wrote:
> > Hi All,
> >
> > A recent change to vmalloc caused some performance benchmark regressions (see
> > [1]). I'm attempting to fix that (and at the same time signficantly improve
>
> Unfortunately, there was no [1] ... I'm not sure this benchmark is
> really doing anything representative. But the performance improvement
> is certainly welcome; we'd deferred work on that for later.
>
When discussed the high-order preference allocation patch, i noticed
a difference in behaviour right away. Further investigation showed that
the free path also needs to be improved to fully benefit from that change.
I can document the test-cases in the test_vmalloc.c if it helps. Also
i can add more focused benchmarks for allocation and free paths.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range()
2026-01-05 17:31 ` Ryan Roberts
@ 2026-01-07 3:32 ` Zi Yan
0 siblings, 0 replies; 15+ messages in thread
From: Zi Yan @ 2026-01-07 3:32 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Uladzislau Rezki, Vishal Moola (Oracle),
linux-mm, linux-kernel, Jiaqi Yan
On 5 Jan 2026, at 12:31, Ryan Roberts wrote:
> On 05/01/2026 17:15, Zi Yan wrote:
>> On 5 Jan 2026, at 11:17, Ryan Roberts wrote:
>>
>>> Decompose the range of order-0 pages to be freed into the set of largest
>>> possible power-of-2 size and aligned chunks and free them to the pcp or
>>> buddy. This improves on the previous approach which freed each order-0
>>> page individually in a loop. Testing shows performance to be improved by
>>> more than 10x in some cases.
>>>
>>> Since each page is order-0, we must decrement each page's reference
>>> count individually and only consider the page for freeing as part of a
>>> high order chunk if the reference count goes to zero. Additionally
>>> free_pages_prepare() must be called for each individual order-0 page
>>> too, so that the struct page state and global accounting state can be
>>> appropriately managed. But once this is done, the resulting high order
>>> chunks can be freed as a unit to the pcp or buddy.
>>>
>>> This significiantly speeds up the free operation but also has the side
>>> benefit that high order blocks are added to the pcp instead of each page
>>> ending up on the pcp order-0 list; memory remains more readily available
>>> in high orders.
>>>
>>> vmalloc will shortly become a user of this new optimized
>>> free_contig_range() since it agressively allocates high order
>>> non-compound pages, but then calls split_page() to end up with
>>> contiguous order-0 pages. These can now be freed much more efficiently.
>>>
>>> The execution time of the following function was measured in a VM on an
>>> Apple M2 system:
>>>
>>> static int page_alloc_high_ordr_test(void)
>>> {
>>> unsigned int order = HPAGE_PMD_ORDER;
>>> struct page *page;
>>> int i;
>>>
>>> for (i = 0; i < 100000; i++) {
>>> page = alloc_pages(GFP_KERNEL, order);
>>> if (!page)
>>> return -1;
>>> split_page(page, order);
>>> free_contig_range(page_to_pfn(page), 1UL << order);
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> Execution time before: 1684366 usec
>>> Execution time after: 136216 usec
>>>
>>> Perf trace before:
>>>
>>> 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>>> |
>>> ---ret_from_fork
>>> kthread
>>> 0xffffbba283e63980
>>> |
>>> |--60.01%--0xffffbba283e636dc
>>> | |
>>> | |--58.57%--free_contig_range
>>> | | |
>>> | | |--57.19%--___free_pages
>>> | | | |
>>> | | | |--46.65%--__free_frozen_pages
>>> | | | | |
>>> | | | | |--28.08%--free_pcppages_bulk
>>> | | | | |
>>> | | | | --12.05%--free_frozen_page_commit.constprop.0
>>> | | | |
>>> | | | |--5.10%--__get_pfnblock_flags_mask.isra.0
>>> | | | |
>>> | | | |--1.13%--_raw_spin_unlock
>>> | | | |
>>> | | | |--0.78%--free_frozen_page_commit.constprop.0
>>> | | | |
>>> | | | --0.75%--_raw_spin_trylock
>>> | | |
>>> | | --0.95%--__free_frozen_pages
>>> | |
>>> | --1.44%--___free_pages
>>> |
>>> --0.78%--0xffffbba283e636c0
>>> split_page
>>>
>>> Perf trace after:
>>>
>>> 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
>>> |
>>> ---ret_from_fork
>>> kthread
>>> 0xffffbbd55ef74980
>>> |
>>> |--8.74%--0xffffbbd55ef746dc
>>> | free_contig_range
>>> | |
>>> | --8.72%--__free_contig_range
>>> |
>>> --1.56%--0xffffbbd55ef746c0
>>> |
>>> --1.54%--split_page
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>> include/linux/gfp.h | 1 +
>>> mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++-----
>>> 2 files changed, 106 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index b155929af5b1..3ed0bef34d0c 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_
>>> #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__VA_ARGS__))
>>>
>>> #endif
>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>>
>>> #ifdef CONFIG_CONTIG_ALLOC
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a045d728ae0f..1015c8edf8a4 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t;
>>> /* Free the page without taking locks. Rely on trylock only. */
>>> #define FPI_TRYLOCK ((__force fpi_t)BIT(2))
>>>
>>> +/* free_pages_prepare() has already been called for page(s) being freed. */
>>> +#define FPI_PREPARED ((__force fpi_t)BIT(3))
>>> +
>>> /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>>> static DEFINE_MUTEX(pcp_batch_high_lock);
>>> #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
>>> @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>>> unsigned long pfn = page_to_pfn(page);
>>> struct zone *zone = page_zone(page);
>>>
>>> - if (free_pages_prepare(page, order))
>>> - free_one_page(zone, page, pfn, order, fpi_flags);
>>> + if (!(fpi_flags & FPI_PREPARED)) {
>>> + if (!free_pages_prepare(page, order))
>>> + return;
>>> + }
>>> +
>>> + free_one_page(zone, page, pfn, order, fpi_flags);
>>> }
>>>
>>> void __meminit __free_pages_core(struct page *page, unsigned int order,
>>> @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>>> return;
>>> }
>>>
>>> - if (!free_pages_prepare(page, order))
>>> - return;
>>> + if (!(fpi_flags & FPI_PREPARED)) {
>>> + if (!free_pages_prepare(page, order))
>>> + return;
>>> + }
>>>
>>> /*
>>> * We only track unmovable, reclaimable and movable on pcp lists.
>>> @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>>> }
>>> #endif /* CONFIG_CONTIG_ALLOC */
>>>
>>> +static void free_prepared_contig_range(struct page *page,
>>> + unsigned long nr_pages)
>>> +{
>>> + while (nr_pages) {
>>> + unsigned int fit_order, align_order, order;
>>> + unsigned long pfn;
>>> +
>>> + /*
>>> + * Find the largest aligned power-of-2 number of pages that
>>> + * starts at the current page, does not exceed nr_pages and is
>>> + * less than or equal to pageblock_order.
>>> + */
>>> + pfn = page_to_pfn(page);
>>> + fit_order = ilog2(nr_pages);
>>> + align_order = pfn ? __ffs(pfn) : fit_order;
>>> + order = min3(fit_order, align_order, pageblock_order);
>>> +
>>> + /*
>>> + * Free the chunk as a single block. Our caller has already
>>> + * called free_pages_prepare() for each order-0 page.
>>> + */
>>> + __free_frozen_pages(page, order, FPI_PREPARED);
>>> +
>>> + page += 1UL << order;
>>> + nr_pages -= 1UL << order;
>>> + }
>>> +}
>>> +
>>> +/**
>>> + * __free_contig_range - Free contiguous range of order-0 pages.
>>> + * @pfn: Page frame number of the first page in the range.
>>> + * @nr_pages: Number of pages to free.
>>> + *
>>> + * For each order-0 struct page in the physically contiguous range, put a
>>> + * reference. Free any page who's reference count falls to zero. The
>>> + * implementation is functionally equivalent to, but significantly faster than
>>> + * calling __free_page() for each struct page in a loop.
>>> + *
>>> + * Memory allocated with alloc_pages(order>=1) then subsequently split to
>>> + * order-0 with split_page() is an example of appropriate contiguous pages that
>>> + * can be freed with this API.
>>> + *
>>> + * Returns the number of pages which were not freed, because their reference
>>> + * count did not fall to zero.
>>> + *
>>> + * Context: May be called in interrupt context or while holding a normal
>>> + * spinlock, but not in NMI context or while holding a raw spinlock.
>>> + */
>>> +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>> +{
>>> + struct page *page = pfn_to_page(pfn);
>>> + unsigned long not_freed = 0;
>>> + struct page *start = NULL;
>>> + unsigned long i;
>>> + bool can_free;
>>> +
>>> + /*
>>> + * Chunk the range into contiguous runs of pages for which the refcount
>>> + * went to zero and for which free_pages_prepare() succeeded. If
>>> + * free_pages_prepare() fails we consider the page to have been freed
>>> + * deliberately leak it.
>>> + *
>>> + * Code assumes contiguous PFNs have contiguous struct pages, but not
>>> + * vice versa.
>>> + */
>>> + for (i = 0; i < nr_pages; i++, page++) {
>>> + VM_BUG_ON_PAGE(PageHead(page), page);
>>> + VM_BUG_ON_PAGE(PageTail(page), page);
>>> +
>>> + can_free = put_page_testzero(page);
>>> + if (!can_free)
>>> + not_freed++;
>>> + else if (!free_pages_prepare(page, 0))
>>> + can_free = false;
>>
>> I understand you use free_pages_prepare() here to catch early failures.
>> I wonder if we could let __free_frozen_pages() handle the failure of
>> non-compound >0 order pages instead of a new FPI flag.
>
> I'm not sure I follow. You would still need to provide a flag to
> __free_frozen_pages() to tell it "this is a set of order-0 pages". Otherwise it
> will treat it as a non-compound high order page, which would be wrong;
> free_pages_prepare() would only be called for the head page (with the order
> passed in) and that won't do the right thing.
>
> I guess you could pass the flag all the way to free_pages_prepare() then it
> could be modified to do the right thing for contiguous order-0 pages; that would
> probably ultimately be more efficient then calling free_pages_prepare() for
> every order-0 page. Is that what you are suggesting?
Yes. I mistakenly mixed up non-compound high order page and a set of order-0
pages. There is alloc_pages_bulk() to get a list of order-0 pages, but
free_pages_bulk() does not exist. Maybe that is what we need here?
Using __free_frozen_pages() for a set of order-0 pages looks like a
shoehorning. I admit that adding free_pages_bulk() with maximal code
reuse and a good interface will take some effort, so it probably is a long
term goal. free_pages_bulk() is also slightly different from what you
want to do, since, if it uses same interface as alloc_pages_bulk(),
it will need to accept a page array instead of page + order.
I am not suggesting you should do this, but just think out loud.
>
>>
>> Looking at free_pages_prepare(), three cases would cause failures:
>> 1. PageHWPoison(page): the code excludes >0 order pages, so it needs
>> to be fixed. BTW, Jiaqi Yan has a series trying to tackle it[1].
>>
>> 2. uncleared PageNetpp(page): probably need to check every individual
>> page of this >0 order page and call bad_page() for any violator.
>>
>> 3. bad free page: probably need to do it for individual page as well.
>
> It's not just handling the failures, it's accounting; e.g.
> __memcg_kmem_uncharge_page().
Got it. Another idea comes to mind.
Is it doable to
1) use put_page_testzero() to bring all pages’ refs to 0,
2) unsplit/merge these contiguous order-0 pages back to non-compound
high order pages,
3) free unsplit/merged pages with __free_frozen_pages()?
Since your example is 1) allocate a non compound high order page,
2) split_page(). The above approach is doing the reverse steps.
Does your example represent the actual use cases?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-01-07 3:32 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-05 16:17 [PATCH v1 0/2] Free contiguous order-0 pages efficiently Ryan Roberts
2026-01-05 16:17 ` [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Ryan Roberts
2026-01-05 17:15 ` Zi Yan
2026-01-05 17:31 ` Ryan Roberts
2026-01-07 3:32 ` Zi Yan
2026-01-05 16:17 ` [PATCH v1 2/2] vmalloc: Optimize vfree Ryan Roberts
2026-01-06 4:36 ` Matthew Wilcox
2026-01-06 9:47 ` David Laight
2026-01-06 11:04 ` Ryan Roberts
2026-01-05 16:26 ` [PATCH v1 0/2] Free contiguous order-0 pages efficiently David Hildenbrand (Red Hat)
2026-01-05 16:36 ` Zi Yan
2026-01-05 16:41 ` Ryan Roberts
2026-01-06 4:38 ` Matthew Wilcox
2026-01-06 11:10 ` Ryan Roberts
2026-01-06 11:34 ` Uladzislau Rezki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox