[PATCH 1/2] mm/vmalloc: Add large-order allocation helper

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] mm/vmalloc: Add large-order allocation helper
@ 2025-12-16 21:19 Uladzislau Rezki (Sony)
  2025-12-16 21:19 ` [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter Uladzislau Rezki (Sony)
  0 siblings, 1 reply; 27+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-12-16 21:19 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vishal Moola, Ryan Roberts, Dev Jain, Baoquan He, LKML, Uladzislau Rezki

Refactor vm_area_alloc_pages() by moving the high-order allocation
code into a separate function, vm_area_alloc_pages_large_order().

No functional changes.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..d3a4725e15ca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3624,18 +3624,20 @@ static inline gfp_t vmalloc_gfp_adjust(gfp_t flags, const bool large)
 	return flags;
 }
 
-static inline unsigned int
-vm_area_alloc_pages(gfp_t gfp, int nid,
-		unsigned int order, unsigned int nr_pages, struct page **pages)
+static unsigned int
+vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
+		unsigned int nr_pages, struct page **pages)
 {
 	unsigned int nr_allocated = 0;
 	unsigned int nr_remaining = nr_pages;
 	unsigned int max_attempt_order = MAX_PAGE_ORDER;
 	struct page *page;
+	unsigned int large_order;
+	gfp_t large_gfp;
 	int i;
-	unsigned int large_order = ilog2(nr_remaining);
-	gfp_t large_gfp = vmalloc_gfp_adjust(gfp, large_order) & ~__GFP_DIRECT_RECLAIM;
 
+	large_order = ilog2(nr_remaining);
+	large_gfp = vmalloc_gfp_adjust(gfp, large_order) & ~__GFP_DIRECT_RECLAIM;
 	large_order = min(max_attempt_order, large_order);
 
 	/*
@@ -3666,6 +3668,20 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		large_order = min(max_attempt_order, large_order);
 	}
 
+	return nr_allocated;
+}
+
+static inline unsigned int
+vm_area_alloc_pages(gfp_t gfp, int nid,
+		unsigned int order, unsigned int nr_pages, struct page **pages)
+{
+	unsigned int nr_allocated = 0;
+	struct page *page;
+	int i;
+
+	nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
+		order, nr_pages, pages);
+
 	/*
 	 * For order-0 pages we make use of bulk allocator, if
 	 * the page array is partly or not at all populated due
-- 
2.47.3



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-16 21:19 [PATCH 1/2] mm/vmalloc: Add large-order allocation helper Uladzislau Rezki (Sony)
@ 2025-12-16 21:19 ` Uladzislau Rezki (Sony)
  2025-12-16 23:36   ` Andrew Morton
                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-12-16 21:19 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vishal Moola, Ryan Roberts, Dev Jain, Baoquan He, LKML, Uladzislau Rezki

Introduce a module parameter to enable or disable the large-order
allocation path in vmalloc. High-order allocations are disabled by
default so far, but users may explicitly enable them at runtime if
desired.

High-order pages allocated for vmalloc are immediately split into
order-0 pages and later freed as order-0, which means they do not
feed the per-CPU page caches. As a result, high-order attempts tend
to bypass the PCP fastpath and fall back to the buddy allocator that
can affect performance.

However, when the PCP caches are empty, high-order allocations may
show better performance characteristics especially for larger
allocation requests.

Since the best strategy is workload-dependent, this patch adds a
parameter letting users to choose whether vmalloc should try
high-order allocations or stay strictly on the order-0 fastpath.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d3a4725e15ca..f66543896b16 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -43,6 +43,7 @@
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 #include <linux/page_owner.h>
+#include <linux/moduleparam.h>

 #define CREATE_TRACE_POINTS
 #include <trace/events/vmalloc.h>
@@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
 	return nr_allocated;
 }

+static int attempt_larger_order_alloc;
+module_param(attempt_larger_order_alloc, int, 0644);
+
 static inline unsigned int
 vm_area_alloc_pages(gfp_t gfp, int nid,
 		unsigned int order, unsigned int nr_pages, struct page **pages)
@@ -3679,8 +3683,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	struct page *page;
 	int i;

-	nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
-		order, nr_pages, pages);
+	if (attempt_larger_order_alloc)
+		nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
+			order, nr_pages, pages);

 	/*
 	 * For order-0 pages we make use of bulk allocator, if
-- 
2.47.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-16 21:19 ` [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter Uladzislau Rezki (Sony)
@ 2025-12-16 23:36   ` Andrew Morton
  2025-12-17 11:37     ` Uladzislau Rezki
  2025-12-17  3:54   ` Baoquan He
  2025-12-17  8:27   ` Ryan Roberts
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2025-12-16 23:36 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Vishal Moola, Ryan Roberts, Dev Jain, Baoquan He, LKML

On Tue, 16 Dec 2025 22:19:21 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:

> Introduce a module parameter to enable or disable the large-order
> allocation path in vmalloc. High-order allocations are disabled by
> default so far, but users may explicitly enable them at runtime if
> desired.
> 
> High-order pages allocated for vmalloc are immediately split into
> order-0 pages and later freed as order-0, which means they do not
> feed the per-CPU page caches. As a result, high-order attempts tend
> to bypass the PCP fastpath and fall back to the buddy allocator that
> can affect performance.
> 
> However, when the PCP caches are empty, high-order allocations may
> show better performance characteristics especially for larger
> allocation requests.
> 
> Since the best strategy is workload-dependent, this patch adds a
> parameter letting users to choose whether vmalloc should try
> high-order allocations or stay strictly on the order-0 fastpath.
> 
> ...
>
> +module_param(attempt_larger_order_alloc, int, 0644);

We should have user docs, please.  Probably in kernel-parameters.txt.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-16 23:36   ` Andrew Morton
@ 2025-12-17 11:37     ` Uladzislau Rezki
  0 siblings, 0 replies; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 11:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Vishal Moola, Ryan Roberts, Dev Jain, Baoquan He, LKML

On Tue, Dec 16, 2025 at 03:36:04PM -0800, Andrew Morton wrote:
> On Tue, 16 Dec 2025 22:19:21 +0100 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:
> 
> > Introduce a module parameter to enable or disable the large-order
> > allocation path in vmalloc. High-order allocations are disabled by
> > default so far, but users may explicitly enable them at runtime if
> > desired.
> > 
> > High-order pages allocated for vmalloc are immediately split into
> > order-0 pages and later freed as order-0, which means they do not
> > feed the per-CPU page caches. As a result, high-order attempts tend
> > to bypass the PCP fastpath and fall back to the buddy allocator that
> > can affect performance.
> > 
> > However, when the PCP caches are empty, high-order allocations may
> > show better performance characteristics especially for larger
> > allocation requests.
> > 
> > Since the best strategy is workload-dependent, this patch adds a
> > parameter letting users to choose whether vmalloc should try
> > high-order allocations or stay strictly on the order-0 fastpath.
> > 
> > ...
> >
> > +module_param(attempt_larger_order_alloc, int, 0644);
> 
> We should have user docs, please.  Probably in kernel-parameters.txt.
>
Thanks. I will add it.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-16 21:19 ` [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter Uladzislau Rezki (Sony)
  2025-12-16 23:36   ` Andrew Morton
@ 2025-12-17  3:54   ` Baoquan He
  2025-12-17 11:44     ` Uladzislau Rezki
  2025-12-17  8:27   ` Ryan Roberts
  2 siblings, 1 reply; 27+ messages in thread
From: Baoquan He @ 2025-12-17  3:54 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vishal Moola, Ryan Roberts, Dev Jain, LKML

Hi Uladzislau,

On 12/16/25 at 10:19pm, Uladzislau Rezki (Sony) wrote:
> Introduce a module parameter to enable or disable the large-order
> allocation path in vmalloc. High-order allocations are disabled by
> default so far, but users may explicitly enable them at runtime if
> desired.
> 
> High-order pages allocated for vmalloc are immediately split into
> order-0 pages and later freed as order-0, which means they do not
> feed the per-CPU page caches. As a result, high-order attempts tend

I don't get why order-0 do not feed the PCP caches.

> to bypass the PCP fastpath and fall back to the buddy allocator that
> can affect performance.
> 
> However, when the PCP caches are empty, high-order allocations may
> show better performance characteristics especially for larger
> allocation requests.

And when PCP is empty, high-order alloc show better performance. Could
you please help elaborate a little more about them? Thanks.

Thanks
Baoquan

> 
> Since the best strategy is workload-dependent, this patch adds a
> parameter letting users to choose whether vmalloc should try
> high-order allocations or stay strictly on the order-0 fastpath.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d3a4725e15ca..f66543896b16 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -43,6 +43,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/shmparam.h>
>  #include <linux/page_owner.h>
> +#include <linux/moduleparam.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vmalloc.h>
> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>  	return nr_allocated;
>  }
>  
> +static int attempt_larger_order_alloc;
> +module_param(attempt_larger_order_alloc, int, 0644);
> +
>  static inline unsigned int
>  vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
> @@ -3679,8 +3683,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	struct page *page;
>  	int i;
>  
> -	nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
> -		order, nr_pages, pages);
> +	if (attempt_larger_order_alloc)
> +		nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
> +			order, nr_pages, pages);
>  
>  	/*
>  	 * For order-0 pages we make use of bulk allocator, if
> -- 
> 2.47.3
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17  3:54   ` Baoquan He
@ 2025-12-17 11:44     ` Uladzislau Rezki
  2025-12-17 11:49       ` Dev Jain
  2025-12-18 10:34       ` Baoquan He
  0 siblings, 2 replies; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 11:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, Vishal Moola, Ryan Roberts, Dev Jain,
	LKML

On Wed, Dec 17, 2025 at 11:54:26AM +0800, Baoquan He wrote:
> Hi Uladzislau,
> 
> On 12/16/25 at 10:19pm, Uladzislau Rezki (Sony) wrote:
> > Introduce a module parameter to enable or disable the large-order
> > allocation path in vmalloc. High-order allocations are disabled by
> > default so far, but users may explicitly enable them at runtime if
> > desired.
> > 
> > High-order pages allocated for vmalloc are immediately split into
> > order-0 pages and later freed as order-0, which means they do not
> > feed the per-CPU page caches. As a result, high-order attempts tend
> 
> I don't get why order-0 do not feed the PCP caches.
> 
"they" -> high-order pages. I should improve it.

> > to bypass the PCP fastpath and fall back to the buddy allocator that
> > can affect performance.
> > 
> > However, when the PCP caches are empty, high-order allocations may
> > show better performance characteristics especially for larger
> > allocation requests.
> 
> And when PCP is empty, high-order alloc show better performance. Could
> you please help elaborate a little more about them? Thanks.
> 
This is what i/we measured. See below example:

# default order-3
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec

# patch order-3
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec

why higher-order wins, i think it is less cyclesto get one big chunk from the
buddy instead of looping and pick one by one.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 11:44     ` Uladzislau Rezki
@ 2025-12-17 11:49       ` Dev Jain
  2025-12-17 11:53         ` Uladzislau Rezki
  2025-12-18 10:34       ` Baoquan He
  1 sibling, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-12-17 11:49 UTC (permalink / raw)
  To: Uladzislau Rezki, Baoquan He
  Cc: linux-mm, Andrew Morton, Vishal Moola, Ryan Roberts, LKML


On 17/12/25 5:14 pm, Uladzislau Rezki wrote:
> On Wed, Dec 17, 2025 at 11:54:26AM +0800, Baoquan He wrote:
>> Hi Uladzislau,
>>
>> On 12/16/25 at 10:19pm, Uladzislau Rezki (Sony) wrote:
>>> Introduce a module parameter to enable or disable the large-order
>>> allocation path in vmalloc. High-order allocations are disabled by
>>> default so far, but users may explicitly enable them at runtime if
>>> desired.
>>>
>>> High-order pages allocated for vmalloc are immediately split into
>>> order-0 pages and later freed as order-0, which means they do not
>>> feed the per-CPU page caches. As a result, high-order attempts tend
>> I don't get why order-0 do not feed the PCP caches.
>>
> "they" -> high-order pages. I should improve it.
>
>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>> can affect performance.
>>>
>>> However, when the PCP caches are empty, high-order allocations may
>>> show better performance characteristics especially for larger
>>> allocation requests.
>> And when PCP is empty, high-order alloc show better performance. Could
>> you please help elaborate a little more about them? Thanks.
>>
> This is what i/we measured. See below example:
>
> # default order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec
>
> # patch order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec
>
> why higher-order wins, i think it is less cyclesto get one big chunk from the
> buddy instead of looping and pick one by one.

I have the same observation that getting a higher-order chunk is faster than bulk allocating basepages.
(btw, I had resent my RFC, in case you missed!)

>
> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 11:49       ` Dev Jain
@ 2025-12-17 11:53         ` Uladzislau Rezki
  0 siblings, 0 replies; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 11:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: Baoquan He, linux-mm, Andrew Morton, Vishal Moola, Ryan Roberts, LKML

Hello, Jain!

>> (btw, I had resent my RFC, in case you missed!)
I have not :) I saw that. Give me some time.

--
Uladzislau Rezki

On Wed, Dec 17, 2025 at 12:49 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 17/12/25 5:14 pm, Uladzislau Rezki wrote:
> > On Wed, Dec 17, 2025 at 11:54:26AM +0800, Baoquan He wrote:
> >> Hi Uladzislau,
> >>
> >> On 12/16/25 at 10:19pm, Uladzislau Rezki (Sony) wrote:
> >>> Introduce a module parameter to enable or disable the large-order
> >>> allocation path in vmalloc. High-order allocations are disabled by
> >>> default so far, but users may explicitly enable them at runtime if
> >>> desired.
> >>>
> >>> High-order pages allocated for vmalloc are immediately split into
> >>> order-0 pages and later freed as order-0, which means they do not
> >>> feed the per-CPU page caches. As a result, high-order attempts tend
> >> I don't get why order-0 do not feed the PCP caches.
> >>
> > "they" -> high-order pages. I should improve it.
> >
> >>> to bypass the PCP fastpath and fall back to the buddy allocator that
> >>> can affect performance.
> >>>
> >>> However, when the PCP caches are empty, high-order allocations may
> >>> show better performance characteristics especially for larger
> >>> allocation requests.
> >> And when PCP is empty, high-order alloc show better performance. Could
> >> you please help elaborate a little more about them? Thanks.
> >>
> > This is what i/we measured. See below example:
> >
> > # default order-3
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec
> >
> > # patch order-3
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
> > Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec
> >
> > why higher-order wins, i think it is less cyclesto get one big chunk from the
> > buddy instead of looping and pick one by one.
>
> I have the same observation that getting a higher-order chunk is faster than bulk allocating basepages.
> (btw, I had resent my RFC, in case you missed!)
>
> >
> > --
> > Uladzislau Rezki



-- 
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 11:44     ` Uladzislau Rezki
  2025-12-17 11:49       ` Dev Jain
@ 2025-12-18 10:34       ` Baoquan He
  1 sibling, 0 replies; 27+ messages in thread
From: Baoquan He @ 2025-12-18 10:34 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Ryan Roberts, Dev Jain, LKML

On 12/17/25 at 12:44pm, Uladzislau Rezki wrote:
> On Wed, Dec 17, 2025 at 11:54:26AM +0800, Baoquan He wrote:
> > Hi Uladzislau,
> > 
> > On 12/16/25 at 10:19pm, Uladzislau Rezki (Sony) wrote:
> > > Introduce a module parameter to enable or disable the large-order
> > > allocation path in vmalloc. High-order allocations are disabled by
> > > default so far, but users may explicitly enable them at runtime if
> > > desired.
> > > 
> > > High-order pages allocated for vmalloc are immediately split into
> > > order-0 pages and later freed as order-0, which means they do not
> > > feed the per-CPU page caches. As a result, high-order attempts tend
> > 
> > I don't get why order-0 do not feed the PCP caches.
> > 
> "they" -> high-order pages. I should improve it.

Ah, git it now, thanks.

> 
> > > to bypass the PCP fastpath and fall back to the buddy allocator that
> > > can affect performance.
> > > 
> > > However, when the PCP caches are empty, high-order allocations may
> > > show better performance characteristics especially for larger
> > > allocation requests.
> > 
> > And when PCP is empty, high-order alloc show better performance. Could
> > you please help elaborate a little more about them? Thanks.
> > 
> This is what i/we measured. See below example:
> 
> # default order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec
> 
> # patch order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec
> 
> why higher-order wins, i think it is less cyclesto get one big chunk from the
> buddy instead of looping and pick one by one.

Thanks a lot for the details.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-16 21:19 ` [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter Uladzislau Rezki (Sony)
  2025-12-16 23:36   ` Andrew Morton
  2025-12-17  3:54   ` Baoquan He
@ 2025-12-17  8:27   ` Ryan Roberts
  2025-12-17 12:02     ` Uladzislau Rezki
  2 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-12-17  8:27 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), linux-mm, Andrew Morton
  Cc: Vishal Moola, Dev Jain, Baoquan He, LKML

On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
> Introduce a module parameter to enable or disable the large-order
> allocation path in vmalloc. High-order allocations are disabled by
> default so far, but users may explicitly enable them at runtime if
> desired.
> 
> High-order pages allocated for vmalloc are immediately split into
> order-0 pages and later freed as order-0, which means they do not
> feed the per-CPU page caches. As a result, high-order attempts tend
> to bypass the PCP fastpath and fall back to the buddy allocator that
> can affect performance.
> 
> However, when the PCP caches are empty, high-order allocations may
> show better performance characteristics especially for larger
> allocation requests.

I wonder if a better solution would be "allocate order-0 if available in pcp,
else try large order, else fallback to order-0" Could that provide the best of
all worlds without needing a configuration knob?

> 
> Since the best strategy is workload-dependent, this patch adds a
> parameter letting users to choose whether vmalloc should try
> high-order allocations or stay strictly on the order-0 fastpath.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d3a4725e15ca..f66543896b16 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -43,6 +43,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/shmparam.h>
>  #include <linux/page_owner.h>
> +#include <linux/moduleparam.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/vmalloc.h>
> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>  	return nr_allocated;
>  }
>  
> +static int attempt_larger_order_alloc;
> +module_param(attempt_larger_order_alloc, int, 0644);

Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
Y/N as the value; that's probably more intuitive?

nit: I'd favour a shorter name. Perhaps large_order_alloc?

Thanks,
Ryan

> +
>  static inline unsigned int
>  vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
> @@ -3679,8 +3683,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	struct page *page;
>  	int i;
>  
> -	nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
> -		order, nr_pages, pages);
> +	if (attempt_larger_order_alloc)
> +		nr_allocated = vm_area_alloc_pages_large_order(gfp, nid,
> +			order, nr_pages, pages);
>  
>  	/*
>  	 * For order-0 pages we make use of bulk allocator, if



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17  8:27   ` Ryan Roberts
@ 2025-12-17 12:02     ` Uladzislau Rezki
  2025-12-17 15:20       ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 12:02 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, Vishal Moola, Dev Jain, Baoquan He,
	LKML

> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
> > Introduce a module parameter to enable or disable the large-order
> > allocation path in vmalloc. High-order allocations are disabled by
> > default so far, but users may explicitly enable them at runtime if
> > desired.
> > 
> > High-order pages allocated for vmalloc are immediately split into
> > order-0 pages and later freed as order-0, which means they do not
> > feed the per-CPU page caches. As a result, high-order attempts tend
> > to bypass the PCP fastpath and fall back to the buddy allocator that
> > can affect performance.
> > 
> > However, when the PCP caches are empty, high-order allocations may
> > show better performance characteristics especially for larger
> > allocation requests.
> 
> I wonder if a better solution would be "allocate order-0 if available in pcp,
> else try large order, else fallback to order-0" Could that provide the best of
> all worlds without needing a configuration knob?
> 
I am not sure, to me it looks like a bit odd. Ideally it would be
good just free it as high-order page and not order-0 peaces.

> > 
> > Since the best strategy is workload-dependent, this patch adds a
> > parameter letting users to choose whether vmalloc should try
> > high-order allocations or stay strictly on the order-0 fastpath.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d3a4725e15ca..f66543896b16 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -43,6 +43,7 @@
> >  #include <asm/tlbflush.h>
> >  #include <asm/shmparam.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/moduleparam.h>
> >  
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/vmalloc.h>
> > @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
> >  	return nr_allocated;
> >  }
> >  
> > +static int attempt_larger_order_alloc;
> > +module_param(attempt_larger_order_alloc, int, 0644);
> 
> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
> Y/N as the value; that's probably more intuitive?
> 
> nit: I'd favour a shorter name. Perhaps large_order_alloc?
> 
Thanks! We can switch to bool and use shorter name for sure.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 12:02     ` Uladzislau Rezki
@ 2025-12-17 15:20       ` Ryan Roberts
  2025-12-17 17:01         ` Ryan Roberts
  2025-12-18  4:55         ` Dev Jain
  0 siblings, 2 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-12-17 15:20 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Dev Jain, Baoquan He, LKML

On 17/12/2025 12:02, Uladzislau Rezki wrote:
>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>> Introduce a module parameter to enable or disable the large-order
>>> allocation path in vmalloc. High-order allocations are disabled by
>>> default so far, but users may explicitly enable them at runtime if
>>> desired.
>>>
>>> High-order pages allocated for vmalloc are immediately split into
>>> order-0 pages and later freed as order-0, which means they do not
>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>> can affect performance.
>>>
>>> However, when the PCP caches are empty, high-order allocations may
>>> show better performance characteristics especially for larger
>>> allocation requests.
>>
>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>> else try large order, else fallback to order-0" Could that provide the best of
>> all worlds without needing a configuration knob?
>>
> I am not sure, to me it looks like a bit odd. 

Perhaps it would feel better if it was generalized to "first try allocation from
PCP list, highest to lowest order, then try allocation from the buddy, highest
to lowest order"?

> Ideally it would be
> good just free it as high-order page and not order-0 peaces.

Yeah perhaps that's better. How about something like this (very lightly tested
and no performance results yet):

(And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
with a contiguous run of order-0 pages, but I'm not seeing any warnings or
memory leaks when running mm selftests...)

---8<---
commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
Author: Ryan Roberts <ryan.roberts@arm.com>
Date:   Wed Dec 17 15:11:08 2025 +0000

    WIP

    Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b155929af5b1..d25f5b867e6b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);

+void free_pages_bulk(struct page *page, int nr_pages);
+
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..5f11224cf353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
order,
 	}
 }

+static void free_frozen_pages_bulk(struct page *page, int nr_pages)
+{
+	while (nr_pages) {
+		unsigned int fit_order, align_order, order;
+		unsigned long pfn;
+
+		pfn = page_to_pfn(page);
+		fit_order = ilog2(nr_pages);
+		align_order = pfn ? __ffs(pfn) : fit_order;
+		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
+
+		free_frozen_pages(page, order);
+
+		page += 1U << order;
+		nr_pages -= 1U << order;
+	}
+}
+
+void free_pages_bulk(struct page *page, int nr_pages)
+{
+	struct page *start = NULL;
+	bool can_free;
+	int i;
+
+	for (i = 0; i < nr_pages; i++, page++) {
+		VM_BUG_ON_PAGE(PageHead(page), page);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+
+		can_free = put_page_testzero(page);
+
+		if (!can_free && start) {
+			free_frozen_pages_bulk(start, page - start);
+			start = NULL;
+		} else if (can_free && !start) {
+			start = page;
+		}
+	}
+
+	if (start)
+		free_frozen_pages_bulk(start, page - start);
+}
+
 /**
  * __free_pages - Free pages allocated with alloc_pages().
  * @page: The page pointer returned from alloc_pages().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..8f782bac1ece 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
 void vfree(const void *addr)
 {
 	struct vm_struct *vm;
-	int i;
+	struct page *start;
+	int i, nr;

 	if (unlikely(in_interrupt())) {
 		vfree_atomic(addr);
@@ -3455,17 +3456,26 @@ void vfree(const void *addr)
 	/* All pages of vm should be charged to same memcg, so use first one. */
 	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
 		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
-	for (i = 0; i < vm->nr_pages; i++) {
+
+	start = vm->pages[0];
+	BUG_ON(!start);
+	nr = 1;
+	for (i = 1; i < vm->nr_pages; i++) {
 		struct page *page = vm->pages[i];

 		BUG_ON(!page);
-		/*
-		 * High-order allocs for huge vmallocs are split, so
-		 * can be freed as an array of order-0 allocations
-		 */
-		__free_page(page);
-		cond_resched();
+
+		if (start + nr != page) {
+			free_pages_bulk(start, nr);
+			start = page;
+			nr = 1;
+			cond_resched();
+		} else {
+			nr++;
+		}
 	}
+	free_pages_bulk(start, nr);
+
 	if (!(vm->flags & VM_MAP_PUT_PAGES))
 		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
 	kvfree(vm->pages);
---8<---

> 
>>>
>>> Since the best strategy is workload-dependent, this patch adds a
>>> parameter letting users to choose whether vmalloc should try
>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>
>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> ---
>>>  mm/vmalloc.c | 9 +++++++--
>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index d3a4725e15ca..f66543896b16 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -43,6 +43,7 @@
>>>  #include <asm/tlbflush.h>
>>>  #include <asm/shmparam.h>
>>>  #include <linux/page_owner.h>
>>> +#include <linux/moduleparam.h>
>>>  
>>>  #define CREATE_TRACE_POINTS
>>>  #include <trace/events/vmalloc.h>
>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>  	return nr_allocated;
>>>  }
>>>  
>>> +static int attempt_larger_order_alloc;
>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>
>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>> Y/N as the value; that's probably more intuitive?
>>
>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>
> Thanks! We can switch to bool and use shorter name for sure.
> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 15:20       ` Ryan Roberts
@ 2025-12-17 17:01         ` Ryan Roberts
  2025-12-17 19:22           ` Uladzislau Rezki
  2025-12-17 20:08           ` Uladzislau Rezki
  2025-12-18  4:55         ` Dev Jain
  1 sibling, 2 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-12-17 17:01 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Dev Jain, Baoquan He, LKML

On 17/12/2025 15:20, Ryan Roberts wrote:
> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>> Introduce a module parameter to enable or disable the large-order
>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>> default so far, but users may explicitly enable them at runtime if
>>>> desired.
>>>>
>>>> High-order pages allocated for vmalloc are immediately split into
>>>> order-0 pages and later freed as order-0, which means they do not
>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>> can affect performance.
>>>>
>>>> However, when the PCP caches are empty, high-order allocations may
>>>> show better performance characteristics especially for larger
>>>> allocation requests.
>>>
>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>> else try large order, else fallback to order-0" Could that provide the best of
>>> all worlds without needing a configuration knob?
>>>
>> I am not sure, to me it looks like a bit odd. 
> 
> Perhaps it would feel better if it was generalized to "first try allocation from
> PCP list, highest to lowest order, then try allocation from the buddy, highest
> to lowest order"?
> 
>> Ideally it would be
>> good just free it as high-order page and not order-0 peaces.
> 
> Yeah perhaps that's better. How about something like this (very lightly tested
> and no performance results yet):
> 
> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
> memory leaks when running mm selftests...)
> 
> ---8<---
> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
> Author: Ryan Roberts <ryan.roberts@arm.com>
> Date:   Wed Dec 17 15:11:08 2025 +0000
> 
>     WIP
> 
>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b155929af5b1..d25f5b867e6b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages_nolock(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
> 
> +void free_pages_bulk(struct page *page, int nr_pages);
> +
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a964..5f11224cf353 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
> order,
>  	}
>  }
> 
> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
> +{
> +	while (nr_pages) {
> +		unsigned int fit_order, align_order, order;
> +		unsigned long pfn;
> +
> +		pfn = page_to_pfn(page);
> +		fit_order = ilog2(nr_pages);
> +		align_order = pfn ? __ffs(pfn) : fit_order;
> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
> +
> +		free_frozen_pages(page, order);
> +
> +		page += 1U << order;
> +		nr_pages -= 1U << order;
> +	}
> +}
> +
> +void free_pages_bulk(struct page *page, int nr_pages)
> +{
> +	struct page *start = NULL;
> +	bool can_free;
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_BUG_ON_PAGE(PageHead(page), page);
> +		VM_BUG_ON_PAGE(PageTail(page), page);
> +
> +		can_free = put_page_testzero(page);
> +
> +		if (!can_free && start) {
> +			free_frozen_pages_bulk(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +		}
> +	}
> +
> +	if (start)
> +		free_frozen_pages_bulk(start, page - start);
> +}
> +
>  /**
>   * __free_pages - Free pages allocated with alloc_pages().
>   * @page: The page pointer returned from alloc_pages().
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..8f782bac1ece 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>  void vfree(const void *addr)
>  {
>  	struct vm_struct *vm;
> -	int i;
> +	struct page *start;
> +	int i, nr;
> 
>  	if (unlikely(in_interrupt())) {
>  		vfree_atomic(addr);
> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>  	/* All pages of vm should be charged to same memcg, so use first one. */
>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
> -	for (i = 0; i < vm->nr_pages; i++) {
> +
> +	start = vm->pages[0];
> +	BUG_ON(!start);
> +	nr = 1;
> +	for (i = 1; i < vm->nr_pages; i++) {
>  		struct page *page = vm->pages[i];
> 
>  		BUG_ON(!page);
> -		/*
> -		 * High-order allocs for huge vmallocs are split, so
> -		 * can be freed as an array of order-0 allocations
> -		 */
> -		__free_page(page);
> -		cond_resched();
> +
> +		if (start + nr != page) {
> +			free_pages_bulk(start, nr);
> +			start = page;
> +			nr = 1;
> +			cond_resched();
> +		} else {
> +			nr++;
> +		}
>  	}
> +	free_pages_bulk(start, nr);
> +
>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>  	kvfree(vm->pages);
> ---8<---

I tested this on a performance monitoring system and see a huge improvement for 
the test_vmalloc tests.

Both columns are compared to v6.18. 6-19-0-rc1 has Vishal's change to allocate 
large orders, which I previously reported the regressions for. vfree-high-order 
adds the above patch to free contiguous order-0 pages in bulk.

(R)/(I) means statistically significant regression/improvement. Results are 
normalized so that less than zero is regression and greater than zero is 
improvement.

+-----------------+----------------------------------------------------------+--------------+------------------+
| Benchmark       | Result Class                                             |   6-19-0-rc1 | vfree-high-order |
+=================+==========================================================+==============+==================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  (R) -40.69% |        (I) 3.98% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |        0.10% |           -1.47% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  (R) -22.74% |       (I) 11.57% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  (R) -23.63% |       (I) 47.42% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |       -1.58% |      (I) 106.01% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  (R) -24.39% |       (I) 99.12% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |    (I) 2.34% |      (I) 196.87% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |  (R) -23.29% |      (I) 125.42% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |    (I) 3.74% |      (I) 238.59% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |  (R) -23.80% |      (I) 132.38% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |   (R) -2.84% |      (I) 514.75% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |        2.74% |            0.33% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |        0.58% |            1.36% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |       -0.66% |            1.48% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |  (R) -25.24% |       (I) 77.95% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |       -0.58% |            0.60% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  (R) -45.75% |        (I) 8.51% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |  (R) -28.16% |       (I) 65.34% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |       -0.54% |           -0.33% |
+-----------------+----------------------------------------------------------+--------------+------------------+

What do you think?

Thanks,
Ryan

> 
>>
>>>>
>>>> Since the best strategy is workload-dependent, this patch adds a
>>>> parameter letting users to choose whether vmalloc should try
>>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>>
>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>> ---
>>>>  mm/vmalloc.c | 9 +++++++--
>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index d3a4725e15ca..f66543896b16 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -43,6 +43,7 @@
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/shmparam.h>
>>>>  #include <linux/page_owner.h>
>>>> +#include <linux/moduleparam.h>
>>>>  
>>>>  #define CREATE_TRACE_POINTS
>>>>  #include <trace/events/vmalloc.h>
>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>>  	return nr_allocated;
>>>>  }
>>>>  
>>>> +static int attempt_larger_order_alloc;
>>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>>
>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>>> Y/N as the value; that's probably more intuitive?
>>>
>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>>
>> Thanks! We can switch to bool and use shorter name for sure.
>>
>> --
>> Uladzislau Rezki
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 17:01         ` Ryan Roberts
@ 2025-12-17 19:22           ` Uladzislau Rezki
  2025-12-18 11:12             ` Ryan Roberts
  2025-12-17 20:08           ` Uladzislau Rezki
  1 sibling, 1 reply; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 19:22 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vishal Moola,
	Dev Jain, Baoquan He, LKML

On Wed, Dec 17, 2025 at 05:01:19PM +0000, Ryan Roberts wrote:
> On 17/12/2025 15:20, Ryan Roberts wrote:
> > On 17/12/2025 12:02, Uladzislau Rezki wrote:
> >>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
> >>>> Introduce a module parameter to enable or disable the large-order
> >>>> allocation path in vmalloc. High-order allocations are disabled by
> >>>> default so far, but users may explicitly enable them at runtime if
> >>>> desired.
> >>>>
> >>>> High-order pages allocated for vmalloc are immediately split into
> >>>> order-0 pages and later freed as order-0, which means they do not
> >>>> feed the per-CPU page caches. As a result, high-order attempts tend
> >>>> to bypass the PCP fastpath and fall back to the buddy allocator that
> >>>> can affect performance.
> >>>>
> >>>> However, when the PCP caches are empty, high-order allocations may
> >>>> show better performance characteristics especially for larger
> >>>> allocation requests.
> >>>
> >>> I wonder if a better solution would be "allocate order-0 if available in pcp,
> >>> else try large order, else fallback to order-0" Could that provide the best of
> >>> all worlds without needing a configuration knob?
> >>>
> >> I am not sure, to me it looks like a bit odd. 
> > 
> > Perhaps it would feel better if it was generalized to "first try allocation from
> > PCP list, highest to lowest order, then try allocation from the buddy, highest
> > to lowest order"?
> > 
> >> Ideally it would be
> >> good just free it as high-order page and not order-0 peaces.
> > 
> > Yeah perhaps that's better. How about something like this (very lightly tested
> > and no performance results yet):
> > 
> > (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
> > with a contiguous run of order-0 pages, but I'm not seeing any warnings or
> > memory leaks when running mm selftests...)
> > 
> > ---8<---
> > commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
> > Author: Ryan Roberts <ryan.roberts@arm.com>
> > Date:   Wed Dec 17 15:11:08 2025 +0000
> > 
> >     WIP
> > 
> >     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index b155929af5b1..d25f5b867e6b 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
> >  extern void free_pages_nolock(struct page *page, unsigned int order);
> >  extern void free_pages(unsigned long addr, unsigned int order);
> > 
> > +void free_pages_bulk(struct page *page, int nr_pages);
> > +
> >  #define __free_page(page) __free_pages((page), 0)
> >  #define free_page(addr) free_pages((addr), 0)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a964..5f11224cf353 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
> > order,
> >  	}
> >  }
> > 
> > +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
> > +{
> > +	while (nr_pages) {
> > +		unsigned int fit_order, align_order, order;
> > +		unsigned long pfn;
> > +
> > +		pfn = page_to_pfn(page);
> > +		fit_order = ilog2(nr_pages);
> > +		align_order = pfn ? __ffs(pfn) : fit_order;
> > +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
> > +
> > +		free_frozen_pages(page, order);
> > +
> > +		page += 1U << order;
> > +		nr_pages -= 1U << order;
> > +	}
> > +}
> > +
> > +void free_pages_bulk(struct page *page, int nr_pages)
> > +{
> > +	struct page *start = NULL;
> > +	bool can_free;
> > +	int i;
> > +
> > +	for (i = 0; i < nr_pages; i++, page++) {
> > +		VM_BUG_ON_PAGE(PageHead(page), page);
> > +		VM_BUG_ON_PAGE(PageTail(page), page);
> > +
> > +		can_free = put_page_testzero(page);
> > +
> > +		if (!can_free && start) {
> > +			free_frozen_pages_bulk(start, page - start);
> > +			start = NULL;
> > +		} else if (can_free && !start) {
> > +			start = page;
> > +		}
> > +	}
> > +
> > +	if (start)
> > +		free_frozen_pages_bulk(start, page - start);
> > +}
> > +
> >  /**
> >   * __free_pages - Free pages allocated with alloc_pages().
> >   * @page: The page pointer returned from alloc_pages().
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ecbac900c35f..8f782bac1ece 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
> >  void vfree(const void *addr)
> >  {
> >  	struct vm_struct *vm;
> > -	int i;
> > +	struct page *start;
> > +	int i, nr;
> > 
> >  	if (unlikely(in_interrupt())) {
> >  		vfree_atomic(addr);
> > @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
> >  	/* All pages of vm should be charged to same memcg, so use first one. */
> >  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
> >  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
> > -	for (i = 0; i < vm->nr_pages; i++) {
> > +
> > +	start = vm->pages[0];
> > +	BUG_ON(!start);
> > +	nr = 1;
> > +	for (i = 1; i < vm->nr_pages; i++) {
> >  		struct page *page = vm->pages[i];
> > 
> >  		BUG_ON(!page);
> > -		/*
> > -		 * High-order allocs for huge vmallocs are split, so
> > -		 * can be freed as an array of order-0 allocations
> > -		 */
> > -		__free_page(page);
> > -		cond_resched();
> > +
> > +		if (start + nr != page) {
> > +			free_pages_bulk(start, nr);
> > +			start = page;
> > +			nr = 1;
> > +			cond_resched();
> > +		} else {
> > +			nr++;
> > +		}
> >  	}
> > +	free_pages_bulk(start, nr);
> > +
> >  	if (!(vm->flags & VM_MAP_PUT_PAGES))
> >  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
> >  	kvfree(vm->pages);
> > ---8<---
> 
> I tested this on a performance monitoring system and see a huge improvement for 
> the test_vmalloc tests.
> 
> Both columns are compared to v6.18. 6-19-0-rc1 has Vishal's change to allocate 
> large orders, which I previously reported the regressions for. vfree-high-order 
> adds the above patch to free contiguous order-0 pages in bulk.
> 
> (R)/(I) means statistically significant regression/improvement. Results are 
> normalized so that less than zero is regression and greater than zero is 
> improvement.
> 
> +-----------------+----------------------------------------------------------+--------------+------------------+
> | Benchmark       | Result Class                                             |   6-19-0-rc1 | vfree-high-order |
> +=================+==========================================================+==============+==================+
> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  (R) -40.69% |        (I) 3.98% |
> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |        0.10% |           -1.47% |
> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  (R) -22.74% |       (I) 11.57% |
> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  (R) -23.63% |       (I) 47.42% |
> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |       -1.58% |      (I) 106.01% |
> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  (R) -24.39% |       (I) 99.12% |
> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |    (I) 2.34% |      (I) 196.87% |
> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |  (R) -23.29% |      (I) 125.42% |
> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |    (I) 3.74% |      (I) 238.59% |
> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |  (R) -23.80% |      (I) 132.38% |
> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |   (R) -2.84% |      (I) 514.75% |
> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |        2.74% |            0.33% |
> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |        0.58% |            1.36% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |       -0.66% |            1.48% |
> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |  (R) -25.24% |       (I) 77.95% |
> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |       -0.58% |            0.60% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  (R) -45.75% |        (I) 8.51% |
> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |  (R) -28.16% |       (I) 65.34% |
> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |       -0.54% |           -0.33% |
> +-----------------+----------------------------------------------------------+--------------+------------------+
> 
> What do you think?
> 
You were first :)

Some figures from me:

# Default(3 pages)
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541868 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542515 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541561 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542951 usec

# Patch(3 pages)
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 585266 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 594301 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 598912 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 589345 usec

Now the perf figures are almost settled and aligned with default!
We do use per-cpu-cache for 3 pages allocations.

# Default(100 pages)
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5724919 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5721430 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5717224 usec

# Patch(100 pages)
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2629600 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2622811 usec
fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2629324 usec

~2x faster! It is because of freeing now occurs much more efficient
so we spent less cycles on free path comparing with default case.

See below, perf also confirms that vfree() ~2x consumes less cycles:

# Default
+   96.99%     0.49%  [test_vmalloc]        [k] fix_size_alloc_test
+   59.64%     2.38%  [kernel]              [k] vfree.part.0
+   45.69%    15.80%  [kernel]              [k] __free_frozen_pages
+   39.83%     0.00%  [kernel]              [k] ret_from_fork_asm
+   39.83%     0.00%  [kernel]              [k] ret_from_fork
+   39.83%     0.00%  [kernel]              [k] kthread
+   38.67%     0.00%  [test_vmalloc]        [k] test_func
+   36.64%     0.01%  [kernel]              [k] __vmalloc_node_noprof
+   36.63%     0.20%  [kernel]              [k] __vmalloc_node_range_noprof
+   17.55%     4.94%  [kernel]              [k] alloc_pages_bulk_noprof
+   16.46%    12.21%  [kernel]              [k] free_frozen_page_commit.isra.0
+   16.06%     8.09%  [kernel]              [k] vmap_small_pages_range_noflush
+   12.56%    10.82%  [kernel]              [k] __rmqueue_pcplist
+    9.45%     9.43%  [kernel]              [k] __get_pfnblock_flags_mask.isra.0
+    7.95%     7.95%  [kernel]              [k] pfn_valid
+    5.77%     0.03%  [kernel]              [k] remove_vm_area
+    5.44%     5.44%  [kernel]              [k] ___free_pages
+    4.67%     4.59%  [kernel]              [k] __vunmap_range_noflush
+    4.30%     4.30%  [kernel]              [k] __list_add_valid_or_report

# Patch
+   94.28%     1.00%  [test_vmalloc]        [k] fix_size_alloc_test
+   55.63%     0.03%  [kernel]              [k] __vmalloc_node_noprof
+   55.60%     3.78%  [kernel]              [k] __vmalloc_node_range_noprof
+   37.26%    19.29%  [kernel]              [k] vmap_small_pages_range_noflush
+   37.12%     5.63%  [kernel]              [k] vfree.part.0
+   30.59%     0.00%  [kernel]              [k] ret_from_fork_asm
+   30.59%     0.00%  [kernel]              [k] ret_from_fork
+   30.59%     0.00%  [kernel]              [k] kthread
+   28.79%     0.00%  [test_vmalloc]        [k] test_func
+   17.90%    17.88%  [kernel]              [k] pfn_valid
+   13.24%     0.02%  [kernel]              [k] remove_vm_area
+   10.90%    10.68%  [kernel]              [k] __vunmap_range_noflush
+   10.81%    10.80%  [kernel]              [k] free_pages_bulk
+    7.09%     0.51%  [kernel]              [k] alloc_pages_noprof
+    6.58%     0.41%  [kernel]              [k] alloc_pages_mpol
+    6.50%     0.30%  [kernel]              [k] free_frozen_pages_bulk
+    5.74%     0.97%  [kernel]              [k] __alloc_frozen_pages_noprof
+    5.70%     0.00%  [kernel]              [k] worker_thread
+    5.62%     0.02%  [kernel]              [k] process_one_work
+    5.57%     0.01%  [kernel]              [k] __purge_vmap_area_lazy
+    4.76%     2.55%  [kernel]              [k] get_page_from_freelist

So it is nice :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 19:22           ` Uladzislau Rezki
@ 2025-12-18 11:12             ` Ryan Roberts
  2025-12-18 11:33               ` Uladzislau Rezki
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-12-18 11:12 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Dev Jain, Baoquan He, LKML

On 17/12/2025 19:22, Uladzislau Rezki wrote:
> On Wed, Dec 17, 2025 at 05:01:19PM +0000, Ryan Roberts wrote:
>> On 17/12/2025 15:20, Ryan Roberts wrote:
>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>> desired.
>>>>>>
>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>> can affect performance.
>>>>>>
>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>> show better performance characteristics especially for larger
>>>>>> allocation requests.
>>>>>
>>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>>> all worlds without needing a configuration knob?
>>>>>
>>>> I am not sure, to me it looks like a bit odd. 
>>>
>>> Perhaps it would feel better if it was generalized to "first try allocation from
>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>> to lowest order"?
>>>
>>>> Ideally it would be
>>>> good just free it as high-order page and not order-0 peaces.
>>>
>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>> and no performance results yet):
>>>
>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>> memory leaks when running mm selftests...)
>>>
>>> ---8<---
>>> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
>>> Author: Ryan Roberts <ryan.roberts@arm.com>
>>> Date:   Wed Dec 17 15:11:08 2025 +0000
>>>
>>>     WIP
>>>
>>>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index b155929af5b1..d25f5b867e6b 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>>>  extern void free_pages_nolock(struct page *page, unsigned int order);
>>>  extern void free_pages(unsigned long addr, unsigned int order);
>>>
>>> +void free_pages_bulk(struct page *page, int nr_pages);
>>> +
>>>  #define __free_page(page) __free_pages((page), 0)
>>>  #define free_page(addr) free_pages((addr), 0)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 822e05f1a964..5f11224cf353 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
>>> order,
>>>  	}
>>>  }
>>>
>>> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
>>> +{
>>> +	while (nr_pages) {
>>> +		unsigned int fit_order, align_order, order;
>>> +		unsigned long pfn;
>>> +
>>> +		pfn = page_to_pfn(page);
>>> +		fit_order = ilog2(nr_pages);
>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
>>> +
>>> +		free_frozen_pages(page, order);
>>> +
>>> +		page += 1U << order;
>>> +		nr_pages -= 1U << order;
>>> +	}
>>> +}
>>> +
>>> +void free_pages_bulk(struct page *page, int nr_pages)
>>> +{
>>> +	struct page *start = NULL;
>>> +	bool can_free;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>> +
>>> +		can_free = put_page_testzero(page);
>>> +
>>> +		if (!can_free && start) {
>>> +			free_frozen_pages_bulk(start, page - start);
>>> +			start = NULL;
>>> +		} else if (can_free && !start) {
>>> +			start = page;
>>> +		}
>>> +	}
>>> +
>>> +	if (start)
>>> +		free_frozen_pages_bulk(start, page - start);
>>> +}
>>> +
>>>  /**
>>>   * __free_pages - Free pages allocated with alloc_pages().
>>>   * @page: The page pointer returned from alloc_pages().
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index ecbac900c35f..8f782bac1ece 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>>>  void vfree(const void *addr)
>>>  {
>>>  	struct vm_struct *vm;
>>> -	int i;
>>> +	struct page *start;
>>> +	int i, nr;
>>>
>>>  	if (unlikely(in_interrupt())) {
>>>  		vfree_atomic(addr);
>>> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>>>  	/* All pages of vm should be charged to same memcg, so use first one. */
>>>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>>>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
>>> -	for (i = 0; i < vm->nr_pages; i++) {
>>> +
>>> +	start = vm->pages[0];
>>> +	BUG_ON(!start);
>>> +	nr = 1;
>>> +	for (i = 1; i < vm->nr_pages; i++) {
>>>  		struct page *page = vm->pages[i];
>>>
>>>  		BUG_ON(!page);
>>> -		/*
>>> -		 * High-order allocs for huge vmallocs are split, so
>>> -		 * can be freed as an array of order-0 allocations
>>> -		 */
>>> -		__free_page(page);
>>> -		cond_resched();
>>> +
>>> +		if (start + nr != page) {
>>> +			free_pages_bulk(start, nr);
>>> +			start = page;
>>> +			nr = 1;
>>> +			cond_resched();
>>> +		} else {
>>> +			nr++;
>>> +		}
>>>  	}
>>> +	free_pages_bulk(start, nr);
>>> +
>>>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>>>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>>>  	kvfree(vm->pages);
>>> ---8<---
>>
>> I tested this on a performance monitoring system and see a huge improvement for 
>> the test_vmalloc tests.
>>
>> Both columns are compared to v6.18. 6-19-0-rc1 has Vishal's change to allocate 
>> large orders, which I previously reported the regressions for. vfree-high-order 
>> adds the above patch to free contiguous order-0 pages in bulk.
>>
>> (R)/(I) means statistically significant regression/improvement. Results are 
>> normalized so that less than zero is regression and greater than zero is 
>> improvement.
>>
>> +-----------------+----------------------------------------------------------+--------------+------------------+
>> | Benchmark       | Result Class                                             |   6-19-0-rc1 | vfree-high-order |
>> +=================+==========================================================+==============+==================+
>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  (R) -40.69% |        (I) 3.98% |
>> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |        0.10% |           -1.47% |
>> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  (R) -22.74% |       (I) 11.57% |
>> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  (R) -23.63% |       (I) 47.42% |
>> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |       -1.58% |      (I) 106.01% |
>> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  (R) -24.39% |       (I) 99.12% |
>> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |    (I) 2.34% |      (I) 196.87% |
>> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |  (R) -23.29% |      (I) 125.42% |
>> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |    (I) 3.74% |      (I) 238.59% |
>> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |  (R) -23.80% |      (I) 132.38% |
>> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |   (R) -2.84% |      (I) 514.75% |
>> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |        2.74% |            0.33% |
>> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |        0.58% |            1.36% |
>> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |       -0.66% |            1.48% |
>> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |  (R) -25.24% |       (I) 77.95% |
>> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |       -0.58% |            0.60% |
>> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  (R) -45.75% |        (I) 8.51% |
>> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |  (R) -28.16% |       (I) 65.34% |
>> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |       -0.54% |           -0.33% |
>> +-----------------+----------------------------------------------------------+--------------+------------------+
>>
>> What do you think?
>>
> You were first :)
> 
> Some figures from me:
> 
> # Default(3 pages)

What is Default? I'm guessing it's the state prior to Vishal's patch?

> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541868 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542515 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541561 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542951 usec
> 
> # Patch(3 pages)

What is Patch? I'm guessing state after applying both Vishal's and my patches?

> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 585266 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 594301 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 598912 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 589345 usec
> 
> Now the perf figures are almost settled and aligned with default!
> We do use per-cpu-cache for 3 pages allocations.
> 
> # Default(100 pages)
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5724919 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5721430 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 5717224 usec
> 
> # Patch(100 pages)
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2629600 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2622811 usec
> fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2629324 usec
> 
> ~2x faster! It is because of freeing now occurs much more efficient
> so we spent less cycles on free path comparing with default case.
> 
> See below, perf also confirms that vfree() ~2x consumes less cycles:
> 
> # Default
> +   96.99%     0.49%  [test_vmalloc]        [k] fix_size_alloc_test
> +   59.64%     2.38%  [kernel]              [k] vfree.part.0
> +   45.69%    15.80%  [kernel]              [k] __free_frozen_pages
> +   39.83%     0.00%  [kernel]              [k] ret_from_fork_asm
> +   39.83%     0.00%  [kernel]              [k] ret_from_fork
> +   39.83%     0.00%  [kernel]              [k] kthread
> +   38.67%     0.00%  [test_vmalloc]        [k] test_func
> +   36.64%     0.01%  [kernel]              [k] __vmalloc_node_noprof
> +   36.63%     0.20%  [kernel]              [k] __vmalloc_node_range_noprof
> +   17.55%     4.94%  [kernel]              [k] alloc_pages_bulk_noprof
> +   16.46%    12.21%  [kernel]              [k] free_frozen_page_commit.isra.0
> +   16.06%     8.09%  [kernel]              [k] vmap_small_pages_range_noflush
> +   12.56%    10.82%  [kernel]              [k] __rmqueue_pcplist
> +    9.45%     9.43%  [kernel]              [k] __get_pfnblock_flags_mask.isra.0
> +    7.95%     7.95%  [kernel]              [k] pfn_valid
> +    5.77%     0.03%  [kernel]              [k] remove_vm_area
> +    5.44%     5.44%  [kernel]              [k] ___free_pages
> +    4.67%     4.59%  [kernel]              [k] __vunmap_range_noflush
> +    4.30%     4.30%  [kernel]              [k] __list_add_valid_or_report
> 
> # Patch
> +   94.28%     1.00%  [test_vmalloc]        [k] fix_size_alloc_test
> +   55.63%     0.03%  [kernel]              [k] __vmalloc_node_noprof
> +   55.60%     3.78%  [kernel]              [k] __vmalloc_node_range_noprof
> +   37.26%    19.29%  [kernel]              [k] vmap_small_pages_range_noflush
> +   37.12%     5.63%  [kernel]              [k] vfree.part.0
> +   30.59%     0.00%  [kernel]              [k] ret_from_fork_asm
> +   30.59%     0.00%  [kernel]              [k] ret_from_fork
> +   30.59%     0.00%  [kernel]              [k] kthread
> +   28.79%     0.00%  [test_vmalloc]        [k] test_func
> +   17.90%    17.88%  [kernel]              [k] pfn_valid
> +   13.24%     0.02%  [kernel]              [k] remove_vm_area
> +   10.90%    10.68%  [kernel]              [k] __vunmap_range_noflush
> +   10.81%    10.80%  [kernel]              [k] free_pages_bulk
> +    7.09%     0.51%  [kernel]              [k] alloc_pages_noprof
> +    6.58%     0.41%  [kernel]              [k] alloc_pages_mpol
> +    6.50%     0.30%  [kernel]              [k] free_frozen_pages_bulk
> +    5.74%     0.97%  [kernel]              [k] __alloc_frozen_pages_noprof
> +    5.70%     0.00%  [kernel]              [k] worker_thread
> +    5.62%     0.02%  [kernel]              [k] process_one_work
> +    5.57%     0.01%  [kernel]              [k] __purge_vmap_area_lazy
> +    4.76%     2.55%  [kernel]              [k] get_page_from_freelist
> 
> So it is nice :)
> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:12             ` Ryan Roberts
@ 2025-12-18 11:33               ` Uladzislau Rezki
  0 siblings, 0 replies; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-18 11:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vishal Moola,
	Dev Jain, Baoquan He, LKML

On Thu, Dec 18, 2025 at 11:12:15AM +0000, Ryan Roberts wrote:
> On 17/12/2025 19:22, Uladzislau Rezki wrote:
> > On Wed, Dec 17, 2025 at 05:01:19PM +0000, Ryan Roberts wrote:
> >> On 17/12/2025 15:20, Ryan Roberts wrote:
> >>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
> >>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
> >>>>>> Introduce a module parameter to enable or disable the large-order
> >>>>>> allocation path in vmalloc. High-order allocations are disabled by
> >>>>>> default so far, but users may explicitly enable them at runtime if
> >>>>>> desired.
> >>>>>>
> >>>>>> High-order pages allocated for vmalloc are immediately split into
> >>>>>> order-0 pages and later freed as order-0, which means they do not
> >>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
> >>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
> >>>>>> can affect performance.
> >>>>>>
> >>>>>> However, when the PCP caches are empty, high-order allocations may
> >>>>>> show better performance characteristics especially for larger
> >>>>>> allocation requests.
> >>>>>
> >>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
> >>>>> else try large order, else fallback to order-0" Could that provide the best of
> >>>>> all worlds without needing a configuration knob?
> >>>>>
> >>>> I am not sure, to me it looks like a bit odd. 
> >>>
> >>> Perhaps it would feel better if it was generalized to "first try allocation from
> >>> PCP list, highest to lowest order, then try allocation from the buddy, highest
> >>> to lowest order"?
> >>>
> >>>> Ideally it would be
> >>>> good just free it as high-order page and not order-0 peaces.
> >>>
> >>> Yeah perhaps that's better. How about something like this (very lightly tested
> >>> and no performance results yet):
> >>>
> >>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
> >>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
> >>> memory leaks when running mm selftests...)
> >>>
> >>> ---8<---
> >>> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
> >>> Author: Ryan Roberts <ryan.roberts@arm.com>
> >>> Date:   Wed Dec 17 15:11:08 2025 +0000
> >>>
> >>>     WIP
> >>>
> >>>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>
> >>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>> index b155929af5b1..d25f5b867e6b 100644
> >>> --- a/include/linux/gfp.h
> >>> +++ b/include/linux/gfp.h
> >>> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
> >>>  extern void free_pages_nolock(struct page *page, unsigned int order);
> >>>  extern void free_pages(unsigned long addr, unsigned int order);
> >>>
> >>> +void free_pages_bulk(struct page *page, int nr_pages);
> >>> +
> >>>  #define __free_page(page) __free_pages((page), 0)
> >>>  #define free_page(addr) free_pages((addr), 0)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 822e05f1a964..5f11224cf353 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
> >>> order,
> >>>  	}
> >>>  }
> >>>
> >>> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
> >>> +{
> >>> +	while (nr_pages) {
> >>> +		unsigned int fit_order, align_order, order;
> >>> +		unsigned long pfn;
> >>> +
> >>> +		pfn = page_to_pfn(page);
> >>> +		fit_order = ilog2(nr_pages);
> >>> +		align_order = pfn ? __ffs(pfn) : fit_order;
> >>> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
> >>> +
> >>> +		free_frozen_pages(page, order);
> >>> +
> >>> +		page += 1U << order;
> >>> +		nr_pages -= 1U << order;
> >>> +	}
> >>> +}
> >>> +
> >>> +void free_pages_bulk(struct page *page, int nr_pages)
> >>> +{
> >>> +	struct page *start = NULL;
> >>> +	bool can_free;
> >>> +	int i;
> >>> +
> >>> +	for (i = 0; i < nr_pages; i++, page++) {
> >>> +		VM_BUG_ON_PAGE(PageHead(page), page);
> >>> +		VM_BUG_ON_PAGE(PageTail(page), page);
> >>> +
> >>> +		can_free = put_page_testzero(page);
> >>> +
> >>> +		if (!can_free && start) {
> >>> +			free_frozen_pages_bulk(start, page - start);
> >>> +			start = NULL;
> >>> +		} else if (can_free && !start) {
> >>> +			start = page;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (start)
> >>> +		free_frozen_pages_bulk(start, page - start);
> >>> +}
> >>> +
> >>>  /**
> >>>   * __free_pages - Free pages allocated with alloc_pages().
> >>>   * @page: The page pointer returned from alloc_pages().
> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >>> index ecbac900c35f..8f782bac1ece 100644
> >>> --- a/mm/vmalloc.c
> >>> +++ b/mm/vmalloc.c
> >>> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
> >>>  void vfree(const void *addr)
> >>>  {
> >>>  	struct vm_struct *vm;
> >>> -	int i;
> >>> +	struct page *start;
> >>> +	int i, nr;
> >>>
> >>>  	if (unlikely(in_interrupt())) {
> >>>  		vfree_atomic(addr);
> >>> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
> >>>  	/* All pages of vm should be charged to same memcg, so use first one. */
> >>>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
> >>>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
> >>> -	for (i = 0; i < vm->nr_pages; i++) {
> >>> +
> >>> +	start = vm->pages[0];
> >>> +	BUG_ON(!start);
> >>> +	nr = 1;
> >>> +	for (i = 1; i < vm->nr_pages; i++) {
> >>>  		struct page *page = vm->pages[i];
> >>>
> >>>  		BUG_ON(!page);
> >>> -		/*
> >>> -		 * High-order allocs for huge vmallocs are split, so
> >>> -		 * can be freed as an array of order-0 allocations
> >>> -		 */
> >>> -		__free_page(page);
> >>> -		cond_resched();
> >>> +
> >>> +		if (start + nr != page) {
> >>> +			free_pages_bulk(start, nr);
> >>> +			start = page;
> >>> +			nr = 1;
> >>> +			cond_resched();
> >>> +		} else {
> >>> +			nr++;
> >>> +		}
> >>>  	}
> >>> +	free_pages_bulk(start, nr);
> >>> +
> >>>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
> >>>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
> >>>  	kvfree(vm->pages);
> >>> ---8<---
> >>
> >> I tested this on a performance monitoring system and see a huge improvement for 
> >> the test_vmalloc tests.
> >>
> >> Both columns are compared to v6.18. 6-19-0-rc1 has Vishal's change to allocate 
> >> large orders, which I previously reported the regressions for. vfree-high-order 
> >> adds the above patch to free contiguous order-0 pages in bulk.
> >>
> >> (R)/(I) means statistically significant regression/improvement. Results are 
> >> normalized so that less than zero is regression and greater than zero is 
> >> improvement.
> >>
> >> +-----------------+----------------------------------------------------------+--------------+------------------+
> >> | Benchmark       | Result Class                                             |   6-19-0-rc1 | vfree-high-order |
> >> +=================+==========================================================+==============+==================+
> >> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |  (R) -40.69% |        (I) 3.98% |
> >> |                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |        0.10% |           -1.47% |
> >> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |  (R) -22.74% |       (I) 11.57% |
> >> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |  (R) -23.63% |       (I) 47.42% |
> >> |                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |       -1.58% |      (I) 106.01% |
> >> |                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |  (R) -24.39% |       (I) 99.12% |
> >> |                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |    (I) 2.34% |      (I) 196.87% |
> >> |                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |  (R) -23.29% |      (I) 125.42% |
> >> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |    (I) 3.74% |      (I) 238.59% |
> >> |                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |  (R) -23.80% |      (I) 132.38% |
> >> |                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |   (R) -2.84% |      (I) 514.75% |
> >> |                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |        2.74% |            0.33% |
> >> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |        0.58% |            1.36% |
> >> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |       -0.66% |            1.48% |
> >> |                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |  (R) -25.24% |       (I) 77.95% |
> >> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |       -0.58% |            0.60% |
> >> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |  (R) -45.75% |        (I) 8.51% |
> >> |                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |  (R) -28.16% |       (I) 65.34% |
> >> |                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |       -0.54% |           -0.33% |
> >> +-----------------+----------------------------------------------------------+--------------+------------------+
> >>
> >> What do you think?
> >>
> > You were first :)
> > 
> > Some figures from me:
> > 
> > # Default(3 pages)
> 
> What is Default? I'm guessing it's the state prior to Vishal's patch?
> 
Right.

> > fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541868 usec
> > fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542515 usec
> > fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 541561 usec
> > fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 542951 usec
> > 
> > # Patch(3 pages)
> 
> What is Patch? I'm guessing state after applying both Vishal's and my patches?
> 
Right.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 17:01         ` Ryan Roberts
  2025-12-17 19:22           ` Uladzislau Rezki
@ 2025-12-17 20:08           ` Uladzislau Rezki
  2025-12-18 11:14             ` Ryan Roberts
  1 sibling, 1 reply; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-17 20:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vishal Moola,
	Dev Jain, Baoquan He, LKML

> 
> What do you think?
> 
I think with such big improvement we do not need a configuration knob.
Your change will fully complete Vishal's work, i.e an idea to allocate
using high-order pages.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 20:08           ` Uladzislau Rezki
@ 2025-12-18 11:14             ` Ryan Roberts
  2025-12-18 11:29               ` Uladzislau Rezki
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-12-18 11:14 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Dev Jain, Baoquan He, LKML

On 17/12/2025 20:08, Uladzislau Rezki wrote:
>>
>> What do you think?
>>
> I think with such big improvement we do not need a configuration knob.
> Your change will fully complete Vishal's work, i.e an idea to allocate
> using high-order pages.

Yes agreed. How do you want to proceed? I'll tidy up my patch and post it
properly if you like? (likely won't be until Tuesday though). Or if you prefer
to work on it, that's fine by me too.

Personally I think we should aim to get the fix into 6.19 to avoid the
performance regression (even if we think the allocation pattern of those
benchmarks is not the common case).

Thanks,
Ryan

> 
> --
> Uladzislau Rezki

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:14             ` Ryan Roberts
@ 2025-12-18 11:29               ` Uladzislau Rezki
  0 siblings, 0 replies; 27+ messages in thread
From: Uladzislau Rezki @ 2025-12-18 11:29 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vishal Moola,
	Dev Jain, Baoquan He, LKML

> On 17/12/2025 20:08, Uladzislau Rezki wrote:
> >>
> >> What do you think?
> >>
> > I think with such big improvement we do not need a configuration knob.
> > Your change will fully complete Vishal's work, i.e an idea to allocate
> > using high-order pages.
> 
> Yes agreed. How do you want to proceed? I'll tidy up my patch and post it
> properly if you like? (likely won't be until Tuesday though). Or if you prefer
> to work on it, that's fine by me too.
> 
I think it is worth you proceed with the change. Also i forgot to
mention that allocation path is improved also. If you have a look at the
perf figures you will see that a pressure on the buddy really has been
reduced. I think it is because the state of the buddy is much more healthy
now.

> Personally I think we should aim to get the fix into 6.19 to avoid the
> performance regression (even if we think the allocation pattern of those
> benchmarks is not the common case).
> 
Personally, i do not mind, let's see how it goes and what Andrew says.
 
Thank you and i am looking forward the patches!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-17 15:20       ` Ryan Roberts
  2025-12-17 17:01         ` Ryan Roberts
@ 2025-12-18  4:55         ` Dev Jain
  2025-12-18 11:53           ` Ryan Roberts
  1 sibling, 1 reply; 27+ messages in thread
From: Dev Jain @ 2025-12-18  4:55 UTC (permalink / raw)
  To: Ryan Roberts, Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML


On 17/12/25 8:50 pm, Ryan Roberts wrote:
> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>> Introduce a module parameter to enable or disable the large-order
>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>> default so far, but users may explicitly enable them at runtime if
>>>> desired.
>>>>
>>>> High-order pages allocated for vmalloc are immediately split into
>>>> order-0 pages and later freed as order-0, which means they do not
>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>> can affect performance.
>>>>
>>>> However, when the PCP caches are empty, high-order allocations may
>>>> show better performance characteristics especially for larger
>>>> allocation requests.
>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>> else try large order, else fallback to order-0" Could that provide the best of
>>> all worlds without needing a configuration knob?
>>>
>> I am not sure, to me it looks like a bit odd. 
> Perhaps it would feel better if it was generalized to "first try allocation from
> PCP list, highest to lowest order, then try allocation from the buddy, highest
> to lowest order"?
>
>> Ideally it would be
>> good just free it as high-order page and not order-0 peaces.
> Yeah perhaps that's better. How about something like this (very lightly tested
> and no performance results yet):
>
> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
> memory leaks when running mm selftests...)

Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
arm64/mmu.c already does this - it computes order from size and passes it to
__free_pages().

>
> ---8<---
> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
> Author: Ryan Roberts <ryan.roberts@arm.com>
> Date:   Wed Dec 17 15:11:08 2025 +0000
>
>     WIP
>
>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b155929af5b1..d25f5b867e6b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages_nolock(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
>
> +void free_pages_bulk(struct page *page, int nr_pages);
> +
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a964..5f11224cf353 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
> order,
>  	}
>  }
>
> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
> +{
> +	while (nr_pages) {
> +		unsigned int fit_order, align_order, order;
> +		unsigned long pfn;
> +
> +		pfn = page_to_pfn(page);
> +		fit_order = ilog2(nr_pages);
> +		align_order = pfn ? __ffs(pfn) : fit_order;
> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
> +
> +		free_frozen_pages(page, order);
> +
> +		page += 1U << order;
> +		nr_pages -= 1U << order;
> +	}
> +}
> +
> +void free_pages_bulk(struct page *page, int nr_pages)
> +{
> +	struct page *start = NULL;
> +	bool can_free;
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++, page++) {
> +		VM_BUG_ON_PAGE(PageHead(page), page);
> +		VM_BUG_ON_PAGE(PageTail(page), page);
> +
> +		can_free = put_page_testzero(page);
> +
> +		if (!can_free && start) {
> +			free_frozen_pages_bulk(start, page - start);
> +			start = NULL;
> +		} else if (can_free && !start) {
> +			start = page;
> +		}
> +	}
> +
> +	if (start)
> +		free_frozen_pages_bulk(start, page - start);
> +}
> +
>  /**
>   * __free_pages - Free pages allocated with alloc_pages().
>   * @page: The page pointer returned from alloc_pages().
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..8f782bac1ece 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>  void vfree(const void *addr)
>  {
>  	struct vm_struct *vm;
> -	int i;
> +	struct page *start;
> +	int i, nr;
>
>  	if (unlikely(in_interrupt())) {
>  		vfree_atomic(addr);
> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>  	/* All pages of vm should be charged to same memcg, so use first one. */
>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
> -	for (i = 0; i < vm->nr_pages; i++) {
> +
> +	start = vm->pages[0];
> +	BUG_ON(!start);
> +	nr = 1;
> +	for (i = 1; i < vm->nr_pages; i++) {
>  		struct page *page = vm->pages[i];
>
>  		BUG_ON(!page);
> -		/*
> -		 * High-order allocs for huge vmallocs are split, so
> -		 * can be freed as an array of order-0 allocations
> -		 */
> -		__free_page(page);
> -		cond_resched();
> +
> +		if (start + nr != page) {
> +			free_pages_bulk(start, nr);
> +			start = page;
> +			nr = 1;
> +			cond_resched();
> +		} else {
> +			nr++;
> +		}
>  	}
> +	free_pages_bulk(start, nr);
> +
>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>  	kvfree(vm->pages);
> ---8<---
>
>>>> Since the best strategy is workload-dependent, this patch adds a
>>>> parameter letting users to choose whether vmalloc should try
>>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>>
>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>> ---
>>>>  mm/vmalloc.c | 9 +++++++--
>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index d3a4725e15ca..f66543896b16 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -43,6 +43,7 @@
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/shmparam.h>
>>>>  #include <linux/page_owner.h>
>>>> +#include <linux/moduleparam.h>
>>>>  
>>>>  #define CREATE_TRACE_POINTS
>>>>  #include <trace/events/vmalloc.h>
>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>>  	return nr_allocated;
>>>>  }
>>>>  
>>>> +static int attempt_larger_order_alloc;
>>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>>> Y/N as the value; that's probably more intuitive?
>>>
>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>>
>> Thanks! We can switch to bool and use shorter name for sure.
>>
>> --
>> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18  4:55         ` Dev Jain
@ 2025-12-18 11:53           ` Ryan Roberts
  2025-12-18 11:56             ` Ryan Roberts
                               ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-12-18 11:53 UTC (permalink / raw)
  To: Dev Jain, Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML

On 18/12/2025 04:55, Dev Jain wrote:
> 
> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>> Introduce a module parameter to enable or disable the large-order
>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>> default so far, but users may explicitly enable them at runtime if
>>>>> desired.
>>>>>
>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>> can affect performance.
>>>>>
>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>> show better performance characteristics especially for larger
>>>>> allocation requests.
>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>> all worlds without needing a configuration knob?
>>>>
>>> I am not sure, to me it looks like a bit odd. 
>> Perhaps it would feel better if it was generalized to "first try allocation from
>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>> to lowest order"?
>>
>>> Ideally it would be
>>> good just free it as high-order page and not order-0 peaces.
>> Yeah perhaps that's better. How about something like this (very lightly tested
>> and no performance results yet):
>>
>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>> memory leaks when running mm selftests...)
> 
> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
> arm64/mmu.c already does this - it computes order from size and passes it to
> __free_pages().

Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
on...

Prior to looking at this yesterday, my understanding was this: At the struct
page level, you can either allocate compond or non-compound. order-0 is
non-compound by definition. A high-order non-compound page is just a contiguous
set of order-0 pages, each with individual reference counts and other meta data.
A compound page is one where all the pages are tied together and managed as one
- the meta data is stored in the head page and all the tail pages point to the
head (this concept is wrapped by struct folio).

But after looking through the comments in page_alloc.c, it would seem that a
non-compound high-order page is NOT just a set of order-0 pages, but they still
share some meta data, including a shared refcount?? alloc_pages() will return
one of these things, and __free_pages() requires the exact same unit to be
provided to it.

vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
split_page() to convert to a set of order-0 pages. See this comment:

/*
 * split_page takes a non-compound higher-order page, and splits it into
 * n (1<<order) sub-pages: page[0..n]
 * Each sub-page must be freed individually.
 *
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */
void split_page(struct page *page, unsigned int order)

So just passing all the order-0 pages directly to __free_pages() in one go is
definitely not the right thing to do ("Each sub-page must be freed
individually"). They may have different reference counts so you can only
actually free the ones that go to zero surely?

But it looked to me like free_frozen_pages() just wants a naturally aligned
power-of-2 number of pages to free, so my patch below is decrementing the
refcount on each struct page and accumulating the ones where the refcounts goto
zero into suitable blocks for free_frozen_pages().

So I *think* my patch is correct, but I'm not totally sure.

Then we have the ___free_pages(), which I find very difficult to understand:

static void ___free_pages(struct page *page, unsigned int order,
			  fpi_t fpi_flags)
{
	/* get PageHead before we drop reference */
	int head = PageHead(page);
	/* get alloc tag in case the page is released by others */
	struct alloc_tag *tag = pgalloc_tag_get(page);

	if (put_page_testzero(page))
		__free_frozen_pages(page, order, fpi_flags);

We only test the refcount for the first page, then free all the pages. So that
implies that non-compound high-order pages share a single refcount? Or we just
ignore the refcount of all the other pages in a non-compound high-order page?

	else if (!head) {

What? If the first page still has references but but it's a non-compond
high-order page (i.e. no head page) then we free all the trailing sub-pages
without caring about their references?

		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
		while (order-- > 0) {
			/*
			 * The "tail" pages of this non-compound high-order
			 * page will have no code tags, so to avoid warnings
			 * mark them as empty.
			 */
			clear_page_tag_ref(page + (1 << order));
			__free_frozen_pages(page + (1 << order), order,
					    fpi_flags);
		}
	}
}

For the arm64 case that you point out, surely __free_pages() is the wrong thing
to call, because it's going to decrement the refcount. But we are freeing based
on their presence in the pagetable and we never took a reference in the first place.

HELP!

> 
>>
>> ---8<---
>> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
>> Author: Ryan Roberts <ryan.roberts@arm.com>
>> Date:   Wed Dec 17 15:11:08 2025 +0000
>>
>>     WIP
>>
>>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index b155929af5b1..d25f5b867e6b 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>>  extern void free_pages_nolock(struct page *page, unsigned int order);
>>  extern void free_pages(unsigned long addr, unsigned int order);
>>
>> +void free_pages_bulk(struct page *page, int nr_pages);
>> +
>>  #define __free_page(page) __free_pages((page), 0)
>>  #define free_page(addr) free_pages((addr), 0)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 822e05f1a964..5f11224cf353 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
>> order,
>>  	}
>>  }
>>
>> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
>> +{
>> +	while (nr_pages) {
>> +		unsigned int fit_order, align_order, order;
>> +		unsigned long pfn;
>> +
>> +		pfn = page_to_pfn(page);
>> +		fit_order = ilog2(nr_pages);
>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
>> +
>> +		free_frozen_pages(page, order);
>> +
>> +		page += 1U << order;
>> +		nr_pages -= 1U << order;
>> +	}
>> +}
>> +
>> +void free_pages_bulk(struct page *page, int nr_pages)
>> +{
>> +	struct page *start = NULL;
>> +	bool can_free;
>> +	int i;
>> +
>> +	for (i = 0; i < nr_pages; i++, page++) {
>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>> +
>> +		can_free = put_page_testzero(page);
>> +
>> +		if (!can_free && start) {
>> +			free_frozen_pages_bulk(start, page - start);
>> +			start = NULL;
>> +		} else if (can_free && !start) {
>> +			start = page;
>> +		}
>> +	}
>> +
>> +	if (start)
>> +		free_frozen_pages_bulk(start, page - start);
>> +}
>> +
>>  /**
>>   * __free_pages - Free pages allocated with alloc_pages().
>>   * @page: The page pointer returned from alloc_pages().
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index ecbac900c35f..8f782bac1ece 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>>  void vfree(const void *addr)
>>  {
>>  	struct vm_struct *vm;
>> -	int i;
>> +	struct page *start;
>> +	int i, nr;
>>
>>  	if (unlikely(in_interrupt())) {
>>  		vfree_atomic(addr);
>> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>>  	/* All pages of vm should be charged to same memcg, so use first one. */
>>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
>> -	for (i = 0; i < vm->nr_pages; i++) {
>> +
>> +	start = vm->pages[0];
>> +	BUG_ON(!start);
>> +	nr = 1;
>> +	for (i = 1; i < vm->nr_pages; i++) {
>>  		struct page *page = vm->pages[i];
>>
>>  		BUG_ON(!page);
>> -		/*
>> -		 * High-order allocs for huge vmallocs are split, so
>> -		 * can be freed as an array of order-0 allocations
>> -		 */
>> -		__free_page(page);
>> -		cond_resched();
>> +
>> +		if (start + nr != page) {
>> +			free_pages_bulk(start, nr);
>> +			start = page;
>> +			nr = 1;
>> +			cond_resched();
>> +		} else {
>> +			nr++;
>> +		}
>>  	}
>> +	free_pages_bulk(start, nr);
>> +
>>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>>  	kvfree(vm->pages);
>> ---8<---
>>
>>>>> Since the best strategy is workload-dependent, this patch adds a
>>>>> parameter letting users to choose whether vmalloc should try
>>>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>>>
>>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>> ---
>>>>>  mm/vmalloc.c | 9 +++++++--
>>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>>> index d3a4725e15ca..f66543896b16 100644
>>>>> --- a/mm/vmalloc.c
>>>>> +++ b/mm/vmalloc.c
>>>>> @@ -43,6 +43,7 @@
>>>>>  #include <asm/tlbflush.h>
>>>>>  #include <asm/shmparam.h>
>>>>>  #include <linux/page_owner.h>
>>>>> +#include <linux/moduleparam.h>
>>>>>  
>>>>>  #define CREATE_TRACE_POINTS
>>>>>  #include <trace/events/vmalloc.h>
>>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>>>  	return nr_allocated;
>>>>>  }
>>>>>  
>>>>> +static int attempt_larger_order_alloc;
>>>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>>>> Y/N as the value; that's probably more intuitive?
>>>>
>>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>>>
>>> Thanks! We can switch to bool and use shorter name for sure.
>>>
>>> --
>>> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:53           ` Ryan Roberts
@ 2025-12-18 11:56             ` Ryan Roberts
  2025-12-19  8:33               ` David Hildenbrand (Red Hat)
  2025-12-19  0:34             ` Vishal Moola (Oracle)
  2025-12-24  6:35             ` Dev Jain
  2 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-12-18 11:56 UTC (permalink / raw)
  To: Dev Jain, Uladzislau Rezki, David Hildenbrand (Red Hat),
	Lorenzo Stoakes, Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML

+ David, Lorenzo, Matthew

Hoping someone might be able to explain to me how this all really works! :-|

On 18/12/2025 11:53, Ryan Roberts wrote:
> On 18/12/2025 04:55, Dev Jain wrote:
>>
>> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>> desired.
>>>>>>
>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>> can affect performance.
>>>>>>
>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>> show better performance characteristics especially for larger
>>>>>> allocation requests.
>>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>>> all worlds without needing a configuration knob?
>>>>>
>>>> I am not sure, to me it looks like a bit odd. 
>>> Perhaps it would feel better if it was generalized to "first try allocation from
>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>> to lowest order"?
>>>
>>>> Ideally it would be
>>>> good just free it as high-order page and not order-0 peaces.
>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>> and no performance results yet):
>>>
>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>> memory leaks when running mm selftests...)
>>
>> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
>> arm64/mmu.c already does this - it computes order from size and passes it to
>> __free_pages().
> 
> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
> on...
> 
> Prior to looking at this yesterday, my understanding was this: At the struct
> page level, you can either allocate compond or non-compound. order-0 is
> non-compound by definition. A high-order non-compound page is just a contiguous
> set of order-0 pages, each with individual reference counts and other meta data.
> A compound page is one where all the pages are tied together and managed as one
> - the meta data is stored in the head page and all the tail pages point to the
> head (this concept is wrapped by struct folio).
> 
> But after looking through the comments in page_alloc.c, it would seem that a
> non-compound high-order page is NOT just a set of order-0 pages, but they still
> share some meta data, including a shared refcount?? alloc_pages() will return
> one of these things, and __free_pages() requires the exact same unit to be
> provided to it.
> 
> vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
> split_page() to convert to a set of order-0 pages. See this comment:
> 
> /*
>  * split_page takes a non-compound higher-order page, and splits it into
>  * n (1<<order) sub-pages: page[0..n]
>  * Each sub-page must be freed individually.
>  *
>  * Note: this is probably too low level an operation for use in drivers.
>  * Please consult with lkml before using this in your driver.
>  */
> void split_page(struct page *page, unsigned int order)
> 
> So just passing all the order-0 pages directly to __free_pages() in one go is
> definitely not the right thing to do ("Each sub-page must be freed
> individually"). They may have different reference counts so you can only
> actually free the ones that go to zero surely?
> 
> But it looked to me like free_frozen_pages() just wants a naturally aligned
> power-of-2 number of pages to free, so my patch below is decrementing the
> refcount on each struct page and accumulating the ones where the refcounts goto
> zero into suitable blocks for free_frozen_pages().
> 
> So I *think* my patch is correct, but I'm not totally sure.
> 
> Then we have the ___free_pages(), which I find very difficult to understand:
> 
> static void ___free_pages(struct page *page, unsigned int order,
> 			  fpi_t fpi_flags)
> {
> 	/* get PageHead before we drop reference */
> 	int head = PageHead(page);
> 	/* get alloc tag in case the page is released by others */
> 	struct alloc_tag *tag = pgalloc_tag_get(page);
> 
> 	if (put_page_testzero(page))
> 		__free_frozen_pages(page, order, fpi_flags);
> 
> We only test the refcount for the first page, then free all the pages. So that
> implies that non-compound high-order pages share a single refcount? Or we just
> ignore the refcount of all the other pages in a non-compound high-order page?
> 
> 	else if (!head) {
> 
> What? If the first page still has references but but it's a non-compond
> high-order page (i.e. no head page) then we free all the trailing sub-pages
> without caring about their references?
> 
> 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
> 		while (order-- > 0) {
> 			/*
> 			 * The "tail" pages of this non-compound high-order
> 			 * page will have no code tags, so to avoid warnings
> 			 * mark them as empty.
> 			 */
> 			clear_page_tag_ref(page + (1 << order));
> 			__free_frozen_pages(page + (1 << order), order,
> 					    fpi_flags);
> 		}
> 	}
> }
> 
> For the arm64 case that you point out, surely __free_pages() is the wrong thing
> to call, because it's going to decrement the refcount. But we are freeing based
> on their presence in the pagetable and we never took a reference in the first place.
> 
> HELP!
> 
>>
>>>
>>> ---8<---
>>> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
>>> Author: Ryan Roberts <ryan.roberts@arm.com>
>>> Date:   Wed Dec 17 15:11:08 2025 +0000
>>>
>>>     WIP
>>>
>>>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index b155929af5b1..d25f5b867e6b 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>>>  extern void free_pages_nolock(struct page *page, unsigned int order);
>>>  extern void free_pages(unsigned long addr, unsigned int order);
>>>
>>> +void free_pages_bulk(struct page *page, int nr_pages);
>>> +
>>>  #define __free_page(page) __free_pages((page), 0)
>>>  #define free_page(addr) free_pages((addr), 0)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 822e05f1a964..5f11224cf353 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
>>> order,
>>>  	}
>>>  }
>>>
>>> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
>>> +{
>>> +	while (nr_pages) {
>>> +		unsigned int fit_order, align_order, order;
>>> +		unsigned long pfn;
>>> +
>>> +		pfn = page_to_pfn(page);
>>> +		fit_order = ilog2(nr_pages);
>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
>>> +
>>> +		free_frozen_pages(page, order);
>>> +
>>> +		page += 1U << order;
>>> +		nr_pages -= 1U << order;
>>> +	}
>>> +}
>>> +
>>> +void free_pages_bulk(struct page *page, int nr_pages)
>>> +{
>>> +	struct page *start = NULL;
>>> +	bool can_free;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>> +
>>> +		can_free = put_page_testzero(page);
>>> +
>>> +		if (!can_free && start) {
>>> +			free_frozen_pages_bulk(start, page - start);
>>> +			start = NULL;
>>> +		} else if (can_free && !start) {
>>> +			start = page;
>>> +		}
>>> +	}
>>> +
>>> +	if (start)
>>> +		free_frozen_pages_bulk(start, page - start);
>>> +}
>>> +
>>>  /**
>>>   * __free_pages - Free pages allocated with alloc_pages().
>>>   * @page: The page pointer returned from alloc_pages().
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index ecbac900c35f..8f782bac1ece 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>>>  void vfree(const void *addr)
>>>  {
>>>  	struct vm_struct *vm;
>>> -	int i;
>>> +	struct page *start;
>>> +	int i, nr;
>>>
>>>  	if (unlikely(in_interrupt())) {
>>>  		vfree_atomic(addr);
>>> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>>>  	/* All pages of vm should be charged to same memcg, so use first one. */
>>>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>>>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
>>> -	for (i = 0; i < vm->nr_pages; i++) {
>>> +
>>> +	start = vm->pages[0];
>>> +	BUG_ON(!start);
>>> +	nr = 1;
>>> +	for (i = 1; i < vm->nr_pages; i++) {
>>>  		struct page *page = vm->pages[i];
>>>
>>>  		BUG_ON(!page);
>>> -		/*
>>> -		 * High-order allocs for huge vmallocs are split, so
>>> -		 * can be freed as an array of order-0 allocations
>>> -		 */
>>> -		__free_page(page);
>>> -		cond_resched();
>>> +
>>> +		if (start + nr != page) {
>>> +			free_pages_bulk(start, nr);
>>> +			start = page;
>>> +			nr = 1;
>>> +			cond_resched();
>>> +		} else {
>>> +			nr++;
>>> +		}
>>>  	}
>>> +	free_pages_bulk(start, nr);
>>> +
>>>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>>>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>>>  	kvfree(vm->pages);
>>> ---8<---
>>>
>>>>>> Since the best strategy is workload-dependent, this patch adds a
>>>>>> parameter letting users to choose whether vmalloc should try
>>>>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>>>>
>>>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>>> ---
>>>>>>  mm/vmalloc.c | 9 +++++++--
>>>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>>>> index d3a4725e15ca..f66543896b16 100644
>>>>>> --- a/mm/vmalloc.c
>>>>>> +++ b/mm/vmalloc.c
>>>>>> @@ -43,6 +43,7 @@
>>>>>>  #include <asm/tlbflush.h>
>>>>>>  #include <asm/shmparam.h>
>>>>>>  #include <linux/page_owner.h>
>>>>>> +#include <linux/moduleparam.h>
>>>>>>  
>>>>>>  #define CREATE_TRACE_POINTS
>>>>>>  #include <trace/events/vmalloc.h>
>>>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>>>>  	return nr_allocated;
>>>>>>  }
>>>>>>  
>>>>>> +static int attempt_larger_order_alloc;
>>>>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>>>>> Y/N as the value; that's probably more intuitive?
>>>>>
>>>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>>>>
>>>> Thanks! We can switch to bool and use shorter name for sure.
>>>>
>>>> --
>>>> Uladzislau Rezki
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:56             ` Ryan Roberts
@ 2025-12-19  8:33               ` David Hildenbrand (Red Hat)
  2025-12-19 11:17                 ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:33 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, Uladzislau Rezki, Lorenzo Stoakes,
	Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML

On 12/18/25 12:56, Ryan Roberts wrote:
> + David, Lorenzo, Matthew
> 
> Hoping someone might be able to explain to me how this all really works! :-|
> 
> On 18/12/2025 11:53, Ryan Roberts wrote:
>> On 18/12/2025 04:55, Dev Jain wrote:
>>>
>>> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>>> desired.
>>>>>>>
>>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>>> can affect performance.
>>>>>>>
>>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>>> show better performance characteristics especially for larger
>>>>>>> allocation requests.
>>>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>>>> all worlds without needing a configuration knob?
>>>>>>
>>>>> I am not sure, to me it looks like a bit odd.
>>>> Perhaps it would feel better if it was generalized to "first try allocation from
>>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>>> to lowest order"?
>>>>
>>>>> Ideally it would be
>>>>> good just free it as high-order page and not order-0 peaces.
>>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>>> and no performance results yet):
>>>>
>>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>>> memory leaks when running mm selftests...)
>>>
>>> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
>>> arm64/mmu.c already does this - it computes order from size and passes it to
>>> __free_pages().
>>
>> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
>> on...
>>
>> Prior to looking at this yesterday, my understanding was this: At the struct
>> page level, you can either allocate compond or non-compound. order-0 is
>> non-compound by definition. A high-order non-compound page is just a contiguous
>> set of order-0 pages, each with individual reference counts and other meta data.

Not quite. A high-order non-compound allocation will only use the 
refcount of page[0].

When not returning that memory in the same order to the buddy, we first 
have to split that high-order allocation. That will initialize the 
refcounts and split page-owner data, alloc tag tracking etc.

>> A compound page is one where all the pages are tied together and managed as one
>> - the meta data is stored in the head page and all the tail pages point to the
>> head (this concept is wrapped by struct folio).
>>
>> But after looking through the comments in page_alloc.c, it would seem that a
>> non-compound high-order page is NOT just a set of order-0 pages, but they still
>> share some meta data, including a shared refcount?? alloc_pages() will return
>> one of these things, and __free_pages() requires the exact same unit to be
>> provided to it.

Right.

>>
>> vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
>> split_page() to convert to a set of order-0 pages. See this comment:
>>
>> /*
>>   * split_page takes a non-compound higher-order page, and splits it into
>>   * n (1<<order) sub-pages: page[0..n]
>>   * Each sub-page must be freed individually.
>>   *
>>   * Note: this is probably too low level an operation for use in drivers.
>>   * Please consult with lkml before using this in your driver.
>>   */
>> void split_page(struct page *page, unsigned int order)
>>
>> So just passing all the order-0 pages directly to __free_pages() in one go is
>> definitely not the right thing to do ("Each sub-page must be freed
>> individually"). They may have different reference counts so you can only
>> actually free the ones that go to zero surely?

Yes.

>>
>> But it looked to me like free_frozen_pages() just wants a naturally aligned
>> power-of-2 number of pages to free, so my patch below is decrementing the
>> refcount on each struct page and accumulating the ones where the refcounts goto
>> zero into suitable blocks for free_frozen_pages().
>>
>> So I *think* my patch is correct, but I'm not totally sure.

Free in the granularity you allocated. :)

>>
>> Then we have the ___free_pages(), which I find very difficult to understand:
>>
>> static void ___free_pages(struct page *page, unsigned int order,
>> 			  fpi_t fpi_flags)
>> {
>> 	/* get PageHead before we drop reference */
>> 	int head = PageHead(page);
>> 	/* get alloc tag in case the page is released by others */
>> 	struct alloc_tag *tag = pgalloc_tag_get(page);
>>
>> 	if (put_page_testzero(page))
>> 		__free_frozen_pages(page, order, fpi_flags);
>>
>> We only test the refcount for the first page, then free all the pages. So that
>> implies that non-compound high-order pages share a single refcount? Or we just
>> ignore the refcount of all the other pages in a non-compound high-order page?
>>
>> 	else if (!head) {
>>
>> What? If the first page still has references but but it's a non-compond
>> high-order page (i.e. no head page) then we free all the trailing sub-pages
>> without caring about their references?

Again, free in the granularity we allocated.

>>
>> 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
>> 		while (order-- > 0) {
>> 			/*
>> 			 * The "tail" pages of this non-compound high-order
>> 			 * page will have no code tags, so to avoid warnings
>> 			 * mark them as empty.
>> 			 */
>> 			clear_page_tag_ref(page + (1 << order));
>> 			__free_frozen_pages(page + (1 << order), order,
>> 					    fpi_flags);
>> 		}
>> 	}
>> }
>>
>> For the arm64 case that you point out, surely __free_pages() is the wrong thing
>> to call, because it's going to decrement the refcount. But we are freeing based
>> on their presence in the pagetable and we never took a reference in the first place.
>>
>> HELP!
Hope my input helped, not sure if I answered the real question? :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-19  8:33               ` David Hildenbrand (Red Hat)
@ 2025-12-19 11:17                 ` Ryan Roberts
  0 siblings, 0 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-12-19 11:17 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	Dev Jain, Uladzislau Rezki, Lorenzo Stoakes, Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML

On 19/12/2025 08:33, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 12:56, Ryan Roberts wrote:
>> + David, Lorenzo, Matthew
>>
>> Hoping someone might be able to explain to me how this all really works! :-|
>>
>> On 18/12/2025 11:53, Ryan Roberts wrote:
>>> On 18/12/2025 04:55, Dev Jain wrote:
>>>>
>>>> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>>>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>>>> desired.
>>>>>>>>
>>>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>>>> can affect performance.
>>>>>>>>
>>>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>>>> show better performance characteristics especially for larger
>>>>>>>> allocation requests.
>>>>>>> I wonder if a better solution would be "allocate order-0 if available in
>>>>>>> pcp,
>>>>>>> else try large order, else fallback to order-0" Could that provide the
>>>>>>> best of
>>>>>>> all worlds without needing a configuration knob?
>>>>>>>
>>>>>> I am not sure, to me it looks like a bit odd.
>>>>> Perhaps it would feel better if it was generalized to "first try allocation
>>>>> from
>>>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>>>> to lowest order"?
>>>>>
>>>>>> Ideally it would be
>>>>>> good just free it as high-order page and not order-0 peaces.
>>>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>>>> and no performance results yet):
>>>>>
>>>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>>>> memory leaks when running mm selftests...)
>>>>
>>>> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
>>>> arm64/mmu.c already does this - it computes order from size and passes it to
>>>> __free_pages().
>>>
>>> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
>>> on...
>>>
>>> Prior to looking at this yesterday, my understanding was this: At the struct
>>> page level, you can either allocate compond or non-compound. order-0 is
>>> non-compound by definition. A high-order non-compound page is just a contiguous
>>> set of order-0 pages, each with individual reference counts and other meta data.
> 
> Not quite. A high-order non-compound allocation will only use the refcount of
> page[0].
> 
> When not returning that memory in the same order to the buddy, we first have to
> split that high-order allocation. That will initialize the refcounts and split
> page-owner data, alloc tag tracking etc.

Ahha, yes, this all makes sense now thanks!

> 
>>> A compound page is one where all the pages are tied together and managed as one
>>> - the meta data is stored in the head page and all the tail pages point to the
>>> head (this concept is wrapped by struct folio).
>>>
>>> But after looking through the comments in page_alloc.c, it would seem that a
>>> non-compound high-order page is NOT just a set of order-0 pages, but they still
>>> share some meta data, including a shared refcount?? alloc_pages() will return
>>> one of these things, and __free_pages() requires the exact same unit to be
>>> provided to it.
> 
> Right.
> 
>>>
>>> vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
>>> split_page() to convert to a set of order-0 pages. See this comment:
>>>
>>> /*
>>>   * split_page takes a non-compound higher-order page, and splits it into
>>>   * n (1<<order) sub-pages: page[0..n]
>>>   * Each sub-page must be freed individually.
>>>   *
>>>   * Note: this is probably too low level an operation for use in drivers.
>>>   * Please consult with lkml before using this in your driver.
>>>   */
>>> void split_page(struct page *page, unsigned int order)
>>>
>>> So just passing all the order-0 pages directly to __free_pages() in one go is
>>> definitely not the right thing to do ("Each sub-page must be freed
>>> individually"). They may have different reference counts so you can only
>>> actually free the ones that go to zero surely?
> 
> Yes.
> 
>>>
>>> But it looked to me like free_frozen_pages() just wants a naturally aligned
>>> power-of-2 number of pages to free, so my patch below is decrementing the
>>> refcount on each struct page and accumulating the ones where the refcounts goto
>>> zero into suitable blocks for free_frozen_pages().
>>>
>>> So I *think* my patch is correct, but I'm not totally sure.
> 
> Free in the granularity you allocated. :)

Or in order-0 chunks if split_page() was called... Yes, I understand the
requirements of the current __free_pages() and co, but my questions are all
intended to help me figure out if we can do something better...

background:

In v6.18 vmalloc would (mostly) allocate order-0 pages then __free_page() on
each order-0 page. In v6.19-rc1 Vishal has added a patch that allocates a set of
high-order non-compound pages to satisfy the request then calls split_page() for
each one. We end up with order-0 pages as before that get freed that same way as
before. This is intended to 1) speed up allocation and 2) prevent fragmentation
because the lifetimes of larger contiguous chunks are tied together.

But for some (unrealistic?) allocation patterns it turns out there is a big
perforamcne regression; the large allocation is always going to the buddy
whereas before it could usually get it's order-0 pages from the pcp list.

So I'm looking for ways to fix it, and have a patch that not only fixes the
regression but improves vmalloc performance 2x-5x in some cases.

The patch basically looks for power-of-2 sized and aligned contiguous chunks of
order-0 pages whose refcounts were all decremented to 0 then calls
free_frozen_pages() for each of those chunks. So instead of posting each order-0
page into the pcp or buddy, we post the power-of-2 chunk.

Anyway, from your explanation, I believe this is all safe and correct in
principle so I will post a proper patch for review.

But out of this discussion it has also emerged that arm64 is likely using
__free_pages() incorrectly in it's memory hot unplug path; Dev is going to take
a look at that.

> 
>>>
>>> Then we have the ___free_pages(), which I find very difficult to understand:
>>>
>>> static void ___free_pages(struct page *page, unsigned int order,
>>>               fpi_t fpi_flags)
>>> {
>>>     /* get PageHead before we drop reference */
>>>     int head = PageHead(page);
>>>     /* get alloc tag in case the page is released by others */
>>>     struct alloc_tag *tag = pgalloc_tag_get(page);
>>>
>>>     if (put_page_testzero(page))
>>>         __free_frozen_pages(page, order, fpi_flags);
>>>
>>> We only test the refcount for the first page, then free all the pages. So that
>>> implies that non-compound high-order pages share a single refcount? Or we just
>>> ignore the refcount of all the other pages in a non-compound high-order page?
>>>
>>>     else if (!head) {
>>>
>>> What? If the first page still has references but but it's a non-compond
>>> high-order page (i.e. no head page) then we free all the trailing sub-pages
>>> without caring about their references?
> 
> Again, free in the granularity we allocated.
> 
>>>
>>>         pgalloc_tag_sub_pages(tag, (1 << order) - 1);
>>>         while (order-- > 0) {
>>>             /*
>>>              * The "tail" pages of this non-compound high-order
>>>              * page will have no code tags, so to avoid warnings
>>>              * mark them as empty.
>>>              */
>>>             clear_page_tag_ref(page + (1 << order));
>>>             __free_frozen_pages(page + (1 << order), order,
>>>                         fpi_flags);
>>>         }
>>>     }
>>> }
>>>
>>> For the arm64 case that you point out, surely __free_pages() is the wrong thing
>>> to call, because it's going to decrement the refcount. But we are freeing based
>>> on their presence in the pagetable and we never took a reference in the first
>>> place.
>>>
>>> HELP!
> Hope my input helped, not sure if I answered the real question? :)

Yes, it definitely helped! I saw Vishal's reponse too, which is much appreciated!

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:53           ` Ryan Roberts
  2025-12-18 11:56             ` Ryan Roberts
@ 2025-12-19  0:34             ` Vishal Moola (Oracle)
  2025-12-19 11:23               ` Ryan Roberts
  2025-12-24  6:35             ` Dev Jain
  2 siblings, 1 reply; 27+ messages in thread
From: Vishal Moola (Oracle) @ 2025-12-19  0:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, Uladzislau Rezki, linux-mm, Andrew Morton, Baoquan He, LKML

On Thu, Dec 18, 2025 at 11:53:00AM +0000, Ryan Roberts wrote:
> On 18/12/2025 04:55, Dev Jain wrote:
> > 
> > On 17/12/25 8:50 pm, Ryan Roberts wrote:
> >> On 17/12/2025 12:02, Uladzislau Rezki wrote:
> >>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
> >>>>> Introduce a module parameter to enable or disable the large-order
> >>>>> allocation path in vmalloc. High-order allocations are disabled by
> >>>>> default so far, but users may explicitly enable them at runtime if
> >>>>> desired.
> >>>>>
> >>>>> High-order pages allocated for vmalloc are immediately split into
> >>>>> order-0 pages and later freed as order-0, which means they do not
> >>>>> feed the per-CPU page caches. As a result, high-order attempts tend
> >>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
> >>>>> can affect performance.
> >>>>>
> >>>>> However, when the PCP caches are empty, high-order allocations may
> >>>>> show better performance characteristics especially for larger
> >>>>> allocation requests.
> >>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
> >>>> else try large order, else fallback to order-0" Could that provide the best of
> >>>> all worlds without needing a configuration knob?
> >>>>
> >>> I am not sure, to me it looks like a bit odd. 
> >> Perhaps it would feel better if it was generalized to "first try allocation from
> >> PCP list, highest to lowest order, then try allocation from the buddy, highest
> >> to lowest order"?
> >>
> >>> Ideally it would be
> >>> good just free it as high-order page and not order-0 peaces.
> >> Yeah perhaps that's better. How about something like this (very lightly tested
> >> and no performance results yet):
> >>
> >> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
> >> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
> >> memory leaks when running mm selftests...)
> > 
> > Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
> > arm64/mmu.c already does this - it computes order from size and passes it to
> > __free_pages().
> 
> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
> on...
> 
> Prior to looking at this yesterday, my understanding was this: At the struct
> page level, you can either allocate compond or non-compound. order-0 is
> non-compound by definition. A high-order non-compound page is just a contiguous
> set of order-0 pages, each with individual reference counts and other meta data.
> A compound page is one where all the pages are tied together and managed as one
> - the meta data is stored in the head page and all the tail pages point to the
> head (this concept is wrapped by struct folio).
> 
> But after looking through the comments in page_alloc.c, it would seem that a
> non-compound high-order page is NOT just a set of order-0 pages, but they still
> share some meta data, including a shared refcount?? alloc_pages() will return
> one of these things, and __free_pages() requires the exact same unit to be
> provided to it.

For high-order non-compound pages, the tail pages don't get initialized.
They don't share anything, and its up to the caller to keep track of
those tail pages.

Historically, we split the pages down to order-0 here to simplify things for
the callers. See commit 3b8000ae185cb0 (stating some callers want to use
some page fields).

Tail pages being uninitialized meant that when using page apis, we could
easily hit 'bad' page states. Splitting to order-0, meant that each page
is now completely independent, and *actually* initialized to an expected
state.

> vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
> split_page() to convert to a set of order-0 pages. See this comment:
> 
> /*
>  * split_page takes a non-compound higher-order page, and splits it into
>  * n (1<<order) sub-pages: page[0..n]
>  * Each sub-page must be freed individually.
>  *
>  * Note: this is probably too low level an operation for use in drivers.
>  * Please consult with lkml before using this in your driver.
>  */
> void split_page(struct page *page, unsigned int order)
> 
> So just passing all the order-0 pages directly to __free_pages() in one go is
> definitely not the right thing to do ("Each sub-page must be freed
> individually"). They may have different reference counts so you can only
> actually free the ones that go to zero surely?
>
> But it looked to me like free_frozen_pages() just wants a naturally aligned
> power-of-2 number of pages to free, so my patch below is decrementing the
> refcount on each struct page and accumulating the ones where the refcounts goto
> zero into suitable blocks for free_frozen_pages().

Frozen pages are just pages without a refcount. I doubt this is the
intended use, but it should work: you're effectively handling the
refcount here instead of letting the page allocator do so.

> So I *think* my patch is correct, but I'm not totally sure. 

I haven't looked at your patch yet, but I do like the idea of freeing
the pages as larger orders together. My only concern is whether page
migration could mess any of this up (I'm completely unfamiliar with that).

> Then we have the ___free_pages(), which I find very difficult to understand:
> 
> static void ___free_pages(struct page *page, unsigned int order,
> 			  fpi_t fpi_flags)
> {
> 	/* get PageHead before we drop reference */
> 	int head = PageHead(page);
> 	/* get alloc tag in case the page is released by others */
> 	struct alloc_tag *tag = pgalloc_tag_get(page);
> 
> 	if (put_page_testzero(page))
> 		__free_frozen_pages(page, order, fpi_flags);
> 
> We only test the refcount for the first page, then free all the pages. So that
> implies that non-compound high-order pages share a single refcount? Or we just
> ignore the refcount of all the other pages in a non-compound high-order page?

We ignore the refcount of all other pages - see __free_pages():
	 * This function can free multi-page allocations that are not compound
	 * pages.  It does not check that the @order passed in matches that of
	 * the allocation, so it is easy to leak memory.  Freeing more memory
	 * than was allocated will probably emit a warning.

> 	else if (!head) {
> 
> What? If the first page still has references but but it's a non-compond
> high-order page (i.e. no head page) then we free all the trailing sub-pages
> without caring about their references?

I think this has to do with racy refcount stuff. For non-compound pages:
if we take a reference with get_page() before put_page_testzero()
happens, we will end up ONLY freeing that page once the caller reaches
its put_page() call. So we're freeing the rest here to prevent leaking
memory that way.

> 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
> 		while (order-- > 0) {
> 			/*
> 			 * The "tail" pages of this non-compound high-order
> 			 * page will have no code tags, so to avoid warnings
> 			 * mark them as empty.
> 			 */
> 			clear_page_tag_ref(page + (1 << order));
> 			__free_frozen_pages(page + (1 << order), order,
> 					    fpi_flags);
> 		}
> 	}
> }
> 
> For the arm64 case that you point out, surely __free_pages() is the wrong thing
> to call, because it's going to decrement the refcount. But we are freeing based
> on their presence in the pagetable and we never took a reference in the first place.
>
> HELP!
> 
> > 
> >>
> >> ---8<---
> >> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
> >> Author: Ryan Roberts <ryan.roberts@arm.com>
> >> Date:   Wed Dec 17 15:11:08 2025 +0000
> >>
> >>     WIP
> >>
> >>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index b155929af5b1..d25f5b867e6b 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
> >>  extern void free_pages_nolock(struct page *page, unsigned int order);
> >>  extern void free_pages(unsigned long addr, unsigned int order);
> >>
> >> +void free_pages_bulk(struct page *page, int nr_pages);
> >> +
> >>  #define __free_page(page) __free_pages((page), 0)
> >>  #define free_page(addr) free_pages((addr), 0)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 822e05f1a964..5f11224cf353 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
> >> order,
> >>  	}
> >>  }
> >>
> >> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
> >> +{
> >> +	while (nr_pages) {
> >> +		unsigned int fit_order, align_order, order;
> >> +		unsigned long pfn;
> >> +
> >> +		pfn = page_to_pfn(page);
> >> +		fit_order = ilog2(nr_pages);
> >> +		align_order = pfn ? __ffs(pfn) : fit_order;
> >> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
> >> +
> >> +		free_frozen_pages(page, order);
> >> +
> >> +		page += 1U << order;
> >> +		nr_pages -= 1U << order;
> >> +	}
> >> +}
> >> +
> >> +void free_pages_bulk(struct page *page, int nr_pages)
> >> +{
> >> +	struct page *start = NULL;
> >> +	bool can_free;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < nr_pages; i++, page++) {
> >> +		VM_BUG_ON_PAGE(PageHead(page), page);
> >> +		VM_BUG_ON_PAGE(PageTail(page), page);
> >> +
> >> +		can_free = put_page_testzero(page);
> >> +
> >> +		if (!can_free && start) {
> >> +			free_frozen_pages_bulk(start, page - start);
> >> +			start = NULL;
> >> +		} else if (can_free && !start) {
> >> +			start = page;
> >> +		}
> >> +	}
> >> +
> >> +	if (start)
> >> +		free_frozen_pages_bulk(start, page - start);
> >> +}
> >> +
> >>  /**
> >>   * __free_pages - Free pages allocated with alloc_pages().
> >>   * @page: The page pointer returned from alloc_pages().
> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >> index ecbac900c35f..8f782bac1ece 100644
> >> --- a/mm/vmalloc.c
> >> +++ b/mm/vmalloc.c
> >> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
> >>  void vfree(const void *addr)
> >>  {
> >>  	struct vm_struct *vm;
> >> -	int i;
> >> +	struct page *start;
> >> +	int i, nr;
> >>
> >>  	if (unlikely(in_interrupt())) {
> >>  		vfree_atomic(addr);
> >> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
> >>  	/* All pages of vm should be charged to same memcg, so use first one. */
> >>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
> >>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
> >> -	for (i = 0; i < vm->nr_pages; i++) {
> >> +
> >> +	start = vm->pages[0];
> >> +	BUG_ON(!start);
> >> +	nr = 1;
> >> +	for (i = 1; i < vm->nr_pages; i++) {
> >>  		struct page *page = vm->pages[i];
> >>
> >>  		BUG_ON(!page);
> >> -		/*
> >> -		 * High-order allocs for huge vmallocs are split, so
> >> -		 * can be freed as an array of order-0 allocations
> >> -		 */
> >> -		__free_page(page);
> >> -		cond_resched();
> >> +
> >> +		if (start + nr != page) {
> >> +			free_pages_bulk(start, nr);
> >> +			start = page;
> >> +			nr = 1;
> >> +			cond_resched();
> >> +		} else {
> >> +			nr++;
> >> +		}
> >>  	}
> >> +	free_pages_bulk(start, nr);
> >> +
> >>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
> >>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
> >>  	kvfree(vm->pages);
> >> ---8<---
> >>
> >>>>> Since the best strategy is workload-dependent, this patch adds a
> >>>>> parameter letting users to choose whether vmalloc should try
> >>>>> high-order allocations or stay strictly on the order-0 fastpath.
> >>>>>
> >>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>>>> ---
> >>>>>  mm/vmalloc.c | 9 +++++++--
> >>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >>>>> index d3a4725e15ca..f66543896b16 100644
> >>>>> --- a/mm/vmalloc.c
> >>>>> +++ b/mm/vmalloc.c
> >>>>> @@ -43,6 +43,7 @@
> >>>>>  #include <asm/tlbflush.h>
> >>>>>  #include <asm/shmparam.h>
> >>>>>  #include <linux/page_owner.h>
> >>>>> +#include <linux/moduleparam.h>
> >>>>>  
> >>>>>  #define CREATE_TRACE_POINTS
> >>>>>  #include <trace/events/vmalloc.h>
> >>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
> >>>>>  	return nr_allocated;
> >>>>>  }
> >>>>>  
> >>>>> +static int attempt_larger_order_alloc;
> >>>>> +module_param(attempt_larger_order_alloc, int, 0644);
> >>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
> >>>> Y/N as the value; that's probably more intuitive?
> >>>>
> >>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
> >>>>
> >>> Thanks! We can switch to bool and use shorter name for sure.
> >>>
> >>> --
> >>> Uladzislau Rezki
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-19  0:34             ` Vishal Moola (Oracle)
@ 2025-12-19 11:23               ` Ryan Roberts
  0 siblings, 0 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-12-19 11:23 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Dev Jain, Uladzislau Rezki, linux-mm, Andrew Morton, Baoquan He, LKML

On 19/12/2025 00:34, Vishal Moola (Oracle) wrote:
> On Thu, Dec 18, 2025 at 11:53:00AM +0000, Ryan Roberts wrote:
>> On 18/12/2025 04:55, Dev Jain wrote:
>>>
>>> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>>> desired.
>>>>>>>
>>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>>> can affect performance.
>>>>>>>
>>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>>> show better performance characteristics especially for larger
>>>>>>> allocation requests.
>>>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>>>> all worlds without needing a configuration knob?
>>>>>>
>>>>> I am not sure, to me it looks like a bit odd. 
>>>> Perhaps it would feel better if it was generalized to "first try allocation from
>>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>>> to lowest order"?
>>>>
>>>>> Ideally it would be
>>>>> good just free it as high-order page and not order-0 peaces.
>>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>>> and no performance results yet):
>>>>
>>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>>> memory leaks when running mm selftests...)
>>>
>>> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
>>> arm64/mmu.c already does this - it computes order from size and passes it to
>>> __free_pages().
>>
>> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
>> on...
>>
>> Prior to looking at this yesterday, my understanding was this: At the struct
>> page level, you can either allocate compond or non-compound. order-0 is
>> non-compound by definition. A high-order non-compound page is just a contiguous
>> set of order-0 pages, each with individual reference counts and other meta data.
>> A compound page is one where all the pages are tied together and managed as one
>> - the meta data is stored in the head page and all the tail pages point to the
>> head (this concept is wrapped by struct folio).
>>
>> But after looking through the comments in page_alloc.c, it would seem that a
>> non-compound high-order page is NOT just a set of order-0 pages, but they still
>> share some meta data, including a shared refcount?? alloc_pages() will return
>> one of these things, and __free_pages() requires the exact same unit to be
>> provided to it.
> 
> For high-order non-compound pages, the tail pages don't get initialized.
> They don't share anything, and its up to the caller to keep track of
> those tail pages.
> 
> Historically, we split the pages down to order-0 here to simplify things for
> the callers. See commit 3b8000ae185cb0 (stating some callers want to use
> some page fields).
> 
> Tail pages being uninitialized meant that when using page apis, we could
> easily hit 'bad' page states. Splitting to order-0, meant that each page
> is now completely independent, and *actually* initialized to an expected
> state.

Thanks for the explanation - consider my mental model of all of this to be
corrected!

> 
>> vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
>> split_page() to convert to a set of order-0 pages. See this comment:
>>
>> /*
>>  * split_page takes a non-compound higher-order page, and splits it into
>>  * n (1<<order) sub-pages: page[0..n]
>>  * Each sub-page must be freed individually.
>>  *
>>  * Note: this is probably too low level an operation for use in drivers.
>>  * Please consult with lkml before using this in your driver.
>>  */
>> void split_page(struct page *page, unsigned int order)
>>
>> So just passing all the order-0 pages directly to __free_pages() in one go is
>> definitely not the right thing to do ("Each sub-page must be freed
>> individually"). They may have different reference counts so you can only
>> actually free the ones that go to zero surely?
>>
>> But it looked to me like free_frozen_pages() just wants a naturally aligned
>> power-of-2 number of pages to free, so my patch below is decrementing the
>> refcount on each struct page and accumulating the ones where the refcounts goto
>> zero into suitable blocks for free_frozen_pages().
> 
> Frozen pages are just pages without a refcount. I doubt this is the
> intended use, but it should work: you're effectively handling the
> refcount here instead of letting the page allocator do so.

Yep, agreed. thanks.

> 
>> So I *think* my patch is correct, but I'm not totally sure. 
> 
> I haven't looked at your patch yet, but I do like the idea of freeing
> the pages as larger orders together. My only concern is whether page
> migration could mess any of this up (I'm completely unfamiliar with that).

I'm not sure I see where migration comes into the picture. We have the struct
page pointers, so nothing can migrated, surely? This is vmalloc memory we are
talking about so it's all unmovable, right?

> 
>> Then we have the ___free_pages(), which I find very difficult to understand:
>>
>> static void ___free_pages(struct page *page, unsigned int order,
>> 			  fpi_t fpi_flags)
>> {
>> 	/* get PageHead before we drop reference */
>> 	int head = PageHead(page);
>> 	/* get alloc tag in case the page is released by others */
>> 	struct alloc_tag *tag = pgalloc_tag_get(page);
>>
>> 	if (put_page_testzero(page))
>> 		__free_frozen_pages(page, order, fpi_flags);
>>
>> We only test the refcount for the first page, then free all the pages. So that
>> implies that non-compound high-order pages share a single refcount? Or we just
>> ignore the refcount of all the other pages in a non-compound high-order page?
> 
> We ignore the refcount of all other pages - see __free_pages():
> 	 * This function can free multi-page allocations that are not compound
> 	 * pages.  It does not check that the @order passed in matches that of
> 	 * the allocation, so it is easy to leak memory.  Freeing more memory
> 	 * than was allocated will probably emit a warning.
> 
>> 	else if (!head) {
>>
>> What? If the first page still has references but but it's a non-compond
>> high-order page (i.e. no head page) then we free all the trailing sub-pages
>> without caring about their references?
> 
> I think this has to do with racy refcount stuff. For non-compound pages:
> if we take a reference with get_page() before put_page_testzero()
> happens, we will end up ONLY freeing that page once the caller reaches
> its put_page() call. So we're freeing the rest here to prevent leaking
> memory that way.

Yeah this makes sense now that I understand that the "tail" pages never got
initialized. Feels odd that get_page() should be allowed on the "head" page
though for this type of user-managed non-compound high-order page, but I won't
worry about that.

> 
>> 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
>> 		while (order-- > 0) {
>> 			/*
>> 			 * The "tail" pages of this non-compound high-order
>> 			 * page will have no code tags, so to avoid warnings
>> 			 * mark them as empty.
>> 			 */
>> 			clear_page_tag_ref(page + (1 << order));
>> 			__free_frozen_pages(page + (1 << order), order,
>> 					    fpi_flags);
>> 		}
>> 	}
>> }
>>
>> For the arm64 case that you point out, surely __free_pages() is the wrong thing
>> to call, because it's going to decrement the refcount. But we are freeing based
>> on their presence in the pagetable and we never took a reference in the first place.
>>
>> HELP!
>>
>>>
>>>>
>>>> ---8<---
>>>> commit caa3e5eb5bfade81a32fa62d1a8924df1eb0f619
>>>> Author: Ryan Roberts <ryan.roberts@arm.com>
>>>> Date:   Wed Dec 17 15:11:08 2025 +0000
>>>>
>>>>     WIP
>>>>
>>>>     Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index b155929af5b1..d25f5b867e6b 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -383,6 +383,8 @@ extern void __free_pages(struct page *page, unsigned int order);
>>>>  extern void free_pages_nolock(struct page *page, unsigned int order);
>>>>  extern void free_pages(unsigned long addr, unsigned int order);
>>>>
>>>> +void free_pages_bulk(struct page *page, int nr_pages);
>>>> +
>>>>  #define __free_page(page) __free_pages((page), 0)
>>>>  #define free_page(addr) free_pages((addr), 0)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 822e05f1a964..5f11224cf353 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -5304,6 +5304,48 @@ static void ___free_pages(struct page *page, unsigned int
>>>> order,
>>>>  	}
>>>>  }
>>>>
>>>> +static void free_frozen_pages_bulk(struct page *page, int nr_pages)
>>>> +{
>>>> +	while (nr_pages) {
>>>> +		unsigned int fit_order, align_order, order;
>>>> +		unsigned long pfn;
>>>> +
>>>> +		pfn = page_to_pfn(page);
>>>> +		fit_order = ilog2(nr_pages);
>>>> +		align_order = pfn ? __ffs(pfn) : fit_order;
>>>> +		order = min3(fit_order, align_order, MAX_PAGE_ORDER);
>>>> +
>>>> +		free_frozen_pages(page, order);
>>>> +
>>>> +		page += 1U << order;
>>>> +		nr_pages -= 1U << order;
>>>> +	}
>>>> +}
>>>> +
>>>> +void free_pages_bulk(struct page *page, int nr_pages)
>>>> +{
>>>> +	struct page *start = NULL;
>>>> +	bool can_free;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < nr_pages; i++, page++) {
>>>> +		VM_BUG_ON_PAGE(PageHead(page), page);
>>>> +		VM_BUG_ON_PAGE(PageTail(page), page);
>>>> +
>>>> +		can_free = put_page_testzero(page);
>>>> +
>>>> +		if (!can_free && start) {
>>>> +			free_frozen_pages_bulk(start, page - start);
>>>> +			start = NULL;
>>>> +		} else if (can_free && !start) {
>>>> +			start = page;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (start)
>>>> +		free_frozen_pages_bulk(start, page - start);
>>>> +}
>>>> +
>>>>  /**
>>>>   * __free_pages - Free pages allocated with alloc_pages().
>>>>   * @page: The page pointer returned from alloc_pages().
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index ecbac900c35f..8f782bac1ece 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -3429,7 +3429,8 @@ void vfree_atomic(const void *addr)
>>>>  void vfree(const void *addr)
>>>>  {
>>>>  	struct vm_struct *vm;
>>>> -	int i;
>>>> +	struct page *start;
>>>> +	int i, nr;
>>>>
>>>>  	if (unlikely(in_interrupt())) {
>>>>  		vfree_atomic(addr);
>>>> @@ -3455,17 +3456,26 @@ void vfree(const void *addr)
>>>>  	/* All pages of vm should be charged to same memcg, so use first one. */
>>>>  	if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES))
>>>>  		mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages);
>>>> -	for (i = 0; i < vm->nr_pages; i++) {
>>>> +
>>>> +	start = vm->pages[0];
>>>> +	BUG_ON(!start);
>>>> +	nr = 1;
>>>> +	for (i = 1; i < vm->nr_pages; i++) {
>>>>  		struct page *page = vm->pages[i];
>>>>
>>>>  		BUG_ON(!page);
>>>> -		/*
>>>> -		 * High-order allocs for huge vmallocs are split, so
>>>> -		 * can be freed as an array of order-0 allocations
>>>> -		 */
>>>> -		__free_page(page);
>>>> -		cond_resched();
>>>> +
>>>> +		if (start + nr != page) {
>>>> +			free_pages_bulk(start, nr);
>>>> +			start = page;
>>>> +			nr = 1;
>>>> +			cond_resched();
>>>> +		} else {
>>>> +			nr++;
>>>> +		}
>>>>  	}
>>>> +	free_pages_bulk(start, nr);
>>>> +
>>>>  	if (!(vm->flags & VM_MAP_PUT_PAGES))
>>>>  		atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
>>>>  	kvfree(vm->pages);
>>>> ---8<---
>>>>
>>>>>>> Since the best strategy is workload-dependent, this patch adds a
>>>>>>> parameter letting users to choose whether vmalloc should try
>>>>>>> high-order allocations or stay strictly on the order-0 fastpath.
>>>>>>>
>>>>>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>>>> ---
>>>>>>>  mm/vmalloc.c | 9 +++++++--
>>>>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>>>>> index d3a4725e15ca..f66543896b16 100644
>>>>>>> --- a/mm/vmalloc.c
>>>>>>> +++ b/mm/vmalloc.c
>>>>>>> @@ -43,6 +43,7 @@
>>>>>>>  #include <asm/tlbflush.h>
>>>>>>>  #include <asm/shmparam.h>
>>>>>>>  #include <linux/page_owner.h>
>>>>>>> +#include <linux/moduleparam.h>
>>>>>>>  
>>>>>>>  #define CREATE_TRACE_POINTS
>>>>>>>  #include <trace/events/vmalloc.h>
>>>>>>> @@ -3671,6 +3672,9 @@ vm_area_alloc_pages_large_order(gfp_t gfp, int nid, unsigned int order,
>>>>>>>  	return nr_allocated;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static int attempt_larger_order_alloc;
>>>>>>> +module_param(attempt_larger_order_alloc, int, 0644);
>>>>>> Would this be better as a bool? Docs say that you can then specify 0/1, y/n or
>>>>>> Y/N as the value; that's probably more intuitive?
>>>>>>
>>>>>> nit: I'd favour a shorter name. Perhaps large_order_alloc?
>>>>>>
>>>>> Thanks! We can switch to bool and use shorter name for sure.
>>>>>
>>>>> --
>>>>> Uladzislau Rezki
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter
  2025-12-18 11:53           ` Ryan Roberts
  2025-12-18 11:56             ` Ryan Roberts
  2025-12-19  0:34             ` Vishal Moola (Oracle)
@ 2025-12-24  6:35             ` Dev Jain
  2 siblings, 0 replies; 27+ messages in thread
From: Dev Jain @ 2025-12-24  6:35 UTC (permalink / raw)
  To: Ryan Roberts, Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vishal Moola, Baoquan He, LKML


On 18/12/25 5:23 pm, Ryan Roberts wrote:
> On 18/12/2025 04:55, Dev Jain wrote:
>> On 17/12/25 8:50 pm, Ryan Roberts wrote:
>>> On 17/12/2025 12:02, Uladzislau Rezki wrote:
>>>>> On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:
>>>>>> Introduce a module parameter to enable or disable the large-order
>>>>>> allocation path in vmalloc. High-order allocations are disabled by
>>>>>> default so far, but users may explicitly enable them at runtime if
>>>>>> desired.
>>>>>>
>>>>>> High-order pages allocated for vmalloc are immediately split into
>>>>>> order-0 pages and later freed as order-0, which means they do not
>>>>>> feed the per-CPU page caches. As a result, high-order attempts tend
>>>>>> to bypass the PCP fastpath and fall back to the buddy allocator that
>>>>>> can affect performance.
>>>>>>
>>>>>> However, when the PCP caches are empty, high-order allocations may
>>>>>> show better performance characteristics especially for larger
>>>>>> allocation requests.
>>>>> I wonder if a better solution would be "allocate order-0 if available in pcp,
>>>>> else try large order, else fallback to order-0" Could that provide the best of
>>>>> all worlds without needing a configuration knob?
>>>>>
>>>> I am not sure, to me it looks like a bit odd. 
>>> Perhaps it would feel better if it was generalized to "first try allocation from
>>> PCP list, highest to lowest order, then try allocation from the buddy, highest
>>> to lowest order"?
>>>
>>>> Ideally it would be
>>>> good just free it as high-order page and not order-0 peaces.
>>> Yeah perhaps that's better. How about something like this (very lightly tested
>>> and no performance results yet):
>>>
>>> (And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
>>> with a contiguous run of order-0 pages, but I'm not seeing any warnings or
>>> memory leaks when running mm selftests...)
>> Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
>> arm64/mmu.c already does this - it computes order from size and passes it to
>> __free_pages().
> Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
> on...

I think this is fine. This function frees either the altmap (in which no struct page is
freed), or the array of struct pages in the vmemmap:

free_map_bootmem -> vmmemap_free (altmap=NULL) -> unmap_hotplug_range(free_mapped=true, altmap=NULL) -> ultimately __free_pages.

free_map_bootmem is called from section_deactivate, and takes in a virtual address corresponding to the vmemmap struct pages.
This virtual address is retrieved from sparse_decode_mem_map (note that the return value of this function is misleading).



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-12-24  6:35 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 21:19 [PATCH 1/2] mm/vmalloc: Add large-order allocation helper Uladzislau Rezki (Sony)
2025-12-16 21:19 ` [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter Uladzislau Rezki (Sony)
2025-12-16 23:36   ` Andrew Morton
2025-12-17 11:37     ` Uladzislau Rezki
2025-12-17  3:54   ` Baoquan He
2025-12-17 11:44     ` Uladzislau Rezki
2025-12-17 11:49       ` Dev Jain
2025-12-17 11:53         ` Uladzislau Rezki
2025-12-18 10:34       ` Baoquan He
2025-12-17  8:27   ` Ryan Roberts
2025-12-17 12:02     ` Uladzislau Rezki
2025-12-17 15:20       ` Ryan Roberts
2025-12-17 17:01         ` Ryan Roberts
2025-12-17 19:22           ` Uladzislau Rezki
2025-12-18 11:12             ` Ryan Roberts
2025-12-18 11:33               ` Uladzislau Rezki
2025-12-17 20:08           ` Uladzislau Rezki
2025-12-18 11:14             ` Ryan Roberts
2025-12-18 11:29               ` Uladzislau Rezki
2025-12-18  4:55         ` Dev Jain
2025-12-18 11:53           ` Ryan Roberts
2025-12-18 11:56             ` Ryan Roberts
2025-12-19  8:33               ` David Hildenbrand (Red Hat)
2025-12-19 11:17                 ` Ryan Roberts
2025-12-19  0:34             ` Vishal Moola (Oracle)
2025-12-19 11:23               ` Ryan Roberts
2025-12-24  6:35             ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox