[RFC PATCH] mm/vmalloc: request large order pages from buddy allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
@ 2025-10-14 18:27 Vishal Moola (Oracle)
  2025-10-15  3:56 ` Matthew Wilcox
  2025-10-15  8:23 ` Uladzislau Rezki
  0 siblings, 2 replies; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-14 18:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Uladzislau Rezki, Andrew Morton, Vishal Moola (Oracle)

Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at
most 100 pages at a time, we can eagerly request large order pages a
smaller number of times.

We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.

Running 1000 iterations of allocations on a small 4GB system finds:

1000 2mb allocations:
	[Baseline]			[This patch]
	real    46.310s			real    34.380s
	user    0.001s			user    0.008s
	sys     46.058s			sys     34.152s

10000 200kb allocations:
	[Baseline]			[This patch]
	real    56.104s			real    43.946s
	user    0.001s			user    0.003s
	sys     55.375s			sys     43.259s

10000 20kb allocations:
	[Baseline]			[This patch]
	real    0m8.438s		real    0m9.160s
	user    0m0.001s		user    0m0.002s
	sys     0m7.936s		sys     0m8.671s

This is an RFC, comments and thoughts are welcomed. There is a
clear benefit to be had for large allocations, but there is
some regression for smaller allocations.

Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
---
 mm/vmalloc.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97cef2cc14d3..0a25e5cf841c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3621,6 +3621,38 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	unsigned int nr_allocated = 0;
 	struct page *page;
 	int i;
+	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
+	unsigned int large_order = ilog2(nr_pages - nr_allocated);
+
+	/*
+	 * Initially, attempt to have the page allocator give us large order
+	 * pages. Do not attempt allocating smaller than order chunks since
+	 * __vmap_pages_range() expects physically contigous pages of exactly
+	 * order long chunks.
+	 */
+	while (large_order > order && nr_allocated < nr_pages) {
+		/*
+		 * High-order nofail allocations are really expensive and
+		 * potentially dangerous (pre-mature OOM, disruptive reclaim
+		 * and compaction etc.
+		 */
+		if (gfp & __GFP_NOFAIL)
+			break;
+		if (nid == NUMA_NO_NODE)
+			page = alloc_pages_noprof(large_gfp, large_order);
+		else
+			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
+
+		if (unlikely(!page))
+			break;
+
+		split_page(page, large_order);
+		for (i = 0; i < (1U << large_order); i++)
+			pages[nr_allocated + i] = page + i;
+
+		nr_allocated += 1U << large_order;
+		large_order = ilog2(nr_pages - nr_allocated);
+	}
 
 	/*
 	 * For order-0 pages we make use of bulk allocator, if
@@ -3665,7 +3697,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		}
 	}
 
-	/* High-order pages or fallback path if "bulk" fails. */
+	/* High-order arch pages or fallback path if "bulk" fails. */
 	while (nr_allocated < nr_pages) {
 		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
 			break;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-14 18:27 [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
@ 2025-10-15  3:56 ` Matthew Wilcox
  2025-10-15  9:28   ` Vishal Moola (Oracle)
  2025-10-15  8:23 ` Uladzislau Rezki
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2025-10-15  3:56 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: linux-mm, linux-kernel, Uladzislau Rezki, Andrew Morton

On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> Running 1000 iterations of allocations on a small 4GB system finds:
> 
> 1000 2mb allocations:
> 	[Baseline]			[This patch]
> 	real    46.310s			real    34.380s
> 	user    0.001s			user    0.008s
> 	sys     46.058s			sys     34.152s
> 
> 10000 200kb allocations:
> 	[Baseline]			[This patch]
> 	real    56.104s			real    43.946s
> 	user    0.001s			user    0.003s
> 	sys     55.375s			sys     43.259s
> 
> 10000 20kb allocations:
> 	[Baseline]			[This patch]
> 	real    0m8.438s		real    0m9.160s
> 	user    0m0.001s		user    0m0.002s
> 	sys     0m7.936s		sys     0m8.671s

I'd be more confident in the 20kB numbers if you'd done 10x more
iterations.

Also, I think 20kB is probably an _interesting_ number, but it's not
going to display your change to its best advantage.  A 32kB allocation
will look much better, for example.

Also, can you go into more detail of the test?  Based on our off-list
conversation, we were talking about allocating something like 100MB
of memory (in these various sizes) then freeing it, just to be sure
that we're measuring the performance of the buddy allocator and
not the PCP list.

> This is an RFC, comments and thoughts are welcomed. There is a
> clear benefit to be had for large allocations, but there is
> some regression for smaller allocations.

Also we think that there's probably a later win to be had by
not splitting the page we allocated.

At some point, we should also start allocating frozen pages
for vmalloc.  That's going to be interesting for the users which
map vmalloc pages to userspace.

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 97cef2cc14d3..0a25e5cf841c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3621,6 +3621,38 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	unsigned int nr_allocated = 0;
>  	struct page *page;
>  	int i;
> +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
> +
> +	/*
> +	 * Initially, attempt to have the page allocator give us large order
> +	 * pages. Do not attempt allocating smaller than order chunks since
> +	 * __vmap_pages_range() expects physically contigous pages of exactly
> +	 * order long chunks.
> +	 */
> +	while (large_order > order && nr_allocated < nr_pages) {
> +		/*
> +		 * High-order nofail allocations are really expensive and
> +		 * potentially dangerous (pre-mature OOM, disruptive reclaim
> +		 * and compaction etc.
> +		 */
> +		if (gfp & __GFP_NOFAIL)
> +			break;

sure, but we could just clear NOFAIL from the large_gfp flags instead
of giving up on this path so quickly?

> +		if (nid == NUMA_NO_NODE)
> +			page = alloc_pages_noprof(large_gfp, large_order);
> +		else
> +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> +		if (unlikely(!page))
> +			break;

I'm not entirely convinced here.  We might want to fall back to the next
larger size.  eg if we try to allocate an order-6 page, and there's not
one readily available, perhaps we should try to allocate an order-5 page
instead of falling back to the bulk allocator?

> @@ -3665,7 +3697,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		}
>  	}
>  
> -	/* High-order pages or fallback path if "bulk" fails. */
> +	/* High-order arch pages or fallback path if "bulk" fails. */

I'm not quite clear what this comment change is meant to convey?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-14 18:27 [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
  2025-10-15  3:56 ` Matthew Wilcox
@ 2025-10-15  8:23 ` Uladzislau Rezki
  2025-10-15 10:44   ` Vishal Moola (Oracle)
  1 sibling, 1 reply; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-15  8:23 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: linux-mm, linux-kernel, Uladzislau Rezki, Andrew Morton

On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.
> 
> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
> 
> Running 1000 iterations of allocations on a small 4GB system finds:
> 
> 1000 2mb allocations:
> 	[Baseline]			[This patch]
> 	real    46.310s			real    34.380s
> 	user    0.001s			user    0.008s
> 	sys     46.058s			sys     34.152s
> 
> 10000 200kb allocations:
> 	[Baseline]			[This patch]
> 	real    56.104s			real    43.946s
> 	user    0.001s			user    0.003s
> 	sys     55.375s			sys     43.259s
> 
> 10000 20kb allocations:
> 	[Baseline]			[This patch]
> 	real    0m8.438s		real    0m9.160s
> 	user    0m0.001s		user    0m0.002s
> 	sys     0m7.936s		sys     0m8.671s
> 
> This is an RFC, comments and thoughts are welcomed. There is a
> clear benefit to be had for large allocations, but there is
> some regression for smaller allocations.
> 
> Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> ---
>  mm/vmalloc.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 97cef2cc14d3..0a25e5cf841c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3621,6 +3621,38 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	unsigned int nr_allocated = 0;
>  	struct page *page;
>  	int i;
> +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
>
If large_order is > MAX_ORDER - 1 then there is no need even try
larger_order attempt.

>> unsigned int large_order = ilog2(nr_pages - nr_allocated);
I think, it is better to introduce "remaining" variable which
is nr_pages - nr_allocated. And on entry "remaining" can be set
to just nr_pages because "nr_allocated" is zero.

Maybe it is worth to drop/warn if __GFP_COMP is set also?

> +
> +	/*
> +	 * Initially, attempt to have the page allocator give us large order
> +	 * pages. Do not attempt allocating smaller than order chunks since
> +	 * __vmap_pages_range() expects physically contigous pages of exactly
> +	 * order long chunks.
> +	 */
> +	while (large_order > order && nr_allocated < nr_pages) {
> +		/*
> +		 * High-order nofail allocations are really expensive and
> +		 * potentially dangerous (pre-mature OOM, disruptive reclaim
> +		 * and compaction etc.
> +		 */
> +		if (gfp & __GFP_NOFAIL)
> +			break;
> +		if (nid == NUMA_NO_NODE)
> +			page = alloc_pages_noprof(large_gfp, large_order);
> +		else
> +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> +		if (unlikely(!page))
> +			break;
> +
> +		split_page(page, large_order);
> +		for (i = 0; i < (1U << large_order); i++)
> +			pages[nr_allocated + i] = page + i;
> +
> +		nr_allocated += 1U << large_order;
> +		large_order = ilog2(nr_pages - nr_allocated);
> +	}
>  
So this is a third path for page allocation. The question is should we
try all orders? Like already noted by Matthew, if there is no 5-order
page but there is 4-order page? Try until we check all orders. For
example we can get different order pages to fulfill the request.

The concern is then if it is a waste of high-order pages. Because we can
easily go with a single page allocator. Whereas someone in a system can not.

Apart of that, maybe we can drop the bulk_path instead of having three paths?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15  3:56 ` Matthew Wilcox
@ 2025-10-15  9:28   ` Vishal Moola (Oracle)
  2025-10-16 16:12     ` Uladzislau Rezki
  0 siblings, 1 reply; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-15  9:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-kernel, Uladzislau Rezki, Andrew Morton

On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > Running 1000 iterations of allocations on a small 4GB system finds:
> > 
> > 1000 2mb allocations:
> > 	[Baseline]			[This patch]
> > 	real    46.310s			real    34.380s
> > 	user    0.001s			user    0.008s
> > 	sys     46.058s			sys     34.152s
> > 
> > 10000 200kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    56.104s			real    43.946s
> > 	user    0.001s			user    0.003s
> > 	sys     55.375s			sys     43.259s
> > 
> > 10000 20kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    0m8.438s		real    0m9.160s
> > 	user    0m0.001s		user    0m0.002s
> > 	sys     0m7.936s		sys     0m8.671s
> 
> I'd be more confident in the 20kB numbers if you'd done 10x more
> iterations.

I actually ran my a number of times to mitigate the effects of possibly
too small sample sizes, so I do have that number for you too:

[Baseline]			[This patch]
real    1m28.119s		real    1m32.630s
user    0m0.012s		user    0m0.011s
sys     1m23.270s		sys     1m28.529s

> Also, I think 20kB is probably an _interesting_ number, but it's not
> going to display your change to its best advantage.  A 32kB allocation
> will look much better, for example.

I provided those particular numbers to showcase the beneficial cases as
well as the regression case.

I ended up finding that allocating sizes <=20k had noticeable
regressions, while [20k, 90k] was approximately the same, and >= 90k had
improvements (getting more and more noticeable as size grows in
magnitude).

> Also, can you go into more detail of the test?  Based on our off-list
> conversation, we were talking about allocating something like 100MB
> of memory (in these various sizes) then freeing it, just to be sure
> that we're measuring the performance of the buddy allocator and
> not the PCP list.

Yup.

What I did to get the numbers above was: call vmalloc() n number of
times on that particular size, then free all those allocations. Then,
I did 1000 iterations of that to get a better average.

So none of these allocations were freed until all the allocations were
done, every single time.

> > This is an RFC, comments and thoughts are welcomed. There is a
> > clear benefit to be had for large allocations, but there is
> > some regression for smaller allocations.
> 
> Also we think that there's probably a later win to be had by
> not splitting the page we allocated.
> 
> At some point, we should also start allocating frozen pages
> for vmalloc.  That's going to be interesting for the users which
> map vmalloc pages to userspace.
> 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 97cef2cc14d3..0a25e5cf841c 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3621,6 +3621,38 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >  	unsigned int nr_allocated = 0;
> >  	struct page *page;
> >  	int i;
> > +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> > +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
> > +
> > +	/*
> > +	 * Initially, attempt to have the page allocator give us large order
> > +	 * pages. Do not attempt allocating smaller than order chunks since
> > +	 * __vmap_pages_range() expects physically contigous pages of exactly
> > +	 * order long chunks.
> > +	 */
> > +	while (large_order > order && nr_allocated < nr_pages) {
> > +		/*
> > +		 * High-order nofail allocations are really expensive and
> > +		 * potentially dangerous (pre-mature OOM, disruptive reclaim
> > +		 * and compaction etc.
> > +		 */
> > +		if (gfp & __GFP_NOFAIL)
> > +			break;
> 
> sure, but we could just clear NOFAIL from the large_gfp flags instead
> of giving up on this path so quickly?

Yeah I'll do that.

> > +		if (nid == NUMA_NO_NODE)
> > +			page = alloc_pages_noprof(large_gfp, large_order);
> > +		else
> > +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> > +
> > +		if (unlikely(!page))
> > +			break;
> 
> I'm not entirely convinced here.  We might want to fall back to the next
> larger size.  eg if we try to allocate an order-6 page, and there's not
> one readily available, perhaps we should try to allocate an order-5 page
> instead of falling back to the bulk allocator?

I'll try that out and see how that affects the numbers.

> > @@ -3665,7 +3697,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >  		}
> >  	}
> >  
> > -	/* High-order pages or fallback path if "bulk" fails. */
> > +	/* High-order arch pages or fallback path if "bulk" fails. */
> 
> I'm not quite clear what this comment change is meant to convey?

Ah that was a comment I had inserted to remind myself that the passed in
order is tied to the HAVE_ARCH_HUGE_VMALLOC config. I meant to leave
that out.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15  8:23 ` Uladzislau Rezki
@ 2025-10-15 10:44   ` Vishal Moola (Oracle)
  2025-10-15 12:42     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-15 10:44 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: linux-mm, linux-kernel, Andrew Morton

On Wed, Oct 15, 2025 at 10:23:19AM +0200, Uladzislau Rezki wrote:
> On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> > allocator. Rather than making requests to the buddy allocator for at
> > most 100 pages at a time, we can eagerly request large order pages a
> > smaller number of times.
> > 
> > We still split the large order pages down to order-0 as the rest of the
> > vmalloc code (and some callers) depend on it. We still defer to the bulk
> > allocator and fallback path in case of order-0 pages or failure.
> > 
> > Running 1000 iterations of allocations on a small 4GB system finds:
> > 
> > 1000 2mb allocations:
> > 	[Baseline]			[This patch]
> > 	real    46.310s			real    34.380s
> > 	user    0.001s			user    0.008s
> > 	sys     46.058s			sys     34.152s
> > 
> > 10000 200kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    56.104s			real    43.946s
> > 	user    0.001s			user    0.003s
> > 	sys     55.375s			sys     43.259s
> > 
> > 10000 20kb allocations:
> > 	[Baseline]			[This patch]
> > 	real    0m8.438s		real    0m9.160s
> > 	user    0m0.001s		user    0m0.002s
> > 	sys     0m7.936s		sys     0m8.671s
> > 
> > This is an RFC, comments and thoughts are welcomed. There is a
> > clear benefit to be had for large allocations, but there is
> > some regression for smaller allocations.
> > 
> > Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> > ---
> >  mm/vmalloc.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 97cef2cc14d3..0a25e5cf841c 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3621,6 +3621,38 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >  	unsigned int nr_allocated = 0;
> >  	struct page *page;
> >  	int i;
> > +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> > +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
> >
> If large_order is > MAX_ORDER - 1 then there is no need even try
> larger_order attempt.
> 
> >> unsigned int large_order = ilog2(nr_pages - nr_allocated);
> I think, it is better to introduce "remaining" variable which
> is nr_pages - nr_allocated. And on entry "remaining" can be set
> to just nr_pages because "nr_allocated" is zero.

I like the idea too.

> Maybe it is worth to drop/warn if __GFP_COMP is set also?

split_page() has a BUG_ON(PageCompound) within, so we don't need one out
here for now.

> > +
> > +	/*
> > +	 * Initially, attempt to have the page allocator give us large order
> > +	 * pages. Do not attempt allocating smaller than order chunks since
> > +	 * __vmap_pages_range() expects physically contigous pages of exactly
> > +	 * order long chunks.
> > +	 */
> > +	while (large_order > order && nr_allocated < nr_pages) {
> > +		/*
> > +		 * High-order nofail allocations are really expensive and
> > +		 * potentially dangerous (pre-mature OOM, disruptive reclaim
> > +		 * and compaction etc.
> > +		 */
> > +		if (gfp & __GFP_NOFAIL)
> > +			break;
> > +		if (nid == NUMA_NO_NODE)
> > +			page = alloc_pages_noprof(large_gfp, large_order);
> > +		else
> > +			page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> > +
> > +		if (unlikely(!page))
> > +			break;
> > +
> > +		split_page(page, large_order);
> > +		for (i = 0; i < (1U << large_order); i++)
> > +			pages[nr_allocated + i] = page + i;
> > +
> > +		nr_allocated += 1U << large_order;
> > +		large_order = ilog2(nr_pages - nr_allocated);
> > +	}
> >  
> So this is a third path for page allocation. The question is should we
> try all orders? Like already noted by Matthew, if there is no 5-order
> page but there is 4-order page? Try until we check all orders. For
> example we can get different order pages to fulfill the request.
>
> The concern is then if it is a waste of high-order pages. Because we can
> easily go with a single page allocator. Whereas someone in a system can not.

I feel like if we have high order pages available we'd rather allocate
those. Since the buddy allocator just coalesces the pages when they're
freed again, as soon as these allocations free up we are much more
likely to have large order pages ready to go again.

> Apart of that, maybe we can drop the bulk_path instead of having three paths?

Probably. I'd say that just depends on whether we care about maintaining
the optimizations for smaller vmallocs() - which I have no strong opinion
on.

> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15 10:44   ` Vishal Moola (Oracle)
@ 2025-10-15 12:42     ` Matthew Wilcox
  2025-10-15 13:42       ` Uladzislau Rezki
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2025-10-15 12:42 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Uladzislau Rezki, linux-mm, linux-kernel, Andrew Morton

On Wed, Oct 15, 2025 at 03:44:22AM -0700, Vishal Moola (Oracle) wrote:
> On Wed, Oct 15, 2025 at 10:23:19AM +0200, Uladzislau Rezki wrote:
> > >  	int i;
> > > +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> > > +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
> > >
> > If large_order is > MAX_ORDER - 1 then there is no need even try
> > larger_order attempt.

Oh, I meant to mention that too.  Yes, this should be min(MAX_ORDER, ilog2()).

> > Maybe it is worth to drop/warn if __GFP_COMP is set also?
> 
> split_page() has a BUG_ON(PageCompound) within, so we don't need one out
> here for now.

I don't think people actually call vmalloc() with __GFP_COMP set, but
clearing it would do no harm here.

> > The concern is then if it is a waste of high-order pages. Because we can
> > easily go with a single page allocator. Whereas someone in a system can not.
> 
> I feel like if we have high order pages available we'd rather allocate
> those. Since the buddy allocator just coalesces the pages when they're
> freed again, as soon as these allocations free up we are much more
> likely to have large order pages ready to go again.

My PoV is different from either of you -- that we actually want
to allocate the high-order pages when we can because it reduces
fragmentation.  If we allocate five separate pages to satisfy a 20kB
allocation, those may come from five different 2MB pages (since they're
probably coming from the pcp lists which after a sufficiently long period
of running will be a jumble).  Whereas if we allocate an order-2 page
and an order-0 page, those can come from at most two 2MB pages.

I understand the "allocating order-0 pages helps by using up the remnants
of previous allocations" argument.  But I think on the whole we need to
be doing larger allocations where possible, not smaller ones.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15 12:42     ` Matthew Wilcox
@ 2025-10-15 13:42       ` Uladzislau Rezki
  2025-10-16  6:57         ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-15 13:42 UTC (permalink / raw)
  To: Matthew Wilcox, Vishal Moola (Oracle)
  Cc: Vishal Moola (Oracle),
	Uladzislau Rezki, linux-mm, linux-kernel, Andrew Morton

On Wed, Oct 15, 2025 at 01:42:46PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 15, 2025 at 03:44:22AM -0700, Vishal Moola (Oracle) wrote:
> > On Wed, Oct 15, 2025 at 10:23:19AM +0200, Uladzislau Rezki wrote:
> > > >  	int i;
> > > > +	gfp_t large_gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN;
> > > > +	unsigned int large_order = ilog2(nr_pages - nr_allocated);
> > > >
> > > If large_order is > MAX_ORDER - 1 then there is no need even try
> > > larger_order attempt.
> 
> Oh, I meant to mention that too.  Yes, this should be min(MAX_ORDER, ilog2()).
> 
> > > Maybe it is worth to drop/warn if __GFP_COMP is set also?
> > 
> > split_page() has a BUG_ON(PageCompound) within, so we don't need one out
> > here for now.
> 
> I don't think people actually call vmalloc() with __GFP_COMP set, but
> clearing it would do no harm here.
> 
Agree. We do not want BUG_ON() in split_page(). I think it is better to
control this even though nobody invokes vmalloc() with __GFP_COMP.

> > > The concern is then if it is a waste of high-order pages. Because we can
> > > easily go with a single page allocator. Whereas someone in a system can not.
> > 
> > I feel like if we have high order pages available we'd rather allocate
> > those. Since the buddy allocator just coalesces the pages when they're
> > freed again, as soon as these allocations free up we are much more
> > likely to have large order pages ready to go again.
> 
> My PoV is different from either of you -- that we actually want
> to allocate the high-order pages when we can because it reduces
> fragmentation.  If we allocate five separate pages to satisfy a 20kB
> allocation, those may come from five different 2MB pages (since they're
> probably coming from the pcp lists which after a sufficiently long period
> of running will be a jumble).  Whereas if we allocate an order-2 page
> and an order-0 page, those can come from at most two 2MB pages.
> 
> I understand the "allocating order-0 pages helps by using up the remnants
> of previous allocations" argument.  But I think on the whole we need to
> be doing larger allocations where possible, not smaller ones.
> 
OK, i see. Then we can start from highest "fit" order as starting point
and switch to lower ones if failing. Fallback to bulk or single page
allocator if it is still not enough pages.

Thank you!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15 13:42       ` Uladzislau Rezki
@ 2025-10-16  6:57         ` Christoph Hellwig
  2025-10-16 11:53           ` Uladzislau Rezki
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2025-10-16  6:57 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Matthew Wilcox, Vishal Moola (Oracle),
	linux-mm, linux-kernel, Andrew Morton

On Wed, Oct 15, 2025 at 03:42:50PM +0200, Uladzislau Rezki wrote:
> Agree. We do not want BUG_ON() in split_page(). I think it is better to
> control this even though nobody invokes vmalloc() with __GFP_COMP.

Please explicitly warn and reject vmalloc calls with unsupported
flag.  The fact that many flags get silently ignored or dropped or
could lead to behavior in gfp_t based interfaces is a constant source
of problems.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-16  6:57         ` Christoph Hellwig
@ 2025-10-16 11:53           ` Uladzislau Rezki
  0 siblings, 0 replies; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-16 11:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Uladzislau Rezki, Matthew Wilcox, Vishal Moola (Oracle),
	linux-mm, linux-kernel, Andrew Morton

On Wed, Oct 15, 2025 at 11:57:56PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 15, 2025 at 03:42:50PM +0200, Uladzislau Rezki wrote:
> > Agree. We do not want BUG_ON() in split_page(). I think it is better to
> > control this even though nobody invokes vmalloc() with __GFP_COMP.
> 
> Please explicitly warn and reject vmalloc calls with unsupported
> flag.  The fact that many flags get silently ignored or dropped or
> could lead to behavior in gfp_t based interfaces is a constant source
> of problems.
> 
Thank you for input.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-15  9:28   ` Vishal Moola (Oracle)
@ 2025-10-16 16:12     ` Uladzislau Rezki
  2025-10-16 17:42       ` Vishal Moola (Oracle)
  0 siblings, 1 reply; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-16 16:12 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Matthew Wilcox, linux-mm, linux-kernel, Uladzislau Rezki, Andrew Morton

On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > 
> > > 1000 2mb allocations:
> > > 	[Baseline]			[This patch]
> > > 	real    46.310s			real    34.380s
> > > 	user    0.001s			user    0.008s
> > > 	sys     46.058s			sys     34.152s
> > > 
> > > 10000 200kb allocations:
> > > 	[Baseline]			[This patch]
> > > 	real    56.104s			real    43.946s
> > > 	user    0.001s			user    0.003s
> > > 	sys     55.375s			sys     43.259s
> > > 
> > > 10000 20kb allocations:
> > > 	[Baseline]			[This patch]
> > > 	real    0m8.438s		real    0m9.160s
> > > 	user    0m0.001s		user    0m0.002s
> > > 	sys     0m7.936s		sys     0m8.671s
> > 
> > I'd be more confident in the 20kB numbers if you'd done 10x more
> > iterations.
> 
> I actually ran my a number of times to mitigate the effects of possibly
> too small sample sizes, so I do have that number for you too:
> 
> [Baseline]			[This patch]
> real    1m28.119s		real    1m32.630s
> user    0m0.012s		user    0m0.011s
> sys     1m23.270s		sys     1m28.529s
> 
I have just had a look at performance figures of this patch. The test
case is 16K allocation by one single thread, 1 000 000 loops, 10 run:

sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4

BOX: AMD Milan, 256 CPUs, 512GB of memory

# default 16K alloc
[   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
[   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
[   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
[   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
[   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
[   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
[   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
[   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
[   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
[   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec

# the patch 16K alloc
[   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
[   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
[   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
[   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
[   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
[   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
[   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
[   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
[   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
[   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec

2X slower.

perf-cycles, same test but on 64 CPUs:

+   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
-   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
     26.19% ret_from_fork_asm
        ret_from_fork
      - kthread
         - 25.96% test_func
            - fix_size_alloc_test
               - 23.49% __vmalloc_node_noprof
                  - __vmalloc_node_range_noprof
                     - 54.70% alloc_pages_noprof
                          alloc_pages_mpol
                          __alloc_frozen_pages_noprof
                          get_page_from_freelist
                          __rmqueue_pcplist
                     - 5.58% __get_vm_area_node
                          alloc_vmap_area
               - 20.54% vfree.part.0
                  - 20.43% __free_frozen_pages
                       free_frozen_page_commit
                       free_pcppages_bulk
                       _raw_spin_lock_irqsave
                       native_queued_spin_lock_slowpath
         - 0.77% worker_thread
            - process_one_work
               - 0.76% vmstat_update
                    refresh_cpu_vm_stats
                    decay_pcp_high
                    free_pcppages_bulk
                    _raw_spin_lock_irqsave
                    native_queued_spin_lock_slowpath
+   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
+   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
+   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
+   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
+   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
+   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
+   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
+   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
+   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
+   32.11%     0.00%  [kernel]          [k] ret_from_fork
+   32.11%     0.00%  [kernel]          [k] kthread

I would say the bottle-neck is a page-allocator. It seems high-order
allocations are not good for it.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-16 16:12     ` Uladzislau Rezki
@ 2025-10-16 17:42       ` Vishal Moola (Oracle)
  2025-10-16 19:02         ` Vishal Moola (Oracle)
  0 siblings, 1 reply; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-16 17:42 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: Matthew Wilcox, linux-mm, linux-kernel, Andrew Morton

On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > 
> > > > 1000 2mb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    46.310s			real    34.380s
> > > > 	user    0.001s			user    0.008s
> > > > 	sys     46.058s			sys     34.152s
> > > > 
> > > > 10000 200kb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    56.104s			real    43.946s
> > > > 	user    0.001s			user    0.003s
> > > > 	sys     55.375s			sys     43.259s
> > > > 
> > > > 10000 20kb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    0m8.438s		real    0m9.160s
> > > > 	user    0m0.001s		user    0m0.002s
> > > > 	sys     0m7.936s		sys     0m8.671s
> > > 
> > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > iterations.
> > 
> > I actually ran my a number of times to mitigate the effects of possibly
> > too small sample sizes, so I do have that number for you too:
> > 
> > [Baseline]			[This patch]
> > real    1m28.119s		real    1m32.630s
> > user    0m0.012s		user    0m0.011s
> > sys     1m23.270s		sys     1m28.529s
> > 
> I have just had a look at performance figures of this patch. The test
> case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> 
> sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4

The reason I didn't use this test module is the same concern Matthew
brought up earlier about testing the PCP list rather than buddy
allocator. The test module allocates, then frees over and over again,
making it incredibly prone to reuse the pages over and over again.

> BOX: AMD Milan, 256 CPUs, 512GB of memory
> 
> # default 16K alloc
> [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> 
> # the patch 16K alloc
> [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> 
> 2X slower.
> 
> perf-cycles, same test but on 64 CPUs:
> 
> +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
>      26.19% ret_from_fork_asm
>         ret_from_fork
>       - kthread
>          - 25.96% test_func
>             - fix_size_alloc_test
>                - 23.49% __vmalloc_node_noprof
>                   - __vmalloc_node_range_noprof
>                      - 54.70% alloc_pages_noprof
>                           alloc_pages_mpol
>                           __alloc_frozen_pages_noprof
>                           get_page_from_freelist
>                           __rmqueue_pcplist
>                      - 5.58% __get_vm_area_node
>                           alloc_vmap_area
>                - 20.54% vfree.part.0
>                   - 20.43% __free_frozen_pages
>                        free_frozen_page_commit
>                        free_pcppages_bulk
>                        _raw_spin_lock_irqsave
>                        native_queued_spin_lock_slowpath
>          - 0.77% worker_thread
>             - process_one_work
>                - 0.76% vmstat_update
>                     refresh_cpu_vm_stats
>                     decay_pcp_high
>                     free_pcppages_bulk
>                     _raw_spin_lock_irqsave
>                     native_queued_spin_lock_slowpath
> +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> +   32.11%     0.00%  [kernel]          [k] kthread
> 
> I would say the bottle-neck is a page-allocator. It seems high-order
> allocations are not good for it.
> 
> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-16 17:42       ` Vishal Moola (Oracle)
@ 2025-10-16 19:02         ` Vishal Moola (Oracle)
  2025-10-17 16:15           ` Uladzislau Rezki
  0 siblings, 1 reply; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-16 19:02 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: Matthew Wilcox, linux-mm, linux-kernel, Andrew Morton

On Thu, Oct 16, 2025 at 10:42:04AM -0700, Vishal Moola (Oracle) wrote:
> On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> > On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > > 
> > > > > 1000 2mb allocations:
> > > > > 	[Baseline]			[This patch]
> > > > > 	real    46.310s			real    34.380s
> > > > > 	user    0.001s			user    0.008s
> > > > > 	sys     46.058s			sys     34.152s
> > > > > 
> > > > > 10000 200kb allocations:
> > > > > 	[Baseline]			[This patch]
> > > > > 	real    56.104s			real    43.946s
> > > > > 	user    0.001s			user    0.003s
> > > > > 	sys     55.375s			sys     43.259s
> > > > > 
> > > > > 10000 20kb allocations:
> > > > > 	[Baseline]			[This patch]
> > > > > 	real    0m8.438s		real    0m9.160s
> > > > > 	user    0m0.001s		user    0m0.002s
> > > > > 	sys     0m7.936s		sys     0m8.671s
> > > > 
> > > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > > iterations.
> > > 
> > > I actually ran my a number of times to mitigate the effects of possibly
> > > too small sample sizes, so I do have that number for you too:
> > > 
> > > [Baseline]			[This patch]
> > > real    1m28.119s		real    1m32.630s
> > > user    0m0.012s		user    0m0.011s
> > > sys     1m23.270s		sys     1m28.529s
> > > 
> > I have just had a look at performance figures of this patch. The test
> > case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> > 
> > sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4
> 
> The reason I didn't use this test module is the same concern Matthew
> brought up earlier about testing the PCP list rather than buddy
> allocator. The test module allocates, then frees over and over again,
> making it incredibly prone to reuse the pages over and over again.
> 
> > BOX: AMD Milan, 256 CPUs, 512GB of memory
> > 
> > # default 16K alloc
> > [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> > [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> > [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> > [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> > [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> > [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> > [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> > [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> > [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> > [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> > 
> > # the patch 16K alloc
> > [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> > [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> > [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> > [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> > [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> > [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> > [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> > [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> > [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> > [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> > 
> > 2X slower.
> > 
> > perf-cycles, same test but on 64 CPUs:
> > 
> > +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> > -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
> >      26.19% ret_from_fork_asm
> >         ret_from_fork
> >       - kthread
> >          - 25.96% test_func
> >             - fix_size_alloc_test
> >                - 23.49% __vmalloc_node_noprof
> >                   - __vmalloc_node_range_noprof
> >                      - 54.70% alloc_pages_noprof
> >                           alloc_pages_mpol
> >                           __alloc_frozen_pages_noprof
> >                           get_page_from_freelist
> >                           __rmqueue_pcplist
> >                      - 5.58% __get_vm_area_node
> >                           alloc_vmap_area
> >                - 20.54% vfree.part.0
> >                   - 20.43% __free_frozen_pages
> >                        free_frozen_page_commit
> >                        free_pcppages_bulk
> >                        _raw_spin_lock_irqsave
> >                        native_queued_spin_lock_slowpath
> >          - 0.77% worker_thread
> >             - process_one_work
> >                - 0.76% vmstat_update
> >                     refresh_cpu_vm_stats
> >                     decay_pcp_high
> >                     free_pcppages_bulk
> >                     _raw_spin_lock_irqsave
> >                     native_queued_spin_lock_slowpath
> > +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> > +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> > +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> > +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> > +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> > +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> > +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> > +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> > +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> > +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> > +   32.11%     0.00%  [kernel]          [k] kthread
> > 
> > I would say the bottle-neck is a page-allocator. It seems high-order
> > allocations are not good for it.

Ah also just took a closer look at this. I realize that you also did 16k
allocations (which is at most order-2), so it may not be a good
representation of high-order allocations either.

Plus that falls into the regression range I found that I detailed in
response to Matthew elsewhere (I've copy pasted it here for reference)

  I ended up finding that allocating sizes <=20k had noticeable
  regressions, while [20k, 90k] was approximately the same, and >= 90k had
  improvements (getting more and more noticeable as size grows in
  magnitude).

> > --
> > Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-16 19:02         ` Vishal Moola (Oracle)
@ 2025-10-17 16:15           ` Uladzislau Rezki
  2025-10-17 17:19             ` Uladzislau Rezki
  0 siblings, 1 reply; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-17 16:15 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Uladzislau Rezki, Matthew Wilcox, linux-mm, linux-kernel, Andrew Morton

On Thu, Oct 16, 2025 at 12:02:59PM -0700, Vishal Moola (Oracle) wrote:
> On Thu, Oct 16, 2025 at 10:42:04AM -0700, Vishal Moola (Oracle) wrote:
> > On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> > > On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > > > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > > > 
> > > > > > 1000 2mb allocations:
> > > > > > 	[Baseline]			[This patch]
> > > > > > 	real    46.310s			real    34.380s
> > > > > > 	user    0.001s			user    0.008s
> > > > > > 	sys     46.058s			sys     34.152s
> > > > > > 
> > > > > > 10000 200kb allocations:
> > > > > > 	[Baseline]			[This patch]
> > > > > > 	real    56.104s			real    43.946s
> > > > > > 	user    0.001s			user    0.003s
> > > > > > 	sys     55.375s			sys     43.259s
> > > > > > 
> > > > > > 10000 20kb allocations:
> > > > > > 	[Baseline]			[This patch]
> > > > > > 	real    0m8.438s		real    0m9.160s
> > > > > > 	user    0m0.001s		user    0m0.002s
> > > > > > 	sys     0m7.936s		sys     0m8.671s
> > > > > 
> > > > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > > > iterations.
> > > > 
> > > > I actually ran my a number of times to mitigate the effects of possibly
> > > > too small sample sizes, so I do have that number for you too:
> > > > 
> > > > [Baseline]			[This patch]
> > > > real    1m28.119s		real    1m32.630s
> > > > user    0m0.012s		user    0m0.011s
> > > > sys     1m23.270s		sys     1m28.529s
> > > > 
> > > I have just had a look at performance figures of this patch. The test
> > > case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> > > 
> > > sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4
> > 
> > The reason I didn't use this test module is the same concern Matthew
> > brought up earlier about testing the PCP list rather than buddy
> > allocator. The test module allocates, then frees over and over again,
> > making it incredibly prone to reuse the pages over and over again.
> > 
> > > BOX: AMD Milan, 256 CPUs, 512GB of memory
> > > 
> > > # default 16K alloc
> > > [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> > > [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> > > [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> > > [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> > > [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> > > [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> > > [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> > > [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> > > [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> > > [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> > > 
> > > # the patch 16K alloc
> > > [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> > > [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> > > [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> > > [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> > > [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> > > [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> > > [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> > > [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> > > [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> > > [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> > > 
> > > 2X slower.
> > > 
> > > perf-cycles, same test but on 64 CPUs:
> > > 
> > > +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> > > -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
> > >      26.19% ret_from_fork_asm
> > >         ret_from_fork
> > >       - kthread
> > >          - 25.96% test_func
> > >             - fix_size_alloc_test
> > >                - 23.49% __vmalloc_node_noprof
> > >                   - __vmalloc_node_range_noprof
> > >                      - 54.70% alloc_pages_noprof
> > >                           alloc_pages_mpol
> > >                           __alloc_frozen_pages_noprof
> > >                           get_page_from_freelist
> > >                           __rmqueue_pcplist
> > >                      - 5.58% __get_vm_area_node
> > >                           alloc_vmap_area
> > >                - 20.54% vfree.part.0
> > >                   - 20.43% __free_frozen_pages
> > >                        free_frozen_page_commit
> > >                        free_pcppages_bulk
> > >                        _raw_spin_lock_irqsave
> > >                        native_queued_spin_lock_slowpath
> > >          - 0.77% worker_thread
> > >             - process_one_work
> > >                - 0.76% vmstat_update
> > >                     refresh_cpu_vm_stats
> > >                     decay_pcp_high
> > >                     free_pcppages_bulk
> > >                     _raw_spin_lock_irqsave
> > >                     native_queued_spin_lock_slowpath
> > > +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> > > +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> > > +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> > > +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> > > +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> > > +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> > > +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> > > +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> > > +   32.11%     0.00%  [kernel]          [k] kthread
> > > 
> > > I would say the bottle-neck is a page-allocator. It seems high-order
> > > allocations are not good for it.
> 
> Ah also just took a closer look at this. I realize that you also did 16k
> allocations (which is at most order-2), so it may not be a good
> representation of high-order allocations either.
> 
I agree. But then we should not optimize "small" orders and focus on
highest ones. Because of double degrade. I assume stress-ng fork test
would alos notice this.

> Plus that falls into the regression range I found that I detailed in
> response to Matthew elsewhere (I've copy pasted it here for reference)
> 
>   I ended up finding that allocating sizes <=20k had noticeable
>   regressions, while [20k, 90k] was approximately the same, and >= 90k had
>   improvements (getting more and more noticeable as size grows in
>   magnitude).
> 
Yes, i did 2-order allocations 

# default
+   35.87%     4.24%  [kernel]            [k] alloc_pages_bulk_noprof
+   31.94%     0.88%  [kernel]            [k] vfree.part.0
-   27.38%    27.36%  [kernel]            [k] clear_page_rep
     27.36% ret_from_fork_asm
        ret_from_fork
        kthread
        test_func
        fix_size_alloc_test
        __vmalloc_node_noprof
        __vmalloc_node_range_noprof
        alloc_pages_bulk_noprof
        clear_page_rep

# patch
+   53.32%     1.12%  [kernel]        [k] get_page_from_freelist
+   49.41%     0.71%  [kernel]        [k] prep_new_page
-   48.70%    48.64%  [kernel]        [k] clear_page_rep
     48.64% ret_from_fork_asm
        ret_from_fork
        kthread
        test_func
        fix_size_alloc_test
        __vmalloc_node_noprof
        __vmalloc_node_range_noprof
        alloc_pages_noprof
        alloc_pages_mpol
        __alloc_frozen_pages_noprof
        get_page_from_freelist
        prep_new_page
        clear_page_rep

i noticed it is because of clear_page_rep() which with patch consumes
double in cycles. 

Both versions should mostly go over pcp-cache, as far as i remember
order-2 is allowed to be cached.

I wounder why the patch gives x2 of cycles to clear_page_rep()...

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-17 16:15           ` Uladzislau Rezki
@ 2025-10-17 17:19             ` Uladzislau Rezki
  2025-10-20 18:23               ` Vishal Moola (Oracle)
  0 siblings, 1 reply; 15+ messages in thread
From: Uladzislau Rezki @ 2025-10-17 17:19 UTC (permalink / raw)
  To: Vishal Moola (Oracle)
  Cc: Vishal Moola (Oracle),
	Matthew Wilcox, linux-mm, linux-kernel, Andrew Morton

On Fri, Oct 17, 2025 at 06:15:21PM +0200, Uladzislau Rezki wrote:
> On Thu, Oct 16, 2025 at 12:02:59PM -0700, Vishal Moola (Oracle) wrote:
> > On Thu, Oct 16, 2025 at 10:42:04AM -0700, Vishal Moola (Oracle) wrote:
> > > On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> > > > On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > > > > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > > > > 
> > > > > > > 1000 2mb allocations:
> > > > > > > 	[Baseline]			[This patch]
> > > > > > > 	real    46.310s			real    34.380s
> > > > > > > 	user    0.001s			user    0.008s
> > > > > > > 	sys     46.058s			sys     34.152s
> > > > > > > 
> > > > > > > 10000 200kb allocations:
> > > > > > > 	[Baseline]			[This patch]
> > > > > > > 	real    56.104s			real    43.946s
> > > > > > > 	user    0.001s			user    0.003s
> > > > > > > 	sys     55.375s			sys     43.259s
> > > > > > > 
> > > > > > > 10000 20kb allocations:
> > > > > > > 	[Baseline]			[This patch]
> > > > > > > 	real    0m8.438s		real    0m9.160s
> > > > > > > 	user    0m0.001s		user    0m0.002s
> > > > > > > 	sys     0m7.936s		sys     0m8.671s
> > > > > > 
> > > > > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > > > > iterations.
> > > > > 
> > > > > I actually ran my a number of times to mitigate the effects of possibly
> > > > > too small sample sizes, so I do have that number for you too:
> > > > > 
> > > > > [Baseline]			[This patch]
> > > > > real    1m28.119s		real    1m32.630s
> > > > > user    0m0.012s		user    0m0.011s
> > > > > sys     1m23.270s		sys     1m28.529s
> > > > > 
> > > > I have just had a look at performance figures of this patch. The test
> > > > case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> > > > 
> > > > sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4
> > > 
> > > The reason I didn't use this test module is the same concern Matthew
> > > brought up earlier about testing the PCP list rather than buddy
> > > allocator. The test module allocates, then frees over and over again,
> > > making it incredibly prone to reuse the pages over and over again.
> > > 
> > > > BOX: AMD Milan, 256 CPUs, 512GB of memory
> > > > 
> > > > # default 16K alloc
> > > > [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> > > > [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> > > > [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> > > > [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> > > > [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> > > > [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> > > > [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> > > > [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> > > > [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> > > > [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> > > > 
> > > > # the patch 16K alloc
> > > > [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> > > > [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> > > > [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> > > > [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> > > > [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> > > > [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> > > > [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> > > > [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> > > > [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> > > > [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> > > > 
> > > > 2X slower.
> > > > 
> > > > perf-cycles, same test but on 64 CPUs:
> > > > 
> > > > +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> > > > -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > >      26.19% ret_from_fork_asm
> > > >         ret_from_fork
> > > >       - kthread
> > > >          - 25.96% test_func
> > > >             - fix_size_alloc_test
> > > >                - 23.49% __vmalloc_node_noprof
> > > >                   - __vmalloc_node_range_noprof
> > > >                      - 54.70% alloc_pages_noprof
> > > >                           alloc_pages_mpol
> > > >                           __alloc_frozen_pages_noprof
> > > >                           get_page_from_freelist
> > > >                           __rmqueue_pcplist
> > > >                      - 5.58% __get_vm_area_node
> > > >                           alloc_vmap_area
> > > >                - 20.54% vfree.part.0
> > > >                   - 20.43% __free_frozen_pages
> > > >                        free_frozen_page_commit
> > > >                        free_pcppages_bulk
> > > >                        _raw_spin_lock_irqsave
> > > >                        native_queued_spin_lock_slowpath
> > > >          - 0.77% worker_thread
> > > >             - process_one_work
> > > >                - 0.76% vmstat_update
> > > >                     refresh_cpu_vm_stats
> > > >                     decay_pcp_high
> > > >                     free_pcppages_bulk
> > > >                     _raw_spin_lock_irqsave
> > > >                     native_queued_spin_lock_slowpath
> > > > +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> > > > +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> > > > +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> > > > +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> > > > +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> > > > +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> > > > +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> > > > +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> > > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> > > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> > > > +   32.11%     0.00%  [kernel]          [k] kthread
> > > > 
> > > > I would say the bottle-neck is a page-allocator. It seems high-order
> > > > allocations are not good for it.
> > 
> > Ah also just took a closer look at this. I realize that you also did 16k
> > allocations (which is at most order-2), so it may not be a good
> > representation of high-order allocations either.
> > 
> I agree. But then we should not optimize "small" orders and focus on
> highest ones. Because of double degrade. I assume stress-ng fork test
> would alos notice this.
> 
> > Plus that falls into the regression range I found that I detailed in
> > response to Matthew elsewhere (I've copy pasted it here for reference)
> > 
> >   I ended up finding that allocating sizes <=20k had noticeable
> >   regressions, while [20k, 90k] was approximately the same, and >= 90k had
> >   improvements (getting more and more noticeable as size grows in
> >   magnitude).
> > 
> Yes, i did 2-order allocations 
> 
> # default
> +   35.87%     4.24%  [kernel]            [k] alloc_pages_bulk_noprof
> +   31.94%     0.88%  [kernel]            [k] vfree.part.0
> -   27.38%    27.36%  [kernel]            [k] clear_page_rep
>      27.36% ret_from_fork_asm
>         ret_from_fork
>         kthread
>         test_func
>         fix_size_alloc_test
>         __vmalloc_node_noprof
>         __vmalloc_node_range_noprof
>         alloc_pages_bulk_noprof
>         clear_page_rep
> 
> # patch
> +   53.32%     1.12%  [kernel]        [k] get_page_from_freelist
> +   49.41%     0.71%  [kernel]        [k] prep_new_page
> -   48.70%    48.64%  [kernel]        [k] clear_page_rep
>      48.64% ret_from_fork_asm
>         ret_from_fork
>         kthread
>         test_func
>         fix_size_alloc_test
>         __vmalloc_node_noprof
>         __vmalloc_node_range_noprof
>         alloc_pages_noprof
>         alloc_pages_mpol
>         __alloc_frozen_pages_noprof
>         get_page_from_freelist
>         prep_new_page
>         clear_page_rep
> 
> i noticed it is because of clear_page_rep() which with patch consumes
> double in cycles. 
> 
> Both versions should mostly go over pcp-cache, as far as i remember
> order-2 is allowed to be cached.
> 
> I wounder why the patch gives x2 of cycles to clear_page_rep()...
> 
And here we go with some results "without" pcp exxecise:

static int fix_size_alloc_test(void)
{
	void **ptr;
	int i;

	if (set_cpus_allowed_ptr(current, cpumask_of(1)) < 0)
		pr_err("Failed to set affinity to %d CPU\n", 1);

	ptr = vmalloc(sizeof(void *) * test_loop_count);
	if (!ptr)
		return -1;

	for (i = 0; i < test_loop_count; i++)
		ptr[i] = vmalloc((nr_pages > 0 ? nr_pages:1) * PAGE_SIZE);

	for (i = 0; i < test_loop_count; i++) {
		if (ptr[i])
			vfree(ptr[i]);
	}

	return 0;
}

time sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=nr-pages-in-order

# default order-1
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1423862 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1453518 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1451734 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1455142 usec

# patch order-1
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1431082 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1454855 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1476372 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1433379 usec

# default order-2
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2198130 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2208504 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2219533 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2214151 usec

# patch order-2
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2110344 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2044186 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2083308 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2073572 usec

# default order-3
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec

# patch order-3
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec

# default order-6(64 pages)
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 23847773 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24015706 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24226268 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24078102 usec

# patch order-6
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 20128225 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 19968964 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 20067469 usec
Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 19928870 usec

Now i see that results align with my initial thoughts when i first time
saw your patch.

The question which is not clear for me still, why pcp case is doing better
even for cached orders.

Do you have any thoughts?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator
  2025-10-17 17:19             ` Uladzislau Rezki
@ 2025-10-20 18:23               ` Vishal Moola (Oracle)
  0 siblings, 0 replies; 15+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-20 18:23 UTC (permalink / raw)
  To: Uladzislau Rezki; +Cc: Matthew Wilcox, linux-mm, linux-kernel, Andrew Morton

On Fri, Oct 17, 2025 at 07:19:16PM +0200, Uladzislau Rezki wrote:
> On Fri, Oct 17, 2025 at 06:15:21PM +0200, Uladzislau Rezki wrote:
> > On Thu, Oct 16, 2025 at 12:02:59PM -0700, Vishal Moola (Oracle) wrote:
> > > On Thu, Oct 16, 2025 at 10:42:04AM -0700, Vishal Moola (Oracle) wrote:
> > > > On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> > > > > On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > > > > > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > > > > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > > > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > > > > > 
> > > > > > > > 1000 2mb allocations:
> > > > > > > > 	[Baseline]			[This patch]
> > > > > > > > 	real    46.310s			real    34.380s
> > > > > > > > 	user    0.001s			user    0.008s
> > > > > > > > 	sys     46.058s			sys     34.152s
> > > > > > > > 
> > > > > > > > 10000 200kb allocations:
> > > > > > > > 	[Baseline]			[This patch]
> > > > > > > > 	real    56.104s			real    43.946s
> > > > > > > > 	user    0.001s			user    0.003s
> > > > > > > > 	sys     55.375s			sys     43.259s
> > > > > > > > 
> > > > > > > > 10000 20kb allocations:
> > > > > > > > 	[Baseline]			[This patch]
> > > > > > > > 	real    0m8.438s		real    0m9.160s
> > > > > > > > 	user    0m0.001s		user    0m0.002s
> > > > > > > > 	sys     0m7.936s		sys     0m8.671s
> > > > > > > 
> > > > > > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > > > > > iterations.
> > > > > > 
> > > > > > I actually ran my a number of times to mitigate the effects of possibly
> > > > > > too small sample sizes, so I do have that number for you too:
> > > > > > 
> > > > > > [Baseline]			[This patch]
> > > > > > real    1m28.119s		real    1m32.630s
> > > > > > user    0m0.012s		user    0m0.011s
> > > > > > sys     1m23.270s		sys     1m28.529s
> > > > > > 
> > > > > I have just had a look at performance figures of this patch. The test
> > > > > case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> > > > > 
> > > > > sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4
> > > > 
> > > > The reason I didn't use this test module is the same concern Matthew
> > > > brought up earlier about testing the PCP list rather than buddy
> > > > allocator. The test module allocates, then frees over and over again,
> > > > making it incredibly prone to reuse the pages over and over again.
> > > > 
> > > > > BOX: AMD Milan, 256 CPUs, 512GB of memory
> > > > > 
> > > > > # default 16K alloc
> > > > > [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> > > > > [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> > > > > [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> > > > > [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> > > > > [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> > > > > [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> > > > > [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> > > > > [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> > > > > [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> > > > > [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> > > > > 
> > > > > # the patch 16K alloc
> > > > > [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> > > > > [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> > > > > [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> > > > > [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> > > > > [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> > > > > [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> > > > > [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> > > > > [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> > > > > [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> > > > > [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> > > > > 
> > > > > 2X slower.
> > > > > 
> > > > > perf-cycles, same test but on 64 CPUs:
> > > > > 
> > > > > +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> > > > > -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
> > > > >      26.19% ret_from_fork_asm
> > > > >         ret_from_fork
> > > > >       - kthread
> > > > >          - 25.96% test_func
> > > > >             - fix_size_alloc_test
> > > > >                - 23.49% __vmalloc_node_noprof
> > > > >                   - __vmalloc_node_range_noprof
> > > > >                      - 54.70% alloc_pages_noprof
> > > > >                           alloc_pages_mpol
> > > > >                           __alloc_frozen_pages_noprof
> > > > >                           get_page_from_freelist
> > > > >                           __rmqueue_pcplist
> > > > >                      - 5.58% __get_vm_area_node
> > > > >                           alloc_vmap_area
> > > > >                - 20.54% vfree.part.0
> > > > >                   - 20.43% __free_frozen_pages
> > > > >                        free_frozen_page_commit
> > > > >                        free_pcppages_bulk
> > > > >                        _raw_spin_lock_irqsave
> > > > >                        native_queued_spin_lock_slowpath
> > > > >          - 0.77% worker_thread
> > > > >             - process_one_work
> > > > >                - 0.76% vmstat_update
> > > > >                     refresh_cpu_vm_stats
> > > > >                     decay_pcp_high
> > > > >                     free_pcppages_bulk
> > > > >                     _raw_spin_lock_irqsave
> > > > >                     native_queued_spin_lock_slowpath
> > > > > +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> > > > > +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> > > > > +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> > > > > +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> > > > > +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> > > > > +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> > > > > +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> > > > > +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> > > > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> > > > > +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> > > > > +   32.11%     0.00%  [kernel]          [k] kthread
> > > > > 
> > > > > I would say the bottle-neck is a page-allocator. It seems high-order
> > > > > allocations are not good for it.
> > > 
> > > Ah also just took a closer look at this. I realize that you also did 16k
> > > allocations (which is at most order-2), so it may not be a good
> > > representation of high-order allocations either.
> > > 
> > I agree. But then we should not optimize "small" orders and focus on
> > highest ones. Because of double degrade. I assume stress-ng fork test
> > would alos notice this.
> > 
> > > Plus that falls into the regression range I found that I detailed in
> > > response to Matthew elsewhere (I've copy pasted it here for reference)
> > > 
> > >   I ended up finding that allocating sizes <=20k had noticeable
> > >   regressions, while [20k, 90k] was approximately the same, and >= 90k had
> > >   improvements (getting more and more noticeable as size grows in
> > >   magnitude).
> > > 
> > Yes, i did 2-order allocations 
> > 
> > # default
> > +   35.87%     4.24%  [kernel]            [k] alloc_pages_bulk_noprof
> > +   31.94%     0.88%  [kernel]            [k] vfree.part.0
> > -   27.38%    27.36%  [kernel]            [k] clear_page_rep
> >      27.36% ret_from_fork_asm
> >         ret_from_fork
> >         kthread
> >         test_func
> >         fix_size_alloc_test
> >         __vmalloc_node_noprof
> >         __vmalloc_node_range_noprof
> >         alloc_pages_bulk_noprof
> >         clear_page_rep
> > 
> > # patch
> > +   53.32%     1.12%  [kernel]        [k] get_page_from_freelist
> > +   49.41%     0.71%  [kernel]        [k] prep_new_page
> > -   48.70%    48.64%  [kernel]        [k] clear_page_rep
> >      48.64% ret_from_fork_asm
> >         ret_from_fork
> >         kthread
> >         test_func
> >         fix_size_alloc_test
> >         __vmalloc_node_noprof
> >         __vmalloc_node_range_noprof
> >         alloc_pages_noprof
> >         alloc_pages_mpol
> >         __alloc_frozen_pages_noprof
> >         get_page_from_freelist
> >         prep_new_page
> >         clear_page_rep
> > 
> > i noticed it is because of clear_page_rep() which with patch consumes
> > double in cycles. 
> > 
> > Both versions should mostly go over pcp-cache, as far as i remember
> > order-2 is allowed to be cached.
> > 
> > I wounder why the patch gives x2 of cycles to clear_page_rep()...
> > 
> And here we go with some results "without" pcp exxecise:
> 
> static int fix_size_alloc_test(void)
> {
> 	void **ptr;
> 	int i;
> 
> 	if (set_cpus_allowed_ptr(current, cpumask_of(1)) < 0)
> 		pr_err("Failed to set affinity to %d CPU\n", 1);
> 
> 	ptr = vmalloc(sizeof(void *) * test_loop_count);
> 	if (!ptr)
> 		return -1;
> 
> 	for (i = 0; i < test_loop_count; i++)
> 		ptr[i] = vmalloc((nr_pages > 0 ? nr_pages:1) * PAGE_SIZE);
> 
> 	for (i = 0; i < test_loop_count; i++) {
> 		if (ptr[i])
> 			vfree(ptr[i]);
> 	}
> 
> 	return 0;
> }
> 
> time sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=nr-pages-in-order
> 
> # default order-1
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1423862 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1453518 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1451734 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1455142 usec
> 
> # patch order-1
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1431082 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1454855 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1476372 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1433379 usec
> 
> # default order-2
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2198130 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2208504 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2219533 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2214151 usec
> 
> # patch order-2
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2110344 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2044186 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2083308 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2073572 usec
> 
> # default order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3718592 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740495 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3737213 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3740765 usec
> 
> # patch order-3
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3350391 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3374568 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3286374 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 3261335 usec
> 
> # default order-6(64 pages)
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 23847773 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24015706 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24226268 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 24078102 usec
> 
> # patch order-6
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 20128225 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 19968964 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 20067469 usec
> Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 19928870 usec
> 
> Now i see that results align with my initial thoughts when i first time
> saw your patch.
 
Its reassuring that your test results show similar performance even at
the order-1 and order-2 cases. This was what I was expecting as well.

I'm assuming this happened because you tested exactly aligned PAGE_SIZE
allocations (whereas somehow I hadn't thought to do that).

> The question which is not clear for me still, why pcp case is doing better
> even for cached orders.
>
> Do you have any thoughts?

I'm not sure either. I'm not familiar with the optimization differences
between the standard and bulk allocators :(

When looking at the code, it appears that although the pcp lists can cache
up to PAGE_ALLOC_COSTLY_ORDER (3), the bulk allocator doesn't have support
for anything outside of order-0. And whenever order-0 pages are
available, the bulk allocator appears incredibly efficient at grabbing
them.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-10-20 18:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-14 18:27 [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
2025-10-15  3:56 ` Matthew Wilcox
2025-10-15  9:28   ` Vishal Moola (Oracle)
2025-10-16 16:12     ` Uladzislau Rezki
2025-10-16 17:42       ` Vishal Moola (Oracle)
2025-10-16 19:02         ` Vishal Moola (Oracle)
2025-10-17 16:15           ` Uladzislau Rezki
2025-10-17 17:19             ` Uladzislau Rezki
2025-10-20 18:23               ` Vishal Moola (Oracle)
2025-10-15  8:23 ` Uladzislau Rezki
2025-10-15 10:44   ` Vishal Moola (Oracle)
2025-10-15 12:42     ` Matthew Wilcox
2025-10-15 13:42       ` Uladzislau Rezki
2025-10-16  6:57         ` Christoph Hellwig
2025-10-16 11:53           ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox