* [PATCH] mm/vmalloc: request large order pages from buddy allocator
@ 2025-10-21 19:44 Vishal Moola (Oracle)
2025-10-21 21:24 ` Andrew Morton
2025-10-22 17:50 ` Uladzislau Rezki
0 siblings, 2 replies; 4+ messages in thread
From: Vishal Moola (Oracle) @ 2025-10-21 19:44 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Uladzislau Rezki, Andrew Morton, Vishal Moola (Oracle)
Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at
most 100 pages at a time, we can eagerly request large order pages a
smaller number of times.
We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.
Running 1000 iterations of allocations on a small 4GB system finds:
1000 2mb allocations:
[Baseline] [This patch]
real 46.310s real 0m34.582
user 0.001s user 0.006s
sys 46.058s sys 0m34.365s
10000 200kb allocations:
[Baseline] [This patch]
real 56.104s real 0m43.696
user 0.001s user 0.003s
sys 55.375s sys 0m42.995s
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
-----
RFC:
https://lore.kernel.org/linux-mm/20251014182754.4329-1-vishal.moola@gmail.com/
Changes since rfc:
- Mask off NO_FAIL in large_gfp
- Mask off GFP_COMP in large_gfp
There was discussion about warning on and rejecting unsupported GFP
flags in vmalloc, I'll have a separate patch for that.
- Introduce nr_remaining variable to track total pages
- Calculate large order as (min(max_order, ilog2())
- Attempt lower orders on failure before falling back to original path
- Drop unnecessary fallback comment change
---
mm/vmalloc.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index adde450ddf5e..0832f944544c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
unsigned int order, unsigned int nr_pages, struct page **pages)
{
unsigned int nr_allocated = 0;
+ unsigned int nr_remaining = nr_pages;
+ unsigned int max_attempt_order = MAX_PAGE_ORDER;
struct page *page;
int i;
+ gfp_t large_gfp = (gfp &
+ ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
+ | __GFP_NOWARN;
+ unsigned int large_order = ilog2(nr_remaining);
+
+ large_order = min(max_attempt_order, large_order);
+
+ /*
+ * Initially, attempt to have the page allocator give us large order
+ * pages. Do not attempt allocating smaller than order chunks since
+ * __vmap_pages_range() expects physically contigous pages of exactly
+ * order long chunks.
+ */
+ while (large_order > order && nr_remaining) {
+ if (nid == NUMA_NO_NODE)
+ page = alloc_pages_noprof(large_gfp, large_order);
+ else
+ page = alloc_pages_node_noprof(nid, large_gfp, large_order);
+
+ if (unlikely(!page)) {
+ max_attempt_order = --large_order;
+ continue;
+ }
+
+ split_page(page, large_order);
+ for (i = 0; i < (1U << large_order); i++)
+ pages[nr_allocated + i] = page + i;
+
+ nr_allocated += 1U << large_order;
+ nr_remaining = nr_pages - nr_allocated;
+
+ large_order = ilog2(nr_remaining);
+ large_order = min(max_attempt_order, large_order);
+ }
/*
* For order-0 pages we make use of bulk allocator, if
--
2.51.0
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator
2025-10-21 19:44 [PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
@ 2025-10-21 21:24 ` Andrew Morton
2025-10-22 14:33 ` Matthew Wilcox
2025-10-22 17:50 ` Uladzislau Rezki
1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2025-10-21 21:24 UTC (permalink / raw)
To: Vishal Moola (Oracle); +Cc: linux-mm, linux-kernel, Uladzislau Rezki
On Tue, 21 Oct 2025 12:44:56 -0700 "Vishal Moola (Oracle)" <vishal.moola@gmail.com> wrote:
> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.
Does this have potential to inadvertently reduce the availability of
hugepages?
> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
>
> Running 1000 iterations of allocations on a small 4GB system finds:
>
> 1000 2mb allocations:
> [Baseline] [This patch]
> real 46.310s real 0m34.582
> user 0.001s user 0.006s
> sys 46.058s sys 0m34.365s
>
> 10000 200kb allocations:
> [Baseline] [This patch]
> real 56.104s real 0m43.696
> user 0.001s user 0.003s
> sys 55.375s sys 0m42.995s
Nice, but how significant is this change likely to be for a real workload?
> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> unsigned int order, unsigned int nr_pages, struct page **pages)
> {
> unsigned int nr_allocated = 0;
> + unsigned int nr_remaining = nr_pages;
> + unsigned int max_attempt_order = MAX_PAGE_ORDER;
> struct page *page;
> int i;
> + gfp_t large_gfp = (gfp &
> + ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> + | __GFP_NOWARN;
Gee, why is this so complicated?
> + unsigned int large_order = ilog2(nr_remaining);
Should nr_remaining be rounded up to next-power-of-two?
> + large_order = min(max_attempt_order, large_order);
> +
> + /*
> + * Initially, attempt to have the page allocator give us large order
> + * pages. Do not attempt allocating smaller than order chunks since
> + * __vmap_pages_range() expects physically contigous pages of exactly
> + * order long chunks.
> + */
> + while (large_order > order && nr_remaining) {
> + if (nid == NUMA_NO_NODE)
> + page = alloc_pages_noprof(large_gfp, large_order);
> + else
> + page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> + if (unlikely(!page)) {
> + max_attempt_order = --large_order;
> + continue;
> + }
> +
> + split_page(page, large_order);
> + for (i = 0; i < (1U << large_order); i++)
> + pages[nr_allocated + i] = page + i;
> +
> + nr_allocated += 1U << large_order;
> + nr_remaining = nr_pages - nr_allocated;
> +
> + large_order = ilog2(nr_remaining);
> + large_order = min(max_attempt_order, large_order);
> + }
>
> /*
> * For order-0 pages we make use of bulk allocator, if
> --
> 2.51.0
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator
2025-10-21 21:24 ` Andrew Morton
@ 2025-10-22 14:33 ` Matthew Wilcox
0 siblings, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2025-10-22 14:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Vishal Moola (Oracle), linux-mm, linux-kernel, Uladzislau Rezki
On Tue, Oct 21, 2025 at 02:24:36PM -0700, Andrew Morton wrote:
> On Tue, 21 Oct 2025 12:44:56 -0700 "Vishal Moola (Oracle)" <vishal.moola@gmail.com> wrote:
>
> > Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> > allocator. Rather than making requests to the buddy allocator for at
> > most 100 pages at a time, we can eagerly request large order pages a
> > smaller number of times.
>
> Does this have potential to inadvertently reduce the availability of
> hugepages?
Quite the opposite. Let's say we're doing a 40KiB allocation. If we
just take the first 10 pages off the PCP list, those could be from
ten different 2MB chunks, preventing ten different hugepages from
forming until the allocation succeeds. If instead we do an order-3
allocation and an order-1 allocation, those can be from at most two
different 2MB chunks and prevent at most two hugepages from forming.
> > 1000 2mb allocations:
> > [Baseline] [This patch]
> > real 46.310s real 0m34.582
> > user 0.001s user 0.006s
> > sys 46.058s sys 0m34.365s
> >
> > 10000 200kb allocations:
> > [Baseline] [This patch]
> > real 56.104s real 0m43.696
> > user 0.001s user 0.003s
> > sys 55.375s sys 0m42.995s
>
> Nice, but how significant is this change likely to be for a real workload?
Ulad has numbers for the last iteration of this patch showing an
improvement for a 16KiB allocation, which is an improvement for fork()
now we all have VMAP_STACK.
> > + gfp_t large_gfp = (gfp &
> > + ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> > + | __GFP_NOWARN;
>
> Gee, why is this so complicated?
Because GFP flags suck as an interface? Look at kmalloc_gfp_adjust().
> > + unsigned int large_order = ilog2(nr_remaining);
>
> Should nr_remaining be rounded up to next-power-of-two?
No, we don't want to overallocate, we want to precisely allocate.
To use our 40KiB example from earlier, we want to satisfy the allocation
by allocating a 32KiB chunk and an 8KiB chunk, not by allocating 64KiB
and only using part of it.
(I suppose there's an argument for using alloc_pages_exact() here, but
I think it's a fairly weak one)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator
2025-10-21 19:44 [PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
2025-10-21 21:24 ` Andrew Morton
@ 2025-10-22 17:50 ` Uladzislau Rezki
1 sibling, 0 replies; 4+ messages in thread
From: Uladzislau Rezki @ 2025-10-22 17:50 UTC (permalink / raw)
To: Vishal Moola (Oracle)
Cc: linux-mm, linux-kernel, Uladzislau Rezki, Andrew Morton
On Tue, Oct 21, 2025 at 12:44:56PM -0700, Vishal Moola (Oracle) wrote:
> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
> allocator. Rather than making requests to the buddy allocator for at
> most 100 pages at a time, we can eagerly request large order pages a
> smaller number of times.
>
> We still split the large order pages down to order-0 as the rest of the
> vmalloc code (and some callers) depend on it. We still defer to the bulk
> allocator and fallback path in case of order-0 pages or failure.
>
> Running 1000 iterations of allocations on a small 4GB system finds:
>
> 1000 2mb allocations:
> [Baseline] [This patch]
> real 46.310s real 0m34.582
> user 0.001s user 0.006s
> sys 46.058s sys 0m34.365s
>
> 10000 200kb allocations:
> [Baseline] [This patch]
> real 56.104s real 0m43.696
> user 0.001s user 0.003s
> sys 55.375s sys 0m42.995s
>
> Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
>
> -----
> RFC:
> https://lore.kernel.org/linux-mm/20251014182754.4329-1-vishal.moola@gmail.com/
>
> Changes since rfc:
> - Mask off NO_FAIL in large_gfp
> - Mask off GFP_COMP in large_gfp
> There was discussion about warning on and rejecting unsupported GFP
> flags in vmalloc, I'll have a separate patch for that.
>
> - Introduce nr_remaining variable to track total pages
> - Calculate large order as (min(max_order, ilog2())
> - Attempt lower orders on failure before falling back to original path
> - Drop unnecessary fallback comment change
> ---
> mm/vmalloc.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index adde450ddf5e..0832f944544c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3619,8 +3619,44 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> unsigned int order, unsigned int nr_pages, struct page **pages)
> {
> unsigned int nr_allocated = 0;
> + unsigned int nr_remaining = nr_pages;
> + unsigned int max_attempt_order = MAX_PAGE_ORDER;
> struct page *page;
> int i;
> + gfp_t large_gfp = (gfp &
> + ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL | __GFP_COMP))
> + | __GFP_NOWARN;
> + unsigned int large_order = ilog2(nr_remaining);
> +
> + large_order = min(max_attempt_order, large_order);
> +
> + /*
> + * Initially, attempt to have the page allocator give us large order
> + * pages. Do not attempt allocating smaller than order chunks since
> + * __vmap_pages_range() expects physically contigous pages of exactly
> + * order long chunks.
> + */
> + while (large_order > order && nr_remaining) {
> + if (nid == NUMA_NO_NODE)
> + page = alloc_pages_noprof(large_gfp, large_order);
> + else
> + page = alloc_pages_node_noprof(nid, large_gfp, large_order);
> +
> + if (unlikely(!page)) {
> + max_attempt_order = --large_order;
> + continue;
> + }
> +
> + split_page(page, large_order);
> + for (i = 0; i < (1U << large_order); i++)
> + pages[nr_allocated + i] = page + i;
> +
> + nr_allocated += 1U << large_order;
> + nr_remaining = nr_pages - nr_allocated;
> +
> + large_order = ilog2(nr_remaining);
> + large_order = min(max_attempt_order, large_order);
> + }
>
> /*
> * For order-0 pages we make use of bulk allocator, if
> --
> 2.51.0
>
I like the idea of page allocation using larger-order :)
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-10-22 17:50 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-21 19:44 [PATCH] mm/vmalloc: request large order pages from buddy allocator Vishal Moola (Oracle)
2025-10-21 21:24 ` Andrew Morton
2025-10-22 14:33 ` Matthew Wilcox
2025-10-22 17:50 ` Uladzislau Rezki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox