[RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64
@ 2025-12-12  4:26 Dev Jain
  2025-12-12  4:27 ` [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Dev Jain @ 2025-12-12  4:26 UTC (permalink / raw)
  To: catalin.marinas, will, urezki, akpm, tytso, adilger.kernel, cem
  Cc: ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy, Dev Jain

In the quest for reducing TLB pressure via block mappings, enable huge
vmalloc by default on arm64 for BBML2-noabort systems which support kernel
live mapping split.

This series is an RFC, because I cannot get a performance improvement for
the usual benchmarks which we have. Currently, vmalloc follows an opt-in
approach for block mappings - the users calling vmalloc_huge() are the ones
which expect the most advantage from block mappings. Most users of
vmalloc(), kvmalloc() and kvzalloc() map a single page. After applying this
series, it is expected that a considerable number of users will produce
cont mappings, and probably none will produce PMD mappings.

I am asking for help from the community in testing - I believe that one of
the testing methods is xfstests: a lot of code uses the APIs mentioned
above. I am hoping that someone can jump in and run at least xfstests, and
probably some other tests which can take advantage of the reduced TLB
pressure from vmalloc cont mappings.

---
Patchset applies on Linus' master (d358e5254674).

Dev Jain (2):
  mm/vmalloc: Do not align size to huge size
  arm64/mm: Enable huge-vmalloc by default

 arch/arm64/include/asm/vmalloc.h |  6 +++++
 arch/arm64/mm/pageattr.c         |  4 +--
 include/linux/vmalloc.h          |  7 ++++++
 mm/vmalloc.c                     | 43 +++++++++++++++++++++++++-------
 4 files changed, 48 insertions(+), 12 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size
  2025-12-12  4:26 [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Dev Jain
@ 2025-12-12  4:27 ` Dev Jain
  2025-12-22 11:47   ` Uladzislau Rezki
  2025-12-12  4:27 ` [RESEND RFC PATCH 2/2] arm64/mm: Enable huge-vmalloc by default Dev Jain
  2026-01-12 10:49 ` [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Uladzislau Rezki
  2 siblings, 1 reply; 6+ messages in thread
From: Dev Jain @ 2025-12-12  4:27 UTC (permalink / raw)
  To: catalin.marinas, will, urezki, akpm, tytso, adilger.kernel, cem
  Cc: ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy, Dev Jain

vmalloc() consists of the following:

(1) find empty space in the vmalloc space -> (2) get physical pages from
the buddy system -> (3) map the pages into the pagetable.

It turns out that the cost of (1) and (3) is pretty insignificant. Hence,
the cost of vmalloc becomes highly sensitive to physical memory allocation
time.

Currently, if we decide to use huge mappings, apart from aligning the start
of the target vm_struct region to the huge-alignment, we also align the
size. This does not seem to produce any benefit (apart from simplification
of the code), and there is a clear disadvantage - as mentioned above, the
main cost of vmalloc comes from its interaction with the buddy system, and
thus requesting more memory than was requested by the caller is suboptimal
and unnecessary.

This change is also motivated due to the next patch ("arm64/mm: Enable
vmalloc-huge by default"). Suppose that some user of vmalloc maps 17 pages,
uses that mapping for an extremely short time, and vfree's it. That patch,
without this patch, on arm64 will ultimately map 16 * 2 = 32 pages in a
contiguous way. Since the mapping is used for a very short time, it is
likely that the extra cost of mapping 15 pages defeats any benefit from
reduced TLB pressure, and regresses that code path. 

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/vmalloc.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..389225a6f7ef 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -654,7 +654,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
-	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+	unsigned int i, step, nr = (end - addr) >> PAGE_SHIFT;
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
@@ -662,7 +662,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 			page_shift == PAGE_SHIFT)
 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
 
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+	step = 1U << (page_shift - PAGE_SHIFT);
+	for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
 		int err;
 
 		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
@@ -673,8 +674,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 		addr += 1UL << page_shift;
 	}
-
-	return 0;
+	if (IS_ALIGNED(nr, step))
+		return 0;
+	return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
 }
 
 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
@@ -3197,7 +3199,7 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
 	unsigned long requested_size = size;
 
 	BUG_ON(in_interrupt());
-	size = ALIGN(size, 1ul << shift);
+	size = PAGE_ALIGN(size);
 	if (unlikely(!size))
 		return NULL;
 
@@ -3353,7 +3355,7 @@ static void vm_reset_perms(struct vm_struct *area)
 	 * Find the start and end range of the direct mappings to make sure that
 	 * the vm_unmap_aliases() flush includes the direct map.
 	 */
-	for (i = 0; i < area->nr_pages; i += 1U << page_order) {
+	for (i = 0; i < ALIGN_DOWN(area->nr_pages, 1U << page_order); i += (1U << page_order)) {
 		unsigned long addr = (unsigned long)page_address(area->pages[i]);
 
 		if (addr) {
@@ -3365,6 +3367,18 @@ static void vm_reset_perms(struct vm_struct *area)
 			flush_dmap = 1;
 		}
 	}
+	for (; i < area->nr_pages; ++i) {
+		unsigned long addr = (unsigned long)page_address(area->pages[i]);
+
+		if (addr) {
+			unsigned long page_size;
+
+			page_size = PAGE_SIZE;
+			start = min(addr, start);
+			end = max(addr + page_size, end);
+			flush_dmap = 1;
+		}
+	}
 
 	/*
 	 * Set direct map to something invalid so that it won't be cached if
@@ -3673,6 +3687,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	 * more permissive.
 	 */
 	if (!order) {
+single_page:
 		while (nr_allocated < nr_pages) {
 			unsigned int nr, nr_pages_request;
 
@@ -3704,13 +3719,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			 * If zero or pages were obtained partly,
 			 * fallback to a single page allocator.
 			 */
-			if (nr != nr_pages_request)
+			if (nr != nr_pages_request) {
+				order = 0;
 				break;
+			}
 		}
 	}
 
 	/* High-order pages or fallback path if "bulk" fails. */
 	while (nr_allocated < nr_pages) {
+		if (nr_pages - nr_allocated < (1UL << order)) {
+			goto single_page;
+		}
 		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
 			break;
 
@@ -5179,7 +5199,9 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v,
 
 	memset(counters, 0, nr_node_ids * sizeof(unsigned int));
 
-	for (nr = 0; nr < v->nr_pages; nr += step)
+	for (nr = 0; nr < ALIGN_DOWN(v->nr_pages, step); nr += step)
+		counters[page_to_nid(v->pages[nr])] += step;
+	for (; nr < v->nr_pages; ++nr)
 		counters[page_to_nid(v->pages[nr])] += step;
 	for_each_node_state(nr, N_HIGH_MEMORY)
 		if (counters[nr])
-- 
2.30.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size
  2025-12-12  4:27 ` [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size Dev Jain
@ 2025-12-22 11:47   ` Uladzislau Rezki
  2025-12-24  5:05     ` Dev Jain
  0 siblings, 1 reply; 6+ messages in thread
From: Uladzislau Rezki @ 2025-12-22 11:47 UTC (permalink / raw)
  To: Dev Jain
  Cc: catalin.marinas, will, urezki, akpm, tytso, adilger.kernel, cem,
	ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy

On Fri, Dec 12, 2025 at 09:57:00AM +0530, Dev Jain wrote:
> vmalloc() consists of the following:
> 
> (1) find empty space in the vmalloc space -> (2) get physical pages from
> the buddy system -> (3) map the pages into the pagetable.
> 
> It turns out that the cost of (1) and (3) is pretty insignificant. Hence,
> the cost of vmalloc becomes highly sensitive to physical memory allocation
> time.
> 
> Currently, if we decide to use huge mappings, apart from aligning the start
> of the target vm_struct region to the huge-alignment, we also align the
> size. This does not seem to produce any benefit (apart from simplification
> of the code), and there is a clear disadvantage - as mentioned above, the
> main cost of vmalloc comes from its interaction with the buddy system, and
> thus requesting more memory than was requested by the caller is suboptimal
> and unnecessary.
> 
> This change is also motivated due to the next patch ("arm64/mm: Enable
> vmalloc-huge by default"). Suppose that some user of vmalloc maps 17 pages,
> uses that mapping for an extremely short time, and vfree's it. That patch,
> without this patch, on arm64 will ultimately map 16 * 2 = 32 pages in a
> contiguous way. Since the mapping is used for a very short time, it is
> likely that the extra cost of mapping 15 pages defeats any benefit from
> reduced TLB pressure, and regresses that code path. 
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/vmalloc.c | 38 ++++++++++++++++++++++++++++++--------
>  1 file changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..389225a6f7ef 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -654,7 +654,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  		pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
> -	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> +	unsigned int i, step, nr = (end - addr) >> PAGE_SHIFT;
>  
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
> @@ -662,7 +662,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  			page_shift == PAGE_SHIFT)
>  		return vmap_small_pages_range_noflush(addr, end, prot, pages);
>  
> -	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> +	step = 1U << (page_shift - PAGE_SHIFT);
> +	for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>  		int err;
>  
>  		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> @@ -673,8 +674,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  
>  		addr += 1UL << page_shift;
>  	}
> -
> -	return 0;
> +	if (IS_ALIGNED(nr, step))
> +		return 0;
> +	return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>  }
>  
Can we improve the readability?

<snip>
index 25a4178188ee..14ca019b57af 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -655,6 +655,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
                pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
        unsigned int i, step, nr = (end - addr) >> PAGE_SHIFT;
+       unsigned int nr_aligned;
+       unsigned long chunk_size;

        WARN_ON(page_shift < PAGE_SHIFT);

@@ -662,20 +664,24 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
                        page_shift == PAGE_SHIFT)
                return vmap_small_pages_range_noflush(addr, end, prot, pages);

-       step = 1U << (page_shift - PAGE_SHIFT);
-       for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
-               int err;
+       step = 1U << (page_shift - PAGE_SHIFT); /* small pages per huge chunk. */
+       nr_aligned = ALIGN_DOWN(nr, step);
+       chunk_size = 1UL << page_shift;

-               err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+       for (i = 0; i < nr_aligned; i += step) {
+               int err = vmap_range_noflush(addr, addr + chunk_size,
                                        page_to_phys(pages[i]), prot,
                                        page_shift);
                if (err)
                        return err;

-               addr += 1UL << page_shift;
+               addr += chunk_size;
        }
-       if (IS_ALIGNED(nr, step))
+
+       if (i == nr)
                return 0;
+
+       /* Map the tail using small pages. */
        return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
 }
<snip>


>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> @@ -3197,7 +3199,7 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
>  	unsigned long requested_size = size;
>  
>  	BUG_ON(in_interrupt());
> -	size = ALIGN(size, 1ul << shift);
> +	size = PAGE_ALIGN(size);
>  	if (unlikely(!size))
>  		return NULL;
>  
> @@ -3353,7 +3355,7 @@ static void vm_reset_perms(struct vm_struct *area)
>  	 * Find the start and end range of the direct mappings to make sure that
>  	 * the vm_unmap_aliases() flush includes the direct map.
>  	 */
> -	for (i = 0; i < area->nr_pages; i += 1U << page_order) {
> +	for (i = 0; i < ALIGN_DOWN(area->nr_pages, 1U << page_order); i += (1U << page_order)) {
>
nr_blocks?

>  		unsigned long addr = (unsigned long)page_address(area->pages[i]);
>  
>  		if (addr) {
> @@ -3365,6 +3367,18 @@ static void vm_reset_perms(struct vm_struct *area)
>  			flush_dmap = 1;
>  		}
>  	}
> +	for (; i < area->nr_pages; ++i) {
> +		unsigned long addr = (unsigned long)page_address(area->pages[i]);
> +
> +		if (addr) {
> +			unsigned long page_size;
> +
> +			page_size = PAGE_SIZE;
> +			start = min(addr, start);
> +			end = max(addr + page_size, end);
> +			flush_dmap = 1;
> +		}
> +	}
>  
>  	/*
>  	 * Set direct map to something invalid so that it won't be cached if
> @@ -3673,6 +3687,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	 * more permissive.
>  	 */
>  	if (!order) {
> +single_page:
>  		while (nr_allocated < nr_pages) {
>  			unsigned int nr, nr_pages_request;
>  
> @@ -3704,13 +3719,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  			 * If zero or pages were obtained partly,
>  			 * fallback to a single page allocator.
>  			 */
> -			if (nr != nr_pages_request)
> +			if (nr != nr_pages_request) {
> +				order = 0;
>  				break;
> +			}
>  		}
>  	}
>  
>  	/* High-order pages or fallback path if "bulk" fails. */
>  	while (nr_allocated < nr_pages) {
> +		if (nr_pages - nr_allocated < (1UL << order)) {
> +			goto single_page;
> +		}
>  		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>  			break;
>
Yes, it requires more attention. That "goto single_page" should be
eliminated, IMO. We should not jump between blocks, logically the
single_page belongs to "order-0 alloc path".

Probably it requires more refactoring to simplify it.

>  
> @@ -5179,7 +5199,9 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v,
>  
>  	memset(counters, 0, nr_node_ids * sizeof(unsigned int));
>  
> -	for (nr = 0; nr < v->nr_pages; nr += step)
> +	for (nr = 0; nr < ALIGN_DOWN(v->nr_pages, step); nr += step)
> +		counters[page_to_nid(v->pages[nr])] += step;
> +	for (; nr < v->nr_pages; ++nr)
>  		counters[page_to_nid(v->pages[nr])] += step;
>
Can we fit it into one loop? Last tail loop continuous adding step?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size
  2025-12-22 11:47   ` Uladzislau Rezki
@ 2025-12-24  5:05     ` Dev Jain
  0 siblings, 0 replies; 6+ messages in thread
From: Dev Jain @ 2025-12-24  5:05 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: catalin.marinas, will, akpm, tytso, adilger.kernel, cem,
	ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy


On 22/12/25 5:17 pm, Uladzislau Rezki wrote:
> On Fri, Dec 12, 2025 at 09:57:00AM +0530, Dev Jain wrote:
>> vmalloc() consists of the following:
>>
>> (1) find empty space in the vmalloc space -> (2) get physical pages from
>> the buddy system -> (3) map the pages into the pagetable.
>>
>> It turns out that the cost of (1) and (3) is pretty insignificant. Hence,
>> the cost of vmalloc becomes highly sensitive to physical memory allocation
>> time.
>>
>> Currently, if we decide to use huge mappings, apart from aligning the start
>> of the target vm_struct region to the huge-alignment, we also align the
>> size. This does not seem to produce any benefit (apart from simplification
>> of the code), and there is a clear disadvantage - as mentioned above, the
>> main cost of vmalloc comes from its interaction with the buddy system, and
>> thus requesting more memory than was requested by the caller is suboptimal
>> and unnecessary.
>>
>> This change is also motivated due to the next patch ("arm64/mm: Enable
>> vmalloc-huge by default"). Suppose that some user of vmalloc maps 17 pages,
>> uses that mapping for an extremely short time, and vfree's it. That patch,
>> without this patch, on arm64 will ultimately map 16 * 2 = 32 pages in a
>> contiguous way. Since the mapping is used for a very short time, it is
>> likely that the extra cost of mapping 15 pages defeats any benefit from
>> reduced TLB pressure, and regresses that code path. 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/vmalloc.c | 38 ++++++++++++++++++++++++++++++--------
>>  1 file changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index ecbac900c35f..389225a6f7ef 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -654,7 +654,7 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>>  int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>  		pgprot_t prot, struct page **pages, unsigned int page_shift)
>>  {
>> -	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>> +	unsigned int i, step, nr = (end - addr) >> PAGE_SHIFT;
>>  
>>  	WARN_ON(page_shift < PAGE_SHIFT);
>>  
>> @@ -662,7 +662,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>  			page_shift == PAGE_SHIFT)
>>  		return vmap_small_pages_range_noflush(addr, end, prot, pages);
>>  
>> -	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
>> +	step = 1U << (page_shift - PAGE_SHIFT);
>> +	for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
>>  		int err;
>>  
>>  		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
>> @@ -673,8 +674,9 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>  
>>  		addr += 1UL << page_shift;
>>  	}
>> -
>> -	return 0;
>> +	if (IS_ALIGNED(nr, step))
>> +		return 0;
>> +	return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>>  }
>>  
> Can we improve the readability?
>
> <snip>
> index 25a4178188ee..14ca019b57af 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -655,6 +655,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>                 pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
>         unsigned int i, step, nr = (end - addr) >> PAGE_SHIFT;
> +       unsigned int nr_aligned;
> +       unsigned long chunk_size;
>
>         WARN_ON(page_shift < PAGE_SHIFT);
>
> @@ -662,20 +664,24 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>                         page_shift == PAGE_SHIFT)
>                 return vmap_small_pages_range_noflush(addr, end, prot, pages);
>
> -       step = 1U << (page_shift - PAGE_SHIFT);
> -       for (i = 0; i < ALIGN_DOWN(nr, step); i += step) {
> -               int err;
> +       step = 1U << (page_shift - PAGE_SHIFT); /* small pages per huge chunk. */
> +       nr_aligned = ALIGN_DOWN(nr, step);
> +       chunk_size = 1UL << page_shift;
>
> -               err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> +       for (i = 0; i < nr_aligned; i += step) {
> +               int err = vmap_range_noflush(addr, addr + chunk_size,
>                                         page_to_phys(pages[i]), prot,
>                                         page_shift);
>                 if (err)
>                         return err;
>
> -               addr += 1UL << page_shift;
> +               addr += chunk_size;
>         }
> -       if (IS_ALIGNED(nr, step))
> +
> +       if (i == nr)
>                 return 0;
> +
> +       /* Map the tail using small pages. */
>         return vmap_small_pages_range_noflush(addr, end, prot, pages + i);
>  }
> <snip>

Indeed I can do this.

>
>>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>> @@ -3197,7 +3199,7 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
>>  	unsigned long requested_size = size;
>>  
>>  	BUG_ON(in_interrupt());
>> -	size = ALIGN(size, 1ul << shift);
>> +	size = PAGE_ALIGN(size);
>>  	if (unlikely(!size))
>>  		return NULL;
>>  
>> @@ -3353,7 +3355,7 @@ static void vm_reset_perms(struct vm_struct *area)
>>  	 * Find the start and end range of the direct mappings to make sure that
>>  	 * the vm_unmap_aliases() flush includes the direct map.
>>  	 */
>> -	for (i = 0; i < area->nr_pages; i += 1U << page_order) {
>> +	for (i = 0; i < ALIGN_DOWN(area->nr_pages, 1U << page_order); i += (1U << page_order)) {
>>
> nr_blocks?
>
>>  		unsigned long addr = (unsigned long)page_address(area->pages[i]);
>>  
>>  		if (addr) {
>> @@ -3365,6 +3367,18 @@ static void vm_reset_perms(struct vm_struct *area)
>>  			flush_dmap = 1;
>>  		}
>>  	}
>> +	for (; i < area->nr_pages; ++i) {
>> +		unsigned long addr = (unsigned long)page_address(area->pages[i]);
>> +
>> +		if (addr) {
>> +			unsigned long page_size;
>> +
>> +			page_size = PAGE_SIZE;
>> +			start = min(addr, start);
>> +			end = max(addr + page_size, end);
>> +			flush_dmap = 1;
>> +		}
>> +	}
>>  
>>  	/*
>>  	 * Set direct map to something invalid so that it won't be cached if
>> @@ -3673,6 +3687,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>  	 * more permissive.
>>  	 */
>>  	if (!order) {
>> +single_page:
>>  		while (nr_allocated < nr_pages) {
>>  			unsigned int nr, nr_pages_request;
>>  
>> @@ -3704,13 +3719,18 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>  			 * If zero or pages were obtained partly,
>>  			 * fallback to a single page allocator.
>>  			 */
>> -			if (nr != nr_pages_request)
>> +			if (nr != nr_pages_request) {
>> +				order = 0;
>>  				break;
>> +			}
>>  		}
>>  	}
>>  
>>  	/* High-order pages or fallback path if "bulk" fails. */
>>  	while (nr_allocated < nr_pages) {
>> +		if (nr_pages - nr_allocated < (1UL << order)) {
>> +			goto single_page;
>> +		}
>>  		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>>  			break;
>>
> Yes, it requires more attention. That "goto single_page" should be
> eliminated, IMO. We should not jump between blocks, logically the
> single_page belongs to "order-0 alloc path".
>
> Probably it requires more refactoring to simplify it.

I can think about refactoring this.

>
>>  
>> @@ -5179,7 +5199,9 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v,
>>  
>>  	memset(counters, 0, nr_node_ids * sizeof(unsigned int));
>>  
>> -	for (nr = 0; nr < v->nr_pages; nr += step)
>> +	for (nr = 0; nr < ALIGN_DOWN(v->nr_pages, step); nr += step)
>> +		counters[page_to_nid(v->pages[nr])] += step;
>> +	for (; nr < v->nr_pages; ++nr)
>>  		counters[page_to_nid(v->pages[nr])] += step;
>>
> Can we fit it into one loop? Last tail loop continuous adding step?

I don't see any other way.

>
> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RESEND RFC PATCH 2/2] arm64/mm: Enable huge-vmalloc by default
  2025-12-12  4:26 [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Dev Jain
  2025-12-12  4:27 ` [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size Dev Jain
@ 2025-12-12  4:27 ` Dev Jain
  2026-01-12 10:49 ` [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Uladzislau Rezki
  2 siblings, 0 replies; 6+ messages in thread
From: Dev Jain @ 2025-12-12  4:27 UTC (permalink / raw)
  To: catalin.marinas, will, urezki, akpm, tytso, adilger.kernel, cem
  Cc: ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy, Dev Jain

For BBML2-noabort arm64 systems, enable vmalloc cont mappings and PMD
mappings by default.

There is benefit to be gained in any code path which maps >= 16 pages using
vmalloc, since any usage of that mapping will now come with reduced TLB
pressure.

Currently, I am not being able to produce a reliable statistically
significant improvement for the benchmarks which we have. I am optimistic
that xfs benchmarks should give some benefit.

Upon running test_vmalloc.sh, this series produces an optimization and
some regressions. I conclude that we should ignore the results of this
testsuite. I explain the regression in the long_busy_list_alloc_test below:
upon running ./test_vmalloc.sh run_test_mask=4 nr_threads=1, a regression
of approx 17% is observed (which increases to 31% if we do *not* apply the
previous patch ("mm/vmalloc: Do not align size to huge size")).

The long_busy_list_alloc_test first maps a lot of single pages to fragment
the vmalloc space. Then, it does the following in a loop: map 100 pages,
map a single page, then vfree both of them. My investigation reveals that
the majority of time is *not* spent in finding a free space in the vmalloc
region (which is exactly the time which the setup of this particular
test wants to increase), but in the interaction with the physical memory
allocator.

It turns out that mapping 100 pages in a contiguous way is *faster* than
bulk mapping 100 single pages. The regression is actually carried by
vfree(). When we contpte map 100 pages, we get 6 * 16 = 96 pages from
the free lists of the buddy allocator, and not the pcp lists. Then, vmalloc
subsystem splits this page into individual pages because drivers can
operate on individual pages, messing up the refcounts. As a result, vfree
frees these pages as single 4k pages, freeing them into the pcp lists.
Thus, now we have got a behaviour of taking from the freelists of the
buddy, and freeing into the pcp lists, which forces pcp draining into the
freelists. By playing with the following code in mm/page_alloc.c:

	high = nr_pcp_high(pcp, zone, batch, free_high);
	if (pcp->count < high)
		return;
The time taken by the test is highly sensitive to the value returned by
nr_pcp_high (although, increasing the value of high does not reduce the
regression).

Summarizing, the regression is due to messing up the state of the buddy
system by rapidly stealing from the freelists and not giving back to them.

If we insert an msleep(1) just before we vfree() both the regions, the
regression reduces. If we reduce the number of iterations in the test, the
regression is gone. This proves that the regression is due to the unnatural
behaviour of the test - it allocates memory, does absolutely nothing with
that memory, and releases it. No workload is expected to map memory without
actually utilizing it for some time. The time between vmalloc() and vfree()
will give time for the buddy to stabilize, and the regression is
eliminated.

The optimization is observed in fix_size_alloc_test with nr_pages = 512,
because both vmalloc() and vfree() will now operate to and from the pcp.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/vmalloc.h | 6 ++++++
 arch/arm64/mm/pageattr.c         | 4 +---
 include/linux/vmalloc.h          | 7 +++++++
 mm/vmalloc.c                     | 5 ++++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 4ec1acd3c1b3..c72ae9bd7360 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -6,6 +6,12 @@

 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP

+#define arch_wants_vmalloc_huge_always arch_wants_vmalloc_huge_always
+static inline bool arch_wants_vmalloc_huge_always(void)
+{
+	return system_supports_bbml2_noabort();
+}
+
 #define arch_vmap_pud_supported arch_vmap_pud_supported
 static inline bool arch_vmap_pud_supported(pgprot_t prot)
 {
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index f0e784b963e6..eddbc202ffdd 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -163,8 +163,6 @@ static int change_memory_common(unsigned long addr, int numpages,
 	 * we are operating on does not result in such splitting.
 	 *
 	 * Let's restrict ourselves to mappings created by vmalloc (or vmap).
-	 * Disallow VM_ALLOW_HUGE_VMAP mappings to guarantee that only page
-	 * mappings are updated and splitting is never needed.
 	 *
 	 * So check whether the [addr, addr + size) interval is entirely
 	 * covered by precisely one VM area that has the VM_ALLOC flag set.
@@ -172,7 +170,7 @@ static int change_memory_common(unsigned long addr, int numpages,
 	area = find_vm_area((void *)addr);
 	if (!area ||
 	    end > (unsigned long)kasan_reset_tag(area->addr) + area->size ||
-	    ((area->flags & (VM_ALLOC | VM_ALLOW_HUGE_VMAP)) != VM_ALLOC))
+	    ((area->flags & VM_ALLOC) != VM_ALLOC))
 		return -EINVAL;

 	if (!numpages)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e8e94f90d686..59bd6ce96706 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -88,6 +88,13 @@ struct vmap_area {
 	unsigned long flags; /* mark type of vm_map_ram area */
 };

+#ifndef arch_wants_vmalloc_huge_always
+static inline bool arch_wants_vmalloc_huge_always(void)
+{
+	return false;
+}
+#endif
+
 /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
 #ifndef arch_vmap_p4d_supported
 static inline bool arch_vmap_p4d_supported(pgprot_t prot)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 389225a6f7ef..88004e803adc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4011,7 +4011,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 		return NULL;
 	}

-	if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
+	if (vmap_allow_huge && ((arch_wants_vmalloc_huge_always()) || (vm_flags & VM_ALLOW_HUGE_VMAP))) {
 		/*
 		 * Try huge pages. Only try for PAGE_KERNEL allocations,
 		 * others like modules don't yet expect huge pages in
@@ -4025,6 +4025,9 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 			shift = arch_vmap_pte_supported_shift(size);

 		align = max(original_align, 1UL << shift);
+
+		/* If arch wants huge by default, set flag unconditionally */
+		vm_flags |= VM_ALLOW_HUGE_VMAP;
 	}

 again:
-- 
2.30.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64
  2025-12-12  4:26 [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Dev Jain
  2025-12-12  4:27 ` [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size Dev Jain
  2025-12-12  4:27 ` [RESEND RFC PATCH 2/2] arm64/mm: Enable huge-vmalloc by default Dev Jain
@ 2026-01-12 10:49 ` Uladzislau Rezki
  2 siblings, 0 replies; 6+ messages in thread
From: Uladzislau Rezki @ 2026-01-12 10:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: catalin.marinas, will, urezki, akpm, tytso, adilger.kernel, cem,
	ryan.roberts, anshuman.khandual, shijie, yang, linux-arm-kernel,
	linux-kernel, linux-mm, npiggin, willy, david, ziy

On Fri, Dec 12, 2025 at 09:56:59AM +0530, Dev Jain wrote:
> In the quest for reducing TLB pressure via block mappings, enable huge
> vmalloc by default on arm64 for BBML2-noabort systems which support kernel
> live mapping split.
> 
> This series is an RFC, because I cannot get a performance improvement for
> the usual benchmarks which we have. Currently, vmalloc follows an opt-in
> approach for block mappings - the users calling vmalloc_huge() are the ones
> which expect the most advantage from block mappings. Most users of
> vmalloc(), kvmalloc() and kvzalloc() map a single page. After applying this
> series, it is expected that a considerable number of users will produce
> cont mappings, and probably none will produce PMD mappings.
> 
> I am asking for help from the community in testing - I believe that one of
> the testing methods is xfstests: a lot of code uses the APIs mentioned
> above. I am hoping that someone can jump in and run at least xfstests, and
> probably some other tests which can take advantage of the reduced TLB
> pressure from vmalloc cont mappings.
> 
I checked how often vmalloc/vmap is triggered when i run xfstests. I think
it also depends on env. and can be different from one setup to another.

"echo vmalloc:alloc_vmap_area > set_event"

urezki@milan:~/data/optane/xfs-test/xfstests.git$ wc -l ./vmalloc_traces/*.trace
    2875 ./vmalloc_traces/generic_036.trace
   30117 ./vmalloc_traces/generic_038.trace
    8481 ./vmalloc_traces/generic_051.trace
   16986 ./vmalloc_traces/generic_055.trace
    6079 ./vmalloc_traces/generic_068.trace
    2792 ./vmalloc_traces/generic_070.trace
   26945 ./vmalloc_traces/generic_072.trace
    2772 ./vmalloc_traces/generic_076.trace
    2750 ./vmalloc_traces/generic_083.trace
    3319 ./vmalloc_traces/generic_095.trace
    2855 ./vmalloc_traces/generic_232.trace
    3537 ./vmalloc_traces/generic_269.trace
   21265 ./vmalloc_traces/generic_299.trace
    3231 ./vmalloc_traces/generic_300.trace
    3050 ./vmalloc_traces/generic_323.trace
    2831 ./vmalloc_traces/generic_390.trace
    4296 ./vmalloc_traces/generic_461.trace
    4807 ./vmalloc_traces/generic_476.trace
    3198 ./vmalloc_traces/generic_551.trace
    3096 ./vmalloc_traces/generic_616.trace
    6495 ./vmalloc_traces/generic_627.trace
   11232 ./vmalloc_traces/generic_642.trace
   11706 ./vmalloc_traces/generic_650.trace
    3135 ./vmalloc_traces/generic_750.trace
    5926 ./vmalloc_traces/generic_751.trace
   77623 ./vmalloc_traces/xfs_013.trace
    9172 ./vmalloc_traces/xfs_017.trace
    4145 ./vmalloc_traces/xfs_068.trace
    2982 ./vmalloc_traces/xfs_104.trace
    7293 ./vmalloc_traces/xfs_167.trace
   18851 ./vmalloc_traces/xfs_168.trace
    4373 ./vmalloc_traces/xfs_442.trace
    3550 ./vmalloc_traces/xfs_609.trace
  321765 total
urezki@milan:~/data/optane/xfs-test/xfstests.git$

Time execution is different for each test. For example "xfs_013" test
takes around 200 seconds on my system and is in top of number of calls:

77623 / 200 = 388.115 calls/sec
200 / 77623 = 0.002576 = ~each 2.5ms

Please note, i have not checked impact of your patch on time execution
or how TLB pressure is affected.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-12 10:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-12  4:26 [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Dev Jain
2025-12-12  4:27 ` [RESEND RFC PATCH 1/2] mm/vmalloc: Do not align size to huge size Dev Jain
2025-12-22 11:47   ` Uladzislau Rezki
2025-12-24  5:05     ` Dev Jain
2025-12-12  4:27 ` [RESEND RFC PATCH 2/2] arm64/mm: Enable huge-vmalloc by default Dev Jain
2026-01-12 10:49 ` [RESEND RFC PATCH 0/2] Enable vmalloc huge mappings by default on arm64 Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox