[REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
@ 2026-02-13 15:08 Luke Yang
  2026-02-13 15:47 ` David Hildenbrand (Arm)
  2026-02-18 13:29 ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 23+ messages in thread
From: Luke Yang @ 2026-02-13 15:08 UTC (permalink / raw)
  To: dev.jain
  Cc: jhladky, akpm, Liam.Howlett, willy, surenb, vbabka, linux-mm,
	linux-kernel

Hello,

we have bisected a significant mprotect() performance regression in
6.17-rc1 to:

cac1db8c3aad ("mm: optimize mprotect() by PTE batching")

The regression becomes clearly visible starting around 400 KiB region
sizes and above. It is also still present in the latest 6.19 kernel.

## Test description

The reproducer repeatedly toggles protection (PROT_NONE <->
PROT_READ|PROT_WRITE) over a single mapped region in a tight loop. All
pages change protection in each iteration.

The benchmark sweeps region sizes from 4 KiB up to 40 GiB.

We bisected between 6.16 and 6.17-rc1 and confirmed that reverting
cac1db8c3aad on top of 6.17-rc1 largely restores the 6.16 performance
characteristics.

## perf observations

In 6.17-rc1, commit_anon_folio_batch() becomes hot and accounts for a
significant portion of cycles inside change_pte_range(). Instruction
count in change_pte_range() increases noticeably in 6.17-rc1.
commit_anon_folio_batch() was added as part of cac1db8c3aad.

The regression is also present for the following servers: AMD EPYC 2 (Rome),
AMD EPYC3 (Milan), AMD EPYC3 (Milanx), AMD EPYC4 (Zen4c Bergamo), Ampere Mt
Snow Altra with KVM virt type (ARM Neoverse-N1) , Lenovo Thinkpad T460p (Intel
Skylake 6820HQ).

## Results (nsec per mprotect call) collected on AMD EPYC Zen3 (Milan)
server.

v6.16
size_kib | nsec_per_call
4        | 1713
40       | 2071
400      | 3453
4000     | 18804
40000    | 172613
400000   | 1699301
4000000  | 17021882
40000000 | 169677478

v6.17-rc1
size_kib | nsec_per_call
4        | 1775
40       | 2362
400      | 5993
4000     | 44116
40000    | 427731
400000   | 4252714
4000000  | 42512805
40000000 | 424995500

v6.17-rc1 with cac1db8c3aad reverted
size_kib | nsec_per_call
4        | 1750
40       | 2126
400      | 3800
4000     | 22227
40000    | 205446
400000   | 2011634
4000000  | 20144468
40000000 | 200764472

This workload appears to be the worst case for the new batching logic,
where batching overhead dominates, and no amortization benefit is
achieved.

We will provide the following minimal reproducers:

* mprot_tw4m_regsize_sweep_one_region.sh
* mprot_tw4m_regsize.c

Please let us know if additional data would be useful.

Reported-by: Luke Yang luyang@redhat.com
Reported-by: Jirka Hladky jhladky@redhat.com

Thank you
Luke

Reproducer
----------


mprot_tw4m_regsize_sweep_one_region.sh
--- cut here ---
#!/bin/bash
gcc -Wall -Wextra -O1 -o mprot_tw4m_regsize mprot_tw4m_regsize.c
if ! [ -x "./mprot_tw4m_regsize" ]; then
 echo "No ./mprot_tw4m_regsize binary, compilation failed?"
 exit 1
fi

DIR="$(date '+%Y-%b-%d_%Hh%Mm%Ss')_$(uname -r)"
mkdir -p "$DIR"

# Sweep region size from 4K to 4G (10x each step), 1 region.
# Iterations decrease by 10x to keep runtime roughly constant.
#   size_kib   iterations
runs=(
   "4          40000000"
   "40         4000000"
   "400        400000"
   "4000       40000"
   "40000      4000"
   "400000     400"
   "4000000    40"
   "40000000   4"
)

for entry in "${runs[@]}"; do
   read -r size_kib iters <<< "$entry"
   logfile="$DIR/regsize_${size_kib}k.log"
   echo "=== Region size: ${size_kib} KiB, iterations: ${iters} ==="
   sync; sync
   echo 3 > /proc/sys/vm/drop_caches
   taskset -c 0 ./mprot_tw4m_regsize "$size_kib" 1 "$iters" 2>&1 | tee "$logfile"
   echo ""
done

# Create CSV summary from log files
csv="$DIR/summary.csv"
echo "size_kib,runtime_sec,nsec_per_call" > "$csv"
for entry in "${runs[@]}"; do
   read -r size_kib _ <<< "$entry"
   logfile="$DIR/regsize_${size_kib}k.log"
   runtime=$(grep -oP 'Runtime: \K[0-9.]+' "$logfile")
   nsec=$(grep -oP 'Avg: \K[0-9.]+(?= nsec/call)' "$logfile")
   echo "${size_kib},${runtime},${nsec}" >> "$csv"
done

echo "Results saved in $DIR/"
echo "CSV summary:"
cat "$csv"
--- cut here ---

mprot_tw4m_regsize.c
--- cut here ---
/*
* Reproduce libmicro mprot_tw4m benchmark - Time mprotect() with configurable region size
* gcc -Wall -Wextra -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* DEBUG: gcc -Wall -Wextra -g -fsanitize=undefined -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* ./mprot_tw4m_regsize <region_size_kib> <region_count> <iterations>
*/

#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <strings.h>
#include <time.h>

typedef volatile char vchar_t;

static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock();

int main(int argc, char **argv)
{
   int i, j, ret;
   long long k;

   if (argc < 4) {
       printf("USAGE: %s region_size_kib region_count iterations\n", argv[0]);
       printf("Creates multiple regions and times mprotect() calls\n");
       return 1;
   }

   long region_size = atol(argv[1]) * 1024L;
   int region_count = atoi(argv[2]);
   int iterations = atoi(argv[3]);

   int pagesize = sysconf(_SC_PAGESIZE);

   vchar_t **regions = malloc(region_count * sizeof(vchar_t*));
   if (!regions) {
       perror("malloc");
       return 1;
   }

   for (i = 0; i < region_count; i++) {
       regions[i] = (vchar_t *) mmap(NULL, region_size,
                     PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0L);

       if (regions[i] == MAP_FAILED) {
           perror("mmap");
           exit(1);
       }

       for (k = 0; k < region_size; k += pagesize) {
           regions[i][k] = 1;
       }
   }

   printf("Created %d regions of %ldKiB each. Starting %d mprotect operations per region...\n",
          region_count, region_size / 1024, iterations);

   struct timespec start_time, end_time;
   clock_gettime(CLOCK_MONOTONIC, &start_time);
   u_int64_t start_rdtsc = start_clock();

   for (j = 0; j < iterations; j++) {
       for (i = 0; i < region_count; i++) {
           int prot;

           if ((i + j) % 2 == 0) {
               prot = PROT_NONE;
           } else {
               prot = PROT_READ | PROT_WRITE;
           }

           ret = mprotect((void *)regions[i], region_size, prot);
           if (ret != 0) {
               perror("mprotect");
               printf("mprotect error at region %d, iteration %d\n", i, j);
           }
       }
   }

   u_int64_t stop_rdtsc = stop_clock();
   clock_gettime(CLOCK_MONOTONIC, &end_time);
   u_int64_t diff = stop_rdtsc - start_rdtsc;

   long total_calls = (long)region_count * iterations;
   double runtime_sec = (end_time.tv_sec - start_time.tv_sec) +
                       (end_time.tv_nsec - start_time.tv_nsec) / 1000000000.0;

   double nsec_per_call = (runtime_sec * 1e9) / total_calls;

   printf("TSC for %ld mprotect calls on %d x %ldKiB regions: %ld K-cycles.  Avg: %g K-cycles/call\n",
          total_calls,
          region_count,
          region_size / 1024,
          diff/1000,
          ((double)(diff)/(double)(total_calls))/1000.0);
   printf("Runtime: %.6f seconds.  Avg: %.3f nsec/call\n", runtime_sec, nsec_per_call);

   for (i = 0; i < region_count; i++) {
       munmap((void *)regions[i], region_size);
   }
   free(regions);

   return 0;
}

static __inline__ u_int64_t start_clock() {
   // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
   u_int32_t hi, lo;
   __asm__ __volatile__ (
       "CPUID\n\t"
       "RDTSC\n\t"
       "mov %%edx, %0\n\t"
       "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
       "%rax", "%rbx", "%rcx", "%rdx");
   return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}

static __inline__ u_int64_t stop_clock() {
   // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
   u_int32_t hi, lo;
   __asm__ __volatile__(
       "RDTSCP\n\t"
       "mov %%edx, %0\n\t"
       "mov %%eax, %1\n\t"
       "CPUID\n\t": "=r" (hi), "=r" (lo)::
       "%rax", "%rbx", "%rcx", "%rdx");
   return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}
--- cut here ---



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 15:08 [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) Luke Yang
@ 2026-02-13 15:47 ` David Hildenbrand (Arm)
  2026-02-13 16:24   ` Pedro Falcato
  2026-02-18 13:29 ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-13 15:47 UTC (permalink / raw)
  To: Luke Yang, dev.jain
  Cc: jhladky, akpm, Liam.Howlett, willy, surenb, vbabka, linux-mm,
	linux-kernel

On 2/13/26 16:08, Luke Yang wrote:
> Hello,

Hi!

> 
> we have bisected a significant mprotect() performance regression in
> 6.17-rc1 to:
> 
> cac1db8c3aad ("mm: optimize mprotect() by PTE batching")
> 
> The regression becomes clearly visible starting around 400 KiB region
> sizes and above. It is also still present in the latest 6.19 kernel.
> 

Micro-benchmark results are nice. But what is the real word impact? IOW, 
why should we care?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 15:47 ` David Hildenbrand (Arm)
@ 2026-02-13 16:24   ` Pedro Falcato
  2026-02-13 17:16     ` Suren Baghdasaryan
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2026-02-13 16:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Luke Yang, dev.jain, jhladky, akpm, Liam.Howlett, willy, surenb,
	vbabka, linux-mm, linux-kernel

On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> On 2/13/26 16:08, Luke Yang wrote:
> > Hello,
> 
> Hi!
> 
> > 
> > we have bisected a significant mprotect() performance regression in
> > 6.17-rc1 to:
> > 
> > cac1db8c3aad ("mm: optimize mprotect() by PTE batching")
> > 
> > The regression becomes clearly visible starting around 400 KiB region
> > sizes and above. It is also still present in the latest 6.19 kernel.
> > 
> 
> Micro-benchmark results are nice. But what is the real word impact? IOW, why
> should we care?

Well, mprotect is widely used in thread spawning, code JITting,
and even process startup. And we don't want to pay for a feature we can't
even use (on x86).

In any case, I think I see the problem. Namely, that we now need to call
vm_normal_folio() for every single PTE (this seems similar to the mremap
problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
draft up a patch over the weekend if I can.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 16:24   ` Pedro Falcato
@ 2026-02-13 17:16     ` Suren Baghdasaryan
  2026-02-13 17:26       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-02-13 17:16 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand (Arm),
	Luke Yang, dev.jain, jhladky, akpm, Liam.Howlett, willy, vbabka,
	linux-mm, linux-kernel

On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> > On 2/13/26 16:08, Luke Yang wrote:
> > > Hello,
> >
> > Hi!
> >
> > >
> > > we have bisected a significant mprotect() performance regression in
> > > 6.17-rc1 to:
> > >
> > > cac1db8c3aad ("mm: optimize mprotect() by PTE batching")
> > >
> > > The regression becomes clearly visible starting around 400 KiB region
> > > sizes and above. It is also still present in the latest 6.19 kernel.
> > >
> >
> > Micro-benchmark results are nice. But what is the real word impact? IOW, why
> > should we care?
>
> Well, mprotect is widely used in thread spawning, code JITting,
> and even process startup. And we don't want to pay for a feature we can't
> even use (on x86).

I agree. When I straced Android's zygote a while ago, mprotect() came
up #30 in the list of most frequently used syscalls and one of the
most used mm-related syscalls due to its use during process creation.
However, I don't know how often it's used on VMAs of size >=400KiB.

>
> In any case, I think I see the problem. Namely, that we now need to call
> vm_normal_folio() for every single PTE (this seems similar to the mremap
> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
> draft up a patch over the weekend if I can.
>
> --
> Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 17:16     ` Suren Baghdasaryan
@ 2026-02-13 17:26       ` David Hildenbrand (Arm)
  2026-02-16 10:12         ` Dev Jain
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-13 17:26 UTC (permalink / raw)
  To: Suren Baghdasaryan, Pedro Falcato
  Cc: Luke Yang, dev.jain, jhladky, akpm, Liam.Howlett, willy, vbabka,
	linux-mm, linux-kernel

On 2/13/26 18:16, Suren Baghdasaryan wrote:
> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>>
>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> Hi!
>>>
>>>
>>> Micro-benchmark results are nice. But what is the real word impact? IOW, why
>>> should we care?
>>
>> Well, mprotect is widely used in thread spawning, code JITting,
>> and even process startup. And we don't want to pay for a feature we can't
>> even use (on x86).
> 
> I agree. When I straced Android's zygote a while ago, mprotect() came
> up #30 in the list of most frequently used syscalls and one of the
> most used mm-related syscalls due to its use during process creation.
> However, I don't know how often it's used on VMAs of size >=400KiB.

See my point? :) If this is apparently so widespread then finding a real 
reproducer is likely not a problem. Otherwise it's just speculation.

It would also be interesting to know whether the reproducer ran with any 
sort of mTHP enabled or not.

> 
>>
>> In any case, I think I see the problem. Namely, that we now need to call
>> vm_normal_folio() for every single PTE (this seems similar to the mremap
>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
>> draft up a patch over the weekend if I can.

I think we excessively discussed that during review and fixups of the 
commit in question. You might want to dig through that because I could 
have sworn we might already have discussed how to optimize this.

When going from none -> writable we always did a vm_normal_folio() with 
anonymous folios. For the other direction not.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 17:26       ` David Hildenbrand (Arm)
@ 2026-02-16 10:12         ` Dev Jain
  2026-02-16 14:56           ` Pedro Falcato
  2026-02-17 17:43           ` Luke Yang
  0 siblings, 2 replies; 23+ messages in thread
From: Dev Jain @ 2026-02-16 10:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Suren Baghdasaryan, Pedro Falcato
  Cc: Luke Yang, jhladky, akpm, Liam.Howlett, willy, vbabka, linux-mm,
	linux-kernel


On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>>>
>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> Hi!
>>>>
>>>>
>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>> IOW, why
>>>> should we care?
>>>
>>> Well, mprotect is widely used in thread spawning, code JITting,
>>> and even process startup. And we don't want to pay for a feature we can't
>>> even use (on x86).
>>
>> I agree. When I straced Android's zygote a while ago, mprotect() came
>> up #30 in the list of most frequently used syscalls and one of the
>> most used mm-related syscalls due to its use during process creation.
>> However, I don't know how often it's used on VMAs of size >=400KiB.
>
> See my point? :) If this is apparently so widespread then finding a real
> reproducer is likely not a problem. Otherwise it's just speculation.
>
> It would also be interesting to know whether the reproducer ran with any
> sort of mTHP enabled or not. 

Yes. Luke, can you experiment with the following microbenchmark:

https://pastebin.com/3hNtYirT

and see if there is an optimization for pte-mapped 2M folios, before and
after the commit?

(set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)


>
>>
>>>
>>> In any case, I think I see the problem. Namely, that we now need to call
>>> vm_normal_folio() for every single PTE (this seems similar to the mremap
>>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
>>> draft up a patch over the weekend if I can.
>
> I think we excessively discussed that during review and fixups of the
> commit in question. You might want to dig through that because I could
> have sworn we might already have discussed how to optimize this. 

I have written a patch to call vm_normal_folio only when required, and use
pte_batch_hint

instead of vm_normal_folio + folio_pte_batch. The results, testing with 

https://pastebin.com/3hNtYirT on Apple M3:

without-thp (small 4K folio case): patched beats vanilla by 6.89% (patched
avoids vm_normal_folio overhead)

64k-thp: no diff

pte-mapped-2M thp: vanilla beats patched by 10.71% (vanilla batches over
2M, patched batches over 64K)


Interestingly, I don't see an obvious reason why the last case should have
a win.

Batching over 16 ptes or 512 ptes in this code path, AFAIU is *not* going
to batch

over TLB flushes, atomic ops etc (the tlb_flush_pte_range in
prot_commit_flush_ptes

is an mmu-gather extension and not a TLB flush). So, the fact that similar
operations

are now getting batched should imply better memory access locality, fewer
function

calls etc.


>
> When going from none -> writable we always did a vm_normal_folio() with
> anonymous folios. For the other direction not.
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-16 10:12         ` Dev Jain
@ 2026-02-16 14:56           ` Pedro Falcato
  2026-02-17 17:43           ` Luke Yang
  1 sibling, 0 replies; 23+ messages in thread
From: Pedro Falcato @ 2026-02-16 14:56 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand (Arm),
	Suren Baghdasaryan, Luke Yang, jhladky, akpm, Liam.Howlett,
	willy, vbabka, linux-mm, linux-kernel

On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
> 
> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> > On 2/13/26 18:16, Suren Baghdasaryan wrote:
> >> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >>>
> >>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>>
> >>>> Micro-benchmark results are nice. But what is the real word impact?
> >>>> IOW, why
> >>>> should we care?
> >>>
> >>> Well, mprotect is widely used in thread spawning, code JITting,
> >>> and even process startup. And we don't want to pay for a feature we can't
> >>> even use (on x86).
> >>
> >> I agree. When I straced Android's zygote a while ago, mprotect() came
> >> up #30 in the list of most frequently used syscalls and one of the
> >> most used mm-related syscalls due to its use during process creation.
> >> However, I don't know how often it's used on VMAs of size >=400KiB.
> >
> > See my point? :) If this is apparently so widespread then finding a real
> > reproducer is likely not a problem. Otherwise it's just speculation.
> >
> > It would also be interesting to know whether the reproducer ran with any
> > sort of mTHP enabled or not. 
> 
> Yes. Luke, can you experiment with the following microbenchmark:
> 
> https://pastebin.com/3hNtYirT
> 
> and see if there is an optimization for pte-mapped 2M folios, before and
> after the commit?
> 
> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
> 
> 
> >
> >>
> >>>
> >>> In any case, I think I see the problem. Namely, that we now need to call
> >>> vm_normal_folio() for every single PTE (this seems similar to the mremap
> >>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
> >>> draft up a patch over the weekend if I can.
> >
> > I think we excessively discussed that during review and fixups of the
> > commit in question. You might want to dig through that because I could
> > have sworn we might already have discussed how to optimize this. 
> 
> I have written a patch to call vm_normal_folio only when required, and use
> pte_batch_hint
> 
> instead of vm_normal_folio + folio_pte_batch. The results, testing with 
> 
> https://pastebin.com/3hNtYirT on Apple M3:
> 
> without-thp (small 4K folio case): patched beats vanilla by 6.89% (patched
> avoids vm_normal_folio overhead)
>

For what it's worth, I tried to avoid vm_normal_page() as much as possible
and realized that the code is extremely timing sensitive (perhaps due to
being in a hot loop), thus even a small attempt at writing something that
doesn't offend the eyes (and the soul) will get it much slower.

FWIW my benchmark was something of the sort:
int i = 0;
mmap(400MiB, MAP_POPULATE);
while (do_benchmark()) {
	if (i & 1)
		mprotect(buf, size, PROT_NONE);
	else
		mprotect(buf, size, PROT_READ | PROT_WRITE);
	i++;
}

probably worth chucking a few "do not thp" calls, which i totally
forgot about. though it didn't seem to be relevant in my testing, somehow.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-16 10:12         ` Dev Jain
  2026-02-16 14:56           ` Pedro Falcato
@ 2026-02-17 17:43           ` Luke Yang
  2026-02-17 18:08             ` Pedro Falcato
  2026-02-18  4:50             ` Dev Jain
  1 sibling, 2 replies; 23+ messages in thread
From: Luke Yang @ 2026-02-17 17:43 UTC (permalink / raw)
  To: Dev Jain
  Cc: pfalcato, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
> 
> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> > On 2/13/26 18:16, Suren Baghdasaryan wrote:
> >> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >>>
> >>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>>
> >>>> Micro-benchmark results are nice. But what is the real word impact?
> >>>> IOW, why
> >>>> should we care?
> >>>
> >>> Well, mprotect is widely used in thread spawning, code JITting,
> >>> and even process startup. And we don't want to pay for a feature we can't
> >>> even use (on x86).
> >>
> >> I agree. When I straced Android's zygote a while ago, mprotect() came
> >> up #30 in the list of most frequently used syscalls and one of the
> >> most used mm-related syscalls due to its use during process creation.
> >> However, I don't know how often it's used on VMAs of size >=400KiB.
> >
> > See my point? :) If this is apparently so widespread then finding a real
> > reproducer is likely not a problem. Otherwise it's just speculation.
> >
> > It would also be interesting to know whether the reproducer ran with any
> > sort of mTHP enabled or not. 
> 
> Yes. Luke, can you experiment with the following microbenchmark:
> 
> https://pastebin.com/3hNtYirT
> 
> and see if there is an optimization for pte-mapped 2M folios, before and
> after the commit?
> 
> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)

----------
amd-epyc2-rome

# before commit
$ uname -r
6.16.0-65.eln150.x86_64
# after commit
$ uname -r
6.17.0-0.rc1.17.eln150.x86_64

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

$ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
[always] inherit madvise never

Before commit: Total = 6895988972
After commit: Total = 2303697782
Percentage change: -66.6%

----------
amd-epyc3-milanx

# before commit
$ uname -r
6.16.0-65.eln150.x86_64
# after commit
$ uname -r
6.17.0-0.rc1.17.eln150.x86_64

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

$ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
[always] inherit madvise never

Before commit: Total = 4006750392
After commit: Total = 1497733191
Percentage change: -62.6%
----------

Luke



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-17 17:43           ` Luke Yang
@ 2026-02-17 18:08             ` Pedro Falcato
  2026-02-18  5:01               ` Dev Jain
  2026-02-18  4:50             ` Dev Jain
  1 sibling, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2026-02-17 18:08 UTC (permalink / raw)
  To: Luke Yang
  Cc: Dev Jain, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Tue, Feb 17, 2026 at 12:43:38PM -0500, Luke Yang wrote:
> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
> > 
> > On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> > > On 2/13/26 18:16, Suren Baghdasaryan wrote:
> > >> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > >>>
> > >>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> > >>>>
> > >>>> Hi!
> > >>>>
> > >>>>
> > >>>> Micro-benchmark results are nice. But what is the real word impact?
> > >>>> IOW, why
> > >>>> should we care?
> > >>>
> > >>> Well, mprotect is widely used in thread spawning, code JITting,
> > >>> and even process startup. And we don't want to pay for a feature we can't
> > >>> even use (on x86).
> > >>
> > >> I agree. When I straced Android's zygote a while ago, mprotect() came
> > >> up #30 in the list of most frequently used syscalls and one of the
> > >> most used mm-related syscalls due to its use during process creation.
> > >> However, I don't know how often it's used on VMAs of size >=400KiB.
> > >
> > > See my point? :) If this is apparently so widespread then finding a real
> > > reproducer is likely not a problem. Otherwise it's just speculation.
> > >
> > > It would also be interesting to know whether the reproducer ran with any
> > > sort of mTHP enabled or not. 
> > 
> > Yes. Luke, can you experiment with the following microbenchmark:
> > 
> > https://pastebin.com/3hNtYirT
> > 
> > and see if there is an optimization for pte-mapped 2M folios, before and
> > after the commit?
> > 
> > (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
> 

Since you're testing stuff, could you please test the changes in:
https://github.com/heatd/linux/tree/mprotect-opt ?

Not posting them yet since merge window, etc. Plus I think there's some
further optimization work we can pull off.

With the benchmark in https://gist.github.com/heatd/25eb2edb601719d22bfb514bcf06a132
(compiled with g++ -O2 file.cpp -lbenchmark, needs google/benchmark) I've measured
about an 18% speedup between original vs with patches.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-17 17:43           ` Luke Yang
  2026-02-17 18:08             ` Pedro Falcato
@ 2026-02-18  4:50             ` Dev Jain
  1 sibling, 0 replies; 23+ messages in thread
From: Dev Jain @ 2026-02-18  4:50 UTC (permalink / raw)
  To: Luke Yang
  Cc: pfalcato, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel


On 17/02/26 11:13 pm, Luke Yang wrote:
> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
>> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
>>> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>>>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>>>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>>> Hi!
>>>>>>
>>>>>>
>>>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>>>> IOW, why
>>>>>> should we care?
>>>>> Well, mprotect is widely used in thread spawning, code JITting,
>>>>> and even process startup. And we don't want to pay for a feature we can't
>>>>> even use (on x86).
>>>> I agree. When I straced Android's zygote a while ago, mprotect() came
>>>> up #30 in the list of most frequently used syscalls and one of the
>>>> most used mm-related syscalls due to its use during process creation.
>>>> However, I don't know how often it's used on VMAs of size >=400KiB.
>>> See my point? :) If this is apparently so widespread then finding a real
>>> reproducer is likely not a problem. Otherwise it's just speculation.
>>>
>>> It would also be interesting to know whether the reproducer ran with any
>>> sort of mTHP enabled or not. 
>> Yes. Luke, can you experiment with the following microbenchmark:
>>
>> https://pastebin.com/3hNtYirT
>>
>> and see if there is an optimization for pte-mapped 2M folios, before and
>> after the commit?
>>
>> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
> ----------
> amd-epyc2-rome
>
> # before commit
> $ uname -r
> 6.16.0-65.eln150.x86_64
> # after commit
> $ uname -r
> 6.17.0-0.rc1.17.eln150.x86_64
>
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
>
> $ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> [always] inherit madvise never
>
> Before commit: Total = 6895988972
> After commit: Total = 2303697782
> Percentage change: -66.6%
>
> ----------
> amd-epyc3-milanx
>
> # before commit
> $ uname -r
> 6.16.0-65.eln150.x86_64
> # after commit
> $ uname -r
> 6.17.0-0.rc1.17.eln150.x86_64
>
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
>
> $ cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> [always] inherit madvise never
>
> Before commit: Total = 4006750392
> After commit: Total = 1497733191
> Percentage change: -62.6%
> ----------

Thanks. So after all, batching improves stuff :)

>
> Luke
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-17 18:08             ` Pedro Falcato
@ 2026-02-18  5:01               ` Dev Jain
  2026-02-18 10:06                 ` Pedro Falcato
  0 siblings, 1 reply; 23+ messages in thread
From: Dev Jain @ 2026-02-18  5:01 UTC (permalink / raw)
  To: Pedro Falcato, Luke Yang
  Cc: david, surenb, jhladky, akpm, Liam.Howlett, willy, vbabka,
	linux-mm, linux-kernel


On 17/02/26 11:38 pm, Pedro Falcato wrote:
> On Tue, Feb 17, 2026 at 12:43:38PM -0500, Luke Yang wrote:
>> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
>>> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
>>>> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>>>>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>>>>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>>
>>>>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>>>>> IOW, why
>>>>>>> should we care?
>>>>>> Well, mprotect is widely used in thread spawning, code JITting,
>>>>>> and even process startup. And we don't want to pay for a feature we can't
>>>>>> even use (on x86).
>>>>> I agree. When I straced Android's zygote a while ago, mprotect() came
>>>>> up #30 in the list of most frequently used syscalls and one of the
>>>>> most used mm-related syscalls due to its use during process creation.
>>>>> However, I don't know how often it's used on VMAs of size >=400KiB.
>>>> See my point? :) If this is apparently so widespread then finding a real
>>>> reproducer is likely not a problem. Otherwise it's just speculation.
>>>>
>>>> It would also be interesting to know whether the reproducer ran with any
>>>> sort of mTHP enabled or not. 
>>> Yes. Luke, can you experiment with the following microbenchmark:
>>>
>>> https://pastebin.com/3hNtYirT
>>>
>>> and see if there is an optimization for pte-mapped 2M folios, before and
>>> after the commit?
>>>
>>> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
> Since you're testing stuff, could you please test the changes in:
> https://github.com/heatd/linux/tree/mprotect-opt ?
>
> Not posting them yet since merge window, etc. Plus I think there's some
> further optimization work we can pull off.
>
> With the benchmark in https://gist.github.com/heatd/25eb2edb601719d22bfb514bcf06a132
> (compiled with g++ -O2 file.cpp -lbenchmark, needs google/benchmark) I've measured
> about an 18% speedup between original vs with patches.

Thanks for working on this. Some comments -

1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.

2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
to optimize the call to vm_normal_folio()?

>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18  5:01               ` Dev Jain
@ 2026-02-18 10:06                 ` Pedro Falcato
  2026-02-18 10:38                   ` Dev Jain
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2026-02-18 10:06 UTC (permalink / raw)
  To: Dev Jain
  Cc: Luke Yang, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
> 
> On 17/02/26 11:38 pm, Pedro Falcato wrote:
> > On Tue, Feb 17, 2026 at 12:43:38PM -0500, Luke Yang wrote:
> >> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
> >>> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> >>>> On 2/13/26 18:16, Suren Baghdasaryan wrote:
> >>>>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >>>>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> >>>>>>> Hi!
> >>>>>>>
> >>>>>>>
> >>>>>>> Micro-benchmark results are nice. But what is the real word impact?
> >>>>>>> IOW, why
> >>>>>>> should we care?
> >>>>>> Well, mprotect is widely used in thread spawning, code JITting,
> >>>>>> and even process startup. And we don't want to pay for a feature we can't
> >>>>>> even use (on x86).
> >>>>> I agree. When I straced Android's zygote a while ago, mprotect() came
> >>>>> up #30 in the list of most frequently used syscalls and one of the
> >>>>> most used mm-related syscalls due to its use during process creation.
> >>>>> However, I don't know how often it's used on VMAs of size >=400KiB.
> >>>> See my point? :) If this is apparently so widespread then finding a real
> >>>> reproducer is likely not a problem. Otherwise it's just speculation.
> >>>>
> >>>> It would also be interesting to know whether the reproducer ran with any
> >>>> sort of mTHP enabled or not. 
> >>> Yes. Luke, can you experiment with the following microbenchmark:
> >>>
> >>> https://pastebin.com/3hNtYirT
> >>>
> >>> and see if there is an optimization for pte-mapped 2M folios, before and
> >>> after the commit?
> >>>
> >>> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
> > Since you're testing stuff, could you please test the changes in:
> > https://github.com/heatd/linux/tree/mprotect-opt ?
> >
> > Not posting them yet since merge window, etc. Plus I think there's some
> > further optimization work we can pull off.
> >
> > With the benchmark in https://gist.github.com/heatd/25eb2edb601719d22bfb514bcf06a132
> > (compiled with g++ -O2 file.cpp -lbenchmark, needs google/benchmark) I've measured
> > about an 18% speedup between original vs with patches.
> 
> Thanks for working on this. Some comments -
> 
> 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
> folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.

I don't understand what you mean. Is ARM64 doing large folio optimization,
even when there's no special MMU support for it (the aforementioned 16K and
32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
Though if you could provide numbers in that case it would be much appreciated.

> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?

Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
features that the prefetcher seems to be doing a poor job, at least per my
results.

> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
> to optimize the call to vm_normal_folio()?

Certainly possible, but I suspect it doesn't make too much sense. You want to
avoid bringing in the cacheline if possible. In the pte's case, I know we're
probably going to look at it and modify it, and if I'm wrong it's just one
cacheline we misprefetched (though I had some parallel convos and it might
be that we need a branch there to avoid prefetching out of the PTE table).
We would like to avoid bringing in the folio cacheline at all, even if we
don't stall through some fancy prefetching or sheer CPU magic.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 10:06                 ` Pedro Falcato
@ 2026-02-18 10:38                   ` Dev Jain
  2026-02-18 10:46                     ` David Hildenbrand (Arm)
  2026-02-18 11:52                     ` Pedro Falcato
  0 siblings, 2 replies; 23+ messages in thread
From: Dev Jain @ 2026-02-18 10:38 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Luke Yang, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel


On 18/02/26 3:36 pm, Pedro Falcato wrote:
> On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
>> On 17/02/26 11:38 pm, Pedro Falcato wrote:
>>> On Tue, Feb 17, 2026 at 12:43:38PM -0500, Luke Yang wrote:
>>>> On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
>>>>> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
>>>>>> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>>>>>>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@suse.de> wrote:
>>>>>>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>>>>>>> IOW, why
>>>>>>>>> should we care?
>>>>>>>> Well, mprotect is widely used in thread spawning, code JITting,
>>>>>>>> and even process startup. And we don't want to pay for a feature we can't
>>>>>>>> even use (on x86).
>>>>>>> I agree. When I straced Android's zygote a while ago, mprotect() came
>>>>>>> up #30 in the list of most frequently used syscalls and one of the
>>>>>>> most used mm-related syscalls due to its use during process creation.
>>>>>>> However, I don't know how often it's used on VMAs of size >=400KiB.
>>>>>> See my point? :) If this is apparently so widespread then finding a real
>>>>>> reproducer is likely not a problem. Otherwise it's just speculation.
>>>>>>
>>>>>> It would also be interesting to know whether the reproducer ran with any
>>>>>> sort of mTHP enabled or not. 
>>>>> Yes. Luke, can you experiment with the following microbenchmark:
>>>>>
>>>>> https://pastebin.com/3hNtYirT
>>>>>
>>>>> and see if there is an optimization for pte-mapped 2M folios, before and
>>>>> after the commit?
>>>>>
>>>>> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
>>> Since you're testing stuff, could you please test the changes in:
>>> https://github.com/heatd/linux/tree/mprotect-opt ?
>>>
>>> Not posting them yet since merge window, etc. Plus I think there's some
>>> further optimization work we can pull off.
>>>
>>> With the benchmark in https://gist.github.com/heatd/25eb2edb601719d22bfb514bcf06a132
>>> (compiled with g++ -O2 file.cpp -lbenchmark, needs google/benchmark) I've measured
>>> about an 18% speedup between original vs with patches.
>> Thanks for working on this. Some comments -
>>
>> 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
>> folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.
> I don't understand what you mean. Is ARM64 doing large folio optimization,
> even when there's no special MMU support for it (the aforementioned 16K and
> 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
> Though if you could provide numbers in that case it would be much appreciated.

There are two things at play here:

1. All arches are expected to benefit from pte batching on large folios, because
of doing similar operations together in one shot. For code paths except mprotect
and mremap, that benefit is far more clear due to:

a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
   Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.

b) vm_normal_folio was already being invoked. So, all in all the only new overhead
   we introduce is of folio_pte_batch(_flags). In fact, since we already have the
   folio, I recall that we even just special case the large folio case, out from
   the small folio case. Thus 4K folio processing will have no overhead.

2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
it becomes critical to batch on arm64.


>
>> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
> Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
> zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
> features that the prefetcher seems to be doing a poor job, at least per my
> results.

Nice.

>
>> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
>> to optimize the call to vm_normal_folio()?
> Certainly possible, but I suspect it doesn't make too much sense. You want to
> avoid bringing in the cacheline if possible. In the pte's case, I know we're
> probably going to look at it and modify it, and if I'm wrong it's just one
> cacheline we misprefetched (though I had some parallel convos and it might
> be that we need a branch there to avoid prefetching out of the PTE table).
> We would like to avoid bringing in the folio cacheline at all, even if we
> don't stall through some fancy prefetching or sheer CPU magic.

I dunno, need other opinions.

The question here becomes that - should we prefer performance on 4K folios or
large folios? As Luke reports in the other email, the benefit on pte-mapped-thp
was staggering.

I believe that if the sysadmin is enabling CONFIG_TRANSPARENT_HUGEPAGE, they know
that the kernel will contain code which incorporates this fact that it will see
large folios. So, is it reasonable to penalize folio order-0 case, in preference
to folio order > 0? If yes, we can simply stop batching if !IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE).

>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 10:38                   ` Dev Jain
@ 2026-02-18 10:46                     ` David Hildenbrand (Arm)
  2026-02-18 11:58                       ` Pedro Falcato
  2026-02-18 11:52                     ` Pedro Falcato
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-18 10:46 UTC (permalink / raw)
  To: Dev Jain, Pedro Falcato
  Cc: Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy, vbabka,
	linux-mm, linux-kernel

On 2/18/26 11:38, Dev Jain wrote:
> 
> On 18/02/26 3:36 pm, Pedro Falcato wrote:
>> On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
>>> Thanks for working on this. Some comments -
>>>
>>> 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
>>> folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.
>> I don't understand what you mean. Is ARM64 doing large folio optimization,
>> even when there's no special MMU support for it (the aforementioned 16K and
>> 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
>> Though if you could provide numbers in that case it would be much appreciated.
> 
> There are two things at play here:
> 
> 1. All arches are expected to benefit from pte batching on large folios, because
> of doing similar operations together in one shot. For code paths except mprotect
> and mremap, that benefit is far more clear due to:
> 
> a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
>     Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> 
> b) vm_normal_folio was already being invoked. So, all in all the only new overhead
>     we introduce is of folio_pte_batch(_flags). In fact, since we already have the
>     folio, I recall that we even just special case the large folio case, out from
>     the small folio case. Thus 4K folio processing will have no overhead.
> 
> 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> it becomes critical to batch on arm64.
> 
> 
>>
>>> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
>> Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
>> zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
>> features that the prefetcher seems to be doing a poor job, at least per my
>> results.
> 
> Nice.
> 
>>
>>> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
>>> to optimize the call to vm_normal_folio()?
>> Certainly possible, but I suspect it doesn't make too much sense. You want to
>> avoid bringing in the cacheline if possible. In the pte's case, I know we're
>> probably going to look at it and modify it, and if I'm wrong it's just one
>> cacheline we misprefetched (though I had some parallel convos and it might
>> be that we need a branch there to avoid prefetching out of the PTE table).
>> We would like to avoid bringing in the folio cacheline at all, even if we
>> don't stall through some fancy prefetching or sheer CPU magic.
> 
> I dunno, need other opinions.

Let's repeat my question: what, besides the micro-benchmark in some 
cases with all small-folios, are we trying to optimize here. No hand 
waving (Androids does this or that) please.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 10:38                   ` Dev Jain
  2026-02-18 10:46                     ` David Hildenbrand (Arm)
@ 2026-02-18 11:52                     ` Pedro Falcato
  1 sibling, 0 replies; 23+ messages in thread
From: Pedro Falcato @ 2026-02-18 11:52 UTC (permalink / raw)
  To: Dev Jain
  Cc: Luke Yang, david, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Wed, Feb 18, 2026 at 04:08:11PM +0530, Dev Jain wrote:
> 
> There are two things at play here:
> 
> 1. All arches are expected to benefit from pte batching on large folios, because
> of doing similar operations together in one shot. For code paths except mprotect
> and mremap, that benefit is far more clear due to:
> 
> a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
>    Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> 
> b) vm_normal_folio was already being invoked. So, all in all the only new overhead
>    we introduce is of folio_pte_batch(_flags). In fact, since we already have the
>    folio, I recall that we even just special case the large folio case, out from
>    the small folio case. Thus 4K folio processing will have no overhead.
> 
> 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> it becomes critical to batch on arm64.
>

Understood.
 
> 
> >
> >> 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
> > Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
> > zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
> > features that the prefetcher seems to be doing a poor job, at least per my
> > results.
> 
> Nice.
> 
> >
> >> I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
> >> to optimize the call to vm_normal_folio()?
> > Certainly possible, but I suspect it doesn't make too much sense. You want to
> > avoid bringing in the cacheline if possible. In the pte's case, I know we're
> > probably going to look at it and modify it, and if I'm wrong it's just one
> > cacheline we misprefetched (though I had some parallel convos and it might
> > be that we need a branch there to avoid prefetching out of the PTE table).
> > We would like to avoid bringing in the folio cacheline at all, even if we
> > don't stall through some fancy prefetching or sheer CPU magic.
> 
> I dunno, need other opinions.
> 
> The question here becomes that - should we prefer performance on 4K folios or
> large folios? As Luke reports in the other email, the benefit on pte-mapped-thp
> was staggering.

We want order-0 folios to be as performant as we can, since they are the
bulk of all folios in an mTHP-less system (especially anon folios, I know the
page cache is a little more complex these days).

> 
> I believe that if the sysadmin is enabling CONFIG_TRANSPARENT_HUGEPAGE, they know
> that the kernel will contain code which incorporates this fact that it will see
> large folios. So, is it reasonable to penalize folio order-0 case, in preference
> to folio order > 0? If yes, we can simply stop batching if !IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE).

No, the sysadmin does not enable CONFIG_TRANSPARENT_HUGEPAGE. We're lucky if
the distribution knows what CONFIG_THP does. It is not reasonable, IMO, to
penalize anything.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 10:46                     ` David Hildenbrand (Arm)
@ 2026-02-18 11:58                       ` Pedro Falcato
  2026-02-18 12:24                         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2026-02-18 11:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
> On 2/18/26 11:38, Dev Jain wrote:
> > 
> > On 18/02/26 3:36 pm, Pedro Falcato wrote:
> > > On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
> > > > Thanks for working on this. Some comments -
> > > > 
> > > > 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
> > > > folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.
> > > I don't understand what you mean. Is ARM64 doing large folio optimization,
> > > even when there's no special MMU support for it (the aforementioned 16K and
> > > 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
> > > Though if you could provide numbers in that case it would be much appreciated.
> > 
> > There are two things at play here:
> > 
> > 1. All arches are expected to benefit from pte batching on large folios, because
> > of doing similar operations together in one shot. For code paths except mprotect
> > and mremap, that benefit is far more clear due to:
> > 
> > a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
> >     Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> > 
> > b) vm_normal_folio was already being invoked. So, all in all the only new overhead
> >     we introduce is of folio_pte_batch(_flags). In fact, since we already have the
> >     folio, I recall that we even just special case the large folio case, out from
> >     the small folio case. Thus 4K folio processing will have no overhead.
> > 
> > 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> > across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> > it becomes critical to batch on arm64.
> > 
> > 
> > > 
> > > > 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
> > > Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
> > > zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
> > > features that the prefetcher seems to be doing a poor job, at least per my
> > > results.
> > 
> > Nice.
> > 
> > > 
> > > > I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
> > > > to optimize the call to vm_normal_folio()?
> > > Certainly possible, but I suspect it doesn't make too much sense. You want to
> > > avoid bringing in the cacheline if possible. In the pte's case, I know we're
> > > probably going to look at it and modify it, and if I'm wrong it's just one
> > > cacheline we misprefetched (though I had some parallel convos and it might
> > > be that we need a branch there to avoid prefetching out of the PTE table).
> > > We would like to avoid bringing in the folio cacheline at all, even if we
> > > don't stall through some fancy prefetching or sheer CPU magic.
> > 
> > I dunno, need other opinions.
> 
> Let's repeat my question: what, besides the micro-benchmark in some cases
> with all small-folios, are we trying to optimize here. No hand waving
> (Androids does this or that) please.

I don't understand what you're looking for. an mprotect-based workload? those
obviously don't really exist, apart from something like a JIT engine cranking
out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
usage of mprotect that our DB friends like to use sometimes (discussed in
$OTHER_CONTEXTS), though those are generally hugepages.

I don't see how this can justify large performance regressions in a system
call, for something every-architecture-not-named-arm64 does not have.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 11:58                       ` Pedro Falcato
@ 2026-02-18 12:24                         ` David Hildenbrand (Arm)
  2026-02-19 12:15                           ` Pedro Falcato
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-18 12:24 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On 2/18/26 12:58, Pedro Falcato wrote:
> On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/18/26 11:38, Dev Jain wrote:
>>>
>>>
>>> There are two things at play here:
>>>
>>> 1. All arches are expected to benefit from pte batching on large folios, because
>>> of doing similar operations together in one shot. For code paths except mprotect
>>> and mremap, that benefit is far more clear due to:
>>>
>>> a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
>>>      Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
>>>
>>> b) vm_normal_folio was already being invoked. So, all in all the only new overhead
>>>      we introduce is of folio_pte_batch(_flags). In fact, since we already have the
>>>      folio, I recall that we even just special case the large folio case, out from
>>>      the small folio case. Thus 4K folio processing will have no overhead.
>>>
>>> 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
>>> across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
>>> it becomes critical to batch on arm64.
>>>
>>>
>>>
>>> Nice.
>>>
>>>
>>> I dunno, need other opinions.
>>
>> Let's repeat my question: what, besides the micro-benchmark in some cases
>> with all small-folios, are we trying to optimize here. No hand waving
>> (Androids does this or that) please.
> 
> I don't understand what you're looking for. an mprotect-based workload? those
> obviously don't really exist, apart from something like a JIT engine cranking
> out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
> usage of mprotect that our DB friends like to use sometimes (discussed in
> $OTHER_CONTEXTS), though those are generally hugepages.
> 

Anything besides a homemade micro-benchmark that highlights why we 
should care about this exact fast and repeated sequence of events.

I'm surprise that such a "large regression" does not show up in any 
other non-home-made benchmark that people/bots are running. That's 
really what I am questioning.

Having that said, I'm all for optimizing it if there is a real problem 
there.

> I don't see how this can justify large performance regressions in a system
> call, for something every-architecture-not-named-arm64 does not have.
Take a look at the reported performance improvements on AMD with large 
folios.

The issue really is that small folios don't perform well, on any 
architecture. But to detect large vs. small folios we need the ... folio.

So once we optimize for small folios (== don't try to detect large 
folios) we'll degrade large folios.


For fork() and unmap() we were able to avoid most of the performance 
regressions for small folios by special-casing the implementation on two 
variants: nr_pages == 1 (incl. small folios) vs. nr_pages != 1 (large 
folios).

We cannot avoid the vm_normal_folio(). Maybe the function-call overhead 
could be avoided by providing an inlined variant -- if that is the real 
problem.

But likely it's also just access to the folio when we really don't need 
it in some cases.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-13 15:08 [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) Luke Yang
  2026-02-13 15:47 ` David Hildenbrand (Arm)
@ 2026-02-18 13:29 ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-18 13:29 UTC (permalink / raw)
  To: Luke Yang, dev.jain
  Cc: jhladky, akpm, Liam.Howlett, willy, surenb, vbabka, linux-mm,
	linux-kernel


> int main(int argc, char **argv)
> {
>     int i, j, ret;
>     long long k;
> 
>     if (argc < 4) {
>         printf("USAGE: %s region_size_kib region_count iterations\n", argv[0]);
>         printf("Creates multiple regions and times mprotect() calls\n");
>         return 1;
>     }
> 
>     long region_size = atol(argv[1]) * 1024L;
>     int region_count = atoi(argv[2]);
>     int iterations = atoi(argv[3]);
> 
>     int pagesize = sysconf(_SC_PAGESIZE);
> 
>     vchar_t **regions = malloc(region_count * sizeof(vchar_t*));
>     if (!regions) {
>         perror("malloc");
>         return 1;
>     }
> 
>     for (i = 0; i < region_count; i++) {
>         regions[i] = (vchar_t *) mmap(NULL, region_size,
>                       PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0L);
> 

I assume that the regression might be more pronounced with MAP_SHARED, 
because there we really didn't ever required the page/folio during 
mprotect() IIRC.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-18 12:24                         ` David Hildenbrand (Arm)
@ 2026-02-19 12:15                           ` Pedro Falcato
  2026-02-19 13:02                             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 23+ messages in thread
From: Pedro Falcato @ 2026-02-19 12:15 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote:
> On 2/18/26 12:58, Pedro Falcato wrote:
> > On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
> > > On 2/18/26 11:38, Dev Jain wrote:
> > > > 
> > > > 
> > > > There are two things at play here:
> > > > 
> > > > 1. All arches are expected to benefit from pte batching on large folios, because
> > > > of doing similar operations together in one shot. For code paths except mprotect
> > > > and mremap, that benefit is far more clear due to:
> > > > 
> > > > a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
> > > >      Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> > > > 
> > > > b) vm_normal_folio was already being invoked. So, all in all the only new overhead
> > > >      we introduce is of folio_pte_batch(_flags). In fact, since we already have the
> > > >      folio, I recall that we even just special case the large folio case, out from
> > > >      the small folio case. Thus 4K folio processing will have no overhead.
> > > > 
> > > > 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> > > > across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> > > > it becomes critical to batch on arm64.
> > > > 
> > > > 
> > > > 
> > > > Nice.
> > > > 
> > > > 
> > > > I dunno, need other opinions.
> > > 
> > > Let's repeat my question: what, besides the micro-benchmark in some cases
> > > with all small-folios, are we trying to optimize here. No hand waving
> > > (Androids does this or that) please.
> > 
> > I don't understand what you're looking for. an mprotect-based workload? those
> > obviously don't really exist, apart from something like a JIT engine cranking
> > out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
> > usage of mprotect that our DB friends like to use sometimes (discussed in
> > $OTHER_CONTEXTS), though those are generally hugepages.
> > 
> 
> Anything besides a homemade micro-benchmark that highlights why we should
> care about this exact fast and repeated sequence of events.
> 
> I'm surprise that such a "large regression" does not show up in any other
> non-home-made benchmark that people/bots are running. That's really what I
> am questioning.

I don't know, perhaps there isn't a will-it-scale test for this. That's
alright. Even the standard will-it-scale and stress-ng tests people use
to detect regressions usually have glaring problems and are insanely
microbenchey.

> 
> Having that said, I'm all for optimizing it if there is a real problem
> there.
> 
> > I don't see how this can justify large performance regressions in a system
> > call, for something every-architecture-not-named-arm64 does not have.
> Take a look at the reported performance improvements on AMD with large
> folios.

Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
point...)

> 
> The issue really is that small folios don't perform well, on any
> architecture. But to detect large vs. small folios we need the ... folio.
> 
> So once we optimize for small folios (== don't try to detect large folios)
> we'll degrade large folios.

I suspect it's not that huge of a deal. Worst case you can always provide a
software PTE_CONT bit that would e.g be set when mapping a large folio. Or
perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
probably in a large folio, thus do the proper batching stuff". I think that
could satisfy everyone. There are heuristics we can use, and perhaps
pte_batch_hint() does not need to be that simple and useless in the !arm64
case then. I'll try to look into a cromulent solution for everyone.

(shower thought: do we always get wins when batching large folios, or do these
need to be of a significant order to get wins?)

But personally I would err on the side of small folios, like we did for mremap()
a few months back.

> 
> 
> For fork() and unmap() we were able to avoid most of the performance
> regressions for small folios by special-casing the implementation on two
> variants: nr_pages == 1 (incl. small folios) vs. nr_pages != 1 (large
> folios).
> 
> We cannot avoid the vm_normal_folio(). Maybe the function-call overhead
> could be avoided by providing an inlined variant -- if that is the real
> problem.
> 
> But likely it's also just access to the folio when we really don't need it
> in some cases.

/me shrieks at the thought of the extra cacheline accesses in the glorious
memdesc future :)

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-19 12:15                           ` Pedro Falcato
@ 2026-02-19 13:02                             ` David Hildenbrand (Arm)
  2026-02-19 15:00                               ` Pedro Falcato
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 13:02 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On 2/19/26 13:15, Pedro Falcato wrote:
> On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/18/26 12:58, Pedro Falcato wrote:
>>>
>>> I don't understand what you're looking for. an mprotect-based workload? those
>>> obviously don't really exist, apart from something like a JIT engine cranking
>>> out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
>>> usage of mprotect that our DB friends like to use sometimes (discussed in
>>> $OTHER_CONTEXTS), though those are generally hugepages.
>>>
>>
>> Anything besides a homemade micro-benchmark that highlights why we should
>> care about this exact fast and repeated sequence of events.
>>
>> I'm surprise that such a "large regression" does not show up in any other
>> non-home-made benchmark that people/bots are running. That's really what I
>> am questioning.
> 
> I don't know, perhaps there isn't a will-it-scale test for this. That's
> alright. Even the standard will-it-scale and stress-ng tests people use
> to detect regressions usually have glaring problems and are insanely
> microbenchey.

My theory is that most heavy (high frequency where it would really hit performance)
mprotect users (like JITs) perform mprotect on very small ranges (e.g., single page),
where all the other overhead (syscall, TLB flush) dominates.

That's why I was wondering which use cases that behave similar to the reproducer exist.

> 
>>
>> Having that said, I'm all for optimizing it if there is a real problem
>> there.
>>
>>> I don't see how this can justify large performance regressions in a system
>>> call, for something every-architecture-not-named-arm64 does not have.
>> Take a look at the reported performance improvements on AMD with large
>> folios.
> 
> Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
> point...)

Well, 1M and all the way down will similarly benefit. 2M is just always the extreme case.

> 
>>
>> The issue really is that small folios don't perform well, on any
>> architecture. But to detect large vs. small folios we need the ... folio.
>>
>> So once we optimize for small folios (== don't try to detect large folios)
>> we'll degrade large folios.
> 
> I suspect it's not that huge of a deal. Worst case you can always provide a
> software PTE_CONT bit that would e.g be set when mapping a large folio. Or
> perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
> probably in a large folio, thus do the proper batching stuff". I think that
> could satisfy everyone. There are heuristics we can use, and perhaps
> pte_batch_hint() does not need to be that simple and useless in the !arm64
> case then. I'll try to look into a cromulent solution for everyone.

Software bits are generally -ENOSPC, but maybe we are lucky on some architectures.

We'd run into similar issues like aarch64 when shattering contiguity etc, so
there is quite some complexity too it that might not be worth it.

> 
> (shower thought: do we always get wins when batching large folios, or do these
> need to be of a significant order to get wins?)

For mprotect(), I don't know. For fork() and unmap() batching there was always a
win even with order-2 folios. (never measured order-1, because they don't apply to
anonymous memory)

I assume for mprotect() it depends whether we really needed the folio before, or
whether it's just not required like for mremap().

> 
> But personally I would err on the side of small folios, like we did for mremap()
> a few months back.

The following (completely untested) might make most people happy by looking up
the folio only if (a) required or (b) if the architecture indicates that there is a large folio.

I assume for some large folio use cases it might perform worse than before. But for
the write-upgrade case with large anon folios the performance improvement should remain.

Not sure if some regression would remain for which we'd have to special-case the implementation
to take a separate path for nr_ptes == 1.

Maybe you had something similar already:


diff --git a/mm/mprotect.c b/mm/mprotect.c
index c0571445bef7..0b3856ad728e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -211,6 +211,25 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
         commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
  }
  
+static bool mprotect_wants_folio_for_pte(unsigned long cp_flags, pte_t *ptep,
+               pte_t pte, unsigned long max_nr_ptes)
+{
+       /* NUMA hinting needs decide whether working on the folio is ok. */
+       if (cp_flags & MM_CP_PROT_NUMA)
+               return true;
+
+       /* We want the folio for possible write-upgrade. */
+       if (!pte_write(pte) && (cp_flags & MM_CP_TRY_CHANGE_WRITABLE))
+               return true;
+
+       /* There is nothing to batch. */
+       if (max_nr_ptes == 1)
+               return false;
+
+       /* For guaranteed large folios it's usually a win. */
+       return pte_batch_hint(ptep, pte) > 1;
+}
+
  static long change_pte_range(struct mmu_gather *tlb,
                 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
                 unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -241,16 +260,18 @@ static long change_pte_range(struct mmu_gather *tlb,
                         const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
                         int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
                         struct folio *folio = NULL;
-                       struct page *page;
+                       struct page *page = NULL;
                         pte_t ptent;
  
                         /* Already in the desired state. */
                         if (prot_numa && pte_protnone(oldpte))
                                 continue;
  
-                       page = vm_normal_page(vma, addr, oldpte);
-                       if (page)
-                               folio = page_folio(page);
+                       if (mprotect_wants_folio_for_pte(cp_flags, pte, oldpte, max_nr_ptes)) {
+                               page = vm_normal_page(vma, addr, oldpte);
+                               if (page)
+                                       folio = page_folio(page);
+                       }
  
                         /*
                          * Avoid trapping faults against the zero or KSM


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-19 13:02                             ` David Hildenbrand (Arm)
@ 2026-02-19 15:00                               ` Pedro Falcato
  2026-02-19 15:29                                 ` David Hildenbrand (Arm)
  2026-02-20  4:12                                 ` Dev Jain
  0 siblings, 2 replies; 23+ messages in thread
From: Pedro Falcato @ 2026-02-19 15:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On Thu, Feb 19, 2026 at 02:02:42PM +0100, David Hildenbrand (Arm) wrote:
> On 2/19/26 13:15, Pedro Falcato wrote:
> > On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote:
> > > On 2/18/26 12:58, Pedro Falcato wrote:
> > > > 
> > > > I don't understand what you're looking for. an mprotect-based workload? those
> > > > obviously don't really exist, apart from something like a JIT engine cranking
> > > > out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
> > > > usage of mprotect that our DB friends like to use sometimes (discussed in
> > > > $OTHER_CONTEXTS), though those are generally hugepages.
> > > > 
> > > 
> > > Anything besides a homemade micro-benchmark that highlights why we should
> > > care about this exact fast and repeated sequence of events.
> > > 
> > > I'm surprise that such a "large regression" does not show up in any other
> > > non-home-made benchmark that people/bots are running. That's really what I
> > > am questioning.
> > 
> > I don't know, perhaps there isn't a will-it-scale test for this. That's
> > alright. Even the standard will-it-scale and stress-ng tests people use
> > to detect regressions usually have glaring problems and are insanely
> > microbenchey.
> 
> My theory is that most heavy (high frequency where it would really hit performance)
> mprotect users (like JITs) perform mprotect on very small ranges (e.g., single page),
> where all the other overhead (syscall, TLB flush) dominates.
> 
> That's why I was wondering which use cases that behave similar to the reproducer exist.
> 
> > 
> > > 
> > > Having that said, I'm all for optimizing it if there is a real problem
> > > there.
> > > 
> > > > I don't see how this can justify large performance regressions in a system
> > > > call, for something every-architecture-not-named-arm64 does not have.
> > > Take a look at the reported performance improvements on AMD with large
> > > folios.
> > 
> > Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
> > point...)
> 
> Well, 1M and all the way down will similarly benefit. 2M is just always the extreme case.
> 
> > 
> > > 
> > > The issue really is that small folios don't perform well, on any
> > > architecture. But to detect large vs. small folios we need the ... folio.
> > > 
> > > So once we optimize for small folios (== don't try to detect large folios)
> > > we'll degrade large folios.
> > 
> > I suspect it's not that huge of a deal. Worst case you can always provide a
> > software PTE_CONT bit that would e.g be set when mapping a large folio. Or
> > perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
> > probably in a large folio, thus do the proper batching stuff". I think that
> > could satisfy everyone. There are heuristics we can use, and perhaps
> > pte_batch_hint() does not need to be that simple and useless in the !arm64
> > case then. I'll try to look into a cromulent solution for everyone.
> 
> Software bits are generally -ENOSPC, but maybe we are lucky on some architectures.
> 
> We'd run into similar issues like aarch64 when shattering contiguity etc, so
> there is quite some complexity too it that might not be worth it.
> 
> > 
> > (shower thought: do we always get wins when batching large folios, or do these
> > need to be of a significant order to get wins?)
> 
> For mprotect(), I don't know. For fork() and unmap() batching there was always a
> win even with order-2 folios. (never measured order-1, because they don't apply to
> anonymous memory)
> 
> I assume for mprotect() it depends whether we really needed the folio before, or
> whether it's just not required like for mremap().
> 
> > 
> > But personally I would err on the side of small folios, like we did for mremap()
> > a few months back.
> 
> The following (completely untested) might make most people happy by looking up
> the folio only if (a) required or (b) if the architecture indicates that there is a large folio.
> 
> I assume for some large folio use cases it might perform worse than before. But for
> the write-upgrade case with large anon folios the performance improvement should remain.
> 
> Not sure if some regression would remain for which we'd have to special-case the implementation
> to take a separate path for nr_ptes == 1.
> 
> Maybe you had something similar already:
> 
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c0571445bef7..0b3856ad728e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -211,6 +211,25 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>         commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>  }
> +static bool mprotect_wants_folio_for_pte(unsigned long cp_flags, pte_t *ptep,
> +               pte_t pte, unsigned long max_nr_ptes)
> +{
> +       /* NUMA hinting needs decide whether working on the folio is ok. */
> +       if (cp_flags & MM_CP_PROT_NUMA)
> +               return true;
> +
> +       /* We want the folio for possible write-upgrade. */
> +       if (!pte_write(pte) && (cp_flags & MM_CP_TRY_CHANGE_WRITABLE))
> +               return true;
> +
> +       /* There is nothing to batch. */
> +       if (max_nr_ptes == 1)
> +               return false;
> +
> +       /* For guaranteed large folios it's usually a win. */
> +       return pte_batch_hint(ptep, pte) > 1;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>                 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>                 unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -241,16 +260,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>                         const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>                         int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>                         struct folio *folio = NULL;
> -                       struct page *page;
> +                       struct page *page = NULL;
>                         pte_t ptent;
>                         /* Already in the desired state. */
>                         if (prot_numa && pte_protnone(oldpte))
>                                 continue;
> -                       page = vm_normal_page(vma, addr, oldpte);
> -                       if (page)
> -                               folio = page_folio(page);
> +                       if (mprotect_wants_folio_for_pte(cp_flags, pte, oldpte, max_nr_ptes)) {
> +                               page = vm_normal_page(vma, addr, oldpte);
> +                               if (page)
> +                                       folio = page_folio(page);
> +                       }
>                         /*
>                          * Avoid trapping faults against the zero or KSM
> 

Yes, this is a better version than what I had, I'll take this hunk if you don't mind :)
Note that it still doesn't handle large folios on !contpte architectures, which
is partly the issue. I suspect some sort of PTE lookahead might work well in
practice, aside from the issues where e.g two order-0 folios that are
contiguous in memory are separately mapped.

Though perhaps inlining vm_normal_folio() might also be interesting and side-step
most of the issue. I'll play around with that.

-- 
Pedro


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-19 15:00                               ` Pedro Falcato
@ 2026-02-19 15:29                                 ` David Hildenbrand (Arm)
  2026-02-20  4:12                                 ` Dev Jain
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 15:29 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Dev Jain, Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy,
	vbabka, linux-mm, linux-kernel

On 2/19/26 16:00, Pedro Falcato wrote:
> On Thu, Feb 19, 2026 at 02:02:42PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/19/26 13:15, Pedro Falcato wrote:
>>>
>>> I don't know, perhaps there isn't a will-it-scale test for this. That's
>>> alright. Even the standard will-it-scale and stress-ng tests people use
>>> to detect regressions usually have glaring problems and are insanely
>>> microbenchey.
>>
>> My theory is that most heavy (high frequency where it would really hit performance)
>> mprotect users (like JITs) perform mprotect on very small ranges (e.g., single page),
>> where all the other overhead (syscall, TLB flush) dominates.
>>
>> That's why I was wondering which use cases that behave similar to the reproducer exist.
>>
>>>
>>>
>>> Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
>>> point...)
>>
>> Well, 1M and all the way down will similarly benefit. 2M is just always the extreme case.
>>
>>>
>>>
>>> I suspect it's not that huge of a deal. Worst case you can always provide a
>>> software PTE_CONT bit that would e.g be set when mapping a large folio. Or
>>> perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
>>> probably in a large folio, thus do the proper batching stuff". I think that
>>> could satisfy everyone. There are heuristics we can use, and perhaps
>>> pte_batch_hint() does not need to be that simple and useless in the !arm64
>>> case then. I'll try to look into a cromulent solution for everyone.
>>
>> Software bits are generally -ENOSPC, but maybe we are lucky on some architectures.
>>
>> We'd run into similar issues like aarch64 when shattering contiguity etc, so
>> there is quite some complexity too it that might not be worth it.
>>
>>>
>>> (shower thought: do we always get wins when batching large folios, or do these
>>> need to be of a significant order to get wins?)
>>
>> For mprotect(), I don't know. For fork() and unmap() batching there was always a
>> win even with order-2 folios. (never measured order-1, because they don't apply to
>> anonymous memory)
>>
>> I assume for mprotect() it depends whether we really needed the folio before, or
>> whether it's just not required like for mremap().
>>
>>>
>>> But personally I would err on the side of small folios, like we did for mremap()
>>> a few months back.
>>
>> The following (completely untested) might make most people happy by looking up
>> the folio only if (a) required or (b) if the architecture indicates that there is a large folio.
>>
>> I assume for some large folio use cases it might perform worse than before. But for
>> the write-upgrade case with large anon folios the performance improvement should remain.
>>
>> Not sure if some regression would remain for which we'd have to special-case the implementation
>> to take a separate path for nr_ptes == 1.
>>
>> Maybe you had something similar already:
>>
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index c0571445bef7..0b3856ad728e 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -211,6 +211,25 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>>          commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>>   }
>> +static bool mprotect_wants_folio_for_pte(unsigned long cp_flags, pte_t *ptep,
>> +               pte_t pte, unsigned long max_nr_ptes)
>> +{
>> +       /* NUMA hinting needs decide whether working on the folio is ok. */
>> +       if (cp_flags & MM_CP_PROT_NUMA)
>> +               return true;
>> +
>> +       /* We want the folio for possible write-upgrade. */
>> +       if (!pte_write(pte) && (cp_flags & MM_CP_TRY_CHANGE_WRITABLE))
>> +               return true;
>> +
>> +       /* There is nothing to batch. */
>> +       if (max_nr_ptes == 1)
>> +               return false;
>> +
>> +       /* For guaranteed large folios it's usually a win. */
>> +       return pte_batch_hint(ptep, pte) > 1;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>                  struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>                  unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -241,16 +260,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                          const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>>                          int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>                          struct folio *folio = NULL;
>> -                       struct page *page;
>> +                       struct page *page = NULL;
>>                          pte_t ptent;
>>                          /* Already in the desired state. */
>>                          if (prot_numa && pte_protnone(oldpte))
>>                                  continue;
>> -                       page = vm_normal_page(vma, addr, oldpte);
>> -                       if (page)
>> -                               folio = page_folio(page);
>> +                       if (mprotect_wants_folio_for_pte(cp_flags, pte, oldpte, max_nr_ptes)) {
>> +                               page = vm_normal_page(vma, addr, oldpte);
>> +                               if (page)
>> +                                       folio = page_folio(page);
>> +                       }
>>                          /*
>>                           * Avoid trapping faults against the zero or KSM
>>
> 
> Yes, this is a better version than what I had, I'll take this hunk if you don't mind :)

Not at all, thanks for working on this.

> Note that it still doesn't handle large folios on !contpte architectures, which
> is partly the issue. 

It should when we really need the folio (write-upgrade, NUMA faults). So 
I guess the benchmark with THP will still show the benefit (as it does 
the write upgrade).

I suspect some sort of PTE lookahead might work well in
> practice, aside from the issues where e.g two order-0 folios that are
> contiguous in memory are separately mapped.
> 
> Though perhaps inlining vm_normal_folio() might also be interesting and side-step
> most of the issue. I'll play around with that.


I'd assume that it could also help fork/munmap() etc. For common 
architectures with vmemmap, vm_normal_page() is extremely short code.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
  2026-02-19 15:00                               ` Pedro Falcato
  2026-02-19 15:29                                 ` David Hildenbrand (Arm)
@ 2026-02-20  4:12                                 ` Dev Jain
  1 sibling, 0 replies; 23+ messages in thread
From: Dev Jain @ 2026-02-20  4:12 UTC (permalink / raw)
  To: Pedro Falcato, David Hildenbrand (Arm)
  Cc: Luke Yang, surenb, jhladky, akpm, Liam.Howlett, willy, vbabka,
	linux-mm, linux-kernel


On 19/02/26 8:30 pm, Pedro Falcato wrote:
> On Thu, Feb 19, 2026 at 02:02:42PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/19/26 13:15, Pedro Falcato wrote:
>>> On Wed, Feb 18, 2026 at 01:24:28PM +0100, David Hildenbrand (Arm) wrote:
>>>> On 2/18/26 12:58, Pedro Falcato wrote:
>>>>> I don't understand what you're looking for. an mprotect-based workload? those
>>>>> obviously don't really exist, apart from something like a JIT engine cranking
>>>>> out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
>>>>> usage of mprotect that our DB friends like to use sometimes (discussed in
>>>>> $OTHER_CONTEXTS), though those are generally hugepages.
>>>>>
>>>> Anything besides a homemade micro-benchmark that highlights why we should
>>>> care about this exact fast and repeated sequence of events.
>>>>
>>>> I'm surprise that such a "large regression" does not show up in any other
>>>> non-home-made benchmark that people/bots are running. That's really what I
>>>> am questioning.
>>> I don't know, perhaps there isn't a will-it-scale test for this. That's
>>> alright. Even the standard will-it-scale and stress-ng tests people use
>>> to detect regressions usually have glaring problems and are insanely
>>> microbenchey.
>> My theory is that most heavy (high frequency where it would really hit performance)
>> mprotect users (like JITs) perform mprotect on very small ranges (e.g., single page),
>> where all the other overhead (syscall, TLB flush) dominates.
>>
>> That's why I was wondering which use cases that behave similar to the reproducer exist.
>>
>>>> Having that said, I'm all for optimizing it if there is a real problem
>>>> there.
>>>>
>>>>> I don't see how this can justify large performance regressions in a system
>>>>> call, for something every-architecture-not-named-arm64 does not have.
>>>> Take a look at the reported performance improvements on AMD with large
>>>> folios.
>>> Sure, but pte-mapped 2M folios is almost a worst-case (why not a PMD at that
>>> point...)
>> Well, 1M and all the way down will similarly benefit. 2M is just always the extreme case.
>>
>>>> The issue really is that small folios don't perform well, on any
>>>> architecture. But to detect large vs. small folios we need the ... folio.
>>>>
>>>> So once we optimize for small folios (== don't try to detect large folios)
>>>> we'll degrade large folios.
>>> I suspect it's not that huge of a deal. Worst case you can always provide a
>>> software PTE_CONT bit that would e.g be set when mapping a large folio. Or
>>> perhaps "if this pte has a PFN, and the next pte has PFN + 1, then we're
>>> probably in a large folio, thus do the proper batching stuff". I think that
>>> could satisfy everyone. There are heuristics we can use, and perhaps
>>> pte_batch_hint() does not need to be that simple and useless in the !arm64
>>> case then. I'll try to look into a cromulent solution for everyone.
>> Software bits are generally -ENOSPC, but maybe we are lucky on some architectures.
>>
>> We'd run into similar issues like aarch64 when shattering contiguity etc, so
>> there is quite some complexity too it that might not be worth it.
>>
>>> (shower thought: do we always get wins when batching large folios, or do these
>>> need to be of a significant order to get wins?)
>> For mprotect(), I don't know. For fork() and unmap() batching there was always a
>> win even with order-2 folios. (never measured order-1, because they don't apply to
>> anonymous memory)
>>
>> I assume for mprotect() it depends whether we really needed the folio before, or
>> whether it's just not required like for mremap().
>>
>>> But personally I would err on the side of small folios, like we did for mremap()
>>> a few months back.
>> The following (completely untested) might make most people happy by looking up
>> the folio only if (a) required or (b) if the architecture indicates that there is a large folio.
>>
>> I assume for some large folio use cases it might perform worse than before. But for
>> the write-upgrade case with large anon folios the performance improvement should remain.
>>
>> Not sure if some regression would remain for which we'd have to special-case the implementation
>> to take a separate path for nr_ptes == 1.
>>
>> Maybe you had something similar already:
>>
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index c0571445bef7..0b3856ad728e 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -211,6 +211,25 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>>         commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>>  }
>> +static bool mprotect_wants_folio_for_pte(unsigned long cp_flags, pte_t *ptep,
>> +               pte_t pte, unsigned long max_nr_ptes)
>> +{
>> +       /* NUMA hinting needs decide whether working on the folio is ok. */
>> +       if (cp_flags & MM_CP_PROT_NUMA)
>> +               return true;
>> +
>> +       /* We want the folio for possible write-upgrade. */
>> +       if (!pte_write(pte) && (cp_flags & MM_CP_TRY_CHANGE_WRITABLE))
>> +               return true;
>> +
>> +       /* There is nothing to batch. */
>> +       if (max_nr_ptes == 1)
>> +               return false;
>> +
>> +       /* For guaranteed large folios it's usually a win. */
>> +       return pte_batch_hint(ptep, pte) > 1;
>> +}
>> +
>>  static long change_pte_range(struct mmu_gather *tlb,
>>                 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>                 unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -241,16 +260,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                         const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>>                         int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>                         struct folio *folio = NULL;
>> -                       struct page *page;
>> +                       struct page *page = NULL;
>>                         pte_t ptent;
>>                         /* Already in the desired state. */
>>                         if (prot_numa && pte_protnone(oldpte))
>>                                 continue;
>> -                       page = vm_normal_page(vma, addr, oldpte);
>> -                       if (page)
>> -                               folio = page_folio(page);
>> +                       if (mprotect_wants_folio_for_pte(cp_flags, pte, oldpte, max_nr_ptes)) {
>> +                               page = vm_normal_page(vma, addr, oldpte);
>> +                               if (page)
>> +                                       folio = page_folio(page);
>> +                       }
>>                         /*
>>                          * Avoid trapping faults against the zero or KSM
>>
> Yes, this is a better version than what I had, I'll take this hunk if you don't mind :)
> Note that it still doesn't handle large folios on !contpte architectures, which
> is partly the issue. I suspect some sort of PTE lookahead might work well in
> practice, aside from the issues where e.g two order-0 folios that are
> contiguous in memory are separately mapped.
>
> Though perhaps inlining vm_normal_folio() might also be interesting and side-step
> most of the issue. I'll play around with that.

Indeed this is one option.

You can also experiment with
https://lore.kernel.org/all/20250506050056.59250-3-dev.jain@arm.com/
which approximates presence of large folio if pfns are contiguous.

>


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-02-20  4:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-13 15:08 [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) Luke Yang
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24   ` Pedro Falcato
2026-02-13 17:16     ` Suren Baghdasaryan
2026-02-13 17:26       ` David Hildenbrand (Arm)
2026-02-16 10:12         ` Dev Jain
2026-02-16 14:56           ` Pedro Falcato
2026-02-17 17:43           ` Luke Yang
2026-02-17 18:08             ` Pedro Falcato
2026-02-18  5:01               ` Dev Jain
2026-02-18 10:06                 ` Pedro Falcato
2026-02-18 10:38                   ` Dev Jain
2026-02-18 10:46                     ` David Hildenbrand (Arm)
2026-02-18 11:58                       ` Pedro Falcato
2026-02-18 12:24                         ` David Hildenbrand (Arm)
2026-02-19 12:15                           ` Pedro Falcato
2026-02-19 13:02                             ` David Hildenbrand (Arm)
2026-02-19 15:00                               ` Pedro Falcato
2026-02-19 15:29                                 ` David Hildenbrand (Arm)
2026-02-20  4:12                                 ` Dev Jain
2026-02-18 11:52                     ` Pedro Falcato
2026-02-18  4:50             ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox