linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
@ 2026-02-13 15:08 Luke Yang
  2026-02-13 15:47 ` David Hildenbrand (Arm)
  2026-02-18 13:29 ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 23+ messages in thread
From: Luke Yang @ 2026-02-13 15:08 UTC (permalink / raw)
  To: dev.jain
  Cc: jhladky, akpm, Liam.Howlett, willy, surenb, vbabka, linux-mm,
	linux-kernel

Hello,

we have bisected a significant mprotect() performance regression in
6.17-rc1 to:

cac1db8c3aad ("mm: optimize mprotect() by PTE batching")

The regression becomes clearly visible starting around 400 KiB region
sizes and above. It is also still present in the latest 6.19 kernel.

## Test description

The reproducer repeatedly toggles protection (PROT_NONE <->
PROT_READ|PROT_WRITE) over a single mapped region in a tight loop. All
pages change protection in each iteration.

The benchmark sweeps region sizes from 4 KiB up to 40 GiB.

We bisected between 6.16 and 6.17-rc1 and confirmed that reverting
cac1db8c3aad on top of 6.17-rc1 largely restores the 6.16 performance
characteristics.

## perf observations

In 6.17-rc1, commit_anon_folio_batch() becomes hot and accounts for a
significant portion of cycles inside change_pte_range(). Instruction
count in change_pte_range() increases noticeably in 6.17-rc1.
commit_anon_folio_batch() was added as part of cac1db8c3aad.

The regression is also present for the following servers: AMD EPYC 2 (Rome),
AMD EPYC3 (Milan), AMD EPYC3 (Milanx), AMD EPYC4 (Zen4c Bergamo), Ampere Mt
Snow Altra with KVM virt type (ARM Neoverse-N1) , Lenovo Thinkpad T460p (Intel
Skylake 6820HQ).

## Results (nsec per mprotect call) collected on AMD EPYC Zen3 (Milan)
server.

v6.16
size_kib | nsec_per_call
4        | 1713
40       | 2071
400      | 3453
4000     | 18804
40000    | 172613
400000   | 1699301
4000000  | 17021882
40000000 | 169677478

v6.17-rc1
size_kib | nsec_per_call
4        | 1775
40       | 2362
400      | 5993
4000     | 44116
40000    | 427731
400000   | 4252714
4000000  | 42512805
40000000 | 424995500

v6.17-rc1 with cac1db8c3aad reverted
size_kib | nsec_per_call
4        | 1750
40       | 2126
400      | 3800
4000     | 22227
40000    | 205446
400000   | 2011634
4000000  | 20144468
40000000 | 200764472

This workload appears to be the worst case for the new batching logic,
where batching overhead dominates, and no amortization benefit is
achieved.

We will provide the following minimal reproducers:

* mprot_tw4m_regsize_sweep_one_region.sh
* mprot_tw4m_regsize.c

Please let us know if additional data would be useful.

Reported-by: Luke Yang luyang@redhat.com
Reported-by: Jirka Hladky jhladky@redhat.com

Thank you
Luke

Reproducer
----------


mprot_tw4m_regsize_sweep_one_region.sh
--- cut here ---
#!/bin/bash
gcc -Wall -Wextra -O1 -o mprot_tw4m_regsize mprot_tw4m_regsize.c
if ! [ -x "./mprot_tw4m_regsize" ]; then
 echo "No ./mprot_tw4m_regsize binary, compilation failed?"
 exit 1
fi

DIR="$(date '+%Y-%b-%d_%Hh%Mm%Ss')_$(uname -r)"
mkdir -p "$DIR"

# Sweep region size from 4K to 4G (10x each step), 1 region.
# Iterations decrease by 10x to keep runtime roughly constant.
#   size_kib   iterations
runs=(
   "4          40000000"
   "40         4000000"
   "400        400000"
   "4000       40000"
   "40000      4000"
   "400000     400"
   "4000000    40"
   "40000000   4"
)

for entry in "${runs[@]}"; do
   read -r size_kib iters <<< "$entry"
   logfile="$DIR/regsize_${size_kib}k.log"
   echo "=== Region size: ${size_kib} KiB, iterations: ${iters} ==="
   sync; sync
   echo 3 > /proc/sys/vm/drop_caches
   taskset -c 0 ./mprot_tw4m_regsize "$size_kib" 1 "$iters" 2>&1 | tee "$logfile"
   echo ""
done

# Create CSV summary from log files
csv="$DIR/summary.csv"
echo "size_kib,runtime_sec,nsec_per_call" > "$csv"
for entry in "${runs[@]}"; do
   read -r size_kib _ <<< "$entry"
   logfile="$DIR/regsize_${size_kib}k.log"
   runtime=$(grep -oP 'Runtime: \K[0-9.]+' "$logfile")
   nsec=$(grep -oP 'Avg: \K[0-9.]+(?= nsec/call)' "$logfile")
   echo "${size_kib},${runtime},${nsec}" >> "$csv"
done

echo "Results saved in $DIR/"
echo "CSV summary:"
cat "$csv"
--- cut here ---

mprot_tw4m_regsize.c
--- cut here ---
/*
* Reproduce libmicro mprot_tw4m benchmark - Time mprotect() with configurable region size
* gcc -Wall -Wextra -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* DEBUG: gcc -Wall -Wextra -g -fsanitize=undefined -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* ./mprot_tw4m_regsize <region_size_kib> <region_count> <iterations>
*/

#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <strings.h>
#include <time.h>

typedef volatile char vchar_t;

static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock();

int main(int argc, char **argv)
{
   int i, j, ret;
   long long k;

   if (argc < 4) {
       printf("USAGE: %s region_size_kib region_count iterations\n", argv[0]);
       printf("Creates multiple regions and times mprotect() calls\n");
       return 1;
   }

   long region_size = atol(argv[1]) * 1024L;
   int region_count = atoi(argv[2]);
   int iterations = atoi(argv[3]);

   int pagesize = sysconf(_SC_PAGESIZE);

   vchar_t **regions = malloc(region_count * sizeof(vchar_t*));
   if (!regions) {
       perror("malloc");
       return 1;
   }

   for (i = 0; i < region_count; i++) {
       regions[i] = (vchar_t *) mmap(NULL, region_size,
                     PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0L);

       if (regions[i] == MAP_FAILED) {
           perror("mmap");
           exit(1);
       }

       for (k = 0; k < region_size; k += pagesize) {
           regions[i][k] = 1;
       }
   }

   printf("Created %d regions of %ldKiB each. Starting %d mprotect operations per region...\n",
          region_count, region_size / 1024, iterations);

   struct timespec start_time, end_time;
   clock_gettime(CLOCK_MONOTONIC, &start_time);
   u_int64_t start_rdtsc = start_clock();

   for (j = 0; j < iterations; j++) {
       for (i = 0; i < region_count; i++) {
           int prot;

           if ((i + j) % 2 == 0) {
               prot = PROT_NONE;
           } else {
               prot = PROT_READ | PROT_WRITE;
           }

           ret = mprotect((void *)regions[i], region_size, prot);
           if (ret != 0) {
               perror("mprotect");
               printf("mprotect error at region %d, iteration %d\n", i, j);
           }
       }
   }

   u_int64_t stop_rdtsc = stop_clock();
   clock_gettime(CLOCK_MONOTONIC, &end_time);
   u_int64_t diff = stop_rdtsc - start_rdtsc;

   long total_calls = (long)region_count * iterations;
   double runtime_sec = (end_time.tv_sec - start_time.tv_sec) +
                       (end_time.tv_nsec - start_time.tv_nsec) / 1000000000.0;

   double nsec_per_call = (runtime_sec * 1e9) / total_calls;

   printf("TSC for %ld mprotect calls on %d x %ldKiB regions: %ld K-cycles.  Avg: %g K-cycles/call\n",
          total_calls,
          region_count,
          region_size / 1024,
          diff/1000,
          ((double)(diff)/(double)(total_calls))/1000.0);
   printf("Runtime: %.6f seconds.  Avg: %.3f nsec/call\n", runtime_sec, nsec_per_call);

   for (i = 0; i < region_count; i++) {
       munmap((void *)regions[i], region_size);
   }
   free(regions);

   return 0;
}

static __inline__ u_int64_t start_clock() {
   // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
   u_int32_t hi, lo;
   __asm__ __volatile__ (
       "CPUID\n\t"
       "RDTSC\n\t"
       "mov %%edx, %0\n\t"
       "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
       "%rax", "%rbx", "%rcx", "%rdx");
   return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}

static __inline__ u_int64_t stop_clock() {
   // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
   u_int32_t hi, lo;
   __asm__ __volatile__(
       "RDTSCP\n\t"
       "mov %%edx, %0\n\t"
       "mov %%eax, %1\n\t"
       "CPUID\n\t": "=r" (hi), "=r" (lo)::
       "%rax", "%rbx", "%rcx", "%rdx");
   return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}
--- cut here ---



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-02-20  4:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-13 15:08 [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad) Luke Yang
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24   ` Pedro Falcato
2026-02-13 17:16     ` Suren Baghdasaryan
2026-02-13 17:26       ` David Hildenbrand (Arm)
2026-02-16 10:12         ` Dev Jain
2026-02-16 14:56           ` Pedro Falcato
2026-02-17 17:43           ` Luke Yang
2026-02-17 18:08             ` Pedro Falcato
2026-02-18  5:01               ` Dev Jain
2026-02-18 10:06                 ` Pedro Falcato
2026-02-18 10:38                   ` Dev Jain
2026-02-18 10:46                     ` David Hildenbrand (Arm)
2026-02-18 11:58                       ` Pedro Falcato
2026-02-18 12:24                         ` David Hildenbrand (Arm)
2026-02-19 12:15                           ` Pedro Falcato
2026-02-19 13:02                             ` David Hildenbrand (Arm)
2026-02-19 15:00                               ` Pedro Falcato
2026-02-19 15:29                                 ` David Hildenbrand (Arm)
2026-02-20  4:12                                 ` Dev Jain
2026-02-18 11:52                     ` Pedro Falcato
2026-02-18  4:50             ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox