From: Luke Yang <luyang@redhat.com>
To: dev.jain@arm.com
Cc: jhladky@redhat.com, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, willy@infradead.org, surenb@google.com,
vbabka@suse.cz, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
Date: Fri, 13 Feb 2026 10:08:14 -0500 [thread overview]
Message-ID: <aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb> (raw)
Hello,
we have bisected a significant mprotect() performance regression in
6.17-rc1 to:
cac1db8c3aad ("mm: optimize mprotect() by PTE batching")
The regression becomes clearly visible starting around 400 KiB region
sizes and above. It is also still present in the latest 6.19 kernel.
## Test description
The reproducer repeatedly toggles protection (PROT_NONE <->
PROT_READ|PROT_WRITE) over a single mapped region in a tight loop. All
pages change protection in each iteration.
The benchmark sweeps region sizes from 4 KiB up to 40 GiB.
We bisected between 6.16 and 6.17-rc1 and confirmed that reverting
cac1db8c3aad on top of 6.17-rc1 largely restores the 6.16 performance
characteristics.
## perf observations
In 6.17-rc1, commit_anon_folio_batch() becomes hot and accounts for a
significant portion of cycles inside change_pte_range(). Instruction
count in change_pte_range() increases noticeably in 6.17-rc1.
commit_anon_folio_batch() was added as part of cac1db8c3aad.
The regression is also present for the following servers: AMD EPYC 2 (Rome),
AMD EPYC3 (Milan), AMD EPYC3 (Milanx), AMD EPYC4 (Zen4c Bergamo), Ampere Mt
Snow Altra with KVM virt type (ARM Neoverse-N1) , Lenovo Thinkpad T460p (Intel
Skylake 6820HQ).
## Results (nsec per mprotect call) collected on AMD EPYC Zen3 (Milan)
server.
v6.16
size_kib | nsec_per_call
4 | 1713
40 | 2071
400 | 3453
4000 | 18804
40000 | 172613
400000 | 1699301
4000000 | 17021882
40000000 | 169677478
v6.17-rc1
size_kib | nsec_per_call
4 | 1775
40 | 2362
400 | 5993
4000 | 44116
40000 | 427731
400000 | 4252714
4000000 | 42512805
40000000 | 424995500
v6.17-rc1 with cac1db8c3aad reverted
size_kib | nsec_per_call
4 | 1750
40 | 2126
400 | 3800
4000 | 22227
40000 | 205446
400000 | 2011634
4000000 | 20144468
40000000 | 200764472
This workload appears to be the worst case for the new batching logic,
where batching overhead dominates, and no amortization benefit is
achieved.
We will provide the following minimal reproducers:
* mprot_tw4m_regsize_sweep_one_region.sh
* mprot_tw4m_regsize.c
Please let us know if additional data would be useful.
Reported-by: Luke Yang luyang@redhat.com
Reported-by: Jirka Hladky jhladky@redhat.com
Thank you
Luke
Reproducer
----------
mprot_tw4m_regsize_sweep_one_region.sh
--- cut here ---
#!/bin/bash
gcc -Wall -Wextra -O1 -o mprot_tw4m_regsize mprot_tw4m_regsize.c
if ! [ -x "./mprot_tw4m_regsize" ]; then
echo "No ./mprot_tw4m_regsize binary, compilation failed?"
exit 1
fi
DIR="$(date '+%Y-%b-%d_%Hh%Mm%Ss')_$(uname -r)"
mkdir -p "$DIR"
# Sweep region size from 4K to 4G (10x each step), 1 region.
# Iterations decrease by 10x to keep runtime roughly constant.
# size_kib iterations
runs=(
"4 40000000"
"40 4000000"
"400 400000"
"4000 40000"
"40000 4000"
"400000 400"
"4000000 40"
"40000000 4"
)
for entry in "${runs[@]}"; do
read -r size_kib iters <<< "$entry"
logfile="$DIR/regsize_${size_kib}k.log"
echo "=== Region size: ${size_kib} KiB, iterations: ${iters} ==="
sync; sync
echo 3 > /proc/sys/vm/drop_caches
taskset -c 0 ./mprot_tw4m_regsize "$size_kib" 1 "$iters" 2>&1 | tee "$logfile"
echo ""
done
# Create CSV summary from log files
csv="$DIR/summary.csv"
echo "size_kib,runtime_sec,nsec_per_call" > "$csv"
for entry in "${runs[@]}"; do
read -r size_kib _ <<< "$entry"
logfile="$DIR/regsize_${size_kib}k.log"
runtime=$(grep -oP 'Runtime: \K[0-9.]+' "$logfile")
nsec=$(grep -oP 'Avg: \K[0-9.]+(?= nsec/call)' "$logfile")
echo "${size_kib},${runtime},${nsec}" >> "$csv"
done
echo "Results saved in $DIR/"
echo "CSV summary:"
cat "$csv"
--- cut here ---
mprot_tw4m_regsize.c
--- cut here ---
/*
* Reproduce libmicro mprot_tw4m benchmark - Time mprotect() with configurable region size
* gcc -Wall -Wextra -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* DEBUG: gcc -Wall -Wextra -g -fsanitize=undefined -O1 mprot_tw4m_regsize.c -o mprot_tw4m_regsize
* ./mprot_tw4m_regsize <region_size_kib> <region_count> <iterations>
*/
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <strings.h>
#include <time.h>
typedef volatile char vchar_t;
static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock();
int main(int argc, char **argv)
{
int i, j, ret;
long long k;
if (argc < 4) {
printf("USAGE: %s region_size_kib region_count iterations\n", argv[0]);
printf("Creates multiple regions and times mprotect() calls\n");
return 1;
}
long region_size = atol(argv[1]) * 1024L;
int region_count = atoi(argv[2]);
int iterations = atoi(argv[3]);
int pagesize = sysconf(_SC_PAGESIZE);
vchar_t **regions = malloc(region_count * sizeof(vchar_t*));
if (!regions) {
perror("malloc");
return 1;
}
for (i = 0; i < region_count; i++) {
regions[i] = (vchar_t *) mmap(NULL, region_size,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0L);
if (regions[i] == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (k = 0; k < region_size; k += pagesize) {
regions[i][k] = 1;
}
}
printf("Created %d regions of %ldKiB each. Starting %d mprotect operations per region...\n",
region_count, region_size / 1024, iterations);
struct timespec start_time, end_time;
clock_gettime(CLOCK_MONOTONIC, &start_time);
u_int64_t start_rdtsc = start_clock();
for (j = 0; j < iterations; j++) {
for (i = 0; i < region_count; i++) {
int prot;
if ((i + j) % 2 == 0) {
prot = PROT_NONE;
} else {
prot = PROT_READ | PROT_WRITE;
}
ret = mprotect((void *)regions[i], region_size, prot);
if (ret != 0) {
perror("mprotect");
printf("mprotect error at region %d, iteration %d\n", i, j);
}
}
}
u_int64_t stop_rdtsc = stop_clock();
clock_gettime(CLOCK_MONOTONIC, &end_time);
u_int64_t diff = stop_rdtsc - start_rdtsc;
long total_calls = (long)region_count * iterations;
double runtime_sec = (end_time.tv_sec - start_time.tv_sec) +
(end_time.tv_nsec - start_time.tv_nsec) / 1000000000.0;
double nsec_per_call = (runtime_sec * 1e9) / total_calls;
printf("TSC for %ld mprotect calls on %d x %ldKiB regions: %ld K-cycles. Avg: %g K-cycles/call\n",
total_calls,
region_count,
region_size / 1024,
diff/1000,
((double)(diff)/(double)(total_calls))/1000.0);
printf("Runtime: %.6f seconds. Avg: %.3f nsec/call\n", runtime_sec, nsec_per_call);
for (i = 0; i < region_count; i++) {
munmap((void *)regions[i], region_size);
}
free(regions);
return 0;
}
static __inline__ u_int64_t start_clock() {
// See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
u_int32_t hi, lo;
__asm__ __volatile__ (
"CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}
static __inline__ u_int64_t stop_clock() {
// See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
u_int32_t hi, lo;
__asm__ __volatile__(
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (hi), "=r" (lo)::
"%rax", "%rbx", "%rcx", "%rdx");
return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}
--- cut here ---
next reply other threads:[~2026-02-13 15:08 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 15:08 Luke Yang [this message]
2026-02-13 15:47 ` David Hildenbrand (Arm)
2026-02-13 16:24 ` Pedro Falcato
2026-02-13 17:16 ` Suren Baghdasaryan
2026-02-13 17:26 ` David Hildenbrand (Arm)
2026-02-16 10:12 ` Dev Jain
2026-02-16 14:56 ` Pedro Falcato
2026-02-17 17:43 ` Luke Yang
2026-02-17 18:08 ` Pedro Falcato
2026-02-18 5:01 ` Dev Jain
2026-02-18 10:06 ` Pedro Falcato
2026-02-18 10:38 ` Dev Jain
2026-02-18 10:46 ` David Hildenbrand (Arm)
2026-02-18 11:58 ` Pedro Falcato
2026-02-18 12:24 ` David Hildenbrand (Arm)
2026-02-19 12:15 ` Pedro Falcato
2026-02-19 13:02 ` David Hildenbrand (Arm)
2026-02-19 15:00 ` Pedro Falcato
2026-02-19 15:29 ` David Hildenbrand (Arm)
2026-02-20 4:12 ` Dev Jain
2026-02-18 11:52 ` Pedro Falcato
2026-02-18 4:50 ` Dev Jain
2026-02-18 13:29 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb \
--to=luyang@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=dev.jain@arm.com \
--cc=jhladky@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox