Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
@ 2025-08-29 14:30 Uschakow, Stanislav
  2025-09-01 10:58 ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: Uschakow, Stanislav @ 2025-08-29 14:30 UTC (permalink / raw)
  To: linux-mm
  Cc: trix, ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	jannh, lorenzo.stoakes, liam.howlett, muchun.song, osalvador,
	vbabka, stable, akpm, jannh

Hello.

We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.

Comparing the a kernel without this patch with a kernel with this patch applied when spawning 1000 children we see those execution times:


Patched kernel: 
$ time make stress
...
real    0m11.275s
user    0m0.177s
sys     0m23.905s

Original kernel : 

$ time make stress
...real    0m2.475s
user    0m1.398s
sys     0m2.501s


The patch in question: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1013af4f585fccc4d3e5c5824d174de2257f7d6d


My observation/assumption is:

each child touches 100 random pages and despawns
on each despawn `huge_pmd_unshare()` is called
each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression



I'm happy to provide more information.




Thank you
Stanislav Uschakow








=== Reproducer ===

Setup:


#!/bin/bash
echo "Setting up hugepages for reproduction..."

# hugepages (1.2TB / 2MB = 614400 pages)
REQUIRED_PAGES=614400

# Check current hugepage allocation
CURRENT_PAGES=$(cat /proc/sys/vm/nr_hugepages)
echo "Current hugepages: $CURRENT_PAGES"

if [ "$CURRENT_PAGES" -lt "$REQUIRED_PAGES" ]; then
    echo "Allocating $REQUIRED_PAGES hugepages..."
    echo $REQUIRED_PAGES | sudo tee /proc/sys/vm/nr_hugepages

    ALLOCATED=$(cat /proc/sys/vm/nr_hugepages)
    echo "Allocated hugepages: $ALLOCATED"
    
    if [ "$ALLOCATED" -lt "$REQUIRED_PAGES" ]; then
        echo "Warning: Could not allocate all required hugepages"
        echo "Available: $ALLOCATED, Required: $REQUIRED_PAGES"
    fi
fi

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

echo -e "\nHugepage information:"
cat /proc/meminfo | grep -i huge

echo -e "\nSetup complete. You can now run the reproduction test."



Makefile:


CXX = gcc
CXXFLAGS = -O2 -Wall
TARGET = hugepage_repro
SOURCE = hugepage_repro.c

$(TARGET): $(SOURCE)
    $(CXX) $(CXXFLAGS) -o $(TARGET) $(SOURCE)

clean:
    rm -f $(TARGET)

setup:
    chmod +x setup_hugepages.sh
    ./setup_hugepages.sh

test: $(TARGET)
    ./$(TARGET) 20 3

stress: $(TARGET)
    ./$(TARGET) 1000 1

.PHONY: clean setup test stress



hugepage_repro.c:


#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <stdio.h>

#define HUGEPAGE_SIZE (2 * 1024 * 1024) // 2MB
#define TOTAL_SIZE (1200ULL * 1024 * 1024 * 1024) // 1.2TB
#define NUM_HUGEPAGES (TOTAL_SIZE / HUGEPAGE_SIZE)

void* create_hugepage_mapping() {
    void* addr = mmap(NULL, TOTAL_SIZE, PROT_READ | PROT_WRITE,
                      MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap hugepages failed");
        exit(1);
    }
    return addr;
}

void touch_random_pages(void* addr, int num_touches) {
    char* base = (char*)addr;
    for (int i = 0; i < num_touches; ++i) {
        size_t offset = (rand() % NUM_HUGEPAGES) * HUGEPAGE_SIZE;
        volatile char val = base[offset];
        (void)val;
    }
}

void child_process(void* shared_mem, int child_id) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    touch_random_pages(shared_mem, 100);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    long duration = (end.tv_sec - start.tv_sec) * 1000000 + 
                   (end.tv_nsec - start.tv_nsec) / 1000;
    
    printf("Child %d completed in %ld μs\n", child_id, duration);
}

int main(int argc, char* argv[]) {
    int num_processes = argc > 1 ? atoi(argv[1]) : 50;
    int iterations = argc > 2 ? atoi(argv[2]) : 5;
    
    printf("Creating %lluGB hugepage mapping...\n", TOTAL_SIZE / (1024*1024*1024));
    void* shared_mem = create_hugepage_mapping();
    
    for (int iter = 0; iter < iterations; ++iter) {
        printf("\nIteration %d: Forking %d processes\n", iter + 1, num_processes);
        
        pid_t children[num_processes];
        struct timespec iter_start, iter_end;
        clock_gettime(CLOCK_MONOTONIC, &iter_start);
        
        for (int i = 0; i < num_processes; ++i) {
            pid_t pid = fork();
            if (pid == 0) {
                child_process(shared_mem, i);
                exit(0);
            } else if (pid > 0) {
                children[i] = pid;
            }
        }
        
        for (int i = 0; i < num_processes; ++i) {
            waitpid(children[i], NULL, 0);
        }
        
        clock_gettime(CLOCK_MONOTONIC, &iter_end);
        long iter_duration = (iter_end.tv_sec - iter_start.tv_sec) * 1000 + 
                            (iter_end.tv_nsec - iter_start.tv_nsec) / 1000000;
        printf("Iteration completed in %ld ms\n", iter_duration);
    }
    
    munmap(shared_mem, TOTAL_SIZE);
    return 0;
}




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-08-29 14:30 Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race Uschakow, Stanislav
@ 2025-09-01 10:58 ` Jann Horn
  2025-09-01 11:26   ` David Hildenbrand
  2025-10-09  7:40   ` David Hildenbrand
  0 siblings, 2 replies; 35+ messages in thread
From: Jann Horn @ 2025-09-01 10:58 UTC (permalink / raw)
  To: Uschakow, Stanislav
  Cc: linux-mm, trix, ndesaulniers, nathan, akpm, muchun.song,
	mike.kravetz, lorenzo.stoakes, liam.howlett, osalvador, vbabka,
	stable

Hi!

On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.

Yeah, every 1G virtual address range you unshare on unmap will do an
extra synchronous IPI broadcast to all CPU cores, so it's not very
surprising that doing this would be a bit slow on a machine with 196
cores.

> My observation/assumption is:
>
> each child touches 100 random pages and despawns
> on each despawn `huge_pmd_unshare()` is called
> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression

Yeah, makes sense that that'd be slow.

There are probably several ways this could be optimized - like maybe
changing tlb_remove_table_sync_one() to rely on the MM's cpumask
(though that would require thinking about whether this interacts with
remote MM access somehow), or batching the refcount drops for hugetlb
shared page tables through something like struct mmu_gather, or doing
something special for the unmap path, or changing the semantics of
hugetlb page tables such that they can never turn into normal page
tables again. However, I'm not planning to work on optimizing this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-09-01 10:58 ` Jann Horn
@ 2025-09-01 11:26   ` David Hildenbrand
  2025-09-04 12:39     ` Uschakow, Stanislav
  2025-10-08 22:54     ` Prakash Sangappa
  2025-10-09  7:40   ` David Hildenbrand
  1 sibling, 2 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-09-01 11:26 UTC (permalink / raw)
  To: Jann Horn, Uschakow, Stanislav
  Cc: linux-mm, trix, ndesaulniers, nathan, akpm, muchun.song,
	mike.kravetz, lorenzo.stoakes, liam.howlett, osalvador, vbabka,
	stable

On 01.09.25 12:58, Jann Horn wrote:
> Hi!
> 
> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> 
> Yeah, every 1G virtual address range you unshare on unmap will do an
> extra synchronous IPI broadcast to all CPU cores, so it's not very
> surprising that doing this would be a bit slow on a machine with 196
> cores.

What is the use case for this extreme usage of fork() in that context? 
Is it just something people noticed and it's suboptimal, or is this a 
real problem for some use cases?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-09-01 11:26   ` David Hildenbrand
@ 2025-09-04 12:39     ` Uschakow, Stanislav
  2025-10-08 22:54     ` Prakash Sangappa
  1 sibling, 0 replies; 35+ messages in thread
From: Uschakow, Stanislav @ 2025-09-04 12:39 UTC (permalink / raw)
  To: David Hildenbrand, Jann Horn
  Cc: linux-mm, trix, ndesaulniers, nathan, akpm, muchun.song,
	mike.kravetz, lorenzo.stoakes, liam.howlett, osalvador, vbabka,
	stable, gregkh, linux-kernel

Hi David,

> From: David Hildenbrand <david@redhat.com>
> Sent: Monday, September 1, 2025 1:26 PM
> To: Jann Horn; Uschakow, Stanislav
> Cc: linux-mm@kvack.org; trix@redhat.com; ndesaulniers@google.com; nathan@kernel.org; akpm@linux-foundation.org; muchun.song@linux.dev; mike.kravetz@oracle.com; lorenzo.stoakes@oracle.com; liam.howlett@oracle.com; osalvador@suse.de; vbabka@suse.cz; stable@vger.kernel.org
> Subject: RE: [EXTERNAL] Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
>     
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 01.09.25 12:58, Jann Horn wrote:
> > Hi!
> >
> > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> >> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified  mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> >
> > Yeah, every 1G virtual address range you unshare on unmap will do an
> > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > surprising that doing this would be a bit slow on a machine with 196
> > cores.
> 
> What is the use case for this extreme usage of fork() in that context?
> Is it just something people noticed and it's suboptimal, or is this a
> real problem for some use cases?
> 

Yes, we have customer reporting huge performance regressions on their workloads. I don't know the software architecture or actual use case for their application though. A execution time increase of at least a factor of 4 is noticeable even with few forks() on those machines.

> --
> Cheers
> 
> David / dhildenb


Thanks

Stanislav

    



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-09-01 11:26   ` David Hildenbrand
  2025-09-04 12:39     ` Uschakow, Stanislav
@ 2025-10-08 22:54     ` Prakash Sangappa
  2025-10-09  7:23       ` David Hildenbrand
  1 sibling, 1 reply; 35+ messages in thread
From: Prakash Sangappa @ 2025-10-08 22:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, Lorenzo Stoakes,
	Liam Howlett, osalvador, vbabka, stable



> On Sep 1, 2025, at 4:26 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 01.09.25 12:58, Jann Horn wrote:
>> Hi!
>> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
>> Yeah, every 1G virtual address range you unshare on unmap will do an
>> extra synchronous IPI broadcast to all CPU cores, so it's not very
>> surprising that doing this would be a bit slow on a machine with 196
>> cores.
> 
> What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?

Our DB team is reporting performance issues due to this change. While running TPCC,  Database 
timeouts & shuts down(crashes). This is seen when there are a large number of 
processes(thousands) involved. It is not so prominent when there are lesser number of 
processes. 

Backing out this change addresses the problem.

-Prakash

> 
> -- 
> Cheers
> 
> David / dhildenb
> 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-08 22:54     ` Prakash Sangappa
@ 2025-10-09  7:23       ` David Hildenbrand
  2025-10-09 15:06         ` Prakash Sangappa
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-10-09  7:23 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, Lorenzo Stoakes,
	Liam Howlett, osalvador, vbabka, stable

On 09.10.25 00:54, Prakash Sangappa wrote:
> 
> 
>> On Sep 1, 2025, at 4:26 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.09.25 12:58, Jann Horn wrote:
>>> Hi!
>>> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>>>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
>>> Yeah, every 1G virtual address range you unshare on unmap will do an
>>> extra synchronous IPI broadcast to all CPU cores, so it's not very
>>> surprising that doing this would be a bit slow on a machine with 196
>>> cores.
>>
>> What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
> 
> Our DB team is reporting performance issues due to this change. While running TPCC,  Database
> timeouts & shuts down(crashes). This is seen when there are a large number of
> processes(thousands) involved. It is not so prominent when there are lesser number of
> processes.
> 
> Backing out this change addresses the problem.

I suspect the timeouts are due to fork() taking longer, and there is no 
kernel crash etc, right?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-09-01 10:58 ` Jann Horn
  2025-09-01 11:26   ` David Hildenbrand
@ 2025-10-09  7:40   ` David Hildenbrand
  2025-10-09  8:19     ` David Hildenbrand
                       ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-10-09  7:40 UTC (permalink / raw)
  To: Jann Horn, Uschakow, Stanislav
  Cc: linux-mm, trix, ndesaulniers, nathan, akpm, muchun.song,
	mike.kravetz, lorenzo.stoakes, liam.howlett, osalvador, vbabka,
	stable

On 01.09.25 12:58, Jann Horn wrote:
> Hi!
> 
> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> 
> Yeah, every 1G virtual address range you unshare on unmap will do an
> extra synchronous IPI broadcast to all CPU cores, so it's not very
> surprising that doing this would be a bit slow on a machine with 196
> cores.
> 
>> My observation/assumption is:
>>
>> each child touches 100 random pages and despawns
>> on each despawn `huge_pmd_unshare()` is called
>> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> 
> Yeah, makes sense that that'd be slow.
> 
> There are probably several ways this could be optimized - like maybe
> changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> (though that would require thinking about whether this interacts with
> remote MM access somehow), or batching the refcount drops for hugetlb
> shared page tables through something like struct mmu_gather, or doing
> something special for the unmap path, or changing the semantics of
> hugetlb page tables such that they can never turn into normal page
> tables again. However, I'm not planning to work on optimizing this.

I'm currently looking at the fix and what sticks out is "Fix it with an 
explicit broadcast IPI through tlb_remove_table_sync_one()".

(I don't understand how the page table can be used for "normal, 
non-hugetlb". I could only see how it is used for the remaining user for 
hugetlb stuff, but that's different question)

How does the fix work when an architecture does not issue IPIs for TLB 
shootdown? To handle gup-fast on these architectures, we use RCU.

So I'm wondering whether we use RCU somehow.

But note that in gup_fast_pte_range(), we are validating whether the PMD 
changed:

if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
     unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
	gup_put_folio(folio, 1, flags);
	goto pte_unmap;
}


So in case the page table got reused in the meantime, we should just 
back off and be fine, right?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-09  7:40   ` David Hildenbrand
@ 2025-10-09  8:19     ` David Hildenbrand
  2025-10-16  9:21     ` Lorenzo Stoakes
  2025-10-16 18:44     ` Jann Horn
  2 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-10-09  8:19 UTC (permalink / raw)
  To: Jann Horn, Uschakow, Stanislav
  Cc: linux-mm, trix, ndesaulniers, nathan, akpm, muchun.song,
	mike.kravetz, lorenzo.stoakes, liam.howlett, osalvador, vbabka,
	stable

On 09.10.25 09:40, David Hildenbrand wrote:
> On 01.09.25 12:58, Jann Horn wrote:
>> Hi!
>>
>> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
>>
>> Yeah, every 1G virtual address range you unshare on unmap will do an
>> extra synchronous IPI broadcast to all CPU cores, so it's not very
>> surprising that doing this would be a bit slow on a machine with 196
>> cores.
>>
>>> My observation/assumption is:
>>>
>>> each child touches 100 random pages and despawns
>>> on each despawn `huge_pmd_unshare()` is called
>>> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
>>
>> Yeah, makes sense that that'd be slow.
>>
>> There are probably several ways this could be optimized - like maybe
>> changing tlb_remove_table_sync_one() to rely on the MM's cpumask
>> (though that would require thinking about whether this interacts with
>> remote MM access somehow), or batching the refcount drops for hugetlb
>> shared page tables through something like struct mmu_gather, or doing
>> something special for the unmap path, or changing the semantics of
>> hugetlb page tables such that they can never turn into normal page
>> tables again. However, I'm not planning to work on optimizing this.
> 
> I'm currently looking at the fix and what sticks out is "Fix it with an
> explicit broadcast IPI through tlb_remove_table_sync_one()".
> 
> (I don't understand how the page table can be used for "normal,
> non-hugetlb". I could only see how it is used for the remaining user for
> hugetlb stuff, but that's different question)
> 
> How does the fix work when an architecture does not issue IPIs for TLB
> shootdown? To handle gup-fast on these architectures, we use RCU.
> 
> So I'm wondering whether we use RCU somehow.
> 
> But note that in gup_fast_pte_range(), we are validating whether the PMD
> changed:
> 
> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>       unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> 	gup_put_folio(folio, 1, flags);
> 	goto pte_unmap;
> }
> 
> 
> So in case the page table got reused in the meantime, we should just
> back off and be fine, right?

Wrong page table level. We'd have to check when processing a PMD leave 
whether the PUD changed as well.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-09  7:23       ` David Hildenbrand
@ 2025-10-09 15:06         ` Prakash Sangappa
  0 siblings, 0 replies; 35+ messages in thread
From: Prakash Sangappa @ 2025-10-09 15:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, Lorenzo Stoakes,
	Liam Howlett, osalvador, vbabka, stable



> On Oct 9, 2025, at 12:23 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 09.10.25 00:54, Prakash Sangappa wrote:
>>> On Sep 1, 2025, at 4:26 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 01.09.25 12:58, Jann Horn wrote:
>>>> Hi!
>>>> On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
>>>>> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
>>>> Yeah, every 1G virtual address range you unshare on unmap will do an
>>>> extra synchronous IPI broadcast to all CPU cores, so it's not very
>>>> surprising that doing this would be a bit slow on a machine with 196
>>>> cores.
>>> 
>>> What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
>> Our DB team is reporting performance issues due to this change. While running TPCC,  Database
>> timeouts & shuts down(crashes). This is seen when there are a large number of
>> processes(thousands) involved. It is not so prominent when there are lesser number of
>> processes.
>> Backing out this change addresses the problem.
> 
> I suspect the timeouts are due to fork() taking longer, and there is no kernel crash etc, right?

That is correct, there is no kernel crash.
-Prakash

> 
> -- 
> Cheers
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-09  7:40   ` David Hildenbrand
  2025-10-09  8:19     ` David Hildenbrand
@ 2025-10-16  9:21     ` Lorenzo Stoakes
  2025-10-16 19:13       ` David Hildenbrand
  2025-10-16 18:44     ` Jann Horn
  2 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-16  9:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On Thu, Oct 09, 2025 at 09:40:34AM +0200, David Hildenbrand wrote:
> On 01.09.25 12:58, Jann Horn wrote:
> > Hi!
> >
> > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> > > We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> >
> > Yeah, every 1G virtual address range you unshare on unmap will do an
> > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > surprising that doing this would be a bit slow on a machine with 196
> > cores.
> >
> > > My observation/assumption is:
> > >
> > > each child touches 100 random pages and despawns
> > > on each despawn `huge_pmd_unshare()` is called
> > > each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> >
> > Yeah, makes sense that that'd be slow.
> >
> > There are probably several ways this could be optimized - like maybe
> > changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> > (though that would require thinking about whether this interacts with
> > remote MM access somehow), or batching the refcount drops for hugetlb
> > shared page tables through something like struct mmu_gather, or doing
> > something special for the unmap path, or changing the semantics of
> > hugetlb page tables such that they can never turn into normal page
> > tables again. However, I'm not planning to work on optimizing this.
>
> I'm currently looking at the fix and what sticks out is "Fix it with an
> explicit broadcast IPI through tlb_remove_table_sync_one()".
>
> (I don't understand how the page table can be used for "normal,
> non-hugetlb". I could only see how it is used for the remaining user for
> hugetlb stuff, but that's different question)

Right, this surely is related only to hugetlb PTS, otherwise the refcount
shouldn't be a factor no?

>
> How does the fix work when an architecture does not issue IPIs for TLB
> shootdown? To handle gup-fast on these architectures, we use RCU.
>
> So I'm wondering whether we use RCU somehow.

Presumably you mean whether we _can_ use RCU somehow?

>
> But note that in gup_fast_pte_range(), we are validating whether the PMD
> changed:
>
> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>     unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> 	gup_put_folio(folio, 1, flags);
> 	goto pte_unmap;
> }

Right and as per the comment there:

/*
 ...
 * For THP collapse, it's a bit more complicated because GUP-fast may be
 * walking a pgtable page that is being freed (pte is still valid but pmd
 * can be cleared already).  To avoid race in such condition, we need to
 * also check pmd here to make sure pmd doesn't change (corresponds to
 * pmdp_collapse_flush() in the THP collapse code path).
 ...
 */

So if this can correctly handle a cleared PMD entry in the teardown case, surely
it can handle it in this case also?

>
>
> So in case the page table got reused in the meantime, we should just back
> off and be fine, right?

Yeah seems to be the case to me.

>
> --
> Cheers
>
> David / dhildenb
>

So it seems like you have a proposal here - could you send a patch so we can
assess it please? :)

I'm guessing we need only consider the 'remaining user' case for hugetlb PTS
right? And perhaps stabilise via RCU somehow?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-09  7:40   ` David Hildenbrand
  2025-10-09  8:19     ` David Hildenbrand
  2025-10-16  9:21     ` Lorenzo Stoakes
@ 2025-10-16 18:44     ` Jann Horn
  2025-10-16 19:10       ` David Hildenbrand
  2025-10-20 15:00       ` Lorenzo Stoakes
  2 siblings, 2 replies; 35+ messages in thread
From: Jann Horn @ 2025-10-16 18:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, lorenzo.stoakes, liam.howlett,
	osalvador, vbabka, stable

On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
> On 01.09.25 12:58, Jann Horn wrote:
> > Hi!
> >
> > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> >> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> >
> > Yeah, every 1G virtual address range you unshare on unmap will do an
> > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > surprising that doing this would be a bit slow on a machine with 196
> > cores.
> >
> >> My observation/assumption is:
> >>
> >> each child touches 100 random pages and despawns
> >> on each despawn `huge_pmd_unshare()` is called
> >> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> >
> > Yeah, makes sense that that'd be slow.
> >
> > There are probably several ways this could be optimized - like maybe
> > changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> > (though that would require thinking about whether this interacts with
> > remote MM access somehow), or batching the refcount drops for hugetlb
> > shared page tables through something like struct mmu_gather, or doing
> > something special for the unmap path, or changing the semantics of
> > hugetlb page tables such that they can never turn into normal page
> > tables again. However, I'm not planning to work on optimizing this.
>
> I'm currently looking at the fix and what sticks out is "Fix it with an
> explicit broadcast IPI through tlb_remove_table_sync_one()".
>
> (I don't understand how the page table can be used for "normal,
> non-hugetlb". I could only see how it is used for the remaining user for
> hugetlb stuff, but that's different question)

If I remember correctly:
When a hugetlb shared page table drops to refcount 1, it turns into a
normal page table. If you then afterwards split the hugetlb VMA, unmap
one half of it, and place a new unrelated VMA in its place, the same
page table will be reused for PTEs of this new unrelated VMA.

So the scenario would be:

1. Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two
hugetlb VMAs in different processes (processes P1 and P2).
2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
walks down through the PUD entry that points to the shared page table,
then when it reaches the loop in gup_fast_pmd_range() gets interrupted
for a while by an NMI or preempted by the hypervisor or something.
3. P2 removes its VMA, and the hugetlb shared page table effectively
becomes a normal page table in P1.
4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2.
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
6. P1 populates VMA3 with page table entries.
7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.

> How does the fix work when an architecture does not issue IPIs for TLB
> shootdown? To handle gup-fast on these architectures, we use RCU.

gup-fast disables interrupts, which synchronizes against both RCU and IPI.

> So I'm wondering whether we use RCU somehow.
>
> But note that in gup_fast_pte_range(), we are validating whether the PMD
> changed:
>
> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
>         gup_put_folio(folio, 1, flags);
>         goto pte_unmap;
> }
>
>
> So in case the page table got reused in the meantime, we should just
> back off and be fine, right?

The shared page table is mapped with a PUD entry, and we don't check
whether the PUD entry changed here.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16 18:44     ` Jann Horn
@ 2025-10-16 19:10       ` David Hildenbrand
  2025-10-16 19:26         ` Jann Horn
  2025-10-20 15:00       ` Lorenzo Stoakes
  1 sibling, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, lorenzo.stoakes, liam.howlett,
	osalvador, vbabka, stable

>> I'm currently looking at the fix and what sticks out is "Fix it with an
>> explicit broadcast IPI through tlb_remove_table_sync_one()".
>>
>> (I don't understand how the page table can be used for "normal,
>> non-hugetlb". I could only see how it is used for the remaining user for
>> hugetlb stuff, but that's different question)
> 
> If I remember correctly:
> When a hugetlb shared page table drops to refcount 1, it turns into a
> normal page table. If you then afterwards split the hugetlb VMA, unmap
> one half of it, and place a new unrelated VMA in its place, the same
> page table will be reused for PTEs of this new unrelated VMA.

That makes sense.

> 
> So the scenario would be:
> 
> 1. Initially, we have a hugetlb shared page table covering 1G of
> address space which maps hugetlb 2M pages, which is used by two
> hugetlb VMAs in different processes (processes P1 and P2).
> 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> walks down through the PUD entry that points to the shared page table,
> then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> for a while by an NMI or preempted by the hypervisor or something.
> 3. P2 removes its VMA, and the hugetlb shared page table effectively
> becomes a normal page table in P1.
> 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> leaving two VMAs VMA1 and VMA2.
> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> example an anonymous private VMA.
> 6. P1 populates VMA3 with page table entries.
> 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> uses the new PMD/PTE entries created for VMA3.

Yeah, sounds possible. And nasty.

> 
>> How does the fix work when an architecture does not issue IPIs for TLB
>> shootdown? To handle gup-fast on these architectures, we use RCU.
> 
> gup-fast disables interrupts, which synchronizes against both RCU and IPI.

Right, but RCU is only used for prevent walking a page table that has 
been freed+reused in the meantime (prevent us from de-referencing 
garbage entries).

It does not prevent walking the now-unshared page table that has been 
modified by the other process.

For that, we need the back-off described below. IIRC we implemented that 
in the PMD case for khugepaged.

Or is there somewhere a guaranteed RCU sync before the shared page table 
gets reused?

> 
>> So I'm wondering whether we use RCU somehow.
>>
>> But note that in gup_fast_pte_range(), we are validating whether the PMD
>> changed:
>>
>> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>       unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
>>          gup_put_folio(folio, 1, flags);
>>          goto pte_unmap;
>> }
>>
>>
>> So in case the page table got reused in the meantime, we should just
>> back off and be fine, right?
> 
> The shared page table is mapped with a PUD entry, and we don't check
> whether the PUD entry changed here.

Yes, see my follow-up mail, that's what we'd have to add.

On an arch without IPI, page tables will be freed with RCU and it just 
works. We walk the wrong page table, realize that the PUD changed and 
back off.

On an arch with IPI it's tricky: if we don't issue the IPI you added, we 
might still back off once we check the PUD entry didn't changee, but I'm 
afraid nothing would stop us from walking the previous page table that 
was freed in the meantime, containing garbage.

Easy fix would be never reusing a page table once shared once?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16  9:21     ` Lorenzo Stoakes
@ 2025-10-16 19:13       ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable


>>
>> (I don't understand how the page table can be used for "normal,
>> non-hugetlb". I could only see how it is used for the remaining user for
>> hugetlb stuff, but that's different question)
> 
> Right, this surely is related only to hugetlb PTS, otherwise the refcount
> shouldn't be a factor no?

The example from Jann is scary. But I think it checks out.

> 
>>
>> How does the fix work when an architecture does not issue IPIs for TLB
>> shootdown? To handle gup-fast on these architectures, we use RCU.
>>
>> So I'm wondering whether we use RCU somehow.
> 
> Presumably you mean whether we _can_ use RCU somehow?

No, whether there is an implied RCU sync before the page table gets 
reused, see my reply from Jann.

> 
>>
>> But note that in gup_fast_pte_range(), we are validating whether the PMD
>> changed:
>>
>> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>>      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
>> 	gup_put_folio(folio, 1, flags);
>> 	goto pte_unmap;
>> }
> 
> Right and as per the comment there:
> 
> /*
>   ...
>   * For THP collapse, it's a bit more complicated because GUP-fast may be
>   * walking a pgtable page that is being freed (pte is still valid but pmd
>   * can be cleared already).  To avoid race in such condition, we need to
>   * also check pmd here to make sure pmd doesn't change (corresponds to
>   * pmdp_collapse_flush() in the THP collapse code path).
>   ...
>   */
> 
> So if this can correctly handle a cleared PMD entry in the teardown case, surely
> it can handle it in this case also?

Right.

But see my other mail, on architectures that don't free page tables with 
RCU we still need the IPI, so that is nasty.

> 
>>
>>
>> So in case the page table got reused in the meantime, we should just back
>> off and be fine, right?
> 
> Yeah seems to be the case to me.
> 
>>
>> --
>> Cheers
>>
>> David / dhildenb
>>
> 
> So it seems like you have a proposal here - could you send a patch so we can
> assess it please? :)

It's a bit tricky, I think I have to discuss with Jann some more first. 
But right now my understanding is that Janns fix might not have taken 
care of arch without the IPI sync -- I might be wrong.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16 19:10       ` David Hildenbrand
@ 2025-10-16 19:26         ` Jann Horn
  2025-10-16 19:44           ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Jann Horn @ 2025-10-16 19:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, lorenzo.stoakes, liam.howlett,
	osalvador, vbabka, stable

On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand <david@redhat.com> wrote:
> >> I'm currently looking at the fix and what sticks out is "Fix it with an
> >> explicit broadcast IPI through tlb_remove_table_sync_one()".
> >>
> >> (I don't understand how the page table can be used for "normal,
> >> non-hugetlb". I could only see how it is used for the remaining user for
> >> hugetlb stuff, but that's different question)
> >
> > If I remember correctly:
> > When a hugetlb shared page table drops to refcount 1, it turns into a
> > normal page table. If you then afterwards split the hugetlb VMA, unmap
> > one half of it, and place a new unrelated VMA in its place, the same
> > page table will be reused for PTEs of this new unrelated VMA.
>
> That makes sense.
>
> >
> > So the scenario would be:
> >
> > 1. Initially, we have a hugetlb shared page table covering 1G of
> > address space which maps hugetlb 2M pages, which is used by two
> > hugetlb VMAs in different processes (processes P1 and P2).
> > 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> > walks down through the PUD entry that points to the shared page table,
> > then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> > for a while by an NMI or preempted by the hypervisor or something.
> > 3. P2 removes its VMA, and the hugetlb shared page table effectively
> > becomes a normal page table in P1.
> > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > leaving two VMAs VMA1 and VMA2.
> > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > example an anonymous private VMA.
> > 6. P1 populates VMA3 with page table entries.
> > 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> > uses the new PMD/PTE entries created for VMA3.
>
> Yeah, sounds possible. And nasty.
>
> >
> >> How does the fix work when an architecture does not issue IPIs for TLB
> >> shootdown? To handle gup-fast on these architectures, we use RCU.
> >
> > gup-fast disables interrupts, which synchronizes against both RCU and IPI.
>
> Right, but RCU is only used for prevent walking a page table that has
> been freed+reused in the meantime (prevent us from de-referencing
> garbage entries).
>
> It does not prevent walking the now-unshared page table that has been
> modified by the other process.

Hm, I'm a bit lost... which page table walk implementation are you
worried about that accesses page tables purely with RCU? I believe all
page table walks should be happening either with interrupts off (in
gup_fast()) or under the protection of higher-level locks; in
particular, hugetlb page walks take an extra hugetlb specific lock
(for hugetlb VMAs that are eligible for page table sharing, that is
the rw_sema in hugetlb_vma_lock).

Regarding gup_fast():

In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix
commit 1013af4f585f uses a synchronous IPI with
tlb_remove_table_sync_one() to wait for any concurrent GUP-fast
software page table walks, and some time after the call to
huge_pmd_unshare() we will do a TLB flush that synchronizes against
hardware page table walks.

In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I
believe the expectation is that the TLB flush implicitly does an IPI
which synchronizes against both software and hardware page table
walks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16 19:26         ` Jann Horn
@ 2025-10-16 19:44           ` David Hildenbrand
  2025-10-16 20:25             ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, lorenzo.stoakes, liam.howlett,
	osalvador, vbabka, stable

On 16.10.25 21:26, Jann Horn wrote:
> On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand <david@redhat.com> wrote:
>>>> I'm currently looking at the fix and what sticks out is "Fix it with an
>>>> explicit broadcast IPI through tlb_remove_table_sync_one()".
>>>>
>>>> (I don't understand how the page table can be used for "normal,
>>>> non-hugetlb". I could only see how it is used for the remaining user for
>>>> hugetlb stuff, but that's different question)
>>>
>>> If I remember correctly:
>>> When a hugetlb shared page table drops to refcount 1, it turns into a
>>> normal page table. If you then afterwards split the hugetlb VMA, unmap
>>> one half of it, and place a new unrelated VMA in its place, the same
>>> page table will be reused for PTEs of this new unrelated VMA.
>>
>> That makes sense.
>>
>>>
>>> So the scenario would be:
>>>
>>> 1. Initially, we have a hugetlb shared page table covering 1G of
>>> address space which maps hugetlb 2M pages, which is used by two
>>> hugetlb VMAs in different processes (processes P1 and P2).
>>> 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
>>> walks down through the PUD entry that points to the shared page table,
>>> then when it reaches the loop in gup_fast_pmd_range() gets interrupted
>>> for a while by an NMI or preempted by the hypervisor or something.
>>> 3. P2 removes its VMA, and the hugetlb shared page table effectively
>>> becomes a normal page table in P1.
>>> 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
>>> leaving two VMAs VMA1 and VMA2.
>>> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
>>> example an anonymous private VMA.
>>> 6. P1 populates VMA3 with page table entries.
>>> 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
>>> uses the new PMD/PTE entries created for VMA3.
>>
>> Yeah, sounds possible. And nasty.
>>
>>>
>>>> How does the fix work when an architecture does not issue IPIs for TLB
>>>> shootdown? To handle gup-fast on these architectures, we use RCU.
>>>
>>> gup-fast disables interrupts, which synchronizes against both RCU and IPI.
>>
>> Right, but RCU is only used for prevent walking a page table that has
>> been freed+reused in the meantime (prevent us from de-referencing
>> garbage entries).
>>
>> It does not prevent walking the now-unshared page table that has been
>> modified by the other process.
> 
> Hm, I'm a bit lost... which page table walk implementation are you
> worried about that accesses page tables purely with RCU? I believe all
> page table walks should be happening either with interrupts off (in
> gup_fast()) or under the protection of higher-level locks; in
> particular, hugetlb page walks take an extra hugetlb specific lock
> (for hugetlb VMAs that are eligible for page table sharing, that is
> the rw_sema in hugetlb_vma_lock).

I'm only concerned about gup-fast, but your comment below explains why 
your fix works as it triggers an IPI in any case, not just during the 
TLB flush.

Sorry for missing that detail.

> 
> Regarding gup_fast():
> 
> In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix
> commit 1013af4f585f uses a synchronous IPI with
> tlb_remove_table_sync_one() to wait for any concurrent GUP-fast
> software page table walks, and some time after the call to
> huge_pmd_unshare() we will do a TLB flush that synchronizes against
> hardware page table walks.

Right, so we definetly issue an IPI.

> 
> In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I
> believe the expectation is that the TLB flush implicitly does an IPI
> which synchronizes against both software and hardware page table
> walks.

Yes, that's what I had in mind, not an explicit sync.


So the big question is whether we could avoid this IPI on every unsharing.

Assume we would ever reuse a page table that was shared, we'd have to do 
this IPI only before freeing the page table I guess, or free the page 
table through RCU.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16 19:44           ` David Hildenbrand
@ 2025-10-16 20:25             ` Jann Horn
  0 siblings, 0 replies; 35+ messages in thread
From: Jann Horn @ 2025-10-16 20:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, lorenzo.stoakes, liam.howlett,
	osalvador, vbabka, stable

On Thu, Oct 16, 2025 at 9:45 PM David Hildenbrand <david@redhat.com> wrote:
> On 16.10.25 21:26, Jann Horn wrote:
> > On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand <david@redhat.com> wrote:
> >>>> I'm currently looking at the fix and what sticks out is "Fix it with an
> >>>> explicit broadcast IPI through tlb_remove_table_sync_one()".
> >>>>
> >>>> (I don't understand how the page table can be used for "normal,
> >>>> non-hugetlb". I could only see how it is used for the remaining user for
> >>>> hugetlb stuff, but that's different question)
> >>>
> >>> If I remember correctly:
> >>> When a hugetlb shared page table drops to refcount 1, it turns into a
> >>> normal page table. If you then afterwards split the hugetlb VMA, unmap
> >>> one half of it, and place a new unrelated VMA in its place, the same
> >>> page table will be reused for PTEs of this new unrelated VMA.
> >>
> >> That makes sense.
> >>
> >>>
> >>> So the scenario would be:
> >>>
> >>> 1. Initially, we have a hugetlb shared page table covering 1G of
> >>> address space which maps hugetlb 2M pages, which is used by two
> >>> hugetlb VMAs in different processes (processes P1 and P2).
> >>> 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> >>> walks down through the PUD entry that points to the shared page table,
> >>> then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> >>> for a while by an NMI or preempted by the hypervisor or something.
> >>> 3. P2 removes its VMA, and the hugetlb shared page table effectively
> >>> becomes a normal page table in P1.
> >>> 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> >>> leaving two VMAs VMA1 and VMA2.
> >>> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> >>> example an anonymous private VMA.
> >>> 6. P1 populates VMA3 with page table entries.
> >>> 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> >>> uses the new PMD/PTE entries created for VMA3.
> >>
> >> Yeah, sounds possible. And nasty.
> >>
> >>>
> >>>> How does the fix work when an architecture does not issue IPIs for TLB
> >>>> shootdown? To handle gup-fast on these architectures, we use RCU.
> >>>
> >>> gup-fast disables interrupts, which synchronizes against both RCU and IPI.
> >>
> >> Right, but RCU is only used for prevent walking a page table that has
> >> been freed+reused in the meantime (prevent us from de-referencing
> >> garbage entries).
> >>
> >> It does not prevent walking the now-unshared page table that has been
> >> modified by the other process.
> >
> > Hm, I'm a bit lost... which page table walk implementation are you
> > worried about that accesses page tables purely with RCU? I believe all
> > page table walks should be happening either with interrupts off (in
> > gup_fast()) or under the protection of higher-level locks; in
> > particular, hugetlb page walks take an extra hugetlb specific lock
> > (for hugetlb VMAs that are eligible for page table sharing, that is
> > the rw_sema in hugetlb_vma_lock).
>
> I'm only concerned about gup-fast, but your comment below explains why
> your fix works as it triggers an IPI in any case, not just during the
> TLB flush.
>
> Sorry for missing that detail.
>
> >
> > Regarding gup_fast():
> >
> > In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix
> > commit 1013af4f585f uses a synchronous IPI with
> > tlb_remove_table_sync_one() to wait for any concurrent GUP-fast
> > software page table walks, and some time after the call to
> > huge_pmd_unshare() we will do a TLB flush that synchronizes against
> > hardware page table walks.
>
> Right, so we definetly issue an IPI.
>
> >
> > In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I
> > believe the expectation is that the TLB flush implicitly does an IPI
> > which synchronizes against both software and hardware page table
> > walks.
>
> Yes, that's what I had in mind, not an explicit sync.
>
>
> So the big question is whether we could avoid this IPI on every unsharing.
>
> Assume we would ever reuse a page table that was shared, we'd have to do
> this IPI only before freeing the page table I guess, or free the page
> table through RCU.

Yeah, that would make things a lot neater. Prevent hugetlb shared page
tables from ever being reused for normal mappings, perhaps by changing
huge_pmd_unshare() so that if the page table has a share count of 1,
we zap it instead of doing nothing. (Though that has to be restricted
to shared hugetlb mappings, which are the ones eligible for page table
sharing.)

I thiiiink doing it at huge_pmd_unshare() would probably be enough to
prevent formerly-shared page tables from being reused for new stuff,
but I haven't looked in detail.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-16 18:44     ` Jann Horn
  2025-10-16 19:10       ` David Hildenbrand
@ 2025-10-20 15:00       ` Lorenzo Stoakes
  2025-10-20 15:33         ` Jann Horn
  2025-10-20 17:18         ` David Hildenbrand
  1 sibling, 2 replies; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-20 15:00 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

Jann,

Please bear with my questions below, want to get a good mental model of this. :)

Thanks!

On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
> > On 01.09.25 12:58, Jann Horn wrote:
> > > Hi!
> > >
> > > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> > >> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> > >
> > > Yeah, every 1G virtual address range you unshare on unmap will do an
> > > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > > surprising that doing this would be a bit slow on a machine with 196
> > > cores.
> > >
> > >> My observation/assumption is:
> > >>
> > >> each child touches 100 random pages and despawns
> > >> on each despawn `huge_pmd_unshare()` is called
> > >> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> > >
> > > Yeah, makes sense that that'd be slow.
> > >
> > > There are probably several ways this could be optimized - like maybe
> > > changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> > > (though that would require thinking about whether this interacts with
> > > remote MM access somehow), or batching the refcount drops for hugetlb
> > > shared page tables through something like struct mmu_gather, or doing
> > > something special for the unmap path, or changing the semantics of
> > > hugetlb page tables such that they can never turn into normal page
> > > tables again. However, I'm not planning to work on optimizing this.
> >
> > I'm currently looking at the fix and what sticks out is "Fix it with an
> > explicit broadcast IPI through tlb_remove_table_sync_one()".
> >
> > (I don't understand how the page table can be used for "normal,
> > non-hugetlb". I could only see how it is used for the remaining user for
> > hugetlb stuff, but that's different question)
>
> If I remember correctly:
> When a hugetlb shared page table drops to refcount 1, it turns into a
> normal page table. If you then afterwards split the hugetlb VMA, unmap
> one half of it, and place a new unrelated VMA in its place, the same
> page table will be reused for PTEs of this new unrelated VMA.
>
> So the scenario would be:
>
> 1. Initially, we have a hugetlb shared page table covering 1G of
> address space which maps hugetlb 2M pages, which is used by two
> hugetlb VMAs in different processes (processes P1 and P2).
> 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> walks down through the PUD entry that points to the shared page table,
> then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> for a while by an NMI or preempted by the hypervisor or something.
> 3. P2 removes its VMA, and the hugetlb shared page table effectively
> becomes a normal page table in P1.

This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?

P2/T1 on CPU A is doing the gup_fast() walk,
P2/T2 on CPU B is simultaneously 'removing' this VMA?

Because surely the interrupts being disabled on CPU A means that ordinary
preemption won't happen right?

By remove what do you mean? Unmap? But won't this result in a TLB flush synced
by IPI that is stalled by P2'S CPU having interrupts diabled?

Or is it removed in the sense of hugetlb? As in something that invokes
huge_pmd_unshare()?

But I guess this doesn't matter as the page table teardown will succeed, just
the final tlb_finish_mmu() will stall.

And I guess GUP fast is trying to protect against the clear down by checking pmd
!= *pmdp.

> 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> leaving two VMAs VMA1 and VMA2.
> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> example an anonymous private VMA.

Hmm, can it though?

P1 mmap write lock will be held, and VMA lock will be held too for VMA1,

In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
for IPI-synced architectures, and in that case the unmap won't finish and the
mmap write lock won't be released so nobody an map a new VMA yet can they?

> 6. P1 populates VMA3 with page table entries.

ofc this requires the mmap/vma write lock above to be released first.

> 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> uses the new PMD/PTE entries created for VMA3.
>
> > How does the fix work when an architecture does not issue IPIs for TLB
> > shootdown? To handle gup-fast on these architectures, we use RCU.
>
> gup-fast disables interrupts, which synchronizes against both RCU and IPI.
>
> > So I'm wondering whether we use RCU somehow.
> >
> > But note that in gup_fast_pte_range(), we are validating whether the PMD
> > changed:
> >
> > if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> >      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> >         gup_put_folio(folio, 1, flags);
> >         goto pte_unmap;
> > }
> >
> >
> > So in case the page table got reused in the meantime, we should just
> > back off and be fine, right?
>
> The shared page table is mapped with a PUD entry, and we don't check
> whether the PUD entry changed here.

Could we simply put a PUD check in there sensibly?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-20 15:00       ` Lorenzo Stoakes
@ 2025-10-20 15:33         ` Jann Horn
  2025-10-24 12:24           ` Lorenzo Stoakes
  2025-10-20 17:18         ` David Hildenbrand
  1 sibling, 1 reply; 35+ messages in thread
From: Jann Horn @ 2025-10-20 15:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> > On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
> > > On 01.09.25 12:58, Jann Horn wrote:
> > > > Hi!
> > > >
> > > > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> > > >> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> > > >
> > > > Yeah, every 1G virtual address range you unshare on unmap will do an
> > > > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > > > surprising that doing this would be a bit slow on a machine with 196
> > > > cores.
> > > >
> > > >> My observation/assumption is:
> > > >>
> > > >> each child touches 100 random pages and despawns
> > > >> on each despawn `huge_pmd_unshare()` is called
> > > >> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> > > >
> > > > Yeah, makes sense that that'd be slow.
> > > >
> > > > There are probably several ways this could be optimized - like maybe
> > > > changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> > > > (though that would require thinking about whether this interacts with
> > > > remote MM access somehow), or batching the refcount drops for hugetlb
> > > > shared page tables through something like struct mmu_gather, or doing
> > > > something special for the unmap path, or changing the semantics of
> > > > hugetlb page tables such that they can never turn into normal page
> > > > tables again. However, I'm not planning to work on optimizing this.
> > >
> > > I'm currently looking at the fix and what sticks out is "Fix it with an
> > > explicit broadcast IPI through tlb_remove_table_sync_one()".
> > >
> > > (I don't understand how the page table can be used for "normal,
> > > non-hugetlb". I could only see how it is used for the remaining user for
> > > hugetlb stuff, but that's different question)
> >
> > If I remember correctly:
> > When a hugetlb shared page table drops to refcount 1, it turns into a
> > normal page table. If you then afterwards split the hugetlb VMA, unmap
> > one half of it, and place a new unrelated VMA in its place, the same
> > page table will be reused for PTEs of this new unrelated VMA.
> >
> > So the scenario would be:
> >
> > 1. Initially, we have a hugetlb shared page table covering 1G of
> > address space which maps hugetlb 2M pages, which is used by two
> > hugetlb VMAs in different processes (processes P1 and P2).
> > 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> > walks down through the PUD entry that points to the shared page table,
> > then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> > for a while by an NMI or preempted by the hypervisor or something.
> > 3. P2 removes its VMA, and the hugetlb shared page table effectively
> > becomes a normal page table in P1.
>
> This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
>
> P2/T1 on CPU A is doing the gup_fast() walk,
> P2/T2 on CPU B is simultaneously 'removing' this VMA?

Ah, yes.

> Because surely the interrupts being disabled on CPU A means that ordinary
> preemption won't happen right?

Yeah.

> By remove what do you mean? Unmap? But won't this result in a TLB flush synced
> by IPI that is stalled by P2'S CPU having interrupts diabled?

The case I had in mind is munmap(). This is only an issue on platforms
where TLB flushes can be done without IPI. That includes:

 - KVM guests on x86 (where TLB flush IPIs can be elided if the target
vCPU has been preempted by the host, in which case the host promises
to do a TLB flush on guest re-entry)
 - modern AMD CPUs with INVLPGB
 - arm64

That is the whole point of tlb_remove_table_sync_one() - it forces an
IPI on architectures where TLB flush doesn't guarantee an IPI.

(The config option "CONFIG_MMU_GATHER_RCU_TABLE_FREE", which is only
needed on architectures that don't guarantee that an IPI is involved
in TLB flushing, is set on the major architectures nowadays -
unconditionally on x86 and arm64, and in SMP builds of 32-bit arm.)

> Or is it removed in the sense of hugetlb? As in something that invokes
> huge_pmd_unshare()?

I think that could also trigger it, though I wasn't thinking of that case.

> But I guess this doesn't matter as the page table teardown will succeed, just
> the final tlb_finish_mmu() will stall.
>
> And I guess GUP fast is trying to protect against the clear down by checking pmd
> != *pmdp.

The pmd recheck is done because of THP, IIRC because THP can deposit
and reuse page tables without following the normal page table life
cycle.

> > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > leaving two VMAs VMA1 and VMA2.
> > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > example an anonymous private VMA.
>
> Hmm, can it though?
>
> P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
>
> In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
> for IPI-synced architectures, and in that case the unmap won't finish and the
> mmap write lock won't be released so nobody an map a new VMA yet can they?

Yeah, I think it can't happen on configurations that always use IPI
for TLB synchronization. My patch also doesn't change anything on
those architectures - tlb_remove_table_sync_one() is a no-op on
architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.

> > 6. P1 populates VMA3 with page table entries.
>
> ofc this requires the mmap/vma write lock above to be released first.
>
> > 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> > uses the new PMD/PTE entries created for VMA3.
> >
> > > How does the fix work when an architecture does not issue IPIs for TLB
> > > shootdown? To handle gup-fast on these architectures, we use RCU.
> >
> > gup-fast disables interrupts, which synchronizes against both RCU and IPI.
> >
> > > So I'm wondering whether we use RCU somehow.
> > >
> > > But note that in gup_fast_pte_range(), we are validating whether the PMD
> > > changed:
> > >
> > > if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> > >      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> > >         gup_put_folio(folio, 1, flags);
> > >         goto pte_unmap;
> > > }
> > >
> > >
> > > So in case the page table got reused in the meantime, we should just
> > > back off and be fine, right?
> >
> > The shared page table is mapped with a PUD entry, and we don't check
> > whether the PUD entry changed here.
>
> Could we simply put a PUD check in there sensibly?

Uuuh... maybe? But I'm not sure if there is a good way to express the
safety rules after that change any more nicely than we can do with the
current safety rules, it feels like we're just tacking on an
increasing number of special cases. As I understand it, the current
rules are something like:

Freeing a page table needs RCU delay or IPI to synchronize against
gup_fast(). Randomly moving page tables to different locations (which
khugepaged does) is specially allowed only for PTE tables, thanks to
the PMD entry recheck. mremap() is kind of an weird case because it
can also move PMD tables without locking, but that's fine because
nothing in the region covered by the source virtual address range can
be part of a VMA other than the VMA being moved, so userspace has no
legitimate reason to access it.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-20 15:00       ` Lorenzo Stoakes
  2025-10-20 15:33         ` Jann Horn
@ 2025-10-20 17:18         ` David Hildenbrand
  2025-10-24  9:59           ` Lorenzo Stoakes
  1 sibling, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-10-20 17:18 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jann Horn
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, liam.howlett, osalvador, vbabka,
	stable

>>>
>>> So in case the page table got reused in the meantime, we should just
>>> back off and be fine, right?
>>
>> The shared page table is mapped with a PUD entry, and we don't check
>> whether the PUD entry changed here.
> 
> Could we simply put a PUD check in there sensibly?

A PUD check would only work if we are guaranteed that the page table 
will not get freed in the meantime, otherwise we might be walking 
garbage, trying to interpret garbage as PMDs etc.

That would require RCU freeing of page tables, which we are not 
guaranteed to have IIRC.

The easiest approach is probably to simply never reuse shared page tables.

If there is consensus on that I can try to see if I can make it fly easily.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-20 17:18         ` David Hildenbrand
@ 2025-10-24  9:59           ` Lorenzo Stoakes
  0 siblings, 0 replies; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-24  9:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On Mon, Oct 20, 2025 at 07:18:18PM +0200, David Hildenbrand wrote:
>
> > > >
> > > > So in case the page table got reused in the meantime, we should just
> > > > back off and be fine, right?
> > >
> > > The shared page table is mapped with a PUD entry, and we don't check
> > > whether the PUD entry changed here.
> >
> > Could we simply put a PUD check in there sensibly?
>
> A PUD check would only work if we are guaranteed that the page table will
> not get freed in the meantime, otherwise we might be walking garbage, trying
> to interpret garbage as PMDs etc.
>
> That would require RCU freeing of page tables, which we are not guaranteed
> to have IIRC.

Ack. Yeah Suren is working on this :) but don't know when that'll land.

>
> The easiest approach is probably to simply never reuse shared page tables.
>
> If there is consensus on that I can try to see if I can make it fly easily.

That'd be good, I'd like to see that if viable for you to put something
forward :)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-20 15:33         ` Jann Horn
@ 2025-10-24 12:24           ` Lorenzo Stoakes
  2025-10-24 18:22             ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-24 12:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
> On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> > > On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
> > > > On 01.09.25 12:58, Jann Horn wrote:
> > > > > Hi!
> > > > >
> > > > > On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav <suschako@amazon.de> wrote:
> > > > >> We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
> > > > >
> > > > > Yeah, every 1G virtual address range you unshare on unmap will do an
> > > > > extra synchronous IPI broadcast to all CPU cores, so it's not very
> > > > > surprising that doing this would be a bit slow on a machine with 196
> > > > > cores.
> > > > >
> > > > >> My observation/assumption is:
> > > > >>
> > > > >> each child touches 100 random pages and despawns
> > > > >> on each despawn `huge_pmd_unshare()` is called
> > > > >> each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
> > > > >
> > > > > Yeah, makes sense that that'd be slow.
> > > > >
> > > > > There are probably several ways this could be optimized - like maybe
> > > > > changing tlb_remove_table_sync_one() to rely on the MM's cpumask
> > > > > (though that would require thinking about whether this interacts with
> > > > > remote MM access somehow), or batching the refcount drops for hugetlb
> > > > > shared page tables through something like struct mmu_gather, or doing
> > > > > something special for the unmap path, or changing the semantics of
> > > > > hugetlb page tables such that they can never turn into normal page
> > > > > tables again. However, I'm not planning to work on optimizing this.
> > > >
> > > > I'm currently looking at the fix and what sticks out is "Fix it with an
> > > > explicit broadcast IPI through tlb_remove_table_sync_one()".
> > > >
> > > > (I don't understand how the page table can be used for "normal,
> > > > non-hugetlb". I could only see how it is used for the remaining user for
> > > > hugetlb stuff, but that's different question)
> > >
> > > If I remember correctly:
> > > When a hugetlb shared page table drops to refcount 1, it turns into a
> > > normal page table. If you then afterwards split the hugetlb VMA, unmap
> > > one half of it, and place a new unrelated VMA in its place, the same
> > > page table will be reused for PTEs of this new unrelated VMA.
> > >
> > > So the scenario would be:
> > >
> > > 1. Initially, we have a hugetlb shared page table covering 1G of
> > > address space which maps hugetlb 2M pages, which is used by two
> > > hugetlb VMAs in different processes (processes P1 and P2).
> > > 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> > > walks down through the PUD entry that points to the shared page table,
> > > then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> > > for a while by an NMI or preempted by the hypervisor or something.
> > > 3. P2 removes its VMA, and the hugetlb shared page table effectively
> > > becomes a normal page table in P1.
> >
> > This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
> >
> > P2/T1 on CPU A is doing the gup_fast() walk,
> > P2/T2 on CPU B is simultaneously 'removing' this VMA?
>
> Ah, yes.

Thanks

>
> > Because surely the interrupts being disabled on CPU A means that ordinary
> > preemption won't happen right?
>
> Yeah.
>
> > By remove what do you mean? Unmap? But won't this result in a TLB flush synced
> > by IPI that is stalled by P2'S CPU having interrupts diabled?
>
> The case I had in mind is munmap(). This is only an issue on platforms
> where TLB flushes can be done without IPI. That includes:
>
>  - KVM guests on x86 (where TLB flush IPIs can be elided if the target
> vCPU has been preempted by the host, in which case the host promises
> to do a TLB flush on guest re-entry)
>  - modern AMD CPUs with INVLPGB
>  - arm64
>
> That is the whole point of tlb_remove_table_sync_one() - it forces an
> IPI on architectures where TLB flush doesn't guarantee an IPI.

Right.

>
> (The config option "CONFIG_MMU_GATHER_RCU_TABLE_FREE", which is only
> needed on architectures that don't guarantee that an IPI is involved
> in TLB flushing, is set on the major architectures nowadays -
> unconditionally on x86 and arm64, and in SMP builds of 32-bit arm.)

Yes.

>
> > Or is it removed in the sense of hugetlb? As in something that invokes
> > huge_pmd_unshare()?
>
> I think that could also trigger it, though I wasn't thinking of that case.
>
> > But I guess this doesn't matter as the page table teardown will succeed, just
> > the final tlb_finish_mmu() will stall.
> >
> > And I guess GUP fast is trying to protect against the clear down by checking pmd
> > != *pmdp.
>
> The pmd recheck is done because of THP, IIRC because THP can deposit
> and reuse page tables without following the normal page table life
> cycle.

Right.

>
> > > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > > leaving two VMAs VMA1 and VMA2.
> > > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > > example an anonymous private VMA.
> >
> > Hmm, can it though?
> >
> > P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
> >
> > In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
> > for IPI-synced architectures, and in that case the unmap won't finish and the
> > mmap write lock won't be released so nobody an map a new VMA yet can they?
>
> Yeah, I think it can't happen on configurations that always use IPI
> for TLB synchronization. My patch also doesn't change anything on
> those architectures - tlb_remove_table_sync_one() is a no-op on
> architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.

Hmm but in that case wouldn't:

tlb_finish_mmu()
-> tlb_flush_mmu()
-> tlb_flush_mmu_free()
-> tlb_table_flush()
-> tlb_remove_table()
-> __tlb_remove_table_one()
-> tlb_remove_table_sync_one()

prevent the unmapping on non-IPI architectures, thereby mitigating the
issue?

Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used
for page table teardown whose grace period would be disallowed until
gup_fast() finishes and therefore that also mitigate?

Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?

It seems you're predicating the issue on an unmap happening without waiting
for GUP fast, but it seems that it always will?

Am I missing something here?

>
> > > 6. P1 populates VMA3 with page table entries.
> >
> > ofc this requires the mmap/vma write lock above to be released first.
> >
> > > 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> > > uses the new PMD/PTE entries created for VMA3.
> > >
> > > > How does the fix work when an architecture does not issue IPIs for TLB
> > > > shootdown? To handle gup-fast on these architectures, we use RCU.
> > >
> > > gup-fast disables interrupts, which synchronizes against both RCU and IPI.
> > >
> > > > So I'm wondering whether we use RCU somehow.
> > > >
> > > > But note that in gup_fast_pte_range(), we are validating whether the PMD
> > > > changed:
> > > >
> > > > if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> > > >      unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> > > >         gup_put_folio(folio, 1, flags);
> > > >         goto pte_unmap;
> > > > }
> > > >
> > > >
> > > > So in case the page table got reused in the meantime, we should just
> > > > back off and be fine, right?
> > >
> > > The shared page table is mapped with a PUD entry, and we don't check
> > > whether the PUD entry changed here.
> >
> > Could we simply put a PUD check in there sensibly?
>
> Uuuh... maybe? But I'm not sure if there is a good way to express the
> safety rules after that change any more nicely than we can do with the
> current safety rules, it feels like we're just tacking on an
> increasing number of special cases. As I understand it, the current
> rules are something like:

Yeah David covered off in other sub-thread, not really viable I guess :)

>
> Freeing a page table needs RCU delay or IPI to synchronize against
> gup_fast(). Randomly moving page tables to different locations (which
> khugepaged does) is specially allowed only for PTE tables, thanks to
> the PMD entry recheck. mremap() is kind of an weird case because it
> can also move PMD tables without locking, but that's fine because
> nothing in the region covered by the source virtual address range can
> be part of a VMA other than the VMA being moved, so userspace has no
> legitimate reason to access it.

I will need to document these somewhere :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 12:24           ` Lorenzo Stoakes
@ 2025-10-24 18:22             ` Jann Horn
  2025-10-24 19:02               ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: Jann Horn @ 2025-10-24 18:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
> > On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> > > > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > > > leaving two VMAs VMA1 and VMA2.
> > > > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > > > example an anonymous private VMA.
> > >
> > > Hmm, can it though?
> > >
> > > P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
> > >
> > > In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
> > > for IPI-synced architectures, and in that case the unmap won't finish and the
> > > mmap write lock won't be released so nobody an map a new VMA yet can they?
> >
> > Yeah, I think it can't happen on configurations that always use IPI
> > for TLB synchronization. My patch also doesn't change anything on
> > those architectures - tlb_remove_table_sync_one() is a no-op on
> > architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
>
> Hmm but in that case wouldn't:
>
> tlb_finish_mmu()
> -> tlb_flush_mmu()
> -> tlb_flush_mmu_free()
> -> tlb_table_flush()

And then from there we call tlb_remove_table_free(), which does a
call_rcu() to tlb_remove_table_rcu(), which will asynchronously run
later and do __tlb_remove_table_free(), which does
__tlb_remove_table()?

> -> tlb_remove_table()

I don't see any way we end up in tlb_remove_table() from here.
tlb_remove_table() is a much higher-level function, we end up there
from something like pte_free_tlb(). I think you mixed up
tlb_remove_table_free and tlb_remove_table.

> -> __tlb_remove_table_one()

Heh, I think you made the same mistake as Linus made years ago when he
was looking at tlb_remove_table(). In that function, the call to
tlb_remove_table_one() leading to __tlb_remove_table_one() **is a
slowpath only taken when memory allocation fails** - it's a fallback
from the normal path that queues up batch items in (*batch)->tables[]
(and occasionally calls tlb_table_flush() when it runs out of space in
there).

> -> tlb_remove_table_sync_one()
>
> prevent the unmapping on non-IPI architectures, thereby mitigating the
> issue?

> Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used
> for page table teardown whose grace period would be disallowed until
> gup_fast() finishes and therefore that also mitigate?

I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE
implies that "Semi RCU" is used to protect page table *freeing*, but
page table freeing is irrelevant to this bug, and there is no RCU
delay involved in dropping a reference on a shared hugetlb page table.
"Semi RCU" is not used to protect against page table *reuse* at a
different address by THP. Also, as explained in the big comment block
in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used -
when memory allocations fail, the __tlb_remove_table_one() fallback
path, when used on !PT_RECLAIM, will fall back to an IPI broadcast
followed by directly freeing the page table. RCU is just used as the
more polite way to do something equivalent to an IPI broadcast (RCU
will wait for other cores to go through regions where they _could_
receive an IPI as part of RCU-sched).

But also: At which point would you expect any page table to actually
be freed, triggering any of this logic? When unmapping VMA1 in step 5,
I think there might not be any page tables that exist and are fully
covered by VMA1 (or its adjacent free space, if there is any) so that
they are eligible to be freed.

> Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?

Because nothing else on that path is guaranteed to send any IPIs
before the page table becomes reusable in another process.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 18:22             ` Jann Horn
@ 2025-10-24 19:02               ` Lorenzo Stoakes
  2025-10-24 19:43                 ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-24 19:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Fri, Oct 24, 2025 at 08:22:15PM +0200, Jann Horn wrote:
> On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
> > > On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> > > > > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > > > > leaving two VMAs VMA1 and VMA2.
> > > > > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > > > > example an anonymous private VMA.
> > > >
> > > > Hmm, can it though?
> > > >
> > > > P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
> > > >
> > > > In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
> > > > for IPI-synced architectures, and in that case the unmap won't finish and the
> > > > mmap write lock won't be released so nobody an map a new VMA yet can they?
> > >
> > > Yeah, I think it can't happen on configurations that always use IPI
> > > for TLB synchronization. My patch also doesn't change anything on
> > > those architectures - tlb_remove_table_sync_one() is a no-op on
> > > architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
> >
> > Hmm but in that case wouldn't:
> >
> > tlb_finish_mmu()
> > -> tlb_flush_mmu()
> > -> tlb_flush_mmu_free()
> > -> tlb_table_flush()
>
> And then from there we call tlb_remove_table_free(), which does a
> call_rcu() to tlb_remove_table_rcu(), which will asynchronously run
> later and do __tlb_remove_table_free(), which does
> __tlb_remove_table()?

Yeah my bad!

>
> > -> tlb_remove_table()
>
> I don't see any way we end up in tlb_remove_table() from here.
> tlb_remove_table() is a much higher-level function, we end up there
> from something like pte_free_tlb(). I think you mixed up
> tlb_remove_table_free and tlb_remove_table.

Yeah sorry my mistake you're right!

>
> > -> __tlb_remove_table_one()
>
> Heh, I think you made the same mistake as Linus made years ago when he
> was looking at tlb_remove_table(). In that function, the call to
> tlb_remove_table_one() leading to __tlb_remove_table_one() **is a
> slowpath only taken when memory allocation fails** - it's a fallback
> from the normal path that queues up batch items in (*batch)->tables[]
> (and occasionally calls tlb_table_flush() when it runs out of space in
> there).
>

At least in good company ;)

> > -> tlb_remove_table_sync_one()
> >
> > prevent the unmapping on non-IPI architectures, thereby mitigating the
> > issue?
>
> > Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used
> > for page table teardown whose grace period would be disallowed until
> > gup_fast() finishes and therefore that also mitigate?
>
> I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE
> implies that "Semi RCU" is used to protect page table *freeing*, but
> page table freeing is irrelevant to this bug, and there is no RCU
> delay involved in dropping a reference on a shared hugetlb page table.

It's this step:

5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.

But see below, I have had the 'aha' moment... this is really horrible.

Sigh hugetlb...

> "Semi RCU" is not used to protect against page table *reuse* at a
> different address by THP. Also, as explained in the big comment block
> in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used -
> when memory allocations fail, the __tlb_remove_table_one() fallback
> path, when used on !PT_RECLAIM, will fall back to an IPI broadcast
> followed by directly freeing the page table. RCU is just used as the
> more polite way to do something equivalent to an IPI broadcast (RCU
> will wait for other cores to go through regions where they _could_
> receive an IPI as part of RCU-sched).

I guess for IPI we're ok as _any_ of the TLB flushing will cause a
shootdown + thus delay on GUP-fast.

Are there any scenarios where the shootdown wouldn't happen even for the
IPI case?

>
> But also: At which point would you expect any page table to actually
> be freed, triggering any of this logic? When unmapping VMA1 in step 5,
> I think there might not be any page tables that exist and are fully
> covered by VMA1 (or its adjacent free space, if there is any) so that
> they are eligible to be freed.

Hmmm yeah, ok now I see - the PMD would remain in place throughout, we
don't actually need to free anything, that's the crux of this isn't
it... yikes.

"Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two
hugetlb VMAs in different processes (processes P1 and P2)."

"Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2."

So the 1 GB would have to be aligned and (xxx = PUD entry, y = VMA1
entries, z = VMA2 entries)

      PUD
    |-----|
    \     \
    /     /
    \     \      PMD
    /     /    |-----|
    | xxx |--->| y1  |
    /     /    | y2  |
    \     \    | ... |
    /     /    |y255 |
    \     \    |y256 |
    |-----|    | z1  |
	       | z2  |
               | ... |
	       |z255 |
	       |z256 |
	       |-----|

So the hugetlb page sharing stuff defeats all assumptions and
checks... sigh.

>
> > Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
>
> Because nothing else on that path is guaranteed to send any IPIs
> before the page table becomes reusable in another process.

I feel that David's suggestion of just disallowing the use of shared page
tables like this (I mean really does it actually come up that much?) is the
right one then.

I wonder whether we shouldn't just free the PMD after it becomes unshared?
It's kind of crazy to think we'll allow a reuse like this, it's asking for
trouble.

Moving on to another point:

One point here I'd like to raise - this seems like a 'just so'
scenario. I'm not saying we shouldn't fix it, but we're paying a _very
heavy_ penalty here for a scenario that really does require some unusual
things to happen in GUP fast and an _extremely_ tight and specific window
in which to do it.

Plus isn't it going to be difficult to mediate exactly when an unshare will
happen?

Since you can't pre-empt and IRQs are disabled, to even get the scenario to
happen is surely very very difficult, you really have to have some form of
(para?)virtualisation preemption or a NMI which would have to be very long
lasting (the operations you mention in P2 are hardly small ones) which
seems very very unlikely for an attacker to be able to achieve.

So my question is - would it be reasonable to consider this at the very
least a vanishingly small, 'paranoid' fixup? I think it's telling you
couldn't come up with a repro, and you are usually very good at that :)

Another question, perhaps silly one, is - what is the attack scenario here?
I'm not so familiar with hugetlb page table sharing, but is it in any way
feasible that you'd access another process's mappings? If not, the attack
scenario is that you end up accidentally accessing some other part of the
process's memory (which doesn't seem so bad right?).

Thanks, sorry for all the questions but really want to make sure I
understand what's going on here (and can later extract some of this into
documentation also potentially! :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 19:02               ` Lorenzo Stoakes
@ 2025-10-24 19:43                 ` Jann Horn
  2025-10-24 19:58                   ` Lorenzo Stoakes
  2025-10-29 16:19                   ` David Hildenbrand
  0 siblings, 2 replies; 35+ messages in thread
From: Jann Horn @ 2025-10-24 19:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Fri, Oct 24, 2025 at 9:03 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Fri, Oct 24, 2025 at 08:22:15PM +0200, Jann Horn wrote:
> > On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
> > > > On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
> > > > > > 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> > > > > > leaving two VMAs VMA1 and VMA2.
> > > > > > 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> > > > > > example an anonymous private VMA.
> > > > >
> > > > > Hmm, can it though?
> > > > >
> > > > > P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
> > > > >
> > > > > In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu()
> > > > > for IPI-synced architectures, and in that case the unmap won't finish and the
> > > > > mmap write lock won't be released so nobody an map a new VMA yet can they?
> > > >
> > > > Yeah, I think it can't happen on configurations that always use IPI
> > > > for TLB synchronization. My patch also doesn't change anything on
> > > > those architectures - tlb_remove_table_sync_one() is a no-op on
> > > > architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
> > >
> > > Hmm but in that case wouldn't:
> > >
> > > tlb_finish_mmu()
> > > -> tlb_flush_mmu()
> > > -> tlb_flush_mmu_free()
> > > -> tlb_table_flush()
> >
> > And then from there we call tlb_remove_table_free(), which does a
> > call_rcu() to tlb_remove_table_rcu(), which will asynchronously run
> > later and do __tlb_remove_table_free(), which does
> > __tlb_remove_table()?
>
> Yeah my bad!
>
> >
> > > -> tlb_remove_table()
> >
> > I don't see any way we end up in tlb_remove_table() from here.
> > tlb_remove_table() is a much higher-level function, we end up there
> > from something like pte_free_tlb(). I think you mixed up
> > tlb_remove_table_free and tlb_remove_table.
>
> Yeah sorry my mistake you're right!
>
> >
> > > -> __tlb_remove_table_one()
> >
> > Heh, I think you made the same mistake as Linus made years ago when he
> > was looking at tlb_remove_table(). In that function, the call to
> > tlb_remove_table_one() leading to __tlb_remove_table_one() **is a
> > slowpath only taken when memory allocation fails** - it's a fallback
> > from the normal path that queues up batch items in (*batch)->tables[]
> > (and occasionally calls tlb_table_flush() when it runs out of space in
> > there).
> >
>
> At least in good company ;)
>
> > > -> tlb_remove_table_sync_one()
> > >
> > > prevent the unmapping on non-IPI architectures, thereby mitigating the
> > > issue?
> >
> > > Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used
> > > for page table teardown whose grace period would be disallowed until
> > > gup_fast() finishes and therefore that also mitigate?
> >
> > I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > implies that "Semi RCU" is used to protect page table *freeing*, but
> > page table freeing is irrelevant to this bug, and there is no RCU
> > delay involved in dropping a reference on a shared hugetlb page table.
>
> It's this step:
>
> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> example an anonymous private VMA.
>
> But see below, I have had the 'aha' moment... this is really horrible.
>
> Sigh hugetlb...
>
> > "Semi RCU" is not used to protect against page table *reuse* at a
> > different address by THP. Also, as explained in the big comment block
> > in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used -
> > when memory allocations fail, the __tlb_remove_table_one() fallback
> > path, when used on !PT_RECLAIM, will fall back to an IPI broadcast
> > followed by directly freeing the page table. RCU is just used as the
> > more polite way to do something equivalent to an IPI broadcast (RCU
> > will wait for other cores to go through regions where they _could_
> > receive an IPI as part of RCU-sched).
>
> I guess for IPI we're ok as _any_ of the TLB flushing will cause a
> shootdown + thus delay on GUP-fast.
>
> Are there any scenarios where the shootdown wouldn't happen even for the
> IPI case?



> > But also: At which point would you expect any page table to actually
> > be freed, triggering any of this logic? When unmapping VMA1 in step 5,
> > I think there might not be any page tables that exist and are fully
> > covered by VMA1 (or its adjacent free space, if there is any) so that
> > they are eligible to be freed.
>
> Hmmm yeah, ok now I see - the PMD would remain in place throughout, we
> don't actually need to free anything, that's the crux of this isn't
> it... yikes.
>
> "Initially, we have a hugetlb shared page table covering 1G of
> address space which maps hugetlb 2M pages, which is used by two
> hugetlb VMAs in different processes (processes P1 and P2)."
>
> "Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> leaving two VMAs VMA1 and VMA2."
>
> So the 1 GB would have to be aligned and (xxx = PUD entry, y = VMA1
> entries, z = VMA2 entries)
>
>
>       PUD
>     |-----|
>     \     \
>     /     /
>     \     \      PMD
>     /     /    |-----|
>     | xxx |--->| y1  |
>     /     /    | y2  |
>     \     \    | ... |
>     /     /    |y255 |
>     \     \    |y256 |
>     |-----|    | z1  |
>                | z2  |
>                | ... |
>                |z255 |
>                |z256 |
>                |-----|
>
> So the hugetlb page sharing stuff defeats all assumptions and
> checks... sigh.
>
> >
> > > Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
> >
> > Because nothing else on that path is guaranteed to send any IPIs
> > before the page table becomes reusable in another process.
>
> I feel that David's suggestion of just disallowing the use of shared page
> tables like this (I mean really does it actually come up that much?) is the
> right one then.

Yeah, I also like that suggestion.

> I wonder whether we shouldn't just free the PMD after it becomes unshared?
> It's kind of crazy to think we'll allow a reuse like this, it's asking for
> trouble.
>
> Moving on to another point:
>
> One point here I'd like to raise - this seems like a 'just so'
> scenario. I'm not saying we shouldn't fix it, but we're paying a _very
> heavy_ penalty here for a scenario that really does require some unusual
> things to happen in GUP fast and an _extremely_ tight and specific window
> in which to do it.

Yes.

> Plus isn't it going to be difficult to mediate exactly when an unshare will
> happen?
>
> Since you can't pre-empt and IRQs are disabled, to even get the scenario to
> happen is surely very very difficult, you really have to have some form of
> (para?)virtualisation preemption or a NMI which would have to be very long
> lasting (the operations you mention in P2 are hardly small ones) which
> seems very very unlikely for an attacker to be able to achieve.

Yeah, I think it would have to be something like a hypervisor
rescheduling to another vCPU, or potentially it could happen if
someone is doing kernel performance profiling with perf_event_open()
(which might do stuff like copying large amounts of userspace stack
memory from NMI context depending on runtime configuration).

> So my question is - would it be reasonable to consider this at the very
> least a vanishingly small, 'paranoid' fixup? I think it's telling you
> couldn't come up with a repro, and you are usually very good at that :)

I mean, how hard this is to hit probably partly depends on what
choices hypervisors make about vCPU scheduling. And it would probably
also be easier to hit for an attacker with CAP_PERFMON, though that's
true of many bugs.

But yeah, it's not the kind of bug I would choose to target if I
wanted to write an exploit and had a larger selection of bugs to
choose from.

> Another question, perhaps silly one, is - what is the attack scenario here?
> I'm not so familiar with hugetlb page table sharing, but is it in any way
> feasible that you'd access another process's mappings? If not, the attack
> scenario is that you end up accidentally accessing some other part of the
> process's memory (which doesn't seem so bad right?).

I think the impact would be P2 being able to read/write unrelated data
in P1. Though with the way things are currently implemented, I think
that requires P1 to do this weird unmap of half of a hugetlb mapping.

We're also playing with fire because if P2 is walking page tables of
P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs
issued by P1 wouldn't be sent to P2. I think that's not exploitable in
the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE
unconditionally either frees page tables through RCU or does IPI
broadcasts sent to the whole system, but it is scary because
sensible-looking optimizations could turn this into a user-to-kernel
privilege escalation bug. For example, if we decided that in cases
where we already did an IPI-based TLB flush, or in cases where we are
single-threaded, we don't need to free page tables with Semi-RCU delay
to synchronize against gup_fast().

> Thanks, sorry for all the questions but really want to make sure I
> understand what's going on here (and can later extract some of this into
> documentation also potentially! :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 19:43                 ` Jann Horn
@ 2025-10-24 19:58                   ` Lorenzo Stoakes
  2025-10-24 21:41                     ` Jann Horn
  2025-10-29 16:19                   ` David Hildenbrand
  1 sibling, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-24 19:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix,
	ndesaulniers, nathan, akpm, muchun.song, mike.kravetz,
	liam.howlett, osalvador, vbabka, stable

On Fri, Oct 24, 2025 at 09:43:43PM +0200, Jann Horn wrote:
> > So my question is - would it be reasonable to consider this at the very
> > least a vanishingly small, 'paranoid' fixup? I think it's telling you
> > couldn't come up with a repro, and you are usually very good at that :)
>
> I mean, how hard this is to hit probably partly depends on what
> choices hypervisors make about vCPU scheduling. And it would probably
> also be easier to hit for an attacker with CAP_PERFMON, though that's
> true of many bugs.
>
> But yeah, it's not the kind of bug I would choose to target if I
> wanted to write an exploit and had a larger selection of bugs to
> choose from.
>
> > Another question, perhaps silly one, is - what is the attack scenario here?
> > I'm not so familiar with hugetlb page table sharing, but is it in any way
> > feasible that you'd access another process's mappings? If not, the attack
> > scenario is that you end up accidentally accessing some other part of the
> > process's memory (which doesn't seem so bad right?).
>
> I think the impact would be P2 being able to read/write unrelated data
> in P1. Though with the way things are currently implemented, I think
> that requires P1 to do this weird unmap of half of a hugetlb mapping.
>
> We're also playing with fire because if P2 is walking page tables of
> P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs
> issued by P1 wouldn't be sent to P2. I think that's not exploitable in
> the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE
> unconditionally either frees page tables through RCU or does IPI
> broadcasts sent to the whole system, but it is scary because
> sensible-looking optimizations could turn this into a user-to-kernel
> privilege escalation bug. For example, if we decided that in cases
> where we already did an IPI-based TLB flush, or in cases where we are
> single-threaded, we don't need to free page tables with Semi-RCU delay
> to synchronize against gup_fast().

Would it therefore be reasonable to say that this is more of a preventative
measure against future kernel changes (which otherwise seem reasonable)
which might lead to exploitable bugs rather than being a practiclaly
exploitable bug in itself?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 19:58                   ` Lorenzo Stoakes
@ 2025-10-24 21:41                     ` Jann Horn
  0 siblings, 0 replies; 35+ messages in thread
From: Jann Horn @ 2025-10-24 21:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Uschakow, Stanislav, linux-mm, trix, nathan,
	akpm, muchun.song, liam.howlett, osalvador, vbabka, stable

On Fri, Oct 24, 2025 at 9:59 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Fri, Oct 24, 2025 at 09:43:43PM +0200, Jann Horn wrote:
> > > So my question is - would it be reasonable to consider this at the very
> > > least a vanishingly small, 'paranoid' fixup? I think it's telling you
> > > couldn't come up with a repro, and you are usually very good at that :)
> >
> > I mean, how hard this is to hit probably partly depends on what
> > choices hypervisors make about vCPU scheduling. And it would probably
> > also be easier to hit for an attacker with CAP_PERFMON, though that's
> > true of many bugs.
> >
> > But yeah, it's not the kind of bug I would choose to target if I
> > wanted to write an exploit and had a larger selection of bugs to
> > choose from.
> >
> > > Another question, perhaps silly one, is - what is the attack scenario here?
> > > I'm not so familiar with hugetlb page table sharing, but is it in any way
> > > feasible that you'd access another process's mappings? If not, the attack
> > > scenario is that you end up accidentally accessing some other part of the
> > > process's memory (which doesn't seem so bad right?).
> >
> > I think the impact would be P2 being able to read/write unrelated data
> > in P1. Though with the way things are currently implemented, I think
> > that requires P1 to do this weird unmap of half of a hugetlb mapping.
> >
> > We're also playing with fire because if P2 is walking page tables of
> > P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs
> > issued by P1 wouldn't be sent to P2. I think that's not exploitable in
> > the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > unconditionally either frees page tables through RCU or does IPI
> > broadcasts sent to the whole system, but it is scary because
> > sensible-looking optimizations could turn this into a user-to-kernel
> > privilege escalation bug. For example, if we decided that in cases
> > where we already did an IPI-based TLB flush, or in cases where we are
> > single-threaded, we don't need to free page tables with Semi-RCU delay
> > to synchronize against gup_fast().
>
> Would it therefore be reasonable to say that this is more of a preventative
> measure against future kernel changes (which otherwise seem reasonable)
> which might lead to exploitable bugs rather than being a practiclaly
> exploitable bug in itself?

I would say it is a security fix for theoretical userspace that either
intentionally partially unmaps hugetlb mappings (which would probably
be weird), or maps and partially unmaps attacker-supplied file
descriptors (without necessarily expecting them to be hugetlb). (I
know of userspace that mmap()s file descriptors coming from untrusted
code, though I don't know examples that would then partially unmap
them.) Admittedly there is some perfectionism involved here on my
part. In particular, it irks me to make qualitative distinctions
between bugs based on how hard to hit the timing requirements for them
are.

But yes, a large part of my motivation for writing this patch was to
prevent reasonable future changes to the rest of MM from making this a
worse bug.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-24 19:43                 ` Jann Horn
  2025-10-24 19:58                   ` Lorenzo Stoakes
@ 2025-10-29 16:19                   ` David Hildenbrand
  2025-10-29 18:02                     ` Lorenzo Stoakes
  1 sibling, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-10-29 16:19 UTC (permalink / raw)
  To: Jann Horn, Lorenzo Stoakes
  Cc: Uschakow, Stanislav, linux-mm, trix, ndesaulniers, nathan, akpm,
	muchun.song, mike.kravetz, liam.howlett, osalvador, vbabka,
	stable

>>>> Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
>>>
>>> Because nothing else on that path is guaranteed to send any IPIs
>>> before the page table becomes reusable in another process.
>>
>> I feel that David's suggestion of just disallowing the use of shared page
>> tables like this (I mean really does it actually come up that much?) is the
>> right one then.
> 
> Yeah, I also like that suggestion.

I started hacking on this (only found a bit of time this week), and in 
essence, we'll be using the mmu_gather when unsharing to collect the 
pages and handle the TLB flushing etc.

(TLB flushing in that hugetlb area is a mess)

It almost looks like a cleanup.

Having that said, it will take a bit longer to finish it and, of course, 
I first have to test it then to see if it even works.

But it looks doable. :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-29 16:19                   ` David Hildenbrand
@ 2025-10-29 18:02                     ` Lorenzo Stoakes
  2025-11-18 10:03                       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-10-29 18:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On Wed, Oct 29, 2025 at 05:19:54PM +0100, David Hildenbrand wrote:
> > > > > Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
> > > >
> > > > Because nothing else on that path is guaranteed to send any IPIs
> > > > before the page table becomes reusable in another process.
> > >
> > > I feel that David's suggestion of just disallowing the use of shared page
> > > tables like this (I mean really does it actually come up that much?) is the
> > > right one then.
> >
> > Yeah, I also like that suggestion.
>
> I started hacking on this (only found a bit of time this week), and in
> essence, we'll be using the mmu_gather when unsharing to collect the pages
> and handle the TLB flushing etc.
>
> (TLB flushing in that hugetlb area is a mess)
>
> It almost looks like a cleanup.
>
> Having that said, it will take a bit longer to finish it and, of course, I
> first have to test it then to see if it even works.
>
> But it looks doable. :)

Ohhhh nice :)

I look forward to it!

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-10-29 18:02                     ` Lorenzo Stoakes
@ 2025-11-18 10:03                       ` David Hildenbrand (Red Hat)
  2025-11-19 16:08                         ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-18 10:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On 29.10.25 19:02, Lorenzo Stoakes wrote:
> On Wed, Oct 29, 2025 at 05:19:54PM +0100, David Hildenbrand wrote:
>>>>>> Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
>>>>>
>>>>> Because nothing else on that path is guaranteed to send any IPIs
>>>>> before the page table becomes reusable in another process.
>>>>
>>>> I feel that David's suggestion of just disallowing the use of shared page
>>>> tables like this (I mean really does it actually come up that much?) is the
>>>> right one then.
>>>
>>> Yeah, I also like that suggestion.
>>
>> I started hacking on this (only found a bit of time this week), and in
>> essence, we'll be using the mmu_gather when unsharing to collect the pages
>> and handle the TLB flushing etc.
>>
>> (TLB flushing in that hugetlb area is a mess)
>>
>> It almost looks like a cleanup.
>>
>> Having that said, it will take a bit longer to finish it and, of course, I
>> first have to test it then to see if it even works.
>>
>> But it looks doable. :)
> 
> Ohhhh nice :)
> 
> I look forward to it!

As shared offline already, it looked simple, but there is one nasty 
corner case: if we never reuse a shared page table, who will take care 
of unmapping all pages?

I played with various ideas, but it just ended up looking more 
complicated and possibly even slower.

So what I am currently looking into is simply reducing (batching) the 
number of IPIs.

In essence, we only have to send one IPI when unsharing multiple page 
tables, and we only have to send one when we are the last one sharing 
the page table (before it can get reused).

While at it, I'm looking into making also the TLB flushing easier to 
understand here.

I'm hacking on a prototype and should likely have something to test this 
week.

[I guess what I am doing now is aligned with Jann's initial ideas to 
optimize this ]

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-11-18 10:03                       ` David Hildenbrand (Red Hat)
@ 2025-11-19 16:08                         ` Lorenzo Stoakes
  2025-11-19 16:29                           ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-11-19 16:08 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On Tue, Nov 18, 2025 at 11:03:07AM +0100, David Hildenbrand (Red Hat) wrote:
> On 29.10.25 19:02, Lorenzo Stoakes wrote:
> > On Wed, Oct 29, 2025 at 05:19:54PM +0100, David Hildenbrand wrote:
> > > > > > > Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
> > > > > >
> > > > > > Because nothing else on that path is guaranteed to send any IPIs
> > > > > > before the page table becomes reusable in another process.
> > > > >
> > > > > I feel that David's suggestion of just disallowing the use of shared page
> > > > > tables like this (I mean really does it actually come up that much?) is the
> > > > > right one then.
> > > >
> > > > Yeah, I also like that suggestion.
> > >
> > > I started hacking on this (only found a bit of time this week), and in
> > > essence, we'll be using the mmu_gather when unsharing to collect the pages
> > > and handle the TLB flushing etc.
> > >
> > > (TLB flushing in that hugetlb area is a mess)
> > >
> > > It almost looks like a cleanup.
> > >
> > > Having that said, it will take a bit longer to finish it and, of course, I
> > > first have to test it then to see if it even works.
> > >
> > > But it looks doable. :)
> >
> > Ohhhh nice :)
> >
> > I look forward to it!
>
> As shared offline already, it looked simple, but there is one nasty corner
> case: if we never reuse a shared page table, who will take care of unmapping
> all pages?

Right. That is nasty... :)

>
> I played with various ideas, but it just ended up looking more complicated
> and possibly even slower.

Yeah...

>
> So what I am currently looking into is simply reducing (batching) the number
> of IPIs.

As in the IPIs we are now generating in tlb_remove_table_sync_one()?

Or something else?

As this bug is only an issue when we don't use IPIs for pgtable freeing right
(e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
tlb_remove_table_sync_one() is a no-op?

>
> In essence, we only have to send one IPI when unsharing multiple page
> tables, and we only have to send one when we are the last one sharing the
> page table (before it can get reused).

Right, hopefully that significantly cuts down on the amount genrated.

>
> While at it, I'm looking into making also the TLB flushing easier to
> understand here.

Good idea ;)

>
> I'm hacking on a prototype and should likely have something to test this
> week.

Thanks!

>
> [I guess what I am doing now is aligned with Jann's initial ideas to
> optimize this ]
>
> --
> Cheers
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-11-19 16:08                         ` Lorenzo Stoakes
@ 2025-11-19 16:29                           ` David Hildenbrand (Red Hat)
  2025-11-19 16:31                             ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-19 16:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

>>
>> So what I am currently looking into is simply reducing (batching) the number
>> of IPIs.
> 
> As in the IPIs we are now generating in tlb_remove_table_sync_one()?
> 
> Or something else?

Yes, for now. I'm essentially reducing the number of 
tlb_remove_table_sync_one() calls.

> 
> As this bug is only an issue when we don't use IPIs for pgtable freeing right
> (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
> tlb_remove_table_sync_one() is a no-op?

Right. But it's still confusing: I think for page table unsharing we 
always need an IPI one way or the other to make sure GUP-fast was called.

At least for preventing that anybody would be able to reuse the page 
table in the meantime.

That is either:

(a) The TLB shootdown implied an IPI

(b) We manually send one

But that's where it gets confusing: nowadays x86 also selects 
MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?

This is so complicated, so I might be missing something.

But it's the same behavior we have in collapse_huge_page() where we first

> 
>>
>> In essence, we only have to send one IPI when unsharing multiple page
>> tables, and we only have to send one when we are the last one sharing the
>> page table (before it can get reused).
> 
> Right, hopefully that significantly cuts down on the amount genrated.

I'd assume that the problem of the current approach is that when we fork 
a child and it quits, that we call __unmap_hugepage_range(). If the 
range is large enough to cover many PMD tables (multiple gigabytes?), we
essentially send one IPI per PMD table we are unsharing, when we really 
only have to send one.

That's the theory ...

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-11-19 16:29                           ` David Hildenbrand (Red Hat)
@ 2025-11-19 16:31                             ` David Hildenbrand (Red Hat)
  2025-11-20 15:47                               ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-19 16:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jann Horn, Uschakow, Stanislav, linux-mm, trix, ndesaulniers,
	nathan, akpm, muchun.song, mike.kravetz, liam.howlett, osalvador,
	vbabka, stable

On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
>>>
>>> So what I am currently looking into is simply reducing (batching) the number
>>> of IPIs.
>>
>> As in the IPIs we are now generating in tlb_remove_table_sync_one()?
>>
>> Or something else?
> 
> Yes, for now. I'm essentially reducing the number of
> tlb_remove_table_sync_one() calls.
> 
>>
>> As this bug is only an issue when we don't use IPIs for pgtable freeing right
>> (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
>> tlb_remove_table_sync_one() is a no-op?
> 
> Right. But it's still confusing: I think for page table unsharing we
> always need an IPI one way or the other to make sure GUP-fast was called.
> 
> At least for preventing that anybody would be able to reuse the page
> table in the meantime.
> 
> That is either:
> 
> (a) The TLB shootdown implied an IPI
> 
> (b) We manually send one
> 
> But that's where it gets confusing: nowadays x86 also selects
> MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
> 
> This is so complicated, so I might be missing something.
> 
> But it's the same behavior we have in collapse_huge_page() where we first

... flush and then call tlb_remove_table_sync_one().

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-11-19 16:31                             ` David Hildenbrand (Red Hat)
@ 2025-11-20 15:47                               ` David Hildenbrand (Red Hat)
  2025-12-03 17:22                                 ` Prakash Sangappa
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-20 15:47 UTC (permalink / raw)
  To: Lorenzo Stoakes, Uschakow, Stanislav, Prakash Sangappa
  Cc: Jann Horn, linux-mm, trix, nathan, akpm, muchun.song,
	mike.kravetz, liam.howlett, osalvador, vbabka, stable

On 11/19/25 17:31, David Hildenbrand (Red Hat) wrote:
> On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
>>>>
>>>> So what I am currently looking into is simply reducing (batching) the number
>>>> of IPIs.
>>>
>>> As in the IPIs we are now generating in tlb_remove_table_sync_one()?
>>>
>>> Or something else?
>>
>> Yes, for now. I'm essentially reducing the number of
>> tlb_remove_table_sync_one() calls.
>>
>>>
>>> As this bug is only an issue when we don't use IPIs for pgtable freeing right
>>> (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
>>> tlb_remove_table_sync_one() is a no-op?
>>
>> Right. But it's still confusing: I think for page table unsharing we
>> always need an IPI one way or the other to make sure GUP-fast was called.
>>
>> At least for preventing that anybody would be able to reuse the page
>> table in the meantime.
>>
>> That is either:
>>
>> (a) The TLB shootdown implied an IPI
>>
>> (b) We manually send one
>>
>> But that's where it gets confusing: nowadays x86 also selects
>> MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
>>
>> This is so complicated, so I might be missing something.
>>
>> But it's the same behavior we have in collapse_huge_page() where we first
> 
> ... flush and then call tlb_remove_table_sync_one().
> 

Okay, I pushed something to

https://github.com/davidhildenbrand/linux.git hugetlb_unshare

I did a quick test and my house did not burn down. But I don't have a 
beefy machine to really stress+benchmark PMD table unsharing.

Could one of the original reporters (Stanislav? Prakash?) try it out to 
see if that would help fix the regression or if it would be a dead end?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-11-20 15:47                               ` David Hildenbrand (Red Hat)
@ 2025-12-03 17:22                                 ` Prakash Sangappa
  2025-12-03 19:45                                   ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 35+ messages in thread
From: Prakash Sangappa @ 2025-12-03 17:22 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Lorenzo Stoakes, Uschakow, Stanislav, Jann Horn, linux-mm, trix,
	nathan, akpm, muchun.song, mike.kravetz, Liam Howlett, osalvador,
	vbabka, stable



> On Nov 20, 2025, at 7:47 AM, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
> 
> On 11/19/25 17:31, David Hildenbrand (Red Hat) wrote:
>> On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
>>>>> 
>>>>> So what I am currently looking into is simply reducing (batching) the number
>>>>> of IPIs.
>>>> 
>>>> As in the IPIs we are now generating in tlb_remove_table_sync_one()?
>>>> 
>>>> Or something else?
>>> 
>>> Yes, for now. I'm essentially reducing the number of
>>> tlb_remove_table_sync_one() calls.
>>> 
>>>> 
>>>> As this bug is only an issue when we don't use IPIs for pgtable freeing right
>>>> (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
>>>> tlb_remove_table_sync_one() is a no-op?
>>> 
>>> Right. But it's still confusing: I think for page table unsharing we
>>> always need an IPI one way or the other to make sure GUP-fast was called.
>>> 
>>> At least for preventing that anybody would be able to reuse the page
>>> table in the meantime.
>>> 
>>> That is either:
>>> 
>>> (a) The TLB shootdown implied an IPI
>>> 
>>> (b) We manually send one
>>> 
>>> But that's where it gets confusing: nowadays x86 also selects
>>> MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
>>> 
>>> This is so complicated, so I might be missing something.
>>> 
>>> But it's the same behavior we have in collapse_huge_page() where we first
>> ... flush and then call tlb_remove_table_sync_one().
> 
> Okay, I pushed something to
> 
> https://github.com/davidhildenbrand/linux.git hugetlb_unshare

For testing had to backport the fix to v5.15. Used top 8 commits from the above tree.
v5.15 kernel does not have ptdesc and hugetlb vma locking.

With that change, our DB team has verified that it fixes the regression.

Will you push this fix to LTS trees after it is reviewed and merged?

Thanks,
-Prakash

> 
> I did a quick test and my house did not burn down. But I don't have a beefy machine to really stress+benchmark PMD table unsharing.
> 
> Could one of the original reporters (Stanislav? Prakash?) try it out to see if that would help fix the regression or if it would be a dead end?
> 
> -- 
> Cheers
> 
> David


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
  2025-12-03 17:22                                 ` Prakash Sangappa
@ 2025-12-03 19:45                                   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 19:45 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Lorenzo Stoakes, Uschakow, Stanislav, Jann Horn, linux-mm, trix,
	nathan, akpm, muchun.song, mike.kravetz, Liam Howlett, osalvador,
	vbabka, stable

On 12/3/25 18:22, Prakash Sangappa wrote:
> 
> 
>> On Nov 20, 2025, at 7:47 AM, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
>>
>> On 11/19/25 17:31, David Hildenbrand (Red Hat) wrote:
>>> On 19.11.25 17:29, David Hildenbrand (Red Hat) wrote:
>>>>>>
>>>>>> So what I am currently looking into is simply reducing (batching) the number
>>>>>> of IPIs.
>>>>>
>>>>> As in the IPIs we are now generating in tlb_remove_table_sync_one()?
>>>>>
>>>>> Or something else?
>>>>
>>>> Yes, for now. I'm essentially reducing the number of
>>>> tlb_remove_table_sync_one() calls.
>>>>
>>>>>
>>>>> As this bug is only an issue when we don't use IPIs for pgtable freeing right
>>>>> (e.g. CONFIG_MMU_GATHER_RCU_TABLE_FREE is set), as otherwise
>>>>> tlb_remove_table_sync_one() is a no-op?
>>>>
>>>> Right. But it's still confusing: I think for page table unsharing we
>>>> always need an IPI one way or the other to make sure GUP-fast was called.
>>>>
>>>> At least for preventing that anybody would be able to reuse the page
>>>> table in the meantime.
>>>>
>>>> That is either:
>>>>
>>>> (a) The TLB shootdown implied an IPI
>>>>
>>>> (b) We manually send one
>>>>
>>>> But that's where it gets confusing: nowadays x86 also selects
>>>> MMU_GATHER_RCU_TABLE_FREE, meaning we would get a double IPI?
>>>>
>>>> This is so complicated, so I might be missing something.
>>>>
>>>> But it's the same behavior we have in collapse_huge_page() where we first
>>> ... flush and then call tlb_remove_table_sync_one().
>>
>> Okay, I pushed something to
>>
>> https://github.com/davidhildenbrand/linux.git hugetlb_unshare
> 
> For testing had to backport the fix to v5.15. Used top 8 commits from the above tree.
> v5.15 kernel does not have ptdesc and hugetlb vma locking.
> 
> With that change, our DB team has verified that it fixes the regression.

Great, thanks for testing!

> 
> Will you push this fix to LTS trees after it is reviewed and merged?

I can further clean this up and send it out. There is something about 
the mmu_gather integration that I don't enjoy, but I didn't find a 
better solution so far.

I can try backporting it, I would likely have to try to minimize the 
prereq cleanups. Let me see to which degree this can be done in a 
sensible way!

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-12-03 19:45 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-29 14:30 Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race Uschakow, Stanislav
2025-09-01 10:58 ` Jann Horn
2025-09-01 11:26   ` David Hildenbrand
2025-09-04 12:39     ` Uschakow, Stanislav
2025-10-08 22:54     ` Prakash Sangappa
2025-10-09  7:23       ` David Hildenbrand
2025-10-09 15:06         ` Prakash Sangappa
2025-10-09  7:40   ` David Hildenbrand
2025-10-09  8:19     ` David Hildenbrand
2025-10-16  9:21     ` Lorenzo Stoakes
2025-10-16 19:13       ` David Hildenbrand
2025-10-16 18:44     ` Jann Horn
2025-10-16 19:10       ` David Hildenbrand
2025-10-16 19:26         ` Jann Horn
2025-10-16 19:44           ` David Hildenbrand
2025-10-16 20:25             ` Jann Horn
2025-10-20 15:00       ` Lorenzo Stoakes
2025-10-20 15:33         ` Jann Horn
2025-10-24 12:24           ` Lorenzo Stoakes
2025-10-24 18:22             ` Jann Horn
2025-10-24 19:02               ` Lorenzo Stoakes
2025-10-24 19:43                 ` Jann Horn
2025-10-24 19:58                   ` Lorenzo Stoakes
2025-10-24 21:41                     ` Jann Horn
2025-10-29 16:19                   ` David Hildenbrand
2025-10-29 18:02                     ` Lorenzo Stoakes
2025-11-18 10:03                       ` David Hildenbrand (Red Hat)
2025-11-19 16:08                         ` Lorenzo Stoakes
2025-11-19 16:29                           ` David Hildenbrand (Red Hat)
2025-11-19 16:31                             ` David Hildenbrand (Red Hat)
2025-11-20 15:47                               ` David Hildenbrand (Red Hat)
2025-12-03 17:22                                 ` Prakash Sangappa
2025-12-03 19:45                                   ` David Hildenbrand (Red Hat)
2025-10-20 17:18         ` David Hildenbrand
2025-10-24  9:59           ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox