From: David Hildenbrand <david@redhat.com>
To: Jann Horn <jannh@google.com>
Cc: "Uschakow, Stanislav" <suschako@amazon.de>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"trix@redhat.com" <trix@redhat.com>,
"ndesaulniers@google.com" <ndesaulniers@google.com>,
"nathan@kernel.org" <nathan@kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"muchun.song@linux.dev" <muchun.song@linux.dev>,
"mike.kravetz@oracle.com" <mike.kravetz@oracle.com>,
"lorenzo.stoakes@oracle.com" <lorenzo.stoakes@oracle.com>,
"liam.howlett@oracle.com" <liam.howlett@oracle.com>,
"osalvador@suse.de" <osalvador@suse.de>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
Date: Thu, 16 Oct 2025 21:10:03 +0200 [thread overview]
Message-ID: <e4277c1a-c8d4-429d-b455-8daa9f4bbd14@redhat.com> (raw)
In-Reply-To: <CAG48ez2dqOF9mM2bAQv1uDGBPWndwOswB0VAkKG7LGkrTXzmzQ@mail.gmail.com>
>> I'm currently looking at the fix and what sticks out is "Fix it with an
>> explicit broadcast IPI through tlb_remove_table_sync_one()".
>>
>> (I don't understand how the page table can be used for "normal,
>> non-hugetlb". I could only see how it is used for the remaining user for
>> hugetlb stuff, but that's different question)
>
> If I remember correctly:
> When a hugetlb shared page table drops to refcount 1, it turns into a
> normal page table. If you then afterwards split the hugetlb VMA, unmap
> one half of it, and place a new unrelated VMA in its place, the same
> page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
>
> So the scenario would be:
>
> 1. Initially, we have a hugetlb shared page table covering 1G of
> address space which maps hugetlb 2M pages, which is used by two
> hugetlb VMAs in different processes (processes P1 and P2).
> 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and
> walks down through the PUD entry that points to the shared page table,
> then when it reaches the loop in gup_fast_pmd_range() gets interrupted
> for a while by an NMI or preempted by the hypervisor or something.
> 3. P2 removes its VMA, and the hugetlb shared page table effectively
> becomes a normal page table in P1.
> 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
> leaving two VMAs VMA1 and VMA2.
> 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
> example an anonymous private VMA.
> 6. P1 populates VMA3 with page table entries.
> 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
> uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
>
>> How does the fix work when an architecture does not issue IPIs for TLB
>> shootdown? To handle gup-fast on these architectures, we use RCU.
>
> gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has
been freed+reused in the meantime (prevent us from de-referencing
garbage entries).
It does not prevent walking the now-unshared page table that has been
modified by the other process.
For that, we need the back-off described below. IIRC we implemented that
in the PMD case for khugepaged.
Or is there somewhere a guaranteed RCU sync before the shared page table
gets reused?
>
>> So I'm wondering whether we use RCU somehow.
>>
>> But note that in gup_fast_pte_range(), we are validating whether the PMD
>> changed:
>>
>> if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>> unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
>> gup_put_folio(folio, 1, flags);
>> goto pte_unmap;
>> }
>>
>>
>> So in case the page table got reused in the meantime, we should just
>> back off and be fine, right?
>
> The shared page table is mapped with a PUD entry, and we don't check
> whether the PUD entry changed here.
Yes, see my follow-up mail, that's what we'd have to add.
On an arch without IPI, page tables will be freed with RCU and it just
works. We walk the wrong page table, realize that the PUD changed and
back off.
On an arch with IPI it's tricky: if we don't issue the IPI you added, we
might still back off once we check the PUD entry didn't changee, but I'm
afraid nothing would stop us from walking the previous page table that
was freed in the meantime, containing garbage.
Easy fix would be never reusing a page table once shared once?
--
Cheers
David / dhildenb
next prev parent reply other threads:[~2025-10-16 19:10 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-29 14:30 Uschakow, Stanislav
2025-09-01 10:58 ` Jann Horn
2025-09-01 11:26 ` David Hildenbrand
2025-09-04 12:39 ` Uschakow, Stanislav
2025-10-08 22:54 ` Prakash Sangappa
2025-10-09 7:23 ` David Hildenbrand
2025-10-09 15:06 ` Prakash Sangappa
2025-10-09 7:40 ` David Hildenbrand
2025-10-09 8:19 ` David Hildenbrand
2025-10-16 9:21 ` Lorenzo Stoakes
2025-10-16 19:13 ` David Hildenbrand
2025-10-16 18:44 ` Jann Horn
2025-10-16 19:10 ` David Hildenbrand [this message]
2025-10-16 19:26 ` Jann Horn
2025-10-16 19:44 ` David Hildenbrand
2025-10-16 20:25 ` Jann Horn
2025-10-20 15:00 ` Lorenzo Stoakes
2025-10-20 15:33 ` Jann Horn
2025-10-24 12:24 ` Lorenzo Stoakes
2025-10-24 18:22 ` Jann Horn
2025-10-24 19:02 ` Lorenzo Stoakes
2025-10-24 19:43 ` Jann Horn
2025-10-24 19:58 ` Lorenzo Stoakes
2025-10-24 21:41 ` Jann Horn
2025-10-29 16:19 ` David Hildenbrand
2025-10-29 18:02 ` Lorenzo Stoakes
2025-11-18 10:03 ` David Hildenbrand (Red Hat)
2025-11-19 16:08 ` Lorenzo Stoakes
2025-11-19 16:29 ` David Hildenbrand (Red Hat)
2025-11-19 16:31 ` David Hildenbrand (Red Hat)
2025-11-20 15:47 ` David Hildenbrand (Red Hat)
2025-12-03 17:22 ` Prakash Sangappa
2025-12-03 19:45 ` David Hildenbrand (Red Hat)
2025-10-20 17:18 ` David Hildenbrand
2025-10-24 9:59 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e4277c1a-c8d4-429d-b455-8daa9f4bbd14@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=jannh@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=nathan@kernel.org \
--cc=ndesaulniers@google.com \
--cc=osalvador@suse.de \
--cc=stable@vger.kernel.org \
--cc=suschako@amazon.de \
--cc=trix@redhat.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox