From: Nico Pache <npache@redhat.com>
To: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, akpm@linux-foundation.org,
corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, ryan.roberts@arm.com,
willy@infradead.org, peterx@redhat.com, ziy@nvidia.com,
wangkefeng.wang@huawei.com, usamaarif642@gmail.com,
sunnanyong@huawei.com, vishal.moola@gmail.com,
thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com,
kirill.shutemov@linux.intel.com, aarcange@redhat.com,
raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com,
catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org,
dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org,
jglisse@google.com, surenb@google.com, zokeefe@google.com,
hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com,
rdunlap@infradead.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support
Date: Fri, 2 May 2025 09:30:15 -0600 [thread overview]
Message-ID: <CAA1CXcBn3eZpcdzsG5DRNBa+rFBg1dQjssC=SYyXW5_7fDmwXg@mail.gmail.com> (raw)
In-Reply-To: <CAG48ez25Wo7BSt4dn3v+xO0mojvjtBpd02kaAu7kn_+AfgSrNQ@mail.gmail.com>
On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 02.05.25 14:50, Jann Horn wrote:
> > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> > >> On 02.05.25 00:29, Nico Pache wrote:
> > >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> > >>>>
> > >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> > >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> > >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> > >>>>>
> > >>>>> After the scan is complete we will perform binary recursion on the
> > >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> > >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> > >>>>> determine how full a THP must be to be eligible.
> > >>>>>
> > >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> > >>>>> pages, we dont perform the collapse.
> > >>>> [...]
> > >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >>>>> vma_start_write(vma);
> > >>>>> anon_vma_lock_write(vma->anon_vma);
> > >>>>>
> > >>>>> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > >>>>> - address + HPAGE_PMD_SIZE);
> > >>>>> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > >>>>> + _address + (PAGE_SIZE << order));
> > >>>>> mmu_notifier_invalidate_range_start(&range);
> > >>>>>
> > >>>>> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > >>>>> +
> > >>>>> /*
> > >>>>> * This removes any huge TLB entry from the CPU so we won't allow
> > >>>>> * huge and small TLB entries for the same virtual address to
> > >>>>
> > >>>> It's not visible in this diff, but we're about to do a
> > >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > >>>> entire page table, meaning it tears down 2MiB of address space; and it
> > >>>> assumes that the entire page table exclusively corresponds to the
> > >>>> current VMA.
> > >>>>
> > >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> > >>>> happens for full-size THP, and that mTHP only tears down individual
> > >>>> PTEs in the relevant range. (That code might get a bit messy, since
> > >>>> the existing THP code tears down PTEs in a detached page table, while
> > >>>> mTHP would have to do it in a still-attached page table.)
> > >>> Hi Jann!
> > >>>
> > >>> I was under the impression that this is needed to prevent GUP-fast
> > >>> races (and potentially others).
> > >
> > > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> > >
> > >>> As you state here, conceptually the PMD case is, detach the PMD, do
> > >>> the collapse, then reinstall the PMD (similarly to how the system
> > >>> recovers from a failed PMD collapse). I tried to keep the current
> > >>> locking behavior as it seemed the easiest way to get it right (and not
> > >>> break anything). So I keep the PMD detaching and reinstalling for the
> > >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> > >>> early. I will comment further on his response.
> > >
> > > As I see it, you're not "keeping" the current locking behavior; you're
> > > making a big implicit locking change by reusing a codepath designed
> > > for PMD THP for mTHP, where the page table may not be exclusively
> > > owned by one VMA.
> >
> > That is not the intention. The intention in this series (at least as we
> > discussed) was to not do it across VMAs; that is considered the next
> > logical step (which will be especially relevant on arm64 IMHO).
>
> Ah, so for now this is supposed to only work for PTEs which are in a
> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
> then try to collapse its contents to an order-2 mTHP page, that should
> just not work?
Correct! As I started in reply to Hugh, the locking conditions explode
if we drop that requirement. A simple workaround we've considered is
only collapsing if a single VMA intersects a PMD. I can make sure this
is more clear in the coverletter + this patch.
Cheers,
-- Nico
>
next prev parent reply other threads:[~2025-05-02 15:30 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-28 18:12 [PATCH v5 00/12] khugepaged: " Nico Pache
2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-04-29 13:41 ` Zi Yan
2025-04-30 7:47 ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-04-30 7:56 ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-04-30 10:08 ` Baolin Wang
2025-04-30 18:56 ` Nico Pache
2025-05-01 23:03 ` Nico Pache
2025-05-02 1:23 ` Baolin Wang
2025-06-06 16:37 ` Dev Jain
2025-06-07 12:48 ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
2025-04-30 20:51 ` Jann Horn
2025-05-01 22:29 ` Nico Pache
2025-05-02 6:29 ` David Hildenbrand
2025-05-02 12:50 ` Jann Horn
2025-05-02 15:18 ` David Hildenbrand
2025-05-02 15:24 ` Lorenzo Stoakes
2025-05-02 15:39 ` David Hildenbrand
2025-05-02 15:26 ` Jann Horn
2025-05-02 15:30 ` Nico Pache [this message]
2025-05-02 15:42 ` David Hildenbrand
2025-05-01 12:58 ` Hugh Dickins
2025-05-01 16:15 ` Andrew Morton
2025-05-01 22:30 ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-04-30 10:09 ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-04-30 10:12 ` Baolin Wang
2025-04-30 18:43 ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-04-29 5:10 ` Bagas Sanjaya
2025-04-29 16:38 ` Christoph Lameter (Ampere)
2025-04-29 17:15 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAA1CXcBn3eZpcdzsG5DRNBa+rFBg1dQjssC=SYyXW5_7fDmwXg@mail.gmail.com' \
--to=npache@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jglisse@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=raquini@redhat.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=ryan.roberts@arm.com \
--cc=sunnanyong@huawei.com \
--cc=surenb@google.com \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tiwai@suse.de \
--cc=usamaarif642@gmail.com \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox