linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Jann Horn <jannh@google.com>, Nico Pache <npache@redhat.com>
Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org,
	mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com,
	ziy@nvidia.com, wangkefeng.wang@huawei.com,
	usamaarif642@gmail.com, sunnanyong@huawei.com,
	vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com,
	yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com,
	aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com,
	anshuman.khandual@arm.com, catalin.marinas@arm.com,
	tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com,
	jack@suse.cz, cl@gentwo.org, jglisse@google.com,
	surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org,
	rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support
Date: Fri, 2 May 2025 17:18:54 +0200	[thread overview]
Message-ID: <b91af3df-643e-4131-96e7-2680f322194d@redhat.com> (raw)
In-Reply-To: <CAG48ez0oe6oZ0QTDWr1rtUJFFTLO69sVpdoCmo+CxZUb6L1seg@mail.gmail.com>

On 02.05.25 14:50, Jann Horn wrote:
> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>> On 02.05.25 00:29, Nico Pache wrote:
>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>
>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>
>>>>> After the scan is complete we will perform binary recursion on the
>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>> determine how full a THP must be to be eligible.
>>>>>
>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>> pages, we dont perform the collapse.
>>>> [...]
>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>           vma_start_write(vma);
>>>>>           anon_vma_lock_write(vma->anon_vma);
>>>>>
>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>           mmu_notifier_invalidate_range_start(&range);
>>>>>
>>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>> +
>>>>>           /*
>>>>>            * This removes any huge TLB entry from the CPU so we won't allow
>>>>>            * huge and small TLB entries for the same virtual address to
>>>>
>>>> It's not visible in this diff, but we're about to do a
>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>> assumes that the entire page table exclusively corresponds to the
>>>> current VMA.
>>>>
>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>> happens for full-size THP, and that mTHP only tears down individual
>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>> the existing THP code tears down PTEs in a detached page table, while
>>>> mTHP would have to do it in a still-attached page table.)
>>> Hi Jann!
>>>
>>> I was under the impression that this is needed to prevent GUP-fast
>>> races (and potentially others).
> 
> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> 
>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>> the collapse, then reinstall the PMD (similarly to how the system
>>> recovers from a failed PMD collapse). I tried to keep the current
>>> locking behavior as it seemed the easiest way to get it right (and not
>>> break anything). So I keep the PMD detaching and reinstalling for the
>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>> early. I will comment further on his response.
> 
> As I see it, you're not "keeping" the current locking behavior; you're
> making a big implicit locking change by reusing a codepath designed
> for PMD THP for mTHP, where the page table may not be exclusively
> owned by one VMA.

That is not the intention. The intention in this series (at least as we 
discussed) was to not do it across VMAs; that is considered the next 
logical step (which will be especially relevant on arm64 IMHO).

> 
>>> As I familiarize myself with the code more, I do see potential code
>>> improvements/cleanups and locking improvements, but I was going to
>>> leave those to a later series.
>>
>> Right, the simplest approach on top of the current PMD collapse is to do
>> exactly what we do in the PMD case, including the locking: which
>> apparently is no completely the same yet :).
>>
>> Instead of installing a PMD THP, we modify the page table and remap that.
>>
>> Moving from the PMD lock to the PTE lock will not make a big change in
>> practice for most cases: we already must disable essentially all page
>> table walkers (vma lock, mmap lock in write, rmap lock in write).
>>
>> The PMDP clear+flush is primarily to disable the last possible set of
>> page table walkers: (1) HW modifications and (2) GUP-fast.
>>
>> So after the PMDP clear+flush we know that (A) HW can not modify the
>> pages concurrently and (B) GUP-fast cannot succeed anymore.
>>
>> The issue with PTEP clear+flush is that we will have to remember all PTE
>> values, to reset them if anything goes wrong. Using a single PMD value
>> is arguably simpler. And then, the benefit vs. complexity is unclear.
>>
>> Certainly something to look into later, but not a requirement for the
>> first support,
> 
> As I understand, one rule we currently have in MM is that an operation
> that logically operates on one VMA (VMA 1) does not touch the page
> tables of other VMAs (VMA 2) in any way, except that it may walk page
> tables that cover address space that intersects with both VMA 1 and
> VMA 2, and create such page tables if they are missing.

Yes, absolutely. That must not happen. And I think I raised it as a 
problem in reply to one of Dev's series.

If this series does not rely on that it must be fixed.

> 
> This proposed patch changes that, without explicitly discussing this
> locking change.

Yes, that must not happen. We must not zap a PMD to temporarily replace 
it with a pmd_none() entry if any other sane page table walker could 
stumble over it.

This includes another VMA that is not write-locked that could span the PMD.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-05-02 15:19 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-28 18:12 [PATCH v5 00/12] khugepaged: " Nico Pache
2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-04-29 13:41   ` Zi Yan
2025-04-30  7:47   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-04-30  7:56   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-04-30 10:08   ` Baolin Wang
2025-04-30 18:56     ` Nico Pache
2025-05-01 23:03       ` Nico Pache
2025-05-02  1:23         ` Baolin Wang
2025-06-06 16:37       ` Dev Jain
2025-06-07 12:48         ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
2025-04-30 20:51   ` Jann Horn
2025-05-01 22:29     ` Nico Pache
2025-05-02  6:29       ` David Hildenbrand
2025-05-02 12:50         ` Jann Horn
2025-05-02 15:18           ` David Hildenbrand [this message]
2025-05-02 15:24             ` Lorenzo Stoakes
2025-05-02 15:39               ` David Hildenbrand
2025-05-02 15:26             ` Jann Horn
2025-05-02 15:30               ` Nico Pache
2025-05-02 15:42                 ` David Hildenbrand
2025-05-01 12:58   ` Hugh Dickins
2025-05-01 16:15     ` Andrew Morton
2025-05-01 22:30     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-04-30 10:09   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-04-30 10:12   ` Baolin Wang
2025-04-30 18:43     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-04-29  5:10   ` Bagas Sanjaya
2025-04-29 16:38   ` Christoph Lameter (Ampere)
2025-04-29 17:15     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b91af3df-643e-4131-96e7-2680f322194d@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=ryan.roberts@arm.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox