Re: [RFC 00/11] khugepaged: mTHP support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	anshuman.khandual@arm.com, catalin.marinas@arm.com,
	cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
	haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
	21cnbao@gmail.com, willy@infradead.org,
	kirill.shutemov@linux.intel.com, david@redhat.com,
	aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com,
	usamaarif642@gmail.com, audra@redhat.com,
	akpm@linux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support
Date: Fri, 24 Jan 2025 13:08:11 +0530	[thread overview]
Message-ID: <f3ffda64-93e4-42c9-bf3a-dabcca070ada@arm.com> (raw)
In-Reply-To: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com>



On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@arm.com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
>

next prev parent reply	other threads:[~2025-01-24  7:38 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-08 23:31 Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10  6:25   ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10  6:11   ` Dev Jain
2025-01-10 19:37     ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10  6:23   ` Dev Jain
2025-01-10 19:41     ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10  6:38   ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10  9:05   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 11:23       ` Dev Jain
2025-01-13 22:25         ` Nico Pache
2025-01-10 14:54   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 15:13   ` Dev Jain
2025-01-12 16:41     ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10  9:20   ` Dev Jain
2025-01-10 13:36   ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10  2:27   ` Nico Pache
2025-01-10  4:56     ` Dev Jain
2025-01-10 22:01       ` Nico Pache
2025-01-12 14:11         ` Dev Jain
2025-01-13 23:00           ` Nico Pache
2025-01-09  6:27 ` Dev Jain
2025-01-10  1:28   ` Nico Pache
2025-01-16  9:47 ` Ryan Roberts
2025-01-16 20:53   ` Nico Pache
2025-01-20  5:17     ` Dev Jain
2025-01-23 20:24       ` Nico Pache
2025-01-24  7:13         ` Dev Jain
2025-01-24  7:38           ` Dev Jain [this message]
2025-01-20 12:49     ` Ryan Roberts
2025-01-23 20:42       ` Nico Pache
2025-01-20 12:54     ` David Hildenbrand
2025-01-20 13:37       ` Ryan Roberts
2025-01-20 13:56         ` David Hildenbrand
2025-01-20 16:27           ` Ryan Roberts
2025-01-20 18:39             ` David Hildenbrand
2025-01-21  9:48               ` Ryan Roberts
2025-01-21 10:19                 ` David Hildenbrand
2025-01-27  9:31                   ` Dev Jain
2025-01-22  5:18                 ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3ffda64-93e4-42c9-bf3a-dabcca070ada@arm.com \
    --to=dev.jain@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=audra@redhat.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=raquini@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox