Re: [RFC 00/11] khugepaged: mTHP support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	anshuman.khandual@arm.com, catalin.marinas@arm.com,
	cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
	haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
	21cnbao@gmail.com, willy@infradead.org,
	kirill.shutemov@linux.intel.com, david@redhat.com,
	aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com,
	usamaarif642@gmail.com, audra@redhat.com,
	akpm@linux-foundation.org
Subject: Re: [RFC 00/11] khugepaged: mTHP support
Date: Fri, 24 Jan 2025 12:43:39 +0530	[thread overview]
Message-ID: <8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com> (raw)
In-Reply-To: <CAA1CXcCOWNDsVOWj8apTqoFaJJsG=op4637qgAJ1k6GzmTtXEA@mail.gmail.com>



On 24/01/25 1:54 am, Nico Pache wrote:
> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> --- snip ---
>>>>
>>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>>>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>>>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
>>> The bitmap was to encode the state of PMD without needing rescanning
>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>> (for x86). Dev did some good analysis for this here
>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>> I think I swayed away and over-analyzed, and probably did not make my
>> main objection clear enough, so let us cut to the chase.
>> *Why* is it correct to remember the state of the PMD?
>>
>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>> tunables again, since we dropped the lock. The bitmap thingy which you
>> are doing, and in general, any algorithm which tries to remember the
>> state of the PMD, violates the entire point of max_ptes_*. Take for
>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
> per your recommendation I dropped the read_lock() -> read_unlock() and
> made it a conditional unlock

That's not the one I was talking about here...

>> of stuff. Now, you do write_lock(), which means that you need to wait
>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>> is to stop the collapse here, since we do not want memory bloat
>> (collapse will grab more memory from the buddy and the old memory won't
>> be freed because it has a reference from the parent/child).
> 
> That's a fair point, but given the other feedback, my current
> implementation now requires mTHPs to have no shared/swap, and ive
> improved the sysctl interactions for the set_bitmap and the
> max_ptes_none check in the _isolate function.

I am guessing you are following the policy of letting the creep happen 
for none ptes, and assuming shared and swap to be zero.

> 
> As for *why* remembering the state is correct. It just prevents
> needing to rescan.

That is what I am saying...if collapse_huge_page() fails, then you have 
dropped the mmap write lock, so now the state of the PTEs may have 
changed, so you must rescan...

> 
>> Another example would be, a sysadmin does not want too much memory
>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>> scan the PTE table you justify the collapse. After you drop the PTL and
>> the mmap_lock, a munmap() happens in the region, no longer justifying
>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>> munmap() on a VMA will happen on the single PTE table present.
>>
>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>> to ask whether any algorithm remembering the state of the PMD is even
>> conceptually right.
> 
> Both the issues you raised dont really have to do with the bitmap...

Correct, my issue is with any general algorithm remembering PTE state.

> they are fair points, but they are more of a criticism of my sysctl
> handling. Ive cleaned up the max_ptes_none interactions, and now that
> we dont plan to initially support swap/shared both these problems are
> 'gone'.
>>
>> Then, you have the harder task of proving that your optimization is
>> actually an optimization, that it is not turned into being futile
>> because of overhead. From a high-level mathematical PoV, you are saving
>> iterations. Any mathematical analysis has the underlying assumption that
>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>> order)] is virtually and physically contiguous in memory so prefetching
>> helps us. You are trying to save on pte memory references, but then look
>> at the number of bitmap memory references you have created, not to
>> mention that you are doing a (costly?) division operation in there, you
>> have a while loop, a stack, new structs, and if conditions. I do not see
>> how this is any faster than a naive linear scan.
> 
> Yeah it's hard to say without real performance testing. I hope to
> include some performance results with my next post.
> 
>>
>>> This prevents needing to hold the read lock for longer, and prevents
>>> needing to reacquire it too.
>>
>> My implementation does not hold the read lock for longer. What you mean
>> to say is, I need to reacquire the lock, and this is by design, to
> yes sorry.
>> ensure correctness, which boils down to what I wrote above.
> The write lock is what ensures correctness, not the read lock. The
> read lock is to gain insight of potential collapse candidates while
> avoiding the cost of the write lock.
> 
> Cheers!
> -- Nico
>>
>

next prev parent reply	other threads:[~2025-01-24  7:13 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-08 23:31 Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10  6:25   ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10  6:11   ` Dev Jain
2025-01-10 19:37     ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10  6:23   ` Dev Jain
2025-01-10 19:41     ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10  6:38   ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10  9:05   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 11:23       ` Dev Jain
2025-01-13 22:25         ` Nico Pache
2025-01-10 14:54   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 15:13   ` Dev Jain
2025-01-12 16:41     ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10  9:20   ` Dev Jain
2025-01-10 13:36   ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10  2:27   ` Nico Pache
2025-01-10  4:56     ` Dev Jain
2025-01-10 22:01       ` Nico Pache
2025-01-12 14:11         ` Dev Jain
2025-01-13 23:00           ` Nico Pache
2025-01-09  6:27 ` Dev Jain
2025-01-10  1:28   ` Nico Pache
2025-01-16  9:47 ` Ryan Roberts
2025-01-16 20:53   ` Nico Pache
2025-01-20  5:17     ` Dev Jain
2025-01-23 20:24       ` Nico Pache
2025-01-24  7:13         ` Dev Jain [this message]
2025-01-24  7:38           ` Dev Jain
2025-01-20 12:49     ` Ryan Roberts
2025-01-23 20:42       ` Nico Pache
2025-01-20 12:54     ` David Hildenbrand
2025-01-20 13:37       ` Ryan Roberts
2025-01-20 13:56         ` David Hildenbrand
2025-01-20 16:27           ` Ryan Roberts
2025-01-20 18:39             ` David Hildenbrand
2025-01-21  9:48               ` Ryan Roberts
2025-01-21 10:19                 ` David Hildenbrand
2025-01-27  9:31                   ` Dev Jain
2025-01-22  5:18                 ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com \
    --to=dev.jain@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=audra@redhat.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=raquini@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox